Software at Scale

Software at Scale 13 - Emma Tang: ex Data Infrastructure Lead, Stripe

0:00

-40:59

Software at Scale 13 - Emma Tang: ex Data Infrastructure Lead, Stripe

Effective Management of Big Data Platforms

Mar 21, 2021

Emma Tang was the Engineering Manager of the Data Infrastructure team at Stripe. She was also a Lead Software Engineer at Aggregate Knowledge, where she worked on the data platform.

We explore the technological and organizational challenges of maintaining big data platforms. We discuss when a company would require a “Big Data” system, what the properties of a good system look like, how some of these systems look like today, some of the tools/frameworks that work well, hiring the right engineers, and unsolved problems in the field.

Apple Podcasts | Spotify | Google Podcasts

Share Software at Scale

Highlights

0:30 - “Big Data” for software engineers - when does a company need a big data solution

2:30 - The transition from when a company uses a regular database to a big data solution, with a motivating example of Stripe

4:20 - Verification of processed output. Some of the tools involved: Amazon S3, Parquet, Kafka, and MongoDB.

9:00 - The cost of ensuring correctness in the data processing. Using tools like Protobuf to enforce schemas

13:30 - Data Governance as a trend

16:30 - Why should a company have a data platform organization?

21:30 - Hiring for data infrastructure/platform engineers

24:00 - How does a data organization maintain quality? What metrics do they look at?

28:30 - Trends of some problem areas in this space.

33:30 - Emma’s interest in data infrastructure, and advice for those looking to get into the field.

Transcript

Utsav Shah: [00:13] Hey, everyone, welcome to another episode of the Software at Scale podcast. Joining me here today is Emma Tang, who is an ex-Engineering Manager at Stripe on the Data Infrastructure team of the Data Infrastructure group. And she used to be a Lead Software Engineer at Neustar, which is an AdTech company. Thanks for being a guest today.

Emma Tang: [00:30] Thanks for having me, Utsav.

Utsav Shah: [00:32] Yeah. Could you tell the listeners and me, because a lot of us are mostly just experienced with full-stack software development, we do front end, we do back end, where does data infrastructure come in, or data in general? When would I use something like Spark? And when does that become business-critical?

Emma Tang: [00:51] That's a really good question and I think that's a question on a lot of folks’ minds. It's like, “When does my data become too big that I need to use big data?” Or “Should I consider it right now, given the set of circumstances that I have?” I personally don't think there's a hard and fast rule in terms of volume or size that defines big data. You can't say, “Oh, I have one petabyte of data, therefore, I need to start using Spark.” I think that actually, it's really depending on your situation, like your architecture and your business needs. So, some symptoms of when you need to consider big data include, oh, you're actually hitting your production database for analytics and you're really worried about the load on your production application. That's a time where you need to consider maybe something that's more specialized for dealing with big data. Another example is, let's say you integrate with a bunch of third-party vendors like Salesforce, and Zendesk, and Clearbit, and you want to integrate the data into your system and yield insights from that, you should also consider big data when you're at that point. And then obviously, another case is, you just literally have a business need. For example, you need to produce data-heavy reports for customers or regulators, or other internal purposes, you also should consider spinning up big data custom for that as well.

Utsav Shah: [02:10] Okay, so maybe, walking through this as somebody who has no experience in this field at all so I'm just guessing here, maybe like V1 of how Stripe would be implemented is just a relational database that has transactions, it's like a ledger. But at some point, you want to do a lot of processing on those transactions, like what the tax rate is for a certain transaction. And there's also really complicated transactions like you're paying maybe an Instacart shopper, and then you're also paying Instacart something and that money is coming from some random account somewhere. At that point, you would start considering managing those transactions using something like Spark? Is that a good way to think about it?

Emma Tang: [02:51] Yeah, you're definitely going at it the right angle here. So, at Stripe, for example, some of our earliest use cases for big data was actually in kind of the reporting and reconciliation side. So basically, because we're a FinTech company, we deal with a lot of money. And when money is involved, accuracy is super important. So, we can't be like, “A rough estimate of what happened yesterday.” No, we have to know exactly what happened yesterday, and not only yesterday, for all the days before that as well. If a transaction was somehow reversed a few days later, we need to know about that, and we need to reconcile the systems. So big data is great for that. You can very easily take in the whole universe of all the data you've ever had in your company, and just crunch it in a few minutes. So that's the power that you get. At Stripe, some of our earliest use cases, including our internal billing system, so at the end of the month, we need to charge our customers a fee and say, “Oh, you made these many transactions on Stripe, we need to charge you this much money.” That process is actually a big data process at Stripe, it's literally how we make money. And big data is at the epicenter of that.

Some other examples are just product examples. So, Stripe has things called Radar fraud detection, or Capital, which is our lending product, and Sigma analytics products. All these products are powered by big data. Without machine learning without big data, none of these are possible.

Utsav Shah: [04:13] Okay, that makes a lot of sense. So, you basically have something like a crone job or something, just bear with me here, which will go through like one day's worth of transactions and figure out, “Okay, these are the final fees that I need to charge people, or charge our customers based on whatever transactions they made,” and that will finally go through.

Emma Tang: [04:33] Yep, sort of like that. And we have to combine different data sources together as well. It's not just one transaction source, you also have to combine it with their custom fee plan, because they might have a very different fee plan from other merchants. And you have to combine maybe currency codes, maybe they operate in different countries, and there's different currency codes and different currency exchanges and all that. So it’s a lot more complicated than I’m making it out to be, but that's where big data shines. You can really do very complex stuff on really [05:00] big data and everything, the system is built for that kind of processing.

Utsav Shah: [05:05] Okay. How does one verify that what the number-crunching did is actually correct when you have such a large pipeline?

Emma Tang: [05:15] That's such a good question, and one of the biggest pain points in big data honestly. Again, I think it really varies by business need. If it's a B2C company that's like Facebook or social media, where you want to know how many clicks happen on this post or whatever, it's not super mission-critical that it's super, super accurate. And therefore, the way you implement that system might be a little bit different. You don't need at least once or only once like semantics. Whereas at Stripe, we really do care about accuracy. So, for us, we actually build in a lot of custom validation frameworks on top of the jobs themselves. So once a job finishes running, we actually have a bunch of separate post hoc processes that we run after the job runs to validate the data, that it is the right format, that it follows some rules, for example, it has to be always growing, or “the max may not exceed this”, or unique key validations. So, we have all of these that run after each job to automatically validate this. And if it doesn't satisfy these, we immediately invalidate the data and stop the entire pipeline from running. So, this is something that we built bespoke at Stripe, but I honestly feel, and we'll talk more about future trends and stuff like that, but I honestly feel that this is one of the things that the industry needs to do better, which is easy ways to build validations on top of big data pipelines.

Utsav Shah: [06:40] That's really interesting. It sounds like it's the inverse of machine learning where you're trying to make sure your training data is clean before it goes into the pipeline. Here, you need to kind of figure out your output data is clean and so you run a bunch of verifiers, and you need to--

Emma Tang: [06:56] Yeah, exactly.

Utsav Shah: [06:56] Okay, so that makes sense. So maybe a system design for something like Stripe would be, you have an API, which pretty much writes information to some kind of big data warehouse like Hive or something like that, and then you have Spark jobs that crunch all of that information that go into Hive and produce this final output, which is like reports and billing, and then you have a bunch of verifies that check other--

Emma Tang: [07:21] Yeah.

Utsav Shah: [07:22] Okay, that's really interesting.

Emma Tang: [07:23] I can describe the system at Stripe for you. It's a little bit different from other companies because, again, big data is so core to our business at Stripe, therefore, it does look a little bit different for us. So, there are a couple of different sources of data for us to be used in big data. One is our online database. So, at Stripe, we use a combination of different types of databases, we use some relational, but also some NoSQL databases. So, at Stripe, we have a really huge Mongo presence. And there are actually ETL pipelines that take the data from Mongo, and basically apply yesterday's changes to yesterday’s snapshot to create today's new snapshot. That snapshot will land in S3 for downstreams to use. In that process, we also apply a schema, and we do validations on top of that Mongo data, so the data is actually in Parquet format when it lands in S3. So, another source of data for us is from our Kafka Streams. So, these are more immutable event-based streams but also end up being archived in S3. And we also transform that to a Parquet format. And then there's other smaller kinds of data streams that flow in mostly from third-party integrations, like I mentioned, Salesforce, etc., those also land in S3. So as Stripe, our data lake is S3, the source of truth is S3. However, we do have data warehouses on top of that, but they are not considered the source of truth, even though they have almost all the stuff that S3 contains.

So at Stripe, we use a combination of Redshift and Presto, but we're in the process of deprecating Redshift. So, I think we're very close. So that basically sits on top of the data, and it gets ingested in every day through another pipeline as well. In terms of batch distributed computation, we use Spark. We use just mostly DataFrames, and we also have quite a bit of Spark SQL. We used to use something called Scalding, which is something based on MapReduce, which is an older technology, but we've mostly rubbed off that at this point.

Utsav Shah: [09:29] Okay. And the benefits of Spark are you get basically a streaming infrastructure rather than a map job, like a map phase and a reduce phase. And each phase taking a long… especially the reduce phase, could potentially take a lot of time to complete.

Emma Tang: [09:42] Yeah, so we actually don't use Spark for streaming purposes. So, Spark actually still has analogous phases. There's mapper phases, but there's shuffles where it's kind of like a reduce phase. So, I think the basic concept is very similar, but I think the main advantage [10:00] to do MapReduce is it's a lot of in-memory processing, which is a lot faster than having to [Inaudible 10:06] onto disk over and over again. So that's why it's a lot faster. It's also just because the people working on it, there's so much attention on it, the optimization they're doing is really, really great, they have a really strong optimization engine. And if you write data frames or SQL, they can actually optimize your query very well, to the point where it will outperform raw Spark, without using DataFrames by like 10x, 20x. It’s quite amazing. So, there's a lot of interesting things that are happening on the Spark SQL side.

Utsav Shah: [10:39] Yeah, that definitely sounds really interesting. And it seems like you have a lot of different data sources that put stuff into S3, and then that gets processed by the Spark jobs, and you have this business-critical output. And it's a very different shape of a company compared to your regular tech companies that don't have to deal with… or a lot of tech companies don't have to deal with large scale data consistency. As you said, a lot of it is analytics so it's okay if it's slightly wrong. That's fascinating.

Emma Tang: [11:10] Yeah, it's fascinating but it's also very challenging in different ways, I would say. We care a lot more about accuracy, so our pipelines tend to be really robust, but also slightly more computationally intensive. Like for the same job, we’re probably spending three times the computation power than a normal company would to make sure everything's great, accurate,

Utsav Shah: [11:31] Okay. So just in the additional verification, and all of that is where that additional time goes through.

Emma Tang: [11:36] Yeah, there's a lot of that. And also, sometimes we take the more robust route. For example, you can get an estimate by just crunching three days of numbers, but we would go back in history and crunch all 10 years of data.

Utsav Shah: [11:48] Oh, interesting. And it's only going to grow with time.

Emma Tang: [11:51] It’s only going to grow.

Utsav Shah: [11:51 And even transaction volume is only going to grow.

Emma Tang: [11:53] Exactly. So, there's actually good ways to work around that issue as well but we can go into that another time, as long as-- There's a lot of stuff here. And there's some things I didn't even touch upon that's also in the sphere of big data. For example, scheduling is also a major issue. You have pipelines, you want to make sure that this job runs after job A and job B runs after job D. That kind of what's called a directed acyclic graph is something that is very important to use in making data pipelines more robust. At Stripe, we use Airflow, but there's also other schedulers out there as well.

Utsav Shah: [12:32] Okay. Yeah, I'm familiar with Airflow, where, as you said, it seems like it allows you to make these DAGs off different jobs so you make sure that the entire pipeline goes through. But what happens then when somebody adds an intermediate job that starts failing? Does the entire company get paged? How do you maintain reliability?

Emma Tang: [12:53] You ask all the best questions.

Utsav Shah: [12:55] Thank you.

Emma Tang: [12:56] Unfortunately, sometimes that does happen. So different people deal with it differently. At Stripe, we have basically different types of rules for each job to specify what the failure behavior is. So, there are rules like “Stop the entire pipeline”, that's usually the default. But there's also rules like, “If this job fails, and the next job needs to run, the next job is allowed to use yesterday's data.” Or “If this job fails, there's some other default job that you can default to. That job will run instead, and then the downstreams can run.” So, we have a lot of complicated logic around this, and unfortunately, it just really depends on the situation.

Utsav Shah: [13:37] Yeah, that makes sense. So, I guess one example I can think of is if you're trying to detect whether there was a lot of fraud or not, you don't have to use the previous day's data. I don't know, I could be wrong here on what the business criticality of that is in case that job is failing temporarily. But definitely calculating how much you need to bill your customers; the entire pipeline has to stop.

Emma Tang: [13:58] Yeah. So, unfortunately, when that happens, people do get paged. All of us get paged.

Utsav Shah: [14:04] Yeah. So how do you maintain data quality? This is a very persistent problem that I've seen where sometimes data might just not be-- For example, country codes, but there might be an invalid country code over there in that, and then some assumptions that a downstream job makes on valid country codes could just break. How do you make sure that data isn't getting dirty because of some intermediate job? I don't know if that makes sense.

Emma Tang: [14:32] Yeah, no, that does make sense. So, like we talked a little bit before, I think validations are a really important part of it. So, we do spend a lot of again, computational resources and ultimately money on making sure that the data is really correct. Some architectural decisions that we made to make sure things are good is, for the most part, at Stripe, in terms of Spark jobs, it's almost all purely Scala, so it's type safe. [15:00] We also make sure that all data that's important also has a schema associated with it. So, all data is written in Parquet but we also have a schema representation for every data set. So, we use a combination of Thrift and Proto. We basically, used to only use Thrift; we’re moving on to Proto as a schema representation. And they need to match one to one with the data you're reading. So, if you have a mismatched data schema anywhere along the way, it won't work. So, we rather things fail than be false is our approach. It's an extreme way to do things, again, but that's just the world that we operate in.

Utsav Shah: [15:39] I think that's fascinating. So, every single schema is basically specified via a Protobuf message.

Emma Tang: [15:45] Yeah.

Utsav Shah: [15:46] Or at least the really critical ones.

Emma Tang: [15:48] Yes, I would say almost everything. So, everything that I mentioned that comes through Mongo, or comes through Kafka, all datasets coming in that way will automatically have a schema representation, either in Thrift or Proto. And we have some libraries that autogen Scala libraries for each model, basically, which users can then use in their type safe Spark jobs. So that's how we ensure there's consistency there.

Utsav Shah: [16:15] Okay, that makes sense. So being the Engineering Manager for pretty much one of the most, it seems like business-critical teams, the company must be pretty stressful.

Emma Tang: [16:27] It can get stressful, that's true. Like you mentioned, when jobs fail, it can fail any time of the day, which can be stressful when you're firefighting. I would also say, increasingly, and this touches upon another subject that we might talk about later, which is future and stuff like that, a trend that I'm seeing a lot is increased data governance, data lineage, data security type work. There's just more regulation these days about how data is stored, how data is moved around, what kind of data you can keep around. So that also adds a lot of pressure to data teams as well. Not only do you have to do your job of making sure the infrastructures are running, that you're actually fulfilling the business needs, but you also have these security and privacy regulatory things that you have to start implementing solutions around as well. So, I think there is, in the last few years, a little bit of a double whammy of pressure on data teams. So, I definitely think it’s a ripe solution for some startup to come in and help solve.

Utsav Shah: [17:34] So maybe to dig into that a bit, is an example of that something like European data has to be processed in a European data center? Is that an example of that?

Emma Tang: [17:44] Yeah, an example is, if a country comes out and says, “Oh, we're going to implement data locality rules. All payment data needs to live in a data center in my country.” And then you might have to spend a lot of time actually working with a regulator to figure out what does that actually mean? Does it mean just storage? Does it mean storage on online databases but also S3? Does it mean compute, like data in fly? And if it is compute, is there a time limit? Can you keep it around for 24 hours, and then ship it over by then? There's a lot of different kinds of possible specifications here. So that's an example of a rule. Another thing is GDPR, you have to encrypt different things in ways. So that's another example.

Utsav Shah: [18:33] Is there a particular country that has the most annoying rules? I don't know if you want to share. Or one particular rule that you had to deal with?

Emma Tang: [18:47] I think honestly, these are great things for the consumer and for businesses too. These are things that we should have in the world. They do add technical challenges to the teams that have to sell for them. An example of one that was a little tricky to solve was the India data locality rule that’s coming out, specifically for payment data. And yeah, basically, if you already have a company and architecture that has scaled to a certain point, then migrating away from that architecture to make data locality work can be a bit challenging. But we still figured out a really great solution, so I'm very happy about that.

Utsav Shah: [19:28] And you can't even fall back to the US when something like India because it’s literally halfway around the world, so the latency must be just really high.

Emma Tang: [19:38] Yeah, so luckily, we're all AWS so latency is not as bad as you think it is, even though, yes, there is some bit of that latency.

Utsav Shah: [19:50] Okay. Yeah, I just remember trying to SSH into my work production servers when I was visiting back. It was like, “I can't work here at all, there’s no point. [20:00] It's just too slow.”

Emma Tang: [20:03] Yeah, it could be like that. We did a lot of studies before we even attempted with any solution, to compare the latency characteristics. And in the end, it was actually much better than we assumed so, it was great.

Utsav Shah: [20:13] Yeah, that is interesting. So how did the work of your team evolve over time? Because I'm guessing in the beginning, there were teams just writing Spark jobs, and at some point, somebody decided, “We need to really invest in this space and make sure there's a [platform-ish 20:33] layer.’ So, I'm guessing that's where your team is. How did that work evolve over time?

Emma Tang: [20:37] Yeah, no, that's a good question. So, I'm going to step back a bit and talk a little bit about why would someone want to have a data org. Is it okay to not have a data org at all, and just figure out whatever solution that works for the business need? And I think the analogy for me is having planes around when most of the transportation is our trains. You can usually get to places on trains, and it's relatively okay, but there are some places that you cannot get to, like, from North America to Europe, you can't get there just on a train, you have to have a plane. But if it's a really short distance, if you're going from San Jose to San Francisco, maybe a plane is a little bit too extra, probably a train is okay. That's an analogy, I think. Big data, it's going to solve problems that your current architecture just cannot solve. It's just not possible. So, you're just losing out on capabilities and possible business products and scalability if you don't even consider it. You just will be kind of left out in the wind, in the dust if that happens.

So the question is, when do you want to consider doing it? And like I mentioned before, when you've noticed that there are make or break situations in terms of business needs, the products that you want to make, you want to run machine learning, for example, you definitely to set up big data in order to even do that. Let's say you have a real-time business need for processing some large volume data, you probably need to set up Kafka. If your production database is falling over because you're running queries on it, you should probably set up a data warehouse like a Redshift, or a Snowflake. So, these are the times when you need to consider it. But I would really recommend not waiting until the breaking point. Investigate earlier, see if you can integrate it into it smoothly so that your architecture will just work well with whatever big data systems that you introduce.

Okay, now, sorry, I'm going to go back to your original question, which is how has work evolved on big data at Stripe? So, I think I'll just start out by saying there is a couple of different philosophies about how data infrastructure groups should be organized. And I think, especially on the point of where do you put data infrastructure engineers and data engineers? And to clarify, in terms of terminology, I think the phrase data engineer is actually interpreted quite differently at different companies. My personal definition is folks who are setting up the Hadoop clusters or Spark or basically writing these autogen for schema data models, I think these folks, I call them data infrastructure engineers. And the folks who are using the infrastructure to write the jobs and pipelines, I call them data engineers. So, I think that's how I differentiate the two. And a lot of the organizational principles revolve around how do you organize a relationship with these two groups?

One philosophy is basically have all of them in the same org. So, you have data infrastructure engineers, and then you have data engineers. And data engineers will help production engineers write their pipelines, but they report up to the same folks as the data infrastructure engineers do. Another philosophy is, let's say the data engineers are actually inside the product works, right? They report up to the product works, but they just know data engineering, and they will write the pipelines for the product works. And the third is there's no data engineer. There's data infrastructure engineers, and there's product engineers, but the product engineers have data engineering capabilities, they know how to do these things, they are more generalists. So, I think those are the main three philosophies.

And Stripe definitely started out as data engineers and data infrastructure are one and the same. When it was small, the people who spun up the infrastructure are also the ones who knew how to write this, so they were doing both jobs. Eventually, when we started growing, it was just very inefficient to keep doing that because they're fundamentally very different skill sets. Understanding production data, understanding the business logic, and how to combine data sources to yield the kind of insights that people want from this is not the same job as making sure the Hadoop cluster doesn't fall over [25:00] from load. So eventually we started separating the two and that's where we were met with the hard decision of should data engineering live within our org, which is kind of how it was in the past, versus trying to split it out outside. And we've experimented with different kinds of ways in the past, but I think we ended up in a hybrid model in the end, where we have an organization under data science, which takes care of the most complicated business models and the most complicated analytics, and they will write the pipelines for the other products, basically. And they report up to data science.

And then on the other side, most other pipelines, not the top level most complicated data models, like most other pipelines on the product side are written by product engineers. And as data info’s folks like myself and my team, we make sure the tooling at Stripe is some of the best tooling around. If you want to write a data job, you want to schedule a job, it's super easy. Observability into your job should be super easy, testing should be super easy. That's our job to make sure the barrier is really low so that we don't need to hire specialized data engineers to kind of tackle this problem.

I can talk on and on about this topic but what I think is important to call out is there's a couple of reasons why having a separate data engineering org may not be the best solution for a lot of different companies. One is data engineers are actually really hard to hire. If you think about it, what do data engineers do, in my definition of data engineers? They're the ones translating production logic into Scala or Python or whatever you write your data pipelines in, so they're literally translating things. So, it doesn't really bring a really big sense of satisfaction when you're not producing a lot of things. You're literally a translation layer. And these folks are also highly technical as well. There's no reason why they need to just write pipeline jobs, they can move to other roles as well. So traditionally, they're really hard to hire. And we've observed this at Stripe as well, they're just really hard to hire.

And the other thing is when a problem has a software solution, versus a hire people solution, you should always try to bias towards the software solution a little bit, because, at some point, your company is going to grow big enough that hiring a million data engineers is just not very tenable. First up, they are hard to hire. And second, it's expensive to hire them. So, if you take the resources that you would, and put them into developing the best software solution possible, I think that's a more long-term sustainable path forward.

Utsav Shah: [27:31] That's so interesting. And I think that certainly makes sense. I don't think I had a good understanding of what a data engineer does before I heard you talk about it. And yeah, how did you even go about evaluating whether somebody is a good data engineer or not, or whether somebody even fits that archetype of what you're looking for?

Emma Tang: [27:50] So I personally don't hire for data engineers, I hire for data infrastructure engineers, which is, again, a very different skill set. I think for data engineers, the folks who write these pipelines, they have to have a curiosity into the business, and the product, to really want to understand why the data should be combined the way they do, be creative about coming out with solutions. I would definitely say they need to be a lot more invested in the business side of things.

Utsav Shah: [28:19] That makes sense. It's like somebody who's interested in how the product works and why people are buying the product. And they want to--

Emma Tang: [28:24] Yeah, working closely with product teams. Yeah, exactly.

Utsav Shah: [28:28] And yeah, you hire for data infra engineers, these seem like more your standard infrastructure engineers interested in building big systems and seeing them work and maintaining them. Is that correct?

Emma Tang: [28:38] Yeah. So, we definitely want to see folks who not only can dig very deep and can successfully operate in very large codebases, because all of the technology we use - Hadoop, Spark, Kafka - they're all very, very complicated codebases, and you need to be very familiar with digging around in that kind of code. I think the other side that we really like seeing is some DevOps experience or at least the potential affinity for the kind of DevOps work that we sometimes have to do. If someone just really hates DevOps, doesn't want to go onto the server to see what the logs are, that’s not the greatest thing. So, it's at once technical depth, but also operational excellence.

Utsav Shah: [29:21] That also seems hard to hire really good people.

Emma Tang: [29:24] Yeah.

Utsav Shah: [29:25] But at least the role is clear, I would say, in the industry what this person does. Interesting.

Emma Tang: [29:31] Yeah. Honestly, all the companies are going after the same pool of people. It’s really funny. If you’ve been in this industry long enough, you basically know everybody else who’s in the industry. You just check in with them once in a while, “Hi, are you ready yet?” Check it every six months or so.

Utsav Shah: [29:46] Yeah. So, you're definitely looking for people who commit to Hadoop and all that, for example? Or that would be like an ideal.

Emma Tang: [29:51] Not necessarily Hadoop.

Utsav Shah: [29:52] Okay.

Emma Tang: [29:53] I would say people with some distributed computation experience is usually helpful. It doesn't have to be Hadoop. You can be [30:00] a Kafka expert, or you can be a Scalding expert or whatever, and still be very successful at a company that does other types of distributed computation.

Utsav Shah: [30:10] Okay. And how do you evaluate whether you're doing a good job and you're helping the business enough? Clearly, when things aren't burning down, that's great. But what kind of quality metrics or like, “Are we shipping enough?” How do you measure yourself?

Emma Tang: [30:25] That's a really good question. So, we have a couple of mechanisms to kind of make sure that we're keeping ourselves honest. We definitely take a look at the runtimes of jobs themselves. We want to make sure that even though data is always growing, we still want to maintain a neutral, if not, lower runtime, over time. That means we keep on making optimizations on the frameworks that we're providing. We keep on providing really useful helper methods so that you can avoid processing more data than you need to. We introduce things like Iceberg data format so there's more optimized reading and writing. So, we're trying really hard. Ultimately, we're trying to make sure that we get to the results as fast as possible, as cheaply as possible. And then the other thing is, are we addressing business needs? We have check-ins with all of the user teams at Stripe, usually, at least every quarter, if not more frequently, where we really listen to what their business needs are and see if there's any ways that we can introduce new systems that help address their really painful, burning need. So, for example, Stripe Capital, they were looking at potentially, Spark ML, as well as data science was interesting. So, this will be a framework that we potentially will introduce to solve the business need. And we will check in periodically to see if what we're doing is fast enough and solves their problems for them.

I think a lot of other metrics include false failure rates. We want to make sure that those get reduced. We want to count how many data validation errors that we hit. We want to reduce the false positives there, and also just reduce the number of validation errors in general. And, Stripe, in general, is very focused on metrics. So, we have a lot of KPIs. So as an org, we actually have a dashboard of like 10 or 20 different metrics that we track internally to make sure we're doing our job right. So those are kind of the things that we keep ourselves honest with. But again, like you mentioned, there's a lot of different things that are coming in that are not really well captured by metrics, for example, security or data locality. So, it’s definitely a complicated business.

Utsav Shah: [32:38] Yeah, it sounds like it's one of the hardest technical challenges, trying to deal with all of this data and all of these different constraints and making sure you're doing your job right. It sounds like a pretty hard technical challenge.

Emma Tang: [32:49] Yeah, but that's the fun part.

Utsav Shah: [32:52] Yeah. The best part about Stripe is it enables so much stuff. I really like the mission of the community which is, I think improving the GDP of the internet, like enabling or increasing the GDP of the internet. I think that's such a great mission statement. Yeah, and there's so many technical details behind that to get all that working.

Emma Tang: [33:13] Yeah, definitely. Definitely.

Utsav Shah: [13:15] So then maybe you can explain to me as a layman, why Snowflake is so popular? And what did it solve in the industry that other people were not doing?

Emma Tang: [33:25] Yeah. So, this is something that I personally feel pretty strongly about, which is SQL-ification of everything. If you think about initially what big data looked like, it's like MapReduce jobs. So, you write jobs in Java, they're really verbose, you have to write sometimes your own serialization, deserialization code. It's just extremely complex and you need to know a lot before you can even be productive and write anything that does any transformation. So then comes other kinds of possibilities, like Spark, for example. Spark initially, also, you can only write in JVM languages, so Scala or Java, and you also needed to write a lot, but they had a lot more helper functions. And then they introduced Spark DataFrames, which is a little bit more robust, a little bit more rails on it. They kind of modeled the Python DataFrames kind of thing. And then there's Spark SQL. And out of all of these, guess which one has the most excitement and increased adoption trends? It's the one that's easiest for folks to use. So, lowering the barrier to entry to big data has been a long-term trend. Hive is very popular for this reason as well. You can write SQL on it. And then Snowflake, Redshift, and the technologies like this are just one and the same. You can do massive data computation with SQL using familiar kind of interfaces and familiar tools. So that's why it's super powerful.

I personally have never worked with Snowflake myself so I can't go too deep in here, but my understanding is that [35:00] there's a lot of advantages to the technology itself. It's also compute and storage is also separated, which is a really great thing. I think initially, Redshift, at least now, they also separate, but in the beginning, it was not separate. So, if the hardware is only like that big, and you can't separate compute and storage, there's only so much data you can put on it, and there's only so much data you can use. And that's not the case with Snowflake. There's also locking associated with Redshift. It took us forever, you can't really do your own thing, and it's pretty expensive, whereas Snowflake, you don’t have that issue. So, I am bullish on any technology that comes out with a SQL solution that's really easy to use, very difficult to mess up, and that can crunch large amounts of data. I'm bullish on any type of technology there. So, Snowflake is just an example of that.

Utsav Shah: [35:51] Okay, that makes sense. And it's more about just the ease of use, and maybe the speed or something like that might be [Inaudible 35:56]

Emma Tang: [35:56] Yeah, ease of use, and the possibility of crunching large amounts of data. You can't do the same on a Postgres database. Postgres is also easy to use, but it's like, you don't have the same capacity.

Utsav Shah: [36:07] That makes sense. And so, you mentioned this earlier, but what do you think are some startups that are missing in this space? Something that still needs improvement. You said verification is one thing.

Emma Tang: [36:17] Yeah. So, I'll just start from the top in terms of just general trends, I think not necessarily all of them are startups. Some of them are just literally further improvements of the open-source repos that we have right now. So, in terms of batch distributed computation, I kind of hinted at this, but we're seeing a lot of database optimizations being borrowed, and then transformed for distributed computation. So, a lot of the stuff that we learned a few decades ago that we applied to databases, we're trying to reapply them again to this field. So, for example, columnar format, column stats, table metadata, SQL query optimization, all of these have arrived for big data. Spark SQL and Spark Data Frame have really great optimization engines that's moving in that direction. There's another thing that is kind of taking the big data world by storm, it's table formats. And honestly, what that is, is, before, data is just data, it can be anything really. But table formats kind of bring a little bit more structure to this. So, for example, the one that we use that Stripe is called Apache Iceberg. Basically, you can think of it as a wrapper around Parquet data. It basically has some extra metadata around to describe the data that it actually is referring to. And that is very powerful, that allows for things like schema evolution, or partition evolution, or table versioning, and also optimized performance, because you have data like oh, [Inaudible 37:56] and stuff like that. So, it's super powerful. I think that's where the industry is going.

I think some other trends that I see are, again, privacy and security. And over here, I think there's just been a lot of demand on data teams too. At every company, we’re facing similar problems, which is how do we make sure that data is tracked, that lineage is clear? How do we make sure that PII data is kept away, stored securely? How do we tokenize different data if we don't need them to be everywhere? A lot of similar problems at all these companies and data catalogs, data lineage products, and data governance products are becoming much more interesting and necessary. LinkedIn had this Datahub data catalog product that were released, and Lyft also released Amundsen. So definitely a lot of companies have released their open-source thing, but there's no one company that's stood out and said, “Oh, we're going to put a stake in this field, and we're going to take over the marketplace.” So, I still think it's a ripe place to explore.

I personally would really like to see kind of a GUI pipeline scheduler. Airflow is definitely a step in the right direction. It has a very feature-full UI, you can do a lot of stuff on it, and that's one of the reasons it's so popular. But I think it could be one step further. The definitions of these DAGs and the jobs themselves, potentially, at least a simple version of it could be a drag and drop. There's no reason why you definitely need to write Python to write as an operator. So, I would like to see something in that direction. And kind of adjacent to that is data pipeline observability and operability. I think, also Airflow offers a little bit of this, but it's really hard. Like you mentioned, what if a job fails in the middle of the night, who gets paged? Can they be smarter about who's getting paged? If this pipeline fails, can there be an estimate about the final job and [40:00] what the delay and landing time is? Which validation failed? How did the schema change from yesterday to today? All of these things could be better captured with better tooling. Right now, it's very ad hoc. You go in and look at the S3 bucket, and you say, “Oh, the data yesterday was like this.” So, I think there's a lot of potential there as well.

Utsav Shah: [40:18] The way you’re describing it, it's like the 2000s, or the at least 2010s, when there wasn't as many options like Prometheus and Datadog, and all of these things, which is now available in production, but seems like it's still not available for data infrastructure.

Emma Tang: [40:34] Exactly. That’s exactly the pain point.

Utsav Shah: [40:37] And just the idea of using drag and drop UIs instead of Python described schema, that makes so much sense. But then you get the advantage of version control, and all of that like with Python. But still, I can totally see, it should be really easy for someone maybe who doesn't understand how to write code at all, they only know how to use SQL, to be able to configure and write a data job.

Emma Tang: [40:59] Yeah.

Utsav Shah: [40:59] Is that possible today? Is there someone who only knows how to run SQL or write SQL? Can they write a data job at Stripe today? Or do you think the tools just aren't there yet?

Emma Tang: [41:12] Currently as Stripe, in order to write a proper pipeline, it's not really possible. You can write kind of pseudo pipelines based on Presto, or Redshift, but in terms of all the robust capabilities that we offer, most of it is only through code. We are moving in that direction, though. So, we wrote a bespoke kind of framework within Stripe that allows folks to write pure SQL and that SQL will get translated into a Spark job. So, you don't have to write Scala at all but you need to write a little bit of Python in order to schedule it. So that still is there. So really, the next step is if we can even get rid of that little bit of Python, and have a GUI-based solution, then that could be much more powerful.

So a lot of companies do have solutions like this, though. Honestly, I don't know how other companies do it but these other companies that use Hive primarily, I think they also write pipelines with Hive, and that's a pure SQL solution. So, I'm pretty sure there isn't a UI solution at other companies. I think Facebook has a solution that's UI-based. Uber also has a solution that is based on Airflow, but they put a custom kind of interface on top that is much more UI-focused. So, everybody's kind of moving in the same direction at these companies. It's just that there is no unified solution yet.

Utsav Shah: [42:40] Basically, make it easier to operate on really large-scale data, and remove as many barriers of entry as possible.

Emma Tang: [42:47] Mm-hmm.

Utsav Shah: [42:48] Okay. Yeah, taking a step back a bit, what got you interested in this space? Traditionally, at least I hadn't heard of data munging, and all of that even in University and I hadn't heard it until I'd been working for like, a year or two years and I had to mess with Hive and try to understand how our user was doing something. And this was like, internal users. And that's when I got interested in making pretty dashboards and all that at max. What got you interested in working on this space?

Emma Tang: [43:13] That's a really good question. So, I think I just got lucky in my previous job. Before Stripe, I was at a smaller company. We were acquired by a larger company, but we kind of stayed nuclear within the big company. And it was small enough where I got a chance to work on basically everything. So, I started on Salt Stack and then I moved towards the backend, and then towards infrastructure. And our infrastructure team comprises both of just normal infrastructure, like our edge servers, because it's AdTech, and then we also have the data infrastructure segment as well. And I've just found myself really gravitating towards the data infrastructure side of things. I think the fact that it's so powerful was really just really astounding. Once you've seen the magic of a big data job work, there is something magical about it. I think the other thing is distributed computation is fundamentally like-- I don't think. I think it's a second-order kind of software problem. Like you're not dealing with one server and how it's breaking, you have to deal with like 1000 servers and how they work in concert, and how they talk to each other, and how it works. So, I think that it was interesting in terms of the technical challenge side of things, so I just definitely found myself gravitating towards it.

I think also the timing was a little bit lucky. I started putting a lot more time into Spark around 2015, 2016. That was when Spark became more stable, I think, from 1.3, 1.6, and then 2.0. It started getting massively more adoption as well. So, I think there was a wave of interest and I was riding that wave as well. So, I went to Spark Summits, and you meet all the people that are super excited about this field as well and a lot of like-minded [45:00] folks. So, I kind of got swept up in the storm that happened at that time, and just kept on working in this field. And it's definitely just, I still think intellectually, challenge-wise, it's one of the best places to be. It's just so interesting. It's endlessly interesting. It's moving very fast.

Utsav Shah: [45:19] That's really cool. And what's your suggestion on somebody who's trying to break into the field right now? Like, if they're just in university, let's say. Or maybe they have a few years of experience, but they find these problems really fascinating. How would somebody transition into this? Or how would somebody start working on data infra or become a data engineer?

Emma Tang: [45:36] Yeah. So, I would advise folks like this. Most companies these days do have big data capabilities. Try to get yourself opportunities to work using the data infrastructure, even if you're not working on the data infrastructure itself. Basically, if there's a chance to write a pipeline for your team, and usually there is, your team probably needs metrics, right? Your team probably needs to know some business insights and how they need to operate. So, if an opportunity arises, take that opportunity, and run with it, and talk with the data infrastructure folks at your company, and kind of learn what their day-to-day is, and see if there's anything you can help and contribute with. If you see that, oh, the way of doing this is kind of repetitive and there's a helper function that you can write on the infrastructure side that could help a lot of other people, try to do that. Honestly, writing libraries is a huge part of what data infrastructure does. We just abstract away things so that our users have an easier time too in writing code. So, I think just being exposed to areas is going to be super important.

What I often see instead is when the opportunity arises to jump on these data challenges that are kind of separate from your work, a lot of folks just shrug away. They're like, “Oh, this is not related to what I do. What I do is Ruby, I do not want to learn Scala. This is totally different.” And I think that's like you’re kind of doing yourself a disservice. If you don't even try it, how do you know? And also, you want to bet on your future. You don't know what's going to be hot in the future, you just know Ruby. It might be good to expose yourself to a few more types of technology. So, I would say just take the opportunities as they come and just go forward with it and try different things.

Utsav Shah: [47:27] Very good. Yeah, you've answered a lot of questions on big data and everything and I don't think I have any more questions for you.

Emma Tang: [47:34] Awesome. Maybe I'll leave with a couple of common pitfalls that I've seen when people spin up their data infrastructure. So, I would say the number one thing is, people trying to do super bespoke solutions like they implement something on their own. They're like, “Oh, I need a scheduler, let me implement something on my own. It's going to be super cool.” It's usually going to be cool, and it will serve your needs in the beginning, but maintaining it and adapting it to your growing business need is going to be a lot harder than if you had an army of open-source people working on the same problem. So, I would say choose popular, widely used frameworks, see how many stars and how many Stack Overflow questions people have answered, just go with the most boring, common solution. That’s usually a good call.

The other thing is probably, if you can afford it, maybe consider using out-of-the-box solutions on the cloud providers, like use EMR on AWS, instead of spinning up your own Hadoop cluster, that probably isn’t worth it. At some point, you can probably just move it in-house, it’s not that hard. You do want to make sure that you're using a technology that is going to be replaceable. I think TCP data flow is much harder to bring in-house than EMR AWS. But yeah, so definitely do your own research. I think the other thing I would call out, and this is much harder to do, so only do it if you absolutely can. But try to make your data immutable, especially the data writing to big data sources that you're going to use downstream. If necessary, just provide a separate path for your offline data than from your online data. They don't need to be exactly the same. I mean, the data needs to be the same, but the format does not need to be exactly the same. So, try to make it immutable if possible. And last but not least, I would say try to have a single source of truth for your data. You don't want to have a situation where the source of truth is separated in different places, like, “Oh, there are some tables that are most accurate on Redshift. There are some tables that are most accurate in S3. There are some tables that are most accurate in these different archives but not these other archives.” So definitely try to standardize and make sure that there is a very easy-to-use multipurpose data lake instead of trying to federate different sources together in the future. That's just such a pain.

Utsav Shah: [49:49] Yeah, so S3 should be the source of truth and has the most accurate stuff always and everybody in the org should basically be aware of something like that. Is that a good way to summarize that?

Emma Tang: [49:57] Yeah.

Utsav Shah: [49:58] Okay, and [50:00] that brings me to another question which you mentioned was, use AWS EMR or use these tools instead of trying to host it open source, at least until the cost becomes prohibitive. Do you know what the markup is, and at what point you should think about, you know, “I shouldn't use Redshift and I should try hosting my own Presto,” for example? At what point do you think it makes sense to start thinking about that? And how much is the markup generally? I'm sure you all have done cost evaluations and these things.

Emma Tang: [50:25] Yeah. I don't recall exactly what the markup was.

Utsav Shah: [50:30] Just like an [Inaudible 50:31], like how much it makes--

Emma Tang: [50:33] It’s definitely less than 10% of your entire spend...

Utsav Shah: [50:35] Oh, interesting.

Emma Tang: [50:36] Yeah, on the cluster. So probably much less than that, but don't quote me on this. We haven't looked at this in a while. But I think it's mostly a matter of, honestly, when people move on to their own clusters, usually, it's more of a business need like they need to install some custom software, that just they cannot do with AWS. Or they need to sync the data into some bizarre place, EMR just does not accommodate that. It's usually those reasons that people move off. And the savings is another justification for moving it off. So, it really does depend on your business need. And on the flip side, it does depend on how much resources you have. If you have the engineers, and you could possibly do this, then by all means evaluate. But a lot of companies, they might not even ever get to that point where they can afford to hire specialized data infrastructure engineers to try and spin this up. So, it really depends on the situation.

Utsav Shah: [51:29] So it's more like a feature gap, rather than a cost-saving measure is why you would consider moving in-house. That's pretty interesting.

Emma Tang: [51:36] That's what I observed. Yeah.

Utsav Shah: [51:38] Okay. Like, there's some very weird place, you need to put data that AWS doesn't support. I mean, I'm just trying to think because it sounds so abstract to me, what kind of things would they not support? What kind of integrations would they not support?

Emma Tang: [51:51] So one thing they don't support is if you want to do Datadog.

Utsav Shah: [51:56] Okay, interesting.

Emma Tang:[Cross talk 51:57]] At least last time we evaluated it. They run on specific versions of Linux, and it just doesn't accommodate our cost of installation of Datadog. So that was one reason we couldn’t do it. There's also security concerns as well. You know, these boxes, they're spinning up and down, they’re processing all of this data, and they're ephemeral. So, there are potential security concerns with that as well.

Utsav Shah: [52:23] Interesting. Yeah, it sounds a little less abstract. So now you were saying something like you can add alerts to whether you're getting enough data on time, and that's why you would put some of your data in Datadog. Okay. That makes sense. Cool.

Emma Tang: [52:35] Yeah.

Utsav Shah: [52:37 Thank you. I learned a lot of stuff about big data and managing big data in general.

Emma Tang: [52:43] Yeah, I'm just spilling all of it out.

Utsav Shah: [52:47] Yeah, and I think it's going to be great to have this recording. And these people are going to be familiar with how much work there is in managing all these infrastructures.

Emma Tang: [52:57] Yeah, cool. Thanks for having me again.

Utsav Shah: [52:59 Yeah, thank you.

Emma Tang: [53:00] All right, bye.