Software at Scale

Software at Scale 26 - Tramale Turner: Head of Engineering, Traffic at Stripe

0:00

-1:04:04

Software at Scale 26 - Tramale Turner: Head of Engineering, Traffic at Stripe

Jul 07, 2021

Tramale Turner is the Head of Engineering, Traffic at Stripe. Previously, he was a Senior Engineering Manager at F5 Networks and a Senior Manager at Nintendo.

Apple Podcasts | Spotify | Google Podcasts

This episode has an unexpectedly deep dive into security and compliance at Stripe. We discuss Stripe’s philosophy and approach towards building secure systems, achieving compliance standards like PCI, and complex requirements like data locality laws.

Highlights

05:00 - Growth at Stripe

09:00 - A sampling of challenges involved in being a payments provider

11:00 - Stripe API traffic is much lower than the traditional large companies with Traffic teams like Google/Facebook/Netflix. Why does Stripe need a Traffic team/group?

16:00 - Stripe’s innovative approach with an embargo for credit card numbers from most of their platform. Idempotency keys.

20:00 - Compliance automation at Stripe!

30:00 - Should the entire organization need to know and care about compliance? Or should teams provide internal platforms to abstract away compliance concerns?

36:00 - Data Governance and locality laws.

45:00 - Security’s relationship with Compliance, and how Stripe thinks about security

53:00 - How to build teams that need to achieve such lofty goals?

Transcript

(Best effort. Find my contact info at /about to report any errors).

Utsav Shah: Hey, Tramale, welcome to another episode of the Software at Scale podcast and thank you for joining me. If you could tell listeners your background and, first of all, the origin of your name, which I think is extremely interesting.

Tramale Turner: Yes, sure. So first of all, thank you for having me. My name is Tramale Turner. And the origin of my name, I think, unless my father was lying to me, is as follows. When I was born, he had this image of the Magi in his head, the three wise men as it's colloquially known. And he thought, “Okay, three men, tri-male.” Didn't like the “I” and so he changed the “I” to an “A”, and I thus became Tramale.

Utsav Shah: Cool, very cool. And so a lot of your experience has been in the networking and traffic space. Right now, you're the Head of Engineering of Traffic at Stripe. And previously, you worked at F5 Networks. So what got you interested in this space? If you could just tell us a little bit about your story? What got you into this space? And what do you think of it? Clearly, you like it?

Tramale Turner: Well, it's interesting. In fact, I hadn't been in specifically this functional area for the majority of my career, but it has been my focus for the last, let's say, five years. I started off as a software engineer. I was a software engineering student at the University of Pennsylvania. I left Penn after my third year, moved to Japan, and worked as a software engineer in Tokyo, working on what I think we now call digital marketing and or web development, or sometimes interactive media, but didn't really have a name at that point. And I was designing software for interactive CDs, for websites and what have you. I ultimately left Japan, came back to the United States, was in the Bay for a little bit, commuting between San Francisco and Tokyo, then moved back home to start a company founded by myself and with investment from the folks that I worked for in Japan.

So I started a series S corp in Michigan and built this product called [Inaudible 02:32], and [Inaudible 02:33] in Japanese just means the Festival of the gods. [Inaudible 02:37] was, for all intents purposes, another social networking service. But this was the age of GeoCities and Yahoo, and social networking services at scale was not really a thing, so to speak. And so our idea, which we thought was novel, was to build an interactive, full media experience where you could have communities and create communities. My co-workers, the people I hired, and the people who, in fact, invested all met on IRC. And so we have this image of people just being able to have real-time experiences and real-time chat in our mind as we were building this. And it was a pretty fun experience as you might imagine running a startup and being sort of the lead technical person. I was way too young to be a CTO of any shape, but I effectively was that. And we showed that off at Macworld and got an investor interested in actually buying the IP. That person bought out the company. I made a little bit of money, bought a house, then went through a couple of failed startups also, in this domain of interactive media, interactive marketing. Ultimately ended up at the Volkswagen Group. Spent nine years at the Volkswagen Group working initially in marketing, strangely enough, working for the Volkswagen brand, and then transitioning more into technology. Traveled all around the world for Volkswagen, mostly all around Germany, and then Latin America. Moved to Puebla, Mexico for a year and then to Herndon, Virginia for almost two years before I left Volkswagen and join Nissan again at an advertising agency captive inside of Nissan and Franklin, Tennessee.

I lived in Nashville, commuted to Franklin, and worked on the Infiniti brand. Did that for six whole months before I got a ping from this little video game company in the Pacific Northwest called Nintendo, your listeners probably have never heard of that. Went to Nintendo working as an engineering manager for this consumer online and publishing team as it became known, working on payment services, account services, and developer support. So if you've ever used a Wii U, 3DS, or a Nintendo Switch, and you've had an NNID or Nintendo Network identifier, that was the service that my team built, and that I helped create, and also responsible for payments within the e-shop, and so on.

So left Nintendo, joined F5, spent 13 pretty fun months at F5 building a brand new set of teams therein when I got a DM in my LinkedIn from the Seattle Site Lead for Stripe who said I should come over for lunch. That was two and a half years ago. I went over on a Taco Tuesday, and I never left and have been a part of the Traffic team and became the leader of the Traffic team for my entire tenure at Stripe, two and a half years now.

Utsav Shah: Okay, how big was Stripe? I have, first of all, so many questions, especially around your experience with Nintendo. But I'm going to ask you how big was Stripe when you join it? And how big do you think it is now, just approximately?

Tramale Turner: Yeah, when I joined Stripe, it was approximately 1000 people. And it is approximately four times the size of that now and growing so much.

Utsav Shah: You’ve grown so much over the last few years.

Tramale Turner: Yeah. Yeah, especially the last 18 months, I would say, we've experienced phenomenal growth. I can't say specifically how many people are there now but as I said, about four times an increase, but still larger than that and growing rapidly.

Utsav Shah: Did your experience at Nintendo, especially working on the payments platform and stuff make you I guess recognize how important the problem is? Or how hard and complex the problem that Stripe is solving? Or is it just like, a combination of things?

Tramale Turner: I love this question. So when I got that invitation to lunch, I of course knew what Stripe was. I follow patio11 on Twitter and had seen his posts and Patrick Olsen's posts on Hacker News, but I pretty much in my mind just saw Stripe as a payment services provider or what we sort of colloquially call a PSP. They have a payment gateway, they connect you to credit card acquirers, and no big deal, you can accept credit card payments online. And I had dealt with Chase payment tech. intimately while at Nintendo and so yeah, I was very familiar with how the structure of those agreements worked and both the utility and some of the failure modes and fault domains that exist when dealing with a payment gateway. And that's what I walked through the door thinking that I was going to experience. Someone was going to talk to me about joining this thing that's going to, you know, in their words, I'm sure, change everything or transform it. But what I found out was something completely different, and it's super interesting.

The person that I was having lunch with that day, a gentleman named Brian Delahunty, was talking to me about Lyft, who is a user of Stripe, and talking me through all of the different use cases. And I won't belabor the point all of the various products and services that Lyft makes use of, but as he was talking, a light went on, and I started to see, “Oh, it's not just a payment services provider. This is a company that's trying to democratize access to economic enablement. And not trying to do it just for a certain segment of the population, like startups or developers or large enterprises. It's for everyone, like literally anyone that has access to a connection and can get access to the API,” which is, again, if you have a computer or a phone and have connectivity, you can do that. You can then create a business. And what really occurred to me in that moment was, you could build something that could support putting food on your table every night, without a whole lot of effort. And when I got that concept, when it started to make sense to me, I said to myself, I couldn't imagine not being a part of it.

Utsav Shah: Wow, that's a great story. And I think there's also a lot of complexity there in something as simple as a Lyft transaction because the person driving the Lyft might be from some other country, the company Lyft is incorporated in some other country, and all of the rules and regulations surrounding all of that, making sure you have enough capital, it just sounds like such an interesting technical problem, plus, you're helping the world.

Tramale Turner: Yeah, I mean, to your point, right. So if we can, because it's important, let's just quickly break it down. You as a platform provider, who has perhaps these 1099 or whatever the regulation is within your country, independent contractors who are driving on your behalf, you want to be able to accept funds somehow, right? So you need the scaffolding so that people can pay you money, right? And you can get some remunerative benefit from the service that you're providing. You want to be able to identify those people, to your point, right? You don't know who that person is when they say they want to be a Lyft, driver, an Uber driver, or Gojek driver, or whomever. And so that identity question is super important as well. You want to be able to manage the platform, right? So being able to do the core service that you enable, in Lyft’s case, being able to enable carshare. But with that rideshare, the core primitives for enabling rideshare, you also have to worry about the saving up movement of money. And oh, if you're operating in different countries, what about foreign exchange, right? And then you have to pay those people. And you want to make sure that you have enough money in whatever account that you're managing to do that. And those people who get paid, maybe they want to do that instantly, maybe you have a payment method by which they can have whatever money they've made in the day, immediately transferred to that payment method, maybe a debit card, right? And then they can go, as I said, to their local bodega or to their grocery store and buy the groceries that they need to put food on their table that night, right. And so when I saw that, in real time, and then it started to make sense to me, I started to see Stripe for so much more than what I think most people perceived it to be. And I think the story is starting to, you know, speak volumes for itself these days but there's still a lot more that we can teach the world about what we have to offer. And I'm really excited about the opportunity to do that going forward.

Utsav Shah: Yeah, I think the mission, increasing the GDP of the internet, it's so simple, but it makes so much sense. Yeah, so I've seen some of the numbers that Patrick has posted about on Twitter, and the API volume of Stripe, it’s charges, right? It's not going to be, like QPS2, a free service or anything like that. So I've seen it roughly translated like five to 6000 QPS. That's what I've just done the math from whatever numbers he shared. And I'm sure it's maybe twice or thrice of that. But it doesn't seem like an inherently super high traffic service. And maybe you can correct me if I'm wrong, and I know you might not be able to share everything publicly but at that point, when you joined the team, why was there a need for a Traffic team? What was the goal of the Traffic team? And how has that goal expanded over time?

Tramale Turner: Yeah, I think it's a really interesting question. You're wrong about the number of QPS, but that's okay. It's okay to be wrong. I won't tell you specifically what that specific metric is but what I will say, which I think is both something intuitive, but also important to consider is that when you're thinking about user attachment to a service, typically you're thinking about, I mean, in this modern age, we think about hyperscale services. So we think about Facebook, or we think about Netflix, we think about users who are engaging at volume, or at scale, as we like saying the industry, with consumer-oriented, sometimes skewing towards entertainment services, right. And so making sure that you have cash content as close as possible to that user, think about the Netflix core team, for instance, and the work that they do to make sure that the CDNs and the edge of the network is performant and robust and resilient. Those edge network teams that they have, and they do have multiple teams, they sort of make sense, again, intuitively.

But the other thing that you would consider is that if you have to do a retry, or if there is a network partition or packet loss of some shape or form that, you know, you kind of for free, get with TCP/IP guarantee of delivery, right? So you can retry again, TCP/IP, layer seven will sort of take care of some of the nuance of what does retry actually look like or what does effective and efficient packet delivery look like. And so you can sort of fake it and you can buffer streams or you can sort of jitter delivery of content. And all of that makes sense again, for most folks, I think and certainly, people who are listening to this podcast are very familiar with the vagaries of content delivery and billing things like CDNs. But when you start talking about money, it becomes a completely different game. So it's not so simple to just retry a request, because you may inadvertently double charge someone. That's a really bad thing. Or you may inadvertently double pay someone, also really bad. Really bad for the user, really bad for the organization providing the service, and potentially in violation of regulatory restrictions or regulations that you have to comport to in order to continue doing your business.

So when you ask why, or when did a “Traffic team” become an integral part of Stripe’s engineering organization, from the very beginning. Absolutely important to understand what happens at the edge of the network, terminating TLS because, of course, everything is TLS encrypted and protected, making sure that whatever was in that payload goes saliently to its destination, and that the response goes saliently back to the requester. So that sort of table stakes. But you also want to be quite careful because you're collecting credit card numbers. And so most payment services providers have something called a cardholder data environment. That's where all of those pans or the primary count numbers, the credit card numbers that you have sort of sit at rest. In most cases, at least in businesses that are doing it in a PCI-compliant way, which is hopefully all of them, you never want anyone who doesn't need to see that primary account number, have access to it. And so how do you do all of the work that you need to do to make the credit card “part of the business” work? You send forward probably some tokenized representation of that credit card number. And that's what the majority of the business deals with, only that tokenized representation. And then they communicate back to the cardholder data environment and the cardholder data environment will communicate with the acquiring partner, and the banks and to make sure that there are funds available, and that they can commit the charge. And that is part of what the Traffic team is responsible for, amongst many other things at Stripe.

Utsav Shah: Okay, so that's super interesting. Let me just clarify this. So there is some service or something at the edge, which takes the actual credit card information, does some kind of hash or tokenization of it, and then the rest of the service [in the factory 17:27] at Stripe never has to worry about potentially dealing with PCI because they will have to at some level, but they don't have to worry about actually holding the credit card information because your edge service has taken care of that. And there's only one data store somewhere that knows how to map from your token to an actual credit card number. Roughly, does that sound accurate?

Tramale Turner: Yeah. So the only thing I would correct there is that there is a service that understands the semantics of translating between an actual credit card number and a tokenized credit card number. But where are those credit card numbers to that rest, let's just say, without going too much into specifics, that there is high resiliency and robustness of persistence to make sure that, one, there is as little latency as possible for the transaction. Because we want the user and the user being the partner of Stripe, the person that signed up to Stripe, the organization that signed up to Stripe, and their customer to have a really good experience. We want wherever they are on the globe for them to imagine that Stripe might have been founded in their country because it's so fast and because it's so effective in closing those transactions. And so how we persist, and how we translate from raw pan to tokenized pan is a really fundamentally interesting distributed systems problem that I think we're pretty darn good at executing against. But yeah, effectively, your summary is broadly correct with, as I said, a couple of nuances that I would correct.

Utsav Shah: Okay, that makes sense. And I've also seen speaking to the other problem you were talking about with requests not needing to be doubled, it's really bad to retry a request without thinking too hard about it, because you don't want to double charge people, or you don't want to double pay someone. I've seen something interesting in the Stripe API Docs about item potency keys, where you let users specify a key. And I'm guessing what that means is some service at Stripe maintains a database of every single request that comes in. And from the API Docs, it looks like y’all garbage collect after like 24 hours or something. So you need to store every single request that comes in, if it has an item potency key, and that's how you make sure not to retry things.

Tramale Turner: That's exactly right. You got it.

Utsav Shah: Yeah. And is that something that your team does?

Tramale Turner: We support the effective routing of those API calls that are item potent. But there are other teams that are actually responsible for making sure that the integrity of a call that has an item potency key attached to it fulfills the service requests, and to make sure that whatever mutation occurs only occurs once.

Utsav Shah: Okay, so what else am I missing? Is there anything else that is interesting like that edge service you're talking about, which you have to think about, as--? Let's say that you are pitching to an engineer about the technical challenges that your team has to face. What can you talk about publicly?

Tramale Turner: Yeah, so what I like to talk about is the fact that regardless of what any candidate has heard about an edge team in other organizations, and there are, I think, to your earlier point… well actually, I don't know that we covered this point but there are edge teams sort of spiraled throughout the industry. I brought up Netflix, but there's the GFE team, the Google Front End team at Google, Amazon has a similarly shaped team that works on edge primitives. All of the hyperscalers do. Many of the scaled, if you will, used to be startups and are now big companies, Spotify, Shopify, you name it, they probably all have something similar. Even Lyft has an edge team. Typically, those teams are very small, because they have a very targeted and very specific set of services. Many of them that I've experienced, work on things like, you know, maybe there's an envoy sidecar proxy that at the edge, they want to make sure they're delivering service requests to some service mesh effectively and efficiently. And they worry only about again, that initial TLS termination and making sure that whatever they have instantiated at the edge is robust, resilient, scalable, etc. If a team goes deeper, like I think the Netflix team does, maybe there are several teams within the edge construct that worry about CDNs, worry about API gateways, and worry about also just making sure that there's effective service to service communications, and maybe they worry about some other core distributed systems primitives like leadership election, and what have you.

Our team does all of those things. And in addition to the network sort of core features, functions, and primitives that we proffer to the organization, and to make sure that the API is highly available and resilient, we also worry about this notion of compliance. And why is that? So I mentioned that we're responsible for the cardholder data environment and the cardholder data environment is the thing that makes Stripe PCI compliant. And PCI DSS is a compliance regime that is a consortium of acquirers and a lot of folks in its current state that care about the integrity and the security of folks’ credit card information, and to make sure that as you're building services online, offline, the amalgam of the two, that you're doing so safely. Because what no one wants to ever have happen is that you're on the front page of the news because you leak credit card numbers, or someone was able to break into a system and exploit those numbers and then trade them on the dark web or some horrible story like that. So every organization that deals with credit cards has to have some level of PCI compliance, be that online or offline.

And we are really good at the PCI compliance motion at Stripe, I would say, fairly exceptional. And the team that looks after a lot of that work, from an engineering perspective is the Traffic team. That's right. So because we got really good at that, the organization sort of looked at us and said, “Would you be interested in thinking about how you can help remove the toil of a lot of these other compliance and regulatory concerns we have?” So sock one, sock two, future things that we might be thinking about, that I can't talk too directly about here. But as an example, there are things like FINRA, HIPAA FedRAMP, all types of regulatory concerns. And what's interesting about those regulatory concerns is that they all require some level of evidence to show that you are compliant with the regulation. And that evidence collection motion, when you start to zoom out and look at it, a lot of the things that are being asked were quite similar. And so as engineers, you look at that and say, “Oh, well, these are declarative statements. And when we have declarative statements, that means we can probably automate the thing. We can come up with a state machine or we can come up with some system by which we are automating the collection of this evidence and maybe even automating the reporting of our attestation of the correctness of that evidence. And so that's one of the things that the Traffic team at Stripe also looks after.

And then we have yet another thing that we're working on that is kind of in the shape of an infrastructure primitive as a product. And I can't talk too much about that, but I will say that it is not completely orthogonal to that notion of compliance and that notion of just being very assertive about protecting the integrity of information that one might share with one's customers as an organization, as a business. And so I'm hopeful that we'll be talking about that in the next less than a year, I hope, depending on the velocity by which we build these primitives, but it's something I'm super excited about. And I really can't wait for Stripe’s users to hear about it.

Utsav Shah: Cool. And I don't know if you've ever spoken about the thing that I'm doing right now. I'm working for a compliance automation company. Yeah, I don't know if we've ended up talking about what I've been working on since we spoke last. I've been working for a compliance automation company, so all of this sounds super exciting to me. And it's interesting that the Traffic team has ended up in charge. But I guess that somewhat makes sense as well, given that y'all are the team that has to worry about making sure your integrations in a sense, with the rest of the company are compliant. I can also totally imagine that… how do I frame this? Any sort of regulatory issue should be caught at the earliest layer possible, and that could end up being the trafficking. You can imagine if there is a payment that is being made, again, in your Lyft example, in a way that isn't compliant, you don't want to find out right at the end. You want to find out right at the beginning. I don't know if that makes sense.

Tramale Turner: No, it totally makes sense to me. And I think yes, and… I think that when we think about compliance, one of the things that a lot of organizations in my experience tend to do is, well, everyone in the organization sort of steps away from compliance, and they're like, “Oh, that belongs to the regulatory--" Like, maybe there's literally a compliance organization typically reporting to the CFO or to the chief legal officer, and the engineers in the organization sort of sigh and they're like, “I don't want to deal with this toilsome evidence collection process. It's just super disruptive. And I understand it's necessary, but it's not something that gives me joy.” And what I love about that is that it's boring. Boring, but everyone who participates in this ecosystem, and not just for payments, as I mentioned, like, if you're doing health care information, you're dealing with HIPAA, if you're trying to sell to the federal government, you're dealing with FedRAMP, and so on, and so forth. So I mean, you know this because you're working on a startup that has seen the opportunity, so much opportunity in this space. And what I'm working on is just making sure that we're really good for Stripe internally, with all the compliance and regulatory motions that we have to comport to, but I totally see, totally see the opportunity for platforms and products.

In fact, we know this to be already something that there are many companies that have tooling and or platforms and services that support. So GRC tooling, which people may be familiar with, which is governance, risk, and compliance, there are a ton of vendors. ServiceNow is one of the biggest SaaS vendors, for instance, that has a GRC tool where evidence is supposed to rest and then you can use that resting state of evidence to support these different regulatory needs and concerns. So there's opportunity here, and I'm excited to just have a team that is really well versed in sort of the complexities of compliance, as well as having the experience to know how to build let's say primitives, then services and platforms to help accelerate getting a lot of that work done. That very necessary, but maybe very boring work done.

Utsav Shah: So in terms of goals, would it be success if no one in the rest of the organization had to care about compliance and it all got kind of platformized by your team? Or would it be success if people still have to do it, but it's super seamless for them? And maybe if you could just talk concretely like how do you enable compliance when you have like a 4,000 person company that has to think about it pretty much because it's bread and butter for your business?

Tramale Turner: Yeah, I think that's an excellent question. I think success for me, looks like being very rigorous around understanding what the toil cost of all of the compliance motions that we invest in today are. Again, these are stay-in-business motions, they're not things that are optional. They're not things that you can choose not to do. If you want to stay in business and you want to continue doing business in certain markets or within certain industries or proffering certain services, you must do these things. And so it's a really easy equation to look and see, “Okay, how many people are committed to--" You know, let's constrain it down. Let's just talk about PCI DSS. “How many people are committed every year to collecting evidence, cleaning evidence, sitting with an internal auditor, sitting with a QSA, and making sure that everything that we're supposed to be doing to secure this data, this credit card data that we're holding, is correct.” That's literally the question. Is it correct? And you can look at all of the effort that you expend against that, and then start to see, “Okay, what parts of this are redundant? What parts of this are things, to the earlier point, could be collapsed into declarative statements that we could tell a piece of software to go and do?”

So a really simple, just very basic version of an example of that is, one of the things you want to be able to see is access logs that tell you when a persistent store that has sensitive data was accessed. You could easily have an operator go to a system, type a set of commands, copy those commands, show the date, and show that the operator is pulling that information to show those logs of when something was accessed and or modified. You could also ask a computer to do that. And the computer could probably do it more effectively, more efficiently. And with that saliency of automation, you can then start to see, “Okay, what is the time we just saved from having the operator do that?” So those are sort of like the basic conversations and the really easy starter primitives that one would want to start building around, like the automation of evidence collection.

Where I think it goes further is, “Okay, well, can you also automate the reporting piece as well? Can you take all of that data that you've collected, if you're collecting it regularly, and just on-demand generate the attestation of compliance?” I think you can. I think that that's actually something quite feasible. And if you get to that point, and you've solved it for PCI DSS, can you solve it for PCI P2PE? Can you solve it for FedRAMP? Can you solve it for FINRA, etc, etc.? And my conjecture at this moment - conjecture, because I'm improving it - is that you can do that, and I am solving that problem internally for Stripe as part of the infrastructure organization. And I am hopeful because we all tend to be quite ambitious at Stripe, that that's something that if we get really good at it, who knows? Who knows what we might do with it going forward as a potential service offering?

Utsav Shah: That makes a lot of sense. And you can imagine, I think everybody that needs to maintain credit card information needs to be PCI compliant. So offering that as “You know what? If you have Stripe, you can also get reports that prove that you've been compliant with the rate.” I don't know how feasible that is because I don't have any familiarity with PCI, but it seems like that is certainly an approach that is exciting. It's also scary because you'd be a competitor of the company [Inaudible 34:30] and I don’t want that.

Tramale Turner: Well, I must say, just to be clear, for users of Stripe, we do handle PCI compliance on their behalf, but there are different levels of PCI compliance and there are different types of PCI compliance. I mentioned P2PE, which allows you to do things like tap on a phone or use the NFC chip within your smartphone to have it act as a point of sales device. I think that you're going to see as different types of online or near online, I think what we traditionally call offline, but they are offline only in as much as there's a person who is present, but everything that's happening after the payment method is made available to the retailer, is something that's happening in the digital space, happening online.

So I think as you see the proliferation of those things, maybe not so much in North America, because North America is just strange in the world. But certainly, globally, there are all kinds of new payment methods coming up every single day. You're going to also see governments and regulatory bodies saying, “Okay, we kind of want to make sure that we have a handle on what's happening here because we want to protect our consumers, we want to protect our citizens, in many cases, to make sure that they're not being taken advantage of or that we're not allowing the proliferation of crime or criminal enterprises by virtue of these new payment methods that are coming online.” A great example of that is what we've seen with digital currency, or crypto recently, and its enablement of some crime vectors that clearly are not optimal. And regulatory bodies are trying to figure out in the present moment, in real-time, right now, how to put some controls around some of those criminal enterprise vectors.

Utsav Shah: Yeah. And as part of maybe a follow up to that there's also-- All right, this is from my conversation with Emma at Stripe recently, have you noticed, or is your team also thinking about data governance stuff - credit card information of a particular country should only live in that country - because it seems like there's more and more laws around that? Is that kind of stuff or something that you have to think about?

Tramale Turner: Absolutely. So it's already public knowledge because I think there were several articles published about it, but one very deliberate way that we're making investments in that space-- And just to sort of add clarity, for folks who may not be familiar, there is a notion of data residency, data locality or data sovereignty. And what does that mean? That means that a regulatory and or political body might say, “For users who are using online services to make payments and to buy goods and services, who reside within our municipality, our country, whatever domain that we control, we would want their data to remain within the borders of the country.” Or if there is a processing function, some type of compute that's occurring, where you're capturing that payment method and doing something to remove or add funds to it like that, that processing must happen only within the borders that they control. And they may say, at least on the surface, that they're doing that for, indeed, to protect the user from potential criminal activity, to be able to easily have access to those services, should there be some nefarious activity, and they need to collect evidence from those services much more saliently or easily.

And then there, if you go deeper, is something of the notion of protectionism. So there may be other competitors within that country that are trying to compete with more globally established players, and the country or the regulatory body may want to support their own natively grown industries in order to allow them to scale and or to provide, maybe to some definition of better, “better” services to their citizenry. So that's what we're talking about.

And Stripe recently - and I should caveat that recently means on the order of two and a half years - has been thinking about this for India. So India passed regulations that basically said that any payment service provider that's operating within the country has to keep any Indian citizen’s data that is used against those services resident within the country. Simple as that. So what does that really mean concretely? If I am shopping online with my credit card, I'm in India and I'm an Indian citizen, or I'm using RuPay or I'm using some other payment method that is either global or local to India, any information pertaining to that payment method, if it initiates in India must stay in India. And there are some caveats like you have a little bit of a grace to process outside of the country, but any data that is processed can't persist outside of the country for more than 24 hours and things like that. And so if you've built, for instance, on a hyperscaler, and let's say all of your services are in US-East-1. If you're using AWS, that presents an issue, right, because you now need to figure out how to move those services regionally. And then not to degrade the services of any of your users who are not in that country, but also for the users in that country, make it somewhat seamless and make it also seem as though all of those services that you proffer to everyone else are available wherever those sort of more constrained borders might be.

And so that's a tough problem. It's not just a tough problem. It's actually a really difficult problem. But Stripe recently announced that we solved that problem for India. And if you're watching the FinTech news, as I think many folks are these days, you'll see that we'll be making future announcements about other locations that we've solved it for in the coming months. And it's something that we have been investing a lot of time and rigorous effort in making sure that we do correct. Again, not just for the countries that have data sovereignty rules, either pending or already in existence, but also to make sure that we don't degrade services for the modulo those countries rest of the world. So I really love this topic because I worked on India. And I'm really happy with not only how we addressed the issue, but the many excellent engineers that just dove deeply into the problem, and rigorously worked to affect a really nontrivial change to, I'm hoping, the light of our users in India.

Utsav Shah: That just sounds so amazing. Stripe sounds like to me as a compliant state machine where you have to continuously inject laws and changes in laws and make the system work. I don't know if you could talk anything about the actual implementation. So I know, at a very basic level, you have to run some, or a significant percentage of services in India. And that makes sense. I don't know if you can talk about anything else, or where the complexity of the implementation comes in.

Tramale Turner: Yeah, I can't talk about specifics, but I can talk about what the-- I think any reasonable person and certainly anyone that's dealt with distributed systems and understands convergence and understands, I think, misnamed eventual consistency, it’s eventual convergence really, I think the thing that makes all of this difficult is state, at the end of the day, where does data rest and making sure that you can attest that there is a compliant resting state of that data, per whatever the regulation says it needs to be. I mean, my friends who focus on the computing primitives will throw stones at me, I'm sure, but I actually see compute in actually my own area, networking, as much easier. It's not that difficult for me to come up with smart route maps and tag a request with sort of a locality primitive and say that, “Hey, I want you to direct this traffic only to these regions.” I mean, we've been doing that sort of, if you will, network, directive, and topological decisioning forever, right? Like since the inception of the internet, more or less That's not a true statement. But it's not that difficult. But dealing with making sure that data that you-- I mean, remember, these are corporations and corporations need to be able to close their books and understand how much revenue they've generated and understand if there are nefarious people trying to attack the system, how they're trying to attack the system. And you need reams and reams of data. And you need to be able to go through all of your data and understand what that data is telling you about what's happening within the system to get state about the system, which is why we call them stateful systems, right?

That is a majorly, huge, huge, huge part of the issue. That's the big rock. And we at Stripe have spent a lot of time, I think, bending our core persistence primitives to a point where-- I don't know how every sort of storage problem within our industry looks in comparison. But I would say there are probably very few organizations that with the tools and services that we make use of, are pushing them to their utility limits as we are. I would argue that probably only the hyperscalers are doing the type of computer science, the type of distributed systems focus that we are investing in in order to comply with these regulations. And I know for a fact that it's a big deal at AWS. I'm here in Seattle and so I have a lot of friends that work at that particular hyperscaler. And I would imagine that my friends down the street at Google, also, similarly, are dealing with how best to comport and comply with these laws and regulations.

Utsav Shah: Yeah, and maybe taking a step back from compliance, what compliance really [Inaudible 46:18] a bunch of things to make. What compliance means is, “Are you complying with our set of laws? But the laws are there for a reason. They're not there just the impede progress, for no reason. But a lot of compliance is around making sure that you're keeping your data secure, and also having processes around that. So the flipside of compliance is really security. And I'm trying to understand if there's anything you can talk about with regards to the kind of security practice and security measures and maybe security implementations y’all have to think about, because I'm sure you're dealing with tons of actors trying to steal data from your systems by just fuzzing the API, like “Maybe I can just get access to something that I'm not supposed to access.? How do you think about that? And how much is your team responsible for that?

Tramale Turner: Yeah, it's a good question. You keep asking me these great questions that I can't actually give you direct answers to. Let me put it this way. We're dealing with money. And money, as we all know, creates an attractive attack vector for assailants. And I think, not just criminals who are trying to perhaps find a way to exploit the system for some remunerative benefit, but also people who would just want to deny others access to that democratization of economic freedom and enablement that I spoke to at the beginning, the thing that actually gets me up and excited about Stripe every morning. And so what I will say is that Stripe has already, and is continuing to invest in one of the best and most rigorous security teams and individuals building platforms and services to protect the integrity of our business that I've ever had the pleasure of working with. The leader of security at Stripe Niels Provos was a long-serving Google Googler, was a Google Vice President of Security, left Google to come to Stripe. And Niels has been building, I think, an amazing team of practitioners, that every time I get an opportunity to interface with them, and I do so frequently, I'm thoroughly impressed. And I actually consider my team to be an extension in some ways of the security team. Security is a different pillar at Stripe than infrastructure, but clearly, they conjoin and clearly, they're intersectional. And I am one of those points of intersection because I support this CDE, this Tier 0 service that is incredibly a lot of things. It's incredibly secure, for sure, but also incredibly important to the viability. Stripe doesn't exist without the CDE, so to speak.

So one thing that I can talk about that we do that I think would be pretty obvious to anyone that deals with hyperscaled or highly scaled services, is that we are very concerned about what's happening at the edge of the network. To your point, who's sort of poking at the API? And what sort of things are they doing, layer three, four, or seven to try to either deny access to the API or to try to do strange things that the API doesn't expect? So you mentioned fuzzing but there are all types. If you were to look at request logs of Stripe, you would see all types of attempts at just doing strange escape vectors, trying to manipulate URL paths, using really odd URI access methods. It would not surprise, I think, most people, but would also really, really freak you out if you saw how frequently and how often folks are trying to attack just this one organization. And you think about all of the payment service providers that are out there. I am hopeful that they all have security teams and engineering teams more broadly, that are just as rigorous around security as we are.

But again, to the point, this notion of denial of service is a big deal, we all know that. And so we make use of certain AWS primitives like AWS Shield to protect that layer three and four but for layer seven, we have sort of our own, it's not necessarily a WAF, a web application firewall, but it is a platform, a tool kit that we have to address different attack vectors that we have both seen and that we anticipate and expect against the API. And we insert primitives into that platform to allow us to do things like throttling or to deny API keys that maybe have leaked for whatever reason. Or if we see card testing, which is something that frequently happens within the industry - trying to test the card to see that it's usable and then to steal funds from that card once you find out that it is usable. We can identify that type of activity and then block it immediately. So that's just the sampling. But trust me, if I were able to, I could probably talk about for hours, many of the different types of security mitigations, remediations, and defense in-depth things that we invest in at Stripe.

Utsav Shah: Yeah, so you're talking about web application firewalls. And I think, publicly, I've seen just systems that block based on IP addresses or certain URL patterns. And what you're describing, at least what it sounds like to me is a whole platform of being able to specify all of these different rules or these different criteria, like, “Oh, looks like an API token for someone has gotten leaked, or there's just suspicious activity,” and being able to automatically disable those things. And that makes a lot of sense. And also, it makes sense that you don't want to be DDoSed, since you have a lot of people depending on you for their business. And I'm sure when Stripe goes down, a lot of companies are upset and are calling you up.

Tramale Turner: Indeed, we try not to go down. And I would argue we're pretty good at that. Our availability is staggering when you look at it, five nines. And we say for the network, and I don't want to over speak here, but I think just mathematically for the API to be available at five nines, you can imagine what the network, at least for minutely availability we tried to achieve. Just to be clear, I didn't state what that SLI is so no one at Stripe come at me. I didn't reveal that publicly. But again, we have very high, great expectations, as it were. I've been quoting a lot of Dickens at work recently because wherever I look, and whenever I see a plan, or whenever I see an ambition at Stripe, and I see how it implicates my team, I go, “Wow, we've got a lot of great expectations.”

Utsav Shah: Yeah, maybe the last set of questions are around building teams to service these lofty goals. When you have a small team, and I'm just talking about like 50 people, 100 people in an entire company, there are a few trusted individuals that you can think about, you know, “This person will take charge of these things.” But once your company grows so big, you have to think about just expanding from individuals or even teams at that point. How do you make sure your infrastructure stays at five nines? You know, if there is a lot of attrition on the team, how do you set yourself up for success? And what kind of processes and what kind of people are you looking to hire? What are the things you're thinking about when you have to think about, you know, “How do I make sure that my APIs stay at five nines for next year?”

Tramale Turner: I love this because there's no easy answer. There's no sort of canonical response that says, “This is what you do.” If there were, everyone would be copying it, and everyone would be doing it. Here's my experience, and then I'll tell you what I specifically do. My experience is that you, first of all, face the reality of what you just said, people are going to churn. At Stripe, for instance, we're huge proponents of internal mobility. And we actually encourage people to move teams after 12 months or so within a role, assuming, you know, decent performance. And I think the good thing about that is that it encourages teams to be very rigorous about their processes, about making sure that they have run books for their services, that they build highly reliable, highly resilient services, that they look to see if you will build a moat of protection around faults.

The old head of infrastructure at Stripe-- I should say the former head of infrastructure, because he’s not that old of a gent, is Will Larsen and Lethain all over the internet. And Will wrote this book recently, An Elegant Puzzle under Strike Press. And I encourage folks to read it. I'll make fun of Will, here, I find Will's writing incredibly dense, and you'll probably have to read the book three times. But it's a good book. And Will talks a lot about organizational structures and dealing with the vagaries of what will happen within an infrastructure as it intersects with those organizational structures. So you're talking about how technical complexity meets organizational complexity. And the thing that you have to hold as a base principle is that you don't know anything. It is the sort of Socratic paradox, “I know that I don't know anything.” And for all of your experience, and for all of the things that you've done, you're not going to build the perfect most reliable, never faults system. So how do you get around that? You prepare for eventual failure modes, you prepare for understanding what those fault domains look like. AWS could go down. An AWS region could fail tomorrow. What do you do when that happens? And how do you recover from that type of epic disaster? Do you even have a plan? Do you have a run book? Have you done a run day? Have you done or game day or whatever your company might call it? Have you tried to test your assumptions and validate whatever mitigations that you have in place? And if you haven't, you're doing it wrong.

And so what I do is I try to hire the best. Absolutely. I look for the best possible folks that I can and I never sort of rest on the laurels. I have a ton of great folks that work in Traffic, I'm still looking for greater folks because I want to constantly challenge the notion that we have solved for the problem set, the problem domain that we are responsible for. I listen, - I try to - as much as I talk and hopefully more so than I talk because it's really about learning about the environment, learning about experiences and learning about what people are observing, and then synthesizing a sort of mental model of what we should be doing and what we should be investing in, in order to address these risks. I test my assumptions rigorously. So we do game days, we make sure that if we synthetically fail something, how are we mitigating against that eventual occurrence, and so on, and so forth? So I think proper planning prevents poor performance, which is sort of something you steal from the MBAs is something that really is relevant to engineering and engineering disciplines as well. If you're not planning, then you're not doing it right.

Utsav Shah: That makes a lot of sense to me. And the one question that comes from there is, how do you balance what's practical versus what should be correct on principle? A simple example is, you should make sure that your database backups work. I think almost no one would argue that that's something you should do. But then there are so many things that could go wrong. And thinking about the AWS example, I know a lot of companies would be like, “Oh, if AWS goes down, it's fine if I go down as well.” I'm sure that's not true at larger companies and companies like Stripe, but there's also a prioritization game that you have to play. So is there any framework you think about through then like, “How do I prioritize?” Is it just based on how much risk something has towards my business versus what's the likelihood of this happening? What else is there that you think about?

Tramale Turner: Well, I mean, when you think about it from that perspective, it's almost like thinking about-- We were talking a lot about security today. And when you talk about security mitigations, you talk about threat models, and you talk about what's the probability of something occurring? And what's interesting about probability is that you're not saying-- I mean, it's very rare that you're saying the probability is zero, right? You're going to have some percentage of likelihood of occurrence. And so the probability of an AWS region going down is very, very low. But it has happened, it happened in 2012. We all remember the big EBS outage that took out US-East completely. It wasn't just US East one, it was like the entire East Coast went down. And it was a horrible day. Amazon learned a ton from that. And that shape of failure mode is not likely to occur again but the fact remains that a transit zone could go down, some core EBS primitive could go down, bad things can and will happen. And so when you talk about how to be principled around that, it's just acknowledgment of the fact that there's enough entropy in the world that random things, Black Swan events will happen. And if you are being rigorous about-- You know, I think, to the thing that you're driving towards, there's different levels of rigor and expectations depending upon the type of service that you're delivering. For us, we have to take every eventuality and every possibility quite seriously, again, because of the type of information that we're dealing with. We're dealing with people's livelihoods.

We go back to the beginning of our conversation. If I fail to be rigorous about my pursuit and about the things that I'm responsible for, and that person somewhere in Bangalore, or somewhere in Kenya, or somewhere in Dublin, can't put food on their table that evening, that's my fault, I own that failure. And I have to have empathy with that user because I wouldn't want that to be me, and I wouldn't want to have an organization that my livelihood depends upon to not have that type of concern in mind as they're doing what they do every day. And so when we talk about principles, that's the way that I say my principles-- Like of course, Stripe has a canonical set of leadership and operating principles, as does every company of that scale. But I think even beyond the leadership and operating principles, which are very good, it's just about being human and caring about the fact that these services go beyond just bits and bytes, that they touch people at the end of the day. And so the humanist that’s in me says it's just about caring for my fellow man and woman and making sure that I do everything that I can to ensure that they have a good day.

Utsav Shah: Yeah, that makes sense. And it's very easy to get lost in numbers like four nines or five nines, but when you translate that, the actual human impact, that's what really shows you the importance of the work that you're doing. Yeah, well, this has been a lot of fun. Thank you so much for being a guest. And I hope to have you for a round two, maybe in a few years when you can talk about some of these topics more publicly.

Tramale Turner: All right. I will put that on the calendar. Thanks for having me. It was really fun.

Software at Scale

Software at Scale 26 - Tramale Turner: Head of Engineering, Traffic at Stripe

Highlights

Transcript

Discussion about this episode

Ready for more?