Software at Scale
Software at Scale
Software at Scale 20 - Naphat Sanguansin: ex Server Platform SRE, Dropbox
0:00
-1:02:35

Software at Scale 20 - Naphat Sanguansin: ex Server Platform SRE, Dropbox

Naphat Sanguansin was the former TL of the Server Platform SRE and Application Services teams at Dropbox, where he led efforts to improve Dropbox’s availability SLA and set a long-term vision for server development.

This episode is more conversational than regular episodes since I was on the same team as Naphat and we worked on a few initiatives together. We share the story behind the reliability of a large monolith with hundreds of weekly contributors, and the eventual decision to “componentize” the monolith for both reliability and developer productivity that we’ve written about officially here. This episode serves as a useful contrast to the recent Running in Production episode, where we talk more broadly about the initial serving stack and how that served Dropbox.

Apple Podcasts | Spotify | Google Podcasts

Share Software at Scale

Highlights

1:00 - Why work on reliability?

4:30 - Monoliths vs. Microservices in 2021. The perennial discussion (and false dichotomy)

6:30 - Tackling infrastructural ambiguity

12:00 - Overcoming the fear from legacy systems

22:00 - Balking the traditional red/green (or whatever color) deployments in emergencies. Pushing the entire site at once so that hot-fixes can be checked in quickly. How to think of deployments from first principles. And the benefits of Envoy.

31:00 - What happens when you forget to jitter your distributed system

34:00 - If the monolith was reliable, why move away from the monolith?

41:00 - The approach that other large monoliths like Facebook, Slack, and Shopify have taken (publicly) is to push many times a day. Why not do that at Dropbox?

52:00 - Why zero-cost migrations are important at larger companies.

56:00 - Setting the right organizational incentives so that teams don’t over-correct for reliability or product velocity.

Transcript

Intro: [00:00] Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening.

Utsav: [00:15] Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Naphat Sanguansin who is an old friend of mine from Dropbox and now a Senior Software Engineer at Vise. At Dropbox, we worked on a bunch of things together like developer productivity, and finally on the application services team, where we were in charge of dropbox.com and the Python monolith behind the main website that powers Dropbox. Thanks for joining me, Naphat. 

Naphat:[00:47] Yeah, happy to be here. Thanks for having me. 

Utsav: [00:49] Yeah, I think this is going to be a really fun episode because we can remember a bunch of things from our previous jobs basically. I want to ask you what got you interested in working on Dropbox, on the main website? So there were a bunch of different things we were doing in the past, and at some point, you transitioned to work on various parts of the site. So what got you interested in that?

Naphat: [01:15] Yeah, that's a good question. There are multiple factors but timing, I think, is probably the most important one here. So that was right when I had just moved to a new country, I moved to Ireland, I had switched teams completely, and I was sort of looking for the next big thing to sink my teeth into. And you’ll remember, at the time, Dropbox was also right at the time that they were trying to move the SLA from three nines to three and a half nines, I believe. And what that actually means is, they can only be down for 20 minutes instead of 40 minutes. And so that triggered a bunch of reliability problems, like some assumptions we had made before about how quick pushers can be, how many downtime we can get, things like that. They all failed hard so we needed to unpack everything and figure out how to do this correctly. For historical context, back then, there wasn't-- Let me just back up. 

So Dropbox is built on a large Python monolith server, we actually have a blog post about this. The way to think about it is imagine how you would build a server and if you had just come out of college. So you would probably start with a Python application, [the language 02:34] at the time. You might use some kind of Mythos framework, let's say Django. Dropbox doesn’t use Django, but let's just say something like that. You would find a bunch of endpoints, you write your implementations, and you would just ship the website into it. Dropbox started like that. It's definitely very, very similar to [Inaudible 02:54] company. And then we sort of grew that codebase. So we added a bunch more engineers, added a bunch more endpoints, and fast forward 10 years later, and you're at a place where we have a couple million lines of code in this monolith, we have thousands of machines filling this same copy of code, we have like 1000 plus endpoints. And at that point, the age of this monolith has-- Throughout the entire history of Dropbox, the monolith sort of went through multiple transitions because the age started to show each time. And then back in 2019, it really just came to a head, we couldn't keep going the way we were anymore. Up to that point, also, there wasn't really a dedicated team that owned this endpoint and so it hasn't received real investment in a long time, [Inaudible 03:54]. And now that they’re unraveling all these assumptions we made about it before, we needed to put the investment. And so I was asked if I wanted to come in and look at this and figure out what to do with it, and like, “Okay,” I have actually never looked at the product codebase at all. Prior to that, as you know, we were [Inaudible 04:17] actually were parallel for a lot of years. So prior to that, I was pretty much working on the development side of things. I was doing a little bit of CICD. And it became a new challenge. Lots of interesting things that I haven't thought about before, I decided to take it and see how it goes. And so that's sort of how I came to start working on this monolith, this Python server. And fast forward two years later, and here we are.

So yeah, a lot of people think about monoliths versus microservice or something. If you had to start Dropbox [05:00], 10 years backward, if you had to start your own company today, with your experience, would you not start with a monolith? Or is it just that, at least to me, the opinion is that we just have to continuously invest in the monolith rather than getting rid of it. What are your thoughts?

Naphat: [05:16] Yeah, I always believe that we should always do what's practical. And the monolith model served Dropbox really well for the past 10 years or something. It's really when Dropbox grew 1000 engineers, and 1000 commits per day, that's when it really started to breakout. If you're at a startup and you ever get to that point, you're already very successful. So I will say do what’s practical. If you start with microservices from the beginning, you're putting a lot of infrastructure investment upfront, whereas you could be spending time working on things like actually getting your product off the ground, making sure that you have product-market fit. So in a way, I would say the journey Dropbox went through is probably going to be very typical for most startups, maybe some startups will invest in it sooner than later, they might do continuous investment over the lifecycle of the startup, it really depends on what the company needs are and what the problems are. So there isn't a one-size-fits-all here. Once you get to a size like Dropbox, where you would probably have a tens to 100 team working on the entire monolith, that's okay. That's when it starts to make sense that okay, splitting this apart into separate entities might make more sense, you're able to push independently, you're able to not be blocking other things. But there's a huge spectrum in between, it’s not one or the other. Yeah.

Utsav: [06:51] So how do you tackle a problem as ambiguous as that? So you want to go from 99.9 to 99.95. And what that means, as you said, was your downtime per month can be no more than 20 minutes. How do you unpack that? And how do you start thinking about what you need to solve?

Naphat: [07:12] Yeah, that's a good question. So we pretty much approached everything from first principles when we started looking into this project. So we put ourselves on-call first of all, and then we started looking into what are the various issues that come up. We look at historical data going back years, and what are some of the past outages that are common in the past, and what the themes are. And we also started talking to people who actually have been looking at this for a while and just figuring out what their perspectives are, and trying to get the lay of the land for ourselves. Once we had all this information, we just started assembling what we believe the biggest hotspots were. We knew what the goal was. In our mind, the goal was always clear that 2019, stabilize this as much as possible, get it to a point where we can buy ourselves easily, another year, year and a half to not even look at this, and then figure out what the long term path is beyond the monolith.

So with that goal in mind, in 2019, we had to stabilize it, get to the newest way that we wanted. That allowed us to identify a bunch of problems. So some of the problems that we encountered, for example, was the push itself. When we had to do an emergency push, the push itself could easily take 40 minutes or longer, just because of how unstructured the push is. I think there were like 10 plus clusters in the same department just because no one has ever invested in it. And so the way we were doing the push, we actually took about, I think close to 20 minutes end to end just to do a round restart of the service. And nobody understood why that number was picked. And there was actually a lot of fear using the number. So that's one problem. We knew we had to go fix it. 

We identified a bunch of other problems, again, the general mindset we had really is we cannot be afraid of the monolith. If you’re going to be on-call for this, if you're going to be investing a lot of energy into this over the next year while we build the next thing, we need to get to a point where we know exactly what it’s going to do, and we know exactly where the limits are. And so I think I spent a good one to two months just poking at the service in various ways, taking away boxes, trying to drag utilization up, and then figuring out how it fails and how I can recover from it, figuring out how long to push takes and what exactly goes into the startup sequence, figuring out how long to build takes, [10:00] figuring out what the other failure modes are and figuring out how do we prevent the failure modes, to begin with?  Once we have all the information, we just started getting to work. We knocked down the problems one after another and eventually got to a stable location, a stable place. I think this is true with engineering problems in general. We just had to approach everything from first principles. We just had to make sure that we don't have any biases going in, we don't have any assumptions about how a problem should be solved, and just start. You know, you have to break some eggs to make an omelet, right? So I just started doing things, started poking it, start seeing what it does. Figure out how to do it safely. First of all, figure out how do you get a safe environment to do that in. 

For us, the way we did it was we redirected a small percentage of traffic to a small cluster, and then we only operated on that cluster. So even if that were to go down, it wouldn't be such a huge outage. And that allowed us to mimic what it would look like on a larger cluster. And that gave us a lot of confidence. So I don't know. What's your opinion on all this? I know you also came into this with a different mindset as well in 2020.

Utsav: [11:24] No, I think that makes total sense. I think the idea of splitting off a little bit of traffic to another independent cluster where you can play around and actually get confidence. Because a lot of decisions when you're so large, and all of that context has been lost, since the engineers working on that have moved on, a lot of decisions are fear-based decisions. They're not really rational, and then you kind of backfill your rationalization on like, “Oh, it's always been this way so it can probably never be better.” But you can always test things out. I think what's more interesting to me is when we, I guess went against some industry-standard things, and I think for the better. One example is, in the industry, people talk about red-green deployments where you have another version of the service running in parallel, and you slowly switch one by one. And that's basically not possible when you have like 2000 machines or just the large cluster. And yeah, if you have a limit of being down for no more than 20 minutes, you can't wait for 20 minutes for a new code push to go out because a bad outage means you've basically blown your SLA, and you can't fix that. So I'm curious to learn or to know how you thought about fixing that, and how you basically validate that approach? Because I know that the solution to this was basically pushing all of Dropbox in parallel, which sounds super scary to me. How did you gain that confidence to be okay with that?

Naphat: [13:06] That's a good question. And let's talk about push in two separate steps, or two separate categories of pushes, let’s say. Let's talk about emergency pushes, which is what we're talking about here about getting a hotfix as quickly out as possible. And this is why you had to mention our SLA. And then we'll get into later on how we do the regular push, and we'll talk about what the nuances are, what the differences between them are. At the end of the day, again, it comes down to doing what’s practical. So what do we know? We know that we have a 40 minutes downtime or 20 minutes downtime SLA. What that means is that for the most part, you should probably be able to push in like five to 10 minutes or 15 at the most. And so how do you do that against 2000 machines? That just completely threw off some pushes. Like you said, we are not going to be able to do any kind of meaningful Canary on all this. That's just not enough time to get it on the beta. So what do we do? 

So we started looking into what the build and push actually does and breaking down what are some of the current timings, which one of them are things that we can never change, and which one of them are things that we can just configure. So what goes into a push? When we tried to sell our hotfix, we track everything in the beginning. We start from people writing code to actually fix it. But let's say that you had to do a one-line fix to something, before we started all this, it took about maybe 10 minutes [15:00] or 15 minutes to actually create the branch and get someone else to do a quick approval and actually commit it in and then make a hotfix branch. So that's usually 10 to 15 minutes lost. We then have another build step that, prior to this, took about five to 10 minutes as well. So that's like, let's say 20, 25 minutes, at this point lost. And then the push itself, in one step, the push took about 20 minutes without any kind of Canary. So we will add about 45, without any kind of Canary, without any kind of safety. So this seems like an impossible problem. So let's break it down. 

Why does it take 10 to 15 minutes to create a commit? It really shouldn’t. The main problem here is that we were using the same workflow for creating a commit for hotfix, as we were for doing regular kind of code review, which they're really, really different. You're doing a hotfix, it should either be a really small change, or it should be something that you already committed somewhere else into the main branch, and you already know with some other validation that it probably works already. Anything else that you’re going to review, you're probably not going to figure it out in 20 minutes anyways. So you're not going to go through the hotfix flow. Oh, and I forgot to mention that when something goes wrong, the first thing you should do is check whether you can roll back, that actually is a lot faster. Now we'll get into that in a bit, and how do we speed it up as well. 

[16:17] So we run a quick tool, just a tool that takes a commit that’s already on the main branch and creates a hotfix branch with the right configuration, and then kicks off the push. This reduces the 10 to 15 minutes initial time to about two minutes. So okay, making good progress here. We then started looking into build. What's going on here? And it turns out that because of how the Build team was structured at Dropbox, a lot of the investment that we made in the build speed, and we had a lot of talk about this externally, they’re using Bazel, we actually cache our builds and all that, a lot of the investment we made was submitted to CI, and not to production build.

And this is not a hard problem, it's just someone has never thought to look into this. And someone has never spent the time to look into this. So coming from a CI team, I knew exactly where the knobs were. And so I just talked to a few of my old teammates, including you, and we figured out how do we actually speed up the production build, and we cut it down to about three to four minutes. So that's pretty good. So between this and creating a commit, we're at about five to six minutes. And that leaves us with another 14 minutes to do the push that we need to. 

[17:42] And now let's talk about the actual push. We need to get the push time down. Twenty minutes is never going to work. So we need to get it down to something that is manageable and something that we believe will be, in the event of a rollback up, will give us more than enough time to make a decision. And so we sort of picked five minutes as our benchmark, as our loss tine, and we wanted to get there. This just came out from intuitively you have 20 minutes, and you need to do a rollback, you kind of want some time to develop and you want to have plenty of wiggle room in case something goes wrong. So let's say five minutes. We took a look at how the push was done and really, the 20 minutes push that was happening was completely artificial. There was a bunch of delays inserted into the push with the fear that if you were to restart a server too quickly, because this is a very large deployment, other services that this server talks to might be overwhelmed when we first create the connection. It is a valid fear. It has never actually been tested. And there are actually ways that we have set up our infrastructure such that it wouldn't have this problem. For example, we actually deploy a sidecar with each of the servers and that's reduced the number of connections to upstream services by about 64x because then you run 64 processes per box. So we have things that we know to be true that make us believe that we should be able to handle much quicker restarts. 

[19:20] So how do we validate it? There really isn't a much better way to do this than to try it out, unfortunately. And so we made the small cluster that I was talking about earlier because at the end of the day, what we care about is-- There are two things we have to validate. We have to validate that the server itself is okay when it restarts that quickly and then we have to validate that the upstream services are okay. With the server itself, you don't need a large deployment. You can validate under small deployment and eliminate one side of the problem. And then with the upstream services, we just had to go slow. [20:00] There isn't really another way to go about this. So we just had to go slow and monitor all the services as we go. So we do 20 minutes to 18 to 16 to 14, eventually to five. And we fixed a bunch of things along the way, of course, because issues did come up. And now we have five minutes push. So if you look at the end to end now, we have about five to six minutes to create a commit, and then five minutes to do the push. That leaves us about nine minutes to actually do extra validation to make sure the push is actually safe. 

[20:35] And so we started thinking, “Okay, maybe we do a very informal Canary,” where if we were to do a hotfix, we probably know exactly what we're pushing, it’s probably only one commit, or rather, we [Inaudible 20:47] so there’s only one commit. And there’s probably only one very, very distinct change. So what if we just very quickly push Canary within one or two minutes, because it's just a subset of the machines, and then just see whether for those new boxes, obviously, you've seen the same errors as before, we have the metrics to tell us that. And this is very different from doing a blue-green deployment, where you would have to create two clusters of machines, one with the old code, one with the new code, and then try to compare metrics between them. But this is just all eyeballing, all looking at exactly what error that we know is causing the site to crash and seeing whether it's coming down. So we built that in. 

[21:37] We also built in another validation step for internal users, where we were to push to a deployment that only failed internal Dropbox users. And this is optional, depending on just how quickly you want to go. And then we codify all this into a pipeline that we can just click through with buttons in the UI. At the end of the day, we got that built and pushed down to about 15 to 18 minutes, depending on the time of day, or depending on how lucky you are with the build, and it worked really well. And then, at that point, it became a question of, “Okay, now that we have this build time and push time down, how do we actually keep this going? And how do we make sure that things don't regress?” Because a lot of the problems that we discovered are things that will regress if you don't keep an eye on it. The build cache could be easily broken again. So we established a monthly DRT that teams that were on-call were supposed to go and try doing the push ourselves and see that it actually completes in time. And then we postmortem every DRT to make sure that, “Okay, if it didn’t complete in time, why didn't it complete in time? If you do the breakdown, where was the increase? Is it in the build? Is it in the push?” and go from there.

Utsav: [22:51] Yeah, and I think one of those things that stands out to listeners, like I know the answer, but I want to hear it again, is that if you push all of Dropbox at the same time, doesn't that mean you drop a lot of requests, like for those five minutes everything is down? And that's what people are worried about, right?

Naphat: [23:11] Right. So we are not exactly pushing everything at the same time. You should think of a push as a capacity problem. So how many machines can you afford to take away at a time while still serving the website? Whatever the number is, take that and restart those machines as quickly as possible. So that's what we did. And so we set the threshold for ourselves at 60%. We never want utilization to go above 60%. And just to be safe, we only took 25% and actually used that for pushing. And then that allowed us to push everything in for budget. And that means that each batch needs to restart as quickly as possible. It's isolated to that one machine and so we will stop traffic to that one machine and just kick it as quickly as possible. It takes about one minute per batch, one and a half minutes per batch, depending on the best.

Utsav: [24:04] But then still, wouldn't you see like a 25% availability here, if you're pushing 25% at the same time?

Naphat: [24:11] No, because we stop traffic to the boxes first and reroute the traffic to the other boxes. So this is why we need to turn it into a capacity problem and make sure that, okay, if you know that any given time the site is never more than 60% utilized, you can afford to take 25% away and then go to 15% overhead.

Utsav: [24:30] Yeah. How do you route traffic to the other boxes? What decides that? Because it seems like this is a complicated orchestration step? 

Naphat: [24:40] It actually isn't that complicated. So the infrastructure at Dropbox is very well structured in this sense. So there is a global proxy. At the time, we actually wrote a few blog posts about it. There is a global proxy called Bandaid. Dropbox [25:00] was in the process of replacing it with Onvoy right when you and I left. But there's a global proxy called Bandaid. This one keeps track of all the monolith boxes that are up. And when I say up, it means passes health check. So when we go and push a box, we make sure it fails health check right away, the global proxy kicks it out of the serving pool within five seconds, we wait for [requested rain 15:16], and then we just kick it as quickly as possible.

Utsav: [25:29] Okay, so it becomes this two-step dance in the sense where the global proxy realizes that you shouldn't be sending traffic to these old boxes anymore. And it can basically reroute and at that time when you can get a new version. 

Naphat: [25:43] Exactly 

Utsav: [25:44] Okay. 

Naphat: [25:45] Exactly, exactly. It's not at all a hard problem. It's just a matter of, “Okay, do we have the architectures to do this or not?” And we do with the global proxy in place.

Utsav: [25:65] Yeah, and with things like Onvoy, I think you get all of this for free. You can configure it in a way that-- And I think by default, I guess once it realizes that things are failing health checks, it can kick it out of its pool. 

Naphat: [26:09] Right.

Utsav: [26:10] Yeah.

Naphat: [26:11] And this is a feature that you need anyway. You need a way to be able to dynamically change the serving size. Machines will go down. Sometimes you add emergency capacity, but I feel like you need the global proxy to have this particular feature. And it comes out of often Onvoy like you said. It's just a matter of how do you actually send the proxy that information? At Dropbox, we did that via Zookeeper. There are other solutions out there.

Utsav: [26:37] And can you maybe talk about any other reliability problems? So this basically helps you reduce the push time significantly, but were there some interesting and really low-hanging fruits that you're comfortable sharing? Small things that nobody had ever looked at but ended up helping with a lot of problems.

Naphat: [26:58] Yeah, let's see. It’s been so long, and I haven't actually thought about this in a while. But one thing that came to mind right away, which is not at all a small one, but I think it's still funny to talk about. So as part of doing filling out the push, we needed to know, what is the capacity constraint on the server? As in how many percent utilization can it go through before it actually starts falling over? The intuitive answer is 100% but severs are like things in life, never that clear or never that easy. And so, among the monolith on-call rotation, there was a common belief that we should never go above 50%. And this bugged the heck out of me when I joined because it seems like we're leaving things unused. But we have seen empirically that when you get to about 60, 70% on some clusters, the utilization often jumps. It jumps from 70 to 100 right away, and it starts dropping requests. We had no idea why. And so that's interesting. How do you actually debug that? Very luckily for us, there were a lot of infrastructure investments that went into Dropbox that allowed this kind of debugging to be easier. 

[28:22] So for example, just before I started working on this, Dropbox replaced its entire monitoring system with a new monitoring system that has 16-second granularity. That allowed us to get a different view into all these problems. And so it turns out that what we thought was 70% utilization was actually something that spiked to 100% every one minute, and then spikes back down to less than around 60% and really averages out to about 70. So that’s the problem. So that turns out [Inaudible 00:29:02] it’s just a matter of profiling. And it turns out that that's just because the old monitoring system that we hadn't completely shut down yet, it would wake up every minute in a synchronized manner because it's the same time. It would wake up on the minute on the clock and do a lot of work. And so the box will be entirely locked up. So one of our engineers on the team said, “Okay, you know what? Until we shut down the monitoring system, what if we just make sure if not synchronized, and we just play it?” That allowed us to go to about 80, 90% utilization. That's pretty good. And that's why we kept it about 60%. 

[29:46] So this is one of those things where we really just had to not be afraid of the things that we are monitoring. And so the way we discovered all this is we just, again, created a small cluster and then started taking [30:00] boxes away to drive the utilization up, and we just observed the graph. And the first time I did this, I was with another engineer, and we were both basically freaking out with each extra box we were taking away because you never know with this particular monolith, how it's going to behave, and we still didn't fully understand the problem. And we actually caused maybe one outage, but it wasn't a huge outage. But it’s fine, this allowed us to actually debunk the actual real cause and now we actually fixed it. And of course, when we shut down the old monitoring system, this problem goes away entirely. So yep, that's one thing. Then you might also ask why we were only at 90% utilization and why couldn’t we go to 100. Because that’s the quirks in how our load balancing works. Our server is impressive and so because of that, our utilization is fixed, unlike most companies. Most companies, when they get to 100%, they can serve a little bit more, they will just slow down everything. For us at 100%, we just start dropping requests. And unless your load balancing is completely perfect and knows about every single machine, every single process, it’s not going to achieve [Inaudible 31:14] 100% and our load balancing isn't perfect.

Utsav: [31:19] That makes total sense. And I think that's why distributed systems engineers should remember the concept of jitter because you never know when you need it, and for how long it will basically waste so much capacity for your company. That’s one expensive mistake. I remember seeing those graphs when we shift that change. It was so gratifying to see. And also, I felt, and I wish that we had added a little bit of jitter and we had never had to deal with this.

Naphat: [31:48] This is the theme of what my 2019 looked like. At the end of the year, we actually sent out a giant email with all the fixes we did. I wish I still had it and I could actually read it to you. But we try and email all the fixes we did. Let's say 60% of them might have been major fixes but 40% of them were all minor fixes like this, all one-liner fixes that we just never invested enough to actually go and look. And so it's funny, but it saved us a lot of energy, it saved us a lot of time. By the end of the year, the on-call rotation was pretty healthy. You joined in 2020, you tell me how healthy it was. But I don't think there were that many hitches by the end of it. We could actually--

Utsav: [32:37] It was surreal to me that the team in charge of pretty much the most important part of the company, in a sense, had lesser alerts than the team that had their own CICD system. I think that just made me realize that a lot of it is about, thinking from first principles, fundamentally, it's just a service. And the amount of time you spend in on-call toil is directly proportional to the amount of investment but inversely proportional to the amount of investment you have into quality and reliability. And things don't have to be that bad, but it just takes a little bit of time to fix and investigate.

Naphat: [33:31] Yep, for sure. And this is just the mindset that I want every infrastructure engineer to have. By the way, so a lot of the things I talk about are things like what you would normally give to a team of SIE to solve. But the people who worked on this with me, only one of them was a full-time SIE. And that's not saying anything against SIEs, I'm just saying that when you are infrastructure engineers, you need to be in a mindset where you don't divide the work completely between SIE and SV. If you have the flexibility, you have the resources, sure. But when you don't, you need to be able to go in-depth. You’re only going in both directions. Infrastructure SIEs should be able to go and do some SV work. SV infrastructure should be able to go and investigate how servers react. And it will just allow us to build better infrastructure in turn. And so yeah, no, this is just-- Looking back, it's kind of funny, it seemed like such an insurmountable problem at the time. But at the end of the day, you know what? We just had to go fix it one by one. And now we just have good stories to tell.

Utsav: [34:33] So then all of this begs the question, now that the monolith is generally reliable in 2020, and the goal for 99.95 is right there, why is the long-term decision to move away from the monolith if it works? What is the reasoning there?

Naphat: [34:55] That's a very good question. [35:00] So what you're talking about is our 2020 project. And it's called the Atlas Project. There's a blog post about it if you Google Dropbox Atlas.

Utsav: [35:07] I can put it in the site notes or in this podcast notes.

Naphat: [35:11 Perfect, perfect. So it is basically a rewrite of our biggest service at Dropbox. Biggest sideline service, I’d say. Dropbox also runs Magic Pocket, that is way bigger than this. That's a staple service. So this is a rewrite of our biggest service at Dropbox. We undertook it for two main reasons. First is developer productivity. And what I mean by that is, at some point, the monolith really starts to restrict what you can do when you have 1000 engineers contributing to it. Well, let’s say five to 600 product engineers contributing to it. What are some restrictions here? For example, you can never really know when your code is going to go out. And this is a huge problem when we do launches because it could be that you're trying to launch a new feature at Dropbox, but someone made a change to the help page and messed that up. And we can’t launch something with a broken help page. And this kind of thing, yes, there should be tests but not everything is tested. It's just the reality of things in software engineering. So what do you do at that point? Well, then you have to roll everything back, fix the help page, then roll back out. 

And actually, we didn't talk about the actual push process, but we actually went to-- We have a semi blue-green deployment similar to that. It takes about two to three hours to run. This is the thing that we used to push out thousands of commits but we have to be more careful because we don’t actually know what we are pushing. So if you had to restart the entire process, you set back any launch you have by two to three hours. So this was a huge problem with Dropbox. There were other problems, of course. You never quite know if someone else is going to be doing something to the memory state that will corrupt your endpoint. You are not allowed to push any more frequently than whatever the cadence that the central team gives you which happens to be daily. When you're doing local development, you can never really start just a special site that you care about, you have to start the entire thing, which takes a long time to build. So the monolith itself really restricts you to what you can do, how productive you can be when you get to a certain size. And again, this is a huge size, I'm not saying every company should do this. We got to a few million lines of code and five to 600 engineers contributing to this, 1000 plus endpoints, so this is an extreme scale. 

On the other note, the other reason we embarked on this project was also this particular server, it was built at the beginning of Dropbox. It used the latest and greatest technology at the time, but it really hasn't caught up to the rest of the company. For example, the server that we were using, there wasn't any support for HTTP/2.0. And so we were, at some point, having to buff where-- It didn't have support for HTTP/1.1 and 2.0. It only supported 1.0. Our global proxy, the Bandaid proxy that we talked about earlier, and also Onvoy, these proxies, they only support HTTP/1.1 and above. And so we were having these two things talk using incompatible protocol for the longest time. For the most part is fine, except for some parts that it's not. And so we would actually have a bug for like an entire week, where, in some cases, we will return a success, 200, with an empty body, just because [Inaudible 39:07] incompatible protocol. We could probably upgrade this Python server but doing so requires a significant amount of work. You already go through with that amount of work, maybe actually we should also think about, “Okay, what else do we want to change here? What else can we do to actually move this in the direction that we all want it to move in? 

And just for context, in all this time that the server existed at Dropbox, every other service at Dropbox had already moved to gRPC. So gRPC was a very well-supported protocol at Dropbox. It was very well tested. There was a team that actually upgraded it regularly and run flow tests on it and all that, but this thing just hasn't kept up. So we needed to find a way to get to something else that is equally supported when we get to gPRC itself. [40:00] So enter Atlas. We decided, “Okay, time to invest in this for real. We bought ourselves time in 2019. Let’s now go do some engineering work, let's figure out how do we build something that we're going to be proud of.” And I take this very seriously because back when I was still doing interviews when I was still based in the US, every time I had to talk about the Dropbox codebase, I sort of talked around the complexity of the monolith because we all know just how bad it was. And when I was actually talking about it, I just had to say, “Yes, it's bad, but also look at all these other things.” I don't want to keep saying that. I just want to say, “You know what? We have a great platform for every engineer to work in, and we should all be proud to work here.” That matters a lot to morale at a company. 

Utsav: [40:51] Yeah.

Naphat: [40:53] And so we embarked on a project. We first started building a team. The team had to be almost completely rebuilt. You joined. That made my day. That made my quarter, let's say. And then we started putting together a plan. It was a collaboration between us and another infrastructure team at Dropbox. We had to do a bunch of research, put together a plan for, “Okay, what do we want the serving staff to look like? What do we want the experience to look like?” We should also get into what our plan actually was and what [Inaudible 41:29] were shipping. Do you have anything that you want to add to the backstory here before we move on?

Utsav: [41:35] No, I want to ask actually a few more probing questions. So if you see other big companies Slack and Shopify, in particular, they seem to have worked around the problem by, first of all, their codebase isn't as old as ours. So I doubt that they’re running into the kind of random bugs that we were running into, but they're still pretty big. And they seem to work around the problem by pushing 12 times a day, 14 times a day, they just push very frequently. And that also gets developers’ code out faster. So that solves one part of the developer productivity problem. I guess why did we not pursue that? And why did we instead decide to give people the opportunity to push on their own, in a sense, or have one part of the monolith’s code being buggy not block the other part rather than just push really frequently?

Naphat: [42:30] Right. I would have loved to get to that model. And that was actually the vision that we were selling as we were selling Atlas to the rest of the company. It's just that the way that the monolith was structured at Dropbox, it wasn't possible to push that many times a day. It wasn't possible to push just a component, you had to push everything. So about pushing many times a day, for example. We couldn't get the push to be reliably automated, just because of how many things we’re pushing and how many endpoints we’re pushing. So we built this blue-green deployment, I'm going to call it Canary Analysis because that's what we called it at Dropbox. So the way it works is that we have three clusters for the monolith. We have the Canary cluster, which receives the newest code first, we have the control cluster that receives the same amount of traffic as Canary, but will be on the same version of code as prod. So it's the same amount of traffic, same traffic pattern, and all that. So during a push, we push Canary and we kick control, so they have the same life cycle. If the Canary code is not worse than control then all the metrics should look better - CPU should look better, memory should look better, all the same, everything like [Inaudible 43:44] and all that should look better. 

We then write a script that goes and look at all these metrics after an hour and just make sure that can you actually proceed forward with a push or not? It turns out that let's say more than half the time, the push wasn't succeeding and so Dropbox actually had a full-time rotation for pushing this monolith. If we were already running into problems every other time, there was just not a feasible way we were going to get to aggressive pushing, not even four times a day, let's say three times a day. We weren’t going to get that. It's not going to fit within the business day. Keep in mind also the Canary Analysis itself takes about two to three hours. And the reason it took that long is also A, we wanted the metric but also, we made each push itself very slow just so that it made the problem we catch it in time, we rollback. So because of how it's built, it's just not feasible. So we would actually have to figure out how to work around all these problems, and Atlas, the project that we were building towards will solve the problem or will at least give us the foundation to [45:00] go forward with that eventually. I remember this because this is one of the first questions you asked me when you joined the team. And I completely agree, and I really hope that Dropbox will get there. 

I left before the project was completely rolled out but having talked to people who are still there, they are very much on target. And going forward, they can easily then start to invest and say, “Okay, you know what? Now that push is reliable, let's push multiple times a day.”

Utsav: [45:29] Yeah, I think philosophically, the reason why people like splitting their stuff up into multiple services, it's basically there's like separation of concerns and all of that. But also, you get to own your push speed, in a sense. You get to push without being blocked on other people. And by breaking up the monolith into chunks, and if not letting people push on their own, at least not blocking their push on somebody else's bad code, I think that basically breaks down the problem and makes things much more sustainable for a longer period of time. Because the monolith is only going to get bigger, we're only going to get more engineers working on multi-products and more features and everything. So that's why I feel like it's actually a very interesting direction that we went into, and I think it's the right one. And yeah, we can always do both in parallel as well. We can also push each part of the monolith like 12 times a day and make sure that that doesn't get blocked. 

Naphat: [46:31] Yeah, but just to make sure that we don't completely give the win to microservices here, there are real problems with microservices. And Dropbox tried really hard to move to the services model a few years earlier, and we couldn't quite get there. There are real problems. Running a service is not easy. And the skillset that you would need to run a service, it's not the same skill set that you will need to write a good product or write a good feature on the website. So we're asking our engineers to spend time doing things that they're not comfortable doing. And we're asking them to be on-call for something that they are not completely comfortable operating. And from a reliability standpoint, we are now tasked with setting the standard across multiple services instead of just dealing with one team. There are real problems with microservices and so this is why we didn't-- 

[47:25] Let's talk about Atlas a little bit. So Atlas, the next generation of Dropbox’s monolith, we sort of tried to take the hybrid approach. First, I said microservices have real problems, but our monolith is not going to scale. So what do we do? So we went with the idea of a managed service. And what that means is, we keep the experience of the monolith, meaning that you still get someone else to be on-call for you. We get our team, our previous team to be on-call, you get someone else pushing for you, in any case, it's going to be automated. It's going to be a regular push cadence, you just have to commit a quarter section-time, and it will show up production the next day. You're not responsible for any of that. But behind the scenes, we are charting everything out into services, and you get most of the benefits of a microservice architecture, but you still get the experience of a monolith. There are some nuances like we do have to put some guardrails into, “Okay, this only works for stateless services, you're not allowed to open a background thread that goes and stuff,” that's going to be very hard for us to manage. You need to use the framework that we are going to be enforcing, gRPC, you need to make sure that your service is well behaved, it cannot just return errors all the time, otherwise, they're going to kick you out, or they're going to yell at your team very loudly, or very politely. So there are rules you have to follow, but if you follow the rules, you're going to get the experience, and it’s going to be easy. So that’s the whole idea with Atlas. And I'm probably not doing it justice, which is probably a good thing to Dropbox but it's really interesting that we got there. With Atlas, it seems like the right balance. It is a huge investment, so it's not going to be right for every company. It took us a year to build it and probably going to be another six months to finish the migration. But it's very interesting, I think it is the right direction for Dropbox.

Utsav: [49:36] Yeah. I think the analogy I like to give is it's like building a Heroku or an App Engine for your internal developers. But rebuilding an App Engine or Heroku, what's the point? You could just use one of those. The idea is that you give them a bunch of features for free. So checking if their service on Heroku is reliable, we do that. We [50:00] automatically track the right metrics and make sure that your route certainly doesn't have only 50% availability. We basically make sure of that. Making sure that your service gets pushed on time, that that push is reliable, making sure for basic issues, operations just happen automatically. So we can even automatically Canary your service. So we push only 10% and we see if that has a problem. All of these infrastructural concerns that people have to think about when developing their own services, we manage all that. And in return, we just ask you to use standard frameworks like gRPC. And the way we can do this behind the scenes is if you're using a gRPC service, we know what kind of metrics we're going to get from each method and everything. And we can use that to automatically do things like Canary Analysis. I think it's a really innovative way to think about things. Because I think from user research, we basically found this very obvious conclusion in retrospect - product engineers, they don't care that much about building out interesting infrastructure. They just want to ship features. So if you think about it from that mindset, everything that you can do to abstract out infrastructural work is okay with people, and in fact, they prefer it that way.

Naphat: [51:29] For sure. And this isn't specific to product engineering, every engineer thinks this way. We have a goal, we want to get to it, everything else is in the way. How do we minimize the pain? And I really liked the way you phrased it that we sort of provide a lot of things under the box. I think one of the directors at Dropbox used the analogy that Atlas will be a server with a battery included in the box. You don't have to think about anything, it just works. It sounds innovative, it probably is. You have to give credit where it’s due. A lot of these other toolings that we have, it already exists at Dropbox, we’re just packaging it together, and we're saying, “Okay, product engineers at Dropbox, here's the interface. Write your code here and you get all of this.” We’re just packaging it, you get automatic profiling, you get automatic monitoring, you get automatic alerting. It really is a pretty good experience, I kind of miss it working at a startup. 

Utsav: [52:28] And I think the best thing is you get an automatic dashboard for your service. You build something out and you get a dashboard of every single metric that's relevant to you, at least at a very high level. You get all of the relevant exceptions. You also get auto-scaling for free, you don't have to tweak any buttons or tweak any configuration to do that. We automatically manage that. And that's the reason why we enforce constraints like stateless because if you have a stateful service, auto-scaling and everything is a little funky. So for most use cases, an experience like that makes sense. And I think it really has to do with the shape of the company. There are some companies where they have a few endpoints, and they have a lot of stuff going on behind the scenes. But with the shape of Dropbox, you have basically like 1000 plus endpoints doing a little bit of slightly different business logic and then there's the core file system and sync flow. So for all of the users and all of the engineers working on these different endpoints, something like Atlas just makes sense for them.

Naphat: [53:40] Right. And there's a real question here about whether we should have fixed this a different way. Like, for example, should we have changed how we provide endpoints at Dropbox? Could we have benefited from a structure like GraphQL, for example, and not have to worry so much about backend? It's a real question. Realistically, at a company this big, with a lot of existing legacy code, we had to make a choice that is practical. And I keep coming back to this, we had to make something that we know we can migrate the rest of the things to, but it still reasonably served the need we need. And this is actually one of the requirements into shipping Atlas. We're not going to build and maintain two systems. The old system has to die. And that constrained a lot of how we were thinking about it, some in good ways, and some not in good ways, but that is just the way things are. 

Someone I respect very deeply said this to me recently, “Don't try to boil the ocean.” We just try to take things one step at a time and move things in the right direction. And just make sure that as long as we can articulate a two to three years, whichever one this to look like, and we are okay with that, that's probably a [55:00] good enough direction. 

Utsav: [55:03] I think, yeah, often what you see in companies, especially larger ones when you have to do migrations is people keep the old system around for a while because it's not possible to migrate everything. But also, you impose a lot of cost to the teams when you're doing that migration. You make them do a lot of work to fit in the new interface. Now, the interesting thing about the Atlas project, which we've written about in the blog post is the fact that it was meant to be a zero-cost migration, that engineers don't have to do some amount of work per existing piece of functionality in order to get stuff done. Of course, it wasn't perfect, but that was the plan, that we can automatically migrate everyone. I think it's a great constraint. I love the fact that we did that but why do you think that was super important for us to do?

Naphat: [55:56] Yeah, again, think we have been burned so many times at Dropbox by tech death and incomplete migration, and so we benefited from all these past experiences. And we knew that, “Okay, if we're going to do this project correctly, we need to complete the migration. And so we need to then define a system that will allow us to easily complete the migration. And we need to assemble a team that will help us complete the migration. So the team that we put together is probably responsible for a good majority of the migrations at Dropbox in the past. And so the experiences that each person brings, like, I've seen one teammate, for example, write some code that passes the Python abstract syntax tree, and then just change some of the API's around for our migration. I was in awe of the solution. I didn't think it would be that easy but he did it in a 50 lines Python file. So I think that it's a very ambitious code you have, and you need to start it in the beginning, you need to know that you're going to be designing for this so that you do everything the right way to accommodate it. But now that you have it, and now that you're doing it, by the end of it you get to reap all the benefits. We actually get to go and kill this legacy server, for example, and delete it from our codebase. We get to assume that everything at Dropbox is gRPC. That is a huge, huge thing to assume. We get to assume that everything at Dropbox will emit the same metrics, and all that, but will not behave the same way. 

But I think it's very satisfying to look back on. I am really glad that we took this time. And we’re patting ourselves on the back a lot, but it's not like this project went completely smoothly. There were real problems that came up. And so we had to go back to the drawing board a few times, we had to make sure that we actually satisfy a real customer need. And as a customer, I mean, Dropbox engineers, our real customers while we're building this platform. But I think we got there in the end. It took a few iterations, but we got there.

Utsav: [58:32] Yeah, and maybe just to close this up, do you have any final thoughts or takeaways for listeners who are thinking about, just infrastructure and developer productivity in general. The interesting part organizationally, to me was, our team was not only responsible for keeping the monolith up, so its reliability, but also for the productivity of engineers. And maybe you can talk about why that is so important? Because if you focus only on one side, you might end up over-correcting.

Naphat: [59:05] For sure. I think one thing I want to focus on is how important it is to have empathy. You really need to be able to put yourself in your customers’ shoes, and if you can't, you need to figure out how to get there, and figure out, “Okay, what exactly do they care about? And what problems are they facing? And are you solving the right problems for them?” You're going to have your own goals. You want to create some infrastructure. Can you marry the two goals? And where you can't, how do you make a trade-off? And how do you explain it? It's going to go a long way into a migration like this because you’re going to need to be able to explain, “Okay, why are you changing my work for? Why are you making me write [Inaudible 59:44]” You can then tell a whole story about “Okay, this is how your life is going to get better, there’s are going to be pushes, you're going to get these faster wheels. But first, you have to do this, and this is your story.” That's a lot better, but it requires [01:00:00] trust to get there. 

So I would say that's probably the most important thing to keep in mind when building an infrastructure or really building anything - make sure you know who your customers are, make sure you have an open line of communication to them. I think there were a few product engineers at Dropbox that I came in weekly and just tried to get their opinions and bring them into the early stages [Inaudible 01:00:25]. And it was great. They sort of became champions of the project themselves and started championing it to their own team. It’s a very amazing thing to be watching, to be of service. But it really comes down to that. Just have empathy for your customers, figure out what they want. And then you really can only do that if you own both sides of the equation. If I only own reliability, I have no incentive to go talk with customers. If I only own workflows, I have no leverage to pull on reliability. So you really need to own both sides of the equation. Once the project matures, and you don't really need to invest in it anymore, you can talk about having a different structure for the team. But given how quickly we were moving, there was really only one way this could have gone.

Utsav: [01:01:16] Sounds good. And yeah, I think this has been a lot of fun reliving our previous work. And it's been really exciting to be part of that project, just to see how it's gone through and seeing what the impact is. And hopefully, in our careers, we can keep doing impactful projects like that, is the way I think about it.

Naphat: [01:01:39] That is the hope, but really just getting to do this is a privilege in its own right. If I never get to do another large project, I mean, I'll be a bit sad, but also not be completely unsatisfied. Yeah, I'm really glad that we got to have this experience. And I'm really glad for all the people, all the things that we got to solve along the way, all the people we got to meet, all the relationships we built, and now all the memories we have. And it just really energizes us for the next thing, I think. 

Utsav: [01:02:15] Yeah. And yeah, thank you for being a guest. I think this was a lot of fun. 

Naphat: [01:02:20] Of course, of course. This is a lot of fun. We should do this again sometime.

Utsav: [01:02:26] We have a lot more stories like this.

0 Comments
Software at Scale
Software at Scale
Software at Scale is where we discuss the technical stories behind large software applications.