Welcome to the first Software at Scale podcast. This episode contains an interview with Alexey Ivanov, Principal Engineer, Infrastructure at Dropbox.
The motivation for yet another software podcast is to let software builders share technical decisions, opinions, and stories in an informal way. Personal blogs and corporate engineering blogs are extremely informative, but often require high activation energy to be published. This podcast instead tries to replicate bar conversations with grizzled senior engineers reminiscing about horrors of their systems and what they’ve learnt over the years.
In this podcast, we discuss object storage, load balancing, build systems like Bazel, Nginx/Envoy config management, monoliths, services, gRPC, and more.
Highlights
Notes are italicized.
0:00 - Introduction
2:20 - Experience working on Object Storage at Yandex in 2012
3:55 - LevelDB wasn’t efficient enough for avatar storage, presumably due to record size. The failure mode was memory consumption, and didn’t seem to work well on spinning drives at the time, so they built a custom storage backend.
5:55 - RocksDB/WiredTiger might be more appropriate for such a use case today. In general, today, it makes sense to take off the shelf components, unless it involves the core of the business and requires innovation. Other examples - Figma and browser based multiplayer design, and Dropbox with Magic Pocket.
8:55 - Experience working on Server Team at Dropbox in 2015. Teams were fairly broad (Server Team, Client Team), and a new Systems Engineering team was created as a lower layer for Server Team that focused on the edge network and runtime concepts like service discovery. Service Discovery at Dropbox today is fairly sophisticated.
10:45 - Dropbox’s stack in 2015. Stateless systems weren’t as mature as the stateful ones, and there might have been a little duct tape involved. David Mah’s talk on securing user data at SRECon is worth listening to.
12:53 - Initial, DNS based service discovery, and Nginx config management and generation via Python and Jinja2.
15:09 - A DNS outage story.
17:30 - Monoliths aren’t that bad, and many successful businesses start off with monoliths.
17:51 - Dropbox doesn’t use the term “microservice”, neither does it encourage too many tiny services. Services shouldn’t be too small or big.
20:10 - How to reasonably manage configs for Envoy - learnings from Nginx config generation. Object Oriented Python that generates a protobuf config helps with a declarative and reusable config format. Materializing to protobuf helps avoid a lot of bugs, like “no” in YAML.
23:05 - “Config languages eventually converge to Turing Complete languages”. Pick the appropriate language based on the engineer who will need to edit these the most. Concretely - Python for Traffic engineers for configurability, a stripped down YAML format for product engineers that need to perform small scoped tasks like adding new routes.
24:50 - What sparked the Nginx to Envoy migration? The killer feature was the community and the ability to participate in the development process. Shout out to Matt Klein, the leader of the Envoy project, for fostering an inclusive community.
28:30 - Envoy aligns with the Google way of development that Dropbox adopted - it works well with gRPC and builds with Bazel. gRPC at Dropbox; Bazel at Dropbox.
29:10 - Initially, Bazel was an unpopular decision at Dropbox, but it ended up as one of the best decisions made. The hermeticity guarantees and the build graph are extremely useful features for incremental builds and tests, tracking down dependencies and keeping deployments (and the deployment system) simple.
“Bazel is like a sewer. You get out of it what you put into it” - Mike Solomon
32:10 - How should someone decide whether Bazel is the right choice for their company? It’s definitely not for everyone. When it starts becoming painful to manage multi language builds, it might become worth it.
36:00 - The “Google” way of development (monorepo, one build system), is very different from the “Amazon” style of team independence in all software decisions. Which approach is better, and why?
The Google style seems to work well for midsize companies that don’t have infinite resources to paper over the inefficiencies and duplication in the Amazon development style. Somewhat alternative opinion: the Google Platform Rant.
40:20 - Why did Dropbox decide to build their own monitoring system as recently as 2019 instead of using something off the shelf?
The answer is mostly cost efficiency. The volume of metrics logged would make an external solution prohibitive. Magic Pocket probably logs a lot of metrics.
44:00 - What’s a project that you’re most proud of?
The transition from systems engineers just managing Nginx configs to rolling out Dropbox’s edge network and the first ten Points of Presence was awesome. That work hit the sweet spot of technical innovation and direct, measurable improvement for users.
Software at Scale 1 - Alexey Ivanov: Principal Engineer, Infrastructure at Dropbox