Software at Scale 3 - Bharat Mediratta: ex-CTO, Dropbox
Bharat Mediratta was a Distinguished Engineer at Google, CTO at Altschool, and CTO at Dropbox. At Google, he worked on GWS (Google Web Server), a system that I’ve always been curious about, especially since its Wikipedia entry calls it “one of the most guarded components of Google's infrastructure”.
In this podcast, we discuss GWS, bootstrapping a culture of testing at Google, breaking up services to be more manageable, monorepos, build systems, the ethics of software at scale, and more. We spent almost an hour and a half, and didn’t even manage to cover his experiences at Altschool or Dropbox (which hopefully will be covered in a follow up).
Listen on Apple Podcasts or Spotify.
Notes are italicized.
0:20 - Background - Childhood interests in technology. His dad was a director at ADE, India. His dad recruited APJ Abdul Kalam, arguably one of India’s most popular Presidents, and kick-started his career.
6:10 - Studying tech in university. Guru Meditation errors.
10:50 - Working at Sun Microsystems as a first job.
12:30 - Transitioning from being a programmer to a leader, and thinking about project plans and deadlines.
14:15 - Working on side projects for the company (a potential inspiration for 20% projects at Google?)
15:30 - Moving from Sun to a few startups to Google. How did 20% projects start?
16:50 - Google News as a 20% project. Apparently 20% projects has its own wikipedia page.
18:24 - Did 20% time require management approval?
19:30 - TK at Google Cloud, and how the management model compares to early Google
21:00 - Declining an offer from Google at 2002, and going to VA Linux instead.
22:28 - Growth at Google from 2004 onwards.
24:28 - Hiring at Google at that time. “A players hire A players, B players hire C players”.
24:55 - Culture Fit (indoctrination)? Two weeks of “fairly intense education”, a Noogler project, and a general investment of time and money to help explain the Google way of doing things. It wasn’t accidental. I went through this in 2016 and definitely learnt a bunch, especially from an intriguing talk called “Life of a Query”.
27:22 - Culturally integrating acquisitions successfully. YouTube as an example.
28:40 - Differences between Google and YouTube, and other acquisitions like Motorola Mobility.
30:20 - Search/Google Web Server (GWS) only had 3 nines of availability? The difference between a forager and a refiner (in terms of programming)
31:15 - What was GWS? Server responsible for Google Home and Google Search.
32:20 - There was only one infrastructure engineer on GWS at the time (who wanted to switch), but about a hundred engineers made changes to it every week.
33:10 - Starting with writing unit tests for this system.
33:40 - “They” used to call GWS “the neck of Google”. Extremely critical, but also extremely fragile. Search results and 98% of revenue came through this system. One second of downtime implied revenue loss. Rewriting was infeasible.
34:50 - How to use unit tests to create a culture of shared understanding. Bharat released a manifesto that basically said “all changes to GWS required unit tests”. This caused massive consternation at the time.
36:10 - A quick example on how to enforce unit tests on new code. If an engineer didn’t add a new unit test, Bharat would write the test for them, which often would be failing due to a bug in engineer’s code. This led to a culture where engineers realized the value of writing these tests (and implicitly
39:23 - New Googlers were taught to write unit tests, so that new engineers would spread a culture of writing tests. “Oh, everyone writes unit tests at Google”.
41:50 - “What kind of features were those hundreds of engineers adding to GWS?”. An example - adding UPS tracking numbers automatically showed you UPS tracking results. These were all quiet launches.
Some of the software design around experimentation towards Google search might have influenced Optimizely’s design.
45:00 - Google’s search page in 2007 was pure HTML. In 2009, it was completely AJAX based. This was a massive shift that happened transparently for users.
46:00 - “We wanted Search to be a utility. We wanted it to be the air you breath. You don’t turn on the faucet and worry that water doesn’t come out.”
47:40 - The evolution of GWS’s architecture. Initially, very monolithic. GWS would talk to indices, get results, rank results, and send back HTML. This eventually was broken into layers. Each layer had responsibility, and the plan was to stick to that.
The number one query at Google at the time was “Yahoo” - a navigational search query.
50:00 - Google Instant was rolled out in 2010. Internally, this was called “Google Psychic”, cause it was pretty good at predicting what users wanted to search.
51:50 - “A rewrite would have been a disaster”. GWS was essentially refactored from inside out every 18 months for 11 years. The first one - was breaking out ranking from GWS to another service.
57:00 - YouTube knew that if it convinced enough people to get better internet, Google would make more revenue.
59:00 - Search grew from 500-1000 people in 2004, to 3000 people in 2010.
59:30 - How exactly did search ranking work, technically and organizationally? The Long Click.
61:40 - Google ran 20+ experiments to figure out the best shade of blue on the Search page. This might seem silly, but it helps at scale, since it could potentially find the shade that would help the most colorblind individuals.
67:50 - Hate speech from Google search, and the ethical quandaries around building a humanity scale system.
70:30 - Improving iteration speed and developer productivity for these systems
71:50 - Google had an ML model for search results back in 2004 that was competitive with the hand-built systems, but didn’t end up using it, due to the lack of understandability. This has definitely changed now. I had read that document during my internship, but was surprised to learn that Google had a working ML model for ranking since 2004.
73:30 - Service Oriented Architecture at Google. Enabled GWS from C to C++ and divest itself from some responsibilities. But Google stuck with a monorepo, compared to Amazon.
76:40 - Components in the Monorepo + Blaze (Bazel) helped Google scale build times and reduce iteration speed. Components is the most interesting piece, since to my understanding, it hasn’t been written about much externally.
78:00 - The scale and complexity of the monorepo.
79:40 - The 400,000 line Makefile, and the start of Blaze.
82:00 - What were the benefits of “Components”?
84:00 - The project to multi-thread GWS, when it was serving 5 - 10 billion search queries a day. It started off as a practical joke.
91:00 - It’s rarely only about the technology. It’s about culture and team cohesion.