Software at Scale 2 - Christine Dodrill: ex-SRE, Lightspeed

Dec 07, 2020

This episode contains an interview with Christine Dodrill, ex Senior Software Reliability Expert at Lightspeed.

We discuss Kubernetes, Spectre/Meltdown, configuration languages, a controversial testing philosophy, autoscaling (auto-failing), technical problems vs social problems, monoliths, Conway’s Law and Canada.

Listen on Apple Podcasts or Spotify.

Highlights

Notes are italicized.

5:34 - Stack Overflow might become actively harmful if you’re working on WebAssembly or something sufficiently niche.

7:40 - Spectre/Meltdown caused 20-40% slowdown at one workplace, which led to some interesting projects, like the aforementioned WebAssembly work

11:30 - What’s up with the title Senior Site Reliability Expert?

Apparently, you can’t call yourself an “engineer” in Canada, you need to go through some kind of process which software developers don’t need to bother.

14:40 - It’s questionable how much software developers are “engineers” in the first place.

16:52 - YAML allows 8 values for boolean true and false, such as “no”, and “on”, which conflict with ISO code for Norway and Province code for Ontario. Maybe Starlark is an answer. Christine uses Dhall with promising results.

Dhall looks like Haskell. It has a strong type system, with variables, functions, and imports, but otherwise a config language. They have a Kubernetes package.

20:30 - Nix the language. An example to configure a website.

24:00 - Experiences building internal tools that interact with other internal tools. Developers tend to have strange environments. One developer would only keep source code on a thumbdrive, and that would cause a few issues.

27:20 - Compliance requirements can be a useful to stop developers from security snafus.

29:30 - Experience with Kubernetes.

30:40 - Kubernetes Autoscaling out of the box is a great way to cause downtime. Experiences on the Metrics team at Heroku which worked on autoscaling. Most applications tend to be I/O bound to the database, so autoscaling tends to become “auto-failing” and cause more problems than it solves.

37:00 - PostgreSQL, PgBouncer, and Transaction ID wraparound. External postmortem.

40:48 - “A lot of document databases are solutions looking for problems”

45:30 - “Continuous Deployment can be a double edged sword”

47:20 - “A lot of unit testing methodology I’ve seen is kind of fundamentally wrong. A fake version of the world will only let you see how fake your world is”.

53:30 - Experience with tiered deployments - stage, QA, and production.

58:40 - Exploring the model where product engineers only build features, and SREs focus on reliability.

A conclusion is that some governance is probably required to prevent a complexity explosion.

60:30 - Monoliths are pretty great. Eventually, Conway’s Law takes place. Incongruities in products or APIs often reflect team boundaries.

68:00 - Buzzwords at big companies.