Tammy Butow is a Principal SRE at Gremlin, an enterprise Chaos Engineering platform that makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn customer trust. She’s also the co-founder of Girl Geek Academy, an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean.
In this episode, we talk about reliability engineering and Chaos Engineering. We talk about the growing trend of outages across the internet and their underlying reasons. We explore common themes in outages, like marketing events and lack of budgets/planning, the impact of such outages on businesses like online retailers, and how tools and methodologies from Chaos Engineering and SRE can help.
01:00 - Starting as the seventh employee at Gremlin
04:00 - An analysis of recent outages and their root causes.
09:00 - A mindset shift on software reliability
14:00 - If you’re suddenly in charge of the reliability of thousands of MySQL databases, what do you do? How do you measure your own success?
25:00 - Why is it important to know exactly how many nodes your service requires to run reliably?
30:00 - What attracts customers to Chaos Engineering? Do prospects get concerned when they hear "chaos” or “failure as a service”?
43:00 - Regression testing failure in CI/CD
51:00 - Trends of interest in Chaos Engineering over time.