Software at Scale 59 - Incident Management with Nora Jones

Software at Scale is where we discuss the technical stories behind large software applications.
Nora is the CEO and co-founder of Jeli, an incident management platform.

Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli.

Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes.

We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents.

Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture.

We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success.