Why you should frequently turn down ~30% of canary instances

The most effective (if scary) way to understand how your stateless service operates under load

If you’re building a production service that handles a large number of requests per second with high availability, it’s wise to understand how the service behaves when it’s overloaded. For example, with Python web servers, the scaling limits are often due to high memory usage. CPU utilization is generally low, and overloads manifest as higher rate of OOMs. This awareness can be crucial during an emergency.

However, it’s often unclear how to understand scalability limits for a service. A common approach is to start with a synthetic load via a script, which are most useful when the underlying service provides a relatively simple API (like Get, Put) and will behave similarly under production traffic. But this approach is not suitable if your service serves a large number of endpoints, has non trivial business logic, or has a fair number of dependencies that can affect availability. For example, a load testing script might not trigger an expensive case or an RPC to a fragile dependency. It also adds toil to maintain custom load testing scripts, and load testing tends to become a one-off process rather than an easily repeatable one.

An alternate approach is the Utilization DRT (Disaster Recovery Test).

Utilization DRT

The goals are:

  • Estimate the amount of headroom your service can handle in practice

  • Determine anomalies in behavior of your service when it’s overloaded

To reliably run a production service - it’s important to understand its headroom - how much percent of current peak traffic it can handle without falling over.

Service owners should estimate a back of the napkin desired headroom percent - based on the service’s availability SLA, cost budget, traffic patterns and auto-scaling efficiency. For me, it’s generally 30 - 40% for a service that has to handle three 9s or more. Then one can use the utilization DRT to estimate what its headroom is so that capacity (or minimum autoscaled capacity) can be provisioned correctly.

Additionally, services tend to display unnatural behavior when under heavy load - garbage collection pausing too often under load is a standard problem for JVM based servers. Understanding this and tuning servers under controlled conditions is important so that they don’t fall over on high stress days due to unforeseen issues.


The steps are:

  • One time - Create a dedicated canary cluster that is X% (usually 10%) of your production cluster

  • One time - Configure a request proxy (Envoy / Amazon API Gateway) to send X% of traffic to the canary cluster

  • One time - Set up a canary dashboard that tracks requests per second (RPS), number of available processes serving requests, and availability percentage

  • Turn down instances in the canary cluster one-by-one until it’s clear that with one more instance going down, users will see an availability hit

This is similar in appearance to a “Chaos Monkey” test - but with a clear goal in mind - to understand the behavior of your cluster in practice. There’s several advantages of performing these.


It mirrors production traffic

This is the key advantage - it allows one to test complicated web servers without hand-written load generation scripts.

It’s relatively simple to set up

A request proxy and an isolated canary cluster is independently useful to have set up. After that - all you need a script that turns off web server processes, or more commonly, proxy sidecar processes, and a script to turn them back when it’s time to rollback the change. This also implies it’s easy to set up automation to do this process repeatedly.

It’s easy to estimate overall cluster throughput

Let’s say that a service’s peak load is 1,000 RPS, and it’s provisioned to a 100 pods, each running a single server process that can handle one request at a time. One can set aside 10 pods as the canary deployment, and turn them down until it seems like your request proxy has to retry a large number of requests to provide the appearance of reliability to clients. In this example - let’s say that that tipping point was 4 pods. We can then justifiably conclude that the overall cluster has 40% headroom for today’s load.


This approach has some unique downsides.

It cannot easily be used for new services

The recommendation for new services is to slowly ramp up traffic and add headroom as required.

It doesn’t solve for overloaded dependencies

One solution for learning how your service reacts when another service is acting overloaded is to run a DRT for that other service.


Why not use auto-scaling so you never have to worry about capacity issues?

Auto-scaling is a great solution to dynamically scale the number of instances - not only does it keep the system efficient, but also safe from client overload. However, auto-scaling is inherently reactive - a service will see an availability hit if there’s a large spike of requests that it cannot handle. So you will need to provision enough headroom to ensure that your service is not overloaded while auto-scaling kicks in, and then we’re back to square one - how much headroom is enough? Use a utilization DRT to confirm.

Why not use serverless and not deal with these issues?

Serverless/Lambdas tends to be at least 5x more expensive than running a server in AWS. At some point, the cost tradeoff is too hard to ignore, let alone other issues like cold start times.


Utilization DRTs are a useful tool for a reliability engineer to truly understand how their service will perform under overload and be more confident for scale ups or launches. Hopefully this post talks through a simple technique to run services more predictably and reliably.