Go for Internal Services

Common benefits, pitfalls, and some advice

Nov 07, 2020

Go/Golang is Google’s open source programming language generally used for backend/systems engineering. Its major benefits are static typing, fast compilation times, and simplicity via its limited featureset.

I’ve been working with Go at my day job for internal services for a few years - and have noticed some common themes from various success stories, incident post-mortems, and conversations with other engineers. This post inspired me to write some of these experiences down.

go vet

go vet is a static analyzer that checks for common pitfalls, like using references to iterator variables.

Run this automatically as a linter or in CI to catch issues before they hit production.

staticcheck

staticcheck is another static analyzer that catches problems like dead code. In my experience, it’s too slow to run locally for large projects, but it’s useful as a pre-submit check in CI.

Error Handling Assumptions

Golang is famous (notorious?) for the slightly verbose error handling idiom:

value, err := someOperation()
if err != nil {
    return nil, err
}

In practice, there’s often an assumption that if the returned error is nil, then the returned value is non-nil, and vice-versa. Trying to verify in all cases that the returned value is non nil if the error is nil adds verbosity for very little benefit. So if you’re writing a library that has many different use cases, it’s useful to guarantee a non nil value if err is nil to callers. I don’t know of a great solution to automatically flag this, since it’s valid to return both nils in some cases.

Panic Handling

In most cases, services should terminate on panic. Since the panic can originate anywhere in the codebase and halt execution, no error handling code (except recover) is run, and internal state can become inconsistent. In other words, “all bets are off”. The panic handler should be a top level function that best effort reports an exception to an external system, and then terminates. Deterministic panics should be caught by SLO alerts like availability, or checks for crash loops at the task level.

Functions that can panic

By convention, functions that panic in valid error scenarios must be prefixed with `Must`. This signals to callers that the function they’re calling should be thoroughly vetted and not passed arbitrary input without sanitization. I’ve had incidents where arbitrary input causes crashes on a small percentage of the fleet that have been hard to track down (since we didn’t handle panics by reporting them to an exception reporting system).

Iterator Variables in Goroutines

This issue deserves a special mention - it will take you many hours to debug these, even if you’ve been working in Go for several years, so it’s worth remembering.

for _, val := range values {
	go func() {
		fmt.Println(val) // doesn't do what you expect
	}()
}

Race Mode

Go provides a race detector to catch concurrency programming bugs. It will often double memory usage and slow down your program. One can run tests in CI in race mode to catch issues.

Occasionally, bugs are deployed to production before they’re caught by the race detector since they fail non deterministically. It’s useful to “bake” changes for a few hours/run a stress test until some automation can catch these flakes. Often, the race detector catches a testonly issue, since developers often don’t think too much about concurrency for tests.

It seemed like a good idea to try deploying canary instances in race mode to detect issues proactively, but that caused a significant latency regression and was never tried again.

Context

Context was built for API calls and other cross process communication.

Be careful with caller deadlines - they might not provide enough time for your service to finish its own work. For example, your service might take 500ms to complete an operation, but a caller might specify a deadline of 100ms for whatever reason (generally a tight deadline by a nested caller). These will manifest as timeouts in your service and affect its SLA. Your service should fail requests that don’t provide meaningful time to complete work. If it sounds like overkill to check this everywhere - it’s worth logging the received deadline by caller, so you can track down the source the next time you have to debug mysterious timeouts.

Error Visibility

Log error stack traces via libraries so you’re not stuck wondering where an error came from.

Conclusion

Go is my “go to” language to use in production. Hopefully these experiences provide some food for thought when you develop your next service in Go.

Software at Scale

Discussion about this post