GUIDs Are Not The Only Answer

Dec 31, 2020

Global Unique Identifiers (GUIDs), also known as Universally Unique Identifiers (UUIDs) are often used as practically unique identifiers in systems.

GUIDs have many advantages. You can safely generate identifiers in a distributed setting, compared to a central identifier registrar. Many systems are vulnerable to Insecure Direct Object Reference attacks, and even though permissions should always be checked for any access, it’s much easier to exploit these with sequential IDs compared to GUIDs (a high profile example). Sequential IDs can leak important information. Modern databases have custom in-built UUID types which have less storage overhead and perform better than using a text type. There are many blogs extolling the benefits of GUIDs.

But anecdotally, there are many cases where raw GUIDs are used as identifiers, even though they might not be the appropriate choice.

Challenges

By their nature, GUIDs are opaque values that don’t lend too much information about what they’re identifying. They’re generally unwieldy to use due to their length and structure. URLs should not contain GUIDs due to their poor aesthetics. To steal an example from Coding Horror, database queries at the REPL can become frustrating.

$ select * where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'

Poorly formatted log statements/errors can become harder to debug, since the identifiers don’t tell us what they’re identifying:

error processing BAE7DF4-DDF-3RG-5TY3E3RF456AS10: nil

This can be mitigated by having libraries that report stack traces/line numbers and help identify the exact line of code with the error, so in general, UUIDs often need context to aid debugging.

Moreover, the lack of type safety when using raw integers and raw UUIDs as IDs can cause large problems. Consider a variation of the example from the linked blog:

def get_team_and_user_ids() -> List[Tuple[UUID, UUID]]:
  """Returns a tuple of (team_id, user_id) tuples"""
  ...

def ban(user_id: UUID) -> None: ...

for (user, team) in get_team_and_user_ids():
    ban(user)

The script might just error out since UUIDs between users and teams aren’t shared, but the consequences might have been worse with integer IDs.

Designing your Identifiers

Identifiers should be designed for the properties of object they’re going to be identifying. We can mix and match components based on the importance of mutability, debuggability, performance and privacy for the object. For example, identifiers for permanent or external objects like customers should be treated differently to ephemeral or internal objects like tasks/jobs.

Structure identifiers with prefixes and hierarchies

Often for internal objects, debuggability is key, and prefixes can help with debugging. For example, all task identifiers could have the prefix “task-”. This automatically makes context free debugging easier, and makes it easy to grep for all messages that have IDs:

error processing task-2: nil

Hierarchies help even more with debuggability. Let’s say that every “Job” spawns N “Tasks”. Tasks IDs might want to contain their job IDs, so that it becomes trivially simple to grep for tasks of a particular job. For example, a task ID would look like: `job-123-task-1`. This also helps in ad-hoc database queries to find relevant rows without complex JOINs. We have examples of this in real life, like area codes in phone numbers.

We can add constant prefixes/hierarchies to any existing ID type. Critics might point out that storing a prefix with a UUID is wasteful since we’d lose the benefits of database optimized UUID types. This can easily be mitigated by storing un-prefixed UUID in the database, and application code automatically prepends a prefix for client code to use, but this might add some complexity.

Content Addressability

Sometimes, we want a zero inconsistency approach to storing objects, so it might make sense to make the identifier (or part of it) the checksum of the content that is to be stored. This guarantees that the underlying content has not been modified. Git does this with SHA1 commit hashes (even though this is being migrated to SHA256 as of December 2020). Bazel (build system) uses a Content Addressable Store for remote caching.

Semi sortability

If you need basic sortability and don’t mind the privacy concerns, consider using a scheme like Segment’s KSUID. This side-steps use of sequential IDs which will be painful to migrate away from. UUID1 has some known issues.

Type Safety

At the very least, identifiers should not be allowed to float freely as strings or integers in order to prevent a class of inconsistency bugs. SQLAlchemy, a database manipulation library in Python, lets user implement a custom TypeDecorator (and has native support for Postgres UUIDs). Let’s reproduce an example with some tweaks:

from sqlalchemy.dialects.postgresql import UUID
from flask_sqlalchemy import SQLAlchemy
import uuid
from typing import NewType

db = SQLAlchemy()

FooId = NewType("FooId", bound=UUID)

class Foo(db.Model):
    id = db.Column(FooId(as_uuid=True), primary_key=True, 
                   default=uuid.uuid4, unique=True)

With a custom type “FooId” being marked as the column type, type checking will prevent other ID types or raw UUID types there.

Conclusion

We’ve seen that there’s a fair amount of complexity involved in id management and decision making. Some high scale applications end up deploying custom ID generation services. Identifiers are an early, hard to reverse decision that affect even schemaless datastores, so they’re worth thinking about in advance, and the right decision can compound developer productivity and prevent painful infrastructure migrations down the road.

Software at Scale

Discussion about this post