SQL vs NoSQL is the Wrong Distinction

A misconception that I held for a really long time was that systems that use SQL (Structured Query Language) are by definition relational databases, and systems that don’t (e.g., MongoDB, Cassandra) are non-relational.

It’s easy to understand whether a database uses SQL or not, but what does it mean when we say a database is relational? At a minimum, relational databases expose data in the form of “relations,” another word for tables with rows and columns.1 SQL is just the ubiquitous way of querying for data relationally.

AWS explains this distinction subtly. The definition of relational databases starts with:

A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database.

And SQL only makes an appearance a little further down, when it’s explained that “SQL or Structured Query Language is the primary interface used to communicate with Relational Databases.”

The creators of MongoDB seem to be aware of this misleading conflation of terms when justifying why they didn’t go with SQL:

The term NoSQL is a bit inaccurate - we are really talking about horizontally scalable postrelational stores, not about the query language. I would consider the Google App Engine Data Store NoSQL, and it uses a SQL-like query language GQL.

Some of their reasons (emphasis mine):

The main reason we went the way we did with the query language - representing queries as JSON - was to normalize the data we are storing with the query mechanism.  If we are storing JSON in the database, can we not represent the queries that way too?  We thought that made sense.

I’ve never been a fan of embedding one programming language in another. Building up strings of SQL dynamically has always seemed a bit strange to me. I much prefer representing the queries in a data-driven way instead.

Therefore, it’s not logically inconsistent that MongoDB’s new data lake offering can be queried via SQL.

Why is this distinction important? It’s pedantic, but SQL is a versatile language that can do much more than query relational databases. For example, SQL can be used to query S3, trigger analytics jobs in Hive, retrieve OS information on the fly via osquery, provide continuously updating relational views on streaming data from Materialize, and even query NoSQL databases. And conversely, relational data can be queried from languages other than SQL, like EdgeQL.

Instead of dividing databases by SQL vs. NoSQL, we could try to divide them up by relational vs. non-relational (perhaps that’s how most of us think of the distinction anyway). But these two camps of databases are slowly converging too, with JSON datatype support in MySQL and PostgreSQL, and pseudo-JOINs and transactions in MongoDB. Other factors like the supposed horizontal scalability of NoSQL databases aren’t valid either. Practitioners figured out that we could build semi-relational databases by tacking on query engines to key-value stores, fully relational distributed databases ground up (F1, CockroachDB), or as layers on primary-replica RDBMS node groups (Vitess).

NoSQL databases admit that their biggest draw is the lack of strict schemas and the ensuing benefits like less painful migrations. In my opinion, this is the major difference between the two camps now. Let’s try to distinguish these databases by strict vs. loose schema validation, instead of SQL vs. NoSQL, and things might become less confusing for the next generation of developers.

1

Based on this definition, both row-based and column-based data databases should count as relational, but in practice, it seems like only row-based database systems are considered relational. The creator of the term “relational” specified that data from a relational system should be queryable via table name, primary key, and column name (Rule 2), and vendors like AWS seem to consider columnar and relational databases as opposites. Column stores are often considered one of the four key NoSQL database types.