Primary keys only need to be unique; there’s many characteristics they often have but don’t need to have

There is often a misunderstanding about primary keys in database tables. Primary keys identify rows uniquely, so the only property they have to obey is they must be unique.

Primary keys don't have to be numbers. But if they are, here are a few properties these numbers don't have to have:

  1. Numerical PKs don't indicate ordering, i.e. which row was inserted first.
  2. Looking at the highest PK value doesn't indicate the number of rows created (cardinality).
  3. There are not guaranteed not to be "holes" in the numerical sequence: e.g. row 11 and row 13 exist, but row 12 doesn't.

In addition to PKs not having to have these properties, there are good reasons why auto-generated PKs actually don't have them practice.

Many types of systems (for example websites, or the central transaction database of a supermarket) measure their performance not by the duration of a particular transaction (0.10s or 0.08s doesn't make a big difference) but measure their performance by concurrency, which influences the throughput, i.e. if you can handle 100 or 1,000 transactions per second.

If all data and locks which are system-wide are stored on the database, then all other parts of the system (web servers, application servers, cash tills) can be independent of one another. If the performance of system were limited by one of these non-database parts, one can improve the throughput of the system (although not the duration of individual transactions) by increasing the concurrency, i.e. simply adding more of these non-database parts in parallel. (Surprisingly, the case in one system I'm building, concerning simulation of many different mobile phone tariffs for a particular phone usage scenario, this is indeed the case: the slowest part by far is the application server)

Increasing the concurrency in the database, on the other hand, is not such an easy task. One can't just split the data up into different databases: unique constraints must be enforced over all data, referential integrity (foreign keys) must be enforced over all data, locks must be system-wide (e.g. "does user X have enough money? first lock, then check, then take away money, then unlock")

But there are techniques available:

So, given that the main contention point is the database, and that multiple parallel CPUs and multiple parallel disks are in use, it would be unfortunate to make e.g. one disk wait for the other to finish. It should be possible that both disks are in use at the same physical moment in time.

And to do this, one needs to make sure that there is no reason why one transaction (e.g. purchase by a user) needs to wait on another transaction (e.g. different purchase by a different user). One needs to use locking where appropriate – but no more (e.g. lock the user's balance for the duration of a transaction so if two items are charged the same account at the same time so one doesn't get "check balance transaction A -> OK; check balance transaction B -> OK; subtract A; subtract B" and find the user had enough money for either A or B but not A+B together – but two transactions for two different users can be totally independent).

Imagine if primary keys could indicate either ordering or cardinality, or weren't allowed to have holes. This would have the following consequences:

So primary keys are just arbitrary data indicating uniqueness alone. And that's good, because if they didn't, there would be much more inter-transaction contention, and one couldn't utilize ones multi-server multi-disk multi-CPU environment: the speed of the system would be limited to the speed of the slowest transaction.

P.S. I recently created a nerdy privacy-respecting tool called When Will I Run Out Of Money? It's available for free if you want to check it out.

This article is © Adrian Smith.
It was originally published on 28 Jan 2009
More on: Databases | Software Architecture