Database normalization: what is it and why do I need to know about it?

Trust me, you do need to know about it.

Dylan Alvarez
devartis

--

I think that a well-designed system is one that is understandable by humans and avoids causing them unnecessary problems.

Database normalization is a set of rules that help improve the design of our database schema. And in Active Record frameworks such as Django and Rails, that means improving the design of the domain model as well.

As a consequence, database normalization becomes a key part of building a well-designed system.

Our main objectives

Avoiding data duplication

Having more than one mention of the same statement (e.g., “David is 23 years old”) in our database generates inconsistencies and damages performance (because we have to update all of those sources of truth every time that statement needs to be updated).

Avoiding null values

They always generate bugs. Application code is prone to null pointer errors, and different databases could reply differently to the same query when null values are present. For example:

Source: Django docs

Giving each table one unique purpose

The model that represents that table is more cohesive and the database itself is more understandable.

Step 1: Don’t repeat columns

Let’s imagine we’re working for a bank in which credit card holders can have up to three additional cards. We could store additional card holder names like this:

But that would mean that the query “cards that have Robert as additional card holder” looks like this:

With this representation, we are introducing null values. Also, we’re tying the database schema to an arbitrary business rule. (What if we allow five additional cards? Would that query still be valid?)

An improved way of representing this is:

Which would turn the query into something a bit less repetitive:

We got rid of null values, and the additional card limit is now more flexible (it could be different for different card holders).

Ask yourself: Are multiple columns responding to the same question? (In this case, “Who has an additional card tied to this one?”)

Step 2: Values must be atomic

And by atomic I mean that they state only one fact. For example: “289 Buckingham Ave, Riverview” states both an address and a town.

Sometimes, the appropriate level of detail depends on the applications that make use of our database. For instance, separating the area code of the rest of a phone number could be either essential or unneeded. If you’re not sure, I suggest you choose the more detailed option.

Let’s say we need to store information about loans: recipient data, location and amount. Our current schema looks like this:

Now we want to retrieve information about loans in China from ICBC. We try to manage using this query:

But there are many problems with it. For example, the result of this query includes a loan in Chinatown, we could be missing an ICBC loan because of a difference in spacing inside the JSON document, and finally, this query may take a long time to complete, because it uses partial string matching twice.

All of these problems arose because there are multiple statements inside of a single database cell. The recipient column informs us about the bank name, the recipient name, and (sometimes) their birthdate. And the location column informs us about the recipient’s neighborhood, the city and the country that neighbourhood is in.

We could improve it by making all values atomic:

There’s a null value in here. That’s not ideal, but I think that in this case it’s representative of the fact that we do not always know the recipient’s birthdate. It’s up to you how strictly you follow these rules.

Our query now looks like this:

Much better. This is also a much faster query, since we now match complete strings instead of using an expensive LIKE query.

This is one of the sneakiest mistakes you can make, because what’s “atomic” depends on your application. For example, there are cases in which phone_number is not atomic enough, and it’s much more convenient to, for example, separate the area_code. Pay close attention to your fields’ atomicity.

Ask yourself: Is one cell responding more than one question?

Step 3: Spot functional dependencies

A functional dependency between two attributes (columns) of an entity (row) is a relationship in which the value of one determines the value of the other. For example, in this table:

There’s a dependency: zip_codestate_name. If zip_code is 10001, then state_name is New York, every time.

If we have many rows that state this fact “Zip code 10001 belongs to the New York state”, then an inconsistency is possible.

We could save a row in which zip_code is 10001 and state_name is Alabama. Now our database gives two different replies to the question “Where is zip code 10001?”, depending on which row you’re asking.

Here we don’t have that problem. If we make zip_code values unique, then there’s only one answer to our question.

Ask yourself: Do I allow multiple answers to the same question?

Step 4: A null value is always a code smell

Did you know that there are both friends and followers on Facebook? Yup, me neither

There are potential problems with this table:

Instances of SocialNetworkUser are prone to null pointer exceptions.

Worse yet (and even worse on flexible languages such as JavaScript), they are prone to silent errors! For example, in Python 2 this works even with social_network_user.friends == None:

is_popular = social_network_user.friends > 1000

If most users are Facebook users, it’s easy for mistakes like that one to slip by. If instead, we store social network users this way:

Then there is no way to (successfully) ask an InstagramUser instance questions that it doesn’t know how to reply. There are a lot of potential bugs that are not possible anymore.

Also, there’s the benefit of having a domain model that’s more cohesive.

Ask yourself: Are all rows in my table of the same kind? Is there a condition under which we know for sure a column is going to be null?

What about performance? Separating tables means expensive JOIN queries!

JOIN queries require going through multiple tables to find something. They tend to be slower than alternatives that don’t use them. However, designing a solution in fear of the cost of a JOIN query is a case of premature optimization.

If a database normalization step makes your query slower, first evaluate if there’s a way to optimize the query itself, then consider using indexes, and finally, if there is no alternative, proceed to denormalize.

What about NoSQL databases?

Normalization rules do not apply to non-relational databases. In those databases duplication is usually tolerated, and the schema is designed around the most common queries.

What you gain in return are some advantages over relational databases, like the ability to distribute a single database around the world with relative ease.

There’s more to normalization than this

This is only an introduction to the theory of database normalization. I recommend going into more detail, for example, reading this post.

Thank you for reading!

Visit us!

--

--