31

Database Third Normal Form Explained in Simple English

Second Normal Form

This is the fourth in a series of posts teaching normalization. 

The third post focused on the second normal form, its definition, and examples to hammer it home.

Once a table is in second normal form, we are guaranteed that every column is dependent on the primary key, or as I like to say, the table serves a single purpose.  But what about relationships among the columns?  Could there be dependencies between columns that could cause an inconsistency?

A table containing both columns for an employee’s age and birth date is spelling trouble, there lurks an opportunity for a data inconsistency!

How are these addressed?  By the third normal form.

3NF – Third Normal Form Definition

A table is in third normal form if:

  • A table is in 2nd normal form.
  • It contains only columns that are non-transitively dependent on the primary key

Wow!  That’s a mouthful.  What does non-transitively dependent mean?  Let’s break it down.

DatabaseNormalizationFiveMinuteGuideButton

Transitive

When something is transitive, then a meaning or relationship is the same in the middle as it is across the whole.  If it helps think of the prefix trans as meaning “across.”  When something is transitive, then if something applies from the beginning to the end, it also applies from the middle to the end.

Since ten is greater than five, and five is greater than three, you can infer that ten is greater than three.

In this case, the greater than comparison is transitive.  In general, if A is greater than B, and B is greater than C, then it follows that A is greater than C.

If you’re having a hard time wrapping your head around “transitive” I think for our purpose it is safe to think “through”  as we’ll be reviewing to see how one column in a table may be related to others, through a second column.

Dependence

An object has a dependence on another object when it relies upon it.  In the case of databases, when we say that a column has a dependence on another column, we mean that the value can be derived from the other.  For example, my age is dependent on my birthday.  Dependence also plays an important role in the definition of the second normal form.

Transitive Dependence

Now let’s put the two words together to formulate a meaning for transitive dependence that we can understand and use for database columns.

I think it is simplest to think of transitive dependence to mean a column’s value relies upon another column through a second intermediate column.

Consider three columns:  AuthorNationality, Author, and Book.  Column values for AuthorNationality and Author rely on the Book; once the book is known, you can find out the Author or AuthorNationality.  But also notice that the AuthorNationality relies upon Author.  That is, once you know the Author, you can determine their nationality.  In this sense then, the AuthorNationality relies upon Book, via Author.  This is a transitive dependence.

This can be generalized as being three columns:  A, B and PK.  If the value of A relies on PK, and B relies on PK, and A also relies on B, then you can say that A relies on PK though B.  That is A is transitively dependent on PK.

Let’s look at some examples to understand further.

Primary Key (PK)Column AColumn BTransitive Dependence?
PersonIDFirstNameLastNameNo, In Western cultures a person’s last name is based on their father’s LastName, whereas their FirstName is given to them.
PersonIDBodyMassIndexIsOverweightYes,  BMI over 25 is considered overweight.It wouldn’t make sense to have the value IsOverweight be true when the BodyMassIndex was < 25.
PersonIDWeightSexNo:There is no direct link between the weight of a person and their sex.
VehicleIDModelManufacturerYes:Manufacturers make specific models.  For instance, Ford creates the Fiesta; whereas, Toyota manufacturers the Camry.

To be non-transitively dependent, then, means that all the columns are dependent on the primary key (a criteria for 2nd normal form) and no other columns in the table.

Issues with our Example Data Model

Let’s review what we have done so far with our database.  You’ll see that I’ve found one transitive dependency:

CustomerCity relies on CustomerPostalCode which relies on CustomerID

Generally speaking a postal code applies to one city.  Although all the columns are dependent on the primary key, CustomerID, there is an opportunity for an update anomaly as you could update the CustomerPostalCode without making a corresponding update to the CustomerCity.

We’ve identified this issue in red.

Issues with the Third Normal Form

Fix the Model to 3NF Standards

In order for our model to be in third normal form, we need to remove the transitive dependencies.  As we stated our dependency is:

CustomerCity relies on CustomerPostalCode which relies on CustomerID

It is OK that CustomerPostalCode relies on CustomerID; however, we break 3NF by including CustomerCity in the table.  To fix this we’ll create a new table, PostalCode, which includes PostalCode as the primary key and City as its sole column.

The CustomerPostalCode remains in the customer table.  The CustomerPostalCode can then be designated a foreign key.  In this way, through the relation, the city and postal code is still known for each customer.  In addition, we’ve eliminated the update anomaly.

Data Model in Third Normal Form

To better visualize this, here are the Customer and PostalCode tables with data.

Sample Data in 3NF

Now each column in the customer table is dependent on the primary key.  Also, the columns don’t rely on one another for values.  Their only dependency is on the primary key.

Data in 3NF

The same holds true for the PostalCode table.

At this point our data model fulfills the requirements for the third normal form.  For most practical purposes this is usually sufficient; however, there are cases where even further data model refinements can take place.  If you are curious to know about these advanced normalization forms, I would encourage you to read about BCNF (Boyce-Codd Normal Form) and more!

Can Normalization Get out of Hand?

Can database normalization be taken too far?  You bet!  There are times when it isn’t worth the time and effort to fully normalize a database.  In our example you could argue to keep the database in second normal form, that the CustomerCity to CustomerPostalCode dependency isn’t a deal breaker.

I think you should normalize if you feel that introducing update or insert anomalies can severely impact the accuracy or performance of your database application.  If not, then determine whether you can rely on the user to recognize and update the fields together.

There are times when you’ll intentionally denormalize data.  If you need to present summarized or complied data to a user, and that data is very time consuming or resource intensive to create, it may make sense to maintain this data separately.

DatabaseNormalizationFiveMinuteGuideButton

Several years ago I developed a large engineering change control system which, on the home page, showed each engineer’s the parts, issues, and tasks requiring their attention.  It was a database wide task list.   The task list was rebuilt on-the-fly in real-time using views.  Performance was fine for a couple of years, but as the user base grew, more and more DB resources were being spent to rebuild the list each time the user visited the home page.

I finally had to redesign the DB.  I replaced the view with a separate table that was initially populated with the view data and then maintained with code to avoid anomalies.   We needed to create complicated application code to ensure it was always up-to-date.

For the user experience it was worth it.  We traded off complexity in dealing with update anomalies for improved user experience.

This post concludes our series on normalization.  If you want to start from the beginning, click here.

More tutorials are to follow! Remember!  I want to remind you all that if you have other questions you want answered, then post a comment or tweet me.  I’m here to help you. What other topics would you like to know more about?

Kris Wenzel
 

Kris Wenzel has been working with databases over the past 28 years as a developer, analyst, and DBA.He has a BSE in Computer Engineering from the University of Michigan and a MBA from the University of Notre Dame.Kris has written hundreds of blog articles and many online courses. He loves helping others learn SQL.

  • Tony Alderman says:

    Just some info in case someone is thinking about taking your advice… zip codes can cross city lines, county lines, and even state lines. I’ve had to fix many databases because people don’t realize zip codes are set by the Post Office (federal gov’t) for route efficiency — it has nothing to do with a city (state & local gov’t).

    • That example still holds. Since postal code is the primary key it is OK for a city to be assigned more than one postal code. I’ve yet to see a postal code, at least in the US, listed under more than two cities.

      In this case, the city is for mailing purposes, so you could live in “Phoenix” yet have a “Scottsdale” address.

      • Tony Alderman says:

        From Wikipedia[1]… “For example ZIP code 42223 spans Christian, KY, and Montgomery, TN, and ZIP code 97635 spans Lake, OR, and Modoc, CA”.

        Now you have. :-)

        This is the extreme example of zip codes spanning multiple states — it is much more common for them to span multiple counties and cities.

        [1] http://en.wikipedia.org/wiki/ZIP_code#Structure_and_allocation

        • Hi Tony, thanks for brining up this example. It points out some interesting things with the way people can use and intrepret a data model!

          I agree that zip code can span more than once city, but when you mail a letter to someone residing in that zip code you don’t say, for example:

          Mr. Smith
          123 Sunny Ave
          Christian, KY or Montgomery, TN 42223

          Instead you would address the letter to:

          Mr. Smith
          123 Sunny Ave
          FORT CAMPBELL, KY 42223

          This brings up a really important point in data modeling: knowing the purpose of your table’s column. If the city column is meant to define the geopolitical area you live in, then using zip code won’t work as a primary key, for the reasons you mention.

          Yet, if the city serves to complete the mailing address, then the model I show works.

          Bottom line, when working with data it is really important to understand its meaning and intended use before your model is complete.

    • Tony says:

      Yeah I agree, not a good example of 3NF since zip code is not directly related to city. What you’re saying is that city depends on zip code, which is not true at all. In fact, in many rural unincorporated areas there may be no city at all. http://en.wikipedia.org/wiki/ZIP_code#By_geography

      • See my replies to Tony Alderman (above).

        There doesn’t have to be a one to one mapping between zip code and city, nor city and zip code.

        In my model the city is the mailing address, and, as such, is well suited to 3NF.

  • Aju Mohan says:

    Thanks a lot @Kris Wenzel . . I think you are the best one on internet to teach Normalization this simple. The site also structured in good manner. Wishes.. :)

  • aditya says:

    Hello sir,
    As we did in the customer table to break the table into customer and postal code tables.so like the same can we break the salesoffice table into salesoffice and offno table since we can find the salesoffice city by using the officeno.i know this is illogical but as we follows the rule can we?

    • Hi – It all depends on whether office number is transitively dependent on office. In my opinion it think it makes sense to keep them the same.

      I would break office number into a separate table if there was more than one office number per office location.

  • Gus says:

    Great set of articles about normalization, my tutor in class didn’t explain it half as well as you have done. Thanks!

  • Sakthivek says:

    Nice expanation…Learned the Normalization clearly…Good work…

  • prichaKid says:

    I must confess; I’ve never understood normalization until now…tanks boss, u did a great job….

  • sherry says:

    Very easy explanation! thanks for the article!!

  • Tabo says:

    I like how you show the relationships. My lecturer left me hanging on that part.

  • Grace says:

    Very informative! I enjoyed reading the tutorial on Database Normalization.

  • Priti says:

    Thanks for such simple and clear explanations. Would like to get some problem solving excersise and solution for practice purposes.

    • Hi Priti,

      Thanksfor the suggestion. I’ve been putting together some SQL puzzles and posting them on this blog and on my FaceBook group EssentialSql.

      I’ll look to see about putting together a Normalization puzzle.

  • Carl Taylor says:

    HI
    I have an issue with your Postalcode table in that your primary key, postalcode, breaks the rules for primary keys – it must be unique for every record in that table. Your first two records have the same postal code. This is the issue with your plan to separate them out as more than one address has the same postal code. Probably similar to the UK. Here, in a street, one post code will cover about 15 houses.
    Thoughts?
    Thanks

  • Hi the postal codes are unique. I think you’re seeing 48123 and 48213 together and your eyes are playing tricks on you!

    Also, it is OK for the non primary keys to be not be unique. So I can have different zip codes for the same city. This happens all the time for large cities such as New York and Chicago.

  • James Prater says:

    Great Work Kris. Very helpful info here. You definitely explained it a lot better than the coursework I’ve reviewed.

  • Hua Liang says:

    It contains only columns that are non-transitively, if you remove “dependent on the primary key” life would be easier.

  • John Knauer says:

    Kris,

    As I was reading through your explanation of 3NF, I had the same thought as Tony – a single zip code can, indeed, span several municipalities! However, as you point out, The zip code is associated with the “major” city of that area. Case in point, I live in a township which is immediately adjacent to a major city. They share the same zip code. My mail is addressed to the major city not the township. Therefore, your example holds to 3NF. Thanks for this refresher on data normalization!

  • Krishna Bala says:

    Nicely explained: It was easy to follow the plain-english DB normalization process. Great examples and simplified to the essence!

  • >