Data is abundant, so organisations have large databases to store the massive volumes of data generated daily. It is well-known that the data stored is analysed by scientists and analysts to identify patterns and trends and forecast upcoming marketing decisions. However, as the data collected from diverse sources may be redundant, data analysts first employ several procedures known as normalization to separate the important data from the redundant ones.
What Is Normalization?
Normalization is the effortless process of organising the data by eliminating redundant data and maintaining its data integrity. The data will then be neatly tabulated after being normalized, making the analysing process hassle-free.
Before learning about the normal forms, first, let’s learn about the keys.
Types of Keys in DBMS
A key in a DBMS can uniquely identify any record or row in a table. Moreover, it is employed to define and pinpoint relationships between tables. Usually, a table comprises a variety of key kinds.
The different types of keys you should be aware of are,
- The Primary Key
- Candidate Key
- Super Key
- Foreign Key
- Composite Key
The Primary Key
The primary key is the only key used to identify an entity uniquely.
For example, the employee id in an employee table is unique to each employee and is used to identify an entity uniquely.
The candidate key is a collection of keys, each representing an entity’s distinct attribute. The most obviously unique key among the candidate keys is chosen as the primary key.
In the employee table, for example, in addition to the employee id, other criteria such as passport number and licence card number are other keys that can uniquely identify an entity.
A super key is a set of attributes that can be used to identify a tuple uniquely. A super key is a candidate key’s superset.
Foreign keys are the columns of the table used as the primary key in another table.
For example, the department id column in the employee table is the primary key of the department table.
The DBMS key with two or more characteristics that, when combined, may uniquely identify a tuple in a table is known as a composite key. When no column may serve as a primary key, other columns are concatenated to serve as the primary key, known as a composite key.
Data is usually collected from various types of sources in large amounts. The collected data initially is collected in large tables in the database.
These enormous tables frequently have an abundance of anomalies that could cause the database to be inconsistent, lose its integrity, or feature redundant data. Furthermore, these inconsistencies might hinder analysis and lead to inaccurate predictions.
Hence why analysts first normalise the database tables using the following types of normal forms,
- 1NF or First Normal Form
- 2NF or Second Normal Form
- 3NF or Third Normal Form
Most tables will reach normalization in the third normal form.
Now, let’s not further delay and look at the different types of normal forms.
First Normal Form 1NF
The rule of the First Normal Form is that,
- Each record must be distinct.
- Each cell of the table should contain only one value.
So even if a column were to have two or more values in the same cell, it should be separated into different rows.
The initial table in the example below includes two values for each row. The two values are divided into two records.
ID Name Courses
1 A t1, t2
2 E t3
3 M t2, t3
ID Name Courses
1 A t1
1 A t2
2 E t3
3 M t2
3 M t3
Second Normal Form 2NF
The rule of the Second Normal Form is that,
- The table first should be normalised in the first normal form.
- The table should not possess partial dependency.
This means the primary key should not depend on other candidate keys.
The T2 is mentioned twice in the below example, which is redundant data. By applying 2NF, the following table will be split into two; table 1 will mention the student and their topics, while the other table will include the topics and their relevant fees.
STUDENT_NO TOPIC_NO TOPIC_FEE
1 T1 2500
2 T2 3333
1 T4 6700
4 T3 5500
4 T1 2500
2 T5 5003
Table 1 Table 2
STUDENT_NO TOPIC_NO TOPIC_NO TOPIC_FEE
1 T1 T1 2500
2 T2 T2 3333
1 T4 T3 5500
4 T3 T4 6700
4 T1 T5 5003
The reduction of redundant data is the aim of 2NF. The table would include 100 entries for the fees if 100 students chose option T1. By splitting the data into two tables, you can now use the other table to determine that T1’s fees are 1000.
Third Normal Form 3NF
The rule of the Third Normal Form is that,
- The table should be normalised in the second normal form.
- The table should not have any transitive functional dependency for non-prime attributes.
Non-prime attributes are those that do not belong to the candidate key set. These non-prime attributes should not depend on other non-prime characteristics in the table.
The third normal form is used to enhance data integrity.
In the below, table the STUD_NO determines the sub_id, which automatically determines the SUB. STU ID, therefore, identifies SUB through SUB ID. This suggests that there is a transitive functional dependency on the table.
STUD_NO NAME SUB_ID SUB ADDRESS
1 John 11 SQL California
2 Matthew 12 C++ Florida
3 Abraham 13 Java Illinois
4 Chris 12 C++ Oregon
Hence, the above table will be divided into two, ensuring that the other columns depend only on the primary key.
STUD_NO NAME SUB_ID ADDRESS
1 John 11 California
2 Matthew 12 Florida
3 Abraham 13 Illinois
4 Chris 12 Oregon
The Bottom Line on Normalization
Normalization is done to make the process of analysing databases a lot easier and efficient enough to derive solutions. In this article, we have discussed the main three types of normal forms. Though another form exists, usually, analysts only normalise the table till the third form.
Start learning code
If you want to learn more about things you can do with data, check out our article on predictive analytics. Alternatively, if you want to learn some coding basics for free, then check out our free 5 Day Coding Challenge. Register now through the form below.