Normalization in DBMS11
Normalization in DBMS11
Normalization in DBMS11
BCNF in Database
Normalization is a process of organizing the data in database to avoid data
redundancy, insertion anomaly, update anomaly & deletion anomaly. Let’s
discuss about anomalies first then we will discuss normal forms with
examples.
Normalization
If a database design is not perfect, it may contain anomalies, which are like a bad
dream for any database administrator. Managing a database with anomalies is
next to impossible.
• Update anomalies − If data items are scattered and are not linked to each other
properly, then it could lead to strange situations. For example, when we try to update
one data item having its copies scattered over several places, a few instances get
updated properly while a few others are left with old values. Such instances leave the
database in an inconsistent state.
• Deletion anomalies − We tried to delete a record, but parts of it was left undeleted
because of unawareness, the data is also saved somewhere else.
• Insert anomalies − We tried to insert data in a record that does not exist at all.
Normalization is a method to remove all these anomalies and bring the database to a
consistent state.
Each attribute must contain only a single value from its pre-defined domain.
If we follow second normal form, then every non-prime attribute should be fully
functionally dependent on prime key attribute. That is, if X → A holds, then there
should not be any proper subset Y of X, for which Y → A also holds true.
We see here in Student_Project relation that the prime key attributes are Stu_ID
and Proj_ID. According to the rule, non-key attributes, i.e. Stu_Name and
Proj_Name must be dependent upon both and not on any of the prime key attribute
individually. But we find that Stu_Name can be identified by Stu_ID and Proj_Name
can be identified by Proj_ID independently. This is called partial dependency,
which is not allowed in Second Normal Form.
We broke the relation in two as depicted in the above picture. So there exists no
partial dependency.
• For any non-trivial functional dependency, X → A, then either − o X is a superkey or, o A is prime
attribute.
We find that in the above Student_detail relation, Stu_ID is the key and only prime
key attribute. We find that City can be identified by Stu_ID as well as Zip itself.
Neither Zip is a superkey nor is City a prime attribute. Additionally, Stu_ID → Zip
→ City, so there exists transitive dependency.
To bring this relation into third normal form, we break the relation into two relations as
follows −
In the above image, Stu_ID is the super-key in the relation Student_Detail and Zip
is the super-key in the relation ZipCodes. So, Stu_ID → Stu_Name, Zip and
Zip → City
• The words normalization and normal form refer to the structure of a database.
• Normalization was developed by IBM researcher E.F. Codd In the 1970s.
• Normalization increases the clarity in organizing data in Database.
Example:
Sample Employee table, it displays employees are working with multiple departments.
Edward 45 Quality
Assurance
Alex 36 Human
Resource
Employee table following 1NF:
Melvin 32 Marketing
Melvin 32 Sales
The entity should be considered already in 1NF, and all attributes within the entity
should depend solely on the unique identifier of the entity.
Example:
1 Monitor Apple
2 Monitor Samsung
3 Scanner HP
productID product
1 Monitor
2 Scanner
3 Head
phone
Brand table:
1 1 1
brandID brand
1 Apple
2 Samsung
2 1 2
3 HP
3 2 3
4 JBL
4 3 4
Well, this is a highly simplified explanation for Database Normalization. One can study this
process extensively though. After working with databases for some time, you'll
automatically create Normalized databases, as it's logical and practical.
A table is said to be in First Normal Form (1NF) if and only if each attribute of the relation is atomic.
That is,
• Each row in a table should be identified by primary key (a unique column value or group
of unique column values)
• No rows of data should have repeating group of column values.
Let's consider the STUDENT table with his ID, Name address and 2 subjects that he has opted for.
• Look at Chris entry. He has only subject. Hence Subject2 for him is NULL. Here storage
space for second entry is simply wasted.
• In the case of Joseph, he has two subjects, Mathematics and Physics, both the columns
have values. Imagine if he opts for third subject? There is no column for his third entry. In
this case, whole table needs to be altered, which is not good at this stage. Once
database is designed, it should be a perfect one. We should not be modifying it as we
start adding/updating data.
• One of the requirements of 1NF is, each table should have primary key. This key in the
table makes each record unique. In our example we have it already- STUDENT_ID.
• Here, SUBJECT1 and SUBJECT2 are same set of columns, i.e.; it has same kind of
information stored - Subject, which is a violation of first rule of 1NF. As it states, there
should not be any repeating columns. We have to remove such columns. But think how?
In order to have STUDENT in 1NF, we have to remove multiple SUBJECT columns from STUDENT
table. Instead, create only one SUBJECT column, and for each STUDENT enters as many rows
as SUBJECT he has. After making this change, the above table will change as follows:
Now STUDENT_ID alone cannot be a primary key, because it does not uniquely determines
each record in the table. If we want to records for Joseph, and we query by his ID,100, gives
us two records. Hence Student_ID is no more a primary key. When we observe the data in the
table, all the four field uniquely determines each record. Hence all four fields together
considered as primary key.
• All subset of data, which applies to have multiple rows in a table must be removed and
placed in a new table. And this new table and the parent table should be related by the
use of foreign key.
In the 1NF STUDENT table above, Joseph and Allen have multiple rows because of their
SUBJECTS. Although it is in 1NF form, it wastes storage space by repeating whole of their
information - name and address in each row. In addition, Student ID alone is strong enough to
be a primary key. If we make Student_ID as primary, all other attributes in the table cannot be
uniquely identified. This is because of multiple rows exists for single ID. Hence it does not
satisfy second condition of 2NF.
So what we can do here is, apply the third condition of 2NF. Remove Subject from the
STUDENT table and create a separate table for it. So the two tables are - STUDENT and
SUBJECT. Now the STUDENT table will have only STUDENT information - STUDENT_ID,
STUDENT_NAME and ADDRESS. New SUBJECT table will have STUDENT_ID and
SUBJECT_NAME.
Now there is no repeating group of columns in STUDENT table and STUDENT_ID is the
primary key of STUDENT table. It uniquely identifies the Student name and address which are
non key attributes of this table. Hence it satisfies both 1NF and 2NF.
In the new SUBJECT table, Subject names opted by each student. Since same student
cannot opt for same subject multiple times in academic year, there will not be any duplicity
of data. But Student_ID alone is not unique; hence it cannot be a primary key. Both
Student_ID and Subject_Name is unique in this table. Hence both of them together become a
primary key. Hence SUBJECT table satisfies 1NF.
There is no non-key attributes in SUBJECT table. Hence we cannot verify second condition
of 2NF. According to the third condition of 2NF, we have removed the data which is forming
multiple rows and put those details in new table. It also states that there should be
relationship between the original table and new table by using foreign key constraint. In the
SUBJECT table, STUDENT_ID is derived from STUDENT table. In this table, STUDENT _ID is
part of primary key as well as it is a foreign key. Hence we can easily relate both STUDENT
and SUBJECT table by using STUDENT_ID. Hence satisfies the third condition of 2NF.
If we want to know which all subjects Joseph has opted for, we would query as below:
SELECT std.STUDENT_ID,
std.STUDENT_NAME,
sb.SUBJECT
Above query will search whole student table for 'Fraser Village Drive' and then update it to
'Fraser Village Dr’. But searching a huge table and updating the single or multiple records will
be a very time consuming, hence affecting the performance of the database.
Instead, if we have these details in a separate table ZIPCODE and is related to STUDENT
table using zip? However ZIPCODE table will have comparatively less amount of records
and we just have to update ZIPCODE table once. It will automatically reflect in the STUDENT
table! Hence making the database and query simpler! And table is in 3NF.
UPDATE ZIPCODE z
SELECT std.STUDENT_NAME,
z.STREET,
z.CITY,
z.ZIP
In the above 3NF example, STUDENT_ID is the Primary key in STUDENT table and ZIP is the
primary key in the ZIPCODE table. There is no other key column in each of the tables which
determines the functional dependency. Hence it's in BCNF form. That is, with STUDENT_ID,
we can retrieve STUDENT_NAME and ZIP from STUDENT table. Similarly, with ZIP value, we
can retrieve STREET and CITY from ZIPCODE table
Let us consider another example - consider each student who has taken major subjects has
different advisory lecturers. Each student will have different advisory lecturers for same
Subjects. There exists following relationship, which is violation of BCNF.
Below table will have all the anomalies too. If we delete any student from below table, it
deletes lecturer's information too. If we add any new lecturer/student to the database, it
needs other related information also. Also, if we update subject for any student, his lecturer
info also needs to be changed, else it will lead to inconsistency.
Hence we need to decompose the table so that eliminates so that it eliminates such
relationship. Now in the new tables below, there are no inter-dependent composite keys
(moreover, there is no composite key in both the tables). If we need to add/update/delete
any lecturer, we can directly insert record into STUDENT_ADVISOR table, without affecting
STUDENT_MAJOR table. If we need to insert/update/delete any subject for a student, then
we can directly do it on STUDENT_MAJOR table, without affecting STUDENT_ADVISOR
table. When we have both advisor as well as major subject information, then we can directly
add/update both the tables. Hence we have eliminated all the anomalies in the database.