Unit 6 - Normalization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

UNIT VI

NORMALIZATION
At the end of this unit, the student should be able to:

• Explain normalization and its use;


• Identify the informal design guidelines of relations;
• Explain functional dependency;
• Apply normalization of tables to achieve first normal form;
• Apply normalization of tables to achieve second normal form;
• Apply normalization of tables to achieve third normal form.

Basically, to create a relational database schema, the database designer groups attributes into relations
by using his common sense or by following the guidelines of mapping a conceptual data model (such
as the ER or Enhanced-ER (EER) data model) to a relational data model. However, there is still a
need to formally analyze why one grouping of attributes into a relation schema may be better (in
terms of goodness and appropriateness) than another. In this unit, we discuss some of the theories that
have been developed with the goal of evaluating relational schemas for design quality. We can assess
formally why one set of groupings of attributes into relation schemas is better than another.

6.1 WHAT IS NORMALI ZATION ?

We recall that a relational database model consists of elements called relations. Each relation is a set
of attributes with values for each attribute such that:

- Each attribute (column) value must be a single value only.


- All values for a given attribute (column ) must be of the same data type.
- Each attribute (column) name must be unique.
- The order of attributes (columns) is insignificant
- No two tuples (rows) in a relation can be identical.
- The order of the tuples (rows) is insignificant.

• Normalization helps achieve the above mentioned relation/table characteristics.


• Normalization is a process in which we systematically examine relations for anomalies and, when
detected, remove those anomalies by splitting up the relation into two new, related, relations.
• Normalization is an important part of the database design process: Often during normalization,
the database designers get their first real look into how the data are going to interact in the
database.
• Finding problems with the database structure at this stage is strongly preferred to finding
problems further along in the development process because at this point it is fairly easy to cycle
back to the conceptual model (Entity Relationship model or Enhanced ER) and make changes.
• Normalization can also be thought of as a trade-off between data redundancy and performance.
Normalizing a relation reduces data redundancy but introduces the need for joins when all of the
data is required by an application such as a report query.
6.2 PRACTICAL USE OF NORMALIZAT ION

Most practical design projects acquire existing designs of databases from previous designs, designs in
legacy models, or from existing files. Normalization is carried out in practice so that the resulting
designs are of high quality and meet the desirable properties.

The normalization process (introduced by Codd, 1972) basically takes a relation schema through a
series of tests to certify whether it satisfies a certain normal form. The normal form of a relation
refers to the highest normal form condition that it meets, and hence indicates the degree to which it
has been normalized.

Initially, Codd proposed three normal forms, which he called first (1NF), second (2NF), and third
(3NF) normal form. A stronger definition of 3NF—called Boyce-Codd normal form (BCNF)—was
proposed later by Boyce and Codd. All these normal forms are based on a single analytical tool: the
functional dependencies among the attributes of a relation. Later, a fourth normal form (4NF) and a
fifth normal form (5NF) were introduced, based on the concepts of multivalued dependencies and
join dependencies, respectively.

However, database designers need not normalize to the highest possible normal form. Relations may
be left in a lower normalization status, such as 2NF, for performance reasons. Thus there is a process
called denormalization which stores the join of higher normal form relations as a base relation,
which is in a lower normal form.

6.3 INFORMAL DESIG N GU IDELI NES

These are some of the informal guidelines that may be used as measures to determine the quality of relation
schema design.

A. Make sure that the semantics of the attributes is clear in the schema.

The semantics of a relation refers to its meaning resulting from the interpretation of attribute values
in a tuple. In grouping attributes to form a relation schema, assume that attributes belonging to one
relation have certain real-world meaning and a proper interpretation associated with them.

Guideline 1: Design a relation schema so that it is easy to explain its meaning. Do not combine
attributes from multiple entity types and relationship types into a single relation. Intuitively, if a
relation schema corresponds to one entity type or one relationship type, it is straightforward to
interpret and to explain its meaning. Otherwise, if the relation corresponds to a mixture of multiple
entities and relationships, semantic ambiguities will result and the relation cannot be easily explained.
Figure 6-1: A simplified company relational database
schema.

B. Minimize the redundant information in tuples and update anomalies.

One goal of schema design is to minimize the storage space used by the base relations. Grouping
attributes into relation schemas has a significant effect on storage space. Storing natural joins of base
relations leads to an additional problem referred to as update anomalies. These anomalies happen
during insertion, deletion, and modification of relations.

Examples of anomalies that may happen using the schema in Figure 6-2:
• To insert a new tuple for an employee who works in department number 5, we must enter all
the attribute values of department 5 correctly so that they are consistent with the
corresponding values for department 5 in other tuples in EMP_DEPT. In the design in
Figure 6-1, we do not have to worry about this consistency problem because we enter only
the department number in the employee tuple; all other attribute values of department 5 are
recorded only once in the database, as a single tuple in the DEPARTMENT relation.
• If we delete from EMP_DEPT an employee tuple that happens to represent the last
employee working for a particular department, the information concerning that department
is lost from the database. This problem does not occur in the database in Figure 6-1 because
DEPARTMENT tuples are stored separately.
• In EMP_DEPT, if we change the value of one of the attributes of a particular department—
say, the manager of department 5—we must update the tuples of all employees who work in
that department; otherwise, the database will become inconsistent. If we fail to update some
tuples, the same department will be shown to have two different values for manager in
different employee tuples, which would be wrong.
Guideline 2: Design the base
relation schemas so that no
insertion, deletion, or
modification anomalies are
present in the relations. If any
anomalies are present, note
them clearly and make sure
that the programs that update
the database will operate
correctly.

Figure 6-2: Two relation schemas suffering from update anomalies. (a) EMP_DEPT and
(b) EMP_PROJ.

C. Avoid too many NULL values in tuples.

If many of the attributes do not apply to all tuples in the relation, we end up with many NULLs in
those tuples. This can waste space at the storage level and may also lead to problems with
understanding the meaning of the attributes and with specifying JOIN operations at the logical level.
Another problem with NULLs is how to account for them when aggregate operations such as
COUNT or SUM are applied. SELECT and JOIN operations involve comparisons; if NULL values
are present, the results may become unpredictable.

Take note that NULL values have multiple interpretations: a) the attribute does not apply to the tuple,
b) the attribute value for the tuple is unknown, c) the value is known but absent- that is, it has not been
recorded yet.

Guideline 3: Avoid placing attributes in a base relation whose values may frequently be NULL. If
NULLs are unavoidable, make sure that they apply in exceptional cases only and do not apply to a
majority of tuples in the relation.

For example: If only 15 percent of employees have individual offices, there is little justification for
including an attribute Office_number in the EMPLOYEE relation; rather, a relation
EMP_OFFICES(Essn, Office_number) can be created to include tuples for only the employees with
individual offices.

D. Avoid the generation of spurious tuples.

Guideline 4: Design relation schemas so that they can be joined with equality conditions on attributes
that are appropriately related (primary key, foreign key) pairs in a way that guarantees that no
spurious tuples are generated. Avoid relations that contain matching attributes that are not (foreign
key, primary key) combinations because joining on such attributes may produce spurious tuples.
6.3 FORMAL DESIGN G UID ELINE S

FUNCTIONAL DEPENDENCY (FD)

This is the most important concept in relational schema design theory. A functional dependency is a
constraint between two sets of attributes from the database. It describes a relationship
between ATTRIBUTES within a single relation. An attribute is FUNCTIONALLY
DEPENDENT on another if we can use the value of one attribute to determine the value of
another.

Suppose that our relational database schema has n attributes A1, A2, ..., An; let us think of the whole
database as being described by a single universal relation schema R = {A1, A2, ... , An}. Functional
dependency is denoted by X Y, between two sets of attributes X and Y that are subsets of R
specifies a constraint on the possible tuples that can form a relation state r of R. The constraint is that,
for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].

Consider the relation schema EMP_PROJ in Figure 6-2(b); from the semantics of the attributes and
the relation, we know that the following functional dependencies should hold:

a. Ssn Ename
b. Pnumber {Pname, Plocation}
c. {Ssn, Pnumber} Hours

These functional dependencies specify that (a) the value of an employee’s Social Security number
(Ssn) uniquely determines the employee name (Ename), (b) the value of a project’s number
(Pnumber) uniquely determines the project name (Pname) and location (Plocation), and (c) a
combination of Ssn and Pnumber values uniquely determines the number of hours the employee
currently works on the project per week (Hours). Alternatively,we say that Ename is functionally
determined by (or functionally dependent on) Ssn, or given a value of Ssn, we know the value of Ename,
and so on.

NORMAL FORMS

Normalization of data can be considered a process of analyzing the given relation schemas based on
their FDs and primary keys to achieve the desirable properties of (1) minimizing redundancy and
(2) minimizing the insertion, deletion, and update anomalies. The normalization procedure provides
database designers with the following:
• A formal framework for analyzing relation schemas based on their keys and on the
functional dependencies among their attributes.
• A series of normal form tests that can be carried out on individual relation schemas so that
the relational database can be normalized to any desired degree.

Although several higher normal forms have been defined, such as the 4NF and 5NF, the practical
utility of these normal forms becomes questionable when the constraints on which they are based are
rare, and hard to understand or to detect by the database designers and users who must discover
these constraints. Thus, we give particular attention to normalization only up to 3NF.

The normalization process for a given relation is summarized into:

1. Specify the Key of the relation


2. Specify the functional dependencies of the relation. Sample data (tuples) for the relation
can assist with this step.
3. Apply the definition of each normal form (starting with 1NF). Follow this as a
sequence to achieve the desirable state of 3NF relations by progressing through the
intermediate states of 1NF and 2NF if needed.
4. If a relation fails to meet the definition of a normal form, change the relation (most
often by splitting the relation into two new relations) until it meets the definition.
5. Re-test the modified/new relations to ensure they meet the definitions of each
normal form.

A. FIRST NORMAL FORM (1NF)

A table is in the 1st Normal Form if:


• Contains only atomic values.
• No repeating groups. A repeating group means that a table contains two or more columns
that are closely related. For example, a table that records data on a book and its author(s)
with the following columns: [Book ID], [Author 1], [Author 2], [Author 3] is not in 1NF
because [Author 1], [Author 2], and [Author 3] are all repeating the same attribute.
• No duplicate rows.

Example 1:

a.

This table is not in first normal form because


the [Color] column can contain multiple values.

b.

The first normal form is satisfied, as the


columns on each table all hold just one value.

Figure 6-3: (a) An un-normalized TABLE_PRODUCT and (b) its 1st normal form - result of splitting
it into two tables.
Example 2:

Using the First


Normal Form, data
redundancy increases,
as there will be many
columns with same
data in multiple rows
but each row as a
whole will be unique.

Figure 6-4: An un-normalized table Student and its 1st Normal Form.

Example 3:

Figure 6-5: (a) A relation schema that


is NOT in 1NF. (b) Sample state of
relation DEPARTMENT. (c) 1NF
version of the same relation with
redundancy.
Example 4:

Figure 6-6: Normalizing nested


relations into 1NF.
(a) Schema of the EMP_PROJ
relation with a nested relation attribute
PROJS.
(b) Sample extension of the
EMP_PROJ relation showing
nested relations within each tuple.
(c) Decomposition of EMP_PROJ
into relations EMP_PROJ1 and
EMP_PROJ2 by propagating the
primary key.

B. SECOND NORMAL FORM (2NF)

A table is in the 2nd Normal Form if:


• It is in first normal form.
• All non-key attributes are fully functionally dependent on the primary key.
• For a table that has concatenated primary key, each column in the table that is not part of
the primary key must depend upon the entire concatenated key for its existence.

A functional dependency X Y is a full functional dependency if removal of any attribute A


from X means that the dependency does not hold any more; that is, for any attribute A ϵ X, (X –
{A}) does not functionally determine Y. A functional dependency X Y is a partial dependency if
some attribute A ϵ X can be removed from X and the dependency still holds; that is, for some A ϵ
X, (X – {A}) Y.
Example 1:

This table has a composite primary


key [Customer ID, Store ID]. The
non-key attribute is [Purchase
Location]. In this case, [Purchase
Location] only depends on [Store ID],
which is only part of the primary key.
Therefore, this table does not satisfy
second normal form.

The partial functional dependency


was removed by splitting the table. In
the table [TABLE_STORE], the
column [Purchase Location] became
fully dependent on the primary key of
that table, which is [Store ID].

Figure 6-7: A relation TABLE_PURCHASE_DETAIL fixed to its 2 nd Normal Form.

C. THIRD NORMAL FORM (3NF)

A table is in the 3rd Normal Form if:

• It is in second normal form.


• There is NO transitive functional dependency.

A functional dependency X Y in a relation schema R is a transitive dependency if there exists a


set of attributes Z in R that is neither a candidate key nor a subset of any key of R, and both X→Z
and Z→Y hold.

Transitive functional dependency exists when we have the following relationships in the table: A is
functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively
dependent on A via B.
Example 1:

In the table TABLE_BOOK_DETAIL,


[Book ID] determines [Genre ID], and
[Genre ID] determines [Genre Type].
Therefore, [Book ID] determines [Genre
Type] via [Genre ID] and we have
transitive functional dependency, and this
structure does not satisfy third normal
form.

All non-key attributes are fully


functional dependent only on
the primary key. In
[TABLE_BOOK], both [Genre
ID] and [Price] are only
dependent on [Book ID]. In
[TABLE_GENRE], [Genre
Type] is only dependent on
[Genre ID].

Figure 6-8: A relation TABLE_BOOK_DETAIL fixed to its 3rd Normal Form.

You might also like