Unit - Ii
Unit - Ii
Unit - Ii
Topics: Functional Dependencies & Normalization For Relational Database Normal Forms Based on Primary Keys General Definitions of Second and Third Normal Forms Boyce-Codd Normal Form Algorithms for Relational Database Schema Design Multivalued Dependencies and Fourth Normal Form Join Dependencies and Fifth Normal Form The Database Design Process
2.0 Introduction
E. F. Codd, in the early 1970's using relational mathematics, devised a system where tables can be designed in such a way that certain "anomalies" can be eliminated by the selection of which columns (attributes) to be included in the table. Since relational mathematics is based upon "relations", it is assumed that all tables in this discussion satisfy the assumptions incorporated in a relation, mentioned earlier. The widespread use of the relational database model is a fairly recent phenomenon because the operation of joining tables requires considerable computer resources and it is only in recent years that computer hardware is such that large relational databases can be satisfactorily maintained. Suppose we are now given the task of designing and creating a database. Good database design needless to say, is important. Careless design can lead to uncontrolled data redundancies that will lead to problems with data anomalies. In this chapter we will examine a process known as Normalisation - a rigorous design tool that is based on the mathematical theory of relations which will result in very practical operational implementations. A properly normalised set of relations actually simplifies the retrieval and maintenance processes and the effort spent in ensuring good structures is certainly a worthwhile investment. Furthermore, if database relations were simply seen as file structures of some vague file system, then the power and flexibility of RDBMS cannot be exploited to the full.
2.1 Objective
a. A Bad Design E.Codd has identified certain structural features in a relation which create retrieval and update problems. Suppose we start off with a relation with a structure and details like:
Page 45
Advanced RDBMS
Simple structure This is a simple and straightforward design. It consists of one relation where we have a single tuple for every customer and under that customer we keep all his transaction records about parts, up to a possible maximum of 9 transactions. For every new transaction, we need not repeat the customer details (of name, city and telephone), we simply add on a transaction detail. Note the following disadvantages:
The relation is wide and clumsy We have set a limit of 9 (or whatever reasonable value) transactions per customer. What if a customer has more than 9 transactions? For customers with less than 9 transactions, it appears that we have to store null values in the remaining spaces. What a waste of space! The transactions appear to be kept in ascending order of P#s. What if we have to delete, for customer Codd, the part numbered 1- should we move the part numbered 2 up (or rather, left)? If we did, what if we decide later to re-insert part 2? The additions and deletions can cause awkward data shuffling.
Let us try to construct a query to "Find which customer(s) bought P# 2" ? The query would have to access every customer tuple and for each tuple, examine every of its transaction looking for (P1# = 2) OR (P2# = 2) OR (P3# = 2) ... OR (P9# = 2) A comparatively simple query seems to require a clumsy retrieval formulation! b. Another Bad Design Alternatively, why don't we re-structure our relation such that we do not restrict the number of transactions per customer. We can do this with the following structure:
Page 46
Advanced RDBMS
This way, a customer can have just any number of Part transactions without worrying about any upper limit or wasted space through null values (as it was with the previous structure). Constructing a query to "Find which customer(s) bought P# 2" is not as cumbersome as before as one can now simply state: P# = 2. But again, this structure is not without its faults:
It seems a waste of storage to keep repeated values of Cname, Ccity and Cphone. If C# 1 were to change his telephone number, we would have to ensure that we update ALL occurrences of C# 1's Cphone values. This means updating tuple 1, tuple 2 and all other tuples where there is an occurrence of C# 1. Otherwise, our database would be left in an inconsistent state. Suppose we now have a new customer with C# 4. However, there is no part transaction yet with the customer as he has not ordered anything yet. We may find that we cannot insert this new information because we do not have a P# which serves as part of the 'primary key' of a tuple. Suppose the third transaction has been canceled, i.e. we no longer need information about 25 of P# 1 being ordered on 26 Jan. We thus delete the third tuple. We are then left with the following relation:
But then, suppose we need information about the customer "Martin", say the city he is located in. Unfortunately as information about Martin was held in only that tuple and having the entire tuple deleted because of its P# transaction, meant also that we have lost all information about Martin from the relation.
Page 47
Advanced RDBMS
As illustrated in the above instances, we note that badly designed, unnormalised relations waste storage space. Worse, they give rise to the following storage irregularities:
Update anomaly: Data inconsistency or loss of data integrity can arise from data redundancy/repetition and partial update. Insertion anomaly: Data cannot be added because some other data is absent. Deletion anomaly: Data maybe unintentionally lost through the deletion of other data.
2.2 Content
2.2.1 Functional Dependencies & Normalization For Relational Databases Informal design guidelines for relational schemas Intuitively, it would seem that these undesirable features can be removed by breaking a relation into other relations with desirable structures. We shall attempt by splitting the above Transaction relation into the following two relations, Customer and Transaction, which can be viewed as entities with a one to many relationship.
Figure 4-2: 1:M data relationships Let us see if this new design will alleviate the above storage anomalies: a. Update anomaly If C# 1 were to change his telephone number, as there is only one occurrence of the tuple in the Customer relation, we need to update only that one tuple as there are no redundant/duplicate tuples. b. Addition anomaly Adding a new customer with C# 4 can be easily done in the Customer relation of which C# serves as the primary key. With no P# yet, a tuple in Transaction need not be created.
Page 48
Advanced RDBMS
c. Deletion anomaly Canceling the third transaction about 25 of P# 1 being ordered on 26 Jan would now mean deleting only the third tuple of the new Transaction relation above. This leaves information about Martin still intact in the new Customer relation. This process of reducing a relation into simpler structures is the process of Normalisation. Normalisation may be defined as a step by step reversible process of transforming an unnormalised relation into relations with progressively simpler structures. Since the process is reversible, no information is lost in the transformation. Normalisation removes (or more accurately, minimises) the undesirable properties by working through a series of stages called Normal Forms. Originally, Codd defined three types of undesirable properties:
and the stages of normalisation that remove the associated problems are defined below. We shall now show a more formal process on how we can decompose relations into multiple relations by using the Normal Form rules for structuring. According to (Elmasri & Navathe, 1994), the normalization process, as first proposed by Codd (1972), takes a relation schema through a series of tests to "certify" whether or not it belongs to a certain normal form. Initially, Codd proposed three normal forms, which he called first, second, and third normal form. A stronger definition of 3NF was proposed later by Boyce and Codd and is known as Boyce-Codd normal form (BCNF). All these normal forms are based on the functional dependencies among the attributes of a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were proposed, based on the concepts or multivalued dependencies and join dependencies, respectively. Functional Dependencies Normalization of data can be looked on as a process during which unsatisfactory relation schemas are decomposed by breaking up their attributes into smaller relation schemas that possess desirable properties. One objective of the original normalization process is to ensure that the update anomalies do not occur.
Page 49
Advanced RDBMS
Normal forms provide database designers with: A formal framework for analyzing relation schemas based on their keys and on the functional dependencies among their attributes. A series of tests that can be carried out on individual relation schema so that the relational database can be normalized to any degree. When a test fails, the relation violating that test must be decomposed into relations that individually meet the normalization tests. Normal forms, when considered in isolation from other factors, do not guarantee a good database design. It is generally not sufficient to check separately that each relation schema in the database is, say, in BCNF or 3NF. Rather, the process of normalization through decomposition must also confirm the existence of additional properties that the relational schemas, taken together, should possess. Two of these properties are: The lossless join or nonadditive join property, which guarantees that the spurious tuple problem does not occur. The dependency preservation property, which ensures that all functional dependencies are represented in some of the individual resulting relations. Let's begin by creating a sample set of data. Imagine we are working on a system to keep track of employees working on certain projects. Project number 1023 Project name Employee number Employee name Rate category Hourly rate $60 $50 $40 $60 $50
1056
Online agency
estate
11 17
A problem with the above data should immediately be obvious. Tables in relational databases, which would include most databases you'll work with, are in a simple grid, or table format. Here, each project has a set of employees. So we couldn't even enter the data into this kind of table. And if we tried to use null fields to cater for the fields that have no value, then we cannot use the project number, or any other field, as a primary key (a primary key is a field, or list of fields, that uniquely identify one record). There is not much use in having a table if we can't uniquely identify each record in it.
Page 50
Advanced RDBMS
So, our solution is to make sure that each field has no sets, or repeating groups. Now we can place the data in a table. employee_project table Project number 1023 1023 1023 1056 1056 Project name Madagascar site Madagascar site Madagascar site Online agency Online agency travel travel travel estate estate Employee number 11 12 16 11 17 Employee name Rate category Hourly rate $60 $50 $40 $60 $50
Notice that the project number cannot be a primary key on it's own. It does not uniquely identify a row of data. So, our primary key must be a combination of project number and employee number. Together these two fields uniquely identify one row of data. (Think about it. You would never add the same employee more than once to a project. If for some reason this could occur, you'd need to add something else to the key to make it uniqueSo, now our data can go in table format, but there are still some problems with it. We store the information that code 1023 refers to the Madagascar travel site 3 times! Besides the waste of space, there is another serious problem. Look carefully at the data below. employee_project table Project number 1023 1023 1023 1056 1056 Project name Madagascar site Madagascar site Madagascat site Online agency Online agency travel travel travel estate estate Employee number 11 12 16 11 17 Employee name Rate category Hourly rate $60 $50 $40 $60 $50
Page 51
Advanced RDBMS
Did you notice anything strange in the data above? Congratulations if you did! Madagascar is misspelt in the 3rd record. Now imagine trying to spot this error in a table with thousands of records! By using the structure above, the chances of the data being corrupted increases drastically. The solution is simply to take out the duplication. What we are doing formally is looking for partial dependencies, ie fields that are dependent on a part of a key, and not the entire key. Since both project number and employee number make up the key, we look for fields that are dependent only on project number, or on employee number. We identify two fields. Project name is dependent on project number only (employee_number is irrelevant in determining project name), and the same applies to employee name, hourly rate and rate category, which are dependent on employee number. So, we take out these fields, as follows: employee_project table Project number 1023 1023 1023 1056 1056 Employee number 11 12 16 11 17
Clearly we can't simply take out the data and leave it out of our database. We take it out, and put it into a new table, consisting of the field that has the partial dependency, and the field it is dependent on. So, we identified employee name, hourly rate and rate category as being dependent on employee number. The new table will consist of employee number as a key, and employee name, rate category and hourly rate, as follows: Employee table Employee number 11 12 16 17 Employee name Vincent Radebe Pauline James Charles Ramoraz Monique Williams Rate category A B C B Hourly rate $60 $50 $40 $50
And the same for the project data. Project table Page 52
Advanced RDBMS
Project number Project name 1023 Madagascar travel site 1056 Online estate agency Note the reduction of duplication. The text "Madagascar travel site" is stored once only, not for each occurrence of an employee working on that project. The link is made through the key, the project number. Obviously there is no way to remove the duplication of this number without losing the relation altogether, but it is far more efficient storing a short number repeatedly, than a large piece of text We're still not perfect. There is still room for anomalies in the data. Look carefully at the data below. Employee table Employee number 11 12 16 17 Employee name Vincent Radebe Pauline James Charles Ramoraz Monique Williams Rate category A B C B Hourly rate $60 $50 $40 $40
The problem above is that Monique Williams has been awarded an hourly rate of $40, when she is actually category B, and should be earning $50 (In the case of this company, the rate category - hourly rate relationship is fixed. This may not always be the case). Once again we are storing data redundantly: the hourly rate - rate category relationship is being stored in its entirety for each employee. The solution, as before, is to remove this excess data into its own table. Formally, what we are doing is looking for transitive relationships, or relationships where a non-key attribute is dependent on another non-key relationship. Hourly rate, while being in one sense dependent on Employee number (we probably identified this dependency earlier, when looking for partial dependencies) is actually dependent on Rate category. So, we remove it, and place it in a new table, with its actual key, as follows. Employee table Employee number 11 12 16 17 Rate table Page 53 Employee name Vincent Radebe Pauline James Charles Ramoraz Monique Williams Rate category A B C B
Advanced RDBMS
Rate category A B C Hourly rate $60 $50 $40
We've cut down once again. It is now impossible to mistakenly assume rate category "B" is associated with an hourly rate of anything but $50. These relationships are only stored in once place - our new table, where it can be ensured they are accurate. a. Modification Anomalies
Once our E-R model has been converted into relations, we may find that some relations are not properly specified. There can be a number of problems: o Deletion Anomaly: Deleting a relation results in some related information (from another entity) being lost. o Insertion Anomaly: Inserting a relation requires we have information from two or more entities - this situation might not be feasible. Here is a quick example: A company has a Purchase order form:
Page 54
Advanced RDBMS
LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty) PO_HEADER (PO_Number, PODate, Vendor, Ship_To, ...) Consider some sample data for the LINE_ITEMS relation: PO_Number ItemNum PartNum Description Price Qty O101 O101 O101 O102 O102 O103
What are some of the problems with this relation ? What happens when we delete item 2 from Order O101 ? These problems occur because the relation in question contains data about 2 or more themes. Typical way to solve these anomalies is to split the relation in to two or more relations - Process called Normalization.
Page 55
Advanced RDBMS
Normal Forms based on Primary keys The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date. The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account. a. First Normal Form First normal form is now considered to be part of the formal definition of a relation; historically, it was defined to disallow multivalued attributes, composite attributes, and their combinations. It states that the domains of attributes must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. Practical Rule: "Eliminate Repeating Groups," i.e., make a separate table for each set of related attributes, and give each table a primary key. Formal Definition: A relation is in first normal form (1NF) if and only if all underlying simple domains contain atomic values only. First normal form deals with the "shape" of a record type. Under first normal form, all occurrences of a record type must contain the same number of fields. First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
Page 56
Advanced RDBMS
Example 1: Let's run again through the example we've just done, this time without the data tables to guide us. After all, when you're designing a system, you usually won't have test data available at this stage. The tables were there to show you the consequences of storing data in unnormalized tables, but without them we can focus on dependency issues, which is the key to database normalization. In the beginning, the data structure we had was as follows: Project number Project name 1-n Employee numbers (1-n indicates that there are many occurrences of this field - it is a repeating group) 1-n Employee names 1-n Rate categories 1-n Hourly rates So, to begin the normalization process, we start by moving from zero normal form to 1st normal form. The definition of 1st normal form There are no repeating groups All the key attributes are defined All attributes are dependent on the primary key So far, we have no keys, and there are repeating groups. So we remove the repeating groups, and define the primary key, and are left with the following: Employee project table Project number - primary key Project name Employee number - primary key Employee name Rate category Hourly rate This table is in 1st normal form. Example 2: Consider this example
Page 57
Advanced RDBMS
No repeating groups. As an example, it might be tempting to make an invoice table with columns for the first, second, and third line item (see above). This violates the first normal form, and would result in large rows, wasted space (where an invoice had less than the maximum number of line items), and *horrible* SQL statements with a separate join for each repetition of the column. First form normalization requires you make a separate line item table, with it's own key (in this case the combination of invoice number and line number) (See below).
To conclude a relation is in first normal form if it meets the definition of a relation: 1. 2. 3. 4. 5. 6. Each column (attribute) value must be a single value only. All values for a given column (attribute) must be of the same type. Each column (attribute) name must be unique. The order of columns is insignificant. No two rows (tuples) in a relation can be identical. The order of the rows (tuples) is insignificant. Page 58
Advanced RDBMS
If you have a key defined for the relation, then you can meet the unique row requirement. Example relation in 1NF: STOCKS (Company, Symbol, Date, Close_Price) Company Symbol Date IBM IBM IBM Netscape Netscape IBM IBM IBM NETS NETS Close Price
01/05/94 101.00 01/06/94 100.50 01/07/94 102.00 01/05/94 33.00 01/06/94 112.00
We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE. b. General Definitions of Second and Third Normal Forms Second normal form is based on the concept of fully functional dependency. A functional X->Y is a fully functional dependency is removal of any attribute A from X means that the dependency does not hold any more. A relation schema is in Second Normal Form if every nonprime attribute in relation is fully functionally dependent on the primary key of the relation. It also can be restated as: a relation schema is in Second Normal Form if every nonprime attribute in relation is not partially dependent on any key of the relation. Practical Rule: "Eliminate Redundant Data," i.e., if an attribute depends on only part of a multivalued key, remove it to a separate table. Formal Definition: A relation is in second normal form (2NF) if and only if it obeys the conditions of First Normal Form and every nonkey attribute is fully dependent on the primary key. To put it simple a table is in 2nd normal form if Its in 1st normal form It includes no partial dependencies (where an attribute is dependent on only a part of a primary key).
Page 59
Advanced RDBMS
Example 1: So, we go through all the fields. Considering our example, Project name is only dependent on Project number. Employee name, Rate category and Hourly rate are dependent only on Employee number. So we remove them, and place these fields in a separate table, with the key being that part of the original key they are dependent on. So, we are left with the following 3 tables: Employee project table Project number - primary key Employee number - primary key Employee table Employee number - primary key Employee name Rate category Hourly rate Project table Project number - primary key Project name The table is now in 2nd normal form. Example 2: Consider this example. As we know that second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record: --------------------------------------------------| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS | ====================------------------------------The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:
The warehouse address is repeated in every record that refers to a part stored in that warehouse.
Page 60
Advanced RDBMS
If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated. Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse. If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records: ------------------------------- -------------------------------| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSEADDRESS | ====================----------- =============------------------When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third. The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs. To summarize,
A relation is in second normal form (2NF) if all of its non-key attributes are dependent on all of the key. Relations that have a single attribute for a key are automatically in 2NF. This is one reason why we often use artificial identifiers as keys. In the example below, Close Price is dependent on Company, Date and Symbol, Date The following example relation is not in 2NF: STOCKS (Company, Symbol, Headquarters, Date, Close_Price) Company Symbol Headquarters Date IBM SONY SONY Netscape Netscape IBM SONY SONY NETS NETS Armonk, NY Armonk, NY Armonk, NY Close Price
Page 61
Advanced RDBMS
Company, Date -> Close Price Symbol, Date -> Close Price Company -> Symbol, Headquarters Symbol -> Company, Headquarters Consider that Company, Date -> Close Price. So we might use Company, Date as our key. However: Company -> Headquarters This violates the rule for 2NF. Also, consider the insertion and deletion anomalies. One Solution: Split this up COMPANY (Company, Symbol, STOCKS (Symbol, Date, Close_Price) into two relations: Headquarters)
Company -> Symbol, Headquarters Symbol -> Company, Headquarters Symbol Date SONY SONY SONY NETS NETS Symbol, Date -> Close Price c. Third Normal Form Third normal form is based on the concept of transitive dependency. A functional dependency X->Y in a relation is a transitive dependency if there is a set of attributes Z that is not a subset of any key of the relation, and both X->Z and Z->Y hold. In other words, a relation is in 3NF if, whenever a functional dependency X->A holds in the relation, either (a) X is a superkey of the relation, or (b) A is a prime attribute of the relation. Practical Rule: "Eliminate Columns not Dependent on Key," i.e., if attributes do not contribute to a description of a key, remove them to a separate table. Close Price
01/05/94 101.00 01/06/94 100.50 01/07/94 102.00 01/05/94 33.00 01/06/94 112.00
Page 62
Advanced RDBMS
Formal Definition: A relation is in third normal form (3NF) if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. To put it simple the definition of 3rd normal form It's in 2nd normal form It contains no transitive dependencies (where a non-key attribute is dependent on another non-key attribute). Example 1: We can narrow our search down to the Employee table, which is the only one with more than one non-key attribute. Employee name is not dependent on either Rate category or Hourly rate, the same applies to Rate category, but Hourly rate is dependent on Rate category. So, as before, we remove it, placing it in it's own table, with the attribute it was dependent on as key, as follows: Employee project table Project number - primary key Employee number - primary key Employee table Employee number - primary key Employee name Rate Category Rate table Rate category - primary key Hourly rate Project table Project number - primary key Project name These tables are all now in 3rd normal form, and ready to be implemented. Example 2: Third normal form is violated when a non-key field is a fact about another non-key field, as in
Page 63
Advanced RDBMS
-----------------------------------| EMPLOYEE | DEPARTMENT | LOCATION | ============-----------------------The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:
The department's location is repeated in the record of every employee assigned to that department. If the location of the department changes, every such record must be updated. Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department. If a department has no employees, there may be no record in which to keep the department's location.
To satisfy third normal form, the record shown above should be decomposed into the two records: ------------------------- ------------------------| Item No. | Part Number | | Part Number | Part Name | ------------------------- -------------------------To summarize, a record is in third normal forms if
If it is in second normal form and it contains no transitive dependencies. Consider relation R containing attributes A, B and If A -> B and B -> C then A -> C
C.
Transitive Dependency: Three attributes with the above dependencies. Example: At CUNY: Course_Code -> Course_Num, Section Course_Num, Section -> Classroom, Professor Example: At Rutgers: Course_Index_Num -> Course_Num, Section Course_Num, Section -> Classroom, Professor
Page 64
Advanced RDBMS
Example: Company County Tax Rate IBM AT&T Company -> County and County -> Tax Rate thus Company -> Tax Rate What happens if we We loose information about 2 different themes. Split this up into two relations: Company County SONY AT&T Company -> County County Tax Rate Putnam 28% Ritchie 26% County -> Tax Rate Before you rush off and start normalizing everything, a word of warning. No process is better than good old common sense. Take a look at this example. Customer table Number - primary key Name Address Zip Code Town Putnam Ritchie remove AT&T? Putnam 28% Bergen 26%
Page 65
Advanced RDBMS
What normal form is this table in? Giving it a quick glance, we see no repeating groups, and a primary key defined, so it's at least in 1st normal form. There's only one key, so we needn't even look for partial dependencies, so it's at least in 2nd normal form. How about transitive dependencies? Well, it looks like Town might be determined by Zip Code. And in most parts of the world that's usually the case. So we should remove Town, and place it in a separate table, with Zip Code as the key? No! Although this table is not technically in 3rd normal form, removing this information is not worth it. Creating more tables increases the load slightly, slowing processing down. This is often counteracted by the reduction in table sizes, and redundant data. But in this case, where the town would almost always be referenced as part of the address, it isn't worth it. Perhaps a company that uses the data to produce regular mailing lists of thousands of customers should normalize fully. It always comes down to how the data is going to be used. Normalization is just a helpful process that usually results in the most efficient table structure, and not a rule for database design. But judging from some of the table structures I've seen out there, it's better to err and normalize than err and not! Functional dependency - a field Y is functionally dependent on a field (or fields) X if it is invalid to have two records with the same X value but different Y values. (A given X must always occur with the same Y.)
A Functional Dependency describes a relationship between attributes in a single relation. An attribute is functionally dependant on another if we can use the value of one attribute to determine the value of another.
Example: Employee_Name is functionally dependant on Social_Security_Number because Social_Security_Number can be used to determine the value of Employee_Name. We use the symbol -> -> is read functionally determines to indicate a functional dependency.
Student_ID -> Student_Major Student_ID, Course#, Semester# -> Grade SKU -> Compact_Disk_Title, Artist Model, Options, Tax -> Car_Price Course_Number, Section -> Professor, Classroom, Number of Students The attributes listed on the left hand side of the -> are called determinants. One can read A -> B as, "A determines B". i. Keys and Uniqueness
Key: One or more attributes that uniquely identify a tuple (row) in a relation.
Page 66
Advanced RDBMS
The selection of keys will depend on the particular application being considered. Users can offer some guidance as to what would make an appropriate key. Also this is pretty much an art as opposed to an exact science. Recall that no two relations should have exactly the same values, thus a candidate key would consist of all of the attributes in a relation. A key functionally determines a tuple (row).
Not all determinants are keys. Consider this example. In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value. There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data: PERSON John Smith John Smith ADDRESS 123 Main St., New York 321 Center St., San Francisco
Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact. Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency. --------------------------------------| PERSON | ADDRESS | -------------+------------------------| John Smith | 123 Main St., New York | | John Smith | 123 Main Street, NYC | ---------------------------------------
Page 67
Advanced RDBMS
We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms. For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier. ---------------------------------------------------------| EMPLOYEE | FATHER | FATHER'S-ADDRESS |============------------+-------------------------------| | Art Smith | John Smith | 123 Main St., New York | | Bob Smith | John Smith | 123 Main Street, NYC | | Cal Smith | John Smith | 321 Center St., San Francisco | ---------------------------------------------------------|
However, in formal terms, there is no functional dependency here between FATHER'SADDRESS and FATHER, and hence no violation of third normal form. d. Boyce-Codd normal form Boyce-Codd normal form is stricter than 3NF, meaning that every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF. A relation schema is an BCNF if whenever a functional dependency X->A holds in the relation, then X is a superkey of the relation. The only difference between BCNF and 3NF is that condition (b) of 3NF, which allows A to be prime if X is not a superkey, is absent from BCNF. Formal Definition: A relation is in Boyce/Codd normal form (BCNF) if and only if every determinant is a candidate key. [A determinant is any attribute on which some other attribute is (fully) functionally dependent.] To put it simple, a relation is in Boyce-Codd if every determinant is a candidate key. Steps in analyzing for BCNF: (1) Find and list all the candidate keys. (Usually the primary key is known.) (2) Determine and list all functional dependencies, noting those which are dependent on attributes which are not the entire primary key. (3) Determine if any dependencies exist which are based on part but not all of a candidate key. (4) Project into relations which remove the problems found in (3). To summarize,
Page 68
Advanced RDBMS
A relation is in BCNF if every determinant is a candidate key. Recall that not all determinants are keys. Those determinants that are keys we initially call candidate keys. Eventually, we select a single candidate key to be the primary key for the relation. Consider the following example: Funds consist of one or more Investment Types. Funds are managed by one or more Managers Investment Types can have one more Managers Managers only manage one type of investment. FundID InvestmentType Manager 99 99 33 22 11 Common Stock Common Stock Growth Stocks Common Stock Smith Green Brown Smith Municipal Bonds Jones
FundID, InvestmentType -> Manager FundID, Manager -> InvestmentType Manager -> InvestmentType In this case, the combination FundID and InvestmentType form a candidate key because we can use FundID,InvestmentType to uniquely identify a tuple in the relation. Similarly, the combination FundID and Manager also form a candidate key because we can use FundID, Manager to uniquely identify a tuple. Manager by itself is not a candidate key because we cannot use Manager alone to uniquely identify a tuple in the relation. Is this relation R(FundID, InvestmentType, Manager) in 1NF, 2NF or 3NF ? Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure. 2NF because all of the non-key attributes (Manager) is dependant on all of the key. 3NF because there are no transitive dependencies. Consider what happens if we delete the tuple with FundID 22. We loose the fact that Brown manages the InvestmentType "Common Stocks." The following are steps to normalize a relation into BCNF: 1. List all of the determinants.
Page 69
Advanced RDBMS
2. See if each determinant can act as a key (candidate keys). 3. For any determinant that is not a candidate key, create a new relation from the functional dependency. Retain the determinant in the original relation. For our Rorig(FundID, InvestmentType, Manager) The FundID, FundID, Manager Which determinants FundID, FundID, Manager NO determinants example: are: InvestmentType Manager as keys ? YES YES
Create a new relation from the functional dependency: Rnew(Manager, Rorig(FundID, Manager) InvestmentType)
In this last step, we have retained the determinant "Manager" in the original relation Rorig. Relational Database Design and Further Dependencies Fourth and fifth normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a manyto-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees. In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow. Multivalued Dependencies and Fourth Normal Form Multivalued dependencies are a consequence of first normal form, which disallowed an attribute in a tuple to have a set of values. If we have two or more multivalued independent attributes in the same relation schema, we get into a problem of having to repeat every value of one of the attributes with every value of the other attribute to keep the relation instances consistent.
Page 70
Advanced RDBMS
Fourth normal form is based on multivalued dependencies, which is violated when a relation has undesirable multivalued dependencies, and hence can be used to identify and decompose such relations. A relation scheme R is in 4NF with respect to a set of dependencies F is, for every nontrivial multivalued dependency X->F, X is a superkey for R. Practical Rule: "Isolate Independent Multiple Relationships," i.e., no table may contain two or more 1:n or n:m relationships that are not directly related. Formal Definition: A relation R is in fourth normal form (4NF) if and only if, whenever there exists a multivalued dependency in the R, say A->>B, then all attributes of R are also functionally dependent on A. Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form. The term "independent" will be discussed after considering an example. Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as ------------------------------| EMPLOYEE | SKILL | LANGUAGE | =============================== Instead, they should be represented in the two records -------------------- ----------------------| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE | ==================== ======================= Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example. The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record: (1) A disjoint format, in which a record contains either a skill or a language, but not both: ------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------|
Page 71
Advanced RDBMS
| Smith | cook | | | Smith | type | | | Smith | | French | | Smith | | German | | Smith | | Greek | ------------------------------This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.) (2) A random mix, with three variations: (a) Minimal number of records, with repetitions: ------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | type | German | | Smith | type | Greek | ------------------------------(b) Minimal number of records, with null values: ------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | type | German | | Smith | | Greek | ------------------------------(c) Unrestricted: ------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | type | | | Smith | | German | | Smith | type | Greek | -------------------------------
Page 72
Advanced RDBMS
(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages: ------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | cook | German | | Smith | cook | Greek | | Smith | type | French | | Smith | type | German | | Smith | type | Greek | ------------------------------Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:
If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent. Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages. Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.
Fourth normal form minimizes such update problems. a. Independence We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed. In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:
Page 73
Advanced RDBMS
------------------------------| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | type | French | | Smith | type | German | | Smith | type | Greek | ------------------------------Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record. b. Multivalued Dependencies For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "crossproduct" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent. We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:
An employee uses certain skills on certain projects. An employee uses certain languages on certain projects.
If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form. To summarize,
Page 74
Advanced RDBMS
Multivalued Dependency: A type of functional dependency where the determinant can determine more than one value. More formally, there are 3 criteria: 1. There must be at least 3 attributes in the relation. call them A, B, and C, for example. 2. Given A, one can determine multiple values of B. Given A, one can determine multiple values of C. 3. B and C are independent of one another. Book example: Student has one or more majors. Student participates in one or more activities. StudentID Major 100 100 100 100 200 StudentID ->-> Major StudentID ->-> Activities CIS CIS Activities Baseball Volleyball
Portfolio ID Stock Fund 999 999 999 999 888 Janus Fund Janus Fund Scudder Fund Scudder Fund Global
Bond Fund Municipal Bonds Dreyfus Short-Intermediate Municipal Bond Fund Municipal Bonds
Global Dreyfus Short-Intermediate Municipal Bond Fund T. Rowe Price Emerging Markets Bond Fund
Kaufmann Fund
Advanced RDBMS
2. All three attributes taken together form the key. 3. Latter two attributes are independent of one another. 4. Insertion anomaly: Cannot add a stock fund without adding a bond fund (NULL Value). Must always maintain the combinations to preserve the meaning. Stock Fund and Bond Fund form a multivalued dependency on Portfolio ID. PortfolioID ->-> Stock Fund PortfolioID ->-> Bond Fund Resolution: Split into two tables with the common key: Portfolio ID Stock Fund 999 999 888 Janus Fund Scudder Global Fund Kaufmann Fund
Portfolio ID Bond Fund 999 999 888 Municipal Bonds Dreyfus Short-Intermediate Municipal Bond Fund T. Rowe Price Emerging Markets Bond Fund
c. Join Dependencies and Fifth Normal Form In some cases there may be no losses join decomposition into two relation schemas but there may be a losses join decomposition into more than two relation schemas. These cases are handled by the join dependency and fifth normal form, and its important to note that these cases occur very rarely and are difficult to detect in practice. Practical Rule: "Isolate Semantically Related Multiple Relationships," i.e., there may be practical constraints on information that justify separating logically related many-tomany relationships. Formal Definition: A relation R is in fifth normal form (5NF)also called projectionjoin normal form (PJNF)if and only if every join dependency in R is a consequence of the candidate keys of R. A join dependency (JD) specified on a relations schema R, specifies a constraint on instances of R. The constraint states that every legal instance of R should have a losses join decomposition into sub-relations of R, that when reunited make the entire relation R. A relation schema R is in fifth normal form (5NF) (or project-join normal form (PJNF)) with respect to a set F of functional, multivalued, and join dependencies if, for every Page 76
Advanced RDBMS
nontrivial join dependency JD(R1, R2, , Rn) in F (implied by F), every Ri is a superkey of R. Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others. We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields: ----------------------------| AGENT | COMPANY | PRODUCT | |-------+---------+---------| | Smith | Ford | car | | Smith | GM | truck | ----------------------------This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not. But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company. ----------------------------| AGENT | COMPANY | PRODUCT | |-------+---------+---------| | Smith | Ford | car | | Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | ----------------------------In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:
Page 77
Advanced RDBMS
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT | |-------+---------| |---------+---------| |-------+---------| | Smith | Ford | | Ford | car | | Smith | car | | Smith | GM | | Ford | truck | | Smith | truck | | Jones | Ford | | GM | car | | Jones | car | ------------------- | GM | truck | --------------------------------------These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not. Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms. Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form. One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times. It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form. It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all. The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.
Page 78
Advanced RDBMS
----------------------------| AGENT | COMPANY | PRODUCT | |-------+---------+---------| | Smith | Ford | car | | Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | | Jones | Ford | truck | | Brown | Ford | car | | Brown | GM | car | | Brown | Totota | car | | Brown | Totota | bus | ----------------------------------------------- --------------------- ------------------| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT | |-------+---------| |---------+---------| |-------+---------| | Smith | Ford | | Ford | car | | Smith | car | Fifth | Smith | GM | | Ford | truck | | Smith | truck | Normal | Jones | Ford | | GM | car | | Jones | car | Form | Brown | Ford | | GM | truck | | Jones | truck | | Brown | GM | | Toyota | car | | Brown | car | | Brown | Toyota | | Toyota | bus | | Brown | bus | ------------------- --------------------- ------------------Observe that:
Jones sells cars and GM makes cars, but Jones does not represent GM. Brown represents Ford and Ford makes trucks, but Brown does not sell trucks. Brown represents Ford and Brown sells buses, but Ford does not make buses.
Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.
Page 79
Advanced RDBMS
In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case. To summarize,
There are certain conditions under which after decomposing a relation, it cannot be reassembled back into its original form. We don't consider these issues here. Inclusion Dependencies, other Dependencies and Normal Forms
2.2.2
Domain Key Normal Form (DKNF) We can also always define stricter forms that take into account additional types of dependencies and constraints. The idea behind domain-key normal form is to specify, (theoretically, at least) the "ultimate normal form" that takes into account all possible dependencies and constraints. A relation is said to be in DKNF if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and the key constraints specified on the relation. For a relation in DKNF, it becomes very straightforward to enforce the constraints by simply checking that each attribute value in a tuple is of the appropriate domain and that every key constraint on the relation is enforced. However, it seems unlikely that complex constraints can be included in a DKNF relation; hence, its practical utility is limited
A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition of keys and domains.
Constraint: An rule governing static values of an attribute such that we can determine if this constraint is True or False. Examples: 1. 2. 3. 4. Functional Dependencies Multivalued Dependencies Inter-relation rules Intra-relation rules
Key: Unique identifier of a tuple. Domain: The physical (data type, size, NULL values) and semantic (logical) description of what values an attribute can hold. There is no known algorithm for converting a relation directly into DK/NF.
Page 80
Advanced RDBMS
Unavoidable Redundancies Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown in the independence topic, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times. a. Inter-Record Redundancy The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundancies.. Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy: ------------------------- ------------------------| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION | ============------------- ==============--------------------------------| EMPLOYEE | LOCATION | ============----------In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy. Inter-record redundancy has been recognized for some time, and has recently been addressed in terms of normal forms and normalization. While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:
Single-valued vs. multi-valued facts. Dependency on the entire key. Independent vs. dependent facts. The presence of mutual constraints. The presence of non-unique or non-singular representations.
Page 81
Advanced RDBMS
And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications. Practical Database Design and Tuning: Organisations The Role of Information Systems in
The role played by the Information Systems at this Information Technology explosion is very eminent and it has become a necessity . The Information System makes an organization to function effectively and efficiently. It provides information to all the levels of the organization the needed information for the current and future needs and growth. The information Systems help the management people to organize their schedules and plan for the development and growth of the organization themselves. Modern technologies goes a step even deeper that it gives the information to the mobile phones of the top authorities. They are provided with the needed information on their palm tops The Database Design Process Many of you asked for a "complete" example that would run through all of the normal forms from beginning to end using the same tables. This is tough to do, but here is an attempt: Example EMPLOYEE ( Name, Project, Task, Office, Phone ) Note: Keys are underlined. Example Data: Name Project Task Office Floor Phone Bill Bill Bill Bill Sue Sue Sue Ed 100X 100X 200Y 200Y 100X 200Y 300Z 100X T1 T2 T1 T2 T33 T33 T33 T2 400 400 400 400 442 442 442 588 4 4 4 4 4 4 4 5 1400 1400 1400 1400 1442 1442 1442 1588 relation:
Page 82
Advanced RDBMS
Name is the employee's name Project is the project they are working on. Bill is working on two different projects, Sue is working on 3. Task is the current task being worked on. Bill is now working on Tasks T1 and T2. Note that Tasks are independent of the project. Examples of a task might be faxing a memo or holding a meeting. Office is the office number for the employee. Bill works in office number 400. Floor is the floor on which the office is located. Phone is the phone extension. Note this is associated with the phone in the given office.
List all of the functional dependencies for EMPLOYEE. Are all of the non-key attributes dependant on all of the key ? Split into two relations EMPLOYEE_PROJECT_TASK and EMPLOYEE_OFFICE_PHONE. EMPLOYEE_PROJECT_TASK (Name, Project, Task) Name Project Task Bill Bill Bill Bill Sue Sue Sue Ed 100X 100X 200Y 200Y 100X 200Y 300Z 100X T1 T2 T1 T2 T33 T33 T33 T2
EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) Name Office Floor Phone Bill Sue 400 442 4 4 1400 1442
Page 83
Advanced RDBMS
Ed c. Third Normal Form
588
1588
Assume each office has exactly one phone number. Are there any transitive dependencies ? Where are the modification anomalies in EMPLOYEE_OFFICE_PHONE ? Split EMPLOYEE_OFFICE_PHONE.
EMPLOYEE_PROJECT_TASK (Name, Project, Task) Name Project Task Bill Bill Bill Bill Sue Sue Sue Ed 100X 100X 200Y 200Y 100X 200Y 300Z 100X T1 T2 T1 T2 T33 T33 T33 T2
EMPLOYEE_OFFICE (Name, Office, Floor) Name Office Floor Bill Sue Ed 400 442 588 4 4 5
EMPLOYEE_PHONE (Office, Phone) Office Phone 400 442 588 d. Boyce-Codd Normal Form 1400 1442 1588
Page 84
Advanced RDBMS
List all of the functional dependencies for EMPLOYEE_PROJECT_TASK, EMPLOYEE_OFFICE and EMPLOYEE_PHONE. Look at the determinants. Are all determinants candidate keys ?
Are there any multivalued dependencies ? What are the modification anomalies ? Split EMPLOYEE_PROJECT_TASK. EMPLOYEE_PROJECT (Name, Project ) Name Project Bill Bill Sue Sue Sue Ed EMPLOYEE_TASK (Name, Task ) Name Task Bill Bill Sue Ed T1 T2 T33 T2 100X 200Y 100X 200Y 300Z 100X
EMPLOYEE_OFFICE (Name, Office, Floor) Name Office Floor Bill Sue Ed R4 (Office, Phone) 400 442 588 4 4 5
Page 85
Advanced RDBMS
Office Phone 400 442 588 1400 1442 1588
At each step of the process, we did the following: 1. 2. 3. 4. Write out the relation (optionally) Write out some example data. Write out all of the functional dependencies Starting with 1NF, go through each normal form and state why the relation is in the given normal form.
Another short example Consider the following example of normalization for a CUSTOMER relation. Relation CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone) Example Data CustomerID Name C101 C102 Bill Smith Street City State Zip NJ Phone Name
f. Functional Dependencies CustomerID -> Name, Street, City, State, Zip, Phone Zip -> City, State g. Normalization
1NF Meets the definition of a relation. 2NF All non key attributes are dependent on all of the key. 3NF There are no transitive dependencies. BCNF Relation CUSTOMER is not in BCNF because one of the determinants Zip can not act as a key for the entire relation. Solution: Split CUSTOMER into two relations: CUSTOMER (CustomerID, Name, Street, Zip, Phone) ZIPCODES (Zip, City, State)
Page 86
Advanced RDBMS
Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to BCNF.
As a final step, consider de-normalization. 2.2.3 An overview of Databases Tuning In Relational Databases & Automated Design Tools The databases needed to be tuned and pruned in order to provide with a updated and upgraded information. So it is vital to manage the database . If you want to get the maximum performance from your applications you need to tune your SQL statements. Tuning of SQL statements means discovering the execution plan that Oracle is using. Once the execution plan is known one can attempt to improve it. The performance of Query can be improved in many ways. By creating indexes one can increase the size of the buffer cache, and use optimizer hints. Hints are instructions to the Oracle optimizer that are buried within your statement. These Hints are used to control virtually any aspect of statement execution. Two important points when tuning rollback segments are Detecting contention and reducing shrinkage. Contention occurs when there are too few rollback segments in your database for the amount of updates that are occurring. Shrinkage occurs when one defined an optimal size for a rollback segment and then the rollback segment grows beyond that size and is forced to shrink back again. Normalization is carried out in practice so that the resulting designs are of high quality and meet the desirable properties The practical utility of these normal forms becomes questionable when the constraints on which they are based are hard to understand or to detect The database designers need not normalize to the highest possible normal form. ( usually up to 3NF, BCNF or 4NF) Denormalization: the process of storing the join of higher normal forms relations as a base relation which is in a lower normal form
Page 87
Advanced RDBMS
Constraint: An rule governing static values of an attribute such that we can determine if this constraint is True or False. There are three types of anomalies involved (1)update anomaly, (2)insertion anomaly and (3)deletion anomaly. A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition of keys and domains.
2.5. Summary
Normalization of data can be looked on as a process during which unsatisfactory relation schemas are decomposed by breaking up their attributes into smaller relation schemas that possess desirable properties A relation is in first normal form (1NF) if and only if all underlying simple domains contain atomic values only. A relation is in second normal form (2NF) if and only if it is in 1NF and every non key attribute is fully dependent on the primary key. A relation is in third normal form (3NF) if and only if it is in 2NF and every non key attribute is non transitively dependent on the primary key. A relation is in Boyce/Codd normal form (BCNF) if and only if every determinant is a candidate key. [A determinant is any attribute on which some other attribute is (fully) functionally dependent.] A relation R is in fourth normal form (4NF) if and only if, whenever there exists a multivalued dependency in the R, say A->>B, then all attributes of R are also functionally dependent on A. Multivalued dependencies are defined essentially as relationships which accept the "cross-product" maintenance policy We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields
Page 88
Advanced RDBMS
A relation R is in fifth normal form (5NF)also called projection-join normal form (PJNF)if and only if every join dependency in R is a consequence of the candidate keys of R. 1. ________may be defined as a step by step reversible process of transforming an unnormalised relation into relations with progressively simpler structures. 2. A relation R is in ___________ normal form if and only if, whenever there exists multivalued dependency in the R, say A->>B, then all attributes of R are also functionally dependent on A. 3. Define Multivalued Dependency. 4. What is a Constraint?
2.8 Assignment
Prepare assignment about oracle 8i.
2.11 Keywords
1. 2. 3. 4. 5. 6. Normalization Boyce-Codd Normal form 1NF First Normal Form 2NF Second Normal Form Multi-Valued Dependency Functional Dependency
Page 89