DBMS
DBMS
DBMS
1. Introduction
DBMS is one of the most important module for Specialist Officer (IT) Exam. As we’ve seen that the objective paper of
Professional Knowledge (especially for Scale-I Officer) in IBPS Exam has many questions from Database and
Networking Modules. Thus, aspirants should prepare DBMS thoroughly. The term DBMS stands for Data Base
Management System. Now comes a question that what is DBMS?
DBMS is the acronym of Data Base Management System. DBMS is a collection of interrelated data and a set of
programs to access this data in a convenient and efficient way. It controls the organization, storage, retrieval,
security and integrity of data in a database.
A database management system (DBMS) is a computer software that manages databases, it may use any of a variety
of database models, such as the Hierarchical DBMS, Network DBMS and Relational DBMS.
The emergence of the first type of DBMS was between 1960's-70's; that was the Hierarchical DBMS. IBM had the first
model, developed on IBM 360 and their (DBMS) was called IMS, originally it was written for the Apollo program.
This type of DBMS was based on binary trees, where the shape was like a tree and relations were only limited
between parent and child records.
Hierarchical Database model is one of the oldest database models. In the hierarchical data model, records are linked
with other superior records on which they are dependent and also on the records, which are dependent on them. A
tree structure may establish one-to-many relationship. Parents can have many children exhibiting one to many
relationships. The grandparents and children are the nodes or dependents of the root. In general, a root may have any
number of dependents.
A tree-structure diagram is the schema for a hierarchical database. Such a diagram
consists of two basic components:
1. Boxes, which correspond to record types
2. Lines, which correspond to links
Pros:
✓ The model allows easy addition and deletion of new information.
✓ Data at the top of the Hierarchy is very fast to access.
✓ It relates well to anything that works through a one to many relationships.
Cons:
• Realtime requirements are of more sophisticated relationships which this model fails to cater.
The Network Database model can be viewed as an upside-down tree where each member information is the branch
linked to the owner, which is the bottom of the tree. The network database model was a progression from the
hierarchical database model and was designed to solve some of that model's problems, specifically the lack of
flexibility. It addresses the need to model more complex relationships such as the many-to-many relationship which
hierarchical model could not deal with.
The Network model replaces the hierarchical tree with a graph thus allowing more general connections among the
nodes. The main difference of the network model from the hierarchical model, is its ability to handle many to many
(N:N) relations. In other words, it allows a record to have more than one parent.
Pros:
✓ In the network database terminology, a relationship is a set. Each set comprises of two types of records.- an
owner record and a member record,
✓ In a network model an application can access an owner record and all the member records within a set.
✓ Network Model supports data independence to some level as it draws a clear line of demarcation between
programs and the complex physical storage details.
Cons:
• The insertion, deletion and updating operations of any record require large number of pointers
adjustments.
• A change in structure demands a change in the application as well, which leads to lack of structural
independence.
Relational databases store data in tables that are two dimensional. The tables have rows and columns. Relational
database tables are "normalized" so data is not repeated more often than necessary. With traditional databases, data
manipulated by the application is transient and data in the database is persisted (Stored on a permanent storage
device). In object databases, the application can manipulate both transient and persisted data.
Reduced Data Redundancy and Inconsistency: This means with DBMS the chances of multiple file formats,
duplication of information in different files got eliminated. Which means it reduced data duplication and with this the
data could stay more consistent.
Data Integrity: data integrity” refers to the accuracy and consistency of data stored in a database
DBMS ensures data integrity by managing transactions through ACID test = atomicity, consistency, isolation,
durability. While such integrity is absent in file management system.
Sharing of Data: In DBMS, data can be shared by authorized users of the organization. The database administrator
manages the data and gives rights to users to access the data.
Control Over Concurrency: In a file-based system, if two users can access data simultaneously, it is possible that they
will interfere with each other. For example, if both users attempt to perform update operation on the same record,
then one may overwrite the values recorded by the other. Most database management systems have sub-systems to
control the concurrency so that transactions are always recorded with accuracy.
Data Independence: The separation of data structure of database from the application program that uses the data is
called data independence. In DBMS, you can easily change the structure of database without modifying the
application program.
2. Database Architecture
The generalized architecture of DBMS is called ANSI/SPARC model. The architecture is divided into three levels:
Schema can be defined as the design of a database. The overall description of the database is called the database
schema.
You can relate it as something like types and variables in programming languages. Thus, essentially Schema is the
logical structure of the database. Just like the View Levels in Data Abstraction Schema is of 3 types:
1. Physical Schema:
The design of a database at physical level is called physical schema, how the data stored in blocks of storage is
described at this level.
2. Logical schema:
Logical schema can be defined as the design of database at logical level. In this level, the programmers as well as the
database administrator (DBA) work. At this level data can be described as certain types of data records which can be
stored in the form of data structures. However, the internal details (such as implementation of data structure) will be
remaining hidden at this level.
• Physical Data Independence–the ability to modify the physical schema without changing the logical schema.
• Applications depend on the logical schema
• In general, the interfaces between the various levels and components should be well defined so that changes in
some parts do not seriously influence others.
What is an Instance?
Databases change over time as information is inserted and deleted. The collection of information stored in the
database at a particular moment is called an instance.
A database system provides a data-definition language to specify the database schema and a data-manipulation
language to express database queries and updates.
1. Data Definition Language: DDL is used for specifying the database schema. It contains commands to create tables,
alter the structure, delete tables or rename tables.
2. Data Manipulation Language: As the name specifies itself DML is used for accessing and manipulating data in a
database.
3. Data Control Language: DCL is used for granting and revoking user access on a database –
3. Entity-Relationship Model
What is an Entity?
In a database, we would be grouping only related data together and storing them under one group name called Entity
/ Table. This helps in identifying which data is stored where and under what name. It reduces the time to search for a
particular data in a whole database.
Entities can be classified based on their strength. An entity is considered weak if its tables are existence dependent.
Following are basic types of entities:
1. Strong Entity: Entities having its own attribute as primary keys are called strong entity. For example,
EMPLOYE has EMPLOYE_ID as primary key. Hence it is a strong entity.
2. Weak Entity: Entities which cannot form their own attribute as primary key are known weak entities. These
entities will derive their primary keys from the combination of its attribute and primary key from its mapping
entity. The relationship between weak entity and strong entity set is called as Identifying Relationship.
3. Composite Entity: Entities participating in the many to many relationships are called composite entity.
Note: the primary key of the strong entity set is not explicitly stored with the weak entity set, since it is implicit in the
identifying relationship. If loan_number were explicitly stored, payment could be made a strong entity, but then the
relationship between payment and loan would be duplicated by an implicit relationship defined by the attribute
loan_number common to payment and loan.
3.1 Attributes
An ER diagram is a means of visualizing how the information a system produces is related. There are five main
components of an ER Diagram:
1. Connecting lines, solid lines that connect attributes to show the relationships of entities in the diagram.
2. Entities: Represented by Rectangle
• Strong Entity: These shapes are independent from other entities, and are often called parent entities, since
they will often have weak entities that depend on them.
Here you can see: Employee Works in Department. EMPLOYE and Dept are Entity Types and WorksIn is the
relationship represented with a diamond figure.
A recursive relationship is one in which the same entity participates more than once in the relationship. For Example:
Every manager is also an employee. So, manager is not a new entity,
but just a subset of the instances of the entity EMPLOYEE.
Recessive Relationship:
EMPLOYE MANAGER
Nikhil Anuj
Sumita Nikhil
Anuj
Rahul
Manages of
An Attribute describes a property or characteristic of an entity. For example, Name, ID, Age, Address etc can be
attributes of an EMPLOYE.
Key attribute represents the main characteristic of an Entity. It is used to represent Primary key. Ellipse with
underlying lines represent Key Attribute. Here EmpId is the key attribute that is the primary key which will uniquely
identify the EMPLOYE Records.
An attribute can also have their own attributes. These attributes are known as Composite attribute.
4. Cardinality
The cardinality of a relationship is the number of instances of entity B that can be associated with entity A. There is a
minimum cardinality and a maximum cardinality for each relationship. Cardinality refers to the maximum number of
times an instance in one entity can relate to instances of another entity. Ordinality, on the other hand, is the minimum
number of times an instance in one entity can be associated with an instance in the related entity.
Cardinalit
Many
Zero or one
One
One (and only one)
Zero or many
One or many
• The number of entities from E2 that an entity from E1 can possibly be associated thru R (and vice-versa)
determines the cardinality ratio of R.
• Four possibilities are usually specified:
1. one-to-one (1:1)
2. one-to-many (1:N)
3. many-to-one (N:1)
4. many-to-many (M:N)
Cardinality Ratios
One-to-one: An E1 entity may be associated with at most one E2 entity and similarly an E2 entity may be associated
with at most one E1 entity.
One-to-many: An E1 entity may be associated with many E2 entities whereas an E2 entity may be associated with at
most one E1 entity.
Many-to-one: … (similar to above)
Many-to-many: Many E1 entities may be associated with a single E2 entity and a single E1 entity may be associated
with many E2 entities.
Mapping Cardinalities
Note: Some elements in A and B may not be mapped to any elements in the other set.
In one department we have many employees so the following represents – one to many relationships
And in case any number of employee may work in any number of department (many to many)
5. Keys
A super key of an entity set is a set of one or more attributes whose values uniquely determine each entity. A
candidate key of an entity set is a minimal super key
Generalization is a bottom-up approach in which two lower level entities combine to form a higher-level entity. In
generalization, the higher-level entity can also combine with other lower level entity to make further higher-level
entity. Specialization is opposite to Generalization. It is a top-down approach in which one higher level entity can be
broken down into two lower level entities.
Top-down design process; we designate subgroupings within an entity set that are distinctive from other entities in
the set.
• These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not
apply to the higher-level entity set.
• Depicted by a triangle component labelled ISA (E.g. customer “is a” person).
• Attribute inheritance – a lower-level entity set inherits all the attributes and relationship participation of the
higher-level entity set to which it is linked.
Specialization Example
Generalization
• A bottom-up design process – combine a number of entity sets that share the same features into a higher-level
entity set.
• Specialization and generalization are simple inversions of each other; they are represented in an E-R diagram in
the same way.
• The terms specialization and generalization are used interchangeably.
Generalization Example
Aggregation
Aggregation is a process when relation between two entities is treated as a single entity. It is an abstraction that treats
relationships as entities.
29 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
• Eliminate this redundancy via aggregation
• Treat relationship as an abstract entity
• Allows relationships between relationships
• Abstraction of relationship into new entity
Aggregation Example
The relational model for database management is a data model based on predicate logic and set theory. It was
invented by Edgar Codd. The fundamental assumption of the relational model is that all data are represented as
mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n sets.
n-ary Relationship
When there are n entities set participating in a relation, the relationship is called as n-ary relationship.
1) Relation - The fundamental organizational structure for data in the relational model is the relation. A relation is a
two-dimensional table made up of rows and columns. Each relation also called a table, stores data about entities.
2) Tuples - The rows in a relation are called tuples. They represent specific occurrences (or records) of an entity. Each
row consists of a sequence of values, one for each column in the table. In addition, each row (or record) in a table must
be unique. A tuple variable is a variable that stand for a tuple.
3) Attributes – The column in a relation is called attribute. The attributes represent characteristics of an entity.
4) Domain – For each attribute there is a set of permitted values called domain of that attribute. For all relations ‘r’,
the domain of all attributes of ‘r’ should be atomic. A domain is said to be atomic if elements of the domain are
considered to be indivisible units.
1. Primary Key: Which uniquely identifies a record in a table. Student_ID is the primary key in this STUDENT
Table.
2. Candidate Key: A candidate key is a single field or the least combination of fields that uniquely identifies
each record in the table. Every table must have at least one candidate key but at the same time can have
several. For Example in the table STUDENT, Student_ID and Roll_No. Are Candidate keys.
Roll_No. Student_ID
001 11093100
002 11093101
003 11093126
004 11093127
3. Foreign Key: A foreign key is generally a primary key from one table that appears as a field in another. For
Example let us consider these two table STUDENT and LIBRARY_RECORD.
STUDENT
Roll_No. Student_ID Student_Name Student_Class
001 11093100 Ravi Kumar 3
002 11093101 Nihal Sharma 4
003 11093126 Astha Mathur 3
004 11093127 Nishi Arora 5
LIBRARY_RECORD
Lib_CardNo Student_ID Student_Name Address
AX120 11093101 Nihal Sharma 12th Avenue Street, Delhi
AX121 11093126 Astha Mathur XYZ Lane,Delhi
BL101 11093127 Nishi Arora 5-D, Z Block, Delhi
Non-key attributes are attributes other than candidate key attributes in a table. And Non-prime Attributes are
attributes other than Primary attribute.
• In a procedural language, the user instructs the system to perform a sequence of operations on the database to
compute the desired result.
• In a nonprocedural language, the user describes the desired information without giving a specific procedure
for obtaining that information.
An operator can be either unary or binary. Following are some operations of relational algebra:
1. Selection operator (σ): Selection operator is used to select tuples from a relation based on some condition.
Syntax: σ (Cond)(Relation Name)
Example
Example
3. Union Operation (∪): It performs binary union between two given relations. Union on two relations R1 and
R2 can only be computed if R1 and R2 are union compatible (These two relation should have same number of
attributes and corresponding attributes in two relations have same domain) . Duplicate tuples are
automatically eliminated in union operation.
Syntax: Relation1 U Relation2
r ∪ s = { t | t ∈ r or t ∈ s}
Note: r, and s must have the same number of attributes.
Projects the names of th Employees who are Managers in IT_Dept or Managers in FUNC_Dept or Both
∏ Managers (IT_Dept) ∪ ∏ Managers(FUNCT_Dept)
4. Minus (-): Minus on two relations R1 and R2 can only be computed if R1 and R2 are union compatible. Minus
operator when applied on two relations as R1-R2 will give a relation with tuples which are in R1 but not in
R2.
Syntax: Relation1 - Relation2
Example
Find person who are student but not employee, we can use minus operator like:
∏ Name (STUDENTS) − ∏ Name (EMPLOYEE)
Example
6. Cartesian Product (Χ): The cartesian product of two tables combines each row in one table with each row in
the other table. It combines tuples from two relations, but unlike the join operation, its result contains all pairs
of tuples from the two relations, regardless of whether their attribute values match.
Syntax: r Χ s
Where r and s are relations and their output will be defined as −
r Χ s = { q t | q ∈ r and t ∈ s}
Relational calculus is a non-procedural query language. It uses mathematical predicate calculus instead of
algebra. It provides the description about the query to get the result whereas relational algebra gives the
method to get the result.
{t | EMPLOYEE (t) and t.SALARY>20000} - implies that it selects the tuples from EMPLOYEE relation
such that resulting employee tuples will have salary greater than 20000. It is example of selecting a range of
values.
In the tuple relational calculus, you have use variables that have series of tuples in a relation. In the domain
relational calculus, you will also use variables but in this case the variables take their values from domains
of attributes rather than tuples of relations. An domain relational calculus expression has the following
general format –
For example, select EMP_ID and EMP_NAME of employees who work for department ID 415
5. Normalization
Normalization is a process of organizing the data in database to avoid data redundancy, insertion anomaly, update
anomaly & deletion anomaly. Thus, database normalization is a database schema design technique, by which an
existing schema is modified to minimize redundancy and dependency of data.
What are Anomalies in Database Management?
Anomalies are inconvenient or error-prone situation arising when we process the tables. There are three types of
anomalies:
1. Update Anomalies: Incorrect data may have to be changed, which could involve many records having to be
changed, leading to the possibility of some changes being made incorrectly.
2. Delete Anomalies: A record of data can legitimately be deleted from a database, and the deletion can result in
the deletion of the only instance of other, required data, E.g. Deleting a book loan from a library member can
remove all details of the particular book from the database such as the author, book title etc
3. Insert Anomalies: The nature of a database may be such that it is not possible to add a required piece of data
unless another piece of unavailable data is also added. E.g. A library database that cannot store the details of a
new member until that member has taken out a book.
We use functional dependencies to: test relations to see if they are legal under a given set of functional
dependencies.
If a relation r is legal under a set F of functional dependencies, we say that r satisfies F.
We say that F holds on R if all legal relations on R satisfy the set of functional dependencies F.
Note: A specific instance of a relation schema may satisfy a functional dependency even if the functional
dependency does not hold on all legal instances. For example, a specific instance of loan may, by chance, satisfy
amount →customer_name.
A functional dependency is trivial if it is satisfied by all instances of a relation
Example:
customer_name, loan_number →customer_name
customer_name→customer_name
Inference rules
Armstrong's axioms - sound and complete i.e, enable the computation of any functional dependency. Functional
dependencies are:
1. Reflexivity - if the B's are a subset of the A's then A → B
2. Augmentation - If A → B, then A, C → B, C.
3. Transitivity - If A → B and B → C then A → C.
Additional inference rules
4. Decomposition - If A → B, C then A → B
5. Union - If A → B and A → C then A → B, C
6. Pseudo transitive - If A → B and C, B → D then C, A → D
Equivalence of sets of functional dependencies
Two functional dependencies S & T are equivalent iff S→ T and T → S.
The dependency {A_1, ..., A_n} → {B_1, ..., B_m} is trivial if the B's are a subset of the A's is nontrivial if at least
one of the B's is not among the A's is completely nontrivial if none of the B's is also one of the A's
Closure (F+)
All dependencies that include F and that can be inferred from F using the above rules are called closure of F
denoted by F+.
After finding a set of functional dependencies that are hold on a relation, the next step is to find the Super key for
that relation (table). The set of identified functional dependencies play a vital role in finding the key for the
Example:
EMPLOYEE(empid, empname, dept, age, salary, experience)
Let the functional dependencies be as follows:
empid->empname
{age,experience}->salary
empid->{age,dept}
dept->experience
In above example, let us find the closure of the attribute empid, i.e, closure of {empid}
Since we are finding closure of empid. empid is an element of the closure set C+. Now we go step by step.
• Step 1: Select each functional dependency and check whether the left side of functional dependency is a
subset of closure. If yes, add the right side of that functional dependency to closure set. if not, check the next
functional dependency
• Step 2: Keep on checking the functional dependencies until there is no more functional dependencies with its
left side as a subset of closure set C+.
What is a subset? A set M is said to be a subset of another set N only if all elements of set M is present in set N. Set N
can have more elements than M.
• So, in our example, empid is an element of the closure set C+. So, initially, C+={empid}.
• First functional dependency says empid functionally determines empname. Its left side ( {empid}) is subset of
C+. Therefore, its right side is added to C +. Now C+={empid, empname}.
• Second fd (functional dependency) says {age, experience}->salary. Here left side ( {age, experience} ) is not a
subset of C+. So we check the next fd.
• Third fd says, empid->{age, dept}. Here left side ( {empid} ) is subset of C+. Therefore, its right side is added to
C+. Now, C becomes, C+={empid, empname, age, dept}.
• Fourth fd says, dept->experience. Here left side ( {dept} ) is a subset of C +. So we are adding its right side (
{experience} ) to Closure set. Now, C+={empid, empname, age, dept, experience}.
• We are looking again for a functional dependency with its left side as a subset of closure set. Since the closure
set C+ is getting changed in some steps, there is more possibility to find another functional dependency with
its left side as a subset of C+. Again, we go through every functional dependency.
• Since sets do not allow duplication, we should do nothing if the right side of a functional dependency whose
left side is subset of C+, is already present in closure set C+.
• Second fd has a left side that is subset of C+. {age, experience}->salary. Therefore, salary is added to C+. Now,
C+={empid, empname, age, dept, experience, salary}.
• There isn't any more functional dependency whose left side is subset of C+ and give at least one new attribute
to closure set. Therefore, we stop now.
We say that a set of functional dependencies F covers another set of functional dependencies G, if every functional
dependency in G can be inferred from F. More formally, F covers G if G+ ⊆ F+. F is a minimal cover of G if F is the
smallest set of functional dependencies that cover G.
We find the minimal cover by iteratively simplifying the set of functional dependencies. To do this, we will use three
methods:
Simplifying an FD by the Union Rule: Let X, Y , and Z be sets of attributes. If X → Y and X → Z, then X → Y Z
Let F be: XB → Y
and H be X → Y
If F ⇒ X → Y , then we can replace F by H. In other words, if Y ⊆ X+F, then we can replace F by H.
For example, say we have the following functional dependencies (F):
• AB → C
•A→B
And we want to know if we can simplify to the following (H):
•A→C
•A→B
Then A+F = ABC. Since, Y ⊆ X+F, we can replace F by H.
Simplifying an FD by simplifying the right-hand side: Let X and Y be sets of attributes, and C be a single attribute
not in Y .
Let F be:
X → Y C and H be
X→Y
If H ⇒ X → Y C, then we can replace F by H. In other words, if Y C ⊆ X+H, then we can replace F by H.
For example, say we have the following functional dependencies (F):
• A → BC
•B→C
And we want to know if we can simplify to the following (H):
•A→B
•B→C
Then A+H = ABC. Since, BC ⊆ X+H, we can replace F by H.
Understanding Normalization
In relational database theory, normalization is the process of restructuring the logical data model of a database to
eliminate redundancy, organize data efficiently, and reduce repeating data and to reduce the potential for anomalies
during data operations. Data normalization also may improve data consistency and simplify future extension of the
logical data model. The formal classifications used for describing a relational database's level of normalization are
called normal forms (NF).
A non-normalized database can suffer from data anomalies: A non-normalized database may store data representing
a particular referent in multiple locations. An update to such data in some but not all of those locations results in an
update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data
(i.e. data other than primary keys) in only one location.
A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional
dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized
database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.
Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such
databases may require deleting data from the inappropriate dependency. A normalized database prevents such
deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.
Normal forms
Edgar F. Codd originally defined the first three normal forms. The first normal form requires that tables be made up
of a primary key and a number of atomic fields, and the second and third deal with the relationship of non-key fields
to the primary key. These have been summarized as requiring that all non-key fields be dependent on "the key, the
whole key and nothing but the key". In practice, most applications in 3NF are fully normalized. However, research
has identified potential update anomalies in 3NF databases. BCNF is a further refinement of 3NF that attempts to
eliminate such anomalies. The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation
of many-many and one-many relationships. Sixth normal form (6NF) only applies to temporal databases.
First normal form (1NF) lays the groundwork for an organized database design: Ensure that each table has a primary
key: minimal set of attributes which can uniquely identify a record. It states that the domain of an attribute must
include only atomic values and the value of any attribute in a tuple must be single value from the domain of that
attribute. It doesn’t allow nested relation. Data that is redundantly duplicated across multiple rows of a table is
moved out to a separate table.
Atomicity: Each attribute must contain a single value, not a set of values.
Unnormalized form (UNF): A table that contains one or more repeating groups.
First normal form (1NF): A relation in which the intersection of each row and column contains one and only one
value.
37 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
UNF → 1NF: remove repeating groups:
• Entering appropriate data in the empty columns of rows.
• Placing repeating data along with a copy of the original
• key attribute in a separate relation. Identifying a primary
• key for each of the new relations.
First normal form (1NF) lays the groundwork for an organized database design: Ensure that each table has a primary
key: minimal set of attributes which can uniquely identify a record. It states that the domain of an attribute must
include only atomic values and the value of any attribute in a tuple must be single value from the domain of that
attribute. It doesn’t allow nested relation. Data that is redundantly duplicated across multiple rows of a table is
moved out to a separate table.
For Example:
Consider a table STUDENT with fields Roll_No, Name, Course. Here a student may have opted for more than one courses thus
the values in Course field will not be atomic:
Roll_No Name Course
1 Snehal Polity, History, Economics
2 Kajal DBMS, CD
3 Amit Physics, Chemistry
After converting it to First Normal Form (1NF)
General Definition: A relation schema R is in second normal form (2NF) if every nonprime attribute A in R is not
partially dependent on any key of R.
Partial Dependency – If proper subset of candidate key determines non-prime attribute, it is called partial
dependency.
• Create separate tables for sets of values that apply to multiple records.
• Relate the tables with a foreign key.
• Records should not depend on anything other than a table's primary key (a compound key, if necessary). For
example, consider a customer's address in an accounting system. The address is needed by the Customers
table, but also by the Orders, Shipping, Invoices, Accounts Receivable, and Collections tables. Instead of
storing the customer's address as a separate entry in each of these tables, store it in one place, either in the
Customers table or in a separate Addresses table.
As per First Normal Form, no two Rows of data must contain repeating group of information i.e each set of column
must have a unique value, such that multiple columns cannot be used to fetch the same row. The Primary key is
usually a single column, but sometimes more than one column can be combined to create a single primary key which
is actually called a Candidate Key. To identify or establish 2NF we must identify Candidate Key and Partial
Dependencies.
3NF Rule:
1. The database must meet all the requirements of the second normal form.
2. Any field which is dependent not only on the primary key but also on another field is moved out to a separate table.
(No Transitive Dependencies)
Example:
STUDENT( Stu_ID, Stu_Name, City, Zip)
We find that in the above STUDENT relation, Stu_ID is the key and only prime key attribute.
We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip is a superkey nor is
City a prime attribute. Additionally, Stu_ID → Zip → City, so there exists transitive dependency.
To bring this relation into third normal form, we break the relation into two relations as follows:
General Definition:
A relation schema R is in 3NF if, whenever a nontrivial functional dependency X→A
holds in R,
Either a) X is a Super key Or b) Y is a prime attribute of R.
i.e. A relation schema R is in 3NF if every nonprime attribute of R meets both of the following terms:
1. It is fully functionally dependent on every key of R.
2. It is non-transitively dependent on every key of R.
A row is in BCNF if and only if every determinant is a candidate key. The second and third normal forms assume that
all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within
the keys.
In a relational model, if all of the information about an entity is to be represented in one relation, it will be necessary
to repeat all the information other than the multivalue attribute value to represent all the information that we wish to
represent. This results in many tuples about the same instance of the entity in the relation and the relation having a
composite key (the entity id and the multivalued attribute). Of course, the other option suggested was to represent
this multivalued information in a separate relation. The situation of course becomes much worse if an entity has more
than one multivalued attributes and these values are represented in one relation by a number of tuples for each entity
instance. The multivalued dependency relates to this problem when more than one multivalued attributes exist.
Consider the following relation that represents an entity employee that has one multivalued attribute proj:
emp (e#, dept, salary, proj)
We have so far considered normalization based on functional dependencies; dependencies that apply only to
single-valued information. For example, e# -> dept implies only one dept value for each value of e#. Not all
information in a database is single-valued, for example, proj in an employee relation may be the list of all projects that
the employee is currently working on. Although e# determines the list of all projects that an employee is working on,
e# -> proj is not a functional dependency.
We can more clearly analyze the multivalued dependency by the following example.
programmer (emp_name, qualifications, languages)
This relation includes two multivalued attributes of entity programmer; qualifications and languages. There are no
functional dependencies.
Now, more formally, X ->> Y is said to hold for R(X, Y, Z) if t1 and t2 are two tuples in R that have the same values
for attributes X and therefore with t1[x] = t2[x] then R also contains tuples t3 and t4 (not necessarily distinct) such that
t1[x] = t2[x] = t3[x] = t4[x]
t3[Y] = t1[Y] and t3[Z] = t2[Z]
t4[Y] = t2[Y] and t4[Z] = t1[Z]
In other words if t1 and t2 are given by
t1 = [X, Y1, Z1], and
t2 = [X, Y2, Z2]
then there must be tuples t3 and t4 such that
t3 = [X, Y1, Z2], and
t4 = [X, Y2, Z1]
We are therefore insisting that every value of Y appears with every value of Z to keep the relation instances consistent.
In other words, the above conditions insist that X alone determines Y and Z and there is no relationship between Y and
Z since Y and Z appear in every possible pair and hence these pairings present no information and are of no
significance.
Definition: A relation schema R is in 4NF with respect to a set of dependencies F, if, for every non-trivial
multivalued dependency X ->>Y in F+, X is a super key for R.
Decomposition Property: A relation must satisfy the following two properties during decomposition.
i. Lossless: A lossless-join dependency is a property of decomposition, which ensures that spurious rows are
generated when relations are united through a natural join operation. i.e. The information in an instance r of R must
be preserved in the instances r1, r2, r3, …..rk where ri = ΠRi (r)
Decomposition is lossless with respect to a set of functional dependencies F if, for every relation instance r on R
satisfying F, 𝑟 = 𝜋𝑅1 (𝑟) ∗ 𝜋𝑅2 (𝑟) ∗ . . . ∗ 𝜋𝑅𝑘 (𝑟)
ii. Dependency Preserving Property: If a set of functional dependencies hold on R it should be possible to enforce F
by enforcing appropriate dependencies on each r1 .
Decomposition D= (R1, R2, R3, ………, Rk) of schema R preserves a set of dependencies
F if,
(π R1 (F) U π R2 (F) U . . . . . . . . . . . π Rn (F)) +=F+
π Ri(F) is the projection of F onto Ri.
i.e Any FD that logically follows from F must also logically follows from the union of projection of F onto Ri ‘S . Then
D is called dependency preserving.
Join Dependency
Join dependency is the term used to indicate the property of a relation schema that cannot be decomposed losslesly
into two relations schema, but can be decomposed losslesly into three or more simpler relation schema. It means that
a table, after it has been decomposed into three or more smaller tables must be capable of being joined again on
common keys to form the original table.
Decomposition of R i is lossless as
R i1 ∩ R i2 = X, R i2 – R i1 = A and X → A
Result: a lossless decomposition of R into BCNF relations
Goals of Normalization
• Let R be a relation scheme with a set F of functional dependencies.
• Decide whether a relation scheme R is in “good” form.
42 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
• In the case that a relation scheme R is not in “good” form, decompose it into a set of relation scheme {R1, R2, ..., R
n} such that each relation scheme is in good form the decomposition is a lossless-join decomposition Preferably,
the decomposition should be dependency preserving.
‘A relation R is in fifth normal form (5NF) – also called projection-join normal form (PJ/NF) if and only if every
join dependency in R is a consequence of the candidate keys of R.’
For every normal form it is assumed that every occurrence of R can be uniquely identified by a primary key using one
or more attributes in R.
• Join dependencies generalize multivalued dependencies lead to project-join normal form (PJNF) (also called
fifth normal form)
• A class of even more general constraints, leads to a normal form called domain key normal form.
• Problem with these generalized constraints: are hard to reason with, and no set of sound and complete set of
inference rules exists. Hence rarely used.
In short, Normalization of a Database is achieved by following a set of rules called 'forms' in creating the
database.
These rules are 5 in number (with one extra one stuck in-between 3&4) and they are:
What's SQL ?
In 1971, IBM researchers created a simple non-procedural language called Structured English Query Language. or
SEQUEL. This was based on Dr. Edgar F. (Ted) Codd's design of a relational model for data storage where he
described a universal programming language for accessing databases.
In the late 80's ANSI and ISO (these are two organizations dealing with standards for a wide variety of things) came
out with a standardized version called Structured Query Language or SQL. SQL is prounced as 'Sequel'. There have
been several versions of SQL and the latest one is SQL-99. Though SQL-92 is the current universally adopted
standard.
SQL is the language used to query all databases. It's simple to learn and appears to do very little but is the heart of a
successful database application. Understanding SQL and using it efficiently is highly imperative in designing an
efficient database application. The better your understanding of SQL the more versatile you'll be in getting
information out of databases.
As you can see SQL is like English. Let's build a real world example database using MySQL and perform some SQL
operations on it. A database that practically anyone could use would be a Contacts database. In our example we are
going to create a database with the following fields:
First Name, Last Name, Birth Date, Street Address, City, State, Zip, Country, Telephone Home, Telephone Work, Email,
Company Name, Designation
First, let’s decide how we are going to store this data in the database. For illustration purposes, we are going to
keep this data in multiple tables.
This will let us exercise all the SQL commands pertaining to retrieving data from multiple tables. Also, we can
separate different kinds of entities into different tables. So, let's say you have thousands of friends and need to send a
mass email to all of them, a SELECT statement (covered later) will look at only one table.
Well, we can keep the FirstName, LastName and BirthDate in one table.
Address related data in another.
Company Details in another.
Emails can be separated into another.
Telephones can be separated into another.
Size in
Type Description
bytes
TINYINT (length) 1 Integer with unsigned range of 0-255 and a signed range from -128-127
Integer with unsigned range of 0-65535 and a signed range from
SMALLINT (length) 2
-32768-32767
Integer with unsigned range of 0-16777215 and a signed range from
MEDIUMINT(length) 3
-8388608-8388607
Integer with unsigned range of 0-429467295 and a signed range from
INT(length) 4
-2147483648-2147483647
Integer with unsigned range of 0-18446744 and a signed range from
BIGINT(length) 8
-9223372036854775808-9223372036854775807
Floating point number with max. value +/-3.402823466E38 and
FLOAT(length, decimal) 4
min.(non-zero) value +/11.175494351E-38
DOUBLEPRECISION(length, Floating point number with max. value +/- -1.7976931348623157E308 and
8
decimal) min. (non-zero) value +/-2.2250738585072014E-308
Floating point number with the range of the DOUBLE type that is stored
DECIMAL(length, decimal) length
as a CHAR field type.
YYYYMMDDHHMMSS or YYMMDDHHMMSS or YYYYMMDD,
TIMESTAMP(length) 4 YYMMDD. A Timestamp value is updated each time the row changes
value. A NULL value sets the field to the current time.
DATE 3 YYYY-MM-DD
TIME 3 HH:MM:DD
DATETIME 8 YYYY-MM-DD HH:MM:SS
YEAR 1 YYYY or YY
44 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
A fixed length text string where fields shorter than the assigned length are
CHAR(length) length
filled with trailing spaces.
A fixed length text string (255 Character Max) where unused trailing
VARCHAR(length) length
spaces are removed before storing.
TINYTEXT length+1 A text field with max. length of 255 characters.
TINYBLOB length+1 A binary field with max. length of 255 characters.
TEXT length+1 64Kb of text
BLOB length+1 64Kb of data
MEDIUMTEXT length+3 16Mb of text
MEDIUMBLOB length+3 16 Mb of data
LONGTEXT length+4 4GB of text
LONGBLOB length+4 4GB of data
This field can contain one of a possible 65535 number of options. Ex:
ENUM 1,2
ENUM('abc','def','ghi')
This type of field can contain any number of a set of predefined possible
SET 1-8
values.
The following examples will make things quite clear on declaring Data Types within SQL statements.
Steps in Creating the Database using MySQL.
CREATE TABLE company_details (contact_id SMALLINT NOT NULL PRIMARY KEY, CompanyName
CHAR(25), Designation CHAR(15));
Note: Here we assume that one person will have only one email address. Now if there were a situation where one person has
multiple addresses, this design would be a problem. We'd need another field which would keep values that indicated to whom the
email address belonged to. In this particular case email data ownership is indicated by the primary key. The same is true for
telephones. We are assuming that one person has only one home telephone and one work telephone number. This need not be true.
Similarly, one person could work for multiple companies at the same time holding two different designation. In all these cases an
extra field will solve the issue. For now, however let's work with this small design.
KEYS: The relationships between columns located in different tables are usually described through the use of keys.
As you can see we have a PRIMARY KEY in each table. The Primary key serves as a mechanism to refer to other fields
within the same row. In this case, the Primary key is used to identify a relationship between a row under
consideration and the person whose name is located inside the 'names' table. We use the AUTO_INCREMENT
statement only for the 'names' table as we need to use the generated contact_id number in all the other tables for
identification of the rows.
This type of table design where one table establishes a relationship with several other tables is known as a 'one to
many’ relationship.
In a 'many to many' relationship we could have several Auto Incremented Primary Keys in various tables with
several inter-relationships.
Foreign Key: A foreign key is a field in a table which is also the Primary Key in another table. This is known
commonly as 'referential integrity'.
45 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
Execute the following commands to see the newly created tables and their contents.
To see the tables inside the database:
mysql> SHOW TABLES;
+-----------------------+
| Tables in contacts |
+-----------------------+
| address |
| company_details |
| email |
| names |
| telephones |
+----------------------+
5 rows in set (0.00 sec)
To see the columns within a particular table:
mysql>SHOW COLUMNS FROM address;
+---------------+-------------+------+-----+---------+-------+---------------------------------+
| Field | Type | Null | Key | Default | Extra | Privileges
|
+---------------+-------------+------+-----+---------+-------+---------------------------------+
| contact_id | smallint(6) | | PRI | 0 | | select,insert,update,references |
| StreetAddress | char(50) | YES | | NULL | | select,insert,update,references |
| City | char(20) | YES | | NULL | | select,insert,update,references |
| State | char(20) | YES | | NULL | | select,insert,update,references |
| Zip | char(10) | YES | | NULL | | select,insert,update,references |
| Country | char(20) | YES | | NULL | | select,insert,update,references |
+---------------+-------------+------+-----+---------+-------+------------------ ---------------+
6 rows in set (0.00 sec)
So, we have the tables created and ready. Now we put in some data.
Let's start with the 'names' table as it uses a unique AUTO_INCREMENT field which in turn is used in the
other tables.
Before we start experimenting with manipulating the data let's look at how MySQL stores the Data.
To do this execute the following command from the shell prompt.
mysqldump contacts > contacts.sql
Note: The reverse operation for this command is:
mysql contacts < contacts.sql
The file generated is a text file that contains all the data and SQL instruction needed to recreate the same
database. As you can see, the SQL here is slightly different than what was typed in. Don't worry about this.
It's all good ! It would also be obvious that this is a good way to backup your stuff.
# MySQL dump 8.2
#
# Host: localhost Database: contacts
#--------------------------------------------------------
# Server version 3.22.34-shareware-debug
#
# Table structure for table 'address'
#
CREATE TABLE address (
contact_id smallint(6) DEFAULT '0' NOT NULL,
StreetAddress char(50),
City char(20),
State char(20),
Zip char(10),
Country char(20),
PRIMARY KEY (contact_id)
);
#
# Dumping data for table 'address'
#
INSERT INTO address VALUES (1,'300 Yamila Ave.','Los Angeles','CA','300012','USA');
INSERT INTO address VALUES (2,'4000 Nikki St.','Boca Raton','FL','500034','USA');
INSERT INTO address VALUES (3,'404 Tia Blvd.','New York','NY','10011','USA');
#
# Table structure for table 'company_details'
#
CREATE TABLE company_details (
contact_id smallint(6) DEFAULT '0' NOT NULL,
CompanyName char(25),
Designation char(20),
PRIMARY KEY (contact_id)
);
#
# Dumping data for table 'company_details'
#
INSERT INTO company_details VALUES (1,'Xerox','New Business Manager');
INSERT INTO company_details VALUES (2,'Cabletron','Customer Support Eng');
INSERT INTO company_details VALUES (3,'Apple','Sales Manager');
#
# Table structure for table 'email'
#
CREATE TABLE email (
contact_id smallint(6) DEFAULT '0' NOT NULL,
Email char(20),
PRIMARY KEY (contact_id)
);
#
# Dumping data for table 'email'
#
48 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
INSERT INTO email VALUES (1,'[email protected]');
INSERT INTO email VALUES (2,'[email protected]');
INSERT INTO email VALUES (3,'[email protected]');
#
# Table structure for table 'names'
#
CREATE TABLE names (
contact_id smallint(6) DEFAULT '0' NOT NULL auto_increment,
FirstName char(20),
LastName char(20),
BirthDate date,
PRIMARY KEY (contact_id)
);
#
# Dumping data for table 'names'
#
INSERT INTO names VALUES (3,'Tia','Carrera','1975-09-18');
INSERT INTO names VALUES (2,'Nikki','Taylor','1972-03-04');
INSERT INTO names VALUES (1,'Yamila','Diaz','1974-10-13');
#
# Table structure for table 'telephones'
#
CREATE TABLE telephones (
contact_id smallint(6) DEFAULT '0' NOT NULL,
TelephoneHome char(20),
TelephoneWork char(20),
PRIMARY KEY (contact_id)
);
#
# Dumping data for table 'telephones'
#
INSERT INTO telephones VALUES (1,'333-50000','333-60000');
INSERT INTO telephones VALUES (2,'444-70000','444-80000');
INSERT INTO telephones VALUES (3,'555-30000','555-40000');
Let's try some SELECT statement variations:
To select all names whose corresponding contact_id is greater than 1.
mysql> SELECT * FROM names WHERE contact_id > 1;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
+------------+-----------+----------+------------+
2 rows in set (0.00 sec)
As a condition we can also use NOT NULL. This statement will return all names where there exists a
contact_id.
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
+------------+-----------+----------+------------+
3 rows in set (0.06 sec)
Results can be arranged in a particular way using the statement ORDER BY.
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL ORDER BY LastName;
+------------+-----------+----------+------------+
HAVING
Take a look at the first query using the statement WHERE and the second statement using the statement
HAVING.
mysql> SELECT * FROM names WHERE contact_id >=1;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 1 | Yamila | Diaz | 1974-10-13 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 3 | Tia | Carrera | 1975-09-18 |
+------------+-----------+----------+------------+
3 rows in set (0.03 sec)
mysql> SELECT * FROM names HAVING contact_id >=1;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 3 | Tia | Carrera | 1975-09-18 |
51 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
+------------+-----------+----------+------------+
3 rows in set (0.00 sec)
Now lets work with multiple tables and see how information can be pulled out of the data.
mysql> SELECT names.contact_id, FirstName, LastName, Email FROM names, email WHERE
names.contact_id = email.contact_id;
+------------+-----------+----------+-------------------+
| contact_id | FirstName | LastName | Email |
+------------+-----------+----------+-------------------+
| 1 | Yamila | Diaz | [email protected] |
| 2 | Nikki | Taylor | [email protected] |
| 3 | Tia | Carrera | [email protected] |
+------------+-----------+----------+-------------------+
3 rows in set (0.11 sec)
mysql> SELECT DISTINCT names.contact_id, FirstName, Email, TelephoneWork FROM names, email,
telephones WHERE names.contact_id=email.contact_id=telephones.contact_id;
+------------+-----------+-------------------+---------------+
| contact_id | FirstName | Email | TelephoneWork |
+------------+-----------+-------------------+---------------+
| 1 | Yamila | [email protected] | 333-60000 |
| 2 | Nikki | [email protected] | 333-60000 |
| 3 | Tia | [email protected] | 333-60000 |
+------------+-----------+-------------------+---------------+
3 rows in set (0.05 sec)
So what's a JOIN?
JOIN is the action performed on multiple tables that returns a result as a table. It's what makes a database
'relational'.
There are several types of joins. Let's look at LEFT JOIN (OUTER JOIN) and RIGHT JOIN
Let's first check out the contents of the tables we're going to use
mysql> SELECT * FROM names;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
+------------+-----------+----------+------------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM email;
+------------+-------------------+
| contact_id | Email |
+------------+-------------------+
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
+------------+-------------------+
3 rows in set (0.00 sec)
BETWEEN
This conditional statement is used to select data where a certain related contraint falls between a certain range
of values. The following example illustrates it's use.
mysql> SELECT * FROM names;
+------------+-----------+----------+------------+
| contact_id | FirstName | LastName | BirthDate |
+------------+-----------+----------+------------+
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
+------------+-----------+----------+------------+
3 rows in set (0.06 sec)
mysql> SELECT FirstName, LastName FROM names WHERE contact_id BETWEEN 2 AND 3;
+-----------+----------+
| FirstName | LastName |
+-----------+----------+
| Tia | Carrera |
53 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
| Nikki | Taylor |
+-----------+----------+
2 rows in set (0.00 sec)
ALTER
The ALTER statement is used to add a new column to an existing table or to make changes to it.
mysql> ALTER TABLE names ADD Age SMALLINT;
Query OK, 3 rows affected (0.11 sec)
Records: 3 Duplicates: 0 Warnings: 0
Now let's take a look at the 'ALTER'ed Table.
mysql> SHOW COLUMNS FROM names;
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| contact_id | smallint(6) | | PRI | 0 | auto_increment |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | smallint(6) | YES | | NULL | |
+------------+-------------+------+-----+---------+----------------+
5 rows in set (0.06 sec)
But we don't require Age to be a SMALLINT type when a TINYINT would suffice. So we use another ALTER
statement.
mysql> ALTER TABLE names CHANGE COLUMN Age Age TINYINT;
Query OK, 3 rows affected (0.02 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SHOW COLUMNS FROM names;
+------------+-------------+------+-----+---------+--------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+--------+----------------+
| contact_id | smallint(6) | | PRI | NULL |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | tinyint(4) | YES | | NULL | |
+------------+-------------+------+-----+---------+--------+----------------+
5 rows in set (0.00 sec)
MODIFY
You can also use the MODIFY statement to change column data types.
mysql> ALTER TABLE names MODIFY COLUMN Age SMALLINT;
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SHOW COLUMNS FROM names;
+------------+-------------+------+-----+---------+----------------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+---------------+
| contact_id | smallint(6) | | PRI | NULL | auto_increment |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | smallint(6) | YES | | NULL | |
+------------+-------------+------+-----+---------+----------------+---------------+
5 rows in set (0.00 sec)
To Rename a Table:
mysql> ALTER TABLE names RENAME AS mynames;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW TABLES;
+--------------------+
54 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
| Tables_in_contacts |
+--------------------+
| address |
| company_details |
| email |
| mynames |
| telephones |
+--------------------+
5 rows in set (0.00 sec)
We rename it back to the original name.
mysql> ALTER TABLE mynames RENAME AS names;
Query OK, 0 rows affected (0.01 sec)
UPDATE
The UPDATE command is used to add a value to a field in a table.
mysql> UPDATE names SET Age ='23' WHERE FirstName='Tia';
Query OK, 1 row affected (0.06 sec)
Rows matched: 1 Changed: 1 Warnings: 0
The Original Table:
mysql> SELECT * FROM names;
+------------+-----------+----------+------------+------+
| contact_id | FirstName | LastName | BirthDate | Age |
+------------+-----------+----------+------------+------+
| 3 | Tia | Carrera | 1975-09-18 | 23 |
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
+------------+-----------+----------+------------+------+
3 rows in set (0.05 sec)
The Modified Table:
mysql> SELECT * FROM names;
+------------+-----------+----------+------------+------+
| contact_id | FirstName | LastName | BirthDate | Age |
+------------+-----------+----------+------------+------+
| 3 | Tia | Carrera | 1975-09-18 | 24 |
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
+------------+-----------+----------+------------+------+
3 rows in set (0.00 sec)
DELETE
mysql> DELETE FROM names WHERE Age=23;
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM names;
+------------+-----------+----------+------------+------+
| contact_id | FirstName | LastName | BirthDate | Age |
+------------+-----------+----------+------------+------+
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
+------------+-----------+----------+------------+------+
2 rows in set (0.00 sec)
A DEADLY MISTAKE
mysql> DELETE FROM names;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM names;
Empty set (0.00 sec)
One more destructive tool...
DROP TABLE
55 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
mysql> DROP TABLE names;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW TABLES;
+--------------------+
| Tables in contacts |
+--------------------+
| address |
| company_details |
| email |
| telephones |
+--------------------+
4 rows in set (0.05 sec)
mysql> DROP TABLE address ,company_details, telephones;
Query OK, 0 rows affected (0.06 sec)
mysql> SHOW TABLES;
Empty set (0.00 sec)
As you can see, the table 'names' no longer exists. MySQL does not give a warning so be careful.
Optimizing your Database: Well, databases do tend to get large at some or the other. And here arises the issue of
database optimization. Queries are going to take longer and longer as the database grows, and certain things can be
done to speed things up.
Clustering: The easiest method is that of 'clustering'. Suppose you do a certain kind of query often, it would be faster
if the database contents were arranged in the same way data was requested. To keep the tables in a sorted order you
need a clustering index. Some databases keep stuff sorted automatically.
Ordered Indices: These are a kind of 'lookup' tables of sorts. For each column that may be of interest to you, you can
create an ordered index. It needs to be noted that again these kinds of optimization techniques produce a system load
in terms of creating a new index each time the data is re-arranged.
There are additional methods such as B-Trees, hashing which you may like to read up about but will not be discussed
here.
Replication: Replication is the term given to the process where databases synchronize with each other. In this process
one database updates its own data with respect to another or with reference to certain criteria for updates specified by
the programmer. Replication can be used under various circumstances. Examples may be: safety and backup, to
provide a closer location to the database for certain users.
(A) Atomicity: If an action consists of multiple steps - it's still considered as one operation.
(C) Consistency: The database exists in a valid and accurate operating state before and after a transaction.
(I) Isolation: Processes within one transaction are independent and cannot interfere with that in others.
Now if there were say a large number of tables involved in a particular action, say 50, all 50 tables would be locked till
a transaction is completed. To improve things a bit, there is another technique used called 2 Phase Locking or 2PL. In
this method of locking, locks are acquired only when needed but are released only when the transaction is completed.
This is done to make sure that that altered data can be safely restored if the transaction fails for any reason. This
technique can also result in problems such as "deadlocks".
In this case - processes requiring the same resources lock each other up by preventing the other to complete an
action. Options here are to abort one, or let the programmer handle it. MySQL implements transactions by
implementing the Berkeley DB libraries into its own code. So it's the source version you'd want here for MySQL
installation. Read the MySQL manual on implementing this.
7. Beyond MySQL
Beyond RDBMS
Distributed Databases (DDB)
A distributed database is a collection of several, logically interrelated database located at multiple locations of a
computer network. A distributed database management system permits the management of such a database and
makes the operation transparent to the user. Good examples of distributed databases would be those utilized by
banks, multinational firms with several office locations where each distributed data system works only with the data
that is relevant to it's operations. DDBs have full functionality of any DBMS. It's also important to know that the
distributed databases are considered to be actually one database rather than discrete files and data within distributed
databases are logically interrelated.
We can see that for every student, Course Information is repeated for each course. So if a student has three course,
you'll need to add another set of columns for Course Title, Course Professor and CourseID. So Student information
and Course Information can be considered to be two broad groups.
Table Name: Student Information
StudentID (Primary Key)
StudentName
AdvisorName
8. Transaction Control
Transaction: A collection of actions that transforms the DB from one consistent state into another consistent state;
during the execution, the DB might be inconsistent. A transaction can be defined as a logical unit of work on the
database. This may be an entire program, a piece of a program or a single command (like the SQL commands such as
INSERT or UPDATE) and it may engage in any number of operations on the database.
Consider an example of steps involved in a simple transaction of deducting 100 Rupees from Amit’s Account.
Open_Acc (Amit)
OldBal = Amit.bal
NewBal = OldBal - 5000
Ram.bal = NewBal
CloseAccount(Amit)
Uncommitted data
• Occurs when:
• Two transactions are executed concurrently
• First transaction is rolled back after the second transaction has already accessed uncommitted data
Inconsistent retrievals
• Occurs when a transaction accesses data before and after one or more other transactions finish working with such
data
8.3 Schedules
Schedule – a sequences of instructions that specify the chronological order in which instructions of concurrent
transactions are executed. A schedule for a set of transactions must consist of all instructions of those transactions.
Must preserve the order in which the instructions appear in each individual transaction.
A transaction that successfully completes its execution will have a commit instruction as the last statement (will be
omitted if it is obvious). A transaction that fails to successfully complete its
execution will have an abort instruction as the last statement (will be omitted if it is obvious).
The execution sequences just described are called schedules. They represent the chronological order in which
instructions are executed in the system. Clearly, a schedule for a set of transactions must consist of all instructions of
those transactions, and must preserve the order in which the instructions appear in each individual transaction.
Following is an example of a serial schedule:
60 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
8.4 Serializability
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S
and S´ are conflict equivalent. We say that a schedule S is serializable if it is conflict equivalent to a serial schedule.
Schedule S1 can be transformed into S2, a serial schedule where T2 follows T1, by series of swaps of non-conflicting
instructions
Therefore, Schedule S1 is serializable
Schedule S1:
T1 T2
read (A)
write (A)
read (A)
write (A)
read (B)
write (B)
read (B)
write (B)
Schedule S2:
T1 T2
read (A)
write (A)
read (B)
write (B)
read (A)
write (A)
read (B)
write (B)
Schedule S3 is not serializable: We are unable to swap instructions in the schedule to obtain either the serial schedule
<T3, T4>, or the serial schedule <T4,T3>
Schedule S3:
T3 T4
read (P)
write (P)
write (P)
8.5 Recoverability
Recoverable schedule: For each pair of transactions Ti and Tj, where Tj reads data items written by Ti, Ti must commit
before Tj commits.
61 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
The following schedule is not recoverable, if T6 commits immediately after the read
T5 T6
read (A)
write (A)
read (A)
read (B)
If T5 should abort, T6 would have read (and possibly shown to the user) an inconsistent database state. Hence,
database must ensure that schedules are recoverable.
T7 T8 T9
read (A)
read (B)
write (A)
read (A)
write (A)
read (A)
Transaction T7 writes a value of A that is read by Transaction T8. Transaction T8 writes a value of A that is read by
Transaction T9. Suppose at this point T7 fails. T7 must be rolled back, since T8 is dependent on T7, T8 must be rolled
back, T9 is dependent on T8, T9 must be rolled back.
This phenomenon, in which a single transaction failure leads to a series of transaction rollbacks is called
Cascading rollback.
• Cascading rollback is undesirable, since it leads to the undoing of a significant amount of work.
• It is desirable to restrict the schedules to those where cascading rollbacks cannot occur, Such schedules are
called Cascadeless Schedules.
• Formally, a cascadeless schedule is one where for each pair of transaction Ti and Tj such that Tj reads data
item, previously written by Ti the commit operation of Ti appears before the read operation of Tj .
• Every Cascadeless schedule is also recoverable schedule.
Cascadeless Schedule
T10 T11
read (A)
write (A)
read (B)
commit
read (A)
8.7 Implementation of Isolation Levels
The goal of concurrency-control policies is to provide a high degree of concurrency, while ensuring that all schedules
that can be generated are conflict or view serializable, recoverable, and cascadeless.
Locking
Instead of locking the entire database, a transaction could, instead, lock only those data items that it accesses. Under
such a policy, the transaction must hold locks long enough to ensure serializability, but for a period short enough not
to harm performance excessively.
Further improvements to locking result if we have two kinds of locks: shared and exclusive. Shared locks are used for
data that the transaction reads and exclusive locks are used for those it writes. Many transactions can hold shared
locks on the same data item at the same time, but a transaction is allowed an exclusive lock on a data item only if no
other transaction holds any lock (regardless
of whether shared or exclusive) on the data item. This use of two modes of locks along with two-phase locking allows
concurrent reading of data while still ensuring serializability.
Snapshot Isolation
By maintaining more than one version of a data item, it is possible to allow a transaction to read an old version of a
data item rather than a newer version written by an uncommitted transaction or by a transaction that should come
later in the serialization order. There are a variety of multi-version concurrency control techniques. One in particular,
called snapshot isolation, is widely used in practice.
Snapshot isolation ensures that attempts to read data never need to wait (unlike locking). Read-only transactions
cannot be aborted; only those that modify data run a slight risk of aborting. Since each transaction reads its own
version or snapshot of the database, reading data does not cause subsequent update attempts by other transactions to
wait (unlike locking). Since most transactions are read-only (and most others read more data than they update), this is
often a major source of performance improvement as compared to locking.
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item
by other transactions. Any number of transactions can hold shared locks on an item, but if any transaction holds an
exclusive on the item no other transaction may hold any lock on the item. If a lock cannot be granted, the requesting
transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then
granted.
Drawbacks
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
Starvation is also possible if concurrency control manager is badly designed.
For example:
• A transaction may be waiting for an X-lock on an item, while a sequence of other transactions request and are
granted an S-lock on the same item.
• The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
One protocol that ensures serializability is the two-phase locking protocol. This protocol requires that each transaction
issue lock and unlock requests in two phases:
1. Growing phase. A transaction may obtain locks, but may not release any lock.
2. Shrinking phase. A transaction may release locks, but may not obtain any new locks.
Initially, a transaction is in the growing phase. The transaction acquires locks as needed. Once the transaction releases
a lock, it enters the shrinking phase, and it can issue no more lock requests.
Deadlock Prevention
To prevent any deadlock situation in the system, the DBMS aggressively inspects all the operations which transactions
are about to execute. DBMS inspects operations and analyze if they can create a deadlock situation. If it finds that a
deadlock situation might occur then that transaction is never allowed to be executed.
There are deadlock prevention schemes, which uses time-stamp ordering mechanism of transactions in order to
predict a deadlock situation.
WAIT-DIE SCHEME:
In this scheme, if a transaction request to lock a resource (data item), which is already held with conflicting lock by
some other transaction, one of the two possibilities may occur:
• If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti is allowed to wait until
the dataitem is available.
• If TS(Ti) > TS(tj), that is Ti is younger than Tj, Ti dies. Ti is restarted later with random delay but with same
timestamp.
• This scheme allows the older transaction to wait but kills the younger one.
WOUND-WAIT SCHEME:
In this scheme, if a transaction request to lock a resource (data item), which is already held with conflicting lock by
some other transaction, one of the two possibilities may occur:
• If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti forces Tj to be rolled
back, that is Ti wounds Tj. Tj is restarted later with random delay but with same timestamp.
• If TS(Ti) > TS(Tj), that is Ti is younger than Tj, Ti is forced to wait until the resource is available.
This scheme, allows the younger transaction to wait but when an older transaction request an item held by younger
one, the older transaction forces the younger one to abort and release the item.
In both cases, transaction, which enters late in the system, is aborted.
Deadlock Detection
Deadlocks can be described precisely in terms of a directed graph called a wait for graph. This graph consists of a pair
G = (V, E), where V is a set of vertices and E is a set of edges. The set of vertices consists of all the transactions in the
system. Each element in the set E of edges is an ordered pair Ti → Tj. If Ti → Tj is in E, then there is a directed edge
from transaction Ti to Tj , implying that transaction Ti is waiting for transaction Tj to release a data item that it needs.
When transaction Ti requests a data item currently being held by transaction Tj, then the edge Ti → Tj is inserted in
the wait-for graph. This edge is removed only when transaction Tj is no longer holding a data item needed by
transaction Ti.
A deadlock exists in the system if and only if the wait-for graph contains a cycle. Each transaction involved in the
cycle is said to be deadlocked. To detect deadlocks, the system needs to maintain the wait-for graph, and periodically
to invoke an algorithm that searches for a cycle in the graph.
Since the graph has no cycle, the system is not in a deadlock state.
64 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
Suppose now that transaction T4 is requesting an item held by T3. The edge T4 → T3 is added to the wait-for graph,
resulting in the new system state. This time, the graph contains the cycle:
Deadlock Recovery
When a detection algorithm determines that a deadlock exists, the system must recover from the deadlock. The most
common solution is to roll back one or more transactions to break the deadlock. Choosing which transaction to abort
is known as Victim Selection.
Selection of a victim: In the above discussed wait-for graph transactions T3, T2 and T4 are deadlocked. In order to
remove deadlock one of the transaction out of these three transactions must be roll backed.
We should roll back those transactions that will incur the minimum cost. When a deadlock is detected, the choice of
which transaction to abort can be made using following criteria:
• The transaction which have the fewest locks
• The transaction that has done the least work
• The transaction that is farthest from completion
Rollback: Once we have decided that a particular transaction must be rolled back, we must determine how far this
transaction should be rolled back.
The simplest solution is a total rollback: Abort the transaction and then restart it. However, it is more effective to roll
back the transaction only as far as necessary to break the deadlock. Such partial rollback requires the system to
maintain additional information about the state of all the running transactions. Specifically, the sequence of lock
requests/grants and updates performed by the transaction needs to be recorded. The deadlock detection mechanism
should decide which locks the selected transaction needs to release in order to break the deadlock. The selected
transaction must be rolled back to the point where it obtained the first of these locks, undoing all actions it took after
that point. The recovery mechanism must be capable of performing such partial rollbacks. Furthermore, the
transactions must be capable of resuming execution after a partial rollback.
Problem of Starvation: In a system where the selection of victims is based primarily on cost factors, it may happen
that the same transaction is always picked as a victim. As a result this transaction never completes can be picked as a
victim only a (small) finite number of times. The most common solution is to include the number of rollbacks in the
cost factor.
PRACTICE SET
1. Mechanism developed to enforce users to enter data (a) Home tab (b) Backstage View tab
in required format is? (c) None of these (d) Database Tools tab
(a) Data validation (b) Input mask (e) File menu
(c) Criteria (d) Data verification 4. Which of the following method can be used to add
(e) None of these more tables in a database?
2. What is the size of Data & Time field type? (a) Design View (b) Table Wizard
(a) 1 (b) 8 (c) By Entering Data (d) All of above
(c) 255 (d) 50 (e) None of these
(e) None of these 5. The feature that database allows to access only
3. The options like Save, Open Database, Print are certain records in database is?
available in (a) Forms (b) Reports
65 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
(c) Queries (d) Tables 17. Column header is referring as?
(e) None of these (a) Table (b) Relation
6. You can find Sort & Filter group of commands in? (c) Attributes (d) Domain
(a) Create ribbon (b) Home ribbon (e) None of these
(c) Database tools ribbon 18. In mathematical term table is referred as?
(d) Fields ribbon (e) None of these (a) Relation (b) Attribute
7. Arrange according to the size (c) Tuple (d) Domain
(a) Record, field, byte, bit (b) Bit, field, byte, record (e) None of these
(c) Field, byte, record, bit (d) Byte, bit, record, field 19. Minimal Super keys are called?
(e) None of these (a) Schema keys (b) Candidate keys
8. What is the maximum allowed field size for Boolean (c) Domain keys (d) Attribute keys
(Yes/No) fields? (e) None of these
(a) 1 (b) 8 20. The Primary key must be?
(c) 50 (d) 255 (e) None of these (a) Non Null (b) Unique
9. What is relational database? (c) Either (e) or (b) (d) Both (a) and (b)
(a) A database structured to recognize relations (e) Null
between stored items of information. It is based on 21. By Grouped Report you understand-
the relational model of data (a) Type of report generated by the Report Wizard
(b) A database that is not related to other databases (b) Type of report that present records sorted in
(c) A database to store human relations ascending or descending order as you specify
(d) Both (a) and (b) (c) Type of report that displays data grouped by
(e) None of these fields you specified
10. What is the best data type for a field that (d) Both (a) and (b) (e) None of these
stores mobile numbers? 22. Which of the following is not a level of data
(a) Memo (b) Number abstraction?
(c) Date/Time (d) Text (a) Physical Level (b) Critical Level
(e) None of these (c) Logical Level (d) View Level
11. Which filter method lets you filter records based on (e) None of these
criterion you specify? 23. Data Manipulation Language enables users to
(a) Filter by form (b) Filter by selection (a) Retrieval of information stored in database
(c) Auto filter (d) Advanced filter (b) Only (a) and (b)
(e) None of these (c) Insertion of new information into the database
12. Which of the following field type is used to store (d) Deletion of information from the database
photograph of employees? (e) All of the above
(a) Memo (b) None of these 24. Which of the following is not a Storage Manager
(c) OLE (d) Photo Component?
(e) Picture (a) Transaction Manager (b) Logical Manager
13. Which of the following in not a function of DBA? (c) Buffer Manager (d) None of these
(a) Network Maintenance (e) File Manager
(b) Routine Maintenance 25. To display associated record from related table
(c) Schema Definition in datasheet view, you can?
(d) Authorization for data access (a) Double click the record
(e) None of these (b) Apply filter by form command
14. Which of the following is a Data Model? (c) Single click on expand indicator (+) next to the
(a) Entity-Relationship model record
(b) Relational data model (d) Double click on expand indicator (+) next to the
(c) Object-Based data model record
(d) Network model (e) All of the above (e) None of these
15. Which of the following is not Modification of the 26. A Relation is a
Database? (a) Subset of a Cartesian product of a list of
(a) Deletion (b) Insertion attributes
(c) Sorting (d) Updating (b) Subset of a Cartesian product of a list of domains
(e) None of these (c) Subset of a Cartesian product of a list of tuple
16. Which of the following represents a relationship (d) Subset of a Cartesian product of a list of relations
among a set of values? (e) None of these
(a) A row (b) A table 27. Who proposed the relational model?
(c) A field (d) A column (a) Bill Gates (b) E.F. Codd
(e) None of these (c) Herman Hollerith (d) Charles Babbage
1. Data warehousing:
1.1 Introduction
Data warehousing is combining data from multiple sources into one comprehensive and easily manipulated database.
The primary aim for data warehousing is to provide businesses with analytic results from data mining, OLAP, Score
carding and reporting.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support
of management's decision making process.
Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing
operations.
Integrated: Data that is gathered into the data warehouse from a variety of source and merged into a coherent
whole.
Time variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
Operational Data:
➢ Focusing on transactional function such as bank card withdrawals and deposits
➢ Detailed
➢ Updateable
➢ Reflects current data
Informational Data:
➢ Focusing on providing answers to problems posed by decision makers
➢ summarized
➢ Non updateable
Important Terminology:
Enterprise Data Warehouse- It collects all information about subjects (Customers, products, sale assets, personnel)
that span the entire organization.
70 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
Data Mart- Departmental subsets that focus on selected subjects. A data mart is a segment of a data warehouse that
can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales,
payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller
than the corporate data warehouse.
Decision Support System(DSS)- Information technology to help the knowledge worker(executive,manager and
analyst) makes faster and better decision.
Drill-down- Traversing the summarization levels from highly summarized data to the underlying current or old
detail.
Meta data- Data about data. Containing location and description of warehouse system components: names,
definition, structure etc.
Data warehouse architecture is based on a relational database management system server that functions as the central
repository for informational data. In the data warehouse architecture, operational data and processing is completely
separate from data warehouse processing.
NOTE on Operational Data Store: Operational Data store (ODS) is an architecture concept to support
day-to-day operational decision support and contains current value data propagated from operational applications
⚫ ODS is subject-oriented, similar to a classic definition of a Data warehouse
⚫ ODS is integrated, in the same sense as a data warehouse
However
⚫ ODS is volatile, while a data warehouse is nonvolatile
⚫ ODS contains very current data, while a data warehouse contains both current and historical data.
⚫ ODS contains detailed data only, and not precalculated summaries and aggregates, as is typical for a data
warehouse.
71 | P a g e CAREER POWER | PREMIER INSTITUTE FOR ALL GOVT. COMPETITIVE EXAMS
Email : [email protected] | website: www.careerpower.in
1.5. Data Warehouse Design Approaches:
Designing or Building of a Data Warehouse can be done following either one of the approaches. These approaches are
notably known as:
• Top-Down Approach-In the top down approach suggested by Bill Inmon, we build a centralized repository to
house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the
EDW is stored in a normalized form in order to avoid redundancy. The central repository for corporate wide data
helps us maintain one version of truth of the data. The data in the EDW is stored at the most detail level. The
reason to build the EDW on the most detail level is to leverage- Flexibility to be used by multiple departments and
Flexibility to cater for future requirements.
• Bottom-Up Approach-The bottom up approach suggested by Ralph Kimball is an incremental approach to build
a data warehouse. Here we build the data marts separately at different points of time as and when the specific
subject area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed dimensions and conformed facts. A
conformed dimension and a conformed fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across
separate data marts. The conformed dimension means exact same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and at the same granularity
across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and
when the requirements are clear. We don’t have to wait for knowing the overall requirements of the warehouse
It is data about data. It is used for maintaining, managing and using the data warehouse. It is classified into two:
Technical Meta data: It contains information about data warehouse data used by warehouse designer, administrator
to carry out development and management tasks. It includes,
• Info about data stores
• Transformation descriptions. That is mapping methods from operational db to warehouse db
• Warehouse Object and data structure definitions for target data
• The rules used to perform clean up, and data enhancement
• Data mapping operations
• Access authorization, backup history, archive history, info delivery history, data acquisition history, data access
etc.,
Business Meta data: It contains info that gives info stored in data warehouse to users. It includes,
• Subject areas, and info object type including queries, reports, images, video, audio clips etc.
• Internet home pages
• Info related to info delivery system
• Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a separate data
stores which is known as informational directory or Meta data repository which helps to integrate, maintain and
view the contents of the data warehouse.
A metadata repository is a database of data about data (metadata). The purpose of the metadata repository is
to provide a consistent and reliable means of access to data. The repository itself may be stored in a physical
location or may be a virtual database, in which metadata is drawn from separate sources.
Its purpose is to provide info to business users for decision making. There are five categories:
1. Data query and reporting tools- Query and reporting tools are used to generate query and report.
Data mart is a subset of a data warehouse that support the requirements of particular department or business
function.
The characteristic that differentiate data marts and data warehouse is, a data mart focuses only the requirements of
users associated with one department or business function.
OLAP is an approach to answering multi dimentional analytical queries. OLAP is part of the broader category of
business intelligence, which also encompasses relational database, report writing and data mining. OLAP tools enable
users to analyze multidimentional data interactively from multiple perspectives.
The OLAP databases are highly de-normalized, which makes the files redundant and helps to improve analytic
performance. The processing speed of the system is very slow and can take up to many hours depending on the data
involved.
Types of OLAP:
➢ Relational OLAP
➢ Multidimentional OLAP
➢ Hybrid OLAP
It is a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and
retrieval transaction processing. It manages current data and stores all of the given data. It is characterized by a large
number of short online transactions and their quick real time response to the users.
The main purpose of the OLTP system is to control or run the fundamental business tasks.
OLAP OLTP
OLAP technology used to perform complex analysis OLTP technology used to preform updates on operational
of the data in a data warehouse. or transactional system (e.g. Point of sale system).
It holds historical and only relevant data It holds current and all type of data
It has few concurrent users. It has many concurrent users.
OLAP systems are used by knowledge workers such OLTP systems are used by clerks, DBAs, or database
as executives, managers and analysts. professionals.
Provides summarized and multidimensional view of Provides detailed and flat relational view of data.
data.
Based on Star Schema, Snowflake, Schema and Fact It is highly normalized with many tables.
Constellation Schema.
Data mining is a process of extracting previously unknown, valid and actionable information from large set of data
and then using the information to make crucial business decision.
Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and
unexpected patterns and relationships in sets of data. The focus of data mining is to find the information that is
hidden and unexpected.
2.1. Data Mining Techniques:
➢ Association
➢ Classification
➢ Clustering
➢ prediction
➢ Sequential Patterns
➢ Decision Trees
➢ Combinations
Various fields uses data mining technologies because of fast access of data and valuable information from vast
amount of data. Data mining technologies have been applied successfully in many areas:
• Financial Data Analysis:
The financial data in banking and financial industry is generally reliable and of high quality which facilitates the
systematic data analysis and data mining. Here are the few typical cases: Design and construction of data
warehouses for multidimensional data analysis and data mining. Loan payment prediction and customer credit
policy analysis. Classification and clustering of customers for targeted marketing. Detection of money laundering
and other financial crimes
• Retail Industry:
Data Mining has its great application in Retail Industry because it collects large amount data from on sales,
customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of
data collected will continue to expand rapidly because of increasing ease, availability and popularity of web. The
Data Mining in Retail Industry helps in identifying customer buying patterns and trends. That leads to improved
quality of customer service and good customer retention and satisfaction.
• Telecommunication Industry:
Today the Telecommunication industry is one of the most emerging industries providing various services such as
fax, pager, cellular phone, Internet messenger, images, email, web data transmission etc. Due to the development
of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is
the reason why data mining is become very important to help and understand the business. Data Mining in
Telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities,
make better use of resource, and improve quality of service.