R19 DBMS Material
R19 DBMS Material
R19 DBMS Material
STUDY MATERIAL
II B.TECH II YEAR
[JNTU-K R19 REGULATION]
Prepared by
M. Ganesh Babu
Assistant Professor
Dept. CSE
R-19 Syllabus for CSE. JNTUK w. e. f. 2019-20
Database is a collection of related data and data is a collection of facts and figures that can be
processed to produce information.
Mostly data represents recordable facts. Data aids in producing information, which is based on
facts. For example, if we have data about marks obtained by all students, we can then conclude
about toppers and average marks.
A database management system stores data in such a way that it becomes easier to retrieve,
manipulate, and produce information.
CHARACTERISTICS OF DBMS
Traditionally, data was organized in file formats. DBMS was a new concept then, and all the
research was done to make it overcome the deficiencies in traditional style of data management.
A modern DBMS has the following characteristics −
Real-world entity − A modern DBMS is more realistic and uses real-world entities to
design its architecture. It uses the behavior and attributes too. For example, a school
database may use students as an entity and their age as an attribute.
Relation-based tables − DBMS allows entities and relations among them to form tables.
A user can understand the architecture of a database just by looking at the table names.
Isolation of data and application − A database system is entirely different than its data.
A database is an active entity, whereas data is said to be passive, on which the database
works and organizes. DBMS also stores metadata, which is data about data, to ease its
own process.
Less redundancy − DBMS follows the rules of normalization, which splits a relation
when any of its attributes is having redundancy in values. Normalization is a
mathematically rich and scientific process that reduces data redundancy.
Consistency − Consistency is a state where every relation in a database remains
consistent. There exist methods and techniques, which can detect attempt of leaving
database in inconsistent state. A DBMS can provide greater consistency as compared to
earlier forms of data storing applications like file-processing systems.
Query Language − DBMS is equipped with query language, which makes it more
efficient to retrieve and manipulate data. A user can apply as many and as different
filtering options as required to retrieve a set of data. Traditionally it was not possible
where file-processing system was used.
ACID Properties − DBMS follows the concepts of Atomicity, Consistency, Isolation,
and Durability (normally shortened as ACID). These concepts are applied on
transactions, which manipulate data in a database. ACID properties help the database stay
healthy in multi-transactional environments and in case of failure.
Multiuser and Concurrent Access − DBMS supports multi-user environment and
allows them to access and manipulate data in parallel. Though there are restrictions on
1
transactions when users attempt to handle the same data item, but users are always
unaware of them.
Multiple views − DBMS offers multiple views for different users. A user who is in the
Sales department will have a different view of database than a person working in the
Production department. This feature enables the users to have a concentrate view of the
database according to their requirements.
Security − Features like multiple views offer security to some extent where users are
unable to access data of other users and departments. DBMS offers methods to impose
constraints while entering data into the database and retrieving the same at a later stage.
DBMS offers many different levels of security features, which enables multiple users to
have different views with different features. For example, a user in the Sales department
cannot see the data that belongs to the Purchase department. Additionally, it can also be
managed how much data of the Sales department should be displayed to the user. Since a
DBMS is not saved on the disk as traditional file systems, it is very hard for miscreants to
break the code.
2
DBMS ARCHITECTURE
In 1-tier architecture, the DBMS is the only entity where the user directly sits on the DBMS and
uses it. Any changes done here will directly be done on the DBMS itself. It does not provide
handy tools for end-users. Database designers and programmers normally prefer to use single-
tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which the DBMS
can be accessed. Programmers use 2-tier architecture where they access the DBMS by means of
an application. Here the application tier is entirely independent of the database in terms of
operation, design, and programming.
3- tier Architecture
A 3-tier architecture separates its tiers from each other based on the complexity of the users and
how they use the data present in the database. It is the most widely used architecture to design a
DBMS.
Database (Data) Tier − At this tier, the database resides along with its query processing
languages. We also have the relations that define the data and their constraints at this
level.
Application (Middle) Tier − At this tier reside the application server and the programs
that access the database. For a user, this application tier presents an abstracted view of the
database. End-users are unaware of any existence of the database beyond the application.
At the other end, the database tier is not aware of any other user beyond the application
tier. Hence, the application layer sits in the middle and acts as a mediator between the
end-user and the database.
3
User (Presentation) Tier − End-users operate on this tier and they know nothing about
any existence of the database beyond this layer. At this layer, multiple views of the
database can be provided by the application. All views are generated by applications that
reside in the application tier.
Multiple-tier database architecture is highly modifiable, as almost all its components are
independent and can be changed independently.
DATA MODELS
Data models define how the logical structure of a database is modeled. Data Models are
fundamental entities to introduce abstraction in a DBMS. Data models define how data is
connected to each other and how they are processed and stored inside the system.
The very first data model could be flat data-models, where all the data used are to be kept in the
same plane. Earlier data models were not so scientific, hence they were prone to introduce lots of
duplication and update anomalies.
Entity-Relationship Model
Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships
among them. While formulating real-world scenario into the database model, the ER Model
creates entity set, relationship set, general attributes and constraints.
ER Model is based on –
4
Mapping cardinalities −
one to one
one to many
many to one
many to many
Relational Model
The most popular data model in DBMS is the Relational Model. It is more scientific a model
than others. This model is based on first-order predicate logic and defines a table as an n-ary
relation.
Database Schema
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are associated.
It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a descriptive
detail of the database, which can be depicted by means of schema diagrams. It’s the database
5
designers who design the schema to help programmers understand the database and make it
useful.
Physical Database Schema − This schema pertains to the actual storage of data and its
form of storage like files, indices, etc. It defines how the data will be stored in a
secondary storage.
Logical Database Schema − This schema defines all the logical constraints that need to
be applied on the data stored. It defines tables, views, and integrity constraints.
Database Instance
It is important that we distinguish these two terms individually. Database schema is the skeleton
of database. It is designed when the database doesn't exist at all. Once the database is
operational, it is very difficult to make any changes to it. A database schema does not contain
any data or information.
A database instance is a state of operational database with data at any given time. It contains a
snapshot of the database. Database instances tend to change with time. A DBMS ensures that its
every instance (state) is in a valid state, by diligently following all the validations, constraints,
and conditions that the database designers have imposed.
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers as we learnt earlier.
6
Data Independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult
to modify or update a set of metadata once it is stored in the database. But as a DBMS expands,
it needs to change over time to satisfy the requirements of the users. If the entire data is
dependent, it would become a tedious and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one layer, it does
not affect the data at another level. This data is independent but mapped to each other.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data residing
on the disk.
7
For example, in case we want to change or upgrade the storage system itself − suppose we want
to replace hard-disks with SSD − it should not have any impact on the logical data or schemas.
Advantages of DBMS
The database management system has a number of advantages as compared to traditional computer
file-based processing approach. The DBA must keep in mind these benefits or capabilities during
databases and monitoring the DBMS.
The Main advantages of DBMS are described below.
In non-database systems each application program has its own private files. In this case, the duplicated
copies of the same data is created in many places. In DBMS, all data of an organization is integrated into
a single database file. The data is recorded in only one place in the database and it is not duplicated.
Sharing of Data
In DBMS, data can be shared by authorized users of the organization. The database administrator
manages the data and gives rights to users to access the data. Many users can be authorized to access
the same piece of information simultaneously. The remote users can also share same data. Similarly, the
data of same database can be shared between different application programs.
Data Consistency
By controlling the data redundancy, the data consistency is obtained. If a data item appears only once,
any update to its value has to be performed only once and the updated value is immediately available to
all users. If the DBMS has controlled redundancy, the database system enforces consistency.
Integration of Data
In Database management system, data in database is stored in tables. A single database contains
multiple tables and relationships can be created between tables (or associated data entities). This makes
easy to retrieve and update data.
Integration Constraints
Integrity constraints or consistency rules can be applied to database so that the correct data can be
entered into database. The constraints may be applied to data item within a single record or the may be
applied to relationships between records.
Data Security
Form is very important object of DBMS. You can create forms very easily and quickly in DBMS. Once a
form is created, it can be used many times and it can be modified very easily. The created forms are also
8
saved along with database and behave like a software component. A form provides very easy way (user-
friendly) to enter data into database, edit data and display data from database. The non-technical users
can also perform various operations on database through forms without going into technical details of a
fatabase.
Report Writers
Most of the DBMSs provide the report writer tools used to create reports. The users can create very
easily and quickly. Once a report is created, it can be used may times and it can be modified very easily.
The created reports are also saved along with database and behave like a software component.
In a computer file-based system, if two users are allowed to access data simultaneously, it is possible
that they will interfere with each other. For example, if both users attempt to perform update operation
on the same record, then one may overwrite the values recorded by the other. Most database
management systems have sub-systems to control the concurrency so that transactions are always
recorded with accuracy.
In a computer file-based system, the user creates the backup of data regularly to protect the valuable
data from damage due to failures to the computer system or application program. It is very time
consuming method, if amount of data is large. Most of the DBMSs provide the 'backup and recovery'
sub-systems that automatically create the backup of data and restore data if required.
Data Independence
The separation of data structure of database from the application program that uses the data is called
data independence. In DBMS, you can easily change the structure of database without modifying the
application program.
Advantages of DBMS
One of the main advantages of using a database system is that the organization can exert,via
theDBA, centralized management and control over the data. The database Administrator is the
focus of the centralized control.
Any application requiring a change in the structure
Of a data record requires an arrangement with the DBA, who makes the necessary modifications
.
Such modifications do not affect other applications or Users of the record in question.
9
Reduction of Redundancies:
Centralized control of data by the DBA avoids unnecessary duplication of data and effectively
reduces the total amount of data storage required.
It also eliminates the extra processing necessary to trace the required data in a large mass of data.
Elimination of Inconsistencies:
The main advantage of avoiding duplication is the elimination of inconsistencies that tend to be
present in redundant data files.
Any redundancies that exist in the DBMS are controlled and the system ensures that these
multiple copies are consistent.
SharedData:
A database allows the sharing of data under its control by any number of application programs or
users.
For example, the applications for the public relations
And payroll departments can share the same data.
Integrity:
Centralized control can also ensure that adequate checks are incorporated in the DBMS to
provide data integrity.
Data integrity means that the data contained in the database is both accurate and consistent.
Therefore, data values being entered for the storage could be checked to ensure that they fall
within a specified range and are of the correct format.
Security:
Data is of vital importance to an organization and may be confidential . Such confidential data
must not be accessed by unauthorized persons . The DBA who has the ultimate responsibility for
the data in the DBMS can ensure that proper access procedures are followed, including proper
authentication schemes for access to the DBMS and additional checks before permitting access
to sensitive data .Different levels of security could be implemented for various types of data and
operations.
Disadvantages of DBMS
10
This increases the potential severity of security breaches and disruption of the operation of the
organization because of downtimes and failures .
The replacement of a monolithic centralized database by a federation of independent and
cooperating distributed databases resolves some of the problems resulting from failures and
downtimes.
Complexity of Backup and Recovery:
Backup and recovery operations are fairly complex in a DBMS environment, and this is
exacerbated in a concurrent multi user database system .
Furthermore, a database system requires a certain amount of controlled redundancies and
duplication to enable access to related data items .
APPLICATION OF DBMS:
Databases are used to support internal operations of organizations and to underpin online
interactions with customers and suppliers (see Enterprise software).
Databases are used to hold administrative information and more specialized data, such as
engineering data or economic models. Examples of database applications include computerized
library systems, flight reservation systems and computerized parts inventory systems.
1. Banking: For customer information, accounts, and loans, and banking transactions.
2. Airlines: For reservations and schedule information. Airlines were among the first to use
databases in a geographically distributed manner - terminals situated around the world accessed
the central database system through phone lines and other data networks.
4. Credit card transactions: For purchases on credit cards and generation of monthly statements.
5. Telecommunication: For keeping records of calls made, generating monthly bills, maintaining
balances on prepaid calling cards, and storing information about the communication networks.
6. Finance: For storing information about holdings, sales, and purchases of financial instruments
such as stocks and bonds.
8. Manufacturing: For management of supply chain and for tracking production of items in
factories, inventories of items in warehouses / stores, and orders for items.
9. Human resources: For information about employees, salaries, payroll taxes and benefits, and
for generation of paychecks
11
UNIT II
Relational model can represent as a table with columns and rows. Each row is known as a
tuple. Each table of the column has a name or attribute.
Attribute: It contains the name of a column in a particular table. Each attribute Ai must have
a domain, dom(Ai)
Relational instance: In the relational database system, the relational instance is represented
by a finite set of tuples. Relation instances do not have duplicate tuples.
Relational schema: A relational schema contains the name of the relation and name of all
columns or attributes.
Relational key: In the relational key, each row has one or more attributes. It can identify the
row in the relation uniquely.
Properties of Relations
o +----+----------+-----+-----------+----------+
In general, each NULL value is considered to be different from every other NULL in the
database. When a NULL is involved in a comparison operation, the result is considered to
be UNKNOWN. Hence, SQL uses a three-valued logic with values True,
False and Unknown. It is, therefore, necessary to define the results of three-valued logical
expressions when the logical
connectives AND, OR, and NOT are used.
Integrity Constraints
o Integrity constraints are a set of rules. It is used to maintain the quality of information.
o Integrity constraints ensure that the data insertion, updating, and other processes have
to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the database.
Types of Integrity Constraint
1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.
o The data type of domain includes string, character, integer, time, date, currency, etc.
The value of the attribute must be available in the corresponding domain.
Example:
Example:
Example:
4. Key constraints
o Keys are the entity set that is used to identify an entity within its entity set uniquely.
o An entity set can have multiple keys, but out of which one key will be the primary
key. A primary key can contain a unique and null value in the relational table.
Example:
Oracle Built-In-Functions:
1. ASCII: Returns the number code that represents the specific character.
Query synatx: ASCII(single_character)
2. CEIL: Returns integer value that is greater than or equal to the number ‘x’
Query syntax: CEIL(x)
3. FLOOR: Returns integer value that is less than or equal to the number ‘x’
Query syntax: FLOOR(x)
4. ROUND: Returns rounded off value of the number ‘x’ upto ‘y’.
Query syntax: ROUND(x,y)
Date functions in SQL:
1. ADD_MONTHS: Return a date value after adding ‘n’ months to the date ‘x’.
1. TO_CHAR: Converts Numeric and Date values to a character string value. It cannot be
used for calculations since it is a string value.
3. NVL: If 'x' is NULL, replace it with 'y'. 'x' and 'y' must be of the same datatype.
An operator manipulates individual data items and returns a result. The data items are called
operands or arguments. Operators are represented by special characters or by keywords. For
example, the multiplication operator is represented by an asterisk (*) and the operator that
tests for nulls is represented by the keywords IS NULL. There are two general classes of
operators: unary and binary. Oracle Database Lite SQL also supports set operators.
Comparison operators:
a)Equal to Operator(=): Checks if the values of two operands are equal or not, if yes then;
condition becomes true.
2.Not Equal to Operator(!=): Checks if the values of two operands are equal or not, if
values are not equal then;condition becomes true.
3.Greater than Operator(>): Checks if the value of left operand is greater than the value of
right operand, if yes then condition becomes true.
4.Less than Operator (<):Checks if the value of left operand is less than the value of right
operand, if yes then condition becomes true.
4.Greater than or Equal to Operator (>=): Checks if the value of left operand is greater
than or equal to the value of right operand, if yes then condition becomes true.
Logical operators:
a)ALL Operator: The ALL operator is used to compare a value to all values in another value
set.
b)AND Operator: The ALL operator is used to compare a value to all values in another value
set.
c) ANY Operator: The ANY operator is used to compare a value to any applicable value in
the list as per the condition.
CREATE TABLE
reating a basic table involves naming the table and defining its columns and each column's
data type.
The SQL CREATE TABLE statement is used to create a new table.
Syntax
Example
The following code block is an example, which creates a CUSTOMERS table with an ID as
a primary key and NOT NULL are the constraints showing that these fields cannot be NULL
while creating records in this table −
Syntax
Example
Let us first verify the CUSTOMERS table and then we will delete it from the database as
shown below −
SQL> DESC CUSTOMERS;
+---------+---------------+------+-----+---------+-------+
|Field|Type|Null|Key|Default|Extra|
+---------+---------------+------+-----+---------+-------+
| ID |int(11)| NO | PRI |||
| NAME |varchar(20)| NO ||||
| AGE |int(11)| NO ||||
| ADDRESS |char(25)| YES || NULL ||
|SALARY |decimal(18,2)| YES || NULL ||
+---------+---------------+------+-----+---------+-------+
5 rows inset(0.00 sec)
This means that the CUSTOMERS table is available in the database, so let us now drop it as
shown below.
Syntax
There are two basic syntaxes of the INSERT INTO statement which are shown below.
INSERT INTO TABLE_NAME (column1, column2, column3,...columnN)
VALUES (value1, value2, value3,...valueN);
Here, column1, column2, column3,...columnN are the names of the columns in the table into
which you want to insert the data.
You may not need to specify the column(s) name in the SQL query if you are adding values
for all the columns of the table. But make sure the order of the values is in the same order as
the columns in the table.
The SQL INSERT INTO syntax will be as follows −
INSERT INTO TABLE_NAME VALUES (value1,value2,value3,...valueN);
Example
The following statements would create six records in the CUSTOMERS table.
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (1,'Ramesh',32,'Ahmedabad',2000.00);
The SQL SELECT statement is used to fetch the data from a database table which returns
this data in the form of a result table. These result tables are called result-sets.
Syntax
Syntax
The basic syntax of the SELECT statement with the WHERE clause is as shown below.
SELECT column1, column2, columnN
FROM table_name
WHERE [condition]
You can specify a condition using the comparison or logical operators like >, <, =, LIKE,
NOT, etc. The following examples would make this concept clear.
Example
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
|1|Ramesh|32|Ahmedabad|2000.00|
|2|Khilan|25|Delhi|1500.00|
|3|kaushik|23|Kota|2000.00|
|4|Chaitali|25|Mumbai|6500.00|
|5|Hardik|27|Bhopal|8500.00|
|6|Komal|22| MP |4500.00|
|7|Muffy|24|Indore|10000.00|
+----+----------+-----+-----------+----------+
The following code is an example which would fetch the ID, Name and Salary fields from
the CUSTOMERS table, where the salary is greater than 2000 −
SQL> SELECT ID, NAME, SALARY
FROM CUSTOMERS
WHERE SALARY >2000;
This would produce the following result −
+----+----------+----------+
| ID | NAME | SALARY |
+----+----------+----------+
| 4 | Chaitali | 6500.00 |
| 5 | Hardik | 8500.00 |
| 6 | Komal | 4500.00 |
| 7 | Muffy | 10000.00 |
+----+----------+----------+
The following query is an example, which would fetch the ID, Name and Salary fields from
the CUSTOMERS table for a customer with the name Hardik.
Here, it is important to note that all the strings should be given inside single quotes ('').
Whereas, numeric values should be given without any quote as in the above example.
The SQL AND & OR operators are used to combine multiple conditions to narrow data in
an SQL statement. These two operators are called as the conjunctive operators.
These operators provide a means to make multiple comparisons with different operators in
the same SQL statement.
The AND operator allows the existence of multiple conditions in an SQL statement's
WHERE clause.
Syntax
The basic syntax of the AND operator with a WHERE clause is as follows −
SELECT column1, column2, columnN
FROM table_name
WHERE [condition1] AND [condition2]...AND [conditionN];
You can combine N number of conditions using the AND operator. For an action to be taken
by the SQL statement, whether it be a transaction or a query, all conditions separated by the
AND must be TRUE.
Example
Consider the CUSTOMERS table having the following records −
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Following is an example, which would fetch the ID, Name and Salary fields from the
CUSTOMERS table, where the salary is greater than 2000 and the age is less than 25 years
−
The OR Operator
Syntax
The basic syntax of the UPDATE query with a WHERE clause is as follows −
UPDATE table_name
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition];
You can combine N number of conditions using the AND or the OR operators.
Example
The SQL DELETE Query is used to delete the existing records from a table.
You can use the WHERE clause with a DELETE query to delete the selected rows,
otherwise all the records would be deleted.
Syntax
The basic syntax of the DELETE query with the WHERE clause is as follows −
DELETE FROM table_name
WHERE [condition];
You can combine N number of conditions using AND or OR operators.
Example
(1) Requirements Analysis: The very first step in designing a database application is to
understand what data is to be stored in the database, what applications must be built on the
database and what operations must be performed on the database. In other words, we must
find out what the users want from the database. This process involves discussions with user
groups, a study of the current operating environment, how it is expected to change an analysis
of any available documentation on existing applications and so on.
(2) Conceptual Database Design: The information gathered in the requirements analysis step
is used to develop a high-level description of the data to be stored in the database, along with
the constraints that are known to hold over this data. The goal is to create a description of the
data that matches to how both users and developers think of the data. This facilities discussion
among all the people involved in the design process i.e., developers and as well as users who
have no technical background. In simple words, the conceptual database design phase is used
in drawing ER model.
(3) Logical Database Design: We must implement our database design and convert the
conceptual database design into a database schema (a description of data) in the data model
(a collection of high level data description constructs that hide many low level storage details)
of the DBMS. We will consider only consider relational DBMSs, and therefore, the task in
the logical design step is to convert the conceptual database design in the form of an ER
schema (Entity-Relationship schema) into a relational database schema.
(4) Schema Refinement: The fourth step in database design is to analyze the collection of
relations in our relational database schema to identify future problems, and to refine (clear) it.
(5) Physical Database Design:. This step may simply involve building indexes on some
tables and clustering some tables, or it may involve redesign of parts of the database schema
obtained from the earlier design steps.
(6) Application andSecurity Design: Any software project that involves a DBMS must
consider applications that involve processes and identify the entities.
Conceptual design:
What are the entities and relationships in the enterprise?
What information about these entities and relationships should be stored in the
database?
What are the integrity constraints or business rules that hold?
A database schema in the ER Model can be represented pictorially (ER
diagrams)
An ER diagram can be mapped into a relational schema
E-R MODEL:
An entity–relationship model (ER model) is a systematic way of describing and defining
a business process. An ER model is typically implemented as a database.
The main components of E-R model are: entity set and relationship set.
Here are the geometric shapes and their meaning in an E-R Diagram –
2. ENTITIES, ATTRIBUTES, AND ENTITY SETS:
ENTITIES:
Example: Specific person, Company, Event, Plant, Building, Room, Chair, Course,
Employee etc.
In E-R Diagram, an entity is represented using rectangles. Name of the Entity is written
inside the rectangle.
Examples: STUDENT, EMPLOYEE, ACCOUNT etc.
A Weak entity is an entity that depends on another entity. Weak entity doesn’t have key
attribute of their own. Double rectangle represents weak entity.
Examples: CLASS_SECTION, DEPENDANT etc.
An Entity setis a set of entities of the same type that share the same properties.
An Entity set is a collection of similar entities.
Examples: set of all persons, companies, Job positions, Courses, Academic staff, Managers,
Employees etc.
–All entities in an entity set have the same set of attributes. (Until we consider ISA
hierarchies, anyway!)
–Each entity set has a key.
–Each attribute has a domain.
The Employees entity set with attributes ssn, name, and lot is shown in Figure 2.1. An
entity set is represented by a rectangle, and an attribute is represented by an oval. Each
attribute in the primary key is underlined.
ATTRIBUTES:
An entity is represented by a set of attributes. Attributes are descriptive properties
possessed by each member of an entity set.
Composite Attribute:
An attribute can also have their own attributes. These attributes are known
as Composite attribute.
Composite attributes can be divided into subparts. For example, an attribute name
could be structured as a composite attribute consisting of first-name, middle-initial, and last-
name.
Attribute Divided into sub parts. Eg. Name (First name, Middle Name, last name)
Multivalued Attributes:(********************)
An attribute that can hold multiple values is known as multivalued attribute. We represent it
with double ellipses in an E-R Diagram. E.g. A person can have more than one phone
numbers so the phone number attribute is multivalued.
There may be instances where an attribute has a set of values for a specific entity. Consider an
employee entity set with the attribute phone-number. An employee may have zero, one, or
several phone numbers, and different employees may have different numbers of phones. This
type of attribute is said to be attribute having more than one values. Eg.Phone Number.
Derived Attribute: A derived attribute is one whose value is dynamic and derived from
another attribute. It is represented by dashed ellipses in an E-R Diagram. E.g. Person age is a
derived attribute as it changes over time and can be derived from another attribute (Date of
birth).
E-R diagram with multivalued and derived attributes:
A total participation of an entity set represents that each entity in entity set must have at least
one relationship in a relationship set. For example: In the below diagram each college must
have at least one associated student.
.
Each n-tuple denotes a relationship involving n entities e1 through en, where entity ei is in
entity set Ei. Note that several relationship sets might involve the same entity sets. For
example, we can also have a Manages relationship set involving Employees and Departments.
A relationship can also have descriptive attributes. Descriptive attributes are used to
record information about the relationship, rather than about any one of the participating
entities.
The ‘since’ value is shown beside each relationship as ‘many-to-many’ relationships and total
participation i.e., the employee with ssn (123-22-3666) Works_In did (51) since 1/1/91,
similarly the employee with ssn (231-31-5368) Works_In did (51) since 3/3/93 and so on.
Ternary relationship is an association (connection) between three entities an employee, a
department, and a location.
Example: Each department has offices in several location and we want to record the location
at which each employee works.
The entity sets that participate in a relationship set need not be only one. Sometimes a
relationship might involve two entities in the same entity set.
For example, consider the Reports_To relationship set that is shown in Figure 2.5. Since
employees report to other employees, every relationship in Reports_To is of the form (emp1,
emp2), where both emp1 and emp2 are entities in Employees.
However, they play different roles: emp1 reports to the managing employee emp2, which is
reflected in the role indicators supervisor and subordinate in Figure 2.5.
If an entity set plays more than one role, the role indicator concatenated with an attribute
name from the entity set gives us a unique name for each attribute in the relationship set. For
example, the Reports To relationship set has attributes corresponding to the ssn of the
supervisor and the ssn of the subordinate, and the names of these attributes are supervisor ssn
and subordinate ssn.
An employee can work in several departments, and a department can have several employees,
as illustrated in the Works_In instance shown in Figure 2.3.
Here, Employee 231-31-5368 has worked in Department 51 since 3/3/93 and in Department
56 since 2/2/92. Department 51 has two employees. Thus one department can have many
employees.
But, if we want to have only one employee in department, then it is an example of Kay
constraint.
Example: Consider another relationship set called Manages between the Employees and
Departments entity sets as in the Figure 2.6.
Here, each department can have only one manager. The restriction that each department can
have only one manager is an example of key constraints. This restriction is indicated in the
above ER diagram by using an arrow from departments to manages, such that a department
can have only one manager.
An instance of the Manages relationship set is shown in Figure 2.7. While this is also a
potential instance for the Works In relationship set, the instance of Works In shown in Figure
2.3 violates the key constraint on Manages.
Key Constraints for Ternary Relationships:
In Figure 2.8, we show a ternary relationship with key constraints. Each employee works in
at most one department, and at a single location.
An instance of the Works_In3 relationship set is shown in Figure 2.9. Notice that each
department can be associated with several employees and locations, and each location can be
associated with several departments and employees; however, each employee is associated
with a single department and location.
Participation Constraints: (***********)
The ER diagram in Figure 2.10 shows both the Manages and Works_In relationship sets and
all the given constraints. If the participation of an entity set in a relationship set is total, the
two are connected by a thick line; independently, the presence of an arrow indicates a key
constraint. The instances of Works_In and Manages shown in Figures 2.3 and 2.7 satisfy all
the constraints in Figure 2.10.
Weak Entities:
An entity set attributes that does not have a primary key within them, is termed as a weak
entity set. As an example, consider the entity set Dependents, which has the two attributes
pname and age, illustrated with the ER diagram as shown in Figure 2.11
A dependent is an example of a weak entity set. A weak entity can be identified uniquely
only by considering some of its attributes in conjunction with the primary key of another
entity, which is called the identifying owner.
The Dependents weak entity set and its relationship to Employees is shown in Figure 2.11.
The total participation of Dependents in Policy is indicated by linking them with a dark line.
The arrow from Dependents to Policy indicates that each Dependents entity appears in at most
one Policy relationship. To underscore the fact that Dependents is a weak entity and Policy is
its identifying relationship, we draw both with dark lines. To indicate that pname is a partial
key for Dependents, we underline it using a broken line. This means that there may well be
two dependents with the same pname value.
Class Hierarchies:
To classify the entities in an entity set into subclass entity is known as class hierarchies.
Example: we might want to classify Employees entity set into subclass entities Hourly_Emps
entity set and a Contract _Emps entity set to distinguish the basis on which they are paid.
Then the class hierarchy is illustrated as shown in Figure 2.12.
This class hierarchy illustrates the inheritance concept. Where, the subclass attributes ISA
(read as: is a) superclass attributes, indicating the “is a” relationship (inheritance concept).
Therefore, the attributes defined for a Hourly_Emps entity set are the attributes of
Hourly_Emps plus attributes of employees (because subclass can have superclass properties).
Likewise the attributes defined for a Contract_Emps set are the attributes of Contract_Emps
plus attributes of Employees.
Generalization:
Generalization is the process of identifying(defining) some generalized (common)
characteristics of a collection of (two or more) entity sets and creating a new entity set that
contains entities possessing these common characteristics. Here, the subclasses
(Hourly_Emps, Contract_Emps, etc.,) are defined first the superclass (Employees) is defined
next.
In shortly, Hourly_Emps and Contract_Emps are generalized by Employees.
The class hierarchy can specify two kinds of constraints. They are
Overlapped Constraints:
Overlap constraints determine whether two subclasses are allowed to contain the same entity.
Example: can Akbar be both a Hourly_Emps entity and a Contract_Emps entity? The answer
is no.
Other example, can Akbar be both a Contract_Emps entity and a Senior_Emps entity (among
them) the answer is, Yes.
Thus, this is a specialization hierarchy property. We denote this by writing “Contract_Emps
overlaps Senior_Emps”.
Covering Constraints:
Covering constraints determine whether the entities in the subclasses collectively include all
entities in the superclass.
Example: Should every Employees entity be a Hourly_Emps or Contract_Emps? The answer
is, No. He can be a Daily_Emps.
Other example, should every Motor_Vehicle (superclass) be a bike (subclass) or a car
(subclass)? The answer is yes.
Thus generalization hierarchies’ property is that every instance of a superclass is an instance
of a subclass.
We denote this by writing “bikes and cars cover Motor_Vehicles”
AGGREGATION: (**************************)
Used when we have to model a relationship involving (entity sets and) a relationship set.
Aggregation allows us to indicate that a relationship set (identified through a dashed box)
participates in another relationship set.
Aggregation allows a relationship set to be treated as an entity set for purposes of
participation in (other) relationship sets.
This is illustrated in Figure, with a dashed box around Sponsors (and its participating entity
sets) used to denote aggregation. This effectively allows us to treat Sponsors as an entity set
for purposes of defining the Monitors relationship set.
Figure. Aggregation
Restaurant Example:
Note:
For Ternary relationship, it can only see which specific restaurant buys what kind food
from which supplier.
For Aggregation, you have more information about which supplier supplies a food item.
Any restaurant needs that item can choose from that list.
Uses of Aggregation:
We use an aggregation, when we need to express a relationship among relationships. Thus,
there are really two distinct relationships, sponsors and monitors, each with its own attributes.
Example: The Monitors relationship has an attribute until that records the ending date until
when the employee is appointed as the sponsorship monitoring.
Entity Vs Attributes:
While identifying the attributes of an entity set, it is sometimes not clear, whether a property
should be modeled as an attribute or as an entity set.
Similar to the problem of wanting to record several working periods for an employee
in Work_In4. We want to record several values of the descriptive attributes for each
instance of this relationship. Accomplished by introducing new entity set, Duration.
.
Entity Vs Relationship:
It is not always clear whether an object is best expressed by an entity set or a relationship set.
Example: If a manager gets a separate discretionary budget (dbudget) for each department he
or she manages.
What if a manager gets a discretionary budget that covers all managed departments?
Redundancy:dbudget stored for each department managed by the manager.
Misleading: suggests dbudget is associated with department – manager combination.
Binary versus Ternary Relationships:
It is always possible to replace a non-binary (n-array, for n>2) relationship set by a number of
distinct binary relationship sets.
The choice between using aggregation or a ternary relationship is mainly determined by the
existence of a relationship that relates a relationship set to an entity set (or second relationship
set). The choice may also be guided by certain integrity constraints that we want to express.
Themonitors is a distinct relationship, with a descriptive attribute. (In Figure)
If we don’t need to record the until attribute of Monitors, then we might reasonably use a
ternary relationship. (In Figure)
Also, it can say that each sponsorship is monitored by at most one employee.
ER DIAGRAMS EXAMPLES
1) Entity Relationship (ER) Modeling - Learn with a Complete Example
Here we are going to design an Entity Relationship (ER) model for a college database. Say
we have the following statements.
1. Department
2. Course
3. Instructor
4. Student
Step 2: Identify the relationships
1. One department offers many courses. But one particular course can be offered by only
one department. hence the cardinality between department and course is One to Many
(1:N)
2. One department has multiple instructors. But instructor belongs to only one
department. Hence the cardinality between department and instructor is One to Many
(1:N)
3. One department has only one head and one head can be the head of only one
department. Hence the cardinality is one to one. (1:1)
4. One course can be enrolled by many students and one student can enroll for many
courses. Hence the cardinality between course and student is Many to Many (M:N)
5. One course is taught by only one instructor. But one instructor teaches many courses.
Hence the cardinality between course and instructor is Many to One (N :1)
Step 3: Identify the key attributes
All the entities represented in the rectangular box in the ER diagram become independent
tables in the database. In the below diagram, STUDENT, COURSE, LECTURER and
SUBJECTS forms individual tables.
All the attributes, whose value at any instance of time is unique, are considered as columns of
that table. In the STUDENT Entity, STUDENT_ID, STUDENT_NAME form the columns of
STUDENT table. Similarly, LECTURER_ID, LECTURER_NAME form the columns of
LECTURER table. And so on.
Key attribute in the ER diagram becomes the Primary key of the table.
In diagram above, STUDENT_ID, LECTURER_ID, COURSE_ID and SUB_ID are the key
attributes of the entities. Hence we consider them as the primary keys of respective table.
Declare the foreign key column, if applicable.
In the diagram, attribute COURSE_ID in the STUDENT entity is from COURSE entity.
Hence add COURSE_ID in the STUDENT table and assign it foreign key constraint.
COURSE_ID and SUBJECT_ID in LECTURER table forms the foreign key column. Hence
by declaring the foreign key constraints, mapping between the tables are established.
A hobby in the Student table is a multi-valued attribute. Any student can have any number of
hobbies. So we cannot represent multiple values in a single column of STUDENT table. We
need to store it separately, so that we can store any number of hobbies, adding/ removing /
deleting hobbies should not create any redundancy or anomalies in the system. Hence we
create a separate table STUD_HOBBY with STUDENT_ID and HOBBY as its columns. We
create a composite key using both the columns.
Any composite attributes are merged into same table as different columns.
In the diagram above, Student Address is a composite attribute. It has Door#, Street, City,
State and Pin. These attributes are merged into STUDENT table as individual columns.
One can ignore derived attribute, since it can be calculated at any time.
In the STUDENT table, Age can be derived at any point of time by calculating the difference
between DateOfBirth and current date. Hence we need not create a column for this attribute.
It reduces the duplicity in the database.
These are the very basic rules of converting ER diagram into tables and columns, and
assigning the mapping between the tables. Table structure at this would be as below:
RELATIONAL MODEL:
The Relational Database is a collection of one or more relations, where each relation is a
table with rows and columns.
The main construct for representing data in the relational model is a relation (table). A
relation consists of a relation schema and a relation instance. The relation instance is a
table, and the relation schema describes the column heads for the table.
The schema specifies the relation’s name, the name of each field (or column, or attribute),
and the domain of each field. A domain is referred to in a relation schema by the domain
name and has a set of associated values.
Students (sid: string, name: string, login: string, age: integer, gpa: real)
The field named sid has a domain named string. The set of values associated with domain
string is the set of all character strings.
Example2:
An instance of a relation is a set of tuples, also called records, in which each tuple has
the same number of fields as the relation schema. A relation instance can be thought of as a
table in which each tuple is a row, and all rows have the same number of fields
The instance S1 contains six tuples and has, as we expect from the schema, five fields. Note
that no two rows are identical. This is a requirement of the relational model—each relation is
defined to be a set of unique tuples or rows.
Cardinality = 3, degree = 5, all rows distinct.
Domain constraints are so fundamental in the relational model that we will henceforth
consider only relation instances that satisfy them; therefore, relation instance means relation
instance that satisfies the domain constraints in the relation schema.
The degree, also called arity, of a relation is the number of fields. The cardinality of a
relation instance is the number of tuples in it. In Figure 3.1, the degree of the relation (the
number of columns) is five, and the cardinality of this instance is six.
A relation schema specifies the domain of each field or column in the relation instance.
These domain constraints in the schema specify an important condition that we want each
instance of the relation to satisfy: The values that appear in a column must be drawn from the
domain associated with that column. Thus, the domain of a field is essentially the type of that
field, in programming language terms, and restricts the values that can appear in the field.
Another Example:
Tuples are inserted using the INSERT command. We can insert a single tuple into the
Students table as follows:
We can delete tuples using the DELETE command. We can delete all Students tuples with
name equal to Smith using the command:
We can modify the column values in an existing row using the UPDATE command. For
example, we can increment the age and decrement the gpa of the student with sid 53688:
The WHERE clause is applied first and determines which rows are to be modified. The
SET clause then determines how these rows are to be modified.
If this query is applied on the instance S1 of Students shown in Figure 3.1, we obtain the
instance shown in Figure 3.3.
1. When the DBA or end user defines a database schema, he or she specifies the ICs that
must hold on any instance of this database.
2. When a database application is run, the DBMS checks for violations and disallows
changes to the data that violate the specified ICs.
There are three types of integrity constraints in addition to domain constraint. They are:
1. CANDIDATE KEY:
A candidate key is a collection of fields/columns/attributes that uniquely identifies a tuple.
Let us take a closer look at the above definition of a (candidate) key.
There are two parts to the definition:
1. Two distinct tuples in a legal instance (an instance that satisfies all ICs, including the key
constraint) cannot have identical values in all the fields of a key.
2. No subset of the set of fields in a key is a unique identifier for a tuple.
Example: In “customer” relation the attribute “cid” is a key, it uniquely defines a tuple in a
relation. No two rows in a relation “customer” can have the same “cid” value.
The set of attributes that form a candidate key need not be all keys. The attributes may be
treated as candidates to be taken as key.
Example: The set (cid, cname) is a candidate key which means either cid or cname can be
taken as key but not both. Each of them independently and uniquely identifies a particular
row. The alternate keys are candidate keys that are not taken as keys.
2. COMPOSITE KEY:
Composite key consist of more than one attribute that uniquely identifies a tuple in a
relation. All the attributes that form a set of keys and all of them taken together determines a
unique row in a table.
Example: The set (cid, accno) is a composite key which maintains the uniqueness of each
row. Both cid, accno are taken as keys.
3. SUPER KEY:
A super key is a combination of both candidate key and composite key. That is a set of
attributes or a single attribute that uniquely identifies a tuple in a relation.
4. PRIMARY KEY:
Only a single attribute can uniquely identify a particular record. More specifically, it can be
defined as the candidate key, which has been selected as key to identify unique records.
The line declares sid as primary key for Students relation. If the user inserts repeated
values for”sid” then error occurs and constraint-name is return indicating violation of
constraint.
In a foreign key reference, a link is created between two tables when the column or
columns that hold the primary key value for one table are referenced by the column or
columns in another table. This column becomes a foreign key in the second table.
To ensure that only bona fide students can enroll in courses, any value that appears in the
sid field of an instance of the Enrolled relation should also appear in the sid field of some
tuple in the Students relation. The sid field of Enrolled is called a foreign key and refers to
Students. The foreign key in the referencing relation (Enrolled) must match the primary key
of the referenced relation (Students), i.e., it must have the same number of columns and
compatible data types, although the column names can be different.
This constraint is illustrated in Figure 3.4. As the figure shows, there may well be some
students who are not referenced from Enrolled (e.g., the student with sid =50000).
However, every sidvalue that appears in the instance of the Enrolled table appears in the
primary key column of a row in the Students table.
A FOREIGN KEY constraint does not have to be linked only to a PRIMARY KEY
constraint in another table; it can also be defined to reference the columns of a UNIQUE
constraint in another table. A FOREIGN KEY constraint can contain null values; however, if
any column of a composite FOREIGN.
The foreign key constraint states that every sid value in Enrolled must also appear in
Students, that is, sid in Enrolled is a foreign key referencing Students.
3).General Constraints:
Domain, primary key, and foreign key constraints are considered to be a fundamental part of
the relational data model and are given special attention in most commercial systems.
Sometimes, however, it is necessary to specify more general constraints.
Example: we may require that student ages be within a certain range of values; given such an
IC specification, the DBMS will reject inserts and updates that violate the constraint. This is
very useful in preventing data entry errors. If we specify that all students must be at least 16
years old, then age are valid cases i.e., legal instance. Rest of all the others having lesser than
16 years are called as invalid cases i.e., illegal instance. Instance of Students shown in Figure
3.1 is illegal because two students are underage. If we disallow the insertion of these two
tuples, we have a legal instance, as shown in Figure 3.5.
The IC that students must be older than 16, is known as an extended domain constraint,
because we are restricting age values more stringently (strictly), than by simply using a
standard domain such as integer.
In general, constraints domain, primary and foreign key constraints can also specify the
maximum limit.
Example: we require a student whose age is greater than 18 must have a gpa greater than 3.
1. Table Constraints: These are applied on a particular table and are checked every table
whenever that specific table is updated.
2. Assertions: These assertions are applied on collection of tables and are checked every time
whenever theses tables are applied.
Integrity Constraints(IC) are the rules that when applied on relations restricts the
insertion of incorrect data and also helps to prevent deletion and updating of consistent data
that may lead to loss of data integrity. And, therefore one should be very careful when
applying integrity constraints on relations.
The operations such as insertion, deletion and updating must be discarded if they are found to
violate integrity constraints. This section provides a brief on different violations of ICs and
also the solutions to handle these violations.
Consider the instance S1 of Students shown in Figure 3.1. The following insertion violates
the primary key constraint because there is already a tuple with the sid 53688, and it will be
rejected by the DBMS:
The following insertion violates the constraint that the primary key cannot contain null:
Deletion does not cause a violation of domain, primary key or unique constraints.
However, an update can cause violations, similar to an insertion:
This update violates the primary key constraint because there is already a tuple with sid
50000.
In addition to the instance S1 of Students, consider the instance of Enrolled shown in
Figure 3.4. Deletions of Enrolled tuples do not violate referential integrity, but insertions of
Enrolled tuples could. The following insertion is illegal because there is no student with sid
51111:
EXAMPLE:
This example explains the options when delete or update operation are performed. These
options are included as a part of foreign key declaration. No action is the default option which
means both update and delete operations are rejected.
1) On Delete Cascade: Means when a row is deleted from Students relation, then all the
rows referred to this deleted row in Enrolled relation must also be deleted.
2) On Update Cascade: Means when updations are carried in Students relation for the
primary key attribute then all these updations must also be carried out in Enrolled.
3) On Delete Set Default: Means when a row is deleted in Students, then that row in
Enrolled relation can be set to same default value.
4) On Delete Set Null: Means on deleting the row in Students the same row can be
assigned a NULL value in Enrolled relation.
NOTE: SQL even provides the facility to delay the applications of constraints on relation and
also immediate application of constraints. This is possible with these two additional
constraints,
1) Deferred mode
2) Immediate mode
Usually, constraints are checked at the end of SQL statements and if the constraints are
violated then the statements are rejected. But with differed constraint, constraint checks are
postponed and are checked at the time of commit.
RELATIONAL ALGEBRA:
Relational algebra is a procedural query language, which takes instances of relations as
input and yields instances of relations as output. It uses operators to perform queries. An
operator can be either unary or binary. They accept relations as their input and yield
relations as their output. Relational algebra is performed recursively on a relation and
intermediate results are also considered relations.
Example schemas:
Sailors (sid: integer, sname: string, rating: integer, age: real)
Boats (bid: integer, bname: string, color: string)
Reserves (sid: integer, bid: integer, day: date)
Example Instances:
The “Sailors” and “Reserves” relations are our examples. We’ll use positional or named
field notation, assume that names of fields in query results are `inherited’ from names of
fields in query input relations.
SELECTION (σ):
The selection operation is a unary operation. This is used to find horizontal subset of relation
or tuples of relation.
It selects tuples that satisfy the given predicate from a relation. It is denoted by sigma(σ).
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic
formula which may use connectors like and, or, and not. These terms may use relational
operators like − =, ≠, ≥, < , >, ≤.
Example: If you want all the Sailors having rating more than 8 from instance S2 of Sailors.
The query is,
PROJECTION (π):
The projection operation is a unary operation which applies only on a single relation at a
time. This is used to select vertical subset of relation (i.e., columns of table)
It projects column(s) that satisfy a given predicate. It is denoted by pi (π).
Example: If you can find out all sailors names and ratings from instance S2 of Sailors. The
query is,
For example, we can compute the names and ratings of highly rated sailors by
combining two of the preceding queries. The expression
produces the result shown in Figure 4.7. It is obtained by applying the selection to S2 (to get
the relation shown in Figure 4.4) and then applying the projection.
2. SET OPERATIONS:
The relational algebraic operations can be divided into basic set oriented operations (Union,
Intersection, Set difference and Cartesian product).
Notation − R U S
Two relation instances are said to be union-compatible if the following conditions hold:
Note that field names are not used in defining union-compatibility. For convenience, we will
assume that the fields of R ∪ S inherit names from R, if the fields of R have names.
The union of S1 and S2 is shown in Figure 4.8. Fields are listed in order; field names are
also inherited from S1. S2 has the same field names, of course, since it is also an instance of
Sailors. In general, fields of S2 may have different names; recall that we require only domains
to match. Note that the result is a set of tuples. Tuples that appear in both S1 and S2 appear
only once in S1 ∪ S2. Also, S1 ∪ R1 is not a valid operation because the two relations are not
union-compatible.
Figure 4.8 S1 ∪ S2
Notation − R ∩ S
If the relations contain nothing as common then the result will be an empty relation. Rules of
set union operations are also applicable here.
Figure 4.9S1∩ S2
R−S returns a relation instance containing all tuples that occur in R but not in S. The
relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
The result of set difference operation is tuples, which are present in one relation but are not
in the second relation. It removes the common tuples of two relations and produces a new
relation having rest of the tuples of first relation.
Notation − R − S
It finds all the tuples that are present in R but not in S.
Figure 4.10S1- S2
We will use the convention that the fields of R × S inherit names from the corresponding
fields of R and S. It is possible for both R and S to contain one or more fields having the same
name; this situation creates a naming conflict. The corresponding fields in R × S are unnamed
and are referred to solely by position.
Notation − R Χ S
The result of the cross-product S1 × R1 is shown in Figure 4.11. Because R1 and S1 both
have a field named sid, by our convention on field names, the corresponding two fields in S1
× R1 are unnamed, and referred to solely by the position in which they appear in Figure 4.11.
The fields in S1 × R1 have the same domains as the corresponding fields in R1 and S1. In
Figure 4.11sid is listed in parentheses to emphasize that it is not an inherited field name; only
the corresponding domain is inherited.
Figure 4.11S1 × R1
3. Renaming (ρ): (********)
The rename (ρ) operation is a unary operation which is used to give names to relational
algebra expressions.
The results of relational algebra are also relations but without any name. The rename
operation allows us to rename the output relation. 'Rename' operation is denoted with small
Greek letter rho ρ.
Suppose, you want to find Cartesian product of a relation with itself then by using rename
operator we give an alias name to that relation. Now, you can easily multiply that relation
with its alias. It is helpful in removing ambiguity.
Notation − ρ x (E)
For example, the expression ρ(C (1 → sid1, 5 → sid2), S1 × R1) returns a relation that
contains the tuples shown in Figure 4.11 and has the following schema:
C (sid1: integer, sname: string, rating: integer, age: real, sid2: integer, bid: integer, day:
dates).
4. Joins: (********)
The join operation is one of the most useful operations in relational algebra and is the most
commonly used way to combine information from two or more relations.
The join operation denoted by “join” or “⋈”, is a relational algebra operation, which is used to combine (join)
two relations like Cartesian-product but finally removes duplicate attributes (same column to only one
column) and makes the operations (selection, projection etc.,) very simple. In simple words, we can say that
join connects relations on columns containing comparable information.
1. Condition Joins
2. Equi Join and
3. Natural join.
1. Condition Joins:
The most general version of the join operation accepts a join condition c and a pair of
relation instances as arguments, and returns a relation instance. The join condition is identical
to a selection condition in form. The operation is defined as follows:
Thus⋈ is defined to be a cross-product followed by a selection. Note that the condition c can
(and typically does) refer to attributes of both R and S. The reference to an attribute of a
relation, say R, can be by position (of the form R.i) or by name (of the form R.name).
S1 R1 4.12. Because
Example: the result of S1.sid isshownR1.sid in Figure
sid appears in both S1 and R1, the corresponding fields in the result of the cross-product S1 ×
R1 (and therefore in the result
S1 R1 are unnamed. Domains are inherited from the
of S1.sid R1.sid
corresponding fields of S1 and R1.
Figure 4.12
The schema of the result of an equijoin contains the fields of R (with the same names and
domains as in R) followed by the fields of S that do not appear in the join conditions. If this
set of fields in the result relation includes two fields that inherit the same name from R and S,
they are unnamed in the result relation.
Result schemasimilar to cross-product, but only one copy of fields for which equality is
specified.
3. Natural Join:
Natural join does not use any comparison operator. It does not concatenate the way a
Cartesian product does. We can perform a Natural Join only if there is at least one common
attribute that exists between two relations. In addition, the attributes must have the same
name and domain.
Natural join acts on those matching attributes where the values of attributes in both the
relations are same.
Special case of the join operation R ⋈S is an equijoin in which equalities are specified on
all fields having the same name in R and S. In this case, we can simply omit the join
condition; the default is that the join condition is a collection of equalities on all common
fields. This special case a natural join, and with this result is guaranteed not to have two
fields with the same name.
The equijoin expression is actually a natural
join and can simply be denoted as S1⋈R1, since the only common field is sid. If the two
relations have no attributes in common, S1 ⋈R1 is simply the cross-product.
Figure.S1⋈R1
5. Division:
The division operator is useful for expressing certain kinds of queries that include the
phrase “for all’. It is denoted by (/). It is alike the inverse of Cartesian product.
Consider two relation instances A and B in which A has (exactly) two fields x and y and B
has just one field y, with the same domain as in A. We define the division operation A/B as
the set of all x values (in the form of unary tuples) such that for every y value in (a tuple of)
B, there is a tuple <x, y> in A.
Another way to understand division is as follows. For each x value in (the first column of) A,
consider the set of y values that appear in (the second field of) tuples of A with that x value. If
this set contains (all y values in) B, the x value is in the result of A/B.
An analogy with integer division may also help to understand division. For integers A and B,
A/B is the largest integer Q such that Q ∗ B ≤ A. For relation instances A and B, A/B is the
largest relation instance Q such that Q × B ⊆ A.
Division is illustrated in Figure 4.14.
Figure 4.14 Examples Illustrating Division
1. Basic SQL,Query:
Structured Query Language (SQL) is the most widely used commercial relational database
language. It was originally developed at IBM in the SEQUEL-XRM and System-R projects
(1974–1977).
1. The Data Definition Language (DDL): This subset of SQL supports the creation, deletion,
and modification of definitions for tables and views. Integrity constraints can be defined on
tables, either when the table is created or later. The DDL also provides commands for specifying
access rights or privileges to tables and views. Although the standard does not discuss indexes,
commercial implementations also provide commands for creating and deleting indexes.
2. The Data Manipulation Language (DML): This subset of SQL allows users to pose queries
and to insert, delete, and modify rows.
3. Embedded and dynamic SQL: Embedded SQL features allow SQL code to be called from a
host language such as C or COBOL. Dynamic SQL features allow a query to be constructed (and
executed) at run-time.
4. Triggers: The new SQL:1999 standard includes support for triggers, which are actions
executed by the DBMS whenever changes to the database meet conditions specified in the
trigger.
5. Security: SQL provides mechanisms to control users’ access to data objects such as tables and
views.
7. Client-server execution and remote database access: These commands control how a client
application program can connect to an SQL database server, or access data from a database over a
network.
SELECT Clause:
Let us consider a simple query:
The answer to this query with and without the keyword DISTINCT on instance S3 of
Sailors is shown in Figures 5.4 and 5.5. The only difference is that the tuple for Horatio
appears twice if DISTINCT is omitted; this is because there are two sailors called Horatio and
age 35
This query uses the optional keyword AS to introduce a range variable. Incidentally, when
we want to retrieve all columns, as in this query, SQL provides convenient shorthand: We can
simply write SELECT *. This notation is useful for interactive querying, but it is poor style
for queries that are intended to be reused and maintained.
The first step is to construct the cross-product S4 × R3, which is shown in Figure 5.8.
The second step is to apply the qualification S.sid = R.sid AND R.bid=103. This step
eliminates all but the last row from the instance shown in Figure 5.8.
The third step is to eliminate unwanted columns; only sname appears in the SELECT
clause. This step leaves us with the result shown in Figure 5.9, which is a table with a single
column and, as it happens, just one row.
Examples of Basic SQL Queries:
(Q16) Find the sids of sailors who have reserved a red boat.
(Q2) Find the names of sailors who have reserved a red boat.
(Q4) Find the names of sailors who have reserved at least one boat.
(Q17) Compute increments for the ratings of persons who have sailed two different
boats on the same day.
Also, each item in a qualification can be as general as expression1 = expression2.
(Q18) Find the ages of sailors whose name begins and ends with B and has at least three
characters.
The UNION operation combines two relations and automatically eliminates the duplicate
tuples.
The INTERSECT operation finds the common tuples of two relations and eliminates the
duplicate tuples.
The EXCEPT operation finds the tuples which are in one relation but not in the other
relation and automatically eliminates duplicate tuples.
(Q5) Find the names of sailors who have reserved a red or a green boat.
(Q19) Find the sids of all sailors who have reserved red boats but not green boats.
(Q20) Find all sids of sailors who have a rating of 10 or have reserved boat 104.
4. NESTED QUERIES:
(Q1) Find the names of sailors who have reserved boat 103.
(Q2) Find the names of sailors who have reserved a red boat.
(Q21) Find the names of sailors who have not reserved a red boat.
(Q1) Find the names of sailors who have reserved boat number 103.
Set-Comparison Operators:
(Q22) Find sailors whose rating is better than some sailor called Horatio.
(Q23) Find sailors whose rating is better than every sailor called Horatio.
We can obtain all such queries with a simple modification to Query Q22: just replace ANY
with ALL in the WHERE clause of the outer query.
(Q6) Find the names of sailors who have reserved both a red and a green boat.
(Q9) Find the names of sailors who have reserved all boats.
5. AGGREGATE OPERATORS:
Aggregate functions operate on a multiset of values and return a single value. Typical
aggregate functions are min, max, sum, count, and avg.
2. SUM ([DISTINCT] A): The sum of all (unique) values in the A column.
3. AVG ([DISTINCT] A): The average of all (unique) values in the A column.
Examples:
There are two such sailors, and their average age is 25.5. MIN (or MAX) can be used instead
of AVG in the above queries to find the age of the youngest (oldest) sailor.
(Q30) Find the names of sailors who are older than the oldest sailor with a rating of 10.
The expressions appearing in the group-qualification in the HAVING clause must have a
single value per group. The intuition is that the HAVING clause determines whether an
answer row is to be generated for a given group. Therefore, a column appearing in the
group-qualification must appear as the argument to an aggregation operator, or it must
also appear in grouping-list.
If the GROUP BY clause is omitted, the entire table is regarded as a single group.
(Q31) Find the age of the youngest sailor for each rating level.
(Q32) Find the age of the youngest sailor who is eligible to vote (i.e., is at least 18 years
old) for each rating level with at least two such sailors.
More Examples of Aggregate Queries:
(Q33) For each red boat, find the number of reservations for this boat.
(Q34) Find the average age of sailors for each rating level that has at least two sailors.
After identifying groups based on rating, we retain only groups with at least two sailors.
The answer to this query on instance S3 is shown in Figure 5.14.
(Q35) Find the average age of sailors who are of voting age (i.e., at least 18 years old) for
each rating level that has at least two sailors.
(Q36) Find the average age of sailors who are of voting age (i.e., at least 18 years old) for
each rating level that has at least two such sailors.
(Q37) Find those ratings for which the average age of sailors is the minimum overall
ratings.
5. NULL VALUES:
The SQL NULL is the term used to represent a missing value. A NULL value in a table is
a value in a field that appears to be blank.
A field with a NULL value is a field with no value. It is very important to understand that a
NULL value is different than a zero value or a field that contains spaces.
The basic syntax of NULL while creating a table:
Here, NOT NULL signifies that column should always accept an explicit value of the
given data type. There are two columns, where we did not use NOT NULL, which means
these columns could be NULL.
A field with a NULL value is one that has been left blank during record creation.
Example:
The NULL value can cause problems when selecting data, however, because when
comparing an unknown value to any other value, the result is always unknown and not
included in the final results.
You must use the IS NULL or IS NOT NULL operators in order to check for a NULL
value.
Consider the following table, CUSTOMERS having the following records:
SELECT *
FROM STUDENT S
WHERES.group = ‘B’;
This solution will result in the set of tuples that satisfies the ‘WHERE’ condition and all
other tuples that does not satisfy this condition are ignored in addition to these tuples. Tuples
with NULL values are also ignored because for them the condition evaluates to FALSE or
UNKNOWN. This elimination of rows that resulted unknown, makes the queries that
involves EXISTS and/or UNIQUE much more simple, easy to understand and makes the
evaluation of these queries (nested queries especially) much easier.
We know that the comparison of any two fields with NULL values for equality is an
UNKNOWN value. But when it comes to (=) equality operator, the two NULL value
attributes are treated as equal. If a field contains two NULL values then that is considered as
duplicate values. Two tuples are said to be duplicates if they hold the same value or if they
hold NULL values. So, the comparison of NULL values with the “=” operator always results
in TRUE.
The result of all the arithmetic operators (+, -, %, /, *) results in an UNKNOWN value
(NULL) if any one of the argument is a NULL value. Similarly, with all the aggregate
operators the result is NULL if these operators are applied a NULL value. Aggregate
functions simply delete the NULL values and then returns the result of aggregate operators
i.e., SUM, AVG, MIN, MAX, COUNT(DISTICT) i.e., simply delete/ignore the NULL
values and returns the result of other NOT NULL tuples. Only exception in aggregate
operator is COUNT (*) which does not ignore/delete the NULL values, it counts them and
then return the number of tuples in the table.
To understand this, consider simple instances of Project and Department as shown in
table.
If we perform join operation on these two tables,
SELECT * D1, * P1
FROM DEPARTMENT D1, PROJECT P1
WHERE D1.Project_no = P1.Project_no;
The result of this statement is shown in Table 1.
The Table 1 shows the simple join operation of two tables, only those rows are selected
that satisfied the condition. However, if we want to include those rows that do not satisfy the
condition, then we can use the concept of OUTER JOINS.
Example:
SELECT * D1, * P1
FROM DEPARTMENT D1 LEFT OUTER JOIN PROJECT P1
WHERE D1.Project_no = P1.Project_no;
So, the LEFT OUTER JOIN resulted in relations that have common rows from both the
tables and also the row which does not have match in the other table. The values of the
attributes corresponding to second table are NULL values.
SELECT * D1, * P1
FROM DEPARTMENT D1 FULL OUTER JOIN PROJECT P1
WHERE D1.Project_no = P1.Project_no;
In this declaration i.e., creation of STUDENT Table, Sid is the PRIMARY KEY hence it
must be UNIQUE and it should not be NULL. Project field indicates the Project taken up by
the student .This field can take NULL values as it is possible
6. EmbeddedSQL: (****************)
SQL provides a powerful declarative query language. Writing queries in SQL is usually
much easier than coding the same queries in a general - purpose programming language.
However, a programmer must have access to a database from a general purpose programming
language for at least two reasons:
Not all queries can be expressed in SQL, since SQL does not provide the full expressive
power of a general-purpose language. That is, there exist queries that can be expressed in
a language such as C, Java, or COBOL that cannot be expressed in SQL. To write such
queries, we can embed SQL within a more powerful language.
SQL statements can refer to variables defined in the host program. Such host-language
variables must be prefixed by a colon (:) in SQL statements and must be declared between the
commands EXEC SQL BEGIN DECLARE SECTION and EXEC SQL END DECLARE
SECTION. The declarations are similar to how they would look in a C program and, as usual
in C, are separated by semicolons.
For example, we can declare variables c_sname, c_sid, c_rating, and c_age (with the
initial c used as a naming convention to emphasize that these are host language variables) as
follows:
The intent is that after each embedded SQL statement is executed, the value of
SQLSTATE should be checked. If SQLERROR is specified and the value of SQLSTATE
indicates an exception, control is transferred to stmt, which is presumably responsible for
error/exception handling. Control is also transferred to stmt if NOT FOUND is specified and
the value of SQLSTATE is 02000, which denotes NO DATA.
SQL defines standards for embedding dynamic SQL calls in a host language, such as C, as
in the following example.
The dynamic SQL program contains a? which is a place holder for a value that is provided
when the SQL program is executed?
8. CURSORS: (************)
A major problem in embedding SQL statements in a host language like C is that an
impedance mismatch occurs because SQL operates on sets of records, whereas languages like
C do not cleanly support a set-of-records abstraction. The solution is to essentially provide a
mechanism that allows us to retrieve rows one at a time from a relation. This mechanism is
called a cursor.
A cursor is a temporary work area created in the system memory when a SQL statement is
executed. A cursor contains information on a select statement and the rows of data accessed
by it.
This temporary work area is used to store the data retrieved from the database, and
manipulate this data. A cursor can hold more than one row, but can process only one row at a
time. The set of rows the cursor holds is called the active set.
We can declare a cursor on any relation or on any SQL query (because every query returns
a set of rows).
Once a cursor is declared, we can open it (which positions the cursor just before the first
row); fetch the next row; move the cursor (to the next row, to the row after the next n, to the
first row, or to the previous row, etc., by specifying additional parameters for the FETCH
command); or close the cursor.
As an example, we can find the name and age of a sailor, specified by assigning a value
to the host variable c_sid as follows:
The INTO clause allows us to assign the columns of the single answer row to the host
variables c_sname and c_age.
Computes the names and ages of all sailors with a rating greater than the current value of
the host variable c_minrating?
This query returns a collection of rows, not just one row. The solution is to use a cursor:
This code can be included in a C program, and once it is executed, the cursor sinfo is
defined. Subsequently, we can open the cursor:
We can use the FETCH command to read the first row of cursorsinfo into host language
variables:
2. Properties of Cursors:
The general form of a cursor declaration is:
For example, if sinfo is an updatable cursor and is open, we can execute the following
statement:
This embedded SQL statement modifies the rating value of the row currently pointed to by
cursor sinfo; similarly, we can delete this row by executing the next statement:
A cursor is updatable by default unless it is a scrollable or insensitive cursor, in which
case it is read-only by default.
If the keyword SCROLL is specified, the cursor is scrollable, which means that variants of
the FETCH command can be used to position the cursor in very flexible ways; otherwise,
only the basic FETCH command, which retrieves the next row, is allowed
If the keyword INSENSITIVE is specified, the cursor behaves as if it is ranging over a
private copy of the collection of answer rows.
For example, while we are fetching rows using the sinfo cursor, we might modify rating
values in Sailor rows by concurrently executing the command:
The answer is sorted first in ascending order by minage, and if several rows have the same
minage value, these rows are sorted further in descending order by rating. The cursor would
fetch the rows in the order shown in Figure 5.18.
When a row is inserted into Reserves or an existing row is modified, the conditional
expression in the CHECK constraint is evaluated. If it evaluates to false, the command is
rejected.
Example:
INTEGER is the base type for the domain ratingval, and every ratingval value must be of
this type. Values in ratingval are further restricted by using a CHECK constraint; in defining
this constraint, we use the keyword VALUE to refer to a value in the domain.
Assertions are group of tables on which a constraint is applied. Unlike table constraints
which are applied on single table, assertions are applied on multiple tables.
Example:
As an example, suppose that we wish to enforce the constraint that the number of boats plus
the number of sailors should be less than 100. We could try the following table constraint:
This solution suffers from two drawbacks. It is associated with Sailors, although it
involves Boats in a completely symmetric way. More important, if the Sailors table is empty,
this constraint is defined (as per the semantics of table constraints) to always hold, even if we
have more than 100 rows in Boats! We could extend this constraint specification to check
that Sailors is nonempty, but this approach becomes very cumbersome. The best solution is to
create an assertion, as follows:
10. TRIGGERS AND ACTIVE DATABASES:(*****)
1. Event
2. Condition
3. Action
Advantages of trigger:
1) Triggers can be used as an alternative method for implementing referential integrity
constraints.
2) By using triggers, business rules and transactions are easy to store in database and can be
used consistently even if there are future updates to the database.
4) When a change happens in a database a trigger can adjust the change to the entire database.
Active database contains a set of triggers and therefore it becomes quite difficult to
maintain active database.
What triggers are activated in what order can be hard to understand because a statement
can activate more than one trigger and the action of one trigger can activate other triggers.
Redundancy refers to repetition of same data or duplicate copies of same data stored in
different locations.
The Schema Refinement refers to refine the schema by using some technique. The best
technique of schema refinement is decomposition.
Decomposition can eliminate the redundancy.
For example, some update operation is being carried out, entering new record for an
employee with id 100002. This must be done multiple time i.e., it must be done for each file
witch stores the employees details. This leads to redundant storage i.e., the same information
is stored multiple times.
3. Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
For example, if a new employee record is being entered, who has not yet assigned an
Emp_id, now if we assume that the null values are not allowed, then it impossible to enter the
new record unless the new employee has been assigned an Emp_id. This is called insertion
anomalies.
4. Deletion anomalies: It may not be possible to delete some information without losing
some other information as well.
For example, if we want to delete the grade entries where grade is equal to ‘A’ then all the
information of Emp_section_id 268 will be deleted/loss.
2. Use of Decompositions:
Decomposion is the solution to the problem caused by data redundancy. Decomposition
means breaking up the large schema into smaller multiple Schemas. Decomposition helps to
remove all the anomalies and helps to maintain data integrity.
We can restrict redundancy in Employee database by dividing it into two smaller
relations/Schemas as in table1R and Table2R.
Now we can easily update Emp_section_id in the Schema Section without bothering
about the updations in the other tuples. To insert a new tuple, we can directly insert the new
record in the Schema section (With the help of Emp_section-id) even if the new employee
has not yet been assigned the Emp_id. To delete the entry with the grade equal to ‘A’, we
can do it directly on the Section schema which does not lead to loss of other information.
Thus, decomposioneliminates the Problems caused by different anomalies
Functional Dependency
The functional dependency is a relationship that exists between two attributes. It typically
exists between the primary key and non-key attribute within a table.
1. X → Y
The left side of FD is known as a determinant, the right side of the production is known as a
dependent.
For example:
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table
because if we know the Emp_Id, we can tell that employee name associated with it.
Example:
1. ID → Name,
2. Name → DOB
1. If X ⊇ Y then X → Y
Example:
1. X = {a, b, c, d, e}
2. Y = {a, b, c}
2. Augmentation Rule (IR2)
1. If X → Y then XZ → YZ
Example:
In the transitive rule, if X determines Y and Y determine Z, then X must also determine Z.
Union rule says, if X determines Y and X determines Z, then X must also determine Y and Z.
1. If X → Y and X → Z then X → YZ
Proof:
1. X → Y (given)
2. X → Z (given)
3. X → XY (using IR2 on 1 by augmentation with X. Where XX = X)
4. XY → YZ (using IR2 on 2 by augmentation with Y)
5. X → YZ (using IR3 on 3 and 4)
Decomposition rule is also known as project rule. It is the reverse of union rule.
1. If X → YZ then X → Y and X → Z
Proof:
1. X → YZ (given)
2. YZ → Y (using IR1 Rule)
3. X → Y (using IR3 on 1 and 2)
6. Pseudo transitive Rule (IR6)
1. If X → Y and YZ → W then XZ → W
Proof:
1. X → Y (given)
2. WY → Z (given)
3. WX → WY (using IR2 on 1 by augmenting with W)
4. WX → Z (using IR3 on 3 and 2)
Normalization
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
5NF A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be
lossless.
EMPLOYEE table:
14 John 7272826385, UP
9064738238
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 9064738238 UP
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In
a school, a teacher can teach more than one subject.
TEACHER table
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
Third Normal Form (3NF)
o A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must be
in third normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every
non-trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
Non-prime attributes: In the given table, all attributes except EMP_ID are non-
prime.
EMPLOYEE table:
EMPLOYEE_ZIP table:
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
Boyce Codd normal form (BCNF)
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exists, then
the relation will be a multi-valued dependency.
Example
STUDENT
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Fifth normal form (5NF)
o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
should be lossless.
o 5NF is satisfied when all the tables are broken into as many tables as possible in order
to avoid redundancy.
o 5NF is also known as Project-join normal form (PJ/NF).
Example
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't
take Math class for Semester 2. In this case, combination of all these fields required to
identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who
will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns
together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
UNIT V
Transaction
Example: Suppose an employee of bank transfers Rs 800 from X's account to Y's account.
This small transaction contains several low-level tasks:
X's Account
1. Open_Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close_Account(X)
Y's Account
1. Open_Account(Y)
2. Old_Balance = Y.balance
3. New_Balance = Old_Balance + 800
4. Y.balance = New_Balance
5. Close_Account(Y)
Operations of Transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in a
buffer in main memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
Let's take an example to debit transaction from an account which consists of following
operations:
1. 1. R(X);
2. 2. X = X - 500;
3. 3. W(X);
o The first operation reads X's value from database and stores it in a buffer.
o The second operation will decrease the value of X by 500. So buffer will contain
3500.
o The third operation will write the buffer's value to the database. So X's final value will
be 3500.
But it may be possible that because of the failure of hardware, software or power, etc. that
transaction may fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after executing operation
2 then X's value will remain 4000 in the database which is not acceptable by the bank.
Transaction property
The transaction has the four properties. These are used to maintain consistency in a database,
before and after the transaction.
Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability
Atomicity
o It states that all operations of the transaction take place at once if not, the transaction
is aborted.
o There is no midway, i.e., the transaction cannot occur partially. Each transaction is
treated as one unit and either run to completion or is not executed at all.
Commit: If a transaction commits then all the changes made are visible.
Example: Let's assume that following transaction T consisting of T1 and T2. A consists of
Rs 600 and B consists of Rs 300. Transfer Rs 100 from account A to account B.
T1 T2
Read(A) Read(B)
A:= A-100 Y:= Y+100
Write(A) Write(B)
If the transaction T fails after the completion of transaction T1 but before completion of
transaction T2, then the amount will be deducted from A but not added to B. This shows the
inconsistent database state. In order to ensure correctness of database state, the transaction
must be executed in entirety.
Consistency
o The integrity constraints are maintained so that the database is consistent before and
after the transaction.
o The execution of a transaction will leave a database in either its prior stable state or a
new stable state.
o The consistent property of database states that every transaction sees a consistent
database instance.
o The transaction is used to transform the database from one consistent state to another
consistent state.
For example: The total amount must be maintained before or after the transaction.
Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then
inconsistency will occur.
Isolation
o It shows that the data which is used at the time of execution of a transaction cannot be
used by the second transaction until the first one is completed.
o In isolation, if the transaction T1 is being executed and using the data item X, then
that data item can't be accessed by any other transaction T2 until the transaction T1
ends.
o The concurrency control subsystem of the DBMS enforced the isolation property.
Durability
States of Transaction
Active state
o The active state is the first state of every transaction. In this state, the transaction is
being executed.
o For example: Insertion or deletion or updating a record is done here. But all the
records are still not saved to the database.
Partially committed
o In the partially committed state, a transaction executes its final operation, but the data
is still not saved to the database.
o In the total mark calculation example, a final display of the total marks step is
executed in this state.
Committed
Failed state
o If any of the checks made by the database recovery system fails, then the transaction
is said to be in the failed state.
o In the example of total mark calculation, if the database is not able to fire a query to
fetch the marks, then the transaction will fail to execute.
Aborted
o If any of the checks fail and the transaction has reached a failed state then the
database recovery system will make sure that the database is in its previous consistent
state. If not then it will abort or roll back the transaction to bring the database into a
consistent state.
o If the transaction fails in the middle of the transaction then before executing the
transaction, all the executed transactions are rolled back to its consistent state.
o After aborting the transaction, the database recovery module will select one of the two
operations:
1. Re-start the transaction
2. Kill the transaction
Schedule
The serial schedule is a type of schedule where one transaction is executed completely before
starting another transaction. In the serial schedule, when the first transaction completes its
cycle, then the next transaction is executed.
For example: Suppose there are two transactions T1 and T2 which have some operations. If
it has no interleaving of operations, then there are the following two possible outcomes:
1. Execute all the operations of T1 which was followed by all the operations of T2.
2. Execute all the operations of T1 which was followed by all the operations of T2.
o In the given (a) figure, Schedule A shows the serial schedule where T1 followed by
T2.
o In the given (b) figure, Schedule B shows the serial schedule where T2 followed by
T1.
2. Non-serial Schedule
3. Serializable schedule
o The serializability of schedules is used to find non-serial schedules that allow the
transaction to execute concurrently without interfering with one another.
o It identifies which schedules are correct when executions of the transaction have
interleaving of their operations.
o A non-serial schedule will be serializable if its result is equal to the result of its
transactions executed serially.
Here,
Assume a schedule S. For S, we construct a graph known as precedence graph. This graph
has a pair G = (V, E), where V consists a set of vertices, and E consists a set of edges. The set
of vertices is used to contain all the transactions participating in the schedule. The set of
edges is used to contain all edges Ti ->Tj for which one of the three conditions holds:
o If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti
are executed before the first instruction of Tj is executed.
o If a precedence graph for schedule S contains a cycle, then S is non-serializable. If the
precedence graph has no cycle, then S is known as serializable.
For example:
Explanation:
The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non-
serializable.
Explanation:
The precedence graph for schedule S2 contains no cycle that's why ScheduleS2 is
serializable.
Conflicting Operations
Conflict Equivalent
In the conflict equivalent, one can be transformed to another by swapping non-conflicting
operations. In the given example, S2 is conflict equivalent to S1 (S1 can be converted to S2
by swapping non-conflicting operations).
Example:
Schedule S2 is a serial schedule because, in this, all operations of T1 are performed before
starting any operation of T2. Schedule S1 can be transformed into a serial schedule by
swapping non-conflicting operations of S1.
T1 T2
Read(A)
Write(A)
Read(B)
Write(B)
Read(A)
Write(A)
Read(B)
Write(B)
View Serializability
View Equivalent
Two schedules S1 and S2 are said to be view equivalent if they satisfy the following
conditions:
1. Initial Read
An initial read of both schedules must be the same. Suppose two schedule S1 and S2. In
schedule S1, if a transaction T1 is reading the data item A, then in S2, transaction T1 should
also read A.
Above two schedules are view equivalent because Initial read operation in S1 is done by T1
and in S2 it is also done by T1.
2. Updated Read
Above two schedules are not view equal because, in S1, T3 is reading A updated by T2 and
in S2, T3 is reading A updated by T1.
3. Final Write
A final write must be the same between both the schedules. In schedule S1, if a transaction
T1 updates A at last then in S2, final writes operations should also be done by T1.
Above two schedules is view equal because Final write operation in S1 is done by T3 and in
S2, the final write operation is also done by T3.
Example:
Schedule S
1. = 3! = 6
2. S1 = <T1 T2 T3>
3. S2 = <T1 T3 T2>
4. S3 = <T2 T3 T1>
5. S4 = <T2 T1 T3>
6. S5 = <T3 T1 T2>
7. S6 = <T3 T2 T1>
Schedule S1
In both schedules S and S1, there is no read except the initial read that's why we don't need to
check that condition.
The initial read operation in S is done by T1 and in S1, it is also done by T1.
The first schedule S1 satisfies all three conditions, so we don't need to check another
schedule.
1. T1 → T2 → T3
Recoverability of Schedule
Sometimes a transaction may not execute completely due to a software issue, system crash or
hardware failure. In that case, the failed transaction has to be rollback. But some other
transaction may also have used value produced by the failed transaction. So we also have to
rollback those
tr
ansactions.
The above table 1 shows a schedule which has two transactions. T1 reads and writes the
value of A and that value is read and written by T2. T2 commits but later on, T1 fails. Due to
the failure, we have to rollback T1. T2 should also be rollback because it reads the value
written by T1, but T2 can't be rollback because it already committed. So this type of schedule
is known as irrecoverable schedule.
Irrecoverable schedule: The schedule will be irrecoverable if Tj reads the updated value of
Ti and Tj committed before Ti commit.
The above table 2 shows a schedule with two transactions. Transaction T1 reads and writes
A, and that value is read and written by transaction T2. But later on, T1 fails. Due to this, we
have to rollback T1. T2 should be rollback because T2 has read the value written by T1. As it
has not committed before T1 commits so we can rollback transaction T2 as well. So it is
recoverable with cascade rollback.
Recoverable with cascading rollback: The schedule will be recoverable with cascading
rollback if Tj reads the updated value of Ti. Commit of Tj is delayed till commit of Ti.
The above Table 3 shows a schedule with two transactions. Transaction T1 reads and write A
and commits, and that value is read and written by T2. So this is a cascade less recoverable
schedule.
Failure Classification
To find that where the problem has occurred, we generalize a failure into the following
categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from
where it can't go any further. If a few transaction or process is hurt, then this is called
as transaction failure.
2. System Crash
o System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It
was a common problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash, and
unreachability to the disk or any other failure, which destroy all or part of disk
storage.
Log-Based Recovery
o The log is a sequence of records. Log of each transaction is maintained in some stable
storage so that if any failure occurs, then it can be recovered from there.
o If any operation is performed on the database, then it will be recorded in the log.
o But the process of storing the logs should be done before the actual transaction is
applied in the database.
o
Let's assume there is a transaction to modify the City of a student. The following logs are
written for this transaction.
o When the transaction is initiated, then it writes 'start' log.
1. <Tn, Start>
o When the transaction modifies the City from 'Noida' to 'Bangalore', then another log is
written to the file.
1. <Tn, Commit>
When the system is crashed, then the system consults the log to find which transactions need
to be undone and which need to be redone.
1. If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti, Commit>, then
the Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record either <Ti, commit>
or <Ti, abort>, then the Transaction Ti needs to be undone.
Checkpoint
o The checkpoint is a type of mechanism where all the previous logs are removed from
the system and permanently stored in the storage disk.
o The checkpoint is like a bookmark. While the execution of the transaction, such
checkpoints are marked, and the transaction is executed then using the steps of the
transaction, the log files will be created.
o When it reaches to the checkpoint, then the transaction will be updated into the
database, and till that point, the entire log file will be removed from the file. Then the
log file is updated with the new step of transaction till next checkpoint and so on.
o The checkpoint is used to declare a point before which the DBMS was in the
consistent state, and all transactions were committed.
In the following manner, a recovery system recovers the database from this failure:
o The recovery system reads log files from the end to start. It reads log files from T4 to
T1.
o Recovery system maintains two lists, a redo-list, and an undo-list.
o The transaction is put into redo state if the recovery system sees a log with <Tn,
Start> and <Tn, Commit> or just <Tn, Commit>. In the redo-list and their previous
list, all the transactions are removed and then redone before saving their logs.
o For example: In the log file, transaction T2 and T3 will have <Tn, Start> and <Tn,
Commit>. The T1 transaction will have only <Tn, commit> in the log file. That's why
the transaction is committed after the checkpoint is crossed. Hence it puts T1, T2 and
T3 transaction into redo list.
o The transaction is put into undo state if the recovery system sees a log with <Tn,
Start> but no commit or abort log found. In the undo-list, all the transactions are
undone, and their logs are removed.
o For example: Transaction T4 will have <Tn, Start>. So T4 will be put into undo list
since this transaction is not yet complete and failed amid.
INDEXING TECHNIQUES:
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf
nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can
support random access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is
of the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.
Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without affecting
the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
File Organization
o The File is a collection of records. Using the primary key, we can access the records.
The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
o File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms
of blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure
our files so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.
File organization contains various methods. These particular methods have pros and cons on
the basis of access or selection. In the file organization, the programmer decides the best-
suited file organization method according to his requirement.
This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
o It is a quite simple method. In this method, we store the record in a sequence, i.e., one
after another. Here, the record will be inserted in the order in which they are inserted
into tables.
o In case of updating or deleting of any record, the record will be searched in the
memory blocks. When it is found, then it will be marked for deleting, and the new
record is inserted.
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence,
records are nothing but a row in the table. Suppose we want to insert a new record R2 in the
sequence, then it will be placed at the end of the file. Here, records are nothing but a row in
any table.
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and
R7. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the
end of the file, and then it will sort the sequence.
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic
tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
o It will waste time as we cannot jump on a particular record that is required but we
have to move sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.
o It is the simplest and most basic type of organization. It works with data blocks. In
heap file organization, the records are inserted at the file's end. When the records are
inserted, it doesn't require the sorting and ordering of records.
o When the data block is full, the new record is stored in some other block. This new
data block need not to be the very next data block, but it can select any data block in
the memory to store new records. The heap file is also known as an unordered file.
o In the file, every record has a unique id, and every page in a file is of the same size. It
is the DBMS responsibility to store and manage the new records.
If we want to search, update or delete the data in heap file organization, then we need to
traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization,
we need to check all the data until we get the requested record.
o It is a very good method of file organization for bulk insertion. If there is a large
number of data which needs to load into the database at a time, then this method is
best suited.
o In case of a small database, fetching and retrieving of records is faster than the
sequential record.
o This method is inefficient for the large database because it takes time to search or
modify the record.
o
o This method is inefficient for large databases.
Hash File Organization uses the computation of hash function on some fields of the records.
The hash function's output determines the location of disk block where the records are to be
placed.
When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored
in the file using the primary key. An index value is generated for each primary key and
mapped with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a
huge database is quick and easy.
o This method supports range retrieval and partial retrieval of records. Since the index
is based on the primary key values, we can retrieve the data for the given range of
value. In the same way, the partial value can also be easily searched, i.e., the student
name starting with 'JA' can be easily searched.
Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to
maintain the sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise,
the performance of the database will slow down.
o When the two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key attributes
which are used to map these tables together are stored only once.
o This method reduces the cost of searching for various records in different files.
o The cluster file organization is used when there is a frequent need for joining the
tables with the same condition. These joins will give only a few records from both
tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.
In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table
is performed.
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The
above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster.
Here, all the records are grouped based on the cluster key- DEP_ID and all the records are
grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records
with the same hash key value.
o The cluster file organization is used when there is a frequent request for joining the
tables with same joining condition.
o It provides the efficient result when there is a 1:M mapping between the tables.
o This method has the low performance for the very large database.
o If there is any change in joining condition, then this method cannot use. If we change
the condition of joining then traversing the file takes a lot of time.
o This method is not suitable for a table with a 1:1 condition.
Indexing in DBMS
Index structure:
o The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be
found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is
10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with
ID-543.
o In the case of a database with no index, we have to search the disk block from starting
till it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the previous
case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
o As primary keys are stored in sorted order, the performance of the searching operation
is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in
the main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get
the unique value and create index out of them. This method is called a clustering
index.
o The records which have similar characteristics are grouped, and indexes are created
for these group.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is
called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then
the secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced.
In this method, the huge range for the columns is selected initially so that the mapping size of
the first level becomes small. Then each range is further divided into smaller ranges. The
mapping of the first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at
this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now
using the address 110, it goes to the data block and starts searching each record till it
gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is
also done in the same manner.
Hashing
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data
bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We
can even consider the primary key itself as the address of the data block. That means each
row whose address will be the same as a primary key stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash
function can also be a simple mathematical function like exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine the address of the data block. In this
case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2
respectively, and records are stored in those data block addresses.
Types of Hashing:
o Static Hashing
o Dynamic Hashing
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always
result in same bucket address 3. Here, there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store the
data.
Operations of Static Hashing
o Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the
bucket where the data is stored.
o Insert a Record
When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.
o Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we
will delete the records for that address in memory.
o Update a Record
To update a record, we will first search it using a hash function, and then the data record is
updated.
HTML Tutorial
If we want to insert some new record into the file but the address of a data bucket generated
by the hash function is not empty, or data already exists in that address. This situation in the
static hashing is known as bucket overflow. This is a critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as
follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is
linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the
hash function generates address as 110 for it. But this bucket is full to store the new data. In
this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
o Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6
are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into
bucket B2. The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is
full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry
because last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last
two bits of both the entry are 11.
o In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There
will not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks
frequently.
o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the
data address will keep changing as buckets grow and shrink. If there is a huge
increase in data, maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time
to reach this situation than static hashing.