14UIT305-Database Systems PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 174

14UIT305 DATABASE SYSTEMS LTPC

Regulation 2014 3003

OBJECTIVES:
• To introduce the basic concepts of database system design and architecture
• To familiarize the Normal forms
• To demonstrate the transaction, recovery controls and storage techniques

UNIT I INTRODUCTION 9

Purpose of Database System -– Views of data – Data Models – Database Languages ––Database System Architecture
– Database users and Administrator – Entity–Relationship Model (E-R model) – E-R Diagrams -- Introduction to
relational databases.

UNIT II RELATIONAL MODEL 9

The relational Model – The catalog – Types– Keys– Relational Algebra – Domain Relational Calculus – Tuple
Relational Calculus– Fundamental operations – Additional Operations – SQL fundamentals – Integrity – Triggers–
Security – Advanced SQL features – Embedded SQL – Dynamic SQL– Missing Information – Views –Introduction
to Distributed Databases and Client/Server Databases.

UNIT III DATABASE DESIGN 9

Functional Dependencies – Non-loss Decomposition – Functional Dependencies – First, Second, Third Normal
Forms, Dependency Preservation – Boyce/Codd Normal Form-Multi-valued Dependencies and Fourth Normal Form
– Join Dependencies and Fifth Normal Form.

UNIT IV TRANSACTIONS 9

Transaction Concepts – Transaction Recovery – ACID Properties – System Recovery –Media Recovery – Two Phase
Commit– Save Points – SQL Facilities for recovery –Concurrency – Need for Concurrency – Locking Protocols –
Two Phase Locking – Intent Locking – Deadlock– Serializability – Recovery Isolation Levels – SQL Facilities for
Concurrency.

UNIT V IMPLEMENTATION TECHNIQUES 9

Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage – File Organization – Organization
of Records in Files – Indexing and Hashing – Ordered Indices – B+ tree Index Files – B tree Index Files – Static
Hashing – Dynamic Hashing – Query Processing Overview – Catalog Information for Cost Estimation – Selection
Operation – Sorting – Join Operation – Database Tuning. Multimedia Database. Case study: FIRM –a database
management system for real time avionics.

TOTAL: 45 PERIODS
COURSE OUTCOMES:
After the successful completion of this course, the student will be able to
• Explain the basic needs of database systems
• Build query with structured query language
• Explain the need for security in database
• Construct a DBMS for an application

TEXT BOOKS:
1. Abraham Silberschatz, Henry F. Korth, Sudharshan.S, “Database System Concepts”,
Tata McGraw Hill, 5th Edition, 2006.
2. Date.C.J, Kannan.A, Swamynathan.S, “An Introduction to Database Systems”, Pearson
Education, 8th Edition, 2006.

REFERENCE BOOKS:
1. RamezElmasri, ShamkantB.Navathe, “Fundamentals of Database Systems”, Pearson
Addision Wesley, 4th Edition, 2007.
2. Raghu Ramakrishnan, “Database Management Systems”, Tata McGraw Hill, 3rd Edition
3. Singh.S.K, “Database Systems Concepts, Design and Applications”, Pearson Education,
1st Edition, 2006.
4. Hector Garcia-Molina, Jeffrey D.Ullman, Jennifer Widom, “Database Systems: The
5. Complete Book”, Pearson Education, 4th Edition, 2009.
3

UNIT I

INTRODUCTION

Purpose of Database systems- Views of data- Data Models- Database Languages- Database system
Architecture– Database users and Administrator- Entity Relationship Model (ER model) – E-R
Diagram – Introduction to relational databases.

1. INTRODUCTION

Data: Known facts that can be recorded that have implicit meaning.
E.g. Student roll no, names, address etc

Database: collection of inter-related data organized meaningfully for a specific purpose.

DBMS: DBMS is a collection of interrelated data and a set of program to access those data. The primary
goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and
efficient.

Database System: Database and DBMS collectively known as database system.


INTRODUCTION TO FILE AND DATABASE SYSTEMS:

Database Applications
 Banking: all transactions
 Airlines: reservations, schedules
 Universities: registration, grades
 Sales: customers, products, purchases
 Online retailers: order tracking, customized recommendations
 Manufacturing: production, inventory, orders, supply chain
 Human resources: employee records, salaries, tax deductions
 Credit card transactions
 Telecommunications & Finance
4

2. PURPOSE OF DATABASE SYSTEMS


Drawbacks of Conventional File Processing System
i. Data redundancy and inconsistency
 Since the files and application programs are created by different programmers over a long
period of time, the files have different formats and the programs may be written in several
programming language. The same piece of information may be duplicated in several files.
 For Example: The address and phone number of particular customer may appear in a file that
consists of personal information and in saving account records file also. This redundancy
leads to data consistency that is, the various copies of the same data may no longer agree.
 For example: a changed customer address may be reflected in personal information file, but
not in saving account records file.
ii. Difficulty in accessing data
 Conventional file processing environments do not allow needed data to be retrieved in a
convenient and efficient manner.
 For Example: Suppose that bank officer needs to find out the names of all customers who
live within the city‘s 411027 zip code. The bank officer has now two choices: Either get the
list of customers and extract the needed information manually, or ask the data processing
department to have a system programmer write the necessary application program. Both
alternatives are unsatisfactory.
iii. Data isolation
 Since, data is scattered in various files, and files may be in different formats, it is difficult to
write new application programs to retrieve appropriate data.
iv. Concurrent access anomalies
 In order to improve the overall performance of the system and obtain a faster response time
many systems allow multiple users to update the data simultaneously. In such environment,
interaction of concurrent updates may results in inconsistent data.
 For Example: Consider bank account A, with $500.If two customers with draw funds (say
$50 and $100 resp ) from account A at the same time, the result of the concurrent executions
$400, rather than $350. In order to guard against this possibility, some form of supervision
must be maintained in the system.
v. Atomicity Problem
 System failure will lead to atomicity problem.
5

 For Example: Failure during transfer of fund from system A to A. It will be debited from A
but not credited to B leading to wrong transaction.
vi. Concurrent Access Anomalies
 In order to improve the overall performance of the system and obtain a faster response time
many systems allow multiple users to update the data simultaneously. In such environment,
interaction of concurrent updates may result in inconsistent data.
 For Example: Consider bank account A, containing $500. If two customers withdraw funds
say $50 and $100 respectively) from account A at about the same time, the result of the
concurrent executions may leave the account in an incorrect (or inconsistent) state. Balance
will be $400 instead of $350. To protect against this possibility, the system must maintain
some form of supervision.
vii. Security problems
 Not every user of the database system should be able to access all the data. System should be
protected using proper security.
 For Example: In a banking system, pay roll personnel should be only given authority to see
the part of the database that has information about the various bank employees. They do not
need access to information about customer accounts.
 Since application programs added to the system in an ad-hoc manner, it is difficult to enforce
such security constraints.
viii. Integrity problems
 The data values stored in the database must satisfy certain types of consistency constrains.
 For Example: The balance of a bank account may never fall below a prescribed amount (say
$100).These constraints are enforced in the system by adding appropriate code in the various
application programs.

Advantages of Database

Data base is a way to consolidate and control the operational data centrally. It is a better way to control
the operational data. The advantages of having a centralized control of data are:

i. Redundancy can be reduced


In non-database systems, each application or department has its own private files resulting in
considerable amount of redundancy of the stored data. Thus storage space is wasted. By having a
centralized database most of this can be avoided.
ii. Inconsistency can be avoided
6

When the same data is duplicated and changes are made at one side, which is not propagated
to the other site, it gives rise to inconsistency. Then the two entries regarding the same data will not
agree. So, if the redundancy is removed, chances of having inconsistent data are also removed.
iii. The data can be shared
The data stored from one application, can be used for another application. Thus, the data of
database stored for one application can be shared with new applications.
iv. Standards can be enforced
With central control of the database, the DBA can ensure that all applicable standards are
observed in the representation of the data.
v. Security can be enforced
DBA can define the access paths for accessing the data stored in database and he can define
authorization checks whenever access to sensitive data is attempted.
vi. Integrity can be maintained
Integrity means that the data in the database is accurate. Centralized control of the data helps
in permitting the administrator to define integrity constraints to the data in the database.

3. VIEW OF DATA
A major purpose of a database system is to provide users with an abstract view of the data. That is,
the system hides certain details of how the data are stored and maintained.
Data abstraction
The Complexity is hidden from the users through several level of abstraction. There are three levels
of data abstraction:
i. Physical level: It is the lowest level of abstraction that describes how the data are actually stored.
The physical level describes complex low-level data structures in details.
ii. Logical level: It is the next higher level of abstraction that describes what data are stored in the
database and what relationships exist among those data.
iii. View level: It is the highest level of abstraction that describes only part of the entire database.
7

Figure The three levels of data abstraction

Data Independence
The ability to modify a scheme definition in one level without affecting a scheme definition in the
next higher level is called data independence. There are two levels of data independence:

1. Physical data independence is the ability to modify the physical scheme without causing application
programs to be rewritten. Modifications at the physical level are occasionally necessary in order to improve
performance.

2. Logical data independence is the ability to modify the conceptual scheme without causing application
programs to be rewritten. Modifications at the conceptual level are necessary whenever the logical structure
of the database is altered.

Logical data independence is more difficult to achieve than physical data independence since
application programs are heavily dependent on the logical structure of the data they access.

Instances and schemas

Database change over times as information is inserted and deleted. The collection of information
stored in the database at a particular moment is called an instance of the database.

The overall design of the database is called the database schema.

Types of database schemas

i. Physical schema: It describes the database design at the physical level.


ii. Logical schema: It describes the database design at the physical level.
8

iii. Subschema: A database may also have several subschemas at the view level called as subschemas
that describe different views of the database.

4. DATA MODELS
 Underlying structure of the database is called as data models.
 It is a collection of conceptual tools for describing data, data relationships, data semantics, and
consistency constraints.
 It is a way to describe the design of the database at physical, logical and view level.
Different types of data models are:
Entity relationship model
Relational model
Hierarchical model
Network model
Object Based model
Object Relational model
Semi Structured Data model
Entity relationship model
It is based on a collection of real world things or objects called entities and the relationship among
these objects.
The Entity relationship model is widely used in database design.

Relational Model
The relational model uses a collection of tables to represent both data and the relationship among
those data.
Each table has multiple columns and each column has a unique name.
Software such as Oracle, Microsoft SQL Server and Sybase are based on the relational model.
E.g. Record Based model. It is based on fixed format records of several types.
Hierarchical Model
 Hierarchical database organize data in to a tree data structure such that each record type has only one
owner
 Hierarchical structures were widely used in the first main frame database management systems.
 Links are possible vertically but not horizontally or diagonally.
9

Advantages
High speed of access to large datasets.
Ease of updates.
Simplicity: the design of a hierarchical database is simple.
Data security: Hierarchical model was the first database model that offered the data security
that is provided and enforced by the DBMS.
Efficiency: The hierarchical database model is a very efficient one when the database
contains a large number of transactions, using data whose relationships are fixed.
Disadvantages
Implementation complexity
Database management problems
Lack of structural independence

Network Model
 The model is based on directed graph theory.
 The network model replaces the hierarchical tree with a graph thus allowing more general
connections among the nodes.
 The main difference of the network model from the hierarchical model is its ability to handle many-
to-many (n: n) relationship or in other words, it allows a record to have more than one parent.
 Example is, an employee working for two departments.
Sample network model

Advantages:
Conceptual simplicity :
10

Capability to handle more relationship types:


Data independence:
Disadvantages:
Detailed structural knowledge is required.
Lack of structural independence.
Object-Based Data model
 The object- oriented model is an extension of E-R model.
 The object- oriented model is based on a collection of objects.
 An object contains values stored in instance variables within the object.
 An object also contains bodies of code that operate in the object these bodies of code are called
methods.
 Objects that contain the same types of values and methods are grouped together into classes.
Advantages:
Applications require less code.
Applications use more natural data model.
Code is easier to maintain.
It provides higher performance management of objects and complex interrelationships
between objects.
Object-oriented features improve productivity.Data access is easy.

Object Relational Model


 Object-relational data model combines the feature of modern object-oriented programming
languages with relational database features.
 Some of the object-relational systems available in the market are IBM DB2 universal server,
Oracle Corporation‘s oracle 8, Microsoft Corporations SQL server 7 and so on.
Semi Structured Data Model
 This data model allows the individual data items of same type to have different sets of attributes.
 Other data model allows a particular type of data item to have same set of attributes.
 Extensible Markup Language (XML) is used to represent structured data.

5. DATABASE LANGUAGES

A database system provides


A Data Definition Language to specify the database schema (DDL)
A Data Manipulation Language to express database queries and updates.
11

Data definition and data manipulation languages are not two separate languages but part of a single
database language such as SQL language.
Data definition language
DDL specifies the database schema and some additional properties to data.
The storage structure and access methods are specified using specified using special type of DDL called
s data storage and data definition language.
The data values stored in the database must satisfy certain consistency constraints. For example,
suppose the balance on an account should not fall below $100.
Database system concentrates on constraints that have less overload.
1. Domain Constraints:
Domain of possible value should be associated with every attributes.
E.g. integer type, character type, date/time type
Declaring attributes to a particular domain will act as a constraint on that value.
They are tested as and when values are entered in to database.
2. Referential Constraints:
In some cases there will be value that appears in one relation for a given set of attributes also
appears for a certain set of attributes in some other relation. Such constraint is called
Referential Constraints.
If any modification violates the constraints then the action that caused the violation should be
rejected.
3. Assertions
It is a condition that database should always satisfy.
Domains and referential integrity are special form of assertion.
E.g. Every loan should have a customer whose account balance is minimum of $1000.00
Modifications to database should not cause violation to assertion.
4. Authorization
The users are differentiated as per the access permit given to them on the different data of the
database. This is known as authorization.
The most common authorizations are
i. Read authorization
Allows reading but no modification of data.
ii. Insert authorization
Allows insertion of new data but no modification of existing data.
12

iii. Update authorization


Allows modification but not deletion.
iv. Delete authorization
Allows deletion of data.
The users may be assigned with all, none or combination of these types.
The DDL gets some input and generates some output.
This output is placed in data dictionary which contains Meta data.
Meta data is data about data.
Data Dictionary is a special type of table which can only be accessed and updated by database
system.
Database system consults the Data Dictionary before reading or modifying actual data.

Data Manipulation Language

A data-manipulation language (DML) is a language that helps users to access or manipulate data.
A query is a statement requesting the retrieval of information.
The portion of DML that involves information retrieval is called as query language.
There are basically two types of DML:
Procedural DMLs
User should specify what data are needed and how to get those data.
Declarative DMLs (nonprocedural DMLs)
User should specify what data are needed without specifying how to get those data. This is
easier to learn and user than procedural DML.
Data manipulation that can be performed using DML are
The retrieval of information stored in the database
The insertion of new information into the database
The deletion of information from the database
The modification of information stored in the database

6. DATABASE SYSTEM ARCHITECTURE

A database system is partitioned into modules that deal with each of the responsibilities of the
overall system. The functional components of a database system can be broadly divided into
 Storage Manager
 Query Processor
13

The database system architecture is influenced by the underlying computer architecture.


The database system can be centralized or client server.
Database systems are partitioned into two or three parts.
In two tier architecture, the application is partitioned into a component that resides at the client
machine and invokes database functionality at the server machine through query language.
Application program interface standards like ODBC and JDBC are used for interaction between the
client and the server.
Two tier architecture

In three tier architecture, the client machines act as a front end and do not contain any direct database
calls.
The client end communicates with the application servers through interface.
The application server interacts with database system to access data.
The business logic of application says what actions to be carried out under what condition.
Three tier is more appropriate for large applications.
Three tier architecture
14

Storage Manager
A storage manager is a program module that provides the interface between the low level data stored in
the database and the application programs and queries submitted to the system.
The storage manager is responsible for the interaction with the file manager.
The storage manager translates the various DML statements into low-level file system commands. Thus,
the storage manager is responsible for storing, retrieving, and updating data in the database.
Components of the storage manager are:

1. Authorization and integrity manager: It tests for satisfaction of various integrity constraints and
checks the authority of users accessing the data.
2. Transaction manager: It ensures that the database remains in a consistent state despite system
failures, and concurrent executions proceed without conflicting.
3. File manager: It manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.
4. Buffer manager: It is responsible for fetching data from disk storage into main memory and to
decide what data to cache in main memory. It enables the database to handle data sizes that are much
larger than the size of the main memory. The storage manager implements several data structures as
part of physical system implementation.
i. Data files: which store the database itself.
ii. Data dictionary: It contains metadata that is data about data. The schema of a table is an
example of metadata. A database system consults the data dictionary before reading and
modifying actual data.
iii. Indices: Which provide fast access to data items that hold particular values.
The Query Processor
15

The query processor is an important part of the database system. It helps the database system to simplify
and facilitate access to data. The query processor components include:
1. DDL interpreter, which interprets DDL statements and records the definitions in the data
dictionary.
2. DML compiler, which translates DML statements in a query language into an evaluation plan
consisting of low-level instructions that the query evaluation engine understands.
A query can be translates into any number of evaluations plans that all give the same result.
The DML compiler also performs query optimization, that is, it picks up the lowest cost
evaluation plan from among the alternatives.
3. Query evaluation engine, which executes low-level instructions generated by the DML
compiler.
16

Database System Structure:


17

7. DATABASE USERS AND ADMINISTRATOR


People who work with a database can be categorized as:
 Database Users
 Database administrators
7.1. DATABASE USERS
There are four types of database users, differentiated by the way they interact with the system.
1. Naive users
 Naive users interact with the system by invoking one of the application programs that have
been written previously.
 Naive users are typical users of form interface, where the user can fill in appropriate fields of
the form.
 Naive users may also simply read reports generated from the database.
2. Application Programmers
 Application programmers are computer professionals who write application programs.
 Rapid application development (RAD) tools enable the application programmer to construct
forms and reports without writing a program.
 Special types of programming languages that combine control structures with data
manipulation language. These languages, sometimes called fourth-generation languages.
3. Sophisticated users
 Sophisticated users interact with the system without writing programs. Instead, they form their
requests in a database query language.
 They submit each such query to a query processor that the storage manager understands.
 Online analytical processing (OLAP) tools simplify analysis and data mining tools specify
certain kinds of patterns in data.
4. Specialized users
 Specialized users are sophisticated users who write specialized database applications that do not
fit into the traditional data-processing framework.
 The applications are computer-aided design systems, knowledge base and expert systems,
systems that store data with complex data types
7.2. DATABASE ADMINISTRATORS
A person who has such central control over the system is called a database administrator (DBA).
The functions of a DBA include:
18

• Schema definition. The DBA creates the original database schema by executing a set of data definition
statements in the DDL.
• Storage structure and access-method definition.
• Schema and physical-organization modification. The DBA carries out changes to the schema and
physical organization to reflect the changing needs of the organization.
• Granting of authorization for data access. By granting different types of authorization, the database
administrator can regulate which parts of the database various users can access.
Authorization information is kept in a special system structure that the database system consults whenever
someone attempts to access the data in the system.
• Routine maintenance. Examples of the database administrator‘s routine maintenance activities are:
1. periodically backing up the database
2. Ensuring that enough free disk space
3. Monitoring jobs running on the database and ensuring that performance is not degraded by very
expensive tasks submitted by some users.
4. Ensuring that performance is not degraded by very expensive tasks submitted by some users.

8. ENTITY RELATIONSHIP MODEL (ER MODEL)


The E-R data model considers the real world consisting of a set of basic objects, called entities, and
relationships among these objects. It is a high level data model that is useful for developing a conceptual design
for a database.
The E-R data model employs three basic notions:
1. Entity sets
2. Relationship sets
3. Attributes
1. Entity Sets
An entity is ‗thing‘ or ‗object in the real world that is distinguishable from all other objects.
For example, each person is an entity.
An entity has a set of properties, and the values for some set of properties may uniquely identify an
entity.
For example, a customer with customer-id property with value C101 uniquely identifies that person.
An entity may be concrete, such as person or a book, or it may be abstract, such as a loan, or a
holiday.
An entity set is a set of entities of the same type that share the same properties, or attributes.
19

For example all persons who are customers at a given bank can be defined as entity set customer.
The properties that describe an entity are called attributes.

2. Relationships and Relationships sets


Relationship is an association among several entities.
Relationship set is a set of relationships of the same type.
The association between entity set is referred to as participation. That is, the entity sets E1, E2, . .
.,En participate in relationship set R.
Recursive relationship set: Same entity set participating in a relationship more than once in a
different role is called Recursive relationship set.
The attributes of entities in Recursive relationship set is called descriptive attributes.

Types of relationships
i) Unary relationship: A unary relationship exists when an association is maintained within a single entity.
Boss Employee
Manager

Worker

FigureAssociation between two objects of the same entity set.

ii) Binary relationship: A binary relationship exists when two entities are associated.

Publisher Publishes Book

iii) Ternary relationship: A ternary relationship exists when there are three entities associated.

Teacher Teaches Subject

Student
20

iv) Quaternary relationship: A quaternary relationship exists when there are four entities associated.

Teacher

Student Studies Course material

Subject

The number of entity set participating in a relationship is called degree of the relationship set.
Binary relationship set is of degree 2; a tertiary relationship set is of degree 3.
Entity role: The function that an entity plays in a relationship is called that entity‘s role. A role is one end
of an association.

Person Works- Company


for
Employee Employee

Here Entity role is Employee

3. Attributes
The properties that describes an entity is called attributes.
The attributes of customer entity set are customer_id, customer_name and city.
Each attributes has a set of permitted values called the domain or value set.
Each entity will have value for its attributes.
Example:
 Customer Name John
 Customer Id 321

Attributes are classified as


Simple
Composite
Single- valued
Multi-valued
Derived
21

1) Simple attribute:
This type of attributes cannot be divided into sub parts.
Example: Age, sex, GPA
2) Composite attribute:
This type of attributes Can be subdivided.
Example: Address: street, city, state, zip
3) Single-valued attribute:
This type of attributes can have only a single value
Example: Social security number
4) Multi-valued attribute:
Multi-valued attribute Can have many values.
Example: Person may have several college degrees, phone numbers
5) Derived attribute:
Derived attribute Can be calculated or derived from other related attributes or entities.
Example: Age can be derived from D.O.B.
6) Stored attributes:
The attributes stored in a data base are called stored attributes.
An attribute takes a null value when an entity does not have a value for it.
Null values indicate the value for the particular attribute does not exists or unknown.
E.g. : 1. Middle name may not be present for a person (non existence case)
2. Apartment number may be missing or unknown.
CONSTRAINTS
An E-R enterprise schema may define certain constraints to which the contents of a database system
must conform.
Three types of constraints are
1. Mapping cardinalities
2. Key constraints
3. Participation constraints
1. Mapping cardinalities
 Mapping cardinalities express the number of entities to which another entity can be associated via
a relationship set.
 Cardinality in E-R diagram that is represented by two ways:
i) Directed line ( ) ii) Undirected line ( --------- )
22

There are 4 categories of cardinality.


i) One to one: An entity in A is associated with at most one entity in B, and an entity in B is associated
with at most one entity in A.

Example: A customer with single account at given branch is shown by one-to-one


relationship as given below

Customer Depositor Account

ii) One-to-many: An entity in A is associated with any number of entities (zero or more) in B. An
entity in B, however, can be associated with at most one entity in A.

Example: A customer having two accounts at a given branch is shown by one-to-many


relationship as given below.

Customer Depositor Account

iii) Many-to-one: An entity in A is associated with at most one entity in B. An entity in B, however, can
be associated with any number (zero or more) of entities in A.
23

Example: Many employees works for a company. This relationship is shown by many-to-one as given
below.
Employees Works-for Company

iv) Many-to-many: An entity in A is associated with any number (zero or more) of entities in B, and an
entity in B is associated with any number (zero or more) of entities in A.

Example: Employee works on number of projects and project is handled by number of employees.
Therefore, the relationship between employee and project is many-to-many as shown below.

Works- Project
Employee
on

2. Keys

 A key allows us to identify a set of attributes and thus distinguishes entities from each other.
 Keys also help uniquely identify relationships, and thus distinguish relationships from each other.

Key Type Definition


24

Any attribute or combination of attributes that uniquely identifies a row in the table.

Superkey Example: Roll_No attribute of the entity set ‗student‘ distinguishes one student entity
from another. Customer_name, Customer_id together is a Super key

Minimal Superkey. A superkey that does not contain a subset of attributes that is itself a
superkey.
Candiate Key
Example: Student_name and Student_street,are sufficient to uniquely identify one
particular student.

The candidate key selected to uniquely identify all rows. It should be rarely changed and
Primary Key cannot contain null values.

Example: Roll_No is a primary set of ‗student‘ entity set.


An attribute (or combination of attributes) in one table that must either match the primary
key of another table or be null
Foreign Key
Example: Consider in the staff relation the branch_no attribute exists to match staff to the
branch office they work in. In the staff relation, branch_no is foreign key.

Secondary Key An attribute or combination of attributes used to make data retrieval more efficient.

3. Participation Constraint

 Participation can be divided into two types.


1. Total 2. Partial
 If every entity in the entity set E participates in at least one relationship in R. Then participation is
called Total Participation
 If only some entities in the entity set E participate in relationships in R. Then the participation is
called Partial Participation.

9. ENTITY-RELATIONSHIP(E-R) DIAGRAMS
 E-R diagram can express the overall logical structure of a database graphically.
 E-R diagram consists of the following major components:
25

Component name Symbol Description

Rectangles represent entity sets

Ellipses represent attributes

Diamonds represent relationship sets

link attributes to entity sets and entity sets to


Lines
relationship sets

Double ellipses represent multivalued attributes

Double rectangles
represent weak entity sets

Represent total participation of an entity in a


Double lines
relationship set

Dashed ellipses represent derived attributes

E-R diagram with composite, multivalued, and derived attributes.

• Double lines are used in an E-R diagram to indicate that the participation of an entity set in a
relationship set is total; that is, each entity in the entity set occurs in at least one relationship in that
relationship set.
26

The number of time an entity participates in a relationship can be specified using complex
cardinalities.
An edge between an entity set and binary relationship set can have an associated minimum and
maximum cardinality assigned in the form of l..h.
l - Minimum cardinality
h - Maximum cardinality
A minimum value of 1 indicates total participation of the entity set in the relationship set.
A maximum value of 1 indicates that the entity participates in at most one relationship.
A maximum value * indicates no limit.
A label 1... on an edge is equivalent to a double line.

0..* indicates a customer can have 0 or more loan


1..1 indicates a loan must have one associated customer
Strong and Weak entity sets
An entity set may not have sufficient attributes to form a primary key. Such an entity set is termed a
weak entity set.
An entity set that has a primary key is termed a strong entity set.
Weak entity set is associated with another entity set called the identifying or owner entity set. ie,
weak entity set is said to be existence dependent on the identifying entity set.
Identifying entity set is said to own the weak entity set.
The relationship among the weak and identifying entity set is called the identifying relationship.
27

Discriminator in a weak entity set is a set of attributes that distinguishes the different entities among
the weak entity also called as partial key.
Extended E-R Features
ER model that is supported with the additional semantic concepts is called the extended entity
relationship model or EER model.
EER model deals with
1. Specialization
2. Generalization
3. Aggregation
1. Specialization:
The process of designating subgroupings within an entity set is called Specialization
Specialization is a top-down process.
Consider an entity set person. A person may be further classified as one of the following:
• Customer
• Employee
All person has a set of attributes in common with some additional attributes.
Specialization is depicted by a triangle component labeled ISA.
The label ISA stands for ―is a‖ for example, that a customer ―is a‖ person.
The ISA relationship may also be referred to as a super class-subclass relationship.
2. Generalization:
Generalization is a simple inversion of specialization.
Generalization is the process of defining a more general entity type from a set of more specialized
entity types.
Generalization is a bottom-up approach.
Generalization results in the identification of a generalized super class from the original subclasses.
28

Person is the higher-level entity set


Customer and employee are lower-level entity sets.
The person entity set is the superclass of the customer and employee subclasses.
Attribute Inheritance
A property of the higher- and lower-level entities created by specialization and generalization is
attribute inheritance.
The attributes of the higher-level entity sets are said to be inherited by the lower-level entity
sets.
For example, customer and employee inherit the attributes of person.
The outcome of attribute inheritance is
1. A higher-level entity set with attributes and relationships that apply to all of its lower-level
entity sets.
2. Lower-level entity sets with distinctive features that apply only within a particular lower-
level entity set
If an entity set is involved as a lower-level entity set in only one ISA relationship, then the entity
set has single inheritance
29

If an entity set is involved as a lower-level entity set in more than one ISA relationship, then the
entity set has multiple inheritance and the resulting structure is said to be a lattice.
Constraints on Generalizations
1. One type of constraint determining which entities can be members of a lower-level entity set. Such
membership may be one of the following:
• Condition-defined. In condition-defined the members of lower-level entity set is evaluated on
the basis of whether or not an entity satisfies an explicit condition.
• User-defined. User defined constraints are defined by user.
2. A second type of constraint relates to whether or not entities may belong to more than one lower-
level entity set within a single generalization. The lower-level entity sets may be one of the
following:
• Disjoint. A disjointness constraint requires that an entity belong to no more than one lower-
level entity set.
• Overlapping. Same entity may belong to more than one lower-level entity set within a single
generalization.
3. A final constraint, the completeness constraint specifies whether or not an entity in the higher-level
entity set must belong to at least one of the lower-level entity sets .This constraint may be one of the
following:
• Total generalization or specialization. Each higher-level entity must belong to a lower-level
entity set. It is represented by double line.
• Partial generalization or specialization. Some higher-level entities may not belong to any
lower-level entity set.
3. Aggregation
One limitation of the E-R model is that it cannot express relationships among relationships.
Consider the ternary relationship works-on, between a employee, branch, and job. Now, suppose
we want to record managers for tasks performed by an employee at a branch. There another
entity set manager is created.
The best way to model such a situation is to use aggregation.
Aggregation is an abstraction through which relationships are treated as higherlevel entities.
In our example works-on act as high level entity.
30

E-R diagram with redundant relationships. E-R diagram with aggregation.


31

Summary of ER diagram
32

10. INTRODUCTION TO RELATIONAL DATABASES


A relational database is based on the relational model and uses a collection of tables to represent
both data and the relationship among those data.
It includes DML and DDL languages.
Tables:
Each table has multiple columns and each column has unique name.
Account number Balance
A-101 500
A-215 700
A-102 400
A relational model is an example of a record based model.
Record based model are structured in fixed format record of several types.
Each table contains record of particular type. Each record type defines a fixed number of fields or
attributes.
The columns of the table correspond to the attribute of record type.
Data Manipulation Language (DML)
DML includes following commands
1. INSERT – To insert one or more number of Rows.
2. SELECT – To display one or more rows.
3. UPDATE – Used to alter the column values in a table.
4. DELETE – Used to delete one or more rows.
Data Definition Language (DDL)
DDL includes following commands
1. CREATE – Command used for creating tables.
2. DESC – Command used to view the table structure.
3. ALTER – Command used for modifying table structure.
4. RENAME – used to change the name of the table.
5. DROP – Command used for removing an existing table.
33

2 Mark Questions

1. Define data, database, database management system, database system?


2. List any eight applications of DBMS.
3. What are the disadvantages of keeping organization information in a file processing system?
4. What are the advantages of using a DBMS (Centralized control of data)?
5. With the block diagram, discuss briefly the various levels of data abstraction?
6. Define instance and schema?
7. Define the terms 1) physical schema 2) logical schema 3)Subschema
8. What is conceptual schema?
9. Define data model?
10. What is storage manager?
11. What are the components of storage manager?
12. What is the purpose of storage manager?
13. List the data structures implemented by the storage manager.
14. What is a data dictionary?
15. What is an entity relationship model?
16. What are attributes? Give examples.
17. What are the types of Attributes?
18. What is relationship? Give examples
19. Define the terms
20. Define null values.
21. Define the terms
22. What is meant by the degree of relationship set?
23. Define the terms
24. Define weak and strong entity sets?
25. What does the cardinality ratio specify?
26. Explain the two types of participation constraint.
27. Define the terms
28. Write short notes on relational model
29. Define the term Domain.
30. Specify with suitable examples, the different types of keys used in database management systems.
31. Define Data model.
34

16 Mark Questions

1. Explain DBMS System Architecture.


2. Explain E-R Model in detail with suitable example.
3. Explain about various data models.
4 Draw an E – R Diagram for Banking, University, Company, Airlines, ATM, Hospital, Library, Super
market, Insurance Company.
5. Explain in detail about the various database languages.
6. Discuss about various operations in Relational Databases.
7. Drawa neat sketch with discuss the three schema architecture of DBMS and Data Independence.

UNIT II
RELATIONAL MODEL

The relational Model – The catalog- Types– Keys - Relational Algebra – Domain Relational Calculus –
Tuple Relational Calculus - Fundamental operations – Additional Operations- SQL fundamentals -
Integrity – Triggers - Security – Advanced SQL features –Embedded SQL– Dynamic SQL- Missing
Information– Views – Introduction to Distributed Databases and Client/Server Databases

1. THE RELATIONAL MODEL


STRUCTURE OF RELATIONAL DATABASES:
A relational database consists of a collection of tables, each of which is assigned a unique name.
A row in a table represents a relationship among a set of values.
BASIC STRUCTURE
Each column header is attributes. Each attribute allows a set of permitted values called domain of
that attribute.
A table of n-attributes must be a subset of
D1 × D2 × ・ ・ ・ × Dn.1 × Dn
A relation is a cartesian product of list of domains.
35

Mathematically table is called as a relation and rows in a table are called as tuples.
The tuples in a relation can be either sorted or unsorted.
Several attributes can have same domain. E.g.: customer_name, employee_name.
Attributes can also be distinct. E.g.: balance, branch_name
Attributes can have null values incase if the value is unknown or does not exist.
Database schema begins with upper case and database relation begins with lower case.
Account-schema = (account-number, branch-name, balance)
account (Account-schema)

Account Table

2. THE CATALOG
The catalog is a place where all the schemas and the corresponding mappings are kept.
The catalog contains detailed information also called as descriptor information or meta data.
Descriptor information is essential for the system to perform its job properly.
For example the authorization subsystem uses catalog information about users and security
constraints to grant or deny access to a particular user.
The catalog should be self describing.

3. RELATIONAL ALGEBRA
The relational algebra is a procedural query language.
It consists of a set of operations that take one or two relations as input and produce a new relation as
their result.
Formal Definition
A basic expression in the relational algebra consists of either one of the following:
A relation in the database
A constant relation
Let E1 and E2 be relational-algebra expressions; the following are all relational-algebra expressions:
E1 E2
36

E1 – E2
E1 x E2
p (E1), P is a predicate on attributes in E1
s (E1), S is a list consisting of some of the attributes in E1
x (E1), x is the new name for the result of E1

Operations can be divided into


Basic operations or Fundamental Operations -Select, Project, Union, rename, set difference &
Cartesian product.
Additional operations that can be expressed in terms of basic operations-Set intersection, Natural
join Division and Assignment.
Extended operations-Generalized projection, Aggregate operations and Outer join.

3.1. BASIC OPERATIONS OR FUNDAMENTAL OPERATIONS


The select, project, and rename operations are called unary operations, because they operate on one
relation.
The other three operations (union, set difference, cartesian product) operate on pairs of relations and
are, therefore, called binary operations.
The basic or fundamental operations are as follows
1. Select
2. Project
3. Union
4. Rename
5. Set difference
6. Cartesian product.
1. Select Operation (σ)
The select operation selects tuples that satisfy a given predicate.
Syntax
σ<select condition>(R)
Symbol σ is used to denote the select operator.
Predicate appears as a subscript to σ and argument relation in paranthesis.
Example
Consider the loan relation
37

Query: σ branch-name =”Perryridge” (loan)

Output relation is

Select operation allows all comparisons using =, _=, <, ., >,.


It allows combination of server predicates using connectives like and (∧), or (∨), and not (¬).

E.g.: 1. σ amount>1200 (loan)


2. σ branch-name =”Perryridge” ∧amount>1200 (loan)
Other Examples
Consider following Book relation.
Book_Id Title Author Publisher Year Price
B001 DBMS Korth McGraw Hill 2000 250
B002 Compiler Ulman 2004 350
B003 OOMD Rambaugh 2003 450
B004 PPL Sabista 2000 500

Following are the some examples of the select operation.

Example 1: Display books published in the 2000.

Query 1: σyear=2000(Book)
The output of query 1 is shown below.

Book_Id Title Author Publisher Year Price


B001 DBMS Korth McGraw_Hill 2000 250
B004 PPL Sabista 2000 500

Example 2: Display all books having price greater than 300.

Query 2: σprice>300(Book)
The output of query 2 is shown below.
38

Book_Id Title Author Publisher Year Price


B002 Compiler Ulman 2004 350
B003 OOMD Rambaugh 2003 450
B004 PPL Sabista 2000 500

Example 3: Select the tuples for all books whose publishing year is 2000 or price is greater than 300.
Query 3: σ(year=2000) OR (price>300)(Book)
The output of query 3 is shown below.

Book_Id Title Author Publisher Year Price


B001 DBMS Korth McGraw_Hill 2000 250
B002 Compiler Ulman 2004 350
B003 OOMD Rambaugh 2003 450
B004 PPL Sabista 2000 500

Example 4: Select the tuples for all books whose publishing year is 2000 and price is greater than 300.

Query 3: σ(year=2000) AND (price>300)(Book)


The output of query 4 is shown below.

Book_Id Title Author Publisher Year Price


B004 PPL Sabista 2000 500

2. Project operation (Π)

The project operation selects certain columns from a table while discarding others. It removes any
duplicate tuples from the result relation.

Syntax
Π<attributelist> ( R )

The symbol Π (pi) is used to denote the project operation


Attribute list to be projected is specified as subscript of Π and R denotes the relation.

E.g.: Π loan-number, amount(loan)


39

Example: The following are the examples of project operation on Book relation.
Example 1: Display all titles with author name.
Query 1: ΠTitle, Author (Book)
The output of query 1 is shown below.
Title Author
DBMS Korth
Compiler Ulman
OOMD Rambaugh
PPL Sabista
Example 2: Display all book titles with authors and price.
Query 2: ΠTitle, Author, Price ( Book )
The output of query 2 is shown below.
Title Author Price
DBMS Korth 250
Compiler Ulman 350
OOMD Rambaugh 450
PPL Sabista 500
Composition of select and project operations
The relational operations select and project can be combined to form a complicated query.
Π customer-name (σ customer-city =”Harrison” (customer))

Input Table: customer


40

Output:
Customer-name

Hayes

Example: Display the titles of books having price greater than 300.
Query: ΠTitle,( σprice>300(Book))
The output of query 1 is shown below.
Title
Compiler
OOMD
PPL

3. Rename operation (ρ)


In relational algebra, you can rename either the relation or the attributes or both. The general rename
operation can take any of the following forms:
Syntax
ρs(new attribute names)( R )  renames both the relation and its attributes

ρs( R ) renames only the relation

ρ(new attribute names)( R )renames only the attribute name


The symbol ‘ρ‘ (rho) is used to denote the RENAME operator.
‘S‘ is the new relation
41

‘R‘ is original relation.

Example 1: Renames both the relation and its attributes, the second renames the relation only and the third
renames as follows.

*ρ Temp(Bname, Aname, Pyear, Bprice) ( Book )

Example 2: Only the relation name is renamed.


*ρTemp (Book)

Example 3: Only the attribute names are renamed


*ρ (Bname, Aname, Pyear, Bprice)( Book )

4. Union operation (U)


For union operation (r U s) to be valid two condition should be satisfied.
1. Relation r and s should have the same number of attributes.
2. The domain value of ith attribute of r and ith attribute of s must be the same for all i.
Syntax: Relation1 U Relation 2
Example: Consider the two relations:

To find the names of all customers with a loan in the bank:


Πcustomer-name (borrower)
To find the names of all customers with an account in the bank:
Πcustomer-name (depositor)
Union of these two sets; that is, To find customer names that appear in either or both of the two relations.
Πcustomer-name (borrower) U Πcustomer-name (depositor)
The result relation for this query:
42

5. Set Difference Operation (-)


To find tuples that is in one relation but is not in another. The two condition of union operation aso
apply for set difference.
Syntax: Relation1 - Relation 2
Example: find all customers of the bank who have an account but not a loan.
Πcustomer-name (depositor) − Πcustomer-name (borrower)
The result relation for this query

6. Cartesian-Product Operation(X)
Cartesian product is also known as CROSS PRODUCT or CROSS JOINS.
Cartesian product allows us to combine information from any 2 relation.
Syntax: Relation1 x Relation 2
Example: Consider following two relations publisher_info and Book_info.

Publisher_Info
Publisher_code Name
P0001 McGraw_Hill
P0002 PHI
P0003 Pearson

Book_Info
Book_ID Title
B0001 DBMS
43

B0002 Compiler
The Cartesian product of Publisher_Info and Book_Info is given in fig .

Publisher_Info X Book_Info
Publisher_code Name Book_ID Title
P0001 McGraw_Hill B0001 DBMS
P0002 PHI B0001 DBMS
P0003 Pearson B0001 DBMS
P0001 McGraw_Hill B0002 Compiler
P0002 PHI B0002 Compiler
P0003 Pearson B0002 Compiler

3.2. ADDITIONAL OPERATIONS


1. Set intersection
2. Natural join
3. Division
4. Assignment
1. Set intersection Operation ( )
The result of intersection operation is a relation that includes all tuples that are in both Relation1 and
Relation2.
The intersection operation is denoted by depositor borrower.
Syntax: Relation1 Relation 2
Example:

Π customer-name (borrower Π customer-name (depositor)


The result relation for this query:

2. Naturaljoin ( )
44

The natural join operation performs a selection on those attributes that appear in both relation
schemes and finally removes duplicate attributes.
Syntax: Relation1 Relation 2
Example: consider the 2 relations

Employee Salary

Emp_code Emp_name Emp_code Salary

E0001 Hari E0001 2000

E0002 Om E0002 5000

E0003 Smith E0003 7000

E0004 Jay E0004 10000

Query:Π emp_name, salary (employee salary)

The output of query is:

Emp_name Salary
Hari 2000
Om 5000
Smith 7000
Jay 10000

3. Division operation (÷)

Division Operation is suited to queries that include the phrase ‘for all‘.

Syntax: Relation1 ÷ Relation 2


Example: Consider three relations:
45

Account Relation Branch Relation

Depositor Relation

Suppose that we wish to find all customers who have an account at all the branches located in Brooklyn.
Step 1: We can obtain all branches in Brooklyn by the expression
r1 = Π branch-name (σ branch-city =’”Brooklyn” (branch))
The result relation for this expression is shown in figure.

Step 2: We can find all (customer-name, branch-name) pairs for which the customer has an account at a
branch by writing
r2 = Π customer-name, branch-name (depositor account)
Figure shows the result relation for this expression.
46

Now, we need to find customers who appear in r2 with every branch name in r1. The operation that
provides exactly those customers is the divide operation.
Thus, the query is
Π customer-name, branch-name (depositor account)
÷ Π branch-name (σ branch-city =”Brooklyn” (branch))
The result of this expression is a relation that has the schema (customer-name) and that contains the tuple
(Johnson).
4. The Assignment Operation ( )
The assignment operation works like assignment in a programming language.
Example:

Result of the expression to the right of the ← is assigned to the relation variable on the left of the←.
With the assignment operation, a query can be written as a sequential program consisting of a series
of assignments followed by an expression whose value is displayed as the result of the query.

3.3. EXTENDED RELATIONAL-ALGEBRA OPERATIONS

1. Generalized projection
2. Aggregate operations
3. Outer join.
1. Generalized projection
The generalized-projection operation extends the projection operation by allowing arithmetic
functions to be used in the projection list.
The generalized projection operation has the form
47

ΠF1, F2,..., Fn(E)


Where E is any relational-algebra expression, and each of F1, F2, . . . , Fn is an arithmetic
expression involving constants and attributes in the schema of E.
Example: Suppose we have a relation credit-info, as in Figure

Query: Π customer-name, (limit − credit-balance) as credit-available (credit-info)


Result of this query:

2. Aggregate Functions
Aggregate functions take a collection of values and return a single value as a result. Few Aggregate
Function are,
1. Avg
2. Min
3. Max
4. Sum
5. Count
1. Avg: The aggregate function avg returns the average of the values.
Example: Use the pt-works relation in Figure

Suppose that we want to find out the average of salaries.

G avg (salary)(pt-works)
48

The symbol G is the letter G in calligraphic font; read it as ―calligraphic G.‖

Result:
Salary
2062.5

2. Min and Max


Min: Return the minimum values in a collection
Max: Return the Maximum values in a collection

Example: branch-name G sum (salary) as sum-salary, max (salary) as max-salary (pt-works)


Result:

The attribute branch-name in the left-hand subscript of G indicates that the input relation pt-
works must be divided into groups based on the value of branch-name.
The calculated sum is placed under the attribute name sum-salary and the maximum salary is
placed under the attribute max-salary.
3. Sum:
The aggregate function sum returns the total of the values.
Example: Suppose that we want to find out the total sum of salaries.
G sum(salary)(pt-works)
The symbol G is the letter G in calligraphic font; read it as “calligraphic G”
Result:
Salary
16500

4. Count:
Returns the number of the elements in the collection,

Example: G count-distinct(branch-name) (pt-works)


Result: The result of this query is a single row containing the value 3.
3. Outer join
Joins are classified into three types namely:
49

1. Inner Join
2. Outer Join
3. Natural Joint

Inner Join ( )
Inner Join returns the matching rows from the tables that are being jointed.
Example: Consider the two relations

Example:
Result:

Outer Join
The outer-join operation is an extension of the join operation to deal with missing information.
Outer-join operations avoid loss of information.
Outer Joins are classified into three types namely:
1. Left Outer Join
2. Right Outer Join
3. Full Outer Join

1. Left Outer Join ( )

The left outer join ( ) takes all tuples in the left relation that did not match with any tuple in the
right relation, pads the tuples with null values for all other attributes from the right relation, and adds them
to the result of the natural join.
50

Example:
Result:

2. Right Outer Join ( )


The right outer join ( ) is symmetric with the left outer join: It pads tuples from the right relation
that did not match any from the left relation with nulls and adds them to the result of the natural join.

Example:
Result:

3. Full Outer Join ( )

The full outer join( ) does both of those operations, padding tuples from the left relation that did
not match any from the right relation, as well as tuples from the right relation that did not match any from
the left relation, and adding them to the result of the join.

Example:
Result:
51

4. RELATIONAL CALCULUS
Relational Calculus is a formal query language where we can write one declarative expression to
specify a retrieval request and hence there is no description of how to retrieve it.
A calculus expression specifies what is to be retrieved rather than how to retrieve it.
Relational Calculus is considered to be non procedural language.
Relational Calculus can be divided into
1. Tuple Relational Calculus
2. Domain Relational Calculus

4.1. TUPLE RELATIONAL CALCULUS


Tuple Relational Calculus is a nonprocedural query language.
A query in the Tuple Relational Calculus is expressed as follows
{t | P (t ) }
It is the set of all tuples t such that predicate P is true for t
t r denotes that tuple t is in relation r
P is a formula similar to that of the predicate calculus
A tuple variable is said to be a free variable unless it is quantified by a or .
t ∈ loan ∧ ∃ s ∈ customer(t[branch-name] = s[branch-name])

t is a free variable. Tuple variable s is said to be a bound variable.


A tuple-relational-calculus formula is built up out of atoms. An atom has one of the
following forms:
1. s r, where s is a tuple variable and r is a relation

2. s[x] u[y], where s and u are tuple variables, x is an attribute on which s is defined, y is
an attribute on which u is defined, and is a comparison operator
3. s[x] c, where s is a tuple variable, x is an attribute on which s is defined, is a
comparison operator, and c is a constant in the domain of attribute x
Rules to built formulas from atoms
An atom is a formula.
If P1 is a formula, then so are ¬P1 and (P1).
If P1 and P2 are formulae, then so are P1 ∨ P2, P1 ∧ P2, and P1 ⇒ P2.
52

If P1(s) is a formula containing a free tuple variable s, and r is a relation, then


∃ s ∈ r (P1(s)) and ∀ s ∈ r (P1(s)) are also formulae.
Equivalence relation in Tuple relational calculus
P1 ∧ P2 is equivalent to ¬ (¬ (P1) ∨ ¬ (P2)).
∀ t ∈ r (P1(t)) is equivalent to ¬ ∃ t ∈ r (¬P1(t)).
P1 ⇒ P2 is equivalent to ¬ (P1) ∨ P2.
Banking Example
1. branch (branch_name, branch_city, assets )
2. customer (customer_name, customer_street, customer_city )
3. account (account_number, branch_name, balance )
4. loan (loan_number, branch_name, amount )
5. depositor (customer_name, account_number )
6. borrower (customer_name, loan_number )
53

Safety of Expressions

A tuple relational calculus may generate an infinite relation.


For example, { t | t loan} results in infinitely many tuples that are not in loan relation.
To guard against the problem, a domain is defined for all tuple relational calculus formula P.
It is denoted by dom(P). it denotes that P can take value only in that domain.
An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in
one of the relations, tuples, or constants that appear in P

4.2. DOMAIN RELATIONAL CALCULUS


Domain relational calculus is also a nonprocedural query language equivalent in power to the tuple
relational calculus.
It servers as the theoretical basis of widely used Query By Example (QBE) language.
Domain relational calculus expression is of the form:
{ x1, x2, …, xn | P (x1, x2, …, xn)}
x1, x2, …, xn represent domain variables.
P represents a formula composed of atoms.
An atom in Domain relational calculus has one of the following form
1. < x1, x2, . . . , xn >r, where r is a relation on n attributes and x1, x2, . . . , xn are domain
variables or domain constants.
2. x y, where x and y are domain variables and is a comparison operator

3. x c, where x is a domain variable, is a comparison operator, and c is a constant.

Rules to built formulas from atoms


An atom is a formula.
If P1 is a formula, then so are ¬P1 and (P1).
If P1 and P2 are formulae, then so are P1 ∨ P2, P1 ∧ P2, and P1 ⇒ P2.
If P1(x) is a formula in x, where x is a free domain variable, then
∃ x (P1(x)) and ∀ x (P1(x)) are also formulae.
Example Queries

1. Find the loan_number, branch_name, and amount for loans of over $1200

{ l, b, a | l, b, a loan a > 1200}


54

2. Find the names of all customers who have a loan of over $1200

{ c | l, b, a ( c, l borrower l, b, a loan a > 1200)}

3. Find the names of all customers who have a loan from the Perryridge branch and the loan
amount:
{ c, a | l ( c, l borrower b ( l, b, a loan b = ―Perryridge‖))}
{ c, a | l ( c, l borrower l, “ Perryridge”, a loan)}
Safety of Expressions
The expression: { x1, x2, …, xn | P (x1, x2, …, xn )} is safe if all of the following hold:
1. All values that appear in tuples of the expression are values from dom (P) (that is, the values appear
either in P or in a tuple of a relation mentioned in P).
2. For every ―there exists‖ subformula of the form x (P1(x)), the subformula is true if and only if there
is a value of x in dom (P1) such that P1(x) is true.
3. For every ―for all‖ subformula of the form x (P1 (x)), the subformula is true if and only if P1(x) is
true for all values x from dom (P1).

5. SQL FUNDAMENTALS
5.1. Introduction
SQL is a standard common set used to communicate with the relational database
management systems.
All tasks related to relational data management-creating tables, querying, modifying, and
granting access to users, and so on.
5.2. Advantages of SQL
SQL is a high level language that provides a greater degree of abstraction than procedural languages.
SQL enables the end-users and systems personnel to deal with a number of database management
systems where it is available.
Application written in SQL can be easily ported across systems.
SQL specifies what is required and not how it should be done.
SQL was simple and easy to learn can handle complex situations.
All SQL operations are performed at a set level.
5.3. Parts of SQL
The SQL language has several parts:
55

Data-definition language (DDL). The SQL DDL provides commands for defining relation
schemas, deleting relations, and modifying relation schemas.
Interactive data-manipulation language (DML). The SQL DML includes a query language based
on both the relational algebra and the tuple relational calculus. It also includes commands to insert
tuples into, delete tuples from, and modify tuples in the database.
View definition. The SQL DDL includes commands for defining views.
Transaction control. SQL includes commands for specifying the beginning and ending of
transactions.
Embedded SQL and dynamic SQL. Embedded and dynamic SQL define how SQL statements can
be embedded within general-purpose programming languages, such as C, C++, Java, PL/I, COBOL,
Pascal, and FORTRAN.
Integrity. The SQL DDL includes commands for specifying integrity constraints that the data stored
in the database must satisfy. Updates that violate integrity constraints are disallowed.
Authorization. The SQL DDL includes commands for specifying access rights to relations and
views.
5.4. Domain Types in SQL
1. Char (n): Fixed length character string, with user-specified length n.
2. varchar(n): Variable length character strings, with user-specified maximum length n.
3. int: Integer (a finite subset of the integers that is machine-dependent).
4. Smallint: Small integer (a machine-dependent subset of the integer domain type).
5. numeric (p,d): fixed point number, with user-specified precision of p digits, with n digits to the
right of decimal point.
6. Real, double precision: Floating point and double-precision floating point numbers, with
machine-dependent precision.
7. float (n): Floating point number, with user-specified precision of at least n digits.
8. Date: Dates, containing a (4 digit) year, month and date
Example: date ‗2005-7-27‘
9. Time: Time of day, in hours, minutes and seconds.
Example: time ‗09:00:30‘ time ‗09:00:30.75‘
10. Timestamp: date plus time of day
Example: timestamp ‗2005-7-27 09:00:30.75‘
11. Interval: period of time
56

Example: interval ‗1‘ day

5.5. DATA DEFINITION LANGUAGE (DDL)


It is used to create a table, alter the structure of a table and also drop the table.
DDL Commands
1. CREATE – Command used for creating tables.

Syntax: create table <table name> (columnname1 data type (size), Columnname 2 data

type(size),.., columnname n data type(size));

Example: create table customer (cust_name varchar2 (15), social_security_no

number(11),cust_street varchar2(7),cust_city varchar2(10));

2. DESC – Command used to view the table structure.


Syntax: desc <table name>;
Example: desc customer;

3. ALTER – Command used for modifying table structure.


i) Syntax: alter table <table name> modify (columnname data type (new size));
Example: alter table customer modify (cust_street varchar2 (10));

ii) Syntax: alter table <table name> modify (columnname new data type (size));
Example: alter table customer modify (social_security_no varchar2 (11));

iii) Syntax: alter table <table name> add (new columnname data type (size));
Example: alter table customer add (acc_no varchar2(5));

iv) Syntax: alter table<table name> drop (column name);

Example: alter table customer drop (acc_no);

4. RENAME – used to change the name of the table.


Syntax: rename <Old table name> to <New table name>;
Example: rename cust to cust1;
57

5. DROP – Command used for removing an existing table.


Syntax: drop table <table name>;
Example: drop table cust1;

5.6. DATA MANIPULATION LANGUAGE (DML)


Data Manipulation language commands let user to insert, modify and delete the data from
database.
DDL Commands
1. INSERT – To insert one or more number of Rows
Syntax 1: insert into <table name> values (List of Data Values)
Syntax 2: insert into <table name> (column names) values (list of data values)
Insert command using User interaction
Syntax 3: insert into <Table name> values (&columnname1, &columnname2…)

2. SELECT – To display one or more rows


Syntax 1: select * from <table name>;
Syntax 2: select columnname1, columnname 2 from <table name>;
Syntax 3: select * from <table name> where <condition>;

Example:
a) Find the names of all branches in the loan table
select branch_name from loan;
b) List all account numbers made by brighton branch
select acc_no from account where branch_name = 'brighton';
c) List the customers who are living in the city harrison
select cust_name from customer where cust_city = 'harrison';

3. UPDATE – Used to alter the column values in a table

Syntax: update <table name> set column_name=new_value where <condition>;


Example: update the account table to replace the balance value 500 to 450.
Query: update account set balance=450 where acc_no=‘A-101‘;
58

4. DELETE – Used to delete one or more rows.

Syntax: delete from <table name> where <condition>;


Example: delete from borrower where cust_name=‘jackson‘;

5.7. BASIC STRUCTURE OF SQL EXPRESSION


SQL expression consists of three clauses:
Select: The select clause corresponds to projection operation of the relational algebra. It is used to
list the attributes desired in the result of a query.
From: The from clause corresponds to the Cartesian product operation of the relational algebra .It
lists the relations to be scanned in the evaluation of the expression.
Where: The where clause corresponds to the selection predicate of the relations that appear in the
form clause.
General form of SQL query
Select A1, A2…………., An

From R1, R2……………, Rm

Where P

Where, A1-represent an attribute


R1-represent relation
P-is a predicate
Example:
“Find the names of all branches in the loan relation”:
select branch-name from loan
select distinct branch-name from loan /*Distinct keyword eleminates duplicates*/
select all branch-name from loan /*Duplicates are not removed*/
“Find all loan numbers for loans made at the Perryridge branch with loan amounts greater that $1200.”
select loan-number from loan where branch-name = ‘Perryridge‘ and amount > 1200

Rename Operation
The SQL allows renaming relations and attributes using the as clause:
Old-name as new-name
59

Example: Find the name, loan number and loan amount of all customers; rename the column name
loan_number as loan_id.
select customer_name ,borrower.loan_number as loan_id,a mount from borrower,loan where
borrower.loan_number = loan.loan_number

Tuple Variables
Tuple variables are defined in the from clause via the use of the as clause.
Example: Find the customer names and their loan numbers for all customers having a loan at some
branch.
select customer_name, T.loan_number, S.amount from borrower as T, loan as S
where T.loan_number = S.loan_number

String Operation
SQL includes a string-matching operator for comparisons on character strings.The operator ―like‖
uses patterns that are described using two special characters:
o percent (%). The % character matches any substring.
o underscore ( _ ). The _ character matches any character.
Example:
‘Perry%‘ matches any string beginning with ―Perry‖.
‘%idge%‘ matches any string containing ―idge‖ as a substring, for example, ‘Perryridge‘, ‘Rock
Ridge‘, ‘Mianus Bridge‘, and ‘Ridgeway‘.
‘- - - ‘ matches any string of exactly three characters.
‘ - - -%‘ matches any string of at least three characters.
Example: Select * from customer where customer_name like ‘j%‘;

Select * from customer where customer_Street like ‘_a%‘;

SQL supports a variety of string operations such as


 concatenation (using ―||‖)
 converting from upper to lower case (and vice versa)
 finding string length, extracting substrings, etc.

Ordering the Display of Tuples


List in alphabetic order the names of all customers having a loan in Perryridge branch
60

Example :select distinct customer_name from borrower, loan where borrower loan_number =
loan.loan_number and branch_name = 'Perryridge' order by customer_name
 We may specify desc for descending order or asc for ascending order, for each attribute; ascending
order is the default.
Example: order by customer_name desc

Set Operations
Set operators combine the results of two queries into a single one.
1. Union – returns all distinct rows selected by either query.
Example: Find all customers having a loan, an account or both at the bank
Query: select cust_name from depositor union select cust_name from borrower;

2. Union all - returns all rows selected by either query


Example: Find all customers having a loan and an account at the bank
Query: select cust_name from depositor union all select cust_name from borrower;

3. Intersect – returns only rows that are common to both the Queries
Example: Find all customers who have both a loan, and an account at the bank.
Query: select cust_name from depositor intersect select cust_name from borrower;

4. Minus – returns all distinct rows selected only by the first Query and not by the second.
Example: To find all customers who have an account but no loan at the bank.
Query: select cust_name from depositor minus select cust_name from borrower;

Aggregate Function
These functions operate on the multiset of values of a column of a relation, and return a value
(a) AVG - To find the average of values.
Example: Find the average of account balance from the account table
Query: select avg (balance) from account;
(b) SUM – To find the sum of values.
Example: Find the sum of account balance from the account table
61

Query: select sum (balance) from account;


(c) MAX – Returns the maximum value.
Example: Find the Maximum value of account balance from the account table.
Query: select max (balance) from account;
(d) MIN - Returns the minimum value.
Example: Find the Minimum value of account balance from the account table.
Query: select min (balance) from account;
(e) COUNT – Returns the number of rows in the column or table.
Example 1: Find the number rows in the customer table.
Query: select count (*) from customer;
Example 2: Find the number of rows in the balance column of account table.
Query: select count (balance) from account;
(f) GROUP BY
Aggregate Functions – Group By
Example 1: Find the average account balance at each branch.

Query:select branch_name, avg (balance) from account group by branch_name;

(g) HAVING CLAUSE

Aggregate Functions – Having Clause

Example 2: Find the average account balance at brighton branch.

Query: select branch_name, avg (balance) from account group by branch_name having
branch_name=‘brighton‘;

Null Values
 It is possible for tuples to have a null value, denoted by null, for some of their attributes
 Null signifies an unknown value or that a value does not exist.
 The predicate is null can be used to check for null values.
Example: Find all loan number which appears in the loan relation with null values for
amount.
select loan_number from loan where amount is null
62

and: The result of true and unknown is unknown, false and unknown is false, while unknown and unknown
is unknown.
or: The result of true or unknown is true, false or unknown is unknown, while unknown or unknown is
unknown.
not: The result of not unknown is unknown.

Nested Subqueries
 SQL provides a mechanism for the nesting of subqueries.
 A subquery is a select-from-where expression that is nested within another query.
 A common use of subqueries is to perform tests for set membership, set comparisons, and set
cardinality.
Example 1: Find all the information of customer who has an account number is A-101.
Query: select * from customer where cust_name=(select cust_name
from depositor where acc_no='A-101');

Example 2:Find all customers who have a loan from the bank, find their names And loan numbers.
Query: select cust_name, loan_no from borrower where loan_no in
(select loan_no from loan);
1. Set memberships
INExample:
Select * from customer where customer_name in(‘Hays‘, ‘Jones‘);
NOT INExample: Select * from customer where customer_name not in(‘Hays‘,’Jones‘);

2. Set comparisons
SQL uses various comparison operators such as <, <=,=,>,>=,<>,any, all,
some etc to compare sets.
Examples 1: Select * from borrower where loan_number<any(select loan_number
from loan 2 where branch_name=‘Perryridge‘);
Example 2: Select loan_no from loan from amount<=30000;

3. Test for Empty Relation


Exists is a test for non empty set.
63

Example: Select title from book where exists(select * from order where book.book-
id=order.book_id);
Similar to exists we can use not exists also.
Example: Select title from book where not exists(select * from order where book.book-
id=order.book_id);

4. Test for absence of duplicate tuples


The Unique construct tests whether a subquery has any duplicate tuples in its result.
Example: select T.customer-name from depositor as T
where unique (select R.customer-name from account, depositor as R
where T.customer-name = R.customer-name and
R.account-number = account.account-number and
account.branch-name = ‘Perryridge‘)
Not Unique construct is used for test the existence of duplicate tuples in the same manner.
Complex Queries
Complex queries are often hard or impossible to write as a single SQL block.There are two ways of
composing multiple SQL blocks to express a complex query.
1. Derived relations
2. with clause.

1. Derived Relations
SQL allows a subquery expression to be used in the from clause. If we use such an
expression, then we must give the result relation a name, and we can rename the attributes. For renaming as
clause is used
For example: “Find the average account balance of those branches where the average account balance is
greater than $1200.”
Select branch-name, avg-balance from (select branch-name, avg (balance) from account
group by branch-name)as branch-avg (branch-name, avg-balance) where avg-balance > 1200
Here subquery result is named as branch-avg with attributes of branch-name and avg-balance.
2. with clause.
The with clause provides a way of defining a temporary view, whose definition is available
only to the query in which the with clause occurs.
64

Consider the following query, which selects accounts with the maximum balance; if there are
many accounts with the same maximum balance, all of them are selected.
with max-balance (value) as
select max(balance)
from account
select account-number
from account, max-balance
where account.balance = max-balance.value

6. INTEGRITY
Integrity constraints ensures that changes made to the database by authorized users donot result in a
loss of data consistency.
It is a mechanism used to prevent invalid data entry into the table.
Prevents accidental damages of database.
Types
1. Domain integrity Constraints
2. Entity integrity Constraints
3. Referential integrity Constraints
1. Domain integrity Constraints
A Domain is a set of values that may be assigned to an attribute. All values that appear in a column
of a relation (table) must be taken from the same domain.
Types
 Not Null Constraints
 Check Constraints

a) NOT NULL – It will not allow null values.


Syntax : create table <table name>(columnname data type (size) constraint constraint_name not
null);
Example : account_no char(10) notnull;
b) CHECK - Use the CHECK constraint when you need to enforce integrity rules that can be evaluated
based on a condition (logical expression).
Syntax : create table <table name>(columnname data type (size) constraint constraint_name check
(check_condition));
65

Example: create table student (name char(15) not null,student-id char(10), degree_level char(15),
primary key(student_id), check(degree_level in(‗bachelors‘,‘master‘,‘doctorate‘)));
The create domain clause can be used to de.ne new domains. For example, the statements:
create domain Dollars numeric(12,2)
create domain Pounds numeric(12,2)
2. Entity integrity Constraints
The entity integrity constraints state that no primary key value can be null. This is because the
primary key value is used to identify individual tuples in a relation.
Types
Unique Constraint
Primary key Constraint
a) UNIQUE – Avoid duplicate values
unique(Aj1,Aj2,……,Ajm)

The unique specification saya that attributes Aj1,Aj2,……,Ajm form a candidate key. These attributes
should have distinct values.
Syntax :create table <table name>(columnname data type (size) constraint
constraint_name unique);
b) Composite UNIQUE – Multicolumn unique key is called composite unique key
Syntax : create table <table name>(columnname1 data type (size), columnname2 data type
(size), constraint constraint_name unique (columnname1, columnname2));
c) PRIMARY KEY – It will not allow null values and avoid duplicate values.
Syntax : create table <table name>(columnname data type (size) constraint constraint_name
primary key);
d) Composite PRIMARY KEY – Multicolumn primary key is called composite primary key
Syntax : create table <table name>(columnname1 data type (size), columnname2 data type
(size), constraint constraint_name primary key (columnname1, columnname2));

3. REFERENTIAL INTEGRITY
Ensures that a value appears in one relation for a given set of attributes also appears for a certain set
of attributes in another relation. This condition is called referential integrity
Reference key (foreign key) – Its represent relationships between tables. Foreign key is a column whose
values are derived from the primary key of the same or some other table.
66

Syntax: create table <table name>(columnname data type (size) constraint constraint_name
references parent_table_name);
Formal Definition
 Let r1(R1) and r2(R2) be relations with primary keys K1 and K2 respectively.
 The subset of R2 is a foreign key referencing K1 in relation r1, if for every t2 in r2 there
must be a tuple t1 in r1 such that t1[K1] = t2[ ].
 Referential integrity constraint also called subset dependency since its can be written as
(r2) K1 (r1)
Assertions
 An assertion is a predicate expressing a condition that we wish the database always to satisfy.
 An assertion in SQL takes the form
Create assertion <assertion-name> check <predicate>
 When an assertion is made, the system tests it for validity, and tests it again on every update that
may violate the assertion
 Asserting for all X,P(X) is achieved in a round-robin fashion using not exists X such that not
P(X).
Assertion Example
 The sum of all loan amounts for each branch must be less than the sum of all account balances at
the branch.
create assertion sum-constraint check (not exists (select * from branch where (select
sum(amount) from loan where loan.branch-name = branch.branch-name) >= (select
sum(balance) from account where account.branch-name = branch.branch-name)))

7.TRIGGERS
 A trigger is a statement that is executed automatically by the system as a side effect of a
modification to the database.
 To design a trigger mechanism, we must:
 Specify the conditions under which the trigger is to be executed.
 Specify the actions to be taken when the trigger executes.
Trigger Example
 Suppose that instead of allowing negative account balances, the bank deals with overdrafts by
 setting the account balance to zero
67

 creating a loan in the amount of the overdraft


 giving this loan a loan number identical to the account number of the overdrawn account
 The condition for executing the trigger is an update to the account relation that results in a negative
balance value.
Syntax:
Create or replace trigger<trigger-name>{before/after}{insert/delete/update}on <table-
name>[for each statement/for each row][when <condition>];
Parts of trigger:
a. Trigger statement (The DML statement like insert/delete/update. It fires the trigger body)
b. Trigger body
c. Trigger restriction (optional)
Types of triggers:
a. Before
b. After
c. For each row
d. For each statement (default)
a. Before/after:
It specifies when the trigger boby should be fired.
In case of before, the trigger will be executed before executing the triggering statement.
In case of after, it will be executed after the triggering statement.
b. For each row/statement:
It decides if the trigger body to be fired once for each row affected by the triggering statement or
only once for the statementb executed.
By default, the trigger fires for each statement.
Disabling Triggers:
Syntax :Alter trigger <trigger_name> disable;
Example : Alter trigger sales_trigger disable;
Specific triggers on a table can be disabled as follows
Alter table purchase_details disable purchase;
All triggers on a table can be disabled on a table as follows.
Syntax :Alter table <table_name> disable all triggers;
Example :Alter table sales_details disables all triggers;
Enabling trigger:
68

Syntax :Alter table <table_name> enable trigger_name;


Example :Alter table purchase_details enable purchase;
To Enable all triggers
Syntax :Alter table <table_name> enable all triggers;
Dropping triggers:
Syntax :Drop trigger <trigger_name>;
Example :Drop trigger purchase;
Trigger Example:
create trigger overdraft-trigger after update on account
referencing new row as nrow for each row when nrow.balance < 0
Triggering Events and Actions in SQL
 Triggering event can be insert, delete or update
 Triggers on update can be restricted to specific attributes
 E.g. create trigger overdraft-trigger after update of balance on account
 Values of attributes before and after an update can be referenced
referencing old row as : for deletes and updates
referencing new row as : for inserts and updates
 Triggers can be activated before an event, which can serve as extra constraints. E.g. convert blanks
to null.
Statement Level Triggers
 Instead of executing a separate action for each affected row, a single action can be executed for all
rows affected by a transaction
o Use for each statement instead of for each row
o Use referencing old table or referencing new table to refer to temporary tables (called
transition tables) containing the affected rows
o Can be more efficient when dealing with SQL statements that update a large number of rows
When Not To Use Triggers
 Triggers were used earlier for tasks such as
 maintaining summary data (e.g. total salary of each department)
 Replicating databases by recording changes to special relations (called change or delta
relations) and having a separate process that applies the changes over to a replica
 There are better ways of doing these now:
 Databases today provide built in materialized view facilities to maintain summary data
69

 Databases provide built-in support for replication


 Encapsulation facilities can be used instead of triggers in many cases
 Define methods to update fields
 Carry out actions as part of the update methods instead of
through a trigger
8. SECURITY
Security of data is important concept in DBMS because it is essential to safeguard the data against
any unwanted users.
It is a protection from malicious attempts to steal or modify data.
There are five different levels of security
1. Database system level
Authentication and authorization mechanism to allow specific users access only to required data.
2. Operating
Protection from invalid logins
File-level access protection
Protection from improper use of ―superuser‖ authority.
Protection from improper use of privileged machine instructions.
3. Network level
Each site must ensure that it communicates with trusted sites.
Links must be protected from theft or modification of messages.
Mechanisms used
Identification protocol (password based)
Cryptography
4. Physical level
Protection of equipment from floods,power failure etc.
Protection of disks from theft,erasure,physical damage etc.
Protection of network and terminal cables from wire tapes,non-invasive electronic
eavesdropping,physical damage, etc.
Solution
Replication hardware-mirrored disks,dual busses etc.
Multiple access paths between every pair of devices.
Physical security by locks,police etc.
70

Software techniques to detect physical security breaches


5. Human level
Protection from stolen passwords,sabotage,etc.
Solution
1. Frequent change of passwords.
2. Use of ―non-guessable‖ passwords.
3. Log all invalid access attempts.
4. Data audits.
5. Careful hiring practices.
Authorization
Forms of authorization on parts of the database:
 Read authorization - allows reading, but not modification of data.
 Insert authorization - allows insertion of new data, but not modification of existing data.
 Update authorization - allows modification, but not deletion of data.
 Delete authorization - allows deletion of data
Forms of authorization to modify the database schema:
 Index authorization - allows creation and deletion of indices.
 Resources authorization - allows creation of new relations.
 Alteration authorization - allows addition or deletion of attributes in a relation.
 Drop authorization - allows deletion of relations.
 The grant statement is used to give authorization
Syntax: grant <previlege list> on <relation name or view name> to <user/role list>
Example : grant select on account to john, mary; // this query grants database users john and
mary with select authorization.
grant update(account) on loan to john, mary;
 Revoke statement: It gets back the granted previlege
Syntax: revoke <previlege list> on <relation name or view name> from <user/role
list>[restrict|cascade]
Example : revoke select on branch from john, mary;
revoke update(account) on loan from john, mary;
 Revocation of a privilege from a user may cause other users also to lose that privilege; referred to as
cascading of the revoke.
 We can prevent cascading by specifying restrict:
71

revoke select on
branch from U1, U2, U3 restrict
 With restrict, the revoke command fails if cascading revokes are required.
Roles
 Roles permit common privileges for a class of users can be specified just once by creating a
corresponding ―role‖
 Privileges can be granted to or revoked from roles, just like user
 Roles can be assigned to users, and even to other roles
O create role teller
create role manager
o grant select on
branch to teller
grant update (balance) on account to teller
grant all privileges on account to manager
grant teller to manager
grant teller to alice, bob
grant manager to avi
Authorization and Views
 Users can be given authorization on views, without being given any authorization on the relations
used in the view definition
 Ability of views to hide data serves both to simplify usage of the system and to enhance security by
allowing users access only to data they need for their job
 A combination or relational-level security and view-level security can be used to limit a user‘s
access to precisely the data that user needs.
Granting of Privileges
 The passage of authorization from one user to another may be represented by an authorization grant
graph.
 The nodes of this graph are the users.
 The root of the graph is the database administrator.
 Consider graph for update authorization on loan.
 An edge Ui Uj indicates that user Ui has granted update authorization on loan to Uj.
Authorization Grant Graph
72

Authorization Grant Graph


 Requirement: All edges in an authorization graph must be part of some path originating with the
database administrator
 If DBA revokes grant from U1:
o Grant must be revoked from U4 since U1 no longer has authorization
o Grant must not be revoked from U5 since U5 has another authorization path from DBA
through U2
 Must prevent cycles of grants with no path from the root:
o DBA grants authorization to U2
o U7 grants authorization to U3
o U8 grants authorization to U2
o DBA revokes authorization from U2

 If the database administrator revokes authorization from U2, U2 retains authorization through U3,

 If authorization is revoked subsequently from U3, U3 appears to retain authorization through U2.
73

 When the database administrator revokes authorization from U3, the edges fromU3 to U2 and from
U2 to U3 are no longer part of a path starting with the database administrator.
 The edges between U2 and U3 are deleted, and the resulting authorization graph is

Audits Trials
 An audit trail is a log of all changes (inserts/deletes/updates) to the database along with information
such as which user performed the change, and when the change was performed.
 Used to track erroneous/fraudulent updates.
 Can be implemented using triggers, but many database systems provide direct support.
Limitations of SQL Authorization
 SQL does not support authorization at a tuple level
o E.g. we cannot restrict students to see only (the tuples storing) their own grades
 With the growth in Web access to databases, database accesses come primarily from application
servers.
o End users don't have database user ids, they are all mapped to the same database user id
 All end-users of an application (such as a web application) may be mapped to a single database
user
 The task of authorization in above cases falls on the application program, with no support from
SQL
o Benefit: fine grained authorizations, such as to individual tuples, can be implemented by the
application.
o Drawback: Authorization must be done in application code, and may be dispersed all over
an application
o Checking for absence of authorization loopholes becomes very difficult since it requires
reading large amounts of application code
Encryption
74

Data Encryption Standard (DES) substitutes characters and rearranges their order on the basis of an
encryption key which is provided to authorized users via a secure mechanism. Scheme is no more secure
than the key transmission mechanism since the key has to be shared.
 Advanced Encryption Standard (AES) is a new standard replacing DES, and is based on the
Rijndael algorithm, but is also dependent on shared secret keys
 Public-key encryption is based on each user having two keys:
o public key – publicly published key used to encrypt data, but cannot be used to decrypt data
o private key -- key known only to individual user, and used to decrypt data.
Need not be transmitted to the site doing encryption.
 Encryption scheme is such that it is impossible or extremely hard to decrypt data given only the
public key.
 The RSA public-key encryption scheme is based on the hardness of factoring a very large number
(100's of digits) into its prime components.
Authentication (Challenge response system)
 Password based authentication is widely used, but is susceptible to sniffing on a network
 Challenge-response systems avoid transmission of passwords
o DB sends a (randomly generated) challenge string to user
o User encrypts string and returns result.
o DB verifies identity by decrypting result
o Can use public-key encryption system by DB sending a message encrypted using user‘s
public key, and user decrypting and sending the message back
 Digital signatures are used to verify authenticity of data
o Private key is used to sign data and the signed data is made public.
o Any one can read the data with public key but cannot generate data without private key..
o Digital signatures also help ensure nonrepudiation: sender
cannot later claim to have not created the data
Digital Certificates
 Digital certificates are used to verify authenticity of public keys.
 Problem: when you communicate with a web site, how do you know if you are talking with the
genuine web site or an imposter?
o Solution: use the public key of the web site
o Problem: how to verify if the public key itself is genuine?
 Solution:
75

o Every client (e.g. browser) has public keys of a few root-level certification authorities
o A site can get its name/URL and public key signed by a certification authority: signed
document is called a certificate
o Client can use public key of certification authority to verify certificate
o Multiple levels of certification authorities can exist. Each certification authority
 presents its own public-key certificate signed by a higher level authority, and
 Uses its private key to sign the certificate of other web sites/authorities

9. EMBEDDED SQL
 Embedded SQL are SQL statements included in the programming language
 The SQL standard defines embeddings of SQL in a variety of programming languages such as C,
Java, and Cobol.
 A language to which SQL queries are embedded is referred to as a host language, and the SQL
structures permitted in the host language comprise embedded SQL.
 The embedded SQL program should be preprocessed prior to compilation.
 The preprocessor replaces embedded SQLrequests with host language declarations and procedure
calls.
 The resulting program is compiled by host language compiler.
 EXEC SQL statement is used to identify embedded SQL request to the preprocessor
o EXEC SQL <embedded SQL statement > END_EXEC
Note: this varies by language (for example, the Java embedding uses # SQL { …. }; , C language uses
semicolon instead of END_EXEC)
Example Query
 From within a host language, find the names and cities of customers with more than the variable
amount dollars in some account.
 Specify the query in SQL and declare a cursor for it
EXEC SQL
declare c cursor for
select depositor.customer_name, customer_city
from depositor, customer, account
where depositor.customer_name = customer.customer_name
and depositor account_number = account.account_number
and account.balance > :amount
76

END_EXEC
 The open statement causes the query to be evaluated
EXEC SQL open c END_EXEC
 The fetch statement causes the values of one tuple in the query result to be placed on host
language variables.
EXEC SQL fetch c into :cn, :cc END_EXEC
Repeated calls to fetch get successive tuples in the query result
 A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to ‗02000‘ to
indicate no more data is available
 The close statement causes the database system to delete the temporary relation that holds the result
of the query.
EXEC SQL close c END_EXEC
Note: above details vary with language. For example, the Java embedding defines Java iterators to step
through result tuples.
Updates Through Cursors
 Can update tuples fetched by cursor by declaring that the cursor is for update
o declare c cursor for select * from account
where branch_name = ‗Perryridge‘
for update
 To update tuple at the current location of cursor c
update account set balance = balance + 100 where current of c

10. DYNAMIC SQL


 The Dynamic SQL allows programs to construct and submit SQL queries at run time.
 Dynamic SQL can be executed immediately or can be used later.
 Two principle Dynamic SQL statements are
1. PREPARE
2. EXECUTE
SQLSOURCE = ‘Delete from account where amount<10000;
EXEC SQL PREPARE SQLPREPPED FROM:SQLSOURCE;
EXEC SQL EXECUTE SQLPREPPED;
 SQLSOURCE specifies the programming language variable.
77

 SQLPREPPED identifies the SQL variables. It holds the compiled version of SQL statement whose
source form is given in SQLSOURCE.
 The prepare statement takes the source statement and prepares it to produce an executable version,
which is stored in SQLPREPPED.
 EXECUTE statement executes the SQLPREPPED version.
 EXECUTE IMMEDIATE statement combines the functions of PREPARE and EXECUTE in a
single operation.
Call Level Interface
 The SQL Call Level Interface (SQL/CLI) is based on Microsoft‘s Opensoure DataBase Connectivity
(ODBC).
 They allow the applications to be written from which the exact SQL code is not known until run
time.
 Two principle reason for using SQL/CLI
Dynamic SQL is a source code statement. Dynamic SQL requires some kind of SQL
compiler to process the operations like PREPARE, EXECUTE. SQL/CLI does not requir any
special compiler instead it uses the host language compiler. It is in object code form.
SQL/CLI is DBMS independent i.e, it allows creation of several applications with different
DBMS.
 Example for SQL/CLI
strcpy (sqlsource, ― Delete from account where amount>10000);
rc = SQLExecDirect(hstmt,(SQLCHAR*)sqlsource,SQL.NTS);
 Strcpy is used to copy the source form of delete statement into sqlsource variable.
 SQLExecDirect executes the SQL Statement contained in sqlsource anf assigns the return code to
the variable rc.
Two standards connects an SQL database and performs queries and updates.
 Opensoure DataBase Connectivity (ODBC) was initially developed for C language and extended to
other languages like C++, C# and Visual Basic.
 Java DataBase Connectivity (JDBC) is an application program interface for java language.
 The users and applications connects to an SQL server establishing a session, executes a series of
statements and finally disconnects the session.
 In addition to normal SQL commands, a session can also contains commands to commit the work
carried out or rollback the work carried out in a session.
78

11.VIEWS
A View is an object that gives the user a logical view of data from an underlying tables or tables.
It is not desirable for all users to see the entire logical model.
Security consideration may require that certain data be hidden from users.
Any relation that is not part of the logical model, but is made visible to a user as a virtual relation,
is called as view

Views may be created for the following reasons:


1. To provide data security
2. Query simplicity
3. Structure simplicity

Creating of Views

VIEW:-is an imaginary table.


Syntax: Create view <view name>(column alias name) as query with condition.
Query: create view custall as (Select cust_name, city from customer);

Assigning Names to Columns


Example: create view custall (customername, city) as (Select cust_name, city from customer);

Selecting data from view: Display the view


Example:select * from custall;

Updation of a View

Views can be used for data manipulation i.e, the user can perform insert, Update,a nd the delete
operations on the view.
The views on which data manipulation can be done are called updatable Views, the views that do
not allow data manipulation are called Readonly Views.

Destroying a view
79

A view can be dropped by using the drop view command.


Syntax: drop view view_name;
Example: drop view custall;

12.INTRODUCTION TO DISTRIBUTED DATABASES AND


CLIENT/SERVER DATABASES

A distributed database is a database in which storage devices are not all attached to a common processing
unit such as the CPU, controlled by a distributed database management system (together sometimes called
adistributed database system).
Two Types :
• Homogeneous distributed databases
Same software/schema on all sites, data may be partitioned among sites
Goal: provide a view of a single database, hiding details of distribution
• Heterogeneous distributed databases
Different software/schema on different sites
Goal: integrate existing databases to provide useful functionality
• Differentiate between local and global transactions
A local transaction accesses data in the single site at which the transaction was initiated.
A global transaction either accesses data in a site different from the one at which the transaction was initiated
or accesses data in several different sites.

CLIENT/SERVER DATABASES :

• The user specifies which database(s) to query and formulates a query, using the client software.
• The client software then connects to the database(s) and submits the query, in a structure suitable for
communication between client and server.
• The server retrieves data from the database, orders these, and returns these to the client.
• The client processes the incoming data, and presents them to the user.
2 Mark Questions

1. Define- relational algebra.


2. What is a SELECT operation?
3. What is a PROJECT operation?
4. Write short notes on tuple relational calculus
5. Write short notes on domain relational calculus
6. Define query language?
7. What are the two different categories of query languages?
8. Write short notes on Schema diagram.
9. Which condition is called referential integrity? Explain its basic concepts.
10. What are the parts of SQL language?
80

12. What are the categories of SQL command?


13. What are the three classes of SQL expression?0r Explain the basic structure of an SQL expression.
14. Give the general form of SQL query?
15. What is the use of rename operation?
16. Define tuple variable?
17. List the string operations supported by SQL?
18. List the set operations of SQL?
19. What is the use of Union and intersection operation?
20. What are aggregate functions? And list the aggregate functions supported by SQL?
21. What is the use of group by clause?
22. What is the use of sub queries?
23. What is view in SQL? How is it defined?
24. What is the use of with clause in SQL?
25. List the table modification commands in SQL?
26. List out the statements associated with a database transaction?
27. What is transaction?
28. List the SQL domain Types?
29. What is the use of integrity constraints?
30. Mention the 2 forms of integrity constraints in ER model?
31. What is trigger?
32. What are domain constraints?
33. What are referential integrity constraints?
34. What is assertion? Mention the forms available.
35. Give the syntax of assertion?
36. What is the need for triggers?
37. List the requirements needed to design a trigger.
38. Give the forms of triggers?
39. What does database security refer to?
40. List some security violations (or) name any forms of malicious access.
41. List the types of authorization.
42. What is authorization graph?
43. List out various user authorization to modify the database schema.
44. What are audit trails?
81

45. Mention the various levels in security measures.


46. Name the various privileges in SQL?
47. Mention the various user privileges.
48. Give the limitations of SQL authorization.
49. Give some encryption techniques?
50. What does authentication refer?
51. List some authentication techniques.
52. What is embedded SQL? What are its advantages?

16 Mark Questions

1. Discuss about various operations in Relational algebra (Fundamental operations – Additional operation)
2. Discuss in detail about an Integrity, Triggers and Security.
3. Explain Embedded and Dynamic SQL.
4. Explain String Operations and Aggregate functions used in SQL.
5. Explain detail in domain relational calculus.
6. Explain detail in Tuple relational calculus.
7. Explain detail in distributed databases and client/server databases.
82

UNIT III

DATABASE DESIGN

Functional Dependencies – Non-loss Decomposition – Functional Dependencies – First, Second, Third


Normal Forms, Dependency Preservation – Boyce/Codd Normal Form- Multi-valued Dependencies and
Fourth Normal Form – Join Dependencies and Fifth Normal Form

1. INTRODUCTION
Relational database design requires a ―good collection of relation schemas.
Pit-falls in Relational Database Design
A bad design may lead to
• Repetition of information
• Inability to represent certain information
Design Goals
a) Avoid redundant data.
b) Ensure that relationships among attributes are represented.
c) Facilitate the checking of updates for violation of database integrity constraints.
Lending-schema= (branch_name, branch_city, assets,c ustomer_name, loan_no, amount).

branch_name branch_city assets customer_name loan_no amount

Downtown Brooklyn 90,00,000 Jones L-17 1000

Redwood Palo Alto 21,00,000 Smith L-23 2000

Porryride Horseneck 17,00,000 Hayes L-15 1500

Downtown Brooklyn 90,00,000 Jackson L-14 3500

Example: Consider the relation schema:


Here branch Downtown details are represented 2 times. This leads to a redundancy problem.
Redundancy leads to
(a) Wastage of space.
(b) Complicates updating, introduces inconsistency.
Null values
(a) Cannot store information about a branch if no loan exist.
83

(b) Can use null values, but they are difficult to handle.
2. FUNCTIONAL DEPENDENCIES
 Functional dependencies are constraints on the set of legal relations.
 The functional dependency holds on R if and only if for any legal relations r(R), whenever
any two tuples t1 and t2 of r agree on the attributes , they also agree on the attributes . That is,
 t1[ ] = t2 [ ] t1[ ] = t2 [ ]
 It requires that the value for a certain set of attributes determines uniquely the value for another set
of attributes.
 In a given relation R, X and Y are attributes. Attributes Y is functionally dependent on attribute X if
each value of X determines exactly one value of Y, which is represented as
X –> Y
i.e., “X determines Y” or “Y is functionally dependent on X”
X –> Y does not imply Y –> X
For example, in a student relation the value of an attribute “ Marks” is known then the value of an
attribute “Grade” is determined since
Marks –> Grade
Types
(a) Full functional dependency
(b) Partial functional dependency
(c) Transitive functional dependency
(a)Full dependencies
In a relation R, X and Y are attributes. X functionally determines Y. Subset of X should not
functionally determine Y.

In the above example marks is fully functionally dependent on student_no and course_no together
and not on subset of {student_no, course_no}.
This means marks cannot be determined either by student_no or course_no alone.It can be
determined only using student_no and course_no together.
Hence marks are fully functionally dependent on {student_no, course_no}.
(b)Partial dependencies
84

Attribute Y is partially dependent on the attribute X only if it is dependent on a subset of attribute X.


For example course_name, Instructer_name are partially dependent on composite attributes {student-
no,course_no} because course_no alone defines course_name, Instructor_name.
(c)Transitive dependencies
X, Y and Z are 3 attributes in the relation R.

X –> Y

Y –> Z

X –> Z
For example, grade depends on marks and in turn mark depends on {student_no
course_no}, hence Grade depends fully transitively on {student_no & course_no}.
Use of Functional Dependencies
 We use functional dependencies to:
o Test relations to see if they are legal under a given set of functional dependencies.
 If a relation r is legal under a set F of functional dependencies, we say that r satisfies
F.
o specify constraints on the set of legal relations
 We say that F holds on R if all legal relations on R satisfy the set of functional
dependencies F.
2.1. CLOSURE OF A SET OF FUNCTIONAL DEPENDENCIES
 Given a set of functional dependencies F, there are certain other functional dependencies that are
logically implied by F.
o For example: If A B and B C, then we can infer that A C
 The set of all functional dependencies logically implied by F is the closure of F.
 We denote the closure of F by F+.
 We can find all F+ by applying Armstrong’s Axioms:
o Reflexivity Rule
 If is a set of attributes and , then holds.
o Augmentation Rule
 If , then is a set of attributes, then holds.
o Transitivity Rule
 If holds and holds then holds.
 These rules are
85

o Sound (generate only functional dependencies that actually hold) and


o Complete (generate all functional dependencies that hold).
 In addition to these three basic rules there are three additional rules to simplify manual computation
of F+.
o Union Rule
 If holds and holds, then holds
o Decomposition Rule
 If holds, then holds and holds
o Pseudotransitivity Rule
 If holds and holds, then holds
Example:
Consider the schema R = (A, B, C, G, H, I)
Set of functional dependency F = {A B
A C
CG H
CG I
B H}
 some members of F+
o A H
 By transitivity from A B and B H then A H holds.
o AG I
 By augmenting A C with G, to get AG CG and then transitivity with CG I
we get AG I.
o CG HI
 By union rule of CG H and CG I, CG HI holds.
 The left-hand and right-hand sides of a functional dependency are both subsets of R.

Procedure for Computing F+: To compute the closure of a set of functional dependencies F:
F+=F
repeat
for each functional dependency f in F+
86

apply reflexivity and augmentation rules on f


add the resulting functional dependencies to F +
+
for each pair of functional dependencies f1and f2 in F
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F +
until F + does not change any further
 A set of size n has 2n subsets, there are a total of 2 × 2n = 2n+1 possible functional dependencies,
where n is the number of attributes in R.
 Each iteration of the loop except the last iteration adds at least one functional dependency to F+.
2.2. CLOSURE OF ATTRIBUTES SETS
 An attribute is functionally determined by if holds.
+
 Given a set of attributes , define the closure of ( ) under F as the set of attributes that are
functionally determined by under F
+
 Algorithm to compute , the closure of under F
result := a;
while (changes to result) do
for each in F do
begin
if result then result := result
end
Example of Attribute Set Closure
R = (A, B, C, G, H, I)
F = {A B
A C
CG H
CG I
B H}
To compute closure of A, (AG)+
The algorithm start with result = AG
1. result = AG
2. A → B includes B in result. Since A→ B is in F and A result, so result: = result

B.
87

3. result = ABCG (A C)
3. result = ABCGH (CG H and CG AGBC)
4. result = ABCGHI (CG I and CG AGBCH)
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
 Testing for super key:
+, +
o To test if is a super key, we compute and check if contains all attributes of R.
 Testing functional dependencies
o To check if a functional dependency holds (or, in other words, is in F+), just check if
+
.
+
o That is, we compute by using attribute closure, and then check if it contains .
o Is a simple and cheap test, and very useful
2.3. CANONICAL COVER
 If a relational schema R has a set of functional dependencies.
 Whenever a user performs an update on the relation, the database system must ensure that the update
does not violate any functional dependencies.
 The system must roll back the update if it violates any functional dependencies in the set F.
 The violation can be checked by testing a simplified set of functional dependencies.
 If simplified set of functional dependency is satisfied then the original functional dependency is
satisfied and vice versa.
 Sets of functional dependencies may have redundant dependencies that can be inferred from the
others.
 A canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no
redundant dependencies or redundant parts of dependencies

Extraneous Attributes
 An attribute of a functional dependency is said to be extraneous if we can remove it without
changing the closure of the set of functional dependencies.
 Consider a set F of functional dependencies and the functional dependency in F.
o Attribute A is extraneous in if A and F logically implies
(F – { }) {( – A) }.
88

o Attribute A is extraneous in if A and the set of functional dependencies


(F – { }) { ( – A)} logically implies F.
 Example: Given F = {A C, AB C}
o B is extraneous in AB C because {A C, AB C} logically implies A C
 Example: Given F = {A C, AB CD}
o C is extraneous in AB CD since AB C can be inferred even after deleting C.
Testing if an Attribute is Extraneous
 Consider a set F of functional dependencies and the functional dependency in F.
 To test if attribute A is extraneous in
1. compute ({ } – A)+ using the dependencies in F
2. check that ({ } – A)+ contains ; if it does, A is extraneous in
 To test if attribute A is extraneous in
+
1. compute using only the dependencies in
(F – { }) { ( – A)},
+
2. check that contains A; if it does, A is extraneous in
 Example: F contains AB CD, A E and E C. to check C is extraneous in AB CD, we compute
the attribute closure of AB under
o F‘={ AB CD, A E, E C}
 The closure is ABCDE, which includes CD. So C is extraneous.
Definition of Canonical Cover
 A canonical cover for F is a set of dependencies Fc such that
o F logically implies all dependencies in Fc, and
o Fc logically implies all dependencies in F, and
o No functional dependency in Fc contains an extraneous attribute, and
o Each left side of functional dependency in Fc is unique.
 To compute a canonical cover for F:
Fc =F
repeat
Use the union rule to replace any dependencies in F
1 1 and 1 2 with 1 1 2

Find a functional dependency with an


extraneous attribute either in or in
89

If an extraneous attribute is found, delete it from


until F does not change
Computing a Canonical Cover
 R = (A, B, C)
F = {A BC
B C
A B
AB C}
 By union rule combine A BC and A B into A BC
o Set is now {A BC, B C, AB C}
 A is extraneous in AB C
o Check if the result of deleting A from AB C is implied by the other dependencies
 After deleting A from AB C the resultant set will be {A BC, B C, B C}
 B C is already present in the set.
o So the resultant set is now {A BC, B C}
 C is extraneous in A BC
o Removing C from A BC we get {A B, B C}
 The canonical cover is:
A B
B C
 Canonical cover might not be unique.

3. NORMALIZATION
 Normalization of data is a process of analyzing the given relational schema based on their functional
dependencies and primary key to achieve the desirable properties of
 Minimize redundancy
 Minimize insert, delete and update anomalies during database activities
 Normalization is an essential part of database design.
 The concept of normalization helps the designer to built efficient design.
Purpose of Normalization:
Minimize redundancy in data.
Remove insert, delete and update anomaly during database activities.
90

Reduce the need to reorganize data when it is modified or enhanced.


Normalization reduces a complex user view to a set of small and stable subgroups of
fields/relations.
This process helps to design a logical data model known as conceptual data model.
Normalization Forms: Different normalization forms are:
1. First normal form (1NF):
A relation is said to be in the first normal form if it is already in unnormalized form and it has
no repeating group.
2. Second normal form (2NF):
A relation is said to be in second normal form if it is already in the first normal form and it
has no partial dependency.
3. Third normal form (3NF):
A relation is said to be in third normal form if it is already in second normal form and it has
no transitive dependency.
4. Boyce-Codd normal form(BCNF):
A relation is said to be in Boyce-Codd normal form if it is already in third normal form and
every determinant is a candidate key. It is a stronger version of 3NF
5. Fourth normal form (4NF) :
A relation is said to be in fourth normal form if it is already in BCNF and it has no
multivalued dependency.
6. Fifth normal form (5NF) :
A relation is said to be in 5NF if it is already in 3NF and has no join dependency.

4. FIRST NORMAL FORM (1NF)


 1NF does not allow multi valued attribute or composite attribute and their combinations.
 It states that domain of the attribute includes only single value, atomic or indivisible value.
 1NF does not allow relation within relation.
 Example: Consider the following schema Department
Department
Dname Dnumber Dmgr_ssn Dlocation

Department
91

Dname Dnumber Dmgr_ssn Dlocation

Research 5 333445555 {Bellaire, Sugsrland,


Houston}

Administration 4 987654321 Stafford

headquarters 1 888665555 Houston

 In our example Departmentrelation is not in 1NF because Dlocation has multivalued attributes.
 There are 3 main techniques to achieve 1NF for such relation.

1. Remove the Dlocation that violates 1NF and place it in a separate relation Dept_location
along with primary key Dnumber of department. The primary key of this relation is the
combination of {Dnumber, Dlocation}.
Dept_location
Dnumber Dlocation

5 Bellaire

5 Sugsrland

5 Houston

4 Stafford

1 Houston

2. Expand the key so that there will be separate tuple in the original department relation. The
primary key becomes {Dnumber, Dlocation}. This solution has the disadvantage of
introducing redundancy in the relation.

Dname Dnumber Dmgr_ssn Dlocation

Research 5 333445555 Bellaire

Research 5 333445555 Sugsrland


92

Research 5 333445555 Houston

Administration 4 987654321 Stafford

headquarters 1 888665555 Houston

3. If a maximum number of values is knowm for the attribute. For example, if it is known that
atmost three locations can exist for a department, and then replace Dlocation by Dlocation1,
Dlocation 2, and Dlocation3. This solution has the disadvantage of introducing null values if
most departments have fewerthan three locations.

Dname Dnumber Dmgr_ssn Dlocation Dlocation Dlocation3


1 2

Research 5 333445555 Bellaire Sugsrland Houston

Administration 4 987654321 Stafford Null Null

headquarters 1 888665555 Houston Null Null

 1NF does not allow nested relation


EMP_PROJ
PROJS
Eid Ename Pnumber Hours

 This schema can also be represented as


EMP_PROJ (Eid, Ename, {PROJS (Pnumber, Hours)})
 To normalize this nested relation into 1NF, we remove the nested relation attribuyes into a new
relation and propagate the primary key into it.
 Primary key of the new relation will be the partial key with the primary key of the original relation.
EMP_PROJ (Eid, Ename, {PROJS ( Pnumber, Hours)})

EMP_PROJ1 EMP_PROJ2
Eid Ename Eid Pnumber Hours
93

5. SECOND NORMAL FORM (2NF)


 2NF is based on the concept of full functional dependency.
 A functional dependency XY is full functional dependency if removal of any attribute A from X
means that the dependency does not hold any more.
i.e., any attribute A X, (X-{A}) does not functionally determine Y.
 A functional dependency XY is partial functional dependency if some attribute A X removed

from X and the functional dependency still holds.


i.e, for some A X, (X-{A})Y holds
 Example
{eid, Pnumber}hours is partial functional dependency
{eid, Pnumber}Ename is partial functional dependency because eid Ename holds
 The test for 2NF involves testing for FDs whose LHS attribute are parts of the PK. If the PK
contains a single attribute, the test need not be applied at all.
 A relational schema R is in 2NF if every nonprime attribute A in R is full functional dependent on
the PK of R.
 Prime attribute: An attribute of a relational schema R is called a Prime attribute of R if it is a
member of some candidate key of R.
 Nonprime attribute: An attribute is called a nonprime attribute if it is not a prime attribute. i.e., if it
is not a member of any candidate key.
 Example
EMP_PROJ
Ssn Pnumber Hours Ename Pname Plocation

 In the above example EMP_PROJ. Ssn and Pnumber are primary key.
 The table is in 1NF.
 FD1 is in 2NF but FD2 and FD3 violates 2Nf.
 The Ename, Pname, Plocation in FD2 and FD3 are partially dependent on the primary key attributes
Ssn and Pnumber.
 A relation which is not in second normal form can be made to be in 2NF by decomposing the
relation into a number such that each nonprime attribute is fully functional dependent on the primary
key.
94

 So the above table can be decomposed in to three tables.


EMP_PROJ
Ssn Pnumber Hours Ename Pname Plocation

FD1

FD2

FD3

EP1
Ssn Pnumber Hours

FD1

EP2
Ssn Ename

FD2

EP3
Pnumber Pname Plocation

FD3

6. THIRD NORMAL FORM (3NF)

 Third Normal Form is based on the concept of transitive dependency.


 A relational schema R is in 3NF if it satisfies 2NF and no nonprime attribute in relation R is
transitively dependent on the primary key.
 A functional dependency XY in a relational schema R is a transitive dependency if there is a set of
attributes Z that is neither a candidate key or a subset of any key R, and both XZ and ZY hold.
Example:

EMP_DEPT
95

Ename Eid DOB Address Dnumber Dname DMGRid

ED1
Ename Eid DOB Address Dnumber

ED2
Dnumber Dname DMGRid

 The dependency EidDMRid is transitive through Dnumber in EMO_DEPT, because both the
dependencies EidDnumber and DnumberDMGRid hold.
 Dnumber is neither a key itself nor a subset of key of EMP_DEPT. therefore the EMP_DEPT
relational schema is not in 3NF.
 The relation is in 2NF because there is no partial dependencies on the key attribute.
 We can normalize EMP_DEPT by decomposing it into two 3NF relational schemas ED1 and ED2.

Summary of Normal Forms

Normal form Test Remedy

Relation should have no multivalued Forms new relations for each


1NF attributes or nested relations. multivalued attributes or nested
relations.

For relations where primary key Decomposes and set up a new relation
contains multiple attribute, no non key for each partial key with its dependent
2NF attribute should be functionally attributes. Make sure to keep a relation
dependent on a pert of primary key. with the original primary key and any
attributes that are fully FD on it.

Relations should not have a non key Decompose and set up a relation that
attribute functionally determined by includes the non key attributes that
3NF another non key attribute. i.e., there functionally determines other non-key
should be no transitive dependency of attributes.
a non key attribute on the primary key.
96

7. LOSS LESS DECOMPOSITION


 Let R be a relational schema and F be a set of functional dependencies on R.
 Let R1 and R2 form a decomposition of R.
 Let r(R) be a relation with schema R.
 The decomposition is lossless decomposition if
R1(r) R2(r) =r
 If natural join is computed on R1 and R2 then we get the relation r.
 A decomposition that is not a lossless decomposition is called lossy decomposition.
 The lossless join decomposition is also called lossless decomposition and the lossy join
decomposition is called lossy decomposition.
 R1 and R2 form a lossless decomposition of R if at least one of the following functional dependency
is in F+
o R1 R2R1
o R1 R2R2
 If R1 R2 forms a super key of either R1 or R2, the decomposition of R is a lossless decomposition.
 Attribute closure can be used to calculate superkey.
 Example: Consider the following schema

bor_loan = (customer_id, loan_number, amount)

If it is decomposed into

borrower = (customer_id, loan_number)

loan = (loan_number, amount)

Here borrower = loan_number and loan_numberamount, satisfies lossless decomposition

rule.

8. DEPENDENCY PRESERVATION
 Let F be a set of functional dependencies on a schema R, and let R1, R2, . . . , Rn be a decomposition
of R.
 The restriction of F to Ri is the set Fi of all functional dependencies in F+ that include only
attributes of Ri.
 Example
97

F = {A → B, B → C}
The restriction of F is A → C, since A → C is in F+, even though it is not in F.
 Even though F’≠ F, F‘+=F+ where F‘=F1 F2 F3 Fn.
 The decomposition having the property F‘+=F+ is a dependency-preserving decomposition.
Algorithm to test dependency preservation
compute F+;
for each schema Ri in D do
begin
Fi : = the restriction of F+ to Ri;
end
F’: =θ
for each restriction Fi do
begin
F’ = F’ Fi
end
compute F’+;
if (F’+= F+) then return (true)
else return (false);

 The input to the algorithm is a set of decomposed relational schemas D = {R1, R2, R3…, Rn} and a
set F of functional dependencies.
 This algorithm is expensive since it requires the computation of F+
 The second alternative method to calculate dependency preservation is as follows.
 The test is applied to each { } in F
result = α
while (changes to result) do
for each Ri in the decomposition
t = (result ∩Ri)+ ∩ Ri
result = result t
 If result contains all attributes in β, then the functional dependency α → β is preserved.

9. BOYCE-CODD NORMAL FORM (BCNF)


 A relational schema R is in BCNF with respect to a set F of functional dependencies, if for all
FD in F+ of the form { }, where R and R. at least one of the following holds
98

{ } is a trivial dependency (i.e., )

is a super key for the schema R.


BCNF Decomposition Algorithm
result := {R};
done := false;
compute F+;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let be a nontrivial functional
dependency that holds on Ri
+
such that Ri is not in F ,
and = ;
result := (result – Ri ) (Ri – ) ( , );
end
else done := true;
Example:
Consider the following relational schema.
1. Customer-schema = (customer-name, customer-street, customer-city)
customer-name → customer-street, customer-city
2. Branch-schema = (branch-name, assets, branch-city)
branch-name → assets, branch-city
3. Loan-info-schema = (branch-name, customer-name, loan-number, amount)
loan-number → amount, branch-name
 Customer-schema is in BCNF. Since customer-name is a candidate key, functional dependencies with
customer-name on the left side do not violate the definition of BCNF.
 Similarly, the relation schema Branch-schema is also in BCNF. The schema Loan-info-schema is not in
BCNF. First, note that loan-number is not a superkey for Loan-info-schema, since we can have a pair of
tuples with a single loan for example,
(Downtown, John Bell, L-44, 1000)
(Downtown, Jane Bell, L-44, 1000)
loannumber is not a candidate key.
99

 However, the functional dependency loan-number → amount is nontrivial. Therefore, Loan-info-schema


does not satisfy the definition of BCNF.
 If several customer names are associated with a loan, then branch name and the amount is repeated once
for each customer.
 This can be eliminated by redesigning the database such that all schemas are in BCNF.
 One approach to this problem is to take the existing non- BCNF design as a starting point, and to
decompose those schemas that are not in BCNF.
 Consider the decomposition of Loan-info-schema into two schemas:
Loan-schema = (loan-number, branch-name, amount)
Borrower-schema = (customer-name, loan-number)
The decomposition is based on the following expression
R1 = ( , )
R2 = (Ri – )
 This decomposition is a lossless-join decomposition. To determine whether these schemas are in BCNF,
we need to determine what functional dependencies apply to them. I
 n this example, it is easy to see that loan-number → amount, branch-name applies to the Loan-schema,
and that only trivial functional dependencies apply to Borrower-schema.
 Although loan-number is not a superkey for Loan-info-schema, it is a candidate key for Loan-schema.
Thus, both schemas of our decomposition are in BCNF.
Testing Decomposition for BCNF
To test if a relation is in BCNF the following can be done:
1. To check if a nontrivial dependency α → β causes a violation of BCNF, compute α+ and
verify that it includes all attributes of R; that is, it is a superkey of R.
2. To check if a relation schema R is in BCNF, it sufficient to check only if the dependencies in the
set F does not violate BCNF, rather than to check all dependencies in F+.
If none of the dependencies in F causes a violation of BCNF, then none of the dependencies in F+ will
cause a violation of BCNF either. It is not true when a relation is decomposed.
An alternative BCNF test to check if a relation Ri, decomposition of R is in BCNF, the following test
can be applied:.
For every subset α of attributes in Ri, check that α+ either includes no attribute of Ri-α, or includes
all attributes of Ri.

10.MULTI-VALUED DEPENDENCIES AND FOURTH NORMAL FORM


100

MULTI-VALUED DEPENDENCIES (MVD)

 A Multi-valued Dependencies X →→ Y on relation schema R, where X and Y are both subset of R,


specifies the following constraints any relation state r of R.
 If two tuples t1 and t2 exist in r such that t1[X] = t2[X], then two tuples t3 and t4 should exist in r with
the following properties

t3[X] = t4[X] = t1[X] = t2[X]


t3[Y] = t1[Y] and t4[Y] = t2[Y]
t3[Z] = t2[Z] and t4[Z] = t1[Z]
where Z is used to denote (R- (X Y))
 X →→ Y holds, we say that X multidetermines Y.
 A MVD is trivial Multi-valued Dependencies if
a. Y is a subset of X.
b. X Y=R
 In MVD X →→ Y, X →→ Z can be written as X →→ Y/Z.
Inference rules for functional and multivalued dependencies
1. IR1 (reflexive rule for FDs) : If X Y, then XY
2. IR2 (augmentation rule for FDs) : {XY}=XZYZ
3. IR3 (transitive rule for FDs) : {XY, YZ}=XZ
4. IR4 (complementation rule for FDs) : { X →→ Y} = {X→→(R-X Y))}
5. IR5 (augmentation rule for MVDs) : if X →→ Y and w Z, then wX→→ YZ
6. IR6 (transitive rule for MVDs) : if { X →→ Y, X →→ Z } = X →→ (Z-Y)
7. IR7 (replication rule for MVDs) : { X →→ Y}={ x →→ y}
8. IR8 (coalescence rule for FDs and MVDs) :
if { X →→ Y} and there exists w with the properties that
a) w y is empty
b) wz and
c) y z, then xz

FOURTH NORMAL FORM


 A relational schema R is in 4NF with respect to a set of dependency F if for every non trivial
multivalued dependency X →→ Y in F+, X is a super key for R.
 Example
101

EMP
Ename Pname Dname

Smith X John

Smith Y Anna

Smith X Anna

Smith Y John

EMP_PROJECTS EMP_DEPENDENTS
Ename Pname Ename Dname

Smith X Smith John

Smith Y Smith Anna

 If the relation has nontrivial MVDs, then insert, delete and update operations on single tuple may
cause additional tuples to be modified.
 To overcome these anomalies the relation is decomposed into 4NF.
Procedure for 4NF:
Input: A universal relation R and a set of functional and multivalued dependencies F
Set D :={ R}
While there is a relation schema Q in D that is not in 4NF, do
{
Choose a relation schema Qin D that is not in 4NF;
Fine a nontrivial MVD X →→ Y that violates 4NF;
Replace Qin D by two relation schemas (Q-Y) and (X Y);
};

11.JOIN DEPENDENCIES AND FIFTH NORMAL FORM


JOIN DEPENDENCIES
 A join dependency (JD) denoted by JD (R1, R2, R3… Rn) specified on relation schema R, specifies a
constraint on the state r of R.
102

 The constraint states that every legal state r of R should have a nonadditive join decomposition into
R1, R2, R3… Rn i.e, for every such relation r we have
( R1(r), R2(r)… Rn(r)) =r
 JD denoted as JD (R1, R2) implies an MVD (R1 R2) →→ (R1 - R2).
FIFTH NORMAL FORM (5NF)
A relational schema R is in fifth normal form or Project Join Normal Form (PJNF) with respect to a
set F of functional, multivalued and join dependency if, for every nontrivial join dependency JD (R1, R2,
+
R3… Rn) in F , every Ri is a superkey of R.

SUPPLY
Sname Part_name Proj_name

Smith Bolt
Sname Part_name
Smith Nut
Smith Bolt
Adamsky Bolt
Smith Nut
Walton Nut
Adamsky Bolt
Adamsky Nail
Walton Nut
Adamsky Bolt

Smith Bolt Y

 The supply relation is decomposed into three relations R1, R2, R3 that are in 5NF.
103

Adamsky Nail Sname Proj_name Part_name Proj_name

Smith X Bolt X

Smith Y Nut Y

Adamsky Y Bolt Y

R1 Walton Z Nut Z
R2
R3 Adamsky X Nail X

ANOMALIES IN DATABASES
 There are three types of anomalies. They are
1. Insert Anomalies
2. Update Anomalies
3. Delete Anomalies

1. Insert Anomalies:

 The inability to insert part of information into a relational schema due to the unavailability of part of
the remaining information is called Insert Anomalies.
 Example: If there is a guid having no registered under him, then we can not insert the guide‘s
information in the schema project.

2. Update Anomalies:

 Updation of relation schema with redundancy may lead to update anomalies.


 Example: If a person changes his address then the updation should be carried out wherever the
copies occur. If it is not updated properly then data inconsistency arises.

3. Delete Anomalies:

 If the deletion of some information leads to loss of some other information, then we say there is a
deletion anomaly.
 Example: If a guide guides one student and if the student discontinues the course then the
information about the guid will be lost.
104

2 Mark Questions

1. What is meant by functional dependencies?


2. What are the uses of functional dependencies?
3. What is Fully Functional dependency?
4. What is Partial Functional dependencies?
5. What is Transitive Functional dependencies?
6. What are axioms?
7. Write the inference rule for functional dependencies.
8. What is meant by computing the closure of a set of functional dependency?
9. Define canonical cover?
10. List the properties of canonical cover.
11. List the disadvantages of relational database system
12. What is known as normalization?
13. Summarize different normal form.
14. What is first normal form?
15. What is 2NF?
16. Define Boyce codd normal form
17. Explain the desirable properties of decomposition.
18. Describe briefly any two undesirable properties that a bad database design may have.
19. What is the purpose of normalization?

16 Mark Questions
1. Explain detail about Functional Dependencies.
2. Explain detail about first, second and third normalization form.
3. Explain detail about Boyce code normal form and fifth normalization form.
4. Explain detail in decomposition using Functional Dependencies.
5. Explain detail in decomposition using Multi-Valued Dependencies.
105

UNIT IV

TRANSACTIONS

Transaction Concepts - Transaction Recovery – ACID Properties – System Recovery – Media Recovery
– Two Phase Commit - Save Points – SQL Facilities for recovery – Concurrency – Need for Concurrency
– Locking Protocols – Two Phase Locking – Intent Locking – Deadlock- Serializability – Recovery
Isolation Levels – SQL Facilities for Concurrency.

1. TRANSACTION CONCEPTS

A transaction is a logical unit of work. It begins with the execution of a BEGIN TRANSACTION
operation and ends with the execution of a COMMIT or ROLLBACK operation.

A Sample Transaction (Pseudo code)

BEGIN TRANSACTION
UPDATE ACC123 (BALANCE: =BALANCE-$100);
If any error occurred THEN GOTO UNDO;
END IF;
UPDATE ACC123 (BALANCE: =BALANCE+$100);
If any error occurred THEN GOTO UNDO;
END IF;
COMMIT;
GOTO FINISH;
UNDO;
ROLLBACK;
FINISH;
RETURN;
In our example an amount of $100 is transferred from account 123 to 456.
It is not a single atomic operation, it involves two separate updates on the database.
Transaction involves a sequence of database update operation.
106

The purpose of this transaction is to transform a correct state of database into another incorrect state,
without preserving correctness at all intermediate points.
Transaction management guarantees a correct transaction and maintains the database in a correct
state.
It guarantees that if the transaction executes some updates and then a failure occurs before the
transaction reaches its planned termination, then those updates will be undone.
Thus the transaction either executes entirely or totally cancelled.
The system component that provides this atomicity is called transaction manager or transaction
processing monitor or TP monitor.
ROLLBACK and COMMIT are key to the way it works.
1. COMMIT:
The COMMIT operation signals successful end of transaction.
It tells the transaction manager that a logical unit of work has been successfully completed and
database is in correct state and the updates can be recorded or saved.
2. ROLLBACK:
a. By contrast, the ROLLBACK operation signals unsuccessful end of transaction.
b. It tells the transaction manager that something has gone wrong, the database might be in
incorrect state and all the updates made by the transaction should be undone.
3. IMPLICIT ROLLBACK:
Explicit ROLLBACK cannot be issued in all cases of transaction failures or errors. So the
system issues implicit ROLLBACK for any transaction failure.
If the transaction does not reach the planned termination then we ROLLBACK the transaction
else it is COMMITTED.
4. MESSAGE HANDLING:
A typical transaction will not only update the database, it will also send some kind of message
back to the end user indicating what has happened.
Example: “Transfer done” if the COMMIT is reached, or Error “transfer not done”
5. RECOVERY LOG:
The system maintains a log or journal or disk on which all particular about the updation is
maintained.
The values of before and after updation is also called as before and after images.
This log is used to bring the database to the previous state incase of some undo operation.
The log consist of two portions
107

a.an active or online portion


b. an archive or offline portion.
The online portion is the portion used during normal system operation to record details of
updates as they are performed and it is normally kept on disk.
When the online portion becomes full, its contents are transferred to the offline portion, which
can be kept on tape.
6. STATEMENT ATOMICITY:
The system should guarantee that individual statement execution must be atomic.
7. PROGRAM EXECUTION IS A SEQUENCE OF TRANSACTIONS:
COMMIT and ROLLBACK terminate the transaction, not the application program.
A single program execution will consist of a sequence of several transactions running one after
another.
PROGRAM EXECUTION IS A SEQUENCE OF TRANSACTIONS

8. NO NESTED TRANSACTIONS:
An application program can execute a BEGIN TRANSACTION statement only when it has no
transaction currently in progress.
i.e., no transaction has other transactions nested inside itself.
9. CORRECTNESS:
Consistent means ―not violating any known integrity constraint.‖
Consistency and correctness of the system should be maintained.
If T is a transaction that transforms the database from state D1 to state D2, and if D1 is correct,
then D2 is correct as well.
10. MULTIPLE ASSIGNMENT:
108

Multiple assignments allow any number of individual assignments (i.e., updates) to be


performed “simultaneously”
Example: UPDATE ACC 123 {BALANCE: = BALANCE - $100}
UPDATE ACC 456 {BALANCE: = BALANCE + $100}
Multiple assignments would make the statement atomic.
Current products do not support multiple assignments.
2. TRANSACTION RECOVERY
A transaction begins by executing a BEGIN TRANSACTION operation and ends by executing either a
COMMIT or a ROLLBACK operation.
COMMIT establishes a commit point or synch point.
A commit point corresponds to the successful end of a transaction and the database will be in a correct
state.
ROLLBACK rolls the database back to the previous commit point.
There will be several transactions executing in parallel in a database.
When a commit point is established:
1. When a program is committed, the change is made permanent. i.e., they are guaranteed to be
recorded in the database. Prior to the commit point updates are tentative i.e., they can be
subsequently be undone.
2. All database positioning is lost and all tuple locks are released.

Database positioning means at the time of execution each program will typically have addressability to
certain tuples in the database, this addressability is lost at a COMMIT point.
Transactions are not only a unit of work but also unit of recovery.
If a transaction successfully commits, then the system updates will be permanently recorded in the
database, even if the system crashes the very next moment.
If the system crashes before the updates are written physically to the database, the system‘s restart
procedure will still record those updates in the database.
The values can be discovered from the relevant records in the log.
The log must be physically written before the COMMIT processing can complete. This is called write-
ahead log rule.
The restart procedure helps in recovering any any transactions that completed successfully but not
physically written prior to the crash.
Implementation issues
109

1. Database updates are kept in buffers in main memory and not physically written to disk until the
transaction commits. That way, if the transaction terminates unsuccessfully, there will be no need
to undo any disk updates.
2. Database updates are physically written to the disk after COMMIT operation. That way, if the
system subsequently crashes, there will be no need to redo any disk updates.
If there is no enough disk space then a transaction may steal buffer space from another transaction. They
may also force updates to be written physically at the time of COMMIT.
Write ahead log rule is elaborated as follows:
1. The log record for a given database update must be physically written to the log before that update
is physically written to the database.
2. All other log records for a given transaction must be physically written to the log before the
COMMIT log record for that transaction is physically written to the log.
3. COMMIT processing for a given transaction must not complete until the COMMIT log record for
that transaction is physically written to the log.
3. ACID PROPERTIES
ACID stands for Atomicity, Correctness, Isolation and Durability.
* Atomicity: Transactions are atomic.
Consider the following example
Transaction to transfer $50 from account A to account B:
read(A)
A := A – 50
write(A)
read(B)
B := B + 50
write(B)
read(X), which transfers the data item X from the database to a local buffer belonging to the
transaction that executed the read operation.
write(X), which transfers the data item X from the local buffer of the transaction that executed the
write back to the database.
Before the execution of transaction Ti the values of accounts A and B are $1000 and $2000,
respectively.
Suppose if the transaction fails due to some power failure, hardware failure and system error the
transaction Ti will not execute successfully.
110

If the failure happens after the write(A) operation but before the write(B) operation. The database
will have values $950 and $2000 which results in a failure.
The system destroys $50 as a result of failure and leads the system to inconsistent state.
The basic idea of atomicity is: The database system keeps track of the old values of any data on
which a transaction performs a write, if the transaction does not terminate successfully then the
database system restores the old values.
Atomicity is handled by transaction-management component.
* Correctness/ Consistency:
Transactions transform a correct state of the database into another correct state, without necessarily
preserving correctness at all intermediate points.
In our example the transaction is in consistent state if the sum of A and B is unchanged by the
execution of transaction.
*Isolation:
Transactions are isolated from one another.
Even though there are many transactions running concurrently, any given transaction‘s updates are
concealed from all the rest, until that transaction commits.
The database will be temporarily inconsistent while the transaction is in progress.
When the amount is reduced from A and not yet incremented to B. the database will be inconsistent.
If a second concurrently running transaction reads A and B at this intermediate point and computes
A+B, it will observe an inconsistent value.
If the second transaction performs updates on A and B based on the inconsistent values that it read,
the database will remain inconsistent even after both transactions are completed.
In order to avoid this problem serial execution of transaction is preferred.
Concurrency control component maintain isolation of transaction.

*Durability:
Once a transaction commits, its updates persist in the database, even if there is a subsequent system
crash.
The computer system failure may lead to loss of data in main memory, but data written to disk are
not lost.
Durability is guaranteed by ensuring the following
o The updates carried out by the transaction should be written to the disk.
111

o Information stored in the disk should be sufficient to enable the database to reconstruct the
updates when the database system restarts after failure.
o Recovery management component is responsible for ensuring durability.

4. SYSTEM RECOVERY
The system must be recovered not only from purely local failures such as an individual transaction, but
also from ―global‖ failures.
A local failure affects only the transaction in which the failure has actually occurred.
A global failure affects all of the transactions in progress at the time of the failure.
The failures fall into two broad categories:
1. System failures (e.g., power outage), which affect all transactions currently in progress but do not
physically damage the database. A system failure is sometimes called a soft crash.
2. Media failures (e.g., head crash on the disk), which cause damage to the database or some portion
of it. A media failure is sometimes called a hard crash.
System failure and recovery
During system failures the contents of main memory is lost.
The transaction at the time of the failure will not be successfully completed, so transactions must be
undone i.e., rolled back when the system restarts.
It is necessary to redo certain transactions at the time of restart that is not successfully completed prior
to the crash but did not manage to get their updates transferred from the buffers in main memory to the
physical database.
Whenever some prescribed number of records has been written to the log the system automatically takes
a checkpoint.
The checkpoint record contains a list of all transactions that were in progress at the time the checkpoint
was taken.
To see how a check point works consider the following
112

A system failure has occurred at time tf.


The most recent checkpoint prior to time tf was taken at time tc.
Transactions of type T1 completed (successfully) prior to time tc.
Transactions of type T2 started prior to time tc and completed (successfully) after time tc and before
time tf.
Transactions of type T3 also started prior to time tc but did not complete by time tf.
Transactions of type T4 started after time tc and completed (successfully) before time tf.
Finally, transactions of type T5 also started after time tc but did not complete by time tf.
The transactions of types T3 and T5must be undone, and transactions of types T2 and T4 must be redone. At
restart time, the system first goes through the following procedure
1. Start with two lists of transactions, the UNDO list and the REDO list.
2. Set the UNDO list equal to the list of all transactions given in the most recent checkpoint record
and the REDO list to empty.
3. Search forward through the log, starting from the checkpoint record.
4. If a BEGIN TRANSACTION log record is found for transaction T, add T to the UNDO list.
5. If a COMMIT log record is found for transaction T, move T from the UNDO list to the REDO list.
6. When the end of the log is reached, the UNDO and REDO lists are identified.
The system now works backward through the log, undoing the transactions in the UNDO list.
Then works forward, redoing the transactions in the REDO list.
Restoring the database to a correct state by redoing work is sometimes called forward recovery.
Restoring the database to a correct state by undoing work is called backward recovery.
When all recovery activity is complete, then the system is ready to accept new work.
ARIES
113

Earlier recovery system performs UNDO before REDO operations.


ARIES scheme performs REDO before UNDO operation.
ARIES operates in three broad phases:
1. Analysis: Build the REDO and UNDO lists.
2. Redo: Start from the log determined in the analysis phase and restore the database to the state it was in the
time of crash.
3. Undo: Undo the effects of transactions that failed to commit.
The name ARIES stands for “Algorithms for Recovery and Isolation Exploiting Semantics”
5. TWO PHASE COMMIT
Two-phase commit is important whenever a given transaction can interact with several independent
“resource managers”
Example,
o Consider a transaction running on an IBM mainframe that updates both an IMS database and
a DB2 database. If the transaction completes successfully, then both IMS data and DB2 data
are committed.
o Conversely, if the transaction fails, then both the updates must be rolled back.
o It is not possible to commit one database update and rollback the other. If done so the
atomicity will not be maintained in the system.
o Therefore, the transaction issues a single “global” or system-wide COMMIT or
ROLLBACK.
o That COMMIT or ROLLBACK is handled by a system component called the coordinator.
o Coordinators task is to guarantee the resource managers commit or roll back.
o It should also guarantee even if the system fails in the middle of the process.
o The two-phase commit protocol is responsible for maintaining such a guarantee.
WORKING
Assume that the transaction has completed and a COMMIT is issued. On receiving the COMMIT
request, the coordinator goes through the following two-phase process:

Prepare:
The resource manager should get ready to ―go either way‖ on the transaction.
The participant in the transaction should record all updates performed during the transaction from
temporary storage to permanent storage.
In order to perform either COMMIT or ROLLBACK as necessary.
114

Resource manager now replies “OK” to the coordinator or NOT OK based on the write operation.
Commit:
When the coordinator has received replies from all participants, it takes a decision regarding the
transaction and records it in the physical log.
If all replies were OK, that the decision is”commit”, if any reply was “Not OK” the decision is
“rollback.”
The coordinator informs its decision to all the participants.
Each participant must then commit or roll back the transaction locally, as instructed by the
coordinator.
If the system fails at some point during the process, the restart procedure looks for the decision of the
coordinator.
If the decision is found then the two phase commit can start processing from where it has left off.
If the decision is not found then it assumes that the decision is ROLLBACK and the process can
complete appropriately.
If the participants are from several systems like in distributed system, then some participants should wait
for long time for the coordinators decision.
Data communication manager (DC manager) can act as a resource manager in case of a two-phase
commit process.
6. SAVEPOINTS
Transactions cannot be nested with in another transaction.
Transactions cannot be broken down into smaller subtransactions.
Transactions establish intermediate savepoints while executing.
If there is a roll back operation executed in the transaction, instead of performing roll back all the way to
the beginning we can roll back to the previous savepoint.
Savepoint is not the same as performing a COMMIT, updates made by the transaction are still not
visible to other transaction until the transaction successfully executes a COMMIT.
7. MEDIA RECOVERY
Media recovery is different from transaction and system recovery.
A media failure is a failure such as a disk head crash or a disk controller failure in which some portion
of the database has been physically destroyed.
Recovery from such a failure basically involves reloading or restoring the database from a backup or
dump copy and then using the log.
There is no need to undo transactions that were still in progress at the time of the failure.
115

The dump portion of that utility is used to make backup copies of the database on demand.
Such copies can be kept on tape or other archival storage, it is not necessary that they be on direct access
media.
After a media failure, the restore portion of the utility is used to recreate the database from a specified
backup copy.

8. SQL FACILITIES FOR RECOVERY

SQL supports transactions and transaction-based recovery.


All executable SQL statements are atomic except CALL and RETURN.
SQL provides BEGIN TRANSACTION, COMMIT, and ROLLBACK, called START
TRANSACTION, COMMIT WORK, and ROLLBACK WORK, respectively.
Syntax for START TRANSACTION:
START TRANSACTION <option commalist>;
The <option commalist> specifies an access mode, an isolation level, or both
The access mode is either READ ONLY or READ WRITE.
o If neither is specified, READ WRITE is assumed. If READ WRITE is specified, the
isolation level must not be READ UNCOMMITTED.
The isolation level takes the form ISOLATION LEVEL <isolation>, where <isolation> can be READ
UNCOMMITTED, READ COMMITTED, REPEATABLE READ, or SERIALIZABLE.
The syntax for COMMIT and ROLLBACK is:
COMMIT [WORK] [AND [NO] CHAIN];
ROLLBACK [WORK] [AND [NO] CHAIN];
AND CHAIN causes a START TRANSACTION to be executed automatically after the COMMIT;
AND NO CHAIN is the default.
A CLOSE is executed automatically for every open cursor except for the cursors declared WITH
HOLD.
A cursor declared WITH HOLD is not automatically closed at COMMIT
SQL also supports savepoints.
Syntax: SAVEPOINT <savepoint name>;
This syntax creates a savepoint with the specified user-chosen name.
Syntax for roll back : ROLLBACK TO <savepoint name>;
This statement undoes all updates done since the specified savepoint.
116

Syntax for releasing savepoints: RELEASE <savepoint name>;


This statement drops the specified savepoint. All savepoints are automatically dropped at transaction
termination.

2 Mark Questions

1. What is transaction?
2. What are the two statements regarding transaction?
3. What are the properties of transaction?
4. What is recovery management component?
5. When is a transaction rolled back?
Any changes that the aborted transaction made to the database must be undone. Once the changes caused by
an aborted transaction have been undone, then the transaction has been rolled back.
6. What are the states of transaction?
7. What is a shadow copy scheme?
8. Give the reasons for allowing concurrency?
9. What is average response time?
10. What are the two types of serializability?
11. Define lock?
12. What are the different modes of lock?
13. Define deadlock?
14. Define the phases of two phase locking protocol
15. Define upgrade and downgrade?
16. What is a database graph?
17. What are the two methods for dealing deadlock problem?
18. What is a recovery scheme?
19. What are the two types of errors?
20. What are the storage types?
21. Define blocks?
22. What is meant by Physical blocks?
23. What is meant by buffer blocks?
24. What is meant by disk buffer?
25. What is meant by log-based recovery?
26. What are uncommitted modifications?
117

27. Define shadow paging.


28. Define page.
29. Explain current page table and shadow page table.
30. What are the drawbacks of shadow-paging technique?
31. Define garbage collection.
32. Differentiate strict two phase locking protocol and rigorous two phase locking protocol.
33. How the time stamps are implemented
34. What are the timestamps associated with each data item?
35. Why is it necessary to have control of concurrent execution of transaction? How is it made possible?

16 Mark Questions

1. Define Serializability. Explain the types of serializability with example.


2. Explain Deadlock with example.
3. Explain in detail about Locking Protocol.
4. Explain the Need for Concurrency Control.
5. Discuss about transaction recoverability.
6. Explain Recovery isolation levels with example.
7. Explain in detail about ACID properties.
118

UNIT V

IMPLEMENTATION TECHNIQUES
Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage – File
Organization – Organization of Records in Files – Indexing and Hashing –Ordered Indices – B+ tree
Index Files – B tree Index Files – Static Hashing – Dynamic Hashing – Query Processing Overview –
Catalog Information for Cost Estimation – Selection Operation – Sorting – Join Operation – Database
Tuning,Multimedia Database. Case Study:FIRM – a database management system for real time avionics.

1. OVERVIEW OF PHYSICAL STORAGE MEDIA

Classification of Physical Storage Media

» Accessing Speed
» Cost per unit of data
» Reliability
o data loss on power failure or system crash
o physical failure of the storage device
» Can differentiate storage into:
o volatile storage: loses contents when power is switched off
o non-volatile storage:
 Contents persist even when power is switched off.
 Includes secondary and tertiary storage, as well as batter-backed up main-
memory.
Storage Hierarchy

» The various storage media can be organized in a hierarchy accounting to their speed and their cost.
» The higher level is expensive, but is fast. As we move down the hierarchy, the cost per bit decreases, where
as the access time increases.
» Storage hierarchy includes 3 main categories.
1. Primary storage: Fastest media but volatile (cache, main memory).
2. Secondary storage: next level in hierarchy, non-volatile, moderately fast access time also called on-line
storage
E.g. flash memory, magnetic disks
3. Tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage
119

E.g. magnetic tape, optical storage

Physical Storage Media


» Cache – fastest and most costly form of storage; volatile; managed by the computer system
hardware.
» Main memory:
o fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
o generally too small (or too expensive) to store the entire database
 capacities of up to a few Gigabytes widely used currently
 Capacities have gone up and per-byte costs have decreased steadily and rapidly
(roughly factor of 2 every 2 to 3 years)
o Volatile — contents of main memory are usually lost if a power failure or system crash
occurs.
» Flash memory
o Data survives power failure
o Data can be written at a location only once, but location can be erased and written to again
 Can support only a limited number (10K – 1M) of write/erase cycles.
 Erasing of memory has to be done to an entire bank of memory
o Reads are roughly as fast as main memory
o But writes are slow (few microseconds), erase is slower
o Cost per unit of storage roughly similar to main memory
o Widely used in embedded devices such as digital cameras
o Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)
» Magnetic Disk
o Data is stored on spinning disk, and read/written magnetically
120

o Primary medium for the long-term storage of data; typically stores entire database.
o Data must be moved from disk to main memory for access, and written back for storage
Much slower access than main memory (more on this later)
o direct-access – possible to read data on disk in any order, unlike magnetic tape
o Capacities range up to roughly 400 GB currently
 Much larger capacity and cost/byte than main memory/flash memory
 Growing constantly and rapidly with technology improvements (factor of 2 to 3 every
2 years)
o Survives power failures and system crashes
 disk failure can destroy data, but is rare
» Optical storage
o non-volatile, data is read optically from a spinning disk using a laser
o CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
o Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R,
DVD+R)
o Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)
o Reads and writes are slower than with magnetic disk
o Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism
for automatic loading/unloading of disks available for storing large volumes of data
» Tape storage
o non-volatile, used primarily for backup (to recover from disk failure), and for archival data
o sequential-access – much slower than disk
o very high capacity (40 to 300 GB tapes available)
o tape can be removed from drive storage costs much cheaper than disk, but drives are
expensive
o Tape jukeboxes available for storing massive amounts of data
 hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012
bytes)

2. MAGNETIC-DISK
a. Data is stored on spinning disk, and read/written magnetically
121

b. Primary medium for the long-term storage of data; typically stores entire database.
c. Data must be moved from disk to main memory for access, and written back for storage
i. Much slower access than main memory (more on this later)
d. direct-access – possible to read data on disk in any order, unlike magnetic tape
e. Capacities range up to roughly 400 GB currently
i. Much larger capacity and cost/byte than main memory/flash memory
ii. Growing constantly and rapidly with technology improvements (factor of 2 to 3 every
2 years)
f. Survives power failures and system crashes
i. disk failure can destroy data, but is rare
Magnetic Hard Disk Mechanism
» Read-write head
a. Positioned very close to the platter surface (almost touching it)
b. Reads or writes magnetically encoded information.
» Surface of platter divided into circular tracks
a. Over 50K-100K tracks per platter on typical hard disks
» Each track is divided into sectors.
a. A sector is the smallest unit of data that can be read or written.
b. Sector size typically 512 bytes
c. Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks)
» To read/write a sector
a. disk arm swings to position head on right track
b. platter spins continually; data is read/written as sector passes under head
» Head-disk assemblies
a. multiple disk platters on a single spindle (1 to 5 usually)
b. One head per platter, mounted on a common arm.
» Cylinder i consists of ith track of all the platters
» Earlier generation disks were susceptible to head-crashes
a. Surface of earlier generation disks had metal-oxide coatings which would disintegrate on
head crash and damage all data on disk
b. Current generation disks are less susceptible to such disastrous failures, although individual
sectors may get corrupted
» Disk controller – interfaces between the computer system and the disk drive hardware.
122

a. accepts high-level commands to read or write a sector


b. initiates actions such as moving the disk arm to the right track and actually reading or writing
the data
c. Computes and attaches checksums to each sector to verify that data is read back correctly
i. If data is corrupted, with very high probability stored checksum won‘t match
recomputed checksum
d. Ensures successful writing by reading back sector after writing it
e. Performs remapping of bad sectors

Disk Subsystem

FigureDisk Subsystem

Multiple disks connected to a computer system through a controller

f. Controllers functionality (checksum, bad sector remapping) often carried out by individual
disks; reduces load on controller
123

Disk interface standards families


g. ATA (AT adaptor) range of standards
h. SATA (Serial ATA)
i. SCSI (Small Computer System Interconnect) range of standards
j. Several variants of each standard (different speeds and capabilities)
Performance Measures of Disks

» Access time – the time it takes from when a read or write request is issued to when data transfer
begins. Consists of:
a. Seek time – time it takes to reposition the arm over the correct track.
b. Rotational latency – time it takes for the sector to be accessed to appear under the head.
» Data-transfer rate – the rate at which data can be retrieved from or stored to the disk.
a. 25 to 100 MB per second max rate, lower for inner tracks
» Mean time to failure (MTTF) – the average time the disk is expected to run continuously without
any failure.
a. Typically 3 to 5 years
Optimization of Disk-Block Access

» Block – a contiguous sequence of sectors from a single track


a. data is transferred between disk and main memory in blocks
b. sizes range from 512 bytes to several kilobytes
i. Smaller blocks: more transfers from disk
ii. Larger blocks: more space wasted due to partially filled blocks
iii. Typical block sizes today range from 4 to 16 kilobytes
Techniques used for accessing data from disk
c. Disk-arm-scheduling algorithms order pending accesses to tracks so that disk arm
movement is minimized
d. elevator algorithm : move disk arm in one direction (from outer to inner tracks or vice
versa), processing next request in that direction, till no more requests in that direction, then
reverse direction and repeat

» File organization – optimize block access time by organizing the blocks to correspond to how data
will be accessed
a. E.g. Store related information on the same or nearby cylinders.
124

b. Files may get fragmented over time


i. E.g. if data is inserted to/deleted from the file
ii. Or free blocks on disk are scattered, and newly created file has its blocks scattered
over the disk
iii. Sequential access to a fragmented file results in increased disk arm movement
c. Some systems have utilities to defragment the file system, in order to speed up file access

» Nonvolatile write buffers speed up disk writes by writing blocks to a non-volatile RAM buffer
immediately
a. Non-volatile RAM: battery backed up RAM or flash memory
i. Even if power fails, the data is safe and will be written to disk when power returns
b. Controller then writes to disk whenever the disk has no other requests or request has been
pending for some time
c. Database operations that require data to be safely stored before continuing can continue
without waiting for data to be written to disk
d. Writes can be reordered to minimize disk arm movement

» Log disk – a disk devoted to writing a sequential log of block updates


a. Used exactly like nonvolatile RAM
i. Write to log disk is very fast since no seeks are required
ii. No need for special hardware (NV-RAM)
» File systems typically reorder writes to disk to improve performance
a. Journaling file systems write data in safe order to NV-RAM or log disk
b. Reordering without journaling: risk of corruption of file system data
125

3. RAID

» RAID: Redundant Arrays of Independent Disks


a. disk organization techniques that manage a large numbers of disks, providing a view of a
single disk of
i. high capacity and high speed by using multiple disks in parallel, and
ii. high reliability by storing data redundantly, so that data can be recovered even if a
disk fails
» The chance that some disk out of a set of N disks will fail is much higher than the chance that a
specific single disk will fail.
» Originally a cost-effective alternative to large, expensive disks

Improvement of Reliability via Redundancy

» Redundancy – store extra information that can be used to rebuild information lost in a disk failure
» E.g., Mirroring (or shadowing)
a. Duplicate every disk. Logical disk consists of two physical disks.
b. Every write is carried out on both disks
i. Reads can take place from either disk
c. If one disk in a pair fails, data still available in the other
i. Data loss would occur only if a disk fails, and its mirror disk also fails before the
system is repaired
1. Probability of combined event is very small
a. Except for dependent failure modes such as fire or building collapse or
electrical power surges
» Mean time to data loss depends on mean time to failure, and mean time to repair

Improvement in Performance via Parallelism

Two main goals of parallelism in a disk system:


» Load balance multiple small accesses to increase throughput
» Parallelize large accesses to reduce response time.

Improve transfer rate by striping data across multiple disks.


126

1. Bit-level striping – split the bits of each byte across multiple disks
a. In an array of eight disks, write bit i of each byte to disk i.
b. Each access can read data at eight times the rate of a single disk.
c. But seek/access time worse than for a single disk
i. Bit level striping is not used much any more
2. Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1
d. Requests for different blocks can run in parallel if the blocks reside on different disks
e. A request for a long sequence of blocks can utilize all disks in parallel
RAID Levels

» Schemes to provide redundancy at lower cost by using disk striping combined with parity bits
a. Different RAID organizations, or RAID levels, have differing cost, performance and
reliability characteristics
» RAID Level 0: Block striping; non-redundant.
a. Used in high-performance applications where data lost is not critical.

» RAID Level 1: Mirrored disks with block striping


a. Offers best write performance.
b. Popular for applications such as storing log files in a database system.

» RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.

» RAID Level 3: Bit-Interleaved Parity


127

a. a single parity bit is enough for error correction, not just detection, since we know which
disk has failed
i. When writing data, corresponding parity bits must also be computed and written to a
parity bit disk
ii. To recover data in a damaged disk, compute XOR of bits from other disks (including
parity bit disk)

b. Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to
participate in every I/O.
c. Subsumes Level 2 (provides all its benefits, at lower cost).

» RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a
separate disk for corresponding blocks from N other disks.
a. Provides higher I/O rates for independent block reads than Level 3
i. block read goes to a single disk, so blocks stored on different disks can be read in
parallel
b. Provides high transfer rates for reads of multiple blocks than no-striping
c. Before writing a block, parity data must be computed
1. More efficient for writing large amounts of data sequentially

» RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1
disks, rather than storing data in N disks and parity in 1 disk.
a. E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the
data blocks stored on the other 4 disks.
b. Higher I/O rates than Level 4.
128

i. Block writes occur in parallel if the blocks and their parity blocks are on different
disks.
c. Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.

» RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information
to guard against multiple disk failures.
a. Better reliability than Level 5 at a higher cost; not used as widely.

Choice of RAID Level

» Factors in choosing RAID level


a. Monetary cost
b. Performance: Number of I/O operations per second, and bandwidth during normal operation
c. Performance during failure
d. Performance during rebuild of failed disk
i. Including time taken to rebuild failed disk
» RAID 0 is used only when data safety is not important
a. E.g. data can be recovered quickly from other sources
» Level 2 and 4 never used since they are subsumed by 3 and 5
» Level 3 is not used anymore since bit-striping forces single block reads to access all disks, wasting
disk arm movement, which block striping (level 5) avoids
» Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications
» So competition is between 1 and 5 only
» Level 1 provides much better write performance than level 5
» Level 1 had higher storage cost than level 5
» Level 5 is preferred for applications with low update rate, and large amounts of data
» Level 1 is preferred for all other applications
Hardware Issues
129

» Software RAID: RAID implementations done entirely in software, with no special hardware
support
» Hardware RAID: RAID implementations with special hardware
a. Use non-volatile RAM to record writes that are being executed
b. Beware: power failure during write can result in corrupted disk
i. E.g. failure after writing one block but before writing the second in a mirrored system
ii. Such corrupted data must be detected when power is restored
1. Recovery from corruption is similar to recovery from failed disk
2. NV-RAM helps to efficiently detected potentially corrupted blocks
a. Otherwise all blocks of disk must be read and compared with
mirror/parity block
» Hot swapping: replacement of disk while system is running, without power down
a. Supported by some hardware RAID systems,
b. reduces time to recovery, and improves availability greatly
» Many systems maintain spare disks which are kept online, and used as replacements for failed disks
immediately on detection of failure
a. Reduces time to recovery greatly
» Many hardware RAID systems ensure that a single point of failure will not stop the functioning of
the system by using
a. Redundant power supplies with battery backup
b. Multiple controllers and multiple interconnections to guard against controller/interconnection
failures
4. TERTIARY STORAGE

Optical Disks

» Compact disk-read only memory (CD-ROM)


a. Disks can be loaded into or removed from a drive
b. High storage capacity (640 MB per disk)
c. High seek times or about 100 msec (optical read head is heavier and slower)
d. Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to magnetic
disks
» Digital Video Disk (DVD)
a. DVD-5 holds 4.7 GB , and DVD-9 holds 8.5 GB
130

b. DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB and 17 GB
c. Other characteristics similar to CD-ROM
» Record once versions (CD-R and DVD-R) are becoming popular
a. data can only be written once, and cannot be erased.
b. high capacity and long lifetime; used for archival storage
c. Multi-write versions (CD-RW, DVD-RW and DVD-RAM) also available

Magnetic Tapes

» Hold large volumes of data and provide high transfer rates


a. Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape)
format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format
b. Transfer rates from few to 10s of MB/s
» Currently the cheapest storage medium
a. Tapes are cheap, but cost of drives is very high
» Very slow access time in comparison to magnetic disks and optical disks
a. limited to sequential access.
b. Some formats (Accelis) provide faster seek (10s of seconds) at cost of lower capacity
» Used mainly for backup, for storage of infrequently used information, and as an off-line medium for
transferring information from one system to another.
» Tape jukeboxes used for very large capacity storage
a. (terabyte (1012 bytes) to petabye (1015 bytes)

Storage Access

» A database file is partitioned into fixed-length storage units called blocks. Blocks are units of both
storage allocation and data transfer.
» Database system seeks to minimize the number of block transfers between the disk and memory.
We can reduce the number of disk accesses by keeping as many blocks as possible in main
memory.
» Buffer – portion of main memory available to store copies of disk blocks.
» Buffer manager – subsystem responsible for allocating buffer space in main memory.
» Buffer Manager
» Programs call on the buffer manager when they need a block from disk.
131

a. If the block is already in the buffer, the requesting program is given the address of the block
in main memory
b. If the block is not in the buffer,
i. the buffer manager allocates space in the buffer for the block, replacing (throwing
out) some other block, if required, to make space for the new block.
ii. The block that is thrown out is written back to disk only if it was modified since the
most recent time that it was written to/fetched from the disk.
iii. Once space is allocated in the buffer, the buffer manager reads the block from the
disk to the buffer, and passes the address of the block in main memory to requester.

» Buffer-Replacement Policies
» Most operating systems replace the block least recently used (LRU strategy)
» Idea behind LRU – use past pattern of block references as a predictor of future references
» Queries have well-defined access patterns (such as sequential scans), and a database system can use
the information in a user‘s query to predict future references
a. LRU can be a bad strategy for certain access patterns involving repeated scans of data
i. e.g. when computing the join of 2 relations r and s by a nested loops
for each tuple tr of r do
for each tuple ts of s do
if the tuples tr and ts match …
b. Mixed strategy with hints on replacement strategy provided by the query optimizer is
preferable
» Pinned block – memory block that is not allowed to be written back to disk.
» Toss-immediate strategy – frees the space occupied by a block as soon as the final tuple of that
block has been processed

» Most recently used (MRU) strategy – system must pin the block currently being processed. After
the final tuple of that block has been processed, the block is unpinned, and it becomes the most
recently used block.
» Buffer manager can use statistical information regarding the probability that a request will reference
a particular relation
a. E.g., the data dictionary is frequently accessed. Heuristic: keep data-dictionary blocks in
main memory buffer
132

5. FILE ORGANIZATION
» The database is stored as a collection of files. Each file is a sequence of records. A record is a
sequence of fields.
Types

1. Fixed Length Record


2. Variable Length Record
Fixed-Length Records
» Simple approach:
o Store record i starting from byte n (i – 1), where n is the size of each record.
o Record access is simple but records may cross blocks
 Modification: do not allow records to cross block boundaries
» Deletion of record i:
alternatives:
o move records i + 1, . . ., n to i, . . . , n – 1
o move record n to i
o do not move records, but link all free records on a free list

FigureFile Containing account record

File of with Record 2 Deleted and All Records Moved


133

File of, with Record 2 deleted and Final Record Moved

Free Lists
» Store the address of the first deleted record in the file header.
» Use this first record to store the address of the second deleted record, and so on
» Can think of these stored addresses as pointers since they ―point‖ to the location of a record.
» More space efficient representation: reuse space for normal attributes of free records to store
pointers. (No pointers stored in in-use records.)
134

FigureFile of after deletion of records 1,4 and 6.


Variable-Length Records

» Variable-length records arise in database systems in several ways:


o Storage of multiple record types in a file.
o Record types that allow variable lengths for one or more fields.
o Record types that allow repeating fields (used in some older data models).
» Byte string representation
o Attach an end-of-record ( ) control character to the end of each record
o Difficulty with deletion
o Difficulty with growth
135

Byte-String Representation of Variable-Length Records

FigureByte-String representation of variable-Length records


Variable-Length Records: Slotted Page Structure

FigureSlotted Page Structure


» Slotted page header contains:
o number of record entries
o end of free space in the block
o location and size of each record
» Records can be moved around within a page to keep them contiguous with no empty space between
them; entry in the header must be updated.
» Pointers should not point directly to record — instead they should point to the entry for the record in
header.

Fixed-length representation:
o reserved space
o pointers
» Reserved space – can use fixed-length records of a known maximum length; unused space in shorter
records filled with a null or end-of-record symbol.
136

FigureFile of using reserved-space method


Pointer Method

FigureFile of using Linked Lists


» Pointer method
o A variable-length record is represented by a list of fixed-length records, chained together via
pointers.
o Can be used even if the maximum record length is not known
» Disadvantage to pointer structure; space is wasted in all records except the first in an a chain.
» Solution is to allow two kinds of block in file:
o Anchor block – contains the first records of chain
o Overflow block – contains records other than those that are the first records of chairs.
137

FigureAnchor-block and Overflow-block Structure

6. ORGANIZATION OF RECORDS IN FILES


» Heap – a record can be placed anywhere in the file where there is space
» Sequential – store records in sequential order, based on the value of the search key of each record
» Hashing – a hash function computed on some attribute of each record; the result specifies in which
block of the file the record should be placed
» Records of each relation may be stored in a separate file. In a clustering file organization records
of several different relations can be stored in the same file
o Motivation: store related records on the same block to minimize I/O
Sequential File Organization
» Suitable for applications that require sequential processing of the entire file
» The records in the file are ordered by a search-key

FigureSequential file for account records


138

» Deletion – use pointer chains


» Insertion –locate the position where the record is to be inserted
o if there is free space insert there
o if no free space, insert the record in an overflow block
o In either case, pointer chain must be updated
» Need to reorganize the file from time to time to restore sequential order

FigureSequential file after an insertion


Clustering File Organization

» Simple file structure stores each relation in a separate file


» Can instead store several relations in one file using a clustering file organization
Example: Consider two relation
The depositor Relation The customer Relation

» E.g., clustering organization of customer and depositor:


139

FigureClustering file structure

o good for queries involving depositor customer, and for queries involving one single customer
and his accounts
o bad for queries involving only customer
o results in variable size records
Clustering File Structure with Pointer Chains

FigureClustering file structure with pointer chains

7. INDEXING AND HASHING

» Comparison of Ordered Indexing and Hashing


» Index Definition in SQL
» Multiple-Key Access
Basic Concepts
» Indexing mechanisms used to speed up access to desired data.
o E.g., author catalog in library
» Search Key - attribute to set of attributes used to look up records in a file.
» An index file consists of records (called index entries) of the form
» Index files are typically much smaller than the original file
» Two basic kinds of indices:
o Ordered indices: search keys are stored in sorted order
o Hash indices: search keys are distributed uniformly across ―buckets‖ using a ―hash function‖.
Index Evaluation Metrics
» Access types supported efficiently. E.g.,
o records with a specified value in the attribute
o Or records with an attribute value falling in a specified range of values.
140

» Access time
» Insertion time
» Deletion time
» Space overhead
8. ORDERED INDICES
o Indexing techniques evaluated on basis of:
» In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in
library.
» Primary index: in a sequentially ordered file, the index whose search key specifies the sequential
order of the file.
o Also called clustering index
o The search key of a primary index is usually but not necessarily the primary key.
» Secondary index: an index whose search key specifies an order different from the sequential order
of the file. Also called non-clustering index.
Primary index
» Index-sequential file: ordered sequential file with a primary index.
Sequential File for account Records

FigureSequential file for account records


Dense and Sparse Indices
There are two types of ordered indices that we can use:
Dense Index Files
» Dense index — Index record appears for every search-key value in the file.
141

FigureDense Index
Sparse Index Files
» Sparse Index: contains index records for only some search-key values.
o Applicable when records are sequentially ordered on search-key
» To locate a record with search-key value K we:
o Find index record with largest search-key value < K
o Search file sequentially starting at the record to which the index record points
» Less space and less maintenance overhead for insertions and deletions.
» Generally slower than dense index for locating records.
» Good tradeoff: sparse index with an index entry for every block in file, corresponding to least
search-key value in the block.
Example of Sparse Index Files

FigureSparse Index

Secondary Indices
» Frequently, one wants to find all the records whose values in a certain field (which is not the search-
key of the primary index satisfy some condition.
142

o Example 1: In the account database stored sequentially by account number, we may want to
find all accounts in a particular branch
o Example 2: as above, but where we want to find all accounts with a specified balance or range
of balances
» We can have a secondary index with an index record for each search-key value; index record points
to a bucket that contains pointers to all the actual records with that particular search-key value.
Secondary Index on balance field of account

FigureSecondary Index on account file, on noncandidate key balance

Primary and Secondary Indices


» Secondary indices have to be dense.
» Indices offer substantial benefits when searching for records.
» When a file is modified, every index on the file must be updated, Updating indices imposes overhead
on database modification.
» Sequential scan using primary index is efficient, but a sequential scan using a secondary index is
expensive
o each record access may fetch a new block from disk
Multilevel Index
» If primary index does not fit in memory, access becomes expensive.
» To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential
file and construct a sparse index on it.
o outer index – a sparse index of primary index
o inner index – the primary index file
» If even outer index is too large to fit in main memory, yet another level of index can be created, and
so on.
» Indices at all levels must be updated on insertion or deletion from the file.
143

Index Update: Deletion


» If deleted record was the only record in the file with its particular search-key value, the search-key is
deleted from the index also.
» Single-level index deletion:
o Dense indices – deletion of search-key is similar to file record deletion.
o Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the
entry in the index with the next search-key value in the file (in search-key order). If the next
search-key value already has an index entry, the entry is deleted instead of being replaced.
Index Update: Insertion
» Single-level index insertion:
o Perform a lookup using the search-key value appearing in the record to be inserted.
o Dense indices – if the search-key value does not appear in the index, insert it.
o Sparse indices – if index stores an entry for each block of the file, no change needs to be made
to the index unless a new block is created. In this case, the first search-key value appearing in
the new block is inserted into the index.
» Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level
algorithms
9. B+-TREE INDEX FILES
B+-tree indices are an alternative to indexed-sequential files.
144

» Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow
blocks get created. Periodic reorganization of entire file is required.
» Advantage of B+-tree index files: automatically reorganizes itself with small, local, changes, in the
face of insertions and deletions. Reorganization of entire file is not required to maintain
performance.
» Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead.
» Advantages of B+-trees outweigh disadvantages, and they are used extensively.
A B+-tree is a rooted tree satisfying the following properties:
» All paths from root to leaf are of the same length
» Each node that is not a root or a leaf has between [n/2] and n children.
» A leaf node has between [(n–1)/2] and n–1 values
» Special cases:
o If the root is not a leaf, it has at least 2 children.
o If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1)
values.
+
B -Tree Node Structure
Figuretypical node of a B+-Tree

o Ki are the search-key values


o Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for
leaf nodes).
» The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
Leaf Nodes in B+-Trees
Properties of a leaf node:
» For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a bucket
of pointers to file records, each record having search-key value Ki. Only need bucket structure if
search-key does not form a primary key.
» If Li, Lj are leaf nodes and i < j, Li‘s search-key values are less than Lj‘s search-key values
» Pn points to next leaf node in search-key order
145

+
FigureA leaf node for account B -Trees index (n=3)

Non-Leaf Nodes in B+-Trees


» Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m
pointers:
o All the search-keys in the subtree to which P1 points are less than K1
o For 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater than or
equal to Ki–1 and less than Km–1

Example of a B+-tree

+
Figure B -tree for account file (n = 3)

+
Figure B -tree for account file (n = 5)
146

» Leaf nodes must have between 2 and 4 values ( (n–1)/2 and n –1, with n = 5).
» Non-leaf nodes other than root must have between 3 and 5 children ( (n/2 and n with n =5).
» Root must have at least 2 children.
Observations about B+-trees
» Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be
―physically‖ close.
» The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
» The B+-tree contains a relatively small number of levels (logarithmic in the size of the main file),
thus searches can be conducted efficiently.
» Insertions and deletions to the main file can be handled efficiently, as the index can be restructured
in logarithmic time.
Queries on B+-Trees
» Find all records with a search-key value of k.
o Start with the root node
 Examine the node for the smallest search-key value > k.
 If such a value exists, assume it is Kj. Then follow Pi to the child node
 Otherwise k Km–1, where there are m pointers in the node. Then follow Pm to the
child node.
o If the node reached by following the pointer above is not a leaf node, repeat the above
procedure on the node, and follow the corresponding pointer.
o Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record or
bucket. Else no record with search-key value k exists.
» In processing a query, a path is traversed in the tree from the root to some leaf node.
» If there are K search-key values in the file, the path is no longer than log n/2 (K) .

procedure find(value V )
set C = root node
while C is not a leaf node begin
Let Ki = smallest search-key value, if any, greater than V
if there is no such value then begin
Let m = the number of pointers in the node
set C = node pointed to by Pm
end
147

else set C = the node pointed to by Pi


end
if there is a key value Ki in C such that Ki = V
then pointer Pi directs us to the desired record or bucket
else no record with key value k exists
end procedure
Updates on B+-Trees: Insertion
» Find the leaf node in which the search-key value would appear
» If the search-key value is already there in the leaf node, record is added to file and if necessary a
pointer is inserted into the bucket.
» If the search-key value is not there, then add the record to the main file and create a bucket if
necessary. Then:
o If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node
o Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the
next slide.
» Splitting a node:
o take the n(search-key value, pointer) pairs (including the one being inserted) in sorted order.
Place the first n/2 in the original node, and the rest in a new node.
o let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node
being split. If the parent is full, split it and propagate the split further up.
» The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the
root node may be split increasing the height of the tree by 1.

FigureResult of splitting node containing Brighton and Downtown on inserting Clearview


148

FigureB+-Tree before and after insertion of ―Clearview‖

Updates on B+-Trees: Deletion

» Find the record to be deleted, and remove it from the main file and from the bucket (if present)
» Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has
become empty
» If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a
single node, then
o Insert all the search-key values in the two nodes into a single node (the one on the left), and
delete the other node.
o Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively
using the above procedure.
» Otherwise, if the node has too few entries due to the removal, and the entries in the node and a
sibling fit into a single node, then
o Redistribute the pointers between the node and a sibling such that both have more than the
minimum number of entries.
o Update the corresponding search-key value in the parent of the node.
» The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If
the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.
149

Examples of B+-Tree Deletion

Figure Before and after deleting “Downtown”

» The removal of the leaf node containing ―Downtown‖ did not result in its parent having too little
pointers. So the cascaded deletions stopped with the deleted leaf node‘s parent.

Figure Deletion of “Perryridge” from result of previous example


150

» Node with ―Perryridge‖ becomes underfull (actually empty, in this special case) and merged with its
sibling.
» As a result ―Perryridge‖ node‘s parent became underfull, and was merged with its sibling (and an
entry was deleted from their parent)
» Root node then had only one child, and was deleted and its child became the new root node

FigureBefore and after deletion of “Perryridge” from earlier example

» Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling
» Search-key value in the parent‘s parent changes as a result
10. B-TREE INDEX FILES

» Similar to B+-tree, but B-tree allow search-key values to appear only once; eliminates redundant
storage of search keys.
» Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each
search key in a nonleaf node must be included.
» Generalized B-tree leaf node
151

» Nonleaf node – pointers Bi is the bucket or file record pointers.


B-Tree Index File Example

» Advantages of B-Tree indices:


o May use less tree nodes than a corresponding B+-Tree.
o Sometimes possible to find search-key value before reaching leaf node.
» Disadvantages of B-Tree indices:
o Only small fraction of all search-key values are found early
o Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth
than corresponding B+-Tree
o Insertion and deletion more complicated than in B+-Trees
o Implementation is harder than B+-Trees.
» Typically, advantages of B-Trees do not outweigh disadvantages.
152

11. HASHING TECHNIQUES


Introduction
» One disadvantage of sequential file organization is that we must access an index structure to locate data or
must use binary search, and that results in more I/O operation.
» File Organization based on the technique of hashing allow us to avoid accessing an index structure.
» Hashing also provide a way of constructing indices.
Types
1. Static Hashing
2. Dynamic Hashing
Static Hashing
» A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).
» In a hash file organization we obtain the bucket of a record directly from its search-key value using
a hash function.
» Hash function h is a function from the set of all search-key values K to the set of all bucket addresses
B.
» Hash function is used to locate records for access, insertion as well as deletion.
» Records with different search-key values may be mapped to the same bucket; thus entire bucket has
to be searched sequentially to locate a record.
Hash Functions
» Worst has function maps all search-key values to the same bucket; this makes access time
proportional to the number of search-key values in the file.
» An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key
values from the set of all possible values.
» Ideal hash function is random, so each bucket will have the same number of records assigned to it
irrespective of the actual distribution of search-key values in the file.
» Typical hash functions perform computation on the internal binary representation of the search-key.
o For example, for a string search-key, the binary representations of all the characters in the
string could be added and the sum modulo the number of buckets could be returned. .
Handling of Bucket Overflows
» Bucket overflow can occur because of
o Insufficient buckets
o Skew in distribution of records. This can occur due to two reasons:
 multiple records have same search-key value
153

 chosen hash function produces non-uniform distribution of key values


» Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by
using overflow buckets.
» Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.
» Above scheme is called closed hashing.
o An alternative, called open hashing, which does not use overflow buckets, is not suitable for
database applications.

FigureOverflow chaining in a hash structure


Hash Indices
» Hashing can be used not only for file organization, but also for index-structure creation.
» A hash index organizes the search keys, with their associated record pointers, into a hash file
structure.
» Strictly speaking, hash indices are always secondary indices
o if the file itself is organized using hashing, a separate primary hash index on it using the same
search-key is unnecessary.
o However, we use the term hash index to refer to both secondary index structures and hash
organized files.
154

Example of Hash Index

FigureHash index on search key account_number of account file

Deficiencies of Static Hashing


» In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.
o Databases grow with time. If initial number of buckets is too small, performance will degrade
due to too much overflows.
o If file size at some point in the future is anticipated and number of buckets allocated
accordingly, significant amount of space will be wasted initially.
o If database shrinks, again space will be wasted.
o One option is periodic re-organization of the file with a new hash function, but it is very
expensive.
» These problems can be avoided by using techniques that allow the number of buckets to be modified
dynamically.
155

12. DYNAMIC HASHING


» Good for database that grows and shrinks in size
» Allows the hash function to be modified dynamically
» Extendable hashing – one form of dynamic hashing
o Hash function generates values over a large range — typically b-bit integers, with b =32.
o At any time use only a prefix of the hash function to index into a table of bucket addresses.
o Let the length of the prefix be i bits, 0 i 32.
o Bucket address table size = 2i. Initially i = 0
o Value of i grows and shrinks as the size of the database grows and shrinks.
o Multiple entries in the bucket address table may point to a bucket.
o Thus, actual number of buckets is < 2i
 The number of buckets also changes dynamically due to coalescing and splitting of
buckets.
General Extendable Hash Structure

o In this structure, i2 = i3 = i, whereas i1 = i – 1


FigureGeneral extendable hash structure

Use of Extendable Hash Structure


156

» Each bucket j stores a value ij; all the entries that point to the same bucket have the same values on
the first ij bits.
» To locate the bucket containing search-key Kj:
o Compute h(Kj) = X
» Use the first i high order bits of X as a displacement into bucket address table, and follow the pointer to
appropriate bucket
» To insert a record with search-key value Kj
o Follow same procedure as look-up and locate the bucket, say j.
o If there is room in the bucket j inserts record in the bucket.
o Else the bucket must be split and insertion re-attempted
 Overflow buckets used instead in some cases
o To split a bucket j when inserting record with search-key value Kj:
» If i > ij (more than one pointer to bucket j)
o Allocate a new bucket z, and set ij and iz to the old ij -+ 1.
o make the second half of the bucket address table entries pointing to j to point to z
o Remove and reinsert each record in bucket j.
o recompute new bucket for Kj and insert record in the bucket (further splitting is required if the
bucket is still full)
» If i = ij (only one pointer to bucket j)
o Increment i and double the size of the bucket address table.
o Replace each entry in the table by two entries that point to the same bucket.
o recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
» When inserting a value, if the bucket is full after several splits (that is, i reaches some limit b) create
an overflow bucket instead of splitting bucket entry table further.
» To delete a key value,
o Locate it in its bucket and remove it.
o The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket
address table).
o Coalescing of buckets can be done (can coalesce only with a ―buddy‖ bucket having same value
of ij and same ij –1 prefix, if it is present)
o Decreasing bucket address table size is also possible
157

 Note: decreasing bucket address table size is an expensive operation and should be
done only if number of buckets becomes much smaller than the size of the table
Use of Extendable Hash Structure: Example

FigureHash Function for branch_name

FigureInitial Hash structure, bucket size = 2

» Hash structure after insertion of one Brighton and two Downtown records
158

» Hash structure after insertion of Mianus record

» Hash structure after insertion of three Perryridge records


159

Hash structure after insertion of Redwood and Round Hill records

Extendable Hashing vs. Other Schemes

» Benefits of extendable hashing:


o Hash performance does not degrade with growth of file
o Minimal space overhead
» Disadvantages of extendable hashing
o Extra level of indirection to find desired record
o Bucket address table may itself become very big (larger than memory)
 Need a tree structure to locate desired record in the structure!
o Changing size of bucket address table is an expensive operation
» Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of
more bucket overflows
Comparison of Ordered Indexing and Hashing
» Cost of periodic re-organization
» Relative frequency of insertions and deletions
» Is it desirable to optimize average access time at the expense of worst-case access time?
» Expected type of queries:
o Hashing is generally better at retrieving records having a specified value of the key.
o If range queries are common, ordered indices are to be preferred
13. QUERY PROCESSING
160

Basic Steps in Query Processing


o Parsing and translation
o Optimization
o Evaluation

» Parsing and translation


o Translate the query into its internal form. This is then translated into relational algebra.
o Parser checks syntax, verifies relations
» Evaluation
o The query-execution engine takes a query-evaluation plan, executes that plan, and returns the
answers to the query.
Basic Steps in Query Processing: Optimization

» A relational algebra expression may have many equivalent expressions


E.g., balance 2500( balance(account)) is equivalent to
Balance ( balance 2500(account))

» Each relational algebra operation can be evaluated using one of several different algorithms
o Correspondingly, a relational-algebra expression can be evaluated in many ways.
» Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.
o E.g., can use an index on balance to find accounts with balance < 2500,
or can perform complete relation scan and discard accounts with balance 2500
161

FigureA query evaluation plan


» Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost.
o Cost is estimated using statistical information from the database catalog
Example: number of tuples in each relation, size of tuples, etc.
» In this Query Processing we study
 How to measure query costs
 Algorithms for evaluating relational algebra operations
 How to combine algorithms for individual operations in order to evaluate a complete
expression
 We study how to optimize queries, that is, how to find an evaluation plan with lowest
estimated cost
Measures of Query Cost

» Cost is generally measured as total elapsed time for answering query


o Many factors contribute to time cost
 disk accesses, CPU, or even network communication
» Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by
taking into account
o Number of seeks * average-seek-cost
o Number of blocks read * average-block-read-cost
o Number of blocks written * average-block-write-cost
 Cost to write a block is greater than cost to read a block
data is read back after being written to ensure that the write was successful
» For simplicity we just use number of block transfers from disk as the cost measure
» Costs depends on the size of the buffer in main memory
» Real systems take CPU cost into account, differentiate between sequential and random I/O, and take
buffer size into account
» We do not include cost to writing output to disk in our cost formulae
14. SELECTION OPERATION
» File scan – search algorithms that locate and retrieve records that fulfill a selection condition.
162

» Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the
selection condition.
o Cost estimate = br block transfers + 1 seek
 br denotes number of blocks containing records from relation r
o If selection is on a key attribute, can stop on finding record
 cost = (br /2) block transfers + 1 seek
o Linear search can be applied regardless of
 selection condition or
 ordering of records in the file, or
 availability of indices
» A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file
is ordered.
o Assume that the blocks of a relation are stored contiguously
o Cost estimate (number of disk blocks to be scanned):
 cost of locating the first tuple by a binary search on the blocks
– log2(br) * (tT + tS)
 If there are multiple records satisfying selection
– Add transfer cost of the number of blocks containing records that satisfy
selection condition
Selections Using Indices
» Index scan – search algorithms that use an index
 Selection condition must be on search-key of index.
» A3 (primary index on candidate key, equality). Retrieve a single record that satisfies the corresponding
equality condition
 Cost = (hi + 1) * (tT + tS)
» A4 (primary index on nonkey, equality) Retrieve multiple records.
 Records will be on consecutive blocks
o Let b = number of blocks containing matching records
 Cost = hi * (tT + tS) + tS + tT * b

» A5 (equality on search-key of secondary index).


 Retrieve a single record if the search-key is a candidate key
o Cost = (hi + 1) * (tT + tS)
 Retrieve multiple records if search-key is not a candidate key
163

o each of n matching records may be on a different block


o Cost = (hi + n) * (tT + tS)
– Can be very expensive
Selections Involving Comparisons
» Can implement selections of the form A V (r) or A V(r) by using
o a linear file scan or binary search,
o or by using indices in the following ways:
» A6 (primary index, comparison). (Relation is sorted on A)
 For A V(r) use index to find first tuple v and scan relation sequentially from there
 For A V (r) just scan relation sequentially till first tuple > v; do not use index
» A7 (secondary index, comparison).
 For A V(r) use index to find first index entry v and scan index sequentially from
there, to find pointers to records.
 For A V (r) just scan leaf pages of index finding pointers to records, till first entry > v
 In either case, retrieve records that are pointed to
– requires an I/O for each record
– Linear file scan may be cheaper
Implementation of Complex Selections
Conjunction: 1 2 . . . n(r)
» A8 (conjunctive selection using one index).
o Select a combination of i and algorithms A1 through A7 that results in the least cost for i

(r).
o Test other conditions on tuple after fetching it into memory buffer.
» A9 (conjunctive selection using multiple-key index).
o Use appropriate composite (multiple-key) index if available.
» A10 (conjunctive selection by intersection of identifiers).
o Requires indices with record pointers.
o Use corresponding index for each condition, and take intersection of all the obtained sets of
record pointers.
o Then fetch records from file
o If some conditions do not have appropriate indices, apply test in memory.
Disjunction: 1 2 ... n (r).

» A11 (disjunctive selection by union of identifiers).


164

o Applicable if all conditions have available indices.


 Otherwise use linear scan.
o Use corresponding index for each condition, and take union of all the obtained sets of record
pointers.
o Then fetch records from file
» Negation: (r)
o Use linear scan on file
o If very few records satisfy , and an index is applicable to
 Find satisfying records using index and fetch from file

14. SORTING
» We may build an index on the relation, and then use the index to read the relation in sorted order.
May lead to one disk block access for each tuple.
» For relations that fit in memory, techniques like quicksort can be used. For relations that don‘t fit in
memory, external sort-merge is a good choice.
External Sort-Merge
o Let M denote memory size (in pages).
» Create sorted runs.
Let i be 0 initially.
Repeatedly do the following till the end of the relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the final value of i be N
» Merge the runs (N-way merge).
We assume (for now) that N < M.
o Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first
block of each run into its buffer page
o repeat
 Select the first record (in sort order) among all buffer pages
 Write the record to the output buffer. If the output buffer is full write it to disk.
165

 Delete the record from its input buffer page.


If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.
o until all input buffer pages are empty:
» If N M, several merge passes are required.
1. In each pass, contiguous groups of M - 1 runs are merged.
2. A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same
factor.
 E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9, each
10 times the size of the initial runs
o Repeated passes are performed till all runs have been merged into one.
Example: External Sorting Using Sort-Merge

» Cost analysis:
o Total number of merge passes required: logM–1(br/M) .
o Block transfers for initial run creation as well as in each pass is 2br
 for final pass, we don‘t count write cost
we ignore final write cost for all operations since the output of an operation
may be sent to the parent operation without being written to disk
 Thus total number of block transfers for external sorting:
br ( 2 logM–1(br / M) + 1)
166

» Cost of seeks
o During run generation: one seek to read each run and one seek to write each run
 2 br / M
o During the merge phase
 Buffer size: bb (read/write bb blocks at a time)
 Need 2 br / bb seeks for each merge pass
except the final one which does not require a write
 Total number of seeks:
2 br / M + br / bb (2 logM–1(br / M) -1)

15. JOIN OPERATION


» Several different algorithms to implement joins
1. Nested-loop join
2. Block nested-loop join
3. Indexed nested-loop join
4. Merge-join
5. Hash-join
» Choice based on cost estimate
» Examples use the following information
o Number of records of customer: 10,000 depositor: 5000
o Number of blocks of customer: 400 depositor: 100
Nested-Loop Join
» To compute the theta join r s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition
if they do, add tr • ts to the result.
end
end
» r is called the outer relation and s the inner relation of the join.
» Requires no indices and can be used with any kind of join condition.
» Expensive since it examines every pair of tuples in the two relations.
167

» In the worst case, if there is enough memory only to hold one block of each relation, the estimated
cost is
nr bs + br
block transfers, plus
nr + br
seeks
» If the smaller relation fits entirely in memory, use that as the inner relation.
o Reduces cost to br + bs block transfers and 2 seeks
» Assuming worst case memory availability cost estimate is
o with depositor as outer relation:
 5000 400 + 100 = 2,000,100 block transfers,
 5000 + 100 = 5100 seeks
o with customer as the outer relation
 10000 100 + 400 = 1,000,400 block transfers and 10,400 seeks
» If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500 block transfers.

Block Nested-Loop Join

» Variant of nested-loop join in which every block of inner relation is paired with every block of outer
relation.
for each block Br of r do begin
for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr,ts) satisfy the join condition
if they do, add tr • ts to the result.
end
end
end
end
» Worst case estimate: br bs + br block transfers + 2 * br seeks
» Best case: br + bs block transfers + 2 seeks.
Indexed Nested-Loop Join
» Index lookups can replace file scans if
168

o join is an equi-join or natural join and


o an index is available on the inner relation‘s join attribute
» For each tuple tr in the outer relation r, use the index to look up tuples in s that satisfy the join
condition with tuple tr.
» Worst case: buffer has space for only one page of r, and, for each tuple in r, we perform an index
lookup on s.
» Cost of the join: br (tT + tS) + nr c
o Where c is the cost of traversing index
Example of Nested-Loop Join Costs
» Compute depositor customer, with depositor as the outer relation.
» Let customer have a primary B+-tree index on the join attribute customer-name, which contains 20
entries in each index node.
» Since customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find
the actual data
» depositor has 5000 tuples
» Cost of block nested loops join
o 400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
» Cost of indexed nested loops join
o 100 + 5000 * 5 = 25,100 block transfers and seeks.
o CPU cost likely to be less than that for block nested loops join
Merge-Join
» Sort both relations on their join attribute (if not already sorted on the join attributes).
» Merge the sorted relations to join them
o Join step is similar to the merge stage of the sort-merge algorithm.
o Main difference is handling of duplicate values in join attribute — every pair with same
value on join attribute must be matched
» Can be used only for equi-joins and natural joins
» Each block needs to be read only once (assuming all tuples for any given value of the join attributes
fit in memory)
» Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
+ the cost of sorting if relations are unsorted.
169

Hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join
attribute
Hash-Join

» Applicable for equi-joins and natural joins.


» A hash function h is used to partition tuples of both relations
» h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of r and s
used in the natural join.
o r0, r1, . . ., rn denote partitions of r tuples
 Each tuple tr r is put in partition ri where i = h(tr [JoinAttrs]).
o r0,, r1. . ., rn denotes partitions of s tuples
 Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).
 Note: In book, ri is denoted as Hri, si is denoted as Hsi and n is denoted as nh.

o
» r tuples in ri need only to be compared with s tuples in si Need not be compared with s tuples in any
other partition.

Hash-Join Algorithm

The hash-join of r and s is computed as follows.

» Partition the relation s using hashing function h. When partitioning a relation, one block of memory
is reserved as the output buffer for each partition.
» Partition r similarly.
» For each i:
170

 Load si into memory and build an in-memory hash index on it using the join attribute. This hash
index uses a different hash function than the earlier one h.
 Read the tuples in ri from the disk one by one. For each tuple tr locate each matching tuple ts in
si using the in-memory hash index. Output the concatenation of their attributes.
» Relation s is called the build input and r is called the probe input.
» The value n and the hash function h is chosen such that each si should fit in memory.
» Recursive partitioning required if number of partitions n is greater than number of pages M of
memory.
Handling of Overflows
» Partitioning is said to be skewed if some partitions have significantly more tuples than some others
» Hash-table overflow occurs in partition si if si does not fit in memory. Reasons could be
» Many tuples in s with same value for join attributes
» Bad hash function
» Overflow resolution can be done in build phase
» Partition si is further partitioned using different hash function.
» Partition ri must be similarly partitioned.
» Overflow avoidance performs partitioning carefully to avoid overflows during build phase
» E.g. partition build relation into many partitions, then combine them
» Both approaches fail with large numbers of duplicates
» Fallback option: use block nested loops join on overflowed partitions
Cost of Hash-Join
» If recursive partitioning is not required: cost of hash join is
3(br + bs) +4 nh
» Total cost estimate is:
2(br + bs logM–1(bs) – 1 + br + bs block transfers +
2( br / bb + bs / bb ) logM–1(bs) – 1 seeks
» If the entire build input can be kept in main memory no partitioning is required
» Cost estimate goes down to br + bs.
Hybrid Hash–Join
» Useful when memory sized are relatively large, and the build input is bigger than memory.
» Main feature of hybrid hash join:
 Keep the first partition of the build relation in memory.
171

» E.g. With memory size of 25 blocks, depositor can be partitioned into five partitions, each of size 20
blocks.
» Division of memory:
 The first partition occupies 20 blocks of memory
 1 block is used for input, and 1 block each for buffering the other 4 partitions.
» customer is similarly partitioned into five partitions each of size 80
» the first is used right away for probing, instead of being written out
» Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for
hybrid hash join, instead of 1500 with plain hash-join.
Complex Joins
» Join with a conjunctive condition:
r 1 2 ... n s
» Either use nested loops/block nested loops, or
» Compute the result of one of the simpler joins r i s
 final result comprises those tuples in the intermediate result that satisfy the remaining
conditions
1 ... i –1 i +1 ... n

» Join with a disjunctive condition


r 1 2 ... ns

» Either use nested loops/block nested loops, or Compute as the union of the records in
individual joins
(r 1 s) (r 2 s) ... (r n s)
Other Operations
» Duplicate elimination can be implemented via hashing or sorting.
» On sorting duplicates will come adjacent to each other, and all but one set of duplicates can
be deleted.
» Hashing is similar – duplicates will come into the same bucket.
» Projection:
» Perform projection on each tuple followed by duplicate elimination.
» Aggregation can be implemented in a manner similar to duplicate elimination.
» Sorting or hashing can be used to bring tuples in the same group together, and then the
aggregate functions can be applied on each group.
172

» Set operations ( , and ): can either use variant of merge-join after sorting, or variant of hash-
join.
r s:
o Add tuples in si to the hash index if they are not already in it.
o At end of si add the tuples in the hash index to the result.
r s:
o output tuples in si to the result if they are already there in the hash index
r – s:
o for each tuple in si, if it is there in the hash index, delete it from the index.
o At end of si add remaining tuples in the hash index to the result.
» Outer join can be computed either as
» A join followed by addition of null-padded non-participating tuples.
» by modifying the join algorithms.
» Modifying merge join to compute r s
» In r s, non participating tuples are those in r – R(r s)
» Modify merge-join to compute r s: During merging, for every tuple tr from r that do
not match any tuple in s, output tr padded with nulls.
» Right outer-join and full outer-join can be computed similarly.
» Modifying hash join to compute r s
» If r is probe relation, output non-matching r tuples padded with nulls
» If r is build relation, when probing keep track of which r tuples matched s tuples. At end of
si output non-matched r tuples padded with nulls
173

2 Mark Questions
1. Define query optimization.
2. What is an index?
3. What are called jukebox systems?
4. What are the types of storage devices?
5. What is called remapping of bad sectors?
6. Define access time.
7. Define seek time.
8. Define average seek time.
9. Define rotational latency time.
10. Define average latency time.
11. What is meant by data-transfer rate?
12. What is meant by mean time to failure?
13. What is a block and a block number?
14. What are called journaling file systems?
15. What is the use of RAID?
16. What is called mirroring?
17. What is called mean time to repair?
18. What is called bit-level striping?
19. What is called block-level striping?
20. What are the two main goals of parallelism?
21. What are the factors to be taken into account when choosing a RAID level?
22. What is meant by software and hardware RAID systems?
23. Define hot swapping?
24. What are the ways in which the variable-length records arise in database systems?
25. What is the use of a slotted-page structure and what is the information present in the header?
26. What are the two types of blocks in the fixed –length representation? Define them.
27. What is known as heap file organization?
28. What is known as sequential file organization?
29. What is hashing file organization?
30. What is known as clustering file organization?
31. What are the types of indices?
32. What are the techniques to be evaluated for both ordered indexing and hashing?
174

33. What is known as a search key?


34. What is a primary index?
35. What are called index-sequential files?
36. What are the two types of indices?
37. What are called multilevel indices?
38. What is B-Tree?
39. What is a B+-Tree index?
40. What is a hash index?
41. What is called query processing?
42. What are the steps involved in query processing?
43. What is called an evaluation primitive?
44. What is called a query evaluation plan?
45. What is called a query –execution engine?
46. What are called as index scans?
47. What is called as external sorting?
48. What is called as recursive partitioning?
49. What is called as an N-way merge?
50. What is known as fudge factor?
51. How to choose the best evaluation plan for Query?

16 Mark Questions

1. Explain various hashing techniques.


2. Explain RAID.
3. Explain the steps in Query processing.
4. Explain B+ Tree and B-tree.
5. a) Explain the types of File Organization.
b) Explain some basic algorithms used in selection operation., Sorting and Join operation.
6. Explain physical storage media with example.

You might also like