DBMS Reference Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104

CHAPTER-1

Database Management System (DBMS):


Is a collection of interrelated data and a set of programs to access those data.

- The primary goal of a DBMS is to provide a way to store and retrieve


database information that is both convenient and efficient.

Database systems are designed to manage large bodies of information.


Management of data involves both defining structures for storage of
information and providing mechanisms for the manipulation of information. In
addition the database system must ensure the safety of the information
stored, despite system crashes or attempts at unauthorized access.

Applications of Database System:


Enterprise Information:
 Sales: Information about customer, product and purchase
 Accounting: Payments, receipts, account balances, assets and other
accounting information.
 Human resources: Information about employees, salaries, taxes and
benefits.
 Manufacturing: tracking productions, inventories of items in warehouse,
orders for items.

Banking and Finance:


 Banking: Customer information, accounts, loans and banking
transaction.
 Credit card transaction: For purchase on credit cards and generation of
monthly statements.
 Finance: For storing information about holdings, sales and purchase of
financial instruments such as stock and bonds.

Universities: For student information, course registration, and grades.

Airlines: For reservations and schedule information.

Telecommunication: For keeping records of calls made, generating monthly


bills, maintaining balances on prepaid calling cards.

Purpose of Database System:

File Processing System: Before database management systems were


introduced organization usually store information on file processing system.
This system stores permanent records in various files, and it needs different
application programs to extract records from and add records to the
appropriate files.

Page | 1
Keeping organizational information in a file processing system has a number
of major disadvantages:

1. Data redundancy and inconsistency:

- Same information may be duplicated in several places (files)


- This redundancy leads to higher storage and access cost.
- In addition it may lead to data inconsistency that is various copy of
same data may no longer agree.

2. Difficulty in accessing data:

- The designers of the original system may not include all the
requests that is needed at any time in modern business.
- File processing systems do not allow needed data to be retrieved in
a convenient and efficient manner.

3. Data isolation:

- Data are scattered in various files and files may be in different


formats.
- Writing new application program to retrieve the appropriate data is
difficult.

4. Integrity Problems:

- The data values stored in the database must satisfy certain types of
consistency constraints.
- Developers enforce these constraints in the system by adding
appropriate code in the various application programs.
- When new constraints are added, it is difficult to change the
programs to enforce them.
- The problem is compounded when constraints involve several data
items from different files.

5. Atomicity problems:

- Computer system like any other device is subject to failure


- It is necessary that if failure occurs the data be restored to the
consistent state that existed prior to the failure.
- The action must be atomic. It must happen in its entirety or not at
all.

6. Concurrent access anomalies:

- Many systems allow multiple users to update the data


simultaneously.
- In such environment interaction of concurrent updates is possible
which may result in inconsistent data.

Page | 2
- Supervision is needed to guard against such possibilities, which is
not possible in conventional file processing system.

7. Security Problems:

- Not every user of the database system should be able to access all
the data
- In the file processing system, the application program does not
enforce such security.

These difficulties with the conventional file processing system


prompted the development of database system.

Data Abstraction:

The purpose of the database system is to provide users with abstract view of
data. This means the system hides certain details of how the data are stored
and maintained.

Since many users are not computer trained, developers hide the complexity
from users through several levels of abstraction to simplify users’ interactions
with the system.

1. Physical Level: This level describes how the data are actually stored. It
is a complex low-level data structure.

2. Logical Level: This level describes what data are stored in the
database and what relationship exists among those data. This
describes the entire database in terms of relatively simple structure.
The users o the logical level does not need to be aware of the
complexity of physical level. The database administration who decides
what data to keep in the database use the logical level.

3. View Level: This level describes only the part of the database. Because
of the variety of information stored in a large database, the logical level
is still complex for many users because they need to access only a part
of the database. The view level simplifies their interaction with the
system. The system provides many views for the same database.

Page | 3
Fig: Three levels of data abstraction

View Level

View 1 View 2 View n

Logical Level

Physical level

Page | 4
Instances and Schemas:

Database changes over time as information is inserted and deleted. The


collection of information stored in the database at a particular moment is
called an instance.
The overall design of the database is called the database schema.

According to the level of abstraction, there is three-database schema:


Physical Schema: describes database design at physical level.
Logical Schema: describes database design at logical level.
Subschemas: Schemas at the view level that describe different views of
database.

Data Models:

Consists of set conceptual tools that are used to describe the structure (data
types, relationships, and constraints) of a database.

1. Hierarchical data model:


- Oldest type of data model.
- It is record based representational or implementation data model. Different
records are inter-related through hierarchical or tree like structures.
- The root may have any number of dependents; each of these dependents
may have any number of lower level dependents.
- A parent record can have several children, but a child can have only one
parent.
- It is therefore one-to-one and one-to-many relationships.

2. Network data model:


- Extension of hierarchical database structure.
- It is also record based representational or implementation data model.
- More flexible than hierarchical data model. Describe the data and relations
by using graph rather than tree like structure.
- Unlike hierarchical data model, it is able to represent many-to-many
relationship as well.

3. Entity-Relationship Model:
- Entity relationship model is based on the perception of real world objects
called entities and relationship among those objects.
- It is conceptual data model.
- Database is modeled as a collection of entities and relationship among
those entities.
- It is represented graphically by the basic components:
 Rectangles (represent entity sets)
 Ellipses (represent attributes)
 Diamonds (represent relationship sets among entity sets)
 Lines (link attributes to entity sets and entity sets to relationship sets)
4. Relational model:
- It is also representational or implementation data model.
- Unlike hierarchical and network models, there are no physical links.
- Data is maintained in the form of tables consisting of rows and columns.
- Each row (record) represents an entity and a column (field) represents an
attribute of entity.
- The relationship between the two tables is implemented through a common
attribute in the tables not by physical links or pointers.
- This makes query much more easier in a relational database system than
hierarchical or network database system.
- More program friendly and the popular model in today’s commercial world.

5. Object Oriented data model:

- Object oriented data model is based on object oriented programming


paradigm.
- Consists of basic object oriented concepts;
 Object and object identifier: Any real world entity is uniformly modeled
as an object (associated with a unique id).
 Attributes and methods: Every object has a state (the set of values for
the attributes of the object) and a behavior (set of methods or program
code which operate on the state of object).
 Class: a means of grouping all the objects, which share the same set of
attributes and methods.
 Class hierarchy and inheritance: Derive a new class (subclass) from an
existing class (superclass). The subclass inherits all the attributes and
methods of the superclass and may have additional attributes and
method.

Data Independence
• The capacity to change the database schema without affecting
application programs is called data independence.
• Database system provides two types of data independence:
1. Logical Data Independence
2. Physical Data Independence

Logical Data Independence:


• The capacity to change the conceptual schema without having to
change associated application programs is called logical data
independence.
• The correspondence between a particular external view and the
conceptual view is called external/conceptual mapping.
• When the modification is done to the conceptual schema only the
external/conceptual mapping need to be changed. It is the
responsibility of database administrator.
Physical Data Independence:
• The capacity to change the internal schema without affecting
application programs is called physical data independence.
• The correspondence between the conceptual view and the stored
database is called conceptual/internal mapping. It specifies how
conceptual records are represented at the internal level.
• If a change is made to the storage structure, then the
conceptual/internal mapping must be changed accordingly. DBA is
responsible to manage such changes.

Database Users:
• Database users are those who interact with the database in order to
query and update the database and generate reports.
• On the basis of how users interact with the database, users are
classified as below:
- Naive users: normal users, invokes written application programs.
- Sophisticated users: business analyst, scientist, use database query
language.
- Specialized users: writes specialized database programs such as
computer-aided design system.
- Application programmers: Computer professionals like database
application programmers develop application programs to facilitate easy data
access.

Database Administrator:
• They are the person who has central control over both data and
application programs.
• Responsibilities of DBA:
- Schema definition and modification
- New software installation
- Security enforcement and administration
- Data analysis
- Preliminary database design
- Physical organization modification
- Routine maintenance checks

Application Archictecture:

• Three different architecture:


- One-tier (Centralized Architecture): The database system, application
programs and user-interface all are executed in a single system and
dummy terminals are used only to display the information.
- Client/server architecture:
Two-tier
• In this system, user-interface and application program is processed at
the client side.
• database system is processed at the server side
• The end users work on the client computer and database system runs
on the server.
Three tier:
• Three tier architecture is primarily use for web based application.
• Adds intermediate layer known as application server (web server)
between the client and the database server.
• The client communicates with application server which in turn
communicates with database server.
CHAPTER-2

1.Entity and Entity Sets


• An Entity is a thing or object in a real world that is distinguishable from other objects, eg. Each student in
a class is an entity.

• An entity has an attribute whose value is distinct from other.

• An entity set is a set of entities of the same type that share same properties or attributes, eg, set of all
students in a class is an entity set.

• Entity sets are represented by rectangle with entity name within it in E-R diagram.

2.Attributes
• Attributes are the information that explains the properties of an entity, eg, student entity can have
student-id, student name, student address.

• They are the descriptive properties possessed by all entity of an entity sets.

• Each entity may have its own value for each attribute.

• Attributes are represented by oval with attributes names inside it.

Types:

1. Simple and composite attribute:


 The attributes that cannot be divided into subparts are called simple (atomic) attributes. Eg, roll
number attribute of a student cannot be further divided into sub parts thus roll number attribute
of a student entity acts as a simple attribute.
 The attribute that can be divided into subparts are called composite attribute. Eg name attribute
of a particular student can be further divided into subparts first-name, last-name and last-name,
thus name attribute acts as a composite.
2. Single-valued and Multi-valued attributes:
 The attributes which can have only one value are called single-valued attributes. Eg, date-of-
birth of an student is single-valued attribute because a person cannot have two value of age.
 The attributes that can have more than one values are called multi-valued attributes. Eg, a
student can have more than one contact numbers, therefore contact number is a multi-valued
attribute. Multi-valued attributes are represented by double oval in E-R diagram.
3. Stored versus Derived attributes:
 An attribute whose value need not be stored rather it can be computed from other attributes
or entities is called derive attributes. Eg, the value of age can be determined from the current
(today’s) date and the value of that person’s birth date attribute. The age attribute is hence
called derived a derived attribute and is said to be derivable from the birth date attribute. It
is represented by dotted oval in ER diagram.
 An attribute whose value is stored in database is called stored attribute

3.Relation and Relationship Sets:


• Association between two or more entities is called relationship, eg, “Raj teaches Manju”, here teaches is
the association between entities Raj and Manju.

• Association between two entity sets is called relationship set, eg, “Teacher teaches students”, here
teaches is the association between entity sets teacher and student.

• Relationship are represented by diamond symbol in E-R diagram.

Page | 1
• A relationship may also have attributes called descriptive attributes. Consider a relationship set
depositor with entity sets customer and account. We could associate the attribute access-date to that
relationship to specify the most recent date on which a customer accessed an account.

Degree of relationship
• Number of entity sets that participate in a relationship set is called degree of relationship set

• On the basis of degree, relationships can be divided as below:

1. Unary Relationship:
 If only one entity set participates in relationship more than once it is called unary relationship.
 Here same entity set participates in relationship twice with different roles.
 Role names are specified above the link joining entity set and relationship set.
 This type of relationship set is sometimes called a recursive relationship set.
2. Binary Relationship:
 Relationship sets in which two entity sets participate are called binary relationships. In other
words we can say that relationship sets of degree 2 are called binary relationship.
 This is the most common type of relationship in database systems.
3. N-ary Relationship:
 Relationship set in which more than two entity sets involves is called n-ary relationship.
 Ternary and quaternary relationships are special cases of n-ary relationship.
 For example ternary relationship Arrange between three entity sets Financial institution, solicitors
and buyers.

4. Mapping Cardinality Constraints:


• E-R model constraints that describe maximum number of possible relationship occurrence for an entity
set participating in a given relationship type is called mapping cardinality.

• On the basis of cardinality ratio, relationship can be categorized into one-to-one, one-to-many, many-to-
one, many-to-many.

• We denote cardinality constraints by drawing either a directed line ( ) signifying one or an


undirected line ( ) signifying many between the relationship set and the entity set.

1. One-to-one relationship: If every entity in A is associated with at most one entity in B and vice-versa
then the relationship is called one-to-one relationship, eg, every bank has only one CEO and a
person can be CEO of only one bank.

2. One-to-many relationship: If an entity in A can be associated with any number of entities in B but
every entity in B can be associated with at most one entity in A and then it is called one-to-many
relationship. Eg, a mother can have any number of children but children can have only one mother.

3. Many-to-one relationship: If every entity in A can be associated with only one of entities in B but an
entity in B can be associated with any number of entities in A, then it is called many to one
relationship. Eg, a book is always published by only one publisher but a publisher can publish any
number of books.

4. Many-to-many relationship: If an entity in A can be associated with any number of entities in B and
vice-versa then it is called many-to-many relationship. Eg, a student can enroll into more than one
subject and a subject can be enrolled by many students.

Participation Constraints:
• Constraint on ER model that determines whether all or only some entity occurrence participate in a
relationship is called participation constraint.

• There are two types of participation constraints: Total Participation Constraints and Partial Participation
Constraints.
Page | 2
• The participation of an entity set A in a relationship set R is said to be total if every entity in A
participates in relationship at least once. This is represented by ( )

• The participation of an entity set R is said to be partial if only some of the members of an entity set A
participate in relationship. It is represented by ( ).

5. Keys:
• Set of one or more attributes whose values are distinct for each individual entity in the entity set is called
key.

• Its values can be used to identify each entity uniquely. Eg, Name attribute is a key of the entity set
company because no two companies are allowed to have same name.

Types of Keys:
Super Key:

• A super key of an entity set is a set of one or more attributes whose values uniquely determine each
entity in the entity set.

• If k is a super key than any superset of k is also super key.

• Example: Student-id attribute of the entity set student is sufficient to distinguish one student entity from
another. Thus the student-id is a super key. Similarly the set of attributes {roll-number, name, address} is
also a super key of the entity set student.

Candidate Key:

• A candidate key of an entity set is a minimal super key. That is a super key which does not have any
proper subset is called candidate key.

• For example, student-id is a candidate key of the entity set student but set of attributes {roll-number,
name, address} is not a candidate key of the entity set student because it has a proper subset {roll-
number, name}.

• All candidate keys are super keys but vice versa is not true.

Primary Key:

• A primary key is a candidate key that is chosen by the database designer as principle means of uniquely
identifying entities within an entity set.

• There may exists several candidate keys, one of the candidate keys is selected to be the primary key.

• For example: entity set student have candidate keys student-id and roll-number, if database designer
choose student-id for the purpose of uniquely identifying entities within entity set then it becomes
primary key.

• Primary key must satisfy following two characteristics:

- It cannot be null

- It cannot be duplicate

6. Primary keys for Relationship Sets:


• Primary keys of an entity sets allows us to distinguish among the various entities of an entity set. We
need a similar mechanism to distinguish among the various relationships of a relationship set.

Page | 3
• Let R be a relationship set involving entity sets E1, E2….. En. Let primary-key(E1) denote the set of
attributes that forms the primary key for entity set E1. The composition of the primary key for a
relationship set depends on the set of attributes associated with the relationship set R.

• If the relationship set R has no attributes associated with it, then the set of attributes

Primary-key(K1)U Primary-key(K2)U...U primary-key(Kn)

Describes a relationship in set R.

• If the relationship set R has attributes a1,a2….an associated with it, then the set of attributes

Primary-key(K1)Uprimary-key(K2)U…Uprimary-key(Kn)U {a1,a2…an}

describes a relationship in set R

• In both of the above cases, the set of attributes

• Primary-key(K1) U primary-key(K2) U………U primary-key(Kn)

Forms a superkey for the relationship set.

• The structure of the primary key for the relationship set depends on the mapping cardinality of the
relationship set.

• Suppose that the relationship set is many to many, then the primary key of depositor consists of the union
of the primary keys of customer and account.

• However if a customer can have only one account, that is if the depositor relationship is many to one from
customer to account, then primary key of depositor is simply the primary key of customer.

• Similarly if the relationship is one to many that is an account is owned by at most one customer, then the
primary key of depositor is simply the primary key of account.

• For one-to-one relationships either primary key can be used.

7. Entity-Relationship diagram:
• This is the most popular conceptual model used for designing a database.

• Once the entity types, relationships types, and their corresponding attributes have been identified, the
step is to graphically represent these components using entity relationship (E-R) diagram.

• An E-R diagram is a specialized graphical tool that demonstrates the interrelationships among various
entities of a database. We use different symbols in E-R Diagram.

• E-R diagram focuses high level database design and hides low level details of database representation
therefore it can be used to communicate with users of the system while collecting information.

(Note: show the symbols used in E-R model)

8. Weak Entity sets and Strong Entity sets:


• An entity that does not posses sufficient attributes to form a primary key is called a weak entity set.

• One that does have a primary key is called strong entity set.

• Weak entity sets depends on the strong entity sets for its existence. Thus these strong entity sets are
called owner or identifying entity sets.

Page | 4
• Weak entity sets does not have a primary key but we need a means of distinguishing among entities. The
primary key of owner entity set and discriminator of the weak entity set allows this distinction to be
made.

• An attribute of weak entity set that is use in combination with primary key of the strong entity set to
identify the weak entity set uniquely is called discriminator (partial key).

• Weak entity sets are represented by double rectangle, identifying relationship by double diamond and
discriminator by dotted line.

• The participation of weak entity set is always total.

9. Reducing E-R diagrams to Tables:


• We can represent a database that conforms to an E-R database schema by a collection of relation
schemas.

• Both the E-R model and the relational database model are logical representation of real-world
enterprises. Because the two models employ similar-design principles, we can convert an E-R design
into a relational design.

9.1 Representation of Strong Entity Sets with Simple Attributes:


• Let E be a strong entity sets with only simple descriptive attributes a1,a2,…an. We represent this entity
by a schema called E with n distinct attributes.

• For schema derived from strong entity sets, the primary key of the entity set serves as a primary key of
the resulting schema.

• For example, consider the entity set student of the E-R diagram. This entity set has 4 attributes: SID,
Sname, address, phoneno. We represent this entity set by schema called student with four attributes:

student (SID, Sname, address, phone no.)

• Note that since SID is the primary key of the entity set, it is also the primary key of the relation schema.

9.2 Representation of Strong Entity sets with Complex attributes:


• We handle composite attributes by creating a separate attribute for each of the component attributes.

• For example the composite attribute Sname of entity set Student, the schema generated contains the
attribute f_name, m_name and L_name; there is no separate attribute or schema for Sname.

• The relational schema derived from the Student entity with complex attributes is thus:

Student (SID, f_name, m_name, l_name, street_name, ward_no., dist_name, Phone no.)

• For a multivalued attribute M, we create a relation schema R with an attribute A that corresponds to M
and attributes corresponding to the primary key of the entity set of which M is an attribute.

• For example, consider an entity set student, which include the multi-valued attribute phone no. The
primary key of entity set student is SID. For this multi-valued attribute, we create a relation schema

student_phone (SID, phone no.)

• We create a primary key of the relation schema consisting of all attributes of the schema. In the above
example the primary key consists of both SID and phone no.

• In addition, we create a foreign-key constraint on the relation schema created from a multi-valued
attribute. In the above example SID will be the foreign key.

Page | 5
9.3 Representation of Weak Entity Sets:
• Let A be a weak entity set with attributes a1,a2….an. Let B be the strong entity set on which A depends.
Let the primary key of B consists of attributes b1,b2,…. bn. We represent the entity set A by a relation
schema called A as

{a1,a2…..an} U {b1,b2…..bn}

• For schemas derived from a weak entity set, the combination of the primary key of the strong entity set
and the discriminator of the weak entity set serves as the primary key of the schema.

• In addition to creating a primary key, we also create a foreign key constraint on the relation A.

• For example, consider the weak entity set section. This entity set has the attributes: sec_id, semestar, year.
The primary key of the course entity set on which section depends, is course_id. Thus we represent
section by a schema with the following attributes:

Section (course_id, sec_id, semestar, year)

9.4 Representation of Relationship Sets:


• Let R be a relationship set, let a1,a2,….an be the set of attributes formed by the union of the primary keys
of each of the entity sets participating in R, and let the descriptive attributes (if any) of R be b1,b2…..bn.
We represent relation schema called R as

{a1,a2,…..,an} U {b1,b2,…., bn}

• The primary key for the relation schema are chosen as follows:

I. For a binary many-to-many relationship, the union of the primary key attributes from the participating
entity sets becomes the primary key.

II. For a binary one-to-one relationship set, primary key of either entity set can be chosen as the primary
key. The choice can be made arbitrarily.

III. For a binary many-to-one or one-to-many relationship set, the primary key of the entity set on the many
side of the relationship set serves as the primary key.

IV. For an n-ary relationship set without any arrows on its edge, the union of the primary key attributes form
the participating entity sets becomes the primary key.

V. For an n-ary relationship set with an arrow on one of its edges, the primary keys of the entity sets not on
the arrow side of the relationship set serve as the primary key for the schema.

• We also create a foreign-key constraint on the relation schema R.

• For example, consider the relationship set Enrolls. The relationship sets involve the following two entity
sets:

Student with the primary key SID

Course with the primary key course-id

• Since the relationship set has no attributes, the Enrolls schema has two attributes SID and course-id.

• We also create two foreign-key constraints on the Enrolls relation with attribute SID referencing the
primary key of student and attribute course-id referencing the primary key of course.

10. Extended E-R model (EER model):


• ER model concept discussed so far can model only traditional systems.

Page | 6
• To model complex systems such as, databases for engineering design and manufacturing,
telecommunications, complex software systems, and geographic information systems (GIS), and many
other applications, it is not enough.

• The EER model includes all the concepts introduced by the ER model.

• Additionally it includes the concepts of specialization, generalization, higher and lower level entity sets,
attribute inheritance, and aggregation.

10.1 Specialization:
• An entity set may include sub groupings of entities that are distinct in some way from other entities in the
set. A subset of entities within an entity set may have attributes that are not shared by all the entities in
the entity set.

• Thus E-R model provides the means of sub groupings these distinct entities.

• The process of designating sub groupings within an entity set is called specialization.

• It is top-down design process.

• For example, the entity set person can be specialized into entity sets employee, and customer because
the employee entities can be further described by emp-id and salary whereas customer entities can be
described by cus-id.

• Further on the basis of job type the entity set employee can be divided into entity sets manager, typist,
sweeper etc. Thus specialization may create hierarchy.

• In E-R diagram specialization is represented by rectangle with ISA level inside it. The level ISA stands for
“is a” and represents for example, that a customer “is a” person. This may be referred as superclass-
subclass relationship.

10.2 Generalization:
• Generalization is the reverse of the specialization. It is a bottom-up design process.

• Here we combine a number of entity sets that share the same features into a higher-level entity sets.

• Designer applies generalization is to emphasize the similarities among the entity sets and hide their
differences.

• Specialization and generalization are simple inversions of each other; they are represented in the E-R
diagram in the same way.

• For example entity set “employee” can be viewed as generalization of {officer, typist and sweeper}. In
reverse we can view {officer, typist and sweeper} as a specialization of “employee”.

10.3 Attribute Inheritance:


• A crucial property of the higher and lower level entities created by specialization and generalization is
attribute inheritance.

• The attribute of higher level entity sets are said to be inherited by lower level entity sets.

• For example customer and employee inherit the attribute of person. Thus customer is described by
name, street and city attribute and additionally by customer-id attribute; the employee is described by
its name, street and city attribute and additionally by employee-id and salary.

10.4 Constraints on Specialization/Generalization:

Page | 7
• Condition defined vs User defined: This type of constraint determines which entities can be members
of a given lower-level entity set. If we can determine exactly which entities can be members of each
subclass by condition the subclass are called predicate-defined (or condition defined)

• Subclass is called user-defined if membership in a subclass is determined by the database users.

• Disjoint vs overlapping: Constraints on whether or not entities may belong to more than one lower-
level entity set within a single specialization.

 Disjoint: an entity can belong to only one lower-level entity set. Noted in E-R diagram by writing
disjoint next to the ISA triangle.

 Overlapping: an entity can belong to more than one lower-level entity set.

• Completeness constraint: specifies whether or not an entity in the higher-level entity set must belong
to at least one of the lower-level entity sets within a generalization.

 Total: an entity must belong to one of the lower-level entity sets

 Partial: an entity need not belong to one of the lower-level entity sets

10.5 Aggregation:
• A relationship set is an association between entity sets. Sometimes we have to model a relationship
between a collection of entities and relationships.

• E-R model cannot express relationships among relationship sets and relationships between relationship
and entity sets.

• Consider a DB with information about employees who work on a particular project and use a number of
machines in doing that work. One alternative in this case is to use ternary relationship between {emplyee,
project and machinery}

• The main problem that will arise in this case is that it has redundant relationships.

• Every {employee, project} combination that appears in uses relationship set also appears in works
relationship set.

• The solution is to use aggregation. An abstraction through which relationship sets (along with associated
entity sets) are treated as higher-lever entities is called aggregation and thus allows us to express
relationships between relationship set and entity set.

• For example, we treat the relationship set works and the entity sets employee and project as a higher-
level entity set.

11. Use of Entity sets or Relationship sets:

11.1 Use of Entity sets versus Attributes


i. Consider the entity set instructor with the attribute phone_number.

ii. Here phone_number is used as an attribute but if we want to keep extra information about phone, such as
its location (home or office) , or its type (mobile or landline), we can treat the phone_number as an entity
set.

iii. Thus, if we take this point of view, we do not add the attribute phone_number to the instructor. Rather we
create: phone entity set with attribute phone_number and location and a relationship set inst_phone,
denoting the association between instructors and the phones.

iv. The difference here is treating a phone as an entity better models a situation where one may want to
keep extra information about a phone. Treating a phone as an attribute implies that instructors have

Page | 8
precisely one phone number each. Treating a phone as an entity phone permits instructors to have
several phone numbers and other information.

11.2 Use of Entity sets versus Relationship Sets:


• It is not always clear whether an object is best expressed by an entity set or relationship set.

• Consider the takes relationship set to model the situation where a student takes a course.

• If we want to enter a course-registration record for each course that each student takes, then we have an
entity set to represent the course- registration record.

• Each registration entity is related to exactly one student and to exactly one section, so we have two
relationship sets, one to relate course registration records to student and one to relate course
registration record to section.

• Thus the entity sets course and student with the takes relationship set is replaced by one entity set and
two relationship sets:

- registration, course_reg, student_reg

• Both the approach represent the university’s information, but the use of takes is more compact and
preferable. However, if we want to associates other information, it might be best to make it an entity in its
own right.

Page | 9
Entity and entity set
An entity is a “thing” or “object” in the real world, either animate or
inanimate, that can be easily identifiable. For example, in a school
database, students, teachers, classes, and courses offered can be
considered as entities. All these entities have some attributes or properties
that give them their identity.

An entity set is a collection of similar types of entities. An entity set may


contain entities with attribute sharing similar values. For example, a
Students set may contain all the students of a school; likewise a Teachers
set may contain all the teachers of a school from all faculties. Entity sets
need not be disjoint.

Attributes
Entities are represented by means of their properties, called attributes. All
attributes have values. For example, a student entity may have name,
class, roll no and age as attributes.

There exists a domain or range of values that can be assigned to attributes.


For example, a student's name cannot be a numeric value. It has to be
alphabetic. A student's age cannot be negative, etc.

Types of Attributes
 Simple attribute − Simple attributes are atomic values, which cannot be
divided further. For example, a student's phone number is an atomic value of
10 digits.

 Composite attribute − Composite attributes are made of more than one


simple attribute. For example, a student's complete name may have first name,
middle name and last name.
 Derived attribute − Derived attributes are the attributes that do not exist in
the physical database, but their values are derived from other attributes present
in the database. For example, average salary in a department should not be
saved directly in the database, instead it can be derived. For another example,
age can be derived from data_of_birth.

 Single-value attribute − Single-value attributes contain single value. For


example − Social_Security_Number.

 Multi-value attribute − Multi-value attributes may contain more than one


values. For example, a person can have more than one phone number, email
address, etc.

These attribute types can come together in a way like −

 simple single-valued attributes

 simple multi-valued attributes

 composite single-valued attributes

 composite multi-valued attributes


Entity-Set and Keys
Key is an attribute or collection of attributes that uniquely identifies an
entity among entity set.

For example, the roll_number of a student makes him/her identifiable


among students.

 Super Key − A set of attributes (one or more) that collectively identifies an


entity in an entity set.

 Candidate Key − A minimal super key is called a candidate key. An entity set
may have more than one candidate key.

 Primary Key − A primary key is one of the candidate keys chosen by the
database designer to uniquely identify the entity set.

Relationship
The association among entities is called a relationship. For example, an
employee works_at a department, a student enrolls in a course. Here,
Works_at and Enrolls are called relationships.

Relationship Set
A set of relationships of similar type is called a relationship set. Like
entities, a relationship too can have attributes. These attributes are
called descriptive attributes.

Degree of Relationship
The number of participating entities in a relationship defines the degree of
the relationship.

 Binary = degree 2

 Ternary = degree 3

 n-ary = degree
Mapping Constraints

An E-R enterprise schema may define certain constraints to which the


contents of a database must conform. Here, we examine mapping
cardinalities and participation constraints, which are two of the most
important types of constraints.

1 Mapping Cardinalities
Cardinality defines the number of entities in one entity set, which can be
associated with the number of entities of other set via relationship set.

 One-to-one − One entity from entity set A can be associated with at most one
entity of entity set B and vice versa.

 One-to-many − One entity from entity set A can be associated with more than
one entities of entity set B however an entity from entity set B, can be
associated with at most one entity.
 Many-to-one − More than one entities from entity set A can be associated with
at most one entity of entity set B, however an entity from entity set B can be
associated with more than one entity from entity set A.

 Many-to-many − One entity from A can be associated with more than one
entity from B and vice versa.
2 Participation Constraints

The participation of an entity set E in a relationship set R is said to be total


if every entity in E participates in at least one relationship in R. If only
some entities in E participate in relationships in R, the participation of
entity set E in relationship R is said to be partial. For example, we expect
every loan entity to be related to at least one customer through the
borrower relationship. Therefore the participation of loan in the relationship
set borrower is total. In contrast, an individual can be a bank customer
whether or not he/she has a loan with the bank. Hence, it is possible that
only some of the customer entities are related to the loan entity set
through the borrower relationship, and the participation of customer in the
borrower relationship set is therefore partial.
CHAPTER-3
Structure of Relational Databases:
• A relational database consists of a collection of tables, each of which is assigned a unique name.
• A row in a table represents a relationship among a set of values.
• Since table is a collection of such relationship, we call a table as a relation.
Example: The account relation
Attributes or columns

Tuples or rows

• Consider the account table. It has three column headers: account-number, branch-name and balance. We
refer to this headers as attributes.
• For each attribute, there is a set of permitted values, called the domain of that attribute. For the
attribute branch-name, the domain is the set of all branch names.
• The rows in a table are called tuples. For example in the above account relation {A101, Butwal, 500} is
a tuple.
• Attribute values are normally required to be atomic that is indivisible.
• The special value null is the member of every domain which signifies that the value is unknown or does
not exist. Null values causes a number of difficulties when we access or update the database, and thus
should be eliminated if at possible.
Let D1 denote the set of all account numbers, D2 denote the set of all branches and D3 the set of all
balances. Then the relation account is the set of
D1 X D2 X D3
that is account relation is a relation over account-number X branch-name X balances.
In general a table of n attributes is a set of
D1 X D2 X….....................X Dn.
Database Schema:
• Database Schema is the logical Design of the database and We use a convention of using lower-case
names for relations, and names beginning with an uppercase letter for relation schemas.
• If A1, A2, A3 are the attributes then R = (A1, A2,A 3) is a relation schema; For example:
Account-schema = (account number, branch-name, balance)
 r(R) is a relation on relation schema R; For example:
account (Account-schema)
• In general, a relation schema consists of a list of attributes and their corresponding domains.
Database Instance:
• The contents of relation change with time as the relation is updated.
• database instance is the snapshot of the data in the database at a given instant of time.
• We simply say relation when we actually mean relation instance because the current values
(relation instance) of a relation are specified by the table.
Database:
• A database consists of multiple relations
• Information about an enterprise is broken up into parts, with each relation storing one part of the
information. E.g:
account: stores information about accounts
depositor: stores information about which customer owns which account
customer: stores information about custozmers

account relation depositor relation customer relation

• Storing all the information as a single relation such as


bank(branch-name, assets, customer-name, customer-address, account –number, balance,…...)
results in
I. Repetition of information: If a customer has several account, we must list her address once for each
account. That is, we must repeat certain information several times. This repetition is wasteful and is
avoided.
II. The need for null values: If a branch has no accounts (say newly created branch), we cannot construct a
complete tuple because no data concerning customer and account are available yet. To represent
incomplete tuples, we must use null values.
Keys:
• Let K be the subset of R.
• K is a super key of R if values of K are sufficient to identify a unique tuple of each possible relation r(R).
For example:
{customer-id, customer-name} and customer-id are both super keys of customer relation.
• K is a candidate key if K is minimal
Example: customer-id is a candidate key for customer relation since it is a super key and there is no
any subset of it.
Determining Keys from E-R Sets
• Strong Entity set: The primary key of the entity set becomes primary key of the relation.
• Weak Entity set: The primary key of the relation consists of the union of the primary key of the strong
entity set and the discriminator of the weak entity set.
• Relationship set: The union of the primary keys of the related entity sets become a super key of the
relation.
 For many-to-one relationship sets, the primary key of the many entity set becomes the relation’s
primary key .
 For one-to-one relationship sets, the relation’s primary key can be that of either entity sets
 For many-to-many relationship sets, the union of the primary keys becomes the relation’s primary key.
• A set of attributes that references primary key of another relation (table) is called foreign key. If
primary key of relation r1 is used as an attribute of relation r2, then that attribute is called a foreign key
from r1, rferencing r2.

Schema Diagram for Banking Enterprise

branch account depositor


loan borrower customer

 A database schema, along with primary key and foreign key dependencies, can be depicted pictorially
by schema diagrams.
 The above figure shows the schema diagram for our banking enterprise. Each relation appears as a box,
with the attributes listed inside it and the relation name above it. If there are primary key attributes, we
draw a horizontal line acoross the box, with the primary key attributes listed above the line.
 Foreign key dependencies appear as arrows from the foreign key attributes of the referencing relation
to the primary key of the referenced relation. For example, account-number attribute of depositor
relation act as a foreign key, which is a primary key of relation account. The dependency is shown by
the arrow.
(Note: do not confuse schema diagram with E-R diagram, In E-R diagram, attributes are listed in columns and
do not show foreign key explicitly as in schema diagram)

Query Language:
• A query language is a language in which a user requests information from the database.
• Query language can be categorized as either procedural or non-procedural.
• In procedural language, the user instructs the system to perform a sequence of operations on the
database to compute the desired result.
• In nonprocedural language, the user describes the desired information without giving a specific
procedure for obtaining that information.
• The relational algebra is procedural, whereas tuple relational calculus and domain relational calculus
are non-procedural approaches.

The Relational Algebra:


• The relational algebra is a procedural query language.
• It consists of a set of operations that take one or two relations as input and produce a new relation as
their result.
• The fundamental operations in the relational algebra are:
select, project, union, set difference, Cartesian product, and rename.
• In addition there are several other operations:
set intersection, natural join, division, and assignment.

Fundamental Operation:
• The select, project and rename operations are called unary operations, because they operate on one
relation.
• The union, set difference and cartesian product are binary operations, because they operate on pairs of
relations.

Note: The below four tables are used to explain all the fundamental relational algebra
account depositor

loan borrower
1.The Select Operation:
A σA=BB∧ D > 5(r)C D
1 7
Relation r
23 10
A B C D
1 7
5 7
12 3
23 10

• We use the lowercase Greek letter sigma σ to denote selection.


• The select operation selects tuples that satisfy a given predicate.
• The predicate appears as subscript to σ.
• The argument relation is in parenthesis after the σ.
• Thus to select those tuples of the branch relation where the branch is Butwal , we write:
σbranch-name = Butwal (branch)
• We can find all the tuples in which the balance is more than 600 by writing
σbalance > 600 (branch)
• In general we allow comparisons using =, ≠, <, , >, in the selection predicate.
• Further, we can combine several predicates into a larger predicate by using the connectives and ( ),or
( ),and not(−).
• Thus to find those tuples whose branch-name is Butwal that has balance more than 600, we write
σbranch-name = butwal balanc > 600 (branch).

2. The Project Operation: A C


Relation r 1 ΠA,C (r)
1
A B C
2
10 1
20 1
30 1
40 2
• The project operation is a unary operation that returns its argument relation, with certain attributes
left out.
• Projection is denoted by uppercase Greek letter pi Π .
• We list those attributes that we wish to appear in the result as a subscript to Π.
• The argument relation follows in parenthesis.
• Thus, if we want to list all the account number and branch-name but do not care about the balance, we
write:
Πaccount-number, branch-name (branch)
Composition of Relational Operation:
• We can use different relational operations together to query the information from the relation.
• Consider the query Find those account number whose balance is greater than 800 from the relation
branch , we write:
Πaccount-number (σbalance > 800 (branch))
• Here, instead of giving the name of relation as the argument of the projection operation, we give select
operation that evaluates to a relation.
rUs
3. The Union Operation : Relation s
A B A B
2 1
3 2
1
Relation r 3

• The Union operation is the binary operation which involve a pair of relations.
• Consider a query to find the customer name of the bank who have either an account or a loan or both.
• To answer this query, we need the information in the depositor relation and in the borrower relation.
• To find the names of all customers with a loan in the bank, we write:
Πcustomer-name (borrower)
• To find the names of all customers with an account in the bank, we write:
Πcustomer-name (depositor)
• To answer the query we need the union of these two sets because we need all the customer names that
appear in either or both of the two relations. So the expression needed is:
Πcustomer-name borrower U Πcustomer-name (depositor)

Result for Πcustomer-name (Depositor) Result for Πcustomer-name (Borrower)


Result for Πcustomer-name
Depositor U Πcustomer-
name (Borrower)
• We must ensure that unions are between compatible
relations.
• The union operation r U s is valid if the below two conditions hold
 The relations r and s must be of the same arity. That is, they must have
same number of attributes. A B
 The attributes domains must be compatible. 1
(note: r and s may be temporary relations derived 1 from relational algebra expression)

4. The Set Difference Operation:


Relation r Relation s r-s
A B A B
1 2
2 3
1

• The set-difference operation, denoted by (–), allows us to find tuples that are in one relation but not in
another.
• The expression r – s produces a relation containing those tuples in r but not in s.
• Thus we can find all the customers of the bank who have an account but not a loan by writing
Πcustomer- C D D name(depositor) – Πcustomer-name(borrower)
• As with the union operation, we must ensureC that set difference are taken
between compatible relations. (Note: mention above two compatible condition)

5. The Cartesian-Product Operation rXs

A B Relation r Relation s
1 A B
2 1
1 2
A B C D E
10 a 1 10 a
10 a 1 10 a
20 b 1 20 b
2 10 a
2 10 a
2 20 b
• The Cartesian-product operation, denoted by a cross (X), allows
us to combine information from any two relations. We write the Cartesian product of relations r and s
as
r X s.
• Here the same attribute name may appear in both r and s, so we need to differentiate between these
attributes. We differentiate between the attributes by attaching the name of relation to the attribute
from which the attribute originally came. For example the relation schema for borrower X loan is
(customer-name, borrower.loan-number, loan.loan-number, branch-name, amount)
• Assume that we have n1 tuples in borrower and n2 tuples in loan, then there are n1 X n2 ways of
choosing a pair of tuples. Thus the output of the cartesian-product result in a large table.
• Suppose that we want to find the name of all the customers who have loan at bank along with a branch
name and amount. For this We need the information in both loan and borrower relation. Therefore, we
take out the cartesian product of borrower and loan (borrower X loan). We know that there is some
tuple in borrower X loan that has
borrower.loan-number = loan.loan-number.
Thus we write
Πcustomer-name, branch-name, amount (σborrower.loan-number = loan.loan-number (borrower X loan))
This will produce the required result.

6. The Rename Operation:


• The results of the relational-algebra expressions do not have a name that we can use to refer to them. It
is useful to be able to give them names; the rename operator, denoted by the lowercase Greek letter ρ ,
let us to do this. Given a relational-algebra expression E, the expression
ρx (E)
returns the result of expression E under the name x.
• We can also rename the attributes that are in E. Assume that a relational-algebra expression E has
attributes n. Then the expression
ρx (A1, A2,…..An (E)
returns the result of expression E under the name x, and with the attributes renamed to A1, A2....., An.
Example:
• Consider the account relation, then the expression:
ρacc Πaccount-name, balance (account))
will result in relation named acc with attributes account-name and balance
• If we want to change the attributes name, then we write expression:
ρacc (name, amount) Πaccount-name, balance (account))
It will result in the relation named acc with attribute account_name renamed to name and attribute
balance renamed to amount.

Practice questions:
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-city )
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
a. Find all loans of over $1200.
b. Find the loan number for each loan of an amount greater than $1200
c. Find the names of all customers who have a loan, an account, or both, from the bank
d. Find the names of all customers who have a loan at the Butwal branch.
e. Find the names of all customers who have a loan at the Butwal branch but do not have an account at the
bank.
f. Find the largest account balance.
Answers for above questions:
a. σamount > 1200 (loan)
b. ∏loan-number (σamount > 1200 (loan))
c. ∏customer-name borrower ∪∏customer-name (depositor)
d. ∏customer-name (σbranch-name= Butwal (σborrower.loan-number = loan.loan-number(borrower x loan))
e. ∏customer-name (σbranch-name = Butwal (σborrower.loan-number = loan.loan-number(borrower x loan))) – ∏customer-
name(depositor)

f. ∏balance(account) - ∏account.balance (σaccount.balance < d.balance (account x ρd (account)))

Formal definition of relational algebra:


A basic expression in the relational algebra consists of either one of the following:
• A relation in the database
• A constant relation
A constant relation is written by listing its tuples within { }, for example { (A-101, Sunwal, 500) (A-215,
Manigram, 700) }.
A general expression in relational algebra is constructed out of smaller sub-expressions. Let E1 and E2 be
relational-algebra expressions. Then, these are all relational- algebra expressions:
• E1 ∪ E2
• E1 − E 2
• E1 × E2
• σP (E1), where P is a predicate on attributes in E1
• ΠS(E1), where S is a list consisting of some of the attributes in E1
• ρx (E1), where x is the new name for the result of E1

Additional Operations:
• The fundamental operations of the relational algebra are sufficient to express any relational-algebra
query.
• However if we restrict ourselves to just the fundamental operations, certain common queries are
lengthy to express.
• Therefore, A B there are additional operations that do not add any power to
algebra, but 2 simplify common queries.
• Those 3 additional queries are:
set intersection, natural join, division, and assignment.

The Set-intersection Operation:


Relation r Relation s rns
A B
A B
1 2
2
1

• The set-intersection operation is denoted by η .


• If r and s are two relations then the set-intersection operation is written as r η s.
• If we wish to find all the customers who have both a loan and an account. Using set intersection we can
write:
Πcustomer-name (borrower) n Π customer-name (depositor)
• Note that we can rewrite any relational algebra expression that uses set intersection by replacing the
intersection operation with a pair of set-difference operation as:
r n s = r – (r – s).
• Thus the set intersection is not a fundamental operation and does not add any power to the relational
algebra. It is simply more convenient to write r n s than to write r – (r – s).

The Natural-Join Operation:


• It is desirable to simplify certain queries that require a cartesian product. Usually, a query that involves
a cartesian product includes a selection operation on the result of the cartesian product.
• The natural join is a binary operation that allows us to combine certain selection and a cartesian
product into one operation.
• It is denoted by the join symbol
• This operation forms a cartesian product of its two arguments, performs a selection forcing equality
on those attributes that appear in both relation schemas, and finally removes duplicate attributes.

• Suppose that we want to find the name of all the customers who have loan at bank along with a branch
name and amount. We first form the cartesian product of the borrower and loan relation, then, we
select those tuples which have same loan-number, followed by the projection operation:
Πcustomer-name, branch-name, amount (σborrower.loan-number = loan.loan-number (borrower X loan))
• The natural join simplify this operation. We can obtain the same result by using the natural join as
follow:
Πcustomer-name, branch-name, amount (borrower loan)

The assignment Operation:


• It is convenient at times to write a relational-algebra expression by assigning parts of it to temporary
relation variables.
• The assignment operation, denoted by , works like an assignment. Let us take an example:
temp1 Πcustomer-name (depositor)
Here, customer name who have account are assigned to temporary variable temp1
temp2 Πcustomer-name (borrower)
Here, customer name who have loan are assigned to temporary variable temp2.
Thus, to find customer name who have both an account and loan, we can write
temp1 U temp2
• The assignment operation does not result in any relation being displayed to the user.
• Rather the result of the expression to the right of the is assigned to the relation variable on the left
of the
• The assignment is always made to a temporary relation variable and is never permanent.
• It does not add any power to relational algebra rather it is a convenient way to express complex
queries.

The Division Operation:


Relation r Relation s r÷s
A B
1 B A
2 1
3 2
1
γ 1
ε 2
ε 3
2

• The division operation is denoted by (÷)


• Let r and s be the relations on schemas R and S respectively where
R (A1,…..,Am, B ,.....Bn
S (B1,......, Bn)
Then the result of r ÷ s is a relation on schema
R – S = (A1,…...,Am .
• It is useful for a special kind of query that include the phrase for all .
Suppose that we wish to find the name of customers who have account at all the branches located in Butwal
city from the following relations.

account depositor

branch

• We can obtain all branches in Butwal by the expression


r Πbranch-name (σbranch-city = Butwal (branch))
• We can find all (customer-name, branch-name) pairs for which the customer has an account at a branch
by writing
r Πcustomer-name, branch-name (depositor account)
• Now we need to find customers who appear in r2 with every branch name in r1. The operation that
provides exactly those customers is the divide operation. So we write
r2 ÷ r1
which will produce the customers name who have account at all branches located in butwal.

Modifying the Database:


• The content of the database can be modified that is we can add, remove or change information in
database using the following operations:
 Deletion
 Insertion
 Updating
• All these operations are expressed using the assignment operator.
1.Deletion:
• A delete request is expressed similarly to a query.
• However, instead of displaying tuples to the user, the selected tuples are removed from the database.
• We can delete only whole tuples; cannot delete values on only particular attributes.
• A deletion is expressed in relational algebra by
r r–E
where r is a relation and E is a relational algebra query.
Example:
• Delete all account records in the Butwal branch.
account account – σbranch-name = butwal (account)
• Delete all loan records with amount in the range of 0 to 50
Loan loan – σamount amount 5 (loan)

2. Insertion:
• To insert data into a relation, we either specify a tuple to be inserted or write a query whose result is a
set of tuples to be inserted.
• The attribute value for the inserted tuples must be members of the attribute’s domain.
• Similarly , tuples inserted must be of the correct arity that is the number of attributes must be same.
• In relational algebra, an insertion is expressed by
r rUE
where, r is a relation and E is a relational algebra expression
• We express the insertion of a single tuple by letting E be a constant relation containing one tuple
Example:
• Insert information in the database specifying that Meena has Rs 1200 in account A-209 at the Sunwal
branch.
account account U {(A-209, Sunwal , 1200)}
depositor depositor U {( Meena , A-209)}

3. Updating:
• It is a mechanism to change a value in a tuple without changing all values in the tuple.
• We use a generalized projection operator to do this task
r ΠA , A ,…..Ai r
• Each Ai is either
 the ith attribute of r, if the ith attribute is not updated, or,
 if the attribute is to be updated Ai is an expression, involving only constants and the attributes of r,
which gives the new value for the attribute.
Example:
• Make interest payments by increasing all balances by 5 percent
account Πaccount-number, branch-name, balance * 1.05 (account)
• Pay all accounts with balances over 1000, 6 percent interest and pay all others 5 percent
account Πaccount-number, branch-name, balance * 1.06 (σbalance > 1000 account U Πaccount-number, branch-name, balance *
1.05 (σbalance (account))

Views:
• In some cases, it is not desirable for all users to see the entire logical model, that is all the actual
relations stored in the database. Security consideration may require that certain data be hidden from
users.
• Consider a person who needs to know the customer’s account number but has no need to see the
balance. This person should see a relation described, in relational algebra by
∏customer-name, account-number (account depositor)
• Any relation that is not part of the logical model, but is made visible to a user as a virtual relation, is
called a view.

View Definition:
• To define a view, we must give the view a name, and must state a query that computes the views
• A view is defined using the create view statement which has the form
create view v as <query expression>
Where, <query expression> is any relational algebra query expression and the view name is represented by v.
• As an example, consider the view consisting of the customer’s name and their account number. If we
wish this view name to be called cust_acc, we define this view as fallows:
create view cust_acc as
∏customer-name, account-number (account depositor)
• Once the view is defined, we can use the view name to refer to the virtual relation that the view
generates.
• Using the view cust_acc, we can find all the customers who have account by writing
∏customer-name (cust_acc)
• At any given time, the set of tuples in the view relation is the result of evaluation of the query
expression that defines the view at that time.
• Thus if a view relation is computed and stored, it may become out of date if the relations used to define
it are modified.
• To avoid this, views are usually implemented as follows:
 When we define view, the database system stores the definition of view itself rather than the result of
evaluation of the relational-algebra expression that defines the view
 Whenever a view relation appears in a query, it is replaced by the stored query expression.
 Thus, whenever we evaluate the query, the view relation gets recomputed.
 Certain database systems allows view relations to be stored, but they make sure that, if the actual
relations used in the view definition change, the view is kept up to date. Such views are called
materialized views. The process of keeping the view up to date is called view maintenance.
 Difference in assignment operation and view definition:
 We evaluate the assignment operation once, and the assignment variable does not change when we
update the relations that involves in the query expression assigned to it. In contrast, any modification
we make to the relations involved in defining the views, changes the set of tuples in the view relation as
well.
CHAPTER-4

SQL: STRUCTURED QUERY LANGUAGE


The SQL language has several parts:
• Data-definition language (DDL): The DDL language provides commands for defining relation
schemas, deleting relations, and modifying relation schemas.
• Data-manipulation language (DML): The DML language includes query language for retrieval of
information. It also includes commands to insert, delete and modify information in the
database. The basic DML are select, where and from clause.
• View definition: The DDL includes commands for defining views.
• Transaction control: SQL Includes commands for specifying the beginning and ending of
transactions.
• Embedded SQL and dynamic SQL: Embedded and dynamic SQL define how SQL statements can
be embedded within general-purpose programming languages, such as c, c++, Java etc.
• Integrity: The SQL DDL includes commands for specifying integrity constraints that the data
stored in the database must satisfy. Updates that violate integrity constraints are disallowed.
• Authorization: The SQL DDL includes commands for specifying access rights to relations and
views.
Basic Structure:
• A typical SQL query has the form
select A1, A2,…......, An
from r1, r2,......., rm
where P
Here, each Ai represents an attribute, and each ri a relation. P is a predicate.
• The query is equivalent to the relational-algebra expression
ΠA , A ,….,An (σp(r1 X r2 X........X rm))
• The basic structure of an SQL expression consists of three clauses: select, from and where.
These 3 clauses are basic DML language.
• The select clause corresponds to the projection operation of the relational algebra. It is used to
list the attributes desired in the result of query.
• The from clause corresponds to the cartesian-product operation of the relational algebra. It
lists the relations to be scanned in the evaluation of the expression.
• The where clause corresponds to the selection predicate of the relational algebra. It consists of
a predicate involving attributes of the relations that appear in the from clause.

To explain all the terms in this chapter we will be using the following relation schema:
Branch (branch_name, branch_city, assets)
Customer (customer_name, customer_street, customer_city)
Loan (loan_number, branch_name, amount)
Borrower (customer_name, loan_number)
Account (account_number, branch_name, balance)
Depositor(customer_name, account_number)

The select Clause:


• It is used to list the attributes desired in the result of query.
• It is of the form: select A1, A2,…......, An
from r1, r2,......., rm
Where, A1, A2,....,An are the desired attributes that are to be retrieved from the relation r1, r2,.......,
rm.
• Let us consider a simple query, Find the names of all branches in the loan relation .
• We can get the desired result by writing
select branch_name
from loan
The result is a relation consisting of a single attributes with the heading branch_name
• SQL allows duplicates in relations as well as in the results of SQL expressions.
• In the case where we want to force the elimination of duplicates, we insert the keyword
distinct after select.
• Thus, if we want to remove duplicates, we can rewrite the above query as
select distinct branch_name
from loan
• SQL allows us to use the keyword all to specify explicitly that duplicates are not removed:
select all branch_name
from loan
• The asterisk symbol * can be used to denote all attributes . Thus if we write
select *
from loan
It indicates that all attributes of loan are to be selected.
• Thus, a select clause of the form select * indicates that all attributes of all relations appearing
in the from clause are selected.
• The select clause may also contain arithmetic expression involving the operations +, -, * and /
operating on constants or attributes of tuples. Eg,
select loan_number, branch_name, amount * 100
from loan
This will return a relation that is same as loan relation, except that the attribute amount is multiplied
by 100.

The Where Clause:


• The where clause consists of a predicate involving attributes of the relations that appear in the
from clause.
• Consider the query Find all loan numbers for loans made at the butwal branch with loan
amount greater than 1200. This query can be written as
select loan_number
from loan
where branch_name = butwal and amount > 1200
• Comparison results can be combined using the logical connectives and, or, and not.
• Comparison operators <, <=, >, >=, = are used to compare strings and arithmatic expressions.
• SQL includes a between comparison operator to simplify where clause that specify that a
value be less than or equal to some value and greater than or equal to some other value.
• If we wish to find the loan number of those loans with loan amounts between 10,000 and
20,000, we can use the between comparison as below
select loan_number
from loan
where amount between 10,000 and 20,000
• Similarly, we can use the not between comparison operator.

The From Clause


• The from clause lists the relations to be scanned in the evaluation of the expression.
• Consider the query find the names, loan numbers and loan amount for the customers who
have a loan from the bank . The query can be written as
select customer_name, borrower.loan_number, amount
from borrower, loan
where borrower.loan_number = loan.loan_number
• Here we are using the notation relation-name.attribute-name, to avoid ambiguity in cases
where an attribute appears in more than one relation.

The Rename Operation


• SQL provides a mechanism for renaming both relations and attributes. It uses the as clause,
taking the form:
old_name as new_name
• The names of the attributes in the result are derived from the names of the attributes in the
relations in the from clause.
• We cannot always derive names in this way for several reasons:
 First, two relations in the from clause may have attributes with the same name in which case
attribute name is duplicated in the result.
 If we used an arithmetic expression in the select clause, the resultant attribute does not have a
name.
 Third even if an attribute name can be derived from the base relations as in the preceding
example, we may want to change the attribute name in the result relation.
 For example, if we want the attribute name loan_number to be replaced with the name loan_id,
we can rewrite the preceding query as
select customer_name, borrower.loan_number as loan_id, amount
from borrower, loan
where borrower.loan_number = loan.loan_number

Tuple Variables
• A tuple variable in SQL must be associated with a particular relation.
• We define a tuple variable in the from clause by placing it after the name of the relation with
which it is associated, with the keyword as in between.
• To illustrate it, we rewrite the query For all customers who have loan from the bank, find the
names, loan numbers, and loan amount as
select customer_name, T.loan_number, S.amount
from borrower as T, loan as S
where T.loan_number = S.loan_number
• Tuple variable are most useful for comparing two tuples in the same relation.
• Suppose that we want the query Find the names of all branches that have assets greater than
those of at least one branch located in Butwal.
• We can write the expression as
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and S.branch_city = Butwal

String Operation
• SQL specifies strings by enclosing them in single quotes, for example, Butwal .
• A single quote character that is part of a string can be specified by using two single quote
characters; for example It s right can be specified by It s right .
• The most commonly used operation on strings is pattern matching using the operator like. We
describe the patterns by using two special characters:
 Percent (%): The % character matches any substring.
 Underscore (_): The _ character matches any character.
• Consider the following examples to illustrate pattern matching:
 But% matches any string beginning with But
 %tw% matches any string containing tw as a substring.
 _ _ matches any string of exactly two characters
 _ _ % matches any string of at least two characters.
• Consider the query Find the names of all customers whose street address includes the
substring nag . The query can be written as
select customer_name
from customer
where customer_street like %nag%
• For patterns to include special pattern characters (that is % and _), SQL allows the specification
of an escape character.
• The escape character is used immediately before a special pattern character to indicate that
the special pattern character is to be treated like a normal character.
• We define the escape character for a like comparison using the escape keyword
• Consider the following patterns, which use a backslash (\) as the escape character:
 Like ab\%cd% escape \ matches all string beginning with ab%cd .
 Like ab\_cd% escape \ matches all string beginning with ab_cd .
• SQL allows us to search for mismatches instead of matches by using the not like comparison
operator.

Ordering of Tuples
• SQL offers the user some control over the order in which tuples in a relation are displayed.
• The order by clause causes the tuples in the result of a query to appear in sorted order.
• To list in alphabetic order all customers who have a loan at the Butwal branch, we write
select distinct customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number and branch_name = Butwal
order by customer_name
• By default, the order by clause lists the items in ascending order.
• To specify the sort order, we may specify desc for descending order or asc for ascending order.
• Furthermore, ordering can be performed on multiple attributes.
• Suppose that we wish to list the entire loan relation in descending order of amount. If several
loans have the same amount, we order them in ascending order by loan number. We express
this query in SQL as follows:
select *
from loan
order by amount desc, loan_number asc

Set Operations
• The set operations union, intersect, and except operate on relations and correspond to the
relational algebra operations , , −.
• Each of the above operations automatically eliminates duplicates; to retain all duplicates use
the corresponding multiset versions union all, intersect all and except all.
• Like union, intersection and set difference in relational algebra, the relations participating in
the set operations must be compatible, that is, they must be of same set of attributes

1. The Union Operation:


• To find all the customers having a loan, an account, or both at the bank, we write
(select customer_name
from depositor)
union
(select customer_name
from borrower)
• If we want to retain all duplicates, we must write union all in place of union:
(select customer_name
from depositor)
union all
(select customer_name
from borrower)

2. The Intersect Operation:


• To find all customers who have both a loan and an account at the bank, we write
(select customer_name
from depositor)
intersect
(select customer_name
from borrower)
• If we want to retain all duplicates, we must write intersect all in the place of intersect
(select customer_name
from depositor)
intersect all
(select customer_name
from borrower)

3. The Except operation:


• To find all customers who have an account but no loan at the bank, we write:
(select customer_name
from depositor)
except
(select customer_name
from borrower)
• If we want to retain all duplicates , we must write except all in place of except.
(select customer_name
from depositor)
except all
(select customer_name
from borrower)

Aggregate Functions

• Aggregate functions are functions that take a collection of values as input and return a single
value. SQL offers five built-in aggregate functions:
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
• Consider the query Find the average balance at the Butwal branch. We write this query as
fallows:
select avg (balance)
from account
where branch_name = Butwal
The result of this query is a relation with a single attribute, containing a single tuple with numerical
value corresponding to the average balance at the Butwal branch.
• There are cases where we must eliminate duplicates while computing aggretate and for this we
use distinct in the aggregate expression. For example:
 Find the number of branch of the bank.
select count (distinct branch_name)
from account
• We use the aggregate function count frequently to count the number of tuples in a relation. We
use the notation (*) for this. For example
 Find the number of tuples in the account relation.
select count (*)
from account
1. Aggregate Functions – Group By:
• The group by clause is used where we would like to apply the aggregate function not only to
the set of tuples but also to a group of sets of tuples.
• The attributes given in the group by clause are used to form groups.
• Tuples with same value on the attribute in the group by clause are placed in one group.
• For example: Find the average account balance at each branch . We write this query as
fallows:
select branch_name, avg (balance)
from account
group by branch_name

2. Aggregate Function – Having


• The having clause is useful to state a condition that applies to groups rather than to tuples.
• For example, we might be interested in only those branches where the average account
balance is more than Rs 1200. This condition does not apply to single tuple; rather, it applies to
each group constructed by the group by clause.
• Thus, SQL applies predicates in the having clause after groups have been formed.
• We express this query as:
select branch_name, avg (balance)
from account
group by branch_name
having avg (balance) > 1200

Null values
• It is possible for tuples to have a null value, denoted by null, for some of their attributes
• Null signifies an unknown value or that a value does not exist.
• The predicate is null can be used to check for null values.
E.g. Find all account number which appear in the account relation with null values for balance.
select account_number
from account
where balance is null
• The result of any arithmetic expression involving null is null
E.g. 5 + null returns null
• Any comparison with null returns null
E.g. 5 < null returns null
• However, aggregate functions simply ignore nulls except count (*).
• Since the predicate in a where clause can involve operation such as and, or and not on the
results of comparisons, such operations deal with unknown value as:
 OR: (unknown or true) = true, (unknown or false) = unknown, (unknown or unknown) =
unknown
 AND: (true and unknown) = unknown, (false and unknown) = false, (unknown and unknown)
= unknown
 NOT: (not unknown) = unknown
(Note that the result of the predicate is true if it satisfies the predicate otherwise it evaluates to be
false).

Nested Subqueries
• SQL provides a mechanism for the nesting of subqueries.
• A subquery is a select-from-where expression that is nested
• A common use of subqueries is to perform tests for set membership, set comparisons, and set
cardinality.
1. Set Membership
• The in connective tests for set membership, where the set is the collection of values produced
by a select clause.
• The not in connective tests for the absence of set membership.
• Consider the query Find all the customers who have both a loan and an account at the bank .
We can write the query as
select distinct customer_name
from borrower
where customer_name in (select customer_name
from depositor)
• Similarly, to find all customers who do not have a loan at the bank, but do not have an account
at the bank, we can write:
select distinct customer_name
from borrower
where customer_name not in (select customer_name
from depositor)
• The following query selects the names of customers who have a loan at the bank, and whose
name are neither Sita nor Hari.
select distinct customer_name
from borrower
where customer_name not in (‘Hari , Sita )
• Find all customers who have both an account and a loan at the Butwal branch.
select distinct customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number and
branch_name = Butwal and
(branch_name, customer_name) in
(select branch_name, customer_name
from depositor, account
where depositor.account_number = account.account_number)

2.Set Comparison:
• Nested subquery can be used to compare sets.
• Consider a query Find the names of all branches that have assets greater than those of at least
one branch in Butwal.
• Here, the phrase greater than at least one is represented in SQL by > some. Thus the query
can be written as:
select branch_name
from branch
where assets > some (select assets
from branch
where branch_city = Butwal )
• SQL allows < some, <= some, >= some, and = some comparisons.
• Similary, the construct >all corresponds to the phrase greater than all .
• If you modify the above query as find the names of all branches that have an asset value
greater than that of each branch in Butwal.
• Then we write the query as fallows:
select branch_name
from branch
where assets > all (select assets
from branch
where branch_city = Butwal )
• SQL also allows <all, <=all, >=all, =all, comparisos.

3. Test for Empty Relations


• SQL includes a feature for testing whether a subquery has any tuples in its result.
• The exists construct returns the value true if the argument subquery is nonempty.
• Using the exists construct, we can write the query Find all customers who have both an
account and a loan at the bank in still another way:
select customer-name
from borrower
where exists (select *
from depositor
where depositor.customer-name = borrower.customer- name)
4. Test for the absence of Duplicate Tuples
• SQL includes a feature for testing whether a subquery has any duplicate tuples in its result.
• The unique construct returns the value true if the argument subquery contains no duplicate
tuples.
• Using the unique construct, we can write the query Find all customers who have at most one
account at the Butwal branch as follows:
select customer-name
from depositor
where unique (select customer-name
from account, depositor
where depositor.account-number = account.account-number and
branch-name = Butwal )
• We can test for the existence of duplicate tuples in a subquery by using the not unique
construct.
• To illustrate this construct, consider the query Find all customers who have at least two
accounts at the Butwal branch, which we write as
select customer-name
from depositor
where not unique (select customer-name
from account, depositor
where depositor.account-number = account.account-number and
branch-name = Butwal )

Derived Relations
• SQL allows a subquery expression to be used in the from clause.
• If we use such an expression, then we must give the result relation a name, and we can rename
the attributes.
• We do this renaming by using the as clause. For example, consider the subquery
(select branch-name, avg (balance)
from account
group by branch-name)
as branch-avg (branch-name, avg-balance)
• The subquery result is named branch-avg, with the attributes branch-name and avg-balance.
• To illustrate the use of a subquery expression in the from clause, consider the query Find the
average account balance of those branches where the average account balance is greater than
$1200.
• We can write the query as:
select branch-name, avg-balance
from (select branch-name, avg (balance)
from account
group by branch-name)
as branch-avg (branch-name, avg-balance)
where avg-balance > 1200

Views
• We define a view in SQL by using the create view command.
• To define a view, we must give the view a name and must state the query that computes the
view. The form of the create view command is
create view v as <query expression>
where <query expression> is any legal query expression. The view name is represented by v.
• As an example, consider the view consisting of branch names and the names of customers who
have either an account or a loan at that branch. Assume that we want this view to be called all-
customer. We define this view as follows:
create view all-customer as
(select branch-name, customer-name
from depositor, account
where depositor.account-number = account.account-number)
union
(select branch-name, customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number)
• View names may appear in any place where a relation name may appear. Using the view all-
customer, we can find all customers of the Butwal branch by writing
select customer-name
from all-customer
where branch-name = Butwal
• The attribute names of a view can be specified explicitly as follows:
 create view branch-total-loan(branch-name, total-loan) as
select branch-name, sum(amount)
from loan
groupby branch-name
• The preceding view gives for each branch the sum of the amounts of all the loans at the branch.
Since the expression sum(amount) does not have a name, the attribute name is specified
explicitly in the view definition.

Modification of the Database


1.Deletion
• A delete request is expressed in much the same way as a query.
• We can delete only whole tuples; we cannot delete values on only particular attributes. SQL
expresses a deletion by
delete from r where P
Where P represents a predicate and r represents a relation.
• Note that a delete command operates on only one relation. If we want to delete tuples from
several relations, we must use one delete command for each relation.
• Delete all tuples from the loan relation.
delete from loan
• Delete all account tuples in the Butwal branch.
delete from account
where branch-name = Butwal
• Delete all loans with loan amounts between $1300 and $1500.
delete from loan
where amount between 1300 and 1500
2. Insertion
• To insert data into a relation, we either specify a tuple to be inserted or write a query whose
result is a set of tuples to be inserted.
• The tuples to be inserted should be of correct arity (numbers of attributes should be equal)
and the attribute s domain should be same.
• If we want to insert a new account A – 101 at the Butwal branch which has a balance 1000,
then we write:
insert into account
values ( A - 101 , Butwal , 1000)
Here, the values are specified in order corresponding to the attributes of relation
• For the benefit of users who may not remember the order of the attributes, SQL allows the
attributes to be specified as part of the insert statement. For example, the following SQL insert
statements are identical in function to the preceding one:
insert into account(account-number,branch-name,balance)
values ( A-101 , Butwal , 1000)

insert into account (branch-name, account-number, balance)


values ( Butwal , A-101 , 1000)
• we might want to insert tuples on the basis of the result of a query.
• Suppose that we want to present a new 200 savings acocunt as a gift to all loan customers of
the Butwal branch for each loan they have.
insert into account
select loan-number, branch-name, 200
from loan
where branch-name = Butwal

3. Updates
• Update statement is used to change a value in a tuple without changing all values a the tuple.
• Increase all the balance by 5 percent
update account
set balance = balance * 1.05
• If the increment is to be done with a balance of 1000 or more, then, we write:
update account
set balance = balance * 1.05
where balance >= 1000
• Let us now suppose that all accounts with balances over 1,000 receive 6 percent interest,
whereas all others receive 5 percent. We could write two update statements:
 update account
set balance = balance * 1.06
where balance > 1000
 update account
set balance = balance * 1.05
where balance <= 1000
• SQL provides a case construct, which we can use to perform both the updates with a single
update statement, avoiding the problem with order of updates.
update account
set balance = case
when balance <= 1000 then balance * 1.05
else balance * 1.06
end

4. Update of a view

• We can modify the view as we modify the relation.


• As an illustration, consider the following view definition:
create view loan-branch as
select branch-name, loan-number
from loan
• If we want to insert new tuple with branch name butwal and loan-number L-101 , we can
write:
insert into loan-branch
values ( Butwal , L-101 )
• SQL represents this insertion by an insertion into the relation loan, since loan is the actual
relation from which the view loan-branch is constructed. We must, therefore, have some value
for amount. This value is a null value. Thus, the preceding insert results in the insertion of the
tuple
( L-101 , Butwal , null) into the loan relation.
• A modification is permitted through a view only if the view in question is defined in terms of
one relation of the actual relational database.
Joined relations
• Join operations take two relations and return as a result another relation.
• These operations are typically used as subquery expressions in the from clause
• Join condition – defines which tuples in the two relations match, and what attributes are
present in the result of the join.
• Join type – defines how tuples in each relation that do not match any tuple in the other
relation (based on the join condition) are treated.

• Relation loan • Relation borrower

• Note: borrower information is missing for L-301 and loan information is missing for L-401.

Loan inner join borrower on loan.loan-number = borrower.loan-number


The table below is the result table for this query:

• The expression computes the inner join of the loan and the borrower relations, with the join
condition being loan.loan-number = borrower.loan-number.
• The attributes of the result consist of the attributes of the left-hand-side relation followed by
the attributes of the right-hand-side relation.
• We can rename the result relation and the attributes by using as clause.
loan inner join borrower on loan.loan-number = borrower.loan-number
as lb(loan-number, branch, amount, cust, cust-loan-num)

Loan left outer join borrower on loan.loan-number = borrower.loan-number


The table below is the result table for this query:
• First, compute the result of the inner join as before. Then, for every tuple in the left-hand-side
relation loan that does not match any tuple in the right-hand-side relation borrower in the
inner join, add a tuple to the result of the join in such a way that attributes of a tuple that are
derived from the left-hand-side relation are filled in with their values and the remaining
attributes on the right are filled with null values.
• Similar is the case for right outer join where a tuple in the result of the join are added in such
a way that attributes of a tuple that are derived from the right-hand-side relation are filled in
with their values and the remaining attributes on the left are filled with null values.

Loan natural inner join borrower


The table below is the result table for this query:

• The result is similar to the result of the inner join with the on condition, since they have, in
effect, the same join condition. However, the attribute loan-number appears only once in the
result of the natural join, whereas it appears twice in the result of the join with the on
condition.

Loan natural right outer join borrower


The table below is the result table for this query:

• The attributes of the result are defined by the join condition, which is a natural join; hence,
loan-number appears only once. The first two tuples in the result are from the inner natural
join of loan and borrower. The tuple (Sita, L-401) from the right-hand-side relation does not
match any tuple from the left-hand-side relation loan in the natural inner join. Hence, the tuple
(L-401, null, null, Sita) appears in the join result.

loan full outer join borrower using (loan-number)


The table below is the result table for this query:
• The join condition using(A1, A2, . . . , An) is similar to the natural join condition, except that the
join attributes are the attributes A1, A2, . . . , An, rather than all attributes that are common to
both relations.
Example:
• Find all customers who have either an account or a loan (but not both) at the bank.
select customer-name
from (depositor natural full outer join borrower)
where account-number is null or loan-number is null

Transaction
• A transaction consists of a sequence of query and/or update statements. The SQL standard
specifies that a transaction begins implicitly when an SQL statement is executed. One of the
following SQL statements must end the transaction:
1. Commit work commits the current transaction; that is, it makes the updates performed by the
transaction become permanent in the database. After the transaction is committed, a new
transaction is automatically started.
2. Rollback work causes the current transaction to be rolled back; that is, it undoes all the
updates performed by the SQL statements in the transaction. Thus, the database state is
restored to what it was before the first statement of the transaction was executed.
• Commit is similar to saving changes to a document that is being edited.
• While roolback is similar to quitting the edit session without saving changes.
• Transaction rollback is useful if some error condition is detected during execution of a
transaction.
• The database system guarantees that in the event of some failure, such as an error in one of the
SQL statements, a power outage, or a system crash, a transaction s effects will be rolled back if
it has not yet executed commit work.
Example:
• Transfer of money from one account to another involves two steps:
 deduct from one account and credit to another
• If one steps succeeds and the other fails, database is in inconsistent state. Therefore, either
both steps should succeed or neither should.
• If any step of a transaction fails, all work done by the transaction can be undone by rollback
work.
• Rollback of incomplete transactions is done automatically, in case of system failures

• In many SQL implementations, by default each SQL statement is taken to be a transaction on its
own, and gets committed as soon as it is executed.
• Automatic commit of individual SQL statements must be turned off if a transaction consisting
of multiple SQL statements needs to be executed. How to turn off automatic commit depends
on the specific SQL implementation.
• An alternative, which is part of the SQL:1999 standard, is to allow multiple SQL statements to
be enclosed between the keywords
begin atomic . . .
end
All the statements between the keywords then form a single transaction.

Data Definition Language(DDL)


• A set of relations in a database is specified by means of a data definition language.
• The SQL DDL allows specification of not only a set of relations, but also information about each
relation, including
 The schema for each relation
 The domain of values associated with each attribute.
 The integrity constraints
 The security and authorization information for each relation
 The physical storage structure of each relation on disk
• The DDL language provides commands for defining relation schemas, deleting relations, and
modifying relation schemas

Domain Types in SQL


• The SQL standard supports a variety of built-in domain types, including:
• char(n): A fixed-length character string with user-specified length n. The full form, character,
can be used instead.
• varchar(n): A variable-length character string with user-specified maximum length n. The full
form, character varying, is equivalent.
• int: An integer (a finite subset of the integers that is machine dependent). The full form,
integer, is equivalent.
• smallint: A small integer (a machine-dependent subset of the integer domain type).
• numeric(p, d): A fixed-point number with user-specified precision. The number consists of p
digits (plus a sign), and d of the p digits are to the right of the decimal point. Thus,
numeric(3,1) allows 44.5 to be stored exactly, but neither 444.5 or 0.32 can be stored exactly
in a field of this type.
• real, double precision: Floating-point and double-precision floating-point numbers with
machine-dependent precision.
• float(n): A floating-point number, with precision of at least n digits.
• date: A calendar date containing a (four-digit) year, month, and day of the month.
• time: The time of day, in hours, minutes, and seconds. A variant, time(p), can be used to
specify the number of fractional digits for seconds. It is also possible to store time zone
information along with the time.
• timestamp: A combination of date and time. A variant, timestamp(p), can be used to specify
the number of fractional digits for seconds.

Schema Definition in SQL


• We define an SQL relation by using the create table command:
create table r (A1 D1, A2 D2,…..,An Dn,
(integrity-constraint 1),
...........,
(integrity-constraint m))
Where r is the name of the relation, each Ai is the name of an attribute in the schema of relation r, and
Di is the domain type of values of attribute Ai.
The allowed integrity constraint include:
• Primary key (A1, A2,....An): The primary key specification says that attributes A1 , A2 , . . . , An
form the primary key for the relation. The primary key attributes are required to be non-null
and unique; that is, no tuple can have a null value for a primary key attribute, and no two tuples
in the relation can be equal on all the primary-key attributes. Although the primary key
specification is optional, it is generally a good idea to specify a primary key for each relation.
• check(P): The check clause specifies a predicate P that must be satisfied by every tuple in the
relation.
• By default null is a legal value for every attribute in SQL, unless the attribute is specifically
stated to be not null. An attribute can be declared to be not null in the following way:
account-number char(10) not null
Examples:
• create table customer
(customer-name char(20),
customer-street char(30),
customer-city char(30),
primary key (customer-name))
• create table account
(account-number char(10) not null
branch-name char(15),
balance integer,
primary key (account-number),
check (balance >= 0))

Drop Clause:
• To remove a relation from an SQL database, we use the drop table command. The drop table
command deletes all information about the relation from the database. The command
drop table r
is a more drastic action than
delete from r
• The latter doesn t delete relation r, but deletes all tuples in r. The former deletes not only all
tuples of r, but also the schema for r.
Alter clause:
• We use the alter table command to add attributes to an existing relation. All tuples in the
relation are assigned null as the value for the new attribute. The form of the alter table
command is
alter table r add A D
where r is the name of an existing relation, A is the name of the attribute to be added, and D is
the domain of the added attribute.
• We can drop attributes from a relation by the command
alter table r drop A
where r is the name of an existing relation, and A is the name of an attribute of the relation.

Embedded SQL:
• The SQL standard defines embeddings of SQL in a variety of programming languages such as
Pascal, PL/I, Fortran, C, and Cobol.
• A language to which SQL queries are embedded is referred to as a host language, and the SQL
structures permitted in the host language comprise embedded SQL.
• EXEC SQL statement is used to identify embedded SQL request to the preprocessor
EXEC SQL <embedded SQL statement > END- EXEC

• a programmer must have access to a database from a general- purpose programming language
for at least two reasons:
1. Not all queries can be expressed in SQL, since SQL does not provide the full expressive power
of a general-purpose language. That is, there exist queries that can be expressed in a language
such as C, Java, or Cobol that cannot be expressed in SQL. To write such queries, we can embed
SQL within a more powerful language.
2. Nondeclarative actions—such as printing a report, interacting with a user, or sending the
results of a query to a graphical user interface—cannot be done from within SQL. For an
integrated application, the programs written in the programming language must be able to
access the database.

Dynamic SQL
• The dynamic SQL component of SQL allows programs to construct and submit SQL queries at
run time.
• In contrast, embedded SQL statements must be completely present at compile time; they are
compiled by the embedded SQL preprocessor.
• Using dynamic SQL, programs can create SQL queries as strings at run time (based on input
from the user) and can either have them executed immediately or have them prepared for
subsequent use. Preparing a dynamic SQL statement compiles it, and subsequent uses of the
prepared statement use the compiled version.
Unit 5 Integrity Constraints

5.1 Domain Constraints

 A domain is defined as the set of all unique values permitted for an attribute. For
example, a domain of date is the set of all possible valid dates, a domain of integer
is all possible whole numbers, a domain of day-of-week is Sunday, Monday, Tuesda
.....Saturday.
 This in effect is defining rules for a particular attribute. If it is determined that an
attribute is a date then it should be implemented in database to prevent invalid
dates being entered.
 Domain constraints are user defined data type and we can define them like this:
Domain Constraint = data type + Constraints (NOT NULL / UNIQUE / PRIMARY
KEY / FOREIGN KEY / CHECK / DEFAULT)
 Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
 Domain constraints are the most elementary form of integrity constraint.
 They test values inserted in the database, and test queries to ensure that the
comparisons make sense.
 New domains can be created from existing data types
E.g. create domain Dollars numeric(12, 2)
create domain Pounds numeric(12,2)
 We cannot assign or compare a value of type Dollars to a value of type Pounds.
However, we can convert type as below
(cast r.A as Pounds)
(it should also multiply by the dollar-to-pound conversion-rate)
 The check clause in SQL-92 permits domains to be restricted:
Use check clause to ensure that an hourly-wage domain allows only values
greater than a specified value.

create domain hourly-wage numeric(5,2)

constraint value-test check(value > = 4.00)

The domain has a constraint that ensures that the hourly-wage is greater than
4.00
The clause constraint value-test is optional; useful to indicate which constraint
an update violated.
 Can have complex conditions in domain check
create domain AccountType char(10)
constraint account-type-test check (value in (‘Checking’, ‘Saving’))

check (branch-name in (select branch-name from branch))

Example:

I want to create a table “student_info” with “stu_id” field having value greater than 100,
I can create a domain and table like this:

create domain id_value int


constraint id_test
check(value > 100);

create table student_info (


stu_id id_value PRIMARY KEY,
stu_name varchar(30),
stu_age int
);

Another example:
I want to create a table bank_account with account_type field having value either checking or
saving :

create domain account_type char(20)


constraint acc_type_test
check(value in ("Checking", "Saving"));

create table bank_account (


account_nbr int PRIMARY KEY,
account_holder_name varchar(30),
account_type account_type
);
5.2 Referential Integrity
5.2.1 Basic Concepts
 Ensures that a value that appears in one relation for a given set of
attributes also appears for a certain set of attributes in another relation.

Example: If Nabil-Bank is a branch name appearing in one of the


tuples in the account relation, then there exists a tuple in the branch
relation for branch Nabil-Bank .

 Formal Definition

Let r1 (R1) and r2(R2) be relations with primary keys K1 and K2


respectively.
The su set α of R is a foreign key referencing K1 in relation r1, if for
every t2 in r2 there must be a tuple t1 in r su h that t [K ] = t [α].
Referential integrity constraint also called subset dependency since
its can be written as
∏α (r2) ⊆∏K1 (r1)

5.2.2 Referential Integrity in the E-R Model R Model


 Consider relationship set R between entity sets E1 and E2. The relational
schema for R includes the primary keys K1 of E1and K2 of E2.
Then K1 and K2 form foreign keys on the relational schemas for E1 and E2
respectively.

E1 R E2

 Weak entity sets are also a source of referential integrity constraints.


For the relation schema for a weak entity set must include the
primary key attributes of the entity set on which it depends
6. . Data ase Modifi ation
Data ase odifi atio s a ause violatio s of refere tial i tegrity. We list here
the test that we must make for each type of database odifi atio to preserve
the following referential-integrity constraint:
Πα (r2) ⊆ ΠK (r1)
• Insert. If a tuple t2 is inserted into r2, the system must ensure that there is a
tuple t1 in r1 such that t1[K]=t2[α]. That is,
t2[α] ∈ ΠK (r1)

• Delete. If a tuple t1 is deleted from r1, the system must compute the set of
tuples in r2 that reference t1:
(r2)
σα = t1[K]
If this set is not empty, either the delete command is rejected as an error ,or the
tuples that reference t1 must themselves be deleted. The latter solution may lead
to cascading deletions, since tuples may reference tuples that reference t1, and so
on.

• Update. We must consider two cases for update: updates to the referencing
relation (r2), and updates to the referenced relation (r1).
If a tuple t2 is updated in relation r2, a d the update odifies
values for the foreig key α, the a test si ilar to the insert

case is made. Let t2 denote the new value of tuple t2. The
system must ensure that

t2 [α] ∈ ΠK (r1)
If a tuple t1 is updated in r1, a d the update odifies values
for the primary key (K), then a test similar to the delete case is
made. The system must compute
σα = t1[K] (r2)
using the old value of t1 (the value before the update is applied). If
this set is not empty, the update is rejected as an error, or the update is
cascaded in a manner similar to delete.

5.2.4 Referential Integrity in SQL


 Primary and candidate keys and foreign keys can be specified as part of the
SQL create table statement:
 The primary key clause lists attributes that comprise the primary key.
 The unique key clause lists attributes that comprise a candidate key.
 The foreign key clause lists the attributes that comprise the foreign key and
the name of the relation referenced by the foreign key.
 By default, a foreign key references the primary key attributes of the
referenced table
foreign key (account-number) references account
 Short form for specifying a single column as
foreign key account-number char (10) references account
 Reference columns in the referenced table can be explicitly specified
but must be declared as primary/candidate keys
foreign key (account-number) references account(account-number)

Referential Integrity in SQL – – Example


create table customer
(customer-name char(20),
customer-street char(30),
customer-city char(30),
primary key (customer-name))
create table branch
(branch-name char(15),
branch-city char(30),
assets integer,
primary key (branch-name),
check (assets>=0))

create table account


(account-number char(10),
branch-name char(15),
balance integer,
primary key (account-number),
foreign key (branch-name) references branch,
check (balance>=0))

create table depositor


(customer-name char(20),
account-number char(10),
primary key (customer-name, account-number),
foreign key (account-number) references account,
foreign key (customer-name) references customer)

Fig: SQL data definition for part of the bank database.


5.3 Assertions
 An assertion is a predicate expressing a condition that we wish the
database always to satisfy.
 Domain constraints and referential –integrity constrains are special forms
of assertions.
 An assertion in SQL takes the form
create assertion <assertion-name> check <predicate>
 When an assertion is made, the system tests it for validity, and tests it again
on every update that may violate the assertion
This testing may introduce a significant amount of overhead;
hence
assertions should be used with great care.
 Asserting
for all X, P(X)
is achieved in a round-about fashion using
not exists X such that not P(X) construct.

Assertion Example
Eg 1)The sum of all loan amounts for each branch must be less than the sum of all
account balances at the branch.
create assertion sum-constraint check
(not exists (select * from branch
where (select sum(amount) from loan
where loan.branch-name = branch.branch-name)
>= (select sum(balance) from account
where account.branch-name = branch.branch-name)))
Eg 2)Every loan has at least one borrower who maintains an account with a
minimum balance or $1000.00

create assertion balance-constraint check


(not exists ( select * from loan
where not exists ( select *
from borrower, depositor, account
where loan.loan-number = borrower.loan-number
and borrower.customer-name = depositor.customer-name
and depositor.account-number = account.account-number
and account.balance >= 1000)))

5.4 Triggers
 A trigger is a statement that is executed automatically by the system as a side effect of a
modification to the database.
 To design a trigger mechanism, we must meet two requirements:
1) Specify the conditions under which the trigger is to be executed.
2) Specify the actions to be taken when the trigger executes.
The above model of triggers is referred to as the event-condition-action model for
triggers.

 Triggers have many uses, such as implementing business rules, audit logging, and even
carrying out actions outside the database system
 Triggers introduced to SQL standard in SQL:1999, but supported even earlier using non-
standard syntax by most databases.

Trigger Example
 Suppose that instead of allowing negative account balances, the bank deals with overdrafts by
setting the account balance to zero
creating a loan in the amount of the overdraft
giving this loan a loan number identical to the account number of the
overdrawn account
 The condition for executing the trigger is an update to the account relation that results in a
negative balance value

Trigger Example in SQL:1999

create trigger overdraft-trigger after update on account


referencing new row as nrow
for each row
when nrow.balance < 0
begin atomic
insert into borrower
(select customer-name, account-number
from depositor
where nrow.account-number = depositor.account-number);

insert into loan values


(nrow.account-number, nrow.branch-name, – nrow.balance);
update account set balance = 0
where account.account-number = nrow.account-number
end

Fig: Example of SQL:1999 syntax for triggers.

Triggering Events and Actions in SQL

 Triggering event can be insert, delete or update


 Triggers on update can be restricted to specific attributes
E.g. create trigger overdraft-trigger after update of balance on account
 Values of attributes before and after an update can be referenced
referencing old row as : for deletes and updates
referencing new row as : for inserts and updates
 Triggers can be activated before an event, which can serve as extra constraints. E.g. convert
blanks to null.

create trigger setnull-trigger before update on r


referencing new row as nrow
for each row
when nrow.phone-number =
set nrow.phone-number = null
Security
 Protection from malicious attempts to steal or modify data.

a) Database system level


# Authentication and authorization mechanisms to allow specific users access
only to required data
# We concentrate on authorization in the rest of this chapter

b) Operating system level


# Operating system super-users can do anything they want to the database
# Good operating system level security is required.

c) Network level: must use encryption to prevent


# Eavesdropping (unauthorized reading of messages)
# Masquerading (pretending to be an authorized user or sending messages
supposedly from authorized users

d) Physical level
# Physical access to computers allows destruction of data by intruders;
traditional lock-and-key security is needed
# Computers must also be protected from floods, fire, etc.

e) Human level
# Users must be screened to ensure that an authorized users do not give access
to intruders
# Users should be trained on password selection and secrecy

Authorization
Forms of authorization on parts of the database:
 Read authorization - allows reading, but not modification of data.
 Insert authorization - allows insertion of new data, but not modification of existing data.
 Update authorization - allows modification, but not deletion of data.
 Delete authorization - allows deletion of data

Forms of authorization to modify the database schema:

 Index authorization - allows creation and deletion of indices.


 Resources authorization - allows creation of new relations.
 Alteration authorization - allows addition or deletion of attributes in a relation.
 Drop authorization - allows deletion of relations.
Chapter-6
Relational Database Design
Design Goals
• Avoid redundant data
• Ensure that relationship among attributes are represented
• Facilitate the checking of updates for violation of database integrity constraints.

6.1 Pitfalls in Relational Database Design


Relational database design requires that we find a good collection of relation schemas. A bad design
may lead to
 Repetition of Information.
 Inability to represent certain information.
Example
Consider the relation schema:
Lending-schema = (branch-name, branch-city, assests, customer-name, loan-number, amount)
Redundancy:
 Data for branch-name, branch-city, assets are repeated for each loan that a branch makes.
 Wastes space
 Complicates updating, introducing possibility of inconsistency of assets value
Null Values:
 Another problem with the Lending-schema design is that we cannot represent directly the
information concerning a branch (branch-name, branch-city, assets) unless there exists at least
one loan at the branch. This is because tuples in the lending relation require values for loan-
number, amount, and customer-name.
 One solution to this problem is to introduce null values, however, null values are difficult to
handle. If we are not willing to deal with null values, then we can create the branch information
only when the first loan application at that branch is made. Worse, we would have to delete this
information when all the loans have been paid.

6.2 Functional Dependencies


A functional dependency is a constraint between two sets of attributes from the relation.
• Suppose in a relation having schema R, X⊆ R and Y⊆ R. A functional dependency X→Y
holds on R if, in any table having schema R, for every two rows/tuples t 1 and t2 the
values of attributes X are same in t1 and t2 then values of attributes Y are also same.
That is
if t1[X] = t2[X], then,
they must also have t1 [Y] = t2[y].
This means that the values of the X component of a tuple of relation r(R) uniquely (or functionally)
determine the values of the Y component. We also say that there is a functional dependency from X to Y,
or that Y is functionally dependent on X.
If K is a super key of a relatio R the it ea s fu tio al depe de y K→R holds.
If X → Y in R, this does not say whether or not Y → X in R.
The database designers will use their understanding of the semantics of the attributes of R - that is, how
they relate to one another to specify the functional dependencies that should hold on all relation states
r of R. Whenever the semantics of two sets of attributes in R indicate that a functional dependency
should hold, we specify the dependency as a constraint.
Hence, the main use of functional dependencies is to describe a relation schema R by specifying
constraints on its attributes that must hold at all times.
Consider relation schema:
EMP_PROJ (E_id, Pnumber, hours, Ename, Pname, Ploacation).
Here, the following functional dependencies should hold:
a. E_id Ename
b. Pnumber {Pname, Plocation}
c. {E_id, Pnumber} hours.
These functional dependencies specify that
a) the value of an employee s ID (E_id) uniquely determines the employee name (ENAME),
b) the value of a project's number (PNUMBER) uniquely determines the project name (PNAME) and
location (PLOCATION), and
c) a combination of E_id and Pnumber values uniquely determines the number of hours the employee
currently works on the project per week (HOURS).
Alternatively, we say that ENAME is functionally determined by (or functionally dependent on) E_id, or
"given a value of E_id, we know the value of Ename," and so on.
Closure of a Set of Functional Dependencies
Given a set F of functional dependencies, we can prove that certain other functional dependencies hold.
We say that such functional dependencies are logically implied by F.
E.g. If A → B a d B → C, the e a i fer that A → C
Suppose we are given a relation schema R = (A, B, C, G, H, I) and the set of functional dependencies
A→B, A→C, CG → H, CG → I, B→H
The functional depedency
A→H is logically implied.
That is, we can show that, whenever our given set of functional dependencies holds on a relation R, A →
H must also hold on the relation.
Suppose that t1 and t2 are tuples such that
t1[A] = t2[A]
Since we are given that A → B, it follows from the definition of functional dependency that
t1[B] = t2[B]
Then, since we are given that B → H, it follows from the definition of functional dependency that
t1[H] = t2[H]
Therefore, we have shown that, whenever t1 and t2 are tuples such that t1[A] = t2[A], it must be that
t1[H] = t2[H], that is, A → H holds.
Let F be a set of functional dependencies. The closure of F, denoted by F+, is the set of all functional
dependencies logically implied by F. Given F, we can compute F+ directly from the formal definition of
functional dependency.
We can use the following three rules to find logically implied functional dependencies. By applying these
rules repeatedly, we can find all of F+, given F. This collection of rules is called Armstrong’s axioms.
1. Reflexivity rule. If α is a set of attri utes a d β ⊆ α, the α → β holds.
2. Augmentation rule. If α → β holds a d is a set of attri utes, the α → β holds.
3. Transitivity rule. If α → β holds a d β → holds, the α → holds.
These rules are
sound (generate only functional dependencies that actually hold) and
complete (generate all functional dependencies that hold).
Although Armstrong s axioms are complete, it is tiresome to use them directly for the computation of F+.
To simplify matters, we list additional rules. It is possible to use Armstrong s axioms to prove that these
rules are correct
1. Union rule: If α→β holds a d α→ holds, the α→β holds.
2. Decomposition rule. If α → β holds, the α → β holds a d α → holds.
3. Pseudotransitivity rule. If α → β holds a d β → holds, the α → holds.
Example
R = (A, B, C, G, H, I)
F={ A→B, A→C, CG → H, CG → I, B → H}
Some members of F+
 A→H
y tra siti ity fro A → B a d B → H
 AG → I
y aug e ti g A → C ith G, to get AG → CG a d the tra siti ity ith CG → I
OR
by pseudotransitivity: A → C a d CG → I implies that
AG → I
 CG → HI:
Si e CG →H a d CG →I, the u io rule i plies that
CG →HI.
OR
Aug e tatio of CG → I to i fer CG → CGI,
aug e tatio of CG → H to i fer CGI → HI,
and then transitivity

Procedure for Computing F+


To compute the closure of a set of functional dependencies F:
F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1 and f2 in F+
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F+
until F+ does not change any further
Closure of Attribute Sets
Let α e a set of attri utes. We all the set of all attri utes fu tio ally deter i ed y α u der a set F of
functional dependencies, the closure of α u der F; e de ote it y α+.
Algorith to o pute α+, the losure of α u der F
result := α;
while (changes to result) do
for each β → i F do
begin
if β ⊆ result then result := result ∪
end
Here, the input is a set F of fu tio al depe de ies a d the set α of attri utes. The output is stored i
the variable result.
How the algorithms works
R=(A,B,C,G,H,I)
F={A→B, A→C, CG → H, CG → I, B → H}
Let us compute (AG)+
1. Result = AG
2. Result = ABG A→B is i F, A ⊆ result (which is AG), so result := result ∪ B.
3. Result = ABCG A→C a d A⊆ABG)
4. Result = ABCGH CG→H a d CG⊆AGBC)
5. Result = ABCGHI CG→I a d CG⊆AGBCH)
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
Testing for superkey:
 To test if α is a superkey, e o pute α+, a d he k if α+ contains all attributes of R.
Testing functional dependencies
 To he k if a fu tio al depe de y α → β holds or, i other ords, is i F+ , just he k if β ⊆ α+.
 That is, we o pute α+ y usi g attri ute losure, a d the he k if it o tai s β.
 Is a simple and cheap test, and very useful
Computing closure of F
 For ea h ⊆ R, e fi d the losure +, and for each
S ⊆ +, e output a fu tio al depe de y → S.

6.3 Decomposition
We should decompose a relation schema that has many attributes into several schemas with fewer
attributes. Careless decomposition, however, may lead to another form of bad design.
Careless decomposition may lead to the loss of information and such decomposition is called lossy-
decomposition, or a lossy-join decomposition. Lossy-decomposition is a bad database design.
A decomposition that is not a lossy-join decomposition is a lossless-join decomposition.
When we decompose a relation into a number of smaller relations, it is crucial that the decomposition
be lossless.
A set of relation schemas {R1 , R2 , . . . , Rn } is a decomposition of R if
R = R1∪R2 ∪···∪Rn
That is, {R1, R2,...,Rn} is a decomposition of R if, for i = 1,2,...,n, each Ri is a subset of R, and every
attribute in R appears in at least one Ri.
To have a loseless-join decomposition, we need to impose contraints on the set of possible relations.
Let C represent a set of constraints on the database, and let R be a relation schema. A decomposition
{R1, R2, . . . , Rn} of R is a lossless-join decomposition if, for all relations r on schema R that are legal
under C,
r = ΠR1 r ΠR2 r ··· ΠRn (r)
Some functional dependencies are said to be trivial because they are satisfied by all relations.
For example, A → A is satisfied by all relations involving attribute A. Si ilarly, AB → A is satisfied y all
relations involving attribute A.
I ge eral, a fu tio al depe de y of the for α → β is trivial if β ⊆ α.
Desirable Properties of Decomposition
The decomposition has mainly two desirable properties:
1. Lossless-Join Decompsition
2. Dependency Preservation
We can use a given set of functional dependencies in designing a relational database in which most of
the undesirable properties do not occur. When we design such systems, it may become necessary to
decompose a relation into several smaller relations.
Consider a Lending-Schema:
Lending Schema = (branch-name, branch-city, assests, customer-name, loan-number, amount)
The set F of functional dependencies that we require to hold on Lending-schema are
branch-name → branch-city assets
loan-number → amount branch-name
loan-number → usto er-name
Based on above functional dependency, we decompose it to the following three relations:
Branch-schema = (branch-name, branch-city, assets)
Loan-schema = (loan-number, branch-name, amount)
Borrower-schema = (customer-name, loan-number)

Lossless-Join Decomposition
When we decompose a relation into a number of smaller relations, it is crucial that the decomposition
be lossless.
Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a
decomposition of R. This decomposition is a lossless-join decomposition of R if at least one of the
following functional dependencies is in F+:
 R ∩R →R
 R ∩R →R
In other words, if R ∩ R for s a superkey of either R or R , the de o positio of R is a lossless-join
decomposition. We can use attribute closure to efficiently test for superkeys.
Let s demonstrate that our decomposition of Lending-schema is a lossless-join decomposition by
showing a sequence of steps that generate the decomposition. We begin by decomposing Lending-
schema into two schemas:
 Branch-schema = (branch-name, branch-city, assets)
 Loan-info-schema= (branch-name, customer-name, loan-number, amount)
Since branch-name → branch-city assets, the augmentation rule for functional dependencies implies
that
branch-name → branch-name branch-city assets
Since Branch-schema ∩ Loan-info-schema = {branch-name}, it follows that our initial decomposition is a
lossless-join decomposition.
Next, we decompose Loan-info-schema into
Loan-schema = (loan-number, branch-name, amount)
Borrower-schema = (customer-name, loan-number)
This step results in a lossless-join decomposition, since loan-number is a common attribute and loan-
number → amount branch-name.

Dependency Preservation
When an update is made to the database, the system should be able to check that the update will not
create an illegal relation that is, one that does not satisfy all the given functional dependencies.
Suppose it is given that a set F of functional dependencies holds on any relation based on schema R.
Then set of functional dependencies that holds on any relation subschema R1 is F1 that contains the
functional dependencies of F which contains attributes of only R1. So if decomposition of R is { R1, R2,…Rn
} such that corresponding functional dependencies which holds on them are { F1, F2,…Fn } then following
should be true.
F+ = {F1 ∪ F2 ∪ … ∪ Fn}+.
Such a decomposition is called dependency preserving decomposition.
Example
Consider the schema
R = {A, B, C, D}
such that following functional dependency holds on it
F = {A→B, A →BC, C →D}.
Now suppose the decomposition of this R is
R1= {A,B} and R2 = {B,C,D},
The functional dependencies which holds on R1 are F1= {A→B} Note: F1 should contain all the functional
dependencies in F which have only attributes of R1) and
those on R2 are F2 = {C→D}.
The union F1 ∪ F2 is {A→B, C →D} hi h does t contain the
A →BC , so it is ot a depe de y preser i g de o positio .
If we decompose R into these relation schemas
R1 ={A,B,C} and R2={C,D} then
F1={A→B, A →BC} a d F2 = {C→D} a d
F1 ∪ F2 is {A→B, A →BC, C →D}, so it is a depe de y preser i g de o positio .
Algorithm for testing dependency preservation
The algorithm shown below is for testing dependency preservation. The input is a set D = {R1, R2, . . . ,
Rn} of decomposed relation schemas, and a set F of functional dependencies.
compute F+;
for each schema Ri in D do
begin
Fi : = the restriction of F+ to Ri;
end
F′ := ∅
for each restriction Fi do
begin
F ′ = F ′ ∪ Fi
end
o pute F′+;
if F ′+ = F +) then return (true)
else return (false);
Example
R = (A, B, C)
F = A → B, B → C
Can be decomposed in two different ways
1. R1 =(A,B), R2 =(B,C)

 Lossless-join decomposition:
R ∩R = {B} a d B→BC
 Dependency preserving
R = A,B , f : A → B
R = B,C , f : B → C
f1 U f2 = F
2. R1=(A,B), R2 =(A,C)
 Lossless-join decomposition:
R ∩R ={A} a d A→AB
Not dependency preserving
R = A,B , f : A → B
R2 = (A,C)
a ot he k B → C
Suppose that we decompose the schema R = (A, B, C, D, E) into
(A, B, C) and (A, D, E)
Show that this decomposition is a lossless-join decomposition if the following set F of functional
dependencies holds:
A → BC CD → E B→D E→A
Ans: A decomposition {R1,R2} is a lossless-join decomposition if R1 ∩ R2 → R1 or R1 ∩ R2 → R2.
Let R1 = (A,B,C) and R2 = (A,D,E)
Given,
A → BC
by augmentation rule
AA →ABC by augmentation rule
A → ABC union of identical sets
And the common attribute in R1 and R2 is A, that is,
R 1 ∩ R 2 → R1
Thus the requirement for lossless-join decomposition is fulfilled, so, this decomposition is lossless-join
decomposition.

List all functional dependencies satisfied by the given relation.


Ans: The functional dependencies that are satisfied by the given relations are:
A → B, C → B and the dependency they logically imply: AC → B.
Use Armstrong s axioms to prove the soundness of the union rule.
Union rule state that if α → β a d α → , the ,
α→β
Proof:
α→β gi e
αα → αβ aug e tatio rule
α → αβ u io of ide ti al sets
α→ gi e
αβ→ β aug e tatio rule
α→β tra si ity rule
Thus the soundness of union rule is proved.

6.4 Normalization
The normalization process, takes a relation schema through a series of tests to "certify" whether it
satisfies a certain normal form.
There are different normal forms:
1. First Normal Form
2. Second Normal Form
3. Third Normal Form
4. Boyce-Codd Normal Form (BCNF)

First Normal Form (1NF)


First Normal form states that the domain of an attribute must include only atomic (simple, indivisible)
values.
A relation R is said to be in the first normal form (1NF) if and only if every attribute contains atomic
values only.
Hence, 1NF disallow multivalued attributes, composite attributes and their combinations.
Consider the department relation schema given below:
Department
We assume that each department can have a number of location.
The above relation is not in 1NF because Dlocation is not an atomic attribute.
We can change this relation into first normal form as fallows:
Department Dept_location

Remove the attribute Dlocations that violates 1NF and place it in a separate relation Dept_locations
along with the primary key Dnumber of Department. The primary key of this relation is the combination
{Dnumber, Dlocation}. A distinct value exists for each location of a department in Dept_location. This
decomposes the non-1NF relation into two 1NF relations.

Second Normal Form (2NF)


Second normal form (2NF) is based on the concept of full functional dependency.
A functional dependency X →Y is a full functional dependency if removal of any attribute A from X
means that the dependency does not hold any more; that is, for any attribute A∈X, (X - {A}) does not
functionally determine Y.
A functional dependency X →Y is a partial dependency if some attribute A∈X can be removed from X
and the dependency still holds; that is, for some A∈X, (X - {A}) →Y.
In the given relation schema emp_proj, {eid, pnumber} →hours is a full dependency (neither eid →hours
nor proj-number→hours holds). However, the dependency {eid, proj-number}→ename is partial
because eid →ename holds.

emp_proj (eid, ename, proj-number, hours)


Definition: A relation schema R is in 2NF if every nonprime attribute A in R is fully functionally
dependent on the primary key of R.
The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of
the primary key.
If the primary key contains a single attribute, the test need not be applied at all.
The EMP_PROJ relation in figure below is in 1NF but is not in 2NF. The nonprime attribute ENAME
violates 2NF because of FD2, as do the nonprime attributes PNAME and PLOCATION because of FD3. The
functional dependencies FD2 and FD3 make ENAME, PNAME, and PLOCATION partially dependent on
the primary key {EID, PNUMBER} of EMP_PROJ, thus violating the 2NF test.
If a relation schema is not in 2NF, it can be "second normalized" or "2NF normalized" into a number of
2NFrelations in which nonprime attributes are associated only with the part of the primary key on which
they are fully functionally dependent.
The functional dependencies FD1, FD2, and FD3 in the Figure below hence lead to the decomposition of
EMP_PRO] into the three relation schemas EPl, EP2, and EP3 as shown in figure each of which is in 2NF.

Third Normal Form (3NF)


Third normal form (3NF) is based on the concept of transitive dependency.
A functional dependency X→Y in a relation schema R is a transitive dependency if there is a set of
attributes Z that is neither a candidate key nor a subset of any key of R, and both X →Z and Z →Y hold.
The dependency SSN→DMGRSSN is transitive through DNUMBER in relation EMP_DEPT in the figure
below because both the dependencies SSN →DNUMBER and DNUMBER →DMGRSSN hold and
DNUMBER is neither a key itself nor a subset of the key of EMP_DEPT.
Definition: A relation schema R is in 3NF if it satisfies 2NF and no nonprime attribute of R is transitively
dependent on the primary key.
The relation schema EMP_DEPT in given figure is in 2NF, since no partial dependencies on a key exist.
However, EMP_DEPT is not in 3NF because of the transitive dependency of DMGRSSN (and also DNAME)
on SSN via DNUMBER.
We can normalize EMP_DEPT by decomposing it into the two 3NF relation schemas EDl and ED2 shown
in Figure. Intuitively, we see that EDl and ED2 represent independent entity facts about employees and
departments.
A 3NF definition equivalent to above definition states that A relation schema R is in third normal form
(3NF) if, whenever a functional dependency X→A holds in R, either (a) X is a superkey of R, or
(b) A is a prime attribute of R.

BOYCE CODE NORMAL FORM (BCNF)


It is a slightly stronger version of the third normal form (3NF). A relation schema R is in BCNF if
whenever a functional dependency X →A holds in R, then X is a superkey of R.
The formal definition of BCNF differs slightly from the definition of 3NF. The only difference between the
definitions of BCNF and 3NF is the second condition of 3NF, which allows A to be prime, is absent from
BCNF.
In practice, most relation schemas that are in 3NF are also in BCNF.
Only in rare cases a 3NF table does not meet the requirements of BCNF. A 3NF table which does not
have multiple overlapping candidate keys is guaranteed to be in BCNF.
In our example, FD3 violates BCNF in relation LOTS because AREA is not a superkey of LOTS. Note that
FD3 satisfies 3NF in LOTS because COUNTRY-NAME is a prime attribute (2nd condition), but this condition
does not exist in the definition of BCNF. We can decompose LOTS into two BCNF relations LOTS1 and
LOTS2, as shown in our example.
7.1 Introduction
• Data governance is a control that ensures that the data
entry by an operations team member or by an automated
process meets precise standards, such as a business rule, a
data definition and data integrity constraints in the data
model.
• The data governor uses data quality monitoring against
production data to communicate errors in data back to
operational team members, or to the technical support
team, for corrective action.
• Data governance includes the people, processes,
and information technology required to create a consistent
and proper handling of an organization's data across the
business enterprise in order to improve data quality.
• Data governance is a set of processes that ensures that
important data assets are formally managed throughout
the enterprise.
• Data governance ensures that data can be trusted and that
people can be made accountable for any adverse event that
happens because of low data quality.
• It is about putting people in charge of fixing and preventing
issues with data so that the enterprise can become more
efficient.
• Data governance also describes an evolutionary process for
a o pa y, alte i g the o pa y’s ay of thi ki g a d
setting up the processes to handle information so that it
may be utilized by the entire organization.
• It’s a out usi g te h ology he e essa y i a y fo s
to help aid the process.
• When companies desire, or are required, to gain control of
their data, they empower their people, set up processes
and get help from technology to do it.
• According to one vendor, data governance is a quality
control discipline for assessing, managing, using, improving,
monitoring, maintaining, and protecting organizational
information. It is a system of decision rights and
accountabilities for information-related processes,
executed according to agreed-upon models which describe
who can take what actions with what information, and
when, under what circumstances, using what methods.
Business enterprises will are realized following benefits by the
implementation of Data governance programs.
1. Increasing consistency and confidence in decision making
2. Decreasing the risk of regulatory fines
3. Improving data security, also defining and verifying the
requirements for data distribution policies
4. Maximizing the income generation potential of data
5. Designating accountability for information quality
6. Enable better planning by supervisory staff
7. Minimizing or eliminating re-work
8. Optimize staff effectiveness
9. Establish process performance baselines to enable improvement
efforts
10. Acknowledge and hold all gain
Data Governance - Benefits
• Cost
– Increased efficiency reduces costs
– Reduce redundant systems and associated operating costs
– Reduce time and effort needed to verify and correct poor data

• Regulatory
– Reduce risk
– Improve compliance

• Opportunities
– Driving business growth
– Delivering improved client service
– Providing prompt and accurate responses to regulatory requests
– Delivering increased business intelligence from consistent and aggregated data
– Information sharing and coordination across organizations.
– Cross-selling
– Segmenting and targeted service opportunities
5
7.2 Data governance drivers
• While data governance initiatives can be driven by a desire
to improve data quality, they are more often driven by C-
Level leaders responding to external regulations.
• Examples of these regulations include Sarbanes-
Oxley, Basel I, Basel II, HIPAA, and a number of data
privacy regulations.
• To achieve compliance with these regulations, business
processes and controls require formal management
processes to govern the data subject to these regulations.
• Successful programs identify drivers meaningful to both
supervisory and executive leadership.
• Common themes among the external regulations center on
the need to manage risk. The risks can be financial
misstatement, inadvertent release of sensitive data, or
poor data quality for key decisions.
• Methods to manage these risks vary from industry to
industry. Examples of commonly referenced best practices
and guidelines include COBIT, ISO/IEC 38500, and others.
• The proliferation of regulations and standards creates
challenges for data governance professionals, particularly
when multiple regulations overlap the data being managed.
Organizations often launch data governance initiatives to
address these challenges.
Data Governance - Drivers
An effective data governance program leads to a successful data management capability

Data
Customer
Security &
Service
Quality

Data Strategy
& Data Corporate
Agility
Accountability Governance
Drivers
Corporate
Cost
Decision
Reduction
Making

Compliance
& Risk
Mgmt
1. Corporate Governance
• Following the Enron scandal the US government took action to hold
directors accountable for the misrepresentation of financial results with
the Sarbanes-Oxley Act.
Many South African companies have dual listings or international partners
and have to comply with this act. On the local front, King III is the most
comprehensive set of guidelines yet published to ensure good financial
governance for South African companies, while government entities must
comply with the Public Financial Management Act (PFMA).

2. Privacy legislation
• The Consumer Protection Act, the PCI-DSS (for credit card vendors) and
the pending Protection of Personal Information bill all require that
sensitive data is secure, is of good quality, is accessed on a need to know
basis only, and is disposed of when no longer needed.
3. Risk
• Basel II (and the pending Basel III) and Solvency II are regulations set up to
manage and assess risk for banks and insurance companies.
• Organizations are required to establish a process for data quality
management and account for adjustments to historical data.
• The egulatio s e ui e that isk al ulatio s ust e p o a ly o e t
based on ongoing assessment and monitoring of core risk data.
• Like any large program that touches all departments in the business a
pragmatic data governance implementation will add value.

4. Master Data Management (or other data intensive IT projects)


• Master data, by definition, is used for many purposes and across many
areas of the business. Data governance helps to ensure that the MDM
implementation takes all the necessary views and uses into account, helps
to manage conflict and resolve issues.
5. Duplicated effort
• In many large organizations we see huge
duplication of effort as different teams try to
address the same issue.
• At worst this may simply result in wasted expense
– multiple projects trying to achieve the same
result.
• In the worst case, projects may clash with each
other – with one undoing the results of another.
7.3 Data governance initiative
• Data governance initiatives improve data quality by assigning a
team responsible for data's accuracy, accessibility, consistency, and
completeness, among other metrics.
• This team usually consists of executive leadership, project
management, line-of-business managers, and data stewards.
• The team usually employs some form of methodology for tracking
and improving enterprise data, such as Six Sigma, and tools for data
mapping, profiling, cleansing, and monitoring data.
• Data governance initiatives may be aimed at achieving a number of
objectives including offering better visibility to internal and external
customers (such as supply chain management), compliance
with regulatory law, improving operations after rapid company
growth or corporate mergers, or to aid the efficiency of
enterprise knowledge workers by reducing confusion and error and
increasing their scope of knowledge.
• Many data governance initiatives are also inspired by past
attempts to fix information quality at the departmental
level, leading to incongruent and redundant data quality
processes.
• Most large companies have many applications and
databases that can't easily share information. Therefore,
knowledge workers within large organizations often don't
have access to the information they need to best do their
jobs. When they do have access to the data, the data
quality may be poor. By setting up a data governance
practice or Corporate Data Authority, these problems can
be mitigated.
• The structure of a data governance initiative will vary not
only with the size of the organization, but with the desired
objectives or the 'focus areas' of the effort.
Implementation
• Implementation of a Data Governance initiative may vary in scope
as well as origin.
• Sometimes, an executive mandate will arise to initiate an enterprise
wide effort, sometimes the mandate will be to create a pilot project
or projects, limited in scope and objectives, aimed at either
resolving existing issues or demonstrating value.
• Sometimes an initiative will originate lower down in the
o ga izatio ’s hie a hy, a d ill e deployed i a li ited s ope to
demonstrate value to potential sponsors higher up in the
organization.
• The initial scope of an implementation can vary greatly as well,
from review of a one-off IT system, to a cross-organization initiative.
CHAPTER- 8
Database Maintenance
 Database maintenance is an activity designed to keep a database running smoothly.
 A number of different systems can be used to build and maintain databases, with one popular
example being MYSQL.
 The maintenance of databases is generally performed by people who are comfortable and
familiar with the database system and the specifics of the particular database, although some
maintenance tasks can be performed by people who do not have experience.
 Databases are used to maintain a library of information in a well organized, accessible format.
They usually are not static, however, because changes are constantly being made as material is
added, removed, and moved around.
 People may also change parameters within the database, decide to use different indexing
systems, and so forth. Over time, this can cause the database to start to malfunction.
 Database maintenance is used to keep the database clean and well organized so that it will not
lose functionality.
 One important aspect of maintaining a database is simply backing up the data so that, if
anything happens, there will be another copy available.
 Some data basing systems actually do this automatically, sending a backup to another location
every day, every week, or within any other set period of time. Backups are usually not enough,
however.
 Database maintenance includes checking for signs of corruption in the database, looking for
problem areas, rebuilding indexes, removing duplicate records, and checking for any
abnormalities in the database that might signal a problem.
 The goal is to keep the database operating smoothly for users, so that ideally they never need to
think about maintenance issues.
 A database that is not maintained can become sluggish, and people may start to experience
problems when trying to access records.
 Numerous textbooks are available with information about database management, including
how to maintain databases properly.
 It is also possible to take courses to learn about different data basing systems and how to care
for databases, whether they are being built from scratch or taken over. People can also earn
certifications in specific systems which indicate a high level of competence.
Data Quality Management
 Data quality management is an administration type that incorporates the role establishment,
role deployment, policies, responsibilities and processes with regard to the acquisition,
maintenance, disposition and distribution of data.
 In order for a data quality management initiative to succeed, a strong partnership between
technology groups and the business is required.
 Information technology groups are in charge of building and controlling the entire environment,
that is, architecture, systems, technical establishments and databases.
 This overall environment acquires, maintains, disseminates and disposes of an organization's
electronic data assets.
 When considering a business intelligence platform, there are various roles associated with data
quality management:
Project leader and program manager: In charge of supervising individual projects or the business
intelligence program. They also manage day-to-day functions depending on the budget, scope and
schedule limitations.
Organization change agent: Assists the organization in recognizing the impact and value of the business
intelligence environment, and helps the organization to handle any challenges that arise.
Data analyst and business analyst: Communicate business needs, which consist of in-depth data quality
needs. The data analyst demonstrates these needs in the data model as well as in the prerequisites for
the data acquisition and delivery procedures. Collectively, these analysts guarantee that the quality
needs are identified and demonstrated in the design, and that these needs are carried to the team of
developers.
Data steward: Handles data as a corporate asset.
 An effective data quality management approach has both reactive and proactive elements. The
proactive elements include:
Establishment of the entire governance
Identification of the roles and responsibilities
Creation of the quality expectations as well as the supporting business strategies
Implementation of a technical platform that facilitates these business practices
 The reactive elements include the management of issues in the data located in existing
databases.
Data Cleansing
 Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database.
 Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate,
irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty
data or coarse data.
 Data cleansing may be performed interactively with data wrangling tools, or as batch
processing through scripting.
 After cleansing, a data set will be consistent with other similar data sets in the system. The
inconsistencies detected or removed may have been originally caused by user entry errors, by
corruption in transmission or storage, or by different data dictionary definitions of similar
entities in different stores.
 Data cleansing differs from data validation in that validation almost invariably means data is
rejected from the system at entry and is performed at entry time, rather than on batches of
data.
 The actual process of data cleansing may involve removing typographical errors or validating
and correcting values against a known list of entities. The validation may be strict (such as
rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records
that partially match existing, known records).
 Some data cleansing solutions will clean data by cross checking with a validated data set. Also
data enhancement, where data is made more complete by adding related information, is a
common data cleansing practice.
 For example, appending addresses with phone numbers related to that address.
 Data cleansing may also involve activities like, harmonization of data, and standardization of
data.
 For example, harmonization of short codes (st, rd, etc.) to actual words (street, road, etcetera).
Standardization of data is a means of changing a reference data set to a new standard, ex, use of
standard codes.
Process of Data Cleansing
a) Data auditing:
 The data is audited with the use of statistical and database methods to detect anomalies and
contradictions: this eventually gives an indication of the characteristics of the anomalies and
their locations.
 Several commercial software packages will let you specify constraints of various kinds (using a
grammar that conforms to that of a standard programming language, e.g., JavaScript or Visual
Basic) and then generate code that checks the data for violation of these constraints.
 This process is referred to below in the bullets "workflow specification" and "workflow
execution."
 For users who lack access to high-end cleansing software, Microcomputer database packages
such as Microsoft Access or File Maker Pro will also let you perform such checks, on a
constraint-by-constraint basis, interactively with little or no programming required in many
cases.
b) Workflow specification:
 The detection and removal of anomalies is performed by a sequence of operations on the data
known as the workflow.
 It is specified after the process of auditing the data and is crucial in achieving the end product of
high-quality data.
 In order to achieve a proper workflow, the causes of the anomalies and errors in the data have
to be closely considered.
c) Workflow execution:
 In this stage, the workflow is executed after its specification is complete and its correctness is
verified.
 The implementation of the workflow should be efficient, even on large sets of data, which
inevitably poses a trade-off because the execution of a data-cleansing operation can be
computationally expensive.
d) Post-processing and controlling:
 After executing the cleansing workflow, the results are inspected to verify correctness. Data that
could not be corrected during execution of the workflow is manually corrected, if possible.
 The result is a new cycle in the data-cleansing process where the data is audited again to allow
the specification of an additional workflow to further cleanse the data by automatic processing.
Challenges and Problems
a) Data cleansing in virtually integrated environments:
 In virtually integrated sources like IBM’s Discovery Link, the cleansing of data has to be
performed every time the data is accessed, which considerably increases the response time and
lowers efficiency.
b) Data-cleansing framework:
 In many cases, it will not be possible to derive a complete data-cleansing graph to guide the
process in advance.
 This makes data cleansing an iterative process involving significant exploration and interaction,
which may require a framework in the form of a collection of methods for error detection and
elimination in addition to data auditing.
 This can be integrated with other data-processing stages like integration and maintenance.
c) Error correction and loss of information:
 The most challenging problem within data cleansing remains the correction of values to
remove duplicates and invalid entries.
 In many cases, the available information on such anomalies is limited and insufficient to
determine the necessary transformations or corrections, leaving the deletion of such entries as
a primary solution.
 The deletion of data, though, leads to loss of information; this loss can be particularly costly if
there is a large amount of deleted data.
d) Maintenance of cleansed data:
 Data cleansing is an expensive and time-consuming process. So after having performed data
cleansing and achieving a data collection free of errors, one would want to avoid the re-
cleansing of data in its entirety after some values in data collection change.
 The process should only be repeated on values that have changed; this means that a cleansing
lineage would need to be kept, which would require efficient data collection and management
techniques.
Data Integrity
 Data integrity refers to maintaining and assuring the accuracy and consistency of data over its
entire life-cycle, and is a critical aspect to the design, implementation and usage of any system
which stores, processes, or retrieves data.
 The term data integrity is broad in scope and may have widely different meanings depending on
the specific context – even under the same general umbrella of computing. This article provides
only a broad overview of some of the different types and concerns of data integrity.
 Data integrity is the opposite of data corruption, which is a form of data loss. The overall intent
of any data integrity technique is the same: ensure data is recorded exactly as intended (such as
a database correctly rejecting mutually exclusive possibilities,) and upon later retrieval, ensure
the data is the same as it was when it was originally recorded.
 In short, data integrity aims to prevent unintentional changes to information. Data integrity is
not to be confused with data security, the discipline of protecting data from unauthorized
parties.
 Any unintended changes to data as the result of a storage, retrieval or processing operation,
including malicious intent, unexpected hardware failure, and human error, is failure of data
integrity.
 If the changes are the result of unauthorized access, it may also be a failure of data security.
 Depending on the data involved this could manifest itself as benign as a single pixel in an image
appearing a different color than was originally recorded, to the loss of vacation pictures or a
business-critical database, to even catastrophic loss of human life in a life-critical system.
 The accuracy and consistency of stored data, indicated by an absence of any alteration in data
between two updates of a data record.
 Data integrity is imposed within a database at its design stage through the use
of standard rules and procedures, and is maintained through the use of error checking
and validation routines.
Data Enrichment
 Data enrichment is a general term that refers to processes used to enhance, refine or otherwise
improve raw data.
 This idea and other similar concepts contribute to making data a valuable asset for almost any
modern business or enterprise.
 It also shows the common imperative of proactively using this data in various ways.
 Data Enrichment is the process of adding value to your data.
 In some cases, external data providers sell data, which may be used to augment existing data.
 In other cases, data from multiple internal sources are simply integrated to get the big picture.
 In any event, the intended result is a data asset that has been increased in value to the user
community.
Data Quality
 Data quality refers to the level of quality of Data. There are many definitions of data quality but
data are generally considered high quality if, "they are fit for their intended uses in operations,
decision making and planning.
 Alternatively, data is deemed of high quality if it correctly represents the real-world construct to
which it refers. Furthermore, apart from these definitions, as data volume increases, the
question of internal consistency within data becomes significant, regardless of fitness for use for
any particular external purpose.
 Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.
 The state of completeness, validity, consistency, timeliness and accuracy that makes data
appropriate for a specific use.
 The totality of features and characteristics of data that bears on its ability to satisfy a given
purpose; the sum of the degrees of excellence for factors related to data.
 The processes and technologies involved in ensuring the conformance of data values to business
requirements and acceptance criteria.
 Complete, standards based, consistent, accurate and time stamped.
Data Security Management
 Data security management is a way to maintain the integrity of data and to make sure that the
data is not accessible by unauthorized parties or susceptible to corruption of data.
 Data security is put in place to ensure privacy in addition or protecting this data.
 Data itself is a raw form of information that is stored on network servers, possible personal
computers and in the form of columns and rows.
 This data can be anything from personal files to intellectual property and even top-secret
information.
 Data can be considered as anything that can be understood and interpreted by humans.
 Data security methods can be acquired by using specific software solutions or hardware
mechanisms.
 One of the biggest reasons to keep data protected is because there are many corporations that
hacker want to target and breach.
 Data security tends to be necessary for large businesses but the small ones usually have fewer
infrastructures in place, making the information not a great loss if breached.
 Depending on the services and content that is to be protected, there can be preventative
measures to further protect the information.
 For example Windows Rights Management Services (RMS) can be set to control whether or not
the recipient of an email can be read and viewed, edited, copied or saved; these setting can also
set an expiration date of a specific document.
 By keeping data secured, it is possible to give different access to different people. For instance,
sales associates can have access to their sales databases, but are unable to access another sales
associates information or business information (e.g. accounts payable, accounts receivable).
Creating a single storage location (or server) for the data, and assigning individuals with
different access, keeping up with data is a breeze. It makes it easier to maintain the data, and
permits a quick transfer to another storage location if needed. Data security software can also
serve as a source to make secure sites (that give access to data files) can only be accessed by
authorized personnel.
Data Access
 Data access typically refers to software and activities related to storing, retrieving, or acting
on data housed in a database or other repository. Two fundamental types of data access exist:
1) Sequential access (as in magnetic tape, for example)
2) Random access (as in indexed media)
 Data access crucially involves authorization to access different data repositories. Data access can
help distinguish the abilities of administrators and users. For example, administrators may have
the ability to remove, edit and add data, while general users may not even have "read" rights if
they lack access to particular information.
 Data access refers to a user's ability to access or retrieve data stored within a database or other
repository. Users who have data access can store, retrieve, move or manipulate stored data,
which can be stored on a wide range of hard drives and external devices.
Data Erasure
 Data erasure (also called data clearing or data wiping) is a software-based method of
overwriting the data that aims to completely destroy all electronic data residing on a hard disk
drive or other digital media.
 Permanent data erasure goes beyond basic file deletion commands, which only remove direct
pointers to the data disk sectors and make the data recovery possible with common software
tools.
 Unlike degaussing and physical destruction, which render the storage media unusable, data
erasure removes all information while leaving the disk operable, preserving IT assets and
the environment.
 New flash memory based media implementations, such as solid-state drives or USB flash
drives can cause data erasure techniques to fail allowing remnant data to be recoverable.
 To protect the data on lost or stolen media, some data erasure applications remotely destroy
the data if the password is incorrectly entered.
 Data erasure tools can also target specific data on a disk for routine erasure, providing
a hacking protection method that is less time-consuming than software encryption.
 Failure to erase data could result in:
Identity theft
Time in prison
Federal and/or civil liability
Exorbitant fines or lawsuits
Irreparable damage to an organization’s reputation
Loss in consumer confidence
Decline in revenue due to bad will and loss of clients
Data Privacy
 Data privacy, also called information privacy, is the aspect of information technology (IT) that
deals with the ability an organization or individual has to determine what data in a computer
system can be shared with third parties.
 Information privacy, or data privacy (or data protection), is the relationship between collection
and dissemination of data, technology, the public expectation of privacy, and the legal and
political issues surrounding them.
 Privacy concerns exist wherever personally identifiable information or other sensitive
information is collected, stored, used, and finally destroyed or deleted – in digital form or
otherwise. Improper or non-existent disclosure control can be the root cause for privacy issues.
 The challenge of data privacy is to utilize data while protecting individual's privacy preferences
and their personally identifiable information.
Data Security
 Data security means protecting data, such as a database, from destructive forces and from the
unwanted actions of unauthorized users.
 Data security refers to protective digital privacy measures that are applied to prevent
unauthorized access to computers, databases and websites. Data security also protects data
from corruption. Data security is the main priority for organizations of every size and genre.
 Data security is also known as information security (IS) or computer security.
 Examples of data security technologies include software/hardware disk encryption, backups,
data masking and data erasure.
 A key data security technology measure is scrambling, where digital data, software/hardware,
and hard drives are scrambled and rendered unreadable to unauthorized users and hackers.
Classification of Data Security
1) Physical Security:
 Physical security refers to the security of the hardware associated with the system and the
protection of the site where the computer resides.
 Natural events such as fire, floods, and earthquakes can be considered as some of the physical
threats.
 It is advisable to have backup copies of database in the face of massive disasters.
2) Logical Security:
 Logical security consists of software safeguards for an organization’s systems, including user
identification and password access, authenticating, access rights and authority levels.
 These measures are to ensure that only authorized users are able to perform actions or access
information in a network or a workstation.
 It is a subset of computer security.

Unit:7
Data-governance

Data governance is a control that ensures that the data entry by an operations team member or by an
automated process meets precise standards, such as a business rule, a data definition and data integrity
constraints in the data model.
The data governor uses data quality monitoring against production data to communicate errors in data
back to operational team members, or to the technical support team, for corrective action.
Data governance includes the people, processes, and information technology required to create a
consistent and proper handling of an organization's data across the business enterprise in order to
improve data quality.
Data governance is a set of processes that ensures that important data assets are formally managed
throughout the enterprise.
Data governance ensures that data can be trusted and that people can be made accountable for any
adverse event that happens because of low data quality.
It is about putting people in charge of fixing and preventing issues with data so that the enterprise can
become more efficient.
Data governance also describes an evolutionary process for a company, altering the company’s way of
thinking and setting up the processes to handle information so that it may be utilized by the entire
organization.
It’s about using technology when necessary in many forms to help aid the process.
When companies desire, or are required, to gain control of their data, they empower their people, set
up processes and get help from technology to do it.
According to one vendor, data governance is a quality control discipline for assessing, managing, using,
improving, monitoring, maintaining, and protecting organizational information. It is a system of decision
rights and accountabilities for information-related processes, executed according to agreed-upon
models which describe who can take what actions with what information, and when, under what
circumstances, using what methods.
Business enterprises will are realized following benefits by the
implementation of Data governance programs.
1. Increasing consistency and confidence in decision making
2. Decreasing the risk of regulatory fines
3. Improving data security, also defining and verifying the requirements for data distribution
policies
4. Maximizing the income generation potential of data
5. Designating accountability for information quality
6. Enable better planning by supervisory staff
7. Minimizing or eliminating re-work
8. Optimize staff effectiveness
9. Establish process performance baselines to enable improvement efforts
10. Acknowledge and hold all gain
Data Governance - Benefits
Cost
Increased efficiency reduces costs
Reduce redundant systems and associated operating costs
Reduce time and effort needed to verify and correct poor data
Regulatory
Reduce risk
Improve compliance
Opportunities
Driving business growth
Delivering improved client service
Providing prompt and accurate responses to regulatory requests
Delivering increased business intelligence from consistent and aggregated data
Information sharing and coordination across organizations.
Cross-selling
Segmenting and targeted service opportunities

7.2 Data governance drivers


While data governance initiatives can be driven by a desire to improve data quality, they are more often
driven by C-Level leaders responding to external regulations.
Examples of these regulations include Sarbanes-Oxley, Basel I, Basel II, HIPAA, and a number of data
privacy regulations.
To achieve compliance with these regulations, business processes and controls require formal
management processes to govern the data subject to these regulations.
Successful programs identify drivers meaningful to both supervisory and executive leadership.
Common themes among the external regulations center on the need to manage risk. The risks can be
financial misstatement, inadvertent release of sensitive data, or poor data quality for key decisions.
Methods to manage these risks vary from industry to industry. Examples of commonly referenced best
practices and guidelines include COBIT, ISO/IEC 38500, and others.
The proliferation of regulations and standards creates challenges for data governance professionals,
particularly when multiple regulations overlap the data being managed. Organizations often launch data
governance initiatives to address these challenges.

Data Governance - Drivers


1. Corporate Governance
Following the Enron scandal the US government took action to hold directors accountable for the
misrepresentation of financial results with the Sarbanes-Oxley Act.
Many South African companies have dual listings or international partners and have to comply
with this act. On the local front, King III is the most comprehensive set of guidelines yet published to
ensure good financial governance for South African companies, while government entities must comply
with the Public Financial Management Act (PFMA).
2. Privacy legislation
The Consumer Protection Act, the PCI-DSS (for credit card vendors) and the pending Protection of
Personal Information bill all require that sensitive data is secure, is of good quality, is accessed on a need
to know basis only, and is disposed of when no longer needed.
3. Risk
Basel II (and the pending Basel III) and Solvency II are regulations set up to manage and assess risk for
banks and insurance companies.
Organizations are required to establish a process for data quality management and account for
adjustments to historical data.
The regulations require that risk calculations must be provably correct based on ongoing assessment
and monitoring of core risk data.
Like any large program that touches all departments in the business a pragmatic data governance
implementation will add value.
4. Master Data Management (or other data intensive IT projects)
Master data, by definition, is used for many purposes and across many areas of the business. Data
governance helps to ensure that the MDM implementation takes all the necessary views and uses into
account, helps to manage conflict and resolve issues.
5. Duplicated effort
In many large organizations we see huge duplication of effort as different teams try to address the same
issue.
At worst this may simply result in wasted expense – multiple projects trying to achieve the same result.
In the worst case, projects may clash with each other – with one undoing the results of another.

7.3 Data governance initiative


Data governance initiatives improve data quality by assigning a team responsible for data's accuracy,
accessibility, consistency, and completeness, among other metrics.
This team usually consists of executive leadership, project management, line-of-business managers,
and data stewards.
The team usually employs some form of methodology for tracking and improving enterprise data, such
as Six Sigma, and tools for data mapping, profiling, cleansing, and monitoring data.
Data governance initiatives may be aimed at achieving a number of objectives including offering better
visibility to internal and external customers (such as supply chain management), compliance
with regulatory law, improving operations after rapid company growth or corporate mergers, or to aid
the efficiency of enterprise knowledge workers by reducing confusion and error and increasing their
scope of knowledge.000
Many data governance initiatives are also inspired by past attempts to fix information quality at the
departmental level, leading to incongruent and redundant data quality processes.
Most large companies have many applications and databases that can't easily share information.
Therefore, knowledge workers within large organizations often don't have access to the information
they need to best do their jobs. When they do have access to the data, the data quality may be poor. By
setting up a data governance practice or Corporate Data Authority, these problems can be mitigated.
The structure of a data governance initiative will vary not only with the size of the organization, but with
the desired objectives or the 'focus areas' of the effort.

Implementation
Implementation of a Data Governance initiative may vary in scope as well as origin.
Sometimes, an executive mandate will arise to initiate an enterprise wide effort, sometimes the
mandate will be to create a pilot project or projects, limited in scope and objectives, aimed at either
resolving existing issues or demonstrating value.
Sometimes an initiative will originate lower down in the organization’s hierarchy, and will be deployed in
a limited scope to demonstrate value to potential sponsors higher up in the organization.
The initial scope of an implementation can vary greatly as well, from review of a one-off IT system, to a
cross-organization initiative.
Unit 9Transaction Management
Transaction:
A business deal in which goods, services, or money are passed from one person, account,
etc., to another is called transaction. A transaction can be defined as a group of tasks. A
single task is the minimum processing unit which cannot be divided further.

A transaction symbolizes a unit of work done within a database management system


against a database, and treated in a consistent and reliable way independent of other
transactions. It represents any changes in database.

Transactions in a database environment have two main purposes:

1) Due to system failures many operations remain uncompleted and it provides a reliable
way which recovers the database and makes it consistent.

2) To provide isolation between programs accessing a database concurrently. If this


isolation is not provided, the programs' outcomes are possibly wrong.

9.1 ACID Properties


A transaction is a very small unit of a program and it may contain several low level
tasks. A transaction in a database system must maintain Atomicity, Consistency,
Isolation, and Durability − commonly known as ACID properties − in order to ensure
accuracy, completeness, and data integrity.

Atomicity − This property states that a transaction must be treated as an atomic unit, i.e.,
either all of its operations are executed or none. There must be no state in a database
where a transaction is left partially completed. States should be defined either before the
execution of the transaction or after the execution/abortion/failure of the transaction.

Features to consider for atomicity:

a transaction is a unit of operation - either all the transaction's actions are completed or none
are
atomicity is maintained in the presence of deadlocks

atomicity is maintained in the presence of database software failures

atomicity is maintained in the presence of application software failures

atomicity is maintained in the presence of CPU failures

atomicity is maintained in the presence of disk failures

atomicity can be turned off at the system level

atomicity can be turned off at the session level

Consistency −The database must remain in a consistent state after any transaction. No
transaction should have any adverse effect on the data residing in the database. If the
database was in a consistent state before the execution of a transaction, it must remain
consistent after the execution of the transaction as well.eg: While sending and receiving
email data. Sender sends email and which remains same at receiving end.

Isolation − In a database system where more than one transaction are being executed
simultaneously and in parallel, the property of isolation states that all the transactions will be
carried out and executed as if it is the only transaction in the system. No transaction will
affect the existence of any other transaction. Eg: In banking system, deposit, loan and
payment should be done at once in the different departments.

Durability −The database should be durable enough to hold all its latest updates even if
the system fails or restarts. If a transaction updates a chunk of data in a database and
commits, then the database will hold the modified data. If a transaction commits but the
system fails before the data could be written on to the disk, then that data will be updated
once the system springs back into action.

Features to consider for durability:

recovery to the most recent successful commit after a database software failure

recovery to the most recent successful commit after an application software failure

recovery to the most recent successful commit after a CPU failure

recovery to the most recent successful backup after a disk failure

recovery to the most recent successful commit after a data disk failure
9.2 Transaction States (States of Transactions)
A transaction in a database can be in one of the following states −

Active −In this state, the transaction is being executed. This is the initial state of every
transaction.

Partially Committed − When a transaction executes its final operation, it is said to be in a


partially committed state.

Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.

Aborted − If any of the checks fails and the transaction has reached a failed state, then
the recovery manager rolls back all its write operations on the database to bring the
database back to its original state where it was prior to the execution of the transaction.
Transactions in this state are called aborted. The database recovery module can select one
of the two operations after a transaction aborts −

o Re-start the transaction

o Kill the transaction

Committed − If a transaction executes all its operations successfully, it is said to be


committed. All its effects are now permanently established on the database system.
9.2.1 Implementation of Atomicity and Durability

The recovery-management component of a database system can support atomicity and


durability by a variety of schemes. Consider a simple shadow copy scheme. This scheme,
which is based on making copies of the database, called shadow copies, assumes that
only one transaction is active at a time. The scheme also assumes that the database is
simply a file on disk. A pointer called db-pointer is maintained on disk; it points to the current
copy of the database. In the shadow-copy scheme, a transaction that wants to update the
database first creates a complete copy of the database. All updates are done on the new
database copy, leaving the original copy, the shadow copy, untouched.

If at any point the transaction has to be aborted, the system merely deletes the new copy.
The old copy of the database has not been affected. If the transaction completes, it is
committed as follows. First, the operating system is asked to make sure that all pages of the
new copy of the database have been written out to disk. After the operating system has
written all the pages to disk, the database system updates the pointer db-pointer to point to
the new copy of the database; the new copy then becomes the current copy of the
database. The old copy of the database is then deleted. Figure depicts the scheme,
showing the database state before and after the update.

Fig: Shadow-copy technique for atomicity and durability

The transaction is said to have been committed at the point where the updated db pointer
is written to disk.

We now consider how the technique handles transaction and system failures. First, consider
transaction failure. If the transaction fails at any time before db-pointer is updated, the old
contents of the database are not affected. We can abort the transaction by just deleting
the new copy of the database. Once the transaction has been committed, all the updates
that it performed are in the Prepared.
database pointed to by dbpointer. Thus, either all updates of the transaction are reflected
,or none of the effects are reflected, regardless of transaction failure.

Now consider the issue of system failure. Suppose that the system fails at any time before
the updated db-pointer is written to disk. Then, when the system restarts, it will read db-
pointer and will thus see the original contents of the database, and none of the effects of
the transaction will be visible on the database.

Next, suppose that the system fails after db-pointer has been updated on disk. Before the
pointer is updated, all updated pages of the new copy of the database were written to
disk. Again, we assume that, once a file is written to disk, its contents will not be damaged
even if there is a system failure. Therefore, when the system restarts, it will read db-pointer
and will thus see the contents of the database after all the updates performed by the
transaction. The implementation actually depends on the write to db-pointer being atomic;
that is, either all its bytes are written or none of its bytes are written. If some of the bytes of
the pointer were updated by the write, but others were not, the pointer is meaningless, and
neither old nor new versions of the database may be found when the system restarts.
Luckily, disk systems provide atomic updates to entire blocks, or at least to a disk sector.

In other words, the disk system guarantees that it will update db-pointer atomically, as long
as we make sure that db-pointer lies entirely in a single sector, which we can ensure by
storing db-pointer at the beginning of a block. Thus, the atomicity and durability properties
of transactions are ensured by the shadow-copy implementation of the recovery-
management component.

9.2.2 Serializability
When multiple transactions are being executed by the operating system in a
multiprogramming environment, there are possibilities that instructions of one transactions
are interleaved with some other transaction.

Schedule − A chronological execution sequence of a transaction is called a schedule. A


schedule can have many transactions in it, each comprising of a number of
instructions/tasks.

Serial Schedule − It is a schedule in which transactions are aligned in such a way that one
transaction is executed first. When the first transaction completes its cycle, then the next
transaction is executed. Transactions are ordered one after the other. This type of schedule
is called a serial schedule, as transactions are executed in a serial manner.
In a multi-transaction environment, serial schedules are considered as a benchmark. The
execution sequence of an instruction in a transaction cannot be changed, but two
transactions can have their instructions executed in a random fashion. This execution does
no harm if two transactions are mutually independent and working on different segments of
data; but in case these two transactions are working on the same data, then the results
may vary. This ever-varying result may bring the database to an inconsistent state.

To resolve this problem, we allow parallel execution of a transaction schedule, if its


transactions are either serializable or have some equivalence relation among them.

9.2.3 Basic concept of Concurrency control and recovery


Concurrency Control
In a multiprogramming environment where multiple transactions can be executed
simultaneously, it is highly important to control the concurrency of transactions. We have
concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent
transactions.

Concurrency control protocols can be broadly divided into two categories −

a) Lock based protocols(9.2.4 Locking Protocols)

b) Time stamp based protocols

a) Lock-based Protocols

Database systems equipped with lock-based protocols use a mechanism by which any
transaction cannot read or write data until it acquires an appropriate lock on it.

Locks are of two kinds −

Binary Locks − A lock on a data item can be in two states; it is either locked or unlocked.

Shared/exclusive − This type of locking mechanism differentiates the locks based on their
uses. If a lock is acquired on a data item to perform a write operation, it is an exclusive lock.
Allowing more than one transaction to write on
the same data item would lead the database into an inconsistent state. Read locks are
shared because no data value is being changed.

There are four types of lock protocols available:-

1) Simplistic Lock Protocol

Simplistic lock-based protocols allow transactions to obtain a lock on every object before a
'write' operation is performed. Transactions may unlock the data item after completing the
‘write’ operation.

2) Pre-claiming Lock Protocol

Pre-claiming protocols evaluate their operations and create a list of data items on which
they need locks. Before initiating an execution, the transaction requests the system for all
the locks it needs beforehand. If all the locks are granted, the transaction executes and
releases all the locks when all its operations are over. If all the locks are not granted, the
transaction rolls back and waits until all the locks are granted.

3) Two-Phase Locking 2PL

This locking protocol divides the execution phase of a transaction into three parts. In the first
part, when the transaction starts executing, it seeks permission for the locks it requires. The
second part is where the transaction acquires all the locks. As soon as the transaction
releases its first lock, the third phase starts. In this phase, the transaction cannot demand
any new locks; it only releases the acquired locks.
Two-phase locking has two phases, one is growing, where all the locks are being acquired
by the transaction; and the second phase is shrinking, where the locks held by the
transaction are being released.

To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock and
then upgrade it to an exclusive lock.

4) Strict Two-Phase Locking

The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first phase, the
transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release
a lock after using it. Strict-2PL holds all the locks until the commit point and releases all the
locks at a time.

Strict-2PL does not have cascading abort as 2PL does.

b) Time stamp-based Protocols

The most commonly used concurrency protocol is the timestamp based protocol. This
protocol uses either system time or logical counter as a timestamp
Lock-based protocols manage the order between the conflicting pairs among transactions
at the time of execution, whereas timestamp-based protocols start working as soon as a
transaction is created.

Every transaction has a timestamp associated with it, and the ordering is determined by the
age of the transaction. A transaction created at 0002 clock time would be older than all
other transactions that come after it. For example, any transaction 'y' entering the system at
0004 is two seconds younger and the priority would be given to the older one.

In addition, every data item is given the latest read and write-timestamp. This lets the system
know when the last ‘read and write’ operation was performed on the data item.

Recovery:
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and
its underlying hardware and system software. If it fails or crashes during transactions, it is
expected that the system would follow some sort of algorithm or techniques to recover lost
data.

Failure Classification
A failure has been generalize into various categories as follows:-

1) Transaction failure

A transaction has to abort when it fails to execute or when it reaches a point from where it
can’t go any further. This is called transaction failure where only a few transactions or
processes are hurt.

Reasons for a transaction failure could be −

Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability, the system aborts an
active transaction.

2) System Crash

There are problems − external to the system − that may cause the system to stop abruptly
and cause the system to crash. For example, interruptions in power supply may cause the
failure of underlying hardware or software failure which include operating system errors..

3) Disk Failure

In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.

Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or
any other failure, which destroys all or a part of disk storage.

Recovery and Atomicity


When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations,
which are atomic in nature. But according to ACID properties of DBMS, atomicity of
transactions as a whole must be maintained, i.e., either all the operations are executed or
none.

When a DBMS recovers from a crash, it should maintain the following −

It should check the states of all the transactions, which were being executed.

A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.

It should check whether the transaction can be completed now or it needs to be rolled
back.

No transactions would be allowed to leave the DBMS in an inconsistent state.


There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction:-

Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.

Maintaining shadow paging, where the changes are done on a volatile memory, and
later, the actual database is updated.

Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a
transaction. It is important that the logs are written prior to the actual modification and
stored on a stable storage media, which is failsafe.

The database can be modified using two approaches:-

Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.

Immediate database modification −Each log follows an actual database modification.


That is, the database is modified immediately after every operation.

Recovery with Concurrent Transactions


When more than one transaction are being executed in parallel, the logs are interleaved.
At the time of recovery, it would become hard for the recovery system to backtrack all logs,
and then start recovering. To ease this situation, most modern DBMS use the concept of
'checkpoints'.

Checkpoint

Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be
handled at all. Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before
which the DBMS was in consistent state, and all the transactions were committed.
Recovery

When a system with concurrent transactions crashes and recovers, it behaves in the
following manner:-

The recovery system reads the logs backwards from the end to the last checkpoint.

It maintains two lists, an undo-list and a redo-list.

If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it
puts the transaction in the redo-list.

If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts
the transaction in undo-list.

All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the
redo-list and their previous logs are removed and then redone before saving their logs.

You might also like