DB Lecture Note All in ONE

Database Management Systems Lecture Note
Chapter 1
Introduction to Database System
Database systems are designed to manage large data set in an organization. The data
management involves both definition and the manipulation of the data which ranges from
simple representation of the data to considerations of structures for the storage of information.
The data management also consider the provision of mechanisms for the manipulation of
information.
Today, Databases are essential to every business. They are used to maintain internal records,
to present data to customers and clients on the World-Wide-Web, and to support many other
commercial processes. Databases are likewise found at the core of many modern
organizations.
The power of databases comes from a body of knowledge and technology that has developed
over several decades and is embodied in specialized software called a database management
system, or DBMS. A DBMS is a powerful tool for creating and managing large amounts of data
efficiently and allowing it to persist over long periods of time, safely. These systems are among
the most complex types of software available.
Thus, for our question: What is a database? In essence a database is nothing more than a
collection of shared information that exists over a long period of time, often many years. In
common dialect, the term database refers to a collection of data that is managed by a DBMS.
Thus the DB course is about:
 How to organize data
 Supporting multiple users
 Efficient and effective data retrieval
 Secured and reliable storage of data
 Maintaining consistent data
 Making information useful for decision making
Data management passes through the different levels of development along with the
development in technology and services. These levels could best be described by categorizing
the levels into three levels of development. Even though there is an advantage and a problem
overcome at each new level, all methods of data handling are in use to some extent. The major
three levels are;
1. Manual Approach
2. Traditional File Based Approach
3. Database Approach
1
1. Manual Approach
In the manual approach, data storage and retrieval follows the primitive and traditional way
of information handling where cards and paper are used for the purpose. The data storage and
retrieval will be performed using human labour.
 Files for as many event and objects as the organization has are used to store
information.
 Each of the files containing various kinds of information is labelled and stored in one
ore more cabinets.
 The cabinets could be kept in safe places for security purpose based on the sensitivity of
the information contained in it.
 Insertion and retrieval is done by searching first for the right cabinet then for the right
the file then the information.
 One could have an indexing system to facilitate access to the data
Limitations of the Manual approach
 Prone to error
 Difficult to update, retrieve, integrate
 You have the data but it is difficult to compile the information
 Limited to small size information
 Cross referencing is difficult
An alternative approach of data handling is a computerized way of dealing with the
information. The computerized approach could also be either decentralized or centralized base
on where the data resides in the system.
2. Traditional File Based Approach
After the introduction of Computer for data processing to the business community, the need to
use the device for data storage and processing increase. There were, and still are, several
computer applications with file based processing used for the purpose of data handling. Even
though the approach evolved over time, the basic structure is still similar if not identical.
 File based systems were an early attempt to computerize the manual filing system.
 This approach is the decentralized computerized data handling method.
 A collection of application programs perform services for the end-users. In such
systems, every application program that provides service to end users define and
manage its own data
 Such systems have number of programs for each of the different applications in the
organization.
 Since every application defines and manages its own data, the system is subjected to
serious data duplication problem.
 File, in traditional file based approach, is a collection of records which contains logically
related data.
2
Limitations of the Traditional File Based approach

As business application become more complex demanding more flexible and reliable data
handling methods, the shortcomings of the file based system became evident. These
shortcomings include, but not limited to:
 Separation or Isolation of Data: Available information in one application may not be
known. Data Synchronisation is done manually.
 Limited data sharing- every application maintains its own data.
 Lengthy development and maintenance time
 Duplication or redundancy of data (money and time cost and loss of data integrity)
 Data dependency on the application- data structure is embedded in the application;
hence, a change in the data structure needs to change the application as well.
 Incompatible file formats or data structures (e.g. ―C‖ and COBOL) between different
applications and programs creating inconsistency and difficulty to process jointly.
 Fixed query processing which is defined during application development
3
The limitations for the traditional file based data handling approach arise from two basic
reasons.
1. Definition of the data is embedded in the application program which makes it
difficult to modify the database definition easily.
2. No control over the access and manipulation of the data beyond that imposed by
the application programs.
The most significant problem experienced by the traditional file based approach of data
handling can be formalized by what is called “update anomalies”. We have three types of
update anomalies;
1. Modification Anomalies: a problem experienced when one ore more data value is
modified on one application program but not on others containing the same data set.
2. Deletion Anomalies: a problem encountered where one record set is deleted from one
application but remain untouched in other application programs.
3. Insertion Anomalies: a problem experienced when ever there is new data item to be
recorded, and the recording is not made in all the applications. And when same data
item is inserted at different applications, there could be errors in encoding which makes
the new data item to be considered as a totally different object.
3. Database Approach
Following a famous paper written by Dr. Edgard Frank Codd in 1970, database systems
changed significantly. Codd proposed that database systems should present the user with a
view of data organized as tables called relations. Behind the scenes, there might be a complex
data structure that allowed rapid response to a variety of queries. But, unlike the user of earlier
database systems, the user of a relational system would not be concerned with the storage
structure. Queries could be expressed in a very high-level language, which greatly increased
the efficiency of database programmers. The database approach emphasizes the integration
and sharing of data throughout the organization.
Thus in Database Approach:
 Database is just a computerized record keeping system or a kind of electronic filing
cabinet.
 Database is a repository for collection of computerized data files.
 Database is a shared collection of logically related data and description of data designed
to meet the information needs of an organization. Since it is a shared corporate resource,
the database is integrated with minimum amount of or no duplication.
 Database is a collection of logically related data where these logically related data
comprises entities, attributes, relationships, and business rules of an organization's
information.
4
 In addition to containing data required by an organization, database also contains a

description of the data which is known as ―Metadata‖ or ―Data Dictionary‖ or ―Systems
Catalogue‖ or ―Data about Data‖ or some times ―Data Directory‖.
 Since a database contains information about the data (metadata), it is called a self
descriptive collection of integrated records.
 The purpose of a database is to store information and to allow users to retrieve and
update that information on demand.
 Database is deigned once and used simultaneously by many users.
 Unlike the traditional file based approach in database approach there is program data
independence. That is the separation of the data definition from the application. Thus the
application is not affected by changes made in the data structure and file organization.
 Each database application will perform the combination of: Creating database, Reading,
Updating and Deleting data.
Benefits of the database approach
 Data can be shared: two or more users can access and use same data instead of storing
data in redundant manner for each user.
 Improved accessibility of data: by using structured query languages, the users can easily
access data without programming experience.
 Redundancy can be reduced: isolated data is integrated in database to decrease the
redundant data stored at different applications.
 Quality data can be maintained: the different integrity constraints in the database
approach will maintain the quality leading to better decision making
 Inconsistency can be avoided: controlled data redundancy will avoid inconsistency of the
data in the database to some extent.
 Transaction support can be provided: basic demands of any transaction support systems
are implanted in a full scale DBMS.
 Integrity can be maintained: data at different applications will be integrated together with
additional constraints to facilitate validity and consistency of shared data resource.
 Security measures can be enforced: the shared data can be secured by having different
levels of clearance and other data security mechanisms.
 Improved decision support: the database will provide information useful for decision
making.
 Standards can be enforced: the different ways of using and dealing with data by different
unite of an organization can be balanced and standardized by using database approach.
 Compactness: since it is an electronic data handling method, the data is stored compactly
(no voluminous papers).
5
 Speed: data storage and retrieval is fast as it will be using the modern fast computer
systems.
 Less labour: unlike the other data handling methods, data maintenance will not demand
much resource.
 Centralized information control: since relevant data in the organization will be stored at
one repository, it can be controlled and managed at the central level.
Limitations and risk of Database Approach

 Introduction of new professional and specialized personnel.
 Complexity in designing and managing data
 The cost and risk during conversion from the old to the new system
 High cost to be incurred to develop and maintain the system
 Complex backup and recovery services from the users perspective
 Reduced performance due to centralization and data independency
 High impact on the system when failure occurs to the central system.
6
Database Management System (DBMS)

Database Management System (DBMS) is a Software package used for providing EFFICIENT,
CONVENIENT and SAFE MULTI-USER (many people/programs accessing same database, or even same data,
simultaneously) storage of and access to MASSIVE amounts of PERSISTENT (data outlives programs that
operate on it) data. A DBMS also provides a systematic method for creating, updating, storing,
retrieving data in a database. DBMS also provides the service of controlling data access,
enforcing data integrity, managing concurrency control, and recovery. Having this in mind, a
full scale DBMS should at least have the following services to provide to the user.
1. Data storage, retrieval and update in the database

2. A user accessible catalogue
3. Transaction support service: ALL or NONE transaction, which minimize data
inconsistency.
4. Concurrency Control Services: access and update on the database by different
users simultaneously should be implemented correctly.
5. Recovery Services: a mechanism for recovering the database after a failure must be
available.
6. Authorization Services (Security): must support the implementation of access and
authorization service to database administrator and users.
7. Support for Data Communication: should provide the facility to integrate with
data transfer software or data communication managers.
8. Integrity Services: rules about data and the change that took place on the data,
correctness and consistency of stored data, and quality of data based on business
constraints.
9. Services to promote data independency between the data and the application
10. Utility services: sets of utility service facilities like
 Importing data
 Statistical analysis support
 Index reorganization
 Garbage collection
7
DBMS and Components of DBMS Environment
Fig. General architecture of a DBMS
A DBMS is software package used to design, manage, and maintain databases. Each DBMS
should have facilities to define the database, manipulate the content of the database and
control the database. These facilities will help the designer, the user as well as the database
administrator to discharge their responsibility in designing, using and managing the
database. It provides the following facilities:
 Data Definition Language (DDL):
o Language used to define each data element required by the organization.
o Commands for setting up schema or the intension of database
o These commands are used to setup a database, create, delete and alter table with
the facility of handling constraints
 Data Manipulation Language (DML):
o Is a core command used by end-users and programmers to store, retrieve, and
access the data in the database e.g. SQL
8
o Since the required data or Query by the user will be extracted using this type of
language, it is also called "Query Language"
 Data Dictionary:
o Due to the fact that a database is a self describing system, this tool, Data
Dictionary, is used to store and organize information about the data stored in the
database.
 Data Control Language:
o Database is a shared resource that demands control of data access and usage. The
database administrator should have the facility to control the overall operation of
the system.
o Data Control Languages are commands that will help the Database
Administrator to control the database.
o The commands include grant or revoke privileges to access the database or
particular object within the database and to store or remove database
transactions
The DBMS is software package that helps to design, manage, and use data using the database
approach. Taking a DBMS as a system, one can describe it with respect to it environment or
other systems interacting with the DBMS. The DBMS environment has five components. To
design and use a database, there will be the interaction or integration of Hardware, Software,
Data, Procedure and People.
1. Hardware: are components that one can touch and feel. These components are
comprised of various types of personal computers, mainframe or any server computers
to be used in multi-user system, network infrastructure, and other peripherals required
in the system.
2. Software: are collection of commands and programs used to manipulate the
hardware to perform a function. These include components like the DBMS software,
application programs, operating systems, network software, language software and
other relevant software.
3. Data: since the goal of any database system is to have better control of the data and
making data useful, Data is the most important component to the user of the database.
There are two categories of data in any database system: that is Operational and
Metadata. Operational data is the data actually stored in the system to be used by the
user. Metadata is the data that is used to store information about the database itself.
The structure of the data in the database is called the schema, which is composed of the
Entities, Properties of entities, and relationship between entities and business constraints.
9
4. Procedure: this is the rules and regulations on how to design and use a database. It
includes procedures like how to log on to the DBMS, how to use facilities, how to start
and stop DBMS, how to make backup, how to treat hardware and software failure, how
to change the structure of the database.
5. People: this component is composed of the people in the organization that are
responsible or play a role in designing, implementing, managing, administering and
using the resources in the database. This component includes group of people with high
level of knowledge about the database and the design technology to other with no
knowledge of the system except using the data in the database.
Database Development Life Cycle (DDLC)

As it is one component in most information system development tasks, there are several steps
in designing a database system. Here more emphasis is given to the design phases of the
system development life cycle. The major steps in database design are;
1. Planning: that is identifying information gap in an organization and propose a
database solution to solve the problem.
2. Analysis: that concentrates more on fact finding about the problem or the
opportunity. Feasibility analysis, requirement determination and structuring, and
selection of best design method are also performed at this phase.
3. Design: in database development more emphasis is given to this phase. The phase is
further divided into three sub-phases.
a. Conceptual Design: concise description of the data, data type, relationship
between data and constraints on the data.
 There is no implementation or physical detail consideration.
 Used to elicit and structure all information requirements
b. Logical Design: a higher level conceptual abstraction with selected specific data
model to implement the data structure.
 It is particular DBMS independent and with no other physical considerations.
c. Physical Design: physical implementation of the logical design of the database
with respect to internal storage and file structure of the database for the selected
DBMS.
 To develop all technology and organizational specification.
4. Implementation: the testing and deployment of the designed database for use.
5. Operation and Support: administering and maintaining the operation of the
database system and providing support to users. Tuning the database operations for
best performance.
10
Roles in Database Design and Use

As people are one of the components in DBMS environment, there are group of roles played
by different stakeholders of the designing and operation of a database system.
1. Database Administrator (DBA)
 Responsible to oversee, control and manage the database resources (the database itself,
the DBMS and other related software)
 Authorizing access to the database
 Coordinating and monitoring the use of the database
 Responsible for determining and acquiring hardware and software resources
 Accountable for problems like poor security, poor performance of the system
 Involves in all steps of database development
We can have further classifications of this role in big organizations having huge amount of
data and user requirement.
a. Data Administrator (DA): is responsible on management of data resources. This
involves in database planning, development, maintenance of standards policies and
procedures at the conceptual and logical design phases.
b. Database Administrator (DBA): This is more technically oriented role. DBA is
responsible for the physical realization of the database. It is involved in physical design,
implementation, security and integrity control of the database.
2. Database Designer (DBD)
 Identifies the data to be stored and choose the appropriate structures to represent and
store the data.
 Should understand the user requirement and should choose how the user views the
database.
 Involve on the design phase before the implementation of the database system.
We have two distinctions of database designers, one involving in the logical and conceptual
design and another involving in physical design.
a. Logical and Conceptual DBD
 Identifies data (entity, attributes and relationship) relevant to the organization
 Identifies constraints on each data
 Understand data and business rules in the organization
 Sees the database independent of any data model at conceptual level and consider
one specific data model at logical design phase.
b. Physical DBD
 Take logical design specification as input & decide how it should be physically realized.
 Map the logical data model on the specified DBMS with respect to tables and integrity
constraints. (DBMS dependent designing)
 Select specific storage structure and access path to the database
 Design security measures required on the database
11
3. Application Programmer and Systems Analyst

 System analyst determines the user requirement and how the user wants to view the
database.
 The application programmer implements these specifications as programs; code,
test, debug, document and maintain the application program.
 The application programmer determines the interface on how to retrieve, insert,
update and delete data in the database.
 The application could use any high level programming language according to the
availability, the facility and the required service.
4. End Users
Workers, whose job requires accessing the database frequently for various purposes, there are
different group of users in this category.
a. Naïve Users:
 Sizable proportion of users
 Unaware of the DBMS
 Only access the database based on their access level and demand
 Use standard and pre-specified types of queries.
b. Sophisticated Users
 Users familiar with the structure of the Database and facilities of the DBMS.
 Have complex requirements
 Have higher level queries
 Are most of the time engineers, scientists, business analysts, etc
c. Casual Users
 Users who access the database occasionally.
 Need different information from the database each time.
 Use sophisticated database queries to satisfy their needs.
 Are most of the time middle to high level managers.
These users can be again classified as ―Actors on the Scene‖ and ―Workers Behind the Scene‖.
Actors on the Scene:
 Data Administrator  Database Designer
 Database Administrator  End Users
Workers behind the scene
 DBMS designers and implementers: who design and implement different DBMS software.
 Tool Developers: experts who develop software packages that facilitates database system
designing and use. Prototype, simulation, code generator developers could be an example.
Independent software vendors could also be categorized in this group.
 Operators and Maintenance Personnel: system administrators who are responsible for
actually running and maintaining the hardware and software of the database system and the
information technology facilities.
12
ANSI-SPARC Architecture
The purpose and origin of the Three-Level database architecture
 All users should be able to access same data. This is important since the database is
having a shared data feature where all the data is stored in one location and all users
will have their own customized way of interacting with the data.
 A user's view is unaffected or immune to changes made in other views. Since the
requirement of one user is independent of the other, a change made in one user‘s
view should not affect other users.
 Users should not need to know physical database storage details. As there are naïve
users of the system, hardware level or physical details should be a black-box for
such users.
 DBA should be able to change database storage structures without affecting the
users' views. A change in file organization, access method should not affect the
structure of the data which in turn will have no effect on the users.
 Internal structure of database should be unaffected by changes to physical aspects of
storage, such as change of hard disk
 DBA should be able to change conceptual structure of database without affecting all
users. In any database system, the DBA will have the privilege to change the
structure of the database, like adding tables, adding and deleting an attribute,
changing the specification of the objects in the database.
All of the above and much more functionalities are possible due to the three level
ANSI-SPARC architecture.
Three-level ANSI-SPARC Architecture of a Database
13
ANSI-SPARC Architecture and Database Design Phases
1. External Level: Users' view of the database. It describes that part of database that is
relevant to a particular user. Different users have their own customized view of the
database independent of other users.
2. Conceptual Level: Community view of the database. Describes what data is stored
in database and relationships among the data along with the business constraints.
3. Internal Level: Physical representation of the database on the computer. Describes
how the data is stored in the database.
The following example can be taken as an illustration for the difference between the three
levels in the ANSI-SPARC database Architecture. Where:
 The first level is concerned about the group of users and their respective data
requirement independent of the other.
 The second level is describing the whole content of the database where one piece of
information will be represented once.
 The third level
Differences between Three Levels of ANSI-SPARC Architecture
14
Defines DBMS schemas at three levels:

 Internal schema: at the internal level to describe physical storage structures and access
paths. Typically uses a physical data model i.e. specific DBMS.
 Conceptual schema: at the conceptual level to describe the structure and constraints for
the whole database for a community of users. It uses a conceptual or an implementation
data model.
 External schema: at the external level to describe the various user views. Usually uses
the same data model as the conceptual level.
Data Independence
Logical Data Independence:
 Refers to immunity of external schemas to changes in conceptual schema.
 Conceptual schema changes e.g. addition/removal of entities should not require
changes to external schema or rewrites of application programs.
 The capacity to change the conceptual schema without having to change the external
schemas and their application programs.
Physical Data Independence
 The ability to modify the physical schema without changing the logical schema
 Applications depend on the logical schema
 In general, the interfaces between the various levels and components should be well
defined so that changes in some parts do not seriously influence others.
 The capacity to change the internal schema without having to change the conceptual
schema
 Refers to immunity of conceptual schema to changes in the internal schema
 Internal schema changes e.g. using different file organizations, storage
structures/devices should not require change to conceptual or external schemas.
Data Independence and the ANSI-SPARC Three-level Architecture
15
The distinction between a Data Definition Language (DDL) and a Data

Manipulation Language (DML)
Database Languages
Data Definition Language (DDL)
 Allows DBA or user to describe and name entitles, attributes and relationships
required for the application.
 Specification notation for defining the database schema
Data Manipulation Language (DML)
 Provides basic data manipulation operations on data held in the database.
 Language for accessing and manipulating the data organized by the appropriate
data model
 DML also known as query language
Procedural DML: user specifies what data is required and how to get the data.
Non-Procedural DML: user specifies what data is required but not how it is to be
retrieved
Data Control Language (DCL)
 Allows a DBA to define access control and privileges for users.
 It is a mechanism for implementing security at a database object level.
 Uses the Grant and Revoke SQL Statements
SQL is the most widely used non-procedural query language
Fourth Generation Language (4GL)
 Query Languages  Graphics Generators
 Forms Generators  Application Generators
 Report Generators
A Classification of data models
Data Model
A specific DBMS has its own specific Data Definition Language to define a database schema,
but this type of language is too low level to describe the data requirements of an organization
in a way that is readily understandable by a variety of users.
We need a higher-level language.
Such a higher-level description of the database schema is called data-model.
Data Model: a set of concepts to describe the structure of a database, and certain constraints
that the database should obey.
A data model is a description of the way that data is stored in a database. Data model helps
to understand the relationship between entities and to create the most effective structure to
hold data.
16
Data Model is a collection of tools or concepts for describing

 Data
 Data relationships
 Data semantics
 Data constraints
The main purpose of Data Model is to represent the data in an understandable way.
Categories of data models include:
 Object-based  Record-based  Physical
Record-based Data Models
Consist of a number of fixed format records.
Each record type defines a fixed number of fields,
Each field is typically of a fixed length.
 Hierarchical Data Model
 Network Data Model
 Relational Data Model
1. Hierarchical Model
 The simplest data model
 Record type is referred to as node or segment
 The top node is the root node
 Nodes are arranged in a hierarchical structure as sort of upside-down tree
 A parent node can have more than one child node
 A child node can only have one parent node
 The relationship between parent and child is one-to-many
 Relation is established by creating physical link between stored records (each is
stored with a predefined access path to other records)
 To add new record type or relationship, the database must be redefined and then
stored in a new form.
Department
Employee Job
Time Card Activity
17
ADVANTAGES of Hierarchical Data Model:

 Hierarchical Model is simple to construct and operate on
 Corresponds to a number of natural hierarchically organized domains - e.g.,
assemblies in manufacturing, personnel organization in companies
 Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT,
GET NEXT WITHIN PARENT etc.
DISADVANTAGES of Hierarchical Data Model:
 Navigational and procedural nature of processing
 Database is visualized as a linear arrangement of records
 Little scope for "query optimization"
2. Network Model
 Allows record types to have more than one parent unlike hierarchical model
 A network data models sees records as set members
 Each set has an owner and one or more members
 Allow no many to many relationship between entities
 Like hierarchical model network model is a collection of physically linked records.
 Allow member records to have more than one owner
ADVANTAGES of Network Data Model:

 Network Model is able to model complex relationships and represents semantics of
add/delete on the relationships.
 Can handle most situations for modeling using record types and relationship types.
 Language is navigational; uses constructs like FIND, FIND member, FIND owner,
FIND NEXT within set, GET etc. Programmers can do optimal navigation through
the database.
DISADVANTAGES of Network Data Model:
 Navigational and procedural nature of processing
 Database contains a complex array of pointers that thread through a set of records.
 Little scope for automated "query optimization‖
18
3. Relational Data Model

 Developed by Dr. Edgar Frank Codd in 1970 (famous paper, 'A Relational Model for
Large Shared Data Banks')
 Terminologies originates from the branch of mathematics called set theory and
predicate logic and is based on the mathematical concept called Relation
 Can define more flexible and complex relationship
 Viewed as a collection of tables called ―Relations‖ equivalent to collection of record
types
 Relation: Two dimensional table
 Stores information or data in the form of tables  rows and columns
 A row of the table is called tuple equivalent to record
 A column of a table is called attribute equivalent to fields
 Data value is the value of the Attribute
 Records are related by the data stored jointly in the fields of records in two tables or
files. The related tables contain information that creates the relation
 The tables seem to be independent but are related some how.
 No physical consideration of the storage is required by the user
 Many tables are merged together to come up with a new virtual view of the
relationship
Alternative terminologies
Relation Table File
Tuple Row Record
Attribute Column Field
 The rows represent records (collections of information about separate items)

 The columns represent fields (particular attributes of a record)
 Conducts searches by using data in specified columns of one table to find
additional data in another table
 In conducting searches, a relational database matches information from a field in
one table with information in a corresponding field of another table to produce a
third table that combines requested data from both tables
19
Chapter Two
Relational Data Model
Important terms:
Relation: a table with rows and columns
Attribute: a named column of a relation
Domain: a set of allowable values for one or more attributes
Tuple: a row of a relation
Degree: the degree of a relation is the number of attributes it contains
Unary relation, Binary relation, Ternary relation, N-ary relation
Cardinality: of a relation is the number of tuples the relation has
Relational Database: a collection of normalized relations with distinct relation names.
Relation Schema: a named relation defined by a set of attribute-domain name pair
Let A1, A2...........An be attributes with domain D1, D2 ………,Dn.
Then the sets {A1:D1, A2:D2… An:Dn} is a Relation Schema. A relation R, defined by a
relation schema S, is a set of mappings from attribute names to their corresponding
domains. Thus a relation is a set of n- tuples of the form
(A1:d1, A2:d2 ,…, An:dn) where d1 є D1, d2 є D2,…….. dn є Dn,
Eg. Student (studentId char(10), studentName char(50), DOB date) is a relation schema for
the student entity in SQL
Relational Database schema: a set of relation schema each with distinct names.
Suppose R1, R2,……, Rn is the set of relation schema in a relational database then the
relational database schema (R) can be stated as: R={ R1 , R2 ,……., Rn}
Properties of Relational Databases

 A relation has a name that is distinct from all other relation names in the relational
schema.
 Each tuple in a relation must be unique
 All tables are LOGICAL ENTITIES
 Each cell of a relation contains exactly one atomic (single) value.
 Each column (field or attribute) has a distinct name.
 The values of an attribute are all from the same domain.
 A table is either a BASE TABLES (Named Relations) or VIEWS (Unnamed Relations)
 Only Base Tables are physically stored
 VIEWS are derived from BASE TABLES with SQL statements like:
[SELECT .. FROM .. WHERE .. ORDER BY]
20
 Relational database is the collection of tables

o Each entity in one table
o Attributes are fields (columns) in table
 Order of rows theoretically ( but practically has impact on performance) and
columns is immaterial
 Entries with repeating groups are said to be un-normalized
All values in a column represent the same attribute and have the same data format
Building Blocks of the Relational Data Model

The building blocks of the relational data model are:
 Entities: real world physical or logical object
 Attributes: properties used to describe each Entity or real world object.
 Relationship: the association between Entities
 Constraints: rules that should be obeyed while manipulating the data.
1. The ENTITIES (persons, places, things etc.) which the organization has to deal with.
Relations can also describe relationships
The name given to an entity should always be a singular noun descriptive of each item
to be stored in it. E.g. : student NOT students.
Every relation has a schema, which describes the columns, or fields the relation itself
corresponds to our familiar notion of a table:
A relation is a collection of tuples, each of which contains values for a fixed number of
attributes
 Existence Dependency: the dependence of an entity on the existence of one or
more entities.
 Weak entity : an entity that can not exist without the entity with which it has a
relationship – it is indicated by a double rectangle
2. The ATTRIBUTES - the items of information which characterize and describe these
entities.
Attributes are pieces of information ABOUT entities. The analysis must of course
identify those which are actually relevant to the proposed application. Attributes will
give rise to recorded items of data in the database
At this level we need to know such things as:
 Attribute name (be explanatory words or phrases)
 The domain from which attribute values are taken (A DOMAIN is a set of values from
which attribute values may be taken.) Each attribute has values taken from a domain.
21
For example, the domain of Name is string and that for salary is real. How ever these
are not shown on E-R models
o Whether the attribute is part of the entity identifier (attributes which just
describe an entity and those which help to identify it uniquely)
o Whether it is permanent or time-varying (which attributes may change their
values over time)
o Whether it is required or optional for the entity (whose values will sometimes be
unknown or irrelevant)
Types of Attributes
(1) Simple (atomic) Vs Composite attributes
 Simple : contains a single value (not divided into sub parts)
E.g. Age, gender
 Composite: Divided into sub parts (composed of other attributes)
E.g. Name, address
(2) Single-valued Vs multi-valued attributes
 Single-valued : have only single value(the value may change but has only
one value at one time)
E.g. Name, Sex, Id. No. color_of_eyes
 Multi-Valued: have more than one value
E.g. Address, dependent-name
Person may have several college degrees
(3) Stored vs. Derived Attribute
 Stored : not possible to derive or compute
E.g. Name, Address
 Derived: The value may be derived (computed) from the values of other
attributes.
E.g. Age (current year – year of birth)
Length of employment (current date- start date)
Profit (earning-cost)
G.P.A (grade point/credit hours)
(4) Null Values
 NULL applies to attributes which are not applicable or which do not have
values.
 You may enter the value NA (meaning not applicable)
 Value of a key attribute can not be null.
Default value - assumed value if no explicit value
22
Entity versus Attributes

When designing the conceptual specification of the database, one should pay attention to the
distinction between an Entity and an Attribute.
 Consider designing a database of employees for an organization:
 Should address be an attribute of Employees or an entity (connected to Employees by
a relationship)?
 If we have several addresses per employee, address must be an entity (attributes
cannot be set-valued/multi valued)
 If the structure (city, Woreda, Kebele, etc) is important, e.g. want to retrieve
employees in a given city, address must be modeled as an entity (attribute values are
atomic)
3. The RELATIONSHIPS between entities which exist and must be taken into account
when processing information. In any business processing one object may be associated
with another object due to some event. Such kind of association is what we call a
RELATIONSHIP between entity objects.
 One external event or process may affect several related entities.
 Related entities require setting of LINKS from one part of the database to another.
 A relationship should be named by a word or phrase which explains its function
 Role names are different from the names of entities forming the relationship: one
entity may take on many roles, the same role may be played by different entities
 For each RELATIONSHIP, one can talk about the Number of Entities and the
Number of Tuples participating in the association. These two concepts are called
DEGREE and CARDINALITY of a relationship respectively.
Degree of a Relationship
 An important point about a relationship is how many entities participate in it. The
number of entities participating in a relationship is called the DEGREE of the
relationship.
Among the Degrees of relationship, the following are the basic:
 UNARY/RECURSIVE RELATIONSHIP: Tuples/records of a Single entity are related withy
each other.
 BINARY RELATIONSHIPS: Tuples/records of two entities are associated in a relationship
 TERNARY RELATIONSHIP: Tuples/records of three different entities are associated
 And a generalized one:
o N-ARY RELATIONSHIP: Tuples from arbitrary number of entity sets are
participating in a relationship.
23
Cardinality of a Relationship
Another important concept about relationship is the number of instances/tuples that can be
associated with a single instance from one entity in a single relationship. The number of
instances participating or associated with a single instance from an entity in a relationship is
called the CARDINALITY of the relationship. The major cardinalities of a relationship are:
 ONE-TO-ONE: one tuple is associated with only one other tuple.
o E.g. Building – Location as a single building will be located in a single
location and as a single location will only accommodate a single Building.
 ONE-TO-MANY, one tuple can be associated with many other tuples, but not the
reverse.
o E.g. Department-Student as one department can have multiple students.
 MANY-TO-ONE, many tuples are associated with one tuple but not the reverse.
o E.g. Employee – Department: as many employees belong to a single
department.
 MANY-TO-MANY: one tuple is associated with many other tuples and from the
other side, with a different role name one tuple will be associated with many tuples
o E.g. Student – Courseas a student can take many courses and a single
course can be attended by many students.
However, the degree and cardinality of a relation are different from degree
and cardinality of a relationship.
4. Key constraints
If tuples are need to be unique in the database, and then we need to make each tuple distinct.
To do this we need to have relational keys that uniquely identify each record.
 Super Key: an attribute/set of attributes that uniquely identify a tuple within a relation.
 Candidate Key: a super key such that no proper subset of that collection is a Super Key
within the relation.
 A candidate key has two properties:
1. Uniqueness
2. Irreducibility
 If a super key is having only one attribute, it is automatically a Candidate key.
 If a candidate key consists of more than one attribute it is called Composite Key.
 Primary Key: the candidate key that is selected to identify tuples uniquely within the
relation.
 The entire set of attributes in a relation can be considered as a primary case in a
worst case.
24
 Foreign Key: an attribute, or set of attributes, within one relation that matches the
candidate key of some relation.
A foreign key is a link between different relations to create a view or an unnamed
relation
Relational Constraints/Integrity Rules
 Relational Integrity
 Domain Integrity: No value of the attribute should be beyond the allowable
limits
 Entity Integrity: In a base relation, no attribute of a Primary Key can assume
a value of NULL
 Referential Integrity: If a Foreign Key exists in a relation, either the Foreign
Key value must match a Candidate Key value in its home relation or the
Foreign Key value must be NULL
 Enterprise Integrity: Additional rules specified by the users or database
administrators of a database are incorporated
 Relational Views
Relations are perceived as a Table from the users‘ perspective. Actually, there are two
kinds of relation in relational database. The two categories or types of Relations are
Named and Unnamed Relations. The basic difference is on how the relation is created,
used and updated:
1. Base Relation
A Named Relation corresponding to an entity in the conceptual schema, whose
tuples are physically stored in the database.
2. View (Unnamed Relation)
A View is the dynamic result of one or more relational operations operating on
the base relations to produce another virtual relation that does not actually exist
as presented. So a view is virtually derived relation that does not necessarily
exist in the database but can be produced upon request by a particular user at the
time of request. The virtual table or relation can be created from single or
different relations by extracting some attributes and records with or without
conditions.
Purpose of a view
 Hides unnecessary information from users: since only part of the base relation
(Some collection of attributes, not necessarily all) are to be included in the virtual
table.
25
 Provide powerful flexibility and security: since unnecessary information will be

hidden from the user there will be some sort of data security.
 Provide customized view of the database for users: each user is going to be
interfaced with their own preferred data set and format by making use of the
Views.
 A view of one base relation can be updated.
 Update on views derived from various relations is not allowed since it may
violate the integrity of the database.
 Update on view with aggregation and summary is not allowed. Since
aggregation and summary results are computed from a base relation and does
not exist actually.
Schemas and Instances and Database State

When a database is designed using a Relational data model, all the data is represented in a
form of a table. In such definitions and representation, there are two basic components of the
database. The two components are the definition of the Relation or the Table and the actual
data stored in each table. The data definition is what we call the Schema or the skeleton of the
database and the Relations with some information at some point in time is the Instance or the
flesh of the database.
Schemas
 Schema describes how data is to be structured, defined at setup/Design time (also
called "metadata")
 Since it is used during the database development phase, there is rare tendency of
changing the schema unless there is a need for system maintenance which demands
change to the definition of a relation.
 Database Schema (Intension): specifies name of relation and the collection of the
attributes (specifically the Name of attributes).
 refer to a description of database (or intention)
 specified during database design
 should not be changed unless during maintenance
 Schema Diagrams
 convention to display some aspect of a schema visually
 Schema Construct
 refers to each object in the schema (e.g. STUDENT)
E.g.: STUNEDT (FName,LName,Id,Year,Dept, Sex)
26
Instances
 Instance: is the collection of data in the database at a particular point of time (snap-
shot).
 Also called State or Snap Shot or Extension of the database
 Refers to the actual data in the database at a specific point in time
 State of database is changed any time we add, delete or update an item.
 Valid state: the state that satisfies the structure and constraints specified in the
schema and is enforced by DBMS
 Since Instance is actual data of database at some point in time, changes rapidly
 To define a new database, we specify its database schema to the DBMS (database is
empty)
 database is initialized when we first load it with data
27
Chapter Three
Database Design
Database design is the process of coming up with different kinds of specification for the data
to be stored in the database. The database design part is one of the middle phases we have in
information systems development where the system uses a database approach. Design is the
part on which we would be engaged to describe how the data should be perceived at different
levels and finally how it is going to be stored in a computer system.
Information System with Database application consists of several tasks which include:
 Planning of Information systems Design
 Requirements Analysis,
 Design (Conceptual, Logical and Physical Design)
 Implementation
 Testing and deployment
 Operation and Support
From these different phases, the prime interest of a database system will be the Design part
which is again sub divided into other three sub-phases. These sub-phases are:
1. Conceptual Design
2. Logical Design, and
3. Physical Design
In general, one has to go back and forth between these tasks to refine a database design,
and decisions in one task can influence the choices in another task.
In developing a good design, one should answer such questions as:
 What are the relevant Entities for the Organization
 What are the important features of each Entity
 What are the important Relationships
 What are the important queries from the user
 What are the other requirements of the Organization and the Users
Conceptual Design
Logical Design
Physical Design
The Three levels of Database Design
28
1. Conceptual Database Design

 Conceptual design is the process of constructing a model of the information used in an
enterprise, independent of any physical considerations.
 It is the source of information for the logical design phase.
 Mostly uses an Entity Relationship Model to describe the data at this level.
 After the completion of Conceptual Design one has to go for refinement of the schema,
which is verification of Entities, Attributes, and Relationships
2. Logical Database Design

 Logical design is the process of constructing a model of the information used in an
enterprise based on a specific data model (e.g. relational, hierarchical or network or
object), but independent of a particular DBMS and other physical considerations.
 Normalization process
 Collection of Rules to be maintained
 Discover new entities in the process
 Revise attributes based on the rules and the discovered Entities
3. Physical Database Design
 Physical design is the process of producing a description of the implementation of the
database on secondary storage. -- defines specific storage or access methods used by
database
 Describes the storage structures and access methods used to achieve efficient
access to the data.
 Tailored to a specific DBMS system -- Characteristics are function of DBMS and
operating systems
 Includes estimate of storage space
Conceptual Database Design

 Conceptual design revolves around discovering and analyzing organizational and user
data requirements
 The important activities are to identify
 Entities
 Attributes
 Relationships
 Constraints
 And based on these components develop the ER model using
 ER diagrams
29
The Entity Relationship (E-R) Model

 Entity-Relationship modeling is used to represent conceptual view of the database
 The main components of ER Modeling are:
 Entities
 Corresponds to entire table, not row
 Represented by Rectangle
 Attributes
 Represents the property used to describe an entity or a relationship
 Represented by Oval
 Relationships
 Represents the association that exist between entities
 Represented by Diamond
 Constraints
 Represent the constraint in the data
 Cardinality and Participation Constraints
Before working on the conceptual design of the database, one has to know and answer the
following basic questions.
 What are the entities and relationships in the enterprise?
 What information about these entities and relationships should we store in the
database?
 What are the integrity constraints that hold? Constraints on each data with respect
to update, retrieval and store.
 Represent this information pictorially in ER diagrams, then map ER diagram into a
relational schema.
Developing an E-R Diagram

 Designing conceptual model for the database is not a one linear process but an iterative
activity where the design is refined again and again.
 To identify the entities, attributes, relationships, and constraints on the data, there are
different set of methods used during the analysis phase. These include information
gathered by…
 Interviewing end users individually and in a group
 Questionnaire survey
 Direct observation
 Examining different documents
 Analysis of requirements gathered
 Nouns -- prospective entities
 Adjectives--prospective attributes
 Verbs/verb phrases-prospective relationships
30
 The basic E-R model is graphically depicted and presented for review.
 The process is repeated until the end users and designers agree that the E-R diagram
is a fair representation of the organization‘s activities and functions.
 Checking for Redundant Relationships in the ER Diagram. Relationships between
entities indicate access from one entity to another - it is therefore possible to access
one entity occurrence from another entity occurrence even if there are other entities
and relationships that separate them - this is often referred to as Navigation' of the
ER diagram
 The last phase in ER modeling is validating an ER Model against requirement of the
user.
Graphical Representations in ER Diagramming

 Entity is represented by a RECTANGLE containing the name of the entity.
Strong Entity Weak Entity
 Connected entities are called relationship participants

 Attributes are represented by OVALS and are connected to the entity by a line.
Ova
Ovals Ovals ls
Ovals Ova
ls
Multi-valued Composite Ova
Attribute Attribute Attribute ls
Key
 Primary Keys are underlined and
 A derived attribute is indicated by a DOTTED LINE. (……..) Ovals
 Relationships are represented by Diamond shaped symbols

 Weak Relationship is a relationship between Weak and Strong Entities
 Strong Relationship is a relationship between two strong Entities
Diamond Diamond
Strong Relationship Weak Relationship
31
Example 1: Build an ER Diagram for the following information:

A student record management system will have the following two basic data object categories
with their own features or properties: Students will have an Id, Name, Dept, Age, GPA and
Course will have an Id, Name, Credit Hours. Whenever a student enroll in a course in a
specific Academic Year and Semester, the Student will have a grade for the course.
Name Dept DoB Id Name Credit
Id Gpa
Students Course
s
Age
Enrolled_In Semester
Academic
Year
Grade
Example 2: Build an ER Diagram for the following information:

A Personnel record management system will have the following two basic data object
categories with their own features or properties: Employee will have an Id, Name, DoB, Age,
Tel and Department will have an Id, Name, Location. Whenever an Employee is assigned in
one Department, the duration of his stay in the respective department should be registered.
Structural Constraints on Relationship

1. Constraints on Relationship / Multiplicity/ Cardinality Constraints
 Multiplicity constraint is the number or range of possible occurrence of an entity
type/relation that may relate to a single occurrence/tuple of an entity type/relation
through a particular relationship.
 Mostly used to insure appropriate enterprise constraints.
 One-to-one relationship:
 A customer is associated with at most one loan via the relationship borrower
 A loan is associated with at most one customer via borrower
32
E.g.: Relationship Manages between STAFF and BRANCH

The multiplicity of the relationship is:
 One branch can only have one manager
 One employee could manage either one or no branches
1..1 Manages 0..1

Employee Branch
 One-To-Many Relationships
 In the one-to-many relationship a loan is associated with at most one customer via
borrower, a customer is associated with several (including 0) loans via borrower
E.g.: Relationship Leads between STAFF and PROJECT

The multiplicity of the relationship
 One staff may Lead one or more project(s)
 One project is Lead by one staff
1..1 Leads 0..*

Employee Project
 Many-To-Many Relationship
 A customer is associated with several (possibly 0) loans via borrower
 A loan is associated with several (possibly 0) customers via borrower
33
E.g.: Relationship ―Teaches‖ between INSTRUCTOR and COURSE

The multiplicity of the relationship:
 One Instructor Teaches one or more Course(s)
 One Course Thought by Zero or more Instructor(s)
0..* Teaches 1..*

Instructor Course
Participation of an Entity Set in a Relationship Set

Participation constraint of a relationship is involved in identifying and setting the mandatory
or optional feature of an entity occurrence to take a role in a relationship. There are two
distinct participation constraints, namely: Total Participation and Partial Participation
 Total participation: every tuple in the entity or relation participates in at least one
relationship by taking a role. This means, every tuple in a relation will be attached with
at least one other tuple. The entity with total participation in a relationship will be
connected to the relationship using a double line.
 Partial participation: some tuple in the entity or relation may not participate in the
relationship. This means, there is at least one tuple from that Relation not taking any
role in that specific relationship. The entity with partial participation in a relationship
will be connected to the relationship using a single line.
 Example 1:
 Participation of EMPLOYEE in ―belongs to‖ relationship with DEPARTMENT is total since
every employee should belong to a department.
 Participation of DEPARTMENT in ―belongs to‖ relationship with EMPLOYEE is total since
every department should have more than one employee.
1..* 1..1
Employee BelongsTo Department
 Example 2:
 Participation of employee in ―manages‖ relationship with Department, is partial
participation since not all employees are managers.
 Participation of DEPARTMENT in ―Manages‖ relationship with EMPLOYEE is total since
every department should have a manager.
34
1..1 0..1
Employee Manages Department
Problem in ER Modeling
The Entity-Relationship Model is a conceptual data model that views the real world as
consisting of entities and relationships. The model visually represents these concepts by the
Entity-Relationship diagram. The basic constructs of the ER model are entities, relationships,
and attributes. Entities are concepts, real or abstract, about which information is collected.
Relationships are associations between the entities. Attributes are properties which describe
the entities.
While designing the ER model one could face a problem on the design which is called a
connection traps. Connection traps are problems arising from misinterpreting certain
relationships
There are two types of connection traps;
1. Fan trap:
Occurs where a model represents a relationship between entity types, but the pathway
between certain entity occurrences is ambiguous.
May exist where two or more one-to-many (1:M) relationships fan out from an entity.
The problem could be avoided by restructuring the model so that there would be no
1:M relationships fanning out from a singe entity and all the semantics of the
relationship is preserved.
Example:
1..* Works 1..1 1..1 IsAssigned 1..*
EMPLOYEE For BRANCH CAR
Semantics description of the problem;
Emp1 Br1 Car1

Emp2 Br2 Car2
Emp3 Br3 Car3
Emp4 Br4 Car4
Emp5 Car5
Emp6 Car6
Emp7 Car7
35
Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6 Emp6 working in
Branch 1 (Br1)? Thus from this ER Model one can not tell which car is used by which
staff since a branch can have more than one car and also a branch is populated by more
than one employee. Thus we need to restructure the model to avoid the connection trap.
To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will
result in the following E-R Model.
1..1 Has 1..* 1..* Used By 1..*
BRANCH CAR EMPLOYEE
Semantics description of the problem;
Car1
Br1 Emp1
Car2
Br2 Emp2
Car3
Br3 Emp3
Car4
Br4 Emp4
Car5
Emp5
Car6
Emp6
Car7
Emp7
2. Chasm Trap:
Occurs where a model suggests the existence of a relationship between entity types, but
the path way does not exist between certain entity occurrences.
Chasm trap may exist when there are one or more relationships with a minimum
multiplicity on cardinality of zero forming part of the pathway between related entities.
Example:
1..1 Has 1..* 0..1 Manages 0..*
BRANCH EMPLOYEE PROJECT
If we have a set of projects that are not active currently then we can not assign a project
manager for these projects. So there are project with no project manager making the
participation to have a minimum value of zero.
Problem:
How can we identify which BRANCH is responsible for which PROJECT? We know
that whether the PROJECT is active or not there is a responsible BRANCH. But which
branch is a question to be answered, and since we have a minimum participation of
zero between employee and PROJECT we can‘t identify the BRANCH responsible for
each PROJECT.
36
The solution for this Chasm Trap problem is to add another relation ship between the
extreme entities (BRANCH and PROJECT)
1..1 Has 1..* 0..1 Manages 0..*

BRANCH EMPLOYEE PROJECT
1..1 Responsible for 1..*
Enhanced E-R (EER) Models

 Object-oriented extensions to E-R model
 EER is important when we have a relationship between two entities and the participation is
partial between entity occurrences. In such cases EER is used to reduce the complexity in
participation and relationship complexity.
 ER diagrams consider entity types to be primitive objects
 EER diagrams allow refinements within the structures of entity types
 EER Concepts
 Generalization
 Specialization
 Sub classes
 Super classes
 Attribute Inheritance
 Constraints on specialization and generalization
1. Generalization
 Generalization occurs when two or more entities represent categories of the same real-
world object.
 Generalization is the process of defining a more general entity type from a set of more
specialized entity types.
 A generalization hierarchy is a form of abstraction that specifies that two or more
entities that share common attributes can be generalized into a higher level entity type.
 Is considered as bottom-up definition of entities.
 Generalization hierarchy depicts relationship between higher level superclass and
lower level subclass.
 Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a
supertype of another. The level of nesting is limited only by the constraint of simplicity.
Example: Account is a generalized form for aving and Current Accounts
37
2. Specialization
 Is the result of subset of a higher level entity set to form a lower level entity set.
 The specialized entities will have additional set of attributes (distinguishing
characteristics) that distinguish them from the generalized entity.
 Is considered as Top-Down definition of entities.
 Specialization process is the inverse of the Generalization process. Identify the
distinguishing features of some entity occurrences, and specialize them into different
subclasses.
 Reasons for Specialization
 Attributes only partially applying to superclasses
 Relationship types only partially applicable to the superclass
In many cases, an entity type has numerous sub-groupings of its entities that are
meaningful and need to be represented explicitly. This need requires the representation of
each subgroup in the ER model. The generalized entity is a superclass and the set of
specialized entities will be subclasses for that specific Superclass.
 Example: Saving Accounts and Current Accounts are Specialized entities for the
generalized entity Accounts. Manager, Sales, Secretary: are specialized employees.
3. Subclass/Subtype
 An entity type whose tuples have attributes that distinguish its members from tuples of
the generalized or Superclass entities.
 When one generalized Superclass has various subgroups with distinguishing features
and these subgroups are represented by specialized form, the groups are called
subclasses.
 Subclasses can be either mutually exclusive (disjoint) or overlapping (inclusive).
 A single subclass may inherit attributes from two distinct superclasses.
 A mutually exclusive category/subclass is when an entity instance can be in only one of
the subclasses.
E.g.: An EMPLOYEE can either be SALARIED or PART-TIMER but not both.
 An overlapping category/subclass is when an entity instance may be in two or more
subclasses.
38
 E.g.: A PERSON who works for a university can be both EMPLOYEE and a STUDENT at
the same time.
4. Superclass /Supertype
 An entity type whose tuples share common attributes. Attributes that are shared by all
entity occurrences (including the identifier) are associated with the supertype.
 Is the generalized entity
5. Relationship Between Superclass and Subclass
 The relationship between a superclass and any of its subclasses is called a
superclass/subclass or class/subclass relationship
 An instance can not only be a member of a subclass. i.e. Every instance of a subclass is
also an instance in the Superclass.
 A member of a subclass is represented as a distinct database object, a distinct record
that is related via the key attribute to its super-class entity.
 An entity cannot exist in the database merely by being a member of a subclass; it must
also be a member of the super-class.
 An entity occurrence of a sub class not necessarily should belong to any of the
subclasses unless there is full participation in the specialization.
 The relationship between a subclass and a Superclass is an ―IS A‖ or ―IS PART OF‖
type.
 Subclass IS PART OF Superclass
 Manager IS AN Employee
 All subclasses or specialized entity sets should be connected with the superclass using a
line to a circle where there is a subset symbol indicating the direction of
subclass/superclass relationship.
 We can also have subclasses of a subclass forming a hierarchy of specialization.

 Superclass attributes are shared by all subclasses of that superclass
 Subclass attributes are unique for the subclass.
39
 Attribute Inheritance
 An entity that is a member of a subclass inherits all the attributes of the entity as a member
of the superclass.
 The entity also inherits all the relationships in which the superclass participates.
 An entity may have more than one subclass categories.
 All entities/subclasses of a generalized entity or superclass share a common unique
identifier attribute (primary key). i.e. The primary key of the superclass and subclasses are
always identical.
 Consider the EMPLOYEE supertype entity shown above. This entity can have several
different subtype entities (for example: HOURLY and SALARIED), each with distinct
properties not shared by other subtypes. But whether the employee is Hourly or Salaried,
same attributes (EmployeeId, Name, and DateHired) are shared.
 The Supertype EMPLOYEE stores all properties that subclasses have in common. And
HOURLY employees have the unique attribute Wage (hourly wage rate), while SALARIED
employees have two unique attributes, StockOption and Salary.
Constraints on specialization and generalization

 Completeness Constraint.
 The Completeness Constraint addresses the issue of whether or not an occurrence of a
Superclass must also have a corresponding Subclass occurrence.
 The completeness constraint requires that all instances of the subtype be represented in the
supertype.
 The Total Specialization Rule specifies that an entity occurrence should at least be a
member of one of the subclasses. Total Participation of superclass instances on subclasses is
diagrammed with a double line from the Supertype to the circle as shown below.
E.g.: If we have Extention and regular as subclasses of a superclass student, then it is
mandatory that each student to be either Extention or regular student. Thus the
participation of instances of student in Extention and regular subclasses will be
total.
40
 The Partial Specialization Rule specifies that it is not necessary for all entity occurrences
in the superclass to be a member of one of the subclasses. Here we have an optional
participation on the specialization. Partial Participation of superclass instances on
subclasses is diagrammed with a single line from the Supertype to the circle.
E.g.: If we have Manager and Secretary as subclasses of a superclass Employee, then
it is not the case that all employees are either manager or secretary. Thus the
participation of instances of employee in manager and secretary subclasses will
be partial.
 Disjointness Constraints.
 Specifies the rule whether one entity occurrence can be a member of more than one
subclasses. i.e. it is a type of business rule that deals with the situation where an entity
occurrence of a Superclass may also have more than one Subclass occurrence.
 The Disjoint Rule restricts one entity occurrence of a superclass to be a member of only
one of the subclasses. Example: a Employee can either be salaried or part-timer, but
not the both at the same time.
 The Overlap Rule allows one entity occurrence to be a member f more than one
subclass. Example: Employee working at the university can be both a Student and an
employee at the same time.
 This is diagrammed by placing either the letter "d" for disjoint or "o" for overlapping
inside the circle on the Generalization Hierarchy portion of the E-R diagram.
The two types of constraints on generalization and specialization (Disjointness and
Completeness constraints) are not dependent on one another. That is, being disjoint will not
favour whether the tuples in the superclass should have Total or Partial participation for that
specific specialization.
From the two types of constraints we can have four possible constraints
 Disjoint AND Total
 Disjoint AND Partial
 Overlapping AND Total
 Overlapping AND Partial
41
Chapter Four
Logical Database Design
The whole purpose of the data base design is to create an accurate representation of the data,
the relationship between the data and the business constraints pertinent to that organization.
Therefore, one can use one or more technique to design a data base. One such a technique was
the E-R model. In this chapter we use another technique known as ―Normalization‖ with a
different emphasis to the database design---- defines the structure of a database with a specific
data model.
Logical design is the process of constructing a model of the information used in an enterprise
based on a specific data model (e.g. relational, hierarchical or network or object), but
independent of a particular DBMS and other physical considerations.
The focus in logical database design is the Normalization Process
 Normalization process
o Collection of Rules (Tests) to be applied on relations to obtain the minimal, non
redundant set or attributes.
o Discover new entities in the process
o Revise attributes based on the rules and the discovered Entities
o Works by examining the relationship between attributes known as functional
dependency.
The purpose of normalization is to find the suitable set of relations that supports the data
requirements of an enterprise.
A suitable set of relations has the following characteristics;
 Minimal number of attributes to support the data requirements of the enterprise
 Attributes with close logical relationship (functional dependency) should be placed in
the same relation.
 Minimal redundancy with each attribute represented only once with the exception of the
attributes which form the whole or part of the foreign key, which are used for joining of
related tables.
The first step before applying the rules in relational data model is converting the conceptual
design to a form suitable for relational logical model, which is in a form of tables.
Converting ER Diagram to Relational Tables

Three basic rules to convert ER into tables or relations:
Rule 1: Entity Names will automatically be table names
Rule 2: Mapping of attributes: attributes will be columns of the respective tables.
 Atomic or single-valued or derived or stored attributes will be columns
42
 Composite attributes: the parent attribute will be ignored and the decomposed
attributes (child attributes) will be columns of the table.
 Multi-valued attributes: will be mapped to a new table where the primary key of the
main table will be posted for cross referencing.
Rule 3: Relationships: relationship will be mapped by using a foreign key attribute. Foreign
key is a primary or candidate key of one relation used to create association between tables.
 For a relationship with One-to-One Cardinality: post the primary or candidate key
of one of the table into the other as a foreign key. In cases where one entity is having
partial participation on the relationship, it is recommended to post the candidate key
of the partial participants to the total participant so as to save some memory location
due to null values on the foreign key attribute. E.g.: for a relationship between
Employee and Department where employee manages a department, the cardinality
is one-to-one as one employee will manage only one department and one
department will have one manager. here the PK of the Employee can be posted to
the Department or the PK of the Department can be posted to the Employee. But the
Employee is having partial participation on the relationship "Manages" as not all
employees are managers of departments. thus, even though both way is possible, it
is recommended to post the primary key of the employee to the Department table as
a foreign key.
 For a relationship with One-to-Many Cardinality: Post the primary key or
candidate key from the ―one‖ side as a foreign key attribute to the ―many‖ side. E.g.:
For a relationship called ―Belongs To‖ between Employee (Many) and Department
(One) the primary or candidate key of the one side which is Department should be
posted to the many side which is Employee table.
 For a relationship with Many-to-Many Cardinality: for relationships having many
to many cardinality, one has to create a new table (which is the associative entity)
and post primary key or candidate key from the participant entities as foreign key
attributes in the new table along with some additional attributes (if applicable). The
same approach should be used for relationships with degree greater than binary.
 For a relationship having Associative Entity property: in cases where the
relationship has its own attributes (associative entity), one has to create a new table
for the associative entity and post primary key or candidate key from the
participating entities as foreign key attributes in the new table.
43
Example to illustrate the major rules in mapping ER to relational schema:
The following ER has been designed to represent the requirement of an organization to

capture Employee Department and Project information. And Employee works for department
where an employee might be assigned to manage a department. Employees might participate
on different projects within the organization. An employee might as well be assigned to lead a
project where the starting and ending date of his/her project leadership and bonus will be
registered.
FName LName
e e
EID Salary DID DLoc

Name Manages
1 1
Employee Department
M 1 M WorksFor 1
Tel DName
StartDate
Leads
EndDate
Participate
PBonus
M
M
Project
PFund
PID PName
After we have drawn the ER diagram, the next thing is to map the ER into relational schema so
as the rules of the relational data model can be tested for each relational schema. The mapping
can be done for the entities followed by relationships based on the rule of mapping. the
mapping has been done as follows.
44
 Mapping EMPLOYEE Entity:

There will be Employee table with EID, Salary, FName and LName being the columns.
The composite attribute Name will be ignored as its decomposed attributes (FName and
LName) are columns in the Employee Table. The Tel attribute will be a new table as it is
multi-valued.
Employee
EID FName LName Salary
Telephone
EID Tel
 Mapping DEPARTMENT Entity:

There will be Department table with DID, DName, and DLoc being the columns.
Department
DID DName DLoc
 Mapping PROJECT Entity:

There will be Project table with PID, PName, and PFund being the columns.
Project
PID PName PFund
 Mapping the MANAGES Relationship:

As the relationship is having one-to-one cardinality, the PK or CK of one of the table can
be posted into the other. But based on the recommendation, the Pk or CK of the partial
participant (Employee) should be posted to the total participants (Department). This
will require adding the PK of Employee (EID) in the Department Table as a foreign key.
We can give the foreign key another name which is MEID to mean "managers employee
id". this will affect the degree of the Department table.
Department
DID DName DLoc MEID
 Mapping the WORKSFOR Relationship:

As the relationship is having one-to-many cardinality, the PK or CK of the "One" side
(PK or CK of Department table) should be posted to the many side (Employee table).
This will require adding the PK of Department (DID) in the Employee Table as a foreign
key. We can give the foreign key another name which is EDID to mean "Employee's
Department id". This will affect the degree of the Employee table.
Employee
EID FName LName Salary EDID
45
 Mapping the PARTICIPATES Relationship:

As the relationship is having many-to-many cardinality, we need to create a new table
and post the PK or CK of the Employee and Project table into the new table. We can
give a descriptive new name for the new table like Emp_Partc_Project to mean
"Employee participate in a project".
Emp_Partc_Project
EID PID
 Mapping the LEADS Relationship:

As the relationship is associative entity, we are supposed to create a table for the
associative entity where the PK of Employee and Project tables will be posted in the
new table as a foreign key. The new table will have the attributes of the associative
entity as columns. We can give a descriptive new name for the new table like
Emp_Lead_Project to mean "Employee leads a project".
Emp_Lead_Project
EID PID PBonus StartDate EndDate
At the end of the mapping we will have the following relational schema (tables) for the logical
database design phase.
Department
DID DName DLoc MEID
Project
PID PName PFund
Telephone
EID Tel
Employee
EID FName LName Salary EDID
Emp_Partc_Project
EID PID
Emp_Lead_Project
EID PID PBonus StartDate EndDate
After converting the ER diagram in to table forms, the next phase is implementing the process
of normalization, which is a collection of rules each table should satisfy.
46
Normalization
A relational database is merely a collection of data, organized in a particular manner. As the
father of the relational database approach, Codd created a series of rules (tests) called normal
forms that help define that organization.
One of the best ways to determine what information should be stored in a database is to clarify
what questions will be asked of it and what data would be included in the answers.
Database normalization is a series of steps followed to obtain a database design that allows
for consistent storage and efficient access of data in a relational database. These steps reduce
data redundancy and the risk of data becoming inconsistent.
NORMALIZATION is the process of identifying the logical associations between data items
and designing a database that will represent such associations but without suffering the
update anomalies which are:
1. Insertion Anomalies
2. Deletion Anomalies
3. Modification Anomalies
Normalization may reduce system performance since data will be cross referenced from many
tables. Thus denormalization is sometimes used to improve performance, at the cost of
reduced consistency guarantees.
Normalization normally is considered ―good‖ if it is lossless decomposition.
All the normalization rules will eventually remove the update anomalies that may exist during
data manipulation after the implementation. The update anomalies are;
The type of problems that could occur in insufficiently normalized table is called update
anomalies which includes.
1. Insertion anomalies
An "insertion anomaly" is a failure to place information about a new database entry into
all the places in the database where information about that new entry needs to be stored.
Additionally, we may have difficulty to insert some data. In a properly normalized database,
information about a new entry needs to be inserted into only one place in the database;
in an inadequately normalized database, information about a new entry may need to be
inserted into more than one place and, human fallibility being what it is, some of the
needed additional insertions may be missed.
2. Deletion anomalies
A "deletion anomaly" is a failure to remove information about an existing database entry
when it is time to remove that entry. Additionally, deletion of one data may result in lose
of other information. In a properly normalized database, information about an old, to-be-
gotten-rid-of entry needs to be deleted from only one place in the database; in an
47
inadequately normalized database, information about that old entry may need to be
deleted from more than one place, and, human fallibility being what it is, some of the
needed additional deletions may be missed.
3. Modification anomalies
A modification of a database involves changing some value of the attribute of a table. In
a properly normalized database table, what ever information is modified by the user, the
change will be effected and used accordingly.
In order to avoid the update anomalies we in a given table, the solution is to decompose
it to smaller tables based on the rule of normalization. However, the decomposition has
two important properties
a. The Lossless-join property insures that any instance of the original relation can be identified
from the instances of the smaller relations.
b. The Dependency preservation property implies that constraint on the original dependency can
be maintained by enforcing some constraints on the smaller relations. i.e. we don‘t have to
perform Join operation to check whether a constraint on the original relation is violated or not.
The purpose of normalization is to reduce the chances for anomalies to occur in a database.
Example of problems related with Anomalies
EmpID FName LName SkillID Skill SkillType School SchoolAdd Skill
Level
12 Abebe Mekuria 2 SQL Database AAU Sidist_Kilo 5
16 Lemma Alemu 5 C++ Programming Unity Gerji 6
28 Chane Kebede 2 SQL Database AAU Sidist_Kilo 10
25 Abera Taye 6 VB6 Programming Helico Piazza 8
65 Almaz Belay 2 SQL Database Helico Piazza 9
24 Dereje Tamiru 8 Oracle Database Unity Gerji 5
51 Selam Belay 4 Prolog Programming Jimma Jimma City 8
94 Alem Kebede 3 Cisco Networking AAU Sidist_Kilo 7
18 Girma Dereje 1 IP Programming Jimma Jimma City 4
13 Yared Gizaw 7 Java Programming AAU Sidist_Kilo 6
 Deletion Anomalies:
If employee with ID 16 is deleted then ever information about skill C++ and the type of
skill is deleted from the database. Then we will not have any information about C++
and its skill type.
 Insertion Anomalies:
What if we have a new employee with a skill called Pascal? We can not decide weather
Pascal is allowed as a value for skill and we have no clue about the type of skill that
Pascal should be categorized as.
48
 Modification Anomalies:
What if the address for Helico is changed from Piazza to Mexico? We need to look for
every occurrence of Helico and change the value of School_Add from Piazza to Mexico,
which is prone to error.
Database-management system can work only with the information that we put explicitly into its
tables for a given database and into its rules for working with those tables, where such rules are
appropriate and possible.
Functional Dependency (FD)

Before moving to the definition and application of normalization, it is important to have an
understanding of "functional dependency."
Data Dependency
The logical associations between data items that point the database designer in the direction of a
good database design are refered to as determinant or dependent relationships.
Two data items A and B are said to be in a determinant or dependent relationship if certain values
of data item B always appears with certain values of data item A. if the data item A is the
determinant data item and B the dependent data item then the direction of the association is from
A to B and not vice versa.
The essence of this idea is that if the existence of something, call it A, implies that B must exist and
have a certain value, then we say that "B is functionally dependent on A." We also often express
this idea by saying that "A functionally determines B," or that "B is a function of A," or that "A
functionally governs B." Often, the notions of functionality and functional dependency are
expressed briefly by the statement, "If A, then B." It is important to note that the value of B must be
unique for a given value of A, i.e., any given value of A must imply just one and only one value of
B, in order for the relationship to qualify for the name "function." (However, this does not
necessarily prevent different values of A from implying the same value of B.)
However, for the purpose of normalization, we are interested in finding 1..1 (one to one)
dependencies, lasting for all times (intension rather than extension of the database), and the
determinant having the minimal number of attributes.
X  Y holds if whenever two tuples have the same value for X, they must have the same value for Y
The notation is: AB which is read as; B is functionally dependent on A
In general, a functional dependency is a relationship among attributes. In relational databases, we
can have a determinant that governs one or several other attributes.
FDs are derived from the real-world constraints on the attributes and they are properties on the
database intension not extension.
49
Example
Dinner Course Type of Wine
Meat Red
Fish White
Cheese Rose
Since the type of Wine served depends on the type of Dinner, we say Wine is functionally
dependent on Dinner.
Dinner  Wine
Dinner Course Type of Wine Type of Fork
Meat Red Meat fork
Fish White Fish fork
Cheese Rose Cheese fork
Since both Wine type and Fork type are determined by the Dinner type, we say Wine is
functionally dependent on Dinner and Fork is functionally dependent on Dinner.
Dinner  Wine
Dinner  Fork
Partial Dependency
If an attribute which is not a member of the primary key is dependent on some part of the
primary key (if we have composite primary key) then that attribute is partially functionally
dependent on the primary key.
Let {A,B} is the Primary Key and C is no key attribute.
Then if {A,B}C and BC
Then C is partially functionally dependent on {A,B}
Full Functional Dependency

If an attribute which is not a member of the primary key is not dependent on some part of the
primary key but the whole key (if we have composite primary key) then that attribute is fully
functionally dependent on the primary key.
Let {A,B} be the Primary Key and C is a non- key attribute
Then if {A,B}C and BC and AC does not hold
Then C Fully functionally dependent on {A,B}
Transitive Dependency
In mathematics and logic, a transitive relationship is a relationship of the following form: "If A
implies B, and if also B implies C, then A implies C."
50
Example:
If Mr X is a Human, and if every Human is an Animal, then Mr X must be an Animal.
Generalized way of describing transitive dependency is that:
If A functionally governs B, AND
If B functionally governs C
THEN A functionally governs C
Provided that neither C nor B determines A i.e. (B / A and C / A)
In the normal notation:
{(AB) AND (BC)} ==> AC provided that B / A and C / A
Steps of Normalization:
We have various levels or steps in normalization called Normal Forms. The level of complexity,
strength of the rule and decomposition increases as we move from one lower level Normal
Form to the higher.
 A table in a relational database is said to be in a certain normal form if it satisfies certain
constraints.
 A normal form below represents a stronger condition than the previous one
Normalization towards a logical design consists of the following steps:

UnNormalized Form(UNF):
Identify all data elements
First Normal Form(1NF):
Find the key with which you can find all data i.e. remove any repeating group
Second Normal Form(2NF):
Remove part-key dependencies (partial dependency). Make all data dependent on the
whole key.
Third Normal Form(3NF)
Remove non-key dependencies (transitive dependencies). Make all data dependent on
nothing but the key.
For most practical purposes, databases are considered normalized if they adhere to the third
normal form (there is no transitive dependency).
First Normal Form (1NF)

Requires that all column values in a table are atomic (e.g., a number is an atomic value,
while a list or a set is not).
We have tow ways of achiving this:
1. Putting each repeating group into a separate table and connecting them with a
primary key-foreign key relationship
51
2. Moving these repeating groups to a new row by repeating the non-repeating

attributes known as ―flattening‖ the table. If so then Find the key with which
you can find all data
Definition: a table (relation) is in 1NF

If
 There are no duplicated rows in the table. Unique identifier
 Each cell is single-valued (i.e., there are no repeating groups).
 Entries in a column (attribute, field) are of the same kind.
Example for First Normal form (1NF)

UNNORMALIZED
EmpID FirstName LastName Skill SkillType School SchoolAdd SkillLevel
12 Abebe Mekuria SQL, Database, AAU, Sidist_Kilo 5
VB6 Programming Helico Piazza 8
16 Lemma Alemu C++ Programming Unity Gerji 6
IP Programming Jimma Jimma City 4
28 Chane Kebede SQL Database AAU Sidist_Kilo 10
65 Almaz Belay SQL Database Helico Piazza 9
Prolog Programming Jimma Jimma City 8
Java Programming AAU Sidist_Kilo 6
24 Dereje Tamiru Oracle Database Unity Gerji 5
94 Alem Kebede Cisco Networking AAU Sidist_Kilo 7
FIRST NORMAL FORM (1NF)

Remove all repeating groups. Distribute the multi-valued attributes into different rows and
identify a unique identifier for the relation so that is can be said is a relation in relational
database. Flatten the table.
EmpID FirstName LastName SkillID Skill SkillType School SchoolAdd SkillLevel
12 Abebe Mekuria 3 VB6 Programming Helico Piazza 8
16 Lemma Alemu 7 IP Programming Jimma Jimma City 4
65 Almaz Belay 5 Prolog Programming Jimma Jimma City 8
65 Almaz Belay 8 Java Programming AAU Sidist_Kilo 6
52
Second Normal form 2NF

No partial dependency of a non key attribute on part of the primary key. This will result in a
set of relations with a level of Second Normal Form.
Any table that is in 1NF and has a single-attribute (i.e., a non-composite) key is automatically
also in 2NF.
Definition: a table (relation) is in 2NF
If
 It is in 1NF and
 If all non-key attributes are dependent on the entire primary key. i.e. no partial
dependency.
Example for 2NF:
EMP_PROJ
EmpID EmpName ProjNo ProjName ProjLoc ProjFund ProjMangID Incentive
EMP_PROJ rearranged
EmpID ProjNo EmpName ProjName ProjLoc ProjFund ProjMangID Incentive
Business rule: Whenever an employee participates in a project, he/she will be entitled for an
incentive.
This schema is in its 1NF since we don‘t have any repeating groups or attributes with multi-
valued property. To convert it to a 2NF we need to remove all partial dependencies of non key
attributes on part of the primary key.
{EmpID, ProjNo} EmpName, ProjName, ProjLoc, ProjFund, ProjMangID, Incentive
But in addition to this we have the following dependencies
FD1: {EmpID}EmpName
FD2: {ProjNo}ProjName, ProjLoc, ProjFund, ProjMangID
FD3: {EmpID, ProjNo} Incentive
As we can see, some non key attributes are partially dependent on some part of the primary
key. This can be witnessed by analyzing the first two functional dependencies (FD1 and FD2).
Thus, each Functional Dependencies, with their dependent attributes should be moved to a
new relation where the Determinant will be the Primary Key for each.
EMPLOYEE
EmpID EmpName
PROJECT
ProjNo ProjName ProjLoc ProjFund ProjMangID
EMP_PROJ
EmpID ProjNo Incentive
53
Third Normal Form (3NF)

Eliminate Columns dependent on another non-Primary Key - If attributes do not contribute to
a description of the key; remove them to a separate table. This level avoids update and deletes
anomalies.
Definition: a Table (Relation) is in 3NF If:
 It is in 2NF and
 There are no transitive dependencies between a primary key and non-primary key attributes.
Example for (3NF)
Assumption: Students of same batch (same year) live in one building or dormitory
STUDENT
StudID Stud_F_Name Stud_L_Name Dept Year Dormitary
125/97 Abebe Mekuria Info Sc 1 401
654/95 Lemma Alemu Geog 3 403
842/95 Chane Kebede CompSc 3 403
165/97 Alem Kebede InfoSc 1 401
985/95 Almaz Belay Geog 3 403
This schema is in its 2NF since the primary key is a single attribute and there are no repeating
groups (multi valued attributes).
Let‘s take StudID, Year and Dormitary and see the dependencies.
StudIDYear AND YearDormitary
And Year can not determine StudID and Dormitary can not determine StudID Then transitively
StudIDDormitary
To convert it to a 3NF we need to remove all transitive dependencies of non key attributes on
another non-key attribute.
The non-primary key attributes, dependent on each other will be moved to another table and
linked with the main table using Candidate Key- Foreign Key relationship.
STTUDENT DORM
StudID Stud F_Name Stud L_Name Dept Year Year Dormitary
125/97 Abebe Mekuria Info Sc 1 1 401
654/95 Lemma Alemu Geog 3 3 403
842/95 Chane Kebede CompSc 3
165/97 Alem Kebede InfoSc 1
985/95 Almaz Belay Geog 3
54
Generally, eventhough there are other four additional levels of Normalization, a table is said to
be normalized if it reaches 3NF. A database with all tables in the 3NF is said to be Normalized
Database.
Mnemonic for remembering the rationale for normalization up to 3NF could be the following:
1. No Repeating or Redunduncy: no repeting fields in the table.
2. The Fields Depend Upon the Key: the table should solely depend on the key.
3. The Whole Key: no partial keybdependency.
4. And Nothing But the Key: no inter data dependency.
5. So Help Me Codd: since Codd came up with these rules.
Other Levels of Normalization

Boyce-Codd Normal Form (BCNF):
BCNF is based on functional dependency that takes in to account all the candidate keys in a
relation. So, table is in BCNF if it is in 3NF and if every determinant is a candidate key.
Violation of the BCNF is very rare. The potential sources for violation of this rule are:
 The relation contains two (or more) composite candidate keys
 The candidate keys over lap i.e. have common attribute.
The issue is related to:
 Isolating Independent Multiple Relationships - No table may contain two or more 1:N or N:M
relationships that are not directly related.
 The correct solution, to cause the model to be in 4th normal form, is to ensure that all M:M
relationships are resolved independently if they are indeed independent, as shown below.
Forth Normal form (4NF)
 Isolate Semantically Related Multiple Relationships - There may be practical constrains on
information that justify separating logically related many-to-many relationships.
 MVD(Multi-Valued Dependency ) : represents a dependency between attributes( for example A,
B,C) in a relation such that for every value of A there is a set of values for B and there is a set of
values for C but the sets B and C are independent to each other.
 MVD between attributes A, B, and C in a relation is represented as follows
 A------>>B & A------->>C
Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.
Fifth Normal Form (5NF)

Sometimes called the Project –Join –Normal Form (PJNF). 5NF is based on the Join dependency.
Join Dependency: a property of decomposition that ensures that no spurious are generated when
rejoining to obtain the original relation.
Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if
every join dependency in the table is a consequence of the candidate keys of the table.
55
Domain-Key Normal Form (DKNF)

 A model free from all modification anomalies.
Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of
keys and domains.
The underlying ideas in normalization are simple enough. Through normalization we want to
design for our relational database a set of tables that;
(1) Contain all the data necessary for the purposes that the database is to serve,
(2) Have as little redundancy as possible,
(3) Accommodate multiple values for types of data that require them,
(4) Permit efficient updates of the data in the database, and
(5) Avoid the danger of losing data unknowingly.
Pitfalls of Normalization: Problems associated with normalization

 Requires data to see the problems  Is time consuming,
 May reduce performance of the  Difficult to design and apply and
system  Prone to human error
56
Chapter Five
Physical Database Design Methodology for Relational Database
We have established that there are three levels of database design:
 Conceptual design: producing a data model which accounts for the relevant entities and
relationships within the target application domain;
 Logical design: ensuring, via normalization procedures and the definition of integrity rules, that
the stored database will be non-redundant and properly connected;
 Physical design: specifying how database records are stored, accessed and related to ensure
adequate performance.
It is considered desirable to keep these three levels quite separate -- one of Codd's
requirements for an RDBMS is that it should maintain logical-physical data independence. The
generality of the relational model means that RDBMSs are potentially less efficient than those
based on one of the older data models where access paths were specified once and for all at the
design stage. However the relational data model does not preclude the use of traditional
techniques for accessing data - it is still essential to exploit them to achieve adequate
performance with a database of any size.
We can consider the topic of physical database design from three aspects:
 What techniques for storing and finding data exist
 Which are implemented within a particular DBMS
 Which might be selected by the designer for a given application knowing the properties of the
data
Thus the purpose of physical database design is:
1. How to map the logical database design to a physical database design.
2. How to design base relations for target DBMS.
3. How to design enterprise constraints for target DBMS.
4. How to select appropriate file organizations based on analysis of transactions.
5. When to use secondary indexes to improve performance.
6. How to estimate the size of the database
7. How to design user views
8. How to design security mechanisms to satisfy user requirements.
9. How to design procedures and triggers.
Physical database design is the process of producing a description of the implementation of
the database on secondary storage. Physical design describes the base relation, file
organization, and indexes used to achieve efficient access to the data, and any associated
integrity constraints and security measures.
 Sources of information for the physical design process include global logical data model and
documentation that describes model. Set of normalized relation.
 Logical database design is concerned with the what; physical database design is concerned with
the how.
57
 The process of producing a description of the implementation of the database on secondary

storage.
 Describes the storage structures and access methods used to achieve efficient access to the data.
Steps in physical database design
 Translate logical data model for target DBMS
o Design base relation
o Design representation of derived data
o Design enterprise constraint
 Design physical representation
o Analyze transactions
o Choose file organization
o Choose indexes
o Estimate disk space and system requirement
 Design user view
 Design security mechanisms
 Consider controlled redundancy
 Monitor and tune the operational system
Translate logical data model for target DBMS

This phase is the translation of the global logical data model to produce a relational database
schema in the target DBMS. This includes creating the data dictionary based on the logical
model and information gathered.
After the creation of the data dictionary, the next activity is to understand the functionality of
the target DBMS so that all necessary requirements are fulfilled for the database intended to be
developed.
Knowledge of the DBMS includes:
 how to create base relations
 whether the system supports:
o definition of Primary key
o definition of Foreign key
o definition of Alternate key(Unique keys)
o definition of Domains
o Referential integrity constraints
o definition of enterprise level constraints
58
1.1. Design Base Relation

To decide how to represent base relations identified in global logical model in target DBMS.
Designing base relation involves identification of all necessary requirements about a relation
starting from the name up to the referential integrity constraints.
For each relation, need to define:
 The name of the relation;
 A list of simple attributes in brackets;
 The PK and, where appropriate, AKs and FKs.
 A list of any derived attributes and how they should be computed;
 Referential integrity constraints for any FKs identified.
For each attribute, need to define:
 Its domain, consisting of a data type, length, and any constraints on the domain;
 An optional default value for the attribute;
 Whether the attribute can hold nulls.
 Whether the attribute can be derived , if do how it should be computed
The implementation of the physical model is dependent on the target DBMS since some has
more facilities than the other in defining database definitions.
The base relation design along with every justifiable reason should be fully documented.
1.2. Design representation of derived data

While analyzing the requirement of users, we may encounter that there are some attributes
holding data that will be derived from existing or other attributes. A decision on how to
represent any derived data present in the global logical data model in the target DBMS should
be devised.
Examine logical data model and data dictionary, and produce list of all derived attributes.
Most of the time derived attributes are not expressed in the logical model but will be included
in the data dictionary. Whether to store derived attributes in a base relation or calculate them when
required is a decision to be made by the designer considering the performance impact.
Option selected is based on:
 Additional cost to store the derived data and keep it consistent with operational data
from which it is derived;
 Cost to calculate it each time it is required.
Less expensive option is chosen subject to performance constraints.
The representation of derived attributes should be fully documented.
59
1.3. Design enterprise constraint

Data in the database is not only subjected to constraints on the database and the data model
used but also with some enterprise dependent constraints. These constraint definitions are also
dependent on the DBMS selected and enterprise level requirements.
One need to know the functionalities of the DBMS since in designing the enterprise constraints
for the target DBMS some DBMS provide more facilities than others. All the enterprise level
constraints and the definition method in the target DBMS should be fully documented.
2. Design physical representation
This phase is the level for determining the optimal file organizations to store the base relations
and the indexes that are required to achieve acceptable performance; that is, the way in which
relations and tuples will be held on secondary storage.
Number of factors that may be used to measure efficiency:
 Transaction throughput: number of transactions processed in given time interval.
 Response time: elapsed time for completion of a single transaction.
 Disk storage: amount of disk space required to store database files.
However, no one factor is always correct.
Typically, have to trade one factor off against another to achieve a reasonable balance.
2.1. Analyze transactions

The objective here is to understand the functionality of the transactions that will run on the
database and to analyze the important transactions.
Attempt to identify performance criteria, e.g.:
 Transactions that run frequently and will have a significant impact on performance;
 Transactions that are critical to the business;
 Times during the day/week when there will be a high demand made on the database
(called the peak load).
Use this information to identify the parts of the database that may cause performance
problems.
To select appropriate file organizations and indexes, also need to know high-level functionality
of the transactions, such as:
 Attributes that are updated in an update transaction;
 Criteria used to restrict tuples that are retrieved in a query.
Often not possible to analyze all expected transactions, so investigate most ‗important‘ ones.
To help identify which transactions to investigate, can use:
 Transaction/relation cross-reference matrix, showing relations that each transaction
accesses, and/or
60
 Transaction usage map, indicating which relations are potentially heavily used.
To focus on areas that may be problematic:
1. Map all transaction paths to relations.
2. Determine which relations are most frequently accessed by transactions.
3. Analyze the data usage of selected transactions that involve these relations.
2.2. Choose file organization
The objective here is to determine an efficient file organization for each base relation
File organizations include Heap, Hash, Indexed Sequential office Access Method (ISAM), B+-
Tree, and Clusters.
Most DBMSs provide little or no option to select file organization. However, they prove the
user with an option to select an index for every relation
2.3. Choose indexes
The objective here is to determine whether adding indexes will improve the performance of
the system.
One approach is to keep tuples unordered and create as many secondary indexes as necessary.
Another approach is to order tuples in the relation by specifying a primary or clustering index.
In this case, choose the attribute for ordering or clustering the tuples as:
 Attribute that is used most often for join operations - this makes join operation more
efficient, or
 Attribute that is used most often to access the tuples in a relation in order of that
attribute.
If ordering attribute chosen is on the primary key of a relation, index will be a primary index;
otherwise, index will be a clustering index.
Each relation can only have either a primary index or a clustering index.
Secondary indexes provide a mechanism for specifying an additional key for a base relation
that can be used to retrieve data more efficiently.
Overhead involved in maintenance and use of secondary indexes that has to be balanced against
performance improvement gained when retrieving data.
This includes:
 Adding an index record to every secondary index whenever tuple is inserted;
 Updating a secondary index when corresponding tuple is updated;
 Increase in disk space needed to store the secondary index;
 Possible performance degradation during query optimization to consider all secondary
indexes.
61
Guidelines for Choosing Indexes

(1) Do not index small relations.
(2) Index PK of a relation if it is not a key of the file organization.
(3) Add secondary index to a FK if it is frequently accessed.
(4) Add secondary index to any attribute that is heavily used as a secondary key.
(5) Add secondary index on attributes that are involved in: selection or join criteria;
ORDER BY; GROUP BY; and other operations involving sorting (such as UNION
or DISTINCT).
(6) Add secondary index on attributes involved in built-in functions.
(7) Add secondary index on attributes that could result in an index-only plan.
(8) Avoid indexing an attribute or relation that is frequently updated.
(9) Avoid indexing an attribute if the query will retrieve a significant proportion of
the tuples in the relation.
(10) Avoid indexing attributes that consist of long character strings.
2.4. Estimate disk space and system requirement

The objective here is to estimate the amount of disk space that will be required by the
database.
Purpose is to answer the following questions:
 If system already exists: is there adequate storage?
 If procuring new system: what storage will be required?
3. Design user view
To design the user views that was identified during the Requirements
Collection and Analysis stage of the relational database application development lifecycle.
Define views in DDL to provide user views identified in data model
Map onto objects in physical data model
4. Design security mechanisms
To design the security measures for the database as specified by the users.
System security – Authentication
Data security-authorizations
5. Consider the Introduction of Controlled Redundancy
The objective here is to determine whether introducing redundancy in a controlled manner by
relaxing the normalization rules will improve the performance of the system. This is
sometimes known as denormalization
Informally speaking, denormalization is merging of relations
62
Result of normalization is a logical database design that is structurally consistent and has
minimal redundancy.
However, sometimes a normalized database design does not provide maximum processing
efficiency.
It may be necessary to accept the loss of some of the benefits of a fully normalized design in
favor of performance.
Also consider that denormalization:
 Makes implementation more complex;
 Often sacrifices flexibility;
 May speed up retrievals but it slows down updates.
Denormalization refers to a refinement to relational schema such that the degree of
normalization for a modified relation is less than the degree of at least one of the original
relations.
Also use term more loosely to refer to situations where two relations are combined into one
new relation, which is still normalized but contains more nulls than original relations. No
fixed rule when to denormalize but ,
Consider denormalization in following situations, specifically to speed up frequent or critical
transactions:
 Step 1 Combining 1:1 relationships
 Step 2 Duplicating non-key attributes in 1:* relationships to reduce joins
 Step 3 Duplicating foreign key attributes in 1:* relationships to reduce joins
 Step 4 Introducing repeating groups
 Step 5 Merging lookup tables with base relations
 Step 6 Creating extract tables.
6. Monitoring and Tuning the operational system
The objective here is to monitor operational system and improve performance of system to
correct inappropriate design decisions or reflect changing requirements.
Importance of monitoring and tuning the operational system
 Avoids procurement of additional hardware
 Down size the hardware configuration less and cheaper hardware less
expensive maintenance.
 Faster response time and high throughput more productive
 Faster response time good staff moral, customer satisfaction
63
Chapter Six
Relational Query Languages
In addition to the structural component of any data model equally important is the
manipulation mechanism. This component of any data model is called the ―query language‖.
 Query languages: Allow manipulation and retrieval of data from a database.
 Query Languages! = programming languages!
 QLs not intended to be used for complex calculations.
 QLs support easy, efficient access to large data sets.
 Relational model supports simple, powerful query languages.
Formal Relational Query Languages
 There are varieties of Query languages used by relational DBMS for manipulating
relations.
 Some of them are procedural
 User tells the system exactly what and how to manipulate the data
 Others are non-procedural
 User states what data is needed rather than how it is to be retrieved.
Two mathematical Query Languages form the basis for Relational Query Languages
 Relational Algebra:
 Relational Calculus:
 We may describe the relational algebra as procedural language: it can be used to tell the
DBMS how to build a new relation from one or more relations in the database.
 We may describe relational calculus as a non procedural language: it can be used to
formulate the definition of a relation in terms of one or more database relations.
 Formally the relational algebra and relational calculus are equivalent to each other. For
every expression in the algebra, there is an equivalent expression in the calculus.
 Both are non-user friendly languages. They have been used as the basis for other,
higher-level data manipulation languages for relational databases.
A query is applied to relation instances, and the result of a query is also a relation instance.
 Schemas of input relations for a query are fixed
 The schema for the result of a given query is also fixed! Determined by definition
of query language constructs.
Relational Algebra
The basic set of operations for the relational model is known as the relational algebra. These
operations enable a user to specify basic retrieval requests.
64
The result of the retrieval is a new relation, which may have been formed from one or more
relations. The algebra operations thus produce new relations, which can be further
manipulated using operations of the same algebra.
A sequence of relational algebra operations forms a relational algebra expression, whose
result will also be a relation that represents the result of a database query (or retrieval request).
 Relational algebra is a theoretical language with operations that work on one or more
relations to define another relation without changing the original relation.
 The output from one operation can become the input to another operation (nesting is
possible)
 There are different basic operations that could be applied on relations on a database
based on the requirement.
 Selection (  ) Selects a subset of rows from a relation.
 Projection (  ) Deletes unwanted columns from a relation.
 Renaming: assigning intermediate relation for a single operation
 Cross-Product ( x ) Allows to concatenate a tuple from one relation with all
the tuples from the other relation.
 Set-Difference ( - ) Tuples in relation R1, but not in relation R2.
 Union ( ) Tuples in relation R1, or in relation R2.
 Intersection () Tuples in relation R1 and in relation R1
 Join Tuples joined from two relations based on a condition
Join and intersection are derivable from the rest.
 Using these, we can build up sophisticated database queries.
Table1: Sample table used to illustrate different kinds of relational operations. The relation
contains information about employees, IT skills they have & the school where they attend each
skill.
Employee
EmpID FName LName SkillID Skill SkillType School SchoolAdd SkillLevel
18 Girma Dereje 1 IP Programming Jimma Jimma City 4
13 Yared Gizaw 7 Java Programming AAU Sidist_Kilo 6
65
1. Selection
 Selects subset of tuples/rows in a relation that satisfy selection condition.
 Selection operation is a unary operator (it is applied to a single relation)
 The Selection operation is applied to each tuple individually
 The degree of the resulting relation is the same as the original relation but the
cardinality (no. of tuples) is less than or equal to the original relation.
 The Selection operator is commutative.
 Set of conditions can be combined using Boolean operations ((AND), (OR), and
~(NOT))
 No duplicates in result!
 Schema of result identical to schema of (only) input relation.
 Result relation can be the input for another relational algebra operation! (Operator
composition.)
 It is a filter that keeps only those tuples that satisfy a qualifying condition (those
satisfying the condition are selected while others are discarded.)
Notation:
<Selection Condition> <Relation Name>

Example: Find all Employees with skill type of Database.
< SkillType =”Database”> (Employee)

This query will extract every tuple from a relation called Employee with all the attributes
where the SkillType attribute with a value of ―Database‖.
The resulting relation will be the following.
If the query is all employees with a SkillType Database and School Unity the relational algebra
operation and the resulting relation will be as follows.
< SkillType =”Database” AND School=”Unity”> (Employee)

66
2. Projection
 Selects certain attributes while discarding the other from the base relation.
 The PROJECT creates a vertical partitioning – one with the needed columns
(attributes) containing results of the operation and other containing the discarded
Columns.
 Deletes attributes that are not in projection list.
 Schema of result contains exactly the fields in the projection list, with the same names
that they had in the (only) input relation.
 Projection operator has to eliminate duplicates!
 Note: real systems typically don‘t do duplicate elimination unless the user explicitly
asks for it.
 If the Primary Key is in the projection list, then duplication will not occur
 Duplication removal is necessary to insure that the resulting table is also a relation.
Notation:
<Selected Attributes> <Relation Name>
Example: To display Name, Skill, and Skill Level of an employee, the query and the resulting
relation will be:
<FName, LName, Skill, Skill_Level> (Employee)

FName LName Skill SkillLevel
Abebe Mekuria SQL 5
Lemma Alemu C++ 6
Chane Kebede SQL 10
Abera Taye VB6 8
Almaz Belay SQL 9
Dereje Tamiru Oracle 5
Selam Belay Prolog 8
Alem Kebede Cisco 7
Girma Dereje IP 4
Yared Gizaw Java 6
If we want to have the Name, Skill, and Skill Level of an employee with Skill SQL and SkillLevel
greater than 5 the query will be:
<FName, LName, Skill, Skill_Level> ( (Employee))

<Skill=”SQL”  SkillLevel>5>
FName LName Skill SkillLevel

Chane Kebede SQL 10
Almaz Belay SQL 9
67
3. Rename Operation
 We may want to apply several relational algebra operations one after the other. The
query could be written in two different forms:
1. Write the operations as a single relational algebra expression by nesting the
operations.
2. Apply one operation at a time and create intermediate result relations. In the
latter case, we must give names to the relations that hold the intermediate
resultsRename Operation
If we want to have the Name, Skill, and Skill Level of an employee with salary greater than
1500 and working for department 5, we can write the expression for this query using the two
alternatives:
1. A single algebraic expression:
The above used query is using a single algebra operation, which is:
<FName, LName, Skill, Skill_Level> ( (Employee))

<Skill=”SQL”  SkillLevel>5>
2. Using an intermediate relation by the Rename Operation:
Step1: Result1  <DeptNo=5  Salary>1500> (Employee)
Step2: Result <FName, LName, Skill, Skill_Level>(Result1)
Then Result will be equivalent with the relation we get using the first alternative.
4. Set Operations
The three main set operations are the Union, Intersection and Set Difference. The properties of
these set operations are similar with the concept we have in mathematical set theory. The
difference is that, in database context, the elements of each set, which is a Relation in Database,
will be tuples. The set operations are Binary operations which demand the two operand
Relations to have type compatibility feature.
Type Compatibility
Two relations R1 and R2 are said to be Type Compatible if:
1. The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) have the same number
of attributes, and
2. The domains of corresponding attributes must be compatible; that is, Dom
(Ai)=Dom(Bi) for i=1, 2, ..., n.
68
To illustrate the three set operations, we will make use of the following two tables:
Employee
EmpID FName LName SkillID Skill SkillType School SkillLevel
12 Abebe Mekuria 2 SQL Database AAU 5
16 Lemma Alemu 5 C++ Programming Unity 6
28 Chane Kebede 2 SQL Database AAU 10
25 Abera Taye 6 VB6 Programming Helico 8
65 Almaz Belay 2 SQL Database Helico 9
24 Dereje Tamiru 8 Oracle Database Unity 5
51 Selam Belay 4 Prolog Programming Jimma 8
94 Alem Kebede 3 Cisco Networking AAU 7
18 Girma Dereje 1 IP Programming Jimma 4
13 Yared Gizaw 7 Java Programming AAU 6
RelationOne: Employees who attend Database Course

RelationTwo: Employees who attend a course in AAU

a. UNION Operation
The result of this operation, denoted by R U S, is a relation that includes all tuples that
are either in R or in S or in both R and S. Duplicate tuple is eliminated.
The two operands must be "type compatible"
Eg: Relation One U Relation Two
Employees who attend Database in any School or who attend any course at AAU
69
b. INTERSECTION Operation
The result of this operation, denoted by R ∩ S, is a relation that includes all tuples that
are in both R and S. The two operands must be "type compatible"
Eg: RelationOne ∩ RelationTwo

Employees who attend Database Course at AAU

c. Set Difference (or MINUS) Operation

The result of this operation, denoted by R - S, is a relation that includes all tuples that
are in R but not in S.
The two operands must be "type compatible"
Eg: RelationOne - RelationTwo

Employees who attend Database Course but didn’t take any course at AAU
Eg: RelationTwo - RelationOne

Employees who attend Database Course but didn’t take any course at AAU
The resulting relation for; R1  R2, R1  R2, or R1-R2 has the same attribute names as
the first operand relation R1 (by convention).
Some Properties of the Set Operators
Notice that both union and intersection are commutative operations; that is
R  S = S  R, and R  S = S  R
Both union and intersection can be treated as n-nary operations applicable to any number of
relations as both are associative operations; that is
R  (S  T) = (R  S)  T, and (R  S)  T = R  (S  T)
The minus operation is not commutative; that is, in general
R-S≠S–R
70
5. CARTESIAN (cross product) Operation

This operation is used to combine tuples from two relations in a combinatorial fashion. That
means, every tuple in Relation (R) will be related with every other tuple in Relation (S).
 In general, the result of R(A1, A2, . . ., An) x S(B1,B2, . . ., Bm) is a relation Q with degree n
+ m attributes Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.
 Where R has n attributes and S has m attributes.
 The resulting relation Q has one tuple for each combination of tuples—one from R and
one from S.
 Hence, if R has n tuples, and S has m tuples, then | R x S | will have n* m tuples.
Example:
Employee Dept
ID FName LName DeptID DeptName MangID
123 Abebe Lemma 2 Finance 567
567 Belay Taye 3 Personnel 123
822 Kefle Kebede
Then the Cartesian product between Employee and Dept relations will be of the form:
Employee X Dept:
ID FName LName DeptID DeptName MangID
123 Abebe Lemma 2 Finance 567
123 Abebe Lemma 3 Personnel 123
567 Belay Taye 2 Finance 567
567 Belay Taye 3 Personnel 123
822 Kefle Kebede 2 Finance 567
822 Kefle Kebede 3 Personnel 123
Basically, even though it is very important in query processing, the Cartesian Product is not
useful by itself since it relates every tuple in the First Relation with every other tuple in the
Second Relation. Thus, to make use of the Cartesian Product, one has to use it with the Selection
Operation, which discriminate tuples of a relation by testing whether each will satisfy the
selection condition.
In our example, to extract employee information about managers of the departments (Managers
of each department), the algebra query and the resulting relation will be.
<ID, FName, LName, DeptName > ( <ID=MangID> (Employee X Dept))

ID FName LName DeptName
123 Abebe Lemma Personnel
567 Belay Taye Finance
71
6. JOIN Operation
The sequence of Cartesian product followed by select is used quite commonly to identify and
select related tuples from two relations, a special operation, called JOIN. Thus in JOIN
operation, the Cartesian Operation and the Selection Operations are used together.
JOIN Operation is denoted by a symbol.
This operation is very important for any relational database with more than a single relation,
because it allows us to process relationships among relations.
The general form of a join operation on two relations
R(A1, A2,. . ., An) and S(B1, B2, . . ., Bm) is:
R S
<join condition> is equivalent to (R X S)
<selection condition>
where <join condition> and <selection condition> are the same

Where, R and S can be any relation that results from general relational algebra expressions.
Since JOIN is an operation that needs two relation, it is a Binary operation.
This type of JOIN is called a THETA JOIN ( - JOIN)
Where  is the logical operator used in the join condition.
 Could be { <,  , >, , , = }
Example:
Thus in the above example we want to extract employee information about managers of the
departments, the algebra query using the JOIN operation will be.
Employee < ID=MangID> Dept
a. EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons only (=).
Such a join, where the only comparison operator used is the equal sign is called an EQUIJOIN.
In the result of an EQUIJOIN we always have one or more pairs of attributes (whose names
need not be identical) that have identical values in every tuple since we used the equality
logical operator.
For example, the above JOIN expression is an EQUIJOIN since the logical operator used is the
equal to operator (=).
b. NATURAL JOIN Operation
We have seen that in EQUIJOIN one of each pair of attributes with identical values is extra, a
new operation called natural join was created to get rid of the second (or extra) attribute that
we will have in the result of an EQUIJOIN condition.
72
The standard definition of natural join requires that the two join attributes, or each pair of
corresponding join attributes, have the same name in both relations. If this is not the case, a
renaming operation on the attributes is applied first.
R1R S represents a natural join between R and S. The degree of R 1 is degree of R

plus Degree of S less the number of common attributes
c. OUTER JOIN Operation
OUTER JOIN is another version of the JOIN operation where non matching tuples from a
relation are also included in the result with NULL values for attributes in the other relation.
There are two major types of OUTER JOIN.
1. RIGHT OUTER JOIN: where non matching tuples from the second (Right) relation are
included in the result with NULL value for attributes of the first (Left) relation.
2. LEFT OUTER JOIN: where non matching tuples from the first (Left) relation are
included in the result with NULL value for attributes of the second (Right) relation.
Notation for Left Outer Join:
R <Join Condition > S theta left outer Join

R S  natural left outer join
When two relations are joined by a JOIN operator, there could be some tuples in the first
relation not having a matching tuple from the second relation, and the query is interested to
display these non matching tuples from the first or second relation. Such query is represented
by the OUTER JOIN.
d. SEMIJOIN Operation
SEMI JOIN is another version of the JOIN operation where the resulting Relation will contain
those attributes of only one of the Relations that are related with tuples in the other Relation.
The following notation depicts the inclusion of only the attributes form the first relation (R) in
the result which are actually participating in the relationship.
R <Join Condition> S
Aggregate functions and Grouping statements
Some queries may involve aggregate function (scalar aggregates like totals in a report, or
Vector aggregates like subtotals in reports)
73
a) (R): Scalar aggregate functions on relation R with AL as a list of (<aggregate

AL
function > ,<attribute >) pairs

b) GA AL (R):
Vector aggregate functions on relation R with AL as list of (<aggregate
function >, <attribute >) pairs with a grouping attribute GA.
Example (a): the number of employees in a an organization (assume you have an employee
table)
This is a scalar aggregate
PR(Num_Employees) Count EmpId (Employee) , where PR = Produce
relation R
Example (b): the number of employees in each department of an organization
(assume you have an employee table)
This is a vector aggregate
PR (DeptId, Num_Employees) DeptId Count EmpId (Employee) , where
PR = Produce relation R
Relational Calculus
A relational calculus expression creates a new relation, which is specified in terms of variables
that range over rows of the stored database relations (in tuple calculus) or over columns of the
stored relations (in domain calculus).
In a calculus expression, there is no order of operations to specify how to retrieve the query
result. A calculus expression specifies only what information the result should contain rather
than how to retrieve it.
In Relational calculus, there is no description of how to evaluate a query; this is the main
distinguishing feature between relational algebra and relational calculus.
Relational calculus is considered to be a nonprocedural language. This differs from relational
algebra, where we must write a sequence of operations to specify a retrieval request; hence
relational algebra can be considered as a procedural way of stating a query.
When applied to relational database, the calculus is not that of derivative and differential but
in a form of first-order logic or predicate calculus, a predicate is a truth-valued function with
arguments.
When we substitute values for the arguments in the predicate, the function yields an
expression, called a proposition, which can be either true or false.
If a predicate contains a variable, as in ‗x is a member of staff‘, there must be a range for
x. When we substitute some values of this range for x, the proposition may be true; for
other values, it may be false.
74
If COND is a predicate, then the set of all tuples evaluated to be true for the predicate
COND will be expressed as follows:
{t | COND(t)}
Where t is a tuple variable and COND (t) is a conditional expression
involving t. The result of such a query is the set of all tuples t that satisfy
COND (t).
If we have set of predicates to evaluate for a single query, the predicates can be
connected using (AND), (OR), and ~(NOT)
A relational calculus expression creates a new relation, which is specified in terms of variables
that range over rows of the stored database relations (in tuple calculus) or over columns of the
stored relations (in domain calculus).
Tuple-oriented Relational Calculus
 The tuple relational calculus is based on specifying a number of tuple variables. Each
tuple variable usually ranges over a particular database relation, meaning that the
variable may take as its value any individual tuple from that relation.
 Tuple relational calculus is interested in finding tuples for which a predicate is true for
a relation. Based on use of tuple variables.
 Tuple variable is a variable that ‘ranges over’ a named relation: that is, a variable whose
only permitted values are tuples of the relation.
 If E is a tuple that ranges over a relation employee, then it is represented as
EMPLOYEE(E) i.e. Range of E is EMPLOYEE
 Then to extract all tuples that satisfy a certain condition, we will represent it as all
tuples E such that COND(E) is evaluated to be true.
{E  COND(E)}
The predicates can be connected using the Boolean operators:
 (AND),  (OR),  (NOT)
COND(t) is a formula, and is called a Well-Formed-Formula (WFF) if:
 Where the COND is composed of n-nary predicates (formula composed of n
single predicates) and the predicates are connected by any of the Boolean
operators.
 And each predicate is of the form A  B and  is one of the logical operators { <,
 , >, , , = }which could be evaluated to either true or false. And A and B are
either constant or variables.
 Formulae should be unambiguous and should make sense.
75
Example (Tuple Relational Calculus)

 Extract all employees whose skill level is greater than or equal to 8
{E | Employee(E)  E.SkillLevel >= 8}

 To find only the EmpId, FName, LName, Skill and the School where the skill is
attended where of employees with skill level greater than or equal to 8, the tuple based
relational calculus expression will be:
{E.EmpId, E.FName, E.LName, E.Skill, E.School | Employee(E)  E.SkillLevel >= 8}
EmpID FName LName Skill School

28 Chane Kebede SQL AAU
25 Abera Taye VB6 Helico
65 Almaz Belay SQL Helico
51 Selam Belay Prolog Jimma
 E.FName means the value of the First Name (FName) attribute for the tuple E.
Quantifiers in Relational Calculus

 To tell how many instances the predicate applies to, we can use the two quantifiers in
the predicate logic.
 One relational calculus expressed using Existential Quantifier can also be expressed
using Universal Quantifier.
1. Existential quantifier  („there exists‟)
Existential quantifier used in formulae that must be true for at least one instance,
such as:
An employee with skill level greater than or equal to 8 will be:
{E | Employee(E)  (E)(E.SkillLevel >= 8)}
This means, there exist at least one tuple of the relation employee where the value for
the SkillLevel is greater than or equal to 8.
2. Universal quantifier  („for all‟)
Universal quantifier is used in statements about every instance, such as:
An employee with skill level greater than or equal to 8 will be:
{E | Employee(E)  (E)(E.SkillLevel >= 8)}
76
This means, for all tuples of relation employee where value for the SkillLevel attribute is
greater than or equal to 8.
Example:
Let‘s say that we have the following Schema (set of Relations)
Employee(EID, FName, LName, EDID)
Project(PID, PName, PDID)
Dept(DID, DName, DMangID)
WorksOn(WEID, WPID)
To find employees who work on projects controlled by department 5 the query will be:
{E | Employee(E)  (P)(Project(P)  (w)(WorksOn(w)  PDID =5  EID=WEID))}
Domain Relational Calculus

In tuple relational Calculus, we use variables that range over tuples of a relation, in the case of domain
relational calculus we use variables that range over domain elements (field variables).
 An expression in the domain relational calculus has the following general form
{(x1,x2,x3,….xn)| P(x1,x2,x3,….xn,xm)}
Where (x1,x2,x3,….xn) represents the domain variables and P(x1,x2,x3,….xn,xm) represents the
formula
Formulas are of the form R(x1,x2,x3,….xn), x1 x2 or
xi C where  є {<,>,<=,>=,=,≠} and R is a relation of degree n and each xi is domain variable
If f1 and f2 are formulas then so are
f1  f2 , f1  f2 ,~f1 , (x)f1 , (x)f1
 The Answer for such a query includes all tuples with attributes (x1,x2,x3,….xn) that make the
formula P(x1,x2,x3,….xn,xm) be true.
 Formula is recursively defined, starting with simple atomic formulas (getting tuples from relations or
making comparisons of values), and building bigger and better formulas using the logical
connectives. i.e the Predicate P can be set of formula combined by Boolean operators
Example: Consider the schema of relations on page 102.
Query1: list Employees
{Fname, Lname| (Employee (EID,FName, LName)}
Query2: Find the list of Employees who work in the department of IS
Domain relational Calculus expression for the query
{EID,Fname,Lname|(DName,EDID,DID)(Employee(EID,FName,
LName)Department(DID,DName,DMangID)DID=EDIDDName=‟IS‟)}
, Where DName, EDID, DID DName, EDID, DID
Query3:List the names of employees that do not manage any department
{Fname,Lname|(EID)(Employee(EID,Fname,Lname)
(~(DMangId)(Dept(DID,Dname,DMangId) (EID=DMangId))))}
77
Chapter Seven
Advanced Concepts in Database Systems
 Database Security and Integrity

 Distributed Database Systems
 Data warehousing
1. Database Security and Integrity
A database represents an essential corporate resource that should be properly secured using
appropriate controls.
 Database security encompasses hardware, software, people and data
Multi-user database system - DBMS must provide a database security and authorization
subsystem to enforce limits on individual and group access rights and privileges.
Database security and integrity is about protecting the database from being inconsistent and
being disrupted. We can also call it database misuse.
Database misuse could be Intentional or accidental, where accidental misuse is easier to cope
with than intentional misuse.
Accidental inconsistency could occur due to:
 System crash during transaction processing
 Anomalies due to concurrent access
 Anomalies due to redundancy
 Logical errors
Likewise, even though there are various threats that could be categorized in this group,
intentional misuse could be:
 Unauthorized reading of data
 Unauthorized modification of data or
 Unauthorized destruction of data
Most systems implement good Database Integrity to protect the system from accidental
misuse while there are many computer based measures to protect the system from intentional
misuse, which is termed as Database Security measures.
 Database security is considered in relation to the following situations:
 Theft and fraud
 Loss of confidentiality (secrecy)
 Loss of privacy
 Loss of integrity
 Loss of availability
78
Security Issues and general considerations

 Legal, ethical and social issues regarding the right to access information
 Physical control
 Policy issues regarding privacy of individual level at enterprise and national level
 Operational consideration on the techniques used (password, etc)
 System level security including operating system and hardware control
 Security levels and security policies in enterprise level
 Database security - the mechanisms that protect the database against intentional or
accidental threats. And Database security encompasses hardware, software, people and
data
 Threat – any situation or event, whether intentional or accidental, that may adversely affect
a system and consequently the organization
 A threat may be caused by a situation or event involving a person, action, or circumstance
that is likely to bring harm to an organization
 The harm to an organization may be tangible or intangible
o Tangible – loss of hardware, software, or data
o Intangible – loss of credibility or client confidence
Examples of threats:
 Using another persons‘ means of access
 Unauthorized amendment/modification or copying of data
 Program alteration
 Inadequate policies & procedures that allow a mix of confidential and normal out put
 Wire-tapping
 Illegal entry by hacker
 Blackmail
 Creating ‗trapdoor‘ into system
 Theft of data, programs, and equipment
 Failure of security mechanisms, giving greater access than normal
 Staff shortages or strikes
 Inadequate staff training
 Viewing and disclosing unauthorized data
 Electronic interference and radiation
 Data corruption owing to power loss or surge
 Fire (electrical fault, lightning strike, arson), flood, bomb
 Physical damage to equipment
 Breaking cables or disconnection of cables
 Introduction of viruses
79
Levels of Security Measures

Security measures can be implemented at several levels and for different components of the
system. These levels are:
1. Physical Level: concerned with securing the site containing the computer system should be
physically secured. The backup systems should also be physically protected from access
except for authorized users.
2. Human Level: concerned with authorization of database users for access the content at
different levels and privileges.
3. Operating System: concerned with the weakness and strength of the operating system
security on data files. Weakness may serve as a means of unauthorized access to the
database. This also includes protection of data in primary and secondary memory from
unauthorized access.
4. Database System: concerned with data access limit enforced by the database system.
Access limit like password, isolated transaction and etc.
Even though we can have different levels of security and authorization on data objects and
users, who access which data is a policy matter rather than technical.
These policies
 should be known by the system: should be encoded in the system
 should be remembered: should be saved somewhere (the catalogue)
 An organization needs to identify the types of threat it may be subjected to and initiate
appropriate plans and countermeasures, bearing in mind the costs of implementing them
Countermeasures: Computer based controls

 The types of countermeasure to threats on computer systems range from physical controls
to administrative procedures
 Despite the range of computer-based controls that are available, it is worth noting that,
generally, the security of a DBMS is only as good as that of the operating system, owing to
their close association
 The following are computer-based security controls for a multi-user environment:
 Authorization
 The granting of a right or privilege that enables a subject to have legitimate
access to a system or a system‘s object
 Authorization controls can be built into the software, and govern not only what
system or object a specified user can access, but also what the user may do with it
 Authorization controls are sometimes referred to as access controls
80
 The process of authorization involves authentication of subjects (i.e. a user or

program) requesting access to objects (i.e. a database table, view, procedure,
trigger, or any other object that can be created within the system)
 Views
 A view is the dynamic result of one or more relational operations operation on
the base relations to produce another relation
 A view is a virtual relation that does not actually exist in the database, but is
produced upon request by a particular user
 The view mechanism provides a powerful and flexible security mechanism by
hiding parts of the database from certain users
 Using a view is more restrictive than simply having certain privileges granted to
a user on the base relation(s)
 Integrity
 Integrity constraints contribute to maintaining a secure database system by
preventing data from becoming invalid and hence giving misleading or incorrect
results
 Domain Integrity
 Entity integrity
 Referential integrity
 Key constraints
 Backup and recovery
 Backup is the process of periodically taking a copy of the database and log file
(and possibly programs) on to offline storage media
 A DBMS should provide backup facilities to assist with the recovery of a
database following failure
 Database recovery is the process of restoring the database to a correct state in the
event of a failure
 Journaling is the process of keeping and maintaining a log file (or journal) of all
changes made to the database to enable recovery to be undertaken effectively in
the event of a failure
 The advantage of journaling is that, in the event of a failure, the database can be
recovered to its last known consistent state using a backup copy of the database
and the information contained in the log file
 If no journaling is enabled on a failed system, the only means of recovery is to
restore the database using the latest backup version of the database
 However, without a log file, any changes made after the last backup to the
database will be lost
81
 Encryption
 The encoding of the data by a special algorithm that renders the data unreadable
by any program without the decryption key
 If a database system holds particularly sensitive data, it may be deemed
necessary to encode it as a precaution against possible external threats or
attempts to access it
 The DBMS can access data after decoding it, although there is a degradation in
performance because of the time taken to decode it
 Encryption also protects data transmitted over communication lines
 To transmit data securely over insecure networks requires the use of a
Cryptosystem, which includes:
 Authentication
 All users of the database will have different access levels and permission for
different data objects, and authentication is the process of checking whether the
user is the one with the privilege for the access level.
 Is the process of checking the users are who they say they are.
 Each user is given a unique identifier, which is used by the operating system to
determine who they are
 Thus the system will check whether the user with a specific username and
password is trying to use the resource.
 Associated with each identifier is a password, chosen by the user and known to
the operation system, which must be supplied to enable the operating system to
authenticate who the user claims to be
Any database access request will have the following three major components
1. Requested Operation: what kind of operation is requested by a specific query?
2. Requested Object: on which resource or data of the database is the operation sought
to be applied?
3. Requesting User: who is the user requesting the operation on the specified object?
The database should be able to check for all the three components before processing any
request. The checking is performed by the security subsystem of the DBMS.
Forms of user authorization

There are different forms of user authorization on the resource of the database. These forms
are privileges on what operations are allowed on a specific data object.
82
User authorization on the data/extension

1. Read Authorization: the user with this privilege is allowed only to read the content of the data
object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or items to
the data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes but
are not authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not
anything else.
 Different users, depending on the power of the user, can have one or the combination of the above
forms of authorization on different data objects.
Role of DBA in Database Security

The database administrator is responsible to make the database to be as secure as possible. For
this the DBA should have the most powerful privilege than every other user. The DBA
provides capability for database users while accessing the content of the database.
The major responsibilities of DBA in relation to authorization of users are:
1. Account Creation: involves creating different accounts for different USERS as well as
USER GROUPS.
2. Security Level Assignment: involves in assigning different users at different categories of
access levels.
3. Privilege Grant: involves giving different levels of privileges for different users and user
groups.
4. Privilege Revocation: involves denying or canceling previously granted privileges for
users due to various reasons.
5. Account Deletion: involves in deleting an existing account of users or user groups. Is
similar with denying all privileges of users on the database.
2. Distributed Database Systems

 Database development facilitates the integration of data available in an organization and
enforces security on data access. But it is not always the case that organizational data reside
in one site. This demand databases at different sites to be integrated and synchronized with
all the facilities of database approach. This leads to Distributed Database Systems.
 In a distributed database system, the database is stored on several computers. The
computers in a distributed system communicate with each other through various
communication media, such as high speed buses or telephone line.
83
 A distributed database system consists of a collection of sites, each of which maintains a

local database system and also participates in global transaction where different databases
are integrated together.
 Even though integration of data implies centralized storage and control, in distributed
database systems the intention is different. Data is stored in different database systems in a
decentralized manner but act as if they are centralized through development of computer
networks.
 A distributed database system consists of loosely coupled sites that share no physical
component and database systems that run on each site are independent of each other.
 Transactions may access data at one or more sites
 Organization may implement their database system on a number of separate computer
system rather than a single, centralized mainframe. Computer Systems may be located at
each local branch office.
The functionalities of a DDBMS will include: Extended Communication Services, Extended
Data Dictionary, Distributed Query Processing, Extended Concurrency Control and Extended
Recovery Services.
Concepts in DDBMS
 Replication: System maintains multiple copies of data, stored in different sites, for
faster retrieval and fault tolerance.
 Fragmentation: Relation is partitioned into several fragments stored in distinct sites
 Data transparency: Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system
Advantages of DDBMS
1. Data sharing and distributed control:
 User at one site may be able access data that is available at another site.
 Each site can retain some degree of control over local data
 We will have local as well as global database administrator
2. Reliability and availability of data
 If one site fails the rest can continue operation as long as transaction does not demand data
from the failed system and the data is not replicated in other sites
3. Speedup of query processing
 If a query involves data from several sites, it may be possible to split the query into sub-
queries that can be executed at several sites which is parallel processing
Disadvantages of DDBMS
1. Software development cost
2. Greater potential for bugs (parallel processing may endanger correctness)
3. Increased processing overhead (due to communication jargons)
4. Communication problems
84
Homogeneous and Heterogeneous Distributed Databases

 In a homogeneous distributed database
 All sites have identical software
 Are aware of each other and agree to cooperate in processing user requests.
 Each site surrenders part of its autonomy in terms of right to change schemas or
software
 Appears to user as a single system
 In a heterogeneous distributed database
 Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
 Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing
3. Data warehousing
 Data warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides
support for decision making.
 Integrated  centralized, consolidated database that integrates data derived from the
entire organization.
 Consolidates data from multiple and diverse sources with diverse formats.
 Helps managers to better understand the company‘s operations.
 Subject-Oriented  Data warehouse contains data organized by topics. Eg. Sales,
marketing, finance, etc.
 Time variant: In contrast to the operational data that focus on current transactions, the
warehouse data represent the flow of data through time.
 Data warehouse contains data that reflect what happened last week, last month,
past five years, and so on.
 Non volatile  Once data enter the data warehouse, they are never removed. Because
the data in the warehouse represent the company‘s entire history.
Differences between database and data warehouse
 Because data is added all the time, warehouse is growing.
 The data warehouse and operational environments are separated. Data warehouse receives its
data from operational databases.
 Data warehouse environment is characterized by read-only transactions to very large data sets.
 Operational environment is characterized by numerous update transactions to a few data
entities at a time.
 Data warehouse contains historical data over a long time horizon.
 Ultimately Information is created from data warehouses. Such Information becomes the basis
for rational decision making.
 The data found in data warehouse is analyzed to discover previously unknown data
characteristics, relationships, dependencies, or trends.
85

DB Lecture Note All in ONE

Uploaded by

Copyright:

Available Formats

DB Lecture Note All in ONE

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DB Lecture Note All in ONE

Uploaded by

Copyright:

Available Formats

Database Management Systems Lecture Note

Limitations of the Traditional File Based approach

 In addition to containing data required by an organization, database also contains a

Limitations and risk of Database Approach

Database Management System (DBMS)

1. Data storage, retrieval and update in the database

DBMS and Components of DBMS Environment

Fig. General architecture of a DBMS

Database Development Life Cycle (DDLC)

Roles in Database Design and Use

 Coordinating and monitoring the use of the database

 Responsible for determining and acquiring hardware and software resources

 Involves in all steps of database development

3. Application Programmer and Systems Analyst

Three-level ANSI-SPARC Architecture of a Database

ANSI-SPARC Architecture and Database Design Phases

Differences between Three Levels of ANSI-SPARC Architecture

Defines DBMS schemas at three levels:

Data Independence and the ANSI-SPARC Three-level Architecture

The distinction between a Data Definition Language (DDL) and a Data

Data Model is a collection of tools or concepts for describing

Time Card Activity

ADVANTAGES of Hierarchical Data Model:

ADVANTAGES of Network Data Model:

3. Relational Data Model

 The rows represent records (collections of information about separate items)

Properties of Relational Databases

 Relational database is the collection of tables

Building Blocks of the Relational Data Model

Entity versus Attributes

 Provide powerful flexibility and security: since unnecessary information will be

Schemas and Instances and Database State

1. Conceptual Database Design

2. Logical Database Design

Conceptual Database Design

The Entity Relationship (E-R) Model

Developing an E-R Diagram

Graphical Representations in ER Diagramming

Strong Entity Weak Entity

 Connected entities are called relationship participants

 Relationships are represented by Diamond shaped symbols

Strong Relationship Weak Relationship

Example 1: Build an ER Diagram for the following information:

Name Dept DoB Id Name Credit

Example 2: Build an ER Diagram for the following information:

Structural Constraints on Relationship

E.g.: Relationship Manages between STAFF and BRANCH

1..1 Manages 0..1

E.g.: Relationship Leads between STAFF and PROJECT

1..1 Leads 0..*

E.g.: Relationship ―Teaches‖ between INSTRUCTOR and COURSE

0..* Teaches 1..*

Participation of an Entity Set in a Relationship Set

Emp1 Br1 Car1

Semantics description of the problem;

1..1 Has 1..* 0..1 Manages 0..*

1..1 Responsible for 1..*

Enhanced E-R (EER) Models

Example: Account is a generalized form for aving and Current Accounts

 We can also have subclasses of a subclass forming a hierarchy of specialization.