Advance DB Notes V2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ADBMS NOTES

Abdul Qayoom
NET2APPS CodeBot
File Based Approaches:
Collection of application programs that perform services for the end users (e.g. reports).

Each program defines and manages its own data.

Limitations:

Separation and isolation of data: Every program maintain own data set and unaware of data at side of
other program.

Duplication of Data: Every program can have same data set, which waste a lot of memory space.

Data Independence: File structure is defined in the program code.

Incompatible File Format: Programs are written in different languages, and so cannot easily access each
other’s files.

Fixed Queries/Proliferation of application programs: Every program is written for specific function so
we need another function of other requirements.

Database:
Database is collection of logically related data entities, attributes and relationships of an organization’s
information.

System Catalogue or metadata provide description of data to enable program.

Database Management Software (DBMS):


DBMS is a software which used to manage and access database by using SQL queries.

DBMS is interface between user and database (User -> DBMS -> DB).

RDBMS (Relational Database Management System) used to store data in the form of tables (collection of rows
and columns).

Functions of DBMS:

 DBMS is responsible for storage, retrieval and update of data.


 DBMS is responsible for Concurrency control and recovery services.
 DBMS is responsible for authorization, integrity and utility services.
 DBMS provide user-accessible catalogue and transaction support.
 DBMS provide support for data communication and promote data independence.

Components of DBMS:

Hardware: PC or Network of systems.

Software: Operating System, Network Software and Application Software.

Data: (Collection of facts and figures) of any organization.

Procedures: Instructions and rules to design and manage database and DBMS.

People: User

Advantages of DBMS:
Data consistency and redundancy control.

Improved data integrity and security.

Improved data accessibility and responsiveness.

Increased productivity and consistency.

Improved backup and recovery services.

Disadvantage of DBMS:

Complexity: Large amount of data comes with more complexity.

Cost: is very high for DBMS (Hardware + Software) and data conversions.

High Failure Impact: due to software and hardware failure.

SQL (Structured Query Language):


DDL (Data Definition language):

DDL is used to modify the structure (Physical Schema) of the database table.

DDL allows to us to describe name entities, attributes, and relationships required for the application.

Queries:

Create Table: To create table.

Drop Table: To delete table.

Alter Table: To modify table.

DML (Data Manipulation Language):

DML used to modify table data (Records).

Queries:

Insert Into: Insert data into table.

Delete From: Delete from table.

Update Set: Update table record.

Procedural DML:

Allows users to tell system, how to manipulate data.

Non-Procedural DML:

Allows users to state what data is needed rather than how it is to be retrieved.

DCL (Data Control Language):

DCL used to control transaction and concurrency in database.

Queries:

Transaction Based:
Commit:

Rollback (Abort):

DQL (Data Query Language):

DQL used to retrieve data from database.

DQL operated on conceptual schema.

DQL used to create View based on query.

Queries:

Select: Select specific record from the table.

Group By: (Conditions) Group by attribute

Where: (Conditions) Selection operator.

Having: (Conditions)

From: Cross Product

Order by:

Nested Queries:
Query inside query is known as nested query.

There are two types of nested queries.

1. Independent nested query: Inner query in independent of outer query.


2. Correlated Nested query: Inner query uses attributes specified in outer query.

Inner queries will be executed first.

Outer query will perform based on results from inner query.

Basic terminologies regarding Relational Models (Tables):


Attribute: Attribute is a named column of a relation.

Domain: Domain is the set of allowable values for one or more attributes.

Tuple or Record: A row in a table.

Degree: Number of field of the table (No. of columns/attributes).

Cardinality: No of records or tuples in a table.

Relational Database: Relational database is a collection of normalized relations with different relation names.

Relational Instance: If data (records) exist in the table then that set of records is called relational instance.

Compound Candidate Key: Candidate key with combination of multiple attributes.

Prime Attribute: Attributes belongs to candidate key.

Primary Key: One of the candidate key which used to uniquely identify a record, Primary key cannot be null.
Foreign Key: Referred primary key of other table (Used to connect two tables).

Redundancy: Duplicated records in database (Result in wastage of memory).

Normalization:

Elimination of Redundancy (Normalization is possible if two or more independent tables are combined).

By defaults normalization of RDBMS is 1NF.

Database relations:
Relation Schema: Relation schema is a named relation defined by a set of attributes and domain names pairs.

Relational Database Schema: Relational database schema is set of relation schemas, each with different names.

Relation (Table):
Table: Collection of rows and columns (Logical Structure of Database).

Properties of Relations:
Relation name is different from all other relations in relational schema.

Each cell of relation contains exactly one value.

Each attribute has a different name.

Values of an attribute are all from the same domain.

Every record or tuple should be unique in table.

Order of attributes (Columns) and tuples (rows) doesn’t matters.

Relational Keys:
Super Key: Super key is an attribute (column) or set attributes, which uniquely identifies a
record in a relation.

Candidate Key:
Superkey (K) such that no proper subset is a superkey within the relation.

In each tuple of R, values of K uniquely identify that tuple (uniqueness).

Primary Key: Primary key is an attribute (Candidate Key) which is selected to uniquely identify
a record in a relation.

Alternate Key: Candidate keys that are not selected for primary key.
Foreign Key: Attribute, or set of attributes, within one relation that matches candidate key of some
(possibly same) relation.

Integrity Constraints:
Null:

Null represents any value which is unknown or not applicable for that record.
Null is not zero or space but an incomplete value or exceptional data.

Entity Integrity:

Primary key cannot be null in a table.

Referential Integrity:

Foreign key cannot be null in base or referenced table.

General Constraint:

Any rules define by user or DBA based on organizational policies or requirements.

Best Practices of relational databases:


Record: All records in the table should be unique.

Candidate Key: Candidate key should be combination of minimum attributes (columns)

DBMS Architecture:
There are three levels of abstraction between user and hardware database.

1. Physical Schema:

Physical Schema is also known as storage metadata or physical metadata.

Physical schema are storage details of database, that how data is physically stored in the database.

Physical schema includes File Structure, Record Structure, Field Structure, Location, Name, Type of File
etc.

2. Conceptual Schema:

Conceptual schema hides the physical details (File Structure, Record Structure, Field Structure, Location,
Name, Type of File etc.).

Conceptual schema haven’t any knowledge regarding stored data in database.

Conceptual schema provide virtual views (View or Table) for representing stored data based on user
requests.

View refers to one or more tables in database based on request.

ANSI-SPARC Three Level Architecture:


1. External Level: External layer present user view of the database.
2. Conceptual Level: Conceptual level presents community view of the database and provide information
regarding data stored in database and relationships of data stored in database.
3. Internal Level: Internal layer represents physical representation (How Data is Stored) of database on
computer.

Objectives of Three Level Architecture:

DBA (Database Administrator) should be able to modify database structure without affecting user’s
view.
DBA should be able to change conceptual structure without affecting user’s view.

Internal structure of the database should be unaffected by physical storage changes.

Views:
Views are sub-sets of data from one or multiple tables based on request by query.

Views are named entities of base relation in conceptual schema, whose tuples are physically stored in DB.

Views are dynamically generated relational results based on user query.

Objectives of View:

View enhanced the security by abstracting the physical DB from the user.

Views provide simple and customized relations based on user request.

Updating Views:

Changes in a view will reflect to base relation, if view is referencing a single relation.

Changes will not impact to referenced relations if they are multiple.

Changes will not impact to base relations during aggregation or grouping operations.

Benefits of views:

Reduce complexity to understand data because they are customizable based on request.

Provide consistency during dynamic changes of physical database.

Provide abstraction between physical and conceptual database layers.

Data Independence:
Logical Data Independence: We don’t need to change external schema or rewrite the application program for
changes in conceptual schema (Additional or Removal of entities).

Physical Data Independence: Internal schema change (Using different file organization, Storage
Structure/Devices) don’t impact conceptual schema.

Data Model:
Integrated collection of concepts for describing data, relationships between data and constraints on the data in
an organization.

Data model is used to represent data in an understandable way.

Data Model comprises a structural and manipulative part.

Data models can be,

Object Based data models used to represents entity relationships, Semantic, Functional and Object
Oriented.

Record based data models used to represent relational data model, network data model and
hierarchical data model.
Conceptual Modeling:

Conceptual Schema is core of system, supports to all users so it should be complete and accurate
representation of organizational requirements.

Conceptual modeling is process of developing a conceptual model of information use, which is


independent of implementation details.

System Catalogue:
System catalogue refers to metadata (Repository of Information about data), which describe data in the
database.

System catalogue stores,

Names, types and size of the data items.

Constraints on the data.

Names of the authorized users.

Data items accessible by users and types of access.

Usage Statistics.

R* Approach:
Record name will be mapped to globally unique internal identifier system-wide name.

Commonly used system-wide names:

Creator ID: The ID of the user who created the object.

Creator Site Id: The ID of the site at which the CREATE operation is entered.

Local Name: The unqualified name of the object.

Birth Site Id: The ID of the site at which the object is initially stored.

E.g. MARILYN @ NEWYORK.STATS @ LONDON

R* SQL:
CREATE SYNONYM MSTATS FOR MARILYN @ NEWYORK.STATS @ LONDON;

 This will create a synonym for “MARILYN @ NEWYORK.STATS @ LONDON”.


 Now we can call it by “MSTATS”.

E.g. MARILYN @ NEWYORK.STATS @ LONDON

MARILYN = system-wide name by which object was created.

NEWYORK.STATS = Synonym table name

LONDON = Site

Flow of “Select … from MSTATS”

 System will look for system-wide name for this MSTATS table.
 Now system will look for Birth Site ID in LONDON catalogue.
 If system has migrated to (LA) then system will get information regarding migrated object (LA) and
start looking for that.
 If object is again migrated from LA to New York, then:
1. System will enter new catalogue entry.
2. Delete LA catalogue entry.
3. Update the LONDON catalogue entry.

Update Propagation:
Update propagation issue occurs during data replication when some of the site are unable to replicate object
due network or any other fault.

To prevent update propagation issue system have primary copy scheme.

Primary copy scheme:

During primary copy scheme every replicated object will be designated with a primary copy and other all
copies will be secondary.

Update operation is be logically complete after update of primary copy and local sites will be responsible
for updates of secondary copies.

Recovery Control:
Recovery control in distributed DBs is based on two-phase commit protocol.

Single transaction will interact with several autonomous resource managers and local DBMSs will operate on
different autonomous sites.

Problems with two-phase commit protocol:

 Communication between all sites create a lot of messages and more overhead.
 Loss of local autonomy because participant needs to do what told by other opposite site.

Concurrency Control:
Concurrency control in distributed system is based on locking.

If each site is responsible for locks on objects stored at that site (as it will be under local autonomy assumption),
then straightforward implementation will require at least 5n messages.

1. n lock requests
2. n lock grants
3. n update messages
4. n acknowledgments
5. n unlock requests

Here n is number of sites.

E.g. suppose there are two sites, B and C. (5(2) = 10 messages)

Solution to this problem can be, if we use primary copy update strategy.

Messages will be reduced from 5n to 2n+3 (one lock request, one lock grant, n updates, n
acknowledgements and one unlock request).
If primary copy unavailable then update cannot be applied.

Distributed Databases:
Any database can said to be distributed if it consist on multiple site connected via communication network.

All distributed databases have agreement for accessing data from other connected databases.

Advantages of distributed databases:

Distributed databases connect multiple (departments or locations) data and provide easy accessibility.

Provide location abstraction from user.

Disadvantages of distributed databases:

Distributed DBs are complex and hard to maintain.

Principles for distributed databases.

All the distributed system problems should be internal or implementation level not external or user
level.

To satisfy above fundamental principle we have 12 rules for distributed databases.

1. Local Autonomy:
Local autonomy means local side should independent from all other sites connected to
that network.
Local site should be accountable for own data.
Local site should be responsible for security, integrity and storage of data.
2. No reliance on central site:
All sites on network should have equal rights and there should not be any master site.
3. Continues Operation:
Distributed DB system should provide reliability and high availability.
Reliability, refers to that system should perform their operations continuously if any
crash or fault occurs at any site.
Availability, refers to that system should be available and should not impacted by fault
at other connected site.
4. Location Independency or Transparency:
User should be unaware about data location, from where data is coming local or remote
database.
5. Fragmentation Independence:
Data should be stored at near location (fragment) where data has most usability so
mostly operations should be local and network traffic can reduce.
6. Replication Independence:
Data should be replicated with updated data to most of sites so we can enhance
availability.
7. Distributed Query Processing:
Query processing should be optimized from any site as central site.
8. Distribution Transaction Management:
Transaction should be concurrent by agents on most of the sites on the network.
Recovery control (Rollback) should be available for any crash or fault occurrence.
9. Hardware Independency:
System should work on heterogeneous machines and independent from hardware type.
E.g. Dell, IBM, HP (Machines)
10. Operating System Independency:
System should work on heterogeneous operating systems and independent from
operating system type.
E.g. Linux, UNIX, Windows (OSs)
11. Network Independence:
System should work on heterogeneous networks and independent from network type.
E.g. Wired, Wireless (Networks)
12. DBMS Independence:

System should work on heterogeneous DBMSs and independent from DBMS type.

E.g. Oracle, DB2, MySQL (DBMSs)

Problems of Distribute System:

Objective: To minimize the network utilization.

 Query Processing:
Our queries should be distributed and query optimization process.
Steps for query process optimization:
 Estimate cardinalities of certain intermediate results.
 Move data to nearest site and process at nearest site.
 Join multiple sites and match specific condition.
 Move matched condition records to a site from where delay time is very short.
 Complete the site where delays time is very short.
 Catalogue Management:
Distributed catalogue should have information to control desired location, fragmentation and
replication independence with local site information.
Catalogue storing possibilities can be:
1. Centralized:
Total catalogue will be stored at central site.
Problem: This will violates “No reliance on central site” principle.
2. Fully Replicated:
Total catalogue will be stored on every site.
Problem: This will cause update propagation problem (Loss of autonomy).
3. Partitioned:
Every site will store own catalogue and total catalogue will be union of all disjoint local
catalogues.
Problem: Remote operation will be expensive.
4. Combination of 1 & 3:
Each site will store own catalogue and a copy of unified catalogue of all sites.
Problem: This will violates “No reliance on central site” principle.
 Object Name:
Object naming is a common and general issue due to same record at multiple site, which cause
ambiguity.
We can use qualified names by specifying names with specific site name (E.g. recordName.SiteName)
but this will violates “Location Transparency” principle.
We can prevent naming ambiguity by R* Approach.
Security:
Security refers to the protection of data against unauthorized discloser, alteration or destruction; integrity refers
to the accuracy or validity of that data.

Security means making sure users are allowed to do the things they are trying to do;

Integrity involves making sure the things they (people) are trying to do are correct.

Security and Integrity can be maintained using constraints stored in system catalog.

Discretionary Control:

A user will typically have different access rights (also known as privileges) on different objects; further, there
are very few limitation. (E.g U1 might be able to see A but not B, while user U2 might be able to see B but
not A)

Mandatory Control:

Each data object is labeled with a certain classification level and each user is given a certain clearance level.
A given data object can be accessed only by users with the appropriate clearance. (E.g if user U1 can see A
but not B, then the classification of B must be higher than that of A, and so no user U2 can see B but not A)

Statistical Databases:

A statistical database (in the present context) is a database that permits queries that derive aggregated
information (e.g. sums, averages) but not queries that derive individual information.

It is possible to make inferences from legal queries to deduce the answers to illegal ones.

Recovery Management Components:


Rollback:

Rollback the transaction or abort the transaction.


Rollback is the process of undoing the modification that were done until failure point.

Transaction Log:

Activities of transaction.
Transaction log is required to perform rollback operations.

Durability:

Transaction should be recoverable under any cause of failure.

Isolation:

Two or more than two transaction are executing concurrently.

Schedule:

Time order sequence of two or more transactions.


Serial schedule execute transaction in sequence (one after other).
Concurrent schedule execute multiple transactions in parallel.

You might also like