Advance DB Notes V2
Advance DB Notes V2
Advance DB Notes V2
Abdul Qayoom
NET2APPS CodeBot
File Based Approaches:
Collection of application programs that perform services for the end users (e.g. reports).
Limitations:
Separation and isolation of data: Every program maintain own data set and unaware of data at side of
other program.
Duplication of Data: Every program can have same data set, which waste a lot of memory space.
Incompatible File Format: Programs are written in different languages, and so cannot easily access each
other’s files.
Fixed Queries/Proliferation of application programs: Every program is written for specific function so
we need another function of other requirements.
Database:
Database is collection of logically related data entities, attributes and relationships of an organization’s
information.
DBMS is interface between user and database (User -> DBMS -> DB).
RDBMS (Relational Database Management System) used to store data in the form of tables (collection of rows
and columns).
Functions of DBMS:
Components of DBMS:
Procedures: Instructions and rules to design and manage database and DBMS.
People: User
Advantages of DBMS:
Data consistency and redundancy control.
Disadvantage of DBMS:
Cost: is very high for DBMS (Hardware + Software) and data conversions.
DDL is used to modify the structure (Physical Schema) of the database table.
DDL allows to us to describe name entities, attributes, and relationships required for the application.
Queries:
Queries:
Procedural DML:
Non-Procedural DML:
Allows users to state what data is needed rather than how it is to be retrieved.
Queries:
Transaction Based:
Commit:
Rollback (Abort):
Queries:
Having: (Conditions)
Order by:
Nested Queries:
Query inside query is known as nested query.
Domain: Domain is the set of allowable values for one or more attributes.
Relational Database: Relational database is a collection of normalized relations with different relation names.
Relational Instance: If data (records) exist in the table then that set of records is called relational instance.
Primary Key: One of the candidate key which used to uniquely identify a record, Primary key cannot be null.
Foreign Key: Referred primary key of other table (Used to connect two tables).
Normalization:
Elimination of Redundancy (Normalization is possible if two or more independent tables are combined).
Database relations:
Relation Schema: Relation schema is a named relation defined by a set of attributes and domain names pairs.
Relational Database Schema: Relational database schema is set of relation schemas, each with different names.
Relation (Table):
Table: Collection of rows and columns (Logical Structure of Database).
Properties of Relations:
Relation name is different from all other relations in relational schema.
Relational Keys:
Super Key: Super key is an attribute (column) or set attributes, which uniquely identifies a
record in a relation.
Candidate Key:
Superkey (K) such that no proper subset is a superkey within the relation.
Primary Key: Primary key is an attribute (Candidate Key) which is selected to uniquely identify
a record in a relation.
Alternate Key: Candidate keys that are not selected for primary key.
Foreign Key: Attribute, or set of attributes, within one relation that matches candidate key of some
(possibly same) relation.
Integrity Constraints:
Null:
Null represents any value which is unknown or not applicable for that record.
Null is not zero or space but an incomplete value or exceptional data.
Entity Integrity:
Referential Integrity:
General Constraint:
DBMS Architecture:
There are three levels of abstraction between user and hardware database.
1. Physical Schema:
Physical schema are storage details of database, that how data is physically stored in the database.
Physical schema includes File Structure, Record Structure, Field Structure, Location, Name, Type of File
etc.
2. Conceptual Schema:
Conceptual schema hides the physical details (File Structure, Record Structure, Field Structure, Location,
Name, Type of File etc.).
Conceptual schema provide virtual views (View or Table) for representing stored data based on user
requests.
DBA (Database Administrator) should be able to modify database structure without affecting user’s
view.
DBA should be able to change conceptual structure without affecting user’s view.
Views:
Views are sub-sets of data from one or multiple tables based on request by query.
Views are named entities of base relation in conceptual schema, whose tuples are physically stored in DB.
Objectives of View:
View enhanced the security by abstracting the physical DB from the user.
Updating Views:
Changes in a view will reflect to base relation, if view is referencing a single relation.
Changes will not impact to base relations during aggregation or grouping operations.
Benefits of views:
Reduce complexity to understand data because they are customizable based on request.
Data Independence:
Logical Data Independence: We don’t need to change external schema or rewrite the application program for
changes in conceptual schema (Additional or Removal of entities).
Physical Data Independence: Internal schema change (Using different file organization, Storage
Structure/Devices) don’t impact conceptual schema.
Data Model:
Integrated collection of concepts for describing data, relationships between data and constraints on the data in
an organization.
Object Based data models used to represents entity relationships, Semantic, Functional and Object
Oriented.
Record based data models used to represent relational data model, network data model and
hierarchical data model.
Conceptual Modeling:
Conceptual Schema is core of system, supports to all users so it should be complete and accurate
representation of organizational requirements.
System Catalogue:
System catalogue refers to metadata (Repository of Information about data), which describe data in the
database.
Usage Statistics.
R* Approach:
Record name will be mapped to globally unique internal identifier system-wide name.
Creator Site Id: The ID of the site at which the CREATE operation is entered.
Birth Site Id: The ID of the site at which the object is initially stored.
R* SQL:
CREATE SYNONYM MSTATS FOR MARILYN @ NEWYORK.STATS @ LONDON;
LONDON = Site
System will look for system-wide name for this MSTATS table.
Now system will look for Birth Site ID in LONDON catalogue.
If system has migrated to (LA) then system will get information regarding migrated object (LA) and
start looking for that.
If object is again migrated from LA to New York, then:
1. System will enter new catalogue entry.
2. Delete LA catalogue entry.
3. Update the LONDON catalogue entry.
Update Propagation:
Update propagation issue occurs during data replication when some of the site are unable to replicate object
due network or any other fault.
During primary copy scheme every replicated object will be designated with a primary copy and other all
copies will be secondary.
Update operation is be logically complete after update of primary copy and local sites will be responsible
for updates of secondary copies.
Recovery Control:
Recovery control in distributed DBs is based on two-phase commit protocol.
Single transaction will interact with several autonomous resource managers and local DBMSs will operate on
different autonomous sites.
Communication between all sites create a lot of messages and more overhead.
Loss of local autonomy because participant needs to do what told by other opposite site.
Concurrency Control:
Concurrency control in distributed system is based on locking.
If each site is responsible for locks on objects stored at that site (as it will be under local autonomy assumption),
then straightforward implementation will require at least 5n messages.
1. n lock requests
2. n lock grants
3. n update messages
4. n acknowledgments
5. n unlock requests
Solution to this problem can be, if we use primary copy update strategy.
Messages will be reduced from 5n to 2n+3 (one lock request, one lock grant, n updates, n
acknowledgements and one unlock request).
If primary copy unavailable then update cannot be applied.
Distributed Databases:
Any database can said to be distributed if it consist on multiple site connected via communication network.
All distributed databases have agreement for accessing data from other connected databases.
Distributed databases connect multiple (departments or locations) data and provide easy accessibility.
All the distributed system problems should be internal or implementation level not external or user
level.
1. Local Autonomy:
Local autonomy means local side should independent from all other sites connected to
that network.
Local site should be accountable for own data.
Local site should be responsible for security, integrity and storage of data.
2. No reliance on central site:
All sites on network should have equal rights and there should not be any master site.
3. Continues Operation:
Distributed DB system should provide reliability and high availability.
Reliability, refers to that system should perform their operations continuously if any
crash or fault occurs at any site.
Availability, refers to that system should be available and should not impacted by fault
at other connected site.
4. Location Independency or Transparency:
User should be unaware about data location, from where data is coming local or remote
database.
5. Fragmentation Independence:
Data should be stored at near location (fragment) where data has most usability so
mostly operations should be local and network traffic can reduce.
6. Replication Independence:
Data should be replicated with updated data to most of sites so we can enhance
availability.
7. Distributed Query Processing:
Query processing should be optimized from any site as central site.
8. Distribution Transaction Management:
Transaction should be concurrent by agents on most of the sites on the network.
Recovery control (Rollback) should be available for any crash or fault occurrence.
9. Hardware Independency:
System should work on heterogeneous machines and independent from hardware type.
E.g. Dell, IBM, HP (Machines)
10. Operating System Independency:
System should work on heterogeneous operating systems and independent from
operating system type.
E.g. Linux, UNIX, Windows (OSs)
11. Network Independence:
System should work on heterogeneous networks and independent from network type.
E.g. Wired, Wireless (Networks)
12. DBMS Independence:
System should work on heterogeneous DBMSs and independent from DBMS type.
Query Processing:
Our queries should be distributed and query optimization process.
Steps for query process optimization:
Estimate cardinalities of certain intermediate results.
Move data to nearest site and process at nearest site.
Join multiple sites and match specific condition.
Move matched condition records to a site from where delay time is very short.
Complete the site where delays time is very short.
Catalogue Management:
Distributed catalogue should have information to control desired location, fragmentation and
replication independence with local site information.
Catalogue storing possibilities can be:
1. Centralized:
Total catalogue will be stored at central site.
Problem: This will violates “No reliance on central site” principle.
2. Fully Replicated:
Total catalogue will be stored on every site.
Problem: This will cause update propagation problem (Loss of autonomy).
3. Partitioned:
Every site will store own catalogue and total catalogue will be union of all disjoint local
catalogues.
Problem: Remote operation will be expensive.
4. Combination of 1 & 3:
Each site will store own catalogue and a copy of unified catalogue of all sites.
Problem: This will violates “No reliance on central site” principle.
Object Name:
Object naming is a common and general issue due to same record at multiple site, which cause
ambiguity.
We can use qualified names by specifying names with specific site name (E.g. recordName.SiteName)
but this will violates “Location Transparency” principle.
We can prevent naming ambiguity by R* Approach.
Security:
Security refers to the protection of data against unauthorized discloser, alteration or destruction; integrity refers
to the accuracy or validity of that data.
Security means making sure users are allowed to do the things they are trying to do;
Integrity involves making sure the things they (people) are trying to do are correct.
Security and Integrity can be maintained using constraints stored in system catalog.
Discretionary Control:
A user will typically have different access rights (also known as privileges) on different objects; further, there
are very few limitation. (E.g U1 might be able to see A but not B, while user U2 might be able to see B but
not A)
Mandatory Control:
Each data object is labeled with a certain classification level and each user is given a certain clearance level.
A given data object can be accessed only by users with the appropriate clearance. (E.g if user U1 can see A
but not B, then the classification of B must be higher than that of A, and so no user U2 can see B but not A)
Statistical Databases:
A statistical database (in the present context) is a database that permits queries that derive aggregated
information (e.g. sums, averages) but not queries that derive individual information.
It is possible to make inferences from legal queries to deduce the answers to illegal ones.
Transaction Log:
Activities of transaction.
Transaction log is required to perform rollback operations.
Durability:
Isolation:
Schedule: