Exp Olap
Exp Olap
Exp Olap
: 01
Query Optimization
Aim: Implementation of any Query optimizer (Java/Python)
Theory:
The term optimization is actually a misnomer because in some cases the chosen execution plan
is not the optimal (or absolute best) strategy—it is just a reasonably efficient strategy for
executing the query. Finding the optimal strategy is usually too time-consuming—except for the
simplest of queries. In addition, trying to find the optimal query execution strategy may require
detailed information on how the files are implemented and even on the contents of the
files—information that may not be fully available in the DBMS catalog. Hence, planning of a
good execution strategy may be a more accurate description than query optimization.
Above Figure shows the different steps of processing a high-level query. The query optimizer
module has the task of producing a good execution plan, and the code generator generates the
code to execute that plan. The runtime database processor has the task of running (executing) the
query code, whether in compiled or interpreted mode, to produce the query result. If a runtime
error results, an error message is generated by the runtime database processor.
Overview
Caching database queries can reduce and even remove the performance degradation caused by
slow database access.
Introduction
Database queries maybe a main source of performance problems in a Java application. To
process the queries, a database server may have to do work that takes significant amount of time.
1
Executing and getting results of such queries can take seconds and even minutes. For many Java
applications such delays are unacceptable.
Two steps are involved in addressing the performance problem caused by the heavy database
queries. These steps are:
1. Optimizing database queries.
2. Caching database queries.
Optimizing Database Queries
Optimizing database queries is the first step in addressing the performance problem.
It is possible to reduce the query execution time by applying the following simple optimization
technique: create an index for each combination of fields involved in the "where" and
"order by" clause .
Example:
Consider an e-commerce management system, the part that contains invoicing:
create table PRODUCT (
ID integer not null,
NAME varchar(254) not null
);
create table INVOICE (
ID integer not null,
NUMBER integer not null,
PRODUCT_ID integer not null
constraint INVOICE _FK1 foreign key (PRODUCT_ID) references PRODUCT(ID)
);
create table invoice ( id integer not null,number1 integer not null,product_id integer not null,
constraint invoice_fk1 foreign key (product_id) references
product(id));
The following query finds all invoices with the given product and orders them by the invoice
number.
select distinct(INVOICE.*) from INVOICE, PRODUCT
where PRODUCT.NAME = ? and INVOICE.PRODUCT_ID = PRODUCT.ID
order by INVOICE.NUMBER
The following indexes may be created up front. Suffix PK stands for primary key, suffix AK
stands for unique alternative key, suffix IX stands for non-unique index :
create unique index PRODUCT_PK1 on PRODUCT(ID);
create unique index PRODUCT_AK1 on PRODUCT(NAME);
create index INVOICE_IX1 on INVOICE(PRODUCT_ID);
2
create index INVOICE_IX2 on INVOICE(NUMBER);
3
execution plan for each SQL statement based on the structure of the query, the available
statistical information about the underlying objects, and all the relevant optimizer and execution
feature.
References
Text Books:
1. Elmasri and Navathe, “Fundamentals of Database Systems”, 6th Edition, PEARSON
Education.
2. Theraja Reema, “Data Warehousing”, Oxford University Press, 2009.
3. Data Warehousing, Data Mining, & OLAP by Alex Berson McGraw Hill.
References:
1. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom “Database System
Implementation”, Pearson Ltd. 1/ e
2. Thomas M. Connolly Carolyn Begg, Database Systems : A Practical Approach to Design,
Implementation and Management, 4/e Pearson Ltd.
3. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Definitive Guide To
Dimensional Modeling”, 3rd Edition. Wiley India.
Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/
4
Questionnaire:
1. An option in view maintenance, in which only the affected part of the view is modified is
called as ........................................
2. The process of replacement of a nested query by a query with a join is called....................
3. The technique chosen by the database system for evaluation of an operation depends on
the size of what?
__________________________________________________________________________
__________________________________________________________________________
4. The term that optimizes sub expressions shared by different expressions in a program is
called as ..................................
5. Is it true “Each relational-algebra expression represents a particular sequence of
operation”
__________________________________________________________________________
6. The files from which algorithms are selected for records are called...............................
7. Is it true “The procedure of choosing a suitable query out of all the queries is classified as
query optimization”
__________________________________________________________________________
__________________________________________________________________________
8. What is the representation of the query in the form of data structure ?
__________________________________________________________________________
__________________________________________________________________________
9. In sort merge strategy, the small sub files sorted are called...............................
10. Is it true “string units is a sequence of zero or more characters enclosed by single
quotes.”
__________________________________________________________________________
5
Experiment No.: 02
Path Expression Evaluation
Aim: Assignments for query evaluation path expressions.
Theory:
Instances of user defined persistable classes (entity classes, mapped super classes and
embeddable classes) are represented in JPQL by the following types of expressions:
Variables - FROM identification variables and SELECT result variables.
Parameters - when instances of these classes are assigned as arguments.
Path expressions that navigate from one object to another.
Instances of user defined persistable classes can participate in direct comparison using
the = and <> operators. But more often they are used in JPQL path expressions that navigate to
values of simple types (number, boolean, string, date).
Simple type values are more useful in queries. They have special operators and functions (e.g.
for strings and for numbers), they can be compared by all six comparison operators, and they can
be used in ordering.
This page covers the following topics:
6
SELECT c.name, c.capital.name FROM Country c
The c identification variables is used for iteration over all the Country objects in the database.
For a country with no capital city, such as Nauru, c.capital is evaluated to NULL and
c.capital.name is an attempt to navigate from a NULL value. In Java, a NullPointerException is
thrown on any attempt to access a field or a method via a null reference. In JPQL, the current
FROM variable (or FROM tuple when there are multiple variables) is simply skipped. It might
be easier to understand exactly how this works by considering the equivalent JOIN query.
Path Expressions
The Root and Join interfaces (which are subinterfaces of From) represent FROM variables.
FROM variable expressions are considered as basic paths and also serve as starting point for
building more complex paths through navigation.
Giving a Path instance, a child Path expression (which represents navigation from the parent path
through a persistent field or property), can be constructed by the get method:
// FROM Variable Paths:
Root<Country> country = query.from(Country.class);
Join<Country,Country> neighborCountry = country.join("neighbors");
// Navigation Paths:
Path<String> countryName = country.get("name");
Path<City> capital = country.get("capital");
Path<String> captialName = capital.get("name");
The path expressions in the above code can be divided into two main groups:
FROM variable expressions, represented by subinterfaces of From (Root, Join) -
7
The creation of a FROM expression automatically modifies the query by adding a variable
to the FROM clause (representing iteration during query execution). The constructed
variable expression can also be used explicitly in other query clauses.
Navigation expressions, represented by the Path interface -
The creation of a navigation path expression doesn't affect the built query directly. The
constructed expression must be integrated into query clauses explicitly to have effect.
Type Expressions
Entity type expressions can be constructed by the Path's type method. For example, the following
criteria expression checks if the type of a specified entity e is not Country.
Predicate p = cb.notEqual(e.type(), cb.literal(Country.class));
In the above example, the comparison is between the type of the e object (which may represent
any path including a root or a join) and the entity type Country (a criteria literal).
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial Application:
1. Use of path expressions is a common feature in most XML query languages. The path
shortening principle reduces the querying cost by shortening the path expressions with
the knowledge of XML schema.
2. Path expressions using navigational algebraic operators.
Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/
Questionnaire:
8
1. Is it true “In ca se of A4 search algorithm, if the equality condition is on a key, the
strategy can retrieve Single record”
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
3. What will be the result of the following query?
WHERE end_date ISNULL
ANDNOT (title=’teller’ OR start_date < ‘2007-01-01’)
__________________________________________________________________________
__________________________________________________________________________
4. What will be the result of the following query?
WHERE end_date ISNULL
AND (title=’teller’ OR start_date < ‘2007-01-01’)
__________________________________________________________________________
__________________________________________________________________________
5. What will be the result of the following query?
WHERE TITLE=’teller’ OR start_date=’2007-01-01’
__________________________________________________________________________
__________________________________________________________________________
6. What will be the result of the following query?
WHERE TITLE= ‘teller’ AND start_date < ’2007-01-01’
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
10. Which of the is an attribute that can uniquely identify a row in a table?
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
Experiment No.: 03
9
Concurrency Control & Recovery Methods
Aim: Simulation of Concurrency Control Algorithm, Recovery Algorithm (Java/Python)
The arbitrariness of both the participants and the action sequences requires extensions of the joint
action control strategies. Transaction techniques extend delegation-based synchronization and
control to situations in which each participant may be unaware of the atomicity constraints
placed on its actions, and cannot rely on more efficient structural solutions. In transaction
frameworks, each participant (and each client) gives up its local autonomy in deciding how to
perform concurrency control. Participants must instead reach consensus deciding how and when
to perform actions and/or to commit to their effects.
Transaction frameworks are among the most famous examples of how providing components
that implement valuable general-purpose functionality sometimes has the price of introducing a
large number of programmer obligations. Classes supporting transaction protocols can be highly
usable and reusable. Transaction frameworks can be used to tackle most of the concurrency
problems discussed in this book. But they rely on designs in which each class, at each layer of
functionality, supports a standardized transaction protocol that propagates control down through
successive layers. The heaviness of transaction frameworks usually restricts their use to contexts
in which you really need to set up objects so as to guarantee atomicity of arbitrary code
sequences.
For example, you may be able to bypass transactional control if you know all the call-sequences
that will ever be encountered in a component or application. In this case, you can specifically
design support for each one (using whatever techniques happen to apply) without having to
address the general case. This is a perhaps extreme extension of the idea (see 2.2.2) of padding
reusable synchronized objects with atomic versions of frequently needed convenience methods.
This is sometimes a plausible alternative, in the spirit of doing the simplest and surest thing that
could possibly work. Similarly, you may be able to rely entirely on client-side locking (see 2.2.3)
in cases where clients somehow know to obtain all locks required for a given action and how to
avoid any potential deadlocks.
10
in a relative sense) internal transaction frameworks are normally tied to other application-specific
constraints and features, you are unlikely to use the exact interfaces and classes described here
(although most are simplified variants of those in the net.jini package). And if you instead rely
on a standardized transaction framework such as JDBC or JTS, you will encounter additional
issues tied to persistence support and related services that fall outside the scope of this book.
However, the final example how the ideas behind transactions can help structure more ordinary
concurrent OO designs. Thus, the main goals of this section are to give a brief synopsis of how
transaction systems extend other approaches to concurrency control, and to present techniques
that may be scaled down as needed to apply to other concurrency problems.
As a running example, consider again writing a transfer operation for the BankAccount class
From a transactional point of view, a stand-alone transfer operation (without any provisions
for automatic transfers) looks like:
pseudoclass AccountUser {
// Pseudocode
TransactionLogger log; // any kind of logging facility
// ...
// Attempt transfer; return true if successful
public boolean transfer(long amount,
BankAccount source,
BankAccount destination) {
TRANSACTIONALLY {
if (source.balance() >= amount) {
log.logTransfer(amount, source, destination);
source.withdraw(amount);
destination.deposit(amount);
return true;
}
else
return false;
}
}
}
The TRANSACTIONALLY pseudo-qualifier indicates that we'd like this code to be executed in an
all-or-none fashion without any possibility of interference from other activities. Once
implemented, this operation could be used in an automated transfer scheme of the sort described
in 3.5.1. Additionally, the transactional approach permits greater flexibility than seen in our
specific solution, although with significantly greater overhead. Once classes are fitted with
transactional apparatus, it becomes possible to associate transactionality with any sequence of
operations involving bank accounts.
Transaction Protocols
11
Transaction frameworks rely on extended forms of the before/after tactics characteristic of most
concurrency control strategies. Here, the before-action is typically called join (or
sometimes, begin) and the after-action is called commit. The main differences between
join/commit and operations such as lock acquire/release stem from the fact that join/commit
require consensus among the set of participating objects: All participants must agree to begin and
end transactions. This leads to two-phase protocols surrounding join and/or commit — first to
obtain consensus and then to act. If any participant disagrees about joining or committing, the
attempted transaction is aborted. The most straightforward version of the basic protocol is:
As in most concurrency control contexts, two complementary sets of policies can be applied to
this protocol. In the purest form of optimistic transactions, participants can always join, but
cannot always commit. In the purest form of conservative transactions, participants cannot
always join, but if they do, they can always commit. Optimistic approaches apply best when the
likelihood of contention is low enough to outweigh rollback costs. Conservative approaches
apply best when it is difficult or impossible to undo actions performed during transactions.
However, it is rare to find pure forms of each, and it is not hard to create frameworks that permit
mixtures.
The most classic forms of conservative transactions can be implemented only if the identities of
all participants can be known before any actions are taken. This is not always possible. In an OO
system, the participants are just those objects whose methods are invoked during some call
sequence that is wrapped as a transaction. Because of polymorphism, dynamic loading, etc., it is
generally impossible to identify them all beforehand; instead, their identities become known only
as the action unfolds. Still, in many cases, at least some participants are known beforehand and
they can be probed before starting to engage in any unrecoverable action.
12
Transaction Participants
In addition to supporting methods for joining, committing, aborting, and (when necessary)
creating transactions, each class in a structured transaction framework must declare all of its
public methods to add a transaction control argument to its set of normal arguments.
A method invocation supplying a given transaction argument serves as a request to perform the
associated action on behalf of the given transaction, but without committing to its effects until
requested to do so. Methods take the form:
Transaction is any type that supplies the necessary control information. This transaction
information must be propagated throughout all methods invoked in the course of a particular
transaction, including nested calls to helper objects. The simplest kind of transaction argument is
a transaction key that uniquely identifies each transaction. Each method in each participating
object is then responsible for using this key to manage and isolate actions in accord with the
given transaction policy. Alternatively, a transaction argument may refer to a special control or
coordinator object possessing methods that help participants perform their roles in transactions.
It is, however, possible to cheat here, and many transaction frameworks do. For example,
transaction identifiers can be hidden as thread-specific data (see 2.3.2). Before/after control can
be restricted to intercepted entry points into sessions performing services provided by
components. Participants can be determined via reflection or scanning bytecodes. And rollback
obligations can be semi-automated by serializing the entire state of a component and/or acquiring
locks on entry into a service session. These kinds of tactics can help hide the details of
transactional control from application writers. This comes at the expense overhead and usage
restrictions that are not generally worthwhile in lightweight transaction frameworks performing
internal concurrency control.
Interfaces
Participant classes must implement interfaces defining the methods used in transaction control.
Here is a simple but representative interface:
13
// Return true if this transaction can be committed
public boolean canCommit(Transaction t);
// Update state to reflect current transaction
public void commit(Transaction t) throws Failure;
// Roll back state (No exception; ignore if inapplicable)
public void abort(Transaction t);
}
Among many other variants, it is possible to split the join phase similarly to the commit phase —
a preliminary canJoin followed by a required join. The canCommit method is most often
named prepare in transaction frameworks.
For simplicity of illustration, a single Failure exception type is associated with these
operations, as well as all others in this series of examples. Participant objects are allowed to raise
exceptions when they encounter actual or potential conflicts and when they are requested to
participate in transactions they do not know about. Of course in practice, you'd want to subclass
these exception types and use them to provide additional information to clients in cases of
failure.
A second interface or class is needed to describe Transaction objects themselves. In discussing
basic operations, we can use a no-op version:
class Transaction {
// add anything you want here
}
14
However, it is not always necessary to provide transactional signatures for pure accessor
methods such as balance here. Instead (or in addition to transactional versions), such methods
can return the most recently committed value when called from clients that are not participating
in transactions. Alternatively, a special null-transaction type (or just passing null for
the Transaction argument) can be used to denote one-shot invocations of transactional
methods.
Oracle supports two types of locks: row locks and table locks. These locks can be subdivided
into several categories:
Row Share Table Locks (RS) —These locks are issued when an SQL transaction has declared
its intent to update the table in row share mode. This type of lock will allow other queries to
update rows in the customer table. For example:
lock table customer in row share mode;
SELECT customer_name
FROM CUSTOMER
FOR UPDATE OF CUSTOMER;
Row Exclusive Table Locks (RX)—These locks are issued automatically against a table when
an UPDATE, DELETE, or INSERT statement is issued against the table.
Table Share Locks (S)—This type of lock is issued when the LOCK TABLE command is
issued against the table. This indicates that the transaction intends to perform updates against
some rows in the table, and prevents any other tasks from execution until the LOCK TABLE xxx
IN SHARE MODE has completed.
Share Row Exclusive Table Locks (SRX)—These locks are issued with the LOCK TABLE
xxx IN SHARE ROW EXCLUSIVE MODE command. This prevents any other tasks from
issuing any explicit LOCK TABLE commands until the task has completed, and also prevents
any row-level locking on the target table.
Exclusive Table Locks (X)—This is the most restrictive of the table locks and prevents
everything except queries against the affected table. Exclusive locks are used when the
programmer desired exclusive control over a set of rows until their operation has completed. The
following command is used to lock the CUSTOMER table for the duration of the task
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial Applications:
1. process control
15
Industrial control system are usually attached to sensors and actuators. The sensors monitor the
state of some real world processes, and the controlled manipulate the valves. Control decision
are made based on the input data and controller configuration parameters. Input data are
generated from field devices such as sensors, transmitters and switches that are connected to the
controller.
2. Internet Service Management.
Service providers are increasingly deploying service management systems (SMS). An SMS lets
an ISP create and provide IP services such as e-mail, VPN, LDAP directory services, etc. The
SMS streamlines allocation of resources to subscribers, resource management, and controlling all the
relevant network components at a single centralized point in the network. An SMS needs to use a
real-time database for performing authorization, authentication, and accounting for the Internet
users. The SMS must manage such data as session status as well as information about the
network, subscriber, and policies — in the face of the number of subscribers running into
millions. Real-time database management systems are used to maintain these data.
Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/
Questionnaire:
8. Is it true “An approach named Lock timeouts is used for Deadlock detection”
_____________________________________________________________________
9. The system must deal with the deadlocks that are not prevented by using schemes
of..............................................
10. _____________ensures that once transaction changes are done, they cannot be
undone or lost, even in the event of a system failure.
17
Experiment No.: 04
Fragmentation, Query Processing.
Aim: Design of a distributed database for a real life application - Fragmentation, Query
Processing.
Theory:
Data Fragmentation:
Techniques that are used to break up the database into logical units, called fragments,
which may be assigned for storage at the various sites. We also discuss the use of data
replication, which permits certain data to be stored in more than one site, and the process of
allocating fragments—or replicas of fragments—for storage at the various sites. These
techniques are used during the process of distributed database design. The information
concerning data fragmentation, allocation, and replication is stored in a global directory that is
accessed by the DDBS applications as needed. In a DDB, decisions must be made regarding
which site should be used to store which portions of the database. For now, we will assume that
there is no replication; that is, each relation—or portion of a relation—is stored at one site only.
We discuss replication and its effects later in this section. We also use the terminology of
relational databases, but similar concepts apply to other data models. We assume that we are
starting with a relational database schema and must decide on how to distribute the relations
over the various sites. To illustrate our discussion, we use the relational database schema Before
we decide on how to distribute the data, we must determine the logical units of the database that
are to be distributed. The simplest logical units are the relations themselves; that is, each whole
relation is to be stored at a particular site.
In our example, we must decide on a site to store each of the relations EMPLOYEE,
DEPARTMENT, PROJECT, WORKS_ON, and DEPENDENT in. In many cases, however, a
relation can be divided into smaller logical units for distribution. For example, consider the
company database shown in Figure 3.6, and assume there are three computer sites—one for each
department in the company.6 We may want to store the database information relating to each
department at the computer site for that department. A technique called horizontal fragmentation
can be used to partition each relation by department.
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial Application:
Questionnaire:
5. Is it true ” SQL is often used as an interface between the distributed data manager
and local data managers”
______________________________________________________________________________
______________________________________________________
8. What is the advantage of the three-tier architecture over the two-tier architecture.
______________________________________________________________________________
______________________________________________________
9. Which of the following statements is not true about two-tier client-server database
architecture?
______________________________________________________________________________
______________________________________________________
10. Define Localization.
__________________________________________________________________
__________________________________________________________________
Experiment No.: 05
Recovery Management
Aim: Simulation of Recovery methods.
20
Theory :
Recovery from transaction failures usually means that the database is restored to the most
recent consistent state just before the time of failure. To do this, the system must keep
information about the changes that were applied to data items by the various transactions. This
information is typically kept in the system log. A typical strategy for recovery may be
summarized informally as follows:
1. If there is extensive damage to a wide portion of the database due to catastrophic
failure, such as a disk crash, the recovery method restores a past copy of the database that was
backed up to archival storage (typically tape or other large capacity offline storage media) and
reconstructs a more current state by reapplying or redoing the operations of committed
transactions from the backed up log, up to the time of failure.
2. When the database on disk is not physically damaged, and a no catastrophic failure of
types 1 through 4 in has occurred, the recovery strategy is to identify any changes that may cause
an inconsistency in the database. For example, a transaction that has updated some database
items on disk but has not been committed needs to have its changes reversed by undoing its write
operations.
It may also be necessary to redo some operations in order to restore a consistent state of the
database; for example, if a transaction has committed but some of its write operations have not
yet been written to disk. For non catastrophic failure, the recovery protocol does not need a
complete archival copy of the database. Rather, the entries kept in the online system log on disk
are analyzed to determine the appropriate actions for recovery. Conceptually, we can distinguish
two main techniques for recovery from non catastrophic transaction failures: deferred update and
immediate update. The deferred update techniques do not physically update the database on disk
until after a transaction reaches its commit point; then the updates are recorded in the database.
Before reaching commit, all transaction updates are recorded in the local transaction workspace
or in the main memory buffers that the DBMS maintains (the DBMS main memory cache).
21
Additionally, checkpointing is used. These tables are maintained by the transaction manager and written
to the log during checkpointing.
In ARIES, every log record has an associated log sequence number (LSN) that is monotonically
increasing and indicates the address of the log record on disk. Each LSN corresponds to a specific change
(action) of some transaction. Also, each data page will store the LSN of the latest log record
corresponding to a change for that page. A log record is written for any of the following actions: updating
a page (write), committing a transaction (commit), aborting a transaction (abort), undoing an update
(undo), and ending a transaction (end). The need for including the first three actions in the log has been
discussed, but the last two need some explanation. When an update is undone, a compensation log record
is written in the log. When a transaction ends, whether by committing or aborting, an end log record is
written. Common fields in all log records include the previous LSN for that transaction, the transaction
ID, and the type of log record.
The previous LSN is important because it links the log records (in reverse order) for each
transaction. For an update (write) action, additional fields in the log record include the page ID for the
page that contains the item, the length of the updated item, its offset from the beginning of the page, the
before image of the item, and its after image.
Besides the log, two tables are needed for efficient recovery: the Transaction Table and the Dirty
Page Table, which are maintained by the transaction manager. When a crash occurs, these tables are
rebuilt in the analysis phase of recovery. The Transaction Table contains an entry for each active
transaction, with information such as the transaction ID, transaction status, and the LSN of the most
recent log record for the transaction. The Dirty Page Table contains an entry for each dirty page in the
buffer, which includes the page ID and the LSN corresponding to the earliest update to that page.
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial Application:
1. Numeric and Textual Databases in Business Applications.
2. Ransomware recovery and endpoints and servers.
Questionnaire:
_________________________________________________________________________________
_________________________________________________________________________________
22
2. Checkpoints are a part of which system?
_________________________________________________________________________________
7. In which the database can be restored up to the last consistent state after the system failure?
8. The _______________consists of the various applications and database that play a role in a backup
and recovery strategy.
10. The actions which are played in the order while recording it is called ______________
history.
Experiment No.: 06
The ETL and Data Warehousing tutorial is organized into lessons representing various business intelligence
23
scenarios, each of which describes a typical data warehousing challenge.
This guide might be considered as an ETL process and Data Warehousing knowledge base with a series of
examples illustrating how to manage and implement the ETL process in a data warehouse environment.
The purpose of this tutorial is to outline and analyze the most widely encountered real life datawarehousing
problems and challenges that need to be taken during the design and architecture phases of a successful data
warehouse project deployment.
Going through the sample implementations of the business scenarios is also a good way to compare Business
Intelligence and ETL tools and get to know the different approaches to designing the data integration process.
This also gives an idea and helps identify strong and weak points of various ETL and data warehousing
applications.
This tutorial shows how to use the following BI, ETL and datawarehousing tools: Datastage, SAS, Pentaho,
Cognos and Teradata.
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial application:
24
Questionnaire:
25
Experiment No.: 07
Star schema
Aim: Construction of star schema.
Theory :
This star schema in database case study at Birla SunLife Insurance involves the handling of
information related to policy and customer in the system. The core OLTP system, Ingenieum,
handles day to day transactions as a master system along with the receipt writer and channel
management system. The consolidated data sets come together into the database that
hosts different schemas including the star schema.
One needs to know the source of the OLTP systems to build a robust architecture that can
support schemas and run queries. For our star schema in database scenario we have DB2, SQL
Server and Oracle as OLTP databases and Teradata as an OLAP system.
At the end of the timeline that has been fixed, there will be no more transactions in the OLTP.
Data transfer from OLTP to landing schema is achieved using an ETL tool or a script that
processes batch files of data, which are targeted to the data warehouse where the multiple
schema structures will work with these files.
The star schema in database will depend on the deliverables that are expected in terms of
reporting, analysis and performance trending. The core structure of the star schema comprises
basic fact and dimension tables.
The fact table holds the measures (facts). These are numeric and additive across some or all of
the dimensions. For example, sales figures are numeric and users can look at total sales for a
product, or category, or subcategory, and by any time period.
While the dimension tables are short and fat, the fact tables are generally long and skinny. They
are long because they can hold the number of records represented by the product of the counts in
all the dimension tables.
26
For example, consider the simplified star schema in database depicted in Figure 1. In this
schema, there are product, time and store dimensions. With 10 years of daily data, 200 stores and
500 products, there is a potential of 365,000,000 records (3650 days * 200 stores * 500
products). Adding another dimension, such as 10,000 customers, will increase the number of
records by 10,000.
The primary key in the fact table is made up of foreign keys that have migrated from the
dimension tables. These fields are typically integer values. In addition, the measures are also
numeric. Therefore, the size of each record is generally much narrower than those in the
dimension tables.
Fact granularity
The granularity, or frequency, of the data is determined by the lowest level of granularity of each
dimension table. For example, a fact table may store weekly or monthly totals for individual
products. The lower the granularity, the more records that will exist in the fact table. The
27
granularity also determines how far users can drill down without returning to the base,
transaction-level data.
A major benefit of the star schema in the database is that the low-level transactions may be
summarized to the fact table grain. This speeds the queries performed as part of the decision
support process. However, the aggregation or summarization of the fact table is not always done
if cubes are being built.
The previous section discussed how 500 products sold in 200 stores and tracked for 10 years
could produce 365,000,000 records in a fact table with a daily grain. This is the maximum size
for the table. A star schema in a database cannot store a zero value unless it has some
significance. So, if a product did not sell at a particular store for a particular day, the system
would not store a zero value.
Even though the fact table is sparsely populated, it still holds the vast majority of the records in
the database and is responsible for almost all of the disk space used. The lower the granularity,
the larger the fact table. In the previous example, moving from a daily to weekly grain would
reduce the potential number of records to just about 52,000,000 records.
The data types for the fields in the fact table help keep it as small as possible. In most fact tables,
all of the fields are numeric, which can require less storage space than the long descriptions we
find in the dimension tables.
Finally, be aware that each added dimension can greatly increase the size of the fact table. If just
one dimension was added to the previous example that included 20 possible values, the potential
number of records would reach 7.3 billion.
Changing attributes
With a star schema in database, changing of database attributes is an issue. For example, in the
StoreDimension in Figure 1, if regions, territories or zones of stores are changed for any reason,
the numbers will not be accurate for historical sales. By simply updating the region for a store,
the total sales for that region will appear as if the current structure has always been true. Thus,
the business would have "lost" history.
28
In fact, the company might want to see what the sales would have been had this store been in that
other region in prior years. More often, businesses do not want to change the historical data. The
typical approach is to create a new record for the store, which contains the new region, but leaves
the old store record (and therefore the old regional sales data) intact. This prevents companies
from comparing this store’s current sales to its historical sales unless the previous StoreID is
preserved. The answer is to keep the existing StoreName (the primary key from the source
system) on both records and instead add BeginDate and EndDate fields to indicate when each
record is active. The StoreID is a surrogate key, so each record has a different StoreID but the
same StoreName, allowing data to be examined for the store across time regardless of
its reporting structure.
Result:
_________________________________________________________________________________
Conclusion:
_________________________________________________________________________________
Industrial Application:
29
Questionnaire:
________________________________________________________________________
3. What is the meaning of Snowflaking?
_________________________________________________________________________________
_______________________________________________________________
4. Is it true ” Sequence of jobs to load data in to warehouse First load data into fact
tables then dimension tables, then Aggregates if any”?
________________________________________________________________________
5. In which type of SCD(Slowly changing dimensions) do we preserve history of
data?
________________________________________________________________________
6. Is it true “ Consolidated data mart is First level data mart”?
________________________________________________________________________
7. What to load During ETL we generally have?
________________________________________________________________________
8. In 4 step dimensional process, what is declaring grain of business process?
________________________________________________________________________
9. Is it true” Drill Across generally use the following join to generate report Self
Join?
________________________________________________________________________
10. Is it true “In general data in Data Warehousing is Normalized”?
________________________________________________________________________
30
Experiment No.: 8
ETL Operation
Theory:
ETL (Extract-Transform-Load)
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a
process of how the data are loaded from the source system to the data warehouse. Currently, the
ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-
Transform-Load. Let us briefly describe each step of the ETL process.
Process
Extract
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible. The extract step should be designed in a
way that it does not negatively affect the source system in terms or performance, response time
or any kind of locking.
Update notification - if the source system is able to provide a notification that a record has been
changed and describe the change, this is the easiest way to get the data.
Incremental extract - some systems may not be able to provide notification that an update has
occurred, but they are able to identify which records have been modified and provide an extract
of such records. During further ETL steps, the system needs to identify changes and propagate it
down. Note, that by using daily extract, we may not be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has been changed at all, so a full
extract is the only way one can get the data out of the system. The full extract requires keeping a
copy of the last extract in the same format in order to be able to identify changes. Full extract
handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
Transform
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using
the same units so that they can later be joined. The transformation step also requires joining data
from several sources, generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
Load
During the load step, it is necessary to ensure that the load is performed correctly and with as
little resources as possible. The target of the Load process is often a database. In order to make
the load process efficient, it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity needs to be maintained
by ETL tool to ensure consistency.
Staging
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We
can ensure this by implementing proper staging. Staging means that the data is simply dumped to
the location (called the Staging Area) so that it can then be read by the next processing phase.
The staging area is also used during ETL process to store intermediate results of processing. This
is ok for the ETL process which uses for this purpose. However, tThe staging area should is be
accessed by the load ETL process only. It should never be available to anyone else; particularly
not to end users as it is not intended for data presentation to the end-user.may contain incomplete
or in-the-middle-of-the-processing data.
32
Fig : ETL Operation scenario
Building your own data transformation tool (usually a set of shell scripts) is the preferred
approach for a small number of data sources which reside in storage of the same type. The reason
for that is the effort to implement the necessary transformation is little due to similar data
structure and common system architecture. Also, this approach saves licensing cost and there is
no need to train the staff in a new tool. This approach, however, is dangerous from the TOC
point of view. If the transformations become more sophisticated during the time or there is a
need to integrate other systems, the complexity of such an ETL system grows but the
manageability drops significantly. Similarly, the implementation of your own tool often
resembles re-inventing the wheel.
Result:
_________________________________________________________________________________
Conclusion:
_________________________________________________________________________________
_________________________________________________________________________________
Industrial Application:
1. An educational publishing company (EPC) wanted to replace its core systems,
including finance, order management, e-commerce, and content management.
Early in the initiative, it was determined that a key element for success would be a
data quality program instituted as part of the data conversion and integration
effort. Experience and research into the common causes of failure for large ERP
implementations highlighted this as a substantial risk area.
2. Global Planning Data Automation.
33
Questionnaire:
34
Experiment No.: 09
Data Cubes
Aim: Construction of Cubes.
Theory:
When we try to extract information from a stack of data, we need tools to help us find what's
relevant and what's important and to explore different scenarios. A report, whether printed on
paper or viewed on-screen, is at best a two-dimensional representation of data, a table using
columns and rows. That's sufficient when we have only two factors to consider, but in the real
world we need more powerful tools.
Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-
dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think
of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
When we try to extract information from a stack of data, we need tools to help us find what's
relevant and what's important and to explore different scenarios. A report, whether printed on
paper or viewed on-screen, is at best a two-dimensional representation of data, a table using
columns and rows. That's sufficient when we have only two factors to consider, but in the real
world we need more powerful tools.
More QuickStudies
Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-
dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think
of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
But data cubes aren't restricted to just three dimensions. Most online analytical processing
(OLAP) systems can build data cubes with many more dimensions—Microsoft SQL Server 2000
Analysis Services, for example, allows up to 64 dimensions. We can think of a 4-D data cube as
consisting of a series of 3-D cubes, though visualizing such higher-dimensional entities in spatial
or geometric terms can be a problem.
In practice, therefore, we often construct data cubes with many dimensions, but we tend to look
at just three at a time. What makes data cubes so valuable is that we can index the cube on one or
more of its dimensions
35
these elements would make up the dimensions of this cube. However, OLAP cubes are not SQL
server relational databases, like data warehouses are.
OLAP cubes are not an open SQL server data warehouse, so they require someone with the
know-how and the experience to maintain it, whereas a SQL server data warehouse can be
maintained by most IT people that have regular database training. This aspect accordingly has a
price tag attached to it. Whether allocating time and energy from a current employee to focus on
management of the OLAP cube or seeking a new, perhaps full-time employee to join payroll for
this role. Additionally, OLAP cubes tend to be more rigid and limited when it comes to
designing reports because of their table-like functionality. Aesthetics and capabilities could and
arguably should be important to a company who is building their portfolio of BI solutions.
The take-home of this article could be that, in looking for BI tools, consider what format of data
storage the product utilizes for analysis. Knowing and understanding the pros and cons of data
warehouses and OLAP cubes should result in a preference – and the storage of company data,
exponentially bigger amounts by the year, is arguably a foundation for successful analysis.
Solver offers a fully built, configurable data warehouse stand-alone and as part of the
comprehensive suite of BI modules and would be happy to answer questions and generally
review BI360’s easy-to-use Data Warehouse solution for collaborative, streamlined decision-
making capabilities.
Result:
______________________________________________________________________________
Conclusion:
______________________________________________________________________________
______________________________________________________________________________
Industrial Application:
Questionnaire:
36
1. Data warehouse architecture is based on______________
2. ______________________supports basic OLAP operations, including slice and dice, drill-down,
roll-up and pivoting.
3. The core of the multidimensional model is the _____________ , which consists of a large set of
facts and a number of dimensions.
4. The data from the operational environment enter ________________of data warehouse.
5. Is it true “The data from the operational environment enter Current detail data of data
warehouse.”________________
6. Business Intelligence and data warehousing is used for________________
7. The generalization of cross-tab which is represented visually is ____________ which is also
called as data cube.
8. What includes a description of each relation in a database, and each piece of information in
each relation?
______________________________________________________________________________
______________________________________________________________________________
9. Which DBMS structure uses data cubes to represent multiple dimensions?
______________________________________________________________________________
______________________________________________________________________________
10. The diagrams such as cubes and cylinders are classified as?
______________________________________________________________________________
______________________________________________________________________________
Experiment No.: 10
37
OLAP Operations
Aim: OLAP Operations, OLAP Queries.
Theory:
CONCEPT HIERARCHIES
In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides
users with the flexibility to view data from different perspectives.
For example, we have attributes as day, temperature and humidity, we can group values in
subsets and name these subsets, thus obtaining a set of hierarchies as shown in figure below.
OLAP OPERATIONS
OLAP provides a user-friendly environment for interactive data analysis. A number of OLAP
data cube operations exist to materialize different views of data, allowing interactive querying
and analysis of the data.
Roll up
The roll-up operation (also called drill-up or aggregation operation) performs aggregation on a
data cube, either by climbing up a concept hierarchy for a dimension or by climbing down a
concept hierarchy, i.e. dimension reduction. Let me explain roll up with an example :
Consider the following cube illustrating temperature of certain days recorded weekly:
Fig : Example.
Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the
above cube. To do this we have to group columns and add up the values according to the concept
hierarchy. This operation is called roll-up.
38
By doing this we obtain the following cube:
Fig : Rollup.
The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the
data by levels of temperature.
Roll Down
The roll down operation (also called drill down) is the reverse of roll up. It navigates from less
detailed data to more detailed data. It can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions. Performing roll down operation
on the same cube mentioned above:
Fig : Rolldown.
The result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by
descending the time hierarchy from the level of week to the more detailed level of day. Also new
dimensions can be added to the cube, because drill-down adds more detail to the given data.
Slicing
Slice performs a selection on one dimension of the given cube, thus resulting in a subcube. For
example, in the cube example above, if we make the selection, temperature=cool we will obtain
the following cube:
39
Fig : Slicing.
Dicing
The dice operation defines a subcube by performing a selection on two or more dimensions. For
example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cube we get the following subcube (still two-dimensional):
Fig :Dice
Pivot
Pivot otheriwise known as Rotate changes the dimensional orientation of the cube, i.e. rotates the
data axes to view the data from different perspectives. Pivot groups data with different
dimensions. The below cubes shows 2D represntation of Pivot.
40
Fig :Pivot
SCOPING: Restricting the view of database objects to a specified subset is called scoping.
Scoping will allow users to recieve and update some data values they wish to recieve and update.
DRILL ACROSS: Accesses more than one fact table that is linked by common dimensions.
COmbiens cubes that share one or more dimensions.
DRILL THROUGH: Drill down to the bottom level of a data cube down to its back end
relational tables.
41
Result:
Conclusion:
Industrial Application:
1. on-line analytical processing and data mining on the analysis of curriculum chosen by students,
accomplishes the design of data warehouse about universities curriculum chosen by students, extracts
and transforms curriculum chosen by students data and then loads them into the data warehouse.
2. on-line analytical processing and data mining on the analysis of curriculum chosen by
students, accomplishes the design of data warehouse about universities curriculum chosen by
students, extracts and transforms curriculum chosen by students data and then loads them into
the data warehouse.
42
Questionnaire:
3. ________________supports basic OLAP operations, including slice and dice, drill-down, roll-up
and pivoting.
4. Can the user perform data entry or editing task on OLAP data?
______________________________________________________________________________
______________________________________________________________________________
7. ____________________ operation of OLAP involves computing all of the data relationships for
one or more dimensions.
8. Is it because _______________ that enables OLAP to achieve great performance for a query.
43
References
Text Books:
1. Elmasri and Navathe, “Fundamentals of Database Systems”, 6th Edition, PEARSON
Education.
2. Theraja Reema, “Data Warehousing”, Oxford University Press, 2009.
3. Data Warehousing, Data Mining, & OLAP by Alex Berson McGraw Hill.
References:
1. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom “Database System
Implementation”, Pearson Ltd. 1/ e
2. Thomas M. Connolly Carolyn Begg, Database Systems : A Practical Approach to Design,
Implementation and Management, 4/e Pearson Ltd.
3. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Definitive Guide To
Dimensional Modeling”, 3rd Edition. Wiley India.
Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/
44