Exp Olap

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Experiment No.

: 01
Query Optimization
Aim: Implementation of any Query optimizer (Java/Python)

Theory:
The term optimization is actually a misnomer because in some cases the chosen execution plan
is not the optimal (or absolute best) strategy—it is just a reasonably efficient strategy for
executing the query. Finding the optimal strategy is usually too time-consuming—except for the
simplest of queries. In addition, trying to find the optimal query execution strategy may require
detailed information on how the files are implemented and even on the contents of the
files—information that may not be fully available in the DBMS catalog. Hence, planning of a
good execution strategy may be a more accurate description than query optimization.

Figure 1 Typical steps when processing a high-level query.

Above Figure shows the different steps of processing a high-level query. The query optimizer
module has the task of producing a good execution plan, and the code generator generates the
code to execute that plan. The runtime database processor has the task of running (executing) the
query code, whether in compiled or interpreted mode, to produce the query result. If a runtime
error results, an error message is generated by the runtime database processor.
Overview
Caching database queries can reduce and even remove the performance degradation caused by
slow database access.
Introduction
Database queries maybe a main source of performance problems in a Java application. To
process the queries, a database server may have to do work that takes significant amount of time.
1
Executing and getting results of such queries can take seconds and even minutes. For many Java
applications such delays are unacceptable.
Two steps are involved in addressing the performance problem caused by the heavy database
queries. These steps are:
1. Optimizing database queries.
2. Caching database queries.
Optimizing Database Queries

Optimizing database queries is the first step in addressing the performance problem.

Creating Indexes for Where and Order By Fields

It is possible to reduce the query execution time by applying the following simple optimization
technique: create an index for each combination of fields involved in the "where" and
"order by" clause .
Example:
Consider an e-commerce management system, the part that contains invoicing:
create table PRODUCT (
ID integer not null,
NAME varchar(254) not null
);
create table INVOICE (
ID integer not null,
NUMBER integer not null,
PRODUCT_ID integer not null
constraint INVOICE _FK1 foreign key (PRODUCT_ID) references PRODUCT(ID)
);
create table invoice ( id integer not null,number1 integer not null,product_id integer not null,
constraint invoice_fk1 foreign key (product_id) references
product(id));
The following query finds all invoices with the given product and orders them by the invoice
number.
select distinct(INVOICE.*) from INVOICE, PRODUCT
where PRODUCT.NAME = ? and INVOICE.PRODUCT_ID = PRODUCT.ID
order by INVOICE.NUMBER

The following indexes may be created up front. Suffix PK stands for primary key, suffix AK
stands for unique alternative key, suffix IX stands for non-unique index :
create unique index PRODUCT_PK1 on PRODUCT(ID);
create unique index PRODUCT_AK1 on PRODUCT(NAME);
create index INVOICE_IX1 on INVOICE(PRODUCT_ID);

2
create index INVOICE_IX2 on INVOICE(NUMBER);

This technique alone will guarantee an improvement of the query performance.


Analyzing Query Execution Plan
If the query is still slow, the following advanced optimization technique may be applied: create
an index for each combination of fields involved in a full scan as determined by examining
a query execution plan.
Example:
1. Execute the query in question using a database console tool.
2. Request the tool to provide a query execution plan.
3. Find places where the database reports full scans.
4. Add an index or indexes on fields involved in the full scans.
Caching Database Queries
Even fully optimized queries may take unacceptably long time. Caching the results of data
queries may provide a significant improvement of the application performance, often in orders
of large magnitude.
Query Caching Algorithm
A cache is an area of local memory that holds a copy of frequently accessed data that is
otherwise expensive to get or compute. Cached data is identified by a key. The algorithm is
simple:
1. Get a cached query result.
2. If not found, execute a query and put it into the cache.
3. Return the result of the query to the requestor.
Result:
_____________________________________________________________________________
Conclusion:
______________________________________________________________________________
____________________________________________________________________________
Industrial Application:
Applying Multiple Query Optimization in Mobile Databases:
We apply multiple query optimization to batches of pull (on-demand) requests in a mobile
database system. The resulting view can answer several queries at once, and it is broadcast on a
view channel dedicated to common answers of multiple queries rather than over individual
downlink channels.
The Oracle Query Optimizer:
The Optimizer is one of the most fascinating components of the Oracle Database, since it is
essential to the processing of every SQL statement. The Optimizer determines the most efficient

3
execution plan for each SQL statement based on the structure of the query, the available
statistical information about the underlying objects, and all the relevant optimizer and execution
feature.

References
Text Books:
1. Elmasri and Navathe, “Fundamentals of Database Systems”, 6th Edition, PEARSON
Education.
2. Theraja Reema, “Data Warehousing”, Oxford University Press, 2009.
3. Data Warehousing, Data Mining, & OLAP by Alex Berson McGraw Hill.

References:
1. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom “Database System
Implementation”, Pearson Ltd. 1/ e
2. Thomas M. Connolly Carolyn Begg, Database Systems : A Practical Approach to Design,
Implementation and Management, 4/e Pearson Ltd.
3. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Definitive Guide To
Dimensional Modeling”, 3rd Edition. Wiley India.

Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/

4
Questionnaire:
1. An option in view maintenance, in which only the affected part of the view is modified is
called as ........................................
2. The process of replacement of a nested query by a query with a join is called....................
3. The technique chosen by the database system for evaluation of an operation depends on
the size of what?
__________________________________________________________________________
__________________________________________________________________________
4. The term that optimizes sub expressions shared by different expressions in a program is
called as ..................................
5. Is it true “Each relational-algebra expression represents a particular sequence of
operation”
__________________________________________________________________________
6. The files from which algorithms are selected for records are called...............................
7. Is it true “The procedure of choosing a suitable query out of all the queries is classified as
query optimization”
__________________________________________________________________________
__________________________________________________________________________
8. What is the representation of the query in the form of data structure ?
__________________________________________________________________________
__________________________________________________________________________
9. In sort merge strategy, the small sub files sorted are called...............................
10. Is it true “string units is a sequence of zero or more characters enclosed by single
quotes.”
__________________________________________________________________________

5
Experiment No.: 02
Path Expression Evaluation
Aim: Assignments for query evaluation path expressions.
Theory:
Instances of user defined persistable classes (entity classes, mapped super classes and
embeddable classes) are represented in JPQL by the following types of expressions:
 Variables - FROM identification variables and SELECT result variables.
 Parameters - when instances of these classes are assigned as arguments.
 Path expressions that navigate from one object to another.
Instances of user defined persistable classes can participate in direct comparison using
the = and <> operators. But more often they are used in JPQL path expressions that navigate to
values of simple types (number, boolean, string, date).
Simple type values are more useful in queries. They have special operators and functions (e.g.
for strings and for numbers), they can be compared by all six comparison operators, and they can
be used in ordering.
This page covers the following topics:

 Navigation through Path Expressions


 Navigation through a NULL value
 Entity Type Expressions
 Criteria Query Paths and Types

Navigation through Path Expressions


A path expression always starts with an instance of a user defined class (represented by a
variable, parameter or prefix path expression) and uses the dot (.) operator to navigate through
persistent fields to other objects and values.
For example - c.capital, where c represents a Country entity object uses the capitalpersistent field
in the Country class to navigate to the associated Capital entity object.
Path expression whose type is a persistable user class can be extended further by reusing the dot
(.) operator. For example, c.capital.name is a nested path expression that continues from the
Capital entity object to its name field. A path expression can be extended further only if its type
is also a user defined persistable class. The dot (.) operator cannot be applied to collections, maps
and values of simple types (number, boolean, string, date).
For a path expression to be valid the user defined persistable class must contain a persistent field
(or property) with a matching name. The path expression, however, is valid even if the persistent
field is declared as private (which is usually the case).

Navigation through a NULL value


The following query retrieves country names with their capital city names:

6
SELECT c.name, c.capital.name FROM Country c
The c identification variables is used for iteration over all the Country objects in the database.
For a country with no capital city, such as Nauru, c.capital is evaluated to NULL and
c.capital.name is an attempt to navigate from a NULL value. In Java, a NullPointerException is
thrown on any attempt to access a field or a method via a null reference. In JPQL, the current
FROM variable (or FROM tuple when there are multiple variables) is simply skipped. It might
be easier to understand exactly how this works by considering the equivalent JOIN query.

Entity Type Expressions


The TYPE operator (which is new in JPA 2) returns the type of a specified argument, similarly
to java.lang.Object's getClass method in Java.
The following query returns the number of all the entity objects in the database, excluding
Country entity objects:
SELECT COUNT(e) FROM Object e WHERE TYPE(e) <> Country
Binding an identification variable (e) to the Object class is an extension of ObjectDB that can be
used to iterate over all the entity objects in the database. The Country literal represents the
Country entity class. The TYPE operator returns the actual type of the iterated e. Only objects
whose type is not Country are passed to the SELECT. The SELECT clause counts all these
objects (this is an aggregate query with no GROUP BY - all the objects are considered as one
group, and COUNT calculates its size).

Criteria Query Paths and Types


Paths and navigations are represented in the JPA Criteria API by the Path interface and by its sub
interfaces (From, Root, Join and Join's descendants).

Path Expressions
The Root and Join interfaces (which are subinterfaces of From) represent FROM variables.
FROM variable expressions are considered as basic paths and also serve as starting point for
building more complex paths through navigation.
Giving a Path instance, a child Path expression (which represents navigation from the parent path
through a persistent field or property), can be constructed by the get method:
// FROM Variable Paths:
Root<Country> country = query.from(Country.class);
Join<Country,Country> neighborCountry = country.join("neighbors");

// Navigation Paths:
Path<String> countryName = country.get("name");
Path<City> capital = country.get("capital");
Path<String> captialName = capital.get("name");

The path expressions in the above code can be divided into two main groups:
 FROM variable expressions, represented by subinterfaces of From (Root, Join) -
7
The creation of a FROM expression automatically modifies the query by adding a variable
to the FROM clause (representing iteration during query execution). The constructed
variable expression can also be used explicitly in other query clauses.
 Navigation expressions, represented by the Path interface -
The creation of a navigation path expression doesn't affect the built query directly. The
constructed expression must be integrated into query clauses explicitly to have effect.

Type Expressions
Entity type expressions can be constructed by the Path's type method. For example, the following
criteria expression checks if the type of a specified entity e is not Country.
Predicate p = cb.notEqual(e.type(), cb.literal(Country.class));

In the above example, the comparison is between the type of the e object (which may represent
any path including a root or a join) and the entity type Country (a criteria literal).
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________

Industrial Application:
1. Use of path expressions is a common feature in most XML query languages. The path
shortening principle reduces the querying cost by shortening the path expressions with
the knowledge of XML schema.
2. Path expressions using navigational algebraic operators.
Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/

Questionnaire:
8
1. Is it true “In ca se of A4 search algorithm, if the equality condition is on a key, the
strategy can retrieve Single record”
__________________________________________________________________________
__________________________________________________________________________

2. Which of the operators is/are used in “Condition Evaluation”?

__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
3. What will be the result of the following query?
WHERE end_date ISNULL
ANDNOT (title=’teller’ OR start_date &lt; ‘2007-01-01’)
__________________________________________________________________________
__________________________________________________________________________
4. What will be the result of the following query?
WHERE end_date ISNULL
AND (title=’teller’ OR start_date &lt; ‘2007-01-01’)
__________________________________________________________________________
__________________________________________________________________________
5. What will be the result of the following query?
WHERE TITLE=’teller’ OR start_date=’2007-01-01’
__________________________________________________________________________
__________________________________________________________________________
6. What will be the result of the following query?
WHERE TITLE= ‘teller’ AND start_date &lt; ’2007-01-01’
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________

7. What is involved in processing a query?

__________________________________________________________________________
__________________________________________________________________________

8. A collection of data designed to be used by different people is called a/an .........................


9. Which of the schemas does define a view or views of the database for particular users?

__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________

10. Which of the is an attribute that can uniquely identify a row in a table?

__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
Experiment No.: 03

9
Concurrency Control & Recovery Methods
Aim: Simulation of Concurrency Control Algorithm, Recovery Algorithm (Java/Python)

Theory : In the context of concurrent OO programming, a transaction is an operation


performed by an arbitrary client that invokes an arbitrary set of methods on an arbitrary set of
participant objects, all without interference from other activities.

The arbitrariness of both the participants and the action sequences requires extensions of the joint
action control strategies. Transaction techniques extend delegation-based synchronization and
control to situations in which each participant may be unaware of the atomicity constraints
placed on its actions, and cannot rely on more efficient structural solutions. In transaction
frameworks, each participant (and each client) gives up its local autonomy in deciding how to
perform concurrency control. Participants must instead reach consensus deciding how and when
to perform actions and/or to commit to their effects.

Transaction frameworks are among the most famous examples of how providing components
that implement valuable general-purpose functionality sometimes has the price of introducing a
large number of programmer obligations. Classes supporting transaction protocols can be highly
usable and reusable. Transaction frameworks can be used to tackle most of the concurrency
problems discussed in this book. But they rely on designs in which each class, at each layer of
functionality, supports a standardized transaction protocol that propagates control down through
successive layers. The heaviness of transaction frameworks usually restricts their use to contexts
in which you really need to set up objects so as to guarantee atomicity of arbitrary code
sequences.

For example, you may be able to bypass transactional control if you know all the call-sequences
that will ever be encountered in a component or application. In this case, you can specifically
design support for each one (using whatever techniques happen to apply) without having to
address the general case. This is a perhaps extreme extension of the idea (see 2.2.2) of padding
reusable synchronized objects with atomic versions of frequently needed convenience methods.
This is sometimes a plausible alternative, in the spirit of doing the simplest and surest thing that
could possibly work. Similarly, you may be able to rely entirely on client-side locking (see 2.2.3)
in cases where clients somehow know to obtain all locks required for a given action and how to
avoid any potential deadlocks.

This section provides a brief overview of transaction-based techniques applicable in general-


purpose concurrent programming contexts. The designs presented here deal only with internal
concurrency, and not explicitly with databases or distribution. Because even lightweight (at least

10
in a relative sense) internal transaction frameworks are normally tied to other application-specific
constraints and features, you are unlikely to use the exact interfaces and classes described here
(although most are simplified variants of those in the net.jini package). And if you instead rely
on a standardized transaction framework such as JDBC or JTS, you will encounter additional
issues tied to persistence support and related services that fall outside the scope of this book.
However, the final example how the ideas behind transactions can help structure more ordinary
concurrent OO designs. Thus, the main goals of this section are to give a brief synopsis of how
transaction systems extend other approaches to concurrency control, and to present techniques
that may be scaled down as needed to apply to other concurrency problems.
As a running example, consider again writing a transfer operation for the BankAccount class
From a transactional point of view, a stand-alone transfer operation (without any provisions
for automatic transfers) looks like:

pseudoclass AccountUser {
// Pseudocode
TransactionLogger log; // any kind of logging facility
// ...
// Attempt transfer; return true if successful
public boolean transfer(long amount,
BankAccount source,
BankAccount destination) {
TRANSACTIONALLY {
if (source.balance() >= amount) {
log.logTransfer(amount, source, destination);
source.withdraw(amount);
destination.deposit(amount);
return true;
}
else
return false;
}
}
}

The TRANSACTIONALLY pseudo-qualifier indicates that we'd like this code to be executed in an
all-or-none fashion without any possibility of interference from other activities. Once
implemented, this operation could be used in an automated transfer scheme of the sort described
in 3.5.1. Additionally, the transactional approach permits greater flexibility than seen in our
specific solution, although with significantly greater overhead. Once classes are fitted with
transactional apparatus, it becomes possible to associate transactionality with any sequence of
operations involving bank accounts.
Transaction Protocols

11
Transaction frameworks rely on extended forms of the before/after tactics characteristic of most
concurrency control strategies. Here, the before-action is typically called join (or
sometimes, begin) and the after-action is called commit. The main differences between
join/commit and operations such as lock acquire/release stem from the fact that join/commit
require consensus among the set of participating objects: All participants must agree to begin and
end transactions. This leads to two-phase protocols surrounding join and/or commit — first to
obtain consensus and then to act. If any participant disagrees about joining or committing, the
attempted transaction is aborted. The most straightforward version of the basic protocol is:

1. For each participant p, if p cannot join, abort.

2. For each participant p, tentatively execute p's action.

3. For each participant p, if p cannot commit, abort.

4. For each participant p, commit p's effects of the transaction.

As in most concurrency control contexts, two complementary sets of policies can be applied to
this protocol. In the purest form of optimistic transactions, participants can always join, but
cannot always commit. In the purest form of conservative transactions, participants cannot
always join, but if they do, they can always commit. Optimistic approaches apply best when the
likelihood of contention is low enough to outweigh rollback costs. Conservative approaches
apply best when it is difficult or impossible to undo actions performed during transactions.
However, it is rare to find pure forms of each, and it is not hard to create frameworks that permit
mixtures.

The most classic forms of conservative transactions can be implemented only if the identities of
all participants can be known before any actions are taken. This is not always possible. In an OO
system, the participants are just those objects whose methods are invoked during some call
sequence that is wrapped as a transaction. Because of polymorphism, dynamic loading, etc., it is
generally impossible to identify them all beforehand; instead, their identities become known only
as the action unfolds. Still, in many cases, at least some participants are known beforehand and
they can be probed before starting to engage in any unrecoverable action.

However, in most approaches to conservative OO transactions, participants join only tentatively.


They can still later refuse to commit, using approximately the same strategy seen in optimistic
transactions. Conversely, full rollback is not always necessary in optimistic approaches. Some
roll-forward operations may be allowed if they do not impact overall functionality.

12
Transaction Participants
In addition to supporting methods for joining, committing, aborting, and (when necessary)
creating transactions, each class in a structured transaction framework must declare all of its
public methods to add a transaction control argument to its set of normal arguments.

A method invocation supplying a given transaction argument serves as a request to perform the
associated action on behalf of the given transaction, but without committing to its effects until
requested to do so. Methods take the form:

ReturnType aMethod(Transaction t, ArgType args) throws...

For example, BankAccount.deposit would be declared as:

void deposit(Transaction t, long amount) throws ...

Transaction is any type that supplies the necessary control information. This transaction
information must be propagated throughout all methods invoked in the course of a particular
transaction, including nested calls to helper objects. The simplest kind of transaction argument is
a transaction key that uniquely identifies each transaction. Each method in each participating
object is then responsible for using this key to manage and isolate actions in accord with the
given transaction policy. Alternatively, a transaction argument may refer to a special control or
coordinator object possessing methods that help participants perform their roles in transactions.
It is, however, possible to cheat here, and many transaction frameworks do. For example,
transaction identifiers can be hidden as thread-specific data (see 2.3.2). Before/after control can
be restricted to intercepted entry points into sessions performing services provided by
components. Participants can be determined via reflection or scanning bytecodes. And rollback
obligations can be semi-automated by serializing the entire state of a component and/or acquiring
locks on entry into a service session. These kinds of tactics can help hide the details of
transactional control from application writers. This comes at the expense overhead and usage
restrictions that are not generally worthwhile in lightweight transaction frameworks performing
internal concurrency control.

Interfaces
Participant classes must implement interfaces defining the methods used in transaction control.
Here is a simple but representative interface:

class Failure extends Exception {}


interface Transactor {
// Enter a new transaction and return true, if possible
public boolean join(Transaction t);

13
// Return true if this transaction can be committed
public boolean canCommit(Transaction t);
// Update state to reflect current transaction
public void commit(Transaction t) throws Failure;
// Roll back state (No exception; ignore if inapplicable)
public void abort(Transaction t);
}

Among many other variants, it is possible to split the join phase similarly to the commit phase —
a preliminary canJoin followed by a required join. The canCommit method is most often
named prepare in transaction frameworks.
For simplicity of illustration, a single Failure exception type is associated with these
operations, as well as all others in this series of examples. Participant objects are allowed to raise
exceptions when they encounter actual or potential conflicts and when they are requested to
participate in transactions they do not know about. Of course in practice, you'd want to subclass
these exception types and use them to provide additional information to clients in cases of
failure.
A second interface or class is needed to describe Transaction objects themselves. In discussing
basic operations, we can use a no-op version:

class Transaction {
// add anything you want here
}

Again, it is not even necessary to associate an object with a transaction. A simple


unique longtransactionKey argument may suffice in place of all uses of Transaction. At the
other extreme, you may need a Transaction Factory for creating all Transactions. This
allows different kinds of Transaction objects to be associated with different styles of
transactions.
Implementations
Participants in transactions must support both a transaction participant interface and an interface
describing their basic actions. For example:

interface TransBankAccount extends Transactor {


public long balance(Transaction t) throws Failure;
public void deposit(Transaction t, long amount)
throws InsufficientFunds, Failure;
public void withdraw(Transaction t, long amount)
throws InsufficientFunds, Failure;

14
However, it is not always necessary to provide transactional signatures for pure accessor
methods such as balance here. Instead (or in addition to transactional versions), such methods
can return the most recently committed value when called from clients that are not participating
in transactions. Alternatively, a special null-transaction type (or just passing null for
the Transaction argument) can be used to denote one-shot invocations of transactional
methods.
Oracle supports two types of locks: row locks and table locks. These locks can be subdivided
into several categories:

Row Share Table Locks (RS) —These locks are issued when an SQL transaction has declared
its intent to update the table in row share mode. This type of lock will allow other queries to
update rows in the customer table. For example:
lock table customer in row share mode;
SELECT customer_name
FROM CUSTOMER
FOR UPDATE OF CUSTOMER;

Row Exclusive Table Locks (RX)—These locks are issued automatically against a table when
an UPDATE, DELETE, or INSERT statement is issued against the table.

Table Share Locks (S)—This type of lock is issued when the LOCK TABLE command is
issued against the table. This indicates that the transaction intends to perform updates against
some rows in the table, and prevents any other tasks from execution until the LOCK TABLE xxx
IN SHARE MODE has completed.

Share Row Exclusive Table Locks (SRX)—These locks are issued with the LOCK TABLE
xxx IN SHARE ROW EXCLUSIVE MODE command. This prevents any other tasks from
issuing any explicit LOCK TABLE commands until the task has completed, and also prevents
any row-level locking on the target table.

Exclusive Table Locks (X)—This is the most restrictive of the table locks and prevents
everything except queries against the affected table. Exclusive locks are used when the
programmer desired exclusive control over a set of rows until their operation has completed. The
following command is used to lock the CUSTOMER table for the duration of the task
Result:

__________________________________________________________________________

Conclusion:

__________________________________________________________________________
__________________________________________________________________________

Industrial Applications:

1. process control
15
Industrial control system are usually attached to sensors and actuators. The sensors monitor the
state of some real world processes, and the controlled manipulate the valves. Control decision
are made based on the input data and controller configuration parameters. Input data are
generated from field devices such as sensors, transmitters and switches that are connected to the
controller.
2. Internet Service Management.
Service providers are increasingly deploying service management systems (SMS). An SMS lets
an ISP create and provide IP services such as e-mail, VPN, LDAP directory services, etc. The
SMS streamlines allocation of resources to subscribers, resource management, and controlling all the
relevant network components at a single centralized point in the network. An SMS needs to use a
real-time database for performing authorization, authentication, and accounting for the Internet
users. The SMS must manage such data as session status as well as information about the
network, subscriber, and policies — in the face of the number of subscribers running into
millions. Real-time database management systems are used to maintain these data.

6. Spacecraft Control System.


In a spacecraft, a control system is responsible for the successful overall operation of the
spacecraft. It is also responsible for receiving command and control information from the
ground computer. A spacecraft control system maintains contact with the ground control using
antennae, receivers, and transmitters. The control system monitors several parameters relevant to
the successful operation of the spacecraft through several sensors mounted on and within the
spacecraft. In addition to controlling the regular operation of the spacecraft, the controller also
monitors the "health" of the spacecraft itself.
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom “Database System
Implementation”, Pearson Ltd. 1/ e
2. Thomas M. Connolly Carolyn Begg, Database Systems : A Practical Approach to Design,
Implementation and Management, 4/e Pearson Ltd.
3. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Definitive Guide To
Dimensional Modeling”, 3rd Edition. Wiley India.

Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/
Questionnaire:

1. Locks placed by command are called ________ .


2. Which of the locks the item from change but not from read?
16
__________________________________________________________________
3. State advantages of optimistic locking?
______________________________________________________________________________
_____________________________________________________________________
____________________________________________________________
4. Which refers to a cursor type that when the cursor is opened, a primary key value
is saved for each row in the record set ; when the application accesses a row, the
key is used to fetch the current values of the row?
______________________________________________________________________________
_____________________________________________________________________
____________________________________________________________
5. Consider the following transactions with data items P and Q initialized to zero:
T1: read (P) ;
read (Q) ;
if P = 0 then Q : = Q + 1 ;
write (Q) ;
T2: read (Q) ;
read (P) ;
if Q = 0 then P : = P + 1 ;
write (P) ;
______________________________________________________________________________
_____________________________________________________________________
____________________________________________________________

6. Any non-serial interleaving of T1 and T2 for concurrent execution leads to what?


______________________________________________________________________________
____________________________________________________________
7. Which of the following concurrency control protocols ensure both conflict
serialzability and freedom from deadlock? I. 2-phase locking II. Time-stamp
ordering.
______________________________________________________________________________
____________________________________________________________

8. Is it true “An approach named Lock timeouts is used for Deadlock detection”
_____________________________________________________________________
9. The system must deal with the deadlocks that are not prevented by using schemes
of..............................................
10. _____________ensures that once transaction changes are done, they cannot be
undone or lost, even in the event of a system failure.

17
Experiment No.: 04
Fragmentation, Query Processing.

Aim: Design of a distributed database for a real life application - Fragmentation, Query
Processing.

Theory:

Data Fragmentation:

Techniques that are used to break up the database into logical units, called fragments,
which may be assigned for storage at the various sites. We also discuss the use of data
replication, which permits certain data to be stored in more than one site, and the process of
allocating fragments—or replicas of fragments—for storage at the various sites. These
techniques are used during the process of distributed database design. The information
concerning data fragmentation, allocation, and replication is stored in a global directory that is
accessed by the DDBS applications as needed. In a DDB, decisions must be made regarding
which site should be used to store which portions of the database. For now, we will assume that
there is no replication; that is, each relation—or portion of a relation—is stored at one site only.
We discuss replication and its effects later in this section. We also use the terminology of
relational databases, but similar concepts apply to other data models. We assume that we are
starting with a relational database schema and must decide on how to distribute the relations
over the various sites. To illustrate our discussion, we use the relational database schema Before
we decide on how to distribute the data, we must determine the logical units of the database that
are to be distributed. The simplest logical units are the relations themselves; that is, each whole
relation is to be stored at a particular site.

In our example, we must decide on a site to store each of the relations EMPLOYEE,
DEPARTMENT, PROJECT, WORKS_ON, and DEPENDENT in. In many cases, however, a
relation can be divided into smaller logical units for distribution. For example, consider the
company database shown in Figure 3.6, and assume there are three computer sites—one for each
department in the company.6 We may want to store the database information relating to each
department at the computer site for that department. A technique called horizontal fragmentation
can be used to partition each relation by department.

Distributed Query Processing:

A distributed database query is processed in stages as follows:


18
1. Query Mapping. The input query on distributed data is specified formally using a query
language. It is then translated into an algebraic query on global relations. This translation is done
by referring to the global conceptual schema and does not take into account the actual
distribution and replication of data. Hence, this translation is largely identical to the one
performed in a centralized DBMS. It is first normalized, analyzed for semantic errors,
simplified, and finally restructured into an algebraic query.

2. Localization. In a distributed database, fragmentation results in relations being stored in


separate sites, with some fragments possibly being replicated. This stage maps the distributed
query on the global schema to separate queries on individual fragments using data distribution
and replication information.

3. Global Query Optimization. Optimization consists of selecting a strategy from a list of


candidates that is closest to optimal.

Result:

__________________________________________________________________________

Conclusion:

__________________________________________________________________________
__________________________________________________________________________

Industrial Application:

1. Log File Code for generating Customized ISUD matrix table.


2. Network scaling problems.
References
Web Resources:

Questionnaire:

1. The _____________contains the description of the entire database as seen by the


database administrator.
19
2. Distributed query processing is more complex than centralized query processing
because it involves _____________

3. What is need to specify SQL query with location transparency?


______________________________________________________________________________
______________________________________________________
4. Is it true “The distributed DBMS may integrate the DBMS of a newly acquired
organization”
______________________________________________________________________________
______________________________________________________

5. Is it true ” SQL is often used as an interface between the distributed data manager
and local data managers”
______________________________________________________________________________
______________________________________________________

6. What is the work of a local data manager in distributed DBMS?


______________________________________________________________________________
______________________________________________________

7. The degree of resource sharing determines architectures for parallel database


processing. The standard classification of architectures are known as......................

8. What is the advantage of the three-tier architecture over the two-tier architecture.
______________________________________________________________________________
______________________________________________________

9. Which of the following statements is not true about two-tier client-server database
architecture?
______________________________________________________________________________
______________________________________________________
10. Define Localization.
__________________________________________________________________
__________________________________________________________________

Experiment No.: 05

Recovery Management
Aim: Simulation of Recovery methods.
20
Theory :
Recovery from transaction failures usually means that the database is restored to the most
recent consistent state just before the time of failure. To do this, the system must keep
information about the changes that were applied to data items by the various transactions. This
information is typically kept in the system log. A typical strategy for recovery may be
summarized informally as follows:
1. If there is extensive damage to a wide portion of the database due to catastrophic
failure, such as a disk crash, the recovery method restores a past copy of the database that was
backed up to archival storage (typically tape or other large capacity offline storage media) and
reconstructs a more current state by reapplying or redoing the operations of committed
transactions from the backed up log, up to the time of failure.
2. When the database on disk is not physically damaged, and a no catastrophic failure of
types 1 through 4 in has occurred, the recovery strategy is to identify any changes that may cause
an inconsistency in the database. For example, a transaction that has updated some database
items on disk but has not been committed needs to have its changes reversed by undoing its write
operations.

It may also be necessary to redo some operations in order to restore a consistent state of the
database; for example, if a transaction has committed but some of its write operations have not
yet been written to disk. For non catastrophic failure, the recovery protocol does not need a
complete archival copy of the database. Rather, the entries kept in the online system log on disk
are analyzed to determine the appropriate actions for recovery. Conceptually, we can distinguish
two main techniques for recovery from non catastrophic transaction failures: deferred update and
immediate update. The deferred update techniques do not physically update the database on disk
until after a transaction reaches its commit point; then the updates are recorded in the database.
Before reaching commit, all transaction updates are recorded in the local transaction workspace
or in the main memory buffers that the DBMS maintains (the DBMS main memory cache).

THE ARIES ALGORITHM


the ARIES algorithm as an example of a recovery algorithm used in database systems. It is used in many
relational database-related products of IBM. ARIES uses a steal/no-force approach for writing, and it is
based on three concepts: write-ahead logging, repeating history during redo, and logging changes during
undo. We discussed write-ahead logging .
The second concept, repeating history, means that ARIES will retrace all actions of the database
system prior to the crash to reconstruct the database state when the crash occurred. Transactions that were
uncommitted at the time of the crash (active transactions) are undone. The third concept, logging during
undo, will prevent ARIES from repeating the completed undo operations if a failure occurs during
recovery, which causes a restart of the recovery process. The ARIES recovery procedure consists of three
main steps: analysis, REDO, and UNDO. The analysis step identifies the dirty (updated) pages in the
buffer6 and the set of transactions active at the time of the crash. The appropriate point in the log where
the REDO operation should start is also determined. The REDO phase actually reapplies updates from the
log to the database.
Generally, the REDO operation is applied only to committed transactions. However, this is not
the case in ARIES. Certain information in the ARIES log will provide the start point for REDO, from
which REDO operations are applied until the end of the log is reached. Additionally, information stored
by ARIES and in the data pages will allow ARIES to determine whether the operation to be redone has
actually been applied to the database and therefore does not need to be reapplied. Thus, only the
necessary REDO operations are applied during recovery. Finally, during the UNDO phase, the log is
scanned backward and the operations of transactions that were active at the time of the crash are undone
in reverse order. The information needed for ARIES to accomplish its recovery procedure includes the
log, the Transaction Table, and the Dirty Page Table.

21
Additionally, checkpointing is used. These tables are maintained by the transaction manager and written
to the log during checkpointing.
In ARIES, every log record has an associated log sequence number (LSN) that is monotonically
increasing and indicates the address of the log record on disk. Each LSN corresponds to a specific change
(action) of some transaction. Also, each data page will store the LSN of the latest log record
corresponding to a change for that page. A log record is written for any of the following actions: updating
a page (write), committing a transaction (commit), aborting a transaction (abort), undoing an update
(undo), and ending a transaction (end). The need for including the first three actions in the log has been
discussed, but the last two need some explanation. When an update is undone, a compensation log record
is written in the log. When a transaction ends, whether by committing or aborting, an end log record is
written. Common fields in all log records include the previous LSN for that transaction, the transaction
ID, and the type of log record.
The previous LSN is important because it links the log records (in reverse order) for each
transaction. For an update (write) action, additional fields in the log record include the page ID for the
page that contains the item, the length of the updated item, its offset from the beginning of the page, the
before image of the item, and its after image.
Besides the log, two tables are needed for efficient recovery: the Transaction Table and the Dirty
Page Table, which are maintained by the transaction manager. When a crash occurs, these tables are
rebuilt in the analysis phase of recovery. The Transaction Table contains an entry for each active
transaction, with information such as the transaction ID, transaction status, and the LSN of the most
recent log record for the transaction. The Dirty Page Table contains an entry for each dirty page in the
buffer, which includes the page ID and the LSN corresponding to the earliest update to that page.
Result:
__________________________________________________________________________
Conclusion:
__________________________________________________________________________
__________________________________________________________________________
Industrial Application:
1. Numeric and Textual Databases in Business Applications.
2. Ransomware recovery and endpoints and servers.

Questionnaire:

1. Does Two-phase commit is not a recovery technique?

_________________________________________________________________________________
_________________________________________________________________________________
22
2. Checkpoints are a part of which system?

_________________________________________________________________________________

3.______________deals with soft errors, such as power failures.

4._______________is an essential part of any backup system.

5. Media recovery deals with _______________.

6. Failure recovery and media recovery fall under _______________.

7. In which the database can be restored up to the last consistent state after the system failure?

8. The _______________consists of the various applications and database that play a role in a backup
and recovery strategy.

9. The simplest approach to introducing redundancy is to duplicate every disk is called


_____________________

10. The actions which are played in the order while recording it is called ______________
history.

Experiment No.: 06

Data Warehouse Design


Aim: Real life Problem to be defined for Warehouse Design.

The ETL and Data Warehousing tutorial is organized into lessons representing various business intelligence

23
scenarios, each of which describes a typical data warehousing challenge.
This guide might be considered as an ETL process and Data Warehousing knowledge base with a series of
examples illustrating how to manage and implement the ETL process in a data warehouse environment.

The purpose of this tutorial is to outline and analyze the most widely encountered real life datawarehousing
problems and challenges that need to be taken during the design and architecture phases of a successful data
warehouse project deployment.

Going through the sample implementations of the business scenarios is also a good way to compare Business
Intelligence and ETL tools and get to know the different approaches to designing the data integration process.
This also gives an idea and helps identify strong and weak points of various ETL and data warehousing
applications.

This tutorial shows how to use the following BI, ETL and datawarehousing tools: Datastage, SAS, Pentaho,
Cognos and Teradata.

Data Warehousing & ETL Tutorial lessons


 Surrogate key generation example which includes information on business keys and surrogate keys and shows
how to design an ETL process to manage surrogate keys in a data warehouse environment. Sample design in
Pentaho Data Integration
 Header and trailer processing - considerations on processing files arranged in blocks consisting of a header
record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and
EPIC files. Solution examples in Datastage, SAS and Pentaho Data Integration
 Loading customers - a data extract is placed on an FTP server. It is copied to an ETL server and loaded into the
data warehouse. Sample loading in Teradata MultiLoad
 Data allocation ETL process case study for allocating data. Examples in Pentaho Data Integration and Cognos
PowerPlay
 Data masking and scambling algorithms and ETL deployments. Sample Kettle implementation
 Site traffic analysis - a guide to creating a data warehouse with data marts for website traffic analysis and
reporting. Sample design in Pentaho Kettle
 Real time ETL with Kafka streaming - a guide to transition from traditional ETL to modern real-time
streams. Sample design in Pentaho Kettle
 Data Quality - ETL process design aimed to test and cleanse data in a Data Warehouse. Sample outline in PDI
 XML ETL processing and parsing example
Result:

__________________________________________________________________________

Conclusion:

__________________________________________________________________________
__________________________________________________________________________

Industrial application:

1. Fraud Detection in retail company,point of scale data and given discounts.


2. Long tail analysis in e commerce companies like amazon,ebay, itunes etc.

24
Questionnaire:

1. Is it true “ Data dictionary is Large collection of data mostly stored in a computer


system”
______________________________________________________________________
2. In what the data is stored, retrieved and updated.
_________________________________________________________________________
3. A ________________ system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
4. ___________________ are designed to overcome any limitations placed on the
warehouse by the nature of the relational data model.
5. What is the biggest drawback of the level indicator in the classic star schema is that is
limits?
_________________________________________________________________________________
__________________________________________________________________________
___________________________________________________________________
6. What are responsible for running queries and reports against data warehouse tables?
_________________________________________________________________________________
__________________________________________________________________________
___________________________________________________________________
7. Business Intelligence and data warehousing is used for_____________________
8. Is it true A data warehouse is Can be updated by end users.”
_________________________________________________________________________________
___________________________________________________________________
9. What is the goal of data mining ?
_________________________________________________________________________________
___________________________________________________________________
10. Is it true “datawarehouse can be updated by end user?”
_________________________________________________________________________________
_____________________________________________________________________

25
Experiment No.: 07

Star schema
Aim: Construction of star schema.

Theory :

This star schema in database case study at Birla SunLife Insurance involves the handling of
information related to policy and customer in the system. The core OLTP system, Ingenieum,
handles day to day transactions as a master system along with the receipt writer and channel
management system. The consolidated data sets come together into the database that
hosts different schemas including the star schema.

Identify the source

One needs to know the source of the OLTP systems to build a robust architecture that can
support schemas and run queries. For our star schema in database scenario we have DB2, SQL
Server and Oracle as OLTP databases and Teradata as an OLAP system.

At the end of the timeline that has been fixed, there will be no more transactions in the OLTP.
Data transfer from OLTP to landing schema is achieved using an ETL tool or a script that
processes batch files of data, which are targeted to the data warehouse where the multiple
schema structures will work with these files.

The star schema in database will depend on the deliverables that are expected in terms of
reporting, analysis and performance trending. The core structure of the star schema comprises
basic fact and dimension tables.

Build fact and dimension tables

The fact table holds the measures (facts). These are numeric and additive across some or all of
the dimensions. For example, sales figures are numeric and users can look at total sales for a
product, or category, or subcategory, and by any time period.

While the dimension tables are short and fat, the fact tables are generally long and skinny. They
are long because they can hold the number of records represented by the product of the counts in
all the dimension tables.

26
For example, consider the simplified star schema in database depicted in Figure 1. In this
schema, there are product, time and store dimensions. With 10 years of daily data, 200 stores and
500 products, there is a potential of 365,000,000 records (3650 days * 200 stores * 500
products). Adding another dimension, such as 10,000 customers, will increase the number of
records by 10,000.

Figure 1 Star schema in database.

The primary key in the fact table is made up of foreign keys that have migrated from the
dimension tables. These fields are typically integer values. In addition, the measures are also
numeric. Therefore, the size of each record is generally much narrower than those in the
dimension tables.

Fact granularity

The granularity, or frequency, of the data is determined by the lowest level of granularity of each
dimension table. For example, a fact table may store weekly or monthly totals for individual
products. The lower the granularity, the more records that will exist in the fact table. The
27
granularity also determines how far users can drill down without returning to the base,
transaction-level data.

A major benefit of the star schema in the database is that the low-level transactions may be
summarized to the fact table grain. This speeds the queries performed as part of the decision
support process. However, the aggregation or summarization of the fact table is not always done
if cubes are being built.

Fact table size

The previous section discussed how 500 products sold in 200 stores and tracked for 10 years
could produce 365,000,000 records in a fact table with a daily grain. This is the maximum size
for the table. A star schema in a database cannot store a zero value unless it has some
significance. So, if a product did not sell at a particular store for a particular day, the system
would not store a zero value.

Even though the fact table is sparsely populated, it still holds the vast majority of the records in
the database and is responsible for almost all of the disk space used. The lower the granularity,
the larger the fact table. In the previous example, moving from a daily to weekly grain would
reduce the potential number of records to just about 52,000,000 records.

The data types for the fields in the fact table help keep it as small as possible. In most fact tables,
all of the fields are numeric, which can require less storage space than the long descriptions we
find in the dimension tables.

Finally, be aware that each added dimension can greatly increase the size of the fact table. If just
one dimension was added to the previous example that included 20 possible values, the potential
number of records would reach 7.3 billion.

Changing attributes

With a star schema in database, changing of database attributes is an issue. For example, in the
StoreDimension in Figure 1, if regions, territories or zones of stores are changed for any reason,
the numbers will not be accurate for historical sales. By simply updating the region for a store,
the total sales for that region will appear as if the current structure has always been true. Thus,
the business would have "lost" history.

28
In fact, the company might want to see what the sales would have been had this store been in that
other region in prior years. More often, businesses do not want to change the historical data. The
typical approach is to create a new record for the store, which contains the new region, but leaves
the old store record (and therefore the old regional sales data) intact. This prevents companies
from comparing this store’s current sales to its historical sales unless the previous StoreID is
preserved. The answer is to keep the existing StoreName (the primary key from the source
system) on both records and instead add BeginDate and EndDate fields to indicate when each
record is active. The StoreID is a surrogate key, so each record has a different StoreID but the
same StoreName, allowing data to be examined for the store across time regardless of
its reporting structure.

Result:

_________________________________________________________________________________
Conclusion:

_________________________________________________________________________________

Industrial Application:

1. Creating metadata for the global star schema in OLAP.


2. V&V of the sale fact analyzed by period, documents, and publisher dimensions.

29
Questionnaire:

1. Data scrubbing is which of the following?


_________________________________________________________________________________
_______________________________________________________________
2. Is it true “In Star Schema Dimension tables are Long and Thin”?

________________________________________________________________________
3. What is the meaning of Snowflaking?
_________________________________________________________________________________
_______________________________________________________________
4. Is it true ” Sequence of jobs to load data in to warehouse First load data into fact
tables then dimension tables, then Aggregates if any”?

________________________________________________________________________
5. In which type of SCD(Slowly changing dimensions) do we preserve history of
data?

________________________________________________________________________
6. Is it true “ Consolidated data mart is First level data mart”?

________________________________________________________________________
7. What to load During ETL we generally have?

________________________________________________________________________
8. In 4 step dimensional process, what is declaring grain of business process?

________________________________________________________________________
9. Is it true” Drill Across generally use the following join to generate report Self
Join?

________________________________________________________________________
10. Is it true “In general data in Data Warehousing is Normalized”?

________________________________________________________________________

30
Experiment No.: 8

ETL Operation

Aim: ETL Operations.

Theory:

ETL (Extract-Transform-Load)
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a
process of how the data are loaded from the source system to the data warehouse. Currently, the
ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-
Transform-Load. Let us briefly describe each step of the ETL process.

Process

Extract
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible. The extract step should be designed in a
way that it does not negatively affect the source system in terms or performance, response time
or any kind of locking.

There are several ways to perform the extract:

 Update notification - if the source system is able to provide a notification that a record has been
changed and describe the change, this is the easiest way to get the data.
 Incremental extract - some systems may not be able to provide notification that an update has
occurred, but they are able to identify which records have been modified and provide an extract
of such records. During further ETL steps, the system needs to identify changes and propagate it
down. Note, that by using daily extract, we may not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has been changed at all, so a full
extract is the only way one can get the data out of the system. The full extract requires keeping a
copy of the last extract in the same format in order to be able to identify changes. Full extract
handles deletions as well.

When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.

Clean
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:

 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not


Available are translated to standard Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
 Convert phone numbers, ZIP codes to a standardized form
31
 Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).

Transform
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using
the same units so that they can later be joined. The transformation step also requires joining data
from several sources, generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.

Load
During the load step, it is necessary to ensure that the load is performed correctly and with as
little resources as possible. The target of the Load process is often a database. In order to make
the load process efficient, it is helpful to disable any constraints and indexes before the load and
enable them back only after the load completes. The referential integrity needs to be maintained
by ETL tool to ensure consistency.

Managing ETL Process


The ETL process seems quite straight forward. As with every application, there is a possibility
that the ETL process fails. This can be caused by missing extracts from one of the systems,
missing values in one of the reference tables, or simply a connection or power outage. Therefore,
it is necessary to design the ETL process keeping fail-recovery in mind.

Staging
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We
can ensure this by implementing proper staging. Staging means that the data is simply dumped to
the location (called the Staging Area) so that it can then be read by the next processing phase.
The staging area is also used during ETL process to store intermediate results of processing. This
is ok for the ETL process which uses for this purpose. However, tThe staging area should is be
accessed by the load ETL process only. It should never be available to anyone else; particularly
not to end users as it is not intended for data presentation to the end-user.may contain incomplete
or in-the-middle-of-the-processing data.

32
Fig : ETL Operation scenario

ETL Tool Implementation


When you are about to use an ETL tool, there is a fundamental decision to be made: will the
company build its own data transformation tool or will it use an existing tool?

Building your own data transformation tool (usually a set of shell scripts) is the preferred
approach for a small number of data sources which reside in storage of the same type. The reason
for that is the effort to implement the necessary transformation is little due to similar data
structure and common system architecture. Also, this approach saves licensing cost and there is
no need to train the staff in a new tool. This approach, however, is dangerous from the TOC
point of view. If the transformations become more sophisticated during the time or there is a
need to integrate other systems, the complexity of such an ETL system grows but the
manageability drops significantly. Similarly, the implementation of your own tool often
resembles re-inventing the wheel.

Result:

_________________________________________________________________________________

Conclusion:

_________________________________________________________________________________
_________________________________________________________________________________

Industrial Application:
1. An educational publishing company (EPC) wanted to replace its core systems,
including finance, order management, e-commerce, and content management.
Early in the initiative, it was determined that a key element for success would be a
data quality program instituted as part of the data conversion and integration
effort. Experience and research into the common causes of failure for large ERP
implementations highlighted this as a substantial risk area.
2. Global Planning Data Automation.

33
Questionnaire:

1. How many steps are there in ETL process?


______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
2. What are the steps involved in ETL process?
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
3. Can there be sub steps of ETL steps?
______________________________________________________________________________
______________________________________________________________________________
4. What is initial load and what is full load?
______________________________________________________________________________
______________________________________________________________________________
5. What is meant by incremental load?
______________________________________________________________________________
______________________________________________________________________________
6. What are the names of the layers in ETL?
______________________________________________________________________________
______________________________________________________________________________
7. List few ETL bugs.?
______________________________________________________________________________
______________________________________________________________________________
8. List the types of facts?
______________________________________________________________________________
______________________________________________________________________________
9. Explain what are the ETL testing operations includes?
______________________________________________________________________________
______________________________________________________________________________
10. Why do we need a staging area in an ETL process?
______________________________________________________________________________
______________________________________________________________________________

34
Experiment No.: 09

Data Cubes
Aim: Construction of Cubes.

Theory:

When we try to extract information from a stack of data, we need tools to help us find what's
relevant and what's important and to explore different scenarios. A report, whether printed on
paper or viewed on-screen, is at best a two-dimensional representation of data, a table using
columns and rows. That's sufficient when we have only two factors to consider, but in the real
world we need more powerful tools.

Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-
dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think
of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.
When we try to extract information from a stack of data, we need tools to help us find what's
relevant and what's important and to explore different scenarios. A report, whether printed on
paper or viewed on-screen, is at best a two-dimensional representation of data, a table using
columns and rows. That's sufficient when we have only two factors to consider, but in the real
world we need more powerful tools.

More QuickStudies

Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-
dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think
of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.

But data cubes aren't restricted to just three dimensions. Most online analytical processing
(OLAP) systems can build data cubes with many more dimensions—Microsoft SQL Server 2000
Analysis Services, for example, allows up to 64 dimensions. We can think of a 4-D data cube as
consisting of a series of 3-D cubes, though visualizing such higher-dimensional entities in spatial
or geometric terms can be a problem.

In practice, therefore, we often construct data cubes with many dimensions, but we tend to look
at just three at a time. What makes data cubes so valuable is that we can index the cube on one or
more of its dimensions

An OLAP Cube basically takes a spreadsheet and three-dimensionalizes the experiences of


analysis. Breaking it down, OLAP means analytical data as opposed to transactional, and the
cube part of the nomenclature refers to the storage aspect. OLAP cubes are basically multi-
dimensional databases. They store data for analysis, and a lot of BI products rely on OLAP
cubes for access to company information for reports, budgets, or dashboards. For example, a
CFO might want to report on company financial data by location, by month, or by product –

35
these elements would make up the dimensions of this cube. However, OLAP cubes are not SQL
server relational databases, like data warehouses are.
OLAP cubes are not an open SQL server data warehouse, so they require someone with the
know-how and the experience to maintain it, whereas a SQL server data warehouse can be
maintained by most IT people that have regular database training. This aspect accordingly has a
price tag attached to it. Whether allocating time and energy from a current employee to focus on
management of the OLAP cube or seeking a new, perhaps full-time employee to join payroll for
this role. Additionally, OLAP cubes tend to be more rigid and limited when it comes to
designing reports because of their table-like functionality. Aesthetics and capabilities could and
arguably should be important to a company who is building their portfolio of BI solutions.
The take-home of this article could be that, in looking for BI tools, consider what format of data
storage the product utilizes for analysis. Knowing and understanding the pros and cons of data
warehouses and OLAP cubes should result in a preference – and the storage of company data,
exponentially bigger amounts by the year, is arguably a foundation for successful analysis.
Solver offers a fully built, configurable data warehouse stand-alone and as part of the
comprehensive suite of BI modules and would be happy to answer questions and generally
review BI360’s easy-to-use Data Warehouse solution for collaborative, streamlined decision-
making capabilities.
Result:
______________________________________________________________________________

Conclusion:
______________________________________________________________________________
______________________________________________________________________________

Industrial Application:

1. object-oriented approach with the help of Unified Modeling Language (UML) is


used to create OLAP cube which supports three dimensional data. A Indian Postal
Services is considered to store large amount of data in the OLAP cube and can be
easily used for the analysis purpose.
2. Log Analysis as an OLAP Application in SANS Institute InfoSec Reading Room.

Questionnaire:
36
1. Data warehouse architecture is based on______________
2. ______________________supports basic OLAP operations, including slice and dice, drill-down,
roll-up and pivoting.
3. The core of the multidimensional model is the _____________ , which consists of a large set of
facts and a number of dimensions.
4. The data from the operational environment enter ________________of data warehouse.
5. Is it true “The data from the operational environment enter Current detail data of data
warehouse.”________________
6. Business Intelligence and data warehousing is used for________________
7. The generalization of cross-tab which is represented visually is ____________ which is also
called as data cube.
8. What includes a description of each relation in a database, and each piece of information in
each relation?
______________________________________________________________________________
______________________________________________________________________________
9. Which DBMS structure uses data cubes to represent multiple dimensions?
______________________________________________________________________________
______________________________________________________________________________
10. The diagrams such as cubes and cylinders are classified as?
______________________________________________________________________________
______________________________________________________________________________

Experiment No.: 10
37
OLAP Operations
Aim: OLAP Operations, OLAP Queries.

Theory:

CONCEPT HIERARCHIES

In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides
users with the flexibility to view data from different perspectives.

For example, we have attributes as day, temperature and humidity, we can group values in
subsets and name these subsets, thus obtaining a set of hierarchies as shown in figure below.

Fig : concept hierarchy

OLAP OPERATIONS

OLAP provides a user-friendly environment for interactive data analysis. A number of OLAP
data cube operations exist to materialize different views of data, allowing interactive querying
and analysis of the data.

The most popular end user operations on dimensional data are:

Roll up

The roll-up operation (also called drill-up or aggregation operation) performs aggregation on a
data cube, either by climbing up a concept hierarchy for a dimension or by climbing down a
concept hierarchy, i.e. dimension reduction. Let me explain roll up with an example :

Consider the following cube illustrating temperature of certain days recorded weekly:

Fig : Example.

Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the
above cube. To do this we have to group columns and add up the values according to the concept
hierarchy. This operation is called roll-up.
38
By doing this we obtain the following cube:

Fig : Rollup.

The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the
data by levels of temperature.
Roll Down

The roll down operation (also called drill down) is the reverse of roll up. It navigates from less
detailed data to more detailed data. It can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions. Performing roll down operation
on the same cube mentioned above:

Fig : Rolldown.

The result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by
descending the time hierarchy from the level of week to the more detailed level of day. Also new
dimensions can be added to the cube, because drill-down adds more detail to the given data.

Slicing

Slice performs a selection on one dimension of the given cube, thus resulting in a subcube. For
example, in the cube example above, if we make the selection, temperature=cool we will obtain
the following cube:
39
Fig : Slicing.

Dicing

The dice operation defines a subcube by performing a selection on two or more dimensions. For
example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cube we get the following subcube (still two-dimensional):

Fig :Dice

Pivot

Pivot otheriwise known as Rotate changes the dimensional orientation of the cube, i.e. rotates the
data axes to view the data from different perspectives. Pivot groups data with different
dimensions. The below cubes shows 2D represntation of Pivot.

40
Fig :Pivot

Other OLAP operations

Some more OLAP operations include:

SCOPING: Restricting the view of database objects to a specified subset is called scoping.
Scoping will allow users to recieve and update some data values they wish to recieve and update.

SCREENING: Screening is performed against the data or members of a dimension in order to


restrict the set of data retrieved.

DRILL ACROSS: Accesses more than one fact table that is linked by common dimensions.
COmbiens cubes that share one or more dimensions.

DRILL THROUGH: Drill down to the bottom level of a data cube down to its back end
relational tables.

41
Result:
Conclusion:

Industrial Application:
1. on-line analytical processing and data mining on the analysis of curriculum chosen by students,
accomplishes the design of data warehouse about universities curriculum chosen by students, extracts
and transforms curriculum chosen by students data and then loads them into the data warehouse.

2. on-line analytical processing and data mining on the analysis of curriculum chosen by
students, accomplishes the design of data warehouse about universities curriculum chosen by
students, extracts and transforms curriculum chosen by students data and then loads them into
the data warehouse.

42
Questionnaire:

1. Is it true OLAP Database gives better performance?


______________________________________________________________________________

2. What is the source of olap cube metadata?


______________________________________________________________________________
______________________________________________________________________________

3. ________________supports basic OLAP operations, including slice and dice, drill-down, roll-up
and pivoting.

4. Can the user perform data entry or editing task on OLAP data?
______________________________________________________________________________
______________________________________________________________________________

5. Is it true “ cognos is a OLAP vendor”?


______________________________________________________________________________

6. What is the other name for OLAP cube?


______________________________________________________________________________

7. ____________________ operation of OLAP involves computing all of the data relationships for
one or more dimensions.

8. Is it because _______________ that enables OLAP to achieve great performance for a query.

9. Is it true “the output of an OLAP query is displayed as a matrix?


______________________________________________________________________________
______________________________________________________________________________

11. What is a good alternative for star schema?


______________________________________________________________________________
______________________________________________________________________________

43
References
Text Books:
1. Elmasri and Navathe, “Fundamentals of Database Systems”, 6th Edition, PEARSON
Education.
2. Theraja Reema, “Data Warehousing”, Oxford University Press, 2009.
3. Data Warehousing, Data Mining, & OLAP by Alex Berson McGraw Hill.

References:
1. Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom “Database System
Implementation”, Pearson Ltd. 1/ e
2. Thomas M. Connolly Carolyn Begg, Database Systems : A Practical Approach to Design,
Implementation and Management, 4/e Pearson Ltd.
3. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Definitive Guide To
Dimensional Modeling”, 3rd Edition. Wiley India.

Web Resources:
1. https://www.cacheonix.org/articles/Caching_for_Java_Applications.htm
2. https://www.cacheonix.org/articles/Caching_Database_Queries_in_Java.htm
3. https://www.tutorialspoint.com/dwh/index.htm
4. https://www.guru99.com/data-warehousing-tutorial.html
5. https://intellipaat.com/tutorial/data-warehouse-tutorial/

44

You might also like