AD Theory 3

AD theory 3
A data-warehouse is a heterogeneous collection of different data sources organised under a unified

schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-
up approach are explained as below.
Design strategies and their architecture:
1. Top-down approach:
The essential components are discussed below:
1. External Sources –
External source is a source from where data is collected irrespective of the type of data. Data
can be structured, semi structured and unstructured as well.
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so
there is a need to validate this data to load into datawarehouse. For this purpose, it is
recommended to use ETL tool.
 E(Extracted): Data is extracted from External data source.
 T(Transform): Data is transformed into the standard format.
 L(Load): Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually
stores the meta data and the actual data gets stored in the data marts. Note that
datawarehouse stores the data in its purest form in this top-down approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the information of a particular
function of an organisation which is handled by single authority. There can be as many
number of data marts in an organisation depending upon the functions. We can also say that
data mart contains subset of the data stored in datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is used to
find the hidden patterns that are present in the database or in datawarehouse with the help
of algorithm of data mining.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organisation and data marts are created from it after the complete datawarehouse has been
created.
Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in

top-down approach).
2. Then, the data go through the staging area (as explained above) and
loaded into data marts instead of datawarehouse. The data marts are
created first and provide reporting capability. It addresses a single
business area.
3. These data marts are then integrated into datawarehouse.
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business.
The data frequently changes as updates are made and reflect the current value of the
last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions
Processing Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data
analysis and decision-making. Such systems can organize and present information in
specific formats to accommodate the diverse needs of various users. These systems
are called as Online-Analytical Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the
goals of both these databases are different.
Operational Database Data Warehouse
Operational systems are designed to Data warehousing systems are typically

support high-volume transaction designed to support high-volume
processing. analytical processing (i.e., OLAP).
Operational systems are usually Data warehousing systems are usually

concerned with current data. concerned with historical data.
Data within operational systems are Non-volatile, new data may be added
mainly updated regularly according to regularly. Once Added rarely changed.
need.
It is designed for real-time business It is designed for analysis of business

dealing and processes. measures by subject area, categories,
and attributes.
It is optimized for a simple set of It is optimized for extent loads and

transactions, generally adding or high, complex, unpredictable queries
retrieving a single row at a time per that access many rows per table.
table.
It is optimized for validation of incoming Loaded with consistent, valid

information during transactions, uses information, requires no real-time
validation data tables. validation.
It supports thousands of concurrent It supports a few concurrent clients

clients. relative to OLTP.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually Data warehousing systems are usually

optimized to perform fast inserts and optimized to perform fast retrievals of
updates of associatively small volumes of relatively high volumes of data.
data.
Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on- Data Warehouse designed for on-line
line transactional Processing (OLTP) Analytical Processing (OLAP)
A star schema is a type of data modeling technique used in data warehousing to

represent data in a structured and intuitive way. In a star schema, data is
organized into a central fact table that contains the measures of interest,
surrounded by dimension tables that describe the attributes of the measures.
The fact table in a star schema contains the measures or metrics that are of
interest to the user or organization. For example, in a sales data warehouse, the
fact table might contain sales revenue, units sold, and profit margins. Each record
in the fact table represents a specific event or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the
measures in the fact table. These attributes are used to slice and dice the data in
the fact table, allowing users to analyze the data from different perspectives. For
example, in a sales data warehouse, the dimension tables might include product,
customer, time, and location.
In a star schema, each dimension table is joined to the fact table through a foreign
key relationship. This allows users to query the data in the fact table using
attributes from the dimension tables. For example, a user might want to see sales
revenue by product category, or by region and time period.
The star schema is a popular data modeling technique in data warehousing
because it is easy to understand and query. The simple structure of the star
schema allows for fast query response times and efficient use of database
resources. Additionally, the star schema can be easily extended by adding new
dimension tables or measures to the fact table, making it a scalable and flexible
solution for data warehousing.
Star schema is the fundamental schema among the data mart schema and it is
simplest. This schema is widely used to develop or build a data warehouse and
dimensional data marts. It includes one or more fact tables indexing any number
of dimensional tables. The star schema is a necessary cause of the snowflake
schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact
table at its center and the dimension tables at its peripheral representing the star’s
points. Below is an example to demonstrate the Star Schema:
Factless tables simply mean the key available in the fact that no remedies are
available. Factless fact tables are only used to establish relationships between
elements of different dimensions. And are also useful for describing events and
coverage, meaning tables contain information that nothing has happened. It often
represents many-to-many relationships.
There are two types of factless table :
1. Event Tracking Tables –
Use a factless fact table to track events of interest to the organization. For
example, attendance at a cultural event can be tracked by creating a fact table that
contains the following foreign keys (i.e. links to dimension tables) event
identifier speaker/entertainment identifier, participant identifier, event type; Date.
This table can then be searched for information, such as the most popular ones.
Which cultural program or program type. The following example shows a factless
fact table that records each time a student attends a course or which class has the
maximum attendance? Or what is the average number of attendance of a given
course? All questions are based on COUNT () with group BY questions. So we
can first count and then implement other aggregate functions like Aggress, Max,
Min.
Attendance fact
2. Coverage Tables –
The second type of factless fact table is called a coverage table by Ralph. It is
used to support negative analysis reports. For example, to create a report that a
store did not sell a product for a certain period of time, you should have a fact
table to capture all possible combinations. Then you can find out what is missing.
Common examples of factless fact table:
Ex-Visitors to the office.
List of people for the web click.
Tracking student attendance or registration events.
Keys in the data warehouse schema:
Types of Keys:
1. Primary Key – Every row in a dimension table is identified by a
unique value which is generally known as primary key. The primary
key is a unique identifier that helps isolate each row in a dimension.
Tables can have numerous records, however each record has only one
Primary Key.
2. Surrogate Key – These are the keys which are generated by the system
and generally does not have any built in meaning. It is UNIQUE and
SEQUENTIAL since it is a consecutively created number for each
record in the table. It is MEANINGLESS since it doesn’t have any
business significance other than identifying each row. For example, if a
data warehouse contains information on 20,000 clients the dimension
table will contain 20,000 surrogate keys one for each client.
3. Foreign Key – In the fact table the primary key of other dimension
table is act as the foreign key.
4. Alternate key – It is also a unique value of the table and generally
knows as secondary key of the table.
5. Composite key – It consists of two or more attributes. For example, the
entity has a clientID and a employeeCode as its primary key. Every one
of the characteristics that make up the primary key are basic keys on the
grounds that each speaks to an exceptional reference while
distinguishing a client in one occasion and a employee in the other, so
this key is a composite key.
6. Candidate key – A substance type in an intelligent information model
will have at least zero competitor keys, likewise alluded to just as one
of a kind identifiers . For instance, on the off chance that we just
interface with American residents, at that point SSN is one up-and-
comer key for the Person element type and the mix of name and
telephone number (expecting the mix is one of a kind) is possibly a
subsequent competitor key. Both of these keys are called up-and-comer
keys since they are possibility to be picked as the essential key, a
substitute key or maybe not so much as a key at all inside a physical
information model.
The snowflake schema:

The CityID attributes link the Customer dimension table with
the City dimension table. The City dimension table has details about each city
such as city name, Zipcode, State, and Country.
The snowflake schema is a variant of the star schema. Here, the centralized fact
table is connected to multiple dimensions. In the snowflake schema, dimensions
are present in a normalized form in multiple related tables. The snowflake
structure materialized when the dimensions of a star schema are detailed and
highly structured, having several levels of relationship, and the child tables have
multiple parent tables. The snowflake effect affects only the dimension tables and
does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data
warehousing to represent data in a structured way that is optimized for querying
large amounts of data efficiently. In a snowflake schema, the dimension tables are
normalized into multiple related tables, creating a hierarchical or “snowflake”
structure.
In a snowflake schema, the fact table is still located at the center of the schema,
surrounded by the dimension tables. However, each dimension table is further
broken down into multiple related tables, creating a hierarchical structure that
resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be
normalized into multiple related tables, such as product category, product
subcategory, and product details. Each of these tables would be related to the
product dimension table through a foreign key relationship.
Difference Between Snowflake and Star Schema
The main difference between star schema and snowflake schema is that the dimension table of the
snowflake schema is maintained in the normalized form to reduce redundancy. The advantage here
is that such tables (normalized) are easy to maintain and save storage space. However, it also means
that more joins will be needed to execute the query. This will adversely impact system performance.
However, the snowflake schema can also be more complex to query than a star schema because it
requires more table joins. This can result in slower query response times and higher resource usage
in the database. Additionally, the snowflake schema can be more difficult to understand and
maintain because of the increased complexity of the schema design.
The decision to use a snowflake schema versus a star schema in a data warehousing project will
depend on the specific requirements of the project and the trade-offs between query performance,
schema complexity, and data integrity.
Characteristics of Snowflake Schema
 The snowflake schema uses small disk space.
 It is easy to implement the dimension that is added to the schema.
 There are multiple tables, so performance is reduced.
 The dimension table consists of two or more sets of attributes that define information at
different grains.
 The sets of attributes of the same dimension table are populated by different source
systems.
Advantages of Snowflake Schema
It provides structured data which reduces the problem of data integrity.

It uses small disk space because data are highly structured.
Fact Constellation is a schema for representing multidimensional model. It is a

collection of multiple fact tables having some common dimension tables. It can
be viewed as a collection of several star schemas and hence, also known
as Galaxy schema. It is one of the widely used schema for Data warehouse
designing and it is much more complex than star and snowflake schema. For
complex systems, we require fact constellations.
Figure – General structure of Fact Constellation
Here, the pink coloured Dimension tables are the common ones among both the
star schemas. Green coloured fact tables are the fact tables of their respective star
schemas.
Example:
In above demonstration:
 Placement is a fact table having attributes: (Stud_roll, Company_id,
TPO_id) with facts: (Number of students eligible, Number of students
placed).
 Workshop is a fact table having attributes: (Stud_roll, Institute_id,
TPO_id) with facts: (Number of students selected, Number of students
attended the workshop).
 Company is a dimension table having attributes: (Company_id, Name,
Offer_package).
 Student is a dimension table having attributes: (Student_roll, Name,
CGPA).
 TPO is a dimension table having attributes: (TPO_id, Name, Age).
 Training Institute is a dimension table having attributes: (Institute_id,
Name, Full_course_fee).
So, there are two fact tables namely, Placement and Workshop which are part of
two different star schemas having dimension tables
– Company, Student and TPO in Star schema with fact table Placement and
dimension tables – Training Institute, Student and TPO in Star schema with fact
table Workshop. Both the star schema have two dimension tables common and
hence, forming a fact constellation or galaxy schema.
Advantage: Provides a flexible schema.
Disadvantage: It is much more complex and hence, hard to implement and
maintain.
What is a data lake?
A data lake is a central location that holds a large amount of data in its native,
raw format. Compared to a hierarchical data warehouse, which stores data in
files or folders, a data lake uses a flat architecture and object storage to store
the data.‍Object storage stores data with metadata tags and a unique
identifier, which makes it easier to locate and retrieve data across regions, and
improves performance. By leveraging inexpensive object storage and open
formats, data lakes enable many applications to take advantage of the data.
Data lakes were developed in response to the limitations of data warehouses.
Why would you use a data lake?

First and foremost, data lakes are open format, so users avoid lock-in to a
proprietary system like a data warehouse, which has become increasingly
important in modern data architectures. Data lakes are also highly durable and low
cost, because of their ability to scale and leverage object storage. Additionally,
advanced analytics and machine learning on unstructured data are some of the most
strategic priorities for enterprises today. The unique ability to ingest raw data in a
variety of formats (structured, unstructured, semi-structured), along with the other
benefits mentioned, makes a data lake the clear choice for data storage.
*Updates to dimension tables – slowly changing dimensions, type 1, type 2 and type 3 changes, large
dimensions, rapidly changing dimensions, junk dimensions, aggregate fact tables.
Data lake Data warehouse

Type Structured, semi-structured, Structured
unstructured
Relational, non-relational Relational
Schema Schema on read Schema on write
Format Raw, unfiltered Processed, vetted
Sources Big data, IoT, social media, Application, business, transactional
streaming data data, batch reporting
Scalability Easy to scale at a low cost Difficult and expensive to scale
Users Data scientists, data engineers Data warehouse professionals,
business analysts
Data lake Data warehouse
Use cases Machine learning, predictive Core reporting, BI
analytics, real-time analytics
Data lake challenges

Despite their pros, many of the promises of data lakes have not been realized due
to the lack of some critical features: no support for transactions, no enforcement of
data quality or governance, and poor performance optimizations. As a result, most
of the data lakes in the enterprise have become data swamps.
Reliability issues
Without the proper tools in place, data lakes can suffer from data reliability issues
that make it difficult for data scientists and analysts to reason about the data. These
issues can stem from difficulty combining batch and streaming data, data
corruption and other factors.
Slow performance
As the size of the data in a data lake increases, the performance of traditional query
engines has traditionally gotten slower. Some of the bottlenecks include metadata
management, improper data partitioning and others.
Lack of security features

Data lakes are hard to properly secure and govern due to the lack of visibility and
ability to delete or update data. These limitations make it very difficult to meet the
requirements of regulatory bodies.

AD Theory 3

Uploaded by

Copyright:

Available Formats

AD Theory 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AD Theory 3

Uploaded by

Copyright:

Available Formats

AD theory 3

A data-warehouse is a heterogeneous collection of different data sources organised under a unified

Design strategies and their architecture:

The essential components are discussed below:

 E(Extracted): Data is extracted from External data source.

 T(Transform): Data is transformed into the standard format.

1. First, the data is extracted from external sources (same as happens in

3. These data marts are then integrated into datawarehouse.

Operational Database Data Warehouse

Operational systems are designed to Data warehousing systems are typically

Operational systems are usually Data warehousing systems are usually

It is designed for real-time business It is designed for analysis of business

It is optimized for a simple set of It is optimized for extent loads and

It is optimized for validation of incoming Loaded with consistent, valid

It supports thousands of concurrent It supports a few concurrent clients

Operational systems are usually Data warehousing systems are usually

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

A star schema is a type of data modeling technique used in data warehousing to

Keys in the data warehouse schema:

The snowflake schema:

such as city name, Zipcode, State, and Country.

Difference Between Snowflake and Star Schema

Characteristics of Snowflake Schema

 The snowflake schema uses small disk space.

 It is easy to implement the dimension that is added to the schema.

 There are multiple tables, so performance is reduced.

Advantages of Snowflake Schema

It provides structured data which reduces the problem of data integrity.

Fact Constellation is a schema for representing multidimensional model. It is a

Figure – General structure of Fact Constellation

Data lakes were developed in response to the limitations of data warehouses.

Why would you use a data lake?

Data lake Data warehouse

Data lake challenges

Lack of security features

You might also like