DB m8 9 10 11 PDF
DB m8 9 10 11 PDF
DB m8 9 10 11 PDF
4
What is a Data Warehousing?
• An electronic storage of a large amount of
information by a business or organization.
5
Data Warehousing vs. Databases
• A database is a transactional system that is set to monitor and update real-time
data in order to have only the most recent data available.
https://blog.panoply.io/hs-fs/hubfs/db_vs_dw.jpg?width=694&name=db_vs_dw.jpg
6
Alternative Names for Data Warehouse System
https://www.guru99.com/images/1/data_warehousing.png
7
History of Data Warehouse
• 1960- Dartmouth and General Mills in a joint research project, develop the terms
dimensions and facts.
• 1970- A Nielsen and IRI introduces dimensional data marts for retail sales.
• 1983- Tera Data Corporation introduces a database management system which is
specifically designed for decision support
• Data warehousing started in the late 1980s when IBM worker Paul Murphy and
Barry Devlin developed the Business Data Warehouse.
• However, the real concept was given by Inmon Bill. He was considered as a father of
data warehouse. He had written about a variety of topics for building, usage, and
maintenance of the warehouse & the Corporate Information Factory.
8
Benefits of Data Warehouse
https://cdn.educba.com/academy/wp-content/uploads/2019/10/Benefits-of-Data-Warehouse.png
9
Data Warehouse Architecture
• Simple. All data warehouses share a basic design in which metadata, summary data,
and raw data are stored within the central repository of the warehouse. The
repository is fed by data sources on one end and accessed by end users for analysis,
reporting, and mining on the other end.
• Simple with a staging area. Operational data must be cleaned and processed before
being put in the warehouse. Although this can be done programmatically, many data
warehouses add a staging area for data before it enters the warehouse, to simplify
data preparation.
10
Data Warehouse Architecture
• Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data mart.
• Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly
and informally explore new datasets or ways of analyzing data without having to
conform to or comply with the formal rules and protocol of the data warehouse.
11
Data Warehouse Architecture
The Data Warehouse Architecture
generally comprises of three tiers.
https://cdn.educba.com/academy/wp
-content/uploads/2019/05/OLAP-
Server.png
12
Data Warehouse Architecture
Top Tier
The Top Tier consists of the Client-side front end of the architecture.
The Transformed and Logic applied information stored in the Data Warehouse will be
used and acquired for Business purposes in this Tier.
Middle Tier
The Middle Tier consists of the OLAP Servers
OLAP is Online Analytical Processing Server
OLAP is used to provide information to business analysts and managers
Bottom Tier
The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data Warehouse.
13
Components of Data Warehouse
Query
Warehouse Manager
Manager
Load Manager
End-user
Access Tools
https://www.teksystems.com/-/media/teksystems/images/blogs/content%20images/conventional-data-warehouse-
architecture.png?h=327&w=601&hash=7FBF0193539FC596BBDA0A36A2373CB30D802006&hash=7FBF0193539FC596BBDA0A
36A2373CB30D802006&la=en
14
Components of Data Warehouse
1. Load Manager
15
Components of Data Warehouse
2. Warehouse Manager
16
Components of Data Warehouse
3. Query Manager
17
Components of Data Warehouse
18
Designing a Data Warehouse
Any data warehouse design must address the following:
• Specific data content
• Relationships within and between groups of data
• The systems environment that will support the data warehouse
• The types of data transformations required
• Data refresh frequency
19
Applications of Data Warehouse
1. Airline
2. Banking
3. Healthcare
4. Public Sector
5. Investment and Insurance Sector
6. Retail sectors
7. Telecommunication
8. Hospitality Industry
9. Financial services
10. Consumer goods
11. Controlled manufacturing
20
Advantages of Data Warehousing
1. Cleans data 8. Timely access to data
2. Indexes multiple types 9. Enhanced data consistency and quality
3. Secured data and its access 10. Return on Investment is high
4. Query process 11. Increase revenues
5. Enhanced business intelligence 12. Standardizes data across the organization
6. Increased system and query 13. Database Normalization
performance
7. Business Intelligence
21
Disadvantages of Data Warehousing
1. Raising ownership
2. Extra reporting
3. Data flexibility
4. Compatibility with the existing
system
5. Keeping data online
6. Dimensional technique
7. Costs
22
Example of Data Warehouse Tools:
20 Best ETL / Data Warehousing Tools in 2020
1. Cdata Sync
2. QuerySurge
3. Xplenty
4. CloverDX
5. Oracle
6. Amazon RedShift
7. Domo
8. Teradata
9. SAP
10. SAS
23
Example of Data Warehouse Tools:
20 Best ETL / Data Warehousing Tools in 2020
11. IBM DataStage
12. Informatica
13. MS SSIS
14. Talend Open Studio
15. The Ab Initio Software
16. Dundas
17. Sisense
18. TabLeau
19. MicroStrategy
20. Pentaho
24
References:
• Frankenfiled, J., 20-Aug-2019. Data Warehousing. Retrieved from Investopedia
https://www.investopedia.com/terms/d/data-warehousing.asp
• Oracle, 2020. What is Data Management?. Retrieved from
https://www.oracle.com/database/what-is-data-management/
• Oracle, 2020. What is a Data Warehouse?. Retrieved from
https://www.oracle.com/database/what-is-a-data-warehouse/
• Guru99, 2020. What is Data Warehouse? Types, Definition & Example. Retrieved
from https://www.guru99.com/data-warehousing.html#2.
• EDUCBA, 2020. Data Warehouse Architecture. Retrieved from
https://www.educba.com/data-warehouse-architecture/
25
References:
• Reddy, C.. Data Warehousing: Characteristics, Functions, Pros & Cons. Wisestep.
Retrieved from https://content.wisestep.com/data-warehousing-characteristics-
functions-pros-cons/
• Guru99, 2020. 20 Best ETL / Data Warehousing Tools in 2020. Retrieved from
https://www.guru99.com/top-20-etl-database-warehousing-tools.html
26
INTENDED LEARNING OUTCOME
30
Types of Data Warehouse
31
Enterprise Data Warehouse (EDW)
• Enterprise Data Warehouse is a centralized warehouse.
• It provides decision support service across the enterprise.
• It offers a unified approach for organizing and representing data.
• It also provide the ability to classify data according to the subject and give
access according to those divisions.
32
Enterprise Data Warehouse Architecture
One-tier
Architecture
https://content.altexsoft.com/media/2019/10/word-image-19.png
33
Enterprise Data Warehouse Architecture
Two-tier
Architecture
https://content.altexsoft.com/media/2019/10/word-image-20.png
34
Enterprise Data Warehouse Architecture
Three-tier
Architecture
https://content.altexsoft.com/media/2019/10/word-image-21.png
35
Enterprise Data Warehouse Architecture
One-tier architecture for EDW means that you have a database directly connected
with the analytical interfaces where the end user can make queries.
Two-tier architecture. A data mart level is added between the user interface and
EDW. A data mart is a low-level repository that contains domain-specific
information. Simply put, it’s another, smaller-sized database that extends EDW
with dedicated information for your sales/operational departments, marketing, etc.
Three-tier architecture. On top of the data mart layer, enterprises also use online
analytical processing (OLAP) cubes. An OLAP cube is a specific type of database that
represents data from multiple dimensions.
36
Enterprise Data Warehouse Components
https://www.teksystems.com/-/media/teksystems/images/blogs/content%20images/conventional-data-warehouse-
architecture.png?h=327&w=601&hash=7FBF0193539FC596BBDA0A36A2373CB30D802006&hash=7FBF0193539FC596BBDA0A
36A2373CB30D802006&la=en
37
Enterprise Data Warehouse Components
1. Sources
• These are the tools that perform actual connection with source data, its
extraction, and loading to the place where it will be transformed.
Transformation unifies data format.
• ETL and ELT approaches differ in that in ETL the transformation is done
before EDW, in a staging area. ELT is a more modern approach that
handles all the transformation in a warehouse.
38
Enterprise Data Warehouse Components
3. Staging Area
• In the case of ETL, the staging area is the place data is loaded before
EDW. Here, it will be cleaned and transformed to a given data model.
4. DW Database
5. Reporting Layer
39
Operational Data Store
https://sqlwithmanoj.files.wordpress.com/2014/12/ods_dw.jpg
40
Operational Data Store
• ODS support only daily operations, so their view of historical data is very
limited. Although they work very well as sources of current data and are
often used as such by data warehouses, they do not support historically rich
queries.
41
Operational Data Store
https://static.javatpoint.com/tutorial/datawarehouse/images/operational-database-vs-data-warehouse.png
42
Operational Database vs. Data Warehouse
• The Operational Database is the source • Data Warehouse Systems serve users or
of information for the data warehouse. knowledge workers in the purpose of
It includes detailed information used to data analysis and decision-making. Such
run the day to day operations of the systems can organize and present
business. The data frequently changes information in specific formats to
as updates are made and reflect the accommodate the diverse needs of
current value of the last transactions. various users. These systems are called
as Online-Analytical Processing (OLAP)
• Operational Database Management Systems.
Systems also called as OLTP (Online
Transactions Processing Databases),
are used to manage dynamic data in
real-time.
43
Advantages of Operational Data Stores
• An ODS provides access to only the current, finely crumbled, non-aggregated,
less complicated data, which can be queried in a well-fitted approach without
using the operational systems.
• When the Reporting & Analysis tools need data that are closer to real-time
operations, it can be queried from the Operational Data Store, as and when it is
received from the respective source systems, instead of opting for more
prolonging conversion and loading operations from the other database sources
in the data warehouse systems.
• ODS is a secure option as it does not contain all the historical data and
operations, which makes it resilient for any cyber-attacks or hacking of data
privacy.
44
Advantages of Operational Data Stores
• An Operational Data Store is a practically feasible structural design option
when bearing in mind the complex requirements provided by the business, in
order to generate input for analysis and reporting processes that leads to
business decision-making.
• ODS facilitates considerably less total time for turnaround when in trouble, like
the environment failure or when a database needs a restart, which implicates
less stress on the business or the stack holders of the application.
• In order to fetch data from an ODS, querying need not be complex or with
multi-level joins and conditional Simple queries will be sufficient enough as it
holds moderately detailed operational data.
45
Data Mart
http://1.bp.blogspot.com/-pUOXIPxlxps/VLHslazE74I/AAAAAAAAIXM/Rj3oUxpe_dY/s1600/DWH_P002_004.jpg
Data Mart
• A data mart is a subset of the data warehouse. It is specially designed
for a particular line of business, such as sales, finance, sales or finance.
In an independent data mart, data can collect directly from sources.
47
Types of Data Mart
1. Dependent: A dependent data mart
allows sourcing organization's data from a
single Data Warehouse.
https://www.guru99.com/images/1
/022218_0604_DataMartTut1.png
48
Types of Data Mart
2. Independent: Independent data mart is
created without the use of a central data
warehouse.
https://www.guru99.com/images/1/02
2218_0604_DataMartTut2.png
49
Types of Data Mart
3. Hybrid: This type of data marts can take
data from data warehouses or operational
systems.
https://www.guru99.com/images/1/
022218_0604_DataMartTut3.png
50
Steps in Implementing a Data Mart
51
Steps in Implementing a Data Mart
Designing
It covers all the tasks between initiating the request for a data mart to gathering
information about the requirements. Finally, we create the logical and physical design
of the data mart.
Constructing
It involves creating the physical database and the logical structures.
Populating
data in populated in the data mart.
52
Steps in Implementing a Data Mart
Accessing
Involves putting the data to use: querying the data, creating reports, charts, and
publishing them. End-user submit queries to the database and display the results of the
queries
Managing
This step covers management tasks such as-
• Ongoing user access management.
• System optimizations and fine-tuning to achieve the enhanced performance.
• Adding and managing fresh data into the data mart.
• Planning recovery scenarios and ensure system availability in the case when the
system fails.
53
Advantages of Data Mart
• Data marts contain a subset of organization-wide data. This Data is valuable to a
specific group of people in an organization.
• It is cost-effective alternatives to a data warehouse, which can take high costs to
build.
• Data Mart allows faster access of Data.
• Data Mart is easy to use as it is specifically designed for the needs of its users. Thus a
data mart can accelerate business processes.
• Data Marts needs less implementation time compare to Data Warehouse systems. It
is faster to implement Data Mart as you only need to concentrate on subset of data.
• It contains historical data which enables the analyst to determine data trends.
54
Disadvantages of Data Mart
• Many times enterprises create too many unrelated data marts without much
benefit. It can become a big hurdle to maintain.
• Data Mart cannot provide company-wide data analysis as their data set is limited.
55
References:
• Guru99, 2020. What is Data Warehouse? Types, Definition & Example. Retrieved
from https://www.guru99.com/data-warehousing.html#4.
• EDUCBA, 2020. Types of Data Warehouse. Retrieved from
https://www.educba.com/types-of-data-warehouse/
• Technopedia, 02-May-2013. Enterprise Data Warehouse. Retrieved from
https://www.techopedia.com/definition/26204/enterprise-data-warehouse
• Altexsoft Software R&D Engineering, 24-Oct-2019. Enterprise Data Warehouse:
Concepts, Architecture, and Components. Retrieved from
https://www.altexsoft.com/blog/enterprise-data-warehouse-concepts/
• Oracle, 2020. What is a Data Warehouse?. Retrieved from
https://www.oracle.com/database/what-is-a-data-warehouse/
56
References:
• JavaTpoint, 2018. Difference between Operational Database and Data Warehouse.
Retrieved from https://www.javatpoint.com/operational-database-vs-data-
warehouse
• EDUCBA, 2020. Types of Data Warehouse. Retrieved from
https://www.educba.com/operational-data-stores/
• Rouse, M., TechTarget. Operational Data Store (ODS). Retrieved from
https://searchoracle.techtarget.com/definition/operational-data-store
• Grenier, R. 05-Feb-2011. Operational Data Stores (ODS). Retrieved from
http://randygrenier.blogspot.com/2011/02/operational-data-stores-ods.html
57
References:
• Baesens, B., 26-Sept-2018. Data Warehouses, Data Marts, Operational Data Stores,
and Data Lakes: What’s in a Name?. Database Trends and Applications. Retrieved
from http://www.dbta.com/BigDataQuarterly/Articles/Data-Warehouses-Data-
Marts-Operational-Data-Stores-and-Data-Lakes-Whats-in-a-Name-127417.aspx
• EDUCBA, 2020. Types of Data Warehouse. Retrieved from
https://www.educba.com/operational-data-stores/
• Guru 99, 2020. What is Data Mart? Types & Example. Retrieved from
https://www.guru99.com/data-mart-tutorial.html
58
INTENDED LEARNING OUTCOME
4
Types of Data Warehouse Schema
• Star Schema
• Snowflake Schema
• Galaxy Schema
5
Schema
6
Star Schema
http://www.durofy.com/wp-content/uploads/2013/07/star.png https://www.guru99.com/images/1/022218_0758_StarandSnow1.png
7
Star Schema
• It is known as star schema as its structure
resembles a star.
• The star schema is the simplest type of
Data Warehouse schema.
• It is also known as Star Join Schema and is
optimized for querying large data sets.
• The center of the star can have one fact
table and a number of associated
dimension tables.
http://www.durofy.com/wp-content/uploads/2013/07/star.png
8
Star Schema
Fact Table
https://www.oercommons.org/editor/images/6978
https://i.stack.imgur.com/aB9k9.jpg
9
Star Schema
Fact Table
• A Fact table in a Data Warehouse system is nothing but the table that contains
all the facts or the business information, which can be subjected to analysis
and reporting activities when required.
• These tables hold fields that represent the direct facts, as well as the foreign
fields that are used to connect the fact table with other dimension tables in
the Data Warehouse system.
• A Data Warehouse system can have one or more fact tables, depending on the
model type used to design the Data Warehouse.
10
Star Schema
Dimension Table
https://www.dotnetinterviewquestions.in/contentpics/star1.JPG
11
Star Schema
Dimension Table
12
Characteristics of Star Schema
• Every dimension in a star schema is represented with only one-dimension
table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized.
• The schema is widely supported by BI Tools
13
Steps in Designing Star Schema
1. Identify the business process.
2. Identify the measures or fact
data.
3. Identify the Dimension table
related to fact table.
4. List of attributes or column in
each dimension table
5. Find the lowest level of
aggregation or summary
analysis in the fact table.
https://www.oercommons.org/editor/images/6929
14
Snowflake Schema
https://upload.wikimedia.org/wikipedia/commons/1/1d/Snowflake_schema.png https://static.javatpoint.com/tutorial/datawarehouse/images/star-schema-vs-snowflake-
schema2.png
15
Snowflake Schema
https://www.guru99.com/images/1/022218_0758_StarandSnow2.png
16
Snowflake Schema
17
When to Use Snowflake Schema
Ralph Kimball, the data warehousing guru, proposes three cases where snowflake
implementation is not only acceptable but is also the key to a successful design:
• Large customer dimensions where, for example, 80 percent of the fact table
measurements involve anonymous visitors about whom you collect little
detail, and 20 percent involve reliably registered customers about whom you
collect much detailed data by tracking many dimensions
18
When to Use Snowflake Schema
19
Characteristics of Snowflake Schema
• The main benefit of the snowflake schema is it uses smaller disk space.
• Easier to implement a dimension is added to the Schema.
• Due to multiple tables, query performance is reduced.
• The primary challenge that you will face while using the snowflake Schema
is that, you need to perform more maintenance efforts because of the more
lookup tables.
20
Galaxy Schema
http://wiki.bi-academy.eu/lib/exe/fetch.php?media=bicn01:modelling:en-img-0079.png
21
Galaxy Schema
• Multiple fact tables share
dimension tables.
• This schema is viewed as
collection of stars hence called
galaxy schema.
• It is also called Fact
Constellation Schema.
https://www.guru99.com/images/1/022218_0758_StarandSnow3.png
22
Characteristics of Galaxy Schema
• The dimensions in this schema are separated into separate dimensions based
on the various levels of hierarchy. For example, if geography has four levels of
hierarchy like region, country, state, and city then Galaxy schema should have
four dimensions.
• Moreover, it is possible to build this type of schema by splitting the one-star
schema into more Star schema.
• The dimensions are large in this schema which is needed to build based on the
levels of hierarchy.
• This schema is helpful for aggregating fact tables for better understanding.
23
References
• EDUCBA, 2020. Introduction to Data warehouse Schema. Retrieved from
https://www.educba.com/data-warehouse-schema/
• TechTerms, 2020. Schema. Retrieved from
https://techterms.com/definition/schema
• Guru99, 2020. Star and SnowFlake Schema in Data Warehouse. Retrieved from
https://www.guru99.com/star-snowflake-data-warehousing.html
• OER Commons – Open Educational Resources. Data warehouse Design Using
Oracle. Retrieved from https://www.oercommons.org/authoring/21861-data-
warehouse-design-using-oracle/3/view
• EDUCBA, 2020. Overview of Fact Table. Retrieved from
https://www.educba.com/what-is-fact-table/
24
References
• EDUCBA, 2020. Introduction to Data warehouse Schema. Retrieved from
https://www.educba.com/data-warehouse-schema/
• JavaTpoint, 2018. Difference between Star and Snowflake Schemas. Retrieved
from https://www.javatpoint.com/data-warehouse-star-schema-vs-snowflake-
schema
• Star and Snowflake Schema. Retrieved from
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/10g/r2/ow
b/owb10gr2_gs/owb/lesson3/starandsnowflake.htm
25
INTENDED LEARNING OUTCOME
• Define OLAP
• Determine the types of OLAP
• Understand OLAP operations.
• List OLAP Tools
• Define ETL and understand its process in data warehouse.
4
Online Analytical Processing (OLAP)
5
Online Analytical Processing (OLAP)
6
OLAP Architecture
https://image.slidesharecdn.com/1-150506060450-conversion-gate01/95/14-data-warehouse-13-638.jpg?cb=1430892370
7
How Does It Work?
• A Data warehouse would extract information from multiple data sources and
formats like text files, excel sheet, multimedia files, etc.
• The extracted data is cleaned and transformed. Data is loaded into an OLAP
server (or OLAP cube) where information is pre-calculated in advance for
further analysis.
8
OLAP Cube
9
Basic Analytical Operations of OLAP
https://www.guru99.com/images/1/022218_1238_WhatisOLAPO2.png
10
Basic Analytical Operations of OLAP
• Drill-down – To perform
the analysis in deeper
among the dimensions of
data. For example, drilling
down from “time period”
to “years” and “months”
and to “days” and so on to
plot sales growth for a
product.
https://www.guru99.com/images/1/022218_1238_WhatisOLAPO3.png
11
Basic Analytical Operations
of OLAP
https://www.guru99.com/images/1/022218_1238_WhatisOLAPO4.png
12
Basic Analytical Operations
of OLAP
https://www.guru99.com/images/1/022218_1238_WhatisOLAPO5.png
13
Basic Analytical Operations
of OLAP
https://www.guru99.com/images/1/022218_1238_WhatisOLAPO6.png
14
Types of OLAP System
15
Types of OLAP System
TYPE OF OLAP EXPLANATION
ROLAP utilize a relational database management system to keep
and control the data. These are the servers that exist
between the database and the user. ROLAP systems work
on the information that resides in a relational database.
16
Types of OLAP System
TYPE OF OLAP EXPLANATION
HOLAP It is a blend of MOLAP and ROLAP. By utilizing both ROLAP
and MOLAP information stores, Hybrid OLAP offers the
qualities of both techniques. HOLAP stores data summaries
in the binary files or in the pre-calculated cubes. It leaves
the quantities of fact and dimension information in the
relational database.
17
Types of OLAP System
TYPE OF OLAP EXPLANATION
WOLAP Web OLAP which is OLAP system accessible via the web
browser. WOLAP is a three-tiered architecture. It consists of
three components: client, middleware, and a database
server.
Mobile OLAP Mobile OLAP helps users to access and analyze OLAP data
using their mobile devices
SOLAP SOLAP is created to facilitate management of both spatial
and non-spatial data in a Geographic Information system
(GIS)
18
Examples of OLAP Tools
• IBM Cognos
• Micro Strategy
• Apache Kylin
• icCube
19
Examples of OLAP Tools
• Pentaho BI
• Mondrian
• Oracle Business Intelligence
Enterprise Edition(OBIEE)
• JsHypercube
• Jedox
20
Advantages of OLAP Disadvantages of OLAP
21
22
What is ETL?
https://lh5.googleusercontent.com/Dg1ZqrEAp-nzxVEog5ZfI2lKD-
quLUttcbQoawUN1a5gyk4lL3DG7k6FSA9_rI_oi9aPGyZZDntlo5ErYoB4os4S07vlHP1Y4iWW1XCy3-UaFpaFcwibzR3GVTSEI50j81FXmmA
23
What is ETL?
24
ETL Process
http://busitelce.com/images/14articles/etl-process/etl-process-in-data-warehouse.jpg
25
ETL Process
• Step 1 - Extraction
All the preferred data from various source systems such as databases,
applications, and flat files is identified and extracted. Data extraction can be
completed by running jobs during non-business hours.
26
ETL Process
• Step 1 - Extraction
Partial Extraction (with update notification): This strategy is also known delta, where only the
data being changed is extracted and update data warehouses
Partial Extraction (without update notification): This strategy refers to extract specific required
data from sources according to load in the data warehouses instead of extracting whole data.
27
ETL Process
• Step 2 - Transformation
Most of the extracted data can’t be directly loaded into the target system.
Based on the business rules, some transformations can be done before
loading the data.
The transformation process also corrects the data, removes any incorrect
data and fixes any errors in the data before loading it.
28
ETL Process
• Step 3 - Loading
All the gathered information is loaded into the target Data Warehouse
tables.
Types of Loading:
Initial Load — populating all the Data Warehouse tables
Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
29
Example of ETL Tools
RightData
Informatica Data Validation SSISTester
QuerySurge TestBench
ICEDQ GTL QAceGen
Datagaps ETL Validator Zuzena Automated Testing Service
QualiDI DbFit
Talend Open Studio for Data Integration AnyDbTest
Codoid's ETL Testing Services 99 Percentage ETL Testing
Data Centric Testing
30
References
• MOOCs : edX Courses. StanfordOnline: SOE.YDB-OLAP0001
Databases: On-Line Analytical Processing
• EDUCBA, 2020. What is OLAP?. Retrieved from
https://www.educba.com/what-is-olap/
• EDUCBA, 2020. Types of OLAP. Retrieved from
https://www.educba.com/types-of-olap/
• Guru99, 2020. What is OLAP? Cube, Operations & Types in Data Warehouse.
Retrieved from https://www.guru99.com/online-analytical-processing.html#2
• OLAP.com. OLAP Cube. Retrieved from https://olap.com/learn-bi-olap/olap-bi-
definitions/olap-cube/
• OLAP.com. Types of OLAP System. Retrieved from https://olap.com/types-of-
olap-systems/
31
References
• Tutorialspoint Simply Easy Learning, 2020. Data Warehousing – OLAP.
Retrieved from https://www.tutorialspoint.com/dwh/dwh_olap.htm
• Software Testing Help, 2020. Top 10 Best Analytical Processing (OLAP) Tools:
Business Intelligence. Retrieved from
https://www.softwaretestinghelp.com/best-olap-tools/
• EDUCBA, 2020. OLAP Tools. Retrieved from https://www.educba.com/olap-
tools/
• Galaktikasoft, 2020. Advantages Of Using OLAP for Business Intelligence.
Retrieved from https://galaktika-soft.com/blog/advantages-of-using-olap-for-
business-intelligence.html
32
References
• EDUCBA, 2020. What is ETL?. Retrieved from https://www.educba.com/what-
is-etl/
• Software Testing Help, 16-Apr-2020. What Is ETL (Extract, Transform, Load)
Process In Data Warehouse?. Retrieved from
https://www.softwaretestinghelp.com/etl-process-in-data-warehouse/
• Software Testing Help, 16-Apr-2020. Top 10 ETL Testing Tools In 2020.
Retrieved from https://www.softwaretestinghelp.com/top-4-etl-testing-tools/
33
INTENDED LEARNING OUTCOME
4
What is Data Mining?
https://miro.medium.com/max/2800/0*2883vRd4rpXWUIz_.jpg
What is Data Mining?
It is basically the extraction of vital
information/knowledge from a large set of data.
So there is a Mobile network operator. They consult a Data miner to dig into the call
records of the operator. No specific targets are given to the Data Miner.
A quantitative target of finding at least 2 new patterns in a month is given.
As the data miner starts digging into the data he finds a pattern that there are less
international calls on Wednesday as compared to other days.
This information is shared with the management and they come up with the plan to
reduce the international call rates on Wednesdays and start a campaign.
Call rates surge, customers are happy with low call price, more customers sign up and the
company makes more money! Win-Win situation!
Data Mining Examples
Example 2 :
Consider a marketing head of telecom service provides who wants to increase revenues
of long distance services. For high ROI on his sales and marketing efforts customer
profiling is important. He has a vast data pool of customer information like age, gender,
income, credit history, etc. But its impossible to determine characteristics of people who
prefer long distance calls with manual analysis. Using data mining techniques, he may
uncover patterns between high long distance call users and their characteristics.
For example, he might learn that his best customers are married females between the
age of 45 and 54 who make more than $80,000 per year. Marketing efforts can be
targeted to such demographic.
Data Mining Examples
Example 3:
A bank wants to search new ways to increase revenues from its credit card
operations. They want to check whether usage would double if fees were halved.
Bank has multiple years of record on average credit card balances, payment
amounts, credit limit usage, and other key parameters. They create a model to
check the impact of the proposed new business policy. The data results show that
cutting fees in half for a targetted customer base could increase revenues by $10
million.
Data Mining Techniques
Anomaly Detection.
It is used to determine when
something is noticeably different
from the regular pattern. It is
used to eliminate any database
inconsistencies or anomalies at
the source.
Data Mining Techniques
Regression Analysis.
This technique is used to make
predictions based on
relationships within the data set.
Data Mining Techniques
Classification.
This deals with the things which have
labels on it. Note in cluster detection,
the things did not have a label in it and
by using data mining we had to label
and form into clusters, but in
classification, there is information
existing that can be easily classified
using an algorithm.
Data Mining Techniques
Associative Learning.
It is used to analyze which things
tend to occur together either in
pairs or larger groups.
Example of Data Mining Tools
• Rapid Miner
• Orange
• Weka
• Knime
• Sisense
• SQL Server Data
Tools (SSDT)
• Apache Mahout
• Oracle Data Mining
Example of Data Mining Tools
• Rattle
• DataMelt
• IBM Cognos
• IBM SPSS Modeler
• SAS Data Mining
• Teradata
• Board
• Dundas BI
Benefits of Data Mining
• Data mining technique helps companies to get knowledge-based
information.
• Data mining helps organizations to make the profitable adjustments in
operation and production.
• The data mining is a cost-effective and efficient solution compared to
other statistical data applications.
• Data mining helps with the decision-making process.
• Facilitates automated prediction of trends and behaviors as well as
automated discovery of hidden patterns.
• It can be implemented in new systems as well as existing platforms
• It is the speedy process which makes it easy for the users to analyze
huge amount of data in less time.
Disadvantages of Data Mining
• There are chances of companies may sell useful information of their
customers to other companies for money.
• Many data mining analytics software is difficult to operate and requires
advance training to work on.
• Different data mining tools work in different manners due to different
algorithms employed in their design. Therefore, the selection of correct
data mining tool is a very difficult task.
• The data mining techniques are not accurate, and so it can cause serious
consequences in certain conditions.
Data Mining Applications
• Communications
• Insurance
• Education
• Manufacturing
• Banking
• Retail
• Service Providers
• E-Commerce
• Super Markets
• Crime Investigation
• Bioinformatics
20
Data Mining Applications
Applications Usage
Communications Data mining techniques are used in communication sector to predict
customer behavior to offer highly targeted and relevant campaigns.
Insurance Data mining helps insurance companies to price their products profitable
and promote new offers to their new or existing customers.
Education Data mining benefits educators to access student data, predict
achievement levels and find students or groups of students which need
extra attention. For example, students who are weak in maths subject.
Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them
reduce them to minimize downtime.
Data Mining Applications
Applications Usage
Banking Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to
decide whether to issue credit cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores identify and
arrange most sellable items in the most attentive positions. It helps store
owners to comes up with the offer which encourages customers to
increase their spending.
Service Providers Service providers like mobile phone and utility industries use Data Mining
to predict the reasons when a customer leaves their company. They
analyze billing details, customer service interactions, complaints made to
the company to assign each customer a probability score and offers
incentives.
Data Mining Applications
Applications Usage
E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-
sells through their websites. One of the most famous names is
Amazon, who use Data mining techniques to get more customers
into their eCommerce store.
Super Markets Data Mining allows supermarket's develope rules to predict if their
shoppers were likely to be expecting. By evaluating their buying
pattern, they could find woman customers who are most likely
pregnant. They can start targeting products like baby powder, baby
shop, diapers and so on.
Crime Investigation Data Mining helps crime investigation agencies to deploy police
workforce (where is a crime most likely to happen and when?), who
to search at a border crossing etc.
Bioinformatics Data Mining helps to mine biological data from massive datasets
gathered in biology and medicine.
Predictive vs. Descriptive Data Mining Algorithm
Predictive Descriptive
https://www.researchgate.net/figure/Main-data-mining-techniques_fig2_270552309
Predictive Data Mining vs. Descriptive Data Mining
The descriptive analysis is used to mine data and provide the latest
information on past or recent events.
The predictive analysis provides answers of the future queries that move
across using historical data as the chief principle for decisions.
25
Key Data Mining Algorithms
Key Data Mining Algorithms
27
What is Data Mining Algorithm?
• Data Mining Algorithms are a special category of algorithms that are useful
for analyzing data and developing data models to identify meaningful
patterns. These are part of machine learning algorithms.
• These algorithms are implemented through various programming like R
language, Python and using data mining tools to derive the optimized data
models.
• These algorithms are part of data analytics implementation for business.
• These algorithms are based upon statistical and mathematical formulas
which applied to the data set.
Data Mining Algorithms
• C4.5
• k-means
• Support Vector Machines (SVM)
• Apriori
• Expectation-Maximization (EM)
• PageRank
• AdaBoost
• kNN (k-Nearest Neighbors)
• Naïve Bayes
• Classification and Regression Trees (CART)
Data Mining Algorithms
C4.5
C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is
given a set of data representing things that are already classified.
k-means
k-means creates k groups from a set of objects so that the members of a group are
more similar. It’s a popular cluster analysis technique for exploring a dataset.
SVM
Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At
a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision
trees at all.
Data Mining Algorithms
Apriori
The Apriori algorithm learns association rules and is applied to a database
containing a large number of transactions.
EM
In data mining, expectation-maximization (EM) is generally used as a clustering
algorithm (like k-means) for knowledge discovery.
PageRank
PageRank is a link analysis algorithm designed to determine the relative
importance of some object linked within a network of objects.
Data Mining Algorithms
AdaBoost
AdaBoost is a boosting algorithm which constructs a classifier. As you
probably remember, a classifier takes a bunch of data and attempts to predict
or classify which class a new data element belongs to.
kNN
kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs
from the classifiers previously described because it’s a lazy learner.
Data Mining Algorithms
Naïve Bayes
Naive Bayes is not a single algorithm, but a family of classification
algorithms that share one common assumption: Every feature of the data
being classified is independent of all other features given the class.
CART
CART stands for classification and regression trees. It is a decision tree
learning technique that outputs either classification or regression trees.
Like C4.5, CART is a classifier.
CRISP-DM
34
Phases of CRISP-DM
https://cdn.educba.com/academy/wp-content/uploads/2019/02/steps-involved.png
Phases of CRISP-DM
1. Business Understanding
2. Data Understanding
Starts with an initial data collection and proceeds with activities in order to get
familiar with the data, to identify data quality problems, to discover first
insights into the data, or to detect interesting subsets to form hypotheses for
hidden information.
36
Phases of CRISP-DM
3. Data Preparation
The data preparation phase covers all activities to construct the final dataset
from the initial raw data.
4. Modeling
Modeling techniques are selected and applied. Since some techniques like
neural nets have specific requirements regarding the form of the data, there
can be a loop back here to data prep.
37
Phases of CRISP-DM
5. Evaluation
Once one or more models have been built that appear to have high quality based on
whichever loss functions have been selected, these need to be tested to ensure they
generalize against unseen data and that all key business issues have been sufficiently
considered. The end result is the selection of the champion model(s).
6. Deployment
38
Knowledge Discovery in Databases (KDD)
• Some people don’t differentiate data mining from knowledge discovery while
others view data mining as an essential step in the process of knowledge
discovery.
• The term knowledge discovery in databases, or KDD for short, refers to the
broad process of finding knowledge in data, and emphasizes the high level
application of particular data minded methods. It is of interest to researchers
in machine learning, pattern recognition, databases, statistics, artificial
intelligence, knowledge acquisition for expert systems, and data visualization.
39
9-step KDD Process Starting from Managerial Step
40
9-step KDD Process Starting from Managerial Step
41
9-step KDD Process Starting from Managerial Step
(Explanation)
42
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 2
Selecting and creating a data set on which discovery will be performed, based on
goals. Determine what data will be used for the knowledge discovery, such as:
what data is available, obtaining additional necessary data, and the integrating all
the data for the knowledge discovery into one data set, including the attributes
that will be considered for the process. This process is very important because the
data mining learns and discovers from the available data. This is the evidence base
for constructing the models. If some important attributes are missing, then the
entire study may fail. From this respect, the more attributes considered, the
better.
43
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 3
Pre-processing and cleansing. Data reliability is enhanced in this stage. It includes
data clearing, such as handling missing values, and removing of outliers.
It may involve complex statistical methods, or using a data mining algorithm in this
context. For example: If one suspects that a certain attribute is of insufficient
reliability, or has many missing data, then this attribute could become the goal of a
data mining supervised algorithm. A prediction model for this attribute will be
developed, and then missing data can be predicted.
44
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 4
Next is data transformation. In this stage, the generation of better data, for the
data mining is prepared and developed. Methods here include dimension
reduction, such as feature selection, and extraction, and record sampling, and
attribute transformation such as discretization of numerical attributes and
functional transformation.
45
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 5
Choosing the appropriate data mining task. We’re now ready to decide which type
of data mining to use. For example: classification, regression, or clustering. This
mostly depends on the KDD goals, and also on the previous steps. There are two
major goals in data mining: prediction and description. Prediction is often referred
to as supervised data mining, while descriptive data mining includes the
unsupervised, and visualization aspects of data mining.
46
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 6
Choosing the data mining algorithm. Now that you have the strategy, we can
decide which tactics to use. This stage includes selecting the specific method for
searching patterns, including multiple inducers. For example, when considering
precision versus understandability, the former is better with neural networks,
while the latter is better with decision trees.
47
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 7
Next is employing the data mining algorithm. Finally you can implement the
appropriate data mining algorithm. In this step we might need to employ the
algorithm several times until a satisfying the result is obtained. For instance, by
tuning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.
48
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 8
evaluation. In this stage we evaluate and interpret the mined patterns with respect
to the goals defined in the first step. Here we consider the pre-processing steps
with respect to their effect on the data mining algorithm results. This step focuses
on the comprehensible nature and usefulness of the induced model. In this step,
the discovered knowledge is also documented for further usage.
49
9-step KDD Process Starting from Managerial Step
(Explanation)
Step 9
Using the discovered knowledge. We’re now ready to incorporate the knowledge
into another system for further action. The knowledge becomes active in the sense
that we may make changes to the system, and measure the effects. Actually, the
success of this step determines the effectiveness of the entire KDD process.
50
References
• EDUCBA, 2020. Data Mining Algorithms. Retrieved from
https://www.educba.com/data-mining-algorithms/
• Ray Li, 2015. Top 10 Data Mining Algorithms. Hacker Bits. Retrieved from
https://hackerbits.com/data/top-10-data-mining-algorithms-in-plain-english/
• TechDifferences, 25-Mar-2019. Difference Between Descriptive and Predictive
Data Mining. Retrieved from https://techdifferences.com/difference-between-
descriptive-and-predictive-data-mining.html
• Guru99, 2020. Data Mining Tutorial: Process, Techniques, Tools, Examples.
Retrieved from https://www.guru99.com/data-mining-tutorial.html
• SAS, 2020. Data Mining What it is and why matters. SAS Institute Inc..
Retrieved from https://www.sas.com/en_us/insights/analytics/data-
mining.html
51
References
• JavaTpoint, 2018. Data Mining Tutorial. Retrieved from
https://www.javatpoint.com/data-mining
• EDUCBA, 2020. Introduction to Data Mining. Retrieved from
https://www.educba.com/introduction-to-data-mining/
• Guru99, 2020. Supervised vs Unsupervised Learning: Key Differences.
Retrieved from https://www.guru99.com/supervised-vs-unsupervised-
learning.html
• Software Testing Help, 16-Apr-2020. Top 15 Best Free Data Mining Tools: The
Most Comprehensive List. Retrieved from
https://www.softwaretestinghelp.com/data-mining-tools/
52
References
• RapidMiner, 2020. Data Mining Tools. Retrieved from
https://rapidminer.com/glossary/data-mining-tools/
• Vorhies, W., 26-Jul-2016. CRISP-DM – a Standard Methodology to Ensure a
Good Outcome. Data Science Central. Retrieved from
https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-
methodology-to-ensure-a-good-outcome
• Rodrigues, I.. CRISP-DM methodology leader in data mining and big data.
Towards Data Science. Retrieved from https://towardsdatascience.com/crisp-
dm-methodology-leader-in-data-mining-and-big-data-467efd3d3781
• Tutorialspoint, 2020. Data Mining - Knowledge Discovery. Retrieved from
https://www.tutorialspoint.com/data_mining/dm_knowledge_discovery.htm
53
References
• Marsh, J., 2014. Data Science Knowledge Discovery in Databases: 9 Steps to
Success. Udemy Blog. Retrieved from https://blog.udemy.com/knowledge-
discovery-in-databases/
• Tech Differences, 25-Mar-2019. Difference Between Descriptive and Predictive
Data Mining. Retrieved from https://techdifferences.com/difference-between-
descriptive-and-predictive-data-mining.html
• Han, J., Pei, J., Kamber, M.. Data Mining: Concepts and Techniques Third
Edition
54