Data Warehouse Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

EX1.

DATA EXPLORATION AND INTEGRATION WITH WEKA


Steps:
1. Download and install Weka. You can find it here:
http://www.cs.waikato.ac.nz/ml/weka/downloading.
html

2. Open Weka and have a look at the interface. It is an open-source project written in Java
from the University of Waikato.
3. Click on the Explorer button on the right side
1. Check different tabs to familiarize with the tool.
2. Weka comes with a number of small datasets. Those files are located at C:\Program
Files\Weka3-8 (If it is installed at this location. Or else, search for Weka-3-8 to fins the
installation location). In this folder, there is a subfolder named ‘data’. Open that folder
to see all files that comes with Weka.
3. For easy access, copy the folder ‘data’ and paste it in your ‘Documents’ folder.
4. In this lab, we will work with the dataset Iris. To open Iris dataset, click on ‘Open
file’ in the ‘Preprocess tab’. From your ‘data’ folder, select iris.arff and hit open.
5. To know more about the iris dataset, open iris.arff in notepad++ or in a similar tool and
read the comments.
6. Click on visualize tab to see various 2D visualizations of the dataset.
7. Fill the table

Flower Type Count


Iris Setosa 50

Iris Versicolour 50

Iris Virginica 50

Attribute Minimum Maximum Mean StdDev

sepal 4.3 7.9 5.84 0.83


length
sepal width 2.0 4.4 3.05 0.43

petal 1.0 6.9 3.76 1.76


length
petal 0.1 2.5 1.20 0.76
width:
Exp 2: Data exploration and integration with WEKA
2.1) Weka : Loading data
In this chapter, we start with the first tab that you use to preprocess the data. This is common to
all algorithms that you would apply to your data for building the model and is a common step
for all subsequent operations in WEKA.
For a machine learning algorithm to give acceptable accuracy, it is important that you must
cleanse your data first. This is because the raw data collected from the field may contain null
values, irrelevant columns and so on.
First, you will learn to load the data file into the WEKA explorer. The data can be loaded from
the following sources:
∙ Local file system
∙ Web

Now, navigate to the folder where your data files are stored. WEKA installation comes
upwith many sample databases for you to experiment. These are available in the data
folderof the WEKA installation.
We will open
the file from a public URL Type the following URL in the popup box:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka
data/weather.nominal.arff

Set the connection string


to your database, set up the query for data selection, processthe query and load the selected
records in WEKA
Exp 3: Apply weka tool for data validation
3.1) Weka – Preprocessing the data
The data that is collected from the field contains many unwanted things that leads to wrong
analysis. For example, the data may contain null fields, it may contain columns that are
irrelevant to the current analysis, and so on. Thus, the data must be preprocessed to meet the
requirements of the type of analysis you are seeking. This is the done in the preprocessing
module.

When you
open the file, your screen looks like as shown here:
This screen tells us several things about the loaded data, which are discussed further inthis
chapter.
3.2 ) UnderstandingData
Let us first look at the highlighted Current relation sub window. It shows the name of the
database that is currently loaded. You can infer two points from this sub window:

The
At the bottom of the window, you see the visual representation of the class values. If you click
on the Visualize All button, you will be able to see all features in one single window as shown
here:
WRITE THE QUERY FOR SCHEMA DEFINITION
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −
define cube sales star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
Fact Constellation Schema Definition
Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:
dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

RESULT
Thus weka tool for data validation is applied.
EX5. DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATIONS
Aim
A data warehouse is a central repository of integrated data that is used for reporting and analysis.
It stores large amounts of historical and current data and enables fast query performance for
analytical purposes. A data warehouse can be used to support business decision-making, improve
operational efficiency, and gain a competitive edge. In this article, we will discuss the process of
building a data warehouse from scratch.
procedure
There are several approaches to designing a data warehouse schema, including −
Star schema − A star schema consists of a central fact table surrounded by dimension tables. The
fact table contains the measures or facts, and the dimension tables contain the attributes or context
for the measures. The schema is called a star because the dimension tables are connected to the
central fact table through foreign key relationships, forming a star shape.
Snowflake schema − A snowflake schema is an extension of the star schema, where the
dimension tables are normalized into multiple tables. This results in a more complex schema, but
it can improve query performance by reducing the amount of data stored in the dimension tables.
Hybrid schema − A hybrid schema is a combination of the star and snowflake schemas, where
some dimension tables are normalized and others are not. This can be useful when some
dimensions are highly granular and require normalization, while others are less granular and can
be denormalized.
Here is an example of a star schema for a sales data warehouse −
 Fact table: Sales
o Date
o Product ID
o Customer ID
o Sales Quantity
o Sales Amount
 Dimension tables:
o Date − Date, Month, Year, Quarter
o Product − Product ID, Product Name, Product Category
o Customer − Customer ID, Customer Name, Customer Location
Example
Here is an example of an ETL process using Python and the popular ETL library, Pandas −
import pandas as pd

# Extract data from a database


df = pd.read_sql("SELECT * FROM Sales", conn)

# Transform data
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['Sales Amount'] = df['Sales Quantity'] * df['Unit Price']

# Load data into data warehouse


df.to_sql("Sales", conn, if_exists="replace")

Here is an example of creating an index and partitioning a table in SQL −


CREATE INDEX idx_date ON Sales (Date);

CREATE PARTITION FUNCTION pf_sales (DATE)


AS RANGE LEFT FOR VALUES (
'2022-01-01',
'2022-06-01',
'2022-12-01'
);

CREATE PARTITION SCHEME ps_sales


AS PARTITION pf_sales
ALL TO ([PRIMARY]);

ALTER TABLE Sales


DROP CONSTRAINT DF_Sales_Date;

ALTER TABLE Sales


ADD CONSTRAINT DF_Sales_Date
DEFAULT (getdate())
FOR Date;

ALTER TABLE Sales


ADD CONSTRAINT PK_Sales
PRIMARY KEY CLUSTERED (SalesID)
WITH (
PARTITION_SCHEME = ps_sales
);

Conclusion
Thus datawarehouse for realtime applications is designed.
EX6. ANALYSE THE DIMENSIONAL MODELING
Step 1: Select the Business Process
The first step involves selecting the business process, and it should be an action resulting in
output.

Step 2: Decide the Grain of each Business Process


A grain is a business process at a specified level. It tells us what exactly a row, in fact, a table,
represents. All the rows in a fact table should result from the same grain. Each fact table is the
result of a different grain selected in a business process.

Step 3: Identify the Dimensions for the Dimensional Table


Before identifying the dimensions we will understand what a dimensional table is.

DimensionalTables
Some important points regarding Dimension Tables:
1. It stores textual information related to a business process.
2. It answers the ‘who, what, where, when, why, and how’ questions related to a particular
business process.
3. Dimension tables have more columns and less number of rows.
4. Each dimension table has a primary key that is joined to its given fact table.
5. Dimension attributes are the primary source of query constraints, grouping, and filtering.
Dimensions describe the measurements of the fact table. For example, customer id is a
measurement, but we can describe its attributes further, more as what is the name of the customer,
the address of the customer, gender, etc.
Our dimensional model will have the following dimensions:

Date Dimension:

Product Dimension:

Order Dimension:.
Customer Dimension:

.
Promotion Dimension:

Warehouse Dimension:

Step 4: Identify the Facts for the Dimensional Table


This is the final step in which we have to decide which facts (measurements) must be included in
the fact table,
RESULT
Thus dimensional model is analysed.
EX7. CASE STUDY USING OLAP
Case Study: Healthcare Analytics with OLAP
Background: A regional healthcare provider, "MedInsight Health Systems," manages multiple
hospitals, clinics, and healthcare facilities. They have a vast amount of data related to patient care,
medical procedures, staff performance, and financial transactions. MedInsight aims to leverage
OLAP for multidimensional analysis to enhance decision-making and improve overall healthcare
delivery.
OLAP Dimensions:
1. Time Dimension:
 Hierarchy: Year > Quarter > Month > Day
2. Medical Department Dimension:
 Hierarchy: Department > Subspecialty > Procedure
3. Patient Dimension:
 Hierarchy: Patient Type > Patient Age Group > Individual Patient
4. Geographical Dimension:
 Hierarchy: Region > Hospital/Clinic > Department
OLAP Measures:
1. Patient Visits
2. Revenue Generated
3. Average Length of Stay
4. Staff Performance Metrics (e.g., Efficiency, Patient Satisfaction)
OLAP Use Cases:
1. Operational Efficiency Analysis:
2. Financial Performance Overview:.
3. Patient Demographics and Preferences:
4. Staff Productivity and Satisfaction:
5. Cost Analysis by Medical Procedure:
6. Quality of Care Metrics:.
7. Resource Planning for Future Growth:
Benefits of OLAP:
1. Decision Support:
2. Patient-Centric Approach:
3. Efficient Resource Allocation:
4. Strategic Planning:.
5. Performance Benchmarking:
6. Regulatory Compliance:
In summary, by implementing OLAP in healthcare analytics, MedInsight Health Systems can
gain a holistic view of its operations, enhance patient care, and make strategic decisions to
improve efficiency and quality in the delivery of healthcare services.
EX. 8 CASE STUDY USING OTLP
Case Study: Online Transaction Processing (OLTP) in E-commerce
Background: "TechMart," an online electronics retailer, is experiencing rapid growth in its
customer base and transaction volume. The company has implemented an OLTP system to
manage its day-to-day online transactions efficiently.
Scenario: TechMart's OLTP system handles various online transactional activities, such as order
processing, inventory management, and customer interactions.
Key Components of OLTP:
1. Order Processing:
2. Inventory Management:
3. Customer Interactions:.
4. Transaction Security:
5. Scalability:
6. Concurrency Control:
.
Benefits of OLTP for TechMart:
1. Real-time Transaction Processing:
2. Improved Customer Experience:.
3. Efficient Inventory Management:
4. Secure Transactions:.
5. Adaptability to Growth:
6. Data Integrity:

Conclusion
In summary, TechMart's implementation of OLTP has streamlined its online transaction
processing, enhanced customer experience, and provided a foundation for secure and scalable e-
commerce operations. The OLTP system plays a pivotal role in the day-to-day functioning of the
business, supporting TechMart's growth and ensuring the reliability of its online platfor
EX. 9 IMPLEMENTATION OF WAREHOUSE TESTING.
The Importance of Data Warehouse Testing
With data driving critical business decisions, testing the data warehouse data integration process
is essential. Data comes from numerous sources. The data source affects data quality, so data
profiling and data cleaning must be ongoing. Source data history, business rules, or audit
information may no longer be available.
Understanding the ETL Testing Process
A solid understanding of data modeling provides testing teams with information to develop the
right testing strategy. During the analysis phase, the testing team must learn and understand the
different stages of the data warehouse implementation including but not limited to:
 Source data profiling
 Data warehouse design
 ETL development
 Data loading and transformations
ETL testing includes multiple phases, and testing should be executed throughout the lifecycle of
the data warehouse implementation, not just at the end.
Preparing for ETL Testing
A data warehouse implementation must include end-to-end testing. The QA team must test initial
and incremental loads for the entire ETL process beginning with identifying source data to report
and portal functions. They must also test each point between extract and load, including data load
from the source extract to staging, staging to transformation and once the data reaches the data
warehouse, test data extraction for display and reporting.
With traditional ETL test planning, there are six phases:
1. Understanding business requirements/analysis
2. Creating test plans and estimating time to completion
3. Designing test cases and selecting test data
4. Executing tests with bug reporting and closure
5. Report summary and analysis
6. Test completion
ETL testing is performed in five stages:
1. Identifying data sources and requirements
2. Data acquisition
3. Implement business logic and Dimensional Modeling
4. Build and populate data
5. Build Reports

Conclusion:
Thus warehouse testing is implemented and studied.

You might also like