Data Warehouse Manual
Data Warehouse Manual
Data Warehouse Manual
2. Open Weka and have a look at the interface. It is an open-source project written in Java
from the University of Waikato.
3. Click on the Explorer button on the right side
1. Check different tabs to familiarize with the tool.
2. Weka comes with a number of small datasets. Those files are located at C:\Program
Files\Weka3-8 (If it is installed at this location. Or else, search for Weka-3-8 to fins the
installation location). In this folder, there is a subfolder named ‘data’. Open that folder
to see all files that comes with Weka.
3. For easy access, copy the folder ‘data’ and paste it in your ‘Documents’ folder.
4. In this lab, we will work with the dataset Iris. To open Iris dataset, click on ‘Open
file’ in the ‘Preprocess tab’. From your ‘data’ folder, select iris.arff and hit open.
5. To know more about the iris dataset, open iris.arff in notepad++ or in a similar tool and
read the comments.
6. Click on visualize tab to see various 2D visualizations of the dataset.
7. Fill the table
Iris Versicolour 50
Iris Virginica 50
Now, navigate to the folder where your data files are stored. WEKA installation comes
upwith many sample databases for you to experiment. These are available in the data
folderof the WEKA installation.
We will open
the file from a public URL Type the following URL in the popup box:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka
data/weather.nominal.arff
When you
open the file, your screen looks like as shown here:
This screen tells us several things about the loaded data, which are discussed further inthis
chapter.
3.2 ) UnderstandingData
Let us first look at the highlighted Current relation sub window. It shows the name of the
database that is currently loaded. You can infer two points from this sub window:
The
At the bottom of the window, you see the visual representation of the class values. If you click
on the Visualize All button, you will be able to see all features in one single window as shown
here:
WRITE THE QUERY FOR SCHEMA DEFINITION
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
Syntax for Cube Definition
define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −
define cube sales star [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:
dollars sold = sum(sales in dollars), units sold = count(*)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))
Fact Constellation Schema Definition
Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:
dollars cost = sum(cost in dollars), units shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
RESULT
Thus weka tool for data validation is applied.
EX5. DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATIONS
Aim
A data warehouse is a central repository of integrated data that is used for reporting and analysis.
It stores large amounts of historical and current data and enables fast query performance for
analytical purposes. A data warehouse can be used to support business decision-making, improve
operational efficiency, and gain a competitive edge. In this article, we will discuss the process of
building a data warehouse from scratch.
procedure
There are several approaches to designing a data warehouse schema, including −
Star schema − A star schema consists of a central fact table surrounded by dimension tables. The
fact table contains the measures or facts, and the dimension tables contain the attributes or context
for the measures. The schema is called a star because the dimension tables are connected to the
central fact table through foreign key relationships, forming a star shape.
Snowflake schema − A snowflake schema is an extension of the star schema, where the
dimension tables are normalized into multiple tables. This results in a more complex schema, but
it can improve query performance by reducing the amount of data stored in the dimension tables.
Hybrid schema − A hybrid schema is a combination of the star and snowflake schemas, where
some dimension tables are normalized and others are not. This can be useful when some
dimensions are highly granular and require normalization, while others are less granular and can
be denormalized.
Here is an example of a star schema for a sales data warehouse −
Fact table: Sales
o Date
o Product ID
o Customer ID
o Sales Quantity
o Sales Amount
Dimension tables:
o Date − Date, Month, Year, Quarter
o Product − Product ID, Product Name, Product Category
o Customer − Customer ID, Customer Name, Customer Location
Example
Here is an example of an ETL process using Python and the popular ETL library, Pandas −
import pandas as pd
# Transform data
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['Sales Amount'] = df['Sales Quantity'] * df['Unit Price']
Conclusion
Thus datawarehouse for realtime applications is designed.
EX6. ANALYSE THE DIMENSIONAL MODELING
Step 1: Select the Business Process
The first step involves selecting the business process, and it should be an action resulting in
output.
DimensionalTables
Some important points regarding Dimension Tables:
1. It stores textual information related to a business process.
2. It answers the ‘who, what, where, when, why, and how’ questions related to a particular
business process.
3. Dimension tables have more columns and less number of rows.
4. Each dimension table has a primary key that is joined to its given fact table.
5. Dimension attributes are the primary source of query constraints, grouping, and filtering.
Dimensions describe the measurements of the fact table. For example, customer id is a
measurement, but we can describe its attributes further, more as what is the name of the customer,
the address of the customer, gender, etc.
Our dimensional model will have the following dimensions:
Date Dimension:
Product Dimension:
Order Dimension:.
Customer Dimension:
.
Promotion Dimension:
Warehouse Dimension:
Conclusion
In summary, TechMart's implementation of OLTP has streamlined its online transaction
processing, enhanced customer experience, and provided a foundation for secure and scalable e-
commerce operations. The OLTP system plays a pivotal role in the day-to-day functioning of the
business, supporting TechMart's growth and ensuring the reliability of its online platfor
EX. 9 IMPLEMENTATION OF WAREHOUSE TESTING.
The Importance of Data Warehouse Testing
With data driving critical business decisions, testing the data warehouse data integration process
is essential. Data comes from numerous sources. The data source affects data quality, so data
profiling and data cleaning must be ongoing. Source data history, business rules, or audit
information may no longer be available.
Understanding the ETL Testing Process
A solid understanding of data modeling provides testing teams with information to develop the
right testing strategy. During the analysis phase, the testing team must learn and understand the
different stages of the data warehouse implementation including but not limited to:
Source data profiling
Data warehouse design
ETL development
Data loading and transformations
ETL testing includes multiple phases, and testing should be executed throughout the lifecycle of
the data warehouse implementation, not just at the end.
Preparing for ETL Testing
A data warehouse implementation must include end-to-end testing. The QA team must test initial
and incremental loads for the entire ETL process beginning with identifying source data to report
and portal functions. They must also test each point between extract and load, including data load
from the source extract to staging, staging to transformation and once the data reaches the data
warehouse, test data extraction for display and reporting.
With traditional ETL test planning, there are six phases:
1. Understanding business requirements/analysis
2. Creating test plans and estimating time to completion
3. Designing test cases and selecting test data
4. Executing tests with bug reporting and closure
5. Report summary and analysis
6. Test completion
ETL testing is performed in five stages:
1. Identifying data sources and requirements
2. Data acquisition
3. Implement business logic and Dimensional Modeling
4. Build and populate data
5. Build Reports
Conclusion:
Thus warehouse testing is implemented and studied.