Creation of A Data Warehouse: To Create A Data Warehouse From Various .CSV Files Using Postgrsql Tool

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Ex no: 1 CREATION OF A DATA WAREHOUSE

Date:

(Using ETL Tool Extract -Transform-Load)


AIM:
To create a data warehouse from various .csv files using Postgrsql tool.

WHAT IS A DATA WAREHOUSE?


Data warehouse databases are designed for query and analysis, not transactions. The data
that is collected from various sources is separated into analytic and transaction workloads while
enabling extraction, reporting, data mining and a number of different capabilities that transform
the information into actionable, useful applications.
The main data warehouse structures listed are the basic architecture, which is a simple set
up that allows end-users to directly access the data from numerous sources through the
warehouse, a second architecture is a warehouse with a staging area that simplifies warehouse
management and helps with cleaning and processing the data before it is loaded into the
warehouse. And finally there is the architecture with both a staging area and a data mart. Data
marts are used to create and maintain custom categories in organizations with specialized
database designed for specific businesses, so for example if an organization had a data
warehouse for sales, separate from advertising, then the data mart setup would best serve their
needs. To further understand a data warehouse, it is important to look at its characteristics, which
are subject orientation, integration, non-volatility, and time variance.
Subject Oriented: This refers to when data is giving information on aparticular subject.
For example, a company could use data to analyze their companys marketing data, and
its effectiveness. The devotion of a data warehouse to a specific matter is the key
component of a subject-oriented warehouse.
Integrated: This is when data gathered from a number of disparagingsources, and then all
gathered into a coherent whole. By organizing differing data into consistent formats,
companies can resolve problems and inconsistencies among units of measurement and
promote better results.
Nonvolatile: This refers to data that remains stable and unaffected by newdevelopments.
Once entered into the system, this data should not change, ensuring comparison analysis
over a long period of time.
Time Variant: This refers to data gathered is identified with a particular timeperiod and
focuses on change over time. By using a large amount of data, spread over a long time
period, analysts can decipher patterns, and business relationships that would have
otherwise been overlooked.

DATA WAREHOUSE INTEGRATION PROCESS


The whole purpose of data mining is to facilitate business analysis. And to accomplish
this, raw data must be arranged and consolidated into an information base usable by the firm.
This process is referred to as ETL (Extract, Transform, & Load), which though it may seem like
specified steps, is in opposition referring to a broader concept.
EXTRACTION
This step in the process refers to removing the data from its source and making it
accessible for further processing. All the needed data is retrieved without affecting the source
systems performance, response time or locking in a negative manner. This first step in the ETL
process usually involves a cleaning phase in which data quality is ensured through data
unification. The rules of unification should include things such as making identifiers unique such
as gender categories, phone number, and zip code conversions into standard form and validation
of address fields converted into the proper format.
TRANSFORMATION
This step applies a set of rules to change source data into similar dimensions so the same
units of measurement can be used. This transformation step also joins data from a variety of
sources, generates aggregates, surrogate keys and applies validation and new values.
LOADING
The loading phase is a two-part process of disabling constraints and indexes before the
load process starts and then enables them once the load is completed. In this step, the target of
the load process is often a database.
SETTING UP A DATA WAREHOUSE
The main purpose of a data warehouse is to organize large amounts of stable data to be
easily retrieved and analyzed. So when setting up, care must be taken to ensure the data is
rapidly accessible and easily analyzed. One way of designing this system is with the use of
dimensional modeling, which allows large volumes of data to be efficiently queried and
examined. Since much of the data in warehouses is stable, that is, unchanging, there is hardly a
need for repetitive backup methods. Also, once new data is loaded it can be updated and backed
up right away by way of, in some cases, the data preparation database, so it becomes available
for easy access. There are four categories of data warehousing tools; these are extraction, table
management, query management and data integrity tools. All these tools can be used in the setup
and maintenance of the best technology to manage and store the huge amounts of data a
company collects, analyzes and reviews.

COMPANY ANALYSIS
The first step, in setting up the companys data warehouse, is to evaluate the firms
objectives, For example, a growing company might set the objective to engage customers in
building rapport. By examining what the company needs to do to achieve these tasks, what will
need to be tracked, the key performance indicators to be noted and a numeric evaluation of the
companys activities the company can note and evaluate where they need to get started.
EXISTING SYSTEM ANALYSIS
By asking customers and various stakeholders pointed questions, Business
Intelligenceleaders can gather the performance information they currently have in place that is or
isnt effective. Reports can be collected from various departments in the company, and they may
even be able to collect analytical and summary reports from analysts and supervisors.
INFORMATION MODELING OF CORE BUSINESS PROCESSES
An information model is conceptual, and allows for one to form ideas of what business
processes need to be interrelating and how to get them linked. Since the data warehouse is a
collection of correlating structures, creating a concept of what indicators need to be linked
together to create top performance levels is a vital step in the information modeling stage. A
simple way to design this model is to gather key performance indicatorsinto fact tables and relate
them to dimensions such as customers, salespeople, products and such.
DESIGN AND TRACK
Once all those concepts are set in place, the next critical step is to move data into the
warehouse structure and track where it comes from and what it relates to. In this phase of design,
it is crucial to plan how to link data in the separate databases so that the information can be
connected as it is loaded into the data warehouse tables. The ETL process can be pretty complex
and require specialized programs with sophisticated algorithms, so the right tools have to be
chosen at the right, and most cost effective price for the job. Because the data is to be tracked
over time, the data will need to be available for a very long period. However the grain (atoms or
make up) of the data will defer over time, but the system should be set that the differing
granularity is still consistent throughout the singular data structure.
IMPLEMENTATION OF THE PLAN
Once the plan is developed, there is a viable basis for scheduling the project. Because the
project is grand, there should be phases of completion scheduled and then fit together upon
completion. With careful planning, the system can provide much-needed information on how
factors work together to help the organizations activities.

UPDATES
Since the data warehouse is set to retain data for long stretches at many levels of
granularity and has already been set to be consistent throughout the system, in the design phase
of the warehouse setup, there can be various storage plans that tie into the non-repetitive update.
As an example, an IT manager could set up a week and monthly grain storage systems. In the
day grain, data is stored in its original format for 2-3 years, after which it is summarized and
moved to the weekly grain structure where it could remain for another 3-5 years and then finally
to a monthly grain structure. This can all be set at the design phase to work with the different
grains based on data age and be done automatically.
DATA WAREHOUSE COMPONENTS
So as was the case in the design and set up phase of the warehouse, data was merged
from varying sources into a single related database. And so far we have seen that the point of
creating this warehouse structure is to retrieve information faster and more easily so a firm can
market faster, create more revenue, improve service standards and manage industry changes.
LOAD MANAGEMENT
Load management refers to the collection of information from varying internal and
external sources and summarizing, manipulating and changing the data into a format that allows
for analysis. To manage the load, raw data must be kept along with the changed versions to
enable construction of different representations as needed.
WAREHOUSE MANAGEMENT
Warehouse management is the day-by-day management of a data warehouse that ensures
the information is available and effectively backed up and secure.
QUERY MANAGEMENT
Query management allows access to the warehouse contents and may even include the
tasks of separating information into various categories to be presented to different users. Users
may access information through query tools or custom built applications.
DATA WAREHOUSE BACKUP, STORAGE & TOOLS
Like any other program, data warehouses can be tedious to design create and implement,
so special measures should be in place to ensure the information is not lost.
BACKUP AND RESTORATION
An automatic system should be put in place to ensure the information is secure and that if
needed data can be restored quickly and with little or no negative alterations. The first and most
vital step is to ensure the original information source is backed up and then following that a
weekly backup of the entire warehouse as it could prove costly to have to recreate the entire
system from scratch. The use of cold and multiplexing backup systems will ensure less need for
restoration. However, a disaster recovery site should be made available for copies of all key
system components. The simplest way to achieve this is using a system that automatically creates
copies and sends them to the disaster site. But there are systems that can copy hybrid database
systems and create the backup if that is needed as well.
ONLINE AND OFFLINE STORAGE
Data storage should be done both online and offline to avoid overwhelming the system or
having disk full issues. With the system setup to store data in different granularity settings,
one could stash older, raw data and unused or rarely used reports and multimedia offline. The
implementation of hierarchical storage management (storing files automatically to a secondary
source while allowing users access) is a smart move after implementing the data warehouse
setup.
STORAGE TOOLS
There are a few tools being created to rectify the storage issues that occur with data warehouses.
Storage Manager takes care of all the storage objects such as file systems, database,
network intelligence devices and disk and tape arrays. This system also collects data about
data, performs administrative duties and among other things let you see the health of your
data warehouse.
Storage Optimizer is another product that can be used for recommendations of actions
that will remove hot spots and improve online performance and reliability. It will also
include actions to take for offline storage based on historical patterns.
Storage Planner enables planning for large online and offline database capacity. This
program focuses on large, international databases and warehouses.
Create Data Warehouse using Postgresql tool (ETL Tool)
STEPS:
1.Click Start AllPrograms PostgreSQL 9.3 click pgAdmin III

2.Click this icon

3.Enter name, host, password is postgres.


4. Double click sudhagar(localhost:5432)
5.Right click -databases(1) and choose - new database and type database name
6.Double click dwftp and click scemas(1) right click select new schema - type
schema name.
7.Double Click dw tables(0) - Click icon SQL - Type query and click run icon.
8.Then close SQL query dialog box.
9.Upto above we created database along with primary key. Next we Import .csv file and store
data in the databases.
- Gototables choose table name and right click - choose Import option.
- Then select .csv file from the location and select format as csv and click Import button.
Now the data are stored in the databases.
10.Similarly do the following table - phonerate, timedim and facts table.

Display the warehouse data:

11.ClickSQL icon type the queries and click run button.

You might also like