Data Minin1
Data Minin1
Data Minin1
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Cost Complexity
The cost complexity is measured by the following two parameters −
Data warehouse is an information system that contains historical and commutative data
from single or multiple sources. It simplifies reporting and analysis process of the
organization.
It is also a single version of truth for any company for decision making and forecasting.
Subject-Oriented
Integrated
Time-variant
Non-volatile
Subject-Oriented
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on
modeling and analysis of data for decision making. It also provides a simple and
concise view around the specific subject by excluding data which not helpful to support
the decision process.
Integrated
However, after transformation and cleaning process all this data is stored in common
format in the Data Warehouse.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational
systems. The data collected in a data warehouse is recognized with a particular period
and offers information from the historical point of view. It contains an element of time,
explicitly or implicitly.
One such place where Datawarehouse data display time variance is in in the structure
of the record key. Every primary key contained with the DW should have either
implicitly or explicitly an element of time. Like the day, week month, etc.
Another aspect of time variance is that once data is inserted in the warehouse, it can't
be updated or changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new
data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data
and understand what & when happened. It does not require transaction process,
recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational
application environment are omitted in Data warehouse environment. Only two types of
data operations performed in the Data Warehousing are
1. Data loading
2. Data access
Here, are some major differences between Application and Data Warehouse
Complex program must be coded to make sure This kind of issues does not happen because data
that data upgrade processes maintain high update is not performed.
integrity of the final product.
Data is placed in a normalized form to ensure Data is not stored in normalized form.
minimal redundancy.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to
remove data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large number of end-users. It
also has connectivity problems because of network limitations.
Three-tier architecture
Datawarehouse Components
The data warehouse is based on an RDBMS server which is a central information
repository that is surrounded by some key components to make the entire environment
functional, manageable and accessible
The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of
implementation is constrained by the fact that traditional RDBMS system is optimized
for transactional database processing and not for data warehousing. For instance, ad-
hoc query, multi-table joins, aggregates are resource intensive and slow down
performance.
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a
unified format in the datawarehouse. They are also called Extract, Transform and Load
(ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs,
Cobol programs, shell scripts, etc. that regularly update data in datawarehouse. These
tools are also helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Metadata
The name Meta Data suggests some high- level technological concept. However, it is
quite simple. Metadata is data about data which defines the data warehouse. It is used
for building, maintaining and managing the data warehouse.
This is a meaningless data until we consult the Meta that tell us it was
Therefore, Meta Data are essential ingredients in the transformation of data into
knowledge.
What tables, attributes, and keys does the Data Warehouse contain?
Where did the data come from?
How many times do data get reloaded?
What transformations were applied with cleansing?
Query Tools
Reporting tools
Managed query tools
Reporting tools: Reporting tools can be further divided into production reporting tools
and desktop report writer.
1. Report writers: This kind of reporting tool are tools designed for end-users for
their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular
operational reports. It also supports high volume batch jobs like printing and
calculating. Some popular reporting tools are Brio, Business Objects, Oracle,
PowerSoft, SAS Institute.
This kind of access tools helps end users to resolve snags in database and SQL and
database structure by inserting meta-layer between users and database.
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of
an organization. In such cases, custom reports are developed using Application
development tools.
Data mining is a process of discovering meaningful new correlation, pattens, and trends
by mining large amount data. Data mining tools are used to make this process
automatic.
4. OLAP tools:
Data warehouse Bus determines the flow of data in your warehouse. The data flow in a
data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta
flow.
While designing a Data Bus, one needs to consider the shared dimensions, facts across
data marts.
Data Marts
A data mart is an access layer which is used to get data out to the users. It is
presented as an option for large size data warehouse as it takes less time and money to
build. However, there is no standard definition of a data mart is differing from person to
person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used
for partition of data which is created for the specific group of users.
To design Data Warehouse Architecture, you need to follow below given best practices:
Use a data model which is optimized for information retrieval which can be the
dimensional mode, denormalized or hybrid approach.
Need to assure that Data is processed quickly and accurately. At the same time,
you should take an approach which consolidates data into a single version of the
truth.
Carefully design the data acquisition and cleansing process for Data warehouse.
Design a MetaData architecture which allows sharing of metadata between
components of Data Warehouse
Consider implementing an ODS model when information retrieval need is near
the bottom of the data abstraction pyramid or when there are multiple
operational sources required to be accessed.
One should make sure that the data model is integrated and not just
consolidated. In that case, you should consider 3NF data model. It is also ideal
for acquiring ETL and Data cleansing tools
Summary:
Data mining is looking for hidden, valid, and potentially useful patterns in huge data
sets. Data Mining is all about discovering unsuspected/ previously unknown
relationships amongst the data.
The insights derived via Data Mining can be used for marketing, fraud detection, and
scientific discovery, etc.
Types of Data
Relational databases
Data warehouses
Advanced DB and information repositories
Object-oriented and object-relational databases
Transactional and Spatial databases
Heterogeneous and legacy databases
Multimedia and streaming database
Text databases
Text mining and Web mining
Business understanding:
Data understanding:
In this phase, sanity check on data is performed to check whether its appropriate for
the data mining goals.
First, data is collected from multiple data sources available in the organization.
These data sources may include multiple databases, flat filer or data cubes.
There are issues like object matching and schema integration which can arise
during Data Integration process. It is a quite complex and tricky process as data
from various sources unlikely to match easily. For example, table A contains an
entity named cust_no whereas another table B contains an entity named cust-id.
Therefore, it is quite difficult to ensure that both of these given objects refer to
the same value or not. Here, Metadata should be used to reduce errors in the
data integration process.
Next, the step is to search for properties of acquired data. A good way to explore
the data is to answer the data mining questions (decided in business phase)
using the query, reporting, and visualization tools.
Based on the results of query, the data quality should be ascertained. Missing
data if any should be acquired.
Data preparation:
The data preparation process consumes about 90% of the time of the project.
The data from different sources should be selected, cleaned, transformed, formatted,
anonymized, and constructed (if required).
Data cleaning is a process to "clean" the data by smoothing noisy data and filling in
missing values.
For example, for a customer demographics profile, age data is missing. The data is
incomplete and should be filled. In some cases, there could be data outliers. For
instance, age has a value 300. Data could be inconsistent. For instance, name of the
customer is different in different tables.
Data transformation operations change the data to make it useful in data mining.
Following transformation can be applied
Data transformation:
Data transformation operations would contribute toward the success of the mining
process.
Attribute construction: these attributes are constructed and included the given set of
attributes helpful for data mining.
The result of this process is a final data set that can be used in modeling.
Modelling
Evaluation:
In this phase, patterns identified are evaluated against the business objectives.
Results generated by the data mining model should be evaluated against the
business objectives.
Gaining business understanding is an iterative process. In fact, while
understanding, new business requirements may be raised because of data
mining.
A go or no-go decision is taken to move the model in the deployment phase.
Deployment:
In the deployment phase, you ship your data mining discoveries to everyday business
operations.
The knowledge or information discovered during data mining process should be
made easy to understand for non-technical stakeholders.
A detailed deployment plan, for shipping, maintenance, and monitoring of data
mining discoveries is created.
A final project report is created with lessons learned and key experiences during
the project. This helps to improve the organization's business policy.
1.Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable,
given the presence of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It
discovers a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be
used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.
Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances
in a right sequence for predicting a future event.
Example 1:
Consider a marketing head of telecom service provides who wants to increase revenues
of long distance services. For high ROI on his sales and marketing efforts customer
profiling is important. He has a vast data pool of customer information like age, gender,
income, credit history, etc. But its impossible to determine characteristics of people who
prefer long distance calls with manual analysis. Using data mining techniques, he may
uncover patterns between high long distance call users and their characteristics.
For example, he might learn that his best customers are married females between the
age of 45 and 54 who make more than $80,000 per year. Marketing efforts can be
targeted to such demographic.
Example 2:
A bank wants to search new ways to increase revenues from its credit card operations.
They want to check whether usage would double if fees were halved.
Bank has multiple years of record on average credit card balances, payment amounts,
credit limit usage, and other key parameters. They create a model to check the impact
of the proposed new business policy. The data results show that cutting fees in half for
a targetted customer base could increase revenues by $10 million.
R-language:
R language is an open source tool for statistical computing and graphics. R has a wide
variety of statistical, classical statistical tests, time-series analysis, classification and
graphical techniques. It offers effective data handing and storage facility.
There are chances of companies may sell useful information of their customers to
other companies for money. For example, American Express has sold credit card
purchases of their customers to the other companies.
Many data mining analytics software is difficult to operate and requires advance
training to work on.
Different data mining tools work in different manners due to different algorithms
employed in their design. Therefore, the selection of correct data mining tool is a
very difficult task.
The data mining techniques are not accurate, and so it can cause serious
consequences in certain conditions.
Insurance Data mining helps insurance companies to price their products profitable
and promote new offers to their new or existing customers.
Education Data mining benefits educators to access student data, predict achievement
levels and find students or groups of students which need extra attention.
For example, students who are weak in maths subject.
Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them
reduce them to minimize downtime.
Banking Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to
decide whether to issue credit cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores identify and
arrange most sellable items in the most attentive positions. It helps store
owners to comes up with the offer which encourages customers to increase
their spending.
Service Service providers like mobile phone and utility industries use Data Mining to
Providers predict the reasons when a customer leaves their company. They analyze
billing details, customer service interactions, complaints made to the
company to assign each customer a probability score and offers incentives.
E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells
through their websites. One of the most famous names is Amazon, who use
Data mining techniques to get more customers into their eCommerce store.
Super Markets Data Mining allows supermarket's develope rules to predict if their shoppers
were likely to be expecting. By evaluating their buying pattern, they could
find woman customers who are most likely pregnant. They can start
targeting products like baby powder, baby shop, diapers and so on.
Crime Data Mining helps crime investigation agencies to deploy police workforce
Investigation (where is a crime most likely to happen and when?), who to search at a
border crossing etc.
Bioinformatics Data Mining helps to mine biological data from massive datasets gathered
in biology and medicine.
Summary:
Data Mining is all about explaining the past and predicting the future for
analysis.
Data mining helps to extract information from huge sets of data. It is the
procedure of mining knowledge from data.
Data mining process includes business understanding, Data Understanding, Data
Preparation, Modelling, Evolution, Deployment.
Important Data mining techniques are Classification, clustering, Regression,
Association rules, Outer detection, Sequential Patterns, and prediction
R-language and Oracle Data mining are prominent data mining tools.
Data mining technique helps companies to get knowledge-based information.
The main drawback of data mining is that many analytics software is difficult to
operate and requires advance training to work on.
Data mining is used in diverse industries such as Communications, Insurance,
Education, Manufacturing, Banking, Retail, Service providers, eCommerce,
Supermarkets Bioinformatics.
What is Database?
A database is a collection of related data which represents some elements of the real
world. It is designed to be built and populated with data for a specific task. It is also a
building block of your data solution.
Data warehouse helps business users to access critical data from some sources
all in one place.
It provides consistent information on various cross-functional activities
Helps you to integrate many sources of data to reduce stress on the production
system.
Data warehouse helps you to reduce TAT (total turnaround time) for analysis and
reporting.
Data warehouse helps users to access critical data from different sources in a
single place so, it saves user's time of retrieving data information from multiple
sources. You can also access data from the cloud easily.
Data warehouse allows you to stores a large amount of historical data to analyze
different periods and trends to make future predictions.
Enhances the value of operational business applications and customer
relationship management systems
Separates analytics processing from transactional databases, improving the
performance of both systems
Stakeholders and users may be overestimating the quality of data in the source
systems. Data warehouse provides more accurate reports.
Characteristics of Database
Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing Processing (OLAP).
(OLTP)
Usage The database helps to perform Data warehouse allows you to analyze your
fundamental operations for business.
your business
Tables and Tables and joins of a database Table and joins are simple in a data
Joins are complex as they are warehouse because they are denormalized.
normalized.
Storage limit Generally limited to a single Stores data from any number of applications
application
Availability Data is available real-time Data is refreshed from source systems as and
when needed
Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.
Data Type Data stored in the Database is Current and Historical Data is stored in Data
up to date. Warehouse. May not be up to date.
Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data normalized approach for the data structure.
storage. Example: Star and snowflake schema.
Query Type Simple transaction queries are Complex queries are used for analysis
used. purpose.
Applications of Database
Sector Usage
Sales & Production Use for storing customer, product and sales
details.
Disadvantages of Database
Adding new data sources takes time, and it is associated with high cost.
Sometimes problems associated with the data warehouse may be undetected for
many years.
Data warehouses are high maintenance systems. Extracting, loading, and
cleaning data could be time-consuming.
The data warehouse may look simple, but actually, it is too complicated for the
average users. You need to provide training to end-users, who end up not using
the data mining and warehouse.
Despite best efforts at project management, the scope of data warehousing will
always increase.
To sum up, we can say that the database helps to perform the fundamental operation
of business while the data warehouse helps you to analyze your business. You choose
either one of them based on your business goals.
The company is in a phase of rapid growth and will need the proper mix of
administrative, sales, production, and support personnel. Key decision-makers want to
know whether increasing overhead staffing is returning value to the organization. As
the company enhances the sales force and employs different sales modes, the leaders
need to know whether these modes are effective. External market forces are changing
the balance between a national and regional focus, and the leaders need to understand
this change's effects on the business.
The only way to gather this performance information is to ask questions. The leaders
have sources of information they use to make decisions. Start with these data sources.
Many are simple. You can get reports from the accounting package, the customer
relationship management (CRM) application, the time reporting system, etc. You'll need
copies of all these reports and you'll need to know where they come from.
Often, analysts, supervisors, administrative assistants, and others create analytical and
summary reports. These reports can be simple correlations of existing reports, or they
can include information that people overlook with the existing software or information
stored in spreadsheets and memos. Such overlooked information can include logs of
telephone calls someone keeps by hand, a small desktop database that tracks shipping
dates, or a daily report a supervisor emails to a manager. A big challenge for data
warehouse designers is finding ways to collect this information. People often write off
this type of serendipitous information as unimportant or inaccurate. But remember that
nothing develops without a reason. Before you disregard any source of information, you
need to understand why it exists.
Another part of this collection and analysis phase is understanding how people gather
and process the information. A data warehouse can automate many reporting tasks,
but you can't automate what you haven't identified and don't understand. The process
requires extensive interaction with the individuals involved. Listen carefully and repeat
back what you think you heard. You need to clearly understand the process and its
reason for existence. Then you're ready to begin designing the warehouse.
By this point, you must have a clear idea of what business processes you need to
correlate. You've identified the key performance indicators, such as unit sales, units
produced, and gross revenue. Now you need to identify the entities that interrelate to
create the key performance indicators. For instance, at our example company, creating
a training sale involves many people and business factors. The customer might not
have a relationship with the company. The client might have to travel to attend classes
or might need a trainer for an on-site class. New product releases such as Windows
2000 (Win2K) might be released often, prompting the need for training. The company
might run a promotion or might hire a new salesperson.
The data warehouse is a collection of interrelated data structures. Each structure stores
key performance indicators for a specific business process and correlates those
indicators to the factors that generated them. To design a structure to track a business
process, you need to identify the entities that work together to create the key
performance indicator. Each key performance indicator is related to the entities that
generated it. This relationship forms a dimensional model. If a salesperson sells 60
units, the dimensional structure relates that fact to the salesperson, the customer, the
product, the sale date, etc.
Then you need to gather the key performance indicators into fact tables. You gather the
entities that generate the facts into dimension tables. To include a set of facts, you
must relate them to the dimensions (customers, salespeople, products, promotions,
time, etc.) that created them. For the fact table to work, the attributes in a row in the
fact table must be different expressions of the same event or condition. You can
express training sales by number of seats, gross revenue, and hours of instruction
because these are different expressions of the same sale. An instructor taught one class
in a certain room on a certain date. If you need to break the fact down into individual
students and individual salespeople, however, you'd need to create another table
because the detail level of the fact table in this example doesn't support individual
students or salespeople. A data warehouse consists of groups of fact tables, with each
fact table concentrating on a specific subject. Fact tables can share dimension tables
(e.g., the same customer can buy products, generate shipping costs, and return times).
This sharing lets you relate the facts of one fact table to another fact table. After the
data structures are processed as OLAP cubes, you can combine facts with related
dimensions into virtual cubes.
After identifying the business processes, you can create a conceptual model of the data.
You determine the subjects that will be expressed as fact tables and the dimensions
that will relate to the facts. Clearly identify the key performance indicators for each
business process, and decide the format to store the facts in. Because the facts will
ultimately be aggregated together to form OLAP cubes, the data needs to be in a
consistent unit of measure. The process might seem simple, but it isn't. For example, if
the organization is international and stores monetary sums, you need to choose a
currency. Then you need to determine when you'll convert other currencies to the
chosen currency and what rate of exchange you'll use. You might even need to track
currency-exchange rates as a separate factor.
Now you need to relate the dimensions to the key performance indicators. Each row in
the fact table is generated by the interaction of specific entities. To add a fact, you need
to populate all the dimensions and correlate their activities. Many data systems,
particularly older legacy data systems, have incomplete data. You need to correct this
deficiency before you can use the facts in the warehouse. After making the corrections,
you can construct the dimension and fact tables. The fact table's primary key is a
composite key made from a foreign key of each of the dimension tables.
Data warehouse structures are difficult to populate and maintain, and they take a long
time to construct. Careful planning in the beginning can save you hours or days of
restructuring.
Now that you know what you need, you have to get it. You need to identify where the
critical information is and how to move it into the data warehouse structure. For
example, most of our example company's data comes from three sources. The
company has a custom in-house application for tracking training sales. A CRM package
tracks the sales-force activities, and a custom time-reporting system keeps track of
time.
You need to move the data into a consolidated, consistent data structure. A difficult
task is correlating information between the in-house CRM and time-reporting
databases. The systems don't share information such as employee numbers, customer
numbers, or project numbers. In this phase of the design, you need to plan how to
reconcile data in the separate databases so that information can be correlated as it is
copied into the data warehouse tables.
You'll also need to scrub the data. In online transaction processing (OLTP) systems,
data-entry personnel often leave fields blank. The information missing from these
fields, however, is often crucial for providing an accurate data analysis. Make sure the
source data is complete before you use it. You can sometimes complete the information
programmatically at the source. You can extract ZIP codes from city and state data, or
get special pricing considerations from another data source. Sometimes, though,
completion requires pulling files and entering missing data by hand. The cost of fixing
bad data can make the system cost-prohibitive, so you need to determine the most
cost-effective means of correcting the data and then forecast those costs as part of the
system cost. Make corrections to the data at the source so that reports generated from
the data warehouse agree with any corresponding reports generated at the source.
You'll need to transform the data as you move it from one data structure to another.
Some transformations are simple mappings to database columns with different names.
Some might involve converting the data storage type. Some transformations are unit-
of-measure conversions (pounds to kilograms, centimeters to inches), and some are
summarizations of data (e.g., how many total seats sold in a class per company, rather
than each student's name). And some transformations require complex programs that
apply sophisticated algorithms to determine the values. So you need to select the right
tools (e.g., Data Transformation Services—DTS—running ActiveX scripts, or third-party
tools) to perform these transformations. Base your decision mainly on cost, including
the cost of training or hiring people to use the tools, and the cost of maintaining the
tools.
You also need to plan when data movement will occur. While the system is accessing
the data sources, the performance of those databases will decline precipitously.
Schedule the data extraction to minimize its impact on system users (e.g., over a
weekend).
Data warehouse structures consume a large amount of storage space, so you need to
determine how to archive the data as time goes on. But because data warehouses track
performance over time, the data should be available virtually forever. So, how do you
reconcile these goals?
The data warehouse is set to retain data at various levels of detail, or granularity. This
granularity must be consistent throughout one data structure, but different data
structures with different grains can be related through shared dimensions. As data
ages, you can summarize and store it with less detail in another structure. You could
store the data at the day grain for the first 2 years, then move it to another structure.
The second structure might use a week grain to save space. Data might stay there for
another 3 to 5 years, then move to a third structure where the grain is monthly. By
planning these stages in advance, you can design analysis tools to work with the
changing grains based on the age of the data. Then if older historical data is imported,
it can be transformed directly into the proper format.
Step 7: Implement the Plan
After you've developed the plan, it provides a viable basis for estimating work and
scheduling the project. The scope of data warehouse projects is large, so phased
delivery schedules are important for keeping the project on track. We've found that an
effective strategy is to plan the entire warehouse, then implement a part as a data mart
to demonstrate what the system is capable of doing. As you complete the parts, they fit
together like pieces of a jigsaw puzzle. Each new set of data structures adds to the
capabilities of the previous structures, bringing value to the system.
After the tools and team personnel selections are made, the data warehouse design can
begin. The following are the typical steps involved in the data warehousing project
cycle.
Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Report Development
Performance Tuning
Query Optimization
Quality Assurance
Rolling out to Production
Production Maintenance
Incremental Enhancements
Each page listed above represents a typical data warehouse design phase, and has
several sections:
Query-driven Approach
Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach
was used to build wrappers and integrators on top of multiple heterogeneous
databases. These integrators are also known as mediators.
Disadvantages
Query-driven approach needs complex integration and filtering processes.
This approach is very inefficient.
It is very expensive for frequent queries.
This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems
follow update-driven approach rather than the traditional approach discussed earlier.
In update-driven approach, the information from multiple heterogeneous sources are
integrated in advance and are stored in a warehouse. This information is available for
direct querying and analysis.
Advantages
This approach has the following advantages −
This approach provide high performance.
The data is copied, processed, integrated, annotated, summarized and
restructured in semantic data store in advance.
Query processing does not require an interface to process data at local sources.
1. Improving integration
An organization registers data in various systems which support the various business
processes. In order to create an overall picture of business operations, customers, and
suppliers – thus creating a single version of the truth – the data must come together in
one place and be made compatible. Both external (from the environment) and internal
data (from ERP, CRM and financial systems) should merge into the data warehouse and
then be grouped.
2. Speeding up response times
The source systems are fully optimized in order to process many small transactions,
such as orders, in a short time. Generating information about the performance of the
organization only requires a few large ‘transactions’ in which large volumes of data are
gathered and aggregated. The structure of a data warehouse is specifically designed to
quickly analyze such large volumes of (big) data.
3. Faster and more flexible reporting
The structure of both data warehouses and data marts enables end users to report in a
flexible manner and to quickly perform interactive analysis based on various predefined
angles (dimensions). They may, for example, with a single mouse click jump from year
level, to quarter, to month level, and quickly switch between the customer dimension
and the product dimension, all while the indicator remains fixed. In this way, end users
can actually juggle the data and thus quickly gain knowledge about business operations
and performance indicators.
4. Recording changes to build history
Source systems don’t usually keep a history of certain data. For example, if a customer
relocates or a product moves to a different product group, the (old) values will most
likely be overwritten. This means they disappear from the system – or at least they’re
very difficult to trace back. That’s a pity, because in order to generate reliable
information, we actually need these old values, as users sometimes want to be able to
look back in time. In other words: we want to be able to look at the organization’s
performance from a historical perspective – in accordance with the organizational
structure and product classifications of that time – instead of in the current context. A
data warehouse ensures that data changes in the source system are recorded, which
enables historical analysis.
5. Increasing data quality
Stakeholders and users frequently overestimate the quality of data in the source
systems. Unfortunately, source systems quite often contain data of poor quality. When
we use a data warehouse, we can greatly improve the data quality, either through –
where possible – correcting the data while loading or by tackling the problem at its
source.
6. Unburdening operational systems
By transferring data to a separate computer in order to analyze it, the operational
system is unburdened.
7. Unburdening the IT department
A data warehouse and Business Intelligence tools allow employees within the
organization to create reports and perform analyses independently. However, an
organization will first have to invest in order to set up the required infrastructure for
that data warehouse and those BI tools. The following principle applies: the better the
architecture is set up and developed, the more complex reports users can
independently create. Obviously, users first need sufficient training and support, where
necessary. Yet, what we see in practice is that many of the more complex reports end
up being created by the IT department. This is mostly due to users lacking either the
time or the knowledge. That’s why data literacy is an important factor. Another reason
may be that the organization hasn’t put enough effort into developing the right
architecture.
8. Increasing recognizability
Indicators are ‘prepared’ in the data warehouse. This allows users to create complex
reports on, for example, returns on customers, or on service levels divided by month,
customer group and country, in a few simple steps. In the source system this
information only emerges when we manually perform a large number of actions and
calculations. Using a data warehouse thus increases the recognizability of the
information we require, provided that the data warehouse is set up based on the
business.
9. Increasing findability
When we create a data warehouse, we make sure that users can easily access the
meaning of data. (In the source system, these meanings are either non-existent or
poorly accessible.) With a data warehouse, users can find data more quickly, and thus
establish information and knowledge faster.
All the goals of the data warehouse serve the aims of Business Intelligence: making
better decisions faster at all levels within the organization and even across
organizational boundaries.
Applications
The utility of artificial neural network models lies in the fact that they can be used to
infer a function from observations. This is particularly useful in applications where the
complexity of the data or task makes the design of such a function by hand impractical.
Real-life applications
The tasks artificial neural networks are applied to tend to fall within the following broad
categories:
Application areas include system identification and control (vehicle control, process
control, natural resources management), quantum chemistry, game-playing and
decision making (backgammon, chess, poker), pattern recognition (radar systems, face
identification, object recognition and more), sequence recognition (gesture, speech,
handwritten text recognition), medical diagnosis, financial applications (automated
trading systems), data mining (or knowledge discovery in databases, “KDD”),
visualization and e-mail spam filtering.
Artificial neural networks have also been used to diagnose several cancers. An ANN
based hybrid lung cancer detection system named HLND improves the accuracy of
diagnosis and the speed of lung cancer radiology. These networks have also been used
to diagnose prostate cancer. The diagnoses can be used to make specific models taken
from a large group of patients compared to information of one given patient. The
models do not depend on assumptions about correlations of different variables.
Colorectal cancer has also been predicted using the neural networks. Neural networks
could predict the outcome for a patient with colorectal cancer with a lot more accuracy
than the current clinical methods. After training, the networks could predict multiple
patient outcomes from unrelated institutions.
Neural networks and neuroscience
Theoretical and computational neuroscience is the field concerned with the theoretical
analysis and computational modeling of biological neural systems. Since neural systems
are intimately related to cognitive processes and behavior, the field is closely related to
cognitive and behavioral modeling.
The aim of the field is to create models of biological neural systems in order to
understand how biological systems work. To gain this understanding, neuroscientists
strive to make a link between observed biological processes (data), biologically
plausible mechanisms for neural processing and learning (biological neural network
models) and theory (statistical learning theory and information theory).
Types of models
Many models are used in the field defined at different levels of abstraction and
modeling different aspects of neural systems. They range from models of the short-
term behavior of individual neurons, models of how the dynamics of neural circuitry
arise from interactions between individual neurons and finally to models of how
behavior can arise from abstract neural modules that represent complete subsystems.
These include models of the long-term, and short-term plasticity, of neural systems and
their relations to learning and memory from the individual neuron to the system level.
While initial research had been concerned mostly with the electrical characteristics of
neurons, a particularly important part of the investigation in recent years has been the
exploration of the role of neuromodulators such as dopamine, acetylcholine, and
serotonin on behavior and learning.
Biophysical models, such as BCM theory, have been important in understanding
mechanisms for synaptic plasticity, and have had applications in both computer science
and neuroscience. Research is ongoing in understanding the computational algorithms
used in the brain, with some recent biological evidence for radial basis networks and
neural backpropagation as mechanisms for processing data.
Computational devices have been created in CMOS for both biophysical simulation and
neuromorphic computing. More recent efforts show promise for creating nanodevices
for very large scale principal components analyses and convolution. If successful, these
efforts could usher in a new era of neural computing that is a step beyond digital
computing, because it depends on learning rather than programming and because it is
fundamentally analog rather than digital even though the first instantiations may in fact
be with CMOS digital devices.
Hierarchical clustering algorithm
Hierarchical clustering algorithm is of two types:
Both this algorithm are exactly reverse of each other. So we will be covering Agglomerative
Hierarchical clustering algorithm in detail.
Agglomerative Hierarchical clustering -This algorithm works by grouping the data one by one on th
basis of the nearest distance measure of all the pairwise distance between the data point. Again
distance between the data point is recalculated but which distance to consider when the groups has
been formed? For this there are many available methods. Some of them are:
4) centroid distance.
This way we go on grouping the data until one cluster is formed. Now on the basis of
dendogram graph we can calculate how many number of clusters should be actually
present.
Let X = {x1, x2, x3, ..., xn} be the set of data points.
1) Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
2) Find the least distance pair of clusters in the current clustering, say pair (r), (s),
according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters
in the current clustering.
3) Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single
cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),
(s)].
5) If all the data points are in one cluster then stop, else repeat from step 2).
Advantages
Disadvantages
2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the number of data
points.
3) Based on the type of distance matrix chosen for merging different algorithms can
suffer with one or more of the following:
1- Requirement gathering
Data mining project starts with the requirement gathering and understanding. Data
mining analysts or users define the requirement scope with the vendor business
perspective. Once, the scope is defined we move to the next phase.
2- Data exploration
Here, in this step Data mining experts gather, evaluate and explore the requirement or
project. Experts understand the problems, challenges and convert them to metadata. In
this step, data mining statistics are used to identify and convert the data patterns.
3- Data preparations
Data mining experts convert the data into meaningful information for the modelling
step. They use ETL process – extract, transform and load. They are also responsible for
creating new data attributes. Here various tools are used to present data in a structural
format without changing the meaning of data sets.
4- Modelling
Data experts put their best tools in place for this step as this plays a vital role in the
complete processing of data. All modeling methods are applied to filter the data in an
appropriate manner. Modelling and evaluation are correlated steps and are followed
same time to check the parameters. Once the final modeling is done the final outcome
is quality proven.
5- Evaluation
This is the filtering process after the successful modelling. If the outcome is not
satisfied then it is transferred to the model again. Upon final outcome, the requirement
is checked again with the vendor so no point is missed. Data mining experts judge the
complete result at the end.
6- Deployment
This is the final stage of the complete process. Experts present the data to vendors in
the form of spreadsheets or graphs.
After the application of data mining process, it is possible to extract information that
has been filtered through the processes of filtering and refining. Usually, the process of
data mining is majorly divided into three sections; pre-processing of data, mining data
and then validation of the data. Generally, this process involves the conversion of data
into valid information.
ADVANTAGES OF DATA MINING
1. With the help of Data mining- Marketing companies build data models and
prediction based on historical data. They run campaigns, marketing strategy
etc. This leads to success and rapid growth.
2. The retail industry is also on the same page with marketing companies- With
Data mining they believe in predictive based models for their goods and
services. Retail stores can have better production and customer insights.
Discounts and redemption are based on historical data.
3. Data mining suggest banks regarding their financial benefits and updates.
They build a model based on customer data and then check out the loan
process which is truly based on data mining. In other ways also Data mining
serves a lot to the banking industry.
4. Manufacturing obtains benefits from Data mining in engineering data and
detecting the faulty devices and products. This helps them to cut off the
defected items from the list and then they can occupy the best services and
products in place.
5. It helps government bodies to analyze the financial data and transaction to
model them to useful information.
6. Data mining organization can improve planning and decision makings.
7. New revenue streams are generated with the help of Data mining which
results in organization growth.
8. Data mining not only helps in predictions but also helps in the development of
new services and products.
9. Customers see better insights with the organization which increase more
customer lists and interactions.
10.Once the competitive advantages are made it reduces with cost also with the
help of data mining.
There are many more benefits of Data mining and its useful features. When data mining
combines with Analytics and Big data, it is completely changed into a new trend which
is the demand of data-driven market.
Conclusion
It is important to note that time is spent in getting valid information from the data.
Therefore if you are after making your business grow rapidly, there is a need to make
accurate and quick decisions that can take advantage of grabbing the available
opportunities in a timely manner.
Data mining is a rapidly growing industry in this technology trend world. Everyone now-
a-days required their data to be used in an appropriate manner and up to right
approach in order to obtain useful and accurate information.
Loginworks Softwares is one of the best data mining outsourcing organizations that
employ high qualified and experienced staff in the market research industry.
Data Mining Issues
Data mining systems face a lot of challenges and issues in today’s world some of them
are:
2 Performance issues
2 Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.
Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this
tutorial, we will discuss the major issues regarding −
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data
again from scratch.
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. For example, the index of a book serves as a
metadata for the contents in the book. In other words, we can say that metadata is
the summarized data that leads us to detailed data. In terms of data warehouse, we
can define metadata as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions
of a given data warehouse. Along with this metadata, additional metadata is also
created for time-stamping any extracted data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business
definition, and changing policies.
Technical Metadata − It includes database system names, table and column
names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes and
indices.
Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation applied
on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the
data warehouse.
Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly
summarized data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata is used for query tools.
Metadata is used in extraction and cleansing tools.
Metadata is used in reporting tools.
Metadata is used in transformation tools.
Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the
following metadata −
Definition of data warehouse − It includes the description of structure of
data warehouse. The description is defined by schema, view, hierarchies,
derived data definitions, and data mart locations and contents.
Business metadata − It contains has the data ownership information, business
definition, and changing policies.
Operational Metadata − It includes currency of data and data lineage.
Currency of data means whether the data is active, archived, or purged.
Lineage of data means the history of data migrated and transformation applied
on it.
Data for mapping from operational environment to data warehouse − It
includes the source databases and their contents, data extraction, data partition
cleaning, transformation rules, data refresh and purging rules.
Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keep on doing so until all of the groups are merged into one or
until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into
smaller clusters. It is down until each object in one cluster or the termination condition
holds. This method is rigid, i.e., once a merging or splitting is done, it can never be
undone.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some threshold,
i.e., for each data point within a given cluster, the radius of a given cluster has to
contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for
a given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.
Data Mining - Rule Based Classification
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can
express a rule in the following from −
IF condition THEN conclusion
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end
tools. To store and manage warehouse data, ROLAP uses relational or extended-
relational DBMS.
ROLAP includes the following −
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP
store.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data. Consider the following diagram that
shows the pivot operation.
OLAP vs OLTP
Sr.No Data Warehouse (OLAP) Operational Database (OLTP)
.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, or database professionals.
managers and analysts.
Following are 3 chief types of multidimensional schemas each having its unique
advantages.
Star Schema
Snowflake Schema
What is a Star Schema?
The star schema is the simplest type of Data Warehouse schema. It is known as star
schema as its structure resembles a star. In the Star schema, the center of the star can
have one fact tables and numbers of associated dimension tables. It is also known as
Star Join Schema and is optimized for querying large data sets.
For example, as you can see in the above-given image that fact table is at the center
which contains keys to every dimension table like Deal_ID, Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema is
that you need to perform more maintenance efforts because of the more lookup
tables.
Hierarchies for the dimensions are stored in Hierarchies are divided into separate tables.
the dimensional table.
In a star schema, only single join creates the A snowflake schema requires many joins to
relationship between the fact table and any fetch the data.
dimension tables.
Single Dimension table contains aggregated Data Split into different Dimension Tables.
data.
Offers higher performing queries using Star The Snow Flake Schema is represented by
Join Query Optimization. Tables may be centralized fact table which unlikely
connected with multiple dimensions. connected with multiple dimensions.
Snowflake schema contains fully expanded hierarchies. However, this can add
complexity to the Schema and requires extra joins. On the other hand, star schema
contains fully collapsed hierarchies, which may lead to redundancy. So, the best
solution may be a balance between these two schemas which is star cluster schema
design.
In star schema fact table contain a large amount of data, with no redundancy.Each
dimension table is joined with the fact table using a primary or foreign key.
Fact Tables
A fact table has two types of columns: one column of foreign keys (pointing to the
dimension tables) and other of numeric values.
Fact Table
Dimension Tables
Dimension table is generally small in size as compared to a fact table.The primary key
of a dimension table is a foreign key in a fact table.
The snowflake schema is a more complex than star schema because dimension tables
of the snowflake are normalized.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model are normalized to reduce redundancies.
The difference between FP growth algorithm and Apriori algorithm is given below:
Difference Between Fp growth and Apriori Algorithm
Fp Growth Algorithm
2. No candidate generation
Apriori Algorithm
Apriori is a classic algorithm for mining frequent items for boolean Association rule.
It uses a bottom-up approach, designed for finding Association rules in a database that
contains transactions.
2. Very slow
Extracting important knowledge from a very large amount of data can be crucial to
organizations for the process of decision-making.
1 Association
2 Classification
3 Clustering
4 Sequential patterns
5 Decision tree.
1 Association Technique
Association Technique helps to find out the pattern from huge data, based on a
relationship between two or more items of the same transaction. The association
technique is used to analyze market means it help us to analyze people's buying habits.
For example, you might identify that a customer always buys ice cream whenever he
comes to watch move so it might be possible that when customer again comes to watch
move he might also want to buy ice cream again.
2 Classification Technique
Classification technique is most common data mining technique. In classification
method we use mathematical techniques such as decision trees, neural network and
statistics in order to predict unknown records. This technique helps in deriving
important information about data.
Let assume you have set of records, each record contains a set of attributes and
depending upon this attributes you will be able to predict unseen or unknown records.
For example, you have given all records of employees who left the company, with
classification technique you can predict who will probably leave the company in a future
period.
3 Clustering Technique
Clustering is one of the oldest techniques used in the process of data mining. The main
aim of clustering technique is to makes cluster(groups) from pieces of data which share
common characteristics. Clustering Technique help to identify the differences and
similarities between the data.
Take an example of a shop in which many items are for sales, now the challenge is how
to keep those items in such way that customer can easily find his required item.By
using the clustering technique, you can keep some items in one corner that have some
similarities and other items in another corner that have some different similarities.
4 Sequential patterns
Sequential patterns are a useful method for identifying trends and similar patterns.
For example, in customer data you identify that a customer buys particular product on
particular time of year, you can use this information to suggest customer these
particular product on that time of year.
5 Decision tree
Decision tree is one of the most common used data mining techniques because its
model is easy to understand for users. In decision tree you start with a simple question
which has two or more answers.Each answer leads to a further two or more question
which help us to make a final decision. The root node of decision tree is a simple
question.
First check water level, if water level is > 50ft then alert is send and if water level is <
50ft then check water level if water level is > 30ft then send warning and if water level
is < 30ft then water is in normal range.
Data Warehouse ?
Data mining refers to extraction of information from large amount of data and store this
information in various data sources such as database and data warehouse.
A data warehouse is a place which store information collected from multiple sources
under unified schema. Information stored in a data warehouse is critical to
organizations for the process of decision-making.
Data Warehouse
Relational data management systems, ERP, flat files, CRM are multiple sources from
where data warehouse store information.
Data warehouses reside on servers and run database management system (DBMS)
such as SQL Server and load software like SQL Server Integration Services (SSIS) to
pull data from the source into data warehouse.
A data warehouse usually stores many months or years of data to support historical
analysis.
Date warehouse are build in order to help users to understand and enhance their
organization's performance.
The idea behind data warehouse is to create a permanent storage space for the data
needed to support reporting, analysis, dashboards, and other BI functions.
2 Integrated:-
A data warehouse is a place which store integrating data collected from heterogeneous
sources such as relational databases, flat files, etc.This data integration techniques are
applied to ensure consistency in naming conventions, encoding structures, attribute
measures etc.
3 Time-variant:-
A data warehouse usually stores many months or years of data to support historical
analysis. Data stored in data warehouse provide information from a historical
perspective.
4 Non-volatile:-
A data warehouse is Non-volatile in nature that means when new data is added old data
is not erased.
By collecting and analyzing public and private sector data, government data mining
is able to identify potential terrorists or other dangerous activities by
unknown individuals. However, this capability continues to raise concerns for private
citizens when it comes to the government’s access to personal data. In fact, in the
absence of specific laws for privacy, the government could have unlimited access to a
massive volume of personal data from behaviors, conversations and habits of people
who have done nothing to legitimize suspicions, and it could create a dramatic increase
in government monitoring of individuals.
The privacy issues raised concern the quality and accuracy of the mined data, how it is
used (especially uses outside the original purview), protection of the data , and the
right of individuals to know that their personal information is being collected and how it
is being used.
On the other side, the lack of legal rules for government data mining restricts the
ability of companies to use and share the information with governments in legal
ways, making the development of new data mining programs slower.
What Is Hypothesis Testing?
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends
on the nature of the data used and the reason for the analysis. Hypothesis testing is
used to infer the result of a hypothesis performed on sample data from a larger
population.
KEY TAKEAWAYS
The null hypothesis is the hypothesis the analyst believes to be true. Analysts believe
the alternative hypothesis to be untrue, making it effectively the opposite of a null
hypothesis. Thus, they are mutually exclusive, and only one can be true. However, one
of the two hypotheses will always be true.
1. The first step is for the analyst to state the two hypotheses so that only one can
be right.
2. The next step is to formulate an analysis plan, which outlines how the data will
be evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either accept or reject the
null hypothesis.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or
not the person is a smoker. It is worth noting that the variable PositiveXray is
independent of whether the patient has a family history of lung cancer or that the
patient is a smoker, given that we know the patient has lung cancer.
The perceptron algorithm was designed to classify visual inputs, categorizing subjects
into one of two types and separating groups with a line. Classification is an important
part of machine learning and image processing. Machine learning algorithms find and
classify patterns by many different means. The perceptron algorithm classifies patterns
and groups by finding the linear separation between different objects and patterns that
are received through numeric or visual input.
At the time, the perceptron was expected to be very significant for the development of
artificial intelligence (AI). While high hopes surrounded the initial perceptron, technical
limitations were soon demonstrated. Single-layer perceptrons can only separate classes
if they are linearly separable. Later on, it was discovered that by using multiple layers,
perceptrons can classify groups that are not linearly separable, allowing them to solve
problems single layer algorithms can’t solve.
A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR
or finance. It is often controlled by a single department in an organization.
Data Mart usually draws data from only a few sources compared to a Data warehouse.
Data marts are small in size and are more flexible compared to a Datawarehouse.
Data Mart helps to enhance user's response time due to reduction in volume of
data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate
Datawarehouse. At the same time, the cost of implementing Data Mart is
certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of change in model,
datamart can be built quicker due to a smaller size.
A Datamart is defined by a single Subject Matter Expert. On the contrary data
warehouse is defined by interdisciplinary SME from a variety of domains. Hence,
Data mart is more open to change compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.
1. Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
2. Independent: Independent data mart is created without the use of a central
data warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or
operational systems.
A dependent data mart allows sourcing organization's data from a single Data
Warehouse. It offers the benefit of centralization. If you need to develop one or more
physical data marts, then you need to configure them as dependent data marts.
Dependent data marts can be built in two different ways. Either where a user can
access both the data mart and data warehouse, depending on need, or where access is
limited only to the data mart. The second approach is not optimal as it produces
sometimes referred to as a data junkyard. In the data junkyard, all data begins with a
common source, but they are scrapped, and mostly junked.
Independent Data Mart
An independent data mart is created without the use of central Data warehouse. This
kind of Data Mart is an ideal option for smaller groups within an organization.
An independent data mart has neither a relationship with the enterprise data
warehouse nor with any other data mart. In Independent data mart, the data is input
separately, and its analyses are also performed autonomously.
A hybrid data mart combines input from sources apart from Data warehouse. This could
be helpful when you want ad-hoc integration, like after a new group or product is added
to the organization.
Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed
steps to implement a Data Mart:
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks
between initiating the request for a data mart to gathering information about the
requirements. Finally, we create the logical and physical design of the data mart.
Gathering the business & technical requirements and Identifying data sources.
Selecting the appropriate subset of data.
Designing the logical and physical structure of the data mart.
Date
Business or Functional Unit
Geography
Any combination of above
A simple pen and paper would suffice. Though tools that help you create UML or ER
diagrams would also append meta data into your logical and physical designs.
Constructing
This is the second phase of implementation. It involves creating the physical database
and the logical structures.
Implementing the physical database designed in the earlier phase. For instance,
database schema objects like table, indexes, views, etc. are created.
You need a relational database management system to construct a data mart. RDBMS
have several features that are required for the success of a Data Mart.
Storage management: An RDBMS stores and manages the data to create, add,
and delete data.
Fast data access: With a SQL query you can easily access data based on
certain conditions/filters.
Data protection: The RDBMS system also offers a way to recover from system
failures such as power failures. It also allows restoring data from these backups
incase of the disk fails.
Multiuser support: The data management system offers concurrent access, the
ability for multiple users to access and modify data without interfering or
overwriting changes made by another user.
Security: The RDMS system also provides a way to regulate access by users to
objects and certain types of operations.
Populating:
You accomplish these population tasks using an ETL(Extract Transform Load)Tool. This
tool allows you to look at the data sources, perform source-to-target mapping, extract
the data, transform, cleanse it, and load it back into the data mart.
In the process, the tool also creates some metadata relating to things like where the
data came from, how recent it is, what type of changes were made to the data, and
what level of summarization was done.
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data,
creating reports, charts, and publishing them. End-user submit queries to the database
and display the results of the queries
Set up a meta layer that translates database structures and objects names into
business terms. This helps non-technical users to access the Data mart easily.
Set up and maintain database structures.
Set up API and interfaces if required
You can access the data mart using the command line or GUI. GUI is preferred as it can
easily generate graphs and is user-friendly compared to the command line.
Managing
This is the last step of Data Mart Implementation process. This step covers
management tasks such as-
You could use the GUI or command line for data mart management.
Following are the best practices that you need to follow while in the Data Mart
Implementation process:
The source of a Data Mart should be departmentally structured
The implementation cycle of a Data Mart should be measured in short periods of
time, i.e., in weeks instead of months or years.
It is important to involve all stakeholders in planning and designing phase as the
data mart implementation could be complex.
Data Mart Hardware/Software, Networking and Implementation costs should be
accurately budgeted in your plan
Even though if the Data mart is created on the same hardware they may need
some different software to handle user queries. Additional processing power and
disk storage requirements should be evaluated for fast user response
A data mart may be on a different location from the data warehouse. That's why
it is important to ensure that they have enough networking capacity to handle
the Data volumes needed to transfer data to the data mart.
Implementation cost should budget the time taken for Datamart loading process.
Load time increases with increase in complexity of the transformations.
Advantages
Disadvantages
Many a times enterprises create too many disparate and unrelated data marts
without much benefit. It can become a big hurdle to maintain.
Data Mart cannot provide company-wide data analysis as their data set is
limited.
Summary:
What is OLAP?
Online Analytical Processing, a category of software tools which provide analysis of data
for business decisions. OLAP systems allow users to analyze database information from
multiple database systems at one time.
What is OLTP?
Example of OLAP
A company might compare their mobile phone sales in September with sales in
October, then compare those results with another location which may be stored
in a sperate database.
Amazon analyzes purchases by its customers to come up with a personalized
homepage with products which likely interest to their customer.
An example of OLTP system is ATM center. Assume that a couple has a joint account
with a bank. One day both simultaneously reach different ATM centers at precisely the
same time and want to withdraw total amount present in their bank account.
However, the person that completes authentication process first will be able to get
money. In this case, OLTP system makes sure that withdrawn amount will be never
more than the amount present in the bank. The key to note here is that OLTP systems
are optimized for transactional superiority instead data analysis.
Online banking
Online airline ticket booking
Sending a text message
Order entry
Add a book to shopping cart
OLAP creates a single platform for all type of business analytical needs which
includes planning, budgeting, forecasting, and analysis.
The main benefit of OLAP is the consistency of information and calculations.
Easily apply security restrictions on users and objects to comply with regulations
and protect sensitive data.
If OLTP system faces hardware failures, then online transactions get severely
affected.
OLTP systems allow multiple users to access and change the same data at the
same time which many times created unprecedented situation.
Difference between OLTP and OLAP
Parameters OLTP OLAP
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Source OLTP and its transactions are the Different OLTP databases become the source
sources of data. of data for OLAP.
Data Integrity OLTP database must maintain data OLAP database does not get frequently
integrity constraint. modified. Hence, data integrity is not an
issue.
Data quality The data in the OLTP database is The data in OLAP process might not be
always detailed and organized. organized.
Usefulness It helps to control and run It helps with planning, problem-solving, and
fundamental business tasks. decision support.
Query Type Queries in this process are Complex queries involving aggregations.
standardized and simple.
Back-up Complete backup of the data OLAP only need a backup from time to time.
combined with incremental Backup is not important compared to OLTP
backups.
User type It is used by Data critical users Used by Data knowledge users like workers,
like clerk, DBA & Data Base managers, and CEO.
professionals.
Purpose Designed for real time business Designed for analysis of business measures
operations. by category and attributes.
Number of This kind of Database users allows This kind of Database allows only hundreds o
users thousands of users. users.
Productivity It helps to Increase user's self- Help to Increase productivity of the business
service and productivity analysts.
Challenge Data Warehouses historically have An OLAP cube is not an open SQL server data
been a development project which warehouse. Therefore, technical knowledge
may prove costly to build. and experience is essential to manage the
OLAP server.
Process It provides fast result for daily It ensures that response to the query is
used data. quicker consistently.
Characteristic It is easy to create and maintain. It lets the user create a view with the help of
a spreadsheet.
Style OLTP is designed to have fast A data warehouse is created uniquely so that
response time, low data it can integrate different data sources for
redundancy and is normalized. building a consolidated database
Summary:
OLAP is a category of software that allows users to analyze information from multiple
database systems at the same time. It is a technology that enables analysts to extract
and view business data from different points of view. OLAP stands for Online Analytical
Processing.
Analysts frequently need to group, aggregate and join data. These operations in
relational databases are resource intensive. With OLAP data can be pre-calculated and
pre-aggregated, making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed in such a
way that creating and viewing reports become easy.
OLAP cube:
At the core of the OLAP, concept is an OLAP Cube. The OLAP cube is a data structure
optimized for very quick data analysis.
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet,
where data values are arranged in row and column format. This is ideal for two-
dimensional data. However, OLAP contains multidimensional data, with data usually
obtained from a different and unrelated source. Using a spreadsheet is not an optimal
option. The cube can store and analyze multidimensional data in a logical and orderly
manner.
A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or
OLAP cube) where information is pre-calculated in advance for further analysis.
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
1) Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can be
performed in 2 ways
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things
based on their order or level.
In this example, cities New jersey and Lost Angles and rolled up into country
USA
The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
In this aggregation process, data is location hierarchy moves up from city to the
country.
In the roll-up process at least one or more dimensions need to be removed. In
this example, Quater dimension is removed.
2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via
3) Slice:
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
Hybrid OnlineAnalytical Processing (HOLAP) In HOLAP approach the aggregated totals are
stored in a multidimensional database while the
detailed data is stored in the relational database.
This offers both data efficiency of the ROLAP
model and the performance of the MOLAP model
Web OLAP (WOLAP) Web OLAP which is OLAP system accessible via
the web browser. WOLAP is a three-tiered
architecture. It consists of three components:
client, middleware, and a database server.
Mobile OLAP: Mobile OLAP helps users to access and analyze
OLAP data using their mobile devices
ROLAP
ROLAP works with data that exist in a relational database. Facts and dimension tables
are stored as relational tables. It also allows multidimensional analysis of data and is
the fastest growing OLAP.
High data efficiency. It offers high data efficiency because query performance
and access language are optimized particularly for the multidimensional data
analysis.
Scalability. This type of OLAP system offers scalability for managing large
volumes of data, and even when the data is steadily increasing.
MOLAP
Hybrid OLAP
Hybrid OLAP is a mixture of both ROLAP and MOLAP. It offers fast computation of
MOLAP and higher scalability of ROLAP. HOLAP uses two databases.
This kind of OLAP helps to economize the disk space, and it also remains
compact which helps to avoid issues related to access speed and convenience.
Hybrid HOLAP's uses cube technology which allows faster performance for all
types of data.
ROLAP are instantly updated and HOLAP users have access to this real-time
instantly updated data. MOLAP brings cleaning and conversion of data thereby
improving data relevance. This brings best of both worlds.
Advantages of OLAP
Disadvantages of OLAP
OLAP requires organizing data into a star or snowflake schema. These schemas
are complicated to implement and administer
You cannot have large number of dimensions in a single OLAP cube
Transactional data cannot be accessed with OLAP system.
Any modification in an OLAP cube needs a full update of the cube. This is a time-
consuming process
Summary:
OLAP is a technology that enables analysts to extract and view business data
from different points of view.
At the core of the OLAP, the concept is an OLAP Cube.
Various business applications and other data operations require the use of OLAP
Cube.
There are primary five types of analytical operations in OLAP 1) Roll-up 2) Drill-
down 3) Slice 4) Dice and 5) Pivot
Three types of widely used OLAP systems are MOLAP, ROLAP, and Hybrid OLAP.
Desktop OLAP, Web OLAP, and Mobile OLAP are some other types of OLAP
systems.
What is MOLAP?
Using a MOLAP, a user can use multidimensional view data with different facets.
Multidimensional data analysis is also possible if a relational database is used. By that
would require querying data from multiple tables. On the contrary, MOLAP has all
possible combinations of data already stored in a multidimensional array. MOLAP can
access this data directly. Hence, MOLAP is faster compared to Relational Online
Analytical Processing (ROLAP).
Key Points
MOLAP Architecture
Database server.
MOLAP server.
Front-end tool.
Above given MOLAPArchitectures, shown in given figure
MOLAP architecture mainly reads the precompiled data. MOLAP architecture has limited
capabilities to dynamically create aggregations or to calculate results that have not
been pre-calculated and stored.
For example, an accounting head can run a report showing the corporate P/L account or
P/L account for a specific subsidiary. The MDDB would retrieve precompiled Profit &
Loss figures and display that result to the user.
MOLAP Advantages
MOLAP Disadvantages
One major weakness of MOLAP is that it is less scalable than ROLAP as it handles
only a limited amount of data.
The MOLAP also introduces data redundancy as it is resource intensive
MOLAP Solutions may be lengthy, particularly on large data volumes.
MOLAP products may face issues while updating and querying models when
dimensions are more than ten.
MOLAP is not capable of containing detailed data.
The storage utilization can be low if the data set is highly scattered.
It can handle the only limited amount of data therefore, it's impossible to include
a large amount of data in the cube itself.
MOLAP Tools
Summary: