Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Data Mining

Data mining is the process of discovering patterns, trends, or useful information from large datasets using various
techniques, including statistical analysis, machine learning, and artificial intelligence. The goal of data mining is to
extract valuable insights and knowledge from raw data, which can then be used for decision-making, predictive analysis,
and other business or research purposes.

​ 1. Data Cleaning: Eliminate noisy and inconsistent data from the dataset.
​ 2. Data Integration: Combine data from multiple sources into a unified dataset.

​ 3. Data Selection: Retrieve relevant data from the database for analysis tasks.
​ 4. Data Transformation: Modify and consolidate data into suitable forms like summary for mining.

​ 5. Data Mining: Apply intelligent methods to extract patterns and insights from the data.

​ 6. Pattern Evaluation:to identify the truly interesting patterns based on interestingness measures
​ 7. Knowledge Presentation: Employ visualization techniques to present mined knowledge to users.
1. Data Source: Where data is stored, cleaned, and integrated.
2. Server: Retrieves data based on user requests.
3. Knowledge Base: Provides domain guidance and constraints.
4. Mining Engine: Performs core tasks like pattern analysis and prediction.
5. Evaluation Module: Assesses patterns for interest, using measures and thresholds.
6. User Interface: Connects users, enabling queries, exploration, and visualization.

Relational Databases:
A Relational database is a collection of tables, each having a unique name. Each table consists of a set of attributes
(columns) and usually stores a large set of tuples (rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values.
Suppose that your job is to analyze the AllElectronics data.
Through the use of relational queries, you can ask things like “Show me a list of all items that were sold in the last
quarter.” We can also use aggregate functions such as sum, avg (average), count, max (maximum), and min
(minimum). These allow you to ask things like “Show me the total sales of the last month, grouped by branch,” or
“How many sales transactions occurred in the month of December?”

Data mining in relational databases involves searching for trends and patterns. For instance, it can predict credit risk
by analyzing customer data like income and credit history. It can also find anomalies, like unexpected sales
drops, prompting further investigation into potential causes like packaging changes or price increases.

Data Warehouse

Suppose the president of AllElectronics needs an analysis of sales per item type per branch for the third quarter, but the
data is scattered across various databases worldwide. Having a data warehouse would make this task easy, as it's a
centralized repository of information collected from different sources. Data warehouses involve steps like cleaning,
integration, transformation, and loading of data. They use a multidimensional structure with attributes as dimensions and
store aggregated measures in cells. This structure can be relational or multidimensional (cube), offering fast access to
summarized data.

A data cube for AllElectronics.


Transactional Database

A transactional database comprises records representing transactions, each with a unique transaction identity number
(trans ID) and a list of items involved in the transaction, like purchased items in a store.
As an analyst of AllElectronics database, you can ask queries like item purchases by Sandy Smith or transactions with
item I3. Answering these may need scanning the whole database. To dig deeper, this type of analysis reveals item pairs
frequently bought together, helping bundle items for sales strategy. For instance, offering discounted high-end printers
with specific computers based on the insight that they are commonly bought together, aiming to boost sales.

Advanced Data and Information Systems and Advanced Applications


// Baad me

Data Mining Functionalities:


Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data
mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks define the
general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to
make predictions.

There are various data mining functionalities which are as follows −


​ Data characterization − It is a summarization of the general characteristics of a target class of data. The data
specified by the user is generally collected by a database query. The output of data characterization can be
presented in multiple forms like pie charts, curves, multid data cubes etc.

​ Data discrimination −It compares features of a class with features of one or more contrasting classes. The
target and contrasting classes can be represented by the user along with the data objects fetched through
database queries.

​ Association Analysis − Association analysis is a technique that focuses on uncovering relationships, or
associations among items within a dataset. It aims to identify patterns of co-occurrence among different items.
This is often used in market basket analysis, where the goal is to find items that are frequently purchased
together.

​ Classification − Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a class or
essentially classify a collection of items. A training set containing items whose properties are known is used to
train the system to predict the category of items from an unknown collection of items.


​ Prediction − It is to predict some unavailable data values or pending trends. An object can be anticipated based
on the attribute values of the object and attribute values of the classes. It can be a prediction of missing
numerical values or increase/decrease trends in time-related information.
​ Numeric predictions are made by creating a linear regression model that is based on historical data.Class
predictions are used to fill in missing class information for products using a training data set

​ Clustering −Cluster analysis, also known as clustering, is a method of data mining that groups similar data
points together. It is unsupervised learning. The goal of cluster analysis is to divide a dataset into groups (or
clusters) such that the data points within each group are similar to each other.

​ Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster. These are the
data objects which have multiple behavior from the general behavior of other data objects. The analysis of this
type of data can be essential to mine the knowledge.

​ Evolution analysis − It defines the trends for objects whose behavior changes over some time.

Classification of Data Mining Systems:

Data mining systems can be classified based on various criteria:

​ 1. Kinds of Databases Mined: Systems can be categorized according to the types of databases they work with,
such as relational, transactional, object-relational, data warehouse, spatial, time-series, text, multimedia, or web
mining systems.
​ 2. Kinds of Knowledge Mined: Systems can be grouped by the types of knowledge they extract, including
characterization, discrimination, association, classification, prediction, clustering, outlier analysis, and evolution
analysis. A comprehensive system provides multiple functionalities.
​ 3. Techniques Utilized: Classification can be done based on the data mining techniques employed. This
includes user interaction (autonomous, interactive, query-driven systems) and methods used (database-oriented,
machine learning, statistics, visualization, neural networks, etc.).
​ 4. Applications Adapted: Systems can be categorized by the specific applications they cater to, like finance,
telecommunications, DNA analysis, stock markets, or email. Different applications often need domain-specific
methods.
Data Mining Task Primitives:
1. Task-Relevant Data: Specifies the database portions or dataset of interest, including relevant attributes or
dimensions.
2. Kind of Knowledge: Defines the data mining functions to be applied, like characterization, discrimination,
association, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. Background Knowledge: Utilizes domain knowledge to guide discovery and evaluate patterns. This includes
concept hierarchies, user beliefs, and relationships in the data.
4. Interestingness Measures: Metrics and thresholds used to guide mining or evaluate patterns. Measures differ based
on the type of knowledge. For instance, association rules use support and confidence.
5. Representation for Visualization: Determines how discovered patterns are presented. This could include rules,
tables, charts, graphs, decision trees, and cubes.

Integration of a Data Mining System with a Database or Data Warehouse System:

Possible integration schemes include no coupling, loose coupling, semi-tight coupling, and tight coupling.

No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system.
Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system.
Semitight coupling: It connects a DM system to a DB/DW system and also allows execution of data mining tasks like
sorting, indexing, etc.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system.

Data mining issues:

Mining Methodology and User Interaction Issues

1. Hard to Derive Knowledge for Diverse Domains − Data mining should cater to varied user interests, involving a
wide range of knowledge discovery tasks.
2. Lack of Interactive Models:− The data mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests.
4. Ad hoc Data Mining is Not Easy − Develop query languages for ad hoc tasks, integrated with data warehouse
queries, ensuring efficient and flexible mining.
5. Presentation and visualization of data mining results − Discovered patterns must be presented in easily
understandable high-level languages and visuals.
6. Handling noisy or incomplete data − Noise handling through data cleaning methods is essential for accurate
pattern discovery.
7. Pattern evaluation − Discovered patterns should be intriguing, avoiding common knowledge or lack of novelty.

Performance Issues
There can be performance-related issues such as follows −

1.Inefficiency and Difficulty in Scalability − Data mining algorithms should be efficient and scalable to handle large
datasets effectively.
2. Inability to Work with Parallel, Distributed, and Incremental Algorithms− Due to data size and complexity,
parallel and distributed algorithms divide data for simultaneous processing thus enhancing efficiency. Incremental
algorithms update databases without mining the data again.

Diverse Data Types Issues:

1. Handling of relational and complex types of data − Complex data like multimedia, spatial, and temporal data
require specialized mining approaches.
2. Mining information from heterogeneous databases− Mining knowledge from varied, distributed sources on
LAN/WAN is complex as these data sources may be structured, semi structured or unstructured.

Data cleaning & its types


Data cleaning is defined to clean the data by filling in the missing values, smoothing noisy data, analyzing and removing
outliers, and removing inconsistencies in the data. Sometimes data at multiple levels of detail can be different from what
is required, for example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data includes birth date.
The data can be cleaned by splitting the data into appropriate types.

There are various types of data cleaning which are as follows −


1. Missing Values − Missing values are filled with appropriate values. There are the following approaches to fill the
values.
a. The tuple is ignored when it includes several attributes with missing values.
b. The values are filled manually for the missing value.
c. The same global constant can fill the values.
d. The attribute mean can fill the missing values.
e. The most probable value can fill the missing values.
f. Ignore the tuple.

2. Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing
methods to handle noise which are as follows −:

a. Binning: Data values are grouped into bins, considering their neighboring values for local smoothing.
b. Regression: Data can be smoothed using regression techniques, fitting it to functions like linear regression for
prediction.
c. Clustering: Outliers can be detected through clustering, where values within clusters are considered normal,
and those outside are outliers.
d. Combined Inspection: Outliers can be identified through a combination of computer analysis and human
review, recognizing patterns with unexpected or unusual values.

3. Inconsistent data: originating from transactions, data entry, or merging databases, can be identified and
rectified. Correlation analysis can help spot redundancies. Accurate integration of data from different sources
minimizes redundancy issues.

Data Integration:
Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data
conflict detection, and the resolution of semantic heterogeneity contribute toward smooth data integration.

Data Transformation:
Data transformation can involve the following:

Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.
Aggregation Data is summarized by applying operations like computing monthly totals from daily sales data. This is
often used for constructing data cubes, allowing analysis at various levels.
Generalization of the data,Low-level primitive raw data is replaced with higher-level concepts through concept
hierarchies. For example, categorical attributes like "street" can be generalized to "city" or "country,"
Normalization, where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or
0.0 to 1.0.
Attribute construction (or feature construction), where new attributes are constructed and added from the given set
of attributes to help the mining process.

Data Reduction:
Data reduction techniques aim to create a compact version of a dataset while preserving its essential information.
Mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.

Strategies for data reduction include the following:

1. Data cube aggregation: Combine data in a data cube through aggregation operations.
2. Attribute subset selection: Identify and remove irrelevant or redundant attributes.
3. Dimensionality reduction: Use encoding methods to decrease dataset size.
4. Numerosity reduction: Substitute data with smaller representations (parametric models, or non-parametric
methods such as clustering, use of histograms).
5. Discretization and concept hierarchy: Replace raw values with ranges or higher-level concepts, aiding
multi-level analysis.

Data discretization ( for numerical data): It is a technique that involves converting a large set of data values
into smaller ones to simplify data analysis and management. It's particularly useful for continuous data, where attribute
values are grouped into distinct intervals to minimize information loss. There are two main types: supervised, which uses
class information, and unsupervised, which relies on specific strategies like top-down splitting and bottom-up merging.
Histogram analysis: Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set.
Binning: Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values
Cluster analysis is a common method for data discretization. It involves using a clustering algorithm to group values
into clusters, thereby discretizing the data.
Entropy-based discretization is a supervised method involving top-down splitting. It considers class distribution and
preserves split-points for separating attribute ranges.

Concept hierarchies (for categorical data) can be generated in following ways:

1. User-Defined Schema Ordering: Experts can define hierarchies by specifying attribute orders, like street < city <
province < country.
2. Manual Data Grouping: Portion of a hierarchy can be manually created by grouping data, useful for intermediate
levels like categorizing provinces.
3. Automatic Attribute Ordering: Attributes can be hierarchically ordered based on distinct values, with more values at
the bottom (e.g., street) and fewer at the top (e.g., country). Adjustments can be made as needed.

MOD 2:
Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process
The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data
repositories.

A data warehouse has distinct characteristics:

1. Subject-Oriented: It's organized around key subjects like customers, products, etc., focusing on data analysis for
decision-making rather than daily operations.
2. Integrated: It's built by combining diverse sources (databases, files, records), using data cleaning and integration to ensure
consistency in naming and structure.
3. Time-Variant: Data is stored historically (e.g., past 5-10 years), and time-related information is a part of its structure.
4. Nonvolatile: Separate from operational data, it doesn't need transaction processing, recovery, or concurrency control. It involves
only data loading and access operations.

Difference between OLTP & OLAP:


● The bottom tier is a warehouse database server that is almost always a relational database system. Back-end
tools and utilities are used to feed data into the bottom tier from operational databases.
● The middle tier is an OLAP server
● The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction etc.)

There are three data warehouse models: the enterprise warehouse, the data mart, and the virtual
warehouse.

An enterprise warehouse collects all of the information about subjects in the entire organization. It puts together data
from different parts of the company, and it can hold a lot of data, from MB to TB & beyond. This kind of warehouse
needs careful planning and might take a year to build.

A data mart is like a smaller version of a data warehouse, focusing on specific needs of a certain group in a company.
Data marts are set up on less expensive servers and can be ready in weeks. They can be independent (drawing from
various sources) or dependent (linked to a larger data warehouse).

Virtual warehouse: A virtual warehouse is like a collection of special views from regular databases. To make queries
fast, only some important views are actually saved. Creating a virtual warehouse is simple, but it needs extra space on
the regular database servers to work well.
Data Warehouse Back-End Tools and Utilities:

1. Data extraction, which typically gathers data from multiple, heterogeneous, and external sources .
2. Data cleaning, which detects errors in the data and rectifies them when possible .
3. Data transformation, which converts data from legacy or host format to warehouse format .
4. Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
5. Refresh, which propagates the updates from the data sources to the warehouse.

Meta Data:

Metadata refers to information about data itself. In the context of a data warehouse, metadata defines the objects in the
warehouse.It's like a guide for the data.Metadata includes data names, definitions, timestamps for when data was
extracted, where it came from, and details about fields that were added due to cleaning or integration.

1. Structure Metadata: Defines warehouse layout, like schema, views, dimensions, hierarchies, and derived data. Also
covers data marts.
2. Operational Metadata: Tracks data lineage (migration history), status (active/archived), and monitoring (usage,
errors).
3. Summarization Algorithms: Includes methods to summarize data, define measures, dimensions, and queries.
4. Operational Mapping: Describes source databases, extraction, transformation, security, and updates.
5. Performance Data: Involves indices, profiles, and rules for data access and updates.
6. Business Metadata: Contains business terms, data ownership, and policies for understanding.

A Multidimensional Data Model: Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube.It is typically used for the design of corporate data warehouses
and departmental data marts. Such a model can adopt a star schema, snowflake schema, or fact constellation schema.

“What is a data cube?” A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts(measures). Dimensions are the entities or perspectives with respect to which an organization
wants to keep records and are hierarchical in nature.
OLAP Operations

Roll-up (Drill-Up): This operation involves summarizing data at a higher level of aggregation. For instance, going from
monthly sales data to quarterly or yearly totals. It allows you to "roll up" data to a broader perspective.

Drill-down (Roll-Down): Drill-down is the opposite of roll-up. It involves breaking down aggregated data into finer levels
of detail. For example, breaking down yearly sales into monthly or daily sales figures.

Slice − It describes the subcube to get more specific information. This is performed by selecting one dimension.
Dice − It describes the subcube by performing selection on two or more dimensions.

Pivot (Rotate): This operation involves changing the orientation of the data cube, effectively rotating it to view different
perspectives. It's useful for examining data from various angles.

Measures categories//

You might also like