Data Mining
Data Mining
Data Mining
Data mining is the process of discovering patterns, trends, or useful information from large datasets using various
techniques, including statistical analysis, machine learning, and artificial intelligence. The goal of data mining is to
extract valuable insights and knowledge from raw data, which can then be used for decision-making, predictive analysis,
and other business or research purposes.
1. Data Cleaning: Eliminate noisy and inconsistent data from the dataset.
2. Data Integration: Combine data from multiple sources into a unified dataset.
3. Data Selection: Retrieve relevant data from the database for analysis tasks.
4. Data Transformation: Modify and consolidate data into suitable forms like summary for mining.
5. Data Mining: Apply intelligent methods to extract patterns and insights from the data.
6. Pattern Evaluation:to identify the truly interesting patterns based on interestingness measures
7. Knowledge Presentation: Employ visualization techniques to present mined knowledge to users.
1. Data Source: Where data is stored, cleaned, and integrated.
2. Server: Retrieves data based on user requests.
3. Knowledge Base: Provides domain guidance and constraints.
4. Mining Engine: Performs core tasks like pattern analysis and prediction.
5. Evaluation Module: Assesses patterns for interest, using measures and thresholds.
6. User Interface: Connects users, enabling queries, exploration, and visualization.
Relational Databases:
A Relational database is a collection of tables, each having a unique name. Each table consists of a set of attributes
(columns) and usually stores a large set of tuples (rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values.
Suppose that your job is to analyze the AllElectronics data.
Through the use of relational queries, you can ask things like “Show me a list of all items that were sold in the last
quarter.” We can also use aggregate functions such as sum, avg (average), count, max (maximum), and min
(minimum). These allow you to ask things like “Show me the total sales of the last month, grouped by branch,” or
“How many sales transactions occurred in the month of December?”
Data mining in relational databases involves searching for trends and patterns. For instance, it can predict credit risk
by analyzing customer data like income and credit history. It can also find anomalies, like unexpected sales
drops, prompting further investigation into potential causes like packaging changes or price increases.
Data Warehouse
Suppose the president of AllElectronics needs an analysis of sales per item type per branch for the third quarter, but the
data is scattered across various databases worldwide. Having a data warehouse would make this task easy, as it's a
centralized repository of information collected from different sources. Data warehouses involve steps like cleaning,
integration, transformation, and loading of data. They use a multidimensional structure with attributes as dimensions and
store aggregated measures in cells. This structure can be relational or multidimensional (cube), offering fast access to
summarized data.
A transactional database comprises records representing transactions, each with a unique transaction identity number
(trans ID) and a list of items involved in the transaction, like purchased items in a store.
As an analyst of AllElectronics database, you can ask queries like item purchases by Sandy Smith or transactions with
item I3. Answering these may need scanning the whole database. To dig deeper, this type of analysis reveals item pairs
frequently bought together, helping bundle items for sales strategy. For instance, offering discounted high-end printers
with specific computers based on the insight that they are commonly bought together, aiming to boost sales.
1. Kinds of Databases Mined: Systems can be categorized according to the types of databases they work with,
such as relational, transactional, object-relational, data warehouse, spatial, time-series, text, multimedia, or web
mining systems.
2. Kinds of Knowledge Mined: Systems can be grouped by the types of knowledge they extract, including
characterization, discrimination, association, classification, prediction, clustering, outlier analysis, and evolution
analysis. A comprehensive system provides multiple functionalities.
3. Techniques Utilized: Classification can be done based on the data mining techniques employed. This
includes user interaction (autonomous, interactive, query-driven systems) and methods used (database-oriented,
machine learning, statistics, visualization, neural networks, etc.).
4. Applications Adapted: Systems can be categorized by the specific applications they cater to, like finance,
telecommunications, DNA analysis, stock markets, or email. Different applications often need domain-specific
methods.
Data Mining Task Primitives:
1. Task-Relevant Data: Specifies the database portions or dataset of interest, including relevant attributes or
dimensions.
2. Kind of Knowledge: Defines the data mining functions to be applied, like characterization, discrimination,
association, classification, prediction, clustering, outlier analysis, or evolution analysis.
3. Background Knowledge: Utilizes domain knowledge to guide discovery and evaluate patterns. This includes
concept hierarchies, user beliefs, and relationships in the data.
4. Interestingness Measures: Metrics and thresholds used to guide mining or evaluate patterns. Measures differ based
on the type of knowledge. For instance, association rules use support and confidence.
5. Representation for Visualization: Determines how discovered patterns are presented. This could include rules,
tables, charts, graphs, decision trees, and cubes.
Possible integration schemes include no coupling, loose coupling, semi-tight coupling, and tight coupling.
No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system.
Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system.
Semitight coupling: It connects a DM system to a DB/DW system and also allows execution of data mining tasks like
sorting, indexing, etc.
Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system.
1. Hard to Derive Knowledge for Diverse Domains − Data mining should cater to varied user interests, involving a
wide range of knowledge discovery tasks.
2. Lack of Interactive Models:− The data mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests.
4. Ad hoc Data Mining is Not Easy − Develop query languages for ad hoc tasks, integrated with data warehouse
queries, ensuring efficient and flexible mining.
5. Presentation and visualization of data mining results − Discovered patterns must be presented in easily
understandable high-level languages and visuals.
6. Handling noisy or incomplete data − Noise handling through data cleaning methods is essential for accurate
pattern discovery.
7. Pattern evaluation − Discovered patterns should be intriguing, avoiding common knowledge or lack of novelty.
Performance Issues
There can be performance-related issues such as follows −
1.Inefficiency and Difficulty in Scalability − Data mining algorithms should be efficient and scalable to handle large
datasets effectively.
2. Inability to Work with Parallel, Distributed, and Incremental Algorithms− Due to data size and complexity,
parallel and distributed algorithms divide data for simultaneous processing thus enhancing efficiency. Incremental
algorithms update databases without mining the data again.
1. Handling of relational and complex types of data − Complex data like multimedia, spatial, and temporal data
require specialized mining approaches.
2. Mining information from heterogeneous databases− Mining knowledge from varied, distributed sources on
LAN/WAN is complex as these data sources may be structured, semi structured or unstructured.
2. Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing
methods to handle noise which are as follows −:
a. Binning: Data values are grouped into bins, considering their neighboring values for local smoothing.
b. Regression: Data can be smoothed using regression techniques, fitting it to functions like linear regression for
prediction.
c. Clustering: Outliers can be detected through clustering, where values within clusters are considered normal,
and those outside are outliers.
d. Combined Inspection: Outliers can be identified through a combination of computer analysis and human
review, recognizing patterns with unexpected or unusual values.
3. Inconsistent data: originating from transactions, data entry, or merging databases, can be identified and
rectified. Correlation analysis can help spot redundancies. Accurate integration of data from different sources
minimizes redundancy issues.
Data Integration:
Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data
conflict detection, and the resolution of semantic heterogeneity contribute toward smooth data integration.
Data Transformation:
Data transformation can involve the following:
Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.
Aggregation Data is summarized by applying operations like computing monthly totals from daily sales data. This is
often used for constructing data cubes, allowing analysis at various levels.
Generalization of the data,Low-level primitive raw data is replaced with higher-level concepts through concept
hierarchies. For example, categorical attributes like "street" can be generalized to "city" or "country,"
Normalization, where the attribute data are scaled so as to fall within a small specified range, such as −1.0 to 1.0, or
0.0 to 1.0.
Attribute construction (or feature construction), where new attributes are constructed and added from the given set
of attributes to help the mining process.
Data Reduction:
Data reduction techniques aim to create a compact version of a dataset while preserving its essential information.
Mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.
1. Data cube aggregation: Combine data in a data cube through aggregation operations.
2. Attribute subset selection: Identify and remove irrelevant or redundant attributes.
3. Dimensionality reduction: Use encoding methods to decrease dataset size.
4. Numerosity reduction: Substitute data with smaller representations (parametric models, or non-parametric
methods such as clustering, use of histograms).
5. Discretization and concept hierarchy: Replace raw values with ranges or higher-level concepts, aiding
multi-level analysis.
Data discretization ( for numerical data): It is a technique that involves converting a large set of data values
into smaller ones to simplify data analysis and management. It's particularly useful for continuous data, where attribute
values are grouped into distinct intervals to minimize information loss. There are two main types: supervised, which uses
class information, and unsupervised, which relies on specific strategies like top-down splitting and bottom-up merging.
Histogram analysis: Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set.
Binning: Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values
Cluster analysis is a common method for data discretization. It involves using a clustering algorithm to group values
into clusters, thereby discretizing the data.
Entropy-based discretization is a supervised method involving top-down splitting. It considers class distribution and
preserves split-points for separating attribute ranges.
1. User-Defined Schema Ordering: Experts can define hierarchies by specifying attribute orders, like street < city <
province < country.
2. Manual Data Grouping: Portion of a hierarchy can be manually created by grouping data, useful for intermediate
levels like categorizing provinces.
3. Automatic Attribute Ordering: Attributes can be hierarchically ordered based on distinct values, with more values at
the bottom (e.g., street) and fewer at the top (e.g., country). Adjustments can be made as needed.
MOD 2:
Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process
The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data
repositories.
1. Subject-Oriented: It's organized around key subjects like customers, products, etc., focusing on data analysis for
decision-making rather than daily operations.
2. Integrated: It's built by combining diverse sources (databases, files, records), using data cleaning and integration to ensure
consistency in naming and structure.
3. Time-Variant: Data is stored historically (e.g., past 5-10 years), and time-related information is a part of its structure.
4. Nonvolatile: Separate from operational data, it doesn't need transaction processing, recovery, or concurrency control. It involves
only data loading and access operations.
There are three data warehouse models: the enterprise warehouse, the data mart, and the virtual
warehouse.
An enterprise warehouse collects all of the information about subjects in the entire organization. It puts together data
from different parts of the company, and it can hold a lot of data, from MB to TB & beyond. This kind of warehouse
needs careful planning and might take a year to build.
A data mart is like a smaller version of a data warehouse, focusing on specific needs of a certain group in a company.
Data marts are set up on less expensive servers and can be ready in weeks. They can be independent (drawing from
various sources) or dependent (linked to a larger data warehouse).
Virtual warehouse: A virtual warehouse is like a collection of special views from regular databases. To make queries
fast, only some important views are actually saved. Creating a virtual warehouse is simple, but it needs extra space on
the regular database servers to work well.
Data Warehouse Back-End Tools and Utilities:
1. Data extraction, which typically gathers data from multiple, heterogeneous, and external sources .
2. Data cleaning, which detects errors in the data and rectifies them when possible .
3. Data transformation, which converts data from legacy or host format to warehouse format .
4. Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
5. Refresh, which propagates the updates from the data sources to the warehouse.
Meta Data:
Metadata refers to information about data itself. In the context of a data warehouse, metadata defines the objects in the
warehouse.It's like a guide for the data.Metadata includes data names, definitions, timestamps for when data was
extracted, where it came from, and details about fields that were added due to cleaning or integration.
1. Structure Metadata: Defines warehouse layout, like schema, views, dimensions, hierarchies, and derived data. Also
covers data marts.
2. Operational Metadata: Tracks data lineage (migration history), status (active/archived), and monitoring (usage,
errors).
3. Summarization Algorithms: Includes methods to summarize data, define measures, dimensions, and queries.
4. Operational Mapping: Describes source databases, extraction, transformation, security, and updates.
5. Performance Data: Involves indices, profiles, and rules for data access and updates.
6. Business Metadata: Contains business terms, data ownership, and policies for understanding.
A Multidimensional Data Model: Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube.It is typically used for the design of corporate data warehouses
and departmental data marts. Such a model can adopt a star schema, snowflake schema, or fact constellation schema.
“What is a data cube?” A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts(measures). Dimensions are the entities or perspectives with respect to which an organization
wants to keep records and are hierarchical in nature.
OLAP Operations
Roll-up (Drill-Up): This operation involves summarizing data at a higher level of aggregation. For instance, going from
monthly sales data to quarterly or yearly totals. It allows you to "roll up" data to a broader perspective.
Drill-down (Roll-Down): Drill-down is the opposite of roll-up. It involves breaking down aggregated data into finer levels
of detail. For example, breaking down yearly sales into monthly or daily sales figures.
Slice − It describes the subcube to get more specific information. This is performed by selecting one dimension.
Dice − It describes the subcube by performing selection on two or more dimensions.
Pivot (Rotate): This operation involves changing the orientation of the data cube, effectively rotating it to view different
perspectives. It's useful for examining data from various angles.
Measures categories//