BDA Class1
BDA Class1
BDA Class1
MODULE I
DATA MINING
Prepared by
Dr. Hima Suresh
Assistant Professor, CS
Contents
• Data mining
• Applications
• Stages of datamining process
• Types of data mining
• Data Preprocessing
– Data cleaning
– Data transformation
– Data Reduction
• Text mining
DATA MINING
What is Data Mining?
• Data mining is the process of uncovering patterns and finding
anomalies and relationships in large datasets that can be used
to make predictions about future trends.
• The main purpose of data mining is to extract valuable
information from available data.
• Alternative names of Data Mining
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting etc.
Applications of Data Mining
• Data mining is highly useful in the following domains −
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Apart from these, data mining can also be used in the areas
of production control, customer retention, science
exploration, sports, astrology, and Internet Web Surf-Aid.
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
STAGES OF DATA MINING PROCESS
• KDD represents Knowledge Discovery in Databases. It
defines the broad process of discovering knowledge in
data and emphasizes the high-level applications of
definite data mining techniques.
• It is an area of interest to researchers in several fields,
such as artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition
for professional systems, and data visualization.
• Here is the list of steps involved in the knowledge discovery process
• Data Selection −
This involves two steps for selecting data in data mining. The first step is to locate the
data which tends to be more mechanical in nature than the second step, that is
identifying data, which requires significant input by a domain expert for the data.
In this step, data relevant to the analysis task are retrieved from the database.
• Data Cleaning − In this step, the noise and inconsistent data will be removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Transformation −
In this step, the data will be transformed or consolidated into forms that are
appropriate for mining by performing summary or aggregation operations.
Data transformation, in terms of data mining, is the process of changing the form or
structure of existing attributes. Data transformation is separate from data cleansing and
data enrichment for data mining purposes because it does not correct existing attribute
data or add new attributes, but instead grooms existing attributes for data mining
purposes.
• Data Mining − In this step, intelligent methods would be applied in order to extract
data patterns.
• Pattern Evaluation − In this step, data patterns will be evaluated.
• Knowledge Presentation − In this step, knowledge will be represented.
TYPES OF DATA MINING
• Data scientists and analysts use many different data mining techniques to
accomplish their goals. Some of the most common types include the
following:
• Classification : Is the process of classifying the input instances based on
their corresponding class labels.
Example: Image recognition
• Cluster Analysis: It is the process of grouping a set of objects in the same
group (called a cluster) are more similar to each other than to those in
other groups.
Example: Target marketing
• Association : Association is a data mining function that discovers the
probability of the co-occurrence of items in a collection. The relationships
between co-occurring items are expressed as Association Rules.
This is the technique drives most of the recommendation engines, such as
when Amazon suggests that if you purchased one item, you might also like
another item. Services like Netflix and Spotify can use association rules to
fuel their content recommendation engines.
• Regression : It is a data mining technique used to predict a range of
numeric values, given a particular dataset. For example, regression
might be used to predict the cost of a product or service, given other
variables.
Example:Predictive analytics and forecasting.
• Text mining analyzes how often people use certain words. It can be
useful for sentiment or personality analysis, as well as for analyzing
social media posts for marketing purposes or to spot potential data
leaks from employees.
• Summarization puts a group of data into a more compact, easier-to-
understandable form. For example, you might use summarization to
create graphs or calculate averages from a given set of data. This is
one of the most familiar and accessible forms of data mining.
DATA PRE-PROCESSING
• Data preprocessing is a data mining technique which is used
to transform the raw data into a useful and efficient format.
• Steps Involved in Data Preprocessing:
• 1. Data Cleaning: It involves handling of missing data, noisy data
etc.
(a) Missing Data:
This situation arises when some data is missing in the dataset. It can
be handled in various ways.
Ignore the tuples: This approach is suitable only when the
dataset we have is quite large and multiple values are missing within
a tuple.
Fill the Missing values: There are various ways to do this task.
You can choose to fill the missing values manually, by attribute
mean or the most probable value.
(b) Noisy Data:
Noisy data is a meaningless data that cannot be interpreted
by machines. It can be generated due to faulty data collection,
data entry errors etc. It could be handled by :
Regression:
Here data can be made smooth by fitting it to a regression
function. The regression used may be linear (having one
independent variable) or multiple (having multiple
independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
• 2. Data Transformation:
This step is taken in order to transform the data in appropriate
forms suitable for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-
1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given
set of attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
• Eg: Consider a numeric attribute like “income”.
Discretize it into categories like “ Low”, “Medium”,
and “High”.
• Income values & Conceptual values
• 20,000-50,000 – Low
• 50,001-90,000- Medium
• 90,001-150,000- High
• 3. Data Reduction: It aims to increase the storage efficiency and
reduce data storage and analysis costs.
• There are various strategies for data reduction which are as follows:
• Data cube aggregation −Data integration is the procedure of
merging data from several disparate sources. While performing data
integration, it must work on data redundancy, inconsistency,
duplicity, etc. In data mining, data integration is a record
preprocessing method that includes merging data from a couple of
the heterogeneous data sources into coherent data to retain and
provide a unified perspective of the data.
• Data integration is especially important in the healthcare industry.
Integrated data from several patient records and clinics assist
clinicians in identifying medical disorders and diseases by
integrating information from several systems into a single
perspective of beneficial information from which useful insights can
be derived.
Continuation…
• Effective data collection and integration also improve medical
insurance claims processing accuracy and ensure that patient
names and contact information are recorded consistently and
accurately. Interoperability refers to the sharing of
information across different systems.
• When we have data in the form different from the needed,
then the aggregation methods can be applied to the attributes
to obtain the desired attributes.
• Data cube aggregation is about summarizing and
consolidating data within a multidimensional
structure to provide a more comprehensible and
actionable overview of the information.
Sales per quarter from year 2010 to 2012 get aggregated into a single annual sales record.
• Concept hierarchies may exist for each attribute, allowing the analysis
of data at multiple levels of abstraction.
• Organization of data into a tree-like structure, where each level of the
hierarchy represents a concept that is more general than the level
below it.
• For example, a hierarchy for a branch could allow branches to be
grouped into regions, based on their address. Data cubes support quick
access to pre-computed, summarized data, thus benefiting online
analytical processing and data mining.
• Data cube is a multidimensional data structure that represents large
amount of data.
Attribute subset selection − In this method, the irrelevant,
weakly relevant, or redundant attributes or dimensions can be
discovered and deleted. Data sets for analysis can include
hundreds of attributes, some of which can be irrelevant to the
mining task or redundant. For instance, if the task is to arrange
customers as to whether or not they are likely to purchase a
popular new CD at All Electronics when notified of a sale,
attributes such as the customer’s telephone number are likely to
be irrelevant, unlike attributes such as age or music_taste.
Dimensionality reduction − Encoding mechanisms are used to
reduce the data set size. In dimensionality reduction, data
encoding or transformations are applied to obtain a reduced or
“compressed” representation of the original data. If the original
data can be reconstructed from the compressed data without
any loss of information, then the data reduction is called
lossless.
Numerosity reduction − The data are restored or predicted by
alternative, smaller data representations including parametric
models (which are required to save only the model
parameters rather than the actual data) or nonparametric
methods including clustering, sampling, and the use of
histograms. [Reduces the data volume by choosing alternate,
smaller forms of data representation].
Discretization and concept hierarchy generation − In this
method, the raw data values for attributes are replaced by
ranges or higher conceptual levels.
Data discretization is a form of numerosity reduction that is
very beneficial for the automatic production of concept
hierarchies. Discretization and concept hierarchy generation
are dynamic tools for data mining, they enable the mining of
data at various levels of abstraction.
• Concept hierarchy is a way of organizing and
representing data at different levels of abstraction or
granularity.
TEXT MINING
• Text mining, also known as text data mining, is the process of
transforming unstructured text into a structured format to
identify meaningful patterns and new insights. By applying
advanced analytical techniques, such as Naïve Bayes, Support
Vector Machines (SVM), and other deep learning algorithms,
companies are able to explore and discover hidden relationships
within their unstructured data.
• Text is one of the most common data types within databases.
Depending on the database, this data can be organized as:
• Structured data: This data is standardized into a tabular format
with numerous rows and columns, making it easier to store and
process for analysis and machine learning algorithms. Structured
data can include inputs such as names, addresses, and phone
numbers.
• Unstructured data: This data does not have a predefined
data format. It can include text from sources, like social
media or product reviews, or rich media formats like,
video and audio files.
• Semi-structured data: As the name suggests, this data is
a blend between structured and unstructured data
formats. While it has some organization, it doesn’t have
enough structure to meet the requirements of a
relational database. Examples of semi-structured data
include XML, JSON and HTML files.
TEXT MINING TECHNIQUES
• The process of text mining comprises several activities that
enable you to deduce information from unstructured text
data. Before you can apply different text mining techniques,
you must start with text preprocessing, which is the practice
of cleaning and transforming text data into a usable format.
• This practice is a core aspect of natural language processing
(NLP) and it usually involves the use of techniques such as
language identification, tokenization, part-of-speech tagging,
chunking, and syntax parsing to format data appropriately for
analysis. When text preprocessing is complete, you can apply
text mining algorithms to derive insights from the data.
• Some of these common text mining techniques include:
1. Information retrieval
• Information retrieval (IR) returns relevant information or
documents based on a pre-defined set of queries or phrases.
IR systems utilize algorithms to track user behaviors and
identify relevant data. Information retrieval is commonly used
in library catalogue systems and popular search engines, like
Google. Some common IR sub-tasks include:
✔ Tokenization: This is the process of breaking out long-form
text into sentences and words called “tokens”. These are,
then, used in the models, like bag-of-words, for text clustering
and document matching tasks.
✔ Stemming: This refers to the process of separating the
prefixes and suffixes from words to derive the root word form
and meaning. This technique improves information retrieval
by reducing the size of indexing files.
2. Natural language processing (NLP)
• Natural language processing, which evolved from computational
linguistics, uses methods from various disciplines, such as computer
science, artificial intelligence, linguistics, and data science, to enable
computers to understand human language in both written and verbal
forms. By analyzing sentence structure and grammar, NLP sub-tasks
allow computers to “read”. Common sub-tasks include:
✔ Summarization: This technique provides a synopsis of long pieces of text
to create a concise, coherent summary of a document’s main points.
✔ Part-of-Speech (PoS) tagging: This technique assigns a tag to every token
in a document based on its part of speech—i.e. denoting nouns, verbs,
adjectives, etc. This step enables semantic analysis on unstructured text.
✔ Text categorization: This task, which is also known as text classification,
is responsible for analyzing text documents and classifying them based
on predefined topics or categories. This sub-task is particularly helpful
when categorizing synonyms and abbreviations.
✔ Sentiment analysis: This task detects positive or negative
sentiment from internal or external data sources, allowing you
to track changes in customer attitudes over time. It is
commonly used to provide information about perceptions of
brands, products, and services. These insights can propel
businesses to connect with customers and improve processes
and user experiences.
3. Information extraction
✔ Information extraction (IE) surfaces the relevant pieces of
data when searching various documents. It also focuses on
extracting structured information from free text and storing
these entities, attributes, and relationship information in a
database. Common information extraction sub-tasks include:
✔ Feature selection, or attribute selection, is the process of
selecting the important features (dimensions) to
contribute the most to output of a predictive analytics
model.
✔ Feature extraction is the process of selecting a subset of
features to improve the accuracy of a classification task.
This is particularly important for the dimensionality
reduction.
✔ Named-entity recognition (NER) also known as entity
identification or entity extraction, aims to find and
categorize specific entities in text, such as names or
locations. For example, NER identifies “California” as a
location and “Mary” as a woman’s name.
4. Data mining
• Data mining is the process of identifying patterns and
extracting useful insights from big data sets. This practice
evaluates both structured and unstructured data to
identify new information, and it is commonly utilized to
analyze consumer behaviors within marketing and sales.
• Text mining is essentially a sub-field of data mining as it
focuses on bringing structure to unstructured data and
analyzing it to generate novel insights. The techniques
mentioned above are forms of data mining but fall under
the scope of textual data analysis.
TEXT MINING APPLICATIONS
• Text analytics software has impacted the way that many
industries work, allowing them to improve product user
experiences as well as make faster and better business
decisions. Some use cases include:
✔ Customer service: There are various ways in which we solicit
customer feedback from our users. When combined with text
analytics tools, feedback systems, such as chatbots, customer
surveys, NPS (net-promoter scores) , online reviews, support
tickets, and social media profiles, enable companies to
improve their customer experience with speed.
⮚ Text mining and sentiment analysis can provide a mechanism
for companies to prioritize key pain points for their customers,
allowing businesses to respond to urgent issues in real-time
and increase customer satisfaction.
✔ Risk management: Text mining also has applications in risk
management, where it can provide insights around industry
trends and financial markets by monitoring shifts in sentiment
and by extracting information from analyst reports and
whitepapers. This is particularly valuable to banking
institutions as this data provides more confidence when
considering business investments across various sectors.
✔ Maintenance: Text mining provides a rich and complete
picture of the operation and functionality of products and
machinery. Over time, text mining automates decision making
by revealing patterns that correlate with problems and
preventive and reactive maintenance procedures. Text
analytics helps maintenance professionals unearth the root
cause of challenges and failures faster.
✔ Healthcare: Text mining techniques have been increasingly
valuable to researchers in the biomedical field, particularly for
clustering information. Manual investigation of medical
research can be costly and time-consuming; text mining
provides an automation method for extracting valuable
information from medical literature.
✔ Spam filtering: Spam frequently serves as an entry point for
hackers to infect computer systems with malware. Text mining
can provide a method to filter and exclude these e-mails from
inboxes, improving the overall user experience and
minimizing the risk of cyber-attacks to end users.
THANK YOU