មេរៀនទី១
មេរៀនទី១
មេរៀនទី១
3
What is Data Mining?
❑ Data mining is the process of discovering new
insights and trends from large data sets. Data
mining techniques like data warehousing, artificial
intelligence, and machine learning help
professionals organize and analyze information to
make more informed organizational decisions. For
this reason, data mining is popular in industries
like business, healthcare, marketing, and finance.
4
Why do we need Data Mining?
5
Data Mining process
❑ The specific steps can vary depending on the tools and techniques used, a general
include:
❖ Business Understanding:
➢ Problem Definition: Clearly define the business problem or question that data mining aims to address.
➢ Data Requirements: Identify the relevant data sources and attributes needed to answer the question.
❖ Data Collection:
➢ Gather Data: Collect the necessary data from various sources, such as databases, files, and sensors.
➢ Data Integration: Combine data from different sources into a unified dataset.
6
Data Mining process
❖ Data Preparation:
➢ Data Cleaning: Handle missing values, outliers, and inconsistencies in the data.
➢ Data Transformation: Convert data into a suitable format for analysis, such as normalization or scaling.
➢ Data Reduction: Reduce the dimensionality of the data if necessary to improve efficiency and
performance.
❖ Model Building:
➢ Choose Algorithms: Select appropriate data mining algorithms based on the problem and data
characteristics.
➢ Train Model: Apply the algorithms to the prepared data to create predictive or descriptive models.
➢ Model Evaluation: Assess the model's performance using evaluation metrics like accuracy, precision,
recall, and F1-score.
7
Data Mining process
❖ Evaluation:
➢ Model Validation: Validate the model's performance on a separate test dataset to ensure its
generalizability.
➢ Refine Model: If necessary, iterate on the model building process to improve its performance.
❖ Deployment:
➢ Integrate Model: Integrate the final model into the business application or system.
➢ Monitor Performance: Continuously monitor the model's performance in real-world scenarios and
update it as needed.
8
Data Mining process
❖ Knowledge Discovery:
➢ Interpret Patterns: Extract meaningful insights and patterns from the model's results.
➢ Communicate Findings: Communicate the discovered knowledge to relevant stakeholders in a clear
and understandable manner.
9
Questions (20 mins)
❑ What is Data?
❑ What are type of data?
10
The Data is also very complex
11
Example: Transaction Data
❑ The point cards allow companies to collect information about specific users.
12
Example: Network Data
13
Behavioral Data
❑ Mobile phone today record a large amount of information about the user behavior
❖ GPS records position
❖ Camera produces images
❖ Communication via phone and SMS
❖ Text via Facebook updates
❖ Association with entities via check-ins
❑ Amazon collects all the items that you browsed, placed into your basket, read
reviews about, purchased.
❑ Google and Bing record all your browsing activity via toolbar plugins. They also
record the queries you asked, the pages you saw and the clicks you did.
14
Type of Data
15
Categorical Data
16
Numeric Record Data
❑ If data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
❑ Such data set can be represented by an n-by-d data matrix, where there are n
rows, one for each object, and d columns, one for each attribute
17
Document Data
18
Transaction Data
❑ Each record (transaction) is a set of items.
❑ A set of items can also be represented as a binary vector, where each attribute is
an item.
❑ A document can also be represented as a set of words (no counts)
19
Ordered Data
❑ Genomic sequence data
❑ Data is a long ordered string
20
Ordered Data
❑ Time series
❖ Sequence of ordered (over “time”) numeric values.
21
Graph Data
❑ Examples: Web graph and HTML Links
22
WHY DATA MINING?
❑ Commercial point of view
❖ Data has become the key competitive advantage of companies
➢ Examples: Facebook, Google, Amazon
❖ Being able to extract useful information out of the data is key for exploiting them commercially.
23
WHY DATA MINING?
❑ Scale (in data size and feature dimension)
❖ Why not use traditional analytic methods?
❖ Enormity of data (very large of data), curse of dimensionality
❖ The amount and the complexity of data does not allow for manual processing of the data. We
need automated techniques.
24
What is DATA MINING again?
❑ “Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data analyst” (Hand, Mannila, Smyth)
25
What can we do with DATA MINING?
❑ Some examples:
❖ Frequent item sets and Association Roles extraction
❖ Coverage
❖ Clustering
❖ Classification
❖ Ranking
❖ Exploratory analysis
26
Frequent item sets and association rules
❑ Given a set of records each of which contain some number of items from a given
collection:
❖ Identify sets of items (item sets) occurring frequently together
❖ Produce dependency rules which will predict occurrence of an item based on occurrence of
other items.
27
Frequent item sets: Applications
❑ Text mining: finding associated phrases in text
❖ There are lots of documents that contain the phrases “association rules”, “data mining” and
“efficient algorithm”
❑ Recommendations:
❖ Users who buy item often buy this item as well
❖ Users who watched James Bond movies, also watched Jason Bourne movies.
28
Frequent item sets: Applications
❑ Supermarket shelf management
❖ Goal: to identify items that are bought together by sufficiently many customers.
❖ Approach: process the point of sale data collected with barcode scanners to find dependencies
among items.
❖ A classic rule:
➢ If a customer buy diaper and milk, then he is very likely to buy beer.
➢ So, don’t be surprised if you find six-packs stacked next to diapers!
29
Clustering definition
30
Coverage
❑ A metric that describes how well a model predicts or classifies data
❖ given a set of customers and items and the transaction relationship between the two, select a
small set of items that “covers” all users.
➢ For each user there is at least one item in the set that the user has bought.
❖ Applications:
➢ Create a catalog to send out that has at least one item of interest for every customer.
31
Classification
❑ is a technique that organizes data points into categories based on their features or
attributes
❖ Given a collection of record (training set)
➢ Each record contains a set of attributes, one of the attributes is the class.
❖ Find a model for class attribute as a function of the values of other attributes.
32
Classification example
33
Classification: Application
❑ Ad Click Prediction
❖ Goal: predict if a user that visits a web page will click on a displayed ad. Use it to target users
with high click probability.
❖ Approach:
➢ Collect data for users over a period of time and record who clicks and who does not. The {click, no
click} information forms the class attribute.
➢ Use the history of the user (web pages browsed, queries issued) as the features.
➢ Learn a classifier model and test on new users.
34
Classification: Application 2
❑ Fraud Detection
❖ Goal: predict fraudulent cases in credit card transactions.
❖ Approach:
➢ Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc.
➢ Label past transactions as fraud or fair transactions. This forms the class attribute.
➢ Learn a model for the class of the transactions.
➢ Use this model to detect fraud by observing credit card transactions on an account.
35
Link analysis Ranking
❑ is a technique that uses graph theory to rank web pages based on their link
structure
❖ Given a collection of web pages that are linked to each other, rank the pages according to
importance (authoritativeness) in the graph
➢ Intuition: a page gains authority if it is linked to by another page.
❖ Application: when retrieving pages, the authoritativeness is factored in the ranking.
36
Exploratory analysis
❑ Trying to understand the data as a physical phenomenon, and describe them with
simple metrics
❖ What does the web graph look like?
❖ How often do people repeat the same query?
❖ Are friends in Facebook also friends in twitter?
❑ The important thing is to find the right metrics and ask the right questions
❑ It helps our understanding of the world, and can lead to models of the phenomena
observe.
37
Application of Data Mining
❑ Healthcare and Insurance
❑ Education
❑ Entertainment
❑ Banking
❑ Marketing
❑ Manufacturing
38
Answer (20 mins)
39
Questions
❑ What is data?
❑ What is data attribute?
❑ What are type of data and type of attribute?
40
សូមអរគុណ
Thank you