មេរៀនទី១

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

មេម ៀនទី១៖ កា ណែន ាំអាំពី Data Mining

(Introduction to Data Mining)

បង្រៀនងោយ៖ ងោក ហួន សុ ភី ទូរសព្ទ៖ 086611073 E-mail: [email protected]

[email protected] t.me/nubbedukh 069 556 155


1 066 556 155
www.nubb.edu.kh nubb.cambodia 1
096 556 155
What is Data Mining?

❑ After years of data mining there is still no unique answer to this


question.
❑ A tentative definition:
❖ Data mining is the use of efficient techniques for analysis of very large
collection of data and extraction of useful and possibly unexpected pattern
of data.

3
What is Data Mining?
❑ Data mining is the process of discovering new
insights and trends from large data sets. Data
mining techniques like data warehousing, artificial
intelligence, and machine learning help
professionals organize and analyze information to
make more informed organizational decisions. For
this reason, data mining is popular in industries
like business, healthcare, marketing, and finance.

4
Why do we need Data Mining?

❑ Need to analyze the raw data to extract knowledge.


❑ Enables informed and data-driven decision-making.
❑ Helps in analyzing substantial amounts of data quickly.
❑ Businesses can get reliable information through data mining.
❑ Helps in identifying patterns and trends and detecting fraud.
❑ Need a way to harness the collective intelligence.

5
Data Mining process

❑ The specific steps can vary depending on the tools and techniques used, a general
include:
❖ Business Understanding:
➢ Problem Definition: Clearly define the business problem or question that data mining aims to address.
➢ Data Requirements: Identify the relevant data sources and attributes needed to answer the question.
❖ Data Collection:
➢ Gather Data: Collect the necessary data from various sources, such as databases, files, and sensors.
➢ Data Integration: Combine data from different sources into a unified dataset.

6
Data Mining process
❖ Data Preparation:
➢ Data Cleaning: Handle missing values, outliers, and inconsistencies in the data.
➢ Data Transformation: Convert data into a suitable format for analysis, such as normalization or scaling.
➢ Data Reduction: Reduce the dimensionality of the data if necessary to improve efficiency and
performance.
❖ Model Building:
➢ Choose Algorithms: Select appropriate data mining algorithms based on the problem and data
characteristics.
➢ Train Model: Apply the algorithms to the prepared data to create predictive or descriptive models.
➢ Model Evaluation: Assess the model's performance using evaluation metrics like accuracy, precision,
recall, and F1-score.

7
Data Mining process
❖ Evaluation:
➢ Model Validation: Validate the model's performance on a separate test dataset to ensure its
generalizability.
➢ Refine Model: If necessary, iterate on the model building process to improve its performance.
❖ Deployment:
➢ Integrate Model: Integrate the final model into the business application or system.
➢ Monitor Performance: Continuously monitor the model's performance in real-world scenarios and
update it as needed.

8
Data Mining process
❖ Knowledge Discovery:
➢ Interpret Patterns: Extract meaningful insights and patterns from the model's results.
➢ Communicate Findings: Communicate the discovered knowledge to relevant stakeholders in a clear
and understandable manner.

9
Questions (20 mins)

❑ What is Data?
❑ What are type of data?

10
The Data is also very complex

❑ Multiple types of data: table, time series, image, graphs, etc.


❑ Spatial and temporal aspects
❑ Interconnected data of different types:
❖ From the mobile phone we can collect, location of the user, friendship information, check-ins to
venues, opinions through twitter, images though cameras, queries to search engines.

11
Example: Transaction Data

❑ Billion of real-life customer:


❖ WALMART: 20M transactions per day
❖ Credit card companies: bullions of transactions per day.

❑ The point cards allow companies to collect information about specific users.

12
Example: Network Data

❑ Web: 50 billion pages linked via hyperlinks


❑ Facebook: 500 million users
❑ Twitter: 300 million users
❑ Instant messenger: ~1billion users
❑ Blogs: 250 million blogs worldwide

13
Behavioral Data

❑ Mobile phone today record a large amount of information about the user behavior
❖ GPS records position
❖ Camera produces images
❖ Communication via phone and SMS
❖ Text via Facebook updates
❖ Association with entities via check-ins
❑ Amazon collects all the items that you browsed, placed into your basket, read
reviews about, purchased.
❑ Google and Bing record all your browsing activity via toolbar plugins. They also
record the queries you asked, the pages you saw and the clicks you did.

14
Type of Data

❑ Numeric data: each object is a point in a multidimensional space


❑ Categorical data: each object is a vector of categorical values
❑ Set data: each object is a set of values (with or without counts)
❖ Sets can also be represented as binary vectors, or vectors of counts

❑ Ordered sequences: each object is an ordered sequence of values.


❑ Graph data: is a type of data that represents relationships between entities. It is
often visualized as a network or diagram, where nodes (or vertices) represent the
entities and edges (or arcs) represent the connections between them.

15
Categorical Data

❑ Data that consists of a collection


of records, each of which
consists of a fixed set of
categorical attributes

16
Numeric Record Data

❑ If data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
❑ Such data set can be represented by an n-by-d data matrix, where there are n
rows, one for each object, and d columns, one for each attribute

17
Document Data

❑ Each document becomes a ‘term’ vector,


❖ each term is a component (attribute) of the vector,
❖ the value of each component is the number of times the corresponding term occurs in the
document.
❖ Bag-of-words representation –no order

18
Transaction Data
❑ Each record (transaction) is a set of items.
❑ A set of items can also be represented as a binary vector, where each attribute is
an item.
❑ A document can also be represented as a set of words (no counts)

19
Ordered Data
❑ Genomic sequence data
❑ Data is a long ordered string

20
Ordered Data
❑ Time series
❖ Sequence of ordered (over “time”) numeric values.

21
Graph Data
❑ Examples: Web graph and HTML Links

22
WHY DATA MINING?
❑ Commercial point of view
❖ Data has become the key competitive advantage of companies
➢ Examples: Facebook, Google, Amazon
❖ Being able to extract useful information out of the data is key for exploiting them commercially.

❑ Scientific point of view


❖ Scientists are at an unprecedented position where they can collect TB of information
➢ Examples: sensor data, astronomy data, social network data, gene data
❖ We need the tools to analyze such data to get a better understanding of the world and advance
science

23
WHY DATA MINING?
❑ Scale (in data size and feature dimension)
❖ Why not use traditional analytic methods?
❖ Enormity of data (very large of data), curse of dimensionality
❖ The amount and the complexity of data does not allow for manual processing of the data. We
need automated techniques.

24
What is DATA MINING again?
❑ “Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data analyst” (Hand, Mannila, Smyth)

❑ “Data mining is the discovery of models for data” (Rajaraman, Ullman)


❖ We can have the following types of models
➢ Models that explain the data (e.g, a single function)
➢ Models that predict the future data instances
➢ Model that summarize the data
➢ Models that extract the most prominent features of the data.

25
What can we do with DATA MINING?
❑ Some examples:
❖ Frequent item sets and Association Roles extraction
❖ Coverage
❖ Clustering
❖ Classification
❖ Ranking
❖ Exploratory analysis

26
Frequent item sets and association rules
❑ Given a set of records each of which contain some number of items from a given
collection:
❖ Identify sets of items (item sets) occurring frequently together
❖ Produce dependency rules which will predict occurrence of an item based on occurrence of
other items.

ItemsetsDiscovered: Rules Discovered:


{Milk,Coke} {Milk} --> {Coke}
{Diaper, Milk} {Diaper, Milk} --> {Beer}

27
Frequent item sets: Applications
❑ Text mining: finding associated phrases in text
❖ There are lots of documents that contain the phrases “association rules”, “data mining” and
“efficient algorithm”

❑ Recommendations:
❖ Users who buy item often buy this item as well
❖ Users who watched James Bond movies, also watched Jason Bourne movies.

28
Frequent item sets: Applications
❑ Supermarket shelf management
❖ Goal: to identify items that are bought together by sufficiently many customers.
❖ Approach: process the point of sale data collected with barcode scanners to find dependencies
among items.
❖ A classic rule:
➢ If a customer buy diaper and milk, then he is very likely to buy beer.
➢ So, don’t be surprised if you find six-packs stacked next to diapers!

29
Clustering definition

❑ is an unsupervised machine learning technique that groups data points based on


their similarities
❖ Given a set of data points, each having a set of attributes, and a similarity measure among them,
find clusters such that
➢ Data points in one cluster are more similar to one another.
➢ Data points in separate clusters are less similar to one another.
❖ Similarity measures:
➢ Euclidean Distance if attributes are continuous.
➢ Other Problem-specific Measures.

30
Coverage
❑ A metric that describes how well a model predicts or classifies data
❖ given a set of customers and items and the transaction relationship between the two, select a
small set of items that “covers” all users.
➢ For each user there is at least one item in the set that the user has bought.
❖ Applications:
➢ Create a catalog to send out that has at least one item of interest for every customer.

31
Classification
❑ is a technique that organizes data points into categories based on their features or
attributes
❖ Given a collection of record (training set)
➢ Each record contains a set of attributes, one of the attributes is the class.
❖ Find a model for class attribute as a function of the values of other attributes.

❖ Goal: previously unseen records should be assigned a class as accurately as possible.


➢ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.

32
Classification example

33
Classification: Application
❑ Ad Click Prediction
❖ Goal: predict if a user that visits a web page will click on a displayed ad. Use it to target users
with high click probability.
❖ Approach:
➢ Collect data for users over a period of time and record who clicks and who does not. The {click, no
click} information forms the class attribute.
➢ Use the history of the user (web pages browsed, queries issued) as the features.
➢ Learn a classifier model and test on new users.

34
Classification: Application 2
❑ Fraud Detection
❖ Goal: predict fraudulent cases in credit card transactions.
❖ Approach:
➢ Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc.
➢ Label past transactions as fraud or fair transactions. This forms the class attribute.
➢ Learn a model for the class of the transactions.
➢ Use this model to detect fraud by observing credit card transactions on an account.

35
Link analysis Ranking
❑ is a technique that uses graph theory to rank web pages based on their link
structure
❖ Given a collection of web pages that are linked to each other, rank the pages according to
importance (authoritativeness) in the graph
➢ Intuition: a page gains authority if it is linked to by another page.
❖ Application: when retrieving pages, the authoritativeness is factored in the ranking.

36
Exploratory analysis
❑ Trying to understand the data as a physical phenomenon, and describe them with
simple metrics
❖ What does the web graph look like?
❖ How often do people repeat the same query?
❖ Are friends in Facebook also friends in twitter?

❑ The important thing is to find the right metrics and ask the right questions
❑ It helps our understanding of the world, and can lead to models of the phenomena
observe.

37
Application of Data Mining
❑ Healthcare and Insurance
❑ Education
❑ Entertainment
❑ Banking
❑ Marketing
❑ Manufacturing

38
Answer (20 mins)

❑ There are two main categories of data types:


❖ Numerical Data
➢ Continuous: Represents measurements that can take on any value within a specific range.
• Examples include age, height, temperature, and income.
➢ Discrete: Represents countable quantities that have distinct values.
• Examples include number of items, population, and shoe size.
❖ Categorical Data
➢ Nominal: Represents categories without an inherent order.
• Examples include color, gender, and product type.
➢ Ordinal: Represents categories with an inherent order.
• Examples include education level, satisfaction rating, and letter grades.

39
Questions

❑ What is data?
❑ What is data attribute?
❑ What are type of data and type of attribute?

40
សូមអរគុណ

Thank you

[email protected] t.me/nubbedukh 069 556 155


41 066 556 155
www.nubb.edu.kh nubb.cambodia 41
096 556 155

You might also like