Week 1
Week 1
Week 1
2
• No Required Textbook
• References
• Data Mining: Concepts and Techniques, 2nd ed., Jiawei
Han and Micheline Kamber, Morgan Kaufmann, 2006.
• Jure Leskovec, Anand Rajaraman, and Jeffrey David
Ullman, Mining of Massive Datasets (2nd Edition),
Cambridge University Press, 2014.
• Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin
Kumar, Introduction to Data Mining (2nd Edition),
Pearson, 2018.
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data
Mining, MIT Press, 2001.
3
• Evaluation:
Assessment Weighting
Assignment 30%
Midterm 30%
Paper Critique 10%
Project 30%
Total 100%
4
Paper critique
• Select a paper from recent 2 year’s papers on
related conferences and journals
• Submit a written critique with
• Introduction: paper’s information & preview of your
analysis
• Summary: paper’s main point & findings
• Critique: Strength and weakness of the paper (2-3) &
your opinion (supported by evidence from the paper)
• Conclusion: key points of your analysis & significance of
the research
5
Conferences and Journals
6
Project
• Identify a research topic closely related to this
course
• Propose a novel approach – clarify how it is
different from existing approaches
• Experiments
• Datasets: Public Accessible (or upon approval)
• Experimental Results
• Validate your argument
• Prove the effectiveness of your approach
• Presentation (10 min)
• Submit a report
7
Report Format
• Template:
• https://www.ieee.org/conferences/publishing/template
s.html
• Paper Critique: 1-2 pages
• Project: 5 pages
Introduction Outline
• Why data analysis?
• What is data analysis?
• What kind of data?
• Data analysis functions and evaluation
• Applications of Data Analysis
• Major Challenges in Data Analysis
9
Why Data Analysis?
10
What is data analysis
• Data analysis is a process of
inspecting, cleansing, transforming and modeling
data with the goal of discovering useful
information, informing conclusions and supporting
decision-making.
Data Science
Data
Analysis
Data
Mining
11
Example: A Web Mining Framework
12
Example: Medical Data Mining
• Health care & medical data mining – often adopted
such a view in statistics and machine learning
• Preprocessing of the data (including feature
extraction and dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation
13
What Kinds of Data?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database
• Big Data
14
Data Analysis Function: Generalization
16
Data Analysis Function: Classification
18
Data Analysis Function: Cluster Analysis
19
Data Analysis Function: Outlier Analysis
• Outlier analysis
• Outlier: A data object that does not comply with the general behavior of
the data
• Noise or exception? ― One person’s garbage could be another person’s
treasure
• Methods: statistical tests, distance measures, density-based methods
• Useful in fraud detection, rare events analysis
20
Time and Ordering: Sequential Pattern, Trend and Evolution
Analysis
21
Structure and Network Analysis
• Graph mining
• Finding frequent subgraphs (e.g., chemical compounds), substructures
(web fragments)
• Information network analysis
• Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
• Multiple networks
• A person could be multiple information networks: friends, family,
classmates, …
• Links carry a lot of semantic information: Link mining
• Web mining
• Web is a big information network: from PageRank to Google
• Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
22
Evaluation of Knowledge
• Are all mined knowledge interesting?
• One can mine tremendous amount of “patterns” and knowledge
• Some may fit only certain dimension space (time, location, …)
• Some may not be representative, may be transient, …
24
Application Areas
• Industry • Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport promotion analysis
Consumer goods Value added data
Data Service providers Power usage analysis
… …
25
Data Analysis works with
Warehouse Data
• Data Warehousing
provides the Enterprise
with a memory
26
Major Challenges in Data Analysis (1)
• Mining Methodology
• Mining various and new kinds of knowledge
• Data Analysis: An interdisciplinary effort
• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive analysis
• Incorporation of background knowledge
• Presentation and visualization of data analysis results
27
Major Challenges in Data Analysis (2)
28
Getting to Know Your Data
• Summary
29
Data Objects
30
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Ordinal
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
31
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between
successive values is not known.
• Size = {small, medium, large}, grades, army rankings
32
• Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚, calendar dates
• No true zero-point
• Ratio
• A ratio-scaled attribute is a numeric attribute with an inherent zero-
point.
• We can speak of values as being an order of magnitude larger than
the unit of measurement.
• e.g., $10 is twice as high as $5
33
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
34
Getting to Know Your Data
• Summary
35
Measuring the Central Tendency
1 n
• Mean (algebraic measure) (sample vs. population): x = å xi (1)
Note: n is sample size and N is population size. n i =1
n
Weighted arithmetic mean:
åw x
•
i i
• Trimmed mean: chopping extreme values
x= i =1
n
(2)
å wi
• Median: i =1
(3)
n / 2 - (å freq)l
Mode
median = L1 + ( ) width
•
freqmedian
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
37
Symmetric vs.
Skewed Data
• Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
38
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
1 n 1 n 2 1 n 1 n 1 n
s = 2
å
n - 1 i =1
( xi - x ) =
2
[å xi - (å xi ) 2 ]
n - 1 i =1 n i =1
s = å ( xi - µ ) 2 =
2
N i =1 N
å xi - µ 2
i =1
2
39
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually
40
Histogram Analysis
40
• Histogram: Graph display of tabulated
35
frequencies, shown as bars
30
• It shows what proportion of cases fall
into each of several intervals 25
20
15
10
5
0
10000 30000 50000 70000 90000
41
Histograms Often Tell More than Boxplots
42
Getting to Know Your Data
• Summary
43
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
44
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p é x11 ... x1f ... x1p ù
dimensions ê ú
ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
• Dissimilarity matrix êë n1 û
• n data points, but
registers only the é 0 ù
distance ê d(2,1) 0 ú
ê ú
• A triangular matrix ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû
45
Proximity Measure for Nominal Attributes
d (i, j) = p -
p
m
46
Proximity Measure for Binary Attributes
A contingency table for binary data
Object j
Object i
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
49
Special Cases of Minkowski Distance
d (i, j) =| x - x | + | x - x | +...+ | x - x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
i1 j1 i2 j 2 ip jp
• h ® ¥. “supremum” (Lmax norm, L¥ norm) distance.
• This is the maximum difference between any component
(attribute) of the vectors
50
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L¥ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
Proximity for Ordinal Variables
Example: {small, median, large} -> rank {1, 2, 3} ->numerical {0, 0.5, 1}
52
Attributes of Mixed Type
S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
if xif = xjf =0 and f is asymmetric binary, otherwise
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
• Treat zif as numeric zif =
r -1
if
M -1 f
53
Textual Data: Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(A, B) = (A• B) /||A|| ||B|| =
54
Getting to Know Your Data
• Summary
55
Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• Gain insight into the data by:
• Basic statistical data description: central tendency, dispersion, graphical
displays
• Data visualization: map data onto graphical primitives
• Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of research.
56
References
• Part of the slides are based on
• the slides provided by Jiawei Han, Micheline Kamber,
and Jian Pei. © 2012 Han, Kamber & Pei.
• the slides provide by Richong Zhang @
http://act.buaa.edu.cn
57