Data Mining I: Summer Semester 2017
Data Mining I: Summer Semester 2017
Data Mining I: Summer Semester 2017
Data Mining I
Summer semester 2017
KDD definition
KDD process
DM step
Supervised (or predictive) vs Unsupervised (or descriptive) learning
Main DM tasks
Clustering: partitioning in groups of similar objects
Classification: predict class attribute from input attributes, class is categorical
Regression: predict class attribute from input attributes, class is continuous
Association rules mining: find associations between attributes
Outlier detection: identify non-typical data
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
Preprocessing/Cleaning:
Integration of data from
Recap: The KDD process
Missing values
Preprocessed data
Transformation:
Select useful features
Feature transformation/
discretization
Dimensionality reduction
Data Mining:
Transformed data
Evaluation:
Evaluate patterns based on
interestingness measures
Patterns
Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]
Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]
Dirty data poor mining results
Data preprocessing is necessary for improving the quality of the mining results !
Not a focus of this class!
Data integration:
Integration of multiple databases, data warehouses, or files (Entity identification)
Data cleaning:
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Duplicate elimination
Aggregation, e.g., from 12 monthly salaries to average salary per month. More on this next semester
Dimensionality reduction, through e.g., PCA on Data Mining II course
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}
Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}
Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*
Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*
Examples:
Calendar dates , Temperature in Farenheit or Celsius, ...
Ratio still has no meaning
A temperature of 2o Celsius is not much different than a temperature of 1o
Celsius.
The issue is that the 0o point of the Celsius scale is in a physical sense arbitrary
and therefore the ratio of two Celsius temperatures is not physically meaningful.
No meaningful (unique and non-arbitrary) zero value
Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf
Frequency
Images:
Color histograms: distribution of
colors in the image
Color
Gene databases:
gene expression level
Data 25
Text databases: Mining 15
Word frequencies Feature 12
Object 7
...
But, the feature-approach allows uniform treatment of instances from different applications.
Data Mining I: Data preprocessing and feature spaces 21
Outline
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
w x i i
Weighted average: x i 1
n
w
i 1
i
Symmetric
Let x1,,xn be a random sample of an attribute X. The degree to which X values tend to spread is called
dispersion or variance of X :
Variance 2:
1 n 1 n 2 1 n 2
( xi x ) [ xi ( xi ) ]
2 2
n i 1 n i 1 n i 1
Standard deviation is the square root of the variance:
1 n
( xi x ) 2
n i 1
Source: http://en.wikipedia.org/wiki/Normal_distribution
Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Q1 (25th percentile)
Q3 (75th percentile)
Median is the 50th percentile
Range: max value min value
http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
Data Mining I: Data preprocessing and feature spaces 29
Univariate descriptors 3/5
Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Q1 (25th percentile)
Q3 (75th percentile)
Median is the 50th percentile
Range: max value min value
Source: http://www.wellbeingatschool.org.nz/information-
sheet/understanding-and-interpreting-box-plots
Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are useful.
Histograms:
Summarizes the distribution of X
X axis: attribute values, Y axis: frequencies
Absolute frequency: for each value a, # occurrences of a in the sample
Relative frequency: f(a) = h(a)/n
Different types of histograms, e.g.:
Equal width:
Equal depth histogram
It divides the range into N intervals of equal size
Source: http://www.dbs.ifi.lmu.de/Lehre/
Equal width histogram KDD/SS16/skript/2_DataRepresentation.pdf
Given two attributes X, Y one can measure how strongly they are correlated
For numerical data correlation coefficient
For categorical data 2 (chi-square)
Correlation coefficient (also called Pearsons correlation coefficient) measures the linear association
between X, Y:
n
(x x) ( y
i i y)
rXY i 1
XY
xi, yi: the values in the ith tuple for X, Y
value range: -1 rXY 1
the higher rXY the stronger the correlation
r > 0 positive correlation
XY
...... .........
r XY 0
.
r XY > 0
.
... .... . r XY < 0
. .. .. .. . . . . . . . ..
Data Mining I: Data preprocessing and feature spaces 33
Bivariate descriptors 3/5: for numerical features
Contingency table
For categorical/ nominal features X={x1, , xc }, Y={y1, , yr}
Represents the absolute frequency hij of each combination of values (xi, yJ) and the marginal frequencies hi,
hj of X, Y.
Attribute Y
Medium-term unemployment Long-term unemployment Total
Attribute X
No education 19 18 37
Teaching 43 20 63
Total 62 38 100
Chi-square 2 test
c r (oij eij ) 2 oij:observed frequency hi h j
2
eij: expected frequency eij
i 1 j 1 eij n
Chi-square example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450
2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
(250 90 ) 2 (50 210 ) 2 (200 360 ) 2 (1000 840 ) 2
2
507 .93
90 210 360 840
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
Feature space
Intuitively: a domain with a distance function
Formally: feature space F = (Dom, dist):
Dom is a set of attributes / features
dist: a numerical measure of the degree to which the two compared objects differ
dist : Dom Dom R+0
For all x, y in Dom, xy, dist is required to satisfy the following properties:
Metric space
Formally: Metric space = , : y
is a feature space z
i.e, dist(x,y) > 0 (non-negativity) and,
dist(x,x) = 0 (reflexivity)
, = 0 = (strictness)
, : , = (, ) (symmetry) x
x,y,z Dom : dist(x,z) dist(x,y) + dist(y,z)
(triangle inequality)
Measures that satisfy all the above properties are called metrics.
point x y
Famous example: Euclidean vector space E=(Dom, dist) p1 0 2
p2 2 0
(Dom, dist) is a metric space p3 3 1
p4 5 1
= Point coordinates
3
2 p1
, = =1 2
p3 p4
1
p2
0
0 1 2 3 4 5 6
Example
L1 p1 p2 p3 p4
point x y p1 0 4 4 6
p1 0 2 p2 4 0 2 4 L1 distance matrix
p2 2 0 p3 4 2 0 2
p3 3 1 p4 6 4 2 0
p4 5 1
L2 p1 p2 p3 p4
Point coordinates
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162 L2 distance matrix
3
p3 3.162 1.414 0 2
2 p1 p4 5.099 3.162 2 0
p3 p4
1 L p1 p2 p3 p4
p2
p1 0 2 3 5
0
0 1 2 3 4 5 6
p2 2 0 1 3 L distance matrix
p3 3 1 0 2
p4 5 3 2 0
new_age=((30-10)/(100-10))*(1-0)+0=2/9
v mean A
v'
stand_devA
v mean A
v'
stand_devA
Jaccard coefficient
Example:
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
Transformation of a document d in a vector r(d) = (h1, ..., hd), hi 0: the frequency of term ti in d
Ranking procedure: #
=
#
1. Compute document frequency for all terms ti :
2. Sort terms w.r.t. DF(ti) and get rank(ti)
Rank Term DF
3. Sort terms by score(ti)= DF(ti) rank(ti) 1. t23 0.82
e.g. score(t23) = 0.82 1 = 0.82 2. t17 0.65
score(t17) = 0.65 2 = 1.3 3. t14 0.52
4.
4. Select the k terms with the largest score(ti)
n(t , d )
TF (t , d ) The frequency of term t in d
wd
n( w, d )
| DB |
IDF(t ) log( ) Inverse frequency of term t in all DB
d | d DB t d
TF IDF TF (t , d ) IDF (t )
Feature vector with TF IDF : r(d) = (TF(t1 ,d)IDF(t1), ..., TF(tn ,d)IDF(tn))
1 2 |{| 1 2 }|
1 , 2 =1 =
1 2 |{| 1 2 }|
1,i
d1 , d 2
d cos inus (d1 , d 2 ) 1 1 i 0
d1 d 2 n n
d1,i d 2 ,i
2 2
i 0 i 0
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
A variety of techniques
Parallel Coordinates
Quantile Plot ()
Scatter Plot Matrix ()
Loess Curve
Spiderweb model
Chernoff faces
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture