Data Mining I: Summer Semester 2017

Fakultt fr Elektrotechnik und Informatik
Institut fr Verteilte Systeme

Fachgebiet Wissensbasierte Systeme (KBS)
Data Mining I
Summer semester 2017
Lecture 2: Data preprocessing and feature spaces

Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Tai Le Quy and Damianos Melidis
Recap from previous lecture
KDD definition
KDD process
DM step
Supervised (or predictive) vs Unsupervised (or descriptive) learning
Main DM tasks
Clustering: partitioning in groups of similar objects
Classification: predict class attribute from input attributes, class is categorical
Regression: predict class attribute from input attributes, class is continuous
Association rules mining: find associations between attributes
Outlier detection: identify non-typical data
How data mining differs from database querying?
Data Mining I: Data preprocessing and feature spaces 2

Outline
Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data
Selection:
Select a relevant dataset or
focus on a subset of a dataset
File / DB/
Target data
Preprocessing/Cleaning:
Integration of data from
Recap: The KDD process
different data sources

Noise removal
Data Mining I: Data preprocessing and feature spaces

[Fayyad, Piatetsky-Shapiro & Smyth, 1996]
Missing values
Preprocessed data
Transformation:
Select useful features
Feature transformation/
discretization
Dimensionality reduction
Data Mining:
Transformed data
Search for patterns of

interest
Evaluation:
Evaluate patterns based on
interestingness measures
Patterns
Statistical validation of the

Models
Visualization
Descriptive Statistics
4
Knowledge
Why data preprocessing?
Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]

Why data preprocessing?
Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]
Dirty data poor mining results
Data preprocessing is necessary for improving the quality of the mining results !
Not a focus of this class!

Major tasks in data preprocessing
Data integration:
Integration of multiple databases, data warehouses, or files (Entity identification)
Data cleaning:
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data transformation: milk

Normalization in a given range, e.g., [0-1]
milk 1.5%
Generalization through some concept hierarchy
Data reduction: milk 1.5% brand x
Duplicate elimination
Aggregation, e.g., from 12 monthly salaries to average salary per month. More on this next semester
Dimensionality reduction, through e.g., PCA on Data Mining II course

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Datasets = instances + features
Datasets consists of instances (also known as examples or objects)

e.g., in a university database: students, professors, courses, grades,
e.g., in a library database: books, users, loans, publishers, .
e.g., in a movie database: movies, actors, director,
e.g., in a supermarket: customers, transactions, products
Instances are described through features (also known as attributes or variables)
E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.

Datasets = instances + features
Datasets consists of instances (also known as examples or objects)

e.g., in a university database: students, professors, courses, grades,
e.g., in a library database: books, users, loans, publishers, .
e.g., in a movie database: movies, actors, director,
e.g., in a supermarket: customers, transactions, products
Instances are described through features (also known as attributes or variables)
E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
Easy visualization: if our data are in a database table, the rows are the instances and the columns are
the features.

Basic feature types
Binary/ Dichotomous variables

Categorical (qualitative)
Nominal variables
Ordinal variables
Numeric variables (quantitative)
Interval-scale variables
Ratio-scaled variables

The attribute can take two values, {0,1} or {true,false}

usually, 0 means absence, 1 means presence
Person isSmoker
e.g., smoker variable: 1 smoker, 0 non-smoker Eirini 0
Erich 1
e.g., true (1), false (0) Kostas 0
Jane 0
Symmetric binary: both outcomes equally important: Emily 1
Markus 0
e.g., gender (male, female)
Asymmetric binary: outcomes not equally important.
e.g., medical tests (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

The attribute can take two values, {0,1} or {true,false}

usually, 0 means absence, 1 means presence
Person isSmoker
e.g., smoker variable: 1 smoker, 0 non-smoker Eirini 0
Erich 1
e.g., true (1), false (0) Kostas 0
Jane 0
Symmetric binary: both outcomes equally important: Emily 1
Markus 0
e.g., gender (male, female)
Asymmetric binary: outcomes not equally important.
e.g., medical tests (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Give me some examples of binary variables!

Categorical: Nominal variables
The attribute can take values within a set of M categories/ states.

Person Occupation
No ordering in the categories/ states. Eirini Astronaut
Erich engineer
Only distinctness relationships apply, i.e., Kostas doctor
equal (=) and Jane engineer
Emily teacher
different () Markus driver
Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}

Categorical: Nominal variables
The attribute can take values within a set of M categories/ states.

Person Occupation
No ordering in the categories/ states. Eirini Astronaut
Erich engineer
Only distinctness relationships apply, i.e., Kostas doctor
equal (=) and Jane engineer
Emily teacher
different () Markus driver
Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}
Give me some examples of nominal variables!

Categorical: Ordinal variables
Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*
School grades: {A,B,C,D,F}

Movie ratings: {hate, dislike, indifferent, like, love}
Also, movie ratings: {*, **, ***, ****, *****}
Also, movie ratings: {1, 2, 3, 4, 5}
Medals = {bronze, silver, gold}

Categorical: Ordinal variables
Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*
School grades: {A,B,C,D,F}

Movie ratings: {hate, dislike, indifferent, like, love}
Also, movie ratings: {*, **, ***, ****, *****}
Also, movie ratings: {1, 2, 3, 4, 5}
Medals = {bronze, silver, gold}
Give me some examples of ordinal variables!

Numeric: Interval-scale variables
Differences between values are meaningful

The difference between 90o and 100o temperature is the same as the difference between
40o and 50o temperature.
Examples:
Calendar dates , Temperature in Farenheit or Celsius, ...
Ratio still has no meaning
A temperature of 2o Celsius is not much different than a temperature of 1o
Celsius.
The issue is that the 0o point of the Celsius scale is in a physical sense arbitrary
and therefore the ratio of two Celsius temperatures is not physically meaningful.
No meaningful (unique and non-arbitrary) zero value
Give me some examples of interval-scale variables!

Numeric: Ratio-scale variables
Both differences and ratios have a meaning

E.g., a 100 Kgs person is twice heavy as a 50 Kgs person.
E.g., a 50 years old person is twice old as a 25 years old person.
Meaningful (unique and non-arbitrary) zero value
Examples:
age, weight, length, number of sales
temperature in Kelvin
When measured on the Kelvin scale, a temperature of 2o is, in a physical meaningful way, twice that of a 1o.
The zero value is absolute 0, represents the complete absence of molecular motion
Give me some examples of ratio-scale variables!

Nominal-ordinal-interval-ratio
Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf
Why we care about the feature types?

How do we extract features?
Feature extraction depends on the application
Frequency
Images:
Color histograms: distribution of
colors in the image
Color
Gene databases:
gene expression level
Data 25
Text databases: Mining 15
Word frequencies Feature 12
Object 7
...
But, the feature-approach allows uniform treatment of instances from different applications.
Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Univariate descriptors 1/5
Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:

(Arithmetic) mean/ center/ average:
1 n
x xi
n i 1
n
w x i i
Weighted average: x i 1
n
w
i 1
i
Give me the mean of: 3, 8, 3, 4, 3, 6, 4, 2, 3


1 n
x xi
n i 1
Mean is greatly influenced by outliers, a more robust measure is median
Median: the central element in ascending ordering

Middle value if odd number of values, or average of the middle two values otherwise
Give me the median of: 3, 8, 3, 4, 3, 6, 4, 2, 3


1 n
x xi
n i 1
Median: the central element in ascending ordering
Middle value if odd number of values, or average of the middle two values otherwise
Mode: Value that occurs most frequently in the data
Give me the mode of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Symmetric
Positively skewed Negatively skewed

Let x1,,xn be a random sample of an attribute X. The degree to which X values tend to spread is called
dispersion or variance of X :
Variance 2:
1 n 1 n 2 1 n 2
( xi x ) [ xi ( xi ) ]
2 2
n i 1 n i 1 n i 1
Standard deviation is the square root of the variance:
1 n
( xi x ) 2
n i 1

Example: For a normal distribution

~68% of values drawn from the distribution are within 1
~95% of the values lie within 2
~99.7% of the values lie within 3
Source: http://en.wikipedia.org/wiki/Normal_distribution

Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Q1 (25th percentile)
Median is the 50th percentile
Range: max value min value
http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Median is the 50th percentile
Range: max value min value
Source: http://www.wellbeingatschool.org.nz/information-
sheet/understanding-and-interpreting-box-plots

Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are useful.
Histograms:
Summarizes the distribution of X
X axis: attribute values, Y axis: frequencies
Absolute frequency: for each value a, # occurrences of a in the sample
Relative frequency: f(a) = h(a)/n
Different types of histograms, e.g.:
Equal width:
Equal depth histogram
It divides the range into N intervals of equal size
Equal frequency/ depth:

It divides the range into N intervals,
each containing approximately same number of samples
Source: http://www.dbs.ifi.lmu.de/Lehre/
Equal width histogram KDD/SS16/skript/2_DataRepresentation.pdf

Bivariate descriptors 1/5
Given two attributes X, Y one can measure how strongly they are correlated
For numerical data correlation coefficient
For categorical data 2 (chi-square)

Bivariate descriptors 2/5: for numerical features
Correlation coefficient (also called Pearsons correlation coefficient) measures the linear association
between X, Y:
n
(x x) ( y
i i y)
rXY i 1
XY
xi, yi: the values in the ith tuple for X, Y
value range: -1 rXY 1
the higher rXY the stronger the correlation
r > 0 positive correlation
XY
XYr < 0 negative correlation

r ~ 0 no correlation/ independent
XY
. ... .... ... . rXY 0
...... .........
r XY 0
.
r XY > 0
.
... .... . r XY < 0
. .. .. .. . . . . . . . ..
Bivariate descriptors 3/5: for numerical features
Visual inspection of correlation

Bivariate descriptors 4/5: for categorical features
Contingency table
For categorical/ nominal features X={x1, , xc }, Y={y1, , yr}
Represents the absolute frequency hij of each combination of values (xi, yJ) and the marginal frequencies hi,
hj of X, Y.
Attribute Y
Medium-term unemployment Long-term unemployment Total
Attribute X
No education 19 18 37
Teaching 43 20 63
Total 62 38 100
Chi-square 2 test
c r (oij eij ) 2 oij:observed frequency hi h j

2
eij: expected frequency eij
i 1 j 1 eij n

Bivariate descriptors 4/5: for categorical features
Chi-square example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450
Not like science fiction 50 (210) 1000 (840) 1050
Sum(col.) 300 1200 1500
2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
(250 90 ) 2 (50 210 ) 2 (200 360 ) 2 (1000 840 ) 2
2
507 .93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Feature spaces and proximity measures
Feature space
Intuitively: a domain with a distance function
Formally: feature space F = (Dom, dist):
Dom is a set of attributes / features
dist: a numerical measure of the degree to which the two compared objects differ
dist : Dom Dom R+0
For all x, y in Dom, xy, dist is required to satisfy the following properties:
dist(x,y) > 0 (non-negativity)

dist(x,x) = 0 (reflexivity)

Metric space
Formally: Metric space = , : y
is a feature space z
i.e, dist(x,y) > 0 (non-negativity) and,
dist(x,x) = 0 (reflexivity)
, = 0 = (strictness)
, : , = (, ) (symmetry) x
x,y,z Dom : dist(x,z) dist(x,y) + dist(y,z)
(triangle inequality)
Measures that satisfy all the above properties are called metrics.

point x y
Famous example: Euclidean vector space E=(Dom, dist) p1 0 2
p2 2 0
(Dom, dist) is a metric space p3 3 1
p4 5 1
= Point coordinates
3
2 p1
, = =1 2
p3 p4
1
p2
0
0 1 2 3 4 5 6
If scales differ, standardization is necessary! p1 p2 p3 p4

p1 0 2.828 3.162 5.099
e.g., age (e.g., range [0-100] vs salary (e.g., range: 10000-100000)) p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance matrix
We will come back to this in a few slides

Manhattan distance or City-block distance (L1 norm)

dist1 = |p1-q1|+|p2-q2|+...+|pd-qd|
The sum of the absolute differences of the p,q coordinates
Euclidean distance (L2 norm)

dist2 = ((p1-q1)2+(p2-q2)2+...+(pd-qd)2)1/2
The length of the line segment connecting p and q
Supremum distance (Lmax norm or L norm)

dist = max{|p1-q1|, |p2-q2|,...,|pd-qd|} Source: http://www.econ.upf.edu/~michael/stanford/maeb5.pdf
The max difference between any attributes of the objects.
Minkowski Distance (Generalization of Lp-distance)

distp = (|p1-q1|p + |p2-q2|p + ...+ |pd-qd|p)1/p

Example
L1 p1 p2 p3 p4
point x y p1 0 4 4 6
p1 0 2 p2 4 0 2 4 L1 distance matrix
p2 2 0 p3 4 2 0 2
p3 3 1 p4 6 4 2 0
p4 5 1
L2 p1 p2 p3 p4
Point coordinates
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162 L2 distance matrix
3
p3 3.162 1.414 0 2
2 p1 p4 5.099 3.162 2 0
p3 p4
1 L p1 p2 p3 p4
p2
p1 0 2 3 5
0
0 1 2 3 4 5 6
p2 2 0 1 3 L distance matrix
p3 3 1 0 2
p4 5 3 2 0

Normalization
Attributes with large ranges outweigh ones with small ranges

e.g. income [10.000-100.000]; age [10-100]
To balance the contribution of an attribute A in the resulting distance, the attributes are scaled to
fall within a small, specified range.
min-max normalization: to [new_minA , new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Normalize age = 30 in the [0-1] range, given min(age)=10, max(age)=100

Normalization

e.g. income [10.000-100.000]; age [10-100]
min-max normalization: to [new_minA , new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Normalize age = 30 in the [0-1] range, given minage=10, maxage=100
new_age=((30-10)/(100-10))*(1-0)+0=2/9

Normalization

e.g. income [10.000-100.000]; age [10-100]
z-score normalization also called zero-mean normalization
After zero-mean normalizing each feature will have a mean value of 0
v mean A
v'
stand_devA
Normalize income = 70,000 if meanincome=50,000, stand_devincome =15,000

Normalization

e.g. income [10.000-100.000]; age [10-100]
z-score normalization also called zero-mean normalization
After zero-mean normalizing each feature will have a mean value of 0
v mean A
v'
stand_devA
Normalize income = 70,000 if meanincome=50,000, stand_devincome =15,000

new_value = (70,000-50,000)/15,000=1.33

Proximity between binary attributes 1/2
A binary attribute has only two states: 0 (absence), 1 (presence)

A contingency table for binary data
Instance j q = the number of attributes where i was 1 and j was 1
t = the number of attributes where i was 0 and j was 0
Instance i
s = the number of attributes where i was 0 and j was 1
r = the number of attributes where i was 1 and j was 0
Simple matching coefficient

(for symmetric binary variables)
for asymmetric binary variables:
Jaccard coefficient
(for asymmetric binary variables)

Proximity between binary attributes 2/2
Example:
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0
(from previous slide)

0 1
d ( jack , mary ) 0.33
2 0 1
11 q = the number of attributes where i was 1 and j was 1
d ( jack , jim) 0.67 t = the number of attributes where i was 0 and j was 0
111
1 2
d ( jim, mary ) 0.75 s = the number of attributes where i was 0 and j was 1
11 2 r = the number of attributes where i was 1 and j was 0

Proximity between categorical attributes
A nominal attribute has >2 states (generalization of a binary attribute)

e.g. color={red, blue, green}
Method 1: Simple matching Name Hair color Occupation
Jack Brown Student
m: # of matches, p: total # of variables
Mary Blond Student
d (i, j) p
p
m Jim Brown Architect
Method 2: Map it to binary variables

create a new binary attribute for each of the M nominal states of the attribute
Name Brown hair Blond hair IsStudent IsArchitect

Jack 1 0 1 0
Mary 0 1 1 0
Jim 1 0 0 1

Selecting the right proximity measure
The proximity function should fit the type of data

For dense continuous data, metric distance functions like Euclidean are often used.
For sparse data, typically measures that ignore 0-0 matches are employed
We care about characteristics that objects share, not about those that both lack
Domain expertise is important, maybe there is already a state-of-the-art proximity function in a

specific domain and we dont need to answer that question again.
In general, choosing the right proximity measure can be a very time consuming task
Other important aspects: How to combine proximities for heterogenous attributes (binary and
numeric and nominal etc.)
e.g., using attribute weights

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Feature transformations for text data 1/6
Text represented as a set of terms (Bag-Of-Words model)

Terms:
Single words (cluster, analysis..)
or
bigrams, trigrams, , n-grams (cluster analysis,..)
Transformation of a document d in a vector r(d) = (h1, ..., hd), hi 0: the frequency of term ti in d
The region is preparing for blizzard conditions

Friday, with the potential for more than two feet of blizzard 1
snow in the Fairfax City area. Conditions are Friday 3
expected to deteriorate Friday afternoon, with the and 2 hEis
biggest snowfall, wind gusts and life-threatening Zombie 0
conditions Friday night and Saturday.


Challenges/Problems in Text Mining:

1. Common words (e.g., the, and, for, me)
2. Words with the same root (fish, fisher, fishing,)
3. Very high-dimensional space (dimensionality d > 10.000)
4. Not all terms are equally important
5. Most term frequencies hi = 0 (sparse feature space)
More challenges due to language:
Different words have same meaning (synonyms)
freedom liberty
Words have more than one meanings
e.g. java, mouse

Problem 1: Common words (e.g., the, and, for, me)

Solution: ignore these terms (Stopwords)
There are stopwords list for all languages in WWW.
Problem 2: Words with the same root (fish, fisher, fishing,)

Solution: Stemming
Map the words to their root
- "fishing", "fished", "fish", and "fisher" to the root word, "fish".
For English, the Porter stemmer is widely used.
( Porters Stemming Algorithms: http://tartarus.org/~martin/PorterStemmer/index.html)
Stemming solutions exist for other languages also.

The root of the words is the output of stemming.

Problem 3: Too many features/ terms (Very high-dimensional space)

Solution: Select the most important features (Feature Selection)
Example: average document frequency for a term
Very frequent items appear in almost all documents
Very rare terms appear in only a few documents
Ranking procedure: #
=
#
1. Compute document frequency for all terms ti :
2. Sort terms w.r.t. DF(ti) and get rank(ti)
Rank Term DF
3. Sort terms by score(ti)= DF(ti) rank(ti) 1. t23 0.82
e.g. score(t23) = 0.82 1 = 0.82 2. t17 0.65
score(t17) = 0.65 2 = 1.3 3. t14 0.52
4.
4. Select the k terms with the largest score(ti)

Problem 4: Not all terms are equally important

Idea: Very frequent terms are less informative than less frequent words. Define such a term weighting
schema.
Solution: TF-IDF (Term Frequency Inverse Document Frequency)
Consider both the importance of the term in the document and in the whole collection of documents.
n(t , d )
TF (t , d ) The frequency of term t in d
wd
n( w, d )
| DB |
IDF(t ) log( ) Inverse frequency of term t in all DB
d | d DB t d
TF IDF TF (t , d ) IDF (t )
Feature vector with TF IDF : r(d) = (TF(t1 ,d)IDF(t1), ..., TF(tn ,d)IDF(tn))

Problem 5: for most of the terms hi = 0

Euclidean distance is not a good idea: it is influenced by vectors lengths
Idea: use more appropriate distance measures
Jaccard Coefficient: Ignore terms absent in both documents
1 2 |{| 1 2 }|
1 , 2 =1 =
1 2 |{| 1 2 }|
Cosine Coefficient: Consider term values (e.g. TFIDF values)

d d 2 ,i
n
1,i
d1 , d 2
d cos inus (d1 , d 2 ) 1 1 i 0
d1 d 2 n n
d1,i d 2 ,i
2 2
i 0 i 0

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Data Visualization
A variety of techniques
Parallel Coordinates
Quantile Plot ()
Scatter Plot Matrix ()
Loess Curve
Spiderweb model
Chernoff faces

Parallel Coordinates
Method to visualize high dimensional data sets

in parallel axes of feature spaces
Abstract Idea: Represent a d-dimensional feature space

with a system of d-parallel axes
Instance Representation: a polygonal line crossing

each parallel axis to its corresponding feature value of the
instance
Image source: https://en.wikipedia.org/wiki/Parallel_coordinates
Slide adopted from http://www.dbs.ifi.lmu.de/Lehre/KDD/

SS16/skript/2_DataRepresentation.pdf

Spiderweb Model
Method to visualize multivariate data sets

using a spider net where each radii of net
represent one feature of the data set
Abstract Idea: Spider chart or radar chart is

a visualization technique for multivariate data
in the form of two-dimensional chart represented
on axes,radii, starting from the same point
** Notation:
radii= equi-angular spokes spider-net axes
Instance Representation:
Example spiderweb visualization from NASA.
Polyline that intersects each spider-net axis The most desirable design results can be found
in the center of the net.
Motivation: Sets all instances to a common origin Image source: https://en.wikipedia.org/wiki/Radar_chart
But does not help if number of instances is big (Big Data)
Slide after https://en.wikipedia.org/wiki/Radar_chart

Chernoff Faces
Visualize multivariate data

in the shape of human face
Motivation: Humans can easily perceive faces

and notice small variations on them
Method: Each individual parts of face,

e.g. eyes, nose,.. represent one feature
and the shape the corresponding instance's value
But applicable with at a cetrain number of

feature dimensions(dimensional reduction) Chernoff faces for laywers' ratings for twelve judges.
Image source: https://en.wikipedia.org/wiki/Chernoff_face
[Chef 85] Chernoff, H: The use of faces to represent points in k-dimensional space graphically.
Journal of American Statistical Association, Vol. 68, pp. 361-368, 1973. Slide after https://en.wikipedia.org/wiki/Chernoff_face

Chernoff Faces
Example of mapping of face parts to climate features.

Using the left mapping, Climatic data of some cities represented by
Chernoff faces.
Fig.4.11(left), Fig.412 (right) of R. Maza Introduction to Information Visualization Springer 2009.

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Homework/ tutorial
2nd tutorial follows

Homework
Inspect a dataset with Weka! Try different visualizations (also with R)
Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 2.
Han J., KamberM., Pei J.Data Mining: Concepts and Techniques3rd ed., Morgan Kaufmann, 2011 (Chapter 2)
(Section 7.2)
If you interested more on text mining:
Tutorial on Text Mining and Link Analysis for Web and Semantic Web
Video: http://videolectures.net/ess07_grobelnik_twdmI/
Slides are also there (Download slides)
Natural language processing course: https://www.coursera.org/course/nlp

R. Mazza Introduction to Information Visualization, Springer 2009

Outline
Data preprocessing
Data Visualization
Homework/ Tutorial

Basics of data preprocessing

Basic feature types
Feature spaces & Metric spaces
Proximity measures
Basics of working with text data

Acknowledgement
The slides are based on

KDD I lecture at LMU Munich (Johannes Afalg, Christian Bhm, Karsten Borgwardt, Martin Ester, Eshref
Januzaj, Karin Kailing, Peer Krger, Eirini Ntoutsi, Jrg Sander, Matthias Schubert, Arthur Zimek, Andreas
Zfle)
Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
Pedro Domingos Machine Lecture course slides at the University of Washington
Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
Old Data Mining course slides at LUH by Prof. Udo Lipeck

Data Mining I: Summer Semester 2017

Uploaded by

Copyright:

Available Formats

Data Mining I: Summer Semester 2017

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining I: Summer Semester 2017

Uploaded by

Copyright:

Available Formats

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme

Lecture 2: Data preprocessing and feature spaces

How data mining differs from database querying?

Data Mining I: Data preprocessing and feature spaces 2

Data Mining I: Data preprocessing and feature spaces 3

different data sources

Data Mining I: Data preprocessing and feature spaces

Search for patterns of

Statistical validation of the

Data Mining I: Data preprocessing and feature spaces 5

Data Mining I: Data preprocessing and feature spaces 6

Data transformation: milk

Data Mining I: Data preprocessing and feature spaces 7

Data Mining I: Data preprocessing and feature spaces 8

Datasets consists of instances (also known as examples or objects)

Data Mining I: Data preprocessing and feature spaces 9

Datasets consists of instances (also known as examples or objects)

Data Mining I: Data preprocessing and feature spaces 10

Binary/ Dichotomous variables

Data Mining I: Data preprocessing and feature spaces 11

The attribute can take two values, {0,1} or {true,false}

Data Mining I: Data preprocessing and feature spaces 12

The attribute can take two values, {0,1} or {true,false}

Give me some examples of binary variables!

Data Mining I: Data preprocessing and feature spaces 13

The attribute can take values within a set of M categories/ states.

Data Mining I: Data preprocessing and feature spaces 14

The attribute can take values within a set of M categories/ states.

Give me some examples of nominal variables!

Data Mining I: Data preprocessing and feature spaces 15

School grades: {A,B,C,D,F}

Medals = {bronze, silver, gold}

Data Mining I: Data preprocessing and feature spaces 16

School grades: {A,B,C,D,F}

Medals = {bronze, silver, gold}

Give me some examples of ordinal variables!

Data Mining I: Data preprocessing and feature spaces 17

Differences between values are meaningful

Give me some examples of interval-scale variables!

Both differences and ratios have a meaning

Give me some examples of ratio-scale variables!

Data Mining I: Data preprocessing and feature spaces 19

Why we care about the feature types?

Feature extraction depends on the application

Data Mining I: Data preprocessing and feature spaces 22

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:

Give me the mean of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 23

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:

Mean is greatly influenced by outliers, a more robust measure is median

Median: the central element in ascending ordering

Give me the median of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 24

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:

Mode: Value that occurs most frequently in the data

Give me the mode of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 25