Data Mining I: Summer Semester 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Fakultt fr Elektrotechnik und Informatik

Institut fr Verteilte Systeme


Fachgebiet Wissensbasierte Systeme (KBS)

Data Mining I
Summer semester 2017

Lecture 2: Data preprocessing and feature spaces


Lectures: Prof. Dr. Eirini Ntoutsi
Exercises: Tai Le Quy and Damianos Melidis
Recap from previous lecture

KDD definition
KDD process
DM step
Supervised (or predictive) vs Unsupervised (or descriptive) learning
Main DM tasks
Clustering: partitioning in groups of similar objects
Classification: predict class attribute from input attributes, class is categorical
Regression: predict class attribute from input attributes, class is continuous
Association rules mining: find associations between attributes
Outlier detection: identify non-typical data

How data mining differs from database querying?

Data Mining I: Data preprocessing and feature spaces 2


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 3


Data
Selection:
Select a relevant dataset or
focus on a subset of a dataset
File / DB/
Target data

Preprocessing/Cleaning:
Integration of data from
Recap: The KDD process

different data sources


Noise removal

Data Mining I: Data preprocessing and feature spaces


[Fayyad, Piatetsky-Shapiro & Smyth, 1996]

Missing values
Preprocessed data

Transformation:
Select useful features
Feature transformation/
discretization
Dimensionality reduction

Data Mining:
Transformed data

Search for patterns of


interest

Evaluation:
Evaluate patterns based on
interestingness measures
Patterns

Statistical validation of the


Models
Visualization
Descriptive Statistics
4
Knowledge
Why data preprocessing?

Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]

Data Mining I: Data preprocessing and feature spaces 5


Why data preprocessing?

Real world data are noisy, incomplete and inconsistent: Know your data!
Noisy: errors/ outliers
o erroneous values : e.g., salary = -10K
o unexpected values: e.g., salary=100K when the rest dataset lies in [30K-50K]
Incomplete: missing data
o missing values: e.g., occupation=
o missing attributes of interest: e.g., no information on occupation
Inconsistent: discrepancies in the data
o e.g., student grade ranges between different universities might differ, in DE [1-5], in GR [0-10]
Dirty data poor mining results
Data preprocessing is necessary for improving the quality of the mining results !
Not a focus of this class!

Data Mining I: Data preprocessing and feature spaces 6


Major tasks in data preprocessing

Data integration:
Integration of multiple databases, data warehouses, or files (Entity identification)
Data cleaning:
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data transformation: milk


Normalization in a given range, e.g., [0-1]
milk 1.5%
Generalization through some concept hierarchy
Data reduction: milk 1.5% brand x

Duplicate elimination
Aggregation, e.g., from 12 monthly salaries to average salary per month. More on this next semester
Dimensionality reduction, through e.g., PCA on Data Mining II course

Data Mining I: Data preprocessing and feature spaces 7


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 8


Datasets = instances + features

Datasets consists of instances (also known as examples or objects)


e.g., in a university database: students, professors, courses, grades,
e.g., in a library database: books, users, loans, publishers, .
e.g., in a movie database: movies, actors, director,
e.g., in a supermarket: customers, transactions, products
Instances are described through features (also known as attributes or variables)
E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.

Data Mining I: Data preprocessing and feature spaces 9


Datasets = instances + features

Datasets consists of instances (also known as examples or objects)


e.g., in a university database: students, professors, courses, grades,
e.g., in a library database: books, users, loans, publishers, .
e.g., in a movie database: movies, actors, director,
e.g., in a supermarket: customers, transactions, products
Instances are described through features (also known as attributes or variables)
E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
Easy visualization: if our data are in a database table, the rows are the instances and the columns are
the features.

Data Mining I: Data preprocessing and feature spaces 10


Basic feature types

Binary/ Dichotomous variables


Categorical (qualitative)
Nominal variables
Ordinal variables
Numeric variables (quantitative)
Interval-scale variables
Ratio-scaled variables

Data Mining I: Data preprocessing and feature spaces 11


Binary/ Dichotomous variables

The attribute can take two values, {0,1} or {true,false}


usually, 0 means absence, 1 means presence
Person isSmoker
e.g., smoker variable: 1 smoker, 0 non-smoker Eirini 0
Erich 1
e.g., true (1), false (0) Kostas 0
Jane 0
Symmetric binary: both outcomes equally important: Emily 1
Markus 0
e.g., gender (male, female)
Asymmetric binary: outcomes not equally important.
e.g., medical tests (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

Data Mining I: Data preprocessing and feature spaces 12


Binary/ Dichotomous variables

The attribute can take two values, {0,1} or {true,false}


usually, 0 means absence, 1 means presence
Person isSmoker
e.g., smoker variable: 1 smoker, 0 non-smoker Eirini 0
Erich 1
e.g., true (1), false (0) Kostas 0
Jane 0
Symmetric binary: both outcomes equally important: Emily 1
Markus 0
e.g., gender (male, female)
Asymmetric binary: outcomes not equally important.
e.g., medical tests (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

Give me some examples of binary variables!

Data Mining I: Data preprocessing and feature spaces 13


Categorical: Nominal variables

The attribute can take values within a set of M categories/ states.


Person Occupation
No ordering in the categories/ states. Eirini Astronaut
Erich engineer
Only distinctness relationships apply, i.e., Kostas doctor
equal (=) and Jane engineer
Emily teacher
different () Markus driver

Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}

Data Mining I: Data preprocessing and feature spaces 14


Categorical: Nominal variables

The attribute can take values within a set of M categories/ states.


Person Occupation
No ordering in the categories/ states. Eirini Astronaut
Erich engineer
Only distinctness relationships apply, i.e., Kostas doctor
equal (=) and Jane engineer
Emily teacher
different () Markus driver

Examples:
Colors = {brown, green, blue,,gray},
Occupation = {engineer, doctor, teacher, , driver}

Give me some examples of nominal variables!

Data Mining I: Data preprocessing and feature spaces 15


Categorical: Ordinal variables

Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*

School grades: {A,B,C,D,F}


Movie ratings: {hate, dislike, indifferent, like, love}
Also, movie ratings: {*, **, ***, ****, *****}
Also, movie ratings: {1, 2, 3, 4, 5}

Medals = {bronze, silver, gold}

Data Mining I: Data preprocessing and feature spaces 16


Categorical: Ordinal variables

Similar to categorical variables, but the M states are ordered/ ranked in a meaningful way.
There is an ordering between the values.
Person A beautiful mind Titanic
Allows to apply order relationships, i.e., >, , <, Eirini 5* 3*
Erich 5* 1*
Kostas 3* 3*
However, the difference and ratio between these values has no meaning. Jane 1* 5*
Emily 1* 5*
Examples: Markus 4* 3*

School grades: {A,B,C,D,F}


Movie ratings: {hate, dislike, indifferent, like, love}
Also, movie ratings: {*, **, ***, ****, *****}
Also, movie ratings: {1, 2, 3, 4, 5}

Medals = {bronze, silver, gold}

Give me some examples of ordinal variables!

Data Mining I: Data preprocessing and feature spaces 17


Numeric: Interval-scale variables

Differences between values are meaningful


The difference between 90o and 100o temperature is the same as the difference between
40o and 50o temperature.

Examples:
Calendar dates , Temperature in Farenheit or Celsius, ...
Ratio still has no meaning
A temperature of 2o Celsius is not much different than a temperature of 1o
Celsius.
The issue is that the 0o point of the Celsius scale is in a physical sense arbitrary
and therefore the ratio of two Celsius temperatures is not physically meaningful.
No meaningful (unique and non-arbitrary) zero value

Give me some examples of interval-scale variables!


Data Mining I: Data preprocessing and feature spaces 18
Numeric: Ratio-scale variables

Both differences and ratios have a meaning


E.g., a 100 Kgs person is twice heavy as a 50 Kgs person.
E.g., a 50 years old person is twice old as a 25 years old person.
Meaningful (unique and non-arbitrary) zero value
Examples:
age, weight, length, number of sales
temperature in Kelvin
When measured on the Kelvin scale, a temperature of 2o is, in a physical meaningful way, twice that of a 1o.
The zero value is absolute 0, represents the complete absence of molecular motion

Give me some examples of ratio-scale variables!

Data Mining I: Data preprocessing and feature spaces 19


Nominal-ordinal-interval-ratio

Source: https://www.sagepub.com/sites/default/files/upm-binaries/19708_6.pdf

Why we care about the feature types?


Data Mining I: Data preprocessing and feature spaces 20
How do we extract features?

Feature extraction depends on the application

Frequency
Images:
Color histograms: distribution of
colors in the image
Color

Gene databases:
gene expression level

Data 25
Text databases: Mining 15
Word frequencies Feature 12
Object 7
...

But, the feature-approach allows uniform treatment of instances from different applications.
Data Mining I: Data preprocessing and feature spaces 21
Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 22


Univariate descriptors 1/5

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:


(Arithmetic) mean/ center/ average:
1 n
x xi
n i 1
n

w x i i

Weighted average: x i 1
n

w
i 1
i

Give me the mean of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 23


Univariate descriptors 1/5

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:


(Arithmetic) mean/ center/ average:
1 n
x xi
n i 1

Mean is greatly influenced by outliers, a more robust measure is median

Median: the central element in ascending ordering


Middle value if odd number of values, or average of the middle two values otherwise

Give me the median of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 24


Univariate descriptors 1/5

Let x1,,xn be a random sample of an attribute X. Measures of central tendency of X include:


(Arithmetic) mean/ center/ average:
1 n
x xi
n i 1
Median: the central element in ascending ordering
Middle value if odd number of values, or average of the middle two values otherwise

Mode: Value that occurs most frequently in the data

Give me the mode of: 3, 8, 3, 4, 3, 6, 4, 2, 3

Data Mining I: Data preprocessing and feature spaces 25


Univariate descriptors 2/5

Symmetric

Positively skewed Negatively skewed


Data Mining I: Data preprocessing and feature spaces 26
Univariate descriptors 3/5

Let x1,,xn be a random sample of an attribute X. The degree to which X values tend to spread is called
dispersion or variance of X :
Variance 2:
1 n 1 n 2 1 n 2
( xi x ) [ xi ( xi ) ]
2 2

n i 1 n i 1 n i 1
Standard deviation is the square root of the variance:

1 n
( xi x ) 2
n i 1

Data Mining I: Data preprocessing and feature spaces 27


Univariate descriptors 4/5

Example: For a normal distribution


~68% of values drawn from the distribution are within 1
~95% of the values lie within 2
~99.7% of the values lie within 3

Source: http://en.wikipedia.org/wiki/Normal_distribution

Data Mining I: Data preprocessing and feature spaces 28


Univariate descriptors 3/5

Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Q1 (25th percentile)
Q3 (75th percentile)
Median is the 50th percentile
Range: max value min value

http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/
Data Mining I: Data preprocessing and feature spaces 29
Univariate descriptors 3/5

Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are
useful.
Boxplots: 5 number summary
min, Q1, median, Q3, max
Q1 (25th percentile)
Q3 (75th percentile)
Median is the 50th percentile
Range: max value min value

Source: http://www.wellbeingatschool.org.nz/information-
sheet/understanding-and-interpreting-box-plots

Data Mining I: Data preprocessing and feature spaces 30


Univariate descriptors 5/5

Let x1,,xn be a random sample of an attribute X. For visual inspection of X, several types of charts are useful.
Histograms:
Summarizes the distribution of X
X axis: attribute values, Y axis: frequencies
Absolute frequency: for each value a, # occurrences of a in the sample
Relative frequency: f(a) = h(a)/n
Different types of histograms, e.g.:
Equal width:
Equal depth histogram
It divides the range into N intervals of equal size

Equal frequency/ depth:


It divides the range into N intervals,
each containing approximately same number of samples

Source: http://www.dbs.ifi.lmu.de/Lehre/
Equal width histogram KDD/SS16/skript/2_DataRepresentation.pdf

Data Mining I: Data preprocessing and feature spaces 31


Bivariate descriptors 1/5

Given two attributes X, Y one can measure how strongly they are correlated
For numerical data correlation coefficient
For categorical data 2 (chi-square)

Data Mining I: Data preprocessing and feature spaces 32


Bivariate descriptors 2/5: for numerical features

Correlation coefficient (also called Pearsons correlation coefficient) measures the linear association
between X, Y:
n

(x x) ( y
i i y)
rXY i 1
XY
xi, yi: the values in the ith tuple for X, Y
value range: -1 rXY 1
the higher rXY the stronger the correlation
r > 0 positive correlation
XY

XYr < 0 negative correlation


r ~ 0 no correlation/ independent
XY

. ... .... ... . rXY 0

...... .........
r XY 0
.
r XY > 0

.
... .... . r XY < 0
. .. .. .. . . . . . . . ..
Data Mining I: Data preprocessing and feature spaces 33
Bivariate descriptors 3/5: for numerical features

Visual inspection of correlation

Data Mining I: Data preprocessing and feature spaces 34


Bivariate descriptors 4/5: for categorical features

Contingency table
For categorical/ nominal features X={x1, , xc }, Y={y1, , yr}
Represents the absolute frequency hij of each combination of values (xi, yJ) and the marginal frequencies hi,
hj of X, Y.
Attribute Y
Medium-term unemployment Long-term unemployment Total
Attribute X

No education 19 18 37
Teaching 43 20 63
Total 62 38 100

Chi-square 2 test
c r (oij eij ) 2 oij:observed frequency hi h j

2
eij: expected frequency eij
i 1 j 1 eij n

Data Mining I: Data preprocessing and feature spaces 35


Bivariate descriptors 4/5: for categorical features

Chi-square example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450

Not like science fiction 50 (210) 1000 (840) 1050

Sum(col.) 300 1200 1500

2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
(250 90 ) 2 (50 210 ) 2 (200 360 ) 2 (1000 840 ) 2
2
507 .93
90 210 360 840

It shows that like_science_fiction and play_chess are correlated in the group

Data Mining I: Data preprocessing and feature spaces 36


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 37


Feature spaces and proximity measures

Feature space
Intuitively: a domain with a distance function
Formally: feature space F = (Dom, dist):
Dom is a set of attributes / features
dist: a numerical measure of the degree to which the two compared objects differ
dist : Dom Dom R+0

For all x, y in Dom, xy, dist is required to satisfy the following properties:

dist(x,y) > 0 (non-negativity)


dist(x,x) = 0 (reflexivity)

Data Mining I: Data preprocessing and feature spaces 38


Feature spaces and proximity measures

Metric space
Formally: Metric space = , : y
is a feature space z
i.e, dist(x,y) > 0 (non-negativity) and,
dist(x,x) = 0 (reflexivity)

, = 0 = (strictness)
, : , = (, ) (symmetry) x
x,y,z Dom : dist(x,z) dist(x,y) + dist(y,z)
(triangle inequality)

Measures that satisfy all the above properties are called metrics.

Data Mining I: Data preprocessing and feature spaces 39


Feature spaces and proximity measures

point x y
Famous example: Euclidean vector space E=(Dom, dist) p1 0 2
p2 2 0
(Dom, dist) is a metric space p3 3 1
p4 5 1
= Point coordinates
3

2 p1
, = =1 2
p3 p4
1
p2
0
0 1 2 3 4 5 6

If scales differ, standardization is necessary! p1 p2 p3 p4


p1 0 2.828 3.162 5.099
e.g., age (e.g., range [0-100] vs salary (e.g., range: 10000-100000)) p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance matrix

We will come back to this in a few slides


Data Mining I: Data preprocessing and feature spaces 40
Feature spaces and proximity measures

Manhattan distance or City-block distance (L1 norm)


dist1 = |p1-q1|+|p2-q2|+...+|pd-qd|
The sum of the absolute differences of the p,q coordinates

Euclidean distance (L2 norm)


dist2 = ((p1-q1)2+(p2-q2)2+...+(pd-qd)2)1/2
The length of the line segment connecting p and q

Supremum distance (Lmax norm or L norm)


dist = max{|p1-q1|, |p2-q2|,...,|pd-qd|} Source: http://www.econ.upf.edu/~michael/stanford/maeb5.pdf

The max difference between any attributes of the objects.

Minkowski Distance (Generalization of Lp-distance)


distp = (|p1-q1|p + |p2-q2|p + ...+ |pd-qd|p)1/p

Data Mining I: Data preprocessing and feature spaces 41


Feature spaces and proximity measures

Example
L1 p1 p2 p3 p4
point x y p1 0 4 4 6
p1 0 2 p2 4 0 2 4 L1 distance matrix
p2 2 0 p3 4 2 0 2
p3 3 1 p4 6 4 2 0
p4 5 1
L2 p1 p2 p3 p4
Point coordinates
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162 L2 distance matrix
3
p3 3.162 1.414 0 2
2 p1 p4 5.099 3.162 2 0
p3 p4
1 L p1 p2 p3 p4
p2
p1 0 2 3 5
0
0 1 2 3 4 5 6
p2 2 0 1 3 L distance matrix
p3 3 1 0 2
p4 5 3 2 0

Data Mining I: Data preprocessing and feature spaces 42


Normalization

Attributes with large ranges outweigh ones with small ranges


e.g. income [10.000-100.000]; age [10-100]
To balance the contribution of an attribute A in the resulting distance, the attributes are scaled to
fall within a small, specified range.
min-max normalization: to [new_minA , new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA

Normalize age = 30 in the [0-1] range, given min(age)=10, max(age)=100

Data Mining I: Data preprocessing and feature spaces 44


Normalization

Attributes with large ranges outweigh ones with small ranges


e.g. income [10.000-100.000]; age [10-100]
To balance the contribution of an attribute A in the resulting distance, the attributes are scaled to
fall within a small, specified range.
min-max normalization: to [new_minA , new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA

Normalize age = 30 in the [0-1] range, given minage=10, maxage=100

new_age=((30-10)/(100-10))*(1-0)+0=2/9

Data Mining I: Data preprocessing and feature spaces 45


Normalization

Attributes with large ranges outweigh ones with small ranges


e.g. income [10.000-100.000]; age [10-100]
To balance the contribution of an attribute A in the resulting distance, the attributes are scaled to
fall within a small, specified range.
z-score normalization also called zero-mean normalization
After zero-mean normalizing each feature will have a mean value of 0

v mean A
v'
stand_devA

Normalize income = 70,000 if meanincome=50,000, stand_devincome =15,000

Data Mining I: Data preprocessing and feature spaces 46


Normalization

Attributes with large ranges outweigh ones with small ranges


e.g. income [10.000-100.000]; age [10-100]
To balance the contribution of an attribute A in the resulting distance, the attributes are scaled to
fall within a small, specified range.
z-score normalization also called zero-mean normalization
After zero-mean normalizing each feature will have a mean value of 0

v mean A
v'
stand_devA

Normalize income = 70,000 if meanincome=50,000, stand_devincome =15,000


new_value = (70,000-50,000)/15,000=1.33

Data Mining I: Data preprocessing and feature spaces 47


Proximity between binary attributes 1/2

A binary attribute has only two states: 0 (absence), 1 (presence)


A contingency table for binary data
Instance j q = the number of attributes where i was 1 and j was 1
t = the number of attributes where i was 0 and j was 0
Instance i
s = the number of attributes where i was 0 and j was 1
r = the number of attributes where i was 1 and j was 0

Simple matching coefficient


(for symmetric binary variables)

for asymmetric binary variables:

Jaccard coefficient

(for asymmetric binary variables)

Data Mining I: Data preprocessing and feature spaces 48


Proximity between binary attributes 2/2

Example:
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

(from previous slide)


0 1
d ( jack , mary ) 0.33
2 0 1
11 q = the number of attributes where i was 1 and j was 1
d ( jack , jim) 0.67 t = the number of attributes where i was 0 and j was 0
111
1 2
d ( jim, mary ) 0.75 s = the number of attributes where i was 0 and j was 1
11 2 r = the number of attributes where i was 1 and j was 0

Data Mining I: Data preprocessing and feature spaces 49


Proximity between categorical attributes

A nominal attribute has >2 states (generalization of a binary attribute)


e.g. color={red, blue, green}
Method 1: Simple matching Name Hair color Occupation
Jack Brown Student
m: # of matches, p: total # of variables
Mary Blond Student
d (i, j) p
p
m Jim Brown Architect

Method 2: Map it to binary variables


create a new binary attribute for each of the M nominal states of the attribute

Name Brown hair Blond hair IsStudent IsArchitect


Jack 1 0 1 0
Mary 0 1 1 0
Jim 1 0 0 1

Data Mining I: Data preprocessing and feature spaces 50


Selecting the right proximity measure

The proximity function should fit the type of data


For dense continuous data, metric distance functions like Euclidean are often used.
For sparse data, typically measures that ignore 0-0 matches are employed
We care about characteristics that objects share, not about those that both lack

Domain expertise is important, maybe there is already a state-of-the-art proximity function in a


specific domain and we dont need to answer that question again.
In general, choosing the right proximity measure can be a very time consuming task
Other important aspects: How to combine proximities for heterogenous attributes (binary and
numeric and nominal etc.)
e.g., using attribute weights

Data Mining I: Data preprocessing and feature spaces 51


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 52


Feature transformations for text data 1/6

Text represented as a set of terms (Bag-Of-Words model)


Terms:
Single words (cluster, analysis..)
or
bigrams, trigrams, , n-grams (cluster analysis,..)

Transformation of a document d in a vector r(d) = (h1, ..., hd), hi 0: the frequency of term ti in d

The region is preparing for blizzard conditions


Friday, with the potential for more than two feet of blizzard 1
snow in the Fairfax City area. Conditions are Friday 3
expected to deteriorate Friday afternoon, with the and 2 hEis
biggest snowfall, wind gusts and life-threatening Zombie 0
conditions Friday night and Saturday.

Data Mining I: Data preprocessing and feature spaces 53


Feature transformations for text data 2/6

Challenges/Problems in Text Mining:


1. Common words (e.g., the, and, for, me)
2. Words with the same root (fish, fisher, fishing,)
3. Very high-dimensional space (dimensionality d > 10.000)
4. Not all terms are equally important
5. Most term frequencies hi = 0 (sparse feature space)
More challenges due to language:
Different words have same meaning (synonyms)
freedom liberty
Words have more than one meanings
e.g. java, mouse

Data Mining I: Data preprocessing and feature spaces 54


Feature transformations for text data 3/6

Problem 1: Common words (e.g., the, and, for, me)


Solution: ignore these terms (Stopwords)
There are stopwords list for all languages in WWW.

Problem 2: Words with the same root (fish, fisher, fishing,)


Solution: Stemming
Map the words to their root
- "fishing", "fished", "fish", and "fisher" to the root word, "fish".
For English, the Porter stemmer is widely used.
( Porters Stemming Algorithms: http://tartarus.org/~martin/PorterStemmer/index.html)

Stemming solutions exist for other languages also.


The root of the words is the output of stemming.

Data Mining I: Data preprocessing and feature spaces 55


Feature transformations for text data 4/6

Problem 3: Too many features/ terms (Very high-dimensional space)


Solution: Select the most important features (Feature Selection)
Example: average document frequency for a term
Very frequent items appear in almost all documents
Very rare terms appear in only a few documents

Ranking procedure: #
=
#
1. Compute document frequency for all terms ti :
2. Sort terms w.r.t. DF(ti) and get rank(ti)
Rank Term DF
3. Sort terms by score(ti)= DF(ti) rank(ti) 1. t23 0.82
e.g. score(t23) = 0.82 1 = 0.82 2. t17 0.65
score(t17) = 0.65 2 = 1.3 3. t14 0.52
4.
4. Select the k terms with the largest score(ti)

Data Mining I: Data preprocessing and feature spaces 56


Feature transformations for text data 5/6

Problem 4: Not all terms are equally important


Idea: Very frequent terms are less informative than less frequent words. Define such a term weighting
schema.
Solution: TF-IDF (Term Frequency Inverse Document Frequency)
Consider both the importance of the term in the document and in the whole collection of documents.

n(t , d )
TF (t , d ) The frequency of term t in d
wd
n( w, d )

| DB |
IDF(t ) log( ) Inverse frequency of term t in all DB
d | d DB t d
TF IDF TF (t , d ) IDF (t )

Feature vector with TF IDF : r(d) = (TF(t1 ,d)IDF(t1), ..., TF(tn ,d)IDF(tn))

Data Mining I: Data preprocessing and feature spaces 57


Feature transformations for text data 6/6

Problem 5: for most of the terms hi = 0


Euclidean distance is not a good idea: it is influenced by vectors lengths
Idea: use more appropriate distance measures

Jaccard Coefficient: Ignore terms absent in both documents

1 2 |{| 1 2 }|
1 , 2 =1 =
1 2 |{| 1 2 }|

Cosine Coefficient: Consider term values (e.g. TFIDF values)


d d 2 ,i
n

1,i
d1 , d 2
d cos inus (d1 , d 2 ) 1 1 i 0
d1 d 2 n n

d1,i d 2 ,i
2 2

i 0 i 0

Data Mining I: Data preprocessing and feature spaces 58


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 59


Data Visualization

A variety of techniques
Parallel Coordinates
Quantile Plot ()
Scatter Plot Matrix ()
Loess Curve
Spiderweb model
Chernoff faces

Data Mining I: Data preprocessing and feature spaces 60


Parallel Coordinates

Method to visualize high dimensional data sets


in parallel axes of feature spaces

Abstract Idea: Represent a d-dimensional feature space


with a system of d-parallel axes

Instance Representation: a polygonal line crossing


each parallel axis to its corresponding feature value of the
instance

Image source: https://en.wikipedia.org/wiki/Parallel_coordinates

Slide adopted from http://www.dbs.ifi.lmu.de/Lehre/KDD/


SS16/skript/2_DataRepresentation.pdf

Data Mining I: Data preprocessing and feature spaces 61


Spiderweb Model

Method to visualize multivariate data sets


using a spider net where each radii of net
represent one feature of the data set

Abstract Idea: Spider chart or radar chart is


a visualization technique for multivariate data
in the form of two-dimensional chart represented
on axes,radii, starting from the same point
** Notation:
radii= equi-angular spokes spider-net axes
Instance Representation:
Example spiderweb visualization from NASA.
Polyline that intersects each spider-net axis The most desirable design results can be found
in the center of the net.
Motivation: Sets all instances to a common origin Image source: https://en.wikipedia.org/wiki/Radar_chart
But does not help if number of instances is big (Big Data)
Slide after https://en.wikipedia.org/wiki/Radar_chart

Data Mining I: Data preprocessing and feature spaces 63


Chernoff Faces

Visualize multivariate data


in the shape of human face

Motivation: Humans can easily perceive faces


and notice small variations on them

Method: Each individual parts of face,


e.g. eyes, nose,.. represent one feature
and the shape the corresponding instance's value

But applicable with at a cetrain number of


feature dimensions(dimensional reduction) Chernoff faces for laywers' ratings for twelve judges.
Image source: https://en.wikipedia.org/wiki/Chernoff_face
[Chef 85] Chernoff, H: The use of faces to represent points in k-dimensional space graphically.
Journal of American Statistical Association, Vol. 68, pp. 361-368, 1973. Slide after https://en.wikipedia.org/wiki/Chernoff_face

Data Mining I: Data preprocessing and feature spaces 64


Chernoff Faces

Example of mapping of face parts to climate features.


Using the left mapping, Climatic data of some cities represented by
Chernoff faces.

Fig.4.11(left), Fig.412 (right) of R. Maza Introduction to Information Visualization Springer 2009.

Data Mining I: Data preprocessing and feature spaces 65


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 66


Homework/ tutorial

2nd tutorial follows


Homework
Inspect a dataset with Weka! Try different visualizations (also with R)
Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 2.
Han J., KamberM., Pei J.Data Mining: Concepts and Techniques3rd ed., Morgan Kaufmann, 2011 (Chapter 2)
(Section 7.2)
If you interested more on text mining:
Tutorial on Text Mining and Link Analysis for Web and Semantic Web
Video: http://videolectures.net/ess07_grobelnik_twdmI/
Slides are also there (Download slides)

Natural language processing course: https://www.coursera.org/course/nlp


R. Mazza Introduction to Information Visualization, Springer 2009

Data Mining I: Data preprocessing and feature spaces 67


Outline

Data preprocessing
Decomposing a dataset: instances and features
Basic data descriptors
Feature spaces and proximity (similarity, distance) measures
Feature transformation for text data
Data Visualization
Homework/ Tutorial
Things you should know from this lecture

Data Mining I: Data preprocessing and feature spaces 68


Things you should know from this lecture

Basics of data preprocessing


Basic feature types
Feature spaces & Metric spaces
Proximity measures
Basics of working with text data

Data Mining I: Data preprocessing and feature spaces 69


Acknowledgement

The slides are based on


KDD I lecture at LMU Munich (Johannes Afalg, Christian Bhm, Karsten Borgwardt, Martin Ester, Eshref
Januzaj, Karin Kailing, Peer Krger, Eirini Ntoutsi, Jrg Sander, Matthias Schubert, Arthur Zimek, Andreas
Zfle)
Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
Pedro Domingos Machine Lecture course slides at the University of Washington
Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
Old Data Mining course slides at LUH by Prof. Udo Lipeck

Data Mining I: Data preprocessing and feature spaces 70

You might also like