Features

CMPE 442 Introduction to
Machine Learning
• Features
Features
 Data set- set of data objects

 Data object – represents an entity described by a set of attributes
 A feature is a data field representing a characteristic of a data object
 Workhorses of ML
 Mapping from instance space to the feature space
 Distinguish features by:
 domain types
 range of permissible operations
Features
 Workhorses of ML
 Mapping from instance space to the feature space
 Distinguish features by:
 domain types
 range of permissible operations
 Ex: Consider two features: person’s age and house number: while both are
numbers, house number is an ordinal hence taking the average is
meaningless.
 What matters is not just the domain of a feature but also the range of
permissible operations.
Getting to Know Your Data
 What are the types of features?

 What kind of values does each feature have?
 Which features are discrete and which are continuous valued?
 How are the values distributed?
 Can we spot any outliers?
 …
Kinds of Feature
 Numerical
 Features with numerical scale
 Often involve mapping to reals
 Continuous
 Ex: age, price, etc.
 Ordinal
 Features with an ordering but without scale
 Some totally ordered set
 Ex: set of characters, strings, house numbers, etc.
 Allow mode and median as central statistics, and quantiles as dispersion statistics
 Categorical
 Features without ordering or scale
 Allows no statistical summary except for mode
 Boolean feature is a subspace of categorical feature
Kinds of Feature
 Categorical and Ordinal features are qualitative:

 Describe a feature of an object without giving an actual size or quantity
 Numerical features are quantitative
Calculations of Features
 Aggregates or Statistics
 Main categories:
 Statistics of Central Tendency
 Statistics of Dispersion
 Shape Statistics
Statistics of Central Tendency
 Mean or Average value

 Median- the middle value if we order the instances from lowest to highest
feature value
 Mode- majority value or values
Statistics of Central Tendency: Mean
 The most common measure of the center of a set of data points

∑ ⋯
 𝑥̅ = = -- arithmetic mean
∑ ⋯
 𝑥̅ = ∑
= ⋯
-- weighted arithmetic mean
 Sensitive to extreme values (outliers)

 Trimmed mean– mean value computed after discarding values at the high
and low extremes
Statistics of Central Tendency: Median
 The middle value in a set of ordered data values

 Is a better measure of the center of the data for skewed (asymmetric) data
 Separates the higher half of a data set from the lower half
 Expensive to compute when we have large number of observations
 Applicable to numeric and ordinal features
Statistics of Central Tendency: Mode
 The value that occurs most frequently in the set

 Can be determined for qualitative and quantitative attributes
 Greatest frequency might correspond to several different values –
multimodal
 For unimodal numeric data that are moderately skewed:
𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 ≈ 3 × (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
 Mean or Average value

 Median- the middle value if we order the instances from lowest to highest
feature value
 Mode- majority value or values
 The mode is the one we can calculate whatever the domain of the feature. Ex:
blood type.
 In order to calculate median we need to have an ordering on the feature values
 In order to calculate mean we need feature to be expressed on some scale.
Statistics of Dispersion
 Range
 Quantiles
 Variance
 Standard deviation
Statistics of Dispersion: Range
 Let 𝑥 , 𝑥 , … , 𝑥 be a set of observations for some numeric attribute 𝑋

 The range of the set is the difference between the largest and smallest
values
Statistics of Dispersion: Quantiles
 Suppose that 𝑋 attribute is sorted in increasing order

 Pick certain data points so as to split the data distribution into equal-size
consecutive sets – these data points are called quantiles
 Quantiles- data points taken at regular intervals of data distribution, dividing it
into equal-size consecutive sets
 The 2-quantile is the data point dividing the lower and upper halves of the data
distribution  corresponds to the median
 The 4-quantiles are the three data points that split the data distribution into four
equal parts  referred as quartiles
 100-quantiles divide the data distribution into 100 equal-sized consecutive sets
referred as percentiles
 The median, quartiles and percentiles are the most widely used forms of
quantiles
Example: Percentile of GDP
First quartile
Median
Mean
Third quartile
Mean > Median
Mean is more sensitive to outliers.

Median is preferred for skewed distributions like this.
Statistics of Dispersion: Quantiles
 Interquartile Range (IQR) – the distance between the first and third quartiles
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
 Simple measure of spread that gives the range covered by the middle half of
the data
 Ex: Suppose we have the following values for salary (in thousands of dollars): 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
 Q1=47 000 $
 Q2=52 000$
 Q3=63 000$
 IQR=63-47=16 000$
 A common approach for identifying suspected outliers is to single out values
falling at least 1.5 x IQR above the third or below the first quartiles
Statistics of Dispersion: Standard
Deviation and Variance
 Indicates how spread out a data distribution is
 Low std means that the data observations tend to be very close to the
mean
 High std indicates that the data are spread out over a large spread of
values
 The variance of 𝑁 observations 𝑥 , 𝑥 , … , 𝑥 for a numeric attribute 𝑋 is
1
𝜎 = (𝑥 − 𝑥̅ )
𝑁
 The std is equal to the square root of variance

Statistics of Dispersion: Standard
Deviation and Variance
 The basic properties of std:
 std measures spread about the mean and should be considered only when
mean is chosen as the measure of center
 std=0 only when there is no spread, i.e. when all observations have the same
value
 An observation is unlikely to be more than several stds away from the mean
Histogram: GDP
GDP per capita is a real-valued feature

We can get its mode by means of histogram
The leftmost bin is the mode, third of the countries have GDP per capita of not more than $2000.
This distribution is extremely right-skewed, resulting in a mean that is considerably higher than the
median.
Scatter Plots and Data Correlation
 Scatter Plot- one of the most effective graphical methods for determining if
there is a relationship between two numeric features
 Provides first look for the clusters and outliers, or to explore the possibility of
correlation relationships
 Two features X and Y are correlated if one feature implies the other
 Correlations can be positive, negative or null (uncorrelated)
Scatter Plots and Data Correlation
Shape Statistics
Shape Statistics
 Skewness :
 𝜎 is a standard deviation
 Positive skewness indicate that the distribution is right-skewed (right tail is longer
than the left tail)
 Negative skewness indicate that the distribution is left-skewed
 Kurtosis:
 Normal distribution has a kurtosis of 3

 Excess kurtosis: −3
 Positive excess kurtosis means that the distribution is more sharply peaked than
the normal distribution.
Example: GDP
Kinds of Feature
Feature types and Models
 Models treat different kinds of feature in distinct ways

 Decision trees
 A split on a categorical feature will have as many children as there are feature values
 Ordinal and quantitative features lead to binary split
 Tree models ignore the scale of quantitative features, treating them as ordinal
 Naïve Bayes
 Works well with categorical features
 Treats ordinal features as categorical, ignoring the order
 Cannot deal with quantitative features unless discretized
 Linear Models
 Can only handle quantitative features
 Linearity assumes Euclidean instance space where features act as Cartesian coordinates
 Distance-based methods
 Can accommodate all feature types by using an appropriate distance metric
Data Pre-processing: Cleaning
 Real data tend to be incomplete and noisy.

1. Missing Values
2. Noisy Data
Data Cleaning: Missing Values
1. Ignore the sample

 Usually done when the class label is missing
2. Fill in the missing value manually
 Time consuming for large data with lots of missing values
3. Use a global constant to fill in the missing value
4. Use measure of central tendency for the attribute to fill in the missing value
 Use mean for normal data distribution, median for skewed data distribution
5. Use the attribute mean or median for all samples belonging to the same class as
the given sample
6. Use the most probable value to fill in the missing value
 Can be determined with regression, decision tree induction, etc.
Data Cleaning: Noisy Data
 Noise is a random error or variance in a measured variable

 We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
Data Cleaning: Noisy Data
 Noise is a random error or variance in a measured variable

 We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
2) Regression
3) Outlier analysis – can be detected by clustering
Feature Transformations
 Aim at improving the utility of a feature by changing, removing or adding

information.
 Feature types ordered by the amount of detail they convey:
1. Quantitative
2. Ordinal
3. Categorical
4. Boolean
 Binarization:
 Transforms a categorical feature into a set of Boolean features, one for each
value of the categorical feature.
 Loses information.
 Needed if model cannot handle more than two feature values.
 Unordering:
 Turns an ordinal feature into categorical one by discarding the ordering of the
feature values.
 Often required since most learning models cannot handle ordinal features
directly.
 Thresholding:
 Transforms a quantitative or ordinal feature into Boolean feature by finding a
feature value to split
 𝐿𝑒𝑡 𝑓: 𝑋 → ℝ be a quantitative feature, and let 𝑡 ∈ ℝ be a threshold, then 𝑓 : 𝑋 →
{𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒} is a Boolean feature defined by 𝑓 𝑥 = 𝑡𝑟𝑢𝑒 if 𝑓(𝑥) ≥ 𝑡 and 𝑓 𝑥 =
𝑓𝑎𝑙𝑠𝑒 if 𝑓 𝑥 < 𝑡
 Threshold can be selected in unsupervised or supervised way
 Unsupervised- involves computing some statistics over the data, typically statistics of
central tendency (mean, median).
 Supervised- requires sorting the data on the feature value and traversing downs this
ordering to optimize a particular objective function.
 Discretization:
 Multiple threshold case.
 Transforms quantitative feature into an ordinal feature.
 Unsupervised discretization:
 divide feature values into predetermined bins.
 Supervised discretization:
 Normalization:
 Unsupervised feature transformation.
 Often required to neutralize the effect of different quantitative features being
measured on different scales.
 Mostly understood as expressing the feature on a [0,1] scale.
 Typically done by subtracting the mean and dividing by standard deviation.
 Calibration:
 Supervised feature transformation adding a meaningful scale carrying class
information to arbitrary features.
 Allows models that require scale, such as linear classifiers, to handle ordinal and
categorical features.
Calibration Example

Features

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Features

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Features

Uploaded by

Copyright:

Available Formats

CMPE 442 Introduction to

 Data set- set of data objects

 What are the types of features?

 Categorical and Ordinal features are qualitative:

 Mean or Average value

 The most common measure of the center of a set of data points

 Sensitive to extreme values (outliers)

 The middle value in a set of ordered data values

 The value that occurs most frequently in the set

 Mean or Average value

 Let 𝑥 , 𝑥 , … , 𝑥 be a set of observations for some numeric attribute 𝑋

 Suppose that 𝑋 attribute is sorted in increasing order

Mean > Median

Mean is more sensitive to outliers.

 The std is equal to the square root of variance

GDP per capita is a real-valued feature

 Normal distribution has a kurtosis of 3

 Models treat different kinds of feature in distinct ways

 Real data tend to be incomplete and noisy.

1. Ignore the sample

 Noise is a random error or variance in a measured variable

 Noise is a random error or variance in a measured variable

 Aim at improving the utility of a feature by changing, removing or adding

You might also like