Features
Features
Features
Machine Learning
• Features
Features
Workhorses of ML
Mapping from instance space to the feature space
Distinguish features by:
domain types
range of permissible operations
Ex: Consider two features: person’s age and house number: while both are
numbers, house number is an ordinal hence taking the average is
meaningless.
What matters is not just the domain of a feature but also the range of
permissible operations.
Getting to Know Your Data
Numerical
Features with numerical scale
Often involve mapping to reals
Continuous
Ex: age, price, etc.
Ordinal
Features with an ordering but without scale
Some totally ordered set
Ex: set of characters, strings, house numbers, etc.
Allow mode and median as central statistics, and quantiles as dispersion statistics
Categorical
Features without ordering or scale
Allows no statistical summary except for mode
Boolean feature is a subspace of categorical feature
Kinds of Feature
Aggregates or Statistics
Main categories:
Statistics of Central Tendency
Statistics of Dispersion
Shape Statistics
Statistics of Central Tendency
The mode is the one we can calculate whatever the domain of the feature. Ex:
blood type.
In order to calculate median we need to have an ordering on the feature values
In order to calculate mean we need feature to be expressed on some scale.
Statistics of Central Tendency
Statistics of Dispersion
Range
Quantiles
Variance
Standard deviation
Statistics of Dispersion: Range
First quartile
Median
Mean
Third quartile
Interquartile Range (IQR) – the distance between the first and third quartiles
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Simple measure of spread that gives the range covered by the middle half of
the data
Ex: Suppose we have the following values for salary (in thousands of dollars): 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Q1=47 000 $
Q2=52 000$
Q3=63 000$
IQR=63-47=16 000$
A common approach for identifying suspected outliers is to single out values
falling at least 1.5 x IQR above the third or below the first quartiles
Statistics of Dispersion: Standard
Deviation and Variance
Indicates how spread out a data distribution is
Low std means that the data observations tend to be very close to the
mean
High std indicates that the data are spread out over a large spread of
values
The variance of 𝑁 observations 𝑥 , 𝑥 , … , 𝑥 for a numeric attribute 𝑋 is
1
𝜎 = (𝑥 − 𝑥̅ )
𝑁
The leftmost bin is the mode, third of the countries have GDP per capita of not more than $2000.
This distribution is extremely right-skewed, resulting in a mean that is considerably higher than the
median.
Scatter Plots and Data Correlation
Scatter Plot- one of the most effective graphical methods for determining if
there is a relationship between two numeric features
Provides first look for the clusters and outliers, or to explore the possibility of
correlation relationships
Two features X and Y are correlated if one feature implies the other
Correlations can be positive, negative or null (uncorrelated)
Scatter Plots and Data Correlation
Shape Statistics
Shape Statistics
Skewness :
𝜎 is a standard deviation
Positive skewness indicate that the distribution is right-skewed (right tail is longer
than the left tail)
Negative skewness indicate that the distribution is left-skewed
Kurtosis:
Naïve Bayes
Works well with categorical features
Treats ordinal features as categorical, ignoring the order
Cannot deal with quantitative features unless discretized
Linear Models
Can only handle quantitative features
Linearity assumes Euclidean instance space where features act as Cartesian coordinates
Distance-based methods
Can accommodate all feature types by using an appropriate distance metric
Data Pre-processing: Cleaning
Binarization:
Transforms a categorical feature into a set of Boolean features, one for each
value of the categorical feature.
Loses information.
Needed if model cannot handle more than two feature values.
Unordering:
Turns an ordinal feature into categorical one by discarding the ordering of the
feature values.
Often required since most learning models cannot handle ordinal features
directly.
Feature Transformations
Thresholding:
Transforms a quantitative or ordinal feature into Boolean feature by finding a
feature value to split
𝐿𝑒𝑡 𝑓: 𝑋 → ℝ be a quantitative feature, and let 𝑡 ∈ ℝ be a threshold, then 𝑓 : 𝑋 →
{𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒} is a Boolean feature defined by 𝑓 𝑥 = 𝑡𝑟𝑢𝑒 if 𝑓(𝑥) ≥ 𝑡 and 𝑓 𝑥 =
𝑓𝑎𝑙𝑠𝑒 if 𝑓 𝑥 < 𝑡
Threshold can be selected in unsupervised or supervised way
Unsupervised- involves computing some statistics over the data, typically statistics of
central tendency (mean, median).
Supervised- requires sorting the data on the feature value and traversing downs this
ordering to optimize a particular objective function.
Feature Transformations
Discretization:
Multiple threshold case.
Transforms quantitative feature into an ordinal feature.
Unsupervised discretization:
divide feature values into predetermined bins.
Supervised discretization:
Feature Transformations
Normalization:
Unsupervised feature transformation.
Often required to neutralize the effect of different quantitative features being
measured on different scales.
Mostly understood as expressing the feature on a [0,1] scale.
Typically done by subtracting the mean and dividing by standard deviation.
Feature Transformations
Calibration:
Supervised feature transformation adding a meaningful scale carrying class
information to arbitrary features.
Allows models that require scale, such as linear classifiers, to handle ordinal and
categorical features.
Calibration Example
Feature Transformations