KNIME - Seven Techs For Dimensionality Reduction
KNIME - Seven Techs For Dimensionality Reduction
KNIME - Seven Techs For Dimensionality Reduction
Reduction
Subtitle
Contents
Introduction
Rules
Missing Values
Low Variance Filter
High Correlation Filter
PCA
Random Forests
Backward Feature Elimination
Forward Feature Construction
Comparison
Combining Dimensionality Reduction Techniques
Introduction
KDD Challenge: Customer relationship prediction.
Churn represents a contract severance by a customer;
Appetency the propensity to buy a service or a product
Upselling the possibility of selling additional side products to the main one
The data set came from the CRM system of a big French
telecommunications company
Huge data set with 50K rows and 15K columns
The problem is not the size of the data set, but rather the number of
input columns
Dimensionality reduction is based on supervised classification
algorithms
Used a cascade of the most promising techniques, as detected in the
first phase of the project on the smaller data set.
Evaluate data columns reduction based on:
High number of missing values
Low variance
High correlation with other data columns
Principal Component Analysis (PCA)
First cuts in random forest trees
Backward feature elimination
Forward feature construction
For the methods that use a threshold, an optimal threshold can be defined
through an optimization loop maximizing the classification accuracy on a
validation set for the best out of three classification algorithms:
MLP
decision tree
Nave Bayes.
Missing Values
Removing Data Columns with Too Many Missing Values
Example: if a data column has only 5-10% of the possible
values, it will likely not be useful for the classification of most
records.
Remove those data columns with too many missing values, i.e.
with more missing values in percent than a given threshold.
To count the number of missing values in all data columns, we
can either use a Statistics or a GroupBy node.
Ratio of missing values = no of missing values / total number
of rows