Feature Engineering Handout

Feature Engineering in Machine Learning
Feature Engineering in
Machine Learning
Chun-Liang Li (李俊良)
[email protected]
2016/07/17@
About Me
Academic Competition
• NTU CSIE BS/MS (2012/2013) • KDD Cup 2011 Champions 
• Advisor: Prof. Hsuan-Tien Lin  KDD Cup 2013 Champions
• With Prof. Chih-Jen Lin 
• CMU MLD PhD (2014-) Prof. Hsuan-Tien Lin 
• Advisor: Prof. Jeff Schneider Prof. Shou-De Lin 
Prof. Barnabás Póczos Many students
Working
(2012 intern) (2015 intern)
2
What is Machine Learning?
• What is Machine Learning? 
Learning Prediction
Existing Data New Data
Machine (Algorithm) Model
Model Prediction
Data: Several length-d vectors
3
Data? Algorithm?
• In academic
• Assume we are given good enough data (in d-dimensional
of course )
• Focus on designing better algorithms 

Sometimes complicated algorithms imply publications
• In practice
• Where is your good enough data?
• Or, how to transform your data into a d-dimensional one?
4
From Zero to One:
Create your features by your observations
5
An Apple
How to describe this picture?
6
More Fruits
• Method I: Use size of picture 
(640, 580) (640, 580)
• Method II: Use RGB average 

(219, 156, 140) (243, 194, 113) (216, 156, 155)
• Many more powerful features  

developed in computer vision
7
Case Study (KDD Cup 2013)
• Determine whether a paper is written by a given
author
We are given raw text  

of these
Data: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge
8
NTU Approaches
Pipeline Feature Engineering
Feature Engineering Observation
Several Algorithms Encode into Feature
Combining Different Models Result
9
First observation:
Authors Information
• Are these my (Chun-Liang Li) papers? (Easy! check author names) 
1. Chun-Liang Li and Hsuan-Tien Lin. Condensed filter tree for cost-sensitive multi-label  
classification. 
2. Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for  
multi-label classification.
• Encode by name similarities (e.g., how many characters are the same)
• Are Li, Chun-Liang and Chun-Liang Li the same?
• Yes! Eastern and Western order
• How about Li Chun-Liang? (Calculate the similarity of the reverse order)
• Also take co-authors into account
• 29 features in total
10
Second Observation:
Affiliations
• Are Dr. Chi-Jen Lu and Prof. Chih-Jen Lin the same?
• Similar name: Chi-Jen Lu v.s. Chih-Jen Lin
• Shared co-author (me!)
• Take affiliations into account!
• Academia Sinica v.s. National Taiwan University
• 13 features in total
11
Last of KDD Cup 2013
• Many other features, including
• Can you live for more than 100 years? At least I

think I can’t do research after 100 years
• More advanced: social network features
Summary
The 97 features designed by students won the competition
12
Furthermore
• If I can access the content, can I do better?   Definitely
Who is Robert Galbraith?

“I thought it was by a very  
mature writer, and not a  
first-timer.” — Peter James
Author: Robert Galbraith
13
Writing Style?
1 2
• “I was testing things
3 like word length, sentence
4
length, paragraph length, frequency of particular
5
words and the pattern of punctuation”  
— Peter Millican (University of Oxford)
14
Game Changing Point:
Deep Learning
15
Common Type of Data
• Image 
 
 
• Text
16
Representation Learning
• Deep Learning as learning hidden representations 
 
 
Raw data
 
 
Use last layer to extract features (Krizhevsky et al., 2012)
(Check Prof. Lee’s talk and go to deep learning session later )
• An active research topic in academia and industry
17
Use Pre-trained Network
• Yon don’t need to train a network by yourself
• Use existing pre-trained network to extract features
• AlexNet
• VGG
• Word2Vector
Result
Simply using deep learning features achieves state-of-the-art
performance in many applications
18
Successful Example
• The PASCAL Visual Object Classes Challenge
0.6
Mean Average
0.45
Precision
Deep learning result 

0.3
0.15 (Girshick et al. 2014)
0
2005 2007 2008 2009 2010 2012 2013 2014
HoG feature Slow progress on feature engineering and  
algorithms before deep learning
19
Curse of Dimensionality:
Feature Selection and Dimension Reduction
20
The more, the better?
Practice Noisy Feature
If we have 1,000,000 data Is every feature useful?

with 100,000 dimensions, Redundancy?
how much memory do we
Theory
need?
  Without any assumption,
1
Ans: 6 5
10 ⇥ 10 ⇥ 8 you need O( ✏d ) data to
= 8 ⇥ 1011 (B)
= 800 (GB)
achieve ✏ error for d-
dimensional data
21
Feature Selection
• Select import features
• Reduce dimensions
• Explainable Results
Commonly Used Tools

• LASSO (Sparse Constraint)
• Random Forests
• Many others
22
KDD Cup Again
• In KDD Cup 2013, we actually generated more
than 200 features (some secrets you won’t see in the paper )
• Use random forests to select only 97 features,

since many features are unimportant and even
harmful, but why?
23
Non-useful Features
• Duplicated features
• Example I: Country (Taiwan) v.s. Coordinates (121, 23.5)
• Example II: Date of birth (1990) v.s. Age (26)
• Noisy features
• Noisy information (something wrong in your data)
• Missing values (something missing in your data)
• What if we still have too many features?
24
Dimension Reduction
• Let’s visualize the data (a perfect example) 
 
1 0.5
0.9 0.4
0.8
 
0.3
0.7 0.2
0.6 0.1
0.5
0.4
0
−0.1
One dimension is enough
0.3 −0.2
0.2 −0.3
0.1 −0.4
0 −0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4
• Non-perfect example in practice

1 0.5
0.9 0.4
0.8 0.3
0.7
0.6
0.2
0.1
Trade-off between  
0.5
0.4
0
−0.1 information and space

0.3 −0.2
0.2 −0.3
0.1 −0.4
0 −0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Commonly Used Tools

• Principal Component Analysis (PCA)
25
PCA — Intuition
1
Let’s apply PCA on these faces  

0.9
• 0.8
0.7
(raw pixels) and visualize the  

0.6
0.5
0.4
0.3
coordinates
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
http://comp435p.tk/
26
PCA — Intuition (cont.)
• We can use very few base faces to approximate
(describe) the original faces
1 2 3 4 5 6 7 8 9
(Sirovich and Kirby, Low-dimensional procedure for the characterization of human faces)
http://comp435p.tk/
27
PCA — Case Study
• CIFAR-10 image classification  
with raw pixels as features and  
using approximated kernel SVM
(Li and Pòczos, Utilize Old Coordinates: Faster Doubly Stochastic Gradients for
Kernel Methods, UAI 2016)
Dimensions Accuracy Time

3072 (all) 63.1% ~2 Hrs
100 (PCA) 59.8% 250 s
Trade-off between information, space and time
28
PCA in Practice
• Practical concern:
Small Problem
• Time complexity: O(N d2 ) PCA takes <10 seconds for
CIFAR-10 dataset (d=3072) by
• Space complexity: O(d2 ) using 12 cores (E5-2620)
• Remark: Use fast approximation for large-scale problem (e.g.,

>100k dimensions)
1. PCA with random projection (implemented in scikit-learn) 

(Halko et al., Finding Structure with Randomness, 2011)
2. Stochastic algorithms (easy to implement from scratch) 

(Li et al., Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA,
AISTATS 2016)
29
Conclusion
• Observe the data and encode them into meaningful features 
 
Beginning: Existing Data Machine (Algorithm)
 
 
Now: Existing Data Features (Simple) Algorithm
• Deep learning is a powerful tool to use
• Reduce number of features if necessary
• Reduce non-useful features
• Computational concern
30
Thanks!
Any Question?
31
References
1. Richard Szeliski. Computer Vision: Algorithms and Applications, 2010.
2. Senjuti Basu Roy, Martine De Cock, Vani Mandava, Swapna Savanna, Brian Dalessandro, Claudia
Perlich, William Cukierski, and Ben Hamner. The Microsoft academic search dataset and KDD cup
2013. In KDD Cup 2013 Workshop, 2013.
3. Chun-Liang Li, Yu-Chuan Su, Ting-Wei Lin, Cheng-Hao Tsai, Wei-Cheng Chang, Kuan-Hao Huang,
Tzu-Ming Kuo, Shan-Wei Lin, Young-San Lin, Yu-Chen Lu, Chun-Pai Yang, Cheng-Xia Chang, Wei-
Sheng Chin, Yu-Chin Juan, Hsiao-Yu Tung, Jui-Pin Wang, Cheng-Kuang Wei, Felix Wu, Tu-Chun Yin,
Tong Yu, Yong Zhuang, Shou-De Lin, Hsuan-Tien Lin, and Chih-Jen Lin. Combination of feature
engineering and ranking models for paper-author identification in KDD Cup 2013. In JMLR, 2015.
4. How JK Rowling was unmasked. http://www.bbc.com/news/entertainment-arts-23313074
5. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new
perspectives. In IEEE PAMI, 2015.
6. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, 2012.
7. Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. In ICLR, 2015.
8. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word
Representations in Vector Space. Technical Report, 2013.
32
9. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In CVPR, 2014.
10. Matthew A. Turk, and Alex Peatland. Face Recognition Using Eigenfaces. In CVPR, 1991.
11. Chun-Liang Li, and Barnabás Póczos. Utilize Old Coordinates: Faster Doubly Stochastic
Gradients for Kernel Methods. In UAI, 2016.
12. Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. In SIAM Rev.,
2011.
13. Chun-Liang Li, Hsuan-Tien Lin and, Chi-Jen Lu. Rivalry of Two Families of Algorithms for
Memory-Restricted Streaming PCA. In AISTATS, 2016.
33

Feature Engineering Handout

Uploaded by

Copyright:

Available Formats

Feature Engineering Handout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Engineering Handout

Uploaded by

Copyright:

Available Formats

Feature Engineering in Machine Learning

(2012 intern) (2015 intern)

Machine (Algorithm) Model

Data: Several length-d vectors

• Focus on designing better algorithms

• Or, how to transform your data into a d-dimensional one?

How to describe this picture?

• Method II: Use RGB average

• Many more powerful features

We are given raw text

Feature Engineering Observation

Several Algorithms Encode into Feature

Combining Different Models Result

• Are Li, Chun-Liang and Chun-Liang Li the same?

• Yes! Eastern and Western order

• How about Li Chun-Liang? (Calculate the similarity of the reverse order)

• Also take co-authors into account

• Similar name: Chi-Jen Lu v.s. Chih-Jen Lin

• Shared co-author (me!)

• Take affiliations into account!

• Academia Sinica v.s. National Taiwan University

• Can you live for more than 100 years? At least I

• More advanced: social network features

Who is Robert Galbraith?

Author: Robert Galbraith

• An active research topic in academia and industry

• Use existing pre-trained network to extract features

Deep learning result

0.15 (Girshick et al. 2014)

If we have 1,000,000 data Is every feature useful?

Commonly Used Tools

• Use random forests to select only 97 features,

• Example I: Country (Taiwan) v.s. Coordinates (121, 23.5)

• Example II: Date of birth (1990) v.s. Age (26)

• Noisy information (something wrong in your data)

• Missing values (something missing in your data)

• What if we still have too many features?

• Non-perfect example in practice

−0.1 information and space

Commonly Used Tools

Let’s apply PCA on these faces

(raw pixels) and visualize the

Dimensions Accuracy Time

• Remark: Use fast approximation for large-scale problem (e.g.,

1. PCA with random projection (implemented in scikit-learn)

2. Stochastic algorithms (easy to implement from scratch)

• Deep learning is a powerful tool to use

• Reduce number of features if necessary

• Reduce non-useful features

You might also like

• Focus on designing better algorithms 

• Method II: Use RGB average 

• Many more powerful features  

We are given raw text  

Deep learning result 

Let’s apply PCA on these faces  

(raw pixels) and visualize the  

1. PCA with random projection (implemented in scikit-learn) 

2. Stochastic algorithms (easy to implement from scratch)