Data Mining 101
Data Mining 101
Data Mining 101
Outline
Introduction
Terminology
Potential application
Venn diagram
Process
overview
Business understanding
Data understanding (exploration)
Data preparation (preprocessing)
Modeling
Evaluation
Deployment (presentation)
Tools &
Resource
Introduction Terminology
Data
science
Big data
analytics
Statistics
Knowledge
Discovery
in
Databases
Data
mining
Customer
segmentation
Recommendation
engine
Social media
mining
What should we do?
Where to start? Do I have to get a master degree in statistics?
http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg
CRISP DM Methodology
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
Business Understanding
CRISP DM Methodology
Objective Statement
Bottom-up
Top-down
Objective Statement
vs
Data
Problem
Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Inventory of Resources
Hardware
Data,
Knowledge,
Tools
Personnel
Resource
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Requirements, Assumptions, and Constraints
Requirements
Assumptions
Constraints
Scheduling
Data quality
Legal issues
Accuracy
External
factors
Budget
Security
Reporting type
Resources
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment
Risks and Contingencies
Business
Organizational
Financial
Contingency Plan
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg
How to evaluate the results?
Define your success criteria!
Data Understanding
CRISP DM Methodology
Data Collection
vs
External
Internal
Watch out!
visible accessible
storable presentable
Victor Lavrenko Text Technologies
http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
Data Exploration
Visualization Heuristics
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration
Visualization Heuristics
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration
Visualization Tools
https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
Data Preparation
CRISP DM Methodology
Data Selection
Missing
value
Data Cleaning
Remember: Expect problems in your data.
Outlier
Duplication
Dirty
Data
Incomplete
Outdated
Data Construction
Feature
e.g.:
year
from timestamp
quarter
BMI
from timestamp
Log(x)
Data Splitting
Two
Training-Validation-Testing
Cross
Validation
Data Splitting
Training-Validation-Testing
Training
Validation
Testing
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Construct
classifier
Pick algorithm
Knob settings
(tree depth, k in
kNN, c in SVM)
Estimate future
error rate
Data Splitting
Cross Validation
Every point is both training and testing, never at the same time
Dimensionality Reduction
Principal
Component
Analysis
vs
Linear
Discriminant
Analysis
Modeling
CRISP DM Methodology
Machine Learning
Classification
Regression
Ranking
Clustering
Model Selection
Generalization bound
Regression
Technique
Linear regression
Kernel ridge regression
Support vector regression
Lasso
Which one should I choose?
Should I use all of them?
It depends on
Model Selection
Assumptions Interpretability
The predictors are linearly
independent
The error is a random variable
with a mean of zero conditional on
the explanatory variables
The sample is representative of
the population for the inference
prediction
https://chenhaot.com/pubs/mldg-interpretability.pdf
The
understandability
of why the model
is true or how the
model is induced
from
Beware of Overfitting!
http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png
Model Assessment
Regression
(R)MSE
Mean
Absolute
Error
Correlation
Coefficient
Classification
Accuracy
Precision
Recall
F-score
Descriptive
Std. Error
p-value
Confidence
Interval
Evaluation
CRISP DM Methodology
Deployment
CRISP DM Methodology
The Tasks
Plan deployment
Plan monitoring
and maintenance
Produce final
report
Review project
Visualization: D3.js
Thank you!