Machine Learning Unit 1
Machine Learning Unit 1
Machine Learning Unit 1
Unit-1
Machine Learning
Applications
Learning Techniques
Classification Families
What is Machine Learning?
Machine Learning
◦ Study of algorithms that
◦ improve their performance
◦ at some task
◦ with experience
Optimize a performance criterion using example data or
past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
◦ Solve the optimization problem
◦ Representing and evaluating the model for inference 3
Applications
Retail: Market basket analysis, Customer relationship
management (CRM)
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Medicine: Medical diagnosis
Telecommunications: Quality of service optimization
Bioinformatics: Motifs, alignment
Web mining: Search engines
Learning system model
Testing
Input Learning
Samples Method
System
Training
Designing a Learning System:
1. Problem Description
2. Choosing the Training Experience
3. Choosing the Target Function
4. Choosing a Representation for the Target
Function(Concept Representation)
5. Choosing a Function Approximation
Algorithm
6. Final Design
6
1. Problem Description: A
Checker Learning Problem
Task T: Playing Checkers
Performance Measure P:
Percent of games won against
opponents
Training Experience E: To be
selected ==> Games Played
against itself
7
2. Choosing the Training Experience
Direct versus Indirect Experience [Indirect
Experience gives rise to the credit assignment problem and
is thus more difficult]
Teacher versus Learner Controlled Experience
[the teacher might provide training examples; the learner
might suggest interesting examples and ask the teacher for
their outcome; or the learner can be completely on its own
with no access to correct outcomes]
How Representative is the Experience? [Is the
training experience representative of the task the system
will actually have to solve? It is best if it is, but such a
situation cannot systematically be achieved]
8
3. Choosing the Target Function
Given a set of legal moves, we want to learn how to
choose the best move [since the best move is not necessarily
known, this is an optimization problem]
ChooseMove: B --> M is called a Target Function
[ChooseMove, however, is difficult to learn. An easier and related
target function to learn is V: B --> R, which assigns a numerical
score to each board. The better the board, the higher the score.]
Operational versus Non-Operational Description of a
Target Function [An operational description must be given]
Function Approximation [The actual function can often not
be learned and must be approximated]
9
4. Choosing a Representation for the Target
Function (Concept Representation)
10
4. Choosing a Representation for the
Target Function (Concept Representation)
Expressiveness versus Training set size [The
more expressive the representation of the target function,
the closer to the “truth” we can get. However, the more
expressive the representation, the more training examples
are necessary to choose among the large number of
“representable” possibilities.]
Example of a representation:
◦ x1/x2 = # of black/red pieces on the board
◦ x3/x4 = # of black/red king on the board
◦ x5/x6 = # of black/red pieces threatened by red/black
V(b) = w0+w1.x1+w2.x2+w3.x3+w4.x4+w5.x5+w6.x6
w’s indicates adjustable or “learnable” coefficients
5. Choosing a Function Approximation
Algorithm
Generating Training Examples of the form
<b,Vtrain(b)> [e.g. <x1=3, x2=0, x3=1, x4=0, x5=0,
x6=0, +100 (=blacks won)]
◦ Useful and Easy Approach:^ Vtrain(b) <-
V(Successor(b))
Training the System
◦ Defining a criterion for success [What is the error that
needs to be minimized?]
◦ Choose an algorithm capable of finding weights of a
linear function that minimize that error [e.g. the Least
Mean Square (LMS) training rule].
12
6. Final Design for Checkers Learning
14
Learning Techniques
Association Analysis
Supervised Learning
◦ Classification
◦ Regression/Prediction
UnsupervisedLearning
Reinforcement Learning
Learning Associations
Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Example: P ( chips | beer ) = 0.7
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Classification
Example: Credit
scoring
Differentiating
between low-risk
and high-risk
customers from their
income and savings
Model 17
Classification: Applications
Aka Pattern recognition
Face recognition: Pose, lighting, occlusion
(glasses, beard), make-up, hair style
Character recognition: Different handwriting
styles.
Speech recognition: Temporal dependency.
◦ Use of a dictionary or the syntax of the language.
◦ Sensor fusion: Combine multiple modalities; eg, visual
(lip image) and acoustic for speech
Medical diagnosis: From symptoms to illnesses
Web Advertizing: Predict if a user clicks on an ad
on the Internet.
18
Prediction: Regression
Example: Price of a
used car
y = wx+w0
x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
19
Regression Applications
α2= g2(x,y)
α2
α1
20
Supervised Learning: Uses
Example: decision trees tools that create rules
Prediction of future cases: Use the rule to
predict the output for future inputs
Knowledge extraction: The rule is easy to
understand
Compression: The rule is simpler than the
data it explains
Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud
21
Unsupervised Learning
Learning “what normally happens”
No output
Clustering: Grouping similar instances
Other applications: Summarization,
Association Analysis
Example applications
◦ Customer segmentation in CRM
◦ Image compression: Color quantization
◦ Bioinformatics: Learning motifs
22
Reinforcement Learning
Topics:
◦ Policies: what actions should an agent take in a particular
situation
◦ Utility estimation: how good is a state (used by policy)
No supervised output but delayed reward
Credit assignment problem (what was responsible
for the outcome)
Applications:
◦ Game playing
◦ Robot in a maze
◦ Multiple agents, partial observability, ...
23
Classification Families
Decision trees
Nearest Neighbour
Linear discriminative
Non-linear discriminative
Probabilistic (conditional and generative)
Decision Tree C4.5 (J48 classifier)
Info(T)=-9/14*log2(9/14)-5/14*log2(5/14)
=0.940 bits
Infox1(T)=5/14(-2/5*log2(2/5)-3/5*log2(3/5))
+4/14(-4/4*log2(4/4)-0/4*log2(0/4))
+5/14(-3/5*log2(3/5)-2/5*log2(2/5))
=0.694 bits
Gain(x1)=0.940-0.694=0.246 bits
K-Nearest Neighbor
Features
◦ All instances correspond to points in an n-
dimensional Euclidean space
◦ Classification is delayed till a new instance
arrives
◦ Classification done by comparing feature
vectors of the different points
◦ Target function may be discrete or real-valued
1-Nearest Neighbor
3-Nearest Neighbor
K-Nearest Neighbor
R1
R2
X2
R3
R4
X1
Piecewise linear decision boundaries in 2D input space
35
Linear Classification…
There is a discriminant function k(x) for each
class k
Classification rule: Rk {x : k argj max j ( x)}
In higher dimensional space the decision
boundaries are piecewise hyperplanar
Remember that 0-1 loss function led to the
classification rule: Rk {x : k arg max P(G j | X x)}
j
So, P (G k | can
X) serve as k(x)
36
Linear Classification…
All we require here is the class boundaries {x:k(x) =
j(x)} be linear for every (k, j) pair
One can achieve this if k(x) themselves are linear or
any monotone transform of k(x) is linear
◦ An example: exp( 0 T x)
P (G 1 | X x)
1 exp( 0 T x)
1
P (G 2 | X x)
1 exp( 0 T x)
P (G 1 | X x)
So that log[ ] 0 T x
P (G 2 | X x)
Linear
37
Linear Classification as a Linear
Regression
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Regression output: Yˆ (( x1 , x2 )) (1 x1 x2 )( XT X) 1 XT Y ( xT 1 xT 2 xT 3 )
Or, Yˆ1 (( x1 x2 )) (1 x1 x2 ) 1
Classification rule:
Yˆ2 (( x1 x2 )) (1 x1 x2 ) 2
Yˆ (( x x )) (1 x x )
3 1 2 1 2 3
Gˆ (( x1 x2 )) arg max Yˆk (( x1 x2 ))
k
38
LINEAR CLASSIFICATION-
PERCEPTRON
The Masking
Linear regression of the indicator matrix can lead to masking
Yˆ3 (1 x1 x2 ) 3
Yˆ2 (1 x1 x2 ) 2
Yˆ1 (1 x1 x2 ) 1
Viewing direction
f k ( x) k
Posterior probability Pr(G k | X x) K
Application of
f ( x)
l 1
l l
Bayes rule
Pr(G k | X x) k fk
log log log
Pr(G l | X x) l fl
1 T 1 1 T 1
(log k x Σ k k Σ k ) (log l x Σ l l Σ l )
T 1 T 1
2 2
k (x) l (x)
Prior probabilities: ˆ k N k / N
Means: ˆ k g k xi / N k
i
Covariance matrix: ˆΣ K ( x ˆ )( x ˆ )T /( N K )
k 1 g i k i k
i
43
Fisher’s Linear Discriminant [DHS]
From training set we want to find out a direction where the separation
between the class means is high and overlap between the classes is small
44
Fisher’s LD…
Geometric interpretation: x
wT x
From training set we want to find out a direction w where the separation
between the projections of class means is high and
the projections of the class overlap is small
45
Fisher’s LD…
1 1
Class means: m1
N1
x ,
xi R1
i m2
N2
x
xi R2
i
~ 1 ~ 1
Projected class means: m1
N1
w
xi R1
T
xi wT m1 , m2
N2
w
xi R2
T
xi wT m2
Scatter of projected data (this will indicate overlap between the classes):
~
s12 ( yi m~1 ) 2
( w T
x i wT
m 1 ) 2
w T
x R ( x i m1 )( x i m 1 ) T
w wT S1w
yi : xi R1 xi R1 i 1
~ ~ )2
s22 ( yi m 2 ( w T
x i w T
m 2 ) 2
w T
x R ( x i m 2 )( x i m 2 ) T
w wT S 2 w
yi : xi R2 xi R2 i 2
46
Fisher’s LD…
~ m
(m ~ ) 2 wT S w
r ( w) ~ 2 ~12 T B
2
s1 s2 w Sww
Rayleigh quotient
where S w S1 S 2
S B (m2 m1 )(m2 m1 )T
w S w1 (m2 m1 )
47
Fisher’s LD: Classifier
So far so good. However, how do we get the classifier?
Since we know that the projected class means are well separated,
we can choose average of the two projected means as a threshold
for classification
1 ~ ~ 1 T 1
y ( x) wT x (m1 m2 ) w x
T
w (m1 m2 ) S w1 (m2 m1 )( x (m1 m2 ))
2 2 2
48
Non-linear Discriminant
Analysis
The idea of nonlinear discriminant
analysis is to transform the input data
nonlinearly, before standard linear
discriminant analysis is applied.
NLDA
Multilayer Perceptrons (MLP) are probably the most
common type of neural network, and are used for
classication and regression.
They usually consist of at least three layers of neurons.
Nonlinearity is introduced by nonlinear neuron activation
functions, such as sigmoidals.
Typical MLP's are trained by minimizing the mean square
error (MSE) using rst or second order methods and the
backpropagation
algorithm.
The adaptive behavior of MLP's is considered as one of
their greatest advantages over other Machine Learning
techniques.
Multi-Layer Networks
output layer
hidden layer
input layer
Training-Rule for Weights to the Output
Layer
yj Ep[wij] = ½ (t -y )
j j
p
j
p 2
xi
wji = a yjp(1-yjp) (tpj-yjp) xip
= a d j
p
xip
with d j
p
:= yjp(1-yjp) (tpj-yjp)
Training-Rule for Weights to the Hidden
Layer
dj
yj
Credit assignment problem:
No target values t for hidden layer
wjk
units.
xk
dk Error for hidden units?
dk = Sj wjk dj yj (1-yj)
wki
wki = a xkp(1-xkp) d k
p
xip
xi
Training-Rule for Weights to the Hidden
Layer
yj
dj Ep[wki] = ½ (t -y )
j j
p
j
p 2
wki = a d k xi p
56