Lecture14 Logistic

ECE595 / STAT598: Machine Learning I
Lecture 14 Logistic Regression
Spring 2020
Stanley Chan
School of Electrical and Computer Engineering

Purdue University
c Stanley Chan 2020. All Rights Reserved.

1 / 25
Overview
In linear discriminant analysis (LDA), there are generally two types of

approaches
Generative approach: Estimate model, then define the classifier
Discriminative approach: Directly define the classifier
2 / 25
Outline
Discriminative Approaches
Lecture 14 Logistic Regression 1
This lecture: Logistic Regression 1

From Linear to Logistic
Motivation
Loss Function
Why not L2 Loss?
Interpreting Logistic
Maximum Likelihood
Log-odd
Convexity
Is logistic loss convex?
Computation
3 / 25
Geometry of Linear Regression
The discriminant function g (x ) is linear

The hypothesis function h(x ) = sign(g (x )) is a unit step

4 / 25
From Linear to Logistic Regression
Can we replace g (x ) by sign(g (x ))?
How about a soft-version of sign(g (x ))?
This gives a logistic regression.

5 / 25
Sigmoid Function
The function
h(x ) =
1 1
x =
1+e −g ( ) 1 + e w T x +w0 )
−(
is called a sigmoid function.

Its 1D form is
1
h(x) = −a(x−x
, for some a and x0 ,
1+e 0)
a controls the transient speed

x0 controls the cutoff location

6 / 25
Sigmoid Function
Note that
h(x) → 1, as x → ∞,
h(x) → 0, as x → −∞,
So h(x) can be regarded as a “probability”.

7 / 25
Sigmoid Function
Derivative is
−2
d 1
−a(x−x0 ) −a(x−x0 )

= − 1 + e e (−a)
dx 1 + e −a(x−x0 )
!
e −a(x−x0 )

1
=a
1 + e −a(x−x0 ) 1 + e −a(x−x0 )

1 1
=a 1−
1 + e −a(x−x0 ) 1 + e −a(x−x0 )
= a[1 − h(x)][h(x)].
Since 0 < h(x) < 0, we have 0 < 1 − h(x) < 1.

Therefore, the derivative is always positive.
So h is an increasing function.
Hence h can be considered as a “CDF”.
8 / 25
Sigmoid Function

http://georgepavlides.info/wp-content/uploads/2018/02/logistic-binary-e1517639495140.jpg 9 / 25
From Linear to Logistic Regression
Can we replace g (x ) by sign(g (x ))?
How about a soft-version of sign(g (x ))?
This gives a logistic regression.

10 / 25
Loss Function for Linear Regression
All discriminant algorithms have a Training Loss Function
N
L(g (x n ), yn ).
1 X
J(θ) =
N
n=1
In linear regression,
N
(g (x n ) − yn )2
1 X
J(θ) =
N
n=1
N
(w T x n + w0 − yn )2
1 X
=
N
n=1
x T1 2
   
1 y1
1  . ..  w −  ..  1
kAθ − y k2 .
=  .. . w  .  =
N N
T xN 1
0
yN
11 / 25
Training Loss for Logistic Regression
N
L(hθ (x n ), yn )
X
J(θ) =
n=1
N
− yn log hθ (x n ) + (1 − yn ) log(1 − hθ (x n ))
X n o
=
n=1
This loss is also called the cross-entropy loss.

Why do we want to choose this cost function?
Consider two cases
and hθ (x n ) = 1,
(
yn log hθ (x n ) =
0, if yn = 1,
−∞, if yn = 1, and hθ (x n ) = 0,
and hθ (x n ) = 0,
(
(1 − yn )(1 − log hθ (x n )) =
0, if yn = 0,
−∞, if yn = 0, and hθ (x n ) = 1.
No solution if mismatch c Stanley Chan 2020. All Rights Reserved.
12 / 25
Why Not L2 Loss?
Why not use L2 loss?

N
(hθ (x n ) − yn )2
X
J(θ) =
n=1
Let’s look at the 1D case:

2
1
J(θ) = −y .
1 + e −θx
This is NOT convex!
How about the logistic loss?

1 1
J(θ) = y log + (1 − y ) log 1 −
1 + e −θx 1 + e −θx
This is convex!
13 / 25
Why Not L2 Loss?
Experiment: Set x = 1 and y = 1.

Plot J(θ) as a function of θ.
1 0
-1
0.8
-2
0.6
J( )
J( )
-3
0.4
-4
0.2
-5
0 -6
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
L2 Logistic
So the L2 loss is not convex, but the logistic loss is concave (negative
is convex)
If you do gradient descent on L2, you will be trapped at local minima

14 / 25
Outline

Motivation
Loss Function
Why not L2 Loss?
Maximum Likelihood
Log-odd
Convexity
Computation
15 / 25
The Maximum-Likelihood Perspective
We can show that
argmin J(θ)
θ
N
X n o
= argmin
θ n=1
N
!
hθ (x n ) (1 − hθ (x n ))
Y
yn 1−yn
= argmin − log
θ n=1
N n
hθ (x n )yn (1 − hθ (x n ))1−yn .
Y o
= argmax
θ n=1
This is maximum-likelihood for a Bernoulli random variable yn

The underlying probability is hθ (x n )
16 / 25
Interpreting h(x n )
Maximum-likelihood Bernoulli:
N n
hθ (x n )yn (1 − hθ (x n ))1−yn .
Y o
∗
θ = argmax
θ n=1
We can interpret hθ (x n ) as a probability p. So:
hθ (x n ) = p, and 1 − hθ (x n ) = 1 − p.
But p is a function of x n . So how about

hθ (x n ) = p(x n ), and 1 − hθ (x n ) = 1 − p(x n ).
And this probability is “after” you see x n . So how about
hθ (x n ) = p(1 | x n ), and 1 − hθ (x n ) = 1 − p(1 | x n ) = p(0 | x n ).
So hθ (x n ) is the posterior of observing x n .

17 / 25
Log-Odds
Let us rewrite J as
N
X n o
J(θ) =
n=1
hθ (x n )
n
+ log(1 − hθ (x n ))
X n o
− yn log
1 − hθ (x n )
=
n=1
hθ (x n )

θ (x n )
In statistics, the term log 1−h is called the log-odd.
If we put hθ (x n ) = 1
, we can show that
1+e −θT x
hθ (x )
 
1
 = log e θT x = θ T x .

1+e −θT x

1 − hθ (x )
log = log 
e −θT x
1+e −θT x
Logistic regression is linear in the log-odd.

18 / 25
Outline

Motivation
Loss Function
Why not L2 Loss?
Maximum Likelihood
Log-odd
Convexity
Computation
19 / 25
Convexity of Logistic Training Loss
Recall that
hθ (x n )
n
+ log(1 − hθ (x n ))
X n o
− yn log
1 − hθ (x n )
J(θ) =
n=1
The first term is linear, so it is convex.

The second term: Gradient:

∇θ [− log(1 − hθ (x ))] = −∇θ log 1 −
1
1 + e −θ x
T
e −θ x
" T
#
−θ T x −θ T x
h i
= −∇θ log = −∇ log e − log(1 + e )
1 + e −θ x
T θ
= −∇θ −θ T x − log(1 + e −θ x ) = x + ∇θ log 1 + e −θ x

h T
i h T
i
−e −θ x
T
!
=x+ x = hθ (x )x .
1 + e −θ x
T

20 / 25
Gradient of second term is
∇θ [− log(1 − hθ (x ))] = hθ (x )x .
Hessian is:
∇2θ [− log(1 − hθ (x ))] = ∇θ [hθ (x )x ]

= ∇θ
1
x
1 + e −θ x
T
!
−θ T x
1
xx T

= −e
(1 + e −θ x )2
T

=
1
−θ T
x 1−
1
−θ T
x xx T
1+e 1+e
= hθ (x )[1 − hθ (x )]xx T .

21 / 25
For any v ∈ Rd , we have that

v T ∇2θ [− log(1 − hθ (x ))]v = v T hθ (x )[1 − hθ (x )]xx T v
h i
= (hθ (x )[1 − hθ (x )]) kv T x k2 ≥ 0.
Therefore the Hessian is positive semi-definite.

So − log(1 − hθ (x ) is convex in θ.
Conclusion: The training loss function
hθ (x n )
n
+ log(1 − hθ (x n ))
X n o
− yn log
1 − hθ (x n )
J(θ) =
n=1
is convex in θ.
So we can use convex optimization algorithms to find θ.
22 / 25
Convex Optimization for Logistic Regression
We can use CVX to solve the logistic regression problem
But it requires some re-organization of the equations
N
− yn θ T x n + log(1 − hθ (x n ))
X n o
J(θ) =
n=1
eθ xn
N T
!
− yn θ T x n + log 1 −
X n o
=
1 + eθ xn
T
n=1
N
− yn θ T x n − log 1 + e θ x n
X n T
o
=
n=1
 !T 
N N
yn x n log 1 + e θ x n
 X X T

=− θ− .
 
n=1 n=1
Tx
The last term is a sum of log-sum-exp: log(e 0 + e θ ).
23 / 25
Convex Optimization for Logistic Regression
0.8 Data
Estimated
True
0.6
0.4
0.2
0 2 4 6 8 10

24 / 25
Reading List
Logistic Regression (Machine Learning Perspective)
Chris Bishop’s Pattern Recognition, Chapter 4.3
Hastie-Tibshirani-Friedman’s Elements of Statistical Learning,
Chapter 4.4
Stanford CS 229 Discriminant Algorithms
http://cs229.stanford.edu/notes/cs229-notes1.pdf
CMU Lecture https:
//www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
Stanford Language Processing
https://web.stanford.edu/~jurafsky/slp3/ (Lecture 5)
Logistic Regression (Statistics Perspective)
Duke Lecture https://www2.stat.duke.edu/courses/Spring13/
sta102.001/Lec/Lec20.pdf
Princeton Lecture
https://data.princeton.edu/wws509/notes/c3.pdf c Stanley Chan 2020. All Rights Reserved.
25 / 25

Lecture14 Logistic

Uploaded by

Copyright:

Available Formats

Lecture14 Logistic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture14 Logistic

Uploaded by

Copyright:

Available Formats

ECE595 / STAT598: Machine Learning I

Lecture 14 Logistic Regression

School of Electrical and Computer Engineering

c Stanley Chan 2020. All Rights Reserved.

In linear discriminant analysis (LDA), there are generally two types of

This lecture: Logistic Regression 1

The discriminant function g (x ) is linear

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

is called a sigmoid function.

a controls the transient speed

c Stanley Chan 2020. All Rights Reserved.

So h(x) can be regarded as a “probability”.

c Stanley Chan 2020. All Rights Reserved.

Since 0 < h(x) < 0, we have 0 < 1 − h(x) < 1.

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

This loss is also called the cross-entropy loss.

Why not use L2 loss?

Let’s look at the 1D case:

Experiment: Set x = 1 and y = 1.

c Stanley Chan 2020. All Rights Reserved.

This lecture: Logistic Regression 1

We can show that

This is maximum-likelihood for a Bernoulli random variable yn

We can interpret hθ (x n ) as a probability p. So:

But p is a function of x n . So how about

And this probability is “after” you see x n . So how about

hθ (x n ) = p(1 | x n ), and 1 − hθ (x n ) = 1 − p(1 | x n ) = p(0 | x n ).

So hθ (x n ) is the posterior of observing x n .

Logistic regression is linear in the log-odd.

This lecture: Logistic Regression 1

The first term is linear, so it is convex.

= −∇θ −θ T x − log(1 + e −θ x ) = x + ∇θ log 1 + e −θ x

c Stanley Chan 2020. All Rights Reserved.

c Stanley Chan 2020. All Rights Reserved.

For any v ∈ Rd , we have that

= (hθ (x )[1 − hθ (x )]) kv T x k2 ≥ 0.

Therefore the Hessian is positive semi-definite.

c Stanley Chan 2020. All Rights Reserved.

You might also like