Lecture 1 Gradients and More
Lecture 1 Gradients and More
Lecture 1 Gradients and More
• TA: TBD
• Office hours and location: TBD
2
Prerequisites
3
Grading
• 5-6 problem sets (70%)
• See collaboration policy on the web
• Attendance policy?
-subject to change-
4
What this course is about...
• Problem solving using techniques from continuous optimization
• Growth of AI, ML, and Data Science means that you are likely
to encounter these problems in the real-world
• Understand how continuous optimization techniques work and
how to interpret their outputs
• Emphasis is on understanding how to formulate problems and
apply existing tools to (approximately) solve them
• We’ll see a bit of the theory behind these methods too, but
the emphasis will be on applying them
5
Course Topics
Part 1: Numerical Optimization
• Unconstrained optimization (partial derivatives, gradients, local versus global
optimality)
• Convex functions and gradient descent
• Linear programming
• Constrained convex optimization problems
• Method of Lagrange multipliers
• Projected gradient methods
• Frank-Wolfe/Method of conditional gradients
7
Univariate Calculus Review
8
Univariate Calculus Review
9
Univariate Calculus Review
10
Univariate Calculus Review
11
Simple Derivatives
• To find the places where the derivative is equal to zero, we can
employ the standard derivative rules from calculus:
• for any
12
Min vs. Inf
• The global minimum may also not exist!
• For example, consider the function
13
Min vs. Inf
• The global minimum may also not exist!
• For example, consider the function
• This function has no finite minimum
• In this case, it doesn’t make sense to write
• The infimum of a function over is defined to be the largest
value such that for all or if no such number exists
• This allows us to write
• Similarly,
14
Min vs. Inf
• Infimums are always well-defined even if the function is
unbounded from below
• The analogous concept for maximization is called the
supremum, i.e., the smallest upper bound
• For example,
• Note that
• As a result, we can always talk about minimizing functions
even if we want to maximize one
15
Optima Under Simple Constraints
16
Optima Under Simple Constraints
17
Optima Under Simple Constraints
18
Multivariable Calculus Review
19
Vectors and Matrices
• A vector is a tuple of real numbers where for denotes the
element of the tuple
• Similarly, a matrix is indexed by a pair and
• All vectors will be treated as column vectors, i.e., they can be
thought of as matrices with rows and 1 column
• , denotes the transpose of , which is a row vector
• , then is the dot product between and
20
Partial Derivatives
• The partial derivative of a multivariate function with respect to
is the derivative of with respect to assuming that all of the
remaining variables are treated as constants
21
Multivariate Chain Rule
• Let , , , be differentiable functions, then
22
Gradients
• The gradient of at is the vector of partial derivatives, denoted ),
whose entry is
23
Directional Derivatives
• Unlike functions of a single variable, there is not a single
derivative that describes the rate of change at the point
24
Directional Derivatives
• Unlike functions of a single variable, there is not a single
derivative that describes the rate of change at the point
• Consider
• Let
• , which is the slope of at the point
• This means that is the slope of the function when
considered only as a function of
• Equivalently, this is the slope of in the direction where
and all other entries of are zero
25
Directional Derivatives
• The directional derivative of a differentiable function at along
the vector is given by , where
• Equivalently, by the multivariate chain rule,
• Example: ,
26
Directional Derivatives
• For differentiable functions, the gradient gives the direction of
steepest ascent in
• It maximizes the directional derivative, , over all possible unit
length directions
• Recall that where and is the angle between the two
vectors (in the 2d-plane that contains them both)
• is unit length if
• So,
• Choosing attains this upper bound
27
Directional Derivatives
28
Local Optima
Critical points can be local minima, local maxima, or saddle points
29
Local Optima
Critical points can be local minima, local maxima, or saddle points
30 source: Wikipedia
Multivariable Calculus Review
31
Hessian Matrix
• The Hessian is the matrix of second partial derivatives, denoted ,
whose entry is
32
Hessian Matrix
• The Hessian is the matrix of second partial derivatives, denoted ,
whose entry is
33
Constrained Optimization
• How do we minimize a multivariate function with respect to
simple constraints, e.g., minimize subject to and ?
34
Constrained Optimization
• How do we minimize a multivariate function with respect to
simple constraints, e.g., minimize subject to and ?
• This is much more challenging than the univariate case!
• The optima can occur at a critical point, , or anywhere on the
boundary of the rectangle determined by !
a b
35