05-1 Supervised Learning

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 65

Supervised Learning

IE:4172 Big Data Analytics


Stephen Baek
Assignment
1. If A and B are matrices, what are the conditions that they must satisfy in order
to compute AB (matrix multiplication)?
2. What kind of operations are required and how many times are they going to be
called to compute AB?
3. What is Gauss-Jordan elimination? What kind of operations are required and
how many times are they going to be called to perform G-J elimination on a d-
by-d matrix?
Assignment
4. Given a linear model Y=XA, the solution is known as A = ((XTX)-1XT)Y. How
many operations do you need to perform in order to find A?
5. The solution to 4 can also be computed as (XTX)-1(XTY). Note the parentheses.
How many operations do you need to perform to find A this time?
6. What is the complexity of solving a linear system?
7. Can the complexity improved? How? Find at least 3 ways.
What is Machine Learning?
● Supervised Learning
○ Classification
○ Regression
What is Machine Learning?
● Supervised Learning
○ Classification
○ Regression
● Unsupervised Learning
What is Machine Learning?
● Supervised Learning
○ Classification
○ Regression
● Unsupervised Learning
● Self-supervised Learning (not in this course)
What is Machine Learning?
● Supervised Learning
○ Classification
○ Regression
● Unsupervised Learning
● Self-supervised Learning (not in this course)
● Reinforcement Learning (not in this course)
Learning a Class from Examples
● What follows is some theory of classification into two classes.
● First, we assume there is no noise (results can be generalized to noise,
though)
● What you should learn:
Learning a Class from Examples
● What follows is some theory of classification into two classes.
● First, we assume there is no noise (results can be generalized to noise,
though)
● What you should learn:
○ Learning can be seen as pruning out possible hypothesis (models).
Learning a Class from Examples
● What follows is some theory of classification into two classes.
● First, we assume there is no noise (results can be generalized to noise,
though)
● What you should learn:
○ Learning can be seen as pruning out possible hypothesis (models).
○ Learning is generalization (we want to predict classes of new examples).
Learning a Class from Examples
● What follows is some theory of classification into two classes.
● First, we assume there is no noise (results can be generalized to noise,
though)
● What you should learn:
○ Learning can be seen as pruning out possible hypothesis (models).
○ Learning is generalization (we want to predict classes of new examples).
○ Learning is impossible if the hypothesis (model) space is too large (in other words: we need
some prior information, we need to select a model family)
Learning a Class from Examples
● What follows is some theory of classification into two classes.
● First, we assume there is no noise (results can be generalized to noise,
though)
● What you should learn:
○ Learning can be seen as pruning out possible hypothesis (models).
○ Learning is generalization (we want to predict classes of new examples).
○ Learning is impossible if the hypothesis (model) space is too large (in other words: we need
some prior information, we need to select a model family)
○ The more complex model family (hypothesis space), the more training data needed.
Independent and Identically Distributed (iid) Data
● We assume that we have a training data that contains data points
drawn independently from the identical distribution.
Independent and Identically Distributed (iid) Data
● We assume that we have a training data that contains data points
drawn independently from the identical distribution.
○ Ordering of the data points does not matter.
Independent and Identically Distributed (iid) Data
● We assume that we have a training data that contains data points
drawn independently from the identical distribution.
○ Ordering of the data points does not matter.
● This assumption often holds or loosely holds in real-world problems.
Independent and Identically Distributed (iid) Data
● We assume that we have a training data that contains data points
drawn independently from the identical distribution.
○ Ordering of the data points does not matter.
● This assumption often holds or loosely holds in real-world problems.
● Notable exception: time series.
○ e.g. Today’s temperature is not independent of yesterday’s temperature. In fact, there is a
strong correlation.
The Family Car Example
Question: Given car properties, is the car a family car?

Car properties: = (price, engine power)

Hypothesis: if a car is a family car?


Family Car: Training Set
Family Car: True Class
Family Car: Hypothesis class H
Consistent Hypothesis
Consistent Hypothesis
What did we learn from the Family Cars example?
● We must choose some hypothesis to be able to predict anything (unless we
observe all possible data values).
What did we learn from the Family Cars example?
● We must choose some hypothesis to be able to predict anything (unless we
observe all possible data values).
● This causes inductive bias (the choice of hypothesis space affects your
results).
What did we learn from the Family Cars example?
● We must choose some hypothesis to be able to predict anything (unless we
observe all possible data values).
● This causes inductive bias (the choice of hypothesis space affects your
results).
● All consistent hypothesis can be found between the most general and most
specific hypothesis.
What did we learn from the Family Cars example?
● We must choose some hypothesis to be able to predict anything (unless we
observe all possible data values).
● This causes inductive bias (the choice of hypothesis space affects your
results).
● All consistent hypothesis can be found between the most general and most
specific hypothesis.
● In practical applications, there may be no consistent hypothesis due to the
hypothesis space that is too simple (underfitting) or a noisy distribution of data.
Noise and Model Complexity
● Noise is unwanted anomaly of data.
● Because of the noise, we may never reach zero
error.
● Why to use simpler model:
○ Simpler to use
○ Easier to train
○ Easier to explain
○ Generalizes better
Noise and Model Complexity
● Noise may be caused by:
○ Errors in measurements of input attributes or class labels.
○ Unknown or ignored (hidden or latent) attributes.
(Example: plane ticket price may increase when there is
an event in the region.)
○ Model is wrong or inaccurate (figure)
Regression
● Classification is the prediction of a class label, given attributes.
Regression
● Classification is the prediction of a class label, given attributes.
● Regression is the prediction of a real number, given attributes (usually with
noise).
Regression
● Classification is the prediction of a class label, given attributes.
● Regression is the prediction of a real number, given attributes (usually with
noise).
● The training set is given by , where .
Regression
● Classification is the prediction of a class label, given attributes.
● Regression is the prediction of a real number, given attributes (usually with
noise).
● The training set is given by , where .
● Each hypothesis is a function . We would like to find for
all items in the training set.
Regression
● Classification is the prediction of a class label, given attributes.
● Regression is the prediction of a real number, given attributes (usually with
noise).
● The training set is given by , where .
● Each hypothesis is a function . We would like to find for
all items in the training set.
● Usually, we want to minimize a quadratic error function:
Regression
● The simplest case is linear regressor:
Regression
● The simplest case is linear regressor:
● Optimization task: find w0 and w1 such that the error

is minimized.
Regression
● The simplest case is linear regressor:
● Optimization task: find w0 and w1 such that the error

is minimized.
● Analytic solution:

where and .
Linear Regression
Linear Regression
Linear Regression

Linear Model:
Linear Regression
Model
parameters
Linear Model:
Linear Regression
Model
parameters
Linear Model:
broadcasting
Linear Regression
Model
parameters
Linear Model:
broadcasting
A “trick”:
Linear Regression
Model
parameters
Linear Model:
broadcasting
A “trick”:
Linear Regression
Linear Regression
Linear Regression
Linear Regression

Assume
Linear Regression

Assume
Linear Regression

Assume
Linear Regression
Linear Regression

Moore-Penrose Pseudoinverse
Linear Regression: Toy data
Linear Basis Functions
Least Squares Solution to Regression
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
Polynomial Regressors
● Etrain is the error in the training data. It decreases as model complexity
increases.
● Etest is the error on the remaining 93 data points, or “test set”. It has minimum
at k = 3.
Polynomial Regressors
Wait a minute...

is d-by-d.
requires N x d2 multiplications and additions.
Inverting is O(d3) (Gauss-Jordan)
Afterwards, requires N x d2 + N x d multiplications.
Big Data!

is d-by-d.
requires N x d2 multiplications and additions.
Inverting is O(d3) (Gauss-Jordan)
Afterwards, requires N x d2 + N x d multiplications.

You might also like