Introduction To Learning: Frederic Precioso 24/01/2019

Download as pdf or txt
Download as pdf or txt
You are on page 1of 179

Introduction to Learning

Winter School ROBOTICA PRINCIPIA

Frederic Precioso
24/01/2019

1
Disclaimer

If any content in this presentation is yours but is not correctly


referenced or if it should be removed, please just let me know
and I will correct it.

2
Overview
• Context & Vocabulary
– What represents Artificial Intelligence?
– Machine Learning vs Data Mining?
– Machine Learning vs Data Science?
– Machine Learning vs Statistics?
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning
3
CONTEXT & VOCABULARY

4
WHAT REPRESENTS ARTIFICIAL INTELLIGENCE?

5
What is Artificial intelligence?

• The term Artificial Intelligence, as a research field, was coined at the conference on
the campus of Dartmouth College in the summer of 1956, even though the idea was
around since antiquity.
• For instance in the first manifesto of Artificial Intelligence, “Intelligent Machinery”, in
1948 Alan Turing distinguished two different approaches to AI, which may be termed
"top-down“ or knowledge-driven AI and "bottom-up“ or data-driven AI

6
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?
• The two different approaches to AI can be detailed:
– "top-down“ or knowledge-driven AI
• cognition = high-level phenomenon, independent of low-level details of implementation
mechanism, first neuron (1943), first neural network machine (1950), neucognitron (1975)
• Evolutionary Algorithms (1954,1957, 1960), Reasoning (1959,1970), Expert Systems (1970),
Logic, Intelligent Agent Systems (1990)…
– "bottom-up“ or data-driven AI
• opposite approach, start from data to build incrementally and mathematically mechanisms
taking decisions
• Machine learning algorithms, Decision Trees (1983), Backpropagation (1984-1986), Random
Forest (1995), Support Vector Machine (1995), Boosting (1995), Deep Learning (1998/2006)…

7
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?

• AI is originally defined, by Marvin Lee Minsky, as “the construction of computer


programs doing tasks, that are, for the moment, accomplished more satisfyingly by
human beings because they require high level mental processes such as: learning.
perceptual organization of memory and critical reasoning”.

• There are so the "artificial" side with the usage of computers or sophisticated
electronic processes and the side “intelligence” associated with its goal to imitate
the (human) behavior.

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)


8
What is Artificial intelligence?

• The concept of strong artificial intelligence makes reference to a machine capable


not only of producing intelligent behavior, but also to experience a feeling of a real
sense of itself, “real feelings” (whatever may be put behind these words), and "an
understanding of its own arguments”.

• The notion of weak artificial intelligence is a pragmatic approach of engineers:


targeting to build more autonomous systems (to reduce the cost of their
supervision), algorithms capable of solving problems of a certain class, etc. But this
time, the machine simulates the intelligence, it seems to act as if it was smart.

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)


9
Why Artificial Intelligence is so difficult to
grasp?

• Frequently, when a technique reaches mainstream use, it is no longer considered as


artificial intelligence; this phenomenon is described as the AI effect: "AI is whatever
hasn't been done yet.“ (Larry Tesler's Theorem) -> e.g. Path Finding (GPS), Checkers
game, Chess electronic game, Alpha Go…

Ÿ “AI” is continuously evolving and so very difficult to grasp.

10
Machine Learning

௙(x,ࢻ) ?
x ‫ݕ‬
x ‫ݕ‬

Face Detection

Betting on sports Scores, ranking…

Speech Recognition
11
Machine Learning

௙(x,ࢻ) ?
x ‫ݕ‬

x ‫ݕ‬

Support Vector Machines Random Forest Artificial Neural Networks 12


MACHINE LEARNING VS DATA MINING?

13
Data Mining Workflow

Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
14
Data Mining Workflow

Validation

Data Mining
Model
Patterns
Transformation
Mainly manual
Pre
Preprocessing
Selection

Data
Data warehouse
15
Data Mining Workflow

Validation
• Filling missing values
• Dealing with outliers
• Sensor failures Data Mining
Model
• Data entry errors
• Duplicates
Patterns
Transfo
Transformation
• …
Preprocessing
Selection

Data
Data warehouse
16
Data Mining Workflow
• Aggregation (sum, average)
• Discretization
• Discrete attribute coding
• Text to numerical attribute Validation
• Scale uniformisation or standardisation
• New variable construction
• … Data Mining
Minin
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
17
Data Mining Workflow
• Regression
• (Supervised) Classification
• Clustering (Unsupervised Classification)
• Feature Selection
• Association analysis
• Novelty/Drift Valid
Validation
• …

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
18
Data Mining Workflow
• Evaluation on Validation Set
• Evaluation Measures
• Visualization
• ...
Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
19
Data Mining Workflow • Visualization
• Reporting
• Knowledge
• ...

Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
20
Data Mining Workflow
Problems Possible Solutions
• Regression • Machine Learning
• (Supervised) Classification • Support Vector Machine
• Density Estimation / Clustering • Artificial Neural Network
(Unsupervised Classification) • Boosting
• Feature Selection • Decision Tree
• Random Forest
• Association analysis • …
• Anomality/Novelty/Drift • Statistical Learning
• … • Gaussian Models (GMM)
• Naïve Bayes
• Gaussian processes
• …
• Other techniques
• Galois Lattice
• …
21
MACHINE LEARNING VS DATA SCIENCE?

22
Data Science Stack
Visualization / Reporting / Knowledge
• Dashboard (Kibana / Datameer)
USER
• Maps (InstantAtlas, Leaflet, CartoDB…)
• Charts (GoogleCharts, Charts.js…)
• D3.js / Tableau / Flame

Analysis / Statistics / Artificial Intelligence


• Machine Learning (Scikit Learn, Mahout, Spark)
• Search / retrieval (ElasticSearch, Solr)

Storage / Access / Exploitation


• File System (HDFS, GGFS, Cassandra …)
• Access (Hadoop / Spark / Both, Sqoop…)
• Databases / Indexing (SQL / NoSQL / Both…, MongoDB, HBase, Infinispan)
• Exploit (LogStash, Flume… )

Infrastructures
• Grid Computing / HPC
• Cloud / Virtualization
HARD
23
MACHINE LEARNING VS STATISTICS?

24
What breed is that Dogmatix (Idéfix) ?

The illustrations of the slides in this section come from the blog “Bayesian Vitalstatistix:
What Breed of Dog was Dogmatix?” 25
Does any real dog get this height and
weight?

• Let us consider x, vectors


independently generated in Rd
(here R2), following a
probability distribution fixed
but unknown P(x).

26
What should be the breed of these dogs?

• An Oracle assignes a value y to


each vector x following a
probability distribution P(y|x)
also fixed but unknown.

27
An oracle provides me with examples?

• Let S be a training set


S = {(x1, y1), (x2, y2),…, (xm, ym)},
with m training samples i.i.d. which
follow the joint probability
P(x, y) = P(x)P(y|x).

28
Statistical solution: Models, Hypotheses…

29
Statistical solution P(height, weight|breed)…

30
Statistical solution P(height, weight|breed)…

31
Statistical solution P(height, weight|breed)…

32
Statistical solution: Bayes,
P(breed|height, weight)…

33
Machine Learning

• we have a learning machine which can


provide a family of functions {f(x;ɲ)},
where ɲ is a set of parameters.

௙(x,ࢻ) ?
x ‫ݕ‬
34
The problem in Machine Learning
• The problem of learning consists in finding the model
(among the {f (x;ɲ)}) which provides the best
approximation NJ of the true label y given by the Oracle.
௙(x,ࢻ) ?
x ‫ݕ‬ • best is defined in terms of minimizing a specific (error)
cost related to your problem/objectives
Q((x, y), ɲ)  [a; b].
• Examples of cost/loss functions Q: Hinge Loss, Quadratic
Loss, Cross-Entropy Loss, Logistic Loss…

35
Loss in Machine Learning

• How to define the loss L (or the cost Q)?


You should choose the right loss function based on your problem and your data
(here y is the true/expected answer, f(x) the answer predicted by the network).
Classification
• Cross-entropy loss: L(x) = -(y ln(f(x)) + (1-y)ln(1-f(x)))
• Hinge Loss (i.e. max-margin loss, i.e. 0-1 loss): L(x) = max(0, 1-yf(x))
• …
Regression
• Mean Square Error (or Quadratic Loss): L(x) = (f(x)-y)2
• Mean Absolute Loss: L(x) = |f(x)-y|
• …
If the loss is minimized but accuracy is low, you should check the loss function.
Maybe it is not the appropriate one for your task.
36
The problem in Machine Learning
For Clarity sake, let us note z = (x ,y).

37
Machine Learning fundamental Hypothesis

For Clarity sake, let us note z = (x ,y).

S = {zi }i=1,…,m is built through an i.i.d. sampling according to P(z).

Machine Learning  Statistics


Train through Cross-Validation

Machine Learning  Statistics


Training set & Test set have to be distributed according to the
same law (i.e. P(z)).
38
Vapnik learning theory (1995)

Training Error Generalization Error

39
Vapnik learning theory (1995)

40
Machine Learning vs Statistics

41
UNSUPERVISED CLASSIFICATION

42
Unsupervised classification
• The system or the operator has only samples, but no label
• The number of classes and their nature have not been
predetermined
Ÿ unsupervised learning or clustering.
œ No expert is required.
œ The algorithm must discover by itself more or less
hidden/underlying data structure.

43
Clustering
Clustering: Partition a dataset into groups based on the similarity between the instances.

Clustering

44
Clustering Algorithms (Partition)

• Centroid-based (1957,1967)
(E.g. K-means, PAM, CLARA)

• Hierarchical:
• Bottom up approach.
• Top down approach.

45
Clustering Algorithms (Density)

• Density-based
(E.g. DBSCAN, DENCLUE)

• Distribution-based
(E.g. EM, extension of
K-means)

46
Clustering Algorithms (Graph)

• Graph-based
(E.g. Chameleon)

• Grid-based
(E.g. STING, CLIQUE)

47
How many Clusters?

K?

48
Which Algorithm?

K-means PAM Gaussian model

DBSCAN Average linkage DIANA


49
Algorithms’ Parameters

Avg. linkage: K=2 Avg. linkage: K=3 Avg. linkage: K=4

Avg. linkage: K=5 Avg. linkage: K=6 Avg. linkage: K=7


50
Which metric/distance?

51
Clustering validation measures
1
• Internal validation: Separation? Compactness? 0,8
E.g. Dunn, DB, and Silhouette indexes. 0,6
0,4
• Problems:
0,2
– Different performance WRT existence of noise, variable 0
densities, and non well-separated clusters. 1 2 3
K
4 5

– Overrate the algorithm that uses the same clustering model.


1

• External validation: == Class labels? 0,8


0,6
E.g. Rand, Jaccard, Purity, MI, VI indexes. 0,4
• Problem: Some class labels, at least, have to exist. 0,2
0
1 2 3 4

52
Consensus clustering

Consensus
Clustering

Ensemble of 3 base clusterings Consensus solution

53
Overview

• Unsupervised classification
• Explicit supervised classification
– Decision Trees
– Random Forest
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning

54
EXPLICIT SUPERVISED CLASSIFICATION

55
DECISION TREES

56
Decision tree to decide playing
tennis or not

Objective
2 classes: yes & no
Prediction if a game will
be played or not
Temperature will be
easily converted into
numerical

I.H. Witten and E. Frank, “Data Mining”, Morgan Kaufmann Pub., 2000.
57
A simple example

58
Example - explanations

• On the nodes
– Distribution of the variable to predict
• The first node is segmented with the variable outlook (sunny, overcast, rainy):
creation of 3 sub-groups
– The first group contains 5 observations, 2 yes and 3 no
• The tree can be translated in a set of decision rules without loosing any
information
– Example: if outlook = sunny and humidity = high then play = yes
59
Basic algorithm
• A= BestAttribute(Examples) // Best attribute means more
//homogenous results
• Assign A to the root
• For each value of A, create a new sub-node of the roof
• Classify all the examples in the sub-nodes
• If all examples of a sub-node are homogeneous, assign their class
to the node, if not repeat this process from this node

• Question: How to measure homogeneity?


Entropy, Gini, Information Gain…
60
Decision tree to decide playing
tennis or not
Class:YES
Class: NO Class: YES

61
Final decision tree

62
Decision tree example
Debt
Income > t1

t2 Debt > t2

Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel

63
Advantages of decision trees
• Simple and easily interpretable rules (unlike implicit decision methods)
• No need to recode heterogeneous data
• Processing with missing values
• No model and no presupposition to meet (iterative method)
• Fast processing time

64
Drawbacks
• The nodes of level n + 1 are highly dependent on those of level
n (the modification of a single variable near to the top of the
tree can entirely change the tree)
• We always choose the best local attributes, the best global
information gain is not at all guaranteed
• Learning requires a sufficient number of individuals
• Inefficient when there are many classes

• No convergence…
65
Drawbacks
• We always chose the best local attributes, the best global information
gain is not at all guaranteed
Gain (A2) = 0.25 Gain (A5) = 0.18

Gain (A4) = 0.11 Gain (A1) = 0.07 Gain (A7) = 0.30 Gain (A3) = 0.20

Solution chosen by the algorithm Solution that should be chosen 66


Decision trees do not converge?
“Plant” a forest

67
Standard Random Forests

Bagging

Random
Feature
Selection

68
Error of generalization for Random Forest

• Error of generalization of RF can be bounded by:

ߩ 1 െ ‫ݏ‬ଶ
ܴ ܴ‫ ܨ‬൑
‫ݏ‬ଶ
where
– U is the mean correlation between two decision trees
– s is the quality of prediction of the set of decision trees

69
Success story: Kinect

https://www.youtube.com/watch?v=lntbRsi8lU8

70
Success story: Kinect

71
Success story: Kinect

72
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
– Multi-Layer Perceptron
– Support Vector Machine
• Deep Learning
• Reinforcement Learning

73
IMPLICIT SUPERVISED CLASSIFICATION

74
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)

75
The curse of dimensionality [Bellman, 1956]

76
MULTI-LAYER PERCEPTRON

77
First, biological neurons
„ Before we study artificial neurons, let’s look at a biological neuron

78
Figure from K.Gurney, An Introduction to Neural Networks
First, biological neurons

Postsynaptic potential function with weight dependency, as a function of


time (ms) and weight value, being excitatory in case of red and blue lines, and
inhibitory in case of a green line.

79
Then, artificial neurons

Pitts & McCulloch (1943), binary inputs & activation function f is a thresholding

Rosenblatt (1956), real inputs & activation function f is a thresholding


80
(Schéma : Isaac Changhau)
‫ݕ‬
Artificial neuron vs
biology ௡

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬


‫ݓ‬଴ ௜ୀ଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ ‫ݓ‬
ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡

Spike-based description Rate-based description


Steady regime
s

y = s( ɇ wi xi )

Gradient descent: KO Gradient descent: OK 81


From perceptron to network

@tachyeonz: A friendly introduction to neural networks and deep learning.

82
Single Perceptron Unit
• Perceptron only learns linear function [Minsky and Papert, 1969]

• Non-linear function needs layer(s) of neurons o Neural Network


• Neural Network = input layer + hidden layer(s) + output layer

83
Multi-Layer Perceptron
• Training a neural network [Rumelhart et al. / Yann Le Cun et al. 1985]
• Unknown parameters: weights on the synapses
• Minimizing a cost function: some metric between the predicted output
and the given output

• Step function: non-continuous functions are replaced by a continuous


non-linear ones 84
Multi-Layer Perceptron
• Minimizing a cost function: some metric between the predicted output
and the given output

• Equation for a network of 3 neurons (i.e. 3 perceptrons):


‫ݓ(ݏ = ݕ‬ଵଷ ‫ݓ ݏ‬ଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵ + ‫ݓ‬ଶଷ ‫ݓ ݏ‬ଵଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶ + ‫ݓ‬଴ଷ )
85
Autonomous Land Vehicle In a Neural
Network (ALVINN)
• ALVINN is an automatic steering system for a car based on input from a
camera mounted on the vehicle.
– Successfully demonstrated in a cross-country trip.

86
ALVINN (1989)

• The ALVINN neural network is:


– 960 inputs (a 30x32 array
derived from the pixels of an
image),
– 4 hidden units and
– 30 output units (each
representing a steering
command).

87
Multi-Layer Perceptron
Theorem [Cybenko, 1989]
• A neural network with one single hidden layer is a universal
approximator: it can represent any continuous function on compact
subsets of Rn
• 2 layers is enough ... theoretically:
“…networks with one internal layer and an arbitrary continuous sigmoidal
function can approximate continuous functions with arbitrary precision
providing that no constraints are placed on the number of nodes or the size
of the weights"
• But no efficient learning rule is known and the size of the hidden layer is
exponential with the complexity of the problem (which is unkown
beforehand) to get an error H , the layer must be infinite for an error 0.
88
SUPPORT VECTOR MACHINE
Partly based on “A Gentle Introduction to Support Vector Machines in Biomedicine”, A. Statnikov, D.
Hardin, I. Guyon, C. F. Aliferis, AMIA 2010.

89
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)

90
The curse of dimensionality [Bellman, 1956]

91
SVM vs ANN
"SVMs have been developed in the reverse order to the development
of neural networks (NNs). SVMs evolved from the sound theory to
the implementation and experiments, while the NNs followed more
heuristic path, from applications and extensive experimentation to
the theory.“

“Support Vector Machines: Theory and Applications” by Lipo


Wang, in Studies in Fuzziness and Soft Computing, Springer, 2005.

92
The Support Vector Machine (SVM)
• Support vector machines (SVMs) is a binary classification
algorithm.
• Extensions of the basic SVM algorithm can be applied to solve
problems of regression, feature selection, novelty/outlier
detection, and clustering.
• SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results. 93
Linearly separable data, “Hard-
margin” linear SVM
G G G
x1 , x2 ,..., x N  R n • Want to find a classifier (hyperplane)
Given training data:
y1 , y2 ,..., y N  {1,1} to separate negative objects from
the positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between data
points on the boundaries (so-called
“support vectors”).
• If the points on the boundaries are
Negative objects (y=-1) Positive objects (y=+1) not informative (e.g., due to noise),
SVMs may not do well. 94
Kernel Trick

https://www.youtube.com/watch?v=-Z4aojJ-pdg

95
Popular kernels
A kernel is a dot product in some feature space:
G G G G
K ( xi , x j ) ) ( xi ) ˜ ) ( x j )
Examples:
G G G G
K ( xi , x j ) xi ˜ x j Linear kernel
G G G G 2
K ( xi , x j ) exp(J xi  x j ) Gaussian kernel
G G G G
K ( xi , x j ) exp(J xi  x j ) Exponential kernel
G G G G q
K ( xi , x j ) ( p  xi ˜ x j ) Polynomial kernel
G G G G q G G 2
K ( xi , x j ) ( p  xi ˜ x j ) exp(J xi  x j ) Hybrid kernel
G G G G
K ( xi , x j ) tanh(kxi ˜ x j  G ) Sigmoidal
96
How to build a kernel function ?

k ( x, y ) k1 (x, y ) + k2 (x, y )
k ( x, y ) D ·k1 (x, y )
k ( x, y ) k1 (x, y )·k2 (x, y )
k ( x, y ) f (x)· f (y )
with f () a function from input space to \
k (x, y ) k3 () (x), ) (y ))
k (x, y ) xBy T
with B a matrix N u N symetric, semi-definite positive
97
Complex kernels on Video Tubes

98
Complex kernels on Video Tubes

99
Classification

100
Classification

101
SVM are ANN

102
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
– Convolutional Neural Networks (CNN)
– Generative Adversarial Networks (GAN)
– Stacked Denoising AutoEncoder (SDAE)
– Reccurent Neural Networks (RNN)
• Reinforcement Learning
103
DEEP LEARNING

104
Deep representation origins
• Theorem Cybenko (1989) A neural network with one single hidden layer
is a universal “approximator”, it can represent any continuous function
on compact subsets of Rn Ÿ 2 layers are enough…but hidden layer size
may be exponential

…………..……..………..
………..……..………
exponential

105
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers

106
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
2
exponential

107
Enabling factors
• Why do it now ? Before 2006, training deep networks was
unsuccessful because of practical aspects
– faster CPU's
– parallel CPU architectures
– advent of GPU computing

• Results…
– 2009, sound, interspeech + ~24%
– 2011, text, + ~15% without linguistic at all
– 2012, images, ImageNet + ~20%
108
Structure the network?
• Can we put any structure reducing the space of exploration and
providing useful properties (invariance, robustness…)?

‫ݓ(ݏ = ݕ‬ଵଷଷ ‫ݓ ݏ‬ଵଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵଵ + ‫ݓ‬ଶଷଷ ‫ݓ ݏ‬ଵଶଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶଶ + ‫ݓ‬଴ଷଷ )

109
CONVOLUTIONAL NEURAL NETWORKS
(AKA CNN, CONVNET)

110
Convolutional neural network

111
Deep representation by CNN

112
Convolution in nature

113
Convolution

114
Convolution in nature
1. Hubel and Wiesel have worked on visual cortex of cats (1962)
2. Convolution

3. Pooling

115
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬


‫ݓ‬଴ ௜ୀ଴
െ4
4 0
0 ‫ܠ‬
࢞૙ = ૚ 0 0 0 2

116
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬

෍ ‫ݓ‬௜ ‫ݔ‬௜
௜ୀ଴ ‫ܟ‬
‫ݓ‬଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ
‫ݓ‬ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡

117
If convolution = perceptron
1. Convolution

2. Pooling

118
Deep representation by CNN

119
Deep representation by CNN

120
Deep representation by CNN

121
Deep representation by CNN

122
Transfer Learning!!

123
Endoscopic Vision Challenge 2017
Surgical Workflow Analysis in the SensorOR

10th of September
Quebec, Canada

124
Clinical context: Laparoscopic Surgery
Surgical Workflow Analysis

125
Task
Phase segmentation of laparoscopic surgeries

Video

Surgical
Devices

126
Dataset
30 colorectal laparoscopies
z Complex type of operation
z Duration: 1.6h – 4.9h (avg 3.2h)
z 3 different sub-types
z 10x Proctocolectomy
z 10x Rectal resection
z 10x Sigmoid resection

Sensor data recorded in integrated OR (Karl Storz OR1)


z Laparoscopic image stream
z Surgical devices Recorded at

127
Annotation
Annotated by surgical experts, 13 different phases

128
Method
Temporal Network

ResNet-34 Number of target classes: Spatial network accuracy:


Rectal resection: 11 Rectal resection: 62.91%
Sigmoid resection: 10 Sigmoid resection: 63.01%
Proctolectomy: 12 Proctolectomy: 63.26%
7x7 conv

3x3 conv

3x3 conv
3x3 conv

FC
feature Temporal network accuracy:
224x224x3 Rectal resection: 49.88%
224x224x20 vector
FV (512) Sigmoid resection: 48.56%
Proctolectomy: 46.96%

Spatial Network

Final Network Accuracy:


Final Network Rectal resection (8): sigmoid resection (7): Proctocolectomy (1):
80.7% 73.5% 71.3%
Rectal resection (6): sigmoid resection (1): Proctocolectomy (4):
features
Spatial

79.9% 54.7% 73.9%


(512)

Spatial CNN
LSTM (512)

Results for rectal resection video #8; GT in green, prediction in red


12
11

224x224x3 (1024 10
9
) 8
7
6
5
Temporal
features

4
(512)

Temporal CNN 3
2
1
0

224x224x20 129
And the winner is...

Average Median
Data used Accuracy
Jaccard Jaccard

1 Video 40% 38% 61% Team UCA

Video +
2
Device
38% 38% 60% Team NCT

3 Video 25% 25% 57% Team TUM

4 Device 16% 16% 36% Team TUM

5 Video 8% 7% 21% Team FFL

130
Why Deep Learning?
Before Deep Learning
Hand-crafted/Engineered features
– Image recognition
– 3L[HOĺHGJHĺtexton ĺSDWWHUQVĺSDUWĺREMHFW
– Text
– &KDUDFWHUĺZRUGĺZRUGJURXSĺFODXVHĺVHQWHQFHĺVWRU\
– Speech
– 6DPSOHĺVSHFWUDOEDQGĺVRXQGĺ«ĺSKRQHĺSKRQHPHĺZRUG
Since Deep Learning
o Hierarchy of representations with increasing level of abstraction
o Each stage is a kind of trainable feature transform…
as long as you have enough data to train the hierarchy
Trainable feature

Trainable feature

Trainable feature

Trainable feature
transform

transform

transform

transform

Decision
Le Cun - Ranzato
How Deep Learning?
Start from raw data OR from a first level representation?

Gray-level Pixels

Color Pixels

Trainable feature

Trainable feature

Trainable feature

Trainable feature
transform

transform

transform

transform

Decision
Text Text Embedding

Waves

Images
Le Cun - Ranzato
AMAZING BUT…

133
Amazing but…be carreful of the bias in
the initial data

134
Amazing but…be careful of the
adversaries (as any other ML algorithms)

135
Amazing but…be careful of the
adversaries (as any other ML algorithms)

136
Amazing but…be careful of the
adversaries (as any other ML algorithms)

From Thomas Tanay


137
Amazing but…be careful of the
adversaries (as any other ML algorithms)

138
Amazing but…be careful of the
adversaries (as any other ML algorithms)

139
Amazing but…be careful of the
adversaries

https://nicholas.carlini.com/code/audio_adversarial_examples/

140
GENERATIVE ADVERSARIAL NETWORKS

141
How to solve it?
Generative Adversarial Networks

142
It finally did not solve adversarial, but…

Operations between latent representations (manifold)

143
It finally did not solve adversarial, but…

144
(DENOISING) STACKED AUTOENCODER

145
Autoencoder: unsupervised!
Learning a compact representation of the data (no classification)
First we train an AutoEncoder layer 1.

146
Autoencoder: unsupervised!
Second we train an AutoEncoder layer 2.

147
Autoencoder -> Supervised

Then we train an output layer of non-linearities based on softmax.

148
Autoencoder -> Supervised
Finally, we fine-tune the whole network in a supervised way.

149
Denoising stacked Autoencoder:
unsupervised
Result = a new latent representation

150
Denoising stacked Autoencoder: example
Stage 1.
Data-mining stage &
Feature extraction:
Driving Electronic Health
Model Deep Patient, Records to build a binary
Published in Nature, phenotype representation.
2016
Stage 2.
Unsupervised stage:
Mapping the Binary Patient
Representation to get a new space call
Deep Patient (or Latent Representation)
Using Stacked Denoising Autoencoders.

Stage 3.
Supervised stage:
Labeling Medical Target and
training the Latent Representation
by Machine Learning algorithms
for classification and prediction of
patient's disease. 151
Supervised Image Segmentation Task

Credits Matthieu Cord 152


Partly from COLAH’s Blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RECURRENT NEURAL NETWORK

153
Recurrent Neural Networks have loops.

154
An unrolled recurrent neural network.

In the last few years, there have been incredible success applying RNNs to a
variety of problems: speech recognition, language modeling, translation,
image captioning…

155
The Problem of Long-Term Dependencies

If we are trying to predict the last word in “the clouds are in the (sky),” we don’t need any
further context – it’s pretty obvious the next word is going to be sky.

Consider trying to predict the last word in the text “I grew up in France… I speak
fluent (French).” Recent information suggests that the next word is probably the name of a
language, but if we want to narrow down which language, we need the context of France,
from further back. 156
The repeating module in a standard
RNN contains a single layer.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information
(cf. vanishing gradients)
The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994), who found
some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs/GRUs do not have this problem!
157
The repeating module in an LSTM/GRU
contains four interacting layers.

158
Sequence modeling with RNNs

159
One to Many - Image captioning

[Xu et al. 2015] 160


Many to Many Parallel - Char-nn

161
Many to Many - Machine translation
speech2text
• Input : Audio mp3 (natural language)
• Output : "How much would a woodchuck chuck"

[Chan et al. 2015] [Olah et Carter 2016]


162
Overview

• Context & Vocabulary


• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning

163
Based on “Introduction to Deep Learning”, by Professor Qiang Yang,
from The Hong Kong University of Science and Technology

REINFORCEMENT LEARNING

164
Reinforcement Learning

• What’s Reinforcement Learning?


Environment

{Observation, Reward} {Actions}

Agent

• Agent interacts with an environment and learns by maximizing a scalar reward


• No labels or any other supervision
• Previously suffering from hand-craft states or representation

165
Policies and Value Functions
• Policy ߨ is a behavior function selecting actions given states
(it defines the probability of each possible action regarding the state s)

ܽ= argmax ࣊(s)
௔௟௟ ௣௢௦௦௜௕௟௘ ௔௖௧௜௢௡௦

• Value function ܳగ (s,a) is expected total reward ‫ ݎ‬from state s and action a
under policy ߨ

“How good is action ܽ in state ‫”?ݏ‬


166
Approaches To Reinforcement Learning
• Policy-based RL
– Search directly for the optimal policy ߨ ‫כ‬
– Policy achieving maximum future reward
• Value-based RL
– Estimate the optimal value function ܳ‫( כ‬s,a)
– Maximum value achievable for the best policy ߨ ‫כ‬
• Model-based RL
– Build a transition model of the environment
– Plan (e.g. by look-ahead) using model

167
Bellman Equation
• Value function can be unrolled recursively
ܳగ ‫ݏ‬, ܽ = ॱ ‫ݎ‬௧ାଵ + ߛ‫ݎ‬௧ାଶ + ߛ ଶ ‫ݎ‬௧ାଷ + ‫ݏ| ڮ‬, ܽ
గ ᇱ ᇱ
= ॱ௦ᇲ ‫ݎ‬௧ + ߛ max

ܳ ‫ ݏ‬, ܽ |‫ݏ‬, ܽ

• Optimal value function Q‫( כ‬s, a) can be unrolled recursively


ܳ‫ݏ כ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max ᇲ
ܳ ‫ כ‬ᇱ ᇱ
‫ ݏ‬, ܽ |‫ݏ‬, ܽ

గ ᇱ ᇱ
ܸగ (‫)ݏ‬ = max

ܳ ‫ ݏ‬,ܽ

• Value iteration algorithms solve the Bellman equation
ᇱ ᇱ
ܳ௜ାଵ ‫ݏ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max

ܳ௜ ‫ݏ‬ , ܽ |‫ݏ‬, ܽ

This last equation corresponds to how we should update Q ideally, i.e. knowing the state distribution BUT
we do not know the state distribution Prob(s’ | s,a) and the complexity is not even polynomial in the
number of states 168
Deep Reinforcement Learning
• Human

• So what’s DEEP RL?


Environment

{Raw Observation, Reward} {Actions}

169
Deep Reinforcement Learning
• Represent value function by deep Q-network with weights w
ܳ ‫ݏ‬, ܽ, ࢝ = ܳగ ‫ݏ‬, ܽ
• Define objective function by mean-squared error in Q-values
ࣦ(࢝࢏ ) = ॱ ‫ݎ‬௧ + ߛ max ᇱ ᇱ

ܳ ‫ݏ‬ , ܽ , ࢝࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏

target

• Leading to the following Q-learning gradient

߲ࣦ(࢝࢏ ) ߲ܳ ‫ݏ‬, ܽ, ࢝࢏
=ॱ ‫ݎ‬௧ + ߛ max ܳ ‫ ݏ‬ᇱ , ܽᇱ , ࢝ ࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏
߲࢝ ௔ᇲ ߲࢝
target

170
DQN in Atari
• End-to-end learning of values Q(s, a) from pixels
• Input state s is stack of raw pixels from last 4 frames
• Output is Q(s, a) for 18 joystick/button positions
• Reward is the change in the score for that step

Mnih, Volodymyr, et al. 2015. 171


DQN in Atari : Human Level Control

Mnih, Volodymyr, et al. 2015.

172
AlphaGO: Monte Carlo Tree Search
• MCTS: Model look ahead to reduce searching space by predicting
opponent’s moves

ܸగ ‫ = ݏ‬max ܳగ ‫ݏ‬, ܽ

Silver, David, et al. 2016.


173
AlphaGO: Learning Pipeline
• Combine SL and RL to learn the search direction in MCTS

Silver, David, et al. 2016.

• SL policy Network
– Prior search probability or potential
• Rollout:
– combine with MCTS for quick simulation on leaf node
• Value Network:
– Build the Global feeling on the leaf node situation
174
Learning to Prune: SL Policy Network
• 13-layer CNN
• Input board position ‫ݏ‬
• Output: pఙ (ܽ|‫)ݏ‬, where ܽ is the next move

175
Learning to Prune: RL Policy Network

Self play

• 1 Million samples are used to train.


• RL-Policy network VS SL-Policy network.
í RL-Policy alone wins 80% games against SL-Policy.
í Combined with MCTS, SL-Policy network is better
• Used to derive the Value Network as the ground truth
– Making enough data for training
176
Learning to Prune: Value Network
• Regression: Similar architecture

• SL Network: Sampling to generate a unique game.


• RL Network: Simulate to get the game’s final result.

• Train: 50 million mini-batches of 32 positions


(30 million unique games)

177
AlphaGO: Evaluation

The version solely using the policy network does not perform any search
Silver, David, et al. 2016. 178
QUESTIONS?

179

You might also like