Introduction To Learning: Frederic Precioso 24/01/2019

Introduction to Learning
Winter School ROBOTICA PRINCIPIA
Frederic Precioso
24/01/2019
1
Disclaimer
If any content in this presentation is yours but is not correctly

referenced or if it should be removed, please just let me know
and I will correct it.
2
Overview
• Context & Vocabulary
– What represents Artificial Intelligence?
– Machine Learning vs Data Mining?
– Machine Learning vs Data Science?
– Machine Learning vs Statistics?
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning
3
CONTEXT & VOCABULARY
4
WHAT REPRESENTS ARTIFICIAL INTELLIGENCE?
5
What is Artificial intelligence?
• The term Artificial Intelligence, as a research field, was coined at the conference on
the campus of Dartmouth College in the summer of 1956, even though the idea was
around since antiquity.
• For instance in the first manifesto of Artificial Intelligence, “Intelligent Machinery”, in
1948 Alan Turing distinguished two different approaches to AI, which may be termed
"top-down“ or knowledge-driven AI and "bottom-up“ or data-driven AI
6
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
• The two different approaches to AI can be detailed:
– "top-down“ or knowledge-driven AI
• cognition = high-level phenomenon, independent of low-level details of implementation
mechanism, first neuron (1943), first neural network machine (1950), neucognitron (1975)
• Evolutionary Algorithms (1954,1957, 1960), Reasoning (1959,1970), Expert Systems (1970),
Logic, Intelligent Agent Systems (1990)…
– "bottom-up“ or data-driven AI
• opposite approach, start from data to build incrementally and mathematically mechanisms
taking decisions
• Machine learning algorithms, Decision Trees (1983), Backpropagation (1984-1986), Random
Forest (1995), Support Vector Machine (1995), Boosting (1995), Deep Learning (1998/2006)…
7
• AI is originally defined, by Marvin Lee Minsky, as “the construction of computer

programs doing tasks, that are, for the moment, accomplished more satisfyingly by
human beings because they require high level mental processes such as: learning.
perceptual organization of memory and critical reasoning”.
• There are so the "artificial" side with the usage of computers or sophisticated
electronic processes and the side “intelligence” associated with its goal to imitate
the (human) behavior.

8
• The concept of strong artificial intelligence makes reference to a machine capable

not only of producing intelligent behavior, but also to experience a feeling of a real
sense of itself, “real feelings” (whatever may be put behind these words), and "an
understanding of its own arguments”.
• The notion of weak artificial intelligence is a pragmatic approach of engineers:

targeting to build more autonomous systems (to reduce the cost of their
supervision), algorithms capable of solving problems of a certain class, etc. But this
time, the machine simulates the intelligence, it seems to act as if it was smart.

9
Why Artificial Intelligence is so difficult to
grasp?
• Frequently, when a technique reaches mainstream use, it is no longer considered as

artificial intelligence; this phenomenon is described as the AI effect: "AI is whatever
hasn't been done yet.“ (Larry Tesler's Theorem) -> e.g. Path Finding (GPS), Checkers
game, Chess electronic game, Alpha Go…
“AI” is continuously evolving and so very difficult to grasp.
10
Machine Learning
௙(x,ࢻ) ?
x ‫ݕ‬
x ‫ݕ‬
Face Detection
Betting on sports Scores, ranking…
Speech Recognition
11
Machine Learning
௙(x,ࢻ) ?
x ‫ݕ‬
x ‫ݕ‬
Support Vector Machines Random Forest Artificial Neural Networks 12

MACHINE LEARNING VS DATA MINING?
13
Data Mining Workflow
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
14
Validation
Data Mining
Model
Patterns
Transformation
Mainly manual
Pre
Preprocessing
Selection
Data
Data warehouse
15
Validation
• Filling missing values
• Dealing with outliers
• Sensor failures Data Mining
Model
• Data entry errors
• Duplicates
Patterns
Transfo
Transformation
• …
Preprocessing
Selection
Data
Data warehouse
16
• Aggregation (sum, average)
• Discretization
• Discrete attribute coding
• Text to numerical attribute Validation
• Scale uniformisation or standardisation
• New variable construction
• … Data Mining
Minin
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
17
• Regression
• (Supervised) Classification
• Clustering (Unsupervised Classification)
• Feature Selection
• Association analysis
• Novelty/Drift Valid
Validation
• …
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
18
• Evaluation on Validation Set
• Evaluation Measures
• Visualization
• ...
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
19
Data Mining Workflow • Visualization
• Reporting
• Knowledge
• ...
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
20
Problems Possible Solutions
• Regression • Machine Learning
• (Supervised) Classification • Support Vector Machine
• Density Estimation / Clustering • Artificial Neural Network
(Unsupervised Classification) • Boosting
• Feature Selection • Decision Tree
• Random Forest
• Association analysis • …
• Anomality/Novelty/Drift • Statistical Learning
• … • Gaussian Models (GMM)
• Naïve Bayes
• Gaussian processes
• …
• Other techniques
• Galois Lattice
• …
21
MACHINE LEARNING VS DATA SCIENCE?
22
Data Science Stack
Visualization / Reporting / Knowledge
• Dashboard (Kibana / Datameer)
USER
• Maps (InstantAtlas, Leaflet, CartoDB…)
• Charts (GoogleCharts, Charts.js…)
• D3.js / Tableau / Flame
Analysis / Statistics / Artificial Intelligence

• Machine Learning (Scikit Learn, Mahout, Spark)
• Search / retrieval (ElasticSearch, Solr)
Storage / Access / Exploitation

• File System (HDFS, GGFS, Cassandra …)
• Access (Hadoop / Spark / Both, Sqoop…)
• Databases / Indexing (SQL / NoSQL / Both…, MongoDB, HBase, Infinispan)
• Exploit (LogStash, Flume… )
Infrastructures
• Grid Computing / HPC
• Cloud / Virtualization
HARD
23
MACHINE LEARNING VS STATISTICS?
24
What breed is that Dogmatix (Idéfix) ?
The illustrations of the slides in this section come from the blog “Bayesian Vitalstatistix:
What Breed of Dog was Dogmatix?” 25
Does any real dog get this height and
weight?
• Let us consider x, vectors

independently generated in Rd
(here R2), following a
probability distribution fixed
but unknown P(x).
26
What should be the breed of these dogs?
• An Oracle assignes a value y to

each vector x following a
probability distribution P(y|x)
also fixed but unknown.
27
An oracle provides me with examples?
• Let S be a training set

S = {(x1, y1), (x2, y2),…, (xm, ym)},
with m training samples i.i.d. which
follow the joint probability
P(x, y) = P(x)P(y|x).
28
Statistical solution: Models, Hypotheses…
29
Statistical solution P(height, weight|breed)…
30
31
32
Statistical solution: Bayes,
P(breed|height, weight)…
33
Machine Learning
• we have a learning machine which can

provide a family of functions {f(x;ɲ)},
where ɲ is a set of parameters.
௙(x,ࢻ) ?
x ‫ݕ‬
34
The problem in Machine Learning
• The problem of learning consists in finding the model
(among the {f (x;ɲ)}) which provides the best
approximation Ǌ of the true label y given by the Oracle.
௙(x,ࢻ) ?
x ‫ݕ‬ • best is defined in terms of minimizing a specific (error)
cost related to your problem/objectives
Q((x, y), ɲ) [a; b].
• Examples of cost/loss functions Q: Hinge Loss, Quadratic
Loss, Cross-Entropy Loss, Logistic Loss…
35
Loss in Machine Learning
• How to define the loss L (or the cost Q)?

You should choose the right loss function based on your problem and your data
(here y is the true/expected answer, f(x) the answer predicted by the network).
Classification
• Cross-entropy loss: L(x) = -(y ln(f(x)) + (1-y)ln(1-f(x)))
• Hinge Loss (i.e. max-margin loss, i.e. 0-1 loss): L(x) = max(0, 1-yf(x))
• …
Regression
• Mean Square Error (or Quadratic Loss): L(x) = (f(x)-y)2
• Mean Absolute Loss: L(x) = |f(x)-y|
• …
If the loss is minimized but accuracy is low, you should check the loss function.
Maybe it is not the appropriate one for your task.
36
The problem in Machine Learning
For Clarity sake, let us note z = (x ,y).
37
Machine Learning fundamental Hypothesis
For Clarity sake, let us note z = (x ,y).
S = {zi }i=1,…,m is built through an i.i.d. sampling according to P(z).
Machine Learning Statistics

Train through Cross-Validation
Machine Learning Statistics

Training set & Test set have to be distributed according to the
same law (i.e. P(z)).
38
Vapnik learning theory (1995)
Training Error Generalization Error
39
Vapnik learning theory (1995)
40
Machine Learning vs Statistics
41
UNSUPERVISED CLASSIFICATION
42
Unsupervised classification
• The system or the operator has only samples, but no label
• The number of classes and their nature have not been
predetermined
unsupervised learning or clustering.
No expert is required.
The algorithm must discover by itself more or less
hidden/underlying data structure.
43
Clustering
Clustering: Partition a dataset into groups based on the similarity between the instances.
Clustering
44
Clustering Algorithms (Partition)
• Centroid-based (1957,1967)
(E.g. K-means, PAM, CLARA)
• Hierarchical:
• Bottom up approach.
• Top down approach.
45
Clustering Algorithms (Density)
• Density-based
(E.g. DBSCAN, DENCLUE)
• Distribution-based
(E.g. EM, extension of
K-means)
46
Clustering Algorithms (Graph)
• Graph-based
(E.g. Chameleon)
• Grid-based
(E.g. STING, CLIQUE)
47
How many Clusters?
K?
48
Which Algorithm?
K-means PAM Gaussian model
DBSCAN Average linkage DIANA

49
Algorithms’ Parameters
Avg. linkage: K=2 Avg. linkage: K=3 Avg. linkage: K=4
Avg. linkage: K=5 Avg. linkage: K=6 Avg. linkage: K=7

50
Which metric/distance?
51
Clustering validation measures
1
• Internal validation: Separation? Compactness? 0,8
E.g. Dunn, DB, and Silhouette indexes. 0,6
0,4
• Problems:
0,2
– Different performance WRT existence of noise, variable 0
densities, and non well-separated clusters. 1 2 3
K
4 5
– Overrate the algorithm that uses the same clustering model.

1
• External validation: == Class labels? 0,8

0,6
E.g. Rand, Jaccard, Purity, MI, VI indexes. 0,4
• Problem: Some class labels, at least, have to exist. 0,2
0
1 2 3 4
52
Consensus clustering
Consensus
Clustering
Ensemble of 3 base clusterings Consensus solution
53
Overview
– Decision Trees
– Random Forest
• Deep Learning
54
EXPLICIT SUPERVISED CLASSIFICATION
55
DECISION TREES
56
Decision tree to decide playing
tennis or not
Objective
2 classes: yes & no
Prediction if a game will
be played or not
Temperature will be
easily converted into
numerical
I.H. Witten and E. Frank, “Data Mining”, Morgan Kaufmann Pub., 2000.
57
A simple example
58
Example - explanations
• On the nodes
– Distribution of the variable to predict
• The first node is segmented with the variable outlook (sunny, overcast, rainy):
creation of 3 sub-groups
– The first group contains 5 observations, 2 yes and 3 no
• The tree can be translated in a set of decision rules without loosing any
information
– Example: if outlook = sunny and humidity = high then play = yes
59
Basic algorithm
• A= BestAttribute(Examples) // Best attribute means more
//homogenous results
• Assign A to the root
• For each value of A, create a new sub-node of the roof
• Classify all the examples in the sub-nodes
• If all examples of a sub-node are homogeneous, assign their class
to the node, if not repeat this process from this node
• Question: How to measure homogeneity?

Entropy, Gini, Information Gain…
60
Decision tree to decide playing
tennis or not
Class:YES
Class: NO Class: YES
61
Final decision tree
62
Decision tree example
Debt
Income > t1
t2 Debt > t2
Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
63
Advantages of decision trees
• Simple and easily interpretable rules (unlike implicit decision methods)
• No need to recode heterogeneous data
• Processing with missing values
• No model and no presupposition to meet (iterative method)
• Fast processing time
64
Drawbacks
• The nodes of level n + 1 are highly dependent on those of level
n (the modification of a single variable near to the top of the
tree can entirely change the tree)
• We always choose the best local attributes, the best global
information gain is not at all guaranteed
• Learning requires a sufficient number of individuals
• Inefficient when there are many classes
• No convergence…
65
Drawbacks
• We always chose the best local attributes, the best global information
gain is not at all guaranteed
Gain (A2) = 0.25 Gain (A5) = 0.18
Gain (A4) = 0.11 Gain (A1) = 0.07 Gain (A7) = 0.30 Gain (A3) = 0.20
Solution chosen by the algorithm Solution that should be chosen 66

Decision trees do not converge?
“Plant” a forest
67
Standard Random Forests
Bagging
Random
Feature
Selection
68
Error of generalization for Random Forest
• Error of generalization of RF can be bounded by:
ߩ 1 െ ‫ݏ‬ଶ
ܴ ܴ‫ ܨ‬൑
‫ݏ‬ଶ
where
– U is the mean correlation between two decision trees
– s is the quality of prediction of the set of decision trees
69
Success story: Kinect
https://www.youtube.com/watch?v=lntbRsi8lU8
70
71
72
Overview
– Multi-Layer Perceptron
– Support Vector Machine
• Deep Learning
73
IMPLICIT SUPERVISED CLASSIFICATION
74
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)
75
The curse of dimensionality [Bellman, 1956]
76
MULTI-LAYER PERCEPTRON
77
First, biological neurons
Before we study artificial neurons, let’s look at a biological neuron
78
Figure from K.Gurney, An Introduction to Neural Networks
First, biological neurons
Postsynaptic potential function with weight dependency, as a function of

time (ms) and weight value, being excitatory in case of red and blue lines, and
inhibitory in case of a green line.
79
Then, artificial neurons
Pitts & McCulloch (1943), binary inputs & activation function f is a thresholding
Rosenblatt (1956), real inputs & activation function f is a thresholding

80
(Schéma : Isaac Changhau)
‫ݕ‬
Artificial neuron vs
biology ௡
෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

‫ݓ‬଴ ௜ୀ଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ ‫ݓ‬
ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡
Spike-based description Rate-based description

Steady regime
s
y = s( ɇ wi xi )
Gradient descent: KO Gradient descent: OK 81

From perceptron to network
@tachyeonz: A friendly introduction to neural networks and deep learning.
82
Single Perceptron Unit
• Perceptron only learns linear function [Minsky and Papert, 1969]
• Non-linear function needs layer(s) of neurons o Neural Network

• Neural Network = input layer + hidden layer(s) + output layer
83
Multi-Layer Perceptron
• Training a neural network [Rumelhart et al. / Yann Le Cun et al. 1985]
• Unknown parameters: weights on the synapses
• Minimizing a cost function: some metric between the predicted output
and the given output
• Step function: non-continuous functions are replaced by a continuous

non-linear ones 84
• Minimizing a cost function: some metric between the predicted output
and the given output
• Equation for a network of 3 neurons (i.e. 3 perceptrons):

‫ݓ(ݏ = ݕ‬ଵଷ ‫ݓ ݏ‬ଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵ + ‫ݓ‬ଶଷ ‫ݓ ݏ‬ଵଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶ + ‫ݓ‬଴ଷ )
85
Autonomous Land Vehicle In a Neural
Network (ALVINN)
• ALVINN is an automatic steering system for a car based on input from a
camera mounted on the vehicle.
– Successfully demonstrated in a cross-country trip.
86
ALVINN (1989)
• The ALVINN neural network is:

– 960 inputs (a 30x32 array
derived from the pixels of an
image),
– 4 hidden units and
– 30 output units (each
representing a steering
command).
87
Theorem [Cybenko, 1989]
• A neural network with one single hidden layer is a universal
approximator: it can represent any continuous function on compact
subsets of Rn
• 2 layers is enough ... theoretically:
“…networks with one internal layer and an arbitrary continuous sigmoidal
function can approximate continuous functions with arbitrary precision
providing that no constraints are placed on the number of nodes or the size
of the weights"
• But no efficient learning rule is known and the size of the hidden layer is
exponential with the complexity of the problem (which is unkown
beforehand) to get an error H , the layer must be infinite for an error 0.
88
SUPPORT VECTOR MACHINE
Partly based on “A Gentle Introduction to Support Vector Machines in Biomedicine”, A. Statnikov, D.
Hardin, I. Guyon, C. F. Aliferis, AMIA 2010.
89
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)
90
The curse of dimensionality [Bellman, 1956]
91
SVM vs ANN
"SVMs have been developed in the reverse order to the development
of neural networks (NNs). SVMs evolved from the sound theory to
the implementation and experiments, while the NNs followed more
heuristic path, from applications and extensive experimentation to
the theory.“
“Support Vector Machines: Theory and Applications” by Lipo

Wang, in Studies in Fuzziness and Soft Computing, Springer, 2005.
92
The Support Vector Machine (SVM)
• Support vector machines (SVMs) is a binary classification
algorithm.
• Extensions of the basic SVM algorithm can be applied to solve
problems of regression, feature selection, novelty/outlier
detection, and clustering.
• SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results. 93
Linearly separable data, “Hard-
margin” linear SVM
G G G
x1 , x2 ,..., x N R n • Want to find a classifier (hyperplane)
Given training data:
y1 , y2 ,..., y N {1,1} to separate negative objects from
the positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between data
points on the boundaries (so-called
“support vectors”).
• If the points on the boundaries are
Negative objects (y=-1) Positive objects (y=+1) not informative (e.g., due to noise),
SVMs may not do well. 94
Kernel Trick
https://www.youtube.com/watch?v=-Z4aojJ-pdg
95
Popular kernels
A kernel is a dot product in some feature space:
G G G G
K ( xi , x j ) ) ( xi ) ) ( x j )
Examples:
G G G G
K ( xi , x j ) xi x j Linear kernel
G G G G 2
K ( xi , x j ) exp(J xi x j ) Gaussian kernel
G G G G
K ( xi , x j ) exp(J xi x j ) Exponential kernel
G G G G q
K ( xi , x j ) ( p xi x j ) Polynomial kernel
G G G G q G G 2
K ( xi , x j ) ( p xi x j ) exp(J xi x j ) Hybrid kernel
G G G G
K ( xi , x j ) tanh(kxi x j G ) Sigmoidal
96
How to build a kernel function ?
k ( x, y ) k1 (x, y ) + k2 (x, y )
k ( x, y ) D ·k1 (x, y )
k ( x, y ) k1 (x, y )·k2 (x, y )
k ( x, y ) f (x)· f (y )
with f () a function from input space to \
k (x, y ) k3 () (x), ) (y ))
k (x, y ) xBy T
with B a matrix N u N symetric, semi-definite positive
97
Complex kernels on Video Tubes
98
Complex kernels on Video Tubes
99
Classification
100
Classification
101
SVM are ANN
102
Overview
• Deep Learning
– Convolutional Neural Networks (CNN)
– Generative Adversarial Networks (GAN)
– Stacked Denoising AutoEncoder (SDAE)
– Reccurent Neural Networks (RNN)
103
DEEP LEARNING
104
Deep representation origins
• Theorem Cybenko (1989) A neural network with one single hidden layer
is a universal “approximator”, it can represent any continuous function
on compact subsets of Rn 2 layers are enough…but hidden layer size
may be exponential
…………..……..………..
………..……..………
exponential
105
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
106
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
2
exponential
107
Enabling factors
• Why do it now ? Before 2006, training deep networks was
unsuccessful because of practical aspects
– faster CPU's
– parallel CPU architectures
– advent of GPU computing
• Results…
– 2009, sound, interspeech + ~24%
– 2011, text, + ~15% without linguistic at all
– 2012, images, ImageNet + ~20%
108
Structure the network?
• Can we put any structure reducing the space of exploration and
providing useful properties (invariance, robustness…)?
‫ݓ(ݏ = ݕ‬ଵଷଷ ‫ݓ ݏ‬ଵଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵଵ + ‫ݓ‬ଶଷଷ ‫ݓ ݏ‬ଵଶଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶଶ + ‫ݓ‬଴ଷଷ )
109
CONVOLUTIONAL NEURAL NETWORKS
(AKA CNN, CONVNET)
110
Convolutional neural network
111
Deep representation by CNN
112
Convolution in nature
113
Convolution
114
Convolution in nature
1. Hubel and Wiesel have worked on visual cortex of cats (1962)
2. Convolution
3. Pooling
115
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬
෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

‫ݓ‬଴ ௜ୀ଴
െ4
4 0
0 ‫ܠ‬
࢞૙ = ૚ 0 0 0 2
116
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬
෍ ‫ݓ‬௜ ‫ݔ‬௜
௜ୀ଴ ‫ܟ‬
‫ݓ‬଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ
‫ݓ‬ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡
117
If convolution = perceptron
1. Convolution
2. Pooling
118
119
120
121
122
Transfer Learning!!
123
Endoscopic Vision Challenge 2017
Surgical Workflow Analysis in the SensorOR
10th of September
Quebec, Canada
124
Clinical context: Laparoscopic Surgery
Surgical Workflow Analysis
125
Task
Phase segmentation of laparoscopic surgeries
Video
Surgical
Devices
126
Dataset
30 colorectal laparoscopies
z Complex type of operation
z Duration: 1.6h – 4.9h (avg 3.2h)
z 3 different sub-types
z 10x Proctocolectomy
z 10x Rectal resection
z 10x Sigmoid resection
Sensor data recorded in integrated OR (Karl Storz OR1)

z Laparoscopic image stream
z Surgical devices Recorded at
127
Annotation
Annotated by surgical experts, 13 different phases
128
Method
Temporal Network
ResNet-34 Number of target classes: Spatial network accuracy:

Rectal resection: 11 Rectal resection: 62.91%
Sigmoid resection: 10 Sigmoid resection: 63.01%
Proctolectomy: 12 Proctolectomy: 63.26%
7x7 conv
3x3 conv
3x3 conv
3x3 conv
FC
feature Temporal network accuracy:
224x224x3 Rectal resection: 49.88%
224x224x20 vector
FV (512) Sigmoid resection: 48.56%
Proctolectomy: 46.96%
Spatial Network
Final Network Accuracy:

Final Network Rectal resection (8): sigmoid resection (7): Proctocolectomy (1):
80.7% 73.5% 71.3%
Rectal resection (6): sigmoid resection (1): Proctocolectomy (4):
features
Spatial
79.9% 54.7% 73.9%

(512)
Spatial CNN
LSTM (512)
Results for rectal resection video #8; GT in green, prediction in red

12
11
224x224x3 (1024 10
9
) 8
7
6
5
Temporal
features
4
(512)
Temporal CNN 3
2
1
0
224x224x20 129
And the winner is...
Average Median
Data used Accuracy
Jaccard Jaccard
1 Video 40% 38% 61% Team UCA
Video +
2
Device
38% 38% 60% Team NCT
3 Video 25% 25% 57% Team TUM
4 Device 16% 16% 36% Team TUM
5 Video 8% 7% 21% Team FFL
130
Why Deep Learning?
Before Deep Learning
Hand-crafted/Engineered features
– Image recognition
– 3L[HOĺHGJHĺtexton ĺSDWWHUQVĺSDUWĺREMHFW
– Text
– &KDUDFWHUĺZRUGĺZRUGJURXSĺFODXVHĺVHQWHQFHĺVWRU\
– Speech
– 6DPSOHĺVSHFWUDOEDQGĺVRXQGĺ«ĺSKRQHĺSKRQHPHĺZRUG
Since Deep Learning
o Hierarchy of representations with increasing level of abstraction
o Each stage is a kind of trainable feature transform…
as long as you have enough data to train the hierarchy
Trainable feature
Trainable feature
Trainable feature
Trainable feature
transform
transform
transform
transform
Decision
Le Cun - Ranzato
How Deep Learning?
Start from raw data OR from a first level representation?
Gray-level Pixels
Color Pixels
Trainable feature
Trainable feature
Trainable feature
Trainable feature
transform
transform
transform
transform
Decision
Text Text Embedding
Waves
Images
Le Cun - Ranzato
AMAZING BUT…
133
Amazing but…be carreful of the bias in
the initial data
134
Amazing but…be careful of the
adversaries (as any other ML algorithms)
135
136
From Thomas Tanay

137
138
139
adversaries
https://nicholas.carlini.com/code/audio_adversarial_examples/
140
GENERATIVE ADVERSARIAL NETWORKS
141
How to solve it?
Generative Adversarial Networks
142
It finally did not solve adversarial, but…
Operations between latent representations (manifold)
143
It finally did not solve adversarial, but…
144
(DENOISING) STACKED AUTOENCODER
145
Autoencoder: unsupervised!
Learning a compact representation of the data (no classification)
First we train an AutoEncoder layer 1.
146
Autoencoder: unsupervised!
Second we train an AutoEncoder layer 2.
147
Autoencoder -> Supervised
Then we train an output layer of non-linearities based on softmax.
148
Autoencoder -> Supervised
Finally, we fine-tune the whole network in a supervised way.
149
Denoising stacked Autoencoder:
unsupervised
Result = a new latent representation
150
Denoising stacked Autoencoder: example
Stage 1.
Data-mining stage &
Feature extraction:
Driving Electronic Health
Model Deep Patient, Records to build a binary
Published in Nature, phenotype representation.
2016
Stage 2.
Unsupervised stage:
Mapping the Binary Patient
Representation to get a new space call
Deep Patient (or Latent Representation)
Using Stacked Denoising Autoencoders.
Stage 3.
Supervised stage:
Labeling Medical Target and
training the Latent Representation
by Machine Learning algorithms
for classification and prediction of
patient's disease. 151
Supervised Image Segmentation Task
Credits Matthieu Cord 152

Partly from COLAH’s Blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RECURRENT NEURAL NETWORK
153
Recurrent Neural Networks have loops.
154
An unrolled recurrent neural network.
In the last few years, there have been incredible success applying RNNs to a
variety of problems: speech recognition, language modeling, translation,
image captioning…
155
The Problem of Long-Term Dependencies
If we are trying to predict the last word in “the clouds are in the (sky),” we don’t need any
further context – it’s pretty obvious the next word is going to be sky.
Consider trying to predict the last word in the text “I grew up in France… I speak
fluent (French).” Recent information suggests that the next word is probably the name of a
language, but if we want to narrow down which language, we need the context of France,
from further back. 156
The repeating module in a standard
RNN contains a single layer.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information
(cf. vanishing gradients)
The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994), who found
some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs/GRUs do not have this problem!
157
The repeating module in an LSTM/GRU
contains four interacting layers.
158
Sequence modeling with RNNs
159
One to Many - Image captioning
[Xu et al. 2015] 160

Many to Many Parallel - Char-nn
161
Many to Many - Machine translation
speech2text
• Input : Audio mp3 (natural language)
• Output : "How much would a woodchuck chuck"
[Chan et al. 2015] [Olah et Carter 2016]

162
Overview
• Context & Vocabulary

• Deep Learning
163
Based on “Introduction to Deep Learning”, by Professor Qiang Yang,
from The Hong Kong University of Science and Technology
REINFORCEMENT LEARNING
164
Reinforcement Learning
• What’s Reinforcement Learning?

Environment
{Observation, Reward} {Actions}
Agent
• Agent interacts with an environment and learns by maximizing a scalar reward

• No labels or any other supervision
• Previously suffering from hand-craft states or representation
165
Policies and Value Functions
• Policy ߨ is a behavior function selecting actions given states
(it defines the probability of each possible action regarding the state s)
ܽ= argmax ࣊(s)
௔௟௟ ௣௢௦௦௜௕௟௘ ௔௖௧௜௢௡௦
• Value function ܳగ (s,a) is expected total reward ‫ ݎ‬from state s and action a
under policy ߨ
“How good is action ܽ in state ‫”?ݏ‬

166
Approaches To Reinforcement Learning
• Policy-based RL
– Search directly for the optimal policy ߨ ‫כ‬
– Policy achieving maximum future reward
• Value-based RL
– Estimate the optimal value function ܳ‫( כ‬s,a)
– Maximum value achievable for the best policy ߨ ‫כ‬
• Model-based RL
– Build a transition model of the environment
– Plan (e.g. by look-ahead) using model
167
Bellman Equation
• Value function can be unrolled recursively
ܳగ ‫ݏ‬, ܽ = ॱ ‫ݎ‬௧ାଵ + ߛ‫ݎ‬௧ାଶ + ߛ ଶ ‫ݎ‬௧ାଷ + ‫ݏ| ڮ‬, ܽ
గ ᇱ ᇱ
= ॱ௦ᇲ ‫ݎ‬௧ + ߛ max
ᇲ
ܳ ‫ ݏ‬, ܽ |‫ݏ‬, ܽ
௔
• Optimal value function Q‫( כ‬s, a) can be unrolled recursively

ܳ‫ݏ כ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max ᇲ
ܳ ‫ כ‬ᇱ ᇱ
‫ ݏ‬, ܽ |‫ݏ‬, ܽ
௔
గ ᇱ ᇱ
ܸగ (‫)ݏ‬ = max
ᇲ
ܳ ‫ ݏ‬,ܽ
௔
• Value iteration algorithms solve the Bellman equation
ᇱ ᇱ
ܳ௜ାଵ ‫ݏ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max
ᇲ
ܳ௜ ‫ݏ‬ , ܽ |‫ݏ‬, ܽ
௔
This last equation corresponds to how we should update Q ideally, i.e. knowing the state distribution BUT
we do not know the state distribution Prob(s’ | s,a) and the complexity is not even polynomial in the
number of states 168
Deep Reinforcement Learning
• Human
• So what’s DEEP RL?

Environment
{Raw Observation, Reward} {Actions}
169
Deep Reinforcement Learning
• Represent value function by deep Q-network with weights w
ܳ ‫ݏ‬, ܽ, ࢝ = ܳగ ‫ݏ‬, ܽ
• Define objective function by mean-squared error in Q-values
ࣦ(࢝࢏ ) = ॱ ‫ݎ‬௧ + ߛ max ᇱ ᇱ
ᇲ
ܳ ‫ݏ‬ , ܽ , ࢝࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏
௔
target
• Leading to the following Q-learning gradient
߲ࣦ(࢝࢏ ) ߲ܳ ‫ݏ‬, ܽ, ࢝࢏
=ॱ ‫ݎ‬௧ + ߛ max ܳ ‫ ݏ‬ᇱ , ܽᇱ , ࢝ ࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏
߲࢝ ௔ᇲ ߲࢝
target
170
DQN in Atari
• End-to-end learning of values Q(s, a) from pixels
• Input state s is stack of raw pixels from last 4 frames
• Output is Q(s, a) for 18 joystick/button positions
• Reward is the change in the score for that step
Mnih, Volodymyr, et al. 2015. 171

DQN in Atari : Human Level Control
Mnih, Volodymyr, et al. 2015.
172
AlphaGO: Monte Carlo Tree Search
• MCTS: Model look ahead to reduce searching space by predicting
opponent’s moves
ܸగ ‫ = ݏ‬max ܳగ ‫ݏ‬, ܽ
௔
Silver, David, et al. 2016.

173
AlphaGO: Learning Pipeline
• Combine SL and RL to learn the search direction in MCTS
Silver, David, et al. 2016.
• SL policy Network
– Prior search probability or potential
• Rollout:
– combine with MCTS for quick simulation on leaf node
• Value Network:
– Build the Global feeling on the leaf node situation
174
Learning to Prune: SL Policy Network
• 13-layer CNN
• Input board position ‫ݏ‬
• Output: pఙ (ܽ|‫)ݏ‬, where ܽ is the next move
175
Learning to Prune: RL Policy Network
Self play
• 1 Million samples are used to train.

• RL-Policy network VS SL-Policy network.
í RL-Policy alone wins 80% games against SL-Policy.
í Combined with MCTS, SL-Policy network is better
• Used to derive the Value Network as the ground truth
– Making enough data for training
176
Learning to Prune: Value Network
• Regression: Similar architecture
• SL Network: Sampling to generate a unique game.

• RL Network: Simulate to get the game’s final result.
• Train: 50 million mini-batches of 32 positions

(30 million unique games)
177
AlphaGO: Evaluation
The version solely using the policy network does not perform any search
Silver, David, et al. 2016. 178
QUESTIONS?
179

Introduction To Learning: Frederic Precioso 24/01/2019

Uploaded by

Copyright:

Available Formats

Introduction To Learning: Frederic Precioso 24/01/2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Learning: Frederic Precioso 24/01/2019

Uploaded by

Copyright:

Available Formats

Introduction to Learning

Winter School ROBOTICA PRINCIPIA

If any content in this presentation is yours but is not correctly

• AI is originally defined, by Marvin Lee Minsky, as “the construction of computer

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)

• The concept of strong artificial intelligence makes reference to a machine capable

• The notion of weak artificial intelligence is a pragmatic approach of engineers:

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)

• Frequently, when a technique reaches mainstream use, it is no longer considered as

 “AI” is continuously evolving and so very difficult to grasp.

Betting on sports Scores, ranking…

Support Vector Machines Random Forest Artificial Neural Networks 12

Analysis / Statistics / Artificial Intelligence

Storage / Access / Exploitation

• Let us consider x, vectors

• An Oracle assignes a value y to

• Let S be a training set

• we have a learning machine which can

• How to define the loss L (or the cost Q)?

For Clarity sake, let us note z = (x ,y).

S = {zi }i=1,…,m is built through an i.i.d. sampling according to P(z).

Machine Learning  Statistics

Machine Learning  Statistics

Training Error Generalization Error

K-means PAM Gaussian model

DBSCAN Average linkage DIANA

Avg. linkage: K=2 Avg. linkage: K=3 Avg. linkage: K=4

Avg. linkage: K=5 Avg. linkage: K=6 Avg. linkage: K=7

– Overrate the algorithm that uses the same clustering model.

• External validation: == Class labels? 0,8

Ensemble of 3 base clusterings Consensus solution

• Question: How to measure homogeneity?

Solution chosen by the algorithm Solution that should be chosen 66

• Error of generalization of RF can be bounded by:

Postsynaptic potential function with weight dependency, as a function of

Rosenblatt (1956), real inputs & activation function f is a thresholding

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

Spike-based description Rate-based description

Gradient descent: KO Gradient descent: OK 81

@tachyeonz: A friendly introduction to neural networks and deep learning.

• Non-linear function needs layer(s) of neurons o Neural Network

• Step function: non-continuous functions are replaced by a continuous

• Equation for a network of 3 neurons (i.e. 3 perceptrons):

• The ALVINN neural network is:

“Support Vector Machines: Theory and Applications” by Lipo

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

Sensor data recorded in integrated OR (Karl Storz OR1)

ResNet-34 Number of target classes: Spatial network accuracy:

Final Network Accuracy:

79.9% 54.7% 73.9%

Results for rectal resection video #8; GT in green, prediction in red

1 Video 40% 38% 61% Team UCA

3 Video 25% 25% 57% Team TUM

4 Device 16% 16% 36% Team TUM

5 Video 8% 7% 21% Team FFL

From Thomas Tanay

Operations between latent representations (manifold)

Then we train an output layer of non-linearities based on softmax.

“AI” is continuously evolving and so very difficult to grasp.

Machine Learning Statistics

Machine Learning Statistics