Introduction To Learning: Frederic Precioso 24/01/2019
Introduction To Learning: Frederic Precioso 24/01/2019
Introduction To Learning: Frederic Precioso 24/01/2019
Frederic Precioso
24/01/2019
1
Disclaimer
2
Overview
• Context & Vocabulary
– What represents Artificial Intelligence?
– Machine Learning vs Data Mining?
– Machine Learning vs Data Science?
– Machine Learning vs Statistics?
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning
3
CONTEXT & VOCABULARY
4
WHAT REPRESENTS ARTIFICIAL INTELLIGENCE?
5
What is Artificial intelligence?
• The term Artificial Intelligence, as a research field, was coined at the conference on
the campus of Dartmouth College in the summer of 1956, even though the idea was
around since antiquity.
• For instance in the first manifesto of Artificial Intelligence, “Intelligent Machinery”, in
1948 Alan Turing distinguished two different approaches to AI, which may be termed
"top-down“ or knowledge-driven AI and "bottom-up“ or data-driven AI
6
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?
• The two different approaches to AI can be detailed:
– "top-down“ or knowledge-driven AI
• cognition = high-level phenomenon, independent of low-level details of implementation
mechanism, first neuron (1943), first neural network machine (1950), neucognitron (1975)
• Evolutionary Algorithms (1954,1957, 1960), Reasoning (1959,1970), Expert Systems (1970),
Logic, Intelligent Agent Systems (1990)…
– "bottom-up“ or data-driven AI
• opposite approach, start from data to build incrementally and mathematically mechanisms
taking decisions
• Machine learning algorithms, Decision Trees (1983), Backpropagation (1984-1986), Random
Forest (1995), Support Vector Machine (1995), Boosting (1995), Deep Learning (1998/2006)…
7
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?
• There are so the "artificial" side with the usage of computers or sophisticated
electronic processes and the side “intelligence” associated with its goal to imitate
the (human) behavior.
10
Machine Learning
(x,ࢻ) ?
x ݕ
x ݕ
Face Detection
Speech Recognition
11
Machine Learning
(x,ࢻ) ?
x ݕ
x ݕ
13
Data Mining Workflow
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
14
Data Mining Workflow
Validation
Data Mining
Model
Patterns
Transformation
Mainly manual
Pre
Preprocessing
Selection
Data
Data warehouse
15
Data Mining Workflow
Validation
• Filling missing values
• Dealing with outliers
• Sensor failures Data Mining
Model
• Data entry errors
• Duplicates
Patterns
Transfo
Transformation
• …
Preprocessing
Selection
Data
Data warehouse
16
Data Mining Workflow
• Aggregation (sum, average)
• Discretization
• Discrete attribute coding
• Text to numerical attribute Validation
• Scale uniformisation or standardisation
• New variable construction
• … Data Mining
Minin
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
17
Data Mining Workflow
• Regression
• (Supervised) Classification
• Clustering (Unsupervised Classification)
• Feature Selection
• Association analysis
• Novelty/Drift Valid
Validation
• …
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
18
Data Mining Workflow
• Evaluation on Validation Set
• Evaluation Measures
• Visualization
• ...
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
19
Data Mining Workflow • Visualization
• Reporting
• Knowledge
• ...
Validation
Data Mining
Model
Patterns
Transformation
Preprocessing
Selection
Data
Data warehouse
20
Data Mining Workflow
Problems Possible Solutions
• Regression • Machine Learning
• (Supervised) Classification • Support Vector Machine
• Density Estimation / Clustering • Artificial Neural Network
(Unsupervised Classification) • Boosting
• Feature Selection • Decision Tree
• Random Forest
• Association analysis • …
• Anomality/Novelty/Drift • Statistical Learning
• … • Gaussian Models (GMM)
• Naïve Bayes
• Gaussian processes
• …
• Other techniques
• Galois Lattice
• …
21
MACHINE LEARNING VS DATA SCIENCE?
22
Data Science Stack
Visualization / Reporting / Knowledge
• Dashboard (Kibana / Datameer)
USER
• Maps (InstantAtlas, Leaflet, CartoDB…)
• Charts (GoogleCharts, Charts.js…)
• D3.js / Tableau / Flame
Infrastructures
• Grid Computing / HPC
• Cloud / Virtualization
HARD
23
MACHINE LEARNING VS STATISTICS?
24
What breed is that Dogmatix (Idéfix) ?
The illustrations of the slides in this section come from the blog “Bayesian Vitalstatistix:
What Breed of Dog was Dogmatix?” 25
Does any real dog get this height and
weight?
26
What should be the breed of these dogs?
27
An oracle provides me with examples?
28
Statistical solution: Models, Hypotheses…
29
Statistical solution P(height, weight|breed)…
30
Statistical solution P(height, weight|breed)…
31
Statistical solution P(height, weight|breed)…
32
Statistical solution: Bayes,
P(breed|height, weight)…
33
Machine Learning
(x,ࢻ) ?
x ݕ
34
The problem in Machine Learning
• The problem of learning consists in finding the model
(among the {f (x;ɲ)}) which provides the best
approximation NJ of the true label y given by the Oracle.
(x,ࢻ) ?
x ݕ • best is defined in terms of minimizing a specific (error)
cost related to your problem/objectives
Q((x, y), ɲ) [a; b].
• Examples of cost/loss functions Q: Hinge Loss, Quadratic
Loss, Cross-Entropy Loss, Logistic Loss…
35
Loss in Machine Learning
37
Machine Learning fundamental Hypothesis
39
Vapnik learning theory (1995)
40
Machine Learning vs Statistics
41
UNSUPERVISED CLASSIFICATION
42
Unsupervised classification
• The system or the operator has only samples, but no label
• The number of classes and their nature have not been
predetermined
unsupervised learning or clustering.
No expert is required.
The algorithm must discover by itself more or less
hidden/underlying data structure.
43
Clustering
Clustering: Partition a dataset into groups based on the similarity between the instances.
Clustering
44
Clustering Algorithms (Partition)
• Centroid-based (1957,1967)
(E.g. K-means, PAM, CLARA)
• Hierarchical:
• Bottom up approach.
• Top down approach.
45
Clustering Algorithms (Density)
• Density-based
(E.g. DBSCAN, DENCLUE)
• Distribution-based
(E.g. EM, extension of
K-means)
46
Clustering Algorithms (Graph)
• Graph-based
(E.g. Chameleon)
• Grid-based
(E.g. STING, CLIQUE)
47
How many Clusters?
K?
48
Which Algorithm?
51
Clustering validation measures
1
• Internal validation: Separation? Compactness? 0,8
E.g. Dunn, DB, and Silhouette indexes. 0,6
0,4
• Problems:
0,2
– Different performance WRT existence of noise, variable 0
densities, and non well-separated clusters. 1 2 3
K
4 5
52
Consensus clustering
Consensus
Clustering
53
Overview
• Unsupervised classification
• Explicit supervised classification
– Decision Trees
– Random Forest
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning
54
EXPLICIT SUPERVISED CLASSIFICATION
55
DECISION TREES
56
Decision tree to decide playing
tennis or not
Objective
2 classes: yes & no
Prediction if a game will
be played or not
Temperature will be
easily converted into
numerical
I.H. Witten and E. Frank, “Data Mining”, Morgan Kaufmann Pub., 2000.
57
A simple example
58
Example - explanations
• On the nodes
– Distribution of the variable to predict
• The first node is segmented with the variable outlook (sunny, overcast, rainy):
creation of 3 sub-groups
– The first group contains 5 observations, 2 yes and 3 no
• The tree can be translated in a set of decision rules without loosing any
information
– Example: if outlook = sunny and humidity = high then play = yes
59
Basic algorithm
• A= BestAttribute(Examples) // Best attribute means more
//homogenous results
• Assign A to the root
• For each value of A, create a new sub-node of the roof
• Classify all the examples in the sub-nodes
• If all examples of a sub-node are homogeneous, assign their class
to the node, if not repeat this process from this node
61
Final decision tree
62
Decision tree example
Debt
Income > t1
t2 Debt > t2
Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
63
Advantages of decision trees
• Simple and easily interpretable rules (unlike implicit decision methods)
• No need to recode heterogeneous data
• Processing with missing values
• No model and no presupposition to meet (iterative method)
• Fast processing time
64
Drawbacks
• The nodes of level n + 1 are highly dependent on those of level
n (the modification of a single variable near to the top of the
tree can entirely change the tree)
• We always choose the best local attributes, the best global
information gain is not at all guaranteed
• Learning requires a sufficient number of individuals
• Inefficient when there are many classes
• No convergence…
65
Drawbacks
• We always chose the best local attributes, the best global information
gain is not at all guaranteed
Gain (A2) = 0.25 Gain (A5) = 0.18
Gain (A4) = 0.11 Gain (A1) = 0.07 Gain (A7) = 0.30 Gain (A3) = 0.20
67
Standard Random Forests
Bagging
Random
Feature
Selection
68
Error of generalization for Random Forest
ߩ 1 െ ݏଶ
ܴ ܴ ܨ
ݏଶ
where
– U is the mean correlation between two decision trees
– s is the quality of prediction of the set of decision trees
69
Success story: Kinect
https://www.youtube.com/watch?v=lntbRsi8lU8
70
Success story: Kinect
71
Success story: Kinect
72
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
– Multi-Layer Perceptron
– Support Vector Machine
• Deep Learning
• Reinforcement Learning
73
IMPLICIT SUPERVISED CLASSIFICATION
74
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)
75
The curse of dimensionality [Bellman, 1956]
76
MULTI-LAYER PERCEPTRON
77
First, biological neurons
Before we study artificial neurons, let’s look at a biological neuron
78
Figure from K.Gurney, An Introduction to Neural Networks
First, biological neurons
79
Then, artificial neurons
Pitts & McCulloch (1943), binary inputs & activation function f is a thresholding
y = s( ɇ wi xi )
82
Single Perceptron Unit
• Perceptron only learns linear function [Minsky and Papert, 1969]
83
Multi-Layer Perceptron
• Training a neural network [Rumelhart et al. / Yann Le Cun et al. 1985]
• Unknown parameters: weights on the synapses
• Minimizing a cost function: some metric between the predicted output
and the given output
86
ALVINN (1989)
87
Multi-Layer Perceptron
Theorem [Cybenko, 1989]
• A neural network with one single hidden layer is a universal
approximator: it can represent any continuous function on compact
subsets of Rn
• 2 layers is enough ... theoretically:
“…networks with one internal layer and an arbitrary continuous sigmoidal
function can approximate continuous functions with arbitrary precision
providing that no constraints are placed on the number of nodes or the size
of the weights"
• But no efficient learning rule is known and the size of the hidden layer is
exponential with the complexity of the problem (which is unkown
beforehand) to get an error H , the layer must be infinite for an error 0.
88
SUPPORT VECTOR MACHINE
Partly based on “A Gentle Introduction to Support Vector Machines in Biomedicine”, A. Statnikov, D.
Hardin, I. Guyon, C. F. Aliferis, AMIA 2010.
89
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)
90
The curse of dimensionality [Bellman, 1956]
91
SVM vs ANN
"SVMs have been developed in the reverse order to the development
of neural networks (NNs). SVMs evolved from the sound theory to
the implementation and experiments, while the NNs followed more
heuristic path, from applications and extensive experimentation to
the theory.“
92
The Support Vector Machine (SVM)
• Support vector machines (SVMs) is a binary classification
algorithm.
• Extensions of the basic SVM algorithm can be applied to solve
problems of regression, feature selection, novelty/outlier
detection, and clustering.
• SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results. 93
Linearly separable data, “Hard-
margin” linear SVM
G G G
x1 , x2 ,..., x N R n • Want to find a classifier (hyperplane)
Given training data:
y1 , y2 ,..., y N {1,1} to separate negative objects from
the positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between data
points on the boundaries (so-called
“support vectors”).
• If the points on the boundaries are
Negative objects (y=-1) Positive objects (y=+1) not informative (e.g., due to noise),
SVMs may not do well. 94
Kernel Trick
https://www.youtube.com/watch?v=-Z4aojJ-pdg
95
Popular kernels
A kernel is a dot product in some feature space:
G G G G
K ( xi , x j ) ) ( xi ) ) ( x j )
Examples:
G G G G
K ( xi , x j ) xi x j Linear kernel
G G G G 2
K ( xi , x j ) exp(J xi x j ) Gaussian kernel
G G G G
K ( xi , x j ) exp(J xi x j ) Exponential kernel
G G G G q
K ( xi , x j ) ( p xi x j ) Polynomial kernel
G G G G q G G 2
K ( xi , x j ) ( p xi x j ) exp(J xi x j ) Hybrid kernel
G G G G
K ( xi , x j ) tanh(kxi x j G ) Sigmoidal
96
How to build a kernel function ?
k ( x, y ) k1 (x, y ) + k2 (x, y )
k ( x, y ) D ·k1 (x, y )
k ( x, y ) k1 (x, y )·k2 (x, y )
k ( x, y ) f (x)· f (y )
with f () a function from input space to \
k (x, y ) k3 () (x), ) (y ))
k (x, y ) xBy T
with B a matrix N u N symetric, semi-definite positive
97
Complex kernels on Video Tubes
98
Complex kernels on Video Tubes
99
Classification
100
Classification
101
SVM are ANN
102
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
– Convolutional Neural Networks (CNN)
– Generative Adversarial Networks (GAN)
– Stacked Denoising AutoEncoder (SDAE)
– Reccurent Neural Networks (RNN)
• Reinforcement Learning
103
DEEP LEARNING
104
Deep representation origins
• Theorem Cybenko (1989) A neural network with one single hidden layer
is a universal “approximator”, it can represent any continuous function
on compact subsets of Rn 2 layers are enough…but hidden layer size
may be exponential
…………..……..………..
………..……..………
exponential
105
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
106
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
2
exponential
107
Enabling factors
• Why do it now ? Before 2006, training deep networks was
unsuccessful because of practical aspects
– faster CPU's
– parallel CPU architectures
– advent of GPU computing
• Results…
– 2009, sound, interspeech + ~24%
– 2011, text, + ~15% without linguistic at all
– 2012, images, ImageNet + ~20%
108
Structure the network?
• Can we put any structure reducing the space of exploration and
providing useful properties (invariance, robustness…)?
ݓ(ݏ = ݕଵଷଷ ݓ ݏଵଵଵ ݔଵ + ݓଶଵଵ ݔଶ + ݓଵଵ + ݓଶଷଷ ݓ ݏଵଶଶ ݔଵ + ݓଶଶଶ ݔଶ + ݓଶଶ + ݓଷଷ )
109
CONVOLUTIONAL NEURAL NETWORKS
(AKA CNN, CONVNET)
110
Convolutional neural network
111
Deep representation by CNN
112
Convolution in nature
113
Convolution
114
Convolution in nature
1. Hubel and Wiesel have worked on visual cortex of cats (1962)
2. Convolution
3. Pooling
115
Convolution = Perceptron
= ݕsign(w
w.)ܠ
116
Convolution = Perceptron
= ݕsign(w
w.)ܠ
ݓ ݔ
ୀ ܟ
ݓ
ݓଵ ݓ
ݓଶ
ݓଷ ܠ
࢞ = ݔଵ ݔଶ ݔଷ ݔ
117
If convolution = perceptron
1. Convolution
2. Pooling
118
Deep representation by CNN
119
Deep representation by CNN
120
Deep representation by CNN
121
Deep representation by CNN
122
Transfer Learning!!
123
Endoscopic Vision Challenge 2017
Surgical Workflow Analysis in the SensorOR
10th of September
Quebec, Canada
124
Clinical context: Laparoscopic Surgery
Surgical Workflow Analysis
125
Task
Phase segmentation of laparoscopic surgeries
Video
Surgical
Devices
126
Dataset
30 colorectal laparoscopies
z Complex type of operation
z Duration: 1.6h – 4.9h (avg 3.2h)
z 3 different sub-types
z 10x Proctocolectomy
z 10x Rectal resection
z 10x Sigmoid resection
127
Annotation
Annotated by surgical experts, 13 different phases
128
Method
Temporal Network
3x3 conv
3x3 conv
3x3 conv
FC
feature Temporal network accuracy:
224x224x3 Rectal resection: 49.88%
224x224x20 vector
FV (512) Sigmoid resection: 48.56%
Proctolectomy: 46.96%
Spatial Network
Spatial CNN
LSTM (512)
224x224x3 (1024 10
9
) 8
7
6
5
Temporal
features
4
(512)
Temporal CNN 3
2
1
0
224x224x20 129
And the winner is...
Average Median
Data used Accuracy
Jaccard Jaccard
Video +
2
Device
38% 38% 60% Team NCT
130
Why Deep Learning?
Before Deep Learning
Hand-crafted/Engineered features
– Image recognition
– 3L[HOĺHGJHĺtexton ĺSDWWHUQVĺSDUWĺREMHFW
– Text
– &KDUDFWHUĺZRUGĺZRUGJURXSĺFODXVHĺVHQWHQFHĺVWRU\
– Speech
– 6DPSOHĺVSHFWUDOEDQGĺVRXQGĺ«ĺSKRQHĺSKRQHPHĺZRUG
Since Deep Learning
o Hierarchy of representations with increasing level of abstraction
o Each stage is a kind of trainable feature transform…
as long as you have enough data to train the hierarchy
Trainable feature
Trainable feature
Trainable feature
Trainable feature
transform
transform
transform
transform
Decision
Le Cun - Ranzato
How Deep Learning?
Start from raw data OR from a first level representation?
Gray-level Pixels
Color Pixels
Trainable feature
Trainable feature
Trainable feature
Trainable feature
transform
transform
transform
transform
Decision
Text Text Embedding
Waves
Images
Le Cun - Ranzato
AMAZING BUT…
133
Amazing but…be carreful of the bias in
the initial data
134
Amazing but…be careful of the
adversaries (as any other ML algorithms)
135
Amazing but…be careful of the
adversaries (as any other ML algorithms)
136
Amazing but…be careful of the
adversaries (as any other ML algorithms)
138
Amazing but…be careful of the
adversaries (as any other ML algorithms)
139
Amazing but…be careful of the
adversaries
https://nicholas.carlini.com/code/audio_adversarial_examples/
140
GENERATIVE ADVERSARIAL NETWORKS
141
How to solve it?
Generative Adversarial Networks
142
It finally did not solve adversarial, but…
143
It finally did not solve adversarial, but…
144
(DENOISING) STACKED AUTOENCODER
145
Autoencoder: unsupervised!
Learning a compact representation of the data (no classification)
First we train an AutoEncoder layer 1.
146
Autoencoder: unsupervised!
Second we train an AutoEncoder layer 2.
147
Autoencoder -> Supervised
148
Autoencoder -> Supervised
Finally, we fine-tune the whole network in a supervised way.
149
Denoising stacked Autoencoder:
unsupervised
Result = a new latent representation
150
Denoising stacked Autoencoder: example
Stage 1.
Data-mining stage &
Feature extraction:
Driving Electronic Health
Model Deep Patient, Records to build a binary
Published in Nature, phenotype representation.
2016
Stage 2.
Unsupervised stage:
Mapping the Binary Patient
Representation to get a new space call
Deep Patient (or Latent Representation)
Using Stacked Denoising Autoencoders.
Stage 3.
Supervised stage:
Labeling Medical Target and
training the Latent Representation
by Machine Learning algorithms
for classification and prediction of
patient's disease. 151
Supervised Image Segmentation Task
153
Recurrent Neural Networks have loops.
154
An unrolled recurrent neural network.
In the last few years, there have been incredible success applying RNNs to a
variety of problems: speech recognition, language modeling, translation,
image captioning…
155
The Problem of Long-Term Dependencies
If we are trying to predict the last word in “the clouds are in the (sky),” we don’t need any
further context – it’s pretty obvious the next word is going to be sky.
Consider trying to predict the last word in the text “I grew up in France… I speak
fluent (French).” Recent information suggests that the next word is probably the name of a
language, but if we want to narrow down which language, we need the context of France,
from further back. 156
The repeating module in a standard
RNN contains a single layer.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information
(cf. vanishing gradients)
The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994), who found
some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs/GRUs do not have this problem!
157
The repeating module in an LSTM/GRU
contains four interacting layers.
158
Sequence modeling with RNNs
159
One to Many - Image captioning
161
Many to Many - Machine translation
speech2text
• Input : Audio mp3 (natural language)
• Output : "How much would a woodchuck chuck"
163
Based on “Introduction to Deep Learning”, by Professor Qiang Yang,
from The Hong Kong University of Science and Technology
REINFORCEMENT LEARNING
164
Reinforcement Learning
Agent
165
Policies and Value Functions
• Policy ߨ is a behavior function selecting actions given states
(it defines the probability of each possible action regarding the state s)
ܽ= argmax ࣊(s)
௦௦ ௧௦
• Value function ܳగ (s,a) is expected total reward ݎfrom state s and action a
under policy ߨ
167
Bellman Equation
• Value function can be unrolled recursively
ܳగ ݏ, ܽ = ॱ ݎ௧ାଵ + ߛݎ௧ାଶ + ߛ ଶ ݎ௧ାଷ + ݏ| ڮ, ܽ
గ ᇱ ᇱ
= ॱ௦ᇲ ݎ௧ + ߛ max
ᇲ
ܳ ݏ, ܽ |ݏ, ܽ
This last equation corresponds to how we should update Q ideally, i.e. knowing the state distribution BUT
we do not know the state distribution Prob(s’ | s,a) and the complexity is not even polynomial in the
number of states 168
Deep Reinforcement Learning
• Human
169
Deep Reinforcement Learning
• Represent value function by deep Q-network with weights w
ܳ ݏ, ܽ, ࢝ = ܳగ ݏ, ܽ
• Define objective function by mean-squared error in Q-values
ࣦ(࢝ ) = ॱ ݎ௧ + ߛ max ᇱ ᇱ
ᇲ
ܳ ݏ , ܽ , ࢝ି െ ܳ ݏ, ܽ, ࢝
target
߲ࣦ(࢝ ) ߲ܳ ݏ, ܽ, ࢝
=ॱ ݎ௧ + ߛ max ܳ ݏᇱ , ܽᇱ , ࢝ ି െ ܳ ݏ, ܽ, ࢝
߲࢝ ᇲ ߲࢝
target
170
DQN in Atari
• End-to-end learning of values Q(s, a) from pixels
• Input state s is stack of raw pixels from last 4 frames
• Output is Q(s, a) for 18 joystick/button positions
• Reward is the change in the score for that step
172
AlphaGO: Monte Carlo Tree Search
• MCTS: Model look ahead to reduce searching space by predicting
opponent’s moves
ܸగ = ݏmax ܳగ ݏ, ܽ
• SL policy Network
– Prior search probability or potential
• Rollout:
– combine with MCTS for quick simulation on leaf node
• Value Network:
– Build the Global feeling on the leaf node situation
174
Learning to Prune: SL Policy Network
• 13-layer CNN
• Input board position ݏ
• Output: pఙ (ܽ|)ݏ, where ܽ is the next move
175
Learning to Prune: RL Policy Network
Self play
177
AlphaGO: Evaluation
The version solely using the policy network does not perform any search
Silver, David, et al. 2016. 178
QUESTIONS?
179