Convnets From Thesis
Convnets From Thesis
Convnets From Thesis
Being able to see is arguably the first step for agents to interact with any world—real, or virtual.
Of course, many virtual worlds can provide ground truths about the entire environment,
but that would be missing the point for AI research, because real world problems do not
provide such luxuries. Even though humans or other animals may use sound and smell in
rather impressive ways to navigate the planet [36], neither of these options are currently very
feasible when it comes to virtual worlds, especially not Minecraft. If then we are to create an
intelligent agent and test its abilities within Minecraft, our best bet is to begin implementing
vision within this agent, such that it can ‘see’ what it is doing. ‘See’ within quotation marks
because this process should not be interpreted as a venture into how human vision works: the
point here is to bootstrap the agent to maneuver within the virtual world, such that a variety
of experiments can be run, either to improve that vision or to use the current visual capacities
for other experiments, such as planning routes—experiments which can, might, give a view
into human cognition. If the goal is to have computers process their surroundings on a high
level then it is not clear that we have to build something that is biologically plausible so long
as we hit our goals. Of course this may be put forward as a defense for all of narrow AI, but
computer vision does yield rather impressive results, such as actually outperforming humans
on ImageNet, getting a 3.6 per cent error rate [28], where humans get a 5-10 per cent error
rate (some were so crazy to try) [33]. So, notwithstanding the limitations of deep learning
outlined in the introduction, deep learning, and especially its convolutional variant, is clearly
an excellent tool for visual processing. For this reason, the current study ventures to apply
a convolutional neural network to process vision within Malmo. In fact, while some of the
limitations of neural networks are addressed in recent work by Geoffrey Hinton on Capsule
Networks [61], his solution still makes important use of convolutions, adding to the credibility
of this approach to computer vision—even when the computer vision techniques are being
revised to address limitations.
Computer vision tasks can be categorised into a few subsets: localisation, detection, seg-
mentation and classification. Localisation attempts to find where in an image an object is
located. Detection is localisation for multiple objects. Segmentation attempts to find the ex-
act outline of an object in the environment. And classification is what ImageNet contest are
about, and deals with classification of objects into a finite set of objects. Because classifica-
tion is currently the most straightforward to achieve, the present study focuses on this as a
way to bootstrap vision in MicroPsi agents, though other techniques can later be added to
this pipeline to enhance the agent’s visual capacities. The current paper bootstraps vision by
using a MobileNet convolutional neural network, which is optimised for efficiency. Before I
explain how this works, I have to explain how convolutional neural networks work. General
information about convolutional neural networks has been gathered from [56], and [41].
29
6 Computer vision
Convolution and convolutional neural networks are arguably the most important concepts
in deep learning at the moment. They can be most succinctly characterised by the creation of
feature maps: where are what features in an image, and how do they combine to form objects?
The neural network used by Krizhevsky and friends in 2012 [37], that revived the field of
deep learning, made use of a deep convolutional architecture. It won the yearly ImageNet
competition, dropping the error rate from 26 per cent to 15 per cent, a major achievement at
the time. Convolutional neural networks are widely used for a variety of applications, but are
most notably good at processing images—as is clear from the fact that it outperforms humans
on this feat, at least on the ImageNet test.
Convolutional neural networks take their inspiration from biology—though inspiration
again being the operative word. Hubel and Wiesel [30] showed in an experiment that spe-
cialised neurons fired only in response to edges of a certain orientation in an image or video.
Some neurons only respond to diagonally and some only to horizontally directed lines. Hubel
and Wiesel postulated that these neurons were organised in a columnar architecture which,
meshed together, would form the brain’s visual perception. The gist of their findings and the
basis of the convolutional inspiration is that the brain carries specialised feature detectors that
are organised structurally to be used in visual perception.
Convolution is a mathematical operation that serves to mix information by taking two
sources and applying rules with which the mix is effected. We can apply this to images in two
dimensions, the width and the height. The first source of information that we have is the
pixel information of the image. Images are typically stored as width x height x 3, where the 3
stands for the RGB values, or colour channels. Each pixel consists of a value between 0 and
255 for every colour channel (red, green and blue). This means that for each image we get
three matrices of size width * height, one for each colour channel. Channels are also referred
to as dimensions.
The second source of information is a convolutional kernel, or filter, which is a single ma-
trix of numbers (also known as the weights) of a certain predefined width and height. The
numbers in the kernel are organised in a way that forms a recipe that is applied to the input
image. Through convolution we can then go on to mix the information of the input image
with the information in the kernel. You do this by applying the kernel, which is for example
3x3, to a similar sized spot on the image, so a spot of 3x3 pixels in this case, and then perform-
ing element wise multiplication with that part of the image and the convolution kernel. You
then take the sum and get 1 datum in the feature map. After this you slide, or rather convolve
the kernel over the complete image, by beginning, say, one to the right, and repeat the process
until you get a full feature map, going one down and repeating the process one step down.
The stride is the step size with which you convolve over the image, meaning that you could
for example set the stride to two where it would take two steps to the right and down. This
means that there will be less overlap in the feature map. This could keep the input to the next
layer smaller, which reduces computational costs.
Padding adds a few pixels (generally put to 0) at the outsides of the image, such that the
corner pixels get featured as much as any of the other pixels. Take for example the upper left
pixel. At any stride, and without padding, this pixel will only be part of one computation, in
contrast to many other of the pixels. To give them even weight, you add the stride based on
30
6.1 Convolutional Neural Networks
the size of the filter such that it has a larger vote. This can be important if there is important
information in the outsides of the image and considered good practice.
The feature map basically lays out where a feature is located within an image. This is further
illustrated by Figure 6.1 and Figure 6.2.
Figure 6.3: This particular filter did not detect anything at this point.
Consider the segment of the drawing of the mouse in Figure 6.1. Imagine that sliding a
kernel over this part of the image at that moment. The filter “matches” very well with the
31
6 Computer vision
input, and it will output a large number on its feature map for this location. This means that
the CNN has detected that feature at that point in the picture. This is in contrast to other
places, as can be seen in Figure 6.3. The curve was not detected in this part of the image, and
thus the output was 0—or generally low.
We can now understand the convolutional kernel as a feature detector similar to an edge
orientation oriented neuron. Similar to how the brain does it—do take such claims with reser-
vation—you can organise these kernels structurally in a neural network to integrate many
feature detectors built upon more feature detectors upon even more feature detectors so to
recognise more complex features and in the end full objects or scenes. The beauty of machine
learning is that the feature detectors need not be pre-defined, and the numbers within the
kernel can be learned through a technique called backpropagation. In a typical pass through
a CNN an image goes through a series of convolutional, non-linear, pooling and fully con-
nected layers. A classical layout would look something like:
1. Convolutional layer
5. Pooling layer
7. Convolutional layer
9. Pooling layer
For the first layer, we could take as our nodes, for example, three 5x5x3 filters, each taking
5 by 5 pixels at the time for each colour channel, and each looking for different features. The
dimensions of the filter have to match the dimensions of the input, so an RGB image will have
filters with a dimension of three. The output of such a layer will have as many dimensions as
there are filters in the layer. A 3x3x3 filter on a 6x6x3 image will end up as a 4x4x1 feature-map.
The output of such a layer, to which an extra bias weight is added, goes through a ReLU ac-
tivation function. The ReLU activation function introduces nonlinearity to a network that
thus far has only computed linear operations in the convolutional layers (multiplication and
summation). Historically, sigmoid or tanh functions were used, but researchers found out
that the ReLU works a lot better because it’s computationally efficient without performing
any worse than other options [52]. What a ReLU basically does is change all negative acti-
vation to 0. The input to the second layer are then the low-level feature maps that the first
convolutional layer took from the picture, allowing the second layer to pick up on higher
order features convolving over this output. These features will typically be circles, squares,
half-circles, and half-squares and all that can be likened to it.
32
6.1 Convolutional Neural Networks
Pooling layers are there to reduce the size of the output by for example only taking the
largest number in each ‘block’, which is a segment of the input. Once we know that a partic-
ular feature is present in the input, its exact location is not as important as its relative position
with regards to other features. This pooling allows us to reduce the size of the output and
save on computing costs.
At the end of the convolutional neural network is a fully connected layer, a la ‘normal’ neu-
ral networks. The image size, or feature map, has by then gotten so small, that the contents
are squeezed into a one-dimensional vector, which is fed into a fully connected layer. This
fully connected layer outputs an N dimensional vector with N being the amount of discrete,
mutually exclusive alternatives (classes) the network has to choose from. It then proceeds to
output probabilities for each of these classes, adding up to 1. This is called a softmax func-
tion. This means that the fully connected layer is there to see which category correlates most
strongly with the high level features that the network presents it. An overview of a typical
architecture can be found in Figure 6.4.
Another tool to use in convolutional neural networks is 1x1 convolutions, which are some-
times also referred to as ‘network in network’ [44]. These function not to reduce the width
and height, but the dimensionality of its input, and are often used to compute reductions
before other more expensive convolutions. The way this works is that if you use a 1x1 filter
of, say, the number 1, you just recreate the input with every feature kept the same. But it can
do this over all dimensions of the input, collapsing these into 1, resulting in an output that
is exactly as many dimensions deep as there are filters that you use. So, if you have an input
of, say, 28x28x192, on which you apply 16 1x1x192 filters, then this results in a 28x28x16 out-
put. You could also reduce the channels by using a similar amount of 3x3 or 5x5 filters, but
these are much more expensive, so using 1x1 filters do this before you apply the larger filters,
can greatly reduce the computational cost and allows creation of deeper networks without
too much expense. By applying an activation function to a 1x1 kernel’s output, you can also
introduce more nonlinearity in the network, allowing it to learn more complex functions,
without too much extra expense.
A CNN forward propagates to produce some output, and we test the quality of that out-
put by comparing it to the ground truth on a labeled dataset. The way this works is by first
taking the loss function, such as a mean squared error, which measures the distance between
the network’s output and the ground truth. This cost function can be seen graphically as
a landscape full of hills, where the cost function can be at any spot within that landscape.
What we wish to do, is reach the lowest point in that landscape, where the distance between
33
6 Computer vision
the output of the network and the ground truth is the smallest. This can be done through
backpropagation, which performs a backward pass through the network whereby weights on
the connections between the nodes are changed according to their respective bearing on the
output value. These weights of the network are, in our case, the numbers in our filters. This
means that this algorithm actually constructs feature detectors automatically. This agnosti-
cism is useful because it doesn’t require the programmer to build in any innate structure or
feature detectors. Backpropagation finds this minimum using gradient descent. Gradient de-
scent means that we can move down a hill in the landscape in incremental steps. The size of
the steps we take is expressed in the learning rate, which if set too small, the network con-
verges too slowly, but if set too large, the process may overshoot its target and can’t get into
the valley, as it keeps stepping over its lowest point back and forth.
We start off by setting random weights for the network, for example through Xavier ini-
tialization [23]. After taking a forward pass on all of our training examples, we compare the
output of the network to the labels, and sum the loss, to get the loss of the network. We want
to compute the contributions to the error of each weight in the network. A neural network
is essentially a large composite function where some function, some layer, multiplies a weight
matrix using the activation of the previous function, the previous layer. Because of this, we
can use the chain rule to compute gradients for the whole network. The chain rule works by
taking the derivative of the outside function, leaving the inner function alone, and then mul-
tiplying by the derivative of that inner function. You do this for each layer in the network. So
to perform backpropagation, we can iteratively apply the chain rule to calculate error deriva-
tives for every weight and bias in the network and update each weight and b in the opposite
direction of the gradient. We update the weights slightly (somewhat depending on our learn-
ing rate) and then do the complete forward pass through every training example again. This
can be very slow, as we have to go to the entire training set each time. A solution to this is to
use Stochastic Gradient Descent [16] where only one training example is taken at a time, and
the weights adjusted according to the error on that training example. This works just as well
as batch gradient descent, which takes batches of multiple examples, and even full gradient
descent, on most occasions. But it is much quicker.
The goal is to reach a minimal loss, after which the network has learned a model of the input
data. One problem is that the model could have overfit the training data, meaning that it has
modeled the noise present in the data. A way to combat that is to use regularisation, which
puts a penalty added to the loss on large weights such that the network is incentivised to keep
its weights small and thus is less able to fit outlandish distributions. Another way to combat
overfit is through dropout, whereby some neurons in the network are sometimes turned off
during training such that the representation gets spread over the entire network [64]. Pooling
layers also serve to counter overfitting, because we do away with some precise information
making our model more general. Lastly, we might prevent overfitting by gathering more data,
which is an important requirement for many applications of neural networks.
34
6.3 Transfer learning
tation, whereby images are mirrored, tilted or otherwise changed slightly in order to create
variation in the dataset which can result in a more generally applicable model. This tech-
nique only gets you so far, though, because if you only have a few dozen examples you might
be able to scale it up 4x, but that still results in a relatively small dataset. Another technique
to circumvent needing a lot of input for a given domain, is by using transfer learning.
35
6 Computer vision
tecture, because it is optimised for efficiency which makes it suitable to use in real-time in the
Minecraft world.
6.4 MobileNet
MobileNets are a family of mobile-first models for vision built in TensorFlow [29]. The
unique selling point of these MobileNets is that they are mindful of restricted resources while
still powerful and accurate for most applications. This makes using a MobileNet very practi-
cal for current purposes, as a visual agent has to process each frame one by one. Moreover, it
would be nice to be able to experiment with visual agents on any device, including my laptop
without a proper GPU, as most of MicroPsi’s implementation does not require a GPU and is
designed to run on a CPU.
The Y-axis of Figure 7.2 shows the ImageNet Top-1 Accuracy, which measures how many
pictures the network classified correctly by having the actual class of the image as its highest
probability output. The X-axis measures the complexity of the algorithm using Multiply-
Accumulates (MACs), which measures the number of fused multiplication and addition op-
erations, which is a good measure of the computational requirements of the network. As can
be seen from this graph, MobileNets score very well versus larger networks, despite a signifi-
cantly smaller size. Google’s flagship model, Inception V3 [68], has a Top-1 accuracy of 78 per
cent on ImageNet, but the model is 85MB to download, and requires significantly more pro-
cessing power than the MobileNet in even the larges size, which gives 70.5 per cent accuracy
and counts a mere 19MB for download [18].
MobileNets are based on a streamlined architecture that uses depth-wise separable convo-
lutions to build light-weight deep neural networks. Normal convolution filters combine the
values of all the input dimensions into one dimension. So an input of 3 channels becomes 1
channel, and an input of a 1000 channels becomes 1 channel all the same. MobileNets also
make use of such a construction, but only in their first layer. Other layers use depthwise sepa-
rable convolutions, which are a combination of sequential depthwise and pointwise convolu-
tions. Depthwise convolutions filter the input channels, but keep the dimensions, such that a
6x6x3 input image, as exampled before, will end up not as a 4x4x1 result after applying a 3x3x3
36
6.4 MobileNet
kernel, but as a 4x4x3 result. This depthwise convolution is then followed by a pointwise con-
volution, which is essentially the application of a 1x1 filter, which then functions to collapse
those dimensions into one. 1x1 convolutions take up 95 per cent of the computational time
of MobileNets [29]. These processes together are called a depthwise separable convolution,
which works to effect that which a regular convolution achieves in one go. The end results of
regular convolutions and depthwise separable convolutions are roughly similar, but regular
convolutions expend more effort to get there. For 3x3 kernels, the depthwise separable con-
volution is 9 times as fast and achieves almost the same results. An added benefit is that we
can also apply a ReLU activation function twice instead of once, allowing for more complex
functions without adding extra computational cost. A full MobileNet network involves 30
layers. The design of the network is as follows:
2. Depthwise layer
6. Depthwise layer
7. Pointwise layer
And so on and so forth, with ReLU activation functions in between. A stride of 2 is some-
times used to reduce the width and height of the data, and pointwise layers sometimes double
the number of channels. In the end the input image is filtered down to 7x7px with a dimen-
sion of 1024, on which an average pooling is applied that ends up in a vector of 1x1x1024. This
will function as the input to the final layer, which will output a softmax function.
In the paper the authors introduce two simple global hyperparameters that effficiently
trade off between latency and accuracy. First there is the width multiplier (alpha), which can
shrink the number of dimensions. If alpha is set to 1, the standard, then the network starts
with 32 channels and ends up with 1024. The second is the resolution multiplier (rho), which
can shrink the dimensions of the input image. A rho of 1 results in a 224x244px input size.
Another option a user is granted is to include or leave out a group of 5 layers in the middle
of the network. MobileNets are trained in TensorFlow, which is Google’s machine learning
framework [1]. MobileNets are trained using asynchronous stochastic gradient descent, a vari-
ant on stochastic gradient descent, and using RMSprop [69] for optimisation.
37