Character Recognition Using ANN
Character Recognition Using ANN
Character Recognition Using ANN
Study Program:
Applied Informatics
Branch of Study:
Educational Institution:
Supervisor:
Bratislava, 2013
Ivor Uhliarik
Acknowledgment
I would like to express my sincere gratitude to my supervisor Mgr. udovt Malinovsk for
invaluable consultations, interest, initiative, and continuous support throughout the
duration of writing this thesis. I would also like to thank my family, fellow colleagues and
all the people that have supported me. Finally, I would like to thank Martin Boze and the
rest of the n'c community for listening to my rants and all the witty remarks.
Abstract
The aim of this work is to review existing methods for the handwritten character
recognition problem using machine learning algorithms and implement one of them for a
user-friendly Android application. The main tasks the application provides a solution for
are handwriting recognition based on touch input, handwriting recognition from live
camera frames or a picture file, learning new characters, and learning interactively based
on user's feedback. The recognition model we have chosen is a multilayer perceptron, a
feedforward artificial neural network, especially because of its high performance on nonlinearly separable problems. It has also proved powerful in OCR and ICR systems [1] that
could be seen as a further extension of this work. We had evaluated the perceptron's
performance and configured its parameters in the GNU Octave programming language,
after which we implemented the Android application using the same perceptron
architecture, learning parameters and optimization algorithms. The application was then
tested on a training set consisting of digits with the ability to learn alphabetical or different
characters.
Abstrakt
Cieom tejto prce je ponknu prehad metd rozpoznvania rukou psanch znakov
pomocou algoritmov strojovho uenia a jeden z nich implementova vo forme Android
aplikcie priateskej pre pouvateov. Hlavnmi poiadavkami na tto aplikciu s
rozpoznvanie rukou psanho psma pomocou dotykovch senzorov Android zariadenia a
zo zberov kamery alebo sboru s obrzkom, uenie novch, dosia nepoznanch znakov a
interaktvne uenie poda sptnej vzby pouvatea. Ako model uenia sme zvolili
dopredn neurnov sie viacvrstvov perceptrn. Vobu ovplyvnil vysok vkon na
nelinerne separovatench problmoch, ako aj fakt, e sa asto pouva v OCR a ICR
systmoch [1], ktor by sa dali prirovna rozreniam tejto prce. Pre testovanie a
vyhodnotenie vkonu, ako aj sprvne nastavenie parametrov perceptrnu sme pouili
programovac jazyk GNU Octave. Na zklade dosiahnutej konfigurcie sme ho
implementovali vo vytvorenej Android aplikcii, priom sme pouili rovnak architektru,
parametre a algoritmy optimalizcie. Aplikciu sme testovali na trnovacej sade cifier s
monosou pridania a nauenia alfabetickch a inch znakov.
Table of Contents
Introduction............................................................................................................................1
1. Overview............................................................................................................................3
1.1 Project Description......................................................................................................3
1.2 Character Recognition Algorithms..............................................................................4
1.2.1 Image Preprocessing............................................................................................4
1.2.2 Feature Extraction................................................................................................7
1.2.3 Classification........................................................................................................9
1.2.3.1 Logistic Regression....................................................................................10
1.2.3.2 Multilayer Perceptron.................................................................................12
1.3 Existing Applications................................................................................................12
1.3.1 Full-featured Document OCR and ICR Systems...............................................12
1.3.2 Input methods.....................................................................................................13
2. Learning Model in Detail.................................................................................................14
2.1 Representation...........................................................................................................14
2.2 Hypothesis.................................................................................................................16
2.3 Learning: Backpropagation.......................................................................................17
2.4 Learning: Resilient Backpropagation........................................................................20
2.5 Bias, Variance............................................................................................................22
3. Solution Description.........................................................................................................24
3.1 Functional Specification............................................................................................24
3.2 Plan of Solution.........................................................................................................25
3.2.1 Recognition Process Pipelines...........................................................................25
3.2.2 Network Architecture.........................................................................................27
3.2.3 Offline and Online Learning..............................................................................28
3.2.4 Used Technologies.............................................................................................29
4. Implementation.................................................................................................................30
4.1 Android Application Implementation........................................................................30
4.2 Android Application Source Code............................................................................34
5. Results..............................................................................................................................36
5.1 Collection Methods....................................................................................................36
5.1 Result Comparison....................................................................................................37
6. Conclusion........................................................................................................................39
Bibliography.........................................................................................................................40
Appendices...........................................................................................................................42
Appendix A: DVD...........................................................................................................42
Appendix B: Android Application Screenshots..............................................................43
ICR
Matrix of all training examples. Each row contains a feature vector of a single
example.
a(l)
sj
g(z)
Sigmoid function of z.
J()
(l)
(l)
(t)
(t)
Introduction
Handwritten character recognition is a field of research in artificial intelligence, computer
vision, and pattern recognition. A computer performing handwriting recognition is said to
be able to acquire and detect characters in paper documents, pictures, touch-screen devices
and other sources and convert them into machine-encoded form. Its application is found in
optical character recognition and more advanced intelligent character recognition
systems. Most of these systems nowadays implement machine learning mechanisms such
as neural networks.
Machine learning is a branch of artificial intelligence inspired by psychology and biology
that deals with learning from a set of data and can be applied to solve wide spectrum of
problems. A supervised machine learning model is given instances of data specific to a
problem domain and an answer that solves the problem for each instance. When learning is
complete, the model is able not only to provide answers to the data it has learned on, but
also to yet unseen data with high precision.
Neural networks are learning models used in machine learning. Their aim is to simulate the
learning process that occurs in an animal or human neural system. Being one of the most
powerful learning models, they are useful in automation of tasks where the decision of a
human being takes too long, or is imprecise. A neural network can be very fast at
delivering results and may detect connections between seen instances of data that human
cannot see.
We have decided to implement a neural network in an Android application that recognizes
characters written on the device's touch screen by hand and extracted from camera and
images provided by the device. Having acquired the knowledge that is explained in this
text, the neural network has been implemented on a low level without using libraries that
already facilitate the process. By doing this, we evaluate the performance of neural
networks in the given problem and provide source code for the network that can be used to
solve many different classification problems. The resulting system is a subset of a complex
OCR or ICR system; these are seen as possible future extensions of this work.
In the first chapter, we describe the overall project in greater detail and review existing
1
approaches, algorithms and systems of similar nature. For overview, we also briefly
explain the specific algorithms that have been used in the implementation of the project.
In the second chapter, we explain the chosen neural network model and algorithms on a
low level. The information contained in this chapter should be sufficient in order to begin
implementing the learning model.
Chapter 3 discusses the design choices made before implementing the Android application.
The requirements of the application are specified and the plan of the solution is laid out.
The fourth chapter, implementation, gives a description of how the requirements have been
satisfied, what problems have arisen and how they were solved, and also serve as a
technical guide for the application users. The structure of the source code is also depicted.
Chapter 5 compares and discusses the numerical results of the network learning
algorithms.
Finally, in the conclusion, we talk about the accomplishments of this work and state how
the application can be used in other projects or extended to a more complex system.
1. Overview
In order to reach the analysis of the used learning model and the specification and
implementation of the algorithms, and consequently, the Android application, we had to
review existing approaches to the problem of character recognition. In this chapter, we
describe the project and compare the methods that have been considered as candidates for
this work.
We understand interactive learning with user feedback as using online machine learning.
This should not be confused with online and offline handwriting recognition, which is
described above. Online learning is defined as learning one instance of data at a time,
expecting feedbackthe real label of a character input by a userto be provided after the
neural network's prediction. On the other hand, offline learning performs all of the learning
process prior to making any predictions and does not change its prediction hypothesis
afterward. The two methods are examples of lazy and eager learning, respectively.
Online machine learning makes the system more adaptive to change, such as a change in
trends, prices, market needs, etc. In our case, user feedback makes the system able to adapt
to a change in handwriting style, perhaps caused by a change of user. We use both offline
and online learning in this work. Initially, the user expects from the application to just
work, so offline learning on initial data is performed. When the system mislabels a
character or the user wants to add a new one, they make a correction, refining the system
progressively. Moreover, online learning executes after correct prediction as well, so the
system should be able to perform better the more it is used.
In addition to implementing the Android application, we have tested and evaluated the
used algorithms in the GNU Octave programming language. The language and its features
are dominant in prototyping and make it very easy to visualize data. It is also well suited
for machine learning applications as matrix manipulations are part of the language's
grammar.
image, but some of them, such as cropping the written character and scaling it to our input
size, are also performed in the touch mode.
Digital capture and conversion of an image often introduces noise which makes it hard to
decide what is actually a part of the object of interest and what is not. Considering the
problem of character recognition, we want to reduce as much noise as possible, while
preserving the strokes of the characters, since they are important for correct classification.
There are many ways of achieving this. Local processing is one of them.
According to [10], local preprocessing makes use of a small area around the pixel of
question in the input imagea maskto compute the output value of the pixel in the
output image. This is called filtering. For this task we use convolutional masks that scan an
image, ideally reducing all unwanted noise. Masks are square matrices with elements
representing weights of the surrounding area pixels that determine the light intensity value
of the pixel at hand.
[ ]
1 1 1 1
1 1 1
9
1 1 1
Typical average
value
h=
[ ]
1 1 1 1
1 2 1
10
1 1 1
Centroid
highlight
h=
(1.1)
(1.2)
In order to preserve character strokes in our project, we have used median filtering, which
is a non-linear operation. A median filter replaces the pixel's value with the median of the
intensity of surrounding pixels. The result is shown in Figure 1.
The task of image segmentation is to split an image into parts with strong correlation with
objects or the real world properties the image represents [10]. Probably the simplest image
segmentation method is thresholding. Thresholding is the extraction of the foreground,
which is a character in our case, from the rather monotonic background [3]. In a gray-scale
image, this is equal to binarization of its real-valued data. An image threshold is selected,
according to which the input image is converted. Given an input pixel f(i, j) and threshold
T, the output pixel g(i, j) is calculated as follows:
g (i , j)= 1
0
,if f (i , j )T
, if f (i , j)<T
(1.3)
Finally, in both touch and image based recognition in our work, we have used cropping and
scaling of the images to a small fixed size.
As we have mentioned, there is no method that is intrinsically perfect for a given task.
Evaluation of such methods would take a lot of time and it is not in the scope of this work.
Instead, we set our focus on multilayer feedforward neural networks, which can be viewed
as a combination of a feature extractor and a classifier [3], the latter of which will be
explained shortly.
In our work, we have used the multilayer perceptron neural network model, which will be
more deeply described in the next chapter. For now, we can think of this model as a
directed graph consisting of at least 3 layers of nodes.
The first layer is called the input layer, the last layer is the output layer, and a number of
intermediate layers are known as hidden layers. Except of the input layer, nodes of neural
networks are also called neurons or units. Each node of a layer typically has a weighted
connection to the nodes of the next layer. The hidden layers are important for feature
extraction, as they create an internal abstraction of the data fed into the network. The more
hidden layers there are in a network, the more abstract the extracted features are.
Input layer
Hidden layer
Output layer
1.2.3 Classification
Classification is defined as the task of assigning labels (categories, classes) to yet unseen
observations (instances of data). In machine learning, this is done on the basis of training
an algorithm on a set of training examples. Classification is a supervised learning problem,
where a teacher links a label to every instance of data. Label is a discrete number that
identifies the class a particular instance belongs to. It is usually represented as a nonnegative integer.
There are many machine learning models that implement classification; these are known as
classifiers. The aim of classifiers is to fit a decision boundary (Figure 4) in feature-space
that separates the training examples, so that the class of a new observation instance can be
correctly labeled. In general, the decision boundary is a hyper-surface that separates an Ndimensional space into two partitions, itself being N1-dimensional.
10
11
12
smartphone's camera for text recognition to import contact information [12]. The
application is, among others, also available for the Android platform.
Tesseract-ocr is an OCR engine developed at HP Labs between 1985 and 1995. It is
claimed [13] that this engine is the most accurate open source OCR engine available,
supporting a wide variety of image formats and over 60 languages. It is free and open
source software, licensed under Apache License 2.0.
Google Goggles is an image recognition Android and iOS application, featuring searching
based on pictures taken by compatible devices and using character recognition for some
use cases [14].
13
2.1 Representation
We have previously outlined the architecture of a multilayer perceptron. In short, it is a
directed graph with at least three layers, at least one of which is a hidden layer. The more
hidden layers there are, the most abstract features can be extracted from input data and the
more complex the decision boundary is. Neurons are the vertices of the graph, while
weighted connections are the edges.
14
In binary classification, using a single output neuron is recommended [9]. Here, the
hypothesis is typically a real value. A threshold is then used to determine the
predicted class.
In multi-class problems, the size of the output layer is typically equal to the number
of classes. Thus, output data is represented as a vector of real values. The predicted
class is the element with the maximum value.
In general, however, we assume the output layer is always a real valued vector:
[]
a(1L)
( L)
h ( x )=a (L)= a 2
...
a(sL)
(2.1)
[]
y1
y
y= 2
...
ys
15
(2.2)
(l )
(l1, )1
(l1,2)
(l)
2,1
(l )
2,2
(l )
2, sl +1
l+ 1
l +1
l +1
...
(l1,s) +1
l
...
...
...
...
...
(l )
(l )
(l )
s , 1 s ,2 ... s ,s +1
l
(2.3)
Here, l is any layer except of the output layer. Using our notation, (l) is a matrix of
weights corresponding to the connection mapping from layer l to layer l + 1. sl is the
number of units in layer l, and sl + 1 is the number of units in layer l + 1. Thus, the size of
(l) is [sl + 1, sl + 1].
The additional neuron that is included in layer l is the bias neuron. Usually marked as x0 or
a (l0 ) , bias is a very important element in the network. It can alter the shift of the
activation function along the x-axis. The bias neuron is only connected to the next layerit
has no input connections.
Note that a row in the weights matrix represents connections from all of the neurons in
layer l to a single neuron in layer l + 1. Conversely, a column in the matrix represents
connections from a single neuron in layer l to all of the neurons in layer l + 1.
2.2 Hypothesis
Throughout this text, we've referred to hypothesis several times. It is the prediction of a
class, the output value of a classifier. As mentioned in chapter 1, in order to enable the
network to solve complex nonlinear problems, the use of a nonlinear activation function is
required.
In many cases, the sigmoid activation function is used:
1
g (z )=
,
(1+ ez )
(2.4)
The range of the sigmoid function is [0, 1], which is therefore also the range of the
elements in the output layer.
16
(2.5)
(l )
(l 1)
a = g (
(l1)
(2.6)
Here, the sigmoid function is applied element-wise to the product of the weights and the
connected neurons from the previous layer, therefore a (l ) s .
l
It may be intuitive to go ahead and use (2.6) recursively to compute the overall hypothesis
of the network, but as we're assuming the bias neuron in the architecture, it needs to be
added to the vector of activations in layer l intermittently.
The process of determining the value of the hypothesis in the described way is called
forward propagation. The algorithm broken up into steps follows:
1. Start with the first hidden layer.
2. Compute the activations in the current layer using (2.6).
3. If the current layer is the output layer, we have reached the hypothesis and end.
4. Add a bias unit a (l0 )=1 to the vector of computed activations.
5. Advance to the next layer and go to step 2.
17
There is a number of different cost functions typically used when training multilayer
perceptrons. Training is often performed by minimizing the mean squared error, which is a
sum of squared differences between computed hypotheses and actual labels. However,
mean non-squared error is also used. To be consistent with [9], we have used a
generalization of the cost function used in logistic regression:
J ()=
1
[ y (ik ) log(( h (h(i )))k )(1 y (ik ))log(1( h ( x(i) ))k )]
m i=1 k=1
(2.7)
In (2.7), m is the number of training examples and K is the total number of possible labels.
(i )
h (x ) is computed using forward propagation described above.
The cost function above is not always convexthere may be more local minima. However,
according to [9], it is sufficient to reach a local minimum that is not global. In order to
reach a local minimum of the cost function, we use an optimization algorithm, such as
gradient descent. Gradient descent is a simple optimization algorithm that converges to a
local minimum by taking steps in a weight-space iteratively in so-called epochs. The size
of each step is proportional to the negative of the gradient of the cost function at the
current point.
There is an important factor that modifies the descent step size in machine learningthe
learning rate. It is a modifier that is used to tune how fast and how accurate the
optimization is and heavily determines the efficiency of a learning algorithm.
J()
J()
J()
(a)
(b)
(c)
Figure 8: The effect of learning rate on gradient descent. In (a), the learning
rate is optimal and gradient descent efficiently converges to a minimum. In (b),
the learning rate is too small; the algorithm still converges to a minimum, but
takes too many steps. In (c), the learning rate is too large and the algorithm
diverges.
18
Since gradient is the partial derivative of the cost function with respect to individual
parameters, the change of parameters in a single gradient descent step is performed as:
J ()
=
(2.8)
In Figure 8 we can see the effect of a wrong choice of the learning rate. There are advanced
ways on determining the right learning rate value, but it is usually sufficient to determine it
empirically after applying various learning rates to the learning algorithm and picking one
with the minimum error.
Obtaining the gradient in a multilayer perceptron is not trivial and is done in several steps.
As each neuron has its own activation and weighted connections, it takes its own part in
the cost function. To propagate the error measured on the output layer after a prediction,
each neuron's weights need to be updated differently. To achieve this, we introduce the
concept of an error term (lj ) , representing the error of node j in layer l.
To obtain the error terms for all neurons in the network except the input layer (as there is
no error in the input data), we do the following. Given an instance of input data x, forward
propagation is performed to determine the hypothesis. With the input label yj, starting at the
end of the network, we calculate the error terms for the output layer per neuron j:
( L)
(L)
j =a j y j
(2.9)
(l )
(l) T
(l +1)
j =( ) j
(l )
.g ' (z ),
(2.10)
where g' is the gradient of the sigmoid function and z(l) is a vector of the linear
combinations of all neurons and their respective weights in layer l - 1:
for layer 1 < l L,
(l)
(l 1 )
z =
(l1)
(2.11)
19
(2.12)
(2.13)
Collecting the error terms is essential for the computation of the partial derivatives of the
network. We will now describe the overall process of obtaining the partial derivatives,
called backpropagation, in the following pseudo-code. Note that in this pseudo-code,
matrix elements are indexed from 0, as opposed to mathematical notation, where indexing
starts at 1.
02| Set
03| For i = 1 to m
(1)
a :=x
(i )
04|
Set
05|
06|
07|
Compute
08|
Set
09|
Di , j :=
(l)
(l )
(L1)
(l )
:= +
(L2)
(l+ 1)
( L)
(L)
=a y (i)
, ... ,
(2 )
(l ) T
(a )
1 (l )
m i, j
D(l) =
function and thus enable the algorithm to make predictions on new data.
It is important to note that before using backpropagation to learn the parameters, weights
should be initialized to random small numbers between 0 and 1. The weights must not be
initialized to zero, otherwise individual weight updates will be constant for all weights and
the minimization algorithm will fail to converge to a local minimum.
20
that is constant for all weight connections, it performs a direct adaptation of the weight step
using only the sign of the partial derivative, not its magnitude. As such, it overcomes the
difficulty of setting the right learning rate value.
For each weight, an individual weight step size is introduced: i,j (not to be confused with
the symbol used when accumulating error in pure backpropagation). Given epoch t > 0 and
considering as a weight matrix for a single layer (this can be applied to any valid layer),
this value evolves during the learning process according to the following rule:
(ti , )j=
J () (t1) J ( ) (t )
>0
J ( ) (t 1 ) J () (t )
,if
<0
, else
+(ti ,1)
j
, if
(t1)
i, j
(t1)
i,j
(2.15)
the partial derivative has not been changed, the step size is increased by the factor
+ .
In effect, this method decelerates in gradient descent when the step would be too large to
converge, and accelerates in shallow regions where the step size is too small.
After determining the update values as shown in (2.15), the weights are updated, following
a rule:
(ti , )j
(ti ,)j =
+ (ti , )j
0
J () (t)
>0
J () (t)
, if
<0
,else
, if
(t +1)=(t )+ (ti , )j
(2.16)
(2.17)
In words, if the derivative is positive, which indicates an increasing error, the weight is
decreased; if the derivative is negative, the weight is increased.
An initial value of the weight update step 0 has to be set before using the algorithm. This
value is preferably chosen in proportion to the initial weight size. However, it has been
21
respectively [4].
In the original document [4] it was suggested that the previous update step be reverted if
the sign of the gradient changes (the minimum was missed). This is called backtracking.
However, in [5], an RPROP variation without backtracking has been proposed, simply
leaving this step out, as it is not crucial in the optimization process and is easier to
implement.
Figure 9: High bias, high variance. On the left, a learning algorithm underfits the
training examples and will likely fail at predictions on unseen examples; in the center,
the algorithm fits the examples just right; on the right, the algorithm overfits the
examples, fitting the training set, but will likely fail at predictions on unseen
examples.
22
1
J ()= [ y (ik ) log(( h (h(i )))k )(1 y (ik ))log(1( h ( x(i) )) k ) ]
m i=1 k=1
+
2m
[ ]
L1 sl+ 1
sl
(2.18)
(l ) 2
i, j
( )
l =1 i =1 j=2
09|
D(l)
i , j :=
1 (l )
m i, j
if j = 0
10|
D (l)
i , j :=
1 (l )
( + (li , )j)
m i,j
if j 0
As shown in (2.18) and in line 9, we do not add the regularization term for the bias units,
therefore we skip the first column of the weight matrix.
23
3. Solution Description
3.1 Functional Specification
As mentioned in the aim, there are four main requirements of the Android application that
has been developed:
The application provides means to enter a touch mode screen. Here, the user can
draw characters freely by hand.
[R01.1] When the user is done drawing, the drawing is recognized and the prediction is
shown to the user.
[R01.2] The drawing, along with the predicted label, can be saved to a persistent location.
[R01.3] The user can provide feedback on the prediction, signaling whether the prediction
was correct, or making a correction to the prediction, performing online learning.
[R02]
The application provides means to enter a camera mode screen. Here, the
device's camera is used to present its input to the screen.
[R02.1] After showing the camera frame, it is analyzed and found patterns are recognized
and shown to the user.
[R02.2] The process in [R02] and [R02.1] is done continuously as the camera frames are
updated over time.
[R02.3] The user can provide feedback on the prediction, signaling whether the prediction
was correct, or making a correction to the prediction, performing online learning.
[R03]
The application provides means to enter an image mode screen. Here, the user
24
can load an image file present on the device, which is then shown on the screen.
[R03.1] After showing the image, it is analyzed and found patters are recognized and
shown to the user.
[R04]
[R05]
The user can add a new character and train it using touch mode as described in
[R01].
[R06]
The user can perform offline learning on current persistent data at any time.
the learning and prediction computationally feasible. Similar bitmap sizes have been
chosen in related problems by Ng [9] and LeCun et al. [1].
As we have more data available in the touch mode than a pure image bitmap, we have also
decided to collect the bitmap of stroke end points to be able to better distinguish characters
such as '8' and 'B', as mentioned in the overview. The resized bitmaps of these characters
are often similar, but the writing style of each is usually different. By providing this extra
bitmap with each example, we are giving a hint to the neural network classifier about what
features to focus on when performing automatic feature extraction with the hidden layer.
The pipeline for recognition based on an image or a camera frame is different:
1. Acquire the image bitmap in gray-scale colors.
2. Apply a median filter to the bitmap.
3. Segment the bitmap using thresholding to get a binary bitmap.
4. Find the bounding boxes of external contours in the bitmap.
5. Extract sub-bitmaps from the bounding boxes.
6. Resize the sub-bitmaps to 20x20 pixels.
7. Unroll the sub-bitmap matrices to feature vectors per 400 elements.
8. Feed each feature vector to a trained multilayer perceptron, giving us predictions.
Our intention is to produce a similar input for the network as in the case of touch input,
only without the stroke end bitmaps as we do not possess this information in this mode.
With this approach, we are able to re-use bitmaps produced in touch input to train a
network that operates on images.
When preprocessing the image, we apply a median blur to remove noise. Median blur was
preferred to other blur filters because of its property of preserving edges, which are
important in character recognition.
Thresholding is a very simple segmentation algorithm that is sufficient to segment out
characters from the background and thus binarize the bitmap. This step is required, as the
background is an extra information we do not need in the recognition process.
Finding contours in the image and finding the bounding boxes of each pattern allows us to
26
detect more characters in an image at once. Given the bounding boxes, we are able to
extract bitmaps of the individual characters (or other patterns) and then perform the rest of
the processes just as in the touch mode pipeline, excluding, of course, the stroke end maps
in the feature vectors.
27
+ is equal to
1.2. After testing the performance of the algorithms, we have decided not to use
regularization at all (the regularization parameter is 0). Both backpropagation and RPROP
perform 100 optimization epochs.
28
29
4. Implementation
In this chapter, we will describe how the Android application requirements have been
satisfied, document the implementation, describe the problems we've run into and how they
were solved and provide a brief user guide in the process.
30
By default, each time a prediction is made, the drawing and the stroke end bitmaps are
saved to external storage as image files in pairs, so that perceptron weights can be removed
and offline learning can be performed. This behavior satisfies R01.2 and can be turned off
in application settings to save memory when no or only online learning is required.
The camera mode contains an OpenCV user interface element tailored for showing and
processing camera frames in Android. When the user enters the activity, they are
immediately presented with continuously updated camera images, which are processed
according to the image mode recognition pipeline as has been described. Individual
contours that are likely to be characters are found and marked with rectangles. A predicted
label is shown above each such rectangle. In practice, a Nexus S device with a single-core
1GHz CPU (the GPU is, unfortunately, not used by the implementation of the OpenCV
library which would have likely performed better) and a camera resolution of 720x480
pixels, the updating frequency is approximately 2 frames per second. This depends on the
number of objects visible in the frames. In order to filter out contours with a low
probability of correct classification, contours that cover an area of less than 40 pixels are
not further processed. This significantly improves the framerate, as there are usually many
objects in the image that are too large to be filtered by the median blur, but still unwanted.
In this mode, the user can click on the individual characters outlined by the mentioned
rectangles in order to provide feedback. The feedback dialog described in the touch mode
is reused. Corrections of the predictions are immediately visible in the next frames; as
such, this activity may resemble an application of augmented reality. As with the touch
mode, the neural network state is also saved when the activity goes into background or is
killed by the system.
When implementing the camera mode activity, a problem with the choice of segmentation
threshold value arose. OpenCV provides the use of Otsu's method, which automatically
determines the threshold value. However, this hasn't proved to be useful, as it is
computationally more expensive and the choice of threshold values still hasn't been
optimal. We have realized the best way to address this problem would be to leave the
setting of the threshold value to the user, being equal to 100 (in the range 0-255) by
default. For this, we have used a panel with a segmentation check-box that allows users to
see the segmented image, and a seekbar (a UI element for contiguous value selection) to
control the threshold.
31
When the user enters the image mode, they are instructed to start by opening an image file.
The system then shows all possible applications that can handle this request; a default
gallery application is preferred, as these usually implement the behavior of providing data
correctly, however, other applications, such as file managers, may also be used. The loaded
image file is processed according to the image mode recognition pipeline using the same
techniques as in camera mode, and is then shown to the user. The segmentation control
views are also present. Because this activity would share a lot of source code with the
camera activity, methods that could be reasonably separated from the classes have been
moved to a public class and declared as static.
Unlike in the camera activity, which is constrained to the landscape screen orientation, we
wanted to make this activity flexible. It can be rotated, while its state is retained in a
fragment, which represents a portion of a user interface or a behavior of an activity [17].
The last activity from the four navigation elements is a character list that satisfies R04. It
shows a grid view of learned characters, filling the items with saved bitmaps if present.
When the user selects an item in the grid view, they are taken into the touch mode activity
that is modified to focus on training the single character label. The difference is that no
feedback dialog is shown to the userit is assumed that the character the user selected in
the character list is always the anticipated label and online learning is performed after each
prediction. Otherwise, the activity is unchanged; the examples are still saved to external
storage if set to do so in the settings and the network state is saved in appropriate events.
The character list activity also contains a menu item for adding a new character as required
in R05. After being selected, the user is asked to assign a label to the new character and is
then taken into the touch input activity to train the new character as described above. This
action adds the character to the list of known characters and modifies the structure of the
perceptron used for touch input recognition. In particular, the output layer size is increased
by one (to be able to predict the new class) and weight connections to this layer are
adjusted and initialized to random values.
Because the application maintains two separate perceptrons, the one used for image and
camera recognition is also modified in this way, but only when the user navigates to the
related activities. The online learning that was performed when training the new character
using touch input is, however, not applied to this perceptron. The user can only train it
32
using online learning in the related activities, or by performing offline learning in the case
that the bitmaps for the new character have been saved.
So far, we have mentioned the application settings only in relation to the option of saving
character bitmaps. Settings are another activity that the user can access from any other
activity at any time. Here, besides the saving option, the user is able to pick the error
minimization algorithm used for offline learning: backpropagation or RPROP, the latter
being the default. The initial learning rate to perform the online learning iteration (using
backpropagation) is 0.3, but if the user feels the need to adjust this parameter, they may do
it here. Finally, the settings contain items that initiate the offline learning process for touchbased and image-based input recognition that satisfy R06.
To make it easy for a user to start using the application, and out-of-the-box experience is
required. The application package contains initial files with a set of image bitmaps, saved
perceptron weights and the list of known characters. These files are compressed in a single
zip file that the application extracts and copies to external storage. To counter the
possibility of injecting malicious files in the zip file, its SHA-1 checksum is verified before
the file is processed.
As the android development guidelines dictate [17], long operations such as file input and
output or offline machine learning should be executed in separate threads. It is preferred to
use the AsyncTask class, which is a thread wrapper that allows simple communication back
to the main application thread. However, when the activity in which an AsyncTask is
started is released from memory, nothing stops the system from killing the leftover thread
that may be performing a critical operation. To ensure such threads keep running, we have
created a service for copying the initial files, and another one that handles offline learning
of both neural networks. These services have been declared foreground, which guarantees
the system will only reclaim the memory if it is required to keep the user interface
responsive [17].
Additionally, when an Android device runs on its battery, the execution of threads may be
interrupted when the device is in sleep mode. As the learning process takes a few minutes
to complete, it should continue to run even after the power button of the device is pressed.
For this, we acquire a partial wake lock. This is a mechanism that controls the power state
of the device. When a partial wake lock is acquired, the device's screen and keyboard
33
backlight are allowed to be turned off, but the CPU is kept on until all partial wake locks
have been released [17].
dialogs
package
consists
of
four
dialog
classes:
AboutDialog,
35
5. Results
Before implementing the learning algorithms in Java, we have tested them as prototypes
using GNU Octave. This allowed us to easily collect results of how well the learning
algorithms perform. This chapter presents and discusses the comparison of the collected
results.
In resilient backpropagation, ,
respectively.
This configuration has been chosen based on recommendations from various resources that
have been referred to in the explanation of individual model characteristics, and testing, the
results of which would be too long to fit in this chapter.
36
37
Figure 12: Learning curve of RPROP performed on the training and validation
sets.
In the learning curves, no significant overfitting or underfitting is apparent. We can see that
the RPROP algorithm manages to converge to a better minimum given 100 epochs than
backpropagation. This is caused by the advantages of the RPROP algorithm to pure
backpropagation that we explained earlier in this work. Table 1 confirms these findings.
Algorithm
Training
set error
(m=400)
Validation
set error
(m=400)
Avg. training
set error
Avg.
validation
set error
Training set
accuracy
Backpropagation
1.201
1.396
1.177
1.529
96.25%
RPROP
0.020
0.320
0.019
0.331
100.00%
38
6. Conclusion
This work has mostly been focused on the machine learning methods used in the project.
At first, we reviewed the approaches that are nowadays used in similar applications. After
that, we delved into the inner workings of a multilayer perceptron, focusing on
backpropagation and resilient back, which has been implemented an the Android
application. With the knowledge we had described, we specified the requirements of the
project and planned the solution. During the development of the application, we ran into a
few problems, which, along with the application structure and details, have been described
in the implementation chapter. Finally, the results of the implementation of the learning
algorithms have been compared.
The Android application performs character recognition based on touch, image, and
camera input. We have developed a Java package containing classes that implement the
multilayer perceptron learning model, which can also be used in other applications due to
its modular design that supports the loose coupling principle. The application itself uses
this package in such way.
Several improvements for the application or the learning model used within can be
suggested. For example, the feature extraction performed by the neural network could be
constrained to operate on more strictly preprocessed data. Also, several classifiers learning
on different features could be combined to make the system more robust. Additionally, an
unsupervised clustering learning model, such as an ART network [7], could be used on the
raw input, whose output would be connected to a multilayer perceptron.
39
Bibliography
[1]
[2]
[3]
. Due Trier, A. K. Jain, T. Taxt, Pattern Recognition, vol. 29, no. 4, pp. 641662,
1996
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Appendices
Appendix A: DVD
The enclosed DVD contains an electronic copy of this text, source codes of the Android
application and GNU Octave scripts, Java source code documentation of the Android
application, and an Android package for installation of the Android application.
Screenshot 1: Android
application navigation screen
Screenshot 2: Android
application touch mode screen