Paper 5
Paper 5
Paper 5
Abstract—Optical character recognition (OCR) is the characters simply by the addition of a single dot or stroke.
conversion of pictures of typed or handwritten characters into This results in a large number of classes (657). Modern
machine encoded characters. We chose to work on a subfield of Kannada script has 34 consonants, 13 vowels as well as two
OCR, namely offline learning of handwritten characters. other letters . All consonants combine with all the vowels to
Kannada script is agglutinative, where simple shapes are
concatenated horizontally to form words. This paper presents a
form consonant-vowel combinations. All possible
comparative study between different machine learning and consonants and consonant-vowel combinations which are
deep learning models on Kannada characters.A Convolutional possible, have been considered, Figure 1.
Neural Network (CNN) was chosen to show that handcrafted
The dataset was chosen from [1] and worked exclusively
features are not required for recognizing classes to which
characters belong to. The CNN beats the accuracy score of
on the offline handwritten Kannada characters. Three
previous models by 5%. different models were implemented on the dataset including
a Random Forest Classifier(RFC)[6], a MultiNomial Naive
Keywords—Optical Character Recognition, Handwritten Bayes Classifier (MNNBC)[7] and a CNN[8]. Optical
Kannada Characters, Convolutional Neural Network Character Recognition (OCR) using machine learning stated
above fared in the same accuracy percentage as that of [1].
I. INTRODUCTION The CNN model was chosen due to the promising results on
Handwritten text is the most widely used method for [2] and [3] to test if using a deep learning model with no
recording information in everyday life. Handwriting feature extraction would improve accuracy compared to
Recognition is the ability of a computer to detect and methods deployed in [1] and [9]. We show that deep
process handwritten input from images, photographs, learning techniques like CNN, even without feature
reports etc. Offline recognition is centered around extraction would show a higher accuracy as compared to the
converting text from an image to codes which can be machine learning models.
processed by a computer application such as text editors.
II. BACKGROUND
The offline handwritten recognition system accepts images
of handwritten documents as input and the recognized The following are attempts made by various authors on
characters from document image are outputted. Handwritten the topic of optical character recognition of Indic languages.
character recognition has a wide range of applications, from This section briefly discusses the methods used for feature
recognizing number plates of automobiles to processing extraction, data pre-processing and classifying that were
bank cheques and form data entry. The challenge of used for similar problems in the past.
recognising and classifying handwritten characters lies in In [21] authors attempt offline cursive handwritten tamil
the difference between the handwriting styles of different character recognition using HMM and feature extraction
writers. Each person has a unique handwriting. Some have using a combination of Time domain and frequency domain
handwriting that is difficult to decipher the . Some features. Their work focuses on Tamil words rather than
documents may be damaged therefore it is difficult to isolated characters. In [22] researched attempt both online
recognize the characters in those documents. There might be and offline character recognition for Bangla numerals using
variations in font styles and thickness. In addition to this, a database created from Bangla numerals which were
data is often presented in the form of a scanned image or extracted from postal addresses using state of the art flatbed
photograph which requires accounting for sources of scanner. In [23] authors outline various methods that were
variability like camera positions that cause distortion, and are being used for character recognition outlining
illumination and background and foreground colour. feature extraction methods like Fourier, Cosine, Slant or
Handwriting recognition has become a popular research Wavelet transform. They outline other methods like Fuzzy
area in the past couple of decades. HWR of english rules, Mahalanobis and Hausdorff distance, Evolutionary
characters is a relatively easy and an already conquered algorithms and explain why they aren’t used on Indic
problem space. However, offline HWR systems have been scripts. The paper also goes into detail about the various
developed for only a few languages and most Indic advances made in character recognition for Indic scripts. In
languages like Kannada are still beyond the pale of current [24] authors propose handwritten Kannada character
HWR techniques. Indic scripts have rich morphological recognition based on Fisher linear discriminant analysis
structure where in a character can lead to many other (FLD)[25]. Their system extracts performs feature
A. The Dataset
The Char74k Dataset consists of 657 Kannada characters
collected using a tablet PC. Each character has 25 samples.
The samples have different stroke thickness and style, just
as in the case of most handwritten characters. There are Fig. 2. Kannada handwritten characters in the Char74k dataset collected
several modifiers which are added to the base class using a tablet PC[9]
characters to obtain new characters which belong to the min_samples_leaf=5.
modified class. The base classes and the modified classes It works on the principle of bootstrap
cover all the consonant vowel combinations. The 657 aggregating(bagging). It is a meta-algorithm, which takes M
characters are either vowels, base classes, modified classes sub-samples (with replacement) from the initial dataset and
or numeric digits. trains the predic- tive model on those subsamples. Random
The dataset consists of images of street scenes taken in Forest introduces randomness to the model, while growing
Ban- galore. In addition to these images there are also the decision trees. The final model is obtained by averaging
handwritten characters and other characters generated by the ”bootstrapped” models and usually yields better results.
computer fonts. From the images, the individual characters Bagging can be used to provide ongoing estimation of
were obtained by manual segmentation. The handwritten overfitting error as well as the estimation of strength and
characters were captured using a tablet PC. correlation.
The drawback of a RFC is that they take a considerable
B. Data Collection and Preprocessing amount of time for classification due to the decision tree
Most of data pre-processing was done converting the type structure. The decisions and classes predicted must be
2
aggregated. pooling layers, CNNs are able to combine local features and
D. Multinomial Naive Bayes Classifier learn more global features of the image. The Dropout layer
MNNBC[7] is a generative model. The principle behind randomly drops a portion of the network and forces it to
MNNBC is probabilities that each pixel values is learn features in a distributed way. Dropout is a
independent from each other. They are probabilistic regularization method, where a proportion of nodes in the
classifiers, therefore will calculate the probability of each layer are randomly ignored for each training sample.
category using Bayes theorem. The MNNBC starts by Dropout is done to improve generalization and reduce
calculating the previous prob- ability of each label. This overfitting. The probability of dropping is set to 0.25. The
was implemented by checking the frequency of all the Flatten layer is used to convert the final feature maps into a
labels in the training set. The contribution of each character single 1D vector. This flattening step is required to make
is then combined with this prior probability to obtain use of fully connected layers after some
estimates for each of the labels in the dataset. The label convolutional/maxpool layers. It combines all the found
with the highest probability will be output. local features of the previous convolutional layers. A Dense
layer is applied which is an artificial neural network(ANN)
A subtle issue with the MNNBC is that if there are no
classifier. The parameter values are set as activation=’relu’.
occurrences of a class label and a certain attribute value
An additional Dropout layer is applied in which the
together then the frequency-based probability estimate will
probability of dropping is set to 0.5. Ultimately the features
be zero. Given the conditional independence assumption,
are used in two fully-connected (Dense) layers. The last
when the probabilities are multiplied you will get zero. This
layer (Dense(657, activation="softmax")) the net outputs
model also assumes that the features are independent of
distribution of probability of each class. The softmax
each other, which is not always true.
activation function returns the probabilities of classes which
E. Convolutional Neural Network sum to one. The model is compiled and an F1 score is use
Convolutional Neural Networks (CNNs)[8] are to evaluate the model. F1 is an overall measure of a model’s
comprised of neurons that self-optimise through learning. accuracy that combines precision and recall. A good F1
Each neuron receives an input and perform a specific score indicates infrequent false positives and false
operation. From the input raw image vectors to the final negatives. The parameters are set as loss =
output of the class score, the entire of the network will still keras.losses.categorical_crossentropy, optimizer =
express a single perceptive score function (the weight). The keras.optimizers.Adadelta(), metrics= [f1]. The CNN is
last layer contains loss functions associated with the trained for 128 epochs(iterations on a dataset). The
classes. CNNs are primarily used in the field of pattern arguments of the fit function are passed as x_train, y_train,
recognition within images. CNN was implemented using batch_size= batch_size, epochs = 128, verbose = 1,
Keras as an API. The CNN architecture implemented validation_data= (x_val, y_val). The value of batch_size
consists of the following layers. The first is the was set to 128.
convolutional (Conv2D) layer. It is a set of learnable filters V. RESULTS
that creates a convolution kernel that is convolved with the
Comparing our accuracies to the ones show in papers [1]
layer input over a single spatial dimension. The first layer
and [9] we can see that our CNN architecture outperforms 4
uses 32 filters and the next uses 64. Each filter transforms a
Fig. 3. Time taken by different Classifiers. all other models
part of the image using the kernel filter. The kernel filter
and shows an improvement in accuracy by 5 % compared to
matrix is applied on the whole image. The parameters for
[9] even without manual feature extraction as shown in
the first Conv2D layer were set as kernel size= (3, 3),
Figure 2. But our machine learning models suffer the same
activation = ’relu’ and input_shap = input_shape. The
low accuracy as that in [1]. Discussed in Section VI are the
parameters for the second Conv2D layer were set as
possible reasons for the differences in accuracies. These
kernel_size = (3, 3) and activation = ’relu’. The activation
results were calculated on a cloud GPU. The CNN predicts
function used is Rectified Linear Unit(ReLu). The ReLu
the handwritten Kannada characters with an accuracy of 57
activation function accelerates convergence and prevents all
percent whereas the other models show low accuracy.
the neurons from firing in an analogous manner, ensuring
Random Forest Classifier shows an accuracy of 5 percent
that a few neurons in the network do not activate. This
whereas Multinomial Naive Bayes Classifier shows an
makes the activations more sparse and efficient. The
accuracy of 4 percent. We see a higher accuracy due to the
rectifier activation function is used to add non linearity to
deep learning nature of the CNN algorithm. The time taken
the network. The next layer in the CNN is the pooling
by the CNN to train is significantly higher than that of
(MaxPool2D) layer. It is a Max pooling operation for
either the machine learning models at 139980 seconds. The
spatial data. It examines the 2 neighboring pixels and picks
random forest takes 90000 seconds followed by the
the maximal value. This operation is performed to reduce
Multinomial Naive Bayes Classifier at 947 seconds. One of
computational cost, and to some extent also reduce
the reasons the CNN took almost 1.5 to 2 times longer to
overfitting. Higher the pooling dimension, greater is the
train than the machine learning algorithm is due to having
downsampling. The parameters were set as
to train over multiple epochs, 128 in our experiment.
pool_size=(2,2)|. The remaining parameters were assigned
the default sklearn values. These parameters will halve the
image in each dimension. Combining convolutional and
3
far more accurately than probabilistic classifiers like the
Multinomial Naive Bayes Classifier and Bootstrap
Aggregating methods such as Random Forest Classifier.
Future work will involve implementing a capsule
network[11], as it doesn’t have the same issue that a CNN
has with spatial features, that are vast in Indic languages. A
capsule neural network is a deep learning model that model
hierarchical relationships better. Capsule networks add
structures called capsules to a CNN, reuse output from
those capsules to perform classifications. In capsule
networks the extra layers are added inside a single layer.
They use minimal preprocessing as compared to other
classification models. The dynamic routing is performed
Fig. 3. Time taken by different Classifiers. between capsules instead of neurons as in CNNs. The layers
are arranged in functional pods which enable designers to
distinguish between the various elements.The model can be
Fig. 4. Accuracy of different ClassifiersFig. 4. evaluated and compared to see if it outperforms the CNN
Accuracy of different Classifiers model. Manual feature extraction methods can be
implemented on CNN to see if they help improve the
accuracy of models. Feature extraction maximizes the
recognition rate with the least amount of elements. Feature
extraction starts from an initial set of measured data and
derives values (features) that are informative facilitating the
subsequent learning and classification steps, which leads to
more accurate interpretations. Methods like Zoning,
Histogram of Gradients[12] and Gradient Based
features[13] or other more recent feature extraction methods
may be implemented. We may improve on cross validation
scheme and use more Indic scripts that are agglutinated in
nature along the horizontal direction like Telugu,
Devanagari, Tamil etc. Cross-validation is a statistical
method used to estimate the skill of machine learning
models. Cross validation scheme is a resampling technique
Fig. 4. Accuracy of different Classifiers
used to evaluate the model when data is limited.
VI. DISCUSSION TABLE I RESULTS
In this paper we chose to tackle the problem of Classifier Time taken to train Accuracy on test
handwritten character recognition. [1] showed an accuracy and predict set
of Fig. 4. Accuracy of different Classifiers 2.77% and 3.4%
using machine learning models like Nearest Neighbors[13] Random Forest 1 min 30 secs 5.234
and Support Vector Machines[14] using feature extraction
methods like Shape Contexts[15], Geometric Blue[16], MNNBC 947 ms 4.199
Scale Invariant Feature Transform[17], Patch
Descriptor[18] on handwritten Kannada characters. One CNN 2 min 19 secs 57.002
reason for the poor accuracy is the fact that Kannada is
agglutinative and the fact that handwriting varies from REFERENCES
person to person. Then, [9] showed an accuracy of using [1] M. V. T. E. de Campos, “Character recognition in natural images,”
Hidden Markov Models[19] using implicit segmentation. VISAPP, 2009.
[2] [2] Marti and H. Bunke, “The IAM-database: an English sentence
Our machine learning models RFC and MNNBC perform database for offline handwriting recognition,” International Journal on
poorly in terms very similarly to [1] showing that we have Document Analysis and Recognition, vol. 5.1
achieved a baseline for our experiment. The reason the [3] Patil and S. Shimpi, “Handwritten English character recognition
CNN implementation does so well with an accuracy of 57% using neural network,” Elixir Comput Sci Eng, vol. 41, pp. 5587–
5591, 2011.
even without the feature extraction is due to convolutions, [4] A. C. A. Y. N. T. Wang, “End-to-end text recognition with
multiple filters, dense and dropout layers highlighted in [20] convolutional neural networks,” International Conference on
that the CNN leverages in order to find spatially invariant Document Analysis and Recognition, 2011.
[5] B. B. K. Wang and S. Belongie, “End-to-End Scene Text
features. Recognition.”
VII. CONCLUSION AND FUTURE WORK [6] Breiman and Leo, “Random forests,” Machine learning, vol. 45.1, pp.
5–32, 2001.
In this paper we have shown that deep learning models , [7] Rennie and J. D, “Tackling the poor assumptions of naive bayes text
here the CNN trains and classifies handwritten characters classifiers,” Proceedings of the 20th international conference on
machine learning, 2003.
4
[8] I. S. Krizhevsky and G. E. Hinton, “Imagenet classification with deep [18] M. Varma and A. Zisserman, “Texture classification: Are filter banks
convolutional neural networks,” Advances in neural information necessary?” IEEE Conf on Computer Vision and Pattern Recognition,
processing systems, 2012. Madison WI, June 18-20, vol. 2, 1999.
[9] V. M. Venkatesh and D. Vijayasenan, “Implicit segmentation of [19] Y. S. Fine and N. Tishby, “The hierarchical hidden Markov model:
Kannada characters in offline handwriting recognition using hidden Analysis and applications,” Machine learning 32.1, 41-62, 1998.
Markov models.” [20] Zeiler and R. Fergus, “Visualizing and understanding convolutional
[10] Smith and Ray, “An overview of the Tesseract OCR engine,” Ninth networks,” European conference on computer vision. springer,
International Conference on Document Analysis and Recognition, Cham,, 2014.
vol. 2, 2007. [21] Patil and S. Shimpi, “Handwritten English character recognition using
[11] N. F. Sabour and G. E. Hinton, “Dynamic routing between capsules,” neural network,” Elixir Comput Sci Eng, vol. 41, pp. 5587–5591,
2017. 2011.
[12] Dalal and B. Triggs, “Histograms of oriented gradients for human [22] Y. S. Fine and N. Tishby, “The hierarchical hidden Markov model:
detection,” International Conference on computer vision & Pattern Analysis and applications,” Machine learning 32.1, 41-62, 1998.
Recognition, vol. 1, 2005. [23] R. J. Kannan, R. Prabhakar, and R. M. Suresh, Off-line cursive
[13] Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE handwritten Tamil character recognition. Technology. IEEE, 2008.
transactions on information theory, 21-27, 1967. [24] B. B. Chaudhuri, A complete handwritten numeral database of
[14] Suykens and J. Vandewalle, “Least squares support vector machine Bangla–a major Indic script., 2006.
classifiers,” Neural processing letters, vol. 9.3, 1999. [25] U. Pal and B. B. Chaudhuri, “"Indian script character recognition: a
[15] M. J. Belongie and J. Puzicha, “Shape matching and object survey." pattern,” Recognition, vol. 37, no. 9, pp. 1887–1899, 2004.
recognition using shape contexts,” IEEE Transactions on Pattern [26] S. K. Niranjan, V. Kumar, and H. Kumar, FLD based unconstrained
Analysis and Machine Intelligence, 2002. handwritten Kannada character recognition., 2008, vol. 3.
[16] B. T. L. Berg and J. Malik, “Shape matching and object recognition [27] S. Mika, “"Fisher discriminant analysis with kernels." Neural
using low distortion correspondence,” Computer Vision and Pattern networks for signal processing,” vol. 98. Ieee, 1999.
Recognition, 2005. [28] R. S. Kunte and R. S. Samuel, “A simple and efficient optical
[17] D. G. Lowe, “Object recognition from local scale-invariante character recognition system for basic symbols in printed Kannada
features,” Computer Vision, Corfu, Greece, 1999. text.” Sadhana, vol. 32, no. 5, p. 521, 2007