HLTR Using ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Handwritten Line Text Recognition using Machine

Learning
Mansi Mayekar Punit Mestha Shoaib Asif
Department of Electronics and Department of Electronics and Department of Electronics and
Telecommunication Telecommunication Telecommunication
SIES Graduate School of Technology SIES Graduate School of Technology SIES Graduate School of Technology
Navi Mumbai, India Navi Mumbai, India Navi Mumbai, India
[email protected] [email protected] [email protected]

Piyush Singh Asst. Prof. Sonal Jatkar


Department of Electronics and Project Guide, Department of Electronics
Telecommunication and Telecommunication
SIES Graduate School of Technology SIES Graduate School of Technology
Navi Mumbai, India Navi Mumbai, India
[email protected] [email protected]

Abstract— Handwritten Line Text Recognition in Machine With these processes, the HTR method mainly suffers
Learning is one of the emerging fields within computer vision. problems of the segmentation not being stable resulting in the
There are different languages and in every language, there are recognition accuracy to be affected. This model consists of
several handwritings of people which need to be correctly many trained modules hence it is a bit tedious to achieve the
identified. Humans identify and process the text that they see desired output for the system. In this project, word images
that is the same which is expected by the HTR. Although it is algorithms are combined with line images algorithms in a
difficult to recognize the text by the system. In handwritten line given image thereby helping the user to easily convert
text recognition, there are several steps involved. Pre-Processing
handwritten documents into digital format. Still the main
the input image, feature extraction of it and transcription are
problem arises during the classification and the solution for
some of the steps involved. During these processes, our system is
being trained and similarities and differences of several
this is to avoid segmentation of lines. The use of HTR is very
handwritten samples are noted. The application hence takes diverse, it can be used in banking applications, verification of
pictures of the transcription thereby converting it into a digital signature of a person, courier services among others.
text which is the desired output.
II. LITERATURE SURVEY
Keywords— HLTR (Handwritten Line Text Recognition), NN Handwritten line text recognition has been an active field
(Neural Network), CNN(convolutional Neural Network), of research and has been witnessing much progress. It is a
RNN(Recurrent Neural Network), CTC(Connectionist Temporal challenging field as the writing of a person depends on various
Classification), TF(Tensor Flow), LSTM(Long Short Term factors like the instrument used to write the text, speed with
Memory) which the text is written, pressure applied while writing etc.
In previous research, a sliding window is used on the input
image and each path is then passed to a convolutional feature
I. INTRODUCTION extractor and then ultimately given to an encoder decoder
With much advancement in technology, manipulation of bidirectional long short term memory. This has a drawback as
photographs is a lot easier. Handwritten line text recognition it adds more weight parameters to the model thereby
plays an integral part in this. Handwritten line text recognition increasing the training time [5]. Document analysis and
is a challenging problem since different people have different recognition gives an advantage of mixing datasets with small
handwritings and styles, still a very useful invention. Also, real images that can improve accuracy. But it has two
text recognition is a tough task to be done by a machine. For disadvantages. 1) In this model, classification of images as
this process, we need to train the system accordingly. printed or handwritten is needed to improve the prediction
Character recognition involves multiple steps and these are accuracy. 2) It was applied to Latin language only [6]. In
image acquisition, pre-processing, feature extraction, computer applications and industrial electronics, datasets
sequence labeling and decoding output. Handwriting text were segmented and tested using various feature based
recognition enables a machine to detect, interpret and classification techniques as its advantages. While the
successfully recognize handwritten text from an external disadvantage was that separate classifiers were required for
media source like image or scanned document. In this project, upper and lower English characters to improve recognition
neural networks play a vital role since convolutional neural accuracy [7]. Trans pattern analysis and machine intelligence
networks and recurrent neural networks are combined with hints at using separable multidimensional long short term
Connectionist Temporal Classification so that the model is memory of recurrent neural network modules that extract
independent of lexical segmentation as well as feature contextual information in various directions and consume
extraction being done manually. Moreover, the training of the much less computation efforts. Hence its disadvantage is to
datasets helps a lot in getting the desired results. On the other solve context overfitting problems in order to improve the
hand, handwritten line text recognition involves conversion of system performance. Another disadvantage is that it was later
all the text in the picture into letter codes and the result thus applied to Latin language only [8]. The proposed model uses
obtained is considered to be a representation of the image. 7 layers CNN, 2 RNN layers with CTC to predict the line text
without pre segmentation. The LSTM enables the network to
Table I: Summary of Dataset
save useful information for more time period and make
predictions based on the previous results. LSTM enables
networks to do more robust training and take advantage of Dataset Words Text Lines Writers
exploring more context. Connectionist Temporal
Classification (CTC) loss helps to remove alignment problems
IAM 1,15,320 13,353 657
as every individual has a different handwriting [1].

III. PROPOSED WORK


A. MODEL OVERVIEW Fig.2 Sample image from IAM line dataset
The proposed Handwritten Line Text Recognition (HTR)
model is a state of the art neural network based on stacking 7
b) Model Architecture:
layers of Convolution NN (CNN), 2 layers of Recurrent NN
(RNN) and a Connectionist Temporal Classification (CTC) The proposed model contains 7 CNN layers along with 2 Bi-
layer. Multiple CNN layers are capable of extracting features directional LSTM (BLSTM) layers with 512 hidden cells.
from the input image given to the NN. The output of these Handwritten text lines are given as input to the neural network
layers is a 1D/2D feature map which is then passed to the RNN model. The NN takes gray text images as input. Multiple CNN
layers and a CTC layer is used for calculating the loss and layers stacked together are used to map the input image into
decoding. The NN is trained on images of the entire feature maps. Finally, the seventh layer of CNN gives an
line/sentence from the IAM line dataset to build the output which is a sequence of length T with F features. This
handwritten line text recognition model (HTR). Training the sequence is then mapped with another sequence of same
model is possible on CPU (although GPU would be better). length T by layers of RNNs. T and thus assign probabilities to
each of the C different classes. This forms a [C x T] sized
matrix thereby finding the most probable labeling in this
matrix. This process is termed as decoding and it is done by
the terminating CTC layer. CTC also calculates loss value for
a batch.. This loss value is then back propagated to the output
layer of RNN. CTC layer is used as a loss function and
decoding function in this model.

Fig. 1: Overview of HTR

B. METHODOLOGY
a) Description of Dataset:
The dataset that we have used is the IAM dataset and it has
a huge database. It accommodates various forms of
handwritten text that are used for testing and training Fig. 3: Architecture of Proposed Model
purposes thereby performing various experiments. The
IAM dataset has over 600 writers who have contributed c) Operations:
their handwriting samples with over 1500 pages of
scanned text meaning each writer contributing over 2 Input Data:
pages of their handwriting samples. At an average, it Input image to the Neural Network is of dimension
accounts to 5685 sentences and 13353 isolated and labeled 800×64. Most of the images in the IAM dataset are not of this
sentences and thereby generating over 115000 isolated and size, therefore the input image is first resized either to width
labeled words. All this data is combined for one training, of 800 or height of 64. It is then normalized to change pixel
one testing and one validation set. There are many images values by placing it in white target image of 800×64 size to
of the same types with a particular dimension in this ease the work for the NN. Fig. 4 shows the output obtained
dataset and also contains images and its text. Database after the preprocessing stage.
includes text written with various writing instruments [2].
Below Table I gives the exact amount of text in the dataset.
Fig. 2 shows a sample image from the IAM line dataset.
Images from the dataset are divided into 95%: 5% for every individual has a different handwriting. The loss
training and validation set for the NN. function is then used to train the Neural Network to predict
the correct output.

Different decoding algorithms include:

(i) Best Path Decoding: It only uses output of the NN and


computes the most suitable path by taking the most potential
character per time step
(ii) Word Beam Search: It eventually creates beams and
marks them. Only the premier scoring beams from the time
step before are kept at each time step. This produces more
accurate results.
C. IMPROVING THE MODEL:
Prediction accuracy can be increased by :
● Better image pre-processing: by reducing
background noise to handle real time images more
accurately.
● Data augmentation: by increasing size of the
dataset by applying further random
Fig. 4: Various steps of preprocessing stage transformations like random noise, random stretch
to input images.

Below Table II shows the number of images fed to the NN ● Removing input images with cursive
handwriting.
for the same.
Also, Multi-Dimensional LSTM (MDLSTM) can be
employed to recognize a whole paragraph at once. For full
Table II: Splitting the Dataset
paragraph text recognition line segmentation can be added.

Set Number of text lines in set


IV. RESULTS
Training 12,685 The proposed HLTR model was trained on a Tesla T4
GPU. Batch size was set to 64 so the model had a total of 198
Validation 668 training batches. Training completed after 68 epochs. The
performance of the model is shown in detail using evaluation
metrics in Table III. Fig. 5 shows the progression of Character
CNN: Error Rate (CER) and Word Error Rate (WER) with respect
A 7 layer CNN is used in the model which helps to extract to epoch.
the features from the image. The first two layers of the uses a
5x5 filter and 3x3 filter is used in the remaining 5 layers. A
pooling layer of 2x2 is used to condense image regions.

RNN:
The RNN consists of 100 time steps. The LSTM enables
the network to save useful information for more time period
and make predictions based on the previous results. It also
enables networks to do more robust training and take
advantage of exploring more context [1]. Two RNN layers
are stacked to create a bidirectional LSTM which gives
output of 100x80 which is given to the CTC. Long short-term
memory (LSTM) is a RNN which is better because it has
feedback connections. It processes single images and also
entire series of data sequences.

CTC:
The output of the BLSTM layer is given to CTC which
calculates the loss value by comparing the output with the Fig. 5: Learning curve of HTR model
ground truth line text. Connectionist Temporal Classification
(CTC) loss function helps to remove alignment problems as
the handwritten text line by loading the previously created and
saved model. The final text recognition output is shown in
Fig.7

Fig. 7: Final prediction of HTR model

V. CONCLUSION
Our HTR model has Character accuracy of 90.02% and
Word accuracy of 73%. The proposed handwritten text line
recognition approach mentioned in this paper is very
efficient. The project is achieved through CNN. In order to
easily recognize the character, it is necessary to train the
Fig. 6: Address accuracy v/s Epoch network with a large amount of dataset. Hence to achieve an
efficient network there is a stipulation of more memory as
well as better processing speeds. Both efficient and effective
Table III: Performance results are achieved by using this algorithm. The text with
less noise gives the best accuracy. The accuracy is completely
based on the dataset. Various additional processing
Proposed Model techniques such as de-slanting, word segmentation, removing
background noises etc. can be used to improve the accuracy.
Evaluation Metric Training Validation . the accuracy can be improved. If we increase the data, we
can get more accuracy and also if we try to avoid cursive
writing then also it yields better results.
CER 10.54% 9.98%

WER 27.44% 27.08%


VI. FUTURE SCOPE
Address 10% 10.62% In future, we plan to develop this model which can predict
accuracy
text from various datasets. As we know the future is
completely based on technology it can be used in real time
applications to convert handwritten text into a digital text like
The handwritten image below is the input given to the NN by for scanning a cheque, bill, historical documents, etc. and
the user to predict the handwritten text. Our system predicts hence easing the daily tasks of people.

You might also like