HCR-Net: A Deep Learning Based Script Independent Handwritten Character Recognition Network
HCR-Net: A Deep Learning Based Script Independent Handwritten Character Recognition Network
HCR-Net: A Deep Learning Based Script Independent Handwritten Character Recognition Network
https://doi.org/10.1007/s11042-024-18655-5
Abstract
Handwritten character recognition (HCR) remains a challenging pattern recognition problem
despite decades of research, and lacks research on script independent recognition techniques.
This is mainly because of similar character structures, different handwriting styles, diverse
scripts, handcrafted feature extraction techniques, unavailability of data and code, and the
development of script-specific deep learning techniques. To address these limitations, we
have proposed a script independent deep learning network for HCR research, called HCR-
Net, that sets a new research direction for the field. HCR-Net is based on a novel transfer
learning approach for HCR, which partly utilizes feature extraction layers of a pre-trained
network. Due to transfer learning and image augmentation, HCR-Net provides faster and
computationally efficient training, better performance and generalizations, and can work
with small datasets. HCR-Net is extensively evaluated on 40 publicly available datasets
of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam,
Telugu, Marathi, Nepali and Arabic languages, and established 26 new benchmark results
while performed close to the best results in the rest cases. HCR-Net showed performance
improvements up to 11% against the existing results and achieved a fast convergence rate
showing up to 99% of final performance in the very first epoch. HCR-Net significantly
outperformed the state-of-the-art transfer learning techniques and also reduced the number
of trainable parameters by 34% as compared with the corresponding pre-trained network. To
facilitate reproducibility and further advancements of HCR research, the complete code is
publicly released at https://github.com/jmdvinodjmd/HCR-Net.
B Anuj Sharma
[email protected]
Vinod Kumar Chauhan
[email protected]
Sukhdeep Singh
[email protected]
1 Department of Engineering Science, University of Oxford, Oxford, UK
2 D. M. College (affiliated to Panjab University Chandigarh), Moga, Punjab, India
3 Department of Computer Science and Applications, Panjab University, Chandigarh, India
123
Multimedia Tools and Applications
1 Introduction
123
Multimedia Tools and Applications
For some scripts, like Chinese, Japanese and Latin, there has been extensive research [29,
53, 54, 60, 63]. However, for some other scripts, such as Gurmukhi, the research is still in its
infancy. Since the research on individual scripts is considered an open problem and is ongoing
[92], so there is little research on multi-script models. This is due to the following reasons:
(i) the conventional research was focused on handcrafted features which are domain/script
specific and are not always available, (ii) a large variety of scripts and their diversity, (iii)
unavailability of datasets and code-repositories for extending and for reproducibility of results
etc., and (iv) the existing deep learning techniques are developed for specific scripts only.
From the above discussion, it is clear that HCR remains a challenging pattern recogni-
tion problem and lacks the development of script independent recognition techniques. Deep
learning offers end-to-end learning solutions for HCR, however, the existing deep learning
techniques are customized to specific scripts/datasets, are computationally expensive and
also lack reproducible code. Thus, the objective of the study is to develop a script inde-
pendent, computationally efficient, faster, robust, publicly available and reproducible script
independent deep learning technique for HCR.
Building on the recent success and capability of end-to-end learning of deep learning, and
the availability of publicly available datasets, this paper proposes the first script independent
deep convolutional network for HCR, called HCR-Net. The proposed network is a script
independent technique as it is not dependent on script-specific handcrafted features which
might not be available for all the scripts. HCR-Net is based on a novel transfer learning
approach for HCR, which partly utilizes feature extraction layers of a pre-trained network,
unlike the existing techniques [32] that use the entire feature extraction layers. The proposed
transfer learning approach is based on our hypothesis that the HCR task is simpler than those
for which pre-trained models, like VGG16 [87] are developed, so the HCR does not need
123
Multimedia Tools and Applications
complex models. HCR-Net has been extensively evaluated on 40 publicly available datasets
and has established several new benchmarks, as discussed in Section 4.
The key contributions of the paper are summarized below.
(a) This paper proposes the first script independent novel deep convolutional network for
end-to-end HCR, called HCR-Net, and sets a new research direction for the HCR field.
(b) HCR-Net develops a novel transfer learning approach for HCR research by partly uti-
lizing feature extraction layers of a pre-trained VGG16 to initialize some of its lower
layers, unlike the existing research which utilizes all feature extraction layers of the
pre-trained models. Transfer learning along with image augmentation helps HCR-Net in
faster, computationally efficient and robust learning and learning even on trivial datasets
as compared with CNN models developed from scratch.
(c) HCR-Net is extensively evaluated on 40 publicly available datasets of Bangla, Punjabi,
Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi,
Nepali and Arabic languages, and established 26 new benchmark results while performing
very close to the best results in the rest cases. HCR-Net showed performance improve-
ments up to 11% against the existing results. HCR-Net achieved a fast convergence rate
and showed up to 99% of final performance in the very first epoch. HCR-Net also sig-
nificantly (p-value=0.00099 using Student’s t-test) outperformed the existing transfer
learning techniques and showed a 34% reduction in the number of trainable parameters
as compared with the corresponding pre-trained network.
(d) For reproducibility and advancement of the HCR research, the complete code is released
at: https://github.com/jmdvinodjmd/HCR-Net.
Organization of rest of the paper Section 2 presents a literature review and discusses
conventional approaches and recent deep learning based approaches for HCR. Section 3
discusses HCR-Net and Section 4 presents experimental results on different scripts. Finally,
Section 5 concludes the paper.
2 Literature review
In this section, literature on HCR is briefly discussed which can be broadly classified into
two categories, conventional approaches and deep learning based approaches, as depicted in
Fig. 1 and discussed in the following subsections.
HCR field has been studied extensively for more than five decades [1, 7, 25, 28, 31, 52,
53, 56, 65, 74, 85, 93, 94]. Earlier, the focus of research was mainly on developing feature
extraction techniques and applying different classification techniques for recognition. Fea-
ture extraction is the process of finding key features which can distinguish different classes
correctly and is a critical factor for the performance of machine learning models. Feature
extraction can be further broadly classified into statistical and structural feature extraction
techniques. Statistical feature extraction considers features based on pixel distribution in an
image, e.g., histograms, zoning and moments. But structural feature extraction techniques
consider features based on the structure of characters, such as loops, intersections, and num-
ber of endpoints. On the other hand, classification techniques are machine learning tools
which learn to classify/recognize a script from a given feature/dataset. For example, SVM
123
Multimedia Tools and Applications
(for more details refer to [13]), k-NN and MLP are the most widely used classifiers in the con-
ventional approaches [22, 24, 25, 33, 36, 79]. A few representative conventional approaches
are discussed below.
Granlund [31] proposed a Fourier transformation-based feature extraction along with a
non-optimized decision method for the recognition of handwritten characters. Lam and Suen
[52] developed a system for the recognition of unconstrained handwritten digits using feature
extraction based on geometric primitives containing topological information such as convex
polygons and line segments, with a relaxation matching classifier. Pal and Chaudhuri [66]
proposed a novel feature extraction method based on the concept of water overflow from
a reservoir as well as statistical and topological features along with a tree-based classifier
for unconstrained offline handwritten Bangla numerals. Bhattacharya and Chaudhuri [8]
used wavelet-based multi-resolution features with multi-layer perceptron classifiers for digit
recognition. Das et al. [25] proposed genetic algorithm (GA), simulated annealing and hill
climbing techniques to sample regions to select local features. They used an SVM classifier for
handwritten digit recognition. Das et al. [24] proposed principal component analysis (PCA),
modular PCA and quad-tree-based hierarchically derived longest-run features with SVM
for recognition of numerals of Devanagari, Telugu, Bangla, Latin and Arabic scripts. Das
et al. [22] presented a benchmark offline dataset of isolated handwritten Bangla compound
characters, called CMATERdb 3.1.3.3. The recognition was performed using quad-tree-based
features with SVM. Ghosh et al. [30] studied a multi-script numeral recognition for Bangla,
Arabic, Telugu, Nepali, Assamese, Gurmukhi, Latin and Devanagari scripts. They used a
histogram of oriented pixel positions and point-light source-based shadow feature extractors
with k-NN, random forest, MLP, simple logistic and sequential minimal optimization as
classifiers.
The recent success of deep learning models, especially CNN, has revolutionized the artifi-
cial intelligence world and has found applications in different fields like, image processing,
computer vision, healthcare and natural language processing [12, 14–17, 50, 88, 96, 97].
The success of deep learning models can be attributed, mainly to the advancements in the
hardware technology, new optimization algorithms and availability of large number of data
sources. CNN has shifted the paradigm from handcrafted features to automated features
learned directly from the input images. CNN also outperforms all other machine learning
techniques for HCR and has become the choice of researchers [4, 27, 32, 62, 77, 79]. How-
ever, the main limitations of CNN are that they need large amounts of data, great computing
resources and large training time if trained from scratch. These limitations are overcome with
the use of image augmentation and transfer learning techniques. The CNN are the state-of-art
for HCR research and a few important studies are discussed below.
Kim and Xie [48] proposed a CNN-based architecture for Hangul HCR and reported results
of 95.96% and 92.92% on SERI95a and PE92 datasets, respectively. Roy et al. [79] employed
a layer-wise training of CNN-based architecture for isolated Bangla compound character
recognition. The proposed model was reported to outperform conventional shallow models,
like SVM, as well as regular CNN. Chi et al. [54] proposed a cascaded CNN with weighted
average pooling for reducing the number of parameters for Chinese HCR. They reported
97.1% results on the ICDAR-2013 dataset. Manjusha et al. [61] also proposed a CNN-based
123
Multimedia Tools and Applications
architecture utilizing scattering transform-based wavelet filters in the first convolutional layer
for Malayalam HCR. Rao et al. [77] designed a lighter multi-channel residual CNN network
(similar to GoogLeNet [97]) for handwritten digit recognition and reported results on mnist
and SVHN datasets. Kavitha and Srimathi [45] proposed a CNN-based architecture for offline
Tamil HCR on HP Labs India dataset and achieved an accuracy of 97.7%. Keserwani et al. [46]
developed a CNN-based architecture for low-memory GPU for offline Bangla HCR. They
used spatial pyramid pooling and fusion of features from different CNN layers. Guha et al. [32]
proposed DevNet, a CNN-based architecture with five convolutional layers followed by max
pooling, one fully connected layer and one fully connected layer as output, for Devanagari
HCR. Melnyk et al. [63] presented a high-performance CNN-based architecture using global
weighted output average pooling to calculate class activation maps for offline Chinese HCR.
Hijam and Saharia [36] introduced the Meitei Mayek (Manipuri script) handwritten character
dataset. They reported results using handcrafted features such as HOG and discrete wavelet
transform (DWT), and image pixel intensities with random forest, k-NN, SVM and also
using CNN-based architecture. CNN model provided benchmark results of 95.56%. Lincy
and, Gayathri [56] used a CNN-based architecture which uses a self-adaptive lion algorithm
for fine-tuning fully connected layers and weights for Tamil HCR. Inunganbi et al. [40]
proposed a three-channel CNN architecture using gradient direction, gradient magnitude and
greyscale images for Meitei Mayek HCR.
Transfer learning is very successful in working with small datasets, including HCR [27,
32, 72, 92]. For example, [27] used fine-tuned VGG16 in two stages for recognition of
Devanagari and Bangla scripts. Pramanik et al. [72] also used fine-tuning of pre-trained
AlexNet and VGG16 on some Indic scripts. Image augmentation, generative adversarial
networks (GANs) and auto-encoders also help to work with limited datasets [4, 20, 27,
44, 49, 96]. Image augmentation artificially expands datasets by using operations, such as
translation, flip, rotation, shear and zoom. on the input images. This helps in developing a
robust classifier with limited datasets because the model is trained on the modified variants
of the training images, e.g., [20, 27, 96]. GANs are deep neural networks which are used to
generate new, synthetic data, similar to real data. For example, [44] used GANs for Devanagari
handwritten character generation. Auto-encoders are also deep neural networks which are
used to learn compact representations of the data, like PCA and also for generating synthetic
data, e.g., [4] used deep encoder and CNN for recognition of handwritten Urdu characters.
In addition, a hybrid of conventional and deep learning approaches is also developed
for HCR. For example, [59] used LeNet-5 for feature extraction and SVM as a classifier
for recognition of Bangla, Devanagari, Latin, Oriya and Telugu. Manjusha et al. [62] used
scattering CNN with SVM for Malayalam HCR. Sarkhel et al. [85] proposed a multi-column
multi-scale CNN architecture based on a multi-scale deep quad tree-based feature extraction
and used SVM as a classifier. They reported their results with Bangla, Tamil, Telugu, Hindi
and Urdu scripts.
Thus, from the brief literature review, we find that conventional approaches are not suitable
for script independent HCR due to the use of handcrafted features or manually designed
features based on morphological or structural appearance, which might not be available all
the time. On the other hand, recent developments in deep learning approaches due to their end-
to-end learning approach are suitable but are studied a little for multi-script HCR. Moreover,
the existing deep learning techniques are computationally expensive, lack reproducible code
and are developed specifically for some scripts and may not work on other scripts.
123
Multimedia Tools and Applications
3 HCR-Net
In this section, we discuss the architecture of HCR-Net, the contribution of transfer learning
and image augmentation to HCR-Net, and the discussion and analysis of the two phase
training of the proposed network.
HCR-Net is a CNN-based end-to-end architecture for offline HCR whose lower and middle
layers act as feature extractors and upper layers act as classifiers. HCR-Net partly utilizes the
feature extraction layers of a pre-trained VGG16 network (as shown in Fig. 2) for initializing
some of its lower layers, and trains in two phases. It is based on the hypothesis that the HCR
is a relatively simple task as compared to ImageNet tasks on which most of the pre-trained
deep learning networks are developed, e.g., VGG16 was originally trained on ImageNet with
14 million images and 1000 classes [87]. So, the use of only some of the lower layers of
pre-trained models could be sufficient and could give better results for the HCR, and this is
supported by our empirical results. This is also the reason for using VGG16 in HCR-Net and
not using complex and powerful architectures, such as ResNet, DenseNet and Inception that
have a large number of layers but are not useful for HCR.
Figure 2 presents the architecture of HCR-Net. It takes an input as a greyscale image
of 32 × 32 pixels and produces output as class probabilities, and the class with the highest
probability is predicted as a target. The architecture consists of convolutional, pooling, batch
normalization, dropout and dense layers. The lower part of the architecture, enclosed inside
the red square in Fig. 2, is similar to VGG16 architecture up to block4_conv2 layer and acts as
a feature extractor. It has four convolutional blocks: the first has two convolution layers with
64 filters followed by a max-pooling layer, the second block has two convolutional layers
with 128 filters followed by max-pooling, the third block has three convolutional layers with
256 filters followed by max-pooling, and the last convolutional block has two convolutional
layers with 512 filters. All the convolutional layers use a stride of one and padding as ‘same’,
and all the pooling layers use a stride of two and padding as ‘valid’. The convolutional blocks
are followed by one batch-normalization layer and two dense layers each of which has 512
neurons and is followed by batch-normalization + dropout (with 35% rate) layers. The last
layer is a dense layer with neurons equal to the number of output classes. All convolutional
and dense layers use ReLU as an activation function except the output dense layer which uses
123
Multimedia Tools and Applications
softmax because it is faster and helps to avoid the gradient vanishing problem. Categorical
cross-entropy is used as a loss function with Root Mean Square Propagation (RMSprop)1 as
an optimizer to update weights/parameters of the network. The complexity of a deep CNN,
and hence of HCR-Net is (n), where n is the number of pixels in an image.
For an input vector x, label vector p, predicted probability vector p̂ and for C classes,
ReLU, softmax, categorical cross-entropy and RMSprop’s weight update rules are given
below [75].
ReLU (x) = max (0, x) , (1)
ex p(xi )
so f tmax(x)i = , (2)
j ex p(x j )
C
l( p, p̂) = − pi log( p̂i ), (3)
i=1
vt = βvt−1 + (1 − β)gt2 ,
−1/2
Wt+1 = Wt − αt Vt gt (4)
−1/2
with Vt = diag(vt + ),
where vt is velocity term, gt = ∇ f (Wt , ξt ), β ∈ [0, 1], αt is learning rate (also called as
step size), Wt , ξt are model weights and randomness at step t, and is a very small number
for numerical stability. The different layers of HCR-Net are discussed below.
Convolution layers are the heart and soul of CNN and also give the network its name. It
applies convolution operation which is a repeated application of a set of weights, called a filter,
to the input image, and generates a feature map and helps in learning some specific feature
during training. So, the use of multiple filters generates multiple feature maps, each learning
some aspect of the image. Convolutions are very useful for learning spatial relationships in
the input and reducing parameters by sharing weights. Let there are l input feature maps of
size m ∗ m, convolutional filter size is n ∗ n with stride s, padding p, number of feature maps
k and output size o ∗ o, then the number of parameters and output size in a convolutional
layer is given below.
#Params = (n ∗ n ∗ l + 1) ∗ k,
m + 2p − n (5)
o= + 1.
s
Pooling layers (PLs) are commonly inserted after successive convolutional layers. Its func-
tion is to down-sample the feature maps obtained from convolutional layers. So, it helps
in reducing computations and the number of parameters, hence it avoids over-fitting and
helps in achieving local translation invariance. PL uses filters, smaller than feature maps, on
patches of feature maps and summarizes the information. The most commonly used pooling
operations are max pooling and average pooling, which return the most activated feature and
average feature, respectively. Let m ∗ m be the input size of one feature map, n ∗ n be the
filter size with stride s and output o ∗ o, then the number of parameters and output size in a
pooling layer is given below.
#Params = 0,
m−n (6)
o= + 1.
s
1 http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
123
Multimedia Tools and Applications
Batch-normalization layers It is a technique for training very deep neural networks that
normalizes inputs of neural network layers coming from previous layers, and since this is
done in batches so the name batch-normalization [41]. It helps to stabilize the training of deep
neural networks and get faster convergence. Chen et al. [18] argued that a combination of
batch-normalization and dropout outperforms the baselines and gives better training stability
and faster convergence. So, we have used a combination of batch-normalization and dropout
with dense layers in HCR-Net. Let xi is i-th d-dimensional input point, say an image, and
xik refers to its k-th dimension, B be a mini-batch of data points of size b, then mean and
variance over B is,
1 b 1 b
μB = i=1 x i , and σ B =
2 (xi − μ B )2 . (7)
b b i=1
For d-dimensional input, each dimension is normalized as,
(k) (k)
(k) x − μB
x̂i = i , where k ∈ [1, d] , i ∈ [1, m] , (8)
(k)2
σB +
is an arbitrarily small constant added for numerical stability, and the transform is given
below,
(k) (k)
yi = γ (k) xi + β (k) , (9)
where yi refers to output corresponding to input xi , and γ (k) and β (k) are parameters learned
(k) (k)
during training. So, batch-normalization transform is B Nγ (k) β (k) : xi=1,2,..,m → yi=1,2,..,m ,
with output equal to input and number of parameters equal to 2d.
Dropout layers randomly and temporarily remove a given percentage of neurons from the
hidden layers. This is one of the regularization techniques in deep learning and helps neural
networks to avoid over-fitting because it helps neural networks remove dependency on any
specific neuron. This is not a computationally expensive regularization technique as it does
not require any specific implementation or new parameters. The output from the dropout
layer is equal to the input of the layer. Mostly, it is used after a dense layer, but it can be used
with other layers.
Dense layers This is the most commonly used layer in neural networks. The dense layer,
also called the fully connected layer, contains a given number of neurons each of which is
connected to all neurons in the previous layer. The output layer is mostly a dense layer which
has neurons equal to the number of classes, representing different class probabilities. Let n i
and n o be the number of neurons in input and output for a dense layer, then the following
equation provides the computation of the number of parameters (#Params) and output size
in a Dense layer of a neural network.
#Params = n i ∗ n o + n o ,
(10)
Output size = n o .
Table 1 presents different layers of HCR-Net along with their types, outputs and number
of parameters. In the first phase, layers up to block4_conv2 are initialized from pre-trained
VGG16 layers and are frozen, i.e., parameters/weights are not updated, and the rest of the
layers are trained, so the number of trainable parameters in the first phase is 4,465,674 of
total parameters 9,744,202. Moreover, for the second phase, all layer weights are updated
123
Multimedia Tools and Applications
Table 1 Summary of different layers and parameters of the proposed HCR-Net for an example with 10 classes
Layer (type) Output shape #Params
so the number of trainable parameters is 9,741,130 while 3,072 are non-trainable. The non-
trainable parameters belong to the batch-normalization layer because for each dimension
batch-normalization maintains four parameters to keep track of the distributions, out of
which two parameters are non-trainable (i.e., moving_mean and moving_variance).
In deep learning, transfer learning is a technique to transfer knowledge learned on one task
to another related task. For example, use of a pre-trained deep learning network, like VGG16
which is trained on ImageNet, for HCR research. Thus, transfer learning enables the reuse
of pre-trained models on a new but related problem. Transfer learning is very useful in faster
training, mostly better results and learning on small datasets which otherwise will need large
amounts of data for deep learning models [27]. For transfer learning, either pre-trained models
can be used, or a source model can be trained first where a large amount of data is available
and then the source model can be reused on the target problem. In both cases, the entire
model can be reused or part of it. As discussed in Section 2.2, transfer learning has already
been studied in HCR and has helped to get better performance, e.g., [27, 32, 72].
As shown in Fig. 2, HCR-Net architecture partly utilizes VGG16 up to block4_conv2
layer and initializes those layers with the pre-trained VGG16. Thus, our use of transfer
123
Multimedia Tools and Applications
learning is novel in HCR research to the best of our knowledge as it reuses pre-trained model
partly, unlike existing research which uses entire pre-trained models [27, 72]. Moreover, this
approach enables to use of transfer learning without using complex models. Details on the
training of HCR-Net are provided in Section 3.4.
Image augmentation is a data augmentation technique which helps to artificially expand the
training dataset by creating modified versions of the training images. For example, an input
image can be modified by rotation, shear, translation, flip (vertical or horizontal), zoom and
their hybrid combinations etc. This helps to apply deep learning techniques for problems
with limited datasets which otherwise might not be possible to train on such a dataset.
Image augmentation also helps a model to generalize well on the test dataset because of its
training on different variants of training images. Some HCR studies have already used image
augmentation and reported improvements in the performance of the model, e.g., [20, 27, 96].
The application of image augmentation is crucial in improving model performance and
generalization capabilities. Image augmentation encompasses a diverse array of transforma-
tive operations essential for enriching the training dataset. For example, rotation introduces
variations in orientation by rotating images within a certain angle range (-45 to 45 degrees),
offering distinct perspectives for model learning. Shearing involves selectively shifting parts
of an image, such as slanting or skewing characters, enhancing the model’s adaptability to
different character structures. Translation displaces images horizontally or vertically, sim-
ulating changes in position and introducing spatial variations. Flipping, whether vertically
or horizontally, mirrors the image and is approached cautiously in HCR to preserve charac-
ter handedness. Zooming adjusts the image scale, providing the model with varying levels
of detail, which is especially beneficial for capturing intricate character features. Hybrid
combinations allow the simultaneous application of multiple operations, contributing to the
creation of a more diverse set of augmented images. Figure 3 presents an example of image
augmentation on a character image and applies rotation, shear, translation, zoom and hybrid
operations. It is to be noted that image augmentation should be applied carefully to HCR as
it is different from other image classification tasks because it can distort the structure of a
Fig. 3 An example of mage augmentation using rotation, translation, shear, zoom and hybrid operations
123
Multimedia Tools and Applications
character and change its meaning, e.g., horizontal flip or large translation. For specific values
for different image augmentation operations, please refer to Section 4.2.
HCR-Net trains in two phases due to transfer learning. In the first phase, parameters initialized
from pre-trained weights of VGG16 are frozen and the rest of the parameters are updated.
The model trains faster in the first phase than the second phase and is quite powerful because
most of the time it can achieve up to 99% of the final accuracy in just the first epoch and
converges in a few epochs. The model obtained after the first phase is sufficient for most of
the datasets like handwritten digit recognition.
In the second training phase of HCR-Net, all the parameters of the network are updated.
However, for the first few epochs learning rate is kept very small to avoid abrupt changes to
the parameters and avoid losing information. Then, the learning rate is increased, as discussed
in Section 4.2. The second phase is useful for complex datasets, is computationally expensive
and requires more epochs to converge with little improvements.
Table 2 presents convergence analysis for two phases of HCR-Net. As it is clear from
the table, without image augmentation, the first phase can get up to 99% (and more in some
cases) of final accuracy in just the first epoch. Moreover, there is a very slight improvement
in the second phase. With image augmentation, test accuracy in the first epoch of first-phase
training is lower as compared with without augmentation because training images have
more diversity and hence more to learn due to modified variants of the images. Further, the
with-augmentation second phase shows relatively more improvement over the first phase, as
compared with without-augmentation.
Figure 4 presents convergence analysis of two phases, averaged over five runs, of HCR-Net
using RMSprop optimizer on Gurmukhi_1.1 dataset, where the first phase runs for 30 epochs
while the second phase runs for 20 epochs. The figure shows how accuracy improves and
loss/error reduces with each epoch (i.e., pass through the data), where solid lines represent
average results and shaded regions around the lines represent standard deviation in the per-
formance. As it is clear from the figure, HCR-Net shows small deviations in the performance
over multiple runs. HCR-Net also converges very quickly in a few epochs, and it is difficult
to detect the slight performance improvement obtained in the second phase with bare eyes.
Moreover, a small gap between test and train performance shows a nice convergence of the
model without over-fitting, obtained with the use of dropout.
123
Multimedia Tools and Applications
Fig. 4 Convergence analysis of HCR-Net on Gurmukhi_1.1 dataset, where the first phase takes the first 30
epochs and the second phase uses the remaining 20 epochs
4 Experimental results
This section presents statistics about datasets used in the experiments, provides experimental
settings, compares HCR-Net against different benchmark studies and some of the state-of-
the-art transfer learning techniques, and also discusses error analysis.
4.1 Datasets
The experiments use publicly available datasets from Bangla, Punjabi, Hindi, English,
Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic
languages belonging to Bangla, Gurmukhi, Devanagari, Latin, Urdu, Farsi, Tibetan, Kannada,
Malayalam, Telugu and Arabic scripts. The statistics of 41 datasets used in the experiments
are given in Table 3. The datasets contain two online handwriting datasets (shown with ‘O’
in the name), namely, IAPR TC-11 and UJIPenChars. The online datasets are first converted
to offline images and then the proposed model is applied.
Few samples of some scripts are presented in Table 4, where samples are taken from
UCI Devanagari, CMATERdb 3.1.1 & CMATERdb 3.1.2 (Bangla) and UJIPenChars (Latin)
datasets. This highlights structural differences in different scripts, e.g., a horizontal line
above the character is used in Devanagari and Bangla scripts but not in Latin script. This
table also shows some noise (for Bangla) in recording different characters, which makes
HCR a challenging task.
Different hyperparameters of the HCR-Net, e.g., learning rate, mini-batch size, number of
layers, neurons and optimizer are selected by trial over a range of values. Each experiment
uses a mini-batch size of 32 and RMSprop as an optimizer. Although, RMSprop and Adam
(Adaptive Moment Estimation), which are popular in HCR research, show similar test accu-
racy but RMSprop is selected because it takes less time for training. Image augmentation uses
rotation of 10 degrees, horizontal and vertical shift of 0.05, shear of 0.5 and zoom of 0.05.
Experiments use a staircase learning rate to better control the convergence of the learning
123
Table 3 Statistics for different HCR datasets
Dataset Writers Samples per class Classes Training Samples Testing samples Total samples
Gurmukhi
123
HWRGurmukhi_1.1 [43] 1 100 35 2450 1050 3500
HWRGurmukhi_1.2 [43] 10 10 35 2450 1050 3500
HWRGurmukhi_1.3 [43] 100 1 35 2450 1050 3500
HWRGurmukhi_2.1 [43] 1 100 56 3920 1680 5600
HWRGurmukhi_2.2 [43] 10 10 56 3920 1680 5600
HWRGurmukhi_2.3 [43] 100 1 56 3920 1680 5600
HWRGurmukhi_3.1 [43] 200 1 35 4900 2100 7000
Devanagari
Nepali (combined) [67] 40 – 58 – – 12,912
Nepali numeral [67] 40 – 10 – – 2880
Nepali vowels [67] 40 – 12 – – 2652
Nepali consonants [67] 40 – 36 – – 7380
Marathi numerals [27] 100 100 10 800 200 1000
Marathi characters [27] 100 100 48 3840 960 4800
Marathi combined [27] 100 100 58 4640 1160 5800
UCI Devanagari numerals [1] – 2000 10 17,000 3,000 20,000
UCI Devanagari characters [1] – 2000 36 61,200 10,800 72,000
UCI Devanagari total [1] – 2000 46 78,200 13,800 92,000
CMATERdb_3.2.1 Devanagari Numeral [24, 25] – – 10 2400 600 3000
IAPR TC-11 (O) [83] 25 – 36 – – 1800
Bangla
CMATERdb_3.1.1 (Bangla numeral) [25] – – 10 4089 1019 5108
CMATERdb_3.1.2 (Bangla character) [23] – 300 50 12,000 3,000 15,000
Multimedia Tools and Applications
Table 3 continued
Dataset Writers Samples per class Classes Training Samples Testing samples Total samples
Latin
UJIPenchars (O) [57] 11 – 35 1240 124 1364
mnist [53] – 7,000 10 60,000 10,000 70,000
ARDIS-II [51] – – 10 6602 1000 7602
ARDIS-III [51] – – 10 6600 1000 7600
ARDIS-IV [51] – – 10 6600 1000 7600
Malayalam*
Amrita_MalCharDb [62] 77 – 85 17,236 6,360 29,302
Malayalam_DB [61] 77 – 85 22,942 6,360 29,302
Telugu
CMATERdb 3.4.1 (Telugu numeral) [25] – – 10 2400 600 3000
Kannada
Kannada-mnist [70] – – – 10 60,000 10,240
60,000 10,000
123
Table 3 continued
Dataset Writers Samples per class Classes Training Samples Testing samples Total samples
123
Urdu
Urdu [4] – – 10 6,606 1,414 8020
Farsi
Farsi [47] – – 10 60,000 20,000 80,000
Tibetan
Tibetan-mnist – – 10 14,214 3,554 17,768
Arabic
MADBase – – 10 60,000 10,000 70,000
* Both datasets are the same but differ in the splitting
Multimedia Tools and Applications
Multimedia Tools and Applications
Devanagari
Bangla
Latin
algorithms. The first phase starts with a high learning rate of 1e-4 to get faster convergence
up to five epochs, and then the learning rate is decreased to 5e-5 for the rest of the epochs.
In the second phase, up to five epochs learning rate is 1e-7 to avoid abrupt changes in
weights, for the rest of the epochs except the last five, the learning rate is increased to 5e-6
and then further decreased to 1e-6 in the last five epochs. The number of epochs required to
train HCR-Net is dependent on the dataset and trained until test accuracy becomes stable.
Generally, without image augmentation, the first phase is run for 30 epochs and the second
phase is run for at least 20 epochs. With image augmentation, the first phase is run for 10
epochs and the second phase is run for at least 50 epochs, this is because training with image
augmentation learns more on diverse images than without augmentation and takes longer to
converge. All the experiments are implemented using Keras library2 , averaged over five runs
and executed on MacBook Pro (RAM 16 GB, Core-i7).
The datasets are partitioned into 80:20 ratios for train and test, respectively wherever test
sets are not separately available. The experimental results are reported on the test dataset
using accuracy, precision, recall and F1-score. But, as it is clear from experiments with
different datasets, all the metrics have almost similar values due to class-balanced datasets,
so accuracy is a sufficient metric for HCR. Some authors also present test error/cost as a
metric, since cost is dependent on the objective function/model used for recognition so cost
is not a good metric and is not used. Moreover, test accuracy is reported at the last epoch of
the network training, unlike some authors reporting the best accuracy which is not the correct
way. This is because, during the training of a model, there may be a spike in test accuracy,
i.e., the optimizer may enter the best solution region for a given test set but that might not
give the best generalization at that epoch because the decision to select the model is based on
the test dataset. So, the test accuracy is reported using test accuracy at the last epoch and best
test accuracy during the training but only test accuracy at the last epoch is used to compare
with the existing literature. We argue that either the results should be reported at the last
epoch of training or reports should be reported on a separate dataset not used to decide the
model, e.g., dividing the dataset into train, validation and test sets where train and validation
may be used for training and final performance should be reported on the test set. Similarly,
the training accuracy is also not a good metric for reporting HCR results because that does
not reflect the generalization of the model and the model may be overfitting on the training
dataset.
2 https://keras.io
123
Multimedia Tools and Applications
4.3 Preprocessing
This paper does not use extensive preprocessing but simple preprocessing techniques as
a generic architecture is developed for different scripts. HCR-Net uses greyscale character
images of size 32×32 as inputs, and if image augmentation is turned on, then during the train-
ing on the fly applies image augmentation of rotation (10 degrees), translation (0.05), sheer
(0.05), zoom (0.05) and hybrid combinations. Wherever possible, the character images are
plotted against a black background to simplify the computations. All image pixel intensities
are scaled to a range of 0 to 1.
This subsection provides experimental results and comparisons with the literature. The per-
formance of HCR-Net is reported on the test dataset using accuracy, precision, recall and
123
Multimedia Tools and Applications
F1-score without and with augmentation, respectively and is separated using ‘|’. We also
present the best accuracy, in addition to accuracy at the last epoch just to show that the best
test accuracy during training is almost all the time more than accuracy at the last epoch. How-
ever, we compare only test accuracy at the last epoch with the existing results. The following
subsections discuss the results for different scripts.
Table 5 presents the performance of HCR-Net on the Gurmukhi script. All the performance
metrics for each dataset show similar results because datasets are class-balanced. HCR-Net
performs very well even though the datasets are quite small, and that is because of the power
of transfer learning from VGG16. Moreover, image augmentation shows improvement only
on HWRGurmukhi_1.1 but for the rest of the datasets it leads to a reduction in performance.
This is because the datasets are collected in controlled environments and have lesser noise
than real-world handwriting so expansion of the datasets with image augmentation techniques
does not improve recognition.
Table 6 presents a comparative study of HCR-Net against the state-of-art results. Since
these are recently released public datasets so there is not much literature to compare. From
the table, it is clear that HCR-Net outperforms existing results and provides new benchmarks
on all seven datasets, and shows up to seven percent improvement in the test accuracy. This is
because [43] has used traditional machine learning techniques with handcrafted features. In
the comparison of 1.1 with 1.3 or comparison of 2.1 with 2.3, cases with an equal number of
samples and classes but 1 and 100 writers, respectively, show a decrease in test accuracy with
an increase in number of writers. This makes the point that different people have different
writing styles which impact the performance.
Table 7 presents the performance of HCR-Net on Nepali, Hindi and Marathi languages which
share Devanagari script. All the performance metrics for each dataset show similar results
because datasets are class-balanced. Image augmentation shows slight improvement on most
of the datasets. It is to be noted that datasets with low performance, e.g., Marathi character
and IAPR TC-11 (O) etc., show better improvements with image augmentation than others
because others, like UCI Devanagari, have already reached a high level of performance and
have a large number of samples.
Table 8 presents a comparative study of HCR-Net against the state-of-art results. From the
table, it is clear that HCR-Net performs quite well and provides new benchmarks on Marathi
numeral, UCI Devanagari (numeral and character) and Nepali (numeral, vowel consonants
and combined). For the Nepali combined dataset, there is no reported result so there is not any
literature to compare. The IAPR TC-11 dataset is a small online handwriting dataset which
is converted to image form, and HCR-Net is able to beat the baseline model for the IAPR
TC-11 dataset. So, this demonstrates HCR-Net’s capability to recognize online handwriting,
and it can perform better if datasets are larger. HCR-Net shows the largest improvements
with the Nepali vowel dataset because the baseline uses handcrafted features with shallow
learning, unlike HCR-Net which is powered by deep learning, image augmentation and
transfer learning techniques. Additionally, the experiments were able to achieve a perfect
score of 100% test accuracy on the UCI numeral dataset four out of five times, which averaged
99.99%. This is due to the large size of the UCI numeral dataset.
123
123
Table 7 Performance of HCR-Net on Devanagari script without|with augmentation
Dataset Precision Recall F1-score Accuracy
At last epoch Best
123
Multimedia Tools and Applications
Table 8 continued
Dataset Reference Methodology Accuracy
Table 9 presents the performance of HCR-Net on Swedish and English languages sharing
Latin script. All the performance metrics for each dataset show similar results because the
datasets are class-balanced. Image augmentation shows slight improvement on all of the
datasets except ARDIS-III where there is a very slight drop in performance. This is because the
datasets are large and already show good performance so there is little scope for improvement.
The performance of UJIPenchars (O), an online handwriting dataset, is low compared to other
datasets and shows the highest improvement with image augmentation. This is because it
has only 1240 training points with 35 classes which are much smaller than the rest of the
datasets.
Table 10 presents a comparative study of HCR-Net against the state-of-art results. From
the table, it is clear that HCR-Net performs quite well and provides new benchmarks on the
ARDIS dataset (II, III and IV). mnist is a widely used benchmark dataset in computer vision
and has extensive literature, here we have presented some representative studies only. Despite
being a generic architecture, HCR-Net shows good performance on mnist with a very low
error. ARDIS dataset is present in three different formats with different preprocessing and
123
Multimedia Tools and Applications
noise levels. HCR-Net outperforms on all datasets, including results given in literature without
the name of the exact variant of ARDIS. Our proposed method shows large improvements
as compared with the existing literature for ARDIS. Moreover, there is no result reported in
the literature for the ARDIS-IV dataset so there is nothing to compare. It is further noted
that HCR-Net performs consistently across different variants of the ARDIS dataset and
shows near-perfect performance, despite different noise levels. This demonstrates the robust
performance of the proposed script independent network. HCR-Net does not perform well on
123
Multimedia Tools and Applications
the UJIPenchars (O), an online dataset, because it is a very small online handwriting dataset
but the performance is still comparable to the existing literature.
Table 11 presents the performance of HCR-Net on Telugu, Malayalam and Kannada scripts’
datasets. All the performance metrics for each dataset show similar results. Image augmenta-
tion shows slight improvement on most of the datasets except CMATERdb 3.4.1 where there
is a very slight drop in performance. Kannada-mnist comes with two test sets, one of which
is an out-of-distribution noisy test set, called Dig-mnist. Interestingly, image augmentation
shows a sharp improvement of around three percent in the test accuracy for Dig-mnist, this is
because image augmentation produces modified variants of images which are not present in
the training set and helps in better generalization, which is more helpful in this case because
Dig-mnist is an out-of-distribution noisy dataset. Moreover, best accuracy values are missing
for the Kannada-mnist test set because it was used for final evaluation and Dig-mnist was
used for evaluation during the training.
Table 12 presents a comparative study of HCR-Net against the state-of-art results. From
the table, it is clear that HCR-Net performs quite well and provides new benchmarks on
Amrita_MalCharDb, Malayalam_DB and Kannada-mnist (Dig-mnist) datasets. CMATERdb
3.4.1 has few versions and it appears that [24] used a different version as the dataset statistics
are different than the one used in our work. We obtained a huge improvement of over 11% on
Dig-mnist, which is an out-of-distribution noisy dataset collected from a practical situation,
because of image augmentation (as reported in Table 11) and transfer learning. This demon-
strates the robustness of the proposed HCR-Net and its suitability for practical real-world
applications where data are noisy and have different styles.
Bangla is one of the widely studied Indian scripts and it has several public datasets which
further enhance the research of this script (refer to [90] for a survey on Bangla handwrit-
ten numeral recognition). Table 13 presents the performance of HCR-Net on Bangla script
datasets. All the performance metrics for each dataset show similar results. Image augmen-
tation shows slight improvement on most of the datasets except 3.1.1 and 3.1.2 where there
is a slight drop in performance and 3.1.3.3 shows large improvements. This is because the
3.1.3.3 dataset has low performance compared to the others and hence there is more scope
for improvement.
123
Multimedia Tools and Applications
123
Multimedia Tools and Applications
CMATERdb 3.1.1 (Bangla numeral) [25] SVM classifier using GA for region 97.70
subsampling of local features
[24] Modular Principal Component Analy- 98.55
sis and Quad-tree based hierarchically
derived Longest-Run features + SVM
[78] Axiomatic Fuzzy Set theory to calcu- 97.45
late features’ combined class separa-
bility + quad-tree based longest-run
feature set and gradient-based direc-
tional feature set + SVM
[85] a multi-column multi-scale CNN 100.00*
architecture + SVM
[46] spatial pyramid pooling and fusion of 98.80
features from different layers of CNN
[27] fine-tuned VGG16 97.45
[30] Histogram of Oriented Pixel Posi- 98.50
tions and Point-Light Source-based
Shadow with random forest
HCR-Net our work 98.84
CMATERdb 3.1.2 (Bangla basic character) [9] local chain code histograms + SVM 92.14
[85] a multi-column multi-scale CNN 100.00
architecture + SVM
[20] CNN based architecture 93.37
[46] spatial pyramid pooling and fusion of 98.56
features from different layers of CNN
[27] fine-tuned VGG16 95.83
HCR-Net our work 97.42
CMATERdb 3.1.3.3 (compound character) [26] Genetic algorithm based Two pass 87.50
approach + SVM
[22] A convex hull and quad tree-based fea- 79.35
tures + SVM
[79] 90.33
[85] a multi-column multi-scale CNN 98.12*
architecture + SVM
[71] 88.74
[46] spatial pyramid pooling and fusion of 95.70
features from different layers of CNN
HCR-Net our work 92.19
ISI Bangla [42] pre-trained LeNet 97.05
[8] multilayer perceptron classifiers using 98.20
wavelet-based multi-resolution fea-
tures
[86] Multi-objective (recognition accuracy 98.23
and recognition cost per image) opti-
mization to find the informative
regions of character image + SVM
123
Multimedia Tools and Applications
Table 14 continued
Dataset Reference Methodology Accuracy
Table 14 presents a comparative study of HCR-Net against the state-of-art results. From
the table, it is clear that HCR-Net performs quite well and provides a few new benchmarks
on Banglalekha-isolated (numerals, characters and combined). It is observed that HCR-Net
lags for complex problems with a large number of classes, like CMATERdb 3.1.3.3 which
has 171 classes, and also trains slowly and takes a large number of epochs, 180 in this case.
However, HCR-Net performs second-best for CMATERdb 3.1.3.3, while the rest show a
large performance gap. It is also noted that among Bangla datasets, CMATERdb 3.1.3.3
obtains the lowest performance due large number of classes. Moreover, it is observed that a
multi-column multi-scale CNN architecture proposed by [85] performs exceptionally well
for the Bangla script.
Table 15 presents the performance of HCR-Net on Farsi, Urdu, Tibetan and Arabic scripts’
datasets. All the performance metrics for each dataset show similar results. Image augmenta-
tion does not show consistent improvements on most of the datasets as there are slight changes
in the performance which could be because of randomness associated with the experiments
and the fact that these datasets already show high performance without image augmentation,
leaving little space for further improvements.
Table 15 Performance of HCR-Net on Farsi, Urdu, Tibetan and Arabic scripts without|with augmentation
Dataset Precision Recall F1-score Accuracy
At last epoch Best
123
Multimedia Tools and Applications
Table 16 Recognition rates Farsi, Urdu, Tibetan and Arabic scripts’ datasets
Dataset Reference Methodology Accuracy
Table 16 presents a comparative study of HCR-Net against the state-of-art results. From
the table, it is clear that HCR-Net performs quite well and provides new benchmarks on
Urdu, Tibetan-mnist and MADBase datasets. For FARSI, DenseNet based model performs
the best followed by HCR-Net with a very small margin. It is also noted that among these
scripts Urdu has the smallest dataset and HCR-Net performs the best with more than one and
a half percent improvement over the baselines leading to near-perfect performance.
Thus, from these experiments, we conclude that HCR-Net is a script independent archi-
tecture which can handle different scripts. It performs very well and establishes several new
benchmarks. It is observed that HCR-Net shows very high performance on datasets with a
smaller number of classes, like numerals. The image augmentation component of HCR-Net
shows great performance improvement when the test dataset is out-of-distribution and noisy,
e.g., Dig-mnist (Table 11) and ARDIS (Table 9. Transfer learning helps HCR-Net to get
faster convergence, robust results and better generalization (Fig. 4 and Table 2).
Here, we compare HCR-Net against some of the popularly used state-of-the-art transfer
learning techniques for HCR as VGG16 [87], Xception [19], ResNet50 [35], InceptionV3
[98] and DenseNet121 [37]. Figure 5 presents the comparative study in terms of test accu-
racy and number of trainable parameters to measure the computational performance using
the UCI Devanagari Numeral dataset. The experimental setup for HCR-Net and other trans-
fer learning techniques is the same. All methods train in two phases, where the first phase
trains only the classifier layer while the second phase trains the entire network. From the
figure, it is clear that HCR-Net significantly (p-value=0.00099 using Student’s t-test) outper-
forms the rest of the transfer learning techniques, and VGG16 is the second-best technique.
ResNet50 is the worst performer in terms of test accuracy. However, all the transfer learning
approaches show impressive performance and that is why they are widely used in the HCR
research. It is also observed that only HCR-Net shows fast convergence and could achieve
high performance immediately after the first epoch (please refer to Section 3.4 for HCR-Net
123
Multimedia Tools and Applications
Fig. 5 Performance comparison of HCR-Net against state-of-the-art transfer learning techniques for HCR
convergence, however, the convergence of the rest of the techniques is not shown here). In
terms of computational efficiency, DenseNet121 and HCR-Net are the first and second best
techniques, respectively, and have a large gap from the rest of the techniques. HCR-Net
reduces the number of trainable parameters of the corresponding VGG16 by 34% and thus
is a computationally efficient technique.
In this subsection, the causes of misclassifications are analysed by taking examples of the
UCI Devanagari numeral dataset and the IAPR-11 Devanagari dataset, where HCR-Net
outperforms and lacks, respectively.
Sub-figure 6a presents a confusion matrix which shows only two misclassifications where
‘digit_1’ and ‘digit_7’ are classified as ‘digit_0’. For ‘digit_1’, this is due to deviations in
the structure by the writer, i.e., due to bad handwriting and for ‘digit_7’, this appears to be
noise in the recording process as some part of the character seems to be cropped, and it is
impossible even for humans to find the class of the character.
Sub-figures 6e and 6f study misclassifications on the IAPR-11 dataset, which is a small
dataset. Here, one reason for the misclassifications is due to the similarity in the structure
of the characters. As it is clear from the figures, the character ‘na’ is miss-classified as ‘ta’
because they look quite similar, in fact, this is the major cause for misclassifications. Thus,
as observed in the literature [27], bad handwriting, errors/noises in the recording process and
similarity in the structure of characters cause misclassifications.
HCR is a widely studied challenging learning problem in pattern recognition, which has
a variety of applications, like in the automated processing of documents. However, there
is a lack of research on script independent HCR. This is mainly because of the focus of
conventional research on handcrafted feature extraction techniques, the diversity of different
scripts and the unavailability of existing datasets and code repositories. Moreover, deep
learning, especially CNN, provides a great opportunity to develop script independent models,
however, deep learning research in handwriting is still in its infancy and models developed
for HCR are focused on specific scripts.
123
Multimedia Tools and Applications
Fig. 6 Miss-classification analysis: (a) confusion matrix of UCI Devanagari numeral dataset, (b) actual
‘digit_0’ in UCI Devanagari numeral dataset, (c) and (d) show ‘digit_7’ and ’digit_9’, respectively, miss-
classified as ‘digit_0’ on UCI Devanagari numeral dataset, (e) and (f) actual ‘ta’ and ‘na’ which is
miss-classified as ‘ta’ on IAPR-11 (O) Devanagari dataset
This paper proposed the first script independent deep learning architecture for HCR,
called HCR-Net, and started a new research direction for HCR research to develop script
independent techniques. HCR-Net uses a novel transfer learning approach for HCR, which
partly utilizes a pre-trained VGG16 network to initialize some parts of HCR-Net, unlike the
existing techniques which utilize the entire feature extraction layers. The proposed transfer
learning technique is based on the hypothesis that HCR is a simpler task as compared to tasks
for which pre-trained networks are developed so HCR does not need all the feature extraction
layers of the pre-trained networks. Powered by transfer learning and image augmentation,
HCR-Net is a computationally efficient technique which can learn faster, and learn on small
datasets, unlike standard deep learning techniques which need large amounts of data, and
provide better generalizations across several scripts. This work is reproducible, and publicly
released at https://github.com/jmdvinodjmd/HCR-Net.
The empirical results proved the efficacy of HCR-Net on 40 publicly available datasets
of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam,
Telugu, Marathi, Nepali and Arabic languages. These datasets do not contain any sensitive
information about the writers, mitigating privacy concerns. HCR-Net established 26 new
123
Multimedia Tools and Applications
benchmark results while performing close to the best results in the rest cases, and showed
performance improvements up to 11% against the existing results, which presents HCR-Net
as a script independent architecture for HCR. HCR-Net also significantly outperformed state-
of-the-art transfer learning techniques for HCR and was able to reduce number number of
trainable parameters of corresponding VGG16 by 34%. In addition to that, among the transfer
learning techniques, HCR-Net has the fastest convergence rate as it can achieve up to 99%
of final performance in the very first epoch. From miss-classification analysis, it is observed
that errors occur mainly due to noisy datasets, bad handwriting and similarity in different
characters. We acknowledge that while most of the datasets are recorded in controlled writer
conditions, we observed the largest performance improvement with Kannada-mnist, which
was collected from practical real-world situations. This indicates that HCR-Net is capable of
handling dataset biases and adapting to diverse handwriting styles, even those encountered
in real-world scenarios.
HCR-Net is a promising deep learning technique for HCR, but it has room for improve-
ment, especially in languages with large character sets. In the future, we plan to specialize
and extend HCR-Net for these languages, as well as explore hierarchical versions to address
the issue of misclassifications due to character similarity. Additionally, as we move towards
data-centric AI, we believe there are opportunities to improve HCR by developing special-
ized pre-processing pipelines and leveraging advanced data-centric methodologies. Aligned
with our overarching research thrust towards script independence, we envision the seamless
integration of HCR-Net into a comprehensive handwriting recognition system. This inte-
grated system positions handwriting recognition as a pivotal constituent and necessitates
nuanced treatment of handwriting independent of scripts and languages. This strategic inte-
gration seeks to contribute to the broader landscape of handwriting recognition, emphasizing
adaptability across diverse linguistic and script domains.
Data Availability All the datasets used in the paper are publicly available.
Competing interest The authors report there are no competing interests to declare.
References
1. Acharya S, Pant AK, Gyawali PK (2015) Deep learning based large scale handwritten Devanagari charac-
ter recognition. In: 2015 9th International conference on software, knowledge, information management
and applications (SKIMA). IEEE, pp 1–6
2. Akhlaghi M, Ghods V (2020) Farsi handwritten phone number recognition using deep learning. SN Appl
Sci 2(3):1–10
3. Al-wajih E, Ghazali R (2023) Threshold center-symmetric local binary convolutional neural networks
for Bilingual handwritten digit recognition. Knowl-Based Syst 259:110079
4. Ali H, Ullah A, Iqbal T, Khattak S (2020) Pioneer dataset and automatic recognition of Urdu handwritten
characters using a deep autoencoder and convolutional neural network. SN Appl Sci 2(2):1–12
5. Alkhawaldeh RS (2021) Arabic (Indian) digit handwritten recognition using recurrent transfer deep
architecture. Soft Comput 25(4):3131–3141
6. Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2010) A novel framework for automatic
sorting of postal documents with multi-script address blocks. Pattern Recognit 43(10):3507–3521
7. Bhattacharya U, Chaudhuri BB(2005) Databases for research on recognition of handwritten characters
of Indian scripts. In: Eighth international conference on document analysis and recognition (ICDAR’05),
vol 2. pp 789–793
123
Multimedia Tools and Applications
8. Bhattacharya U, Chaudhuri BB (2009) Handwritten numeral databases of Indian scripts and multistage
recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 31(3):444–457
9. Bhattacharya U, Shridhar M, Parui SK (2006) On recognition of handwritten Bangla characters. In:
Computer vision, graphics and image processing. Springer, pp 817–828
10. Biswas M, Islam R, Shom GK, Shopon M, Mohammed N, Momen S, Abedin A (2017) Banglalekha-
isolated: a multi-purpose comprehensive dataset of handwritten Bangla isolated characters. Data in Brief
12:103–107
11. Bonyani M, Jahangard S, Daneshmand M (2021) Persian handwritten digit, character and word recog-
nition using deep learning. International Journal on document analysis and recognition (IJDAR),
pp 1–11,
12. Chauhan VK, Molaei S, Tania MH, Thakur A, Zhu T, Clifton DA (2023) Adversarial de-confounding
in individualised treatment effects estimation. In: International conference on artificial intelligence and
statistics. PMLR, vol 206, pp 837–849
13. Chauhan VK, Dahiya K, Sharma A (2019) Problem formulations and solvers in linear SVM: a review.
Artif Intell Rev 52(2):803–855
14. Chauhan VK, Thakur A, O’Donoghue O, Clifton DA (2022) Coper: continuous patient state perceiver.
In 2022 IEEE-EMBS international conference on biomedical and health informatics (BHI). IEEE,
pp 1–4
15. Chauhan VK, Thakur A, O’Donoghue O, Rohanian O, Clifton DA (2022) Continuous Patient State
Attention Models. Medrxiv. https://doi.org/10.1101/2022.12.23.22283908
16. Chauhan VK, Zhou J, Lu P, Molaei S, Clifton DA (2023) A brief review of hypernetworks in deep
learning. arXiv:2306.06955
17. Chauhan VK, Zhou J, Molaei S, Ghosheh G, Clifton DA (2023) Dynamic inter-treatment information
sharing for individualized treatment effects estimation. arXiv:2305.15984
18. Chen G, Chen P, Shi Y, Hsieh C-Y, Liao B, Zhang S (2019) Rethinking the usage of batch normalization
and dropout in the training of deep neural networks. arXiv:1905.05928
19. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp 1251–1258
20. Chowdhury RR, Hossain MS, ul Islam R, Andersson K, Hossain S (2019) Bangla handwritten character
recognition using convolutional neural network with data augmentation. In: 2019 Joint 8th international
conference on informatics, electronics & vision (ICIEV) and 2019 3rd international conference on
imaging, vision & pattern recognition (icIVPR). IEEE, pp 318–323
21. Dargan S, Kumar M, Mittal A, Kumar K (2023) Handwriting-based gender classification using machine
learning techniques. Multimed Tools Appl 1–25
22. Das N, Acharya K, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) A benchmark image database of
isolated bangla handwritten compound characters. Int J Doc Anal Recognit (IJDAR) 17(4):413–431
23. Das N, Basu S, Sarkar R, Kundu M, Nasipuri M et al (2015) An improved feature descriptor for
recognition of handwritten bangla alphabet. arXiv:1501.05497
24. Das N, Reddy JM, Sarkar R, Basu S, Kundu M, Nasipuri M, Basu DK (2012) A statistical-topological
feature combination for recognition of handwritten numerals. Appl Soft Comput 12(8):2486–2495
25. Das N, Sarkar R, Basu S, Kundu M, Nasipuri M, Basu DK (2012) A genetic algorithm based region
sampling for selection of local features in handwritten digit recognition application. Appl Soft Comput
12(5):1592–1606
26. Das N, Sarkar R, Basu S, Saha PK, Kundu M, Nasipuri M (2015) Handwritten bangla character recogni-
tion using a soft computing paradigm embedded in two pass approach. Pattern Recognit 48(6):2054–2071
27. Deore SP, Pravin A (2020) Devanagari handwritten character recognition using fine-tuned deep convo-
lutional neural network on trivial dataset. Sādhanā 45(1):1–13
28. Duerr B, Hättich W, Tropf H, Winkler G (1980) A combination of statistical and syntactical pattern
recognition applied to classification of unconstrained handwritten numerals. Pattern Recognit 12(3):189–
199
29. Gan J, Chen Y, Hu B, Leng J, Wang W, Gao X (2023) Characters as graphs: interpretable handwritten
Chinese character recognition via pyramid graph transformer. Pattern Recognit 109317
30. Ghosh S, Chatterjee A, Singh PK, Bhowmik S, Sarkar R (2020) Language-invariant novel feature
descriptors for handwritten numeral recognition. Vis Comput 1–23
31. Granlund GH (1972) Fourier preprocessing for hand print character recognition. IEEE Trans Comput
100(2):195–201
32. Guha R, Das N, Kundu M, Nasipuri M, Santosh KC (2020) DevNet: an efficient CNN architecture for
handwritten Devanagari character recognition. Int J Pattern Recognit Artif Intell 34(12):2052009
33. Gupta A, Sarkhel R, Das N, Kundu M (2019) Multiobjective optimization for recognition of isolated
handwritten Indic scripts. Pattern Recognit Lett 128:318–325
123
Multimedia Tools and Applications
34. Hamida S, Cherradi B, El Gannour O, Raihani A, Ouajji H (2023) Cursive Arabic handwritten word
recognition system using majority voting and k-NN for feature descriptor selection. Multimed Tools
Appl 1–25
35. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of
the IEEE conference on computer vision and pattern recognition, p 770–778
36. Hijam D, Saharia S (2021) On developing complete character set Meitei Mayek handwritten character
database. Vis Comput 1–15
37. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708,
38. Huang Z, Shivakumara P, Kaljahi MA, Kumar A, Pal U, Lu T, Blumenstein M (2023) Writer age
estimation through handwriting. Multimed Tools Appl 82(11):16033–16055
39. Inunganbi S (2023) A systematic review on handwritten document analysis and recognition. Multimed
Tools Appl 1–27
40. Inunganbi S, Choudhary P, Manglem K (2021) Handwritten Meitei Mayek recognition using three-
channel convolution neural network of gradients and gray. Comput Intell 37(1):70–86
41. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal
covariate shift. In: International conference on machine learning. PMLR, pp 448–456
42. Jiang W (2020) Mnist-mix: a multi-language handwritten digit recognition dataset. IOP SciNotes
1(2):025002
43. Jinda SRl, Singh H (2019) Benchmark datasets for offline handwritten gurmukhi script recognition. In:
Document analysis and recognition: 4th workshop, DAR 2018, Held in Conjunction with ICVGIP 2018,
Hyderabad, India, December 18, 2018, Revised Selected Papers, vol 1020. Springer, pp 143
44. Kaur S, Verma K (2020) Handwritten Devanagari character generation using deep convolutional gener-
ative adversarial network. In: Soft computing: theories and applications. Springer, pp 1243–1253
45. Kavitha BR, Srimathi C (2019) Benchmarking on offline handwritten Tamil character recognition using
convolutional neural networks. J King Saud Univ Comput Inf
46. Keserwani P, Ali T, Roy PP (2019) Handwritten Bangla character and numeral recognition using con-
volutional neural network for low-memory Gpu. Int J Mach Learn Cybern 10(12):3485–3497
47. Khosravi H, Kabir E (2007) Introducing a very large dataset of handwritten farsi digits and a study on
their varieties. Pattern Recognit Lett 28(10):1133–1141
48. Kim I-J, Xie X (2015) Handwritten hangul recognition using deep convolutional neural networks. Int J
Doc Anal Recognit (IJDAR) 18(1):1–13
49. Kong H, Tang D, Meng X, Lu T (2019) Garn: a novel generative adversarial recognition network for
end-to-end scene character recognition. In: 2019 International conference on document analysis and
recognition (ICDAR). pp 689–694
50. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. Adv Neural Inf Process Syst 2:1097–1105
51. Kusetogullari H, Yavariabdi A, Cheddad A, Grahn H, Hall J (2019) Ardis: a Swedish historical hand-
written digit dataset. Neural Comput Appl 1–14
52. Lam L, Suen CY (1988) Structural classification and relaxation matching of totally unconstrained hand-
written zip-code numbers. Pattern Recognit 21(1):19–31
53. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86(11):2278–2324
54. Li Z, Teng N, Jin M, Huaxiang L (2018) Building efficient CNN architecture for offline handwritten
Chinese character recognition. Int J Doc Anal Recognit (IJDAR) 21(4):233–240
55. Chi L, Asfandeyar A, Qu R, Yi W, Lei W, Wu G, Qiang L, Qiang Z (2023) A handwriting recognition
system with wifi. IEEE Trans Mobile Comput 1–18
56. Lincy RB, Gayathri R (2021) Optimally configured convolutional neural network for tamil handwritten
character recognition by improved lion optimization model. Multimed Tools Appl 80(4):5917–5943
57. D, Prat F, Marzal A, Vilar JM, Castro MJ, Amengual J-C, Barrachina S, Castellanos A, Boquera SE,
Gómez JA, et al (2008) The ujipenchars database: a pen-based database of isolated handwritten charac-
ters. In LREC
58. Mahapatra D, Choudhury C, Karsh RK (2020) Generator based methods for off-line handwritten charac-
ter recognition. In: 2020 Advanced communication technologies and signal processing (ACTS). IEEE,
pp 1–6
59. Maitra DS, Bhattacharya U, Parui SK (2015) CNN based common approach to handwritten charac-
ter recognition of multiple scripts. In: 2015 13th International conference on document analysis and
recognition (ICDAR). IEEE, pp 1021–1025
60. Majid N, Smith EHB (2022) Character spotting and autonomous tagging: offline handwriting recognition
for Bangla, Korean and other alphabetic scripts. Int J Doc Anal Recognit (IJDAR) 25(4):245–263
123
Multimedia Tools and Applications
61. Manjusha K, Kumar MA, Soman KP (2018) Integrating scattering feature maps with convolutional
neural networks for Malayalam handwritten character recognition. Int J Doc Anal Recognit (IJDAR)
21(3):187–198
62. Manjusha K, Kumar MA, Soman KP (2019) On developing handwritten character image database for
malayalam language script. Eng Sci Technol Int J 22(2):637–645
63. Melnyk P, You Z, Li K (2020) A high-performance cnn method for offline handwritten chinese character
recognition and visualization. Soft Comput 24(11):7977–7987
64. Mukhoti J, Dutta S, Sarkar R (2020) Handwritten digit classification in Bangla and Hindi using deep
learning. Appl Artif Intell 34(14):1074–1099
65. Muthureka K, Reddy US, Janet B (2023) An improved customized CNN model for adaptive recognition
of cerebral palsy people’s handwritten digits in assessment. Int J Multimed Inf Retriev 12(2):23
66. Pal U, Chaudhuri BB (2000) Automatic recognition of unconstrained off-line Bangla handwritten numer-
als. In: International conference on multimodal interfaces. Springer, pp 371–378
67. Pant AK, Panday SP, Joshi SR (2012) Off-line Nepali handwritten character recognition using multilayer
perceptron and radial basis function neural networks. In: 2012 Third Asian Himalayas international
conference on internet. IEEE, pp 1–5
68. Parseh MJ, Meftahi M (2017) A new combined feature extraction method for persian handwritten digit
recognition. Int J Image Graph 17(02):1750012
69. Porwal U, Fornés A, Shafait F (2022) Advances in handwriting recognition
70. Prabhu VU (2019) Kannada-MNIST: A new handwritten digits dataset for the Kannada language.
arXiv:1908.01242
71. Pramanik R, Bag S (2018) Shape decomposition-based handwritten compound character recognition for
Bangla OCR. J Vis Commun Image Represent 50:123–134
72. Pramanik R, Dansena P, Bag S (2018) A study on the effect of CNN-based transfer learning on handwrit-
ten Indic and mixed numeral recognition. In: Workshop on document analysis and recognition. Springer,
pp 41–51
73. Prat F, Marzal A, Martın S, Ramos-Garijo R (2007) A two-stage template-based recognition engine for
on-line handwritten characters. In: Proc. of the Asia-Pacific workshop, pp 77–82
74. Prijatelj DS, Grieggs S, Yumoto F, Robertson E, Scheirer W (2023) Novelty in handwriting recognition.
In: A unifying framework for formal theories of novelty: discussions, guidelines, and examples for
artificial intelligence. Springer, pp 49–70
75. Prince SJD (2023) Understanding Deep Learning. MIT press
76. Ram S, Gupta S, Agarwal B (2018) Devanagri character recognition model using deep convolution
neural network. J Stat Manage Syst 21(4):593–599
77. Rao Z, Zeng C, Wu M, Wang Z, Zhao N, Liu M, Wan X (2018) Research on a handwritten character
recognition algorithm based on an extended nonlinear kernel residual network. KSII Trans Int Inf Syst
12(1):413–435
78. Roy A, Das N, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) An axiomatic fuzzy set theory based
feature selection methodology for handwritten numeral recognition. In: ICT and critical infrastructure:
proceedings of the 48th annual convention of computer society of India-Vol I. Springer, pp 133–140
79. Roy S, Das N, Kundu M, Nasipuri M (2017) Handwritten isolated Bangla compound character recog-
nition: a new benchmark using a novel deep learning approach. Pattern Recognit Lett 90:15–21
80. Saha P, Jaiswal A (2020) Handwriting recognition using active contour. In: Artificial intelligence and
evolutionary computations in engineering systems. Springer, pp 505–514
81. Saini A, Daniel S, Saini S, Mittal A (2021) Kannadares-next: a deep residual network for Kannada
numeral recognition. In: Machine learning for intelligent multimedia analytics. Springer, pp 63–89
82. Santosh KC, Iwata E (2012) Stroke-based cursive character recognition. Adv Character Recognit 175
83. Santosh KC, Nattee C, Lamiroy B (2010) Spatial similarity based stroke number and order free clustering.
In: 2010 12th International conference on frontiers in handwriting recognition. IEEE, pp 652–657
84. Sarkar A, Singh K, Mukerjee A (2012) Handwritten Hindi numerals recognition system. Webpage:
https://www.cse.iitk.ac.in/users/cs365/2012/submissions/aksarkar/cs365,CS365projectreport,
85. Sarkhel R, Das N, Das A, Kundu M, Nasipuri M (2017) A multi-scale deep quad tree based feature
extraction method for the recognition of isolated handwritten characters of popular indic scripts. Pattern
Recognit 71:78–93
86. Sarkhel R, Das N, Saha AK, Nasipuri M (2016) A multi-objective approach towards cost effective
isolated handwritten Bangla character and digit recognition. Pattern Recognit 58:172–189
87. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
arXiv:1409.1556
88. Singh H, Sharma RK, Singh VP (2023) Language model based suggestions of next possible Gurmukhi
character or word in online handwriting recognition system. Multimed Tools Appl 1–19
123
Multimedia Tools and Applications
89. Singh PK, Chatterjee I, Sarkar R, Smith EB, Nasipuri M (2021) A new feature extraction approach for
script invariant handwritten numeral recognition. Expert Syst 38(6):e12699
90. Singh PK, Sarkar R, Nasipuri M (2018) A comprehensive survey on Bangla handwritten numeral recog-
nition. Int J Appl Pattern Recognit 5(1):55–71
91. Singh S, Sharma A (2019) Online handwritten Gurmukhi words recognition: an inclusive study. ACM
Trans Asian Low-Resour Lang Inf Process 18(3):21:1-21:55
92. Singh S, Sharma A, Chauhan VK (2021) Online handwritten Gurmukhi word recognition using fine-
tuned deep convolutional neural network on offline features. Mach Learn Appl 100037
93. Singh S, Sharma A, Chauhan VK (2023) Indic script family and its offline handwriting recognition for
characters/digits and words: a comprehensive survey. Artif Intell Rev 1–53
94. Singh S, Sharma A, Chhabra I (2016) Online handwritten Gurmukhi strokes dataset based on minimal
set of words. ACM Trans Asian Low-Res Lang Inf Process 16(1):1–20
95. Singh S, Sharma A, Chhabra I (2017) A dominant points-based feature extraction approach to recognize
online handwritten strokes. Int J Doc Anal Recognit 20(1):37–58
96. Sufian A, Ghosh A, Naskar A, Sultana F, Sil J, Rahman MMH (2020) BDNet: Bengali handwritten
numeral digit recognition based on densely connected convolutional neural networks. J King Saud
Univ-Comput Inf Sci
97. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A
(2015) Going deeper with convolutions. InL Proceedings of the IEEE conference on computer vision
and pattern recognition. pp 1–9
98. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for
computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp 2818–2826
99. Wan L, Zeiler M, Zhang S, Cun YL, Fergus R (2013) Regularization of neural networks using dropcon-
nect. In: Dasgupta S, McAllester D (eds) Proceedings of the 30th international conference on machine
learning
100. Zhao A, Li J (2023) A significantly enhanced neural network for handwriting assessment in parkinson’s
disease detection. Multimed Tools Appl 1:1–21
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123