1 s2.0 S1877050923002041 Main
1 s2.0 S1877050923002041 Main
1 s2.0 S1877050923002041 Main
com
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 218 (2023) 2287–2298
Abstract
Character recognition is the extraction of printed or handwritten text from images into machine-readable format. The extracted
text can be easily edited, modified and efficiently stored. While there are several Optical Character Recognition (OCR) and
Handwritten Character Recognition (HCR) systems available for the English language, such systems are not well developed for
Indian languages such as Gujarati. This work deals with text recognition of the Gujarati Script. Two different models have been
analyzed in this work for the task of recognition of Gujarati text: CNN based EfficientNet B3 and YOLO v4. The system has
been developed using the EfficientNet B3 model which gives better accuracy and efficiency. The input to the system is an image
having optical Gujarati text and the system produces an editable text document having the contents of the recognized text in the
image. The system has been successfully implemented for the task of creating a digital library of Gujarati newspapers articles
from their images. This novel project is a step toward the cultural and linguistic preservation of the Gujarati language.
© 2023 The Authors. Published by ELSEVIER B.V.
© 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data Engineering
Engineering
Keywords:Optical Character Recognition; Natural Language Processing; Convolution Neural Network; Gujarati; Gujarati Character Recognition;
Devanagari Script; OCR; Feed-forward Back Propagation (FFBP) Neural Network; EfficientNet B3; Line Segmentation; Word Segmentation;
Character Segmentation; Auto-correct; Article Search
1. Introduction
Gujarati is one of the oldest Indian languages. It is an official language in the state of Gujarat and in the union
territory of Daman and Diu and Dadra and Nagar Haveli. Gujarati is a 700-year-old language spoken by over 55
million people worldwide. Gujarati is also widely spoken by Gujarati migrants in different places of India,
particularly in Mumbai and New Delhi. Gujarati is also widely spoken by the Gujarati diaspora in various locations
outside of India. such as in the United States and Canada.
Gujarati Script is adapted from the Devanagari Script. Gujarati language consists of 34 consonants and 13 vowels.
It also consists of ten digits ranging from zero to nine. Gujarati writing is an abugida, which means that each base
consonantal letter has an inherent vowel. The consonant is applied with diacritics to post consonantal vowels other
than ‘a’, while full-formed letters are used for non-post consonantal vowels (initial and post-vocalic positions).
Because ‘a’ is the most commonly used vowel, this is a useful approach because it reduces the breadth of writing.
Character and word identification for the English language is very advanced today. But very little work has been
done for the Gujarati Language. One reason is that very few people speak Gujarati compared to English. The other
reason is the difficulty in recognizing Gujarati words because they do not have a distinguishing line on top of each
word as found in Hindi. The problem of character identification in the Gujarati language of the Indian subcontinent
is addressed in this work. Gujarati is one of the many languages that employ Devanagari script variants. There is no
notable work in the literature that addresses Gujarati language recognition. Gujarati script evolved from the
Devanagari script. Sanskrit, Hindi, and Marathi are among the languages that use a similar script. Some Gujarati
characters have a striking resemblance in look. These characters can easily be misclassified if there is enough noise.
The paper describes the results obtained using Deep Learning technique for character recognition.
Followed by this section is Literature Survey, where we have summarized 12 research papers on similar topic,
after this section we have mentioned the details of model training. Following the proposed model section is the
Implementation and Result section which explains how an input image is enhanced and used as input to the model.
The next section is Evaluation and Discussion which consists of the comparison of the different models trained by
us, followed by concluding the paper in the Conclusion section.
2. Literature Survey
The literature review involves the study of various research papers, which involve a range of Indian languages
including Gurmukhi and Odia. One common observation made was that most of them have achieved high-level
accuracy at the character-level recognition, but they fail to perform well on word or sentence-level recognition. It is
also observed that different architectures of neural networks perform better than statistical machine learning models.
Table I, summarizes in detail the state-of-art technique in this domain.
of pictures obtained from a variety OCR on Gujarati scripts. of 80% for Gujarati scripts.
of sources, including photos taken
from newspapers, books and other
sources, and which are in a variety
of font styles, sizes, and colors.
Paper [6] A total of 11,996 characters are The word-connected component labeling method The average recognition rate for
used for testing. is used to uniquely identify an individual style identification is 83.57%.
subcomponent. For feature extraction, HOG and The overall average prediction
CCH are used. SVM classifier is used for accuracy of the proposed system
classification is 88.4%.
Paper [7] A dataset of 748 images is created PCA is used for feature extraction. Then a Hopfield neural network is used
which contains 2 images per Hopfield neural network with 900 input neurons which gives an overall accuracy
character with font size 12. and 900 output neurons is used for classification. of 93.25%.
Paper [8] 4000 numerals from 200 different Implemented a 3 layered FFBP neural network Identifies individual characters
people. for recognition of Gujarati numerals. Trained on with accuracy of 81.5%
both handwritten and computer-generated (font:
ariel) images.
Paper [9] A data set of 1800 numerals is First, the segmentation process is used to Printed text Accuracy of 97% is
used for training and 93.26% decompose the text into its connected obtained.
correct recognition rate and 1.71% components. Then, a hybrid classification is
rejection on a disjoint test set of used consisting of binary decision trees and
another 7760 samples. nearest neighbors. For word-level recognition
statistical and grammar rules are used as a post
processor.
Paper [10] Database contains images from 15 Recognition of Similar appearing Gujarati The results of Wavelet features
font families with font size 4. The Characters using Fuzzy KNN classifier paired are almost 100%. The recognition
training data set consists of 18 with two different Geometric and Wavelet accuracy obtained for the same
characters (classes) each having 20 features. characters using KNN was only
samples. 67%.
Paper [11] Data were collected from a broad A new combination of Structural and Statistical The combination of three
group of people, considering their methods (using Freeman chain code, Hu's different methods i.e., freeman
professions, age groups, as well as invariant moment and center of mass) to extract chain code, Hu’s invariant
their proficiency over the language feature vectors was proposed and it resulted in a moment and center of mass
on the sheet. Total 200 data samples good amount of accuracy. These extracted resulted in an accuracy of 87.22%
for each basic Gujarati character. feature vectors were further fed as an input to the for Gujarati characters.
Support Vector Machines and their resulting
accuracies were analyzed using 10-fold cross
validation.
Paper [12] Database OdiEnCorp 2.0 contains Used Tesseract-OCR for recognizing the non- The sacreBLEU score for Odia-
98,302 sentences and 1.69 million digitized data for the translation process. >English translation is 8.6 and
English and 1.47 million Odia Converted RGB to grayscale using luminosity, English->Odia is 5.2
tokens. Dataset is publicly threshold algorithm for binarization, dilation and (Translation)
available. erosion for image processing.
3. Proposed Model
The proposed system has implemented Gujarati Character Recognition using Convolution Neural Network and
transfer learning as a foundation to Gujarati Script recognition. The dataset was developed using Gujarati characters
as six hundred and fifteen distinct classes and multiple CNN model was trained on these classes.
Figure 1 illustrates the conceptual diagram for the proposed system. The first step is to convert the original image
file to a binary image using adaptive binarization. This enables the separation of the text from the background. The
next step is to detect the contours of the text. Blobs are formed by grouping the outlines. Blobs are arranged into text
lines, which are then evaluated to determine if the text is fixed pitch or proportional pitch. The text spacing
2290 Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298
4 Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000
determines how the lines are partitioned into words. The next step is recognition. It is a 2-step process. During the
first step, each segmented character is passed to the character recognition model and a list of recognized character is
saved. Then a second pass is performed so that the list of characters is synchronized with the list of space escape
codes and next lines escape codes saved during segmentation to construct words, lines and paragraphs.
3.2. Annotations
Annotations means a note or a short explanation added to an image to provide an explanation about a particular
part of it. Using the concept of annotations, the dataset was created for training the model on the character dataset.
In the proposed system, 98,303 Gujarati words were collected and stored in a text file with each word on a new line.
This text file was then converted to a pdf file with the same format of a new word on a new line. This pdf file was
then converted to image files with each page as a new image using an online pdf to image converter tool. After
converting to images, 6558 images were formed. This means that the pdf file had 6558 pages in it. All images were
stored in a single folder. The images were iterated in a loop one by one and the same concept of segmentation was
applied on each image. As the image had one word on one line, by line segmentation all the words got separated and
now there were several images with a single word as the line segmentation algorithm separates each line. This single
image of a line had much greater width compared to its height because it had the entire length of the page.
The next task was to trim the line image from both sides. The logic for trimming was that the complete image
was scanned pixel wise from the rightmost column, that is, the extreme right of the image toward the left. Now, all
the images had white space on the right side and the word was only on the left side. While scanning the image, white
pixels were ignored and as soon as the first black pixel was encountered, the scanning was stopped and the image
was cropped from the row next to the black pixel at the end of the image. Similarly, from the left side as soon as the
first black pixel was encountered, the image was cropped from the start of the image until the row just before the
first black pixel. Now, the image only had the word, which perfectly fit within the dimensions of the image. To
make the word clearer for understanding, there was a need to add padding to all sides of the image. So, a padding of
five pixels was added to all the sides of the image. Therefore, the height and width of the image was increased by
ten pixels. The same concept was used for all 6558 images and as a result, many images were formed, which had
only a single word in it. Once such images are formed, the role of character segmentation starts. All the images
formed are iterated one by one and character segmentation is performed.
A similar approach is used for character segmentation as that of line segmentation. Once a row is encountered
that has a black pixel, from that row until a complete white pixel row is encountered, this part of the image is
cropped and is considered a single character. Here as well, padding of five pixels is added on all sides of the images.
Therefore, the height and width of an image was increased by ten pixels. Now, character images were formed and
the next important task was to label these images to create the complete dataset. For labeling, the complete text file
of Gujarati words was present which was initially used. The order of the character images and their occurrence in
the text file was the same. So, the text file was simply parsed character by character and the corresponding image
was labeled. In this way, a dataset of 81,304 images of Gujarati characters was created for training the model.
Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298 2291
Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000 5
Convolutional neural networks are deep learning neural networks that are used to analyze structured data sets
such as images. A convolutional neural network is a feed-forward neural network with up to 20 or 30 layers.
Convolutional neural networks are composed of numerous convolutional layers stacked on top of one another, each
proficient in identifying increasingly complex forms. The architecture of a convolutional neural network is a multi-
layered feed-forward neural network formed by stacking multiple hidden layers on top of each other in sequence.
Because of their sequential architecture, convolutional neural networks may learn hierarchical features.
Convolutional layers are often followed by activation layers, with some of them being followed by pooling layers
[13] [14].
Scaling a Convolutional Neural Network can help it perform better. The three scaling dimensions of a CNN are
its depth, breadth, and resolution. The depth of a network equals the number of layers present in it. Width simply
refers to the size of the network. Resolution is merely the image resolution provided to a CNN. By stacking
additional convolutional layers and increasing the depth, the network can learn more complex features. Increasing
the width of a network allows it to learn more fine-grained features. Increasing the resolution provides the model
with more detailed information about the image. However, these techniques, when implemented individually inhibit
accuracy gains and present the problem of vanishing gradients. Efficient Net is a deep neural network derived using
the principle of compound scaling. Efficient net uses the MobileNetV2’s inverted bottleneck convolution (MBConv)
block as its base network. Squeeze-and-excitation optimization is added to this base model before it is uniformly
scaled in width, height and resolution using a compound coefficient to obtain the EfficientNet family B0 to B7. It is
observed that EfficientNet models outperform CNN models in terms of accuracy and efficiency [15] [16] [17] [18].
The proposed system has used EfficientNet B3 for training the model. The output layer uses the Softmax activation
function. The model has been trained for 10 epochs for 3 hours. Weights used were of ‘imagenet’. ImageNet is an
image database ordered in the WordNet structure, with millions of photographs representing each node in the
network. The project has made major contributions to studies in computer vision and deep learning. Researchers are
permitted to use the data for non-commercial uses.
In batch normalization, momentum was set to 0.99 and epsilon was set to 0.001. The “lag” in learning means and
variance is called momentum, and it allows noise from mini-batch to be ignored. With momentum 0.99 and 0.75, the
actual (light) and lagged (bold) values are shown. By default, momentum is set to a high value of around 0.99,
implying a lot of lag and delayed learning. To avoid division by zero, epsilon is a little float number added to
variance. Kernel regularizer was set with a value of 0.016, activity regularizer was set with a value of 0.006 and bias
regularizer was set with a value of 0.006. Kernel Regularizer, attempts to lower the weights W in the kernel
(excluding bias). Regularizer for Bias, attempts to decrease the bias b. Activity Regularizer, attempts to minimize
the layer's output y, lowering the weights and adjusting the bias to the least Wx+b. The training time for 5 epochs on
the device configuration mentioned in Experimental Setup section is 3 hours.
The model was trained and analyzed on Jupyter Notebook using Anaconda Environment with Python 3.6 and on
an Intel i7 processor, 16 GB RAM with NVIDIA GTX 1650, which has 4 GB of GDDR5, clocked at 8GT/s. The
saved model file is deployed on a Graphical User Interface developed using PyQt5.
The dataset was digitally created by converting a list of words into images and then annotating the data. The
details of annotations are described in the next section. By changing the font style and adding salt and pepper noise
various samples of a single character were created. Salt and pepper noise is sparsely occurring black and white pixels
all over the image. In this way, there were 615 different classes, where each class represents a single Gujarati
2292 Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298
6 Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000
character. These classes also include Gujarati characters with Matras. Each class contains images of a single
character. There are around 130 to 150 images in a single class. So, a total of 81,304 images were generated.
The data consists of images with high background noise, poor resolution, etc. Data preprocessing is essential for
achieving better results. It includes several steps. Firstly, the image consisting of several words and lines is
segmented into separate characters. Then, each individual character image is resized and enhanced to improve its
quality. Skew correction is performed on the image and its contrast and brightness values are adjusted. This
enhanced image is now suitable for character recognition. The detailed steps in preprocessing are explained below
4.3.1. Segmentation
Segmentation is used to separate a paragraph into single characters. This process occurs in 3 steps: the first line
segmentation, followed by word segmentation, and lastly character segmentation. The underlying technique used for
segmentation is a histogram projection [19].
• Line Segmentation: Line segmentation separates different lines of text present in the input image. This is done
using the horizontal histogram projection method. A histogram is a graph of pixel values in an image. In the
horizontal histogram projection method, the number of pixels along the rows of the image is counted. It gives an
output array of size equal to the height of the image (number of rows). When a graph is plotted on this array, a
horizontal histogram having rows on the y axis and pixel values on the x-axis is obtained. The higher peaks in
the graph indicate separate lines, whereas the lower peaks correspond to the gaps between two lines. These
lower peaks are used to mark the segmentation between two lines in the image. Figure 2 (a) depicts the test
image used for OCR and Figure 2 (b) depicts the output of the test image after line segmentation [20] [21].
• Word Segmentation: Word segmentation is used to further divide individual lines into separate words. Vertical
histogram projection method has been employed to accomplish this. The number of pixels along the image's
columns are totaled and an output array with size corresponding to number of columns in the picture (width of
the image) is obtained. When a graph is plotted on this array, a vertical histogram having columns on the x-axis
and pixel values on the y-axis is obtained. The higher peaks in the graph indicate separate words whereas the
lower peaks correspond to the gaps between the two words. These lower peaks are used to mark the
segmentation between two words in the image of a single line obtained from the previous step. Figure 2 (c) is
the output of word segmentation [22].
• Character Segmentation: The images of individual words obtained from the previous step are further segmented
into separate characters in this step. This is known as character segmentation. Figure 3 (a) is segmented
characters of the test image [23].
Fig. 2. (a) Input Image; (b) Segmented Lines; (c) Segmented Words
4.3.2. Resizing
Image resizing refers to scaling of a digital image. While scaling a vector graphic image, geometric
transformations may be used to scale the visual primitives that make up the image while maintaining the image
quality. A new picture with a greater or lesser number of pixels must be produced when scaling a raster graphics
Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298 2293
Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000 7
image. When the pixel number is reduced (scaling down), there is typically a noticeable loss of quality. Image
scaling may be considered as a form of image resampling or reconstruction according to the Nyquist sampling
theorem. As per the theory, aliasing artifacts can only be avoided by down-sampling a higher-resolution original
image to a smaller image using a suitable 2D anti-aliasing filter. The image is shrunk down to the amount of
information that the smaller image can retain.
Image enhancement consists of techniques of improving images. Histogram plots can be used to improve image
quality. The histogram displays the number of pixels in a picture (vertical axis) that have a specific brightness value
(horizontal axis). The brightness and contrast of an image may be altered using the histogram, boosting the image's
quality. Figure 3 (b) shows the difference between resized image and image after enhancement [24] [25] [26].
The method of recognizing and interpreting text in photographs using computer vision is known as optical
character recognition (OCR). OCR systems go through a variety of pre-processing procedures to clean and denoise
the picture. Following that, the document is binarized for contour detection, which aids in the recognition of lines
and columns. The characters that make up the lines are then isolated, cropped, and are fed into a trained algorithm
for classification. Character recognition refers to the technique of identifying individual characters. Using contour
detection, the borders of objects can be detected, and localized easily in an image. In OCR systems it is used to
extract the characters in the image [27] [28] [29].
Once the model detects the segmented characters, it will return the predictions in the form of an array with a
probability value for each class. The maximum probability value is found and then mapped with the corresponding
index in the class_dict.csv file. This file contains all the 615 classes in sequence. Now, the predicted class is
obtained, but it is in raw form, that is, it is not as Gujarati text. For example, the model will predict the class as ‘Aa’,
‘Ta’, ‘E’, etc. This must be converted to equivalent Gujarati text. This is done by simply creating a dictionary with
key-value pairs, with English equivalents as the key and Gujarati equivalent text as the value. Figure 3 (c) is the
snapshot of the dictionary used for mapping Gujarati characters.
For post-processing, the predicted text HunSpell Gujarati dictionary with Enchant interface was used. Enchant is
a library (and command-line tool) that provides a uniform interface for several different spelling libraries and
programs. Enchant is written in plain C99 and C++11, and therefore works on most modern operating systems.
When given a word, which is not present in the dictionary, Enchant uses operations like delete, insert, transpose,
replace and split to suggest possible correct words.
For the proposed system, after the text in the entire image is predicted the text is passed through the HunSpell
dictionary to check and highlight the words, which are not present in the dictionary. Then, the user also gets some
suggestions to correct those words. If the user feels that the predicted word is correct, then he has the option to add
that word to the dictionary [30].
2294 Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298
8 Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000
Fig. 3. (a) Segmented Characters; (b) Difference between resized image and enhanced resized image.; (c) Mapping Dictionary
The Article Search consists of images of Gujarati newspaper articles and their corresponding text files obtained
by performing OCR on the images. They are segregated according to the newspaper to which they belong and
organized according to the date of publication. The Article Search feature has been incorporated so that users can
conveniently search for newspaper articles stored in the database using search criteria such as keywords, the name of
the newspaper, and the date of publication or a start and end date range. The keyword search is done by string
matching in the files of the particular date range. The output is a list of articles from the database along with their
paths satisfying the particular search criteria. The user can view any article using its path.
For ease-of-use, the proposed system has a Graphical User Interface developed using PyQt5, tasks of Manual
Annotations, Training, Inference, and Article Search can be performed using GUI. As Annotations and Training will
only be used by developers, so, to avoid clustering two separate GUIs comprising Annotation, Training and
Inference, Article Search were developed. Both the GUI's have a login screen. The details of the GUI are as follows:
4.6.1. Annotations
The horizontal Menu Bar at the left end of the screen as shown in Figure 4 (a) gives the user options to select the
input image, save the annotated image, Create the annotation box, Zoom In and Zoom Out the selected image. The
middle section of the screen is the area where the input image appears and once the annotation box is drawn a pop-
up appears to select the label as seen in Figure 4 (b). The right-most horizontal bar is for editing labels, setting
default labels and for viewing the output file list. Some annotation samples can be seen in Figure 4 (c).
This screen provides the user with the feature to start training a model by providing a path to the dataset as shown
in Figure 5 (a). Also, the log related to the training like date and time of the start and end of training and status of the
training is reflected in this log. The user gets to set the name of the training and select the data path as shown in
Figure 5 (b).
The left part of the screen has functionality to select an input image, display the selected image, button to clear
the selection, a checkbox to activate image enhancement module, a button for performing multi-column
segmentation to separate articles of a newspaper page into different images and Predict button for performing OCR
on the selected input image, as highlighted in Figure 5 (c). The right part of the screen is used to display the output
of OCR and has functionality to highlight all the incorrect spellings as highlighted in Figure 6 (a), so the user can
right-click on the highlighted word to use the Spell Checker module as highlighted in Figure 6 (b). The user can
Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298 2295
Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000 9
save the output of OCR with or without using the Spell Checker module in a text file using the Save button as
highlighted in Figure 6 (c).
Fig. 4. (a) Annotations screen; (b) Screen to select labels; (c) Sample annotations
Fig. 5. (a) Training screen; (b) Screen to set training name and data path; (c) Left side of the inference screen where image is loaded
Fig. 3. (a) Segmented Characters; (b) Difference between resized image and enhanced resized image.; (c) Mapping Dictionary
Fig. 6. (a) Right side of the inference screen where the output is displayed; (b) Menu to select spelling suggestions; (c) Save button
The left part of the screen has filters to search the required keyword like the date range of the Newspapers, option
to specify Newspapers and an output field to display the path of files in which keyword match is found as
highlighted in Figure 7 (a). The right part of the screen can be used to display the newspaper in which the keyword
match is found as highlighted in Figure 7 (b).
We have implemented multiple approaches for the study like using YOLO for recognizing Gujarati text that uses
character recognition and then localization for inference and then the proposed embodiment of first segmenting a
block of text into characters and then using these segmented character images as input for the character recognition
model.
YOLO has a limitation of the number of classes and using localization increases the inference time so we shifted
our approach of study toward character segmentation followed by recognition. We have implemented two models
for character recognition, one being the Convolution Neural Network model and the other using transfer learning an
EfficientNet B3 model. The CNN model achieves a training accuracy of 79.78% and a validation accuracy of
68.17%. The EfficientNet B3 model achieves a training accuracy of 98.92% and a validation accuracy of 99.70%.
On comparing the evaluation measures of both the character recognition models, the transfer learning model
displays better performance. If computation time is considered as a comparison metric, then the computation time
for the YOLO v4 model was too high, approximately 30 seconds. The input image had 10 lines and 108 words. The
computation time for the same image with the same number of lines and words used with the CNN model was 22
seconds. When the image was used with the proposed model, that is, EfficientNet B3 the computation time was
brought down to 13 seconds. The evaluation metrics for the proposed EfficientNet B3 models are as follows:
2296 Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298
10 Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000
Figure 7 (c) is the recognized Gujarati text for the input image i.e., Figure 2 (a). Figure 8 (a) is a plot of accuracy
and validation accuracy against the number of epochs. It can be seen, initially both accuracy and validation accuracy
are very less with validation accuracy ahead of accuracy. As the number of epochs increase, validation accuracy and
accuracy both start increasing and become almost constant after a certain number of epochs. It can be seen that both
are very close to one after the completion of epochs. Figure 8 (b) is a plot of training loss and validation loss against
Figure 7 (c) is the recognized Gujarati text for the input image i.e., Figure 2 (a). Figure 9 (a) is a plot of accuracy
and validation accuracy against the number of epochs. It can be seen, initially both accuracy and validation accuracy
are very less with validation accuracy ahead of accuracy. As the number of epochs increase, validation accuracy and
accuracy both start increasing and become almost constant after a certain number of epochs. It can be seen that both
are very close to one after the completion of epochs. Figure 9 (b) is a plot of training loss and validation loss against
Fig. 7. (a) Left side of the screen to enter keyword and apply search filters; (b) Right side of the screen to display the article; (c) Output of OCR
Fig. 8. (a) Plot of Accuracy; (b) Plot of Loss for CNN model
Fig. 9. (a) Plot of Accuracy; (b) Plot of Loss for EfficientNet B3 model
Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000 11
Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298 2297
the number of epochs. Initially, both training loss and validation loss start with a higher value with validation loss
being less than the training loss. As the number of epochs increases, both the parameters start decreasing and are
very close to zero when the number of epochs is maximum. The time required by the model to display the output
after the input image is provided is known as the computation time. Accuracy is the capacity of an instrument to
measure the precise value. In other words, it refers to how closely the measured value resembles a reference or
genuine value. The precision of a substance is defined as the degree to which two or more measurements agree with
one another.
6. Conclusion
The proposed method uses a character recognition model. When an input image is received it is recursively
broken down into lines, words and characters. The prediction model works on individual characters which when
combined together give us the final output of the image. So, the overall accuracy of our system depends on two
factors i.e., segmentation of text into characters and prediction of Gujarati letters. Our method of segmenting input
text into characters by using the values of the pixels in every row and column of the image has proven to be an
accurate method. For the character prediction, EfficientNet B3 had the best performance followed by the CNN
method. This text recognition method can be further improved by using autocorrect and autocomplete mechanisms.
Acknowledgments
This study is conducted in collaboration with Vivekanand Education Society’s Institute of Technology and Tata
Institute of Fundamental Research, Mumbai. The insights provided by the faculties of both institutions have led to
the successful completion of the work.
References
[1] S. S. Magare, Y. K. Gedam, D. S. Randhave, Prof. R. R. Deshmukh, 2014, Character Recognition of Gujarati and Devanagari Script:
AReview, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 03, Issue 01 (January
2014).
[2] M. J. Baheti and K. Kale, "Gujarati Numeral Recognition: Affine Invariant Moments Approach," International Journal of electronics,
Communication & Soft Computing Science & Engineering, pp. 140-146, March 2012.
[3] J. Memon, M. Sami, R. A. Khan and M. Uddin, "Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature
Review (SLR)," in IEEE Access, vol. 8, pp. 142642-142668, 2020, doi: 10.1109/ACCESS.2020.3012542.
[4] Pareek, J., Singhania, D., Kumari, R.R. and Purohit, S., 2020. Gujarati Handwritten Character Recognition from Text Images. Procedia
Computer Science, 171, pp.514-523.
[5] Audichya, Milind & Saini, Jatinderkumar. (2019). An Overview of Optical Character Recognition for Gujarati Typed and Handwritten
Characters.
[6] Ami Mehta, Ashish Gor “Multi font Multi size Gujarati OCR with Style Identification” International Conference on Energy, Communication,
Data Analytics and Soft Computing (ICECDS2017)
[7] Solanki, P., & Bhatt, M. (2013). Printed Gujarati Script OCR using Hopfield Neural Network. International Journal of Computer Applications,
69(13), 33–37. https://doi.org/10.5120/11905-7982
[8] Khopkar, M., 2013. OCR for Gujarati Numeral using Neural Network. IJSRD - International Journal for Scientific Research & Development,
Vol. 1, Issue 3, 2013, pp.424–427.
[9] Kumar, Anubhav & Sharma, Anuradha & Chawla, Monika & Prasad, T. (2009). Different Approaches in OCR of Indian Languages.
10.13140/RG.2.1.3243.5046.
[10] Choksi, Amit & Thakkar, Shital. (2012). Recognition of Similar appearing Gujarati Characters using Fuzzy-KNN Algorithm.
[11] S. J. Macwan and A. N. Vyas, "Classification of offline gujarati handwritten characters," 2015 International Conference on Advances in
Computing, Communications and Informatics (ICACCI), 2015, pp. 1535-1541, doi: 10.1109/ICACCI.2015.7275831.
[12] Parida, Shantipriya& Dash, Satya &Bojar, Ondřej&Motlíček, Petr &Pattnaik, Priyanka & Mallick, Debasish. (2020). OdiEnCorp 2.0: Odia-
English Parallel Corpus for Machine Translation.
[13] S. Albawi, T. A. Mohammed and S. Al-Zawi, "Understanding of a convolutional neural network," 2017 International Conference on
Engineering and Technology (ICET), 2017, pp. 1-6, doi: 10.1109/ICEngTechnol.2017.8308186.
2298 Abhinav Sharma et al. / Procedia Computer Science 218 (2023) 2287–2298
12 Abhinav Sharma et al. / Procedia Computer Science 00 (2019) 000–000
[14] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., … Chen, T. (2018). Recent advances in convolutional neural networks. Pattern
Recognition, 77, 354–377. doi:10.1016/j.patcog.2017.10.013.
[15] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.
"Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
[16] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. "Mobilenetv2: Inverted residuals and linear
bottlenecks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.
[17] Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." In International conference on
machine learning, pp. 6105-6114. PMLR, 2019.
[18] Tan, Mingxing, and Quoc Le. "Efficientnetv2: Smaller models and faster training." In International Conference on Machine Learning, pp.
10096-10106. PMLR, 2021.
[19] Dongre, Vikas J., and Vijay H. Mankar. "Devnagari document segmentation using histogram approach." arXiv preprint arXiv:1109.1247
(2011).
[20] Likforman-Sulem, L., Zahour, A. &Taconet, B. Text line segmentation of historical documents: a survey. IJDAR 9, 123–138 (2007).
doi.org:10.1007/s10032-006-0023-z
[21] ManivannanArivazhagan, Harish Srinivasan, and Sargur Srihari "A statistical approach to line segmentation in handwritten documents",
Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000T (29 January 2007); doi:10.1117/12.704538
[22] Perruchet, Pierre, and Annie Vinter. "PARSER: A model for word segmentation." Journal of memory and language 39, no. 2 (1998): 246-
263.
[23] T. M. Breuel, "Segmentation of handprinted letter strings using a dynamic programming algorithm," Proceedings of Sixth International
Conference on Document Analysis and Recognition, 2001, pp. 821-826, doi: 10.1109/ICDAR.2001.953902.
[24] Polesel, Andrea, Giovanni Ramponi, and V. John Mathews. "Image enhancement via adaptive unsharp masking." IEEE transactions on
image processing 9, no. 3 (2000): 505-510.
[25] Hummel, Robert. "Image enhancement by histogram transformation." Unknown (1975).
[26] Munteanu, Cristian, and Agostinho Rosa. "Gray-scale image enhancement as an automatic process driven by evolution." IEEE transactions
on systems, man, and cybernetics, part B (cybernetics) 34, no. 2 (2004): 1292-1298.
[27] White, James M., and Gene D. Rohrer. "Image thresholding for optical character recognition and other applications requiring character
image extraction." IBM Journal of research and development 27, no. 4 (1983): 400-411.
[28] Nagy, George, Thomas A. Nartker, and Stephen V. Rice. "Optical character recognition: An illustrated guide to the frontier." In Document
recognition and retrieval VII, vol. 3967, pp. 58-69. SPIE, 1999.
[29] Singh, Sukhpreet. "Optical character recognition techniques: a survey." Journal of emerging Trends in Computing and information Sciences
4, no. 6 (2013): 545-550.
[30] Lopresti, Daniel. "Optical character recognition errors and their effects on natural language processing." International Journal on Document
Analysis and Recognition (IJDAR) 12, no. 3 (2009): 141-151.