1 s2.0 S1877050919306854 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
Procedia Computer Science 00 (2019) 000–000
ScienceDirect
www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia

Procedia Computer Science 152 (2019) 111–121

International Conference on Pervasive Computing Advances and Applications – PerCAA 2019


International Conference on Pervasive Computing Advances and Applications – PerCAA 2019
An efficient Devanagari character classification in printed and
An efficient Devanagari character classification in printed and
handwritten documents using SVM
handwritten documents using SVM
Shalini Puria,*
a,*
, Satya Prakash Singhbb
a
Shalini Puri , Satya Prakash Singh
Birla Institute of Technology, Mesra, Ranchi, Jharkhand 835215, India
ba
Birla Institute of Technology, Mesra, Ranchi, Jharkhand 835215, India
b
Birla Institute of Technology, Mesra, Ranchi, Jharkhand 835215, India

Abstract
Abstract
With the increased demand, exploration and globalization of digitized Devanagari documents, many printed and handwritten
With the increased
mono-lingual demand,
character exploration
recognition and globalization
techniques have evolvedof digitized
since Devanagari
last two documents,
decades. This many printed
paper presents and handwritten
an efficient Devanagari
mono-lingual
character character recognition
classification model usingtechniques
SVM for have evolved
printed since last two
and handwritten decades. This
mono-lingual paperSanskrit
Hindi, presentsand
an efficient
Marathi Devanagari
documents,
character
which firstclassification
preprocessesmodel using SVM
the image, for printed
segments andprojection
it through handwritten mono-lingual
profiles, removesHindi, Sanskrit
shirorekha, and Marathi
extracts features,documents,
and then
which firstthepreprocesses
classifies the characters
shirorekha-less image, segments it throughcharacter
into pre-defined projection profiles, The
categories. removes shirorekha,
experiments extractsonfeatures,
performed proposedand then
system
classifies the shirorekha-less
obtained average classificationcharacters
accuraciesinto pre-defined
of 99.54% and character
98.35% forcategories. Thehandwritten
printed and experimentsimages,
performed on proposed
respectively, system
and showed
obtained
better average classification
performance accuracies of 99.54% and 98.35% for printed and handwritten images, respectively, and showed
than other techniques.
better performance than other techniques.
© 2019 The Authors. Published by Elsevier Ltd.
© 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
© 2019
This is anThe Authors.
open access Published by Elsevier
article under Ltd.
Peer-review under responsibility of the
the CC BY-NC-ND
scientific license
committee of (https://creativecommons.org/licenses/by-nc-nd/4.0/)
the International Conference on Pervasive Computing Advances
This is an open access article
and Applications – PerCAA 2019. under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Keywords: document analysis; shirorekha-less words; shirorekha-less characters; feature extraction; character classification; SVM
Keywords: document analysis; shirorekha-less words; shirorekha-less characters; feature extraction; character classification; SVM

1. Introduction
1. Introduction
Automated Devanagari character recognition is an innovative, prominent and challenging area in today's digitized
Automated
world, which hasDevanagari
evolved character recognition
through the combination is anof
innovative, prominent and
artificial intelligence, challenging
pattern area inmachine
recognition, today's digitized
learning
world,
and datawhich
mininghasconcepts.
evolved through the combination
Optical Character of artificial
Recognition (OCR)intelligence, pattern recognition,
systems of non-Indic languages, machine learning
such as, English,
and data mining
Chinese, Japanese, concepts.
Korean,Optical
German Character
etc. are Recognition
already mature (OCR) systems oftonon-Indic
as compared languages,
Indic scripts. Becausesuch
of as, English,
initial slow
Chinese, Japanese,
progressive growth Korean,
and muchGerman etc. Devanagari
ignorance, are already recognition
mature as compared to Indic systems
and classification scripts. Because of ainitial
are getting good slow
deal
progressive
of attention growth
nowadays.and Although
much ignorance, Devanagari
many offline recognition
Devanagari OCR and classification
methods systems
have already beenareintroduced
getting a good deal
in recent
of attention
years, nowadays.
yet it is still a bigAlthough
challengemany offlineitsDevanagari
to process documents OCR due tomethods
linguistichave already
based been introduced
criticalities, in recent
large character set,
years, yetconjuncts,
complex it is still atypical
big challenge to process
geometric structureitsofdocuments
character, due
zonetobased
linguistic
form,based criticalities,
and use large character
of shirorekha (top line).set,
In
complex to
addition conjuncts, typical geometric
these complexities, structure of character,
the processing zoneDevanagari
unconstrained based form,handwritten
and use of shirorekha (top line). In
character recognition is
additionfound
always to these
muchcomplexities, theprinted
harder than the processing
ones. of unconstrained Devanagari handwritten character recognition is
always found much harder than the printed ones.

1877-0509 © 2019 The Authors. Published by Elsevier Ltd.


This is an open
1877-0509 access
© 2019 Thearticle under
Authors. the CC BY-NC-ND
Published license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
1877-0509 © 2019 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the International Conference on Pervasive Computing Advances
and Applications – PerCAA 2019.
10.1016/j.procs.2019.05.033
112 Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121
2 Author name / Procedia Computer Science 00 (2019) 000–000

Therefore, an efficient Devanagari Character Classification using Support Vector Machine (CC-SVM) method is
proposed in this paper, which preprocesses the offline scanned imaged documents, normalizes them, segments them
using projection profile, removes top line, obtains Shirorekha-Less (SL) characters, extracts their features, and
finally, recognizes the SL characters by using SVM classifier. Such method makes use of a pseudo-thesaurus, which
stores a set of pre-defined character classes. The segmented SL characters are matched with the shirorekha based
pre-defined characters. If that SL character is found same as the pre-defined character, then it is categorized into that
pre-defined character category, otherwise, not. During the top line removal, some of the characters, say, ‘अं’ and ‘ण’
are divided into 2 or multi parts. So in such cases, the parts of a character altogether need to be matched with only 1
character class. The proposed method has been implemented on image documents of three Devanagari scripted
languages - Hindi, Sanskrit and Marathi. A sample document set for printed and handwritten mono-lingual
documents of Hindi, Sanskrit and Marathi are shown trough Fig. 1 (a) to (f).
This proposed system designing uses the concepts of In-Document and Out-Document modes of N-lingual text
document recognition and processing systems [1], where N represents mono/bi/tri/multi. These modes represent
internal and external factors of an imaged document, respectively. The In-Document mode includes embedded
written contents, such as, printed/handwritten type, text font size and type, text organization, its structure and its
format, whereas Out-Document mode is based upon page and pen quality, resolution, skew variation, and other
surrounding external factors [2]. These In-Document and Out-Document factors are interrelated with each other in
such a way, so that they generate varying types of mono-lingual images.

Fig. 1. Sample documents on (a) Hindi printed; (b) Hindi handwritten; (c) Sanskrit printed; (d) Sanskrit handwritten; (e) Marathi printed; and (f)
Marathi handwritten.

The paper is organized as follows. Section 2 includes the thematic background to cover Devanagari basic
concepts and literature review. The literature review provides a detailed discussion on existing contributions of
character recognition methods for Hindi, Sanskrit and Marathi languages, and further demonstrates the comparisons
of these methods on the basis of some major parameters. Section 3 illustrates the detailed design of proposed
character classification system. Section 4 discusses on used data sets and obtained results. For this many experiments
have been performed on mono-lingual printed and handwritten document images of three languages. Along with
this, the performance of proposed system is evaluated by comparing its results with the results other existing
systems. Finally, last section concludes the paper.
Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121 113
Author name / Procedia Computer Science 00 (2019) 000–000 3

2. Thematic background

Originated from the Brahmi script and the mother script of many Indian languages, Devanagari is used as writing
and reading script, which is extensively spread over a wide belt of India. The evolution and gradual progression of
Devanagari from Brahmi is given as, Brahmi Script → Bharati Script (Bhagwat Gita) → Gupta Script → Nagari
Script → Devanagari Script [3]. Devanagari is used to write many languages of India, such as, Sanskrit, Hindi,
Marathi, Rajasthani, Sindhi, Prakrit, Konkani and Nepali. Moreover, it is also used as the secondary script for
Kashmiri and Punjabi languages. Languages Sanskrit, Hindi and Marathi have come from the Indo-Aryan family of
languages. Sanskrit is considered as the traditional language and mother of Hindi. Hindi is mostly spoken in the
northern part of India and is considered as the official language of India. Marathi is mostly spoken in the southern
regions of India, such as, Maharashtra, Goa, Dadra and Nagar etc. Devanagari includes many distinctive features of
characters and word formation; well-constructed grammar rules; and methods to process the linguistic structures.
Following sub-sections includes Devanagari based concepts and literature review on existing Devanagari character
recognition techniques in detail.

2.1. Devanagari concepts

The foundation of Devanagari script consists of total 13 vowels (वर) from ‘अ’ to ‘अः’; 33 basic consonants
(यंजन) from ‘क’ to ‘ह’ along with 5 additional consonants of ‘’, ‘’, ‘’, ‘ड़’, and ‘ढ़’; and 10 numerals (अंक)
from 0 to 9. Devanagari contains a very large set of ligatures or consonant clusters (संयुतार), where these clusters
are constructed by combining two or more characters. Some examples of these compound characters are ‘’, ‘’,
‘’, ‘त’, ‘’, ‘व’, ‘’, ‘ध’, and a special three letter combination symbol, ‘ॐ’ (‘ओ३म’). ् A word is formed with
clusters of letters, which consists of varying combinations of consonants and vowels in sequence. The use of
compound characters in a word drastically increases its processing complexity. Devanagari exhibits many unique
properties and contains many measures in its reading and writing styles, as given below.

• It is written left to right, top to bottom and read in the similar way of order of sequence.
• It does not have capital or small characters and has no spelling arrangement of letters.
• It constructs syllabic alphabets by combining syllables and characters.
• Many Devanagari symbols follow a phonetic order.
• It uses an angled sub – stroke character, called ‘हलंत’ (Halant), which is a consonant without any vowel, say, ‘ब’,्
‘’, ‘ग’् etc. Halant is used to write half and full characters. A half character is used to form conjuncts and is
written as a full character with a halant, whereas a full character is defined as a half character with a halant [3]. A
word never ends with a full character, not with a half character as shown in Fig. 2 (a) and 2 (b).
• A very unique feature of Devanagari is the use of a long continuous horizontal top line (shirorekha) on the
characters. When characters are joined together to form a word, then all the top lines of all the characters are joint
together one by one, so it gets a single long shirorekha of the word as shown in Fig. 2 (c).
• Shirorekha is partially absent in ‘भ’, ‘थ’, and ‘ध’ consonants and their words, as shown in Fig. 2 (d).
• The geometric structures of many vowels and consonants contain small loops and closed forms, as depicted in
Fig. 2 (e). This property exists in two vowels as closed form, and also exists in four consonant categories, such
as, loop without vertical bar, loop with vertical bar, closed letter and loop with small vertical bar.
• Graphical elements and modifiers are used left, right, below or above the consonants/vowels/both to form a word
frame [3] as shown in Fig. 2 (f).
• An independent vowel does not use consonants on its left and right sides, and is used to appear alone at the
beginning of a word or after another vowel, as depicted in Fig. 2 (g). Due to this, it is not converted into a vowel
modifier.
• Short/long vowels occur on the left/right side of the consonant, as depicted in Fig. 2 (h), and represent modifiers.
Many letters use modifiers to form a word. Other conjunct modifiers are applied on the upper and right – upper
sides.
114 Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121
4 Author name / Procedia Computer Science 00 (2019) 000–000

• It exhibits vertical bar features in characters depending upon their position at right, left or middle side. No
character exists with vertical bar at left position. All these cases are illustrated in Fig. 2 (i).
• Word formation in Devanagari includes the combinations, such as, ‘cv’, ‘ccv’, ‘vcc’ etc., with consonant, ‘c’, and
vowel ‘v’.
• The vowel duration is always more than the consonant duration in speech.

Fig. 2. (a) Half characters (column-wise); (b) full characters (column-wise); (c) word construction; (d) use of ‘भ’, ‘थ’ and ‘ध’; (e) characters with
loop properties; (f) graphical symbols, their formulation and examples; (g) use of independent vowels in words; (h) use of short, long and
conjunct vowels on left, right, below and above sides; and (i) vertical bar existence in characters.

2.2. Literature review

This section discusses many existing character recognition techniques on Devanagari, especially Hindi, Sanskrit
and Marathi, through SVM, which are further discriminated on the basis of six major parameters. [4] presented a
review on feature extraction methods for Devanagari characters and numerals, which achieved better recognition
rate in Euclidean Distance based K-Nearest Neighbor (ED-KNN) than SVM. To retrieve the knowledge from
Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121 115
Author name / Procedia Computer Science 00 (2019) 000–000 5

ancient Sanskrit handwritten manuscripts, a character detection review [5] was presented, which focused on text
straightening and aging, text line overlapping, character and line overlapping, and writing errors. It found the text
line recognition without binarization as the most effective method among all others. Similarly, a Sanskrit
handwritten word recognition method used Prewitt's operator for edge detection, Freeman Chain Code (FCC) to get
character image boundary and features vector, and then used Genetic Algorithm (GA) to find non-linear
segmentation path [6]. An another handwritten Devanagari OCR [7] followed the steps of normalization based
Gradient Local Auto-Correlation (GLAC), feature extraction, Box-Cox variable transformation, feature
normalization, and Radial basis function (RBF) based SVM classification for ISIDCHAR and V2DMDCHAR
databases. Another method of Devanagari compound OCR [8] included the steps of preprocessing; normalization;
global pre-classification with vertical line presence, its position and enclosed region; local pre-classification with
character end points, position, and rotation invariant features; and finally, classification through SVM and KNN
with k-fold cross validation. [9] performed Devanagari overlapped and conjunct character segmentation for
handwritten expressions of Hindi dialects by using pixel cluster identification and projection profiles. It followed the
steps of scanning, black pixel count, finding inter-character gaps and mid values, checking cluster presence, and
recalculation of mid, segmented and connected characters by drawing the lines. Another overlapped and touching
character recognition [10] was performed through normalization; binarization and thinning in preprocessing;
thresholding and blob analysis in segmentation; overlapped region detection using FCC; and finally, SVM
classification. The offline Hindi OCR survey [11] discussed recognition by using Artificial Neural Network (ANN),
Fuzzy Logic, GA, SVM, KNN, Hidden Markov Model (HMM), Bacterial Foraging (BF), Clonal Selection (CS) etc.
Further, [12] recognized offline Hindi text words by ignoring upper/lower modifiers and half characters, extracting
features, and identifying character and word's middle zones with rule based classifier.
In [13], Hindi OCR performed binarization and shirorekha removal in preprocessing; feature extraction through
K - means clustering; and finally, linear kernel based SVM classification. Next Hindi OCR [14] performed
binarization, noise removal, skew detection, character segmentation, thinning, feature extraction, and finally,
classification through Rough Fuzzy Multilayer Perceptron (RFMLP), Fuzzy Support Vector Machine (FSVM),
Fuzzy Rough Support Vector Machine (FRSVM) and Fuzzy Markov Random Fields (FMRF). Devanagari OCR
method [15] used Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which extracted
chain coding features, used gradient and directional features for edge detection, reduced LDA features, and finally,
classified characters through SVM. Another Hindi OCR [16] used efficient feature extraction techniques with
multiple classifiers, whereas, offline Devanagari OCR [17] extracted Strength-, Angle- and Histogram Of Gradient
(SOG, AOG, HOG) based directional features, which classified characters by a combined classifier with 3-fold cross
validation. Then, it down sampled SOG and AOG using Gaussian filter, which was followed by HOG features.
The comparative study on handwritten Marathi OCR [18] discussed the recognition steps, which were noise
removal through morphology and thresholding; skew correction through Hough transform; segmentation through
bounding box; normalization; connected pixel based feature extraction; and finally, classification with 5-fold
validation. Further, [19] discussed an offline Marathi scanned handwritten compound OCR, and [20] recognized
multi-oriented characters through scale and rotation invariant features to get foreground and background
information, where it first partitioned characters into multi-circular zones, computed three centroids for each zone,
grouped each zone's character segments into two clusters, obtained one global centroid for each zone's components,
and finally, generated two centroids for two clusters. Next [21] presented 3 noisy handwritten document retrieval
strategies to obtain the text through bootstrap based refined OCR’ed text in first; through modified standard vector
space of raw OCR’ed text in second; and through employing robust image features for document indexing in third.
The primitive feature based Devanagari OCR [22] used features of character's presence and location, such as,
vertical lines, frequency, and location and frequency of intersections of character body with top line. Further, a
detailed survey along with the concept of Hindi printed and handwritten document classification using SVM and
fuzzy was discussed in [23], which focused on recognition and classification of character images and then the
classification of imaged documents into predefined categories. Another survey [24] discussed handwritten Hindi
images along with handwriting types, complexity evaluators and challenges. All of these Devanagari OCR based
techniques have been discriminated on the basis of six major parameters of script/language, feature extraction,
printed / handwritten form, focused element, classifier used and accuracy and depicted in Table 1.
116 Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121
6 Author name / Procedia Computer Science 00 (2019) 000–000

Table 1. Discrimination among existing Devanagari OCR methods.


Ref. Script/Language Feature extraction Printed/ Focused Classifier used Accuracy (%)
No. technique Handwritten element
[4] Devanagari Zernike Moments Printed & Character SVM & ED 90% - 98% (KNN) > 83% -
Handwritten & numeral based KNN 97% (SVM)
[6] Devanagari/ FCC Handwritten Word SVM Promising
Sanskrit
[7] Devanagari GLAC Handwritten Character SVM 93.21% (ISIDCHAR) &
95.21% (V2DMDCHAR)
[8] Devanagari Zernike Moments Handwritten Character SVM & KNN Basic Character: 98.37%
(SVM), 95.82% (KNN), &
Compound: 98.32% (SVM),
95.42% (KNN)
[9] Devanagari/Hindi Pixel Detection Handwritten Character - 95% - Touched & Conjunct,
88% - Overlapped
[10] Indic FCC Handwritten Character SVM 93%
[11] Devanagari - Handwritten Character ANN, Fuzzy, -
GA, SVM, KNN,
HMM, BF, CS
[12] Devanagari/Hindi Topological Handwritten Word Rule Based 88.75%
[13] Devanagari/Hindi K-Means Clustering Handwritten Character SVM with ED Promising
[14] Devanagari/Hindi Fuzzy Hough Handwritten Character RFMLP, FSVM, 99.8% by FMRF
Transform FRSVM &
FMRF
[15] Devanagari PCA & LDA Handwritten Character SVM Promising
[16] Devanagari/Hindi Oriented Gradient & Handwritten Character Quadratic SVM Promising
Projection Profiles & Others
[17] Devanagari Gradient Feature Handwritten Character Combined SVM 95.81%
Extraction & Quadratic
[18] Devanagari/Marathi Connected Pixel Handwritten Character SVM & KNN SVM better than KNN
[19] Devanagari/Marathi - Handwritten Character - SVM better than KNN
[20] Devanagari & Centroid Encoding & Printed Character SVM Outperformed
Bangla PCA
[21] Devanagari & Probability & Feature Handwritten Document - 66.3,71.18 & 87.88, and
Latin/Hindi, based Word Spotting 69.2, 74.34 & 92.33 with &
Sanskrit & English without Relevance Feedback
[22] Devanagari/Hindi, Used Primitive Handwritten Character Existence & 93.33% in 21 Fonts, &
Sanskrit & Marathi Features Location of 72.72% for 22 Handwritings
Features in glyph of 5 Characters

The observations obtained from Table 1 state that in spite of many linguistic based challenges, many researchers
have contributed in handwritten as well as printed Devanagari OCR. These methods have used many different
feature extraction methods, where SVM has outperformed in most cases and proved to be a good recognizer. Here
most of the research methods focused on recognition of Devanagari basic character set, where many researches have
been found for mono/bi-lingual Sanskrit, Hindi, or Marathi document processing. It is found that there is need of
recognizing and classifying the large set of basic and complex Devanagari. Therefore, the proposed system in this
paper is designed in classifying the isolated and SL characters of printed or handwritten images using SVM into the
large set of pre-defined categories with good performance.
Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121 117
Author name / Procedia Computer Science 00 (2019) 000–000 7

3. CC-SVM system design

The proposed offline Devanagari Character Classification system using SVM (CC-SVM) is designed to accept
mono-lingual scanned Hindi, Sanskrit and Marathi text images, which preprocesses, segments, extracts features, and
finally, recognizes and classifies the SL characters into pre-defined character categories. The system architecture
includes the phases of training and testing. During the training phase, a known set of text document images are
taken, which are further processed to get SL characters. When top lines are removed from words during
segmentation, then isolated and independent SL characters are obtained. These SL characters can either be non-
modified simple characters or those characters whose modifiers or parts also got separated during such process. If
SL character is the basic simple character with no upper/right/left modifier or part, say, ‘ब’, ‘ह’, ‘प’ etc., then a
simple set of features is used to train the SVM. In other cases, the upper/right/left modifiers (‘◌ै’, ‘◌ौ’, ‘◌ा’) or parts
(‘◌ं’, ‘◌ँ’) also get detached from their characters, so their features are used to train SVM in such a way, so that
SVM first calculates the minimum distance between the character and its upper/right/left modifier (or part) and then
recognizes and classifies SL character or modified SL character into the pre-defined class. During the testing phase,
the performance of SVM classifier is tested with an unknown set of document images.

3.1. Image preprocessing

After acquiring the scanned text document image IMi, image preprocessing is performed. At this time, all the
variables and counters are initialized, and paths are set to get input images and to store the output variables. All the
noise contents and undesired outliers are removed. Its skew is corrected and image is normalized. Further, such
image is converted from RGB to gray level image through which the binary image is obtained.

3.2. Image segmentation

The obtained pre-processed binary image is segmented further to extract the lines, words and characters by using
the project profiles. Firstly, the image is segmented through horizontal projection profile, so that all the lines are
detected and located in bounding boxes. This is performed because text lines contain high density of black pixels as
compared to the gaps existing between the adjacent lines. After such line detection, words are detected and located
in bounding boxes by using vertical projection profile. To locate and detect all the characters and their
upper/left/right modifiers (or parts), the top lines are removed from image word by word by using horizontal
profiling. For this, each top line is first located and then all pixels are made zeroed because it contains the highest
density of black pixels in a word. Further, the characters, modifiers and all other character components are bounded
in the boxes, so that independent and isolated SL characters are obtained.

3.3. Feature extraction and classification

During this step, the features of all SL character images are extracted. For this, the values are represented in the
(rows × columns) matrix, where the number of rows is equal to number of columns. A character geometrical
structure consumes a number of black pixels, which are depicted by its corresponding values, and then their
geometric based features are extracted. These extracted features of SL characters and SL modified characters are
used to train SVM classifier. Here a pseudo thesaurus is used, which is used to store the pre-defined character
classes. Finally, the SL characters and SL characters with 2/more components are matched against these categories.
In such a way, they are recognized and further used to classify test documents. Therefore, after performing SVM
training, the system is tested on a set of unknown scanned text document images and their performance is analyzed.

4. Experimental results

The proposed CC-SVM system has been implemented in MATLAB 2013a with 300 DPI resolution for mono-
lingual printed and handwritten Hindi, Sanskrit and Marathi documents. CC-SVM has been tested on a good set of
118 Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121
8 Author name / Procedia Computer Science 00 (2019) 000–000

scanned document images and has obtained very promising results. This section elaborates the data sets used, the
obtained results for document segmentation and character classification, and the evaluation of system performance.

4.1. Data sets

For this implementation, the scanned version of printed documents have been taken, which were collected from
different web sites of Government portals, news articles, and traditional shlokas. The handwritten documents are
written by 2 writers in all 3 languages. Total 60 documents are under considered, where 60% documents are used in
training and 40% were used during testing. These documents have been evenly distributed among all six document
types, such as printed Hindi, printed Sanskrit, printed Marathi, handwritten Hindi, handwritten Sanskrit and
handwritten Marathi. Further, this system is also designed to process those images, which have colored background,
colored text, multi-colored background and text, bold text, and 12 – 18 pt. font sizes (printed). Some rules have been
followed for handwritten documents, and they are given as, shirorekha must be approximately straight, no character
or shirorekha overlapping must be there, and characters must not touch each other except the language writing
requirements.

4.2. Results

The proposed CC-SVM model has been designed for SL character recognition and classification from Hindi,
Sanskrit and Marathi imaged documents. The strength of this model lies in its ability to classify the extracted simple
SL characters and SL characters along with their upper/right/left components into the pre-defined shirorekha based
character categories. It has been found that most of the existing algorithms worked upon shirorekha based characters
instead of using SL characters. Second enhancement is that the proposed algorithm extracts complex words and
characters from the scanned text imaged documents only, which increases the scope of using diverse set of printed
and handwritten imaged documents, whereas it is found that most of the existing algorithms have worked upon the
simple character images and their recognition only rather than using the text imaged documents as input. Thirdly,
the proposed system works on both character recognition as well as classification, another enhancement on existing
technologies. SVM is a very useful supervised method for classification of patterns and characters in image
processing. It is generally defined for 2-class problem, which uses optimal hyper-plane to maximize the distance
between closest samples of 2 classes. As such it has been successfully applied in a large range of applications and
can be extended for many non-linear variants. In the proposed system, the Gaussian kernel based SVM is used. The
power of SVM here is that characters can be separated using a hyper plane, so it uses kernel trick to obtain the better
accuracy. Here the SVM technique is used to find the minimum distance between upper/right/left parts with their
characters, where they were separated during segmentation and shirorekha removal. So, the character forms with
upper/right/lefts components are also classified efficiently by the system, for example ‘ङ’ and ‘ड’can be
discriminated. Here segmentation has been performed very efficiently without any errors.
A pseudo thesaurus of total 51 classes has been maintained for the system, where these classes are considered as
the pre-defined classes. It includes 13 classes of vowels, such as, ‘अ’, ‘आ’, ‘इ’, ‘ई’, ‘उ’, ‘ऊ’, ‘ॠ’, ‘ए’, ‘ऐ’, ‘ओ’,
‘औ’, ‘अं’, and ‘अः’, and 38 classes of consonants, such as, ‘क’, ‘ख’, ‘ग’, ‘घ’, ‘ङ’, ‘च’, ‘छ’, ‘ज’, ‘झ’, ‘ञ’, ‘ट’, ‘ठ’,
‘ड’, ‘ढ’, ‘ण’, ‘त’, ‘थ’, ‘द’, ‘ध’, ‘न’, ‘प’, ‘फ’, ‘ब’, ‘भ’, ‘म’, ‘य’, ‘र’, ‘ल’, ‘व’, ‘श’, ‘ष’, ‘स’, ‘ह’, ‘’, ‘’, ‘’, ‘ड़’
and ‘ढ़’. The character classification accuracy has been calculated by using eq. (1).

Accuracy (%) = Total Number of Characters Classified Correctly / Total Number of Characters (1)

The observations obtained from Table 2 show segmentation and accuracy results, which are achieved for mono-
lingual documents of all 3 languages. The average correctness of the document segmentation into lines and line into
words were found 100% for both printed and handwritten documents, so the word segmentation into characters with
shirorekha removal was found 100% in printed datasets and handwritten datasets. The 100% segmentation
accuracies in printed images indicate that all the characters have been accurately separated from their
upper/right/left modifiers. The classification accuracy results have also been found good. The accuracy obtained for
Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121 119
Author name / Procedia Computer Science 00 (2019) 000–000 9

Sanskrit printed images is 98.77%, which is the lowest among all other printed image results. Such result was
obtained because Sanskrit language, especially shlokas, contains high number of varying conjuncts and their
combinations. Similarly, Sanskrit handwritten characters have been classified with 97.22% accuracy, which is the
lowest among all other accuracies. The average classification accuracy of printed documents was 99.54% and of
handwritten documents was 98.35%.

Table 2. Segmentation and SL classification accuracy results for printed and handwritten Hindi, Sanskrit and Marathi documents.
Document type Segmentation accuracy (%) SL Classification accuracy (%)
Printed Handwritten Printed Handwritten
Hindi 100% 100% 100% 99.23%
Sanskrit 100% 100% 98.77% 97.22%
Marathi 100% 100% 99.86% 98.61%

CC-SVM does not produce good results for italic text recognition in printed documents. Additionally, this system
does not produce work well for very small font size documents, say 6 pt. size. The challenging part also came across
with those characters, which have bottom zone modifiers, say, ‘कु’, ‘कू’ etc., so another limitation of this system is to
consider only those documents, which have minimum number of lower zone modifiers.

4.3. Evaluating CC-SVM performance

It is seen that very promising classification results have been obtained on CC-SVM. The obtained classification
results of proposed system have been compared with the accuracy results of other existing recognition methods. In
section 2, many existing recognition techniques on Devanagari script, and Hindi, Sanskrit and Marathi languages
were discussed and compared with each other for many different parameters. So, it is easily observed from Table 1
that 88.89% contributions belong to handwritten document recognition as compared to the 5.55% of printed images
and 5.55% of printed-handwritten document images. The overall contribution of 83.33% was observed for character
identification, 11.11% for word identification and 5.56% for document. The preferred classifier used was SVM with
72.22% contribution. Fig. 3 depicts the result comparison between proposed CC-SVM method and existing
Devanagari OCR techniques.

Fig. 3. Result comparison of proposed method with existing Devanagari OCR techniques.
120 Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121
10 Author name / Procedia Computer Science 00 (2019) 000–000

In Fig. 3, the proposed system accuracy is compared with the results of SVM and KNN [4], ISIDCHAR and
V2DMDCHAR datasets [7], SVM and KNN for basic and compound characters [8], results of overlapped, touched
and compound characters [9], SVM [10], combined SVM and quadratic [17], and results for 5 characters and 21
fonts by using feature based classification [22]. Fig. 4 depicts the comparative analysis of proposed method with
existing OCR techniques for handwritten Hindi and Sanskrit documents. Here the proposed system accuracy is
compared with the results of Hindi OCR by using rule based classifier [12], Hindi OCR by using RFMLP, FSVM,
FRSVM and FMRF [14], and Hindi and Sanskrit OCR without relevance feedback [21]. The proposed method
results better than other existing techniques.

Fig. 4. Result comparison of proposed method with existing techniques for handwritten Hindi and Sanskrit documents.

5. Conclusion

In this paper, an efficient Devanagari character classification system using SVM was proposed, which recognized
and classified SL characters and modified SL characters of scanned mono-lingual printed and handwritten Hindi,
Sanskrit and Marathi document images. This system was implemented on diverse data sets of three languages and
found very promising results for Hindi and Marathi than Sanskrit. The reason behind the misclassification of
Sanskrit characters was their typical character geometrical structure and use of many complex conjuncts. Further
this system was compared with many other existing Devanagari OCR based techniques. These techniques were also
elaborated and compared with each other for different parameters.
In future, the proposed system will be improved for a large set of printed and handwritten document images.
Another scope of its extension is the complex conjuncts identification and categorization. The proposed system can
be extended in the following directions –

• Recognition and classification of modified characters and half characters, such as, ‘क’, ‘क’, ‘को’, ‘का’, ‘ल’्etc.
• Recognition and classification of image words.
• Improving the system for multi-fonts and italics text.
• Extending the work for imaged document classification.
• Making the proposed system generic for the inclusion of other Indic and non-Indic scripts.
Shalini Puri et al. / Procedia Computer Science 152 (2019) 111–121 121
Author name / Procedia Computer Science 00 (2019) 000–000 11

References

[1] Puri, S., and Singh, S. P. (2016) “A technical study and analysis of text classification techniques in N - Lingual documents”, in International
Conference on Computer Communication and Informatics, IEEE Press, pp. 1–6.
[2] Puri, S., and Singh, S. P. (2016) “Text recognition in bilingual machine printed image documents - challenges and survey”, in Tenth
International Conference on Intelligent Systems and Control, IEEE Press, pp. 1–8.
[3] D'source. History of Devanagari Letterforms. Retrieved on September 10, 2018 from http://www.dsource.in/resource/history-devanagari-
letterforms/characteristics-script
[4] Holambe, A. K. N. , Thool, R. C., and Jagade, S. M. (2012) “A brief review and survey of feature extraction methods for Devnagari OCR”, in
Ninth International Conference on ICT and Knowledge Engineering, IEEE Press, pp. 99–104.
[5] Shah, K. R., and Badgujar, D. D. (2013) “Devnagari handwritten character recognition (DHCR) for ancient documents: a review”, in
Conference on Information & Communication Technologies, IEEE Press, pp. 656–660.
[6] Dwivedi, N., Srivastava, K., and Arya, N. (2013) “Sanskrit word recognition using Prewitt's operator and support vector classification”, in
International Conference on Emerging Trends in Computing, Communication and Nanotechnology, IEEE Press, pp. 265–269.
[7] Jangid, M., and Srivastava, S. (2014) “Gradient local auto-correlation for handwritten Devanagari character recognition”, in International
Conference on High Performance Computing and Applications, IEEE Press, pp. 1–5.
[8] Kale, K. V., Deshmukh, P. D., Chavan, S. V., Kazi, M. M., and Rode, Y. S. (2013) “Zernike moment feature extraction for handwritten
Devanagari compound character recognition”, in Science and Information Conference, IEEE Press, pp. 459–466.
[9] Thakral, B., and Kumar, M. (2014) “Devanagari handwritten text segmentation for overlapping and conjunct characters - a proficient
technique”, in Proceedings of Third International Conference on Reliability, Infocom Technologies and Optimization, IEEE Press, pp. 1–4.
[10] Chame, S. D., and Kumar, A. (2016) “Overlapped character recognition: an innovative approach”, in Sixth International Conference on
Advanced Computing, IEEE Press, pp. 464–469.
[11] Indian, A., and Bhatia, K. (2017) “A survey of offline handwritten Hindi character recognition”, in Third International Conference on
Advances in Computing, Communication & Automation, IEEE Press, pp. 1–6.
[12] Garg, N. K., Kaur, L., and Jndal, M. (2015) “Recognition of offline handwritten Hindi text using middle zone of the words”, in Fourteenth
International Conference on Computer and Information Science, IEEE Press, pp. 325–328.
[13] Gaur, A., and Yadav, S. (2015) “Handwritten Hindi character recognition using K-means clustering and SVM”, in Fourth International
Symposium on Emerging Trends and Technologies in Libraries and Information Services, IEEE Press, pp. 65–70.
[14] Chaudhuri, A., Mandaviya, K., Badelia, P., and Ghosh, S. K. (2017) “Optical character recognition systems for Hindi language”, in Optical
Character Recognition Systems for Different Languages with Soft Computing: Studies in Fuzziness and Soft Computing, vol. 352, Springer,
Cham, pp. 193–216.
[15] Shitole, S., and Jadhav, S. (2018) “Recognition of handwritten Devanagari characters using linear discriminant analysis”, in Second
International Conference on Inventive Systems and Control, IEEE Press, pp. 100–103.
[16] Yadav, M., and Purwar, R. (2017) “Hindi handwritten character recognition using multiple classifiers”, in Seventh International Conference
on Cloud Computing, Data Science & Engineering – Confluence, IEEE Press, pp. 149–154.
[17] Bhalerao, M., Bonde, S., Nandedkar, A., and Pilawan, S. (2018) “Combined classifier approach for offline handwritten Devanagari character
recognition using multiple features”, in Hemanth D. and Smys S. (eds) Computational Vision and Bio Inspired Computing: Lecture Notes in
Computational Vision and Biomechanics, vol. 28, Springer, Cham, pp. 45–54.
[18] Kamble, P. M., and Hegadi, R. S. (2016) “Comparative study of Handwritten Marathi characters recognition based on KNN and SVM
classifier”, in Santosh K., Hangarge M., Bevilacqua V., and Negi A. (eds) International Conference on Recent Trends in Image Processing
and Pattern Recognition: Communications in Computer and Information Science, vol. 709, Springer, Singapore, pp. 93–101.
[19] Bhandare, M. S., and Kakade, A. S. (2015) “Handwritten (Marathi) compound character recognition”, in International Conference on
Innovations in Information, Embedded and Communication Systems, IEEE Press, pp. 1–4.
[20] Tripathy, N., Chakraborti, T., Nasipuri, M., and Pal, U. (2016) “A scale and rotation invariant scheme for multi-oriented character
recognition”, in 23rd International Conference on Pattern Recognition, IEEE Press, pp. 4041–4046.
[21] Govindaraju, V., Cao, H., and Bhardwaj, A. (2009) “Handwritten document retrieval strategies”, in Proceedings of the Third Workshop on
Analytics for Noisy Unstructured Text Data, ACM New York, USA, pp. 3–7.
[22] Sharma, R., and Mudgal, T. (2019) “Primitive feature-based optical character recognition of the Devanagari script”, in Panigrahi C., Pujari
A., Misra S., Pati B., and Li KC. (eds) Progress in Advanced Computing and Intelligent Engineering: Advances in Intelligent Systems and
Computing, vol. 714, Springer, Singapore, pp. 249–259.
[23] Puri, S., and Singh, S. P. (2018) “Hindi text document classification system using SVM and fuzzy – a survey.” International Journal of
Rough Sets and Data Analysis 5 (4): 1-31.
[24] Puri, S., and Singh, S. P. (2019) “Toward recognition and classification of Hindi handwritten document image”, in In: Hu YC., Tiwari S.,
Mishra K., Trivedi M. (eds) Ambient Communications and Computer Systems. Advances in Intelligent Systems and Computing, vol 904, pp.
497-507, Springer, Singapore.

You might also like