Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant

NEPALI OCR USING HYBRID APPROACH
OF RECOGNITION
By
NIRAJAN PANT
Master of Technology in Information Technology, Kathmandu University, 2016
A Thesis
Submitted to the
Department of Computer Science and Engineering
Kathmandu University
In partial fulfillment of the requirements for the degree of

Master of Technology in Information Technology
July 2016
DECLARATION OF ORIGINALITY
Being a student, I understand that I have an ethical and moral obligation ensuring that the
dissertation that I have submitted to the Kathmandu University is my own, original and free of
plagiarism. All the sources are acknowledged properly, exact words paraphrased or quoted,
with appropriate references throughout the dissertation. Hence I am fully satisfied that the work
I am submitting to the Department of Computer Science and Engineering, Kathmandu
University is my own research and original.
_______________
Nirajan Pant
Candidate
University Registration No: 015493-13
I
THESIS EVALUATION
This thesis, submitted by Nirajan Pant in partial fulfillment of the requirements for the Degree
of Master of Technology in Information Technology from the Kathmandu University, has
been read by the faculty Advisory Committee under whom the work has been done and is
hereby approved.
____________________
Dr. Bal Krishna Bal
(Supervisor)
Assistant Professor
Department of Computer Science and Engineering, Kathmandu University
_____________________
Suresh K. Regmi
(Extermal Examinor)
Managing Director
Professional Computer System (P) Ltd.
____________________
Dr. Manish Pokharel
Head of Department
Department of Computer Science and Engineering, Kathmandu University
This thesis is being submitted by the appointed advisory committee as having met all of the
requirements of the School of Engineering at the Kathmandu University and is hereby
approved.
________________________________
Prof. Dr. Bhupendra Bimal Chhetri
Dean
School of Engineering
Date:
II
PERMISSION
Title Nepali OCR Using Hybrid Approach of Recognition
Department Computer Sciences and Engineering
Degree Master of Technology in Information Technology
In presenting this thesis in partial fulfillment of the requirements for a graduate degree from
Kathmandu University, I agree that the library of this University shall make it freely available
for inspection. I further agree that permission for extensive copying for scholarly purposes may
be granted by the supervisors who supervised my thesis work or, in his (or her) absence, by the
Head of the Department or other use of this thesis or part thereof for financial gain shall not be
allowed without my written permission. It is also understood that due recognition shall be given
to me and to the Kathmandu University in any scholarly use which may be made of any material
in my thesis.
________________
Nirajan Pant
Date:
III
ACKNOWLEDGEMENTS
I express my sincere gratitude to Dr. Bal Krishna Bal for supervising me in this thesis. I will
always be indebted to his continued motivations, suggestions and involvements which have
helped significantly for the completion of this thesis.
I am thankful to Madan Puraskar Pustakalaya (MPP), Lalitpur, Nepal who has provided Nepali
text image data for this thesis work.
At last, I express my thankfulness to all members of Department of Computer Science and

Engineering, my friends and family members who helped me directly or indirectly till this day
for successful accomplishment of this thesis. This day would not have been possible without
their continued support, motivation and encouragements.
Nirajan Pant
Master of Technology in Information Technology

IV
ABSTRACT
Nepali, which is an Indo-Aryan language written in the Devanagari Script, is the most widely
spoken language in Nepal with more than 35 million speakers. It is also spoken in many areas
of India, Bhutan, and Myanmar. The Optical Character Recognition (OCR) systems developed
so far for the Nepali language has a very poor recognition rate. Devanagari script has some
special features like ‘dika’ and the rules for joining the vowel modifiers which makes it different
from Latin script, where every character in a word is written separately. One of the major
reasons for poor recognition rate is due to the error in character segmentation. The presence of
conjuncts, compound and touching characters in the scanned documents complicates the
segmentation process, creating the major problems when designing an effective character
segmentation technique. Thus, the aim of work is to reduce the scope of the segmentation task
so that the segmentation errors could be minimized.
In this work, I have proposed a hybrid OCR system for printed Nepali text using the Random
Forest (RF) algorithm. It incorporates two different techniques of OCR – firstly, the Holistic
approach and secondly, the Character Level Recognition approach. The system first tries to
recognize a word as a whole and if it is not confident about the word, the character level
recognition is performed. Histogram Oriented Gradients (HOG) descriptors are used to define
a feature vector of a word or character. The performance of 78.87%, and 94.80% recognition
rates are achieved for character level recognition approach and the hybrid approach
respectively.
Keywords— OCR, Devanagari Script, Pre-processing, Segmentation, HOG Feature, Feature

Descriptor, Classification, Random Forest (RF)
V
Contents
ACKNOWLEDGEMENTS ..................................................................................................... IV
ABSTRACT.............................................................................................................................. V
List of Figures ....................................................................................................................... VIII
List of Tables ........................................................................................................................... IX
List of Abbreviations ................................................................................................................ X
CHAPTER I INTRODUCTION ............................................................................................... 1
1.1 Optical Character Recognition ................................................................................... 1
1.1.1 General OCR Architecture ................................................................................. 2
1.1.2 Uses and Current Limitations of OCR ............................................................... 5
1.2 Devanagari Script....................................................................................................... 6
1.3 Problem Definition................................................................................................... 10
1.4 Motivation ................................................................................................................ 11
1.5 Research Questions .................................................................................................. 12
1.6 Objectives ................................................................................................................ 12
1.7 Organization of Document ....................................................................................... 13
CHAPTER II LITERATURE REVIEW.................................................................................. 14
2.1 Different Models of Character Segmentation in OCR Systems ............................... 14
2.1.1 Dissection Techniques ..................................................................................... 15
2.1.2 Recognition Driven Segmentation ................................................................... 16
2.1.3 Holistic Technique ........................................................................................... 17
2.2 Segmentation Challenges in Devanagari OCR ........................................................ 17
2.2.1 Over Segmentation of Basic Characters .......................................................... 18
2.2.2 Handling vowel modifiers and Diacritics ........................................................ 18
2.2.3 Handling Compound characters and Ligatures ................................................ 19
2.3 Related work ............................................................................................................ 20
2.3.1 Segmentation.................................................................................................... 20
2.3.2 Recognition ...................................................................................................... 24
2.4 OCR Tools Developed for Devanagari .................................................................... 26
CHAPTER III METHODOLOGY .......................................................................................... 30
3.1 Training:................................................................................................................... 31
3.1.1 Dataset Generation: .......................................................................................... 31
3.1.2 Feature Extraction: ........................................................................................... 33
3.2 Recognition: ............................................................................................................. 33
3.2.1 Line and Word Segmentation .......................................................................... 34
3.2.2 Character Segmentation: .................................................................................. 35
3.2.3 Classifier Tool .................................................................................................. 36
VI
3.2.4 Confidence and Threshold: .............................................................................. 40
CHAPTER IV RESULTS AND DISCUSSION ...................................................................... 42
4.1 Experimental Setup .................................................................................................. 42
4.2 Segmentation Results ............................................................................................... 42
4.3 Recognition Results ................................................................................................. 43
4.4 Computational Cost ................................................................................................. 44
CHAPTER V CONCLUSION AND FUTURE WORK ......................................................... 48
References ................................................................................................................................ 50
APPENDIX I Snapshots ........................................................................................................... A
APPENDIX II Word Recognition Data Sample ........................................................................E
VII
List of Figures
Figure 1 General OCR Architecture .......................................................................................... 2
Figure 2 Structure of Nepali Text Word .................................................................................... 8
Figure 3 Over-segmentation Example (Letter ण, श, ग) .............................................................. 18
Figure 4 Segmentation using Projection Profile Technique .................................................... 18
Figure 5 Proposed Nepali OCR Model .................................................................................... 30
Figure 6 Training Dataset Generation ...................................................................................... 32
Figure 7 Feature Extraction ..................................................................................................... 33
Figure 8 Nepali text words as Blobs ........................................................................................ 34
Figure 9 Snapshot of Character Segmentation ......................................................................... 34
Figure 10 Learning Curve - Word classifier 1 ......................................................................... 38
Figure 13 Learning Curve - Character classifier ...................................................................... 40
Figure 14 Recognition Results ................................................................................................. 44
VIII
List of Tables
Table 1 Vowels and Corresponding Modifiers .......................................................................... 8
Table 2 Diacritics and Special Symbols .................................................................................... 8
Table 3 Consonants and their half-forms ................................................................................... 9
Table 4 Letter Variants .............................................................................................................. 9
Table 5 Formation of Compound Characters ............................................................................. 9
Table 6 Existing Text Segmentation Approaches for Devanagari OCR .................................. 23
Table 7 Feature Extraction and Classifiers in Devangari OCR ............................................... 25
Table 8 Word Classifier Training ............................................................................................ 37
Table 9 Character Classifier Training ...................................................................................... 39
Table 10 Experimental Environment ....................................................................................... 42
Table 11 Character Segmentation Results ............................................................................... 43
Table 12 Recognition Results .................................................................................................. 43
IX
List of Abbreviations
ASCII – American Standard Code for Information Interchange
BAG – Block Adjacency Graph
C-DAC – Centre for Development of Advanced Computing
DOCR – Devanagari Optical Character Recognition
DSP – Digital Signal Processing
GHIC – Generalized Hausdorff Image Comparison
GSC – Gradient, Structural and Concavity
GUI – Graphical user interface
HMM – Hidden Markov model
HOG – Histogram Oriented Gradient
HPP – Horizontal Projection Profile
HTK - Hidden Markov Model Toolkit
IPA – Integrity, purposefulness and adaptability
ISCII – Indian Script Code for Information Interchange
MPP - Madan Puraskar Pustakalaya
OCR – Optical Character Recognition
PDF – Portable Document File
PP – Projection Profile
RF – Random Forest
SFSA – Stochastic Finite Automata
VPP – Vertical Projection Profile
X
CHAPTER I
INTRODUCTION
This thesis is about improving the performance of Nepali OCR by proper handling of
segmentation problems prevalent in the Nepali language. The assumption made is: “The
performance of Nepali OCR can be improved by using the Hybrid recognition approach”.
Based on this assumption, a Nepali language specific OCR model has been developed. The
model will be tested by experimenting the proposed model.
The concepts of OCR and its general architecture, Devanagari script for Nepali language from
the point of view of OCR, and uses and limitations of OCR are discussed in this chapter. This
chapter includes basic introduction of thesis which covers problem definition, motivation,
research questions and objectives, and the basic overview of terms and terminologies that are
used in this thesis.
1.1 Optical Character Recognition

OCR is a field of computer science that involves converting texts from images of typewritten
or printed or handwritten documents into computer readable text. OCR enables conversion of
texts in image data into textual data and facilitates editing, searching, republishing without
retyping the whole document. Any written or printed document, if it is to be replicated digitally,
needs to be photocopied or scanned. Such a replicated document cannot be altered in terms of
the spellings, words, font-style, and font-size that the document contains. Also typing an entire
document in order to replicate it is extremely time consuming. In order to overcome the above
mentioned issues, an OCR system is needed. Documents containing characters images can be
scanned through the scanner and then the recognition engine of the OCR system interprets the
images and turns images of printed or handwritten characters into machine - readable characters
(e.g. ASCII or Unicode). Therefore, OCR allows users to quickly automate data captured from
image document, eliminates keystrokes to reduce typing costs and still maintains the high level
of accuracy in text processing applications.
1
1.1.1 General OCR Architecture
While talking about how an OCR system recognizes text, first, the program analyzes structure
of document image. It divides the page into elements such as blocks of texts, tables, images,
etc. The lines are divided into words and then into characters. Once the characters have been
singled out, the program compares them with a set of pattern images.
The process of character recognition consists of a series of stages, with each stage passing its
results on to the next in a pipeline fashion. There is no feedback loop that would permit an
Figure 1 General OCR Architecture
earlier stage to make use of knowledge gained at a later point in the process (Casey & Lecolinet,
1996). The recognition process can be divided into three major steps: Preprocessing,
Recognition (Feature Extraction) and Post Processing (Optical character recognition, 2015)
(OCR Processing Steps [ABBYY Developer Portal], n.d.).
Pre-processing
OCR software loads the image and performs pre-processing to increase the recognition
accuracy. Most of the OCRs expect some pre-defined formats of input image such as font-size
ranges, foreground, background, image format, and color format. The pre-processing steps
often performed in OCR are: i) Binarization ii) Morphological Operations and iii) Segmentation
(Hansen, 2002). Binarization is the process of converting an image to bi-tonal image; most of
2
the OCRs work on bi-tonal images. Morphological Operations are used in pre or post processing
(filtering, thinning, and pruning). They may be applied in degraded documents to increase the
performance of OCR.
Different actions performed during pre-processing are:
- De-skewing
- Binarization
- Page Layout Analysis
- Detection of text lines and words
- Character segmentation – For per-character OCR, multiple characters that are
connected due to image artifacts must be separated; single characters that are broken
into multiple pieces due to artifacts must be connected. Usually, in every OCR system,
the recognition is performed at the character level. So the segmentation is the basic and
important phase of recognition. Effective segmentation at character level yields the
better accuracy in recognition.
- Normalization
Character recognition
Recognition algorithm is the brain of the OCR system. After successful pre-processing of input
image document, now OCR algorithm can start recognition of characters and translate them
into character codes (ASCII/Unicode). Creating one hundred percent accurate algorithm is
probably impossible where there is a lot of noise and different font styles are present.
In general, a character recognition consists of the following procedures:
 Learning - The recognition algorithms relies on a set of learned characters and their
properties. It compares the characters in the scanned image file to the characters in this
learned set.
 Extraction and isolation of individual characters from an image
 Determination of the properties of the extracted characters
3
 Comparison of the properties of the learned and extracted characters
There are two basic types of core OCR algorithm – matrix matching and feature extraction.
(Optical Character Recognition, 2015). Matrix matching also known as “pattern matching” or
“pattern matching” involves comparing an image to a stored glyph on a pixel-by-pixel basis.
This relies on the input glyph being correctly isolated from the rest of the image, and on the
stored glyph being in a similar font and at the same scale. This technique works best with
typewritten text and does not work well when new fonts are encountered. Feature extraction
decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections.
These are compared with an abstract vector-like representation of a character, which might
reduce to one or more glyph prototypes. General techniques of feature detection in computer
vision are applicable to this type of OCR, which is commonly seen in most modern OCR
software. Machine learning algorithms such as Neural Networks, Nearest-neighbor
classifier algorithms are used to compare image features with stored glyph features and choose
the nearest match. Most modern Omnifont OCR programs (ones that can recognize printed text
in any font) work by feature detection rather than pattern recognition.
Post-processing
This step can help to improve recognition quality; sometimes OCR can output wrong character
code in such case a dictionary support can help to make the decision. OCR accuracy can also
be increased if the output is constrained by a lexicon – a list of words that are allowed to occur
in a document. With dictionary support, the program ensures even more accurate analysis and
recognition of documents and simplifies further verification of recognition results.
The output stream may be a plain text stream or file of characters, but more sophisticated OCR
systems can preserve the original layout of the page and produce, for example, an
annotated PDF that includes both the original image of the page and a searchable textual
representation.
4
The exact mechanisms that allow humans to recognize objects are yet to be understood, but the
three basic principles are already well known by scientists – integrity, purposefulness and
adaptability (IPA). The most advanced optical character recognition systems are focused on
replicating natural or “animal like” recognition. In the heart of these systems lie three
fundamental principles: Integrity, Purposefulness and Adaptability. The principle of integrity
says that the observed object must always be considered as a “whole” consisting of many
interrelated parts. The principle of purposefulness supposes that any interpretation of data must
always serve some purpose. And the principle of adaptability means that the program must be
capable of self-learning. These principles endow the program with maximum flexibility and
intelligence, bringing it as close as possible to human recognition (What is OCR?, 2015).
1.1.2 Uses and Current Limitations of OCR
OCR is widely used to recognize and search text from electronic documents or to publish the
text on a website ( Singh, Bacchuwar, & Bhasin , 2012). It has enabled scanned documents to
become more than just image files, turning into fully searchable documents with text content
that is recognized by computers. OCR is a vast field with a number of varied applications such
as invoice imaging, legal industry, banking, health care industry etc. It is widely being used in
digital libraries for searching scanned books and magazines (e.g. Google books), data entry
such as bill payment, passport, text-to-speech synthesis, machine translation, text mining, and
check entry, automatic number plate recognition etc.
Optical character recognition has been applied to a number of applications. Some of them have
been listed below:
- Institutional Repositories and Digital Libraries
- Banking: Form processing, check collection etc.
- Healthcare: General forms, insurance forms, and prescription documents processing
- Automatic Number Plate Recognition
- Handwriting Recognition
5
OCR has simplified data collection and analysis process. With its continuous advancement,
more and more applications powered by OCR are being developed in various fields including
finance, education, and government agencies.
The advantages of OCR can be summarized as:
- Cheaper than paying someone to manually enter large amounts of text
- Much faster than someone manually entering large amounts of text
- The latest software can recreate tables and the original layout
OCR system has a lot of advantages even then it has many limitations. Some of the limitations
are outlined below:
- Limited Documents: It does not perform well with documents containing both
images and text, containing tables, and noise or dirt.
- Accuracy: The accuracy depends upon the quality and type of document, including
the font used. Errors that occur during OCR include misreading letters, skipping
over letters that are unreadable, or mixing together text from adjacent columns or
image captions.
- Additional Work: OCR is not error proof, OCR also makes mistakes. A person
has to manually compare the original image document and the recognized text for
errors and correct them.
- Not worth doing for small amounts of text: OCR has to suffer from a long
process of document scanning, recognizing and, verification of output text. Thus
OCR may not be feasible and worthwhile to use for small amount of documents.
1.2 Devanagari Script

Devanagari script is derived from ancient Brahmi script through many modifications. Many
languages including Sanskrit, Nepali, Hindi, Marathi, Bihari, Bhojpuri, Maithili, and Newari
are written in Devanagari and over 500 million people are using it. Devanagari is a syllabic-
alphabetic script with a set of basic symbols - consonants, half-consonants, vowels, vowel-
6
modifiers, digits and special diacritic marks (Kompalli, Setlur , & Govindaraju, 2006)
(Kompalli, Setlur, & Govindaraju, 2009). Script has its own specified composition rules for
combining vowels, consonants and modifiers. Modifiers are attached to the top, bottom, left or
right side of other characters. All characters of a word are stuck together by a horizontal line,
called dika, which runs at the top of core characters (Khedekar, Ramanaprasad, Setlur, &
Govindaraju, 2003). Devanagari character may be formed by combining one or more alphabets
which are referred as composite characters or conjuncts. For example: half- consonant ka (क्‍)
and consonant ya (य) combine to produce the conjunct character kya (कय), Consonant-modifier
and conjunct-modifier characters are produced by combining consonants and conjuncts with
vowel modifiers (Eg: क्‍+्‍ा → क , कय्‍+्‍ा → कय ). This combination of alphabets contrasts with
Latin in which the number of characters is fixed. A horizontal header line (dika) runs across the
top of the characters in a word, and the characters span three distinct zones (Figure 2); an
ascender zone above the Dika, the core zone just below the Dika, and a descender zone below
the baseline of the core zone. Symbols written above or below the core will be referred to as
ascender or descender components, respectively. A composite character formed by one or more
half consonants followed by a consonant and a vowel modifier will be referred to as a conjunct
character or conjunct (Kompalli, Setlur , & Govindaraju, 2006).
Nepali, originally known as Khas Kurā is an Indo-Aryan language with around 17 million
speakers in Nepal, India, Bhutan, and Burma. Nepali is written in Devanagari, which is
developed from the Brahmi script in the 11th century AD. The Nepali is started to be written
from 12th century AD1. In Nepali, there are 13 vowels (swaravarna), 36 consonants
(vyanjanvarna) (33 pure consonants and 3 are composite consonants), 10 numerals, and half-
letters. When vowels come together with consonants they are written above, below, before or
after the consonant they belong to using special diacritical marks. When vowels are written in
this way they are known as modifiers. In addition, consonants occur together in clusters, often
1
http://www.omniglot.com/writing/nepali.htm
7
called conjunct consonants. Altogether, there are more than 500 different characters (K.C. &
Nattee, 2007). The sentences end with ‘purnaviram’.
It is written and read from left to right in a horizontal line. Many languages in India use different
variants of this script. Nepali language uses a subset of characters from Devanagari script set
Dika
(Headerline) Ascender
Head Line
Core सम्विि Upper Zone

Middle Zone
Lower Zone
Base Line
Compound
Descender
Character
Figure 2 Structure of Nepali Text Word
for written purposes. Some characters of Devanagari script are language specific. But the basic
vowels, consonants and modifiers are same in all languages. For example ‘Nukta’ is used in
Hindi but not in Nepali. Similarly, letter ‘LLA’ is also not used in Nepali.
Vowels and corresponding modifiers:
Table 1 Vowels and Corresponding Modifiers

Vowel अ आ इ ई उ ऊ ऋ ए ऐ ओ औ अं अ:
Corresponding ा िा ा ा ा ा ा ा ा ा ा ा
Vowel Modifier
Diacritics, Consonant-modifiers and Special Symbols: In some situations, a consonant

following (or proceeding) another consonant is represented by a modifier called consonant
modifier. In this case, the constituent consonants take modified shapes, such as ‘reph’.
Table 2 Diacritics and Special Symbols

Diacritics and Special Symbols Different forms of Consonant modifier ra (र)
ा ा ा ा ऽ
Consonants and their Half Forms:

Along with a set of vowel modifiers there is a set of pure-consonants (also called half-letters)
which when combined with other consonants yield conjuncts (Pal & Chaudhuri, 2004).
8
Table 3 Consonants and their half-forms
Consonant Half Consonant Half Consonant Half Consonant Half Consonant Half
Form Form Form Form Form
क क्‍ ख ख्‍ ग ग्‍ घ घ्‍ ङ
च च्‍ छ ज ज्‍ झ झ्‍ ञ ञ्‍
ट ठ ड ढ ण ण्‍
त त्‍ थ थ्‍ द ध ध्‍ न न्‍
प प्‍ फ फ्‍ ब ब्‍ भ भ्‍ म म््‍
य य्‍ र ल ल्‍ व व्‍ श श्‍
स्‍ ष ष्‍ ह ह्‍
Numerals:
०१२३४५६७८९
Letter Variants:
In writing Nepali, there are many letter variations found in written or printed documents. This
is because fonts have different writing styles. Some characters having letter variants, differs by
the old and new writing styles. The old variants of some letters (e.g. letter अ and letter ण) are
not used in these days but the old documents frequently contains these forms. A set of letter
variants is shown in Table 4.
Table 4 Letter Variants
Letter Variants Letter Variants

Numeral Five Letter ‘La’
Letter ‘A’ Letter ‘Sha’
Letter ‘Jha’ Letter ‘Ksha’
Letter ‘ण’
There are many conjuncts which are written as a single character e.g. द्द, द्म, हृ, i.e. sometimes two
or more consonants can combine to form new complex shape. Sometimes the shape of the
compound character is so complex that it becomes difficult to identify the constituent
characters. Despite the existence of so many compound characters, their frequency of
appearance in any text page is much lower than that of basic characters.
Table 5 Formation of Compound Characters
ट+ ट ट+ ठ द+ द द+ म श+ र त+ र द+ ध क +ष
ट्ट ट्ठ द्द द्म श्र त्र ि क्ष
9
In writing Nepali, many consonants come together in a cluster to form Typographic Ligatures.
These are also frequently found in Nepali. The number of ligatures employed may be language-
dependent; thus many more ligatures are conventionally used in writing Sanskrit than in written
Nepali (Typographic ligature, 2016) . Using 33 consonants in total hundreds of ligatures can be
formed (the composite character classes exceeds 5000) most of which are infrequent.
All the consonant characters, vowel characters, compound characters, modifiers are connected
to by ‘dika’ and looks like characters are hanging in a rope. This is a special feature in
Devanagari Script and it does not appear in Latin Script. There are many shapes that look
similar e.g. घ and ध, म and भ, ब and व.
These characteristics of Devanagari Scripts are becoming challenges for DOCR (Devanagari
Optical Character Recognition). Devanagari Script is different from Latin Script by these
characteristics so the same technique from Latin OCR may not work fine for DOCR. Thus
finding a technique suitable for segmentation of text images in Devanagari script is also
challenging.
1.3 Problem Definition

We all want an OCR system for Nepali which can recognize different types of documents,
documents composed of varying fonts, and the main thing we want is the accuracy of
recognition.
Today we have many OCR project releases for Nepali as well as Hindi and Sanskrit. But their
performance has not been satisfactory. The problem lies in inadequate handling of conjuncts
and compound characters. This issue has to be seriously dealt with in order to develop a reliable
and high performance OCR system for Nepali.
In this research work, Hybrid Recognition Approaches in recognition along with compound
characters/conjuncts or ligature recognition is used to improve the overall performance of
Nepali OCR.
10
1.4 Motivation
Digital documents have become a part of everyday life. Anyone can take advantage of scanning
their documents making easy to reference, organizing files, protecting and storing of
documents. There is no limitation to the types of documents that can be digitized. Thus the
increased interest forces us to deal with any type of document that someone may wish to
observe such as images. Plain text has a number of advantages over scanned copies of text. A
text document can be searched, edited, reformatted, and stored more compactly but it is not
possible in the case of images. One will not be able to edit, search or reformat any text that
visually appears in images. Images are nothing more than just a collection of pixels for a
computer.
Extracting the text data from images is an important for reading, editing and analyzing the text
content contained in the images. Computers cannot recognize the text data directly in images.
Thus the design of computer program called “OCR” that can recognize text in digital documents
(images) is important.
OCR technology for some scripts like Roman, Chinese, Japanese, Korean and Arabic is fairly
mature and commercial OCR systems are available with accuracy higher than 98%, including
OmniPage Pro from Nuance or FineReader from ABBYY for Roman and Cyrillic scripts, and
Nuance for Asian languages. Despite ongoing research on non-Latin script recognition, most
of the commercial OCR systems focus on Latin-based languages. OCR for Indian scripts, as
well as many low-density languages, is still in the research and development stage. The
resulting systems are often costly and do little to advance the field (Agrawal, Ma, & Doermann,
2009).
In case of Nepali OCR, the segmentation process cannot achieve full accuracy because of dika,
touching characters, conjuncts/compound characters, modifiers, and variation in typefaces.
These problems directly affect successful recognition and thus result in decreased performance.
Due to the presence of language-specific constructs, in the domain Devanagari script requires
11
different approaches to segmentation. Thus working on better approach for segmentation and
improvement of performance is important.
1.5 Research Questions

Studies show that developing an OCR system for Devanagari script is more challenging than
the Latin script due to its writing arrangement. The techniques applied for Latin OCR may or
may not apply to the Devanagari script. The main challenges in segmentation for Devanagari
OCR are: i) Handling modifiers and diacritics, and ii) Handling compound characters and
ligatures (connected components). Dealing with these two main challenges is necessary to
achieve better accuracy. One major difficulty to improve the performance of OCR system lies
in recognition of compound characters forming complex shapes.
The research questions formulated are:
- What are the challenges of Devanagari (Nepali) OCR?
- What are the current segmentation and recognition techniques for Devanagari (Nepali)
OCR?
- How can the accuracy of Devanagari (Nepali) OCR be improved using
the combined approach of Holistic methods and character level dissection technique?
1.6 Objectives
This research is focused on improving the performance of Nepali OCR. This research will be
helpful for understanding the segmentation approaches used for Devanagari and Bangla OCR,
and underlying challenges and the improvements required. A better approach for designing an
OCR system for Nepali is the expected outcome of this research. Moreover, the improved
techniques will be implemented to develop a prototype OCR system for Nepali.
The objectives of this study are as follows:
- To implement a hybrid approach of recognition that uses both holistic approach and
dissection method of recognition
12
- To determine and evaluate the hybrid approach for improved performance of Nepali
OCR
1.7 Organization of Document

This document is organized into 5 chapters. Chapter 1 includes basic introduction of thesis
which covers problem definition, motivation, research questions and objectives, and the basic
overview of terms and terminologies. Chapter 2 discusses about different segmentation
methods and recognition methods proposed for Devanagari optical character recognition. This
chapter also gives information about various OCR tools developed so far for Devangari.
Chapter 3 discusses about methods applied to conduct this research work and experiment. In
this chapter, different components and phases of applied method are also discussed. In chapter
4 segmentation results and recognition results are presented. The computation cost for character
level recognition technique and holistic approach is also described in this chapter. Finally,
chapter 5 concludes the research, the contributions and possible future improvements are
discussed in this chapter.
In conclusion, in this chapter, the basic concepts of optical character recognition and a general
architecture of OCR, Devanagari script for Nepali language from the point of view of OCR are
discussed. The motivation of the research, research questions, objectives and goals of this
research were also discussed in this chapter.
The next Chapter will discuss about different segmentation methods and recognition
approaches proposed in the literature. And it will also discuss about various OCR tools
developed so far for Devangari.
13
CHAPTER II
LITERATURE REVIEW
Optical character recognition is a sequence of multiple processes – segmentation, feature
extraction, and classification. Different models or techniques are proposed for character
segmentation. These techniques can be categorized into three major strategies – dissection
technique, recognition driven technique, and holistic methods. The use and selection of these
techniques highly depends on the construct of script and language. Various feature extraction
and classification techniques has been proposed by different researchers. The feature extraction
algorithms may rely on morphology of characters for better classification. Classification in one
of the major steps in OCR and design of good classifier is also a challenging task. Mostly
supervised learning is used for the classification of characters.
2.1 Different Models of Character Segmentation in OCR Systems

Character segmentation is an operation that seeks to decompose an image of sequence of
characters into sub-images of individual symbols. The difficulty of performing accurate
segmentation is determined by the nature of the material to be read and by its quality.
Segmentation is the initial step in a three-step procedure. (Casey & Lecolinet, 1996):
Given a starting point in a document image:
1) Find the next character image.
2) Extract distinguishing attributes of the character image.
3) Find the member of a given symbol set whose attributes best match those of the input,
and output its identity.
This sequence is repeated until no additional character images are found.
A character is a pattern that resembles one of the symbols the system is designed to recognize.
But to determine such a resemblance the pattern must be segmented from the document image.
Casey & Lecolinet (Casey & Lecolinet, 1996) have classified the segmentation methods into
14
three pure strategies based on how segmentation and classification interact in the OCR process.
The elemental strategies are:
1) The classical approach, in which segments are identified based on "character-like"
properties. This process of cutting up the image into meaningful components is given
a special name, “dissection".
2) Recognition-based segmentation, in which the system searches the image for
components that match classes in its alphabet.
3) Holistic methods, in which the system seeks to recognize words as a whole, thus,
avoiding the need to segment into characters.
2.1.1 Dissection Techniques
By dissection means decomposition of image into a sequence of sub-images using general
properties of the valid characters such as height, width, separation from neighboring
components, disposition along a baseline etc. Dissection is an intelligent process in that an
analysis of the image is carried out; however, classification into symbols is not involved at this
point. The segmentation stage consisted of three steps:
1) Detection of the start of a character.
2) A decision to begin testing for the end of a character
3) Detection of end-of-character.
The analysis of the projection of a line of print has been used as a basis for segmentation of
non-cursive writing. When printed characters touch, or overlap horizontally, the projection
often contains a minimum at the proper segmentation column (Casey & Lecolinet, 1996). A
peak-to-valley function has been designed to improve this method. A minimum of the
projection is located and the projection value noted. A vertical projection is less satisfactory for
the slanted characters.
Analysis of projections or bounding boxes offers an efficient way to segment non-touching
characters in hand- or machine-printed data. However, more detailed processing is necessary
15
in order to separate joined characters reliably. The intersection of two characters can give rise
to special image features. Consequently dissection methods have been developed to detect
these features and to use them in splitting a character string image into sub-images. Only image
components failing certain dimensional tests are subjected to detailed examination.
2.1.2 Recognition Driven Segmentation
This approach also segment words into individual characters which are usually letters. It is quite
different from dissection based approach. Here, no feature-based dissection algorithm is
employed. Rather, the image is divided systematically into many overlapping pieces without
regard to content. These are classified as part of an attempt to find a coherent
segmentation/recognition result. Letter segmentation is a by-product of letter recognition,
which may itself be driven by contextual analysis. The main interest of this category of methods
is that they bypass the segmentation problem: No complex “dissection" algorithm has to be
built and recognition errors are basically due to failures in classification.
The basic principle is to use a mobile window of variable width to provide sequences of
tentative segmentations which are confirmed (or not) by character recognition. Multiple
sequences are obtained from the input image by varying the window placement and size. Each
sequence is assessed as a whole based on recognition results. In recognition-based techniques,
recognition can be performed by following either a serial or a parallel optimization scheme. In
the first case, recognition is done iteratively in a left-to-right scan of words, searching for a
"satisfactory" recognition result. The parallel method proceeds in a more global way. It
generates a lattice of all (or many) possible feature-to-letter combinations. The final decision is
found by choosing an optimal path through the lattice (Casey & Lecolinet, 1996).
Recognition-based segmentation consists of the following two steps:
1) Generation of segmentation hypotheses (e.g. windowing)
2) Choice of the best hypothesis (verification step)
16
2.1.3 Holistic Technique
Holistic technique is opposite of the classical dissection approach. This technique is used to
recognize word as a whole. Thus skips the segmentation of words into characters. This involves
comparison of features of unsegmented word image to the features or descriptions of words in
database.
Since a holistic approach does not directly deal with characters or alphabets, a major drawback
of this class of methods is that their use is usually limited to predefined words. A training stage
is thus mandatory to expand or modify the scope of possible words. This property makes this
kind of method more suitable for applications where the lexicon is statically defined, like check
recognition. They can be used for specific user as well as to the particular vocabulary
concerned. Holistic methods usually follow a two-step scheme:
1. The first step performs feature extraction.
2. The second step performs global recognition by comparing the representation of
the unknown word with those of the references stored in the lexicon. (Chaudhuri
& Pal, 1997)
2.2 Segmentation Challenges in Devanagari OCR

Several works has been reported in Devanagari and other south Asian scripts too. Among them
Devanagari, Bangla, and Gurmukhi have same issues/challenges as they follow same structure
of characters and writing style (e.g. composition rules, headerline, conjuncts, compound
characters, position of vowel modifiers etc.). The challenges and open problems related to
Devanagari OCR are outlined below. These problems are unique to Devanagari and Bangla,
and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted
to these scripts (Bag & Harit, 2013).
The Segmentation challenges faced in Devanagari OCR are described below:
17
2.2.1 Over Segmentation of Basic Characters
Some of the characters in Devanagari such as (ग), (ण), (श) have two basic components. Similarly,
letter Kha (ख) also have structure with visually two separate components and looks like a
combination of letter Ra and Va (रव). In such cases OCR system get confused and cannot
Figure 3 Over-segmentation Example (Letter ण, श, ग)
segment a complete basic character. Sometimes poor quality of document also leads to over
segmentation of characters. Some of these problems can be handled during post-processing and
some of them must be considered in OCR process (segmentation and classification).
2.2.2 Handling vowel modifiers and Diacritics
Devanagari script consists of several Vowel modifiers. When vowel modifiers comes together
Figure 4 Segmentation using Projection Profile Technique
with core consonants they take position at top, bottom, left or right and result a new shape.
Identification of modifiers and their recognition is important task. The main challenge is to
handle the large number of characters that are formed when the vowel modifiers combine with
the basic characters (Bag & Harit, 2013). Sometimes vowel modifiers come together with other
diacritics (For example, vowel modifier I (िा) and Chandravindu (ा), च िह च + ह + िा+ ा). In
such case they overlap and increase the complexity of segmentation.
18
2.2.3 Handling Compound characters and Ligatures
In Devanagari, compound characters and ligatures are popular. Conjunct or compound
characters may be produced by combining half-consonants with consonants. There is a large
set of compound characters and ligatures. Sometimes, it is harder to identify its constituent
characters by simply analyzing it. Thus handling a large set of compound characters and
ligatures is also challenging task.
Apart from these segmentation challenges there are others challenges too like incorrect typos,
word and character spacing. Kulkarni (Kulkarni, 2013) have studied the display typefaces of
Devanagari Script. He noticed that most of the existing digital display typefaces in Devanagari
are inconsistent. They have imbalanced letter structures, limited/ inadequate matras and ill-
designed conjuncts. They also seem outdated and are overused. Many of them copy features
and styles from existing Latin typefaces. He recommends looking at Devanagari type-design
independently and not as secondary to Latin type design. This inconsistency and imbalanced
letter structures in typefaces adds the complexity to the OCR system. Because of the structural
complexities of Indian scripts, the character recognition module that makes use of only the
image information (shape and structure) of a character is prone to give incorrect results. To
improve the recognition accuracy rate, it is necessary to use language knowledge to correct the
recognition result. There has been a limited use of post-processing in Indian OCR systems and
more efforts are needed in this direction (Bag & Harit, 2013). Almost all Indic scripts need
character reordering to re-organize from visual order to logical (Unicode) order. Since most
OCR systems operates strictly from left to right; the characters are scanned in visual order and
recognition also happens in visual order. This needs to be reordered in post processing.
Apart from the above-mentioned problems, which directly pertain to the OCR systems, there is
a need for a major effort to address related problems like scene text recognition, restoration of
degraded documents, and large scale indexing and search in multilingual document archives.
19
2.3 Related work
Various works have been reported in literature for the correct segmentation of
conjuncts/compound characters, shadow characters to increase the performance of Devanagari
OCR. At the same time, various feature extraction methods and character recognition
algorithms have been proposed. Some of the works from literature are briefly described below.
2.3.1 Segmentation
Bansal & Sinha (Bansal & Sinha, 1998) have considered the problem of conjunct segmentation
in context of Devanagari script. The conjunct segmentation algorithm process takes the image
of the conjunct and the co-ordinates of the enclosing box. The position of the vertical bar and
pen width are also inputs to the algorithm. For extracting the second constituent character of
the conjunct, the continuity of the collapsed horizontal projection is checked. Bansal & Sinha
(Bansal & Sinha, 2001) have divided words into top and bottom strip and then vertical
projection is computed to extract character/symbol and top modifiers. Collapsed Horizontal
Projection is defined for the segmentation of conjuncts/touching characters and shadow
characters. Ma & Doermann (Ma & Doermann, 2003) identified Hindi words and then
segmented into individual characters using projection profile technique (isolating top modifiers,
separating bottom modifiers, and extracting core characters). Composite characters are
identified and further segmented based on the structural properties of the script and statistical
information. The Collapsed Horizontal Projection Technique is adopted from Bansal & Sinha
(2001) for conjunct segmentation. Bansal & Sinha (Bansal & Sinha, 2002) presents a two pass
algorithm for the segmentation and decomposition of Devanagari composite (touching and
fused) characters/symbols into their constituent symbols. The proposed algorithm extensively
uses structural properties of the script. In the first pass, words are segmented into easily
separable characters/composite characters. Statistical information about the height and width
of each separated box is used to hypothesize whether a character box is composite. In the second
pass, the hypothesized composite characters are further segmented. For segmentation of
composite characters, the continuity of collapsed horizontal projection is checked. Agrawal,
20
Ma & Doermann (Agrawal, Ma, & Doermann, 2009) have generated the character glyphs from
font files and passed them through the feature extraction routines. For each character segmented
in the document image, feature extraction is performed. With the objective of grouping broken
characters, segmenting conjuncts, and touching characters, the technique of font-model-based
intelligent character segmentation and recognition was developed. For each word, connected
component analysis is performed. Kompalli et al. (Kompalli, Nayak, & Setlur, 2005) have
proposed a projection profile based method for character segmentation from words. Words are
separated into ascenders, core components, and descenders. Gradient features are used to
classify segmented images into different classes: ascenders, descenders, and core components.
Core components contain vowels, consonants, and frequently occurring conjuncts. Core
components are pre-classified into four groups based on the presence of a vertical bar: no
vertical bar (Eg: छ, ट, ह, vertical bar at the center (Eg:व फ, क), right (Eg: व, त, म) or at multiple
locations (Eg: कय, स, सत). Four neural networks are used for classification within these groups.
Due to ascender and core character separation, characters may be divided into multiple
segments during OCR. Positional information from segmented images is used to reconstruct
the original character. For recognition of valid but not frequently occurring conjuncts, Kompalli
et al. (2005) have attempted to segment the conjunct characters into their constituent consonants
and classify segmented images. For the segmentation of valid but not frequently occurring
conjuncts, authors have examined breaks and joins in the horizontal runs (HRUNS) of a
candidate conjunct character and build a block adjacency graph (BAG). Adjacent blocks in
the BAG are selected from left to right as segmentation hypothesis. Both left and right images
obtained from each segmentation hypothesis are classified using conjunct/vowel classifiers.
The segmentation hypothesis with highest confidence is accepted. Post processing is carried
out using a lexicon with 4,291 entries generated from the Devanagari data set. Kumar et al.
(Kumar & Sengar, 2010) presents projection profile technique for printed Devanagari and
Gurmukhi Script character segmentation. Initially, horizontal histogram of segmented line is
computed and the position of headerline is located. This separates the word into top and bottom
21
strip. Vertical projection histogram for each strip is computed for the segmentation of top
modifiers and characters. In this paper conjuncts/fused characters are not considered. The
results are for clean documents consisting no conjuncts/fused characters. A projection profile
technique is proposed in (Dongre & Mankar, 2011) for the segmentation of Devanagari Text
Image. To normalize the image against thickness of the character the input image is thinned.
Then the vertical projection histogram is computed and the locations containing single white
pixels are noted. These points are taken as the boundaries for individual characters. The
proposed method skips the process of headerline removal. In case of character segmentation,
words are segmented into more symbols than actually present in the word.
Kompalli et al. (Kompalli, Setlur , & Govindaraju, 2006) have extended their previous work
(Kompalli, Nayak, & Setlur, 2005) and two different approaches: segmentation driven and
recognition driven segmentation are compared for OCR of machine printed, multi-font
Devanagari text. They have proposed recognition driven approach that combines classifier
design with segmentation using the hypothesis and test paradigm. Word images are examined
along horizontal runs (HRUNS) to build a Block Adjacency Graph (BAG). Given the BAG of
a word, histogram analysis of block width is used to identify the longest blocks as headline
(dika) and isolate ascenders from core components. Regression over the centroids of these core
connected components is used to determine a baseline for the word. It uses the classifier to
obtain hypotheses for word segments like consonants, vowels, or consonant-ascenders. If the
confidence of the classifier is below a threshold the algorithm attempts to segment the
conjuncts, consonant-descenders and half-consonants. Thus, the classifier results are used to
guide the further segmentation. Kompalli et al. (Kompalli, Setlur, & Govindaraju, 2009) have
proposed a novel graph-based recognition driven segmentation methodology for Devanagari
script OCR using hypothesize and test paradigm. This work is further improvement to their
previous work (Kompalli et al. 2006). A BAG is constructed from a word image and ascenders,
and core components are isolated. The core components can be isolated characters that does not
need further segmentation or conjuncts and fused characters that may or may not have
22
descenders. Multiple hypotheses are obtained for each composite character by considering all
possible combinations of the generated primitive components and their classification scores. A
stochastic model (describes the design of a Stochastic Finite Automata (SFSA) that outputs
word recognition results based on the component hypotheses and n-gram statistics) for word
recognition has been presented. It combines classifier scores, script composition rules, and
character n-gram statistics. Post-processing tools such as word n-grams or sentence-level
grammar models are applied to prune the top n choice results. They have not considered special
diacritic marks like avagraha, udatta, anudatta, special consonants such as, punctuation and
numerals. Symbols such as anusvara, visarga and the reph character often tend to be classified
as noise.
Table 6 Existing Text Segmentation Approaches for Devanagari OCR
Authors Segmentation Technique Performance

Bansal & Sinha (2001) Collapsed Horizontal Projection 93% at character level
Kompalli et al. (2005) BAG Analysis 93.81% for consonants and vowels
Kompalli et al. (2006) Graph Based Character Segmentation 39.58% for the segmentation driven
OCR and 44.10% with the recognition
driven OCR.
Kompalli et al. (2009) Graph-based recognition driven Accuracy of the recognition driven
segmentation BAG segmentation ranges from 72 to
90%
Ma & Doermann (2003) Structural Properties and statistical the average recognition accuracy can
information reach 87.82%
Agrawal et al. (2009) Font-model-based segmentation, 92% at character-level recognition
connected component analysis
Bansal & Sinha (1998) Collapsed Horizontal Projection for recognition rate of 85% has been
Segmentation achieved on the segmented touching
characters
Bansal & Sinha (2002) Collapsed Horizontal Projection 85% recognition rate
For Nepali HTK OCR, (Shakya, Tuladhar, Pandey, & Bal, 2009) (Bal, 2009) the projection
profile technique have been adopted for character segmentation. The process includes removal
of headerline and upper modifiers and then applying Multi-factorial analysis technique to
segment basic characters. The method is able to segment isolated characters along with half and
conjoined characters. For the classifier, Hidden Markov Model (HMM) from HTK toolkit is
used. (Rupakheti & Bal, 2009) adopted projection profiling technique for Nepali Tesseract
OCR. Headerline width is identified and then vertical projection histogram of word to be
segmented is computed. Then the histogram analysis is done to mark starting and ending
23
boundary of character fragment by taking headerline line as a threshold value that qualifies the
segment to be separated.
Most of the researchers have adopted projection profiling technique for character segmentation.
For Devanagari character segmentation, this technique includes two phases: preliminary
segmentation segments words into basic characters and compound characters/shadow
characters/fused characters. In general, preliminary segmentation includes detection of headline
and use of its reference to isolate ascenders, core components, and descenders. For
segmentation of compound characters, Bansal & Sinha (1998, 2001, 2003), have proposed
continuity checking of Collapsed Horizontal Projection. Kompalli et al (2005) have proposed
graph analysis for compound character segmentation. (Ma & Doermann, 2003) have used
Structural Properties and statistical information of script is for further segmentation of
compound characters. Kompalli et al. (2006, 2009) have proposed graph based recognition-
driven character segmentation technique to overcome the problem regarding the compound
character segmentation which is usually difficult using projection profile techniques. Various
character segmentation approaches for Devanagari OCR are summarized in Table 6.
2.3.2 Recognition
Various feature extraction algorithms and classifiers have been proposed for Devanagari optical
character recognition. They all have focused on the improved performance. The shaded portions
on the characters are used as features by Chaudhari & Pal (Chaudhuri & Pal, 1997), the
classifiers used were decision trees. Kompalli et al. have used GSC features and Neural
Network as a classifier (Kompalli, Nayak, & Setlur, 2005). Kompalli et al. (Kompalli, Setlur ,
& Govindaraju, 2006) have used GSC as features and k-nearest neighbor classifier. Ma &
Doermann (Ma & Doermann, 2003) suggests use of statistical structural features; they have
used Generalized Hausdorff Image Comparison (GHIC) for the recognition of characters.
Different feature extraction methods and classifiers used by various researchers in the field of
Devanagari OCR are summarized in Table 7.
24
Table 7 Feature Extraction and Classifiers in Devangari OCR
Author Feature Classifier Performance

Pal & Chaudhari (1997) Shaded portions in the Decision Tree and Template 96.5%
character Matching
Kompalli et al. (2005) GSC Neural Network 84.77%
Kompalli et al. (2006) GSC k-nearest neighbor 95%
Bansal & Sinha (2002) Statistical Structure Statistical Knowledge 85%
Dhurandhar et al. (2005) Curves, contour Centroid matching, length matching, 85%
interpolation
Kompalli et al. (2009) SFSA Stochastic Finite State Automation 96%
Ma & Doermann (2003) Statistical structural Generalized Hausdorff Image 87.82%
Comparison (GHIC)
Agrawal et al. (2009) Moment descriptors, GHIC 92%
directional features
Bansal et al. (2001) Filters Distance based classifiers 93%
(Bishnu & Chaudhuri, 1999) have proposed a recursive contour following method for
segmenting handwritten Bangla words into characters. Based on certain characteristics of
Bangla writing styles, different zones across the height of the word are detected. These zones
provide certain structural information about the constituent characters of the word. Recursive
contour following solves the problem of overlap between successive characters. (Garain &
Chaudhuri, 2002) have proposed a method for segmenting the touching characters in printed
Bangla script. With a statistical study they noted that touching characters occur mostly at the
middle of the middle zone, and hence certain suspected points of touching were found by
inspecting the pixel patterns and their relative position with respect to the predicted middle
zone. The geometric shape is cut at these points and the OCR scores are noted. The best score
gives the desired result. Habib (Murtoza, 2005) have proposed a projection profiling technique
for Bangla Character Segmentation. The width of the headline is variable because of print style
(font size). So sometimes headline cannot be removed clearly. Here two morphological
operations: thinning and skeletonization has been tried to overcome this problem. These
operations removes pixels and pixels remaining makeup the image skeleton. Character can be
separated by using connected components which is considered as input of recognition step.
25
The Arabic OCR Framework proposed by Nazly and others (Sabbour & Shafait, 2013) takes
raw Arabic script data as text files as input in training phase. The training part outputs a dataset
of ligatures, where each ligature is described by a feature vector. Recognition which takes as
input an image specified by the user. It uses the dataset of ligatures generated from the training
part to convert the image into text. It contains versions of degraded text images which aim at
measuring the robustness of a recognition system against possible image defects, such as, jitter,
thresholding, elastic elongation, and sensitivity. The performance of system is reported to be
91% for Urdu clean text and 86% for Arabic clean text.
2.4 OCR Tools Developed for Devanagari

The development of Devanagari (Sanskrit, Hindi, Marathi, and Nepali) OCR software has been
initiated by many organizations and individuals in India and Nepal. C-DAC, from India have
developed an OCR system (Chitrankan) for Hindi and Marathi languages. Madan Puraskar
Pustakalaya (MPP) from Nepal have also developed OCR projects for Nepali language (based
on Tesseract Open Source OCR engine and HTK tool). Ind.Senz (founded by Dr. Oliver
Hellwing) is developing OCR software for Devanagari Script (Sanskrit, Hindi and Marathi
languages). The other projects are Parichit and Sanskrit/Hindi-Tesseract OCR. These tools are
described in details below:
Chitrankan: Chitrankan is an OCR (Optical Character Recognition) system for Hindi and
other Indian Languages developed by by C-DAC. It works with Hindi and Marathi languages
along with embedded English text. It comes with facilities like Spell Checker, saving
recognized text in ISCII format, and exporting text as .RTF for editing by any wordprocessor.
Skew detection and correction upto ±15°, automatic text and picture region deteciton, and
advanced DSP(Digital Signal Processing) algorithms to remove noise and Back Page
Reflection are also imlemented. The recognized text is not much accurate so manual editing is
required. The supported operating systems are Windows XP and older version of Windows2.
2
http://cdac.in/index.aspx?id=mlc_gist_chitra
26
Parichit: This project is based on Tesseract OCR Engine (http://code.google.com/p/tesseract-
ocr/). The front end is the modified version of VietOCR (http://vietocr.sourceforge.net/). The
project aims to create open source OCRs for Indian and South Asian Languages. It also aims
to create high quality raining data for creating Tesseract language models for each of the Indian
Languages. This project reports on going works on Headerline Segmenter (Shirorekha
Segmenter) and Character Reordering for post processing3.
Sanskrit / Hindi - Tesseract OCR (Traineddata files for Devanagari fonts for
Tesseract_OCR 3.02+): Tesseract OCR 3.02 provides hih.traineddata for recognizing texts
in Devanagari scripts. However training texts, images and box files are not provided, so it is
difficult to improve the accuracy by father improving the traineddata. It is noted that recognition
is more accurate and faster if the training is done with the same/similar font as used in the text
to be OCRed. With the aim of creating traineddata for various Devanagari fonts such that the
Tesseract OCR can be used for the recognition of document written in various Devanagari fonts,
this traineddata is maintained by http://sourceforge.net/users/shreeshrii. The trained data can be
downloaded from http://sourceforge.net/projects/tesseracthindi. Currently the traineddata for
Sanskrit2003 font and another similar font is available4.
Ind.senz OCR Programs: The OCR programs are available for Hindi, Marathi, and Sanskrit
languages. These are the only Devanagari OCR programs developed and available for
professional use. Ind.senz explains about the usability of programs in Data Entry Companies,
Publishing Houses, and Universities – whenever large amounts of Hindi and Sanskrit text have
to be digitized. The programs take text images and transform them automatically into computer
editable text in Unicode format. Ind.senz reports the achievement of high accuracy rates on
typical Devanagari fonts. The OCR programs provided are paid software. The demo version
can be downloaded from http://www.indsenz.com/int/index.php5.
3
http://code.google.com/p/parichit
4
http://sourceforge.net/projects/tesseracthindi
5
http://www.indsenz.com/int/index.php
27
Google Drive OCR: Google have launched Nepali OCR in Google Drive. The OCR
technology is free for Google Drive users. OCR provided have good performance in single
column documents. It can retain some formatting like bold, font size, font type and line-breaks.
But lists, tables, columns, footnotes, and endnotes are likely not be detected. Though it shows
good performance, we need to be Google Drive user, we need to surrender our documents to
the Google, and we need to work online.
A Step Towards development of Nepali OCR:
HTK Toolkit Based OCR: This OCR project is developed under Phase I of PAN Localization
project (2004-2007). The project was executed by Madan Puraskar Pustakalya,
http://madanpuraskar.org/. The development of Nepali OCR had been done with the guidance
and direct training from the Bangladesh team. The OCR project was closed with the release of
beta version6. The source files and executable are available on http://nepalinux.org7.
Tesseract Based Nepali OCR: Under the initiatives of MPP and Kathmandu University
(KU), efforts were made for developing a Tesseract based Nepali OCR under PAN
Localization Project Phase II. In this project, 202 Nepali Characters including basic characters
and some derived characters (characters with ukar, ekar, and aikar) were trained via Tesseract
2.04. It is available for download at http://nepalinux.org or it can also be downloaded from the
website of PAN Localization Project, www.panl10.net8.
After the release of the HTK based beta version of the Nepali OCR, Google’s Tesseract
based Nepali OCR was developed in 2009. Then after, the development and enhancement of
Nepali OCR has discontinued. It’s been a long time that these tools have not been updated. In
the current scenario, new versions of operating systems and new platforms have been released.
The tools developed do not meet the requirements of the new versions of Operating Systems
6
Findings of PAN Localization Project, PAN Localization Project 2012; ISBN: 978-969-9690-02-2
7
http://nepalinux.org/index.php?option=com_content&task=view&id=46&Itemid=53
8
http://www.panl10n.net/madan-puraskar-pustakalaya-nepal/
28
like Windows 7, and Windows 8.1. It is also necessary to develop OCR tools for other platforms
like Linux, Android etc.
In conclusion, in this chapter, various works and methods for the correct segmentation of
conjuncts/compound characters, and shadow characters to increase the performance of OCR
are discussed. Moreover, various feature extraction methods and character recognition
algorithms were also described briefly. Most of the researches focus on the improvement of
performance of Devangari OCR by improving the conjuncts/compound character segmentation
process. The method includes projection profile techniques, collapsed horizontal projection
technique, and recognition-driven segmentation techniques.
Various feature extraction methods and classifiers proposed for the successful recognition of
Devangari character are also presented. Finally, various tool developed for Devangari OCR
including Hindi, Sanskrit, Marathi and Nepali language are presented.
The next Chapter will discuss about the methods applied to conduct this research and
experiment. It will also discuss about different components and phases of applied method.
29
CHAPTER III
METHODOLOGY
The research works on Devanagari Optical Character Recognition suggests that the
segmentation process cannot achieve full accuracy because of noise, touching characters,
compound characters, variation in typefaces, and many similar looking characters. Because of
the presence of language-specific constructs in non-Latin scripts, such as “dika” (Devanagari),
modifiers (south-east Asian scripts), writing order, or irregular word spacing (Arabic and
Chinese) it requires different approaches to segmentation (Agrawal, Ma, & Doermann, 2009).
Devanagari Script also possess its own constructs which totally differ from Latin.
Figure 5 Proposed Nepali OCR Model
The most practiced character dissection method for Devanagari works by removing the
headerline (dika) and separating the lower modifiers and upper modifiers, which makes it easy
to extract the basic characters but increases the complexity of extracting modifiers. The
modifiers get broken and it is difficult to note their position in a sequence of segmented
characters and restore their original shape. To minimize the overhead of component level
segmentation and minimize the errors due to inaccurate dissection, here, a hybrid approach
which combines the Holistic Method and Dissection Technique is proposed. Kompalli et al.
(Kompalli, Setlur, & Govindaraju, 2009) have also proposed a novel graph-based recognition
driven segmentation methodology for Devanagari script OCR using hypothesize and test
30
paradigm which is promising work and an inspiring work for using a hybrid approach to OCR.
Harit and Bag (2013) have also highlighted the need of new approaches because the problems
are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for
other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013).
The proposed framework has two phase recognition scheme:
Phase 1: Segment input text image into words and recognize words using Holistic Approach.
Measure the confidence of classification. If the confidence is lower than threshold then we go
for Phase 2 recognition.
Phase 2: Words that are poorly classified in Phase 1 are segmented into characters using
projection profile. Segmentation results may be characters or compound characters (conjuncts,
shadow characters, consonant-consonant-vowel combinations, consonant vowel combinations,
and characters including diacritics). These characters are then classified. A general framework
of DOCR is given in Figure 5.
The general framework of our approach consists of two main parts – training and recognition:
3.1 Training:
Training takes the raw Nepali text data as input and outputs a dataset of words and a dataset of
ligatures (compound characters), where each data is described by feature vector. Training phase
consists of two main steps:
1. Generation of a dataset of images for the possible words and ligatures (compound
characters) of the Nepali language to be used by the application.
2. Extracting features that describe each word and ligature in the dataset generated by the
previous step.
3.1.1 Dataset Generation:
This step involves use of automated computer program to generate the necessary Training
Dataset. Text corpus of target language is fed to the program and the analysis of textual data is
31
performed to generate the list of words, basic characters, and compound characters which will
be later used for rendering images representing corresponding text. Various steps involved in
dataset generation are:
3.1.1.1 Create Distinct Words List and Character List:
In this project, a text corpus collected by Madan Puraskar Pustakalaya (MPP) under the Bhasha
Sanchar Project9 is used. The corpus includes different types of articles from different news
portals, magazines, websites and books (about 2,500 articles). The text corpus thus collected is
Figure 6 Training Dataset Generation
fed to the Text Separator, a program written in C#. This program searches for the words and
maintains the dictionary in the form of <word, frequency> tuples and a dictionary of characters
in the form of <character, frequency> is generated. The number of words extracted for Nepali
is over 150,000 having different length. The number of basic characters and compound
characters extracted is over 7,000 having different character length.
9
This corpus has been constructed by the Nelralec / Bhasha Sanchar Project, undertaken by a consortium of the Open
University, Madan Puraskar Pustakalaya (मदन पुरस्कार पुस्तकालय), Lancaster University, the University of Göteborg, ELRA (the
European Language Resources Association) and Tribhuvan University.
32
3.1.1.2 Image Dataset Generation:
In order to generate an Image Dataset of words and Compound Characters (including basic
characters) following steps are carried out:
- Images for each extracted word and character are rendered using a rendering engine. This
involves rendering the text using 15 different Devanagari Unicode Fonts namely some of
them are Mangal, Arial Unicode MS, Samanata, Kokila, Adobe Devanagari, Madan.
- The degraded images are generated by applying different image filtering operations (e.g.
threshold, blur, erode) to the images rendered in previous step.
3.1.2 Feature Extraction:
The second main step of the training phase is to extract a feature vector representing each word
and compound characters included in dataset. For this, the following steps are done:
1. Normalize each image to a fixed width and height
2. Compute the Histogram of Oriented Gradients (HOG) descriptor
Figure 7 Feature Extraction
To extract the HOG features from dataset, hog routine implemented in skimage.feature has
been used. The routine allows to manage orientations, pixels per cell, and cells per block. The
process of feature extraction is shown in Figure 7.
3.2 Recognition:
The recognition part takes as input an image which is specified by the user through the user
interface. Its main task is to recognize any text that occurs in the input image. The recognized
33
text is presented as an output to the user in an editable format. The recognition of the text in an
input image is done using the following steps:
Step1: Segment page image into lines and words.

Step 2: Describe each unknown word image using HOG descriptor.
Step 3: Classify each word segment using Random Forest classifier tool
Step 4: Calculate the confidence of Classification
Step 5: If classification confidence is lower than threshold
a. Segment words into characters/ligatures
b. Perform Character level classification
3.2.1 Line and Word Segmentation
In this research work, instead of Projection Profile, a Blob Detection based approach for line
and word segmentation has been used. Blobs are bright on dark or dark on bright regions in an
image10 11. In Devanagari Script each word is a bunch of characters and these characters are
tied with each other by header line (dika). This property of Devanagari Script makes it easy to
use blob detection for detecting individual words in a text document. Figure 8 shows the Nepali
words and each word as a separate bright region in the black background.
Figure 9 Nepali text words as Blobs Figure 8 Snapshot of Character Segmentation
Word segmentation using Blog Detection involves various steps.
Algorithm: Line and Word Segmentation

Step 1: Preprocessing: Blurring, Binarization (Grayscale and Thresholding), Skeletonization, and
Inverting Image
Step 2: Detect blobs
Step 3: Get average blob size and remove all small and big blobs
Step 4: Create clusters of blobs by the analysis of their distribution and y-cordinate
Step 5: Each cluster bounding box represents a text line, now we can perform word segmentation
10http://scikit-image.org/docs/dev/auto_examples/plot_blob.html Blob Detection, scikit-image, Accessed: 12/19/2015

11http://www.aforgenet.com/framework/features/blobs_processing.html Blobs Processing, Afoge.net, Accessed: 12/19/2015
34
Step 6: Re-apply blob detection in a line to Perform word segmentation (vertical and horizontal
bluring may be applied for more accurate segmentation)
3.2.2 Character Segmentation:
The character segmentation to basic components becomes more challenging due to its
properties – use of modifiers and diacritics, and compound characters and ligatures. By
studying the structure of Devanagari script and use of compound characters it is found that it
would be better to use compound characters and ligatures as a single characters. The method is
inspired form the work by Nazly Sabbour and Faisal Shafait, the ligature based approach to
implement segmentation free Arabic and Urdu OCR (Sabbour & Shafait, 2013). On analyzing
the Nepali text corpus, it is found that there are about 7,000 compound characters (basic
characters, conjuncts, ligatures) used in Nepali. Projection profile (PP) algorithm is used to
segment characters.
The algorithm for character segmentation is given below:

Algorithm: Character Segmentation
Step 1: Input: list of blobs
Step 2: Apply Horizontal projection on Word Rectangle Part
Hpp(word) = {r1, r2, r3, … , rn}, where r1, r2, …, rn are the score of black pixels in
corresponding rows
Step 3: Find header line in a word; HLl(word, x, y), HLh(word)
HLl(word, x, y) is the location of headerline in a word, where x the location of upper row
that lies in headerline and y is the location of lower row. HLh(word) = y – x, is height of
headerline of word
Step 4: Apply Veritical projection on Word Rectangle Part
Vpp(word) = {v1, v2, v3, … , vn}, where v1, v2, …, vn are the score of black pixels in
corresponding columns
Step 5: Remove Header Line
Vpp(word)hr = Vpp(word) – HLh(word) = {v1 - HLh(word), v2 - HLh(word) , … , vn -
HLh(word) }
Step 6: Detect cut points
CP(word) = <word, {cp1, cp2, …, cpn}>; Cut points are valleys i.e. space between two
characters
Step 7: Perform segmentation
This module takes the blobs (rectangles enclosing the word) as input. In the earlier step
Horizontal Projection is applied on the word. Hpp(word) = {r1, r2, r3, …, rn} is the result of
Horizontal Projection which contains score of black pixels in each row. The analysis on
Hpp(word) is performed to detect the header line of a word and to calculate its height
HLh(word). Those rows that have Horizontal Projection score equal to max score or near about
35
max score and are neighbor to each other are the part of headerline. The analysis is performed
on the upper half part of the word. The location of headerline HLl(word, x, y), is the position
of headerline in a word, where x the location of upper row that lies in headerline and y is the
location of lower row. The height of word is given by HLh(word) = y – x. Then vertical
projection is applied on the blob. Vpp(word) ={v1, v2, v3, … , vn}, is list of scores of black
pixels in each column of blob, where v1, v2, ….. vn, are scores of black pixels in respective
column. The method that have been practiced so far to isolate the individual character in a word
of Devanagari Script is to remove the headerline. I have also used the same method. And this
method works fine for isolating compound characters. The headerline is not removed in actual
but HLh(word) is subtracted from each element of Hpp(word), the result is Vpp(word)hr = {v1
- HLh(word), v2 - HLh(word) , … , vn - HLh(word) }. On subtraction some element may result
less that zero, in such case make that element zero because no score can be less that zero. Next
task is to find the cut points, CP(word) = <word, {cp1, cp2, …, cpn}>, where cp1, cp2, …,
cpnare cutpoints, by analyzing the Vpp(word)hr , cut points are the points in a word from where
we can chop a word to isolate the characters. And normally these are the points where element
of Vpp(word)hr is equal to zero that means the space between two characters. Finally the
cutpoints are noted and the segmentation is performed, the result of segmentation is given by
CS(TextImage) = {<word1, <cs11, cs12, …, cs1n>>, <word2, <cs21, cs22, …, cs2m>>, …, <wordp,
<csp1, csp2, …, cspq>>}.
3.2.3 Classifier Tool
For development of both word classifier and character classifier, Random Forest classifier tool
was developed. According to sk-learn.org, “A random forest is a meta-estimator that fits a
number of decision tree classifiers on various sub-samples of the dataset and use averaging to
improve the predictive accuracy and control over-fitting”12.
12
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html [Accesses: 03-24-2016]
36
For testing purposes a limited set of words and characters has been trained. The training of
Random Forest is performed with following setup:
Word Classifier: Three different Random Forest classifiers are trained based on the word
length i.e. the ratio of image width and height. The training data images with width <=
(height*2) lies in class 1, with width <= (height*4) in class 2 and with width <= (height*6) in
class 3.
Table 8 Word Classifier Training
Class 1 Class 2 Class 3

Image Normalization (48, 24) (96, 24) (144, 24)
HOG Feature orientations=8 orientations=8 orientations=8
Extraction pixels_per_cell=(8, 8) pixels_per_cell=(8, 8) pixels_per_cell=(8, 8)
cells_per_block=(3, 3) cells_per_block=(3, 3) cells_per_block=(3, 3)
Random Forest 50 estimators 50 estimators 50 estimators
Setup
Words Trained (519 291 212 16
words/48 images per
word)
Accuracy (3-fold 0.91 (+/- 0.06) 0.92 (+/- 0.06) 0.96 (+/- 0.05)
Cross Validation)
The following graph shows learning curve for word classifier 1 (Class 1) with 20 iterations and
test size of 30 percent.
37
Figure 10 Learning Curve - Word classifier 1
38
Character Classifier: A single Random Forest classifier is developed for the classification of
characters.
Table 9 Character Classifier Training
Image Normalization (37, 58)

HOG Feature Extraction orientations=9
pixels_per_cell=(8, 8)
cells_per_block=(4, 4)
Random Forest Setup 55 estimators
Characters Trained (144 images per word) 417 characters (basic and compound characters, half
letters, and most frequent compound characters)
Accuracy (3-fold Cross Validation) 0.84 (+/- 0.07)
The feature extraction configuration, random forest setup, training data size and the cross
validation result for training are presented in Table 9. The classifier was trained successfully
with accuracy of more than 80% by above configuration.
39
The following graph shows learning curve for character classifier with 20 iterations and test
size of 30 percent.
Figure 13 Learning Curve - Character classifier
3.2.4 Confidence and Threshold:
Confidence is a prediction probability. Prediction probability helps to be sure about how
accurately some character/word has been classified. The threshold defined is a numeric value,
if a confidence is below threshold the prediction or the classification is marked as false
classification. The threshold value is calculated by the study of classification results and the
corresponding prediction probability. In this work, 0.2 is taken as the threshold. A sample of
word recognition data and prediction probability is presented in Appendix II.
In conclusion, in this chapter, the proposed model of Nepali OCR is discussed. Our model
consists of two phase recognition schemes – firstly, OCR engine tries to recognize words as a
whole, secondly, if it is not confident about the word, it tries to segment the word into
constituent characters and recognize at the character level. The general framework of our
approach consists of two main parts – training and recognition.
40
The training phase consists of two main steps – dataset generation, and feature extraction. The
process of dataset generation and feature extraction are described in this chapter.
Word and line segmentation, and character segmentation algorithms are described. Word and
line segmentation algorithm uses blob detection and projection profile technique for
segmentation.
Finally, different classifiers used in experiment and their configurations are presented. The
training results, cross validation results, and training curves are presented for both the word and
character classifiers. The classifiers were successfully trained with more than 80% accuracy.
In the next Chapter, segmentation results and recognition results are described.
41
CHAPTER IV
RESULTS AND DISCUSSION
In this section, experimental study and testing of the proposed architecture are presented. To
test the system, various documents were generated and collected. Results of the testing of
different modules of system namely, word segmentation, character segmentation, word level
recognition, and compound character level recognition are also presented here.
4.1 Experimental Setup

For conducting the experiment, C# and Python3 was used as a major programming languages
for image processing and machine learning. The experiment is conducted on the machine with
following hardware and software configurations.
Table 10 Experimental Environment
Title Description
Computer System Dell Inspiron 5420, i5 Processor, 4GB RAM,
1GB NVIDIA Graphics
Operating System Windows 10
Programming Languages C#, Python 3
Image Processing Libraries Aforge.net, Accord.net, scikit-image,
OpenCV-Python
Machine Learning Libraries scikit-learn
For various tasks including GUI design of the experiment software, pre-processing of input
images, and post-processing mostly C# has been used as major language. C# libraries like
Aforge.net and Accord.net are used for achieving different image processing tasks like reading
image, removing noise, and performing segmentation. Similarly, scikit-image has been used
for image feature extraction. For implementing machine learning, RandomForest routine
available in scikit-learn has been used.
4.2 Segmentation Results

It is found that line segmentation is accurate as long as there is a specified amount of space
between lines, it is almost 100%. The accuracy of word segmentation is reduced a little by the
lower modifiers (Ukar, Ookar, Rrikar, and Halant) if they are separated from the core character
42
and by punctuation marks like comma and dot. The character segmentation results for 7
documents is presented in Table 11.
Table 11 Character Segmentation Results
Document 1 2 3 4 5 6 7
Characters Present 212 118 370 353 166 273 289
Characters Over- 5 7 12 11 2 3 4
segmented
Characters under- 14 7 23 9 9 18 19
segmented
Error (%) 8.96 11.86 9.45 5.66 6.62 7.69 7.95
From the above test, it is clear that most of the errors are due to under-segmentation. The errors
due to over-segmentation are less-than-half compared to errors due to under-segmentation. The
average error rate of character segmentation is found to be 8.31%.
4.3 Recognition Results

Both the approaches - character level recognition approach and hybrid approach were tested.
The classifier was tested on documents containing characters from a set of trained 519 words
and 417 characters. The result of recognition of 7 documents is presented in Table 12.
Table 12 Recognition Results
Character Level Recognition Approach Hybrid Approach

Characters Correctly Accuracy Correctly Accuracy
Present Recognized Recognized
1 116 85 73% 111 95.68%
2 92 78 84.78% 89 96.73%
3 130 111 85.38% 121 93%
4 118 82 69.49% 116 98%
5 105 83 79% 100 95.23%
6 139 114 82% 131 94.24%
7 65 51 78.46% 59 90.76%
From Table 12, we can see that the accuracy of Character Level Recognition approach ranges
from 69.49 % to 85.38%. The average accuracy rate of Character Level Recognition Approach
is 78.87%. The average accuracy rate of the Hybrid approach ranges from 90.76% to 98%. The
average accuracy rate of our approach is 94.80%.
The recognition results are also presented in bar chart in Figure 14.
43
Recognition Results
95.68% 96.73% 98% 95.23%
100% 93% 94.24%
90.76%
90% 84.79% 85.38%
82%
79% 78.46%
80% 73%
69.49%
70%
60%
50%
40%
30%
20%
10%
0%
Document 1 Document 2 Document 3 Document 4 Document 5 Document 6 Document 7
Character Level Recognition Hybrid Approach
Figure 14 Recognition Results
From the above results, we can see that the proposed hybrid approach is promising. The
accuracy rate increased by more than 10% while using hybrid approach.
4.4 Computational Cost

In this sub-section, the computational cost of both approaches of recognition in terms of
mathematical interpretation is discussed. This computational cost only includes the cost of
segmentation and recognition (the other costs like pre-processing, training, and post-processing
are omitted for now).
Computational cost – Character level recognition technique:
This technique involves word segmentation, segmentation each word into characters, and then
recognition of each character. Thus the total computational cost for this approach is given by:
𝐶𝑡 = 𝑤𝑠 + 𝐶𝑐𝑙𝑠 + 𝐶𝑐𝑙𝑟 .
Where,
𝑤𝑠 = Word segmentation cost
𝐶𝑡 = Total computational cost
𝐶𝑐𝑙𝑠 = Character level segmentation cost and
𝐶𝑐𝑙𝑟 = Character level recognition cost.
44
Assume that there are ‘𝑛’ words in a document and the time taken to segment an image
document into words is 𝑤𝑠 . Let us say, the average cost to segment each word into characters
is 𝐶𝑠 then the cost of segmentation of 𝑛 words is 𝑛 × 𝐶𝑠 i.e. 𝐶𝑐𝑙𝑠 = 𝑛 × 𝐶𝑠 .
Also assume that the character segmentation yields 𝑚 characters i.e. there are 𝑚 characters
present. If 𝑟 is recognition time required to recognize a single character then the total time
required can be given by 𝑚 × 𝑟 i.e. 𝐶𝑐𝑙𝑟 = 𝑚 × 𝑟.
Thus, the total computation time can be given by the equation:
𝐶𝑡 = 𝑤𝑠 + 𝑛 × 𝐶𝑠 + 𝑚 × 𝑟 …………………………….. (1)
Computational cost – Hybrid approach:
Since this approach is also followed by word level segmentation. The process of recognition is
started with the segmentation of image document into possible words. The word level
segmentation cost is same as to that of character level recognition technique i.e. 𝑤𝑠 .
Now, in this approach, we first try to recognize all the words. Let us say, the cost of recognition
of a single word in average is 𝑤𝑟 and there are 𝑛 words in total, the cost of recognition of 𝑛
words is given by 𝑊𝑟𝑐 = 𝑛 × 𝑤𝑟 .
Then the confidence of recognition for each recognized word is calculated to decide whether to
go for character level recognition or not. If the cost of calculation of recognition confidence of
a single words is 𝑅𝑐𝑐 , then total cost for calculating recognition confidence for all words is
given by 𝑅𝑐𝑡 = 𝑛 × 𝑅𝑐𝑐 .
Now the next step, deciding how many words have been successfully recognized and how many
words require further processing. If 𝑝 words require further processing means 𝑛 − 𝑝 words does
not require character level segmentation. The cost of character level segmentation is equals to
𝐶𝑐𝑙𝑠 = 𝑝 × 𝐶𝑠 . If the character segmentation of 𝑝 words results 𝑞 characters then the cost of
recognition of 𝑞 characters is given by 𝐶𝑐𝑙𝑟 = 𝑞 × 𝑟.
Thus the total computational cost of hybrid approach is given by the equation
𝐶𝑡ℎ = 𝑤𝑠 + 𝑊𝑟𝑐 + 𝑅𝑐𝑡 + 𝐶𝑐𝑙𝑠 + 𝐶𝑐𝑙𝑟 …………………………….. (2)
Where,
45
𝐶𝑡ℎ = Total computational cost of hybrid approach
𝑤𝑠 = word segmentation cost
𝑊𝑟𝑐 = word recognition cost
𝑅𝑐𝑡 = recognition confidence calculation cost
𝐶𝑐𝑙𝑠 = character level segmentation cost and
𝐶𝑐𝑙𝑟 = character recognition cost
The equation can further be written as
𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 + 𝑝 × 𝐶𝑠 + 𝑞 × 𝑟………………………. (3)
Comparison of computational cost:
From equation (1), we can re-write this equation as
𝐶𝑡 = 𝑤𝑠 + [(𝑛 − 𝑝) × 𝐶𝑠 + (𝑝 × 𝐶𝑠 )] + [(𝑚 − 𝑞) × 𝑟 + (𝑞 × 𝑟)]
𝑖. 𝑒. 𝐶𝑡 = 𝑤𝑠 + [(𝑛 − 𝑝) × 𝐶𝑠 + (𝑚 − 𝑞) × 𝑟] + 𝑝 × 𝐶𝑠 + 𝑞 × 𝑟……………………. (4)
By comparing equation (3) and equation (4) we can see that the computation cost 𝑤𝑠 + 𝑝 ×
𝐶𝑠 + 𝑞 × 𝑟 is same in both algorithms.
In worst case, 𝑝 = 𝑛, 𝑞 = 𝑚 and hence equation (3) becomes
𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 + 𝑛 × 𝐶𝑠 + 𝑚 × 𝑟 ……………………………….. (5)
In best case, 𝑝 = 0, 𝑞 = 0 and hence equation (3) becomes
𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 ………………………………. (6)
For character level recognition technique, in worst case and best case, the computational cost
is equal.
On comparing equation (1) and equation (5), we can conclude that
𝐶𝑡 < 𝐶𝑡ℎ
On comparing equation (1) and equation (6), 𝑛 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑚 𝑖. 𝑒. 𝑛 < 𝑚 and hence
less computation will be performed. This can be expressed as
𝐶𝑡 > 𝐶𝑡ℎ
This shows that at best case the hybrid approach performs better in terms of computational cost
but in worst case the character level recognition technique performs better. But this is not
46
always the case that the hybrid approach will perform best, even if we consider 𝑝 to be 𝑛/2,
there will not be much difference in computational cost.
In conclusion, the experimental environment and the results of the experiment performed are
discussed. The chapter begins with the discussion of the hardware and software environment
on which the experiment is performed. And then segmentation results and the recognition
results has been presented.
In segmentation results, character segmentation results and the error rates are presented.
Similarly, in recognition results section, the recognition results of character recognition
approach and hybrid approach of recognition (proposed method) are presented. The results are
presented in table in terms of accuracy percentages.
In the next chapter, the contributions, and the possible future improvements in conclusion will
be described.
47
CHAPTER V
CONCLUSION AND FUTURE WORK
Using the hybrid approach improved the recognition accuracy. The performance of OCR
increased by nearly 10% while using the hybrid approach. However it is not always the case, it
depends on how many words have been trained. The word recognition is like, how much
familiar you are with some language? Training results shows that the word classifier has better
performance, nearly 90% on average. The performance of character level recognition is found
to be 78.87%.
The major contributions of this work are:
 We conducted a detailed literature review on different models of character
segmentation, various challenges in Devanagari OCR, and the segmentation and
recognition techniques proposed for Devanagari OCR which includes Nepali, Sanskrit,
Hindi, and Marathi.
 We proposed a model for Nepali OCR which combines the Holistic technique and
Character level dissection techniques. At first, the system tries to recognize a word as
a whole, if it is not confident about the classification then character level dissection and
recognition is performed. The method results reduction of noticeable amount of
segmentation task.
 The model is trained using Random Forest classifiers. HOG descriptor has been used
as a feature set. A set of frequent words in Nepali and a set of frequent
characters/compound character is trained and validated using 3 fold cross validation.
 Along with the cross validation testing, the manual testing of the models are also
presented. The testing shows higher accuracy rates and possibilities for its
generalization and further improvements.
Our focus was on improving the performance of Nepali OCR by using a hybrid approach of
recognition. The approach reduces the character level or component level segmentation task.
48
There are several issues and possibilities that can be addressed in the future to further improve
the performance. Some of the possible improvements are listed below:
 The performance is constrained by over-segmentation and under-segmentation
problems. These problems have always been pertinent issues and challenges for the
Devanagari OCR. The problem may be addressed by applying some recognition driven
segmentation. Under-segmented characters mostly include shadow characters and
conjuncts/compound characters. One of the future works may be identification of such
characters and training them.
 The model proposed can be generalized and trained to recognize large set of words and
compound characters not only for Nepali but for Hindi, Marathi, and other languages
too which are written in Devangari script.
 Better and concrete methods must be designed for creating multiple classes of word
images. Use of multiple classifiers apparently improved the performance but this has
to be further validated for better measure.
49
References
Agrawal, M., Ma, H., & Doermann, D. (2010). Generalization of hindi OCR using adaptive
egmentation and font files. In Guide to OCR for Indic Scripts. Springer London, pp.
181-207.
Bag, S., & Harit, G. (2013). A survey on optical character recognition for Bangla and
Devanagari Script. Sadhana, 133-168.
Bal, B. K. (2009). Scripts Segmentation and OCR II Nepali OCR and Bangla Collaboration.
Conference on Localized ICT Development and Dissemination across Asia. PAN
Localization Project. Laos.
Bansal, V., & Sinha, M. (2001). A complete OCR for printed Hindi text in Devanagari Script.
ICDAR (p. 0800). IEEE.
Bansal, V., & Sinha, R. (1998). Segmentation of Touching Characters in Devanagari.
Proceedings CVGIP, (pp. 371-376). Delhi.
Bansal, V., & Sinha, R. (2002). Segmentation of touching and fused Devanagari Characters.
Pattern Recognition, 875-893.
Bishnu, A., & Chaudhuri, B. B. (1999). Segmentation of Bangla handwritten text into
characters by recursive contour following. Proceedings of the International
Conference on Document Analysis and Recognition, (pp. 402-405).
Casey, R. G., & Lecolinet, E. (1996, July). A Survey of Methods and Strategies in Character
Recognition. IEEE Transactions on Pattern Recognition and Machine Intelligence,
18.
Chaudhuri, B. B., & Pal, U. (1997). An OCR System to Read Two Indian Language Scripts:
Bangla and Devanagari (Hindi). Proceedings of the Fourth International Conference
on Document Analysis and Recognition (pp. 1011-1015). IEEE.
Dhurandhar, A., Shankarnarayanan, K., & Jawale, R. (2005). Robust Pattern Recognition
Scheme for Devanagari Script. (pp. 1021 – 1026). Springer-Verlag Berlin Heidelberg
2005.
Dongre, V. J., & Mankar, V. H. (2011). Devanagari Document Segmentation Using
Histogram Approach. International Journal of Computer Science, Engineering and
Information Technology (IJCSEIT).
Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-
Degree Compared. (pp. 1-10). Grid Computing Environments Workshop.
Garain, U., & Chaudhuri, B. B. (2002). Segmentation of touching characters in printed
Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans. Syst.
Man Cybern., (pp. 449–459).
Hansen, J. (2002). A Matlab Project in Optical Character Recognition (OCR). DSP Lab,
University of Rhode Island, 6.
Holley, R. (n.d.). How Good Can It Get? Analysing and Improving OCR Accuracy in Large
Scale Historic Newspaper Digitisation Programs. Retrieved 03 04, 2014, from
http://www.dlib.org/dlib/march09/holley/03holley.html
50
K.C., S., & Nattee, C. (2007). Template-based Nepali Natural Handwritten Alphanumeric
Character Recognition. Thammasat Int. J. Sc. Tech, 12(1).
Khedekar, S., Ramanaprasad, V., Setlur, S., & Govindaraju, V. (2003). Text - Image
Separation in Devanagari Documents. Proceedings of the Seventh International
Conference on Document Analysis and Recognition (ICDAR 2003) .
Kompalli, S., Nayak, S., & Setlur, S. (2005). Challenges in OCR of Devanagari Documents.
Kompalli, S., Setlur , S., & Govindaraju, V. (2006). Design and Comparison of Segmentation
Driven and Recognition Driven Devanagari OCR.
Kompalli, S., Setlur, S., & Govindaraju, V. (2009). Devanagari OCR using a recognition
driven segmentation framework and stochastic language models. IJDAR.
Kulkarni, S. (2013). Issues with Devanagari Display Type. WhiteCrow Designs.
Kumar, V., & Sengar, P. K. (2010). Segmentation of Printed Text in Devanagari Script and
Gurmukhi Script. International Journal of Computer Applications, 3.
Ma, H., & Doermann, D. (2003). Apaptive Hindi OCR using generalized Hausdroff Image
Comparison. ACM Transactions on Asian Language Information Processing, 2(3),
193-218.
Murtoza, S. M. (2005). Bangla Optical Character Recognition. BRAC University.
OCR Applications. (2015, April). Retrieved from cvision.
OCR Processing Steps [ABBYY Developer Portal]. (n.d.). Retrieved 05 22, 2014, from
http://www.abbyy-developers.eu/en:tech:processing
Optical character recognition - From Wikipedia, the free encyclopedia. (n.d.). Retrieved 03
03, 2014, from http://en.wikipedia.org/wiki/Optical_character_recognition
Optical character recognition. (2015, 05 04). (Wikipedia.org) Retrieved 5 22, 2014, from
Wikipedia: http://en.wikipedia.org/wiki/Optical_character_recognition
Optical Character Recognition. (2015, 04 17). Retrieved from Webopedia:
http://www.webopedia.com/TERM/O/optical_character_recognition.html
Pal, U., & Chaudhuri, B. (2004). Indial script character recognition: a survey. Pattern
Recognition.
Rupakheti, P., & Bal, B. K. (2009). Research Report on the Nepali OCR. Madan Puraskar
Pustakalaya.
Sabbour, N., & Shafait, F. (2013). A Segmentation Free Approach to Arabic and Urdu OCR.
SPIE Proceedings.
Scanning in Digital Age. (2015, 04 16). Retrieved from Record Nations:
http://www.recordnations.com/articles/scanning-in-digital-age/
Shakya, S., Tuladhar, S., Pandey, R., & Bal, B. K. (2009). Interim Report on Nepali OCR.
Madan Puraskar Pustakalaya.
Singh, A., Bacchuwar, K., & Bhasin , A. (2012, June). A Survey of OCR Applications.
International Journal of Machine Learning and Computing, 2.
51
Typographic ligature. (2016, 03 24). Retrieved 04 04, 2016, from Wikipedia:
http://en.wikipedia.org/wiki/Typographic_ligature
What is OCR? (2015, 04 17). Retrieved from ABBYY:
http://finereader.abbyy.com/about_ocr/whatis_ocr/
What is optical character recognition? (n.d.). Retrieved 03 03, 2014, from
http://www.webopedia.com/TERM/O/optical_character_recognition.html
52
APPENDIX I
Snapshots
Word Extraction from Text Corpus
Character Extraction from Text Corpus
A
Word Training Data
Character Training Data
B
Segmentation
Recognition
C
Text Extractor
D
APPENDIX II
Word Recognition Data Sample
Here a sample word recognition data is presented. Each line is a result of recognition of a single word.
The data has two values separated by underscore (_). First value represents word code and second value
is the recognition confidence (prediction probability).
226_0.5
196_0.46000000000000002
130_0.62
66_0.41999999999999998
129_0.35999999999999999
36_0.40000000000000002
37_0.47999999999999998
38_0.29999999999999999
39_0.32000000000000001
40_0.76000000000000001
41_0.54000000000000004
42_0.46000000000000002
43_0.54000000000000004
44_0.57999999999999996
173_0.28000000000000003
46_0.85999999999999999
47_0.41999999999999998
48_0.47999999999999998
35_0.34000000000000002
49_0.32000000000000001
51_0.44
52_0.56000000000000005
53_0.69999999999999996
54_0.14000000000000001
55_0.34000000000000002
56_0.32000000000000001
57_0.54000000000000004
58_0.41999999999999998
59_0.92000000000000004
60_0.29999999999999999
61_0.38
62_0.68000000000000005
63_0.66000000000000003
50_0.71999999999999997
34_0.52000000000000002
346_0.16
32_0.40000000000000002
3_0.23999999999999999
4_0.47999999999999998
5_0.52000000000000002
6_0.29999999999999999

Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant

Uploaded by

Copyright:

Available Formats

Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant

Uploaded by

Copyright:

Available Formats

NEPALI OCR USING HYBRID APPROACH

In partial fulfillment of the requirements for the degree of

University Registration No: 015493-13

At last, I express my thankfulness to all members of Department of Computer Science and

Master of Technology in Information Technology

Keywords— OCR, Devanagari Script, Pre-processing, Segmentation, HOG Feature, Feature

Figure 2 Structure of Nepali Text Word .................................................................................... 8

Figure 3 Over-segmentation Example (Letter ण, श, ग) .............................................................. 18

Figure 4 Segmentation using Projection Profile Technique .................................................... 18

Figure 5 Proposed Nepali OCR Model .................................................................................... 30

Figure 6 Training Dataset Generation ...................................................................................... 32

Figure 7 Feature Extraction ..................................................................................................... 33

Figure 8 Nepali text words as Blobs ........................................................................................ 34

Figure 9 Snapshot of Character Segmentation ......................................................................... 34

Figure 10 Learning Curve - Word classifier 1 ......................................................................... 38

Figure 11 Learning Curve - Word classifier 2 ......................................................................... 38

Figure 12 Learning Curve - Word classifier 3 ......................................................................... 39

Figure 13 Learning Curve - Character classifier ...................................................................... 40

Figure 14 Recognition Results ................................................................................................. 44

Table 2 Diacritics and Special Symbols .................................................................................... 8

Table 3 Consonants and their half-forms ................................................................................... 9

Table 4 Letter Variants .............................................................................................................. 9

Table 5 Formation of Compound Characters ............................................................................. 9

Table 6 Existing Text Segmentation Approaches for Devanagari OCR .................................. 23

Table 7 Feature Extraction and Classifiers in Devangari OCR ............................................... 25

Table 8 Word Classifier Training ............................................................................................ 37

Table 9 Character Classifier Training ...................................................................................... 39

Table 10 Experimental Environment ....................................................................................... 42

Table 11 Character Segmentation Results ............................................................................... 43

Table 12 Recognition Results .................................................................................................. 43

BAG – Block Adjacency Graph

C-DAC – Centre for Development of Advanced Computing

DOCR – Devanagari Optical Character Recognition

DSP – Digital Signal Processing

GHIC – Generalized Hausdorff Image Comparison

GSC – Gradient, Structural and Concavity

GUI – Graphical user interface

HMM – Hidden Markov model

HOG – Histogram Oriented Gradient

HPP – Horizontal Projection Profile

HTK - Hidden Markov Model Toolkit

IPA – Integrity, purposefulness and adaptability

ISCII – Indian Script Code for Information Interchange

MPP - Madan Puraskar Pustakalaya

OCR – Optical Character Recognition

PDF – Portable Document File

SFSA – Stochastic Finite Automata

VPP – Vertical Projection Profile

model will be tested by experimenting the proposed model.

used in this thesis.

1.1 Optical Character Recognition

needs to be photocopied or scanned. Such a replicated document cannot be altered in terms of

of accuracy in text processing applications.

Figure 1 General OCR Architecture

(OCR Processing Steps [ABBYY Developer Portal], n.d.).

Different actions performed during pre-processing are:

- Page Layout Analysis

- Detection of text lines and words

- Character segmentation – For per-character OCR, multiple characters that are