Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant
Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant
Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant
OF RECOGNITION
By
NIRAJAN PANT
Master of Technology in Information Technology, Kathmandu University, 2016
A Thesis
Submitted to the
Department of Computer Science and Engineering
Kathmandu University
July 2016
DECLARATION OF ORIGINALITY
Being a student, I understand that I have an ethical and moral obligation ensuring that the
dissertation that I have submitted to the Kathmandu University is my own, original and free of
plagiarism. All the sources are acknowledged properly, exact words paraphrased or quoted,
with appropriate references throughout the dissertation. Hence I am fully satisfied that the work
I am submitting to the Department of Computer Science and Engineering, Kathmandu
University is my own research and original.
_______________
Nirajan Pant
Candidate
I
THESIS EVALUATION
This thesis, submitted by Nirajan Pant in partial fulfillment of the requirements for the Degree
of Master of Technology in Information Technology from the Kathmandu University, has
been read by the faculty Advisory Committee under whom the work has been done and is
hereby approved.
____________________
Dr. Bal Krishna Bal
(Supervisor)
Assistant Professor
Department of Computer Science and Engineering, Kathmandu University
_____________________
Suresh K. Regmi
(Extermal Examinor)
Managing Director
Professional Computer System (P) Ltd.
____________________
Dr. Manish Pokharel
Head of Department
Department of Computer Science and Engineering, Kathmandu University
This thesis is being submitted by the appointed advisory committee as having met all of the
requirements of the School of Engineering at the Kathmandu University and is hereby
approved.
________________________________
Prof. Dr. Bhupendra Bimal Chhetri
Dean
School of Engineering
Kathmandu University
Date:
II
PERMISSION
Title Nepali OCR Using Hybrid Approach of Recognition
Department Computer Sciences and Engineering
Degree Master of Technology in Information Technology
In presenting this thesis in partial fulfillment of the requirements for a graduate degree from
Kathmandu University, I agree that the library of this University shall make it freely available
for inspection. I further agree that permission for extensive copying for scholarly purposes may
be granted by the supervisors who supervised my thesis work or, in his (or her) absence, by the
Head of the Department or other use of this thesis or part thereof for financial gain shall not be
allowed without my written permission. It is also understood that due recognition shall be given
to me and to the Kathmandu University in any scholarly use which may be made of any material
in my thesis.
________________
Nirajan Pant
Date:
III
ACKNOWLEDGEMENTS
I express my sincere gratitude to Dr. Bal Krishna Bal for supervising me in this thesis. I will
always be indebted to his continued motivations, suggestions and involvements which have
helped significantly for the completion of this thesis.
I am thankful to Madan Puraskar Pustakalaya (MPP), Lalitpur, Nepal who has provided Nepali
text image data for this thesis work.
Nirajan Pant
IV
ABSTRACT
Nepali, which is an Indo-Aryan language written in the Devanagari Script, is the most widely
spoken language in Nepal with more than 35 million speakers. It is also spoken in many areas
of India, Bhutan, and Myanmar. The Optical Character Recognition (OCR) systems developed
so far for the Nepali language has a very poor recognition rate. Devanagari script has some
special features like ‘dika’ and the rules for joining the vowel modifiers which makes it different
from Latin script, where every character in a word is written separately. One of the major
reasons for poor recognition rate is due to the error in character segmentation. The presence of
conjuncts, compound and touching characters in the scanned documents complicates the
segmentation process, creating the major problems when designing an effective character
segmentation technique. Thus, the aim of work is to reduce the scope of the segmentation task
so that the segmentation errors could be minimized.
In this work, I have proposed a hybrid OCR system for printed Nepali text using the Random
Forest (RF) algorithm. It incorporates two different techniques of OCR – firstly, the Holistic
approach and secondly, the Character Level Recognition approach. The system first tries to
recognize a word as a whole and if it is not confident about the word, the character level
recognition is performed. Histogram Oriented Gradients (HOG) descriptors are used to define
a feature vector of a word or character. The performance of 78.87%, and 94.80% recognition
rates are achieved for character level recognition approach and the hybrid approach
respectively.
V
Contents
ACKNOWLEDGEMENTS ..................................................................................................... IV
ABSTRACT.............................................................................................................................. V
List of Figures ....................................................................................................................... VIII
List of Tables ........................................................................................................................... IX
List of Abbreviations ................................................................................................................ X
CHAPTER I INTRODUCTION ............................................................................................... 1
1.1 Optical Character Recognition ................................................................................... 1
1.1.1 General OCR Architecture ................................................................................. 2
1.1.2 Uses and Current Limitations of OCR ............................................................... 5
1.2 Devanagari Script....................................................................................................... 6
1.3 Problem Definition................................................................................................... 10
1.4 Motivation ................................................................................................................ 11
1.5 Research Questions .................................................................................................. 12
1.6 Objectives ................................................................................................................ 12
1.7 Organization of Document ....................................................................................... 13
CHAPTER II LITERATURE REVIEW.................................................................................. 14
2.1 Different Models of Character Segmentation in OCR Systems ............................... 14
2.1.1 Dissection Techniques ..................................................................................... 15
2.1.2 Recognition Driven Segmentation ................................................................... 16
2.1.3 Holistic Technique ........................................................................................... 17
2.2 Segmentation Challenges in Devanagari OCR ........................................................ 17
2.2.1 Over Segmentation of Basic Characters .......................................................... 18
2.2.2 Handling vowel modifiers and Diacritics ........................................................ 18
2.2.3 Handling Compound characters and Ligatures ................................................ 19
2.3 Related work ............................................................................................................ 20
2.3.1 Segmentation.................................................................................................... 20
2.3.2 Recognition ...................................................................................................... 24
2.4 OCR Tools Developed for Devanagari .................................................................... 26
CHAPTER III METHODOLOGY .......................................................................................... 30
3.1 Training:................................................................................................................... 31
3.1.1 Dataset Generation: .......................................................................................... 31
3.1.2 Feature Extraction: ........................................................................................... 33
3.2 Recognition: ............................................................................................................. 33
3.2.1 Line and Word Segmentation .......................................................................... 34
3.2.2 Character Segmentation: .................................................................................. 35
3.2.3 Classifier Tool .................................................................................................. 36
VI
3.2.4 Confidence and Threshold: .............................................................................. 40
CHAPTER IV RESULTS AND DISCUSSION ...................................................................... 42
4.1 Experimental Setup .................................................................................................. 42
4.2 Segmentation Results ............................................................................................... 42
4.3 Recognition Results ................................................................................................. 43
4.4 Computational Cost ................................................................................................. 44
CHAPTER V CONCLUSION AND FUTURE WORK ......................................................... 48
References ................................................................................................................................ 50
APPENDIX I Snapshots ........................................................................................................... A
APPENDIX II Word Recognition Data Sample ........................................................................E
VII
List of Figures
Figure 1 General OCR Architecture .......................................................................................... 2
VIII
List of Tables
Table 1 Vowels and Corresponding Modifiers .......................................................................... 8
IX
List of Abbreviations
ASCII – American Standard Code for Information Interchange
PP – Projection Profile
RF – Random Forest
X
CHAPTER I
INTRODUCTION
This thesis is about improving the performance of Nepali OCR by proper handling of
segmentation problems prevalent in the Nepali language. The assumption made is: “The
performance of Nepali OCR can be improved by using the Hybrid recognition approach”.
Based on this assumption, a Nepali language specific OCR model has been developed. The
The concepts of OCR and its general architecture, Devanagari script for Nepali language from
the point of view of OCR, and uses and limitations of OCR are discussed in this chapter. This
chapter includes basic introduction of thesis which covers problem definition, motivation,
research questions and objectives, and the basic overview of terms and terminologies that are
or printed or handwritten documents into computer readable text. OCR enables conversion of
texts in image data into textual data and facilitates editing, searching, republishing without
retyping the whole document. Any written or printed document, if it is to be replicated digitally,
the spellings, words, font-style, and font-size that the document contains. Also typing an entire
document in order to replicate it is extremely time consuming. In order to overcome the above
mentioned issues, an OCR system is needed. Documents containing characters images can be
scanned through the scanner and then the recognition engine of the OCR system interprets the
images and turns images of printed or handwritten characters into machine - readable characters
(e.g. ASCII or Unicode). Therefore, OCR allows users to quickly automate data captured from
image document, eliminates keystrokes to reduce typing costs and still maintains the high level
1
1.1.1 General OCR Architecture
While talking about how an OCR system recognizes text, first, the program analyzes structure
of document image. It divides the page into elements such as blocks of texts, tables, images,
etc. The lines are divided into words and then into characters. Once the characters have been
singled out, the program compares them with a set of pattern images.
The process of character recognition consists of a series of stages, with each stage passing its
results on to the next in a pipeline fashion. There is no feedback loop that would permit an
earlier stage to make use of knowledge gained at a later point in the process (Casey & Lecolinet,
1996). The recognition process can be divided into three major steps: Preprocessing,
Recognition (Feature Extraction) and Post Processing (Optical character recognition, 2015)
Pre-processing
OCR software loads the image and performs pre-processing to increase the recognition
accuracy. Most of the OCRs expect some pre-defined formats of input image such as font-size
ranges, foreground, background, image format, and color format. The pre-processing steps
often performed in OCR are: i) Binarization ii) Morphological Operations and iii) Segmentation
(Hansen, 2002). Binarization is the process of converting an image to bi-tonal image; most of
2
the OCRs work on bi-tonal images. Morphological Operations are used in pre or post processing
(filtering, thinning, and pruning). They may be applied in degraded documents to increase the
performance of OCR.
- De-skewing
- Binarization
connected due to image artifacts must be separated; single characters that are broken
into multiple pieces due to artifacts must be connected. Usually, in every OCR system,
the recognition is performed at the character level. So the segmentation is the basic and
- Normalization
Character recognition
Recognition algorithm is the brain of the OCR system. After successful pre-processing of input
image document, now OCR algorithm can start recognition of characters and translate them
into character codes (ASCII/Unicode). Creating one hundred percent accurate algorithm is
probably impossible where there is a lot of noise and different font styles are present.
Learning - The recognition algorithms relies on a set of learned characters and their
properties. It compares the characters in the scanned image file to the characters in this
learned set.
3
Comparison of the properties of the learned and extracted characters
There are two basic types of core OCR algorithm – matrix matching and feature extraction.
(Optical Character Recognition, 2015). Matrix matching also known as “pattern matching” or
This relies on the input glyph being correctly isolated from the rest of the image, and on the
stored glyph being in a similar font and at the same scale. This technique works best with
typewritten text and does not work well when new fonts are encountered. Feature extraction
decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections.
These are compared with an abstract vector-like representation of a character, which might
reduce to one or more glyph prototypes. General techniques of feature detection in computer
vision are applicable to this type of OCR, which is commonly seen in most modern OCR
classifier algorithms are used to compare image features with stored glyph features and choose
the nearest match. Most modern Omnifont OCR programs (ones that can recognize printed text
Post-processing
This step can help to improve recognition quality; sometimes OCR can output wrong character
code in such case a dictionary support can help to make the decision. OCR accuracy can also
be increased if the output is constrained by a lexicon – a list of words that are allowed to occur
in a document. With dictionary support, the program ensures even more accurate analysis and
The output stream may be a plain text stream or file of characters, but more sophisticated OCR
systems can preserve the original layout of the page and produce, for example, an
annotated PDF that includes both the original image of the page and a searchable textual
representation.
4
The exact mechanisms that allow humans to recognize objects are yet to be understood, but the
three basic principles are already well known by scientists – integrity, purposefulness and
adaptability (IPA). The most advanced optical character recognition systems are focused on
replicating natural or “animal like” recognition. In the heart of these systems lie three
says that the observed object must always be considered as a “whole” consisting of many
interrelated parts. The principle of purposefulness supposes that any interpretation of data must
always serve some purpose. And the principle of adaptability means that the program must be
capable of self-learning. These principles endow the program with maximum flexibility and
OCR is widely used to recognize and search text from electronic documents or to publish the
text on a website ( Singh, Bacchuwar, & Bhasin , 2012). It has enabled scanned documents to
become more than just image files, turning into fully searchable documents with text content
that is recognized by computers. OCR is a vast field with a number of varied applications such
as invoice imaging, legal industry, banking, health care industry etc. It is widely being used in
digital libraries for searching scanned books and magazines (e.g. Google books), data entry
such as bill payment, passport, text-to-speech synthesis, machine translation, text mining, and
Optical character recognition has been applied to a number of applications. Some of them have
- Handwriting Recognition
5
OCR has simplified data collection and analysis process. With its continuous advancement,
more and more applications powered by OCR are being developed in various fields including
- The latest software can recreate tables and the original layout
OCR system has a lot of advantages even then it has many limitations. Some of the limitations
- Limited Documents: It does not perform well with documents containing both
- Accuracy: The accuracy depends upon the quality and type of document, including
the font used. Errors that occur during OCR include misreading letters, skipping
over letters that are unreadable, or mixing together text from adjacent columns or
image captions.
- Additional Work: OCR is not error proof, OCR also makes mistakes. A person
has to manually compare the original image document and the recognized text for
- Not worth doing for small amounts of text: OCR has to suffer from a long
OCR may not be feasible and worthwhile to use for small amount of documents.
languages including Sanskrit, Nepali, Hindi, Marathi, Bihari, Bhojpuri, Maithili, and Newari
are written in Devanagari and over 500 million people are using it. Devanagari is a syllabic-
alphabetic script with a set of basic symbols - consonants, half-consonants, vowels, vowel-
6
modifiers, digits and special diacritic marks (Kompalli, Setlur , & Govindaraju, 2006)
(Kompalli, Setlur, & Govindaraju, 2009). Script has its own specified composition rules for
combining vowels, consonants and modifiers. Modifiers are attached to the top, bottom, left or
right side of other characters. All characters of a word are stuck together by a horizontal line,
called dika, which runs at the top of core characters (Khedekar, Ramanaprasad, Setlur, &
Govindaraju, 2003). Devanagari character may be formed by combining one or more alphabets
which are referred as composite characters or conjuncts. For example: half- consonant ka (क्)
and consonant ya (य) combine to produce the conjunct character kya (कय), Consonant-modifier
and conjunct-modifier characters are produced by combining consonants and conjuncts with
vowel modifiers (Eg: क्+्ा → क , कय्+्ा → कय ). This combination of alphabets contrasts with
Latin in which the number of characters is fixed. A horizontal header line (dika) runs across the
top of the characters in a word, and the characters span three distinct zones (Figure 2); an
ascender zone above the Dika, the core zone just below the Dika, and a descender zone below
the baseline of the core zone. Symbols written above or below the core will be referred to as
half consonants followed by a consonant and a vowel modifier will be referred to as a conjunct
Nepali, originally known as Khas Kurā is an Indo-Aryan language with around 17 million
speakers in Nepal, India, Bhutan, and Burma. Nepali is written in Devanagari, which is
developed from the Brahmi script in the 11th century AD. The Nepali is started to be written
from 12th century AD1. In Nepali, there are 13 vowels (swaravarna), 36 consonants
(vyanjanvarna) (33 pure consonants and 3 are composite consonants), 10 numerals, and half-
letters. When vowels come together with consonants they are written above, below, before or
after the consonant they belong to using special diacritical marks. When vowels are written in
this way they are known as modifiers. In addition, consonants occur together in clusters, often
1
http://www.omniglot.com/writing/nepali.htm
7
called conjunct consonants. Altogether, there are more than 500 different characters (K.C. &
It is written and read from left to right in a horizontal line. Many languages in India use different
variants of this script. Nepali language uses a subset of characters from Devanagari script set
Dika
(Headerline) Ascender
Head Line
Base Line
Compound
Descender
Character
for written purposes. Some characters of Devanagari script are language specific. But the basic
vowels, consonants and modifiers are same in all languages. For example ‘Nukta’ is used in
Hindi but not in Nepali. Similarly, letter ‘LLA’ is also not used in Nepali.
which when combined with other consonants yield conjuncts (Pal & Chaudhuri, 2004).
8
Table 3 Consonants and their half-forms
Consonant Half Consonant Half Consonant Half Consonant Half Consonant Half
Form Form Form Form Form
क क् ख ख् ग ग् घ घ् ङ
च च् छ ज ज् झ झ् ञ ञ्
ट ठ ड ढ ण ण्
त त् थ थ् द ध ध् न न्
प प् फ फ् ब ब् भ भ् म म््
य य् र ल ल् व व् श श्
स् ष ष् ह ह्
Numerals:
०१२३४५६७८९
Letter Variants:
In writing Nepali, there are many letter variations found in written or printed documents. This
is because fonts have different writing styles. Some characters having letter variants, differs by
the old and new writing styles. The old variants of some letters (e.g. letter अ and letter ण) are
not used in these days but the old documents frequently contains these forms. A set of letter
There are many conjuncts which are written as a single character e.g. द्द, द्म, हृ, i.e. sometimes two
or more consonants can combine to form new complex shape. Sometimes the shape of the
appearance in any text page is much lower than that of basic characters.
ट+ ट ट+ ठ द+ द द+ म श+ र त+ र द+ ध क +ष
ट्ट ट्ठ द्द द्म श्र त्र ि क्ष
9
In writing Nepali, many consonants come together in a cluster to form Typographic Ligatures.
These are also frequently found in Nepali. The number of ligatures employed may be language-
dependent; thus many more ligatures are conventionally used in writing Sanskrit than in written
Nepali (Typographic ligature, 2016) . Using 33 consonants in total hundreds of ligatures can be
formed (the composite character classes exceeds 5000) most of which are infrequent.
All the consonant characters, vowel characters, compound characters, modifiers are connected
to by ‘dika’ and looks like characters are hanging in a rope. This is a special feature in
Devanagari Script and it does not appear in Latin Script. There are many shapes that look
These characteristics of Devanagari Scripts are becoming challenges for DOCR (Devanagari
Optical Character Recognition). Devanagari Script is different from Latin Script by these
characteristics so the same technique from Latin OCR may not work fine for DOCR. Thus
finding a technique suitable for segmentation of text images in Devanagari script is also
challenging.
documents composed of varying fonts, and the main thing we want is the accuracy of
recognition.
Today we have many OCR project releases for Nepali as well as Hindi and Sanskrit. But their
performance has not been satisfactory. The problem lies in inadequate handling of conjuncts
and compound characters. This issue has to be seriously dealt with in order to develop a reliable
In this research work, Hybrid Recognition Approaches in recognition along with compound
Nepali OCR.
10
1.4 Motivation
Digital documents have become a part of everyday life. Anyone can take advantage of scanning
their documents making easy to reference, organizing files, protecting and storing of
documents. There is no limitation to the types of documents that can be digitized. Thus the
increased interest forces us to deal with any type of document that someone may wish to
observe such as images. Plain text has a number of advantages over scanned copies of text. A
text document can be searched, edited, reformatted, and stored more compactly but it is not
possible in the case of images. One will not be able to edit, search or reformat any text that
visually appears in images. Images are nothing more than just a collection of pixels for a
computer.
Extracting the text data from images is an important for reading, editing and analyzing the text
content contained in the images. Computers cannot recognize the text data directly in images.
Thus the design of computer program called “OCR” that can recognize text in digital documents
(images) is important.
OCR technology for some scripts like Roman, Chinese, Japanese, Korean and Arabic is fairly
mature and commercial OCR systems are available with accuracy higher than 98%, including
OmniPage Pro from Nuance or FineReader from ABBYY for Roman and Cyrillic scripts, and
Nuance for Asian languages. Despite ongoing research on non-Latin script recognition, most
of the commercial OCR systems focus on Latin-based languages. OCR for Indian scripts, as
well as many low-density languages, is still in the research and development stage. The
resulting systems are often costly and do little to advance the field (Agrawal, Ma, & Doermann,
2009).
In case of Nepali OCR, the segmentation process cannot achieve full accuracy because of dika,
These problems directly affect successful recognition and thus result in decreased performance.
Due to the presence of language-specific constructs, in the domain Devanagari script requires
11
different approaches to segmentation. Thus working on better approach for segmentation and
the Latin script due to its writing arrangement. The techniques applied for Latin OCR may or
may not apply to the Devanagari script. The main challenges in segmentation for Devanagari
OCR are: i) Handling modifiers and diacritics, and ii) Handling compound characters and
ligatures (connected components). Dealing with these two main challenges is necessary to
achieve better accuracy. One major difficulty to improve the performance of OCR system lies
- What are the current segmentation and recognition techniques for Devanagari (Nepali)
OCR?
the combined approach of Holistic methods and character level dissection technique?
1.6 Objectives
This research is focused on improving the performance of Nepali OCR. This research will be
helpful for understanding the segmentation approaches used for Devanagari and Bangla OCR,
and underlying challenges and the improvements required. A better approach for designing an
OCR system for Nepali is the expected outcome of this research. Moreover, the improved
- To implement a hybrid approach of recognition that uses both holistic approach and
12
- To determine and evaluate the hybrid approach for improved performance of Nepali
OCR
which covers problem definition, motivation, research questions and objectives, and the basic
methods and recognition methods proposed for Devanagari optical character recognition. This
chapter also gives information about various OCR tools developed so far for Devangari.
Chapter 3 discusses about methods applied to conduct this research work and experiment. In
this chapter, different components and phases of applied method are also discussed. In chapter
4 segmentation results and recognition results are presented. The computation cost for character
level recognition technique and holistic approach is also described in this chapter. Finally,
chapter 5 concludes the research, the contributions and possible future improvements are
In conclusion, in this chapter, the basic concepts of optical character recognition and a general
architecture of OCR, Devanagari script for Nepali language from the point of view of OCR are
discussed. The motivation of the research, research questions, objectives and goals of this
The next Chapter will discuss about different segmentation methods and recognition
approaches proposed in the literature. And it will also discuss about various OCR tools
13
CHAPTER II
LITERATURE REVIEW
extraction, and classification. Different models or techniques are proposed for character
segmentation. These techniques can be categorized into three major strategies – dissection
technique, recognition driven technique, and holistic methods. The use and selection of these
techniques highly depends on the construct of script and language. Various feature extraction
and classification techniques has been proposed by different researchers. The feature extraction
algorithms may rely on morphology of characters for better classification. Classification in one
of the major steps in OCR and design of good classifier is also a challenging task. Mostly
segmentation is determined by the nature of the material to be read and by its quality.
Segmentation is the initial step in a three-step procedure. (Casey & Lecolinet, 1996):
3) Find the member of a given symbol set whose attributes best match those of the input,
A character is a pattern that resembles one of the symbols the system is designed to recognize.
But to determine such a resemblance the pattern must be segmented from the document image.
Casey & Lecolinet (Casey & Lecolinet, 1996) have classified the segmentation methods into
14
three pure strategies based on how segmentation and classification interact in the OCR process.
properties. This process of cutting up the image into meaningful components is given
3) Holistic methods, in which the system seeks to recognize words as a whole, thus,
properties of the valid characters such as height, width, separation from neighboring
analysis of the image is carried out; however, classification into symbols is not involved at this
3) Detection of end-of-character.
The analysis of the projection of a line of print has been used as a basis for segmentation of
non-cursive writing. When printed characters touch, or overlap horizontally, the projection
often contains a minimum at the proper segmentation column (Casey & Lecolinet, 1996). A
peak-to-valley function has been designed to improve this method. A minimum of the
projection is located and the projection value noted. A vertical projection is less satisfactory for
15
in order to separate joined characters reliably. The intersection of two characters can give rise
to special image features. Consequently dissection methods have been developed to detect
these features and to use them in splitting a character string image into sub-images. Only image
This approach also segment words into individual characters which are usually letters. It is quite
employed. Rather, the image is divided systematically into many overlapping pieces without
which may itself be driven by contextual analysis. The main interest of this category of methods
is that they bypass the segmentation problem: No complex “dissection" algorithm has to be
The basic principle is to use a mobile window of variable width to provide sequences of
tentative segmentations which are confirmed (or not) by character recognition. Multiple
sequences are obtained from the input image by varying the window placement and size. Each
the first case, recognition is done iteratively in a left-to-right scan of words, searching for a
"satisfactory" recognition result. The parallel method proceeds in a more global way. It
generates a lattice of all (or many) possible feature-to-letter combinations. The final decision is
found by choosing an optimal path through the lattice (Casey & Lecolinet, 1996).
16
2.1.3 Holistic Technique
Holistic technique is opposite of the classical dissection approach. This technique is used to
recognize word as a whole. Thus skips the segmentation of words into characters. This involves
database.
Since a holistic approach does not directly deal with characters or alphabets, a major drawback
of this class of methods is that their use is usually limited to predefined words. A training stage
is thus mandatory to expand or modify the scope of possible words. This property makes this
kind of method more suitable for applications where the lexicon is statically defined, like check
recognition. They can be used for specific user as well as to the particular vocabulary
the unknown word with those of the references stored in the lexicon. (Chaudhuri
Devanagari, Bangla, and Gurmukhi have same issues/challenges as they follow same structure
of characters and writing style (e.g. composition rules, headerline, conjuncts, compound
characters, position of vowel modifiers etc.). The challenges and open problems related to
Devanagari OCR are outlined below. These problems are unique to Devanagari and Bangla,
and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted
17
2.2.1 Over Segmentation of Basic Characters
Some of the characters in Devanagari such as (ग), (ण), (श) have two basic components. Similarly,
letter Kha (ख) also have structure with visually two separate components and looks like a
combination of letter Ra and Va (रव). In such cases OCR system get confused and cannot
segment a complete basic character. Sometimes poor quality of document also leads to over
segmentation of characters. Some of these problems can be handled during post-processing and
Devanagari script consists of several Vowel modifiers. When vowel modifiers comes together
with core consonants they take position at top, bottom, left or right and result a new shape.
Identification of modifiers and their recognition is important task. The main challenge is to
handle the large number of characters that are formed when the vowel modifiers combine with
the basic characters (Bag & Harit, 2013). Sometimes vowel modifiers come together with other
diacritics (For example, vowel modifier I (िा) and Chandravindu (ा), च िह च + ह + िा+ ा). In
18
2.2.3 Handling Compound characters and Ligatures
set of compound characters and ligatures. Sometimes, it is harder to identify its constituent
characters by simply analyzing it. Thus handling a large set of compound characters and
Apart from these segmentation challenges there are others challenges too like incorrect typos,
word and character spacing. Kulkarni (Kulkarni, 2013) have studied the display typefaces of
Devanagari Script. He noticed that most of the existing digital display typefaces in Devanagari
are inconsistent. They have imbalanced letter structures, limited/ inadequate matras and ill-
designed conjuncts. They also seem outdated and are overused. Many of them copy features
and styles from existing Latin typefaces. He recommends looking at Devanagari type-design
independently and not as secondary to Latin type design. This inconsistency and imbalanced
letter structures in typefaces adds the complexity to the OCR system. Because of the structural
complexities of Indian scripts, the character recognition module that makes use of only the
image information (shape and structure) of a character is prone to give incorrect results. To
improve the recognition accuracy rate, it is necessary to use language knowledge to correct the
recognition result. There has been a limited use of post-processing in Indian OCR systems and
more efforts are needed in this direction (Bag & Harit, 2013). Almost all Indic scripts need
character reordering to re-organize from visual order to logical (Unicode) order. Since most
OCR systems operates strictly from left to right; the characters are scanned in visual order and
recognition also happens in visual order. This needs to be reordered in post processing.
Apart from the above-mentioned problems, which directly pertain to the OCR systems, there is
a need for a major effort to address related problems like scene text recognition, restoration of
degraded documents, and large scale indexing and search in multilingual document archives.
19
2.3 Related work
Various works have been reported in literature for the correct segmentation of
OCR. At the same time, various feature extraction methods and character recognition
algorithms have been proposed. Some of the works from literature are briefly described below.
2.3.1 Segmentation
Bansal & Sinha (Bansal & Sinha, 1998) have considered the problem of conjunct segmentation
in context of Devanagari script. The conjunct segmentation algorithm process takes the image
of the conjunct and the co-ordinates of the enclosing box. The position of the vertical bar and
pen width are also inputs to the algorithm. For extracting the second constituent character of
the conjunct, the continuity of the collapsed horizontal projection is checked. Bansal & Sinha
(Bansal & Sinha, 2001) have divided words into top and bottom strip and then vertical
characters. Ma & Doermann (Ma & Doermann, 2003) identified Hindi words and then
segmented into individual characters using projection profile technique (isolating top modifiers,
separating bottom modifiers, and extracting core characters). Composite characters are
identified and further segmented based on the structural properties of the script and statistical
information. The Collapsed Horizontal Projection Technique is adopted from Bansal & Sinha
(2001) for conjunct segmentation. Bansal & Sinha (Bansal & Sinha, 2002) presents a two pass
algorithm for the segmentation and decomposition of Devanagari composite (touching and
fused) characters/symbols into their constituent symbols. The proposed algorithm extensively
uses structural properties of the script. In the first pass, words are segmented into easily
separable characters/composite characters. Statistical information about the height and width
of each separated box is used to hypothesize whether a character box is composite. In the second
pass, the hypothesized composite characters are further segmented. For segmentation of
20
Ma & Doermann (Agrawal, Ma, & Doermann, 2009) have generated the character glyphs from
font files and passed them through the feature extraction routines. For each character segmented
in the document image, feature extraction is performed. With the objective of grouping broken
intelligent character segmentation and recognition was developed. For each word, connected
component analysis is performed. Kompalli et al. (Kompalli, Nayak, & Setlur, 2005) have
proposed a projection profile based method for character segmentation from words. Words are
separated into ascenders, core components, and descenders. Gradient features are used to
classify segmented images into different classes: ascenders, descenders, and core components.
Core components contain vowels, consonants, and frequently occurring conjuncts. Core
components are pre-classified into four groups based on the presence of a vertical bar: no
vertical bar (Eg: छ, ट, ह, vertical bar at the center (Eg:व फ, क), right (Eg: व, त, म) or at multiple
locations (Eg: कय, स, सत). Four neural networks are used for classification within these groups.
Due to ascender and core character separation, characters may be divided into multiple
segments during OCR. Positional information from segmented images is used to reconstruct
the original character. For recognition of valid but not frequently occurring conjuncts, Kompalli
et al. (2005) have attempted to segment the conjunct characters into their constituent consonants
and classify segmented images. For the segmentation of valid but not frequently occurring
conjuncts, authors have examined breaks and joins in the horizontal runs (HRUNS) of a
candidate conjunct character and build a block adjacency graph (BAG). Adjacent blocks in
the BAG are selected from left to right as segmentation hypothesis. Both left and right images
obtained from each segmentation hypothesis are classified using conjunct/vowel classifiers.
The segmentation hypothesis with highest confidence is accepted. Post processing is carried
out using a lexicon with 4,291 entries generated from the Devanagari data set. Kumar et al.
(Kumar & Sengar, 2010) presents projection profile technique for printed Devanagari and
computed and the position of headerline is located. This separates the word into top and bottom
21
strip. Vertical projection histogram for each strip is computed for the segmentation of top
modifiers and characters. In this paper conjuncts/fused characters are not considered. The
results are for clean documents consisting no conjuncts/fused characters. A projection profile
technique is proposed in (Dongre & Mankar, 2011) for the segmentation of Devanagari Text
Image. To normalize the image against thickness of the character the input image is thinned.
Then the vertical projection histogram is computed and the locations containing single white
pixels are noted. These points are taken as the boundaries for individual characters. The
proposed method skips the process of headerline removal. In case of character segmentation,
words are segmented into more symbols than actually present in the word.
Kompalli et al. (Kompalli, Setlur , & Govindaraju, 2006) have extended their previous work
(Kompalli, Nayak, & Setlur, 2005) and two different approaches: segmentation driven and
recognition driven segmentation are compared for OCR of machine printed, multi-font
Devanagari text. They have proposed recognition driven approach that combines classifier
design with segmentation using the hypothesis and test paradigm. Word images are examined
along horizontal runs (HRUNS) to build a Block Adjacency Graph (BAG). Given the BAG of
a word, histogram analysis of block width is used to identify the longest blocks as headline
(dika) and isolate ascenders from core components. Regression over the centroids of these core
connected components is used to determine a baseline for the word. It uses the classifier to
obtain hypotheses for word segments like consonants, vowels, or consonant-ascenders. If the
confidence of the classifier is below a threshold the algorithm attempts to segment the
conjuncts, consonant-descenders and half-consonants. Thus, the classifier results are used to
guide the further segmentation. Kompalli et al. (Kompalli, Setlur, & Govindaraju, 2009) have
script OCR using hypothesize and test paradigm. This work is further improvement to their
previous work (Kompalli et al. 2006). A BAG is constructed from a word image and ascenders,
and core components are isolated. The core components can be isolated characters that does not
need further segmentation or conjuncts and fused characters that may or may not have
22
descenders. Multiple hypotheses are obtained for each composite character by considering all
possible combinations of the generated primitive components and their classification scores. A
stochastic model (describes the design of a Stochastic Finite Automata (SFSA) that outputs
word recognition results based on the component hypotheses and n-gram statistics) for word
recognition has been presented. It combines classifier scores, script composition rules, and
grammar models are applied to prune the top n choice results. They have not considered special
diacritic marks like avagraha, udatta, anudatta, special consonants such as, punctuation and
numerals. Symbols such as anusvara, visarga and the reph character often tend to be classified
as noise.
Kompalli et al. (2005) BAG Analysis 93.81% for consonants and vowels
Kompalli et al. (2006) Graph Based Character Segmentation 39.58% for the segmentation driven
OCR and 44.10% with the recognition
driven OCR.
Kompalli et al. (2009) Graph-based recognition driven Accuracy of the recognition driven
segmentation BAG segmentation ranges from 72 to
90%
Ma & Doermann (2003) Structural Properties and statistical the average recognition accuracy can
information reach 87.82%
Agrawal et al. (2009) Font-model-based segmentation, 92% at character-level recognition
connected component analysis
Bansal & Sinha (1998) Collapsed Horizontal Projection for recognition rate of 85% has been
Segmentation achieved on the segmented touching
characters
Bansal & Sinha (2002) Collapsed Horizontal Projection 85% recognition rate
For Nepali HTK OCR, (Shakya, Tuladhar, Pandey, & Bal, 2009) (Bal, 2009) the projection
profile technique have been adopted for character segmentation. The process includes removal
of headerline and upper modifiers and then applying Multi-factorial analysis technique to
segment basic characters. The method is able to segment isolated characters along with half and
conjoined characters. For the classifier, Hidden Markov Model (HMM) from HTK toolkit is
used. (Rupakheti & Bal, 2009) adopted projection profiling technique for Nepali Tesseract
OCR. Headerline width is identified and then vertical projection histogram of word to be
segmented is computed. Then the histogram analysis is done to mark starting and ending
23
boundary of character fragment by taking headerline line as a threshold value that qualifies the
segment to be separated.
Most of the researchers have adopted projection profiling technique for character segmentation.
For Devanagari character segmentation, this technique includes two phases: preliminary
and use of its reference to isolate ascenders, core components, and descenders. For
segmentation of compound characters, Bansal & Sinha (1998, 2001, 2003), have proposed
graph analysis for compound character segmentation. (Ma & Doermann, 2003) have used
compound characters. Kompalli et al. (2006, 2009) have proposed graph based recognition-
driven character segmentation technique to overcome the problem regarding the compound
character segmentation which is usually difficult using projection profile techniques. Various
2.3.2 Recognition
Various feature extraction algorithms and classifiers have been proposed for Devanagari optical
character recognition. They all have focused on the improved performance. The shaded portions
on the characters are used as features by Chaudhari & Pal (Chaudhuri & Pal, 1997), the
classifiers used were decision trees. Kompalli et al. have used GSC features and Neural
Network as a classifier (Kompalli, Nayak, & Setlur, 2005). Kompalli et al. (Kompalli, Setlur ,
& Govindaraju, 2006) have used GSC as features and k-nearest neighbor classifier. Ma &
Doermann (Ma & Doermann, 2003) suggests use of statistical structural features; they have
used Generalized Hausdorff Image Comparison (GHIC) for the recognition of characters.
Different feature extraction methods and classifiers used by various researchers in the field of
24
Table 7 Feature Extraction and Classifiers in Devangari OCR
(Bishnu & Chaudhuri, 1999) have proposed a recursive contour following method for
Bangla writing styles, different zones across the height of the word are detected. These zones
provide certain structural information about the constituent characters of the word. Recursive
contour following solves the problem of overlap between successive characters. (Garain &
Chaudhuri, 2002) have proposed a method for segmenting the touching characters in printed
Bangla script. With a statistical study they noted that touching characters occur mostly at the
middle of the middle zone, and hence certain suspected points of touching were found by
inspecting the pixel patterns and their relative position with respect to the predicted middle
zone. The geometric shape is cut at these points and the OCR scores are noted. The best score
gives the desired result. Habib (Murtoza, 2005) have proposed a projection profiling technique
for Bangla Character Segmentation. The width of the headline is variable because of print style
(font size). So sometimes headline cannot be removed clearly. Here two morphological
operations: thinning and skeletonization has been tried to overcome this problem. These
operations removes pixels and pixels remaining makeup the image skeleton. Character can be
25
The Arabic OCR Framework proposed by Nazly and others (Sabbour & Shafait, 2013) takes
raw Arabic script data as text files as input in training phase. The training part outputs a dataset
of ligatures, where each ligature is described by a feature vector. Recognition which takes as
input an image specified by the user. It uses the dataset of ligatures generated from the training
part to convert the image into text. It contains versions of degraded text images which aim at
measuring the robustness of a recognition system against possible image defects, such as, jitter,
91% for Urdu clean text and 86% for Arabic clean text.
initiated by many organizations and individuals in India and Nepal. C-DAC, from India have
developed an OCR system (Chitrankan) for Hindi and Marathi languages. Madan Puraskar
Pustakalaya (MPP) from Nepal have also developed OCR projects for Nepali language (based
on Tesseract Open Source OCR engine and HTK tool). Ind.Senz (founded by Dr. Oliver
Hellwing) is developing OCR software for Devanagari Script (Sanskrit, Hindi and Marathi
languages). The other projects are Parichit and Sanskrit/Hindi-Tesseract OCR. These tools are
Chitrankan: Chitrankan is an OCR (Optical Character Recognition) system for Hindi and
other Indian Languages developed by by C-DAC. It works with Hindi and Marathi languages
along with embedded English text. It comes with facilities like Spell Checker, saving
recognized text in ISCII format, and exporting text as .RTF for editing by any wordprocessor.
Skew detection and correction upto ±15°, automatic text and picture region deteciton, and
advanced DSP(Digital Signal Processing) algorithms to remove noise and Back Page
Reflection are also imlemented. The recognized text is not much accurate so manual editing is
required. The supported operating systems are Windows XP and older version of Windows2.
2
http://cdac.in/index.aspx?id=mlc_gist_chitra
26
Parichit: This project is based on Tesseract OCR Engine (http://code.google.com/p/tesseract-
ocr/). The front end is the modified version of VietOCR (http://vietocr.sourceforge.net/). The
project aims to create open source OCRs for Indian and South Asian Languages. It also aims
to create high quality raining data for creating Tesseract language models for each of the Indian
Sanskrit / Hindi - Tesseract OCR (Traineddata files for Devanagari fonts for
Tesseract_OCR 3.02+): Tesseract OCR 3.02 provides hih.traineddata for recognizing texts
in Devanagari scripts. However training texts, images and box files are not provided, so it is
difficult to improve the accuracy by father improving the traineddata. It is noted that recognition
is more accurate and faster if the training is done with the same/similar font as used in the text
to be OCRed. With the aim of creating traineddata for various Devanagari fonts such that the
Tesseract OCR can be used for the recognition of document written in various Devanagari fonts,
Ind.senz OCR Programs: The OCR programs are available for Hindi, Marathi, and Sanskrit
languages. These are the only Devanagari OCR programs developed and available for
professional use. Ind.senz explains about the usability of programs in Data Entry Companies,
Publishing Houses, and Universities – whenever large amounts of Hindi and Sanskrit text have
to be digitized. The programs take text images and transform them automatically into computer
editable text in Unicode format. Ind.senz reports the achievement of high accuracy rates on
typical Devanagari fonts. The OCR programs provided are paid software. The demo version
3
http://code.google.com/p/parichit
4
http://sourceforge.net/projects/tesseracthindi
5
http://www.indsenz.com/int/index.php
27
Google Drive OCR: Google have launched Nepali OCR in Google Drive. The OCR
technology is free for Google Drive users. OCR provided have good performance in single
column documents. It can retain some formatting like bold, font size, font type and line-breaks.
But lists, tables, columns, footnotes, and endnotes are likely not be detected. Though it shows
good performance, we need to be Google Drive user, we need to surrender our documents to
HTK Toolkit Based OCR: This OCR project is developed under Phase I of PAN Localization
http://madanpuraskar.org/. The development of Nepali OCR had been done with the guidance
and direct training from the Bangladesh team. The OCR project was closed with the release of
beta version6. The source files and executable are available on http://nepalinux.org7.
Tesseract Based Nepali OCR: Under the initiatives of MPP and Kathmandu University
(KU), efforts were made for developing a Tesseract based Nepali OCR under PAN
Localization Project Phase II. In this project, 202 Nepali Characters including basic characters
and some derived characters (characters with ukar, ekar, and aikar) were trained via Tesseract
2.04. It is available for download at http://nepalinux.org or it can also be downloaded from the
After the release of the HTK based beta version of the Nepali OCR, Google’s Tesseract
based Nepali OCR was developed in 2009. Then after, the development and enhancement of
Nepali OCR has discontinued. It’s been a long time that these tools have not been updated. In
the current scenario, new versions of operating systems and new platforms have been released.
The tools developed do not meet the requirements of the new versions of Operating Systems
6
Findings of PAN Localization Project, PAN Localization Project 2012; ISBN: 978-969-9690-02-2
7
http://nepalinux.org/index.php?option=com_content&task=view&id=46&Itemid=53
8
http://www.panl10n.net/madan-puraskar-pustakalaya-nepal/
28
like Windows 7, and Windows 8.1. It is also necessary to develop OCR tools for other platforms
In conclusion, in this chapter, various works and methods for the correct segmentation of
are discussed. Moreover, various feature extraction methods and character recognition
algorithms were also described briefly. Most of the researches focus on the improvement of
process. The method includes projection profile techniques, collapsed horizontal projection
Various feature extraction methods and classifiers proposed for the successful recognition of
Devangari character are also presented. Finally, various tool developed for Devangari OCR
The next Chapter will discuss about the methods applied to conduct this research and
experiment. It will also discuss about different components and phases of applied method.
29
CHAPTER III
METHODOLOGY
The research works on Devanagari Optical Character Recognition suggests that the
segmentation process cannot achieve full accuracy because of noise, touching characters,
compound characters, variation in typefaces, and many similar looking characters. Because of
modifiers (south-east Asian scripts), writing order, or irregular word spacing (Arabic and
Chinese) it requires different approaches to segmentation (Agrawal, Ma, & Doermann, 2009).
Devanagari Script also possess its own constructs which totally differ from Latin.
The most practiced character dissection method for Devanagari works by removing the
headerline (dika) and separating the lower modifiers and upper modifiers, which makes it easy
to extract the basic characters but increases the complexity of extracting modifiers. The
modifiers get broken and it is difficult to note their position in a sequence of segmented
characters and restore their original shape. To minimize the overhead of component level
segmentation and minimize the errors due to inaccurate dissection, here, a hybrid approach
which combines the Holistic Method and Dissection Technique is proposed. Kompalli et al.
(Kompalli, Setlur, & Govindaraju, 2009) have also proposed a novel graph-based recognition
driven segmentation methodology for Devanagari script OCR using hypothesize and test
30
paradigm which is promising work and an inspiring work for using a hybrid approach to OCR.
Harit and Bag (2013) have also highlighted the need of new approaches because the problems
are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for
other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013).
Phase 1: Segment input text image into words and recognize words using Holistic Approach.
Measure the confidence of classification. If the confidence is lower than threshold then we go
Phase 2: Words that are poorly classified in Phase 1 are segmented into characters using
and characters including diacritics). These characters are then classified. A general framework
The general framework of our approach consists of two main parts – training and recognition:
3.1 Training:
Training takes the raw Nepali text data as input and outputs a dataset of words and a dataset of
ligatures (compound characters), where each data is described by feature vector. Training phase
1. Generation of a dataset of images for the possible words and ligatures (compound
2. Extracting features that describe each word and ligature in the dataset generated by the
previous step.
This step involves use of automated computer program to generate the necessary Training
Dataset. Text corpus of target language is fed to the program and the analysis of textual data is
31
performed to generate the list of words, basic characters, and compound characters which will
be later used for rendering images representing corresponding text. Various steps involved in
In this project, a text corpus collected by Madan Puraskar Pustakalaya (MPP) under the Bhasha
Sanchar Project9 is used. The corpus includes different types of articles from different news
portals, magazines, websites and books (about 2,500 articles). The text corpus thus collected is
fed to the Text Separator, a program written in C#. This program searches for the words and
maintains the dictionary in the form of <word, frequency> tuples and a dictionary of characters
in the form of <character, frequency> is generated. The number of words extracted for Nepali
is over 150,000 having different length. The number of basic characters and compound
9
This corpus has been constructed by the Nelralec / Bhasha Sanchar Project, undertaken by a consortium of the Open
University, Madan Puraskar Pustakalaya (मदन पुरस्कार पुस्तकालय), Lancaster University, the University of Göteborg, ELRA (the
European Language Resources Association) and Tribhuvan University.
32
3.1.1.2 Image Dataset Generation:
In order to generate an Image Dataset of words and Compound Characters (including basic
- Images for each extracted word and character are rendered using a rendering engine. This
involves rendering the text using 15 different Devanagari Unicode Fonts namely some of
them are Mangal, Arial Unicode MS, Samanata, Kokila, Adobe Devanagari, Madan.
- The degraded images are generated by applying different image filtering operations (e.g.
The second main step of the training phase is to extract a feature vector representing each word
and compound characters included in dataset. For this, the following steps are done:
To extract the HOG features from dataset, hog routine implemented in skimage.feature has
been used. The routine allows to manage orientations, pixels per cell, and cells per block. The
3.2 Recognition:
The recognition part takes as input an image which is specified by the user through the user
interface. Its main task is to recognize any text that occurs in the input image. The recognized
33
text is presented as an output to the user in an editable format. The recognition of the text in an
In this research work, instead of Projection Profile, a Blob Detection based approach for line
and word segmentation has been used. Blobs are bright on dark or dark on bright regions in an
image10 11. In Devanagari Script each word is a bunch of characters and these characters are
tied with each other by header line (dika). This property of Devanagari Script makes it easy to
use blob detection for detecting individual words in a text document. Figure 8 shows the Nepali
words and each word as a separate bright region in the black background.
34
Step 6: Re-apply blob detection in a line to Perform word segmentation (vertical and horizontal
bluring may be applied for more accurate segmentation)
The character segmentation to basic components becomes more challenging due to its
properties – use of modifiers and diacritics, and compound characters and ligatures. By
studying the structure of Devanagari script and use of compound characters it is found that it
would be better to use compound characters and ligatures as a single characters. The method is
inspired form the work by Nazly Sabbour and Faisal Shafait, the ligature based approach to
implement segmentation free Arabic and Urdu OCR (Sabbour & Shafait, 2013). On analyzing
the Nepali text corpus, it is found that there are about 7,000 compound characters (basic
characters, conjuncts, ligatures) used in Nepali. Projection profile (PP) algorithm is used to
segment characters.
This module takes the blobs (rectangles enclosing the word) as input. In the earlier step
Horizontal Projection is applied on the word. Hpp(word) = {r1, r2, r3, …, rn} is the result of
Horizontal Projection which contains score of black pixels in each row. The analysis on
Hpp(word) is performed to detect the header line of a word and to calculate its height
HLh(word). Those rows that have Horizontal Projection score equal to max score or near about
35
max score and are neighbor to each other are the part of headerline. The analysis is performed
on the upper half part of the word. The location of headerline HLl(word, x, y), is the position
of headerline in a word, where x the location of upper row that lies in headerline and y is the
location of lower row. The height of word is given by HLh(word) = y – x. Then vertical
projection is applied on the blob. Vpp(word) ={v1, v2, v3, … , vn}, is list of scores of black
pixels in each column of blob, where v1, v2, ….. vn, are scores of black pixels in respective
column. The method that have been practiced so far to isolate the individual character in a word
of Devanagari Script is to remove the headerline. I have also used the same method. And this
method works fine for isolating compound characters. The headerline is not removed in actual
but HLh(word) is subtracted from each element of Hpp(word), the result is Vpp(word)hr = {v1
less that zero, in such case make that element zero because no score can be less that zero. Next
task is to find the cut points, CP(word) = <word, {cp1, cp2, …, cpn}>, where cp1, cp2, …,
cpnare cutpoints, by analyzing the Vpp(word)hr , cut points are the points in a word from where
we can chop a word to isolate the characters. And normally these are the points where element
of Vpp(word)hr is equal to zero that means the space between two characters. Finally the
cutpoints are noted and the segmentation is performed, the result of segmentation is given by
CS(TextImage) = {<word1, <cs11, cs12, …, cs1n>>, <word2, <cs21, cs22, …, cs2m>>, …, <wordp,
For development of both word classifier and character classifier, Random Forest classifier tool
number of decision tree classifiers on various sub-samples of the dataset and use averaging to
12
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html [Accesses: 03-24-2016]
36
For testing purposes a limited set of words and characters has been trained. The training of
Word Classifier: Three different Random Forest classifiers are trained based on the word
length i.e. the ratio of image width and height. The training data images with width <=
(height*2) lies in class 1, with width <= (height*4) in class 2 and with width <= (height*6) in
class 3.
The following graph shows learning curve for word classifier 1 (Class 1) with 20 iterations and
test size of 30 percent.
37
Figure 10 Learning Curve - Word classifier 1
The following graph shows learning curve for word classifier 2 (Class 2) with 20 iterations and
test size of 30 percent.
38
The following graph shows learning curve for word classifier 3 (Class 3) with 20 iterations and
test size of 30 percent.
Character Classifier: A single Random Forest classifier is developed for the classification of
characters.
The feature extraction configuration, random forest setup, training data size and the cross
validation result for training are presented in Table 9. The classifier was trained successfully
39
The following graph shows learning curve for character classifier with 20 iterations and test
size of 30 percent.
accurately some character/word has been classified. The threshold defined is a numeric value,
classification. The threshold value is calculated by the study of classification results and the
corresponding prediction probability. In this work, 0.2 is taken as the threshold. A sample of
In conclusion, in this chapter, the proposed model of Nepali OCR is discussed. Our model
consists of two phase recognition schemes – firstly, OCR engine tries to recognize words as a
whole, secondly, if it is not confident about the word, it tries to segment the word into
constituent characters and recognize at the character level. The general framework of our
40
The training phase consists of two main steps – dataset generation, and feature extraction. The
process of dataset generation and feature extraction are described in this chapter.
Word and line segmentation, and character segmentation algorithms are described. Word and
line segmentation algorithm uses blob detection and projection profile technique for
segmentation.
Finally, different classifiers used in experiment and their configurations are presented. The
training results, cross validation results, and training curves are presented for both the word and
character classifiers. The classifiers were successfully trained with more than 80% accuracy.
In the next Chapter, segmentation results and recognition results are described.
41
CHAPTER IV
RESULTS AND DISCUSSION
In this section, experimental study and testing of the proposed architecture are presented. To
test the system, various documents were generated and collected. Results of the testing of
different modules of system namely, word segmentation, character segmentation, word level
recognition, and compound character level recognition are also presented here.
for image processing and machine learning. The experiment is conducted on the machine with
Title Description
Computer System Dell Inspiron 5420, i5 Processor, 4GB RAM,
1GB NVIDIA Graphics
Operating System Windows 10
Programming Languages C#, Python 3
Image Processing Libraries Aforge.net, Accord.net, scikit-image,
OpenCV-Python
Machine Learning Libraries scikit-learn
For various tasks including GUI design of the experiment software, pre-processing of input
images, and post-processing mostly C# has been used as major language. C# libraries like
Aforge.net and Accord.net are used for achieving different image processing tasks like reading
image, removing noise, and performing segmentation. Similarly, scikit-image has been used
for image feature extraction. For implementing machine learning, RandomForest routine
between lines, it is almost 100%. The accuracy of word segmentation is reduced a little by the
lower modifiers (Ukar, Ookar, Rrikar, and Halant) if they are separated from the core character
42
and by punctuation marks like comma and dot. The character segmentation results for 7
Document 1 2 3 4 5 6 7
Characters Present 212 118 370 353 166 273 289
Characters Over- 5 7 12 11 2 3 4
segmented
Characters under- 14 7 23 9 9 18 19
segmented
Error (%) 8.96 11.86 9.45 5.66 6.62 7.69 7.95
From the above test, it is clear that most of the errors are due to under-segmentation. The errors
The classifier was tested on documents containing characters from a set of trained 519 words
and 417 characters. The result of recognition of 7 documents is presented in Table 12.
From Table 12, we can see that the accuracy of Character Level Recognition approach ranges
from 69.49 % to 85.38%. The average accuracy rate of Character Level Recognition Approach
is 78.87%. The average accuracy rate of the Hybrid approach ranges from 90.76% to 98%. The
The recognition results are also presented in bar chart in Figure 14.
43
Recognition Results
95.68% 96.73% 98% 95.23%
100% 93% 94.24%
90.76%
90% 84.79% 85.38%
82%
79% 78.46%
80% 73%
69.49%
70%
60%
50%
40%
30%
20%
10%
0%
Document 1 Document 2 Document 3 Document 4 Document 5 Document 6 Document 7
From the above results, we can see that the proposed hybrid approach is promising. The
accuracy rate increased by more than 10% while using hybrid approach.
mathematical interpretation is discussed. This computational cost only includes the cost of
segmentation and recognition (the other costs like pre-processing, training, and post-processing
This technique involves word segmentation, segmentation each word into characters, and then
recognition of each character. Thus the total computational cost for this approach is given by:
𝐶𝑡 = 𝑤𝑠 + 𝐶𝑐𝑙𝑠 + 𝐶𝑐𝑙𝑟 .
Where,
44
Assume that there are ‘𝑛’ words in a document and the time taken to segment an image
document into words is 𝑤𝑠 . Let us say, the average cost to segment each word into characters
Also assume that the character segmentation yields 𝑚 characters i.e. there are 𝑚 characters
present. If 𝑟 is recognition time required to recognize a single character then the total time
𝐶𝑡 = 𝑤𝑠 + 𝑛 × 𝐶𝑠 + 𝑚 × 𝑟 …………………………….. (1)
Since this approach is also followed by word level segmentation. The process of recognition is
started with the segmentation of image document into possible words. The word level
Now, in this approach, we first try to recognize all the words. Let us say, the cost of recognition
of a single word in average is 𝑤𝑟 and there are 𝑛 words in total, the cost of recognition of 𝑛
Then the confidence of recognition for each recognized word is calculated to decide whether to
go for character level recognition or not. If the cost of calculation of recognition confidence of
a single words is 𝑅𝑐𝑐 , then total cost for calculating recognition confidence for all words is
Now the next step, deciding how many words have been successfully recognized and how many
words require further processing. If 𝑝 words require further processing means 𝑛 − 𝑝 words does
not require character level segmentation. The cost of character level segmentation is equals to
𝐶𝑐𝑙𝑠 = 𝑝 × 𝐶𝑠 . If the character segmentation of 𝑝 words results 𝑞 characters then the cost of
Thus the total computational cost of hybrid approach is given by the equation
Where,
45
𝐶𝑡ℎ = Total computational cost of hybrid approach
By comparing equation (3) and equation (4) we can see that the computation cost 𝑤𝑠 + 𝑝 ×
For character level recognition technique, in worst case and best case, the computational cost
is equal.
𝐶𝑡 < 𝐶𝑡ℎ
On comparing equation (1) and equation (6), 𝑛 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑚 𝑖. 𝑒. 𝑛 < 𝑚 and hence
𝐶𝑡 > 𝐶𝑡ℎ
This shows that at best case the hybrid approach performs better in terms of computational cost
but in worst case the character level recognition technique performs better. But this is not
46
always the case that the hybrid approach will perform best, even if we consider 𝑝 to be 𝑛/2,
In conclusion, the experimental environment and the results of the experiment performed are
discussed. The chapter begins with the discussion of the hardware and software environment
on which the experiment is performed. And then segmentation results and the recognition
In segmentation results, character segmentation results and the error rates are presented.
approach and hybrid approach of recognition (proposed method) are presented. The results are
In the next chapter, the contributions, and the possible future improvements in conclusion will
be described.
47
CHAPTER V
CONCLUSION AND FUTURE WORK
Using the hybrid approach improved the recognition accuracy. The performance of OCR
increased by nearly 10% while using the hybrid approach. However it is not always the case, it
depends on how many words have been trained. The word recognition is like, how much
familiar you are with some language? Training results shows that the word classifier has better
performance, nearly 90% on average. The performance of character level recognition is found
to be 78.87%.
recognition techniques proposed for Devanagari OCR which includes Nepali, Sanskrit,
We proposed a model for Nepali OCR which combines the Holistic technique and
Character level dissection techniques. At first, the system tries to recognize a word as
a whole, if it is not confident about the classification then character level dissection and
segmentation task.
The model is trained using Random Forest classifiers. HOG descriptor has been used
Along with the cross validation testing, the manual testing of the models are also
presented. The testing shows higher accuracy rates and possibilities for its
Our focus was on improving the performance of Nepali OCR by using a hybrid approach of
recognition. The approach reduces the character level or component level segmentation task.
48
There are several issues and possibilities that can be addressed in the future to further improve
problems. These problems have always been pertinent issues and challenges for the
Devanagari OCR. The problem may be addressed by applying some recognition driven
The model proposed can be generalized and trained to recognize large set of words and
compound characters not only for Nepali but for Hindi, Marathi, and other languages
Better and concrete methods must be designed for creating multiple classes of word
images. Use of multiple classifiers apparently improved the performance but this has
49
References
Agrawal, M., Ma, H., & Doermann, D. (2010). Generalization of hindi OCR using adaptive
egmentation and font files. In Guide to OCR for Indic Scripts. Springer London, pp.
181-207.
Bag, S., & Harit, G. (2013). A survey on optical character recognition for Bangla and
Devanagari Script. Sadhana, 133-168.
Bal, B. K. (2009). Scripts Segmentation and OCR II Nepali OCR and Bangla Collaboration.
Conference on Localized ICT Development and Dissemination across Asia. PAN
Localization Project. Laos.
Bansal, V., & Sinha, M. (2001). A complete OCR for printed Hindi text in Devanagari Script.
ICDAR (p. 0800). IEEE.
Bansal, V., & Sinha, R. (1998). Segmentation of Touching Characters in Devanagari.
Proceedings CVGIP, (pp. 371-376). Delhi.
Bansal, V., & Sinha, R. (2002). Segmentation of touching and fused Devanagari Characters.
Pattern Recognition, 875-893.
Bishnu, A., & Chaudhuri, B. B. (1999). Segmentation of Bangla handwritten text into
characters by recursive contour following. Proceedings of the International
Conference on Document Analysis and Recognition, (pp. 402-405).
Casey, R. G., & Lecolinet, E. (1996, July). A Survey of Methods and Strategies in Character
Recognition. IEEE Transactions on Pattern Recognition and Machine Intelligence,
18.
Chaudhuri, B. B., & Pal, U. (1997). An OCR System to Read Two Indian Language Scripts:
Bangla and Devanagari (Hindi). Proceedings of the Fourth International Conference
on Document Analysis and Recognition (pp. 1011-1015). IEEE.
Dhurandhar, A., Shankarnarayanan, K., & Jawale, R. (2005). Robust Pattern Recognition
Scheme for Devanagari Script. (pp. 1021 – 1026). Springer-Verlag Berlin Heidelberg
2005.
Dongre, V. J., & Mankar, V. H. (2011). Devanagari Document Segmentation Using
Histogram Approach. International Journal of Computer Science, Engineering and
Information Technology (IJCSEIT).
Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-
Degree Compared. (pp. 1-10). Grid Computing Environments Workshop.
Garain, U., & Chaudhuri, B. B. (2002). Segmentation of touching characters in printed
Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans. Syst.
Man Cybern., (pp. 449–459).
Hansen, J. (2002). A Matlab Project in Optical Character Recognition (OCR). DSP Lab,
University of Rhode Island, 6.
Holley, R. (n.d.). How Good Can It Get? Analysing and Improving OCR Accuracy in Large
Scale Historic Newspaper Digitisation Programs. Retrieved 03 04, 2014, from
http://www.dlib.org/dlib/march09/holley/03holley.html
50
K.C., S., & Nattee, C. (2007). Template-based Nepali Natural Handwritten Alphanumeric
Character Recognition. Thammasat Int. J. Sc. Tech, 12(1).
Khedekar, S., Ramanaprasad, V., Setlur, S., & Govindaraju, V. (2003). Text - Image
Separation in Devanagari Documents. Proceedings of the Seventh International
Conference on Document Analysis and Recognition (ICDAR 2003) .
Kompalli, S., Nayak, S., & Setlur, S. (2005). Challenges in OCR of Devanagari Documents.
Kompalli, S., Setlur , S., & Govindaraju, V. (2006). Design and Comparison of Segmentation
Driven and Recognition Driven Devanagari OCR.
Kompalli, S., Setlur, S., & Govindaraju, V. (2009). Devanagari OCR using a recognition
driven segmentation framework and stochastic language models. IJDAR.
Kulkarni, S. (2013). Issues with Devanagari Display Type. WhiteCrow Designs.
Kumar, V., & Sengar, P. K. (2010). Segmentation of Printed Text in Devanagari Script and
Gurmukhi Script. International Journal of Computer Applications, 3.
Ma, H., & Doermann, D. (2003). Apaptive Hindi OCR using generalized Hausdroff Image
Comparison. ACM Transactions on Asian Language Information Processing, 2(3),
193-218.
Murtoza, S. M. (2005). Bangla Optical Character Recognition. BRAC University.
OCR Applications. (2015, April). Retrieved from cvision.
OCR Processing Steps [ABBYY Developer Portal]. (n.d.). Retrieved 05 22, 2014, from
http://www.abbyy-developers.eu/en:tech:processing
Optical character recognition - From Wikipedia, the free encyclopedia. (n.d.). Retrieved 03
03, 2014, from http://en.wikipedia.org/wiki/Optical_character_recognition
Optical character recognition. (2015, 05 04). (Wikipedia.org) Retrieved 5 22, 2014, from
Wikipedia: http://en.wikipedia.org/wiki/Optical_character_recognition
Optical Character Recognition. (2015, 04 17). Retrieved from Webopedia:
http://www.webopedia.com/TERM/O/optical_character_recognition.html
Pal, U., & Chaudhuri, B. (2004). Indial script character recognition: a survey. Pattern
Recognition.
Rupakheti, P., & Bal, B. K. (2009). Research Report on the Nepali OCR. Madan Puraskar
Pustakalaya.
Sabbour, N., & Shafait, F. (2013). A Segmentation Free Approach to Arabic and Urdu OCR.
SPIE Proceedings.
Scanning in Digital Age. (2015, 04 16). Retrieved from Record Nations:
http://www.recordnations.com/articles/scanning-in-digital-age/
Shakya, S., Tuladhar, S., Pandey, R., & Bal, B. K. (2009). Interim Report on Nepali OCR.
Madan Puraskar Pustakalaya.
Singh, A., Bacchuwar, K., & Bhasin , A. (2012, June). A Survey of OCR Applications.
International Journal of Machine Learning and Computing, 2.
51
Typographic ligature. (2016, 03 24). Retrieved 04 04, 2016, from Wikipedia:
http://en.wikipedia.org/wiki/Typographic_ligature
What is OCR? (2015, 04 17). Retrieved from ABBYY:
http://finereader.abbyy.com/about_ocr/whatis_ocr/
What is optical character recognition? (n.d.). Retrieved 03 03, 2014, from
http://www.webopedia.com/TERM/O/optical_character_recognition.html
52
APPENDIX I
Snapshots
A
Word Training Data
B
Segmentation
Recognition
C
Text Extractor
D
APPENDIX II
Word Recognition Data Sample
Here a sample word recognition data is presented. Each line is a result of recognition of a single word.
The data has two values separated by underscore (_). First value represents word code and second value
is the recognition confidence (prediction probability).
226_0.5
196_0.46000000000000002
130_0.62
66_0.41999999999999998
129_0.35999999999999999
36_0.40000000000000002
37_0.47999999999999998
38_0.29999999999999999
39_0.32000000000000001
40_0.76000000000000001
41_0.54000000000000004
42_0.46000000000000002
43_0.54000000000000004
44_0.57999999999999996
173_0.28000000000000003
46_0.85999999999999999
47_0.41999999999999998
48_0.47999999999999998
35_0.34000000000000002
49_0.32000000000000001
51_0.44
52_0.56000000000000005
53_0.69999999999999996
54_0.14000000000000001
55_0.34000000000000002
56_0.32000000000000001
57_0.54000000000000004
58_0.41999999999999998
59_0.92000000000000004
60_0.29999999999999999
61_0.38
62_0.68000000000000005
63_0.66000000000000003
50_0.71999999999999997
34_0.52000000000000002
346_0.16
32_0.40000000000000002
3_0.23999999999999999
4_0.47999999999999998
5_0.52000000000000002
6_0.29999999999999999