Nepali Ocr Using Hybrid Approach of Recognition: Nirajan Pant

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

NEPALI OCR USING HYBRID APPROACH

OF RECOGNITION

By

NIRAJAN PANT
Master of Technology in Information Technology, Kathmandu University, 2016

A Thesis
Submitted to the
Department of Computer Science and Engineering
Kathmandu University

In partial fulfillment of the requirements for the degree of


Master of Technology in Information Technology

July 2016
DECLARATION OF ORIGINALITY

Being a student, I understand that I have an ethical and moral obligation ensuring that the
dissertation that I have submitted to the Kathmandu University is my own, original and free of
plagiarism. All the sources are acknowledged properly, exact words paraphrased or quoted,
with appropriate references throughout the dissertation. Hence I am fully satisfied that the work
I am submitting to the Department of Computer Science and Engineering, Kathmandu
University is my own research and original.

_______________

Nirajan Pant

Candidate

University Registration No: 015493-13

I
THESIS EVALUATION
This thesis, submitted by Nirajan Pant in partial fulfillment of the requirements for the Degree
of Master of Technology in Information Technology from the Kathmandu University, has
been read by the faculty Advisory Committee under whom the work has been done and is
hereby approved.

____________________
Dr. Bal Krishna Bal
(Supervisor)
Assistant Professor
Department of Computer Science and Engineering, Kathmandu University

_____________________
Suresh K. Regmi
(Extermal Examinor)
Managing Director
Professional Computer System (P) Ltd.

____________________
Dr. Manish Pokharel
Head of Department
Department of Computer Science and Engineering, Kathmandu University

This thesis is being submitted by the appointed advisory committee as having met all of the
requirements of the School of Engineering at the Kathmandu University and is hereby
approved.

________________________________
Prof. Dr. Bhupendra Bimal Chhetri
Dean
School of Engineering
Kathmandu University

Date:

II
PERMISSION
Title Nepali OCR Using Hybrid Approach of Recognition
Department Computer Sciences and Engineering
Degree Master of Technology in Information Technology

In presenting this thesis in partial fulfillment of the requirements for a graduate degree from
Kathmandu University, I agree that the library of this University shall make it freely available
for inspection. I further agree that permission for extensive copying for scholarly purposes may
be granted by the supervisors who supervised my thesis work or, in his (or her) absence, by the
Head of the Department or other use of this thesis or part thereof for financial gain shall not be
allowed without my written permission. It is also understood that due recognition shall be given
to me and to the Kathmandu University in any scholarly use which may be made of any material
in my thesis.

________________

Nirajan Pant

Date:

III
ACKNOWLEDGEMENTS
I express my sincere gratitude to Dr. Bal Krishna Bal for supervising me in this thesis. I will
always be indebted to his continued motivations, suggestions and involvements which have
helped significantly for the completion of this thesis.

I am thankful to Madan Puraskar Pustakalaya (MPP), Lalitpur, Nepal who has provided Nepali
text image data for this thesis work.

At last, I express my thankfulness to all members of Department of Computer Science and


Engineering, my friends and family members who helped me directly or indirectly till this day
for successful accomplishment of this thesis. This day would not have been possible without
their continued support, motivation and encouragements.

Nirajan Pant

Master of Technology in Information Technology


Kathmandu University

IV
ABSTRACT
Nepali, which is an Indo-Aryan language written in the Devanagari Script, is the most widely
spoken language in Nepal with more than 35 million speakers. It is also spoken in many areas
of India, Bhutan, and Myanmar. The Optical Character Recognition (OCR) systems developed
so far for the Nepali language has a very poor recognition rate. Devanagari script has some
special features like ‘dika’ and the rules for joining the vowel modifiers which makes it different
from Latin script, where every character in a word is written separately. One of the major
reasons for poor recognition rate is due to the error in character segmentation. The presence of
conjuncts, compound and touching characters in the scanned documents complicates the
segmentation process, creating the major problems when designing an effective character
segmentation technique. Thus, the aim of work is to reduce the scope of the segmentation task
so that the segmentation errors could be minimized.

In this work, I have proposed a hybrid OCR system for printed Nepali text using the Random
Forest (RF) algorithm. It incorporates two different techniques of OCR – firstly, the Holistic
approach and secondly, the Character Level Recognition approach. The system first tries to
recognize a word as a whole and if it is not confident about the word, the character level
recognition is performed. Histogram Oriented Gradients (HOG) descriptors are used to define
a feature vector of a word or character. The performance of 78.87%, and 94.80% recognition
rates are achieved for character level recognition approach and the hybrid approach
respectively.

Keywords— OCR, Devanagari Script, Pre-processing, Segmentation, HOG Feature, Feature


Descriptor, Classification, Random Forest (RF)

V
Contents
ACKNOWLEDGEMENTS ..................................................................................................... IV
ABSTRACT.............................................................................................................................. V
List of Figures ....................................................................................................................... VIII
List of Tables ........................................................................................................................... IX
List of Abbreviations ................................................................................................................ X
CHAPTER I INTRODUCTION ............................................................................................... 1
1.1 Optical Character Recognition ................................................................................... 1
1.1.1 General OCR Architecture ................................................................................. 2
1.1.2 Uses and Current Limitations of OCR ............................................................... 5
1.2 Devanagari Script....................................................................................................... 6
1.3 Problem Definition................................................................................................... 10
1.4 Motivation ................................................................................................................ 11
1.5 Research Questions .................................................................................................. 12
1.6 Objectives ................................................................................................................ 12
1.7 Organization of Document ....................................................................................... 13
CHAPTER II LITERATURE REVIEW.................................................................................. 14
2.1 Different Models of Character Segmentation in OCR Systems ............................... 14
2.1.1 Dissection Techniques ..................................................................................... 15
2.1.2 Recognition Driven Segmentation ................................................................... 16
2.1.3 Holistic Technique ........................................................................................... 17
2.2 Segmentation Challenges in Devanagari OCR ........................................................ 17
2.2.1 Over Segmentation of Basic Characters .......................................................... 18
2.2.2 Handling vowel modifiers and Diacritics ........................................................ 18
2.2.3 Handling Compound characters and Ligatures ................................................ 19
2.3 Related work ............................................................................................................ 20
2.3.1 Segmentation.................................................................................................... 20
2.3.2 Recognition ...................................................................................................... 24
2.4 OCR Tools Developed for Devanagari .................................................................... 26
CHAPTER III METHODOLOGY .......................................................................................... 30
3.1 Training:................................................................................................................... 31
3.1.1 Dataset Generation: .......................................................................................... 31
3.1.2 Feature Extraction: ........................................................................................... 33
3.2 Recognition: ............................................................................................................. 33
3.2.1 Line and Word Segmentation .......................................................................... 34
3.2.2 Character Segmentation: .................................................................................. 35
3.2.3 Classifier Tool .................................................................................................. 36

VI
3.2.4 Confidence and Threshold: .............................................................................. 40
CHAPTER IV RESULTS AND DISCUSSION ...................................................................... 42
4.1 Experimental Setup .................................................................................................. 42
4.2 Segmentation Results ............................................................................................... 42
4.3 Recognition Results ................................................................................................. 43
4.4 Computational Cost ................................................................................................. 44
CHAPTER V CONCLUSION AND FUTURE WORK ......................................................... 48
References ................................................................................................................................ 50
APPENDIX I Snapshots ........................................................................................................... A
APPENDIX II Word Recognition Data Sample ........................................................................E

VII
List of Figures
Figure 1 General OCR Architecture .......................................................................................... 2

Figure 2 Structure of Nepali Text Word .................................................................................... 8

Figure 3 Over-segmentation Example (Letter ण, श, ग) .............................................................. 18

Figure 4 Segmentation using Projection Profile Technique .................................................... 18

Figure 5 Proposed Nepali OCR Model .................................................................................... 30

Figure 6 Training Dataset Generation ...................................................................................... 32

Figure 7 Feature Extraction ..................................................................................................... 33

Figure 8 Nepali text words as Blobs ........................................................................................ 34

Figure 9 Snapshot of Character Segmentation ......................................................................... 34

Figure 10 Learning Curve - Word classifier 1 ......................................................................... 38

Figure 11 Learning Curve - Word classifier 2 ......................................................................... 38

Figure 12 Learning Curve - Word classifier 3 ......................................................................... 39

Figure 13 Learning Curve - Character classifier ...................................................................... 40

Figure 14 Recognition Results ................................................................................................. 44

VIII
List of Tables
Table 1 Vowels and Corresponding Modifiers .......................................................................... 8

Table 2 Diacritics and Special Symbols .................................................................................... 8

Table 3 Consonants and their half-forms ................................................................................... 9

Table 4 Letter Variants .............................................................................................................. 9

Table 5 Formation of Compound Characters ............................................................................. 9

Table 6 Existing Text Segmentation Approaches for Devanagari OCR .................................. 23

Table 7 Feature Extraction and Classifiers in Devangari OCR ............................................... 25

Table 8 Word Classifier Training ............................................................................................ 37

Table 9 Character Classifier Training ...................................................................................... 39

Table 10 Experimental Environment ....................................................................................... 42

Table 11 Character Segmentation Results ............................................................................... 43

Table 12 Recognition Results .................................................................................................. 43

IX
List of Abbreviations
ASCII – American Standard Code for Information Interchange

BAG – Block Adjacency Graph

C-DAC – Centre for Development of Advanced Computing

DOCR – Devanagari Optical Character Recognition

DSP – Digital Signal Processing

GHIC – Generalized Hausdorff Image Comparison

GSC – Gradient, Structural and Concavity

GUI – Graphical user interface

HMM – Hidden Markov model

HOG – Histogram Oriented Gradient

HPP – Horizontal Projection Profile

HTK - Hidden Markov Model Toolkit

IPA – Integrity, purposefulness and adaptability

ISCII – Indian Script Code for Information Interchange

MPP - Madan Puraskar Pustakalaya

OCR – Optical Character Recognition

PDF – Portable Document File

PP – Projection Profile

RF – Random Forest

SFSA – Stochastic Finite Automata

VPP – Vertical Projection Profile

X
CHAPTER I
INTRODUCTION

This thesis is about improving the performance of Nepali OCR by proper handling of

segmentation problems prevalent in the Nepali language. The assumption made is: “The

performance of Nepali OCR can be improved by using the Hybrid recognition approach”.

Based on this assumption, a Nepali language specific OCR model has been developed. The

model will be tested by experimenting the proposed model.

The concepts of OCR and its general architecture, Devanagari script for Nepali language from

the point of view of OCR, and uses and limitations of OCR are discussed in this chapter. This

chapter includes basic introduction of thesis which covers problem definition, motivation,

research questions and objectives, and the basic overview of terms and terminologies that are

used in this thesis.

1.1 Optical Character Recognition


OCR is a field of computer science that involves converting texts from images of typewritten

or printed or handwritten documents into computer readable text. OCR enables conversion of

texts in image data into textual data and facilitates editing, searching, republishing without

retyping the whole document. Any written or printed document, if it is to be replicated digitally,

needs to be photocopied or scanned. Such a replicated document cannot be altered in terms of

the spellings, words, font-style, and font-size that the document contains. Also typing an entire

document in order to replicate it is extremely time consuming. In order to overcome the above

mentioned issues, an OCR system is needed. Documents containing characters images can be

scanned through the scanner and then the recognition engine of the OCR system interprets the

images and turns images of printed or handwritten characters into machine - readable characters

(e.g. ASCII or Unicode). Therefore, OCR allows users to quickly automate data captured from

image document, eliminates keystrokes to reduce typing costs and still maintains the high level

of accuracy in text processing applications.

1
1.1.1 General OCR Architecture

While talking about how an OCR system recognizes text, first, the program analyzes structure

of document image. It divides the page into elements such as blocks of texts, tables, images,

etc. The lines are divided into words and then into characters. Once the characters have been

singled out, the program compares them with a set of pattern images.

The process of character recognition consists of a series of stages, with each stage passing its

results on to the next in a pipeline fashion. There is no feedback loop that would permit an

Figure 1 General OCR Architecture

earlier stage to make use of knowledge gained at a later point in the process (Casey & Lecolinet,

1996). The recognition process can be divided into three major steps: Preprocessing,

Recognition (Feature Extraction) and Post Processing (Optical character recognition, 2015)

(OCR Processing Steps [ABBYY Developer Portal], n.d.).

Pre-processing

OCR software loads the image and performs pre-processing to increase the recognition

accuracy. Most of the OCRs expect some pre-defined formats of input image such as font-size

ranges, foreground, background, image format, and color format. The pre-processing steps

often performed in OCR are: i) Binarization ii) Morphological Operations and iii) Segmentation

(Hansen, 2002). Binarization is the process of converting an image to bi-tonal image; most of

2
the OCRs work on bi-tonal images. Morphological Operations are used in pre or post processing

(filtering, thinning, and pruning). They may be applied in degraded documents to increase the

performance of OCR.

Different actions performed during pre-processing are:

- De-skewing

- Binarization

- Page Layout Analysis

- Detection of text lines and words

- Character segmentation – For per-character OCR, multiple characters that are

connected due to image artifacts must be separated; single characters that are broken

into multiple pieces due to artifacts must be connected. Usually, in every OCR system,

the recognition is performed at the character level. So the segmentation is the basic and

important phase of recognition. Effective segmentation at character level yields the

better accuracy in recognition.

- Normalization

Character recognition

Recognition algorithm is the brain of the OCR system. After successful pre-processing of input

image document, now OCR algorithm can start recognition of characters and translate them

into character codes (ASCII/Unicode). Creating one hundred percent accurate algorithm is

probably impossible where there is a lot of noise and different font styles are present.

In general, a character recognition consists of the following procedures:

 Learning - The recognition algorithms relies on a set of learned characters and their

properties. It compares the characters in the scanned image file to the characters in this

learned set.

 Extraction and isolation of individual characters from an image

 Determination of the properties of the extracted characters

3
 Comparison of the properties of the learned and extracted characters

There are two basic types of core OCR algorithm – matrix matching and feature extraction.

(Optical Character Recognition, 2015). Matrix matching also known as “pattern matching” or

“pattern matching” involves comparing an image to a stored glyph on a pixel-by-pixel basis.

This relies on the input glyph being correctly isolated from the rest of the image, and on the

stored glyph being in a similar font and at the same scale. This technique works best with

typewritten text and does not work well when new fonts are encountered. Feature extraction

decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections.

These are compared with an abstract vector-like representation of a character, which might

reduce to one or more glyph prototypes. General techniques of feature detection in computer

vision are applicable to this type of OCR, which is commonly seen in most modern OCR

software. Machine learning algorithms such as Neural Networks, Nearest-neighbor

classifier algorithms are used to compare image features with stored glyph features and choose

the nearest match. Most modern Omnifont OCR programs (ones that can recognize printed text

in any font) work by feature detection rather than pattern recognition.

Post-processing

This step can help to improve recognition quality; sometimes OCR can output wrong character

code in such case a dictionary support can help to make the decision. OCR accuracy can also

be increased if the output is constrained by a lexicon – a list of words that are allowed to occur

in a document. With dictionary support, the program ensures even more accurate analysis and

recognition of documents and simplifies further verification of recognition results.

The output stream may be a plain text stream or file of characters, but more sophisticated OCR

systems can preserve the original layout of the page and produce, for example, an

annotated PDF that includes both the original image of the page and a searchable textual

representation.

4
The exact mechanisms that allow humans to recognize objects are yet to be understood, but the

three basic principles are already well known by scientists – integrity, purposefulness and

adaptability (IPA). The most advanced optical character recognition systems are focused on

replicating natural or “animal like” recognition. In the heart of these systems lie three

fundamental principles: Integrity, Purposefulness and Adaptability. The principle of integrity

says that the observed object must always be considered as a “whole” consisting of many

interrelated parts. The principle of purposefulness supposes that any interpretation of data must

always serve some purpose. And the principle of adaptability means that the program must be

capable of self-learning. These principles endow the program with maximum flexibility and

intelligence, bringing it as close as possible to human recognition (What is OCR?, 2015).

1.1.2 Uses and Current Limitations of OCR

OCR is widely used to recognize and search text from electronic documents or to publish the

text on a website ( Singh, Bacchuwar, & Bhasin , 2012). It has enabled scanned documents to

become more than just image files, turning into fully searchable documents with text content

that is recognized by computers. OCR is a vast field with a number of varied applications such

as invoice imaging, legal industry, banking, health care industry etc. It is widely being used in

digital libraries for searching scanned books and magazines (e.g. Google books), data entry

such as bill payment, passport, text-to-speech synthesis, machine translation, text mining, and

check entry, automatic number plate recognition etc.

Optical character recognition has been applied to a number of applications. Some of them have

been listed below:

- Institutional Repositories and Digital Libraries

- Banking: Form processing, check collection etc.

- Healthcare: General forms, insurance forms, and prescription documents processing

- Automatic Number Plate Recognition

- Handwriting Recognition

5
OCR has simplified data collection and analysis process. With its continuous advancement,

more and more applications powered by OCR are being developed in various fields including

finance, education, and government agencies.

The advantages of OCR can be summarized as:

- Cheaper than paying someone to manually enter large amounts of text

- Much faster than someone manually entering large amounts of text

- The latest software can recreate tables and the original layout

OCR system has a lot of advantages even then it has many limitations. Some of the limitations

are outlined below:

- Limited Documents: It does not perform well with documents containing both

images and text, containing tables, and noise or dirt.

- Accuracy: The accuracy depends upon the quality and type of document, including

the font used. Errors that occur during OCR include misreading letters, skipping

over letters that are unreadable, or mixing together text from adjacent columns or

image captions.

- Additional Work: OCR is not error proof, OCR also makes mistakes. A person

has to manually compare the original image document and the recognized text for

errors and correct them.

- Not worth doing for small amounts of text: OCR has to suffer from a long

process of document scanning, recognizing and, verification of output text. Thus

OCR may not be feasible and worthwhile to use for small amount of documents.

1.2 Devanagari Script


Devanagari script is derived from ancient Brahmi script through many modifications. Many

languages including Sanskrit, Nepali, Hindi, Marathi, Bihari, Bhojpuri, Maithili, and Newari

are written in Devanagari and over 500 million people are using it. Devanagari is a syllabic-

alphabetic script with a set of basic symbols - consonants, half-consonants, vowels, vowel-

6
modifiers, digits and special diacritic marks (Kompalli, Setlur , & Govindaraju, 2006)

(Kompalli, Setlur, & Govindaraju, 2009). Script has its own specified composition rules for

combining vowels, consonants and modifiers. Modifiers are attached to the top, bottom, left or

right side of other characters. All characters of a word are stuck together by a horizontal line,

called dika, which runs at the top of core characters (Khedekar, Ramanaprasad, Setlur, &

Govindaraju, 2003). Devanagari character may be formed by combining one or more alphabets

which are referred as composite characters or conjuncts. For example: half- consonant ka (क्‍)

and consonant ya (य) combine to produce the conjunct character kya (कय), Consonant-modifier

and conjunct-modifier characters are produced by combining consonants and conjuncts with

vowel modifiers (Eg: क्‍+्‍ा → क , कय्‍+्‍ा → कय ). This combination of alphabets contrasts with

Latin in which the number of characters is fixed. A horizontal header line (dika) runs across the

top of the characters in a word, and the characters span three distinct zones (Figure 2); an

ascender zone above the Dika, the core zone just below the Dika, and a descender zone below

the baseline of the core zone. Symbols written above or below the core will be referred to as

ascender or descender components, respectively. A composite character formed by one or more

half consonants followed by a consonant and a vowel modifier will be referred to as a conjunct

character or conjunct (Kompalli, Setlur , & Govindaraju, 2006).

Nepali, originally known as Khas Kurā is an Indo-Aryan language with around 17 million

speakers in Nepal, India, Bhutan, and Burma. Nepali is written in Devanagari, which is

developed from the Brahmi script in the 11th century AD. The Nepali is started to be written

from 12th century AD1. In Nepali, there are 13 vowels (swaravarna), 36 consonants

(vyanjanvarna) (33 pure consonants and 3 are composite consonants), 10 numerals, and half-

letters. When vowels come together with consonants they are written above, below, before or

after the consonant they belong to using special diacritical marks. When vowels are written in

this way they are known as modifiers. In addition, consonants occur together in clusters, often

1
http://www.omniglot.com/writing/nepali.htm

7
called conjunct consonants. Altogether, there are more than 500 different characters (K.C. &

Nattee, 2007). The sentences end with ‘purnaviram’.

It is written and read from left to right in a horizontal line. Many languages in India use different

variants of this script. Nepali language uses a subset of characters from Devanagari script set

Dika
(Headerline) Ascender

Head Line

Core सम्विि Upper Zone


Middle Zone
Lower Zone

Base Line
Compound
Descender
Character

Figure 2 Structure of Nepali Text Word

for written purposes. Some characters of Devanagari script are language specific. But the basic

vowels, consonants and modifiers are same in all languages. For example ‘Nukta’ is used in

Hindi but not in Nepali. Similarly, letter ‘LLA’ is also not used in Nepali.

Vowels and corresponding modifiers:

Table 1 Vowels and Corresponding Modifiers


Vowel अ आ इ ई उ ऊ ऋ ए ऐ ओ औ अं अ:
Corresponding ा िा ा ा ा ा ा ा ा ा ा ा
Vowel Modifier

Diacritics, Consonant-modifiers and Special Symbols: In some situations, a consonant


following (or proceeding) another consonant is represented by a modifier called consonant
modifier. In this case, the constituent consonants take modified shapes, such as ‘reph’.

Table 2 Diacritics and Special Symbols


Diacritics and Special Symbols Different forms of Consonant modifier ra (र)
ा ा ा ा ऽ

Consonants and their Half Forms:


Along with a set of vowel modifiers there is a set of pure-consonants (also called half-letters)

which when combined with other consonants yield conjuncts (Pal & Chaudhuri, 2004).

8
Table 3 Consonants and their half-forms
Consonant Half Consonant Half Consonant Half Consonant Half Consonant Half
Form Form Form Form Form
क क्‍ ख ख्‍ ग ग्‍ घ घ्‍ ङ
च च्‍ छ ज ज्‍ झ झ्‍ ञ ञ्‍
ट ठ ड ढ ण ण्‍
त त्‍ थ थ्‍ द ध ध्‍ न न्‍
प प्‍ फ फ्‍ ब ब्‍ भ भ्‍ म म््‍
य य्‍ र ल ल्‍ व व्‍ श श्‍
स्‍ ष ष्‍ ह ह्‍
Numerals:
०१२३४५६७८९
Letter Variants:

In writing Nepali, there are many letter variations found in written or printed documents. This

is because fonts have different writing styles. Some characters having letter variants, differs by

the old and new writing styles. The old variants of some letters (e.g. letter अ and letter ण) are

not used in these days but the old documents frequently contains these forms. A set of letter

variants is shown in Table 4.

Table 4 Letter Variants

Letter Variants Letter Variants


Numeral Five Letter ‘La’
Letter ‘A’ Letter ‘Sha’
Letter ‘Jha’ Letter ‘Ksha’
Letter ‘ण’

There are many conjuncts which are written as a single character e.g. द्द, द्म, हृ, i.e. sometimes two

or more consonants can combine to form new complex shape. Sometimes the shape of the

compound character is so complex that it becomes difficult to identify the constituent

characters. Despite the existence of so many compound characters, their frequency of

appearance in any text page is much lower than that of basic characters.

Table 5 Formation of Compound Characters

ट+ ट ट+ ठ द+ द द+ म श+ र त+ र द+ ध क +ष
ट्ट ट्ठ द्द द्म श्र त्र ि क्ष

9
In writing Nepali, many consonants come together in a cluster to form Typographic Ligatures.

These are also frequently found in Nepali. The number of ligatures employed may be language-

dependent; thus many more ligatures are conventionally used in writing Sanskrit than in written

Nepali (Typographic ligature, 2016) . Using 33 consonants in total hundreds of ligatures can be

formed (the composite character classes exceeds 5000) most of which are infrequent.

All the consonant characters, vowel characters, compound characters, modifiers are connected

to by ‘dika’ and looks like characters are hanging in a rope. This is a special feature in

Devanagari Script and it does not appear in Latin Script. There are many shapes that look

similar e.g. घ and ध, म and भ, ब and व.

These characteristics of Devanagari Scripts are becoming challenges for DOCR (Devanagari

Optical Character Recognition). Devanagari Script is different from Latin Script by these

characteristics so the same technique from Latin OCR may not work fine for DOCR. Thus

finding a technique suitable for segmentation of text images in Devanagari script is also

challenging.

1.3 Problem Definition


We all want an OCR system for Nepali which can recognize different types of documents,

documents composed of varying fonts, and the main thing we want is the accuracy of

recognition.

Today we have many OCR project releases for Nepali as well as Hindi and Sanskrit. But their

performance has not been satisfactory. The problem lies in inadequate handling of conjuncts

and compound characters. This issue has to be seriously dealt with in order to develop a reliable

and high performance OCR system for Nepali.

In this research work, Hybrid Recognition Approaches in recognition along with compound

characters/conjuncts or ligature recognition is used to improve the overall performance of

Nepali OCR.

10
1.4 Motivation
Digital documents have become a part of everyday life. Anyone can take advantage of scanning

their documents making easy to reference, organizing files, protecting and storing of

documents. There is no limitation to the types of documents that can be digitized. Thus the

increased interest forces us to deal with any type of document that someone may wish to

observe such as images. Plain text has a number of advantages over scanned copies of text. A

text document can be searched, edited, reformatted, and stored more compactly but it is not

possible in the case of images. One will not be able to edit, search or reformat any text that

visually appears in images. Images are nothing more than just a collection of pixels for a

computer.

Extracting the text data from images is an important for reading, editing and analyzing the text

content contained in the images. Computers cannot recognize the text data directly in images.

Thus the design of computer program called “OCR” that can recognize text in digital documents

(images) is important.

OCR technology for some scripts like Roman, Chinese, Japanese, Korean and Arabic is fairly

mature and commercial OCR systems are available with accuracy higher than 98%, including

OmniPage Pro from Nuance or FineReader from ABBYY for Roman and Cyrillic scripts, and

Nuance for Asian languages. Despite ongoing research on non-Latin script recognition, most

of the commercial OCR systems focus on Latin-based languages. OCR for Indian scripts, as

well as many low-density languages, is still in the research and development stage. The

resulting systems are often costly and do little to advance the field (Agrawal, Ma, & Doermann,

2009).

In case of Nepali OCR, the segmentation process cannot achieve full accuracy because of dika,

touching characters, conjuncts/compound characters, modifiers, and variation in typefaces.

These problems directly affect successful recognition and thus result in decreased performance.

Due to the presence of language-specific constructs, in the domain Devanagari script requires

11
different approaches to segmentation. Thus working on better approach for segmentation and

improvement of performance is important.

1.5 Research Questions


Studies show that developing an OCR system for Devanagari script is more challenging than

the Latin script due to its writing arrangement. The techniques applied for Latin OCR may or

may not apply to the Devanagari script. The main challenges in segmentation for Devanagari

OCR are: i) Handling modifiers and diacritics, and ii) Handling compound characters and

ligatures (connected components). Dealing with these two main challenges is necessary to

achieve better accuracy. One major difficulty to improve the performance of OCR system lies

in recognition of compound characters forming complex shapes.

The research questions formulated are:

- What are the challenges of Devanagari (Nepali) OCR?

- What are the current segmentation and recognition techniques for Devanagari (Nepali)

OCR?

- How can the accuracy of Devanagari (Nepali) OCR be improved using

the combined approach of Holistic methods and character level dissection technique?

1.6 Objectives
This research is focused on improving the performance of Nepali OCR. This research will be

helpful for understanding the segmentation approaches used for Devanagari and Bangla OCR,

and underlying challenges and the improvements required. A better approach for designing an

OCR system for Nepali is the expected outcome of this research. Moreover, the improved

techniques will be implemented to develop a prototype OCR system for Nepali.

The objectives of this study are as follows:

- To implement a hybrid approach of recognition that uses both holistic approach and

dissection method of recognition

12
- To determine and evaluate the hybrid approach for improved performance of Nepali

OCR

1.7 Organization of Document


This document is organized into 5 chapters. Chapter 1 includes basic introduction of thesis

which covers problem definition, motivation, research questions and objectives, and the basic

overview of terms and terminologies. Chapter 2 discusses about different segmentation

methods and recognition methods proposed for Devanagari optical character recognition. This

chapter also gives information about various OCR tools developed so far for Devangari.

Chapter 3 discusses about methods applied to conduct this research work and experiment. In

this chapter, different components and phases of applied method are also discussed. In chapter

4 segmentation results and recognition results are presented. The computation cost for character

level recognition technique and holistic approach is also described in this chapter. Finally,

chapter 5 concludes the research, the contributions and possible future improvements are

discussed in this chapter.

In conclusion, in this chapter, the basic concepts of optical character recognition and a general

architecture of OCR, Devanagari script for Nepali language from the point of view of OCR are

discussed. The motivation of the research, research questions, objectives and goals of this

research were also discussed in this chapter.

The next Chapter will discuss about different segmentation methods and recognition

approaches proposed in the literature. And it will also discuss about various OCR tools

developed so far for Devangari.

13
CHAPTER II
LITERATURE REVIEW

Optical character recognition is a sequence of multiple processes – segmentation, feature

extraction, and classification. Different models or techniques are proposed for character

segmentation. These techniques can be categorized into three major strategies – dissection

technique, recognition driven technique, and holistic methods. The use and selection of these

techniques highly depends on the construct of script and language. Various feature extraction

and classification techniques has been proposed by different researchers. The feature extraction

algorithms may rely on morphology of characters for better classification. Classification in one

of the major steps in OCR and design of good classifier is also a challenging task. Mostly

supervised learning is used for the classification of characters.

2.1 Different Models of Character Segmentation in OCR Systems


Character segmentation is an operation that seeks to decompose an image of sequence of

characters into sub-images of individual symbols. The difficulty of performing accurate

segmentation is determined by the nature of the material to be read and by its quality.

Segmentation is the initial step in a three-step procedure. (Casey & Lecolinet, 1996):

Given a starting point in a document image:

1) Find the next character image.

2) Extract distinguishing attributes of the character image.

3) Find the member of a given symbol set whose attributes best match those of the input,

and output its identity.

This sequence is repeated until no additional character images are found.

A character is a pattern that resembles one of the symbols the system is designed to recognize.

But to determine such a resemblance the pattern must be segmented from the document image.

Casey & Lecolinet (Casey & Lecolinet, 1996) have classified the segmentation methods into

14
three pure strategies based on how segmentation and classification interact in the OCR process.

The elemental strategies are:

1) The classical approach, in which segments are identified based on "character-like"

properties. This process of cutting up the image into meaningful components is given

a special name, “dissection".

2) Recognition-based segmentation, in which the system searches the image for

components that match classes in its alphabet.

3) Holistic methods, in which the system seeks to recognize words as a whole, thus,

avoiding the need to segment into characters.

2.1.1 Dissection Techniques

By dissection means decomposition of image into a sequence of sub-images using general

properties of the valid characters such as height, width, separation from neighboring

components, disposition along a baseline etc. Dissection is an intelligent process in that an

analysis of the image is carried out; however, classification into symbols is not involved at this

point. The segmentation stage consisted of three steps:

1) Detection of the start of a character.

2) A decision to begin testing for the end of a character

3) Detection of end-of-character.

The analysis of the projection of a line of print has been used as a basis for segmentation of

non-cursive writing. When printed characters touch, or overlap horizontally, the projection

often contains a minimum at the proper segmentation column (Casey & Lecolinet, 1996). A

peak-to-valley function has been designed to improve this method. A minimum of the

projection is located and the projection value noted. A vertical projection is less satisfactory for

the slanted characters.

Analysis of projections or bounding boxes offers an efficient way to segment non-touching

characters in hand- or machine-printed data. However, more detailed processing is necessary

15
in order to separate joined characters reliably. The intersection of two characters can give rise

to special image features. Consequently dissection methods have been developed to detect

these features and to use them in splitting a character string image into sub-images. Only image

components failing certain dimensional tests are subjected to detailed examination.

2.1.2 Recognition Driven Segmentation

This approach also segment words into individual characters which are usually letters. It is quite

different from dissection based approach. Here, no feature-based dissection algorithm is

employed. Rather, the image is divided systematically into many overlapping pieces without

regard to content. These are classified as part of an attempt to find a coherent

segmentation/recognition result. Letter segmentation is a by-product of letter recognition,

which may itself be driven by contextual analysis. The main interest of this category of methods

is that they bypass the segmentation problem: No complex “dissection" algorithm has to be

built and recognition errors are basically due to failures in classification.

The basic principle is to use a mobile window of variable width to provide sequences of

tentative segmentations which are confirmed (or not) by character recognition. Multiple

sequences are obtained from the input image by varying the window placement and size. Each

sequence is assessed as a whole based on recognition results. In recognition-based techniques,

recognition can be performed by following either a serial or a parallel optimization scheme. In

the first case, recognition is done iteratively in a left-to-right scan of words, searching for a

"satisfactory" recognition result. The parallel method proceeds in a more global way. It

generates a lattice of all (or many) possible feature-to-letter combinations. The final decision is

found by choosing an optimal path through the lattice (Casey & Lecolinet, 1996).

Recognition-based segmentation consists of the following two steps:

1) Generation of segmentation hypotheses (e.g. windowing)

2) Choice of the best hypothesis (verification step)

16
2.1.3 Holistic Technique

Holistic technique is opposite of the classical dissection approach. This technique is used to

recognize word as a whole. Thus skips the segmentation of words into characters. This involves

comparison of features of unsegmented word image to the features or descriptions of words in

database.

Since a holistic approach does not directly deal with characters or alphabets, a major drawback

of this class of methods is that their use is usually limited to predefined words. A training stage

is thus mandatory to expand or modify the scope of possible words. This property makes this

kind of method more suitable for applications where the lexicon is statically defined, like check

recognition. They can be used for specific user as well as to the particular vocabulary

concerned. Holistic methods usually follow a two-step scheme:

1. The first step performs feature extraction.

2. The second step performs global recognition by comparing the representation of

the unknown word with those of the references stored in the lexicon. (Chaudhuri

& Pal, 1997)

2.2 Segmentation Challenges in Devanagari OCR


Several works has been reported in Devanagari and other south Asian scripts too. Among them

Devanagari, Bangla, and Gurmukhi have same issues/challenges as they follow same structure

of characters and writing style (e.g. composition rules, headerline, conjuncts, compound

characters, position of vowel modifiers etc.). The challenges and open problems related to

Devanagari OCR are outlined below. These problems are unique to Devanagari and Bangla,

and hence the solutions adopted by the OCR systems for other scripts cannot be directly adapted

to these scripts (Bag & Harit, 2013).

The Segmentation challenges faced in Devanagari OCR are described below:

17
2.2.1 Over Segmentation of Basic Characters

Some of the characters in Devanagari such as (ग), (ण), (श) have two basic components. Similarly,

letter Kha (ख) also have structure with visually two separate components and looks like a

combination of letter Ra and Va (रव). In such cases OCR system get confused and cannot

Figure 3 Over-segmentation Example (Letter ण, श, ग)

segment a complete basic character. Sometimes poor quality of document also leads to over

segmentation of characters. Some of these problems can be handled during post-processing and

some of them must be considered in OCR process (segmentation and classification).

2.2.2 Handling vowel modifiers and Diacritics

Devanagari script consists of several Vowel modifiers. When vowel modifiers comes together

Figure 4 Segmentation using Projection Profile Technique

with core consonants they take position at top, bottom, left or right and result a new shape.

Identification of modifiers and their recognition is important task. The main challenge is to

handle the large number of characters that are formed when the vowel modifiers combine with

the basic characters (Bag & Harit, 2013). Sometimes vowel modifiers come together with other

diacritics (For example, vowel modifier I (िा) and Chandravindu (ा), च िह च + ह + िा+ ा). In

such case they overlap and increase the complexity of segmentation.

18
2.2.3 Handling Compound characters and Ligatures

In Devanagari, compound characters and ligatures are popular. Conjunct or compound

characters may be produced by combining half-consonants with consonants. There is a large

set of compound characters and ligatures. Sometimes, it is harder to identify its constituent

characters by simply analyzing it. Thus handling a large set of compound characters and

ligatures is also challenging task.

Apart from these segmentation challenges there are others challenges too like incorrect typos,

word and character spacing. Kulkarni (Kulkarni, 2013) have studied the display typefaces of

Devanagari Script. He noticed that most of the existing digital display typefaces in Devanagari

are inconsistent. They have imbalanced letter structures, limited/ inadequate matras and ill-

designed conjuncts. They also seem outdated and are overused. Many of them copy features

and styles from existing Latin typefaces. He recommends looking at Devanagari type-design

independently and not as secondary to Latin type design. This inconsistency and imbalanced

letter structures in typefaces adds the complexity to the OCR system. Because of the structural

complexities of Indian scripts, the character recognition module that makes use of only the

image information (shape and structure) of a character is prone to give incorrect results. To

improve the recognition accuracy rate, it is necessary to use language knowledge to correct the

recognition result. There has been a limited use of post-processing in Indian OCR systems and

more efforts are needed in this direction (Bag & Harit, 2013). Almost all Indic scripts need

character reordering to re-organize from visual order to logical (Unicode) order. Since most

OCR systems operates strictly from left to right; the characters are scanned in visual order and

recognition also happens in visual order. This needs to be reordered in post processing.

Apart from the above-mentioned problems, which directly pertain to the OCR systems, there is

a need for a major effort to address related problems like scene text recognition, restoration of

degraded documents, and large scale indexing and search in multilingual document archives.

19
2.3 Related work
Various works have been reported in literature for the correct segmentation of

conjuncts/compound characters, shadow characters to increase the performance of Devanagari

OCR. At the same time, various feature extraction methods and character recognition

algorithms have been proposed. Some of the works from literature are briefly described below.

2.3.1 Segmentation

Bansal & Sinha (Bansal & Sinha, 1998) have considered the problem of conjunct segmentation

in context of Devanagari script. The conjunct segmentation algorithm process takes the image

of the conjunct and the co-ordinates of the enclosing box. The position of the vertical bar and

pen width are also inputs to the algorithm. For extracting the second constituent character of

the conjunct, the continuity of the collapsed horizontal projection is checked. Bansal & Sinha

(Bansal & Sinha, 2001) have divided words into top and bottom strip and then vertical

projection is computed to extract character/symbol and top modifiers. Collapsed Horizontal

Projection is defined for the segmentation of conjuncts/touching characters and shadow

characters. Ma & Doermann (Ma & Doermann, 2003) identified Hindi words and then

segmented into individual characters using projection profile technique (isolating top modifiers,

separating bottom modifiers, and extracting core characters). Composite characters are

identified and further segmented based on the structural properties of the script and statistical

information. The Collapsed Horizontal Projection Technique is adopted from Bansal & Sinha

(2001) for conjunct segmentation. Bansal & Sinha (Bansal & Sinha, 2002) presents a two pass

algorithm for the segmentation and decomposition of Devanagari composite (touching and

fused) characters/symbols into their constituent symbols. The proposed algorithm extensively

uses structural properties of the script. In the first pass, words are segmented into easily

separable characters/composite characters. Statistical information about the height and width

of each separated box is used to hypothesize whether a character box is composite. In the second

pass, the hypothesized composite characters are further segmented. For segmentation of

composite characters, the continuity of collapsed horizontal projection is checked. Agrawal,

20
Ma & Doermann (Agrawal, Ma, & Doermann, 2009) have generated the character glyphs from

font files and passed them through the feature extraction routines. For each character segmented

in the document image, feature extraction is performed. With the objective of grouping broken

characters, segmenting conjuncts, and touching characters, the technique of font-model-based

intelligent character segmentation and recognition was developed. For each word, connected

component analysis is performed. Kompalli et al. (Kompalli, Nayak, & Setlur, 2005) have

proposed a projection profile based method for character segmentation from words. Words are

separated into ascenders, core components, and descenders. Gradient features are used to

classify segmented images into different classes: ascenders, descenders, and core components.

Core components contain vowels, consonants, and frequently occurring conjuncts. Core

components are pre-classified into four groups based on the presence of a vertical bar: no

vertical bar (Eg: छ, ट, ह, vertical bar at the center (Eg:व फ, क), right (Eg: व, त, म) or at multiple

locations (Eg: कय, स, सत). Four neural networks are used for classification within these groups.

Due to ascender and core character separation, characters may be divided into multiple

segments during OCR. Positional information from segmented images is used to reconstruct

the original character. For recognition of valid but not frequently occurring conjuncts, Kompalli

et al. (2005) have attempted to segment the conjunct characters into their constituent consonants

and classify segmented images. For the segmentation of valid but not frequently occurring

conjuncts, authors have examined breaks and joins in the horizontal runs (HRUNS) of a

candidate conjunct character and build a block adjacency graph (BAG). Adjacent blocks in

the BAG are selected from left to right as segmentation hypothesis. Both left and right images

obtained from each segmentation hypothesis are classified using conjunct/vowel classifiers.

The segmentation hypothesis with highest confidence is accepted. Post processing is carried

out using a lexicon with 4,291 entries generated from the Devanagari data set. Kumar et al.

(Kumar & Sengar, 2010) presents projection profile technique for printed Devanagari and

Gurmukhi Script character segmentation. Initially, horizontal histogram of segmented line is

computed and the position of headerline is located. This separates the word into top and bottom

21
strip. Vertical projection histogram for each strip is computed for the segmentation of top

modifiers and characters. In this paper conjuncts/fused characters are not considered. The

results are for clean documents consisting no conjuncts/fused characters. A projection profile

technique is proposed in (Dongre & Mankar, 2011) for the segmentation of Devanagari Text

Image. To normalize the image against thickness of the character the input image is thinned.

Then the vertical projection histogram is computed and the locations containing single white

pixels are noted. These points are taken as the boundaries for individual characters. The

proposed method skips the process of headerline removal. In case of character segmentation,

words are segmented into more symbols than actually present in the word.

Kompalli et al. (Kompalli, Setlur , & Govindaraju, 2006) have extended their previous work

(Kompalli, Nayak, & Setlur, 2005) and two different approaches: segmentation driven and

recognition driven segmentation are compared for OCR of machine printed, multi-font

Devanagari text. They have proposed recognition driven approach that combines classifier

design with segmentation using the hypothesis and test paradigm. Word images are examined

along horizontal runs (HRUNS) to build a Block Adjacency Graph (BAG). Given the BAG of

a word, histogram analysis of block width is used to identify the longest blocks as headline

(dika) and isolate ascenders from core components. Regression over the centroids of these core

connected components is used to determine a baseline for the word. It uses the classifier to

obtain hypotheses for word segments like consonants, vowels, or consonant-ascenders. If the

confidence of the classifier is below a threshold the algorithm attempts to segment the

conjuncts, consonant-descenders and half-consonants. Thus, the classifier results are used to

guide the further segmentation. Kompalli et al. (Kompalli, Setlur, & Govindaraju, 2009) have

proposed a novel graph-based recognition driven segmentation methodology for Devanagari

script OCR using hypothesize and test paradigm. This work is further improvement to their

previous work (Kompalli et al. 2006). A BAG is constructed from a word image and ascenders,

and core components are isolated. The core components can be isolated characters that does not

need further segmentation or conjuncts and fused characters that may or may not have

22
descenders. Multiple hypotheses are obtained for each composite character by considering all

possible combinations of the generated primitive components and their classification scores. A

stochastic model (describes the design of a Stochastic Finite Automata (SFSA) that outputs

word recognition results based on the component hypotheses and n-gram statistics) for word

recognition has been presented. It combines classifier scores, script composition rules, and

character n-gram statistics. Post-processing tools such as word n-grams or sentence-level

grammar models are applied to prune the top n choice results. They have not considered special

diacritic marks like avagraha, udatta, anudatta, special consonants such as, punctuation and

numerals. Symbols such as anusvara, visarga and the reph character often tend to be classified

as noise.

Table 6 Existing Text Segmentation Approaches for Devanagari OCR

Authors Segmentation Technique Performance


Bansal & Sinha (2001) Collapsed Horizontal Projection 93% at character level

Kompalli et al. (2005) BAG Analysis 93.81% for consonants and vowels
Kompalli et al. (2006) Graph Based Character Segmentation 39.58% for the segmentation driven
OCR and 44.10% with the recognition
driven OCR.
Kompalli et al. (2009) Graph-based recognition driven Accuracy of the recognition driven
segmentation BAG segmentation ranges from 72 to
90%
Ma & Doermann (2003) Structural Properties and statistical the average recognition accuracy can
information reach 87.82%
Agrawal et al. (2009) Font-model-based segmentation, 92% at character-level recognition
connected component analysis
Bansal & Sinha (1998) Collapsed Horizontal Projection for recognition rate of 85% has been
Segmentation achieved on the segmented touching
characters
Bansal & Sinha (2002) Collapsed Horizontal Projection 85% recognition rate

For Nepali HTK OCR, (Shakya, Tuladhar, Pandey, & Bal, 2009) (Bal, 2009) the projection

profile technique have been adopted for character segmentation. The process includes removal

of headerline and upper modifiers and then applying Multi-factorial analysis technique to

segment basic characters. The method is able to segment isolated characters along with half and

conjoined characters. For the classifier, Hidden Markov Model (HMM) from HTK toolkit is

used. (Rupakheti & Bal, 2009) adopted projection profiling technique for Nepali Tesseract

OCR. Headerline width is identified and then vertical projection histogram of word to be

segmented is computed. Then the histogram analysis is done to mark starting and ending

23
boundary of character fragment by taking headerline line as a threshold value that qualifies the

segment to be separated.

Most of the researchers have adopted projection profiling technique for character segmentation.

For Devanagari character segmentation, this technique includes two phases: preliminary

segmentation segments words into basic characters and compound characters/shadow

characters/fused characters. In general, preliminary segmentation includes detection of headline

and use of its reference to isolate ascenders, core components, and descenders. For

segmentation of compound characters, Bansal & Sinha (1998, 2001, 2003), have proposed

continuity checking of Collapsed Horizontal Projection. Kompalli et al (2005) have proposed

graph analysis for compound character segmentation. (Ma & Doermann, 2003) have used

Structural Properties and statistical information of script is for further segmentation of

compound characters. Kompalli et al. (2006, 2009) have proposed graph based recognition-

driven character segmentation technique to overcome the problem regarding the compound

character segmentation which is usually difficult using projection profile techniques. Various

character segmentation approaches for Devanagari OCR are summarized in Table 6.

2.3.2 Recognition

Various feature extraction algorithms and classifiers have been proposed for Devanagari optical

character recognition. They all have focused on the improved performance. The shaded portions

on the characters are used as features by Chaudhari & Pal (Chaudhuri & Pal, 1997), the

classifiers used were decision trees. Kompalli et al. have used GSC features and Neural

Network as a classifier (Kompalli, Nayak, & Setlur, 2005). Kompalli et al. (Kompalli, Setlur ,

& Govindaraju, 2006) have used GSC as features and k-nearest neighbor classifier. Ma &

Doermann (Ma & Doermann, 2003) suggests use of statistical structural features; they have

used Generalized Hausdorff Image Comparison (GHIC) for the recognition of characters.

Different feature extraction methods and classifiers used by various researchers in the field of

Devanagari OCR are summarized in Table 7.

24
Table 7 Feature Extraction and Classifiers in Devangari OCR

Author Feature Classifier Performance


Pal & Chaudhari (1997) Shaded portions in the Decision Tree and Template 96.5%
character Matching
Kompalli et al. (2005) GSC Neural Network 84.77%
Kompalli et al. (2006) GSC k-nearest neighbor 95%
Bansal & Sinha (2002) Statistical Structure Statistical Knowledge 85%
Dhurandhar et al. (2005) Curves, contour Centroid matching, length matching, 85%
interpolation
Kompalli et al. (2009) SFSA Stochastic Finite State Automation 96%
Ma & Doermann (2003) Statistical structural Generalized Hausdorff Image 87.82%
Comparison (GHIC)
Agrawal et al. (2009) Moment descriptors, GHIC 92%
directional features
Bansal et al. (2001) Filters Distance based classifiers 93%

(Bishnu & Chaudhuri, 1999) have proposed a recursive contour following method for

segmenting handwritten Bangla words into characters. Based on certain characteristics of

Bangla writing styles, different zones across the height of the word are detected. These zones

provide certain structural information about the constituent characters of the word. Recursive

contour following solves the problem of overlap between successive characters. (Garain &

Chaudhuri, 2002) have proposed a method for segmenting the touching characters in printed

Bangla script. With a statistical study they noted that touching characters occur mostly at the

middle of the middle zone, and hence certain suspected points of touching were found by

inspecting the pixel patterns and their relative position with respect to the predicted middle

zone. The geometric shape is cut at these points and the OCR scores are noted. The best score

gives the desired result. Habib (Murtoza, 2005) have proposed a projection profiling technique

for Bangla Character Segmentation. The width of the headline is variable because of print style

(font size). So sometimes headline cannot be removed clearly. Here two morphological

operations: thinning and skeletonization has been tried to overcome this problem. These

operations removes pixels and pixels remaining makeup the image skeleton. Character can be

separated by using connected components which is considered as input of recognition step.

25
The Arabic OCR Framework proposed by Nazly and others (Sabbour & Shafait, 2013) takes

raw Arabic script data as text files as input in training phase. The training part outputs a dataset

of ligatures, where each ligature is described by a feature vector. Recognition which takes as

input an image specified by the user. It uses the dataset of ligatures generated from the training

part to convert the image into text. It contains versions of degraded text images which aim at

measuring the robustness of a recognition system against possible image defects, such as, jitter,

thresholding, elastic elongation, and sensitivity. The performance of system is reported to be

91% for Urdu clean text and 86% for Arabic clean text.

2.4 OCR Tools Developed for Devanagari


The development of Devanagari (Sanskrit, Hindi, Marathi, and Nepali) OCR software has been

initiated by many organizations and individuals in India and Nepal. C-DAC, from India have

developed an OCR system (Chitrankan) for Hindi and Marathi languages. Madan Puraskar

Pustakalaya (MPP) from Nepal have also developed OCR projects for Nepali language (based

on Tesseract Open Source OCR engine and HTK tool). Ind.Senz (founded by Dr. Oliver

Hellwing) is developing OCR software for Devanagari Script (Sanskrit, Hindi and Marathi

languages). The other projects are Parichit and Sanskrit/Hindi-Tesseract OCR. These tools are

described in details below:

Chitrankan: Chitrankan is an OCR (Optical Character Recognition) system for Hindi and

other Indian Languages developed by by C-DAC. It works with Hindi and Marathi languages

along with embedded English text. It comes with facilities like Spell Checker, saving

recognized text in ISCII format, and exporting text as .RTF for editing by any wordprocessor.

Skew detection and correction upto ±15°, automatic text and picture region deteciton, and

advanced DSP(Digital Signal Processing) algorithms to remove noise and Back Page

Reflection are also imlemented. The recognized text is not much accurate so manual editing is

required. The supported operating systems are Windows XP and older version of Windows2.

2
http://cdac.in/index.aspx?id=mlc_gist_chitra

26
Parichit: This project is based on Tesseract OCR Engine (http://code.google.com/p/tesseract-

ocr/). The front end is the modified version of VietOCR (http://vietocr.sourceforge.net/). The

project aims to create open source OCRs for Indian and South Asian Languages. It also aims

to create high quality raining data for creating Tesseract language models for each of the Indian

Languages. This project reports on going works on Headerline Segmenter (Shirorekha

Segmenter) and Character Reordering for post processing3.

Sanskrit / Hindi - Tesseract OCR (Traineddata files for Devanagari fonts for

Tesseract_OCR 3.02+): Tesseract OCR 3.02 provides hih.traineddata for recognizing texts

in Devanagari scripts. However training texts, images and box files are not provided, so it is

difficult to improve the accuracy by father improving the traineddata. It is noted that recognition

is more accurate and faster if the training is done with the same/similar font as used in the text

to be OCRed. With the aim of creating traineddata for various Devanagari fonts such that the

Tesseract OCR can be used for the recognition of document written in various Devanagari fonts,

this traineddata is maintained by http://sourceforge.net/users/shreeshrii. The trained data can be

downloaded from http://sourceforge.net/projects/tesseracthindi. Currently the traineddata for

Sanskrit2003 font and another similar font is available4.

Ind.senz OCR Programs: The OCR programs are available for Hindi, Marathi, and Sanskrit

languages. These are the only Devanagari OCR programs developed and available for

professional use. Ind.senz explains about the usability of programs in Data Entry Companies,

Publishing Houses, and Universities – whenever large amounts of Hindi and Sanskrit text have

to be digitized. The programs take text images and transform them automatically into computer

editable text in Unicode format. Ind.senz reports the achievement of high accuracy rates on

typical Devanagari fonts. The OCR programs provided are paid software. The demo version

can be downloaded from http://www.indsenz.com/int/index.php5.

3
http://code.google.com/p/parichit
4
http://sourceforge.net/projects/tesseracthindi
5
http://www.indsenz.com/int/index.php

27
Google Drive OCR: Google have launched Nepali OCR in Google Drive. The OCR

technology is free for Google Drive users. OCR provided have good performance in single

column documents. It can retain some formatting like bold, font size, font type and line-breaks.

But lists, tables, columns, footnotes, and endnotes are likely not be detected. Though it shows

good performance, we need to be Google Drive user, we need to surrender our documents to

the Google, and we need to work online.

A Step Towards development of Nepali OCR:

HTK Toolkit Based OCR: This OCR project is developed under Phase I of PAN Localization

project (2004-2007). The project was executed by Madan Puraskar Pustakalya,

http://madanpuraskar.org/. The development of Nepali OCR had been done with the guidance

and direct training from the Bangladesh team. The OCR project was closed with the release of

beta version6. The source files and executable are available on http://nepalinux.org7.

Tesseract Based Nepali OCR: Under the initiatives of MPP and Kathmandu University

(KU), efforts were made for developing a Tesseract based Nepali OCR under PAN

Localization Project Phase II. In this project, 202 Nepali Characters including basic characters

and some derived characters (characters with ukar, ekar, and aikar) were trained via Tesseract

2.04. It is available for download at http://nepalinux.org or it can also be downloaded from the

website of PAN Localization Project, www.panl10.net8.

After the release of the HTK based beta version of the Nepali OCR, Google’s Tesseract

based Nepali OCR was developed in 2009. Then after, the development and enhancement of

Nepali OCR has discontinued. It’s been a long time that these tools have not been updated. In

the current scenario, new versions of operating systems and new platforms have been released.

The tools developed do not meet the requirements of the new versions of Operating Systems

6
Findings of PAN Localization Project, PAN Localization Project 2012; ISBN: 978-969-9690-02-2
7
http://nepalinux.org/index.php?option=com_content&task=view&id=46&Itemid=53
8
http://www.panl10n.net/madan-puraskar-pustakalaya-nepal/

28
like Windows 7, and Windows 8.1. It is also necessary to develop OCR tools for other platforms

like Linux, Android etc.

In conclusion, in this chapter, various works and methods for the correct segmentation of

conjuncts/compound characters, and shadow characters to increase the performance of OCR

are discussed. Moreover, various feature extraction methods and character recognition

algorithms were also described briefly. Most of the researches focus on the improvement of

performance of Devangari OCR by improving the conjuncts/compound character segmentation

process. The method includes projection profile techniques, collapsed horizontal projection

technique, and recognition-driven segmentation techniques.

Various feature extraction methods and classifiers proposed for the successful recognition of

Devangari character are also presented. Finally, various tool developed for Devangari OCR

including Hindi, Sanskrit, Marathi and Nepali language are presented.

The next Chapter will discuss about the methods applied to conduct this research and

experiment. It will also discuss about different components and phases of applied method.

29
CHAPTER III
METHODOLOGY

The research works on Devanagari Optical Character Recognition suggests that the

segmentation process cannot achieve full accuracy because of noise, touching characters,

compound characters, variation in typefaces, and many similar looking characters. Because of

the presence of language-specific constructs in non-Latin scripts, such as “dika” (Devanagari),

modifiers (south-east Asian scripts), writing order, or irregular word spacing (Arabic and

Chinese) it requires different approaches to segmentation (Agrawal, Ma, & Doermann, 2009).

Devanagari Script also possess its own constructs which totally differ from Latin.

Figure 5 Proposed Nepali OCR Model

The most practiced character dissection method for Devanagari works by removing the

headerline (dika) and separating the lower modifiers and upper modifiers, which makes it easy

to extract the basic characters but increases the complexity of extracting modifiers. The

modifiers get broken and it is difficult to note their position in a sequence of segmented

characters and restore their original shape. To minimize the overhead of component level

segmentation and minimize the errors due to inaccurate dissection, here, a hybrid approach

which combines the Holistic Method and Dissection Technique is proposed. Kompalli et al.

(Kompalli, Setlur, & Govindaraju, 2009) have also proposed a novel graph-based recognition

driven segmentation methodology for Devanagari script OCR using hypothesize and test

30
paradigm which is promising work and an inspiring work for using a hybrid approach to OCR.

Harit and Bag (2013) have also highlighted the need of new approaches because the problems

are unique to Devanagari and Bangla, and hence the solutions adopted by the OCR systems for

other scripts cannot be directly adapted to these scripts (Bag & Harit, 2013).

The proposed framework has two phase recognition scheme:

Phase 1: Segment input text image into words and recognize words using Holistic Approach.

Measure the confidence of classification. If the confidence is lower than threshold then we go

for Phase 2 recognition.

Phase 2: Words that are poorly classified in Phase 1 are segmented into characters using

projection profile. Segmentation results may be characters or compound characters (conjuncts,

shadow characters, consonant-consonant-vowel combinations, consonant vowel combinations,

and characters including diacritics). These characters are then classified. A general framework

of DOCR is given in Figure 5.

The general framework of our approach consists of two main parts – training and recognition:

3.1 Training:
Training takes the raw Nepali text data as input and outputs a dataset of words and a dataset of

ligatures (compound characters), where each data is described by feature vector. Training phase

consists of two main steps:

1. Generation of a dataset of images for the possible words and ligatures (compound

characters) of the Nepali language to be used by the application.

2. Extracting features that describe each word and ligature in the dataset generated by the

previous step.

3.1.1 Dataset Generation:

This step involves use of automated computer program to generate the necessary Training

Dataset. Text corpus of target language is fed to the program and the analysis of textual data is

31
performed to generate the list of words, basic characters, and compound characters which will

be later used for rendering images representing corresponding text. Various steps involved in

dataset generation are:

3.1.1.1 Create Distinct Words List and Character List:

In this project, a text corpus collected by Madan Puraskar Pustakalaya (MPP) under the Bhasha

Sanchar Project9 is used. The corpus includes different types of articles from different news

portals, magazines, websites and books (about 2,500 articles). The text corpus thus collected is

Figure 6 Training Dataset Generation

fed to the Text Separator, a program written in C#. This program searches for the words and

maintains the dictionary in the form of <word, frequency> tuples and a dictionary of characters

in the form of <character, frequency> is generated. The number of words extracted for Nepali

is over 150,000 having different length. The number of basic characters and compound

characters extracted is over 7,000 having different character length.

9
This corpus has been constructed by the Nelralec / Bhasha Sanchar Project, undertaken by a consortium of the Open
University, Madan Puraskar Pustakalaya (मदन पुरस्कार पुस्तकालय), Lancaster University, the University of Göteborg, ELRA (the
European Language Resources Association) and Tribhuvan University.

32
3.1.1.2 Image Dataset Generation:

In order to generate an Image Dataset of words and Compound Characters (including basic

characters) following steps are carried out:

- Images for each extracted word and character are rendered using a rendering engine. This

involves rendering the text using 15 different Devanagari Unicode Fonts namely some of

them are Mangal, Arial Unicode MS, Samanata, Kokila, Adobe Devanagari, Madan.

- The degraded images are generated by applying different image filtering operations (e.g.

threshold, blur, erode) to the images rendered in previous step.

3.1.2 Feature Extraction:

The second main step of the training phase is to extract a feature vector representing each word

and compound characters included in dataset. For this, the following steps are done:

1. Normalize each image to a fixed width and height

2. Compute the Histogram of Oriented Gradients (HOG) descriptor

Figure 7 Feature Extraction

To extract the HOG features from dataset, hog routine implemented in skimage.feature has

been used. The routine allows to manage orientations, pixels per cell, and cells per block. The

process of feature extraction is shown in Figure 7.

3.2 Recognition:
The recognition part takes as input an image which is specified by the user through the user

interface. Its main task is to recognize any text that occurs in the input image. The recognized

33
text is presented as an output to the user in an editable format. The recognition of the text in an

input image is done using the following steps:

Step1: Segment page image into lines and words.


Step 2: Describe each unknown word image using HOG descriptor.
Step 3: Classify each word segment using Random Forest classifier tool
Step 4: Calculate the confidence of Classification
Step 5: If classification confidence is lower than threshold
a. Segment words into characters/ligatures
b. Perform Character level classification

3.2.1 Line and Word Segmentation

In this research work, instead of Projection Profile, a Blob Detection based approach for line

and word segmentation has been used. Blobs are bright on dark or dark on bright regions in an

image10 11. In Devanagari Script each word is a bunch of characters and these characters are

tied with each other by header line (dika). This property of Devanagari Script makes it easy to

use blob detection for detecting individual words in a text document. Figure 8 shows the Nepali

words and each word as a separate bright region in the black background.

Figure 9 Nepali text words as Blobs Figure 8 Snapshot of Character Segmentation

Word segmentation using Blog Detection involves various steps.

Algorithm: Line and Word Segmentation


Step 1: Preprocessing: Blurring, Binarization (Grayscale and Thresholding), Skeletonization, and
Inverting Image
Step 2: Detect blobs
Step 3: Get average blob size and remove all small and big blobs
Step 4: Create clusters of blobs by the analysis of their distribution and y-cordinate
Step 5: Each cluster bounding box represents a text line, now we can perform word segmentation

10http://scikit-image.org/docs/dev/auto_examples/plot_blob.html Blob Detection, scikit-image, Accessed: 12/19/2015


11http://www.aforgenet.com/framework/features/blobs_processing.html Blobs Processing, Afoge.net, Accessed: 12/19/2015

34
Step 6: Re-apply blob detection in a line to Perform word segmentation (vertical and horizontal
bluring may be applied for more accurate segmentation)

3.2.2 Character Segmentation:

The character segmentation to basic components becomes more challenging due to its

properties – use of modifiers and diacritics, and compound characters and ligatures. By

studying the structure of Devanagari script and use of compound characters it is found that it

would be better to use compound characters and ligatures as a single characters. The method is

inspired form the work by Nazly Sabbour and Faisal Shafait, the ligature based approach to

implement segmentation free Arabic and Urdu OCR (Sabbour & Shafait, 2013). On analyzing

the Nepali text corpus, it is found that there are about 7,000 compound characters (basic

characters, conjuncts, ligatures) used in Nepali. Projection profile (PP) algorithm is used to

segment characters.

The algorithm for character segmentation is given below:


Algorithm: Character Segmentation
Step 1: Input: list of blobs
Step 2: Apply Horizontal projection on Word Rectangle Part
Hpp(word) = {r1, r2, r3, … , rn}, where r1, r2, …, rn are the score of black pixels in
corresponding rows
Step 3: Find header line in a word; HLl(word, x, y), HLh(word)
HLl(word, x, y) is the location of headerline in a word, where x the location of upper row
that lies in headerline and y is the location of lower row. HLh(word) = y – x, is height of
headerline of word
Step 4: Apply Veritical projection on Word Rectangle Part
Vpp(word) = {v1, v2, v3, … , vn}, where v1, v2, …, vn are the score of black pixels in
corresponding columns
Step 5: Remove Header Line
Vpp(word)hr = Vpp(word) – HLh(word) = {v1 - HLh(word), v2 - HLh(word) , … , vn -
HLh(word) }
Step 6: Detect cut points
CP(word) = <word, {cp1, cp2, …, cpn}>; Cut points are valleys i.e. space between two
characters
Step 7: Perform segmentation

This module takes the blobs (rectangles enclosing the word) as input. In the earlier step

Horizontal Projection is applied on the word. Hpp(word) = {r1, r2, r3, …, rn} is the result of

Horizontal Projection which contains score of black pixels in each row. The analysis on

Hpp(word) is performed to detect the header line of a word and to calculate its height

HLh(word). Those rows that have Horizontal Projection score equal to max score or near about

35
max score and are neighbor to each other are the part of headerline. The analysis is performed

on the upper half part of the word. The location of headerline HLl(word, x, y), is the position

of headerline in a word, where x the location of upper row that lies in headerline and y is the

location of lower row. The height of word is given by HLh(word) = y – x. Then vertical

projection is applied on the blob. Vpp(word) ={v1, v2, v3, … , vn}, is list of scores of black

pixels in each column of blob, where v1, v2, ….. vn, are scores of black pixels in respective

column. The method that have been practiced so far to isolate the individual character in a word

of Devanagari Script is to remove the headerline. I have also used the same method. And this

method works fine for isolating compound characters. The headerline is not removed in actual

but HLh(word) is subtracted from each element of Hpp(word), the result is Vpp(word)hr = {v1

- HLh(word), v2 - HLh(word) , … , vn - HLh(word) }. On subtraction some element may result

less that zero, in such case make that element zero because no score can be less that zero. Next

task is to find the cut points, CP(word) = <word, {cp1, cp2, …, cpn}>, where cp1, cp2, …,

cpnare cutpoints, by analyzing the Vpp(word)hr , cut points are the points in a word from where

we can chop a word to isolate the characters. And normally these are the points where element

of Vpp(word)hr is equal to zero that means the space between two characters. Finally the

cutpoints are noted and the segmentation is performed, the result of segmentation is given by

CS(TextImage) = {<word1, <cs11, cs12, …, cs1n>>, <word2, <cs21, cs22, …, cs2m>>, …, <wordp,

<csp1, csp2, …, cspq>>}.

3.2.3 Classifier Tool

For development of both word classifier and character classifier, Random Forest classifier tool

was developed. According to sk-learn.org, “A random forest is a meta-estimator that fits a

number of decision tree classifiers on various sub-samples of the dataset and use averaging to

improve the predictive accuracy and control over-fitting”12.

12
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html [Accesses: 03-24-2016]

36
For testing purposes a limited set of words and characters has been trained. The training of

Random Forest is performed with following setup:

Word Classifier: Three different Random Forest classifiers are trained based on the word

length i.e. the ratio of image width and height. The training data images with width <=

(height*2) lies in class 1, with width <= (height*4) in class 2 and with width <= (height*6) in

class 3.

Table 8 Word Classifier Training

Class 1 Class 2 Class 3


Image Normalization (48, 24) (96, 24) (144, 24)
HOG Feature orientations=8 orientations=8 orientations=8
Extraction pixels_per_cell=(8, 8) pixels_per_cell=(8, 8) pixels_per_cell=(8, 8)
cells_per_block=(3, 3) cells_per_block=(3, 3) cells_per_block=(3, 3)
Random Forest 50 estimators 50 estimators 50 estimators
Setup
Words Trained (519 291 212 16
words/48 images per
word)
Accuracy (3-fold 0.91 (+/- 0.06) 0.92 (+/- 0.06) 0.96 (+/- 0.05)
Cross Validation)

The following graph shows learning curve for word classifier 1 (Class 1) with 20 iterations and
test size of 30 percent.

37
Figure 10 Learning Curve - Word classifier 1

The following graph shows learning curve for word classifier 2 (Class 2) with 20 iterations and
test size of 30 percent.

Figure 11 Learning Curve - Word classifier 2

38
The following graph shows learning curve for word classifier 3 (Class 3) with 20 iterations and
test size of 30 percent.

Figure 12 Learning Curve - Word classifier 3

Character Classifier: A single Random Forest classifier is developed for the classification of
characters.

Table 9 Character Classifier Training

Image Normalization (37, 58)


HOG Feature Extraction orientations=9
pixels_per_cell=(8, 8)
cells_per_block=(4, 4)
Random Forest Setup 55 estimators
Characters Trained (144 images per word) 417 characters (basic and compound characters, half
letters, and most frequent compound characters)
Accuracy (3-fold Cross Validation) 0.84 (+/- 0.07)

The feature extraction configuration, random forest setup, training data size and the cross

validation result for training are presented in Table 9. The classifier was trained successfully

with accuracy of more than 80% by above configuration.

39
The following graph shows learning curve for character classifier with 20 iterations and test
size of 30 percent.

Figure 13 Learning Curve - Character classifier

3.2.4 Confidence and Threshold:

Confidence is a prediction probability. Prediction probability helps to be sure about how

accurately some character/word has been classified. The threshold defined is a numeric value,

if a confidence is below threshold the prediction or the classification is marked as false

classification. The threshold value is calculated by the study of classification results and the

corresponding prediction probability. In this work, 0.2 is taken as the threshold. A sample of

word recognition data and prediction probability is presented in Appendix II.

In conclusion, in this chapter, the proposed model of Nepali OCR is discussed. Our model

consists of two phase recognition schemes – firstly, OCR engine tries to recognize words as a

whole, secondly, if it is not confident about the word, it tries to segment the word into

constituent characters and recognize at the character level. The general framework of our

approach consists of two main parts – training and recognition.

40
The training phase consists of two main steps – dataset generation, and feature extraction. The

process of dataset generation and feature extraction are described in this chapter.

Word and line segmentation, and character segmentation algorithms are described. Word and

line segmentation algorithm uses blob detection and projection profile technique for

segmentation.

Finally, different classifiers used in experiment and their configurations are presented. The

training results, cross validation results, and training curves are presented for both the word and

character classifiers. The classifiers were successfully trained with more than 80% accuracy.

In the next Chapter, segmentation results and recognition results are described.

41
CHAPTER IV
RESULTS AND DISCUSSION

In this section, experimental study and testing of the proposed architecture are presented. To

test the system, various documents were generated and collected. Results of the testing of

different modules of system namely, word segmentation, character segmentation, word level

recognition, and compound character level recognition are also presented here.

4.1 Experimental Setup


For conducting the experiment, C# and Python3 was used as a major programming languages

for image processing and machine learning. The experiment is conducted on the machine with

following hardware and software configurations.

Table 10 Experimental Environment

Title Description
Computer System Dell Inspiron 5420, i5 Processor, 4GB RAM,
1GB NVIDIA Graphics
Operating System Windows 10
Programming Languages C#, Python 3
Image Processing Libraries Aforge.net, Accord.net, scikit-image,
OpenCV-Python
Machine Learning Libraries scikit-learn

For various tasks including GUI design of the experiment software, pre-processing of input

images, and post-processing mostly C# has been used as major language. C# libraries like

Aforge.net and Accord.net are used for achieving different image processing tasks like reading

image, removing noise, and performing segmentation. Similarly, scikit-image has been used

for image feature extraction. For implementing machine learning, RandomForest routine

available in scikit-learn has been used.

4.2 Segmentation Results


It is found that line segmentation is accurate as long as there is a specified amount of space

between lines, it is almost 100%. The accuracy of word segmentation is reduced a little by the

lower modifiers (Ukar, Ookar, Rrikar, and Halant) if they are separated from the core character

42
and by punctuation marks like comma and dot. The character segmentation results for 7

documents is presented in Table 11.

Table 11 Character Segmentation Results

Document 1 2 3 4 5 6 7
Characters Present 212 118 370 353 166 273 289
Characters Over- 5 7 12 11 2 3 4
segmented
Characters under- 14 7 23 9 9 18 19
segmented
Error (%) 8.96 11.86 9.45 5.66 6.62 7.69 7.95

From the above test, it is clear that most of the errors are due to under-segmentation. The errors

due to over-segmentation are less-than-half compared to errors due to under-segmentation. The

average error rate of character segmentation is found to be 8.31%.

4.3 Recognition Results


Both the approaches - character level recognition approach and hybrid approach were tested.

The classifier was tested on documents containing characters from a set of trained 519 words

and 417 characters. The result of recognition of 7 documents is presented in Table 12.

Table 12 Recognition Results

Character Level Recognition Approach Hybrid Approach


Characters Correctly Accuracy Correctly Accuracy
Present Recognized Recognized
1 116 85 73% 111 95.68%
2 92 78 84.78% 89 96.73%
3 130 111 85.38% 121 93%
4 118 82 69.49% 116 98%
5 105 83 79% 100 95.23%
6 139 114 82% 131 94.24%
7 65 51 78.46% 59 90.76%

From Table 12, we can see that the accuracy of Character Level Recognition approach ranges

from 69.49 % to 85.38%. The average accuracy rate of Character Level Recognition Approach

is 78.87%. The average accuracy rate of the Hybrid approach ranges from 90.76% to 98%. The

average accuracy rate of our approach is 94.80%.

The recognition results are also presented in bar chart in Figure 14.

43
Recognition Results
95.68% 96.73% 98% 95.23%
100% 93% 94.24%
90.76%
90% 84.79% 85.38%
82%
79% 78.46%
80% 73%
69.49%
70%
60%
50%
40%
30%
20%
10%
0%
Document 1 Document 2 Document 3 Document 4 Document 5 Document 6 Document 7

Character Level Recognition Hybrid Approach

Figure 14 Recognition Results

From the above results, we can see that the proposed hybrid approach is promising. The

accuracy rate increased by more than 10% while using hybrid approach.

4.4 Computational Cost


In this sub-section, the computational cost of both approaches of recognition in terms of

mathematical interpretation is discussed. This computational cost only includes the cost of

segmentation and recognition (the other costs like pre-processing, training, and post-processing

are omitted for now).

Computational cost – Character level recognition technique:

This technique involves word segmentation, segmentation each word into characters, and then

recognition of each character. Thus the total computational cost for this approach is given by:

𝐶𝑡 = 𝑤𝑠 + 𝐶𝑐𝑙𝑠 + 𝐶𝑐𝑙𝑟 .

Where,

𝑤𝑠 = Word segmentation cost

𝐶𝑡 = Total computational cost

𝐶𝑐𝑙𝑠 = Character level segmentation cost and

𝐶𝑐𝑙𝑟 = Character level recognition cost.

44
Assume that there are ‘𝑛’ words in a document and the time taken to segment an image

document into words is 𝑤𝑠 . Let us say, the average cost to segment each word into characters

is 𝐶𝑠 then the cost of segmentation of 𝑛 words is 𝑛 × 𝐶𝑠 i.e. 𝐶𝑐𝑙𝑠 = 𝑛 × 𝐶𝑠 .

Also assume that the character segmentation yields 𝑚 characters i.e. there are 𝑚 characters

present. If 𝑟 is recognition time required to recognize a single character then the total time

required can be given by 𝑚 × 𝑟 i.e. 𝐶𝑐𝑙𝑟 = 𝑚 × 𝑟.

Thus, the total computation time can be given by the equation:

𝐶𝑡 = 𝑤𝑠 + 𝑛 × 𝐶𝑠 + 𝑚 × 𝑟 …………………………….. (1)

Computational cost – Hybrid approach:

Since this approach is also followed by word level segmentation. The process of recognition is

started with the segmentation of image document into possible words. The word level

segmentation cost is same as to that of character level recognition technique i.e. 𝑤𝑠 .

Now, in this approach, we first try to recognize all the words. Let us say, the cost of recognition

of a single word in average is 𝑤𝑟 and there are 𝑛 words in total, the cost of recognition of 𝑛

words is given by 𝑊𝑟𝑐 = 𝑛 × 𝑤𝑟 .

Then the confidence of recognition for each recognized word is calculated to decide whether to

go for character level recognition or not. If the cost of calculation of recognition confidence of

a single words is 𝑅𝑐𝑐 , then total cost for calculating recognition confidence for all words is

given by 𝑅𝑐𝑡 = 𝑛 × 𝑅𝑐𝑐 .

Now the next step, deciding how many words have been successfully recognized and how many

words require further processing. If 𝑝 words require further processing means 𝑛 − 𝑝 words does

not require character level segmentation. The cost of character level segmentation is equals to

𝐶𝑐𝑙𝑠 = 𝑝 × 𝐶𝑠 . If the character segmentation of 𝑝 words results 𝑞 characters then the cost of

recognition of 𝑞 characters is given by 𝐶𝑐𝑙𝑟 = 𝑞 × 𝑟.

Thus the total computational cost of hybrid approach is given by the equation

𝐶𝑡ℎ = 𝑤𝑠 + 𝑊𝑟𝑐 + 𝑅𝑐𝑡 + 𝐶𝑐𝑙𝑠 + 𝐶𝑐𝑙𝑟 …………………………….. (2)

Where,

45
𝐶𝑡ℎ = Total computational cost of hybrid approach

𝑤𝑠 = word segmentation cost

𝑊𝑟𝑐 = word recognition cost

𝑅𝑐𝑡 = recognition confidence calculation cost

𝐶𝑐𝑙𝑠 = character level segmentation cost and

𝐶𝑐𝑙𝑟 = character recognition cost

The equation can further be written as

𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 + 𝑝 × 𝐶𝑠 + 𝑞 × 𝑟………………………. (3)

Comparison of computational cost:

From equation (1), we can re-write this equation as

𝐶𝑡 = 𝑤𝑠 + [(𝑛 − 𝑝) × 𝐶𝑠 + (𝑝 × 𝐶𝑠 )] + [(𝑚 − 𝑞) × 𝑟 + (𝑞 × 𝑟)]

𝑖. 𝑒. 𝐶𝑡 = 𝑤𝑠 + [(𝑛 − 𝑝) × 𝐶𝑠 + (𝑚 − 𝑞) × 𝑟] + 𝑝 × 𝐶𝑠 + 𝑞 × 𝑟……………………. (4)

By comparing equation (3) and equation (4) we can see that the computation cost 𝑤𝑠 + 𝑝 ×

𝐶𝑠 + 𝑞 × 𝑟 is same in both algorithms.

In worst case, 𝑝 = 𝑛, 𝑞 = 𝑚 and hence equation (3) becomes

𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 + 𝑛 × 𝐶𝑠 + 𝑚 × 𝑟 ……………………………….. (5)

In best case, 𝑝 = 0, 𝑞 = 0 and hence equation (3) becomes

𝐶𝑡ℎ = 𝑤𝑠 + 𝑛 × 𝑤𝑟 + 𝑛 × 𝑅𝑐𝑐 ………………………………. (6)

For character level recognition technique, in worst case and best case, the computational cost

is equal.

On comparing equation (1) and equation (5), we can conclude that

𝐶𝑡 < 𝐶𝑡ℎ

On comparing equation (1) and equation (6), 𝑛 𝑖𝑠 𝑎𝑙𝑤𝑎𝑦𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑚 𝑖. 𝑒. 𝑛 < 𝑚 and hence

less computation will be performed. This can be expressed as

𝐶𝑡 > 𝐶𝑡ℎ

This shows that at best case the hybrid approach performs better in terms of computational cost

but in worst case the character level recognition technique performs better. But this is not

46
always the case that the hybrid approach will perform best, even if we consider 𝑝 to be 𝑛/2,

there will not be much difference in computational cost.

In conclusion, the experimental environment and the results of the experiment performed are

discussed. The chapter begins with the discussion of the hardware and software environment

on which the experiment is performed. And then segmentation results and the recognition

results has been presented.

In segmentation results, character segmentation results and the error rates are presented.

Similarly, in recognition results section, the recognition results of character recognition

approach and hybrid approach of recognition (proposed method) are presented. The results are

presented in table in terms of accuracy percentages.

In the next chapter, the contributions, and the possible future improvements in conclusion will

be described.

47
CHAPTER V
CONCLUSION AND FUTURE WORK

Using the hybrid approach improved the recognition accuracy. The performance of OCR

increased by nearly 10% while using the hybrid approach. However it is not always the case, it

depends on how many words have been trained. The word recognition is like, how much

familiar you are with some language? Training results shows that the word classifier has better

performance, nearly 90% on average. The performance of character level recognition is found

to be 78.87%.

The major contributions of this work are:

 We conducted a detailed literature review on different models of character

segmentation, various challenges in Devanagari OCR, and the segmentation and

recognition techniques proposed for Devanagari OCR which includes Nepali, Sanskrit,

Hindi, and Marathi.

 We proposed a model for Nepali OCR which combines the Holistic technique and

Character level dissection techniques. At first, the system tries to recognize a word as

a whole, if it is not confident about the classification then character level dissection and

recognition is performed. The method results reduction of noticeable amount of

segmentation task.

 The model is trained using Random Forest classifiers. HOG descriptor has been used

as a feature set. A set of frequent words in Nepali and a set of frequent

characters/compound character is trained and validated using 3 fold cross validation.

 Along with the cross validation testing, the manual testing of the models are also

presented. The testing shows higher accuracy rates and possibilities for its

generalization and further improvements.

Our focus was on improving the performance of Nepali OCR by using a hybrid approach of

recognition. The approach reduces the character level or component level segmentation task.

48
There are several issues and possibilities that can be addressed in the future to further improve

the performance. Some of the possible improvements are listed below:

 The performance is constrained by over-segmentation and under-segmentation

problems. These problems have always been pertinent issues and challenges for the

Devanagari OCR. The problem may be addressed by applying some recognition driven

segmentation. Under-segmented characters mostly include shadow characters and

conjuncts/compound characters. One of the future works may be identification of such

characters and training them.

 The model proposed can be generalized and trained to recognize large set of words and

compound characters not only for Nepali but for Hindi, Marathi, and other languages

too which are written in Devangari script.

 Better and concrete methods must be designed for creating multiple classes of word

images. Use of multiple classifiers apparently improved the performance but this has

to be further validated for better measure.

49
References
Agrawal, M., Ma, H., & Doermann, D. (2010). Generalization of hindi OCR using adaptive
egmentation and font files. In Guide to OCR for Indic Scripts. Springer London, pp.
181-207.
Bag, S., & Harit, G. (2013). A survey on optical character recognition for Bangla and
Devanagari Script. Sadhana, 133-168.
Bal, B. K. (2009). Scripts Segmentation and OCR II Nepali OCR and Bangla Collaboration.
Conference on Localized ICT Development and Dissemination across Asia. PAN
Localization Project. Laos.
Bansal, V., & Sinha, M. (2001). A complete OCR for printed Hindi text in Devanagari Script.
ICDAR (p. 0800). IEEE.
Bansal, V., & Sinha, R. (1998). Segmentation of Touching Characters in Devanagari.
Proceedings CVGIP, (pp. 371-376). Delhi.
Bansal, V., & Sinha, R. (2002). Segmentation of touching and fused Devanagari Characters.
Pattern Recognition, 875-893.
Bishnu, A., & Chaudhuri, B. B. (1999). Segmentation of Bangla handwritten text into
characters by recursive contour following. Proceedings of the International
Conference on Document Analysis and Recognition, (pp. 402-405).
Casey, R. G., & Lecolinet, E. (1996, July). A Survey of Methods and Strategies in Character
Recognition. IEEE Transactions on Pattern Recognition and Machine Intelligence,
18.
Chaudhuri, B. B., & Pal, U. (1997). An OCR System to Read Two Indian Language Scripts:
Bangla and Devanagari (Hindi). Proceedings of the Fourth International Conference
on Document Analysis and Recognition (pp. 1011-1015). IEEE.
Dhurandhar, A., Shankarnarayanan, K., & Jawale, R. (2005). Robust Pattern Recognition
Scheme for Devanagari Script. (pp. 1021 – 1026). Springer-Verlag Berlin Heidelberg
2005.
Dongre, V. J., & Mankar, V. H. (2011). Devanagari Document Segmentation Using
Histogram Approach. International Journal of Computer Science, Engineering and
Information Technology (IJCSEIT).
Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-
Degree Compared. (pp. 1-10). Grid Computing Environments Workshop.
Garain, U., & Chaudhuri, B. B. (2002). Segmentation of touching characters in printed
Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans. Syst.
Man Cybern., (pp. 449–459).
Hansen, J. (2002). A Matlab Project in Optical Character Recognition (OCR). DSP Lab,
University of Rhode Island, 6.
Holley, R. (n.d.). How Good Can It Get? Analysing and Improving OCR Accuracy in Large
Scale Historic Newspaper Digitisation Programs. Retrieved 03 04, 2014, from
http://www.dlib.org/dlib/march09/holley/03holley.html

50
K.C., S., & Nattee, C. (2007). Template-based Nepali Natural Handwritten Alphanumeric
Character Recognition. Thammasat Int. J. Sc. Tech, 12(1).
Khedekar, S., Ramanaprasad, V., Setlur, S., & Govindaraju, V. (2003). Text - Image
Separation in Devanagari Documents. Proceedings of the Seventh International
Conference on Document Analysis and Recognition (ICDAR 2003) .
Kompalli, S., Nayak, S., & Setlur, S. (2005). Challenges in OCR of Devanagari Documents.
Kompalli, S., Setlur , S., & Govindaraju, V. (2006). Design and Comparison of Segmentation
Driven and Recognition Driven Devanagari OCR.
Kompalli, S., Setlur, S., & Govindaraju, V. (2009). Devanagari OCR using a recognition
driven segmentation framework and stochastic language models. IJDAR.
Kulkarni, S. (2013). Issues with Devanagari Display Type. WhiteCrow Designs.
Kumar, V., & Sengar, P. K. (2010). Segmentation of Printed Text in Devanagari Script and
Gurmukhi Script. International Journal of Computer Applications, 3.
Ma, H., & Doermann, D. (2003). Apaptive Hindi OCR using generalized Hausdroff Image
Comparison. ACM Transactions on Asian Language Information Processing, 2(3),
193-218.
Murtoza, S. M. (2005). Bangla Optical Character Recognition. BRAC University.
OCR Applications. (2015, April). Retrieved from cvision.
OCR Processing Steps [ABBYY Developer Portal]. (n.d.). Retrieved 05 22, 2014, from
http://www.abbyy-developers.eu/en:tech:processing
Optical character recognition - From Wikipedia, the free encyclopedia. (n.d.). Retrieved 03
03, 2014, from http://en.wikipedia.org/wiki/Optical_character_recognition
Optical character recognition. (2015, 05 04). (Wikipedia.org) Retrieved 5 22, 2014, from
Wikipedia: http://en.wikipedia.org/wiki/Optical_character_recognition
Optical Character Recognition. (2015, 04 17). Retrieved from Webopedia:
http://www.webopedia.com/TERM/O/optical_character_recognition.html
Pal, U., & Chaudhuri, B. (2004). Indial script character recognition: a survey. Pattern
Recognition.
Rupakheti, P., & Bal, B. K. (2009). Research Report on the Nepali OCR. Madan Puraskar
Pustakalaya.
Sabbour, N., & Shafait, F. (2013). A Segmentation Free Approach to Arabic and Urdu OCR.
SPIE Proceedings.
Scanning in Digital Age. (2015, 04 16). Retrieved from Record Nations:
http://www.recordnations.com/articles/scanning-in-digital-age/
Shakya, S., Tuladhar, S., Pandey, R., & Bal, B. K. (2009). Interim Report on Nepali OCR.
Madan Puraskar Pustakalaya.
Singh, A., Bacchuwar, K., & Bhasin , A. (2012, June). A Survey of OCR Applications.
International Journal of Machine Learning and Computing, 2.

51
Typographic ligature. (2016, 03 24). Retrieved 04 04, 2016, from Wikipedia:
http://en.wikipedia.org/wiki/Typographic_ligature
What is OCR? (2015, 04 17). Retrieved from ABBYY:
http://finereader.abbyy.com/about_ocr/whatis_ocr/
What is optical character recognition? (n.d.). Retrieved 03 03, 2014, from
http://www.webopedia.com/TERM/O/optical_character_recognition.html

52
APPENDIX I
Snapshots

Word Extraction from Text Corpus

Character Extraction from Text Corpus

A
Word Training Data

Character Training Data

B
Segmentation

Recognition

C
Text Extractor

D
APPENDIX II
Word Recognition Data Sample

Here a sample word recognition data is presented. Each line is a result of recognition of a single word.
The data has two values separated by underscore (_). First value represents word code and second value
is the recognition confidence (prediction probability).

226_0.5
196_0.46000000000000002
130_0.62
66_0.41999999999999998
129_0.35999999999999999
36_0.40000000000000002
37_0.47999999999999998
38_0.29999999999999999
39_0.32000000000000001
40_0.76000000000000001
41_0.54000000000000004
42_0.46000000000000002
43_0.54000000000000004
44_0.57999999999999996
173_0.28000000000000003
46_0.85999999999999999
47_0.41999999999999998
48_0.47999999999999998
35_0.34000000000000002
49_0.32000000000000001
51_0.44
52_0.56000000000000005
53_0.69999999999999996
54_0.14000000000000001
55_0.34000000000000002
56_0.32000000000000001
57_0.54000000000000004
58_0.41999999999999998
59_0.92000000000000004
60_0.29999999999999999
61_0.38
62_0.68000000000000005
63_0.66000000000000003
50_0.71999999999999997
34_0.52000000000000002
346_0.16
32_0.40000000000000002
3_0.23999999999999999
4_0.47999999999999998
5_0.52000000000000002
6_0.29999999999999999

You might also like