Computer Vision CH4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

CHAPTER FOUR

OPTICAL CHARACTER RECOGNITION

Introduction: Character Recognition, Evolution, and Development

1.1. Generation and Recognition of Characters

Most people learn to read and write during their first few years of education. By the time they have
grown out of childhood, they have already acquired very good reading and writing skills, including
the ability to read most texts, whether they are printed in different fonts and styles, or handwritten
neatly or sloppily. Most people have no problem in reading the following: light prints or heavy
prints; upside down prints; advertisements in fancy font styles; characters with flowery ornaments
and missing parts; and even characters with funny decorations, stray marks, broken, or fragmented
parts; misspelled words; and artistic and figurative designs. At times, the characters and words
may appear rather distorted and yet, by experience and by context, most people can still figure
them out. On the contrary, despite more than five decades of intensive research, the reading skill
of the computer is still way behind that of human beings. Most OCR systems still cannot read
degraded documents and handwritten characters/words.

1.2. History of OCR

To understand the phenomena described in the above section, we have to look at the history of
OCR, its development, recognition methods, computer technologies, and the differences between
humans and machines.

It is always fascinating to be able to find ways of enabling a computer to mimic human functions,
like the ability to read, to write, to see things, and so on. OCR research and development can be
traced back to the early 1950s, when scientists tried to capture the images of characters and texts,
first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner
with a cathode ray tube lens, followed by photocells and arrays of them. At first, the scanning
operation was slow and one line of characters could be digitized at a time by moving the scanner
or the paper medium. Subsequently, the inventions of drum and flatbed scanners arrived, which
extended scanning to the full page. Then, advances in digital-integrated circuits brought
photoarrays with higher density, faster transports for documents, and higher speed in scanning and

1
digital conversions. These important improvements greatly accelerated the speed of character
recognition and reduced the cost, and opened up the possibilities of processing a great variety of
forms and documents. Throughout the 1960s and 1970s, new OCR applications sprang up in retail
businesses, banks, hospitals, post offices; insurance, railroad, and aircraft companies; newspaper
publishers, and many other industries.

In parallel with these advances in hardware development, intensive research on character


recognition was taking place in the research laboratories of both academic and industrial sectors.
Although both recognition techniques and computers were not that powerful in the early days
(1960s), OCR machines tended to make lots of errors when the print quality was poor, caused
either by wide variations in type fonts and roughness of the surface of the paper or by the cotton
ribbons of the typewriters. To make OCR work efficiently and economically, there was a big push
from OCR manufacturers and suppliers toward the standardization of print fonts, paper, and ink
qualities for OCR applications. New fonts such as OCRA and OCRB were designed in the 1970s
by the American National Standards Institute (ANSI) and the European Computer Manufacturers
Association (ECMA), respectively. These special fonts were quickly adopted by the International
Standards Organization (ISO) to facilitate the recognition process. As a result, very high
recognition rates became achievable at high speed and at reasonable costs. Such accomplishments
also brought better printing qualities of data and paper for practical applications. Actually, they
completely revolutionized the data input industry and eliminated the jobs of thousands of keypunch
operators who were doing the really mundane work of keying data into the computer.

1.3. Development of New Techniques


As OCR research and development advanced, demands on handwriting recognition also increased
because a lot of data (such as addresses written on envelopes; amounts written on checks; names,
addresses, identity numbers, and dollar values written on invoices and forms) were written by hand
and they had to be entered into the computer for processing. But early OCR techniques were based
mostly on template matching, simple line and geometric features, stroke detection, and the
extraction of their derivatives. Such techniques were not sophisticated enough for practical
recognition of data handwritten on forms or documents. To cope with this, the Standards
Committees in the United States, Canada, Japan, and some countries in Europe designed some
handprint models in the 1970s and 1980s for people to write them in boxes. Hence, characters

2
written in such specified shapes did not vary too much in styles, and they could be recognized
more easily by OCR machines, especially when the data were entered by controlled groups of
people, for example, employees of the same company were asked to write their data like the
advocated models. Sometimes writers were asked to follow certain additional instructions to
enhance the quality of their samples, for example, write big, close the loops, use simple shapes, do
not link characters, and so on. With such constraints, OCR recognition of handprints was able to
flourish for a number of years.

1.4. Recent Trends and Movements

As the years of intensive research and development went by, and with the birth of several new
conferences and workshops such as IWFHR (International Workshop on Frontiers in Handwriting
Recognition), ICDAR (International Conference on Document Analysis and Recognition), and
others, recognition techniques advanced rapidly. Moreover, computers became much more
powerful than before.

People could write the way they normally did, and characters need not have to be written like
specified models, and the subject of unconstrained handwriting recognition gained considerable
momentum and grew quickly. As of now, many new algorithms and techniques in preprocessing,
feature extraction, and powerful classification methods have been developed.

TOOLS FOR IMAGE PREPROCESSING

To extract symbolic information from millions of pixels in document images, each component in
the character recognition system is designed to reduce the amount of data. As the first important
step, image and data preprocessing serve the purpose of extracting regions of interest, enhancing
and cleaning up the images, so that they can be directly and efficiently processed by the feature
extraction component. This section covers the most widely used preprocessing techniques and is
organized in a global-to-local fashion.

GENERIC FORM-PROCESSING SYSTEM

When you file the annual income tax form, writing a check to pay the utility bill, or fill in a credit
card application form, have you ever wondered how these documents are processed and stored?
Millions of such documents need to be processed as an essential operation in many business and
government organizations including telecommunications, health care, finance, insurance, and

3
government and public utilities. Driven by the great need of reducing both the huge amount of
human labor involved in reading and entering the data on to electronic databases and the associated
human error, many intelligent automated form-processing systems were designed and put to
practice. This section will introduce the general concept of an intelligent form-processing system
that can serve as the kernel of all form-processing applications.

The input of a typical form-processing system consists of color or gray level images in which every
several hundred pixels corresponds to a linear inch in the paper medium. The output, on the
contrary, consists of ASCII strings that usually can be represented by just a few hundred bytes. To
achieve such a high level of data abstraction, a form-processing system depends on the following
main components for information extraction and processing:

 Image acquisition: acquire the image of a form in color, gray level, or binary format.
 Binarization: convert the acquired form images to binary format, in which the foreground
contains logo, the form frame lines, the preprinted entities, and the filled in data.
 Form identification: for an input form image, identify the most suitable form model from
a database.
 Layout analysis: understand the structure of forms and the semantic meaning of the
information in tables or forms.
 Data extraction: extract pertinent data from respective fields and preprocess the data to
remove noise and enhance the data.
 Character recognition: convert the gray or binary images that contain textual information
to electronic representation of characters that facilitate post-processing including data
validation and syntax analysis.

Each of the above steps reduces the amount of the information to be processed by a later step.
Conventional approaches pass information through these components in a cascade manner and
seek the best solution for each step. The rigid system architecture and constant information flowing
direction limit the performance of each component to local maximum. The observation from data
in real-life applications and the way humans read distorted data lead us to construct a knowledge-
based, intelligent form processing system. Instead of simply concatenating general-purpose
modules with little consideration of the characteristics of the input document, the performance of

4
each module is improved by utilizing as much knowledge as possible, and the global performance
is optimized by interacting with the knowledge database at run-time.

An overview of an intelligent form-processing system with some typical inputs and outputs is
illustrated in Figure 2.1. As the kernel of this intelligent form-processing system, the knowledge
database is composed of short-term and long-term memories along with a set of generic and
specific tools for document processing. The short-term memory stores the knowledge gathered in
run-time, such as the statistics of a batch of input documents. The long-term memory stores the
knowledge gathered in the training phase, such as the references for recognizing characters and
logos, and identifying form types. Different types of form images can be characterized by their
specific layout. Instead of keeping the original images for model forms, only the extracted
signatures are needed in building the database for known form types. Here, the signature of a form
image is defined as a set of statistical and structural features extracted from the input form image.
The generic tools include various binarization methods, binary smoothing methods, normalization
methods, and so on. The specific tools include cleaning, recognition, or verification methods for
certain types of input images. When an input document passes the major components of the
systems, the knowledge in the database helps the components to select the proper parameters or
references to process. At the same time, each component gathers respective information and sends
back to the knowledge base. Therefore, the knowledge base will adapt itself dynamically at run-
time. At run-time, for each unknown document image, features are extracted and compared with
signatures of the prototypes exposed to the system in the learning phase.

5
Figure 4.1: Overview of an intelligent form-reading system.

As of understanding the form structures and extracting the areas of interest, there are two types
of approaches in form-reading techniques: The first type of form readers relies upon a library of
form models that describe the structure of forms being processed. A form identification stage
matches the extracted features (e.g., line crossings) from an incoming form against those
extracted from each modeled design in order to select the appropriate model for subsequent

6
system processes. The second type of form readers do not need an explicit model for every
design that may be encountered, but instead rely upon a model that describes only design-
invariant features associated with a form class. For example, in financial document processing,
instead of storing blank forms of every possible checks or bills that may be encountered, the
system may record rules that govern how the foreground components in a check may be placed
relative to its background baselines. Such a generic financial document processing system based
on staff line and a form description language (FDL) is described. In either type of form readers,
the system extracts the signature, such as a preprinted logo or a character string, of each training
form during the training phase, and stores the information in the long-term memory of the
knowledge base. In the working phase, the signature of an input form is extracted and compared
statistically and syntactically to the knowledge in the database. By defining a similarity between
signatures from the two form images, we will be able to identify the format of the input form. If
the input form is of known type, according to the form registration information, the pertinent
items can be extracted directly from approximate positions. Otherwise, the signature of this
unknown form can be registered through human intervention.

Through form analysis and understanding, we are able to extract the form frames that describe
the structure of a form document, and thus extract the user filled data image from the original
form image. Instead of keeping the original form image, only extracted user filled data images
need to be stored for future reference. This will help to minimize storage and facilitates
accessibility. Meanwhile, we can use the “signature” of form images to index and retrieve
desired documents in a large database. As the kernel of this up to this moment, the useful
information in a form is reduced from a document image that usually contains millions of gray
level or color pixels to only the sub-images that contain the items of interest.

7
Lab Exercise

Optical Character Recognition (OCR) in Python

Optical character recognition (OCR) is a tool that can recognize text in images. Here’s how to
build an OCR engine in Python.

Optical character recognition (OCR) is a technology that recognizes text in images, such as
scanned documents and photos. Perhaps you’ve taken a photo of a text just because you didn’t
want to take notes or because taking a photo is faster than typing it. Fortunately, thanks to
smartphones today, we can apply OCR so that we can copy the picture of text we took before
without having to retype it.

Python OCR is a technology that recognizes and pulls out text in images like scanned documents
and photos using Python. It can be completed using the open-source OCR engine Tesseract.

We can do this in Python using a few lines of code. One of the most common OCR tools that are
used is the Tesseract. Tesseract is an optical character recognition engine for various operating
systems.

Python OCR Installation

Tesseract runs on Windows, macOS and Linux platforms. It supports Unicode (UTF-8) and more
than 100 languages. In this article, we will start with the Tesseract OCR installation process, and
test the extraction of text in images.

The first step is to install the Tesseract. In order to use the Tesseract library, we need to install it
on your windows system. To install Tesseract, just go the command-line and type the following
command:

pip install pytesseract

or

py -m pip install pytesseract

You need to also install tesseract.exe. You can get it here: https://github.com/UB-
Mannheim/tesseract/wiki

After installing, you need to

8
Python OCR Implementation

After installation is completed, let’s move forward by applying Tesseract with Python. First, we
import the dependencies.

from PIL import Image

import pytesseract

import numpy as np

We will use a simple image to test the usage of the Tesseract.

Fig: A sample image for Tesseract to convert into text.

Let’s load this image and convert it to text.

# Mention the installed location of Tesseract-OCR in your system

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-


OCR\tesseract.exe'

filename = 'C:/Users/HT/Desktop/Python2023/tes.jpg'

img1 = np.array(Image.open(filename))

text = pytesseract.image_to_string(img1)

Now, let’s see the result.

print(text)

And this is the result.

Tesseract Example in Python

You might also like