Computer Vision CH4
Computer Vision CH4
Computer Vision CH4
Most people learn to read and write during their first few years of education. By the time they have
grown out of childhood, they have already acquired very good reading and writing skills, including
the ability to read most texts, whether they are printed in different fonts and styles, or handwritten
neatly or sloppily. Most people have no problem in reading the following: light prints or heavy
prints; upside down prints; advertisements in fancy font styles; characters with flowery ornaments
and missing parts; and even characters with funny decorations, stray marks, broken, or fragmented
parts; misspelled words; and artistic and figurative designs. At times, the characters and words
may appear rather distorted and yet, by experience and by context, most people can still figure
them out. On the contrary, despite more than five decades of intensive research, the reading skill
of the computer is still way behind that of human beings. Most OCR systems still cannot read
degraded documents and handwritten characters/words.
To understand the phenomena described in the above section, we have to look at the history of
OCR, its development, recognition methods, computer technologies, and the differences between
humans and machines.
It is always fascinating to be able to find ways of enabling a computer to mimic human functions,
like the ability to read, to write, to see things, and so on. OCR research and development can be
traced back to the early 1950s, when scientists tried to capture the images of characters and texts,
first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner
with a cathode ray tube lens, followed by photocells and arrays of them. At first, the scanning
operation was slow and one line of characters could be digitized at a time by moving the scanner
or the paper medium. Subsequently, the inventions of drum and flatbed scanners arrived, which
extended scanning to the full page. Then, advances in digital-integrated circuits brought
photoarrays with higher density, faster transports for documents, and higher speed in scanning and
1
digital conversions. These important improvements greatly accelerated the speed of character
recognition and reduced the cost, and opened up the possibilities of processing a great variety of
forms and documents. Throughout the 1960s and 1970s, new OCR applications sprang up in retail
businesses, banks, hospitals, post offices; insurance, railroad, and aircraft companies; newspaper
publishers, and many other industries.
2
written in such specified shapes did not vary too much in styles, and they could be recognized
more easily by OCR machines, especially when the data were entered by controlled groups of
people, for example, employees of the same company were asked to write their data like the
advocated models. Sometimes writers were asked to follow certain additional instructions to
enhance the quality of their samples, for example, write big, close the loops, use simple shapes, do
not link characters, and so on. With such constraints, OCR recognition of handprints was able to
flourish for a number of years.
As the years of intensive research and development went by, and with the birth of several new
conferences and workshops such as IWFHR (International Workshop on Frontiers in Handwriting
Recognition), ICDAR (International Conference on Document Analysis and Recognition), and
others, recognition techniques advanced rapidly. Moreover, computers became much more
powerful than before.
People could write the way they normally did, and characters need not have to be written like
specified models, and the subject of unconstrained handwriting recognition gained considerable
momentum and grew quickly. As of now, many new algorithms and techniques in preprocessing,
feature extraction, and powerful classification methods have been developed.
To extract symbolic information from millions of pixels in document images, each component in
the character recognition system is designed to reduce the amount of data. As the first important
step, image and data preprocessing serve the purpose of extracting regions of interest, enhancing
and cleaning up the images, so that they can be directly and efficiently processed by the feature
extraction component. This section covers the most widely used preprocessing techniques and is
organized in a global-to-local fashion.
When you file the annual income tax form, writing a check to pay the utility bill, or fill in a credit
card application form, have you ever wondered how these documents are processed and stored?
Millions of such documents need to be processed as an essential operation in many business and
government organizations including telecommunications, health care, finance, insurance, and
3
government and public utilities. Driven by the great need of reducing both the huge amount of
human labor involved in reading and entering the data on to electronic databases and the associated
human error, many intelligent automated form-processing systems were designed and put to
practice. This section will introduce the general concept of an intelligent form-processing system
that can serve as the kernel of all form-processing applications.
The input of a typical form-processing system consists of color or gray level images in which every
several hundred pixels corresponds to a linear inch in the paper medium. The output, on the
contrary, consists of ASCII strings that usually can be represented by just a few hundred bytes. To
achieve such a high level of data abstraction, a form-processing system depends on the following
main components for information extraction and processing:
Image acquisition: acquire the image of a form in color, gray level, or binary format.
Binarization: convert the acquired form images to binary format, in which the foreground
contains logo, the form frame lines, the preprinted entities, and the filled in data.
Form identification: for an input form image, identify the most suitable form model from
a database.
Layout analysis: understand the structure of forms and the semantic meaning of the
information in tables or forms.
Data extraction: extract pertinent data from respective fields and preprocess the data to
remove noise and enhance the data.
Character recognition: convert the gray or binary images that contain textual information
to electronic representation of characters that facilitate post-processing including data
validation and syntax analysis.
Each of the above steps reduces the amount of the information to be processed by a later step.
Conventional approaches pass information through these components in a cascade manner and
seek the best solution for each step. The rigid system architecture and constant information flowing
direction limit the performance of each component to local maximum. The observation from data
in real-life applications and the way humans read distorted data lead us to construct a knowledge-
based, intelligent form processing system. Instead of simply concatenating general-purpose
modules with little consideration of the characteristics of the input document, the performance of
4
each module is improved by utilizing as much knowledge as possible, and the global performance
is optimized by interacting with the knowledge database at run-time.
An overview of an intelligent form-processing system with some typical inputs and outputs is
illustrated in Figure 2.1. As the kernel of this intelligent form-processing system, the knowledge
database is composed of short-term and long-term memories along with a set of generic and
specific tools for document processing. The short-term memory stores the knowledge gathered in
run-time, such as the statistics of a batch of input documents. The long-term memory stores the
knowledge gathered in the training phase, such as the references for recognizing characters and
logos, and identifying form types. Different types of form images can be characterized by their
specific layout. Instead of keeping the original images for model forms, only the extracted
signatures are needed in building the database for known form types. Here, the signature of a form
image is defined as a set of statistical and structural features extracted from the input form image.
The generic tools include various binarization methods, binary smoothing methods, normalization
methods, and so on. The specific tools include cleaning, recognition, or verification methods for
certain types of input images. When an input document passes the major components of the
systems, the knowledge in the database helps the components to select the proper parameters or
references to process. At the same time, each component gathers respective information and sends
back to the knowledge base. Therefore, the knowledge base will adapt itself dynamically at run-
time. At run-time, for each unknown document image, features are extracted and compared with
signatures of the prototypes exposed to the system in the learning phase.
5
Figure 4.1: Overview of an intelligent form-reading system.
As of understanding the form structures and extracting the areas of interest, there are two types
of approaches in form-reading techniques: The first type of form readers relies upon a library of
form models that describe the structure of forms being processed. A form identification stage
matches the extracted features (e.g., line crossings) from an incoming form against those
extracted from each modeled design in order to select the appropriate model for subsequent
6
system processes. The second type of form readers do not need an explicit model for every
design that may be encountered, but instead rely upon a model that describes only design-
invariant features associated with a form class. For example, in financial document processing,
instead of storing blank forms of every possible checks or bills that may be encountered, the
system may record rules that govern how the foreground components in a check may be placed
relative to its background baselines. Such a generic financial document processing system based
on staff line and a form description language (FDL) is described. In either type of form readers,
the system extracts the signature, such as a preprinted logo or a character string, of each training
form during the training phase, and stores the information in the long-term memory of the
knowledge base. In the working phase, the signature of an input form is extracted and compared
statistically and syntactically to the knowledge in the database. By defining a similarity between
signatures from the two form images, we will be able to identify the format of the input form. If
the input form is of known type, according to the form registration information, the pertinent
items can be extracted directly from approximate positions. Otherwise, the signature of this
unknown form can be registered through human intervention.
Through form analysis and understanding, we are able to extract the form frames that describe
the structure of a form document, and thus extract the user filled data image from the original
form image. Instead of keeping the original form image, only extracted user filled data images
need to be stored for future reference. This will help to minimize storage and facilitates
accessibility. Meanwhile, we can use the “signature” of form images to index and retrieve
desired documents in a large database. As the kernel of this up to this moment, the useful
information in a form is reduced from a document image that usually contains millions of gray
level or color pixels to only the sub-images that contain the items of interest.
7
Lab Exercise
Optical character recognition (OCR) is a tool that can recognize text in images. Here’s how to
build an OCR engine in Python.
Optical character recognition (OCR) is a technology that recognizes text in images, such as
scanned documents and photos. Perhaps you’ve taken a photo of a text just because you didn’t
want to take notes or because taking a photo is faster than typing it. Fortunately, thanks to
smartphones today, we can apply OCR so that we can copy the picture of text we took before
without having to retype it.
Python OCR is a technology that recognizes and pulls out text in images like scanned documents
and photos using Python. It can be completed using the open-source OCR engine Tesseract.
We can do this in Python using a few lines of code. One of the most common OCR tools that are
used is the Tesseract. Tesseract is an optical character recognition engine for various operating
systems.
Tesseract runs on Windows, macOS and Linux platforms. It supports Unicode (UTF-8) and more
than 100 languages. In this article, we will start with the Tesseract OCR installation process, and
test the extraction of text in images.
The first step is to install the Tesseract. In order to use the Tesseract library, we need to install it
on your windows system. To install Tesseract, just go the command-line and type the following
command:
or
You need to also install tesseract.exe. You can get it here: https://github.com/UB-
Mannheim/tesseract/wiki
8
Python OCR Implementation
After installation is completed, let’s move forward by applying Tesseract with Python. First, we
import the dependencies.
import pytesseract
import numpy as np
filename = 'C:/Users/HT/Desktop/Python2023/tes.jpg'
img1 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img1)
print(text)