Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO.

4, APRIL 2019 1145

Multi-Script-Oriented Text Detection and


Recognition in Video/Scene/Born Digital Images
K. S. Raghunandan, Palaiahnakote Shivakumara, Member, IEEE, Sangheeta Roy, G. Hemantha Kumar,
Umapada Pal, Senior Member, IEEE, and Tong Lu , Member, IEEE

Abstract— Achieving good text detection and recognition low resolution mobile cameras, natural scene images cap-
results for multi-script-oriented images is a challenging task. tured by high resolution cameras, and uploaded images on
First, we explore bit plane slicing in order to utilize the webpages. Besides, these databases include texts of different
advantage of the most significant bit information to identify text
components. A new iterative nearest neighbor symmetry is then types, namely, caption texts which are manually edited, and
proposed based on shapes of convex and concave deficiencies scene texts where multiple scripts, orientations, fonts and sizes
of text components in bit planes to identify candidate planes. exist naturally. Such multi-type texts make the text detection
Further, we introduce a new concept called mutual nearest and recognition problem much complex and challenging. To
neighbor pair components based on gradient direction to identify understand such images, there are now increasing attentions
representative pairs of texts in each candidate bit plane. The
representative pairs are used to restore words with the help of researchers in the field of computer vision and video
of edge image of the input one, which results in text detection processing [4], [5]. Among all the information contained in
results (words). Second, we propose a new idea by fixing window images, texts carry semantic information and could provide
for character components of arbitrary oriented words based on useful cues about image content and hence it is important
angular relationship between sub-bands and a fused band. For for human and computer to understand images. It is evident
each window, we extract features in contourlet wavelet domain
to detect characters with the help of an SVM classifier. Further, from the statement in [1] that given an image containing
we propose to explore HMM for recognizing characters and texts and other objects, viewers often tend to focus on texts.
words of any orientation using the same feature vector. The This shows that text detection and recognition is important
proposed method is evaluated on standard databases such as for human to understand complex images. Furthermore, text
ICDAR, YVT video, ICDAR, SVT, MSRA scene data, ICDAR detection and recognition is indispensable for a lot of real
born digital data, and multi-lingual data to show its superiority
to the state of the art methods. time applications such as automatic sign reading, language
translation, navigation and surveillance applications [4], [5].
Index Terms— Bit plane slicing, convex and concave deficien- There are methods developed in literature [4]–[9] to address
cies, wavelet sub-bands, arbitrarily-oriented text detection and
recognition, hidden Markov model, multi-lingual text detection the issue of text detection and recognition in video, natural
and recognition. scene and born digital images of different orientations, scripts,
font sizes, etc. According to literature review, most available
I. I NTRODUCTION methods focus on a particular data type and address a spe-
cific issue like complex background, low contrast, multiple
A S explosive proliferation of multimedia content on broad-
cast and Internet, the need for its ubiquitous access
at anytime and anywhere over a variety of devices also
scripts or multiple orientations. As a result, performances of
such methods are poor for the data affected by multiple adverse
increases [1]–[3]. Therefore, one can expect huge databases, factors. The main cause of the above challenges is as follows:
which consist of diversified data such as videos captured by 1) Frames captured by low resolution video cameras often
suffer from low contrast and low-resolution issues, 2) Natural
Manuscript received October 19, 2017; revised January 24, 2018; accepted scene images captured by high resolution cameras provide
March 18, 2018. Date of publication March 21, 2018; date of current version
April 3, 2019. This work was supported in part by the Natural Science high contrast but suffer from complex background, which
Foundation of China under Grant 61672273, Grant 61272218, and Grant leads to more false positives, and 3) Born digital images from
61321491 and in part by the Science Foundation for Distinguished Young websites suffer from multiple fonts, sizes, colors, appearance
Scholars of Jiangsu under Grant BK20160021. This paper was recommended
by Associate Editor M. Wang. (Corresponding author: Tong Lu.) variations, background complexity, etc., which affect scaling,
K. S. Raghunandan and G. H. Kumar are with the Department of Studies text alignment and geometrical shapes of character compo-
in Computer Science, University of Mysore, Karnataka 57005, India (e-mail: nents. Therefore, text detection and recognition in different
raghu0770@gmail.com; ghk.2007@yahoo.com).
P. Shivakumara and S. Roy are with the Faculty of Computer System types of images is considered as an open issue.
and Information Technology, University of Malaya, Kuala Lumpur 50603, It is evident from the illustration presented in Fig. 1 that
Malaysia (e-mail: shiva@um.edu.my; 2sangheetaroy@gmail.com). the existing method [8], which is the state-of-the-art method
U. Pal is with the Computer Vision and Pattern Recognition Unit, Indian
Statistical Institute, Kolkata 700108, India (e-mail: umapada@isical.ac.in). that explores fractals for text detection in mobile video scene
T. Lu is with the National Key Laboratory for Novel Software Technology, images, produces false positives for video, natural scene
Nanjing University, Nanjing 210023, China (e-mail: lutong@nju.edu.cn). images and it also do not detect text properly from born
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. digital images as shown in Fig. 1(a). This is because the
Digital Object Identifier 10.1109/TCSVT.2018.2817642 primary goal of this approach is to detect texts in mobile
1051-8215 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

planes as Canny edge detector to give fine edge details for low
contrast irrespective of foreground-background color changes.
For each text component in Canny edge image of planes,
the proposed method introduces Iterative Nearest Neighbor
Symmetry (INNS) based on shapes given by convex/concave
deficiencies to detect a candidate plane out of 8 planes. Then
INNS extracts shape and self-symmetry based features, which
are invariant to font, font size, orientation and script. For
each component in the candidate plane, we further propose
Mutual Nearest Neighbor Pair (MNNP), which uses gradient
outward direction to find the nearest neighbor component in
text lines. Since MNNP considers uniform spacing between
characters and words, it finds component pairs irrespective of
orientation, script, font size and font. In addition, the same
criterion is used for eliminating false positives produced by
background complexity. Further, since MNNP uses Canny
edge image of the input image, the loss character components
can be restored easily during components pairing. Therefore,
the rationale behind the proposed method is to explore shape,
symmetry, structure and direction of text components, which
has the ability to tackle the above mentioned challenges posed
by multi-type images.
Similarly, the main issue with recognition is defining
window size for arbitrarily-oriented characters in multi-type
images. To overcome such issue, we propose automatic win-
Fig. 1. Text detection and recognition results of the existing and the proposed
dow size detection based on the fact that directions of most
approaches for video, natural scene and born digital images. (a) Existing pixels contribute towards character height, which helps us to
text detection and recognition results by Tesseract OCR. (b) Proposed text fix correct windows according to sizes and orientations of char-
detection and recognition for the video, natural and born digital images.
acters. Further, the integration of strength of different types of
features, namely, statistical features which extract geometrical
video images but not texts in multiple type images. Therefore, properties, texture features which extract appearances property,
it gives inconsistent results. When a text detection method run-length smearing which extracts intra and inter symmetry
does not detect texts properly, it often affects directly to of character components, and contour wavelet domain which
binarization and recognition as shown in Fig. 1(a), where is invariant to scaling, multi-fonts or multi-sizes, helps us to
one can see that the binarization output [9] contains non- achieve better results for text in multi-type images. Overall,
text components and background noises for video and scene the text detection and recognition steps are proposed based on
images. For example, though the binarization method in [9], robust and invariant features and hence the proposed method
which explores a Bayesian classifier for recognizing texts in is generic. The contribution and novelty lies in exploring the
video and natural scene images, preserves character shapes, above basic concepts for addressing the open challenges of text
the Optical Character Recognizer (OCR) [10], which is avail- detection and recognition without rigid constraints to achieve
able publicly, fails to recognize the texts correctly due to the better results.
noises introduced by binarization and non-text components
given by the text detection method. However, for the texts
II. R ELATED W ORK
in the born digital image, OCR gives correct results because
text detection and binarization methods work well. Text detection and recognition can be divided broadly into
On the other hand, the proposed method detects texts in all three categories: video, natural scene and born digital images.
the three types of images properly and correctly recognizes Therefore, this section reviews the past works on text detection
the texts without binarization as shown in Fig. 1(b). Therefore, in video, natural scene and born digital images, and text
we can conclude that the existing methods are not adequate recognition in respective categories.
to handle the challenges posed by multi-type images. Hence, The methods for text detection in video images can be
we propose a new method to fill up this gap in this work. further classified into connected component based methods,
To address the above mentioned challenges, we propose texture based methods, and edge-gradient based methods.
to explore Most Significant Bit (MSB), which carries vital Connected component based methods [4], [5] expect character
information using bit slicing for input images. This is because shapes to be preserved. These methods generally focus on
bit information is unlikely lost regardless of adverse situations. images of high contrast texts with plane background. However,
However, one can expect location misplacements of bits. This the constraint is not necessarily true for images of different
may lead to changes of foreground or background colors in types and scripts, where one can expect images of large
an image. Therefore, we propose to use Canny edge image of variations on contrast and background complexity, which may

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1147

cause the loss of information. Therefore, the methods may type images, where robust and invariant features with generic
not work well for different types and scripts. To overcome the property are required.
problems of connected based methods, texture based methods When we look at the literature on text recognition in video,
are proposed in literature [7], [8], [11], [12], which require natural scene and born digital images, it is found that there are
high contrast images. These methods consider appearances of methods which use binarization for recognition or their own
character components as a special kind of texture property. classifiers for recognition. The methods in [9] and [33]–[35]
To extract such texture property and separate text components recognize texts through binarization require complete shapes
from complex background, the methods propose features with of characters to achieve better recognition rates. Moreover,
a large number of training samples. However, the extracted most of the methods propose thresholding based criteria for
features are sensitive to low contrast or low resolution. Apart binarization. For the images of different contrasts and back-
from that, since the methods are trained by pre-defined labels, ground complexities considered in this work, the binarization
they may not perform well for images of different scripts process may not preserve characters, rather it loses shapes.
considered in this work. In addition, the methods are too Therefore, the methods may not perform well for the consid-
expensive for real time applications. To ease the number of ered images. To reduce the complexity of the problem, the
computations of texture based methods, edge-gradient based methods in [36]–[38] are proposed for recognition without
methods are developed, which generally focus on high gradient binarization process. These methods generally extract a large
that represents text and edge pixels, which give vital clues number of features using well known descriptors, namely,
for the presence of texts [6], [13], [14]. However, these SIFT, HOG or the combinations of several descriptors, and
features are sensitive to complex background, where edges then explore classifiers or lexicons for better recognition. As a
in background may overlap with text edges. As a result, result, the performances of the methods depend much on
the methods produce more false positives and hence the datasets and samples. In addition, the features extracted based
performance hampers especially for multi-type images. on descriptors work well for high contrast images. There-
In summary, the methods in [15]–[20] focus on high contrast fore, the methods may not perform well for the multi-type
images for text detection, therefore their performances degrade images. Recently, to improve the recognition rates of video
for low contrast images. The methods in [20]–[22] focus and natural scene images, there are methods that explore con-
on low contrast images for text detection, thus they report volutional network and deep learning [39]–[43]. For instance,
inconsistent results for high contrast images. In the same Shi et al. [44] proposed an end-to-end trainable neural network
way, the methods that focus on plain background images such for image based sequence recognition and its application to
as born digital images may suffer from very low resolution, scene text recognition. The method explores convolutional
multi-fonts, multi-sizes and multi-colors compared to video recurrent neural network for text recognition. Jain et al. [45]
and natural scene images, thus they report poor results for proposed unconstrained scene text and video text recogni-
complex background images. Recently, there are methods tion for Arabic scripts. The method focusses on a specific
which explore convolutional neural network and deep learning Arabic script for achieving results. It is noted from the
to overcome the problem of text detection in scene and video above discussions that the primary goal of the methods is to
images [23]–[28]. For instance, Liu and Jin [29] proposed a recognize a specific script of different texts but not multi-
deep matching prior network for multi-oriented text detection. script recognition. In addition, for arbitrarily-oriented texts,
The method focusses on fixing tight bounding boxes for multi- the methods fail to fix window size for characters to extract
oriented texts to prevent background noises such that text features, which leads to poor performance. As variations on
detection performance improves significantly. Tian et al. [30] dataset increase, difficulty in determining optimal parameters
proposed scene text detection based on weak supervision. The for deep learning set also increases.
method focusses on weak annotated data to reduce network Overall, from the review on text detection and recognition in
dependency on a large number of pre-defined labeled data. video, natural scene and born digital images, it is observed that
Although the methods solve complex issues consisting of the methods are successful for a specific type data, on which
multi-fonts, sizes, orientations, scripts and low contrast, they these methods are developed. It is worth to mention that none
still suffer from good character candidate detection. In addi- of the methods considers more than two types of data for text
tion, it is hard to optimize parameters based on pre-labeled detection and recognition. Besides, text detection and recog-
samples. This is because the considered problem in this work nition of multi-scripts especially for Indian scripts, is still at
involves large variations in terms of contrast, background, and the infancy stage. Therefore, in this work, we propose a novel
foreground complexity. As a result, it is difficult to find a large method for text detection and recognition in video, natural
number of pre-defined samples to train a classier and represent scene and born digital images irrespective of orientation and
such variations especially samples for representing non-text script.
components. Similarly, the methods in [8], [31], and [32]
are proposed for detecting multi-oriented and multi-script
III. P ROPOSED M ETHOD
texts in video images without depending much on learning.
However, these methods report inconsistent results for multi- Inspired by the enhancement concept presented in
type images. In summary, it is noted from literature review [46] and [47], where it is mentioned that the Most Sig-
on text detection in video, natural scene and born digital nificant Bit (MSB) carries significant information and the
images that none of the methods tackles the issues of multi- Least Significant Bit (LSB) carries less significant information,

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1148 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

we propose to explore MSB for identifying text components


in gray images using Canny edge detector. Inspired by the
clue used in [19] for text detection in natural scene images
that character components in the same text exhibit high
self-similarity, we propose a new Iterative Nearest Neighbor
Symmetry (INNS) based on the shapes of convex and concave
deficiencies to extract similarities at component level to iden-
tify a candidate plane which contains significant information
out of eight planes.
By considering the complexity of the problem, INNS alone
is insufficient to identify components that represent characters.
Therefore, motivated by the work [12] on text detection in
video where inward and outward gradient directions are used
for text detection, we introduce a new idea for identifying
Mutual Nearest Neighbor Pair (MNNP) components, which
eliminates almost all non-text components. This results in
representatives of texts. The same outward direction is used Fig. 2. Examples of eight bit planes for the input image.
for restoring missing components with the help of the edge
image.
Unlike the existing methods on text recognition [36]–[39]
which use a fixed sliding window in horizontal direction for
feature extraction, we propose a new iterative algorithm for
determining actual sizes of character components. This step
works based on the fact that most pixel direction contributes Fig. 3. Canny edge images for the bit planes of MSB and bit 6 in Fig. 2.
towards the height of the character. As a result, the proposed
recognition is invariant to multi-font and sized texts.
We noticed from literature review on recognition that sta- Where I (i, j ) = gray image, ∂τ (i, j ) = bit-plane information
tistical features extract shapes of characters, texture features for τ th bit.
extract special texture property of character appearances, and In this way, the proposed method obtains bit planes for each
run-length features extract intra and inter symmetry properties input image by extracting bits in bytes of the gray image as
of character components. Therefore, in this work, we pro- shown in Fig. 2, which shows the results of bit planes corre-
pose to integrate statistical-texture and intrinsic features in sponding to the sequence of bits from the higher order (MSB)
contourlet wavelet domain for recognition. It is noted [48] to the lower order (LSB). It is observed from Fig. 2 that the
that contourlet wavelet transform is invariant to rotation and plane represents MSB and bit 6 provides fine details of text
extracts local information. Lastly, inspired by the special prop- components compared to the other planes. We can also notice
erty of Hidden Markov Model (HMM) and the capability of from the same two bit planes that background and foreground
extracting context features with the help of spatial information colors of text components change from dark to white and vice
of characters [36], [49], we explore HMM for the recognition versa. This is due to the change in positions of bits in byte
in this work. representation. It is due to the unpredictable nature of the input
image. Therefore, we propose to employ Canny edge operator
A. Text Detection Approach on bit plane images to obtain edge images. Since it considers
Our text detection method consists of three sub-sections. abrupt changes for edge detection, the color changes does
At first, Section III.A.1 introduces bit plane slicing for not affect as shown in Fig. 3, where there is no difference
identifying text components. Next, Section III.A.2 proposes between the planes represented by bit 7 and bit 6. This is
Iterative Nearest Neighbor Symmetry (INNS) for candidate the advantage of obtaining Canny edge images for bit planes
plane detection, and then Section III.A.3 introduces Mutual images. Therefore, the output of the Canny edge image of bit
Nearest Neighbor Pair (MNNP) components for identifying planes is considered as text components.
the representatives of text. It is true that MSB should carry significant information of
1) Text Components Detection Through Bit Plane Slicing: images as shown in Fig. 3. However, it is not true for all the
It is true that each pixel in gray image can be represented by cases due to different characteristics of input images. As a
an 8-bit binary vector, say, (b7, b6, b5, b4, b3, b2, b1, b0) result, any bit can carry significant information plane out of
bm , where m is from 0 to 7 and each bm is either “0” or “1”. eight bits. One such example is shown in Fig. 4, where it can
In this case, an image may be considered as an overlay of be seen that bit 5 and bit 4 give fine details of the input image
eight bit-planes. Each bit-plane is a two tone image and can rather than bit 7 and bit 6. Therefore, we propose a new idea
be obtained by the operation defined in equation (1). for identifying candidate plane in the subsequent section.
   When we extract bits from the byte of a pixel of the input
1 1 gray image to obtain respective bit plane slices (images),
∂τ (i, j ) = mod f loor τ I (i, j ) (1)
2 2 the slices provide significant information in binary form.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1149

components exhibit self-similarity as stated in [19]. It is true


that convex deficiencies are quite common for studying the
shapes of text components but not concave deficiencies. Since
the problem is complex due to multi-type images, we propose
to find concave deficiencies as shown in Fig. 5(b). It is worth
to note that this new combination helps us to extract structures
of character components regardless of the above challenges.
To study the shapes of deficiencies, we propose to estimate
the ratio of minor axis to major axis as a feature as shown
in Fig. 5(b), where major and minor axis are drawn for convex
and concave deficiencies. The reason to find the ratio as a
Fig. 4. Illustration to show that the most significant bit does not carry feature is that it is invariant to rotation and scaling, and can
significant information of the input image. withstand the cause created by multi-type images, which is
our objective of the work. With ratio features, to extract the
high degree of the similarity between deficiency shapes as
in Fig. 5(b), we propose a new concept called Iterative Nearest
Neighbor Symmetry (INNS) to classify ratio features that are
close to each other into one group based on the Max-Min
clustering concept as defined in Equation (2). If this process
results in two equally sized clusters for four ratio features, the
component is considered as a text one, else it is considered as
a non-text one.
1 k
Fig. 5. Examples of Convex and concave deficiencies for the character min min(P j − Ci )2 ; i = 1, 2; (2)
“H”, “L” and “P”. (a) Concave and convex boundaries for “H”, “L” and “P” k j =1
characters. (b) Concave and convex deficiencies for the images in (a).
where k denotes the number of ratio features, i denotes the
max-min cluster number, P denotes ratio features which are
Therefore, this makes implementation easy. In addition, when compared with the maximum and minimum centroids C to
the image is affected by adverse factors as mentioned in the select cluster that is close to it its cluster.
Introduction Section, most likely the position of a bit may This hypothesis is true when a component produces four
be changed but not much effect to the value of the bit in deficiencies to classify it as a text one, say case-1. Suppose,
contrast to gray image because any bit can have only either the component produces only one deficiency then we discard
0 or 1. Therefore, it may result in an invert image which swaps the component, say case-2. If a component produces two
background and foreground colors for different input images. deficiencies then the proposed method compares two ratio
Since the proposed method performs Canny edge operation features to find the proximity between two feature values.
on bit slice images as shown in Fig. 2-Fig. 3, this problem If the difference of these two feature values satisfy a certain
does not affect. Similarly, if the position of the bit changes threshold as defined in Algorithm-1, we consider it as a text
due to adverse factors, any plane out of 8 can have signifi- component, else it will be discarded, say case-3. If the compo-
cant information. To detect such plane, we propose Iterative nent produces three deficiencies then we apply the above Max-
Nearest Neighbor Symmetry which results in candidate plane. Min clustering to obtain two clusters, namely, Max and Min.
Therefore, we prefer to use bit plane slicing for text candidate The cluster that contains two feature values is tested with the
detection as it is insensitive to some extent to the above condition of case-3 to classify it as a text component or a non-
adverse factors compared to the state of the art methods [50]. text one. Similarly, if the component produces more than four
2) Iterative Nearest Neighbor Symmetry for Candidate deficiencies, say n, we apply the above procedure iteratively
Plane Detection: For each component of each bit plane, until it satisfies the symmetry property of any of the above
the proposed method finds deficiencies by drawing convex and mentioned four cases. The iterative algorithm terminates if the
concave boundaries as shown in Fig. 5, where the boundaries component satisfies the symmetry property, else it continues
are drawn for characters “H”, “L” and “P” in Fig. 5(a), until no further clusters can be formed.
respectively. Convex boundaries are drawn by fixing a closed In Algorithm-1, Dcc denotes the number of deficiencies,
bounding box for text components, and the deficiencies are INNS denotes whether a component is text candidate or non-
obtained by subtracting actual text components with the text candidate, RF 1 and RF 2 denote ratio features, and CRmax
boundary of each component. Concave boundaries are drawn and CRmin are the maximum and minimum clusters, respec-
by connecting mid-points of the sides of the bounding box tively. The strength of INNS is that since it involves flexible
fixed by convex boundaries. The deficiencies of the convex ratio features and the nearest neighbor clustering, it can
and concave boundaries are shown in Fig. 5(b) for characters, withstand the effect of adverse factors as shown in Fig. 6 (a),
“H”, “L” and “P”, respectively. When we look at the shapes of where we can see impaired and adherent text components.
deficiencies in Fig. 5(b), one can find that the deficiency shapes It is noted from Fig. 6(a) and Fig. 6(b) that the deficiencies
share a high degree of similarity due to the fact that character obtained by the proposed method appear similar in terms of

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

Algorithm 1 Iterative Nearest Neighbor Symmetry for


Candidate Plane Detection
Input: Binary Image (Ib )
1. For every component in binary image Ib , do the fol-
lowing:
A. Find the number of convex and concave defi-
ciencies (Dcc).
B. If the number of deficiency (Dcc) is 1, then
label the component as a false.
C. If the number of deficiency (Dcc) is 2, do the Fig. 7. Illustrations of effects of INNS algorithm for planes.
following:
a. Find the ratio (RF 1 ) of minor axis and
major axis of concave deficiency. number of text candidates compared to the other planes.
b. Find the ratio (RF 2 ) of minor axis and In other words, the algorithms choose the plane which gives
major axis of convex deficiency. the highest number of text candidates as a candidate plane.
c. If the ratio difference of convex and con- This shows that the text candidate in bit 7 carry signifi-
cave is less than ε, then label the compo- cant information of the input image. Therefore, the proposed
nent as a true, else a false. method chooses the plane as a candidate one. In summary, this
section considers all the 8 planes as the input and gives one
D. If the number of deficiency is 3,
candidate plane as the output for text detection.
a. Find maximum (CRmax )and minimum 3) Mutual Nearest Neighbor Pair for Text Detection:
(CRmin )number clusters. The proposed method considers candidate plane given by
b. If CRmax or CRmin is 2, Dcc is set to 2 and INNS as the input for identifying Mutual Nearest Neighbor
go to (C). Pair (MNNP) in the input image for text detection. It is noted
E. If the number of deficiency is 4, then label the from gradient directions at edge pixels that there are inward
component as a true, else consider it as a false. and outward directions due to the influence of neighboring
F. If the number of deficiency is greater than 4, characters and background. Here inward direction is to show
do the following the direction towards characters and outward direction is to
a. While the component is considered as show the direction away from the character. It is also true that
true or both CRmax and CRmin is equal to the spacing between characters is smaller than that between
zero, do the following: words and text lines. Since our target is arbitrarily oriented text
i. DCC 1 is set to CR max and DCC 2 is detection, we use outward direction as it is invariant to rotation
set to CR min . and scaling. Therefore, for each component, say A, in a
ii. Use step (B) ∼ (E) to test DCC 1 and candidate plane, the proposed method moves in the outward
DCC 2 respectively. direction of gradient until it reaches the nearest neighbor
G. If CRmax and CRmin is not equal to zero, then component, say B. Then the method moves along with the
label the component as a true, else a false. outward gradient direction of the component B until it finds
the nearest neighbor component, say A. If A finds B and B
Output: Label of binary component of image.
finds A then the components A and B are said to be a Mutual
Nearest Neighbor Pair (MNNP). The method retains the text
candidates that satisfy the MNNP as the representatives of text.
The MNNP is formally defined in equation (3).
Ai+1 := Ai − ∇ϕ1 ;
Bi+1 := Bi − ∇ϕ2 ;
MNNP(A, B) = = 1 if B = lim (An ) && A = lim (Bn ) (3)
n→∞ n→∞
Where ∇ϕ1 and ∇ϕ2 are the outward gradient direction of A
and B, respectively.
It is illustrated in Fig. 8, where (a) shows a candidate
Fig. 6. INNS classifies the impaired text components as text candidates. plane given by INNS and Fig. 8(b) shows gradient outward
(a) Convex and concave boundaries for impaired components. (b) Major and directions marked by red color circles for the two adjacent
minor axes for convex and concave deficiencies for the images in (a). characters in Fig. 8(a). It is observed from Fig. 8(b) that since
the two characters are the nearest neighbors, the characters
ratio of minor to major axis. Therefore, the text components satisfy the MNNP criterion and hence these two compo-
shown in Fig. 6(a) are classified as text candidates. nents are considered as an MNNP pair. This results in the
The effect of INNS algorithm for all the planes can be seen representative of the text as shown in Fig. 8(c). From the
in Fig. 7, where we can notice that bit 7 contains a larger figure it can be seen that the method eliminates most of

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1151

Fig. 8. Examples of representative text selection using MNNP and text


detection through restoration. (a) Candidate plane. (b) Outward directions.
(c) Representatives. (d) Grouping MNNP components in one word in edge
image. (e) Restored image. (f) Text detection. Fig. 9. Illustrating robustness of the MNNP over impaired and adherent text
candidates for text detection. (a) Impaired and adherent text candidates given
by INNS. (b) Testing mutual nearest neighbor pair for the candidates in (a).
non-text components and retains only text ones. This is the (c) Text detection results using MNNP.
advantage of this step because the proposed method does not
require additional features or classifiers for removing false
positive components in contrast to the existing methods, which
generally use heuristics or rules or classifiers for removing
false text candidates. However, it is also seen from Fig. 8(c)
that character components are missing compared to the results
in Fig. 8(a). This is because some of the text components do
not satisfy MNNP.
Therefore, to restore missing character components, we use
the outward direction of representative given by MNNP to find
the nearest neighbor component in the Canny edge image of
the input one as shown in Fig. 8(d), where missing character
components are restored. The final restored image for the input
one is shown in Fig. 8(e) and the text detection results are
shown in Fig. 8(f), where it can be seen that the proposed
method detects text words of different orientations without
losing or producing false positives.
To illustrate the strength of the proposed INNS and MNNP
steps together, we test these steps on images with broken and
adherent text components, and images of low resolution along Fig. 10. Illustrating the robustness of proposed text detection method
with different degrees of blur and complex background as for the images of different resolutions with blur and complex background.
(a) Images of low resolution with blur and complex background. (b) Result
shown in Fig. 9(a)-Fig. 9(c) and Fig. 10(a)-Fig. 10(d), respec- of the proposed INNS. (c) Result of the proposed MNNP. (d) Text detection
tively. It is found from Fig. 9 that the text candidate satisfies result of the proposed method.
MNNP criteria and fixes the bounding boxes successfully as
shown in Fig. 9(b) and Fig. 9(c). It is observed from Fig. 10(b)- threshold or condition. Therefore, INNS is robust to noises
Fig. 10(d) that the proposed INNS and MNNP have the ability and can withstand variations caused by different challenges.
to withstand the effect of low resolution, complex background In the same way, MNNP considers gradient outward direction
and degree of blur to some extent. Therefore, we can assert that and symmetry based on distance between two nearest neighbor
the INNS and MNNP steps are robust. This is because INNS character components, and is robust to the above-mentioned
involves robust steps, namely, estimating shapes of deficiencies challenges and orientations because symmetry represents char-
created by convex and concave boundaries using the ratio of acter components.
major and minor axis and symmetry based on the nearest
neighbor criteria. Fixing bounding box and drawing major axis
do not require the full character, and they can work even a B. An Approach for Text Recognition
character misses a few pixels as they consider the majority This section is divided into two sub-sections. We first
of pixels. Similarly, the symmetry does not involve any hard introduce a novel idea for determining an automatic window

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1152 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

Fig. 12. Automatic window fixing for non-horizontal and curved text:
(b) is the last result of iterative algorithm for Horizontal (H) and Fused (F)
sub-bands.

high frequency sub-bands by passing coordinates of pixels to


Principal Component Analysis (PCA) as shown in Fig. 11(c),
where the principal axis is drawn for high frequency sub-
bands and the fused window. The reason to choose PCA is
that it does not require the full shape of the character to
find its direction [46] as can be noticed from Fig. 11(c). It is
Fig. 11. Automatic window fixing for horizontal text.
true that PCA is popular for dimensionality reduction rather
than angle estimation. However, we explore the property that
PCA outputs principal axis for objects when we feed two
to extract features for text of any direction in Section III.B.1. dimensional data, such as X and Y coordinates of “0” and “1”
Then Section III.B.2 proposes a new set of features, which pixels in the image. Therefore, the principal axis can be
combine statistical-texture and spatial features for recognizing drawn using the first Eigen vector of PCA to estimate the
text in video, natural scene and born digital images with the correct angle of the character components. It is also true that
help of Hidden Markov Model (HMM). since the initial square window covers one or more characters
1) Automatic Window Size Detection for Text Recognition: according to our observation as shown in Fig. 11(a). As a
For each word given by text detection in the previous section, result, the arbitrary orientation of a text does not reflect in the
we consider the height of word as the width of the initial win- content of the square window. It is evident from Fig. 12(a)
dow, which results in a square window as shown in Fig. 11(a) and Fig. 12(b), where the content of the initial square window
and Fig. 11(b), where we can see the initial square window looks horizontal with a bit tilt. Therefore, we ignore angles
for the input word. In general, since the text detection method of vertical and diagonal windows and consider the angle of
fixes a bounding box for the whole word by covering extra horizontal and fused windows for experimentation.
background information, the square window covers more than The proposed method iteratively calculates the angle by
one character. As a result, the defined square window does reducing three pixels at each iteration for both the horizontal
not cover only one character. Therefore, to determine a correct window and the fused window until the difference between the
window, we propose to explore wavelet high frequency sub- horizontal and the fused windows satisfies with +3 or −3 dif-
bands and a fused band. The reason to propose Haar wavelet ferences. It can be verified from the example in Fig. 11(d),
high frequency sub-bands is that wavelet decomposition is where the last iteration results with angles are shown. The
good in classifying text pixels from non-text ones as stated difference of the fused and the horizontal window angles is
in [11] for text detection. For the square window, we obtain −0.4, which satisfies the condition with +3 or −3. These two
high frequency sub-bands, namely, Horizontal, Vertical and angles match in the sense that the window contains the exact
Diagonal as shown in Fig. 11(b). The proposed method per- character without any extra information as shown in Fig. 11(d)
forms OR operation to fuse three high frequency sub-bands as at fused results. It is observed from Fig. 11(c) and Fig. 11(d)
shown in Fig. 11(b) with label fused. Then we apply k-means that the angles of the vertical and the diagonal window do
clustering with k = 2 on three sub-bands and fused window not play a role in calculating angles. In this way, the proposed
to obtain respective text clusters as shown in Fig. 11(c). The method determines the correct window iteratively with the help
cluster which gives the highest mean is considered as text of angle information. The same thing is true for non-horizontal
cluster. Since text pixels have high contrast values compared and curved text as shown in Fig. 12(a) and Fig. 12(b). In some
to its background [6], the pixels that have high contrast values situation, we may have the exact vertical direction. In this
are classified into text cluster. This result outputs the structures case, instead of considering the horizontal frequency sub-band,
of character components as shown in Fig. 11(c). we consider the vertical sub-band to find matches with the
It is true that most pixel direction contributes towards height fused window for angle calculation. In addition, the procedure
of character components. If we calculate the angle of such a terminates with the angle of zero degree rather than 90 degree.
character component, it gives almost the angle of character The algorithmic steps of the iterative process for finding
direction. For example, if a character is in horizontal direc- window are described in Algorithm-2.
tion, it gives almost 90 degree. Inspired by this observation, In Algorithm-2, W, Winit denote the recognized window
we calculate the angle for the fused result and text cluster of pixel matrix, while HL, LH and HH are the high-frequency

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1153

Algorithm 2 Automatic Window Selection


Input: Detected text image of width w and height h
1. For text image, do the following:
A. Take first Square of window Winit , where
Winit = {P1 , P2 , . . . , Pn } is image of width h
and height h.
B. Apply wavelet decomposition on Winit to gen-
erate 3 sub-bands HL, LH, HH and subse-
quently a fusion image  using union operator
on 3 sub-bands.
C. Obtained text cluster (H H t xt , t xt ) and non- Fig. 13. Examples of path estimation for the arbitrary oriented text using
fused results with angles. (a) Angles for the non-overlapping square window
text cluster from HH and  using K-means. over text. (b) Paths for the different oriented texts.
D. Compute the covariance matrix (S) using
1 n (i) (i) T
m i=1 x (x ) of H H t xt and t xt where character with the same window size in a non-overlapping way,
x is coordinate of text pixel, m is total number we need to find the direction of the text. For this, we use the
of pixel, T is the transpose matrix. angle of the initial window of the fused result as the direction
E. Apply PCA on S and find Z direction vector to move over the text. The angle of the next winnow is then
having maximum information. used as the direction to move further. This process continues
F. Estimate PCA angle (θ1) of HL matrix using until the end of the word as shown in Fig. 13(a), where we can
sinh−1 ( 
2 2
z 12
) and PCA angle (θ2 ) of  see the angles of window movements over a text, which results
2z 11 +z 12
matrix using sinh−1 (  z 22
). in a path for moving the window according to text direction
2 2 +z 2
z 21 22
as shown in Fig. 13(b).
G. While the absolute difference between θ1 and Sometimes, due to upper or lower case and font size
θ2 is greater than 3, variations, the fixed window for the initial character using
a. Winit = Winit − { p1, p2 , p3 }; an automatic window size determination may not fix for
b. Use step (B) ∼ step (F) to calculate the θ1 neighbor characters during moving. The same procedure can
and θ2 . also be adapted for fixing a correct window for the next moved
H. Apply shrinking and expand algorithm to window. But, this procedure is slightly sensitive to very small
generate a sequence of window size W = fonts, while small fonts are common for this work. Therefore,
[W1 W2 . . . Wn ]. we move the same square window in a non-overlapping way
I. For each window to the next character and calculate the confidence score using
an SVM classifier to fix the correct window for the character
a. Extract feature vector.
by expanding and shirking the window pixel by pixel. The
b. Apply SVM classification on feature vec-
procedure to use an SVM classifier for calculating confidence
tor to get confidence score (q).
score is as follows. For the extracted features, mapping is done
J. Select the window Wop having the maxi- between the window that is moving over character images
mum SVM score among all windows using and its label. It can be represented as x → y, where x ∈ X
Wi:q=max{q1, q2, ...qn } is a character in window and y ∈ Y is its class label. Here
K. Recognize the text in Wop x ∈ R n , where n is the number of features extracted from
L. While window is not at the end of text region, the window. For the input set X and output set Y , training
do the following: set will be (x 1 , y1 ) , . . . , (x w , yw ). In testing phase, for an
a. Move the window with the direction of unknown or query window x q ∈ X, SVM finds appropriate
θ1 +θ2
2 label yq ∈ Y . In this work, we use RBF kernel which is the
Use step (A) ∼ step (K) to generate angle
b. most popular function that has been used in literature. RBF
and optimal window. kernel is function k, such that for all xr, x s ∈ X, k (xr , x s ) =
2. Output: Optimal character window. ( (xr ) . (x s )) , where  is the mapping from X to the dot
product feature space. More details regarding training and the
kernel of the SVM classifier can be found in [51]. For training
the SVM classifier and setting parameters, we use the same
sub-bands of wavelet decomposition on W,  is the fused number of training samples to be used for recognition.
band, Z denotes the direction vector determined by PCA, θ1 When a text has uniform sized characters, window fixing
denotes the PCA angle of HL matrix, and θ2 denotes the PCA using the SVM classifier terminates quickly. This is the advan-
angle of  matrix. Variable X is the feature vector generated tage of the automatic window determination. Window fixing
by features, and q denotes the SVM classifier score for each using the SVM classifier serves as a verifier for character
feature vector X. window detection. Before recognition, we extract a set of
With this step, the proposed method fixes the correct features, which will be discussed later in the same section,
window for the initial character. Next, to move to the next to calculate a confidence score with the SVM classifier. Since

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1154 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

for complex mammogram identification successfully in [54].


In this work, for each sector, 90 statistical-textures and 32 run-
length based features are extracted. For 8 sectors in three
high frequency sub-bands with two levels of decomposition,
we extract totally 122 × 8 × 3 × 2 = 5856 features, which are
then fed to HMM for recognition.
We propose to explore Hidden Markov Model (HMM) as
it has the ability to recognize words of different forms [36].
This is because it extracts context information by studying the
spatial information of characters. A brief explanation of HMM
for recognizing words is follows. A feature vector sequence
X = X 1 X 2 . . . X N is first generated from multi-oriented words
and is processed using Hidden Markov Models (HMMs). One
of the advantages of HMMs is that they are able to cope
Fig. 14. Character detection using confidence score of SVM. (a) Path. with variable-length data. The basic models considered in
(b) Second window. (c) Window shrinking. (d) Correct window. (e) Window this method are character models. In the training phase, the
shrinking further-Stop. (f) Terminating criterion for fixing correct window. transcriptions of text line images together with feature vector
sequences are used in order to train each character model.
the proposed method considers the same window to move to Using the transcription, character models are concatenated in
the next character, it saves a large number of computations order to produce the word models. The recognition is per-
as it moves character by character but not pixel by pixel formed using the Viterbi algorithm, which finds the character
in contrast to the existing methods [36], [52], [53]. When a sequence that has the best likelihood calculated from the word
text has uniform sized characters, window fixing using the image. For HMMs implementation, we used the popular HTK
SVM classifier terminates quickly. This is the advantage of toolkit [55]. For finding the values of parameters, we follow
the automatic window determination. Window fixing using the same instructions given in [36]. More details about HMM
the SVM classifier serves as a verifier for character window and parameter setting can be found in [36].
detection. Before recognition, we extract a set of features,
which will be discussed later in the same section, to calculate
IV. E XPERIMENTAL R ESULTS
a confidence score with the SVM classifier. More details
regarding the training and the kernel of the SVM classifier There are two contributions in this work: one for text detec-
can be found in [49].When the confidence score gives the tion and the other for text recognition. To evaluate the pro-
maximum with ground truth, we consider it as the actual posed method for text detection in multi-type images, we con-
window for the character. It is illustrated in Fig. 14 where sider standard databases of video, natural scene, born digital
(a) provides the path, (b) shows the second window, and Indian scene text images, namely, ICDAR 2015 [56],
(c) shows the shrinking window, (d) shows the correct window YouTube Text (YVT) [57] for video, ICDAR 2013 [58], Street
of character “h”, (e) shows the shrinking window further, and View Text (SVT) [52] and MSRA-TD 500 data [59] for scene
(f) shows the confidence score given by SVM for window images and ICDAR 2011 [56] for born digital images. To eval-
reduction. Fig. 14(f) shows the maximum confidence score for uate the ability of multi-script detection, we consider data
fixing the correct window and the confidence score decreases comprising South Indian scripts, namely, Kannada (Karnataka
when the window reduces further. state of India), Tamil (Tamilnadu state of India), Telugu (Andra
2) Feature Extraction for Text Recognition: The previous Pradesh state of India) and Malayalam (Kerala state of India),
section provides a window to traverse a word character which are scene text images. This dataset is complex because
by character through path estimation of any direction. For the considered South Indian script is more cursive compared
each window, we propose a new set of features comprising to English and other Indian scripts. Besides, since the images
statistical-texture and spatial features in contourlet wavelet are captured by four mega pixel cameras, text suffer from low
domain. Here, the window is referring one character accord- resolution like video. Therefore, it is slightly complex than the
ing to the previous step. Motivated by the alterative review ICDAR family. In summary, according to the characteristics
in [4]–[9], we propose to combine the strengths of statis- of the databases from video, natural scene and born digital
tical features which generally help in extracting shapes of images, we list expected common challenges and individual
characters, texture features which help in extracting character challenges in Table I.
appearance, and spatial features which help in extracting inter For calculating standard measures, namely, Recall, Preci-
and intra symmetrical features of characters components. sion and F-measure, to evaluate the proposed text detection
For each window (character), the proposed method obtains method, we follow the standard evaluation scheme described
high frequency sub-bands, namely, horizontal, vertical and in ICDAR 2013 Robust Reading Competition [58] for all the
diagonal using contourlet wavelets (Haar). For each sector of experimentation including South Indian script data. We use the
each window of the word image, we extract statistical, textural ground truth available in respective databases for calculating
and run length based features as defined in [54], where the measures. However, for Indian data, we create the ground truth
definitions and formula are presented. These features are used for calculating measures using the same evaluation scheme.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1155

TABLE I
E XPECTED C HALLENGES OF T EXT D ETECTION IN V IDEO ,
N ATURAL S CENE AND B ORN D IGITAL I MAGES

The definitions and formulas of the three measures can be


found in [58].
To show the superiority of the proposed method to the
existing approaches, we implement the state of the art the
existing methods to give comparative studies. For example, Fig. 15. Illustrating advantages of the proposed method over MSER+MNNP
Epshtein et al. [15], Yin et al. [16], Liao et al. [24], and which uses color information for text detection. (a) MSER results for the
Tian et al. [30] detect text in natural scene images. different type images. (b) Text detection results using MSER+MNNP for the
images in (a). (c) The result of the proposed INNS for the different type
Mosleh et al. [14], Li et al. [11], Zhao et al. [13], and images. (d) Text detection result of the proposed INNS+MNNP.
Khare et al. [12] detect text in video. Shivakumara et al. [8]
detect text in mobile video images. Note that Epshtein
et al., Mosleh et al., Zhao et al., Li et al., Khare et al., recognize text through binarization in natural scene images,
Dey et al, and Shivakumara et al.’s methods are implemented Su et al. [33] recognize text through binarization in degraded
by us as the instructions given in their papers. However, for document images, Roy et al. [36] recognize text with-
Yin et al., and Liao et al.’s methods, we use the codes available out binarization in natural scene images, Roy et al. [9]
publicly. We use the same samples for training and testing as recognize text through binarization in video images, and
the proposed method. Jaderberg et al. [41] explore deep learning for recognition
The reason to choose these existing methods for compara- in natural scene images. Lee and Kim [43] has also pro-
tive studies is that Epshtein et al. [15], Mosleh et al. [14], posed deep convolutional neural network for slab number
Zhao et al. [13], and Tian et al. [30] use the strength recognition. Note that the codes or exe files are available for
of gradient and edge information for text detection, Milyae et al., Howe et al., Jaderberg et al., Lee and Kim and
Li et al. [11], Khare et al. [12] and Zhao et al. [13] use the Su et al.’s methods, while the other two approaches proposed
strength of texture for text detection, Yin et al. [16] use the by Roy et al. are implemented according to their papers.
strength of connected component analysis for text detection, We choose these methods because it involve binarization,
while Liao et al. [24] use deep learning concept for text classifiers and deep learning for recognition as the proposed
detection. To show one single strength is not enough to achieve method to give fair comparative studies.
good results for the considered complexity, we compare the
proposed method which combines the strengths of connected
component, gradient, shape, texture of text components and a A. Evaluating Text Detection Method
classifier with the existing methods. For text detection, finding candidate plane through bit plane
To evaluate the proposed recognition method, we consider slicing is the key step, which has several advantages over state
standard databases as in text detection, such as ICDAR of art methods as discussed in Section III.A.1. To validate the
2013 video, ICDAR 2011, SVT scene data, ICDAR 2011 born effectiveness of bit plane slicing over the standard step called
digital data and South Indian data for experimentation. For Maximally Stable Extremal Regions (MSER), which is widely
all the recognition experiments, we use the ground truth for used for detecting text candidates in images [50] based on
calculating recognition rates at both word and character levels. color information, we conduct experiments on 100 images ran-
In addition, we follow the standard measures as presented domly chosen from all the databases to give comparative study.
in [37] for calculating recognition rates at both word and We feed the text candidates given by MSER to MNNP of the
character levels. proposed method for text detection, at the same time, we feed
To show the effectiveness of the proposed text recognition text components in candidate planes given by INNS to MNNP
method, we implement the state of the art methods for text of the proposed method for text detection. The qualitative
recognition in scanned images, natural scene images and results of the MSER+MNNP and the proposed INNS+MNNP
videos. For example, Milyae et al. [35] and Howe [34] are respectively shown in Fig. 15(a)-Fig. 15(b), where it can

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1156 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

TABLE II
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2015 V IDEO D ATASET

Fig. 16. Examples of text detection results of the proposed method on


ICDAR 2015 and YVT video datasets.

be seen that MSER+MNNP produces more false positives and


does not fix bounding boxes for text lines, correctly. This is
because the MSER considers only color values for grouping
character components with certain threshold values. As a
result, MSER+MNNP misclassifies text as non-text. On the
other hand, the proposed INNS+MNNP gives better results for TABLE III
different type images as shown in Fig. 15(c) and Fig. 15(d). P ERFORMANCES OF THE P ROPOSED AND E XISTING
M ETHODS ON YVT V IDEO D ATA
This shows that though bit plane slicing loses color values,
it is capable of detecting text accurately in multi-type images
with the help of symmetry features used in INNS, which
represent character components more accurately. Therefore,
we can conclude that the loss of color information does not
affect the performance of text detection.
The quantitative results of the proposed MSER+MNNP
and the proposed INNS+MNNP are reported
in Table II-Table VIII, where one can notice that the
proposed INNS+MNNP is better than the MSER+MNNP for
all the databases in terms of recall, precision and F-measure.
The main reason to report poor results for MSER+NNP is
that MSER erroneously considers non-text as text candidate
especially for complex background images. This is valid
because MSER groups uniform pixels as a single component, at every step to achieve better results. On the other hand,
while INNS involves iterative nearest neighbor symmetry the proposed method does not use classifiers for text detection.
to extract features which represents character components. Since these two methods depend heavily on training, samples
Since the proposed method requires frames for text detection, and classifiers, the overall performance in terms of F-measure
we follow the criterion as stated in [6] and [7] to extract is poor compared to the proposed method.
320 and 60 key frames from ICDAR 2015 video and YVT For YVT video data, Wu et al.’s method achieves the best
video, respectively. The same samples along with the available precision compared to the proposed method. This is due to
ground truth are used for calculating measures to evaluate the the method considers the advantage of temporal information.
proposed and existing methods. For experiments on natural However, it is the worst for recall and F-measure compared
scene data, we consider 229 images for ICDAR 2013, 200 for to the proposed method. Other existing methods, namely,
MSRA, 238 for SVT, 141 for ICDAR 2011 Born digital and Epshtein et al.’s [15] method which is sensitive to edges
250 for Indian data with corresponding the ground truth. in the background, Mosleh et al. and Zhao et al.’s [13]
The same samples are used for evaluating the proposed and methods which are limited to caption and big font text, and
existing methods. Khare et al.’s [12] method which focuses on high contrast
1) Experiments on ICDAR 2015 and YVT Video: Sample text, score poor results compared to the proposed method.
text detection results of the proposed method for ICDAR It is observed from Table II and Table III that the proposed
2015 and YVT video are shown in Fig. 16, where it is noticed and existing methods score better results for YVT video data
that the proposed method detects text of small fonts and compared to ICDAR 2015 video data because YVT video
arbitrarily oriented text. Quantitative results of the proposed dataset provides high contrast text with complex background,
and existing methods are reported in Table II and Table III, while ICDAR 2015 video data has both low and high contrast
which show that the proposed method is the best at F-measure text.
compared to the existing methods for ICDAR 2015 video. 2) Experiments on ICDAR 2013 and SVT Scene Data: Qual-
For YVT data, the proposed method scores the best recall itative results of the proposed method for ICDAR 2013 and
and F-measure compared to existing methods. Liao et al. SVT data are shown in Fig. 17, where we can see images with
is the best at recall and Yin et al.’s method is the best at different backgrounds, fonts, font sizes, etc. Fig. 17 shows
precision for ICDAR 2015 video compared to the proposed that the proposed method detects text well in both images.
method. This is because these two methods exploit classifiers The quantitative results of the proposed and existing methods

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1157

TABLE VI
P ERFORMANCES OF THE P ROPOSED AND E XISTING
M ETHODS ON MSRA S CENE D ATA

Fig. 17. Examples of text detection results of the proposed method on ICDAR
2013, SVT and MSRA natural scene dataset.

TABLE IV
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2013 S CENE D ATASET

the previous section to score poor results. Since this dataset


is complex than ICDAR 2013 scene dataset, the proposed
and existing methods report low results compared to ICDAR
data results. From the experiments, we can conclude that the
proposed approach is capable of handling complex background
and multi-type images.
3) Experiments on Multi-Oriented MSRA Scene Data: This
dataset is different from the above two scene data as it contains
more multi-oriented text and Chinese script images. In addi-
tion, the ground truth is available for text lines but not words
TABLE V as ICDAR data, where the ground truth is available at word
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS level. Therefore, the proposed method connects words of the
ON SVT S CENE D ATASET
same text line to make use of ground truth for experimentation
as shown in Fig. 17. Sample results in Fig. 17 show that
the proposed method detects multi-oriented text lines well.
The results of the proposed and existing methods reported
in Table VI show that the proposed method scores better results
for recall and F-measure compared to the existing methods.
However, Yin et al.’s method is the best at precision. The
proposed MNNP sometimes misclassifies a non-text as a text
while restoring especially for multi-oriented text. Therefore,
there is a poor precision for the proposed method. Since the
existing approaches are not robust to arbitrary orientations,
they score poor results except Khare et al.’s method [12],
which is developed for video text detection but not scene text
for ICDAR 2013 scene data are reported in Table IV. It is detection. Hence, it does not report good results for scene data.
observed from Table IV that the proposed method achieves 4) Experiments on ICDAR 2011 Born Digital Data: For this
the best results for F-measure compared to existing methods. dataset, achieving good results compared to video and natural
The method participated in ICDAR 2013 competition is the scene images is challenging because text nature is unpre-
best at precision and Liao et al.’s method is the best at recall. dictable. Sample results in Fig. 18 show that the proposed
The main reason to get the low recall and precision of the method is good for born digital data also. The results reported
proposed method is that MNNP sometimes misclassifies text in Table VII show that the proposed method is better than the
components as non-text ones and thus misses text components. existing methods in terms of F-measure because the extracted
For SVT data, according to the results reported in Table V, features are invariant to text types. Liao et al.’s method [24]
the proposed method is better than the existing methods in is the best for recall, and Yang et al.’s method [21] is the best
terms of recall and F-measure. Liao et al.’s method is better for precision compared to the proposed and the other existing
than the proposed and all the other existing methods in terms methods. Liao et al.’s method is robust to multi-font and size
of recall as it exploits deep learning for text detection. Other because of deep learning. Yang et al.’s method is developed for
existing methods have inherent limitations as discussed in born digital images and scores a high precision. Since the other

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1158 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

TABLE IX
R ECOGNITION R ATES OF THE P ROPOSED AND E XISTING A PPROACHES ON
D IFFERENT D ATASETS AT W ORD AND C HARACTER L EVELS ( IN %).
W AND C I NDICATE W ORD AND C HARACTER R ECOGNITION
R ATES , R ESPECTIVELY

Fig. 18. Examples of text detection results of the proposed approach on


Born digital and South Indian data.

TABLE VII
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2011 B ORN D IGITAL D ATASET
because the proposed features are invariant to scripts. Due
to more cursiveness and low resolution, the existing methods
report poor results. However, Yin et al.’s method scores high
precision compared to the proposed and the other existing
methods. This is because the approach has the ability to multi-
script text. However, it depends much on classifiers and train-
ing, it fails to score the best recall and F-measure compared to
the proposed method. On the other hand, the proposed method
does not depend on classifiers and is the best at F-measure.

B. Evaluating Text Recognition Approach


This section consists of four experiments, namely, evaluat-
ing the proposed method on video which comprises ICDAR
TABLE VIII
2013 data, natural scene which comprises ICDAR 2011, SVT
P ERFORMANCES OF THE P ROPOSED AND E XISTING
A PPROACHES ON S OUTH I NDIAN D ATA data, Born digital which comprises ICDAR 2011 data, and
South Indian data which is created by us. For evaluating the
proposed method, we use respective ground truth and testing
data reported in the databases for calculating recognition rates
at both word and character levels for all the experiments in
this work. However, for South Indian Data (SID), we use
200 words which include 50 for each language to validate
as in the line of SVT data. HMM is trained according to
the guidelines given in [34] to obtain the values for the
parameters. For training, 750, 1850, 1700, 2200 and 2540
( 650 for Kannada, 550 for Malayalam, 700 for Tamil and
640 for Telugu) words are used for ICDAR 2013 (I2013)
video, ICDAR 2011 (I2011) scene, SVT, ICDAR 2011 (I2011)
Born Digital (BD) and South Indian Data (SID), respectively.
existing methods are developed either for video text or scene In total, the proposed approach considers 9040 words for
text, they report poor results. training in this work.
5) Experiments on Multi-Lingual Scene Data: This dataset The results of the proposed and existing methods for
is special because it contains multi-lingual scene text of video, scene, born digital and South Indian data at word and
different orientations and low resolution as shown by sample character levels are reported in Table IX and X, respectively.
images in Fig. 18, where it is noticed that small font or cursive Table IX and Table X show that the recognition rates for
font text with different backgrounds are detected properly by words are lower than those of characters. This is because
the proposed method. Since the text are more cursive than the proposed HMM does not involve any post processing,
English and Chinese, the methods which depend on shapes dictionary, or language models for recognizing words. As a
of characters may not work well for this dataset. The results result, if one character misses or fails to recognize cor-
reported in Table VIII show that the proposed method is better rectly, the whole word is considered as wrong results for
than the existing methods in terms of recall and F-measure calculating recognition rates. While this is not the case for

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1159

TABLE X TABLE XI
R ECOGNITION R ATES OF THE P ROPOSED AND E XISTING A PPROACHES ON AVERAGE P ROCESSING T IME OF THE P ROPOSED M ETHOD FOR
S OUTH I NDIAN D ATASETS AT W ORD AND C HARACTER L EVELS R ECOGNITION ON D IFFERENT D ATABASES ( IN S ECONDS )
( IN %). W AND C I NDICATE W ORD AND C HARACTER
R ECOGNITION R ATES , R ESPECTIVELY

character recognition rate where each character contributes in


recognition rate calculation. The existing methods which use
binarization (Milyae et al. [35], Howe [34], Su et al. [33] and
Roy et al. [9]) for recognition score low results compared to Neighbor Pair (MNNP) for text detection, Automatic Window
the methods which do not use binarization (Roy et al. [36], Size Detection (AWD) for characters in a text and Feature
Shi et al. [40], Phan et al. [37], Lee et al. [38], Extraction for recognition with HMM. In BPS, for each input
Jadeberg et al. [41], Lee and Kim [43]) as reported in Table IX. image, the proposed method checks bit locations to display
This is because though binarization methods are developed for 8 bit planes. As a result, the time complexity of BPS is (n2 )
natural scene text recognition, they fail to preserve character for the best case and O(n2 ) for the worst case, where n is the
shapes for complex background images. At the same time, number of 8 bits stream in the image. In INNS, the iterative
the OCR which is available publicly [10] is not robust to process may check a few text components to detect the can-
different fonts or font sizes. On the other hand, the methods didate plane for the best case. Therefore, let C be the number
(Roy et al., Shi et al., Phan et al. Lee et al., Jadeberg et al. of components then the time complexity is (C) for the best
and Lee & Kim) give better results at both word and character case. If the value of C reaches n number of components, time
levels because they extract their own features and classifiers complexity is O(n) for the worst case. In MNNP, the process
along with dictionaries, language models for achieving better of symmetry may check a few candidate components in the
results. In case of Jaderberg et al. [41], we run the method image for the best case and n components for the worst case.
without lexicons and synthetic images to calculate measures as Therefore, time complexity of MNNP is (C) for the best
the proposed method for fair comparative study. The proposed case and O(n) for the worst case. For whole text detection,
method gives better results than the existing methods. This is time complexity is (n2 ) + (C) + (C) ≈ (n2 ) for the
mainly because of the advantage of determining window size best case and O(n2 ) + O(n) + O(n) ≈ O(n2 ) for the worst
according to character size and moving widow over characters. case.
Note that Roy et al. [36] use HMM for recognizing text In AWD, this process uses a fixed window for ideal char-
in natural scene images, Jadeberg et al. [41] and Lee and acters, therefore time complexity for the best case is (C),
Kim [43] use deep learning which has the ability to handle where C is a few fixed number components. If C is n,
complex issues. Therefore, we compare the proposed method the time complexity for the worst case is O(n). In case of
with these three methods to show the superiority on multi- feature extraction and recognition using HMM, if the extracted
lingual ability. The results reported in Table X show that features of an unknown character may match with the one idea
the proposed method is better than the existing method at sample, time complexity of recognition step is (1). If the
both word and character levels. Determining optimal parameter features of an unknown character match with the features of
values for complex script is hard from the above three existing n characters, time complexity of recognition is O(n) for the
methods, while the proposed method can be extended to worst case. Therefore, for recognition, (C) + (1) ≈ (C)
any language as it does not involve any specific parameters for the best case and O(n) + O(n) ≈ O(n) for the worst case.
and lexicons. Interestingly, the result of South Indian data For the whole proposed method including text detection and
including all the data (Kannada+Tamil+Telugu+Malayalam) recognition, (n2 ) + (C) ≈ (n2 ) for the best case and
together is lower than the other data. Hence, it is an open O(n2 ) + O(n) ≈ O(n2 ) for the worst case. We find that time
issue for the researchers. In summary, with all the experimental complexity is almost the same for both best and worst cases.
analysis, we can assert that the proposed method is the best for The system with 2.59 GHz, 8GB RAM and Window 8 is
text detection and recognition without much constraints and is used for experimentation. We report average processing time,
invariant to language, orientation, multi-fonts, multi-size, and which computes the mean of processing time for 100 randomly
multi-type text. chosen set of images of respective databases in Table XI.
In order to analyze time complexity of the proposed method, It is found from Table XI that INNS step consumes more
we estimate formal computational time complexity. The pro- time compared to the other steps of text detection because
posed method consists of Bit Plane Slicing (BPS) for text INNS involves an iterative process for each component in the
components detection, Iterative Nearest Neighbor Cluster- image. Similarly, feature extraction with HMM for recogni-
ing (INNS) for candidate plane detection, Mutual Nearest tion consumes more time compared to AWD because HMM

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1160 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

experimental results that the accuracy is still low for multi-


lingual data, too small font, blur and poor quality images.
To achieve better results, one can explore to fuse the significant
information in different bit plane images rather relying on one
candidate plane.

Fig. 19. Limitation of the proposed text detection and recognition methods.
ACKNOWLEDGMENT
(a) Text detection. (b) Recognition “Gg”, “IICC”, “eeo”. Authors would like to thank Pooja.G, Navya, Gowrishankar
Pillai, Mayur. S and Deepa Shree for their help in creating
ground truth for south Indian scripts. They would also like to
process requires more computations. Table XI shows that thank Wang Zhen for shaping the algorithms.
the proposed method takes, on an average, more processing
time for video data and MSRA data compared to the other
R EFERENCES
databases. This is valid because video involves processing of
temporal frames and MSRA involves arbitrarily-oriented text [1] C.-Z. Shi, C.-H. Wang, B.-H. Xiao, S. Gao, and J.-L. Hu, “Scene
text recognition using structure-guided character detection and linguistic
which require more computations compared to horizontal text. knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 7,
Overall, the proposed method consumes a few seconds for pp. 1235–1250, Jul. 2014.
each image in order to recognize text in the image. This is [2] D. Tao, J. Cheng, X. Gao, X. Li, and C. Deng, “Robust sparse coding for
mobile image labeling on the cloud,” IEEE Trans. Circuits Syst. Video
due to MATLAB software. It is also noted that the processing Technol., vol. 27, no. 1, pp. 62–72, Jan. 2017.
time depends on many factors, such as data structure of the [3] Y. Yang, C. Deng, D. Tao, S. Zhang, W. Liu, and X. Gao, “Latent max-
algorithm, system configuration and platform. Since our target margin multitask learning with skelets for 3-D action recognition,” IEEE
Trans. Cybern., vol. 47, no. 2, pp. 439–448, Feb. 2017.
is to develop a prototype, we plan to convert the whole [4] Q. Ye and D. Doermann, “Text detection and recognition in imagery:
MATLAB codes to VC++ and make the algorithm efficient A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7,
with the help of cloud computing in the future such that system pp. 1480–1500, Jul. 2015.
[5] X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking
can work for real time applications. Since the main aim of the and recognition in video: A comprehensive survey,” IEEE Trans. Image
proposed work is to develop a generic method for recognizing Process., vol. 25, no. 6, pp. 2752–2773, Jun. 2016.
text irrespective orientation, contrast variations, scripts, etc, [6] L. Wu, P. Shivakumara, T. Lu, and C. L. Tan, “A new technique for
multi-oriented scene text line detection and tracking in video,” IEEE
prototype or working model development is considered as Trans. Multimedia, vol. 17, no. 8, pp. 1137–1152, Aug. 2015.
beyond the scope of this work. [7] G. Liang, P. Shivakumara, T. Lu, and C. L. Tan, “Multi-spectral fusion
When the image containing too small font and poor quality based approach for arbitrarily oriented scene text detection in video
images,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 4488–4500,
as shown in Fig. 19(a), the proposed text detection step does Nov. 2015.
not perform well due to loss of components at MNNS stage. [8] P. Shivakumara, L. Wu, T. Lu, C. L. Tan, M. Blumenstein, and
Similarly, for the image of poor quality as shown in Fig. 19(b), B. S. Anami, “Fractals based multi-oriented text detection system
for recognition in mobile video images,” Pattern Recognit., vol. 68,
even naked eyes fail to read texts. For such images, the recog- pp. 158–174, Aug. 2017.
nition step fails to recognize the texts correctly. The main [9] S. Roy, P. Shivakumara, P. P. Roy, U. Pal, C. L. Tan, and T. Lu, “Bayesian
reason is that the method loses character structure during fixing classifier for multi-oriented video text recognition system,” Expert Syst.
Appl., vol. 42, no. 13, pp. 5554–5566, 2015.
an automatic window for each character and feature extraction. [10] (2016). Tesseract. [Online]. Available: http://code.google.com/p/tesseract-
Therefore, there is the scope for future work. ocr/
[11] H. Li, D. Doermann, and O. Kia, “Automatic text detection and tracking
in digital video,” IEEE Trans. Image Process., vol. 9, no. 1, pp. 147–156,
V. C ONCLUSION AND F UTURE W ORK Jan. 2000.
[12] V. Khare, P. Shivakumara, and P. Raveendran, “A new histogram oriented
In this work, we have proposed a new method which can moments descriptor for multi-oriented moving text detection in video,”
cope with the challenges of text detection and recognition in Expert Syst. Appl., vol. 42, no. 21, pp. 7627–7640, 2015.
multi-image environment, namely, video, natural scene and [13] X. Zhao, K.-H. Lin, Y. Fu, Y. Hu, Y. Liu, and T. S. Huang, “Text from
corners: A novel approach to detect text and caption in videos,” IEEE
born digital images. We have explored convex and concave Image Process., vol. 20, no. 3, pp. 790–799, Mar. 2011.
deficiencies to identify a candidate plane from eight planes [14] A. Mosleh, N. Bouguila, and A. B. Hamza, “Automatic inpainting
to represent significant information by introducing a new scheme for video text detection and removal,” IEEE Trans. Image
Process., vol. 22, no. 11, pp. 4460–4472, Nov. 2013.
concept, called Iterative Nearest Neighbor Symmetry (INNS). [15] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
Based on the outward gradient direction of components, we with stroke width transform,” in Proc. CVPR, Jun. 2010, pp. 2963–2970.
have proposed a new idea of Mutual Nearest Neighbor Pair [16] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in
natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
(MNNP) components identification to identify the represen- no. 5, pp. 970–983, May 2014.
tatives of texts. For recognition, we have introduced a new [17] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-orientation scene
idea of determining an automatic window according to char- text detection with adaptive clustering,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 37, no. 9, pp. 1930–1937, Sep. 2015.
acter size based on the angular relationship between fused [18] X. Wang, Y. Song, Y. Zhang, and J. Xin, “Natural scene text detection
and high frequency wavelet sub-bands. We have proposed with multi-layer segmentation and higher order conditional random
the combination of statistical-textures and spatial information field based analysis,” Pattern Recognit. Lett., vols. 60–61, pp. 41–47,
Aug. 2015.
based features in contourlet wavelet domain for recognition [19] Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based text line
with the help of HMM model. However, it is noticed from the detection in natural scenes,” in Proc. CVPR, Jun. 2015, pp. 2558–2567.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1161

[20] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: [46] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.
A unified text detection system in natural scene images,” in Proc. ICCV, New Delhi, India: Pearson, 2002.
Dec. 2015, pp. 4651–4659. [47] S. Sudhakaran and A. P. James, “Sparse distributed localized gra-
[21] H. Yang, S. Wu, C. Deng, and W. Lin, “Scale and orientation invariant dient fused features of objects,” Pattern Recognit., vol. 48, no. 4,
text segmentation for born-digital compound images,” IEEE Trans. pp. 1538–1546, 2015.
Cybern., vol. 45, no. 3, pp. 533–547, Mar. 2015. [48] Z. Long and N. H. Younan, “Multiscale texture segmentation via a
[22] J. Xu, P. Shivakumara, T. Lu, C. L. Tan, and M. Blumenstein, “Text contourlet contextual hidden Markov model,” Digit. Signal Process.,
detection in born-digital images by mass estimation,” in Proc. ACPR, vol. 23, no. 3, pp. 859–869, 2013.
Nov. 2015, pp. 690–694. [49] A. E. Yacoubi, M. Gilloux, R. Sabourin, and C. Y. Suen, “An HMM-
[23] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional convolutional based approach for off-line unconstrained handwritten word modeling
neural network for scene text detection,” IEEE Trans. Image Process., and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8,
vol. 25, no. 6, pp. 2529–2541, Jun. 2016. pp. 752–760, Aug. 1999.
[24] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “TextBoxes: A fast [50] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, and
text detector with a single deep neural network,” in Proc. AAAI, 2017, B. Girod, “Robust text detection in natural images with edge-
pp. 4161–4167. enhanced maximally stable extremal regions,” in Proc. ICIP, Sep. 2011,
[25] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi- pp. 2609–2612.
oriented text detection with fully convolutional networks,” in Proc. [51] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
CVPR, Apr. 2016, pp. 4159–4167. vol. 20, no. 3, pp. 273–297, 1995.
[26] H. Cho, M. Sung, and B. Jun, “Canny text detector: Fast and [52] K. Wang, B. Babenko, and S. Belngie, “End-to-end scene text recogni-
robust scene text localization algorithm,” in Proc. CVPR, Jun. 2016, tion,” in Proc. ICCV, Nov. 2011, pp. 1457–1464.
pp. 3566–3573. [53] A. Mishra, K. Alhari, and C. V. Jawahar, “Top-down and bottom-up cues
[27] A. Gupta, A. Vedaldi, and A. Ziserman, “Synthetic data for for scene text recognition,” in Proc. CVPR, Jun. 2012, pp. 2687–2694.
text localisation in natural images,” in Proc. CVPR, Apr. 2016, [54] P. Agrawal, M. Vatsa, and R. Singh, “Saliency based mass detection from
pp. 2315–2324. screening mammograms,” Signal Process., vol. 99, pp. 29–47, Jun. 2014.
[28] L. Gomez and D. Karatzas, “TextProposals: A text-specific selective [55] S. J. Young, J. Jansen, J. J. Odell, D. Ollason, and P. C. Woodland,
search algorithm for word spotting in the wild,” Pattern Recognit., “The HTK hidden Markov model toolkit book,” Entropic Cambridge
vol. 70, pp. 60–74, Oct. 2017. Res. Lab., Cambridge, U.K., Tech. Rep., 1995. [Online]. Available:
[29] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi- http://htk.eng.cam.ac.uk/
oriented text detection,” in Proc. ICCV, Mar. 2017, pp. 3454–3461. [56] D. Karatzas et al., “ICDAR 2015 competition on robust reading,” in
[30] S. Tian, S. Lu, and C. Li, “WeText: Scene text detection under weak Proc. ICDAR, Aug. 2015, pp. 1156–1160.
supervision,” in Proc. ICCV, Oct. 2017, pp. 1501–1509. [57] P. X. Nguyen, K. Wang, and S. Belongie, “Video text detection and
[31] A. Mittal, P. P. Roy, P. Singh, and B. Raman, “Rotation and script recognition: Dataset and benchmark,” in Proc. WACV, Mar. 2014,
independent text detection from video frames using sub pixel mapping,” pp. 776–783.
J. Vis. Commun. Represent., vol. 46, pp. 187–198, Jul. 2017. [58] D. Karatzas et al., “ICDAR 2013 robust reading competition,” in Proc.
ICDAR, Aug. 2013, pp. 1115–1124.
[32] S. Dey et al., “Script independent approach for multi-oriented text
[59] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of
detection in scene image,” Neurocomputing, vol. 242, pp. 96–112,
arbitrary orientations in natural images,” in Proc. CVPR, Jun. 2012,
Jun. 2017.
pp. 1083–1090.
[33] B. Su, S. Lu, and C. L. Tan, “Robust document image binarization
[60] D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy,
technique for degraded document images,” IEEE Trans. Image Process.,
“ICDAR 2011 robust reading competition—Challenge 1: Reading text
vol. 22, no. 4, pp. 1408–1417, Apr. 2013.
in born-digital images (Web and Email),” in Proc. ICDAR, Sep. 2011,
[34] N. R. Howe, “A Laplacian energy for document binarization,” in Proc. pp. 1485–1490.
ICDAR, Sep. 2011, pp. 6–10.
[35] S. Milyae, O. Barinova, T. Novikova, P. Kohli, and V. Lempitsky, “Image
binarization for end-to-end text understanding in natural images,” in
Proc. ICDAR, Aug. 2013, pp. 128–132.
[36] S. Roy, P. P. Roy, P. Shivakumara, G. Louloudis, and C. L. Tan, “HMM-
based multi oriented text recognition in natural scene image,” in Proc. K. S. Raghunandan received the master’s degree
ACPR, Nov. 2013, pp. 288–292. from University of Mysore in 2013, where he is
[37] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text currently pursuing the Ph.D. degree. His research
with perspective distortion in natural scenes,” in Proc. ICCV, Dec. 2013, interests includes image processing, pattern recogni-
pp. 569–576. tion, and video understanding.
[38] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu,
“Region-based discriminative feature pooling for scene text recognition,”
in Proc. CVPR, Jun. 2014, pp. 4050–4057.
[39] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with atten-
tion modeling for OCR in the wild,” in Proc. CVPR, Mar. 2016,
pp. 2231–2239.
[40] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text
recognition with automatic rectification,” in Proc. CVPR, Mar. 2016,
pp. 4168–4176.
[41] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading Palaiahnakote Shivakumara received the B.Sc.,
text in the wild with convolutional neural networks,” Int. J. Comput. M.Sc., M.Sc. (Tech.) by research, and Ph.D. degrees
Vis., vol. 116, no. 1, pp. 1–20, 2016. in computer science from University of Mysore, Kar-
[42] S. Yousf, S. A. Berrani, and C. Garcia, “Contribution of recurrent con- nataka, India, in 1995, 1999, 2001, and 2005, respec-
nectionist language models in improving LSTM-based Arabic text recog- tively. He was with the Department of Computer
nition in videos,” Pattern Recognit., vol. 64, pp. 245–251, Apr. 2017. Science, School of Computing, National University
[43] S. J. Lee and S. W. Kim, “Recognition of slab identification numbers of Singapore, from 2008 to 2013, as a Research Fel-
using a deep convolutional neural network,” in Proc. ICMLA, Dec. 2016, low on video text extraction and recognition project.
pp. 718–721. He is currently a Senior Lecturer with the Faculty
[44] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network of Computer Science and Information Technology,
for image-based sequence recognition and its application to scene text University of Malaya, Kuala Lumpur, Malaysia. He
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, has published over 190 papers in conference and journals. His research
pp. 2298–2304, Nov. 2017. interests are in the area of image processing and video text analysis. He was a
[45] M. Jain, M. Mathew, and C. V. Jawahar, “Unconstrained scene text and recipient of the prestigious Dynamic Indian of the Millennium award by KG
video text recognition for Arabic Script,” in Proc. ASAR, Apr. 2017, Foundation, India. He has been an Associate Editor for ACM Transactions
pp. 26–30. Asian and Low-Resource Language Information Processing.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1162 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019

Sangheeta Roy is currently pursuing the Ph.D. Umapada Pal (SM’15) received the Ph.D. degree
degree with University of Malaya, Malaysia. Her from the Indian Statistical Institute and the Ph.D.
area of interest includes image processing, pattern degree with the Institut National de Recherché en
recognition, and video text understanding. Informatiqueet en Automatique, France. In 1997, he
was a Faculty Member with the Computer Vision
and Pattern Recognition Unit, Indian Statistical Insti-
tute, Kolkata, where he is currently a Professor.
Because of his significant impact in the Document
Analysis research domain of the Indian language,
TC-10 and TC-11 committees of International Asso-
ciation for Pattern Recognition (IAPR) presented the
ICDAR Outstanding Young Researcher Award to Dr. Pal in 2003. He is a
fellow of IAPR. He is an editorial board member for several journals like
PR, PRL, IJDAR, and ACM Transactions on Asian Language Information
Processing.

G. Hemantha Kumar received the B.Sc., M.Sc., Tong Lu received the B.Sc. and M.Sc. degrees and
and Ph.D. degrees from University of Mysore. He the Ph.D. degree in computer science from Nanjing
is currently a Professor with the Department of University, in 1993, 2002, and 2005, respectively.
Studies in Computer Science, University of Mysore, He is currently a Full Professor with Nanjing Uni-
Mysore. He has published over 200 papers in jour- versity. His current interests are in the areas of
nals, edited books, and refereed conferences. His multimedia, computer vision, and pattern recogni-
current research interest includes numerical tech- tion algorithms/systems.
niques, digital image processing, pattern recognition,
and multimodal biometrics.

Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.

You might also like