Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
Abstract— Achieving good text detection and recognition low resolution mobile cameras, natural scene images cap-
results for multi-script-oriented images is a challenging task. tured by high resolution cameras, and uploaded images on
First, we explore bit plane slicing in order to utilize the webpages. Besides, these databases include texts of different
advantage of the most significant bit information to identify text
components. A new iterative nearest neighbor symmetry is then types, namely, caption texts which are manually edited, and
proposed based on shapes of convex and concave deficiencies scene texts where multiple scripts, orientations, fonts and sizes
of text components in bit planes to identify candidate planes. exist naturally. Such multi-type texts make the text detection
Further, we introduce a new concept called mutual nearest and recognition problem much complex and challenging. To
neighbor pair components based on gradient direction to identify understand such images, there are now increasing attentions
representative pairs of texts in each candidate bit plane. The
representative pairs are used to restore words with the help of researchers in the field of computer vision and video
of edge image of the input one, which results in text detection processing [4], [5]. Among all the information contained in
results (words). Second, we propose a new idea by fixing window images, texts carry semantic information and could provide
for character components of arbitrary oriented words based on useful cues about image content and hence it is important
angular relationship between sub-bands and a fused band. For for human and computer to understand images. It is evident
each window, we extract features in contourlet wavelet domain
to detect characters with the help of an SVM classifier. Further, from the statement in [1] that given an image containing
we propose to explore HMM for recognizing characters and texts and other objects, viewers often tend to focus on texts.
words of any orientation using the same feature vector. The This shows that text detection and recognition is important
proposed method is evaluated on standard databases such as for human to understand complex images. Furthermore, text
ICDAR, YVT video, ICDAR, SVT, MSRA scene data, ICDAR detection and recognition is indispensable for a lot of real
born digital data, and multi-lingual data to show its superiority
to the state of the art methods. time applications such as automatic sign reading, language
translation, navigation and surveillance applications [4], [5].
Index Terms— Bit plane slicing, convex and concave deficien- There are methods developed in literature [4]–[9] to address
cies, wavelet sub-bands, arbitrarily-oriented text detection and
recognition, hidden Markov model, multi-lingual text detection the issue of text detection and recognition in video, natural
and recognition. scene and born digital images of different orientations, scripts,
font sizes, etc. According to literature review, most available
I. I NTRODUCTION methods focus on a particular data type and address a spe-
cific issue like complex background, low contrast, multiple
A S explosive proliferation of multimedia content on broad-
cast and Internet, the need for its ubiquitous access
at anytime and anywhere over a variety of devices also
scripts or multiple orientations. As a result, performances of
such methods are poor for the data affected by multiple adverse
increases [1]–[3]. Therefore, one can expect huge databases, factors. The main cause of the above challenges is as follows:
which consist of diversified data such as videos captured by 1) Frames captured by low resolution video cameras often
suffer from low contrast and low-resolution issues, 2) Natural
Manuscript received October 19, 2017; revised January 24, 2018; accepted scene images captured by high resolution cameras provide
March 18, 2018. Date of publication March 21, 2018; date of current version
April 3, 2019. This work was supported in part by the Natural Science high contrast but suffer from complex background, which
Foundation of China under Grant 61672273, Grant 61272218, and Grant leads to more false positives, and 3) Born digital images from
61321491 and in part by the Science Foundation for Distinguished Young websites suffer from multiple fonts, sizes, colors, appearance
Scholars of Jiangsu under Grant BK20160021. This paper was recommended
by Associate Editor M. Wang. (Corresponding author: Tong Lu.) variations, background complexity, etc., which affect scaling,
K. S. Raghunandan and G. H. Kumar are with the Department of Studies text alignment and geometrical shapes of character compo-
in Computer Science, University of Mysore, Karnataka 57005, India (e-mail: nents. Therefore, text detection and recognition in different
raghu0770@gmail.com; ghk.2007@yahoo.com).
P. Shivakumara and S. Roy are with the Faculty of Computer System types of images is considered as an open issue.
and Information Technology, University of Malaya, Kuala Lumpur 50603, It is evident from the illustration presented in Fig. 1 that
Malaysia (e-mail: shiva@um.edu.my; 2sangheetaroy@gmail.com). the existing method [8], which is the state-of-the-art method
U. Pal is with the Computer Vision and Pattern Recognition Unit, Indian
Statistical Institute, Kolkata 700108, India (e-mail: umapada@isical.ac.in). that explores fractals for text detection in mobile video scene
T. Lu is with the National Key Laboratory for Novel Software Technology, images, produces false positives for video, natural scene
Nanjing University, Nanjing 210023, China (e-mail: lutong@nju.edu.cn). images and it also do not detect text properly from born
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. digital images as shown in Fig. 1(a). This is because the
Digital Object Identifier 10.1109/TCSVT.2018.2817642 primary goal of this approach is to detect texts in mobile
1051-8215 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
planes as Canny edge detector to give fine edge details for low
contrast irrespective of foreground-background color changes.
For each text component in Canny edge image of planes,
the proposed method introduces Iterative Nearest Neighbor
Symmetry (INNS) based on shapes given by convex/concave
deficiencies to detect a candidate plane out of 8 planes. Then
INNS extracts shape and self-symmetry based features, which
are invariant to font, font size, orientation and script. For
each component in the candidate plane, we further propose
Mutual Nearest Neighbor Pair (MNNP), which uses gradient
outward direction to find the nearest neighbor component in
text lines. Since MNNP considers uniform spacing between
characters and words, it finds component pairs irrespective of
orientation, script, font size and font. In addition, the same
criterion is used for eliminating false positives produced by
background complexity. Further, since MNNP uses Canny
edge image of the input image, the loss character components
can be restored easily during components pairing. Therefore,
the rationale behind the proposed method is to explore shape,
symmetry, structure and direction of text components, which
has the ability to tackle the above mentioned challenges posed
by multi-type images.
Similarly, the main issue with recognition is defining
window size for arbitrarily-oriented characters in multi-type
images. To overcome such issue, we propose automatic win-
Fig. 1. Text detection and recognition results of the existing and the proposed
dow size detection based on the fact that directions of most
approaches for video, natural scene and born digital images. (a) Existing pixels contribute towards character height, which helps us to
text detection and recognition results by Tesseract OCR. (b) Proposed text fix correct windows according to sizes and orientations of char-
detection and recognition for the video, natural and born digital images.
acters. Further, the integration of strength of different types of
features, namely, statistical features which extract geometrical
video images but not texts in multiple type images. Therefore, properties, texture features which extract appearances property,
it gives inconsistent results. When a text detection method run-length smearing which extracts intra and inter symmetry
does not detect texts properly, it often affects directly to of character components, and contour wavelet domain which
binarization and recognition as shown in Fig. 1(a), where is invariant to scaling, multi-fonts or multi-sizes, helps us to
one can see that the binarization output [9] contains non- achieve better results for text in multi-type images. Overall,
text components and background noises for video and scene the text detection and recognition steps are proposed based on
images. For example, though the binarization method in [9], robust and invariant features and hence the proposed method
which explores a Bayesian classifier for recognizing texts in is generic. The contribution and novelty lies in exploring the
video and natural scene images, preserves character shapes, above basic concepts for addressing the open challenges of text
the Optical Character Recognizer (OCR) [10], which is avail- detection and recognition without rigid constraints to achieve
able publicly, fails to recognize the texts correctly due to the better results.
noises introduced by binarization and non-text components
given by the text detection method. However, for the texts
II. R ELATED W ORK
in the born digital image, OCR gives correct results because
text detection and binarization methods work well. Text detection and recognition can be divided broadly into
On the other hand, the proposed method detects texts in all three categories: video, natural scene and born digital images.
the three types of images properly and correctly recognizes Therefore, this section reviews the past works on text detection
the texts without binarization as shown in Fig. 1(b). Therefore, in video, natural scene and born digital images, and text
we can conclude that the existing methods are not adequate recognition in respective categories.
to handle the challenges posed by multi-type images. Hence, The methods for text detection in video images can be
we propose a new method to fill up this gap in this work. further classified into connected component based methods,
To address the above mentioned challenges, we propose texture based methods, and edge-gradient based methods.
to explore Most Significant Bit (MSB), which carries vital Connected component based methods [4], [5] expect character
information using bit slicing for input images. This is because shapes to be preserved. These methods generally focus on
bit information is unlikely lost regardless of adverse situations. images of high contrast texts with plane background. However,
However, one can expect location misplacements of bits. This the constraint is not necessarily true for images of different
may lead to changes of foreground or background colors in types and scripts, where one can expect images of large
an image. Therefore, we propose to use Canny edge image of variations on contrast and background complexity, which may
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1147
cause the loss of information. Therefore, the methods may type images, where robust and invariant features with generic
not work well for different types and scripts. To overcome the property are required.
problems of connected based methods, texture based methods When we look at the literature on text recognition in video,
are proposed in literature [7], [8], [11], [12], which require natural scene and born digital images, it is found that there are
high contrast images. These methods consider appearances of methods which use binarization for recognition or their own
character components as a special kind of texture property. classifiers for recognition. The methods in [9] and [33]–[35]
To extract such texture property and separate text components recognize texts through binarization require complete shapes
from complex background, the methods propose features with of characters to achieve better recognition rates. Moreover,
a large number of training samples. However, the extracted most of the methods propose thresholding based criteria for
features are sensitive to low contrast or low resolution. Apart binarization. For the images of different contrasts and back-
from that, since the methods are trained by pre-defined labels, ground complexities considered in this work, the binarization
they may not perform well for images of different scripts process may not preserve characters, rather it loses shapes.
considered in this work. In addition, the methods are too Therefore, the methods may not perform well for the consid-
expensive for real time applications. To ease the number of ered images. To reduce the complexity of the problem, the
computations of texture based methods, edge-gradient based methods in [36]–[38] are proposed for recognition without
methods are developed, which generally focus on high gradient binarization process. These methods generally extract a large
that represents text and edge pixels, which give vital clues number of features using well known descriptors, namely,
for the presence of texts [6], [13], [14]. However, these SIFT, HOG or the combinations of several descriptors, and
features are sensitive to complex background, where edges then explore classifiers or lexicons for better recognition. As a
in background may overlap with text edges. As a result, result, the performances of the methods depend much on
the methods produce more false positives and hence the datasets and samples. In addition, the features extracted based
performance hampers especially for multi-type images. on descriptors work well for high contrast images. There-
In summary, the methods in [15]–[20] focus on high contrast fore, the methods may not perform well for the multi-type
images for text detection, therefore their performances degrade images. Recently, to improve the recognition rates of video
for low contrast images. The methods in [20]–[22] focus and natural scene images, there are methods that explore con-
on low contrast images for text detection, thus they report volutional network and deep learning [39]–[43]. For instance,
inconsistent results for high contrast images. In the same Shi et al. [44] proposed an end-to-end trainable neural network
way, the methods that focus on plain background images such for image based sequence recognition and its application to
as born digital images may suffer from very low resolution, scene text recognition. The method explores convolutional
multi-fonts, multi-sizes and multi-colors compared to video recurrent neural network for text recognition. Jain et al. [45]
and natural scene images, thus they report poor results for proposed unconstrained scene text and video text recogni-
complex background images. Recently, there are methods tion for Arabic scripts. The method focusses on a specific
which explore convolutional neural network and deep learning Arabic script for achieving results. It is noted from the
to overcome the problem of text detection in scene and video above discussions that the primary goal of the methods is to
images [23]–[28]. For instance, Liu and Jin [29] proposed a recognize a specific script of different texts but not multi-
deep matching prior network for multi-oriented text detection. script recognition. In addition, for arbitrarily-oriented texts,
The method focusses on fixing tight bounding boxes for multi- the methods fail to fix window size for characters to extract
oriented texts to prevent background noises such that text features, which leads to poor performance. As variations on
detection performance improves significantly. Tian et al. [30] dataset increase, difficulty in determining optimal parameters
proposed scene text detection based on weak supervision. The for deep learning set also increases.
method focusses on weak annotated data to reduce network Overall, from the review on text detection and recognition in
dependency on a large number of pre-defined labeled data. video, natural scene and born digital images, it is observed that
Although the methods solve complex issues consisting of the methods are successful for a specific type data, on which
multi-fonts, sizes, orientations, scripts and low contrast, they these methods are developed. It is worth to mention that none
still suffer from good character candidate detection. In addi- of the methods considers more than two types of data for text
tion, it is hard to optimize parameters based on pre-labeled detection and recognition. Besides, text detection and recog-
samples. This is because the considered problem in this work nition of multi-scripts especially for Indian scripts, is still at
involves large variations in terms of contrast, background, and the infancy stage. Therefore, in this work, we propose a novel
foreground complexity. As a result, it is difficult to find a large method for text detection and recognition in video, natural
number of pre-defined samples to train a classier and represent scene and born digital images irrespective of orientation and
such variations especially samples for representing non-text script.
components. Similarly, the methods in [8], [31], and [32]
are proposed for detecting multi-oriented and multi-script
III. P ROPOSED M ETHOD
texts in video images without depending much on learning.
However, these methods report inconsistent results for multi- Inspired by the enhancement concept presented in
type images. In summary, it is noted from literature review [46] and [47], where it is mentioned that the Most Sig-
on text detection in video, natural scene and born digital nificant Bit (MSB) carries significant information and the
images that none of the methods tackles the issues of multi- Least Significant Bit (LSB) carries less significant information,
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1148 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1149
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1151
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1152 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Fig. 12. Automatic window fixing for non-horizontal and curved text:
(b) is the last result of iterative algorithm for Horizontal (H) and Fused (F)
sub-bands.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1153
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1154 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1155
TABLE I
E XPECTED C HALLENGES OF T EXT D ETECTION IN V IDEO ,
N ATURAL S CENE AND B ORN D IGITAL I MAGES
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1156 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
TABLE II
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2015 V IDEO D ATASET
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1157
TABLE VI
P ERFORMANCES OF THE P ROPOSED AND E XISTING
M ETHODS ON MSRA S CENE D ATA
Fig. 17. Examples of text detection results of the proposed method on ICDAR
2013, SVT and MSRA natural scene dataset.
TABLE IV
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2013 S CENE D ATASET
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1158 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
TABLE IX
R ECOGNITION R ATES OF THE P ROPOSED AND E XISTING A PPROACHES ON
D IFFERENT D ATASETS AT W ORD AND C HARACTER L EVELS ( IN %).
W AND C I NDICATE W ORD AND C HARACTER R ECOGNITION
R ATES , R ESPECTIVELY
TABLE VII
P ERFORMANCES OF THE P ROPOSED AND E XISTING M ETHODS
ON ICDAR 2011 B ORN D IGITAL D ATASET
because the proposed features are invariant to scripts. Due
to more cursiveness and low resolution, the existing methods
report poor results. However, Yin et al.’s method scores high
precision compared to the proposed and the other existing
methods. This is because the approach has the ability to multi-
script text. However, it depends much on classifiers and train-
ing, it fails to score the best recall and F-measure compared to
the proposed method. On the other hand, the proposed method
does not depend on classifiers and is the best at F-measure.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1159
TABLE X TABLE XI
R ECOGNITION R ATES OF THE P ROPOSED AND E XISTING A PPROACHES ON AVERAGE P ROCESSING T IME OF THE P ROPOSED M ETHOD FOR
S OUTH I NDIAN D ATASETS AT W ORD AND C HARACTER L EVELS R ECOGNITION ON D IFFERENT D ATABASES ( IN S ECONDS )
( IN %). W AND C I NDICATE W ORD AND C HARACTER
R ECOGNITION R ATES , R ESPECTIVELY
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1160 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Fig. 19. Limitation of the proposed text detection and recognition methods.
ACKNOWLEDGMENT
(a) Text detection. (b) Recognition “Gg”, “IICC”, “eeo”. Authors would like to thank Pooja.G, Navya, Gowrishankar
Pillai, Mayur. S and Deepa Shree for their help in creating
ground truth for south Indian scripts. They would also like to
process requires more computations. Table XI shows that thank Wang Zhen for shaping the algorithms.
the proposed method takes, on an average, more processing
time for video data and MSRA data compared to the other
R EFERENCES
databases. This is valid because video involves processing of
temporal frames and MSRA involves arbitrarily-oriented text [1] C.-Z. Shi, C.-H. Wang, B.-H. Xiao, S. Gao, and J.-L. Hu, “Scene
text recognition using structure-guided character detection and linguistic
which require more computations compared to horizontal text. knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 7,
Overall, the proposed method consumes a few seconds for pp. 1235–1250, Jul. 2014.
each image in order to recognize text in the image. This is [2] D. Tao, J. Cheng, X. Gao, X. Li, and C. Deng, “Robust sparse coding for
mobile image labeling on the cloud,” IEEE Trans. Circuits Syst. Video
due to MATLAB software. It is also noted that the processing Technol., vol. 27, no. 1, pp. 62–72, Jan. 2017.
time depends on many factors, such as data structure of the [3] Y. Yang, C. Deng, D. Tao, S. Zhang, W. Liu, and X. Gao, “Latent max-
algorithm, system configuration and platform. Since our target margin multitask learning with skelets for 3-D action recognition,” IEEE
Trans. Cybern., vol. 47, no. 2, pp. 439–448, Feb. 2017.
is to develop a prototype, we plan to convert the whole [4] Q. Ye and D. Doermann, “Text detection and recognition in imagery:
MATLAB codes to VC++ and make the algorithm efficient A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7,
with the help of cloud computing in the future such that system pp. 1480–1500, Jul. 2015.
[5] X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking
can work for real time applications. Since the main aim of the and recognition in video: A comprehensive survey,” IEEE Trans. Image
proposed work is to develop a generic method for recognizing Process., vol. 25, no. 6, pp. 2752–2773, Jun. 2016.
text irrespective orientation, contrast variations, scripts, etc, [6] L. Wu, P. Shivakumara, T. Lu, and C. L. Tan, “A new technique for
multi-oriented scene text line detection and tracking in video,” IEEE
prototype or working model development is considered as Trans. Multimedia, vol. 17, no. 8, pp. 1137–1152, Aug. 2015.
beyond the scope of this work. [7] G. Liang, P. Shivakumara, T. Lu, and C. L. Tan, “Multi-spectral fusion
When the image containing too small font and poor quality based approach for arbitrarily oriented scene text detection in video
images,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 4488–4500,
as shown in Fig. 19(a), the proposed text detection step does Nov. 2015.
not perform well due to loss of components at MNNS stage. [8] P. Shivakumara, L. Wu, T. Lu, C. L. Tan, M. Blumenstein, and
Similarly, for the image of poor quality as shown in Fig. 19(b), B. S. Anami, “Fractals based multi-oriented text detection system
for recognition in mobile video images,” Pattern Recognit., vol. 68,
even naked eyes fail to read texts. For such images, the recog- pp. 158–174, Aug. 2017.
nition step fails to recognize the texts correctly. The main [9] S. Roy, P. Shivakumara, P. P. Roy, U. Pal, C. L. Tan, and T. Lu, “Bayesian
reason is that the method loses character structure during fixing classifier for multi-oriented video text recognition system,” Expert Syst.
Appl., vol. 42, no. 13, pp. 5554–5566, 2015.
an automatic window for each character and feature extraction. [10] (2016). Tesseract. [Online]. Available: http://code.google.com/p/tesseract-
Therefore, there is the scope for future work. ocr/
[11] H. Li, D. Doermann, and O. Kia, “Automatic text detection and tracking
in digital video,” IEEE Trans. Image Process., vol. 9, no. 1, pp. 147–156,
V. C ONCLUSION AND F UTURE W ORK Jan. 2000.
[12] V. Khare, P. Shivakumara, and P. Raveendran, “A new histogram oriented
In this work, we have proposed a new method which can moments descriptor for multi-oriented moving text detection in video,”
cope with the challenges of text detection and recognition in Expert Syst. Appl., vol. 42, no. 21, pp. 7627–7640, 2015.
multi-image environment, namely, video, natural scene and [13] X. Zhao, K.-H. Lin, Y. Fu, Y. Hu, Y. Liu, and T. S. Huang, “Text from
corners: A novel approach to detect text and caption in videos,” IEEE
born digital images. We have explored convex and concave Image Process., vol. 20, no. 3, pp. 790–799, Mar. 2011.
deficiencies to identify a candidate plane from eight planes [14] A. Mosleh, N. Bouguila, and A. B. Hamza, “Automatic inpainting
to represent significant information by introducing a new scheme for video text detection and removal,” IEEE Trans. Image
Process., vol. 22, no. 11, pp. 4460–4472, Nov. 2013.
concept, called Iterative Nearest Neighbor Symmetry (INNS). [15] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes
Based on the outward gradient direction of components, we with stroke width transform,” in Proc. CVPR, Jun. 2010, pp. 2963–2970.
have proposed a new idea of Mutual Nearest Neighbor Pair [16] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in
natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
(MNNP) components identification to identify the represen- no. 5, pp. 970–983, May 2014.
tatives of texts. For recognition, we have introduced a new [17] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-orientation scene
idea of determining an automatic window according to char- text detection with adaptive clustering,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 37, no. 9, pp. 1930–1937, Sep. 2015.
acter size based on the angular relationship between fused [18] X. Wang, Y. Song, Y. Zhang, and J. Xin, “Natural scene text detection
and high frequency wavelet sub-bands. We have proposed with multi-layer segmentation and higher order conditional random
the combination of statistical-textures and spatial information field based analysis,” Pattern Recognit. Lett., vols. 60–61, pp. 41–47,
Aug. 2015.
based features in contourlet wavelet domain for recognition [19] Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based text line
with the help of HMM model. However, it is noticed from the detection in natural scenes,” in Proc. CVPR, Jun. 2015, pp. 2558–2567.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
RAGHUNANDAN et al.: MULTI-SCRIPT-ORIENTED TEXT DETECTION AND RECOGNITION IN VIDEO/SCENE/BORN DIGITAL IMAGES 1161
[20] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: [46] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.
A unified text detection system in natural scene images,” in Proc. ICCV, New Delhi, India: Pearson, 2002.
Dec. 2015, pp. 4651–4659. [47] S. Sudhakaran and A. P. James, “Sparse distributed localized gra-
[21] H. Yang, S. Wu, C. Deng, and W. Lin, “Scale and orientation invariant dient fused features of objects,” Pattern Recognit., vol. 48, no. 4,
text segmentation for born-digital compound images,” IEEE Trans. pp. 1538–1546, 2015.
Cybern., vol. 45, no. 3, pp. 533–547, Mar. 2015. [48] Z. Long and N. H. Younan, “Multiscale texture segmentation via a
[22] J. Xu, P. Shivakumara, T. Lu, C. L. Tan, and M. Blumenstein, “Text contourlet contextual hidden Markov model,” Digit. Signal Process.,
detection in born-digital images by mass estimation,” in Proc. ACPR, vol. 23, no. 3, pp. 859–869, 2013.
Nov. 2015, pp. 690–694. [49] A. E. Yacoubi, M. Gilloux, R. Sabourin, and C. Y. Suen, “An HMM-
[23] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional convolutional based approach for off-line unconstrained handwritten word modeling
neural network for scene text detection,” IEEE Trans. Image Process., and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8,
vol. 25, no. 6, pp. 2529–2541, Jun. 2016. pp. 752–760, Aug. 1999.
[24] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “TextBoxes: A fast [50] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk, and
text detector with a single deep neural network,” in Proc. AAAI, 2017, B. Girod, “Robust text detection in natural images with edge-
pp. 4161–4167. enhanced maximally stable extremal regions,” in Proc. ICIP, Sep. 2011,
[25] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi- pp. 2609–2612.
oriented text detection with fully convolutional networks,” in Proc. [51] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
CVPR, Apr. 2016, pp. 4159–4167. vol. 20, no. 3, pp. 273–297, 1995.
[26] H. Cho, M. Sung, and B. Jun, “Canny text detector: Fast and [52] K. Wang, B. Babenko, and S. Belngie, “End-to-end scene text recogni-
robust scene text localization algorithm,” in Proc. CVPR, Jun. 2016, tion,” in Proc. ICCV, Nov. 2011, pp. 1457–1464.
pp. 3566–3573. [53] A. Mishra, K. Alhari, and C. V. Jawahar, “Top-down and bottom-up cues
[27] A. Gupta, A. Vedaldi, and A. Ziserman, “Synthetic data for for scene text recognition,” in Proc. CVPR, Jun. 2012, pp. 2687–2694.
text localisation in natural images,” in Proc. CVPR, Apr. 2016, [54] P. Agrawal, M. Vatsa, and R. Singh, “Saliency based mass detection from
pp. 2315–2324. screening mammograms,” Signal Process., vol. 99, pp. 29–47, Jun. 2014.
[28] L. Gomez and D. Karatzas, “TextProposals: A text-specific selective [55] S. J. Young, J. Jansen, J. J. Odell, D. Ollason, and P. C. Woodland,
search algorithm for word spotting in the wild,” Pattern Recognit., “The HTK hidden Markov model toolkit book,” Entropic Cambridge
vol. 70, pp. 60–74, Oct. 2017. Res. Lab., Cambridge, U.K., Tech. Rep., 1995. [Online]. Available:
[29] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi- http://htk.eng.cam.ac.uk/
oriented text detection,” in Proc. ICCV, Mar. 2017, pp. 3454–3461. [56] D. Karatzas et al., “ICDAR 2015 competition on robust reading,” in
[30] S. Tian, S. Lu, and C. Li, “WeText: Scene text detection under weak Proc. ICDAR, Aug. 2015, pp. 1156–1160.
supervision,” in Proc. ICCV, Oct. 2017, pp. 1501–1509. [57] P. X. Nguyen, K. Wang, and S. Belongie, “Video text detection and
[31] A. Mittal, P. P. Roy, P. Singh, and B. Raman, “Rotation and script recognition: Dataset and benchmark,” in Proc. WACV, Mar. 2014,
independent text detection from video frames using sub pixel mapping,” pp. 776–783.
J. Vis. Commun. Represent., vol. 46, pp. 187–198, Jul. 2017. [58] D. Karatzas et al., “ICDAR 2013 robust reading competition,” in Proc.
ICDAR, Aug. 2013, pp. 1115–1124.
[32] S. Dey et al., “Script independent approach for multi-oriented text
[59] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of
detection in scene image,” Neurocomputing, vol. 242, pp. 96–112,
arbitrary orientations in natural images,” in Proc. CVPR, Jun. 2012,
Jun. 2017.
pp. 1083–1090.
[33] B. Su, S. Lu, and C. L. Tan, “Robust document image binarization
[60] D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy,
technique for degraded document images,” IEEE Trans. Image Process.,
“ICDAR 2011 robust reading competition—Challenge 1: Reading text
vol. 22, no. 4, pp. 1408–1417, Apr. 2013.
in born-digital images (Web and Email),” in Proc. ICDAR, Sep. 2011,
[34] N. R. Howe, “A Laplacian energy for document binarization,” in Proc. pp. 1485–1490.
ICDAR, Sep. 2011, pp. 6–10.
[35] S. Milyae, O. Barinova, T. Novikova, P. Kohli, and V. Lempitsky, “Image
binarization for end-to-end text understanding in natural images,” in
Proc. ICDAR, Aug. 2013, pp. 128–132.
[36] S. Roy, P. P. Roy, P. Shivakumara, G. Louloudis, and C. L. Tan, “HMM-
based multi oriented text recognition in natural scene image,” in Proc. K. S. Raghunandan received the master’s degree
ACPR, Nov. 2013, pp. 288–292. from University of Mysore in 2013, where he is
[37] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text currently pursuing the Ph.D. degree. His research
with perspective distortion in natural scenes,” in Proc. ICCV, Dec. 2013, interests includes image processing, pattern recogni-
pp. 569–576. tion, and video understanding.
[38] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu,
“Region-based discriminative feature pooling for scene text recognition,”
in Proc. CVPR, Jun. 2014, pp. 4050–4057.
[39] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with atten-
tion modeling for OCR in the wild,” in Proc. CVPR, Mar. 2016,
pp. 2231–2239.
[40] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text
recognition with automatic rectification,” in Proc. CVPR, Mar. 2016,
pp. 4168–4176.
[41] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading Palaiahnakote Shivakumara received the B.Sc.,
text in the wild with convolutional neural networks,” Int. J. Comput. M.Sc., M.Sc. (Tech.) by research, and Ph.D. degrees
Vis., vol. 116, no. 1, pp. 1–20, 2016. in computer science from University of Mysore, Kar-
[42] S. Yousf, S. A. Berrani, and C. Garcia, “Contribution of recurrent con- nataka, India, in 1995, 1999, 2001, and 2005, respec-
nectionist language models in improving LSTM-based Arabic text recog- tively. He was with the Department of Computer
nition in videos,” Pattern Recognit., vol. 64, pp. 245–251, Apr. 2017. Science, School of Computing, National University
[43] S. J. Lee and S. W. Kim, “Recognition of slab identification numbers of Singapore, from 2008 to 2013, as a Research Fel-
using a deep convolutional neural network,” in Proc. ICMLA, Dec. 2016, low on video text extraction and recognition project.
pp. 718–721. He is currently a Senior Lecturer with the Faculty
[44] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network of Computer Science and Information Technology,
for image-based sequence recognition and its application to scene text University of Malaya, Kuala Lumpur, Malaysia. He
recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, has published over 190 papers in conference and journals. His research
pp. 2298–2304, Nov. 2017. interests are in the area of image processing and video text analysis. He was a
[45] M. Jain, M. Mathew, and C. V. Jawahar, “Unconstrained scene text and recipient of the prestigious Dynamic Indian of the Millennium award by KG
video text recognition for Arabic Script,” in Proc. ASAR, Apr. 2017, Foundation, India. He has been an Associate Editor for ACM Transactions
pp. 26–30. Asian and Low-Resource Language Information Processing.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.
1162 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 4, APRIL 2019
Sangheeta Roy is currently pursuing the Ph.D. Umapada Pal (SM’15) received the Ph.D. degree
degree with University of Malaya, Malaysia. Her from the Indian Statistical Institute and the Ph.D.
area of interest includes image processing, pattern degree with the Institut National de Recherché en
recognition, and video text understanding. Informatiqueet en Automatique, France. In 1997, he
was a Faculty Member with the Computer Vision
and Pattern Recognition Unit, Indian Statistical Insti-
tute, Kolkata, where he is currently a Professor.
Because of his significant impact in the Document
Analysis research domain of the Indian language,
TC-10 and TC-11 committees of International Asso-
ciation for Pattern Recognition (IAPR) presented the
ICDAR Outstanding Young Researcher Award to Dr. Pal in 2003. He is a
fellow of IAPR. He is an editorial board member for several journals like
PR, PRL, IJDAR, and ACM Transactions on Asian Language Information
Processing.
G. Hemantha Kumar received the B.Sc., M.Sc., Tong Lu received the B.Sc. and M.Sc. degrees and
and Ph.D. degrees from University of Mysore. He the Ph.D. degree in computer science from Nanjing
is currently a Professor with the Department of University, in 1993, 2002, and 2005, respectively.
Studies in Computer Science, University of Mysore, He is currently a Full Professor with Nanjing Uni-
Mysore. He has published over 200 papers in jour- versity. His current interests are in the areas of
nals, edited books, and refereed conferences. His multimedia, computer vision, and pattern recogni-
current research interest includes numerical tech- tion algorithms/systems.
niques, digital image processing, pattern recognition,
and multimodal biometrics.
Authorized licensed use limited to: Cisco. Downloaded on March 18,2021 at 06:21:27 UTC from IEEE Xplore. Restrictions apply.