Deep Learning Approaches To Scene Text Detection A

Artificial Intelligence Review
https://doi.org/10.1007/s10462-020-09930-6
Deep learning approaches to scene text detection:

a comprehensive review
Tauseef Khan1 · Ram Sarkar2 · Ayatullah Faruk Mollah1
© Springer Nature B.V. 2021
Abstract
In recent times, text detection in the wild has significantly raised its ability due to tremen-
dous success of deep learning models. Applications of computer vision have emerged and
got reshaped in a new way in this booming era of deep learning. In the last decade, research
community has witnessed drastic changes in the area of text detection from natural scene
images in terms of approach, coverage and performance due to huge advancement of deep
neural network based models. In this paper, we present (1) a comprehensive review of
deep learning approaches towards scene text detection, (2) suitable deep frameworks for
this task followed by critical analysis, (3) a categorical study of publicly available scene
image datasets and applicable standard evaluation protocols with their pros and cons, and
(4) comparative results and analysis of reported methods. Moreover, based on this review
and analysis, we precisely mention possible future scopes and thrust areas of deep learning
approaches towards text detection from natural scene images on which upcoming research-
ers may focus.
Keywords Text detection · Deep learning · Scene image · End-to-end text reading · Review
of methods
1 Introduction
In real-world, human beings are surrounded by instances of text objects in imagery

that bear important semantic information in practical scenario. In the past few decades,
researchers have started paying attention on developing several methods for text detection
in the wild (natural scene images). Although, reading text from scene images has been
a prime focus to the researchers for a long time, detecting text from such real-life clutter
images is proven to be one of the hot research topics owing to its diversified practical appli-
cations. Nowadays, due to increasing demand for solution of many computer vision prob-
lems, text detection from natural images has emerged as an active research domain in the
* Ayatullah Faruk Mollah

[email protected]
1
Department of Computer Science and Engineering, Aliah University, IIA/27 New Town,
Kolkata 700160, India
2
Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
13
Vol.:(0123456789)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
T. Khan et al.
community. As soon as it was possible to read the scanned images by an electronic device,
researchers started developing automated text detection and recognition systems from such
images. Due to advancement of computer vision and image processing techniques, it is
now more feasible to encounter text detection from complex scene images. Many research-
ers have developed automated optical character recognition (OCR) techniques to read text
from imagery (Maitra et al. 2015; Majhi and Pujari 2018; Manjusha et al. 2018). Although,
OCR systems work reasonably well for well-formatted document images with high detec-
tion accuracy, it fails to detect text from real-world scene images.
Several motives exist behind the booming interest of scene text detection among
research community. (1) Text under scene environment often carries important semantic
information. Scene texts embedded with rich and precise information may be considered to
be a beneficial entity to understand the world surrounded by us. (2) Snowballing usage of
mobile phones with built-in high-resolution camera and computational capability undoubt-
edly creates opportunity for ease of image acquisition and subsequent processing under
uncontrollable circumstances. (3) Huge advancement of various image processing and pat-
tern recognition tools and technologies makes it more attainable to address several chal-
lenges associated with scene text detection. On the other hand, text detection in the wild is
very challenging in comparison with text detection from well-formatted documents. Spot-
ting the locations of texts in scene images is complicated as texts are present in a scattered
way and no a priori information is available. Diversity of texts in the wild is enormous and
in some cases appearance of texts is close to background and other objects, which makes it
quite difficult to segment. Thus, detecting and localizing text from such uncontrolled envi-
ronments is yet a major challenge despite years of rigorous research.
1.1 Related applications
Detecting as well as reading texts in scene images plays a vital role in several real-life
applications. There are numerous applications with texts found in scene images and video
frames, such as indexing of multimedia resources, license plate recognition for vehicle
identification, traffic sign detection, text-to-speech conversion for visually impaired per-
sons, etc. Besides, it opens up many arenas of applications as it enables a machine to have
the ability of reading the surrounding environment in terms of imbibed texts therein. Text
based interaction with real world by artificial intelligence is also going to be a reality in the
coming future. Few of the related applications are highlighted below:
Information retrieval from multimedia. Text embedded in images signifies important

semantic information. So, detecting as well as recognizing texts and keywords from
multimedia resources certainly improves the overall retrieval of semantic information
from media resources.
Pursuance of video. Automatic text detection from video frames enables video analysis
and summarization. In addition to that, efficient extraction of textual information pre-
sent in frames enables effortless video searching, indexing and retrieval.
Familiarity in non-native environment. Multi-lingual scene text detection and recogni-
tion systems reduce language barriers to a great extent through automatic translation.
People may comprehend texts written in any language with the help of such systems.
Smart-phone enabled applications of this kind is really useful particularly in non-native
environment.
13
Deep learning approaches to scene text detection: a comprehensive…
Text synthesis for visually impaired. Conversion of recognized texts from scene images
into speech using different synthesizers enables visually impaired persons to understand
and realize the world around them. Text-to-speech converters or smart-phone apps may
assist those persons in their daily life.
Robotic navigation. Automated sign detection and recognition from scenes enables nav-
igation in many fields including automatic traffic control system.
Automatic license plate recognition. Automated license plate recognition identifies
vehicles in real-time, which may enable parking automation, highway tolling manage-
ment, access control in restricted areas and so on.
Automated annotation in industrial applications. Text detection and recognition from
containers, houses and roads can directly be related to improvement of automated indus-
trial operations. Detecting texts from bottles or containers may certainly improve logis-
tics industry. Reading texts from envelops helps to sort the letters in postal department.
Moreover, house number detection may make geo-satellite systems more intelligent.
1.2 Associated challenges
Scene texts are often affected from uneven illumination, noise, complex backgrounds, pres-
ence of multi-lingual texts, which poses more challenges unlike well-formatted, clean,
flatbed document image analysis. These combined challenges are the main hindrances to
researchers for developing a comprehensive scene text detection method. In complex scene
environments, inflexible image acquisition along with diversity of scene texts throws vari-
ety of challenges for accurate scene text localization. Most of the challenge’s researchers
faced during text detection in the wild are as outlined below:
(1) Diverse nature of text in the wild
Text in scene images, exhibits more diversity and variability in terms of style, layout, ori-
entation and other factors. Moreover, natural scene images comprise of multiple-scripts’
text posing more complexity in text detection task due to irregular pattern.
(2) Complex imagery background
Backgrounds of scene images are often very complex in nature and at the same time unpre-
dictable to localize the texts in the images. Sometimes, scene texts appeared as miscellany
objects under unconstrained environments may require sophisticated and expensive seg-
mentation techniques for appropriate text localization. Also, objects closely resembled as
texts and occlusion due to noise, uneven lighting effects may lead to confusion and errone-
ous detection.
(3) Improper image acquisition
Image acquisition in the wild cannot be flexible due the uncontrollable situations and, thus
it may not guarantee the quality of scene imagery and text instances. Inappropriate image
capturing, distance of lens from the target, shooting angle sometimes lead to low-resolution
and blurred text instances in the acquired image.
(4) Text-specific algorithms
13
T. Khan et al.
In recent times, scene text detection methods are becoming more target specific in terms of
orientation and appearance. Some newly released datasets are devoted to specific type of
texts like multi-oriented, curved texts, blurred texts etc. that motivate researchers to design
text specific models rather than developing robust text detectors in all aspects.
1.3 Pre‑deep learning era
Looking forward to deal with all the associated challenges, researchers have devised dif-
ferent methods to detect texts from scene images. Before the advent of deep learning in
this domain, progress of scene text detection was largely confined to designing handcrafted
low-level features (Ojala et al. 1996; Liu et al. 2016b, 2017; Khan et al. 2017; Fogel and
Sagi 1989; Mallat 1989; da Silveira et al. 2017; Liang et al. 2015; Gllavata et al. 2004;
Shivakumara et al. 2019) of the texts present in images. Traditional text detection methods
mainly rely on discriminating features of text areas within an image. Scene text detection
methods, in the pre deep learning era may be broadly divided into (a) connected compo-
nent (CC) based methods (Bagri and Johari 2015; Greenhalgh and Mirmehdi 2012; Tian
et al. 2016b; Koo and Kim 2013; Mosleh et al. 2012; Khan and Mollah 2019a; b; Paul et al.
2019; Shi et al. 2013; Neumann and Matas 2012; Cho et al. 2016; Jiang et al. 2018), and
(b) sliding-window based methods (Pan et al. 2010; Lee et al. 2011a; Coates et al. 2011;
Wang et al. 2011, 2012).
In CC-based methods, foreground candidate components are extracted using different
segmentation techniques (e.g. stable intensity regions, color clustering) followed by prun-
ing out non-text components using handcrafted low-level features (stroke-width, edge gra-
dient, texture, etc.). Epshtein et al. (2010) have introduced a novel text detector for scene
images using stroke width transform (SWT) and text-candidate extraction by analyzing
pixels of similar strokes. Recently, few works have been proposed on distance transform
based stroke descriptors for foreground component classification and scene text detection.
Khan and Mollah (2019a) have proposed a set of novel feature descriptors using potential
stroke pixels from distance transform map and fed into classifiers for component level clas-
sification. Khan and Mollah (2019b) have also designed a medial skeleton based stroke
feature descriptor for text/non-text classification. Paul et al. (2019) applied fuzzy distance
transform and adaptive stroke filters to localize text components from camera captured nat-
ural images. Maximally stable extremal regions (MSER), a popular region detector, finds
extremal regions within an image that are maximally stable with wide range of threshold
values. Neumann and Matas (2012) have applied MSER to extract stable regions, identify
the characters and group them to form text lines. MSER is found to demonstrate robustness
in identifying text candidate regions in diverse environment. Generally speaking, CC based
methods usually detect a large number of non-text components along with texts. Therefore,
removal of false-positives using rigorous post-processing plays a crucial role for efficient
text detection.
In sliding-window based technique, multi-scale windows are convolved over given
image and multiple text blocks are obtained. Candidate text blocks are then classified
using different classifiers driven by handcrafted features and finally, classified text blocks
are aggregated for word level or text-line level detection. Pan et al. (2010) have developed
a text detection method by finding text-oriented confidence maps and employed a condi-
tional random field (CRF) based technique to aggregate candidate blocks and subsequently
filtered out non-text blocks. Lee et al. (2011a) have designed a multi-scale text detection
13
approach, where extracted text blocks were further classified with AdaBoost using some
handcrafted features.
These types of methods (Epshtein et al. 2010; Li and Lu 2012; Huang et al. 2013; Neu-
mann and Matas 2010; Wang et al. 2011, 2018; Shi et al. 2013; Yao et al. 2012; Chen et al.
2011; Shivakumara et al. 2010; Sain et al. 2018; Francis and Sreenath 2017; Huang 2019;
Chen and Yuille 2004; Lee et al. 2011b; Mishra et al. 2012; Zamberletti et al. 2014; Yi
and Tian 2012; Bai et al. 2013; Lee et al. 2010; Mollah et al. 2012) are mostly driven by
computationally expensive low-level handcrafted features, where designing such features
is itself a challenging task in natural scene environments. These methods also involve tedi-
ous pre-processing and post-processing steps resulting in reduced robustness in detecting
texts in natural imagery. Traditional methods built with multiple steps may tend to incur
erroneous results. It may be stated that low-level hand-crafted features are highly sensitive
to noise, illumination, multi-orientation of texts and other clutters.
1.4 Advent of deep learning
In recent years, deep learning driven models for text detection have proven to be an emerg-
ing and significant move towards computer vision applications (Zhang et al. 2016; Liu and
Jin 2017; Ma et al. 2018; Xu et al. 2019a; Liu et al. 2016a; Liao et al. 2017; 2018a; Zhong
et al. 2016; Tian et al. 2016a; Shi et al. 2017a; Saha et al. 2020). Deep networks consist of
many hidden layers which automatically extract several high-level features from an input
image and generate the desired result. There are several deep neural networks (DNNs)
employed in text detection, but the most popular among them is convolutional neural net-
work (CNN). CNN consists of several layers with fixed size filters which transform the
input image into a feature map with some auto-generated high-level features. The first CNN
model LeNet (LeCun et al. 1998) was presented for handwritten digit recognition problem.
Later, several deep learning frameworks have evolved with various CNN architectures for
scene text detection, which has significantly increased overall performance. But, nowadays,
with increasing demand for computer vision applications, some advanced DNNs are found
in literature (Simonyan and Zisserman 2014; Redmon et al. 2016). In 2012, an advanced
CNN model called AlexNet (Krizhevsky et al. 2012) won the ImageNet large dataset [over
15 + million labeled high-resolution images] challenges for visual recognition.
Nearly, most of the recent approaches are driven by deep network models. Most impor-
tantly, deep learning approaches make researchers free from exhausting task of designing
and testing low-level handcrafted features, which unquestionably gives an elevation to a
bloom of works and pushes the wheel further. Generally speaking, deep learning methods
simplify the overall pipeline of text detection by reducing the number of laborious steps
and time-complexity. Moreover, deep learning based methods harvest significant improve-
ment over traditional approaches on public datasets.
1.5 Motivation and contribution
Incorporation of deep learning based algorithms in text detection has started building its
reputation rapidly after increasing demand for real-world applications. The number of
works involving deep learning significantly increased after 2012. A few survey articles
on scene text detection and recognition are found in literature (Lin et al. 2019; Liu et al.
2019d; Joan and Valli 2019; Zhu et al. 2016; Long et al. 2018b). However, reviews on the
same, focused on deep learning approaches, is a crucial need. Though, few recent articles
13
T. Khan et al.
have attempted to fulfil this need, focused and insightful trend-setting review with pro-
spective future directions is still missing. Firstly, majority of conventional reviews pre-
sent by focusing on general or multi-typed scene text images. How deep learning methods
are plunged into different text-types in scene environment and dealt with various com-
plexities are yet to be figured out. Secondly, in order to encourage competitions on deep
learning based text detection approaches, datasets suitable for deep networks should be
identified as substantially large data are usually required for any deep network. Thirdly,
previous reviews included both deep learning and traditional approaches to maintain par-
ity while compromising focus on deep learning approaches. Thus, a study on deep learn-
ing approaches towards text detection involving related challenges, relevant deep networks,
models, benchmark strategies and datasets, and future directions may be really helpful to
research fraternity. In this paper, an in-depth review of the deep learning based methods
developed for scene text detection from natural scene images on different types of texts
is reported. We have also tried to report a comprehensive and summarized version of the
performance of state-of-the-art methods. In a nutshell, contributions of this survey may be
stated as follows:
• A modest attempt is made to present comprehensive review on different deep learning

based methods adopted for text detection in the wild.
• A categorical discussion on different deep learning networks used in text detection are
reported in detail.
• Standard datasets available for scene text detection have been outlined and grouped on
the basis of their applicable text-types. Standard evaluation protocols along with their
suitability and limitations are also reported.
• Performance comparison of all methods considered in this study is presented on dif-
ferent public datasets along with a comprehensive view of deep learning approaches
versus traditional approaches.
• Finally, an in-depth analysis on the state-of-the-art methods is included and possible
future research trends towards scene text detection are mentioned.
The paper is organized as follows. In Sect. 2, a detailed review of the state-of-the-art

methods using deep learning approaches for different text-types is made. In Sect. 3, some
leading deep learning frameworks used in scene text detection, different CNN architec-
tures and finally hardware-software requirements are discussed in detail. Section 4 presents
a summarized version of standard datasets, their salient features and various evaluation
protocols followed by the researchers. In Sect. 5, performance of state-of-the-art methods
are reported on standard datasets in comparison with traditional approaches along with
detailed discussion and possible future scope of research on scene text detection. Finally,
conclusive remarks are made in Sect. 6.
2 Text detection approaches based on deep networks
Recent development of deep learning based approaches towards text detection from scene
images demonstrates its high performance in complex environments, leading to effective-
ness and robustness into the problem. Several deep learning based methods have been
reported so far to tackle with extremely diverse scene texts (Wang et al. 2019b, c, d, e;
Zhang et al. 2019; Baek et al. 2019; Xue et al. 2019; Liu et al. 2019e, f, 2020a; Tian et al.
13
2019; Kobchaisawat et al. 2020; He et al. 2020; Ma et al. 2020; Song et al. 2020; Jeon
and Jeong 2020). Existing deep learning approaches for scene text detection can be pri-
marily divided into two categories, namely, Top-down and Bottom-up approaches (Zhong
et al. 2019a). In top-down approach, text instances are considered as general objects and
object detection frameworks are adopted to predict text instances directly from input
images, whereas in bottom-up approach, small text primitives are initially localized and
then progressively linked to detect final text. However, these types of categorization are
very general and shallow in nature. In this paper, we categorize the existing deep learning
approaches in a specific way based on different strategies adopted by researchers. Accord-
ing to our analysis, state-of-the-art deep learning approaches are broadly categorized into
four groups, (1) regression-based methods, (2) segmentation based methods, (3) hybrid
methods, and (4) end-to-end text spotting. In the first category of methods (Gao et al. 2019;
Liu et al. 2019c; Liao et al. 2018b; Wang et al. 2020a), text regions are detected by con-
volving rectangle or quadrilateral text boxes in multiple directions over the entire image.
In segmentation methods (Yang et al. 2018; Tang and Wu 2017; Qin et al. 2019a), text
regions are segmented based on text intrinsic information obtained from the scene images.
In this approach, computationally expensive post-processing techniques are needed for
efficient text component extraction from segmented regions. Hybrid methods (He et al.
2017b; Zhong et al. 2019a; Lin et al. 2017; Wang et al. 2019a) are basically a combina-
tion of regression-based and segmentation-based methods that are able to detect texts more
accurately. Finally, end-to-end text spotting methods that combine both text detection and
recognition for accurate text spotting in scene images (Busta et al. 2017; Li et al. 2017;
Liu et al. 2018c, 2020b; Sun et al. 2018; He et al. 2017d, 2018a; Liao et al. 2019a; Lyu
et al. 2018b; Qin et al. 2019b; Qiao et al. 2020; Wang et al. 2019f; Feng et al. 2019) are
discussed.
2.1 Regression based methods
In regression based methods, horizontal/quadrangular boxes are placed over text instances
or small text primitives within text regions. Then, iterative regression of bounding boxes
is applied in order to generate tight coordinates over text instances that can enclose texts
accurately. In these methods, regression is sometimes, applied on entire text instances or it
only focuses on part of the texts. Based on these strategies, regression-based methods may
further be divided into two categories, (1) proposal-based methods, and (2) segment/link-
based methods.
2.1.1 Proposal based methods
Region proposal for scene text detection is an end-to-end convolution-based text detec-
tion method. Proposal based method is mainly inspired by general object detection
frameworks. In this method, entire text region is targeted first and then horizontal/quad-
rilateral boxes are applied on it, finally bounding box regression is performed to gener-
ate accurate coordinate offsets of predicted text boxes. Region proposal methods are
highly efficient and effective for scene text detection in clutters. Multi-scale quadrilat-
eral windows are convolved over text instances, and finally those windows are regressed
having a higher overlapping ratio with text regions for tighter text detection. This tech-
nique deals with mainly multi-oriented and horizontally aligned texts in the wild. It is
worth to mention that proposal-based methods are mostly limited to simple texts with
13
T. Khan et al.
linear shape. This proposal-based method works in a sliding window manner and it is
inspired by object detection techniques, which fails to detect curved texts accurately in
most of the cases. Here, such methods for different orientation of texts are discussed
below.
Arbitrary-oriented Texts Text proposal-based methods efficiently and accurately local-

ize scene texts with arbitrary orientations. Liu and Jin (2017) have developed a deep
network named deep matching prior network (DMPNet) based on CNN for detecting
tightly bounded multi-oriented texts from scene images. In this work, multiple quad-
rilateral boxes are convolved over the candidate text regions for roughly detecting text
instances and finally, a sequential protocol is applied on those text instances for final
text detection. Here, Monte-Carlo method is applied for computing higher overlapping
ratio of multiple boxes with annotated text regions. Figure 1 illustrates this detection
method using multi-scale sliding window technique. Ma et al. (2018) have proposed a
rotation region proposal network (RRPN) that deployed VGG-16 network in front of
the architecture to detect arbitrary oriented scene texts using oriented angle informa-
tion. Recently, Zhong et al. (2019b) have used a novel bounding box module to improve
the detection accuracy of faster region-based CNN (Faster R-CNN) for natural scene
text detection. Yang et al. (2019) have designed an end-to-end trainable text detector
using VGGNet and PVANet (Kim et al. 2016) for arbitrary oriented scene text detec-
tion. EAST is a novel text detector for scene text detection using quadrilateral boxes
have designed by Zhou et al. (2017), that directly localizes text and eliminates inter-
mediate steps in the pipeline like candidate region aggregation and word segmentation
task. Wang et al. (2019d) have proposed adaptive pair-wise corner points identification
for arbitrary shaped text detection. Here, text candidate proposals are generated from
Region Proposal Network (RPN), and further refined and rectified using refinement
network (text/non-text classification, bounding box regression). Then, long short term
memory (LSTM) based network (Hochreiter and Schmidhuber 1997) is applied on true
text instances to predict pair-wise adaptive corner point in upper and lower boundaries
for appropriate representation of text regions. This strategy is more effective than fixed
number of corner points based text detection that mostly fails to detect arbitrary shaped
texts. However, detection of corner points is not always feasible for complex images
Fig. 1 Comparison between horizontal and quadrilateral sliding window techniques for text detection (Liu
and Jin 2017). a The black box represents ground truth, the blue box represents conventional horizontal
rectangle and the red box represents tighter bounding box. b Normal rectangular sliding window kernels. c
Multi-directional and multi-scale quadrilateral kernels used in for tighter text detection
13
where text regions may have blends with background and word level text instances may
be closely located.
Curved Texts Zhu et al. (2019) have implemented a novel text detection algorithm for
curved texts using bounding box regression method. Here, candidate text proposals
are generated using RPN and bounding box regression of text proposals are performed
gradually in a two-stage manner, which produces more accurate text bounding boxes.
Liu et al. (2019a) also have designed a novel polygon-based curved text detector (CTD)
for curve text detection. This model applies recurrent neural network (RNN) (Sherstin-
sky 2018) to connect locating points of text line for smooth and better detection result.
A multi-scale shape regression-based framework is designed by Xue et al. (2019), where
central point of each candidate text proposal is identified using a triangulation method
within polygon region. Then, distances from nearest boundary pixel to central pixel in
both x and y direction are computed to produce a dense boundary of arbitrary shaped
text instances. This method is claimed to be superior to general quadrilateral box regres-
sion methods that generally include additional background region. However, in case of
overlapped text-lines where detecting central point is difficult, it fails.
Horizontal Texts Horizontally aligned texts are the most common features for scene
images which are accurately localized using region-based proposal techniques. Liu
et al. (2016a) have presented a model named single shot multibox detector (SSD) which
adopts fixed sized multiple boxes and estimate score of objects found within default
boxes and then an aggregation technique is applied on positive boxes to generate the
final detection results. TextBoxes (Liao et al. 2017) inherites the architecture of VGG-
16 which adopts the SSD (Liu et al. 2016a) framework to detect scene text with higher
accuracy. TextBoxes++ (Liao et al. 2018a) overcomes the shortcoming of TextBoxes
by introducing quadrilateral boxes instead of horizontally aligned boxes to detect lin-
early aligned scene texts. DeepText (Zhong et al. 2016) uses Inception-RPN inspired by
GoogleNet (Szegedy et al. 2015) which detects scene text accurately using region-based
proposal technique.
2.1.2 Segment/link‑based methods
In segment based method, at first, text instances are segmented/decomposed into multi-
ple parts, then it tried to link those candidate parts using some intrinsic features to obtain
the final detection results. These methods are more flexible compared to proposal-based
techniques for irregular text detection. They can effectively detect text regions in complex
environment. These methods attempt to localize small part of text instances that requires
small receptive field of CNN baselines. Although, linkage strategy leads to more complex
algorithms, it is more versatile in the context of text shape. These methods mainly focus on
multi-oriented and curved texts in natural images.
Arbitrary-oriented Texts Segment based regression only focus on part of the texts and
subsequently, a non-suppression merging technique is applied for final detection. Tian
et al. (2016a) have designed a connectionist text proposal network (CTPN) as an end-
to-end text detector that predicts fine-scale text proposal maps and then links them to
detect final text accurately. Shi et al. (2017a) have proposed a scene text detector (Seg-
Link) for script invariant multi-dimensional texts by linking small decomposed text
primitives. Lyu et al. (2018a) have developed a corner-point based scene text detec-
tion technique inspired by DeNet (Tychsen-Smith and Petersson 2017) and PLN (Wang
13
T. Khan et al.
et al. 2017), where at first corner points are identified and then a position-sensitive map
(Leibe et al. 2016) is generated based on corner points and finally a Non-maximum Sup-
pression (NMS) technique is applied for grouping the maps. MCN (Liu et al. 2018a) is
a powerful robust text detector for multi-scale and multi-oriented scene texts, which first
predicted a graph from input images and then a Markov clustering (Van Dongen 2000)
is applied on this graph for text grouping and accurate bounding box generation.
Curved texts Most of the scene text detectors are applicable for horizontal or multi-ori-
ented text detection. It has been observed that performances of conventional approaches
drop significantly in detecting curved texts in the wild. Xu et al. (2019a) have used a
novel text detection method where a VGG-16 based network learns a directional field
map consisting of 2-D vector (magnitude and direction) generated from an input image
and finally using this directional information curved texts are obtained accurately from
scene images (Fig. 2). In SegLink++ (Tang et al. 2019), an arbitrary oriented text detec-
tion method is proposed using instance-aware text component grouping technique.
Here, multi-level features are extracted to predict text components and estimate links
between text components and then a grouping algorithm is applied based on estimated
links to generate the final result (see Fig. 3). Baek et al. (2019) have designed a unique
framework that localizes individual characters within text instances and subsequently
group them based on affinity scores between adjacent characters to detect the entire text
instance. Model is trained with character level ground truths generated from synthetic
and real images. This model is more robust to text scales, as model localizes individual
characters rather than whole text instance. So, relatively small receptive area is required
to cover a single character region that effectively locates irregular shaped text instances.
However, high-computational cost is involved in character level ground truth generation
and this model is less effective in multi-lingual scenario, especially for Bangla and Ara-
bic texts where segmentation of individual characters is difficult due to cursive nature.
Fig. 2 Pipeline and working principle of text detection model for curved text. TextField model (Xu et al.
2019a) has VGG16 network that learns the two-channel map, then fuses these maps to obtain text direction
field and finally a post-processing operation is applied to extract text instances
13
Fig. 3 Text detection pipeline of SegLink++ (Tang et al. 2019). The network extracts multi-level features
using VGG16 model and then repulsive link estimation is applied for line-level text detection
2.1.3 Ascendancy and hindrances
Regression based methods generally predict text instances using quadrilateral bounding
box offset from text pixels. These methods are greatly inspired by general object detection
frameworks. In Liu et al. (2016a), Liao et al. (2018a), Tian et al. (2016a), text instances are
directly detected using multi-sized rectangular or quadrilateral bounding box regression.
Compared to other types of methods, regression based methods are generally faster since
pixel-wise prediction is not needed. Also due to huge development of object detection
frameworks, these methods have gained significant popularity in recent times. However,
several pitfalls are pointed out for regression-based methods: (1) models need to design
quadrilateral boxes with multiple scales in advance without prior knowledge of text pro-
posals that may lead to inaccurate detection for highly varied text scales, (2) methods may
suffer when dealing with arbitrary shaped text instances specially curved texts and long
text lines, (3) in few cases bounding box refinement is required for final detection by pro-
gressive boundary coordinate regression (Zhang et al. 2019), which may decrease the over-
all efficiency of the model, (4) unwanted background may get involved in final text detec-
tion, and (5) due to lack of character level annotation in public datasets, learning network
may not be fully trained for character level bounding box regression.
2.2 Segmentation based methods
Segmentation based methods deal with multi-scale text components in scene images using
some text-attentional segmentation algorithms. Segmentation techniques are inspired by
text-semantic information and post-processing technique is applied on segmented regions
to extract the actual text part. Segmentation based approaches accurately localize arbitrary
oriented and curved scene texts. General segmentation approach is further divided into
13
T. Khan et al.
two strategies—(a) Semantic segmentation methods and (b) Instance-aware segmentation

methods.
2.2.1 Semantic segmentation
Objective of semantic segmentation is grouping pixels in a meaningful way in order to seg-

ment multiple objects within image by assigning pixel-wise labels. This method is mainly
carried out by fully convolutional network (FCN) (Long et al. 2015) that generally adopt
other classification networks and finally fine-tune the parameters for accurate segmenta-
tion. Zhang et al. (2016) have used FCN for multi-oriented text detection from natural
images. At first, a text salient feature map is generated using text block FCN by pixel-wise
prediction, which produces candidate text instances. Individual character components are
extracted from text instances and their orientations are estimated. Finally, line-level text
boxes are generated using component level features and subsequent false-alarms removal
technique. He et al. (2017a) have proposed an FCN based segmentation approach for multi-
oriented scene text detection. At first, FCN is used for candidate text region separation from
input image, and then two cascaded segmentation networks are designed namely, text line
CNN (TL-CNN) and Instance Aware CNN (IA-CNN) for removal of false-positives and
to extract actual text parts. Wu and Natarajan (2017) have introduced a new segmentation
problem of three class for scene text detection i.e. text, non-text and text border line classes
using FCN based architecture. In this work, novel border learning approach is employed to
extract accurate texts from scene images which significantly reduces post-processing cost.
Yao et al. (2016) have reported a scene text detection method for multi-oriented texts using
semantic segmentation approach, where at first global and pixel-wise prediction maps are
generated from an input image and subsequently different properties of text regions are
estimated using FCN model to localize the text accurately. A novel word level text spotting
method using two convolutional models are reported in Qin and Manduchi (2017). Here,
the first network is inspired by FCN which coarsely localizes the text regions within the
image and second model is inspired by you only look once (YOLO) (Redmon et al. 2016)
model which accurately localizes and detects texts at word level. Li et al. (2018) have
designed a progressive scale expansion network (PSENet) as a segmentation-based shape
arbitrary text detector, where text instances are initially shrunk to different scales and then
progressively expands the kernels to enclose the entire text instance. A novel text detection
method for multi-oriented and curved texts using geometric feature estimation technique is
reported in Long et al. (2018a). Here text instances are predicted using geometric proper-
ties of text regions.
Wang et al. (2019b) have designed a lightweight segmentation framework consisting
of two modules namely feature pyramid enhancement module (FPEM) and feature fusion
module (FFM), where FPEM generates scale-wise feature maps from input image and
FFM aggregates those multi-scale feature maps to generate final feature map. Then, a pixel
aggregation method is applied to predict text instances on final feature map, where true
pixels of text instances are aggregated with appropriate text kernels nearest to correspond-
ing text instances. This method yields high accuracy and efficiency due to its low-cost seg-
mentation process. A super pixel-based segmentation method for scene text detection is
discussed in Tang and Wu (2018) where true text regions are extracted using deep learning-
based region classification technique. Recently, Richardson et al. (2019) have designed a
segmentation-based model that approximately locate the text instances without processing
all the pixels in the image, resulting in a robust and tighter text detection. He et al. (2018b)
13
have designed a multi-lingual text detection framework, where possible text instances are
segmented and actual text instances are obtained with classification, and finally bounding
box regression is applied on text instances to determine final text boxes.
2.2.2 Instance‑aware segmentation
Semantic segmentation based text detectors are limited to un-overlapped text lines where
appropriate segmentation is difficult. In this context, text instance-aware segmentation
approach mitigates this problem quite well (Kong and Fowlkes 2018; Liu et al. 2018b;
Fathi et al. 2017). Instance-aware segmentation method treats multiple objects of same
class as individual instances. This type of method is more complex in nature since labels
are instance-aware. Dai et al. (2018) have developed a novel fused text segmentation-based
network (FTSN) for end-to-end text detection from scene images using an instance-aware
semantic segmentation technique. Here, generated feature maps are fused together for finer
text localization. Deng et al. (2018) have proposed an instance segmentation based method
for text detection, where pixel-wise classification is performed and then true text pixels are
linked together within same text regions. This method is more suitable for overlapped text
instances where general segmentation-based methods struggle to separate them. However,
random geometry feature based post-processing step is attached, that may not be able to
remove false alarms in complex situation. He et al. (2017a) have introduced a multi-scale
FCN model for text region extraction. Further an instance aware segmentation technique is
applied to remove false positives and localize word level text blocks. Wang et al. (2019c)
have introduced a single-shot arbitrary text detector, where multiple geometric properties
are considered for text instance segmentation. Further contextual information’s of rough
text instances are aggregated and text pixels are clustered based on high level object prop-
erties and low-level pixel information for finer text detection. Model is more reliable for
long text lines whereas it fails to detect very small sized texts due to lack of geometric
information.
Another text instance level segmentation approach is reported by Liu et al. (2019e),
where text regions are detected using region expansion technique. Initially, a seed point is
chosen arbitrarily within text region, and then it expands the region gradually by iteratively
merging the neighborhood pixels based on local features generated by CNN. This method
is suitable for curved texts due to robust boundary pixel prediction. Tian et al. (2019) have
developed an instance segmentation based framework, where foreground text instances are
initially extracted. Then, an embedded feature-based clustering algorithm is adopted that
encourage text pixels of same text instances are grouped together to predict appropriate
text bounding boxes (Fig. 4). The key idea of this work is to discriminate text instances
from other object instances within the image based on intrinsic features lied within text
regions. In order to estimate this discrimination between text and other object instances
a novel shape-aware loss (SA Loss) function is proposed as well (Fig. 5). However, the
model is comparatively slower than other state-of-the-art methods due to cluster formation
twice. Recently, Liu et al. (2019h) have designed a novel scene text detection framework
based on Mask R-CNN that generates pyramid shaped soft mask for each text instances,
where pixel values of each text mask are real number that lies between 0 and 1. Value of
each pixel is assigned according to the distance from boundary text boxes. This method
is superior to other Mask R-CNN frameworks in two aspects i.e. quadrilateral shape of
text instances are preserved using pyramid soft mask for training purpose, and boundary
adhered segmentation is more accurate that reduces the number of mislabeled pixels near
13
T. Khan et al.
Fig. 4 Overall framework of Shape-Aware learning-based instance segmentation approach for scene text
detection proposed by Tian et al. (2019)
Fig. 5 Pictorial illustration of shape-aware embedded clustering technique using SA-Loss estimation (Tian
et al. 2019). a Original image. b True text pixels are grouped together within same text regions. c Pixels
belong to different text instances are pushed away from each other
boundary region. Few other works (Xie et al. 2019; Huang et al. 2019) also adopt Mask
R-CNN for scene text detection that effectively detects arbitrary shape text instances. SPC-
NET is similar to PMTD (Liu et al. 2019h) that generates robust shape based text mask and
detects arbitrary shaped text instances more accurately. Pyramid attention network (PAN)
(Huang et al. 2019) effectively removes false alarms that significantly improves the perfor-
mance for multi-oriented and curved texts. However, model is comparatively slow due to
high computation and limited to short text-line instances.
2.2.3 Ascendancy and hindrances
Segmentation based methods are suitable for arbitrary oriented texts with irregular shapes
in natural scene images. These methods perform pixel-wise prediction for text/non-text
to separate text instances quite well under complex environment. Such approach is more
robust for multi-scale text instances. However, after a careful analysis, several limitations
are noted that may help the readers get a clear view about the said methods: (1) it often suf-
fers from overlapped text instances where pixel level segmentation is difficult to separate
and post-processing becomes inevitable at the cost of high computation, (2) long text lines
with high character spacing are often fragmented due to small receptive field that costs
13
partial detection, (3) text instances that blend with background is hard to segment if appro-
priate boundary is not detected, though Tian et al. (2019) tried to overcome this problem by
considering intrinsic features lied within text regions, and (4) such methods are designed
with heavy framework with complicated pipeline, which decreases inference speed.
2.3 Hybrid methods
In recent times, few methods have been reported using both bounding box regression and
segmentation-based approaches for better performance on scene images. He et al. (2017b)
presented a single shot text detector that accurately detects word level texts from natural
images. In this work, text regions are, at first, coarsely extracted using attentional map
generated from images and then background suppression technique is applied using con-
volutional methods to produce accurate instances of word level texts. Finally, a hierar-
chical inception module is developed that aggregates all the features to detect word level
texts from complex scene images in a challenging environment. Zhong et al. (2019a) have
developed an anchor free RPN (AF-RPN), inspired by feature pyramid network [FPN)
(Lin et al. 2017), which consists of three different scale-based detection pyramid modules
(small, medium and large texts) for efficient text detection from natural images. Model
obtained impressive result for horizontal and multi-script text instances but failed to detect
curved texts and very small text instances. A novel hybrid network model is designed for
video text detection which is reported in Wang et al. (2019a). The model consists of text
region prediction and text-sensitive segmentation modules (Fig. 6). This model is inspired
by RefineDet (Zhang et al. 2018) network that consists of three modules namely, feature
extraction, text region detection and semantic segmentation.
Recently, an arbitrary scene text detection framework named as LOMO (Zhang et al.
2019) is developed, which integrates both proposal based and instance segmentation based
methods. In this framework, initial text proposal is generated in direct regression manner,
and then iterative refinement module (IRM) is adopted that iteratively refines the quadrilat-
eral text proposals by continuous regression of boundary coordinates to enclose the entire
text instances. Finally, quadrilateral text proposals are reconstructed in a finer way that
can fit to curve and wavy texts using shape expression module (SEM) inspired by Mask
R-CNN which exploits geometric properties. This model is mainly motivated by general
human perception ability to detect text instances. Normally, human perception can detect
a part of long texts at first sight and entire text instance only after several look. So, this
model adopts this concept by iterative refinement of initial text proposals until it detects the
entire text instances.
This kind of combined approaches is adopted by some works to improve the end result
in complex scenario. Most of the end-to-end trainable networks for both text detection and
recognition use such hybrid approaches. However, overall processing time increases due to
greater number of processing steps than other text detection methods.
2.4 End‑to‑end text spotters
Recently, due to huge advancement of deep learning approaches, end-to-end deep neural
frameworks have emerged for accurate text spotting in nature scene images. Unlike most
methods that consider text detection and detected text recognition as two separate tasks,
these methods consider them a single combined task. While the former methods may suf-
fer from inaccurate results, the latter may yield relatively better results since accuracy of
13
T. Khan et al.
Fig. 6 Pipeline and working principles of the hybrid model reported in (Wang et al. 2019a). a Network
architecture contains three different modules. High-level features are extracted using convolutional
approach, then the text region detector is applied using a two-prediction map, i.e. anchor and text region
prediction. Anchor prediction map estimates the location and size of anchors, then text predictor detects
text region using quadrilateral boxes. Text sensitive map generates semantic information of text regions that
certainly determines the text boxes using a text detector
text recognition largely depends on text detection. In this context, a single pass trainable
framework for both text detection and recognition improves the overall result significantly
for irregular shaped texts in the wild.
Busta et al. (2017) designed an end-to-end trainable text spotter for scene images, which
includes both text detection and recognition module. Here, candidate text proposals are
obtained from input images using RPN and selected text proposals are normalized to fixed
height feature vectors by keeping aspect ratio same using novel bilinear sampling and fed
to CTC based recognizer for character sequence prediction. Li et al. (2017) have used a
similar approach, where text proposals are generated from convolutional feature maps of
text proposal network (TPN). Like Busta’s work (Busta et al. 2017), a fixed sized feature
vector is generated from arbitrary shaped text proposals using region feature encoder (RFE)
that are fed to text detection network (TDN) to generate textness scores of text proposals
and bounding box offsets. RFE is applied again for fixed length feature vector generation
from bounding box proposals obtained from TDN and feed in text recognition network
(TRN) for final recognition based on extracted text instance features. It is observed that
both the methods deal with variable sized text proposals and generate fixed length feature
representation that is subsequently used for text recognition, resulting accurate character
recognition of varying font sizes. But both the methods are limited to only horizontal or
near horizontal texts.
Recently, Liu et al. (2018c) have developed a fast-unified end-to-end trainable text
detection and recognition framework for scene images. The novelty of this framework is
13
region of interest rotate (RoIRotate) operator that performs affine transformation on arbi-
trary oriented text proposals and generates horizontally aligned feature maps that are fur-
ther fed to text recognition branch consisting of RNN encoder and CTC decoder for predic-
tion. This framework is superior to previous state-of-the-art in two ways: (a) model is fast
due to convolutional feature map is shared by both text detection and recognition module
that makes recognition branch almost cost free, (b) model is suitable for arbitrary-oriented
text proposals due to RoIRotate operator. However, this model uses ground truth (GT) text
regions in training phase for text recognition that makes the model less robust and vul-
nerable in real-time scenario. Similar concept is inherited in Sun et al. (2018), He et al.
(2018a), where irregular shaped text proposals are first generated by text proposal network
and subsequently rectified and transferred to text recognition module for end-to-end train-
ing. Limitation of irregular shaped text detection and recognition is further mitigated in
Sun et al. (2018), where multi-scale quadrilateral text proposals are fused and perspective
RoI transformation is applied on final text regions to preserve the text information for the
purpose of recognition, though it bears with some limitations on curved texts.
Instance segmentation based methods are suitable for irregular shaped (specially curved)
text detection. Mask R-CNN (He et al. 2017d) is used as main backbone for instance/
semantic segmentation based methods for text detection module. Liao et al. (2019a)
have designed a complete end-to-end trainable text spotter named Mask TextSpotter that
detects text instances using instance segmentation method and then accurate recognition of
sequence of characters is done using spatial attentional module (SAM) (Fig. 7). Key con-
tribution of this work is mask branch that performs three different tasks: (a) instance seg-
mentation map generation for text detection, (b) character segmentation, and (c) a spatial
attention recognizer used for character level prediction. This improved version of method
overcomes the shortcomings of earlier version (Lyu et al. 2018b) where appropriate loca-
tion information of individual segmented characters is required to group them based on
heuristics. However, it sometimes looses character level contextual information. In con-
trast, spatial attention based recognizer can predict character sequence in word-level more
accurately. So, for recognition task, only word level annotation is required, not character
level annotation like previous state-of-the-art methods. Though, such models are suitable
for curve texts, text detection is often slow in detecting text instances.
Another, instance segmentation based end-to-end trainable framework is reported in
Qin et al. (2019b). Mask R-CNN is used as backbone for text detector that generates
text proposals as rectangular boxes and corresponding text instance segmentation mask.
Fig. 7 Overall pipeline of Mask TextSpotter (Liao et al. 2019a). Text proposals are classified and regressed
using Fast R-CNN, that are fed in Mask branch inspired by Mask R-CNN for text instance segmentation and
final character level prediction
13
T. Khan et al.
A general polygon is fitted on instance segmentation masks for curved text detection.
Novelty of this work lies in RoI masking technique that generates feature representa-
tion for recognition task directly from both axis-aligned text proposals and instance seg-
mentation mask generated from mask R-CNN. This work is comparatively faster as it
removes the text proposal rectification (orientation transformation) step between detec-
tor and recognizer that enables fast and robust text recognition. Recently, Qiao et al.
(2020) have adopted an order-aware multi-class semantic segmentation method for text
instances followed by corner point and bounding box regression that finally enables an
appropriate detection for arbitrary shaped text instances. A shape transformation mod-
ule is employed that generates fiducial points around boundary regions of text instances,
and then irregular feature regions are transformed into fixed sized regular form with the
help of extracted fiducial points that are subsequently fed to recognition module. This
framework is found to effectively detect and recognize text from curved texts due to
appropriate selection of fiducial points around boundary regions. However, this model
yields failure cases for overlapped texts due to inappropriate text segmentation.
Ascendancy and Hindrances. End-to-end text spotting methods are more robust for
arbitrary shaped specially curved texts than other state-of-the-art methods. Recently,
most text spotter adopt Mask R-CNN as main backbone for region proposals that effec-
tively detects curved texts. However, these end-to-end trainable frameworks do not
always perform well due to improper prediction of text instances.
3 Deep learning frameworks for scene text detection
Deep learning based methods for scene text detection often apply some frameworks
originated from cutting-edge network models. It is noticed that general pixel-wise
segmentation networks and object detection frameworks have been widely applied for
recent scene text detection methods and state-of-the-art performance is obtained. In this
section, brief outline of popular deep frameworks followed by insightful views is pre-
sented in connection to scene text detection. Along with this, how such deep frame-
works are applied in scene text detection in order to achieve state-of-the-art results is
also discussed. Based on working principle, deep learning based frameworks can be
broadly categorized into two groups: (1) semantic segmentation based frameworks,
and (2) object detection/instance segmentation based frameworks. Besides these frame-
works, we also report an outline of various CNN architectures used as baseline networks
in deep learning frameworks for scene text detection in Table 1. A brief discussion on
hardware/software requirements and available popular libraries for such networks is also
included at the end of this section.
3.1 Semantic segmentation based frameworks
Every semantic segmentation network performs pixel-wise prediction within an image and
classifies each pixel with specific object label, thereby segmenting the entire image into
different classes. Semantic segmentation networks are effective for complex image level
text segmentation. Few popular semantic segmentation networks are discussed in this sub-
section that are extensively used in scene text detection.
13
Table 1 Summary of different CNN architectures used as baseline networks in different deep learning frameworks for scene text detection and their salient characteristics
Architecture Input image dimension No. of layers No. of parameters Error-rate Evaluation dataset Remarks
LeNet-5 (LeCun et al. 32 × 32 grayscale Conv + max-pooling-2, 60 K 0.95% NIST 1st architecture consists of
1998) FC layer-3 convolution, maxpool-
ing layers and FC layers
AlexNet (Krizhevsky 224 × 224 colour Conv + max-pooling -5, 60 M 15.3% (5-validation) ImageNet, CIFAR-10 1st network to adopt
et al. 2012) FC layer-3 (2020) ReLU as activation
function
VGG-16 (Simonyan and 224 × 224 colour Conv + max-pooling-13, 138 M 8.8% (5-validation) ImageNet Network is deeper than
Zisserman 2014) FC layer-3 AlexNet with small
sized filters
GoogleNet (Szegedy 224 × 224 colour Conv layers-22 5M 6.67% (5-validation) ImageNet Inception module—paral-
et al. 2015) lel tower of different
convolution filters is
applied on feature maps
and outputs are concat-
enated
1 × 1 filter is applied
followed by 3 × 3, 5 × 5
filter and finally 3 × 3
maxpooling followed by
1 × 1 filter and outputs
of each convolution
filters are concatenated
to generate final feature
map
The main motivation is to
dimensionality reduc-
tion of feature maps
using 1 × 1 filter that
decreases the computa-
tional bottleneck also
13
Table 1 (continued)
Architecture Input image dimension No. of layers No. of parameters Error-rate Evaluation dataset Remarks
ResNet (He et al. 2016b) 224 × 224 colour 152 (maximum) 26 M 3.57% (5-validation) ImageNet Residual block—It
13
increases the training
accuracy with number
of convolution layers in
the network using skip
connection strategy
Few convolution blocks
are skipped during
training
In activation (ReLU)
function, output of
last convolution block
and direct input both
are added, that retain
networks generalization
power as well
T. Khan et al.

3.1.1 Fully convolutional network
FCN is used as a backbone architecture for most of the semantic segmentation-based

approaches for scene text detection (Long et al. 2015; He et al. 2017a; Wu and Nat-
arajan 2017; Yao et al. 2016). FCN is a pixel-level classification model that predicts
each pixel as text or non-text and generates a pixel-wise prediction map for text/non-
text classification. Finally, labeled pixels are grouped together for final detection. Fig-
ure 8a illustrates the general pipeline of FCN. It has the ability to learn and predict
from multi-scale images, thereby conforming the arbitrary shaped scene texts found in
natural images. Moreover, FCN exploits global cues from the text regions, generates
corresponding salient maps, and finally localizes the texts from predicted maps in the
subsequent processes. Besides, refinement of FCN with more deep features and variable
stride sizes helps improving the segmentation results further. Zhang et al. (2016) have
adopted text-block FCN for coarse-level detection of text lines from text saliency map
illustrated in Fig. 8b and in the later stage, another FCN which is a minor version of
the previously used FCN predicts centroid of each character, and finally based on some
threshold false text lines are removed. Yao et al. (2016) have used FCN as the main
backbone network for their proposed framework that generates three different prediction
maps from the input images which are text line level, character-level, character-level
orientation, and finally localized the texts following some holistic approach. In another
work (He et al. 2017a), the authors have adopted multi-scale FCN for cascaded text
detection, where model is trained using three different scale maps and scale-wise feature
Fig. 8 A pictorial illustration of general architecture and implementation of FCN on scene text detection
in order to coarsely localize text-blocks. a A general FCN framework (Long et al. 2015). b Generation of
layer-wise feature maps obtained from input images using text-block FCN (Zhang et al. 2016). Finally, all
feature maps are fused together to generate final saliency map which localizes multi-oriented text blocks in
coarse-level (marked in red border). Layer-wise feature maps are generated by exploiting both global and
local contextual information
13
T. Khan et al.
maps are generated. Finally, all the feature maps are joined together for final block-level
prediction.
It is an undeniable fact that scene texts with high diversity can only be localized effec-
tively by exploiting its large contextual information and coarse-to-fine strategy using both
global and local features. In regard to application of FCN in scene text detection, one may
keep the following aspects into consideration: (a) FCN is employed in different stages of
state-of-the-art methods as per the suitability since it can generate prediction maps using
both global and local textual cues. (b) FCN adopts multi-scale texts that imitates the actual
nature of scene texts. (c) Moreover, multi-scale FCN aggregates the features maps obtained
from different scales that capture large contextual information and deal with high variabil-
ity of the texts found under scene environment. (d) General FCN is limited to small sized
text instances because segmentation makes it difficult when dealing with large sized text
instances due to fixed sized receptive area of FCN. Thereby, appropriate strategy can be
adopted by the researchers to enlarge the receptive field which can eventually exploit more
textual information to improve overall performance.
3.1.2 DeconvNet
Another type of semantic segmentation network that performs pixel-wise classification is

reported in Noh et al. (2015). It consists of both convolution and deconvolution networks.
Figure 9a shows the architecture of DeconvNet. Convolution block is same as found in
FCN which uses a series of convolution and max-pooling layers in order to generate feature
maps. The novelty of such networks is the deconvolution block comprised of a series of
deconvolution and un-pooling layers that contribute to image reconstruction from down-
sampled feature maps. In DeconvNet, un-pooling layer retains important information
unlike pooling layer because it records the location of each activation value pooled dur-
ing pooling operation and places it in the respective position with the help of switch vari-
ables that reconstruct the original sized feature maps without losing any spatial information
(Zeiler et al. 2011). Figure 9b illustrates the un-pooling operation more clearly. Inspired by
the performance of DeconvNet over Pascal VoC dataset, it is evaluated on curved texts as
reported in Ch’ng and Chan (2017) and achieved encouraging results. As DeconvNet has
more layers making the network deeper, it enables models extract intrinsic features from
the input images properly. However, the framework suffers from high memory utilization
and computational cost.
3.2 Object detection/instance segmentation based frameworks
Deep learning based object detection frameworks are prevalent among current text
detection methods owing to the rapid advancement of GPU enabled computing devices,
where usually word-level text instances are treated as generic objects (Ma et al. 2018;
Zhong et al. 2016, 2019a, b; Tian et al. 2016a; Yang et al. 2020). Fast and cost-effective
generation of candidate text proposals makes object detectors popular among state-of-
the-art methods. However, general object detectors typically focus on horizontal bound-
ing box regression which is suitable for generic object detection but not for scene texts
having arbitrary orientations, patterns, alignments, etc. Moreover, natural scene texts
with multi-orientation and extreme aspect ratio may be better fitted by oriented and
quadrilateral bounding boxes. In this regard, it may be added that direct application
of general object detection frameworks in scene text detection may yield poor results
13
Fig. 9 Overall framework and working principle of DeconvNet (Noh et al. 2015). a Architecture of the net-
work model consisting of series of convolution layers followed by deconvolution layers in order to generate
final feature map and b Operation of the un-pooling layer that generates feature map same as input image
size using switch variables (Zeiler et al. 2011)
until some suitable modifications are made. State-of-the-art object detectors may be fur-
ther categorized into two groups: (a) two-stage object detector, and (b) one-stage object
detector. Figure 10 illustrates the working principle of both types of object detectors.
Fig. 10 General pipeline of both two-stage and one-stage object detection frameworks (Jiao et al. 2019).
a Working principle of two-stage object detectors. Region proposal network extracts candidate text pro-
posals that are fed to RoI pooling layer for feature vector generation. Finally, classification and bounding
box regression of text proposals are performed. b Working-flow of one-stage object detectors where object
bounding boxes are directly extracted from input images using series of convolution layers followed by
downsampling operation. Finally, feature vectors generated from candidate boxes using RoI pooling layers
followed by bounding box classification and regression
13
T. Khan et al.
For more information, readers are referred to a recently published survey on deep learn-
ing based object detection methods (Jiao et al. 2019).
3.2.1 Two‑stage object detection networks
Two-stage object detection networks work in two phases. In the first phase, text candidate
proposals are generated using RPN, and in the second phase features are extracted from the
candidate text proposals using RoI pooling technique followed by instance classification
and bounding box regression. This kind of detector detects the text instances with a high
accuracy. However, it is slower than a single-stage detector due to the usage of more steps.
A concise discussion on few of the two-stage object detectors are presented in the light of
scene text detection.
R-CNN (Girshick et al. 2014) initially generates some rectangular-shaped region pro-
posals using selective-search techniques and then fixed-length feature vectors for each
region proposal are generated using CNN inspired architecture. Fixed length feature
representations are further classified as text or non-text using support vector machine
(SVM) classifier and finally, bounding box regression is performed to generate true text
boxes. During the training phase, samples are labelled as positive and negative using
intersection over union (IoU) method with GT boxes, where the overlap ratio is set to
0.5 for fine-tuning the network. Jaderberg et al. (2016) employed R-CNN for scene text
detection that generated word-level bounding box proposals. An integrated regional
proposal generation module with R-CNN reduces the computational expenses radically.
Moreover, high recall is obtained using word-level proposals that enhances the overall
performance. Despite that, performance of the general R-CNN is limited due to usage
of conventional region proposal generation in coarse level which fails to handle more
complex text appearance under scene environment.
Fast R-CNN (Girshick 2015) is the extended version of R-CNN network. In Fast
R-CNN, the network takes an entire image and a set of candidate region proposals as
inputs. Then, input image is passed through a series of convolution and max-pooling
layers and corresponding feature maps are generated. Next, for each text proposal, cor-
responding small fixed-length feature vector is computed using RoI pooling layer from
the previously generated feature map. Finally, RoI feature vectors are passed through
fully connected (FC) layers that generate two outputs namely region wise softmax class
probabilities and bounding box regression. This network is much faster than R-CNN
in two ways: (1) In R-CNN each region proposal is individually fed to CNN for fea-
ture generation that takes large amount of time, whereas in Fast R-CNN, feature map is
generated only one time for the entire image and fed to CNN, and (2) In R-CNN, multi-
stage training phase is required such as pre-training of CNN, fine-tuning of CNN, SVM
training stage for classification, and regression stage that require high computational
time, whereas Fast R-CNN is a single-stage end-to-end training model. In general,
Faster R-CNN driven text detection methods adopt Fast R-CNN for classification and
refinement of the candidate text proposals generated from RPN in coarse-level (Jiang
et al. 2017; Zhong et al. 2019b). In Zhong et al. (2019b), Fast R-CNN computes the
textness score of the candidate text proposals for classification. Finally, enlarged true
classified proposals create the search space and extract some probability-based informa-
tion from the enlarged search space for accurate text localization.
13
Faster R-CNN (Ren et al. 2015) is a more improved version than Fast R-CNN in terms
of processing speed. The novelty of Faster R-CNN lies in its region proposal technique,
where it adopts RPN for generating proposals. RPN uses shared convolutional features
for region proposal generation and same convolution layers are used in text detection
network that significantly increases the processing time. Faster R-CNN is extensively
used in state-of-the-art text detection methods (Zhong et al. 2019a, b; Jiang et al. 2017).
However, in most of the cases, horizontal region proposals generated by RPN fails to
accurately localize multi-scale or arbitrary-aligned text instances. Thus, appropriate
changes are made to Faster R-CNN by researchers for precise text detection. Zhong
et al. (2019a) adopted anchor-free RPN (AF-RPN) instead of general RPN in Faster
R-CNN framework for arbitrary-oriented text detection. Ma et al. (2018) designed a
Rotation RPN (RRPN) which generates inclined text proposals for arbitrary-oriented
text instances using text-orientation information. In the context of scene text detection,
suitability of RRPN instead of general RPN of Faster R-CNN framework is visually
illustrated in Fig. 11. Jiang et al. (2017) also inherited Faster R-CNN framework for
text detection where pooling information is extracted from the generated region pro-
posals obtained through RPN using different pooling scales and finally concatenated
all pooling features followed by subsequent post-processing for final detection. So, it
has been observed that Faster R-CNN is merely suitable for horizontal text detection.
However, it fails to localize inclined and arbitrary-oriented texts. Hence, the researchers
who adopted Faster R-CNN made appropriate changes to the network in order to obtain
state-of-the-art performance.
Mask R-CNN (He et al. 2017d) is an extended version of Faster R-CNN which is mainly
used for instance level segmentation in scene text detection. Here, an additional mask
branch is used which is driven by FCN that predicts segmentation masks on each RoI
object using pixel-wise alignment estimation. It adopts FPN for generating RoI feature
vector that helps to reinforce feature representation and fairly overcomes the problem of
small object detection. Mask R-CNN is an amalgamation of both region proposal-based
Fig. 11 A visual illustration of suitability and effectiveness of RRPN on arbitrary-oriented text detection
instead of general RPN of Faster R-CNN framework (Ma et al. 2018) (top-down). a Input image, b coarse-
level text localization using horizontal region proposals generated by general RPN of Faster R-CNN (top),
corresponding text localization using rotational regional proposal generated by RRPN (down). c Final text
detection using general Faster R-CNN after bounding box regression (top), corresponding text detection
using RRPN (modified Faster R-CNN) after bounding box regression (down)
13
T. Khan et al.
object detector and instance segmentation-based framework. Mask R-CNN is suitable

for extreme curved text detection and is able to obtain state-of-the-art results (He et al.
2017d; Liao et al. 2019a; Lyu et al. 2018b; Qin et al. 2019b; Liu et al. 2019h; Xie et al.
2019; Huang et al. 2019). Besides, general Mask R-CNN works comparatively better
than other object detectors, but it behaves erroneously in complex scenario when candi-
date proposals fail to cover the entire text instances. In this context, most of the popular
works incorporated their own ideas with Mask R-CNN framework for betterment of text
detection. Liu et al. (2019h) used soft mask instead of binary text mask in the Mask
R-CNN and achieved state-of-the-art results. Huang et al. (2019) slightly modified the
general Mask R-CNN framework by introducing a high-level feature representation
module using PAN that effectively detects the curved texts.
3.2.2 One‑stage object detection networks
One-stage network directly produces object bounding boxes from input image using series
of convolution layers followed by downsampling operation. No region proposal generation
step is required for this kind of networks, resulting in faster and more efficient network
model than two-stage networks in terms of inference speed. These types of networks are
suitable for fast scene text detection with suitable modification. However, these networks
may fragile to arbitrary shaped text detection in complex environment.
YOLO (Redmon et al. 2016) is an extremely fast and simple real-time object detector.
The model runs a single convolutional network over the entire image and predicts region
bounding boxes with class probability in a single pass. Fast YOLO uses a smaller num-
ber of layers and fewer filters in between to design network model. However, YOLO
cannot be implemented directly for scene text detection due to its inability to localize
small objects. YOLO can be made useful for scene text detection by increasing the cell
size or number of boxes produced per cell. Gupta et al. (2016) designed an improved
version of YOLO which is twice faster than general YOLO architecture, and it is used
for synthetic text detection with regression-based FCN (Gupta et al. 2016). A word level
text detection method is proposed in Qin and Manduchi (2017), where at first, candidate
text regions are roughly segmented using FCN and then YOLO architecture is employed
for accurate word-level text prediction.
SSD (Liu et al. 2016a) is an improved version of YOLO which eases the regression
problem by default bounding boxes with different aspect ratios. SSD achieves better
performance compared to YOLO while maintaining its computational time. Although,
SSD achieves state-of-the-art performance in object detection, direct application is not
so effective for scene text with arbitrary orientation and extreme aspect ratio. In this
regard, the authors of Liu et al. (2016a) proposed a text detection framework inspired by
SSD and achieved high performance by generating quadrilateral boxes instead of hori-
zontal boxes used in general SSD which is suitable for oriented text instances. Liao et al.
(2018b) adopted SSD for arbitrary-oriented text detection with some dedicated changes
in the prediction module where separate feature maps are generated for classification
and bounding box regression tasks instead of using common shared feature map, which
eventually enhances the overall performance. Later, deconvolutional SSD (DSSD) (Fu
et al. 2017) is proposed which is an extended version of SSD, where residual network
and deconvolution layers are added, which improves the accuracy further in the case of
small text object detection.
13
DenseBox (Huang et al. 2015) is an end-to-end unified one-stage object detection frame-
work. This network predicts object bounding boxes with multiple scales directly from
RoI of the input images without any prior proposal generation. Prediction of bounding
boxes is carried out by a series of convolution and maxpooling layers, upsampling lay-
ers and finally non-maximum suppression (NMS) on detected boxes for fixed threshold
value. This framework is more suitable for multi-oriented text detection using quadri-
lateral bounding box regression. DenseBox is fast compared to other object detectors
due to non-involvement of additional region proposal generation module. However, said
framework shows its limitation for curved text detection. In this regard, researchers have
adopted suitable modification to overcome such problems. He et al. (2017c) adopted a
direct regression-based framework for multi-oriented text detection using quadrilateral
boxes inspired by DenseBox. Another work inspired by DenseBox is EAST (Zhou et al.
2017) that straightaway predicts arbitrary-aligned text instances from an entire image.
DenseBox framework is effective for small sized text object detection with complex
backgrounds.
3.3 Hardware and software requirements
One of the main reasons for popularity of deep learning based approach is the recent avail-
ability of hardware to meet the requirements of these models. Deep learning models are
computationally expensive, which needs fast and efficient hardware modules embedded
with computers. Nowadays, wide availability of GPUs with high processing speed has cer-
tainly extended the usage of deep learning approaches. GPU comprises hundreds of cores
which maximizes the floating-point throughputs. In the current scenario, several GPUs like
NVIDIA GTX 1050, GTX 1060, TITAN X, etc. with variable RAM sizes and GPU accel-
erated computing software packages (OpenCL, CUDA, OpenMP, Ocelot) are available
in the market. Figure 12 illustrates a comparison of different GPUs in terms of execution
speed.
Apart from efficient GPU based computers, widely available open-source library pack-
ages are other important contributors towards advancement of DNNs. These libraries are
very fast to execute and provide an efficient implementation of the DNN models using
GPUs. Most of these libraries and frameworks for DNN are mostly implemented in Python
framework. Some of the popular library packages are discussed below:
• Caffe Caffe (Jia et al. 2014) is a deep learning framework developed by the Berkeley
vision and learning center (BVLC). It is very fast and modular in nature. Caffe is not
proper deep learning library rather it is programmatically binding with Python frame-
work. It is mainly used for academic and industrial purposes.
• Theano Python Theano (Bastien et al. 2012) is one of the most popular deep learn-
ing packages binding with Python interface. This Python library is strongly integrated
with the Numpy package. Theano is developed in LISA lab at the University of Mon-
treal, Canada. Theano can be installed in Python using pip package manager: pip install
theano.
• TensorFlow Python TensorFlow (Abadi et al. 2016) is an open-source Python library
used for mathematical computation in a distributed manner. It is developed by Google
Machine Intelligence research organization. TensorFlow can be installed using pip
package manager: pip install TensorFlow.
13
T. Khan et al.
Fig. 12 Performance comparison of different GPUs and TPU for CNN, RNN and their combination (Which
GPU(s) to get for deep learning: my experience and advice for using GPUs in deep learning, https://timdett-
mers.com/ 2019)
• Keras Python Keras (Ketkar 2017) is another open-source Python-based library pack-
age for deep learning methods. It is mainly used for numerical computation, optimiza-
tion (Adam, etc.), normalization and activation functions. It is developed by Francois
Chollet from Google research organization. Keras can be installed in Python using the
command: pip install Keras.
• Lasagne Lasagne (https://lasagne.readthedocs.io/en/latest/) is a lightweight deep learn-
ing library embedded with Python interface. It is mainly used to train different DNNs.
Lasagne is integrated with Theano. This project was started by Dieleman in 2014 with
other people. Lasagne can be installed in Python by the command: pip install lasagne.
4 Benchmark datasets and evaluation protocols
In this section, we briefly discuss on standard datasets available for scene text detection
and recognition and different evaluation protocols for performance measurement. Till date,
several standard datasets have been released comprising multi-oriented texts in a complex
environment. In Sect. 4.1, a detailed overview of different publicly available datasets along
with salient features is reported. In Sect. 4.2, suitability of datasets for training deep learn-
ing models are outlined. Finally, in Sect. 4.3, we have summarized available evaluation
protocols on standard datasets for text detection methods.
4.1 Standard datasets for text detection
As scene text detection and recognition has received attention from researchers across
the globe since last two decades, a good number of datasets have been made available
in public by different research groups/labs. These publicly available datasets consist of
13
widely diversified images captured in the unconstrained environment. A categorical dis-

cussion of different public datasets is reported in this section.
4.1.1 ICDAR datasets (horizontal and multi‑oriented)
Most popular benchmark datasets for text detection and recognition are released through
international conference of document analysis and recognition (ICDAR) organized
“robust reading competition (RRC)” since last few years. ICDAR organizes various
competitions for text detection and recognition on different types of images. Few of the
popular benchmark datasets for text detection are discussed below. Table 2 illustrates a
brief outline of standard datasets to date.
ICDAR 2003 (Lucas et al. 2003) is the first public dataset released by ICDAR in
“Robust Reading and Text Locating Competition” for text detection in the year 2003.
The main idea for releasing this dataset is to detect text region accurately in scenes.
The images were annotated using XML files with three different labels, i.e. location,
words, and segmentation points.
ICDAR 2005 (Lucas 2005) dataset consists of images that are mainly used for char-
acter recognition task. Mainly three types of images are present in this dataset i.e.
digits, upper case characters, and lower-case characters.
ICDAR 2011 (Karatzas et al. 2011) dataset consists of born-digital images and real
scene images for text localization. The main objective for publishing this dataset is
an accurate estimation of text location in natural images.
In ICDAR 2013 (Karatzas et al. 2013) robust reading competition, the main aim was
to detect texts from three different types of images, namely born-digital images, natu-
ral scene images, and scene video frames. This dataset inherits most of the images
from born-digital images of ICDAR 2011 dataset, but some additional images and
video frames along with improved evaluation protocols are also included.
In ICDAR 2015 (Karatzas et al. 2015) edition, a new incidental scene text dataset is
introduced for text detection along with born-digital images, focused scene texts and
video texts. Incidental scene texts are used for text localization, recognition, and to
develop end-to-end systems. Video texts are used for text localization and for devel-
oping end-to-end system, and focused scene texts and born-digital images are used
only for developing end-to-end systems.
ICDAR 2017 RCTW-17 (Shi et al. 2017b) dataset comprises widely used Chinese
text images in the wild. This dataset is mainly used for text localization and end-to-
end recognition task. Images are natural scene and born-digital image. Each image
contains at least one Chinese text line. Annotation of the dataset is done manually by
polygon drawing.
4.1.2 Multi‑lingual datasets
Unlike English texts, multi-lingual texts have high variant of text-style, orientation and
structures. As a consequence, efficient text detection in multi-lingual environment is
a crucial need. In this context, several multi-lingual datasets are developed in recent
times.
13
Table 2 Outline of different benchmark datasets with salient features available for scene text detection
Dataset Size Training set Test set Image type Language Orientation Eval. protocol Task Annotation Source
13
ICDAR’ 03/05 529 258 251 Natural scene English Horizontal IC’03 evalua- Localization/ Word/character Digital camera
(JPEG) tion protocol recognition
ICDAR’11 522 420 102 Born digital English Horizontal DetEval Localization/ word Web and mail
(PNG) recognition/
segmentation
484 229 255 Natural scene English Horizontal DetEval Localization/ Word Digital camera
(PNG) recognition
ICDAR’13 561 420 141 Born digital English Horizontal DetEval Localization/ Word/character Web and mail
(PNG) recognition/
segmentation
462 229 233 Natural scene English Horizontal DetEval Localization/ Word/character Digital camera
(PNG) recognition/
segmentation
28 13 15 Video frames Multi-lingual Multi-oriented CLEAR-MOT Localization Word Youtube/Nico
(Bernardin Nico Douga
and Stie- (http://www.
felhagen nicovideo.jp)
2008)/VACE
(Kasturi et al.
2008; Yi and
Tian 2011)
ICDAR’15 1500 1000 500 Incidental scene English Multi-oriented IoU metric Localization/ Word Digital camera
recognition/
text-reading
ICDAR’17- 18,000 9000 9000 Natural scene Multi-lingual Multi-oriented IoU Localization/ Word Digital camera
MLT script identifi-
cation
ICDAR’17- 12,263 8034 4229 Natural scene Chinese Multi-oriented PASCAL VOC Localization/ Text-line Digital camera/
RCTW and born recognition computer
digital
T. Khan et al.

Table 2 (continued)
ICDAR’19- 20,000 10,000 10,000 Natural scene/ Multi-lingual Multi-oriented IoU Localization/ Word Digital camera/
MLT web script identifi- internet
cation
ICDAR.19- 25,000 20,000 5000 Street view Chinese Multi-oriented IoU Recognition/ Text-line/Char- Mobile camera
ReCTS scene detection/text- acter
reading
KAIST 3000 – – Indoor/outdoor Multi-lingual Multi-oriented IC’03 evalua- Localization/ Pixel Digital/mobile
scene tion protocol recognition camera
SVT 350 100 250 Street view English Horizontal OCR engine Detection Word Google street
scene view (http://
maps.googl
e.com)
MSRA-TD500 500 300 200 Indoor/outdoor Multi-lingual Multi-oriented PASCAL VOC Detection Text-line Packet camera
scene
COCO-text 63,686 43,686 20,000 Natural scene Multi-lingual Horizontal/ PASCAL VOC Detection/clas- Word MS COCO (Lin
vertical sification/ et al. 2014)
SynthText 800 k 800 k – Synthetic English Horizontal DetEval and Detection Word Newsgroup20/
PASCAL Google image
VOC
VISD 10 k 10 k – Synthetic English Multi-oriented DetEval Detection/rec- Word –
ognition
SynthText3D 10 k 10 k – Synthetic 3D English Multi-oriented – Detection Word Newsgroup20
Total-Text 1555 1255 300 Natural scene English Multi-oriented DetEval Detection/rec- Word Digital camera
ognition
CUTE80 80 – 80 Indoor/outdoor English Multi-oriented – Detection Text-line Digital camera/
scene internet
OSTD (Risnu- 89 – – Indoor scene/ English Multi-oriented IC’03 evalua- Detection Text-line Digital camera
mawan et al. street view tion protocol
2014)
13
Table 2 (continued)
13
CTW-1500 1500 – – Indoor/outdoor Multi-lingual Multi-oriented PASCAL VOC Detection Text-line Internet/Google
scene image
C-SVT (full 30 k 25 k 5k Street view Chinese Multi-oriented PASCAL VOC End-to-end text Text-line Mobile camera
annotated) scene reading
AUTNT (Khan 10,771 8619 2152 Document/natu- Multi-lingual Multi-oriented – Classification/ Multi-level Mobile camera
and Mollah ral scene recognition/
2019) script identifi-
cation
Table delivers summarized report of most popular public datasets based on image particulars, dataset utility, level of text-annotation and image acquisition systems
T. Khan et al.

ICDAR 2017 RRC-MLT (Nayef et al. 2017) dataset consists of multi-lingual texts (MLT)
from scene images, used for scene text detection and script identification task. It comprises
9 different scripts and captured by different users and different camera phones.
ICDAR 2019-MLT (Nayef et al. 2019) dataset is an extension of ICDAR17 MLT dataset. It
contains two types of multi-lingual texts i.e. real scene images from the wild and synthetic
images. Images contain texts of 10 different languages.
KAIST consists of images captured in a diverse environment (Lee et al. 2010). Most images
are indoor/outdoor scene images having uneven lighting effects. The dataset contains multi-
lingual text images. Pixel level GT is generated for this dataset.
MSRA-TD500 The main motivation for developing this dataset is to focus on multi-ori-
ented text detection (Yao et al. 2012). This dataset consists of horizontal as well as perspec-
tival distorted and skewed images. Images in this dataset are multi-lingual in nature. All the
images are indoor/outdoor scenes. Dataset consists of diverse texts on complex environ-
ment that makes text detection more challenging. Images are annotated at text-line level
rather at word-level.
COCO-texts It is a large-scale benchmark dataset used for scene text detection and recogni-
tion task developed by Veit et al. (2016). The dataset is derived from MS-COCO dataset
(Nayef et al. 2019). It consists of day-to-day natural scene images captured from various
angles in an uncontrolled environment. So, the dataset reflects a wide range of diversity of
scene texts. Text regions are annotated in multiple levels such as location of texts, legibility
of texts, text category, text script and transcription of texts.
4.1.3 Curved text datasets
Text with curve orientation is one of the most common observation in real-world scenario.
Despite of tremendous success of scene text detection in multi-oriented texts, when it comes
to curved texts, it is still not fully explored. The primary aim of curved text dataset is to bridge
this gap and ease a new course of study for research community.
Total-text The main aim to develop this dataset was to focus on curved scene texts (Ch’ng
and Chan 2017). This dataset contains only texts in English language. Word level annota-
tion is performed on this dataset and the dataset is further classified into three categories
based on text-orientation i.e. horizontal, multi-oriented and curved texts. Text images are
collected in an unconstrained environment with varying size, font, color, etc.
CTW-1500 (Liu et al. 2019a) This dataset contains 1500 scene images where at least one
curved text line per image. Images are mostly captured from indoor-outdoor scene images
using camera phones and also from internet, digital image libraries etc. Dataset contains
English and Chinese texts.
CUTE80 This is the first curve-oriented scene text dataset (Risnumawan et al. 2014).
Images are mostly captured from indoor/outdoor scenes using digital camera phones. Texts
are having a complex background with varying font size, color, perspective distortion, etc.
Text-line and word level annotation are made for this dataset.
4.1.4 Street view image datasets
Few datasets are dedicated to street-view images, where texts within images display high
variability and often suffers from low-resolution. Images are mostly captured under uncon-
strained environment in crowded streets.
13
T. Khan et al.
Street view texts (SVT) Wang and Belongie (2010) have developed a dataset for text
detection which comprises natural scene images. This dataset is mainly harvested from
Google street view images (http://maps.google.com). Images are mainly street-view nat-
ural scene images and texts within images exhibit high diversity. Most of the images are
captured in low illumination and having low resolution. Dataset is annotated at word-
level using bounding rectangles.
C-SVT Sun et al. (2019) have developed a large-scale Chinese street view text (C-SVT)
dataset for end-to-end text reading problem. Images are acquired by cell-phone camera
from different crowded streets across all cities of China. Texts in such images mostly
appear in complex background with arbitrary orientation. The dataset is annotated in
two parts: full annotation and weak annotation. In full annotation, image location and
labels are accurately provided for end-to-end text reading task, whereas in weak anno-
tation, text regions are roughly located and labelled that are supplied to fine-tune the
end-to-end recognizers. It is observed that weak annotation takes much lesser time
than full annotation. It is worth mentioning that full annotated images are used for fully
supervised learning models, weak annotated images are supplied to weakly supervised
learning models, and finally mixture of both annotated images are used for partially
supervised learning models. Experimental results show that partial supervised learning
model obtains higher accuracy when trained with both full and weak annotated images.
ICDAR 2019-ReCTS (Liu et al. 2019g) dataset is released in a competition organized by
ICDAR on reading Chinese text on signboard (ReCTS) in the year 2019. Images of this
dataset are taken from street view Chinese signboard where texts have mostly appeared
in complex background with varying font style and orientation. Images are acquired by
cell-phone cameras in unconstrained environment. Layout of Chinese characters adds
more complexity in signboard due to their artistic outlook.
4.1.5 Synthetic datasets
Due to huge progress of deep networks in scene text detection, requirement of a large
amount of annotated dataset is inescapable while training the model. Synthetic dataset
delivers detailed annotation on scene images in a large scale where text instances are syn-
thetically embedded in a semantically sensible location.
SynthText in the wild Synthetic text images are generated using a synthetic engine, texts
are rendered artificially on natural images (Gupta et al. 2016). This rendering process is
automated. Source images are mainly taken from Newsgroup20 dataset (Mitchell 1999).
The dataset contains street view images and complex indoor/outdoor images where
diversified texts are synthetically superimposed. Texts within images are annotated at
three levels, namely text line, word, and character.
VISD verisimilar image synthesis dataset (VISD) is developed by Zhan et al. (2018)
for text detection and recognition task from scene images. Dataset comprises of syn-
thetic images that are generated in such a way that source text instances are embedded
in semantically sensible location within background images, rather than imposing text
in any random position within background that makes no sense. Fonts, color, orientation
of source text is generated according to sensible location of background. Finally, text
instances are embedded in the exact location within image background. Source texts
are generally embedded in homogenous background regions within images, which cre-
ates high contrast and visibility. Dataset comprises of 10 K synthetic images using 10 K
13
background scene images, which indicates that there is no repetition of background

for any synthetic image and there is certain degree of diversity in this image dataset.
Dimension of images are same i.e. 640 × 480 pixels.
SynthText3D A synthetic image dataset is developed by Liao et al. (2020) using 3D vir-
tual world images. At first, different camera anchors are set manually for capturing 3D
virtual world models. Then, RGB image and corresponding surface map are obtained
from each camera anchors using 3D engine. Next, 2D text regions are generated based
on surface map, and finally, generated text regions are mapped into 3D virtual images
and superimposed on it. Different illumination effects and multiple visibility degrees
are rendered artificially on 3D images. These images have varying illumination effect,
occlusion, perspective distortion and more realistic appearance than other existing syn-
thetic images. It comprises of 10 K images generated from 30 background scenes of
fixed dimension.
4.2 Suitability of datasets for deep networks
It may be noted that a large number of standard datasets have been developed for the sake
of robust scene text detection. Some datasets are large enough to train deep network mod-
els whereas there are some datasets which are relatively small in terms of size. Hence,
some are found to be useful to train DNN models for efficient text detection in wild, e.g.,
SynthText (Gupta et al. 2016), VISD (Zhan et al. 2018), SynthText3D (Liao et al. 2020)
and COCO-text (Veit et al. 2016). These datasets are large enough and diverse in nature
to fine-tune any deep network model. In most of the cases, comparatively large datasets
are used for training the model and small datasets are used for evaluation purpose. Zhong
et al. (2019a) have used ICDAR’17-MLT dataset with 200 k iteration to train the model
as ICDAR’13 and ICDAR’15 datasets are too small for the training phase. Wang et al.
(2019a) have used SynthText to fine-tune the hybrid network model for text detection task
using a multi-scale strategy that deals with the problem of limited training data and then
evaluated the model using other benchmark datasets such as ICDAR 2015, MSRA-TD500
and others.
Publicly available datasets like CUTE80, OSTD consists of very smaller number of
image samples and it may be insufficient to train any model. Poly-FRCNN-3 model (Ch’ng
et al. 2019), evaluated using CUTE80, is trained with SynthText and COCO-text with
100 k and 100 k iterations respectively. So, dataset with large number of samples is still a
necessity particularly for deep learning based models. Especially for curved texts, available
public datasets are not sufficient to train any model. Synthetic annotated texts are somehow
unable to represent the intrinsic properties of real-life scene images completely, that may
lead to an imperfect model while training DNN models for scene text detection.
4.3 Evaluation protocols
Performance assessment is an essential aspect of any research task. In this section differ-
ent evaluation metrics used for text detection/localization are outlined. Earlier, the per-
formance of any text detection algorithms was evaluated by observing the overlapping
ratio of GT and detected regions, but those early approaches could not capitalize the
intrinsic information of text regions for appropriate evaluation. Over the years, research-
ers have tried to overcome those shortfalls and proposed different evaluation metrics
13
T. Khan et al.
for text detection in natural images in a complex environment. Here, some standard and
widely used evaluation protocols reported in the literature are discussed.
4.3.1 Liang’s evaluation protocol
Liang et al. (1997) proposed a quantitative evaluation framework for page layout analy-
sis of document images, which is also used to quantize performance of text detection
methods. Two performance metrics are defined in Eqs. 1 and 2.
( )
Area Gi ∩ Dj
𝜏ij = ( ) (1)
Area Dj
( )
Area Gi ∩ Dj
𝜎ij = ( ) (2)
Area Gi
{ } { }
Let, G = G1 , G2 , … GM be the set of GT boxes and D = D1 , D2 , … DN be the set
of detected boxes. Then 𝜎ij defines the ratio between intersection are of Gi and Dj , and
area of Gi whereas 𝜏ij defines the ratio between intersection area of Gi and Dj , and area
of Dj . Now, the computation of true positive, false alarms, etc. are decided by the fol-
lowing criteria:
• If 𝜎ij ≈ 1 and 𝜏ij ≈ 1, then correctly detected.

• If Gi is not detected by any bounding box then, 𝜎ij ≈ 0 which is counted as not detected.
• If detected box Dj is not in the set of G , then 𝜏ij ≈ 0, is considered as false alarm.
This protocol was originally designed for document images. It supports only one-to-one
matching for evaluation and it is fast and simple to assess the performance of any detection
method. Moreover, this protocol effectively evaluates the performance on zone, text-line
and word level segmentation. However, it has few limitations: (1) it is only suitable for
horizontally aligned text bounding boxes, (2) it is not suitable for complex images where
GT detected boxes are split into several rectangular boxes, (3) it fails in case of skewed
text-lines, and (4) it is sensitive to noise for word level performance evaluation.
4.3.2 ICDAR’03 evaluation protocol
In ICDAR text locating task (Lucas et al. 2003) from scene images, performance evalua-
tion of any algorithm is based on the statistical measures namely Precision (P), Recall (R)
and F-Measure (FM). Let, set of GT rectangles be denoted as T and set of detected rec-
tangles be denoted as E. In this framework, a match score of two rectangles is obtained by
calculating overlap ratio between them. Let, the best match m(r, R) for a specific rectangle r
from a set of rectangles denoted by R can be mathematically expressed in Eq. 3.
( )
m(r, R) = mP r, r� , where r� ∈ R (3)
Match score between GT and detected rectangles is denoted as mP . The metrics of eval-
uation framework can be defined by Eqs. 4–6.
13
∑ � �
re ∈E m re , T
P= (4)
�E�
∑ � �
rt ∈T m rt , E
R= (5)
�T�
1
FM =
𝛼∕P + (1 − 𝛼)∕R (6)
Here re and rt are the specific rectangle of GT and detected rectangle respectively and 𝛼 is
denoted as weight parameter which is set to 0.5.
In ICDAR 2003 competition, these metrics were used in evaluation framework. It performs
one-to-one matching between two sets of rectangular boxes (detected and GT) and selects the
detected box having closest match with GT box. However, this evaluation framework has cou-
ple of limitations, (1) it fails to work when one-to-many or many-to-one matching is followed
(Liu et al. 2019b) because when a single detected rectangle is matched with several GT rec-
tangles (word-level annotation), recall turns to be close to zero and vice versa, and (2) it may
happen that multiple GT/detected boxes are repeatedly matched.
4.3.3 DetEval evaluation protocol
Wolf and Jolion (2006) develop an improved evaluation framework for text detection algo-
rithms which certainly overcomes the shortfalls of previously used evaluation protocols in this
domain. ICDAR’13 dataset has been evaluated using DetEval protocol. This protocol supports
one-to-one, one-to-many (splits), many-to-one (merge) and many-to-many (splits and merge)
matches between GT and detected rectangles. Figure 13 illustrates the different matches based
on overlap area of GT and detected rectangles. Depending on the matching type, evaluation
metrics are defined in Eqs. 7–8.
∑ � �
� � j matchD Dj , G, tr , tp
P G, D, tr , tp = (7)
�D�
∑ � �
� � matchG Gi , D, tr , tp
(8)
i
R G, D, tr , tp =
�G�
Fig. 13 Example of different matching types based on overlap area between GT and detected rectangles
(Wolf and Jolion 2006). a One-to-one, b one-to-many (splits), c many-to-one (merge), d straight-line rec-
tangle and dotted-line rectangle denoted as GT and detected regions respectively
13
T. Khan et al.
Here tr and tp are the constraint values of area R and area P respectively that lie between 0
and 1. The obtained overlapping ratio of two rectangles is compared with these constraints
based on different matching types. matchD and matchG are two matching functions which
are described by the following criteria:
⎧ 1 if D matches with a single GT rectangle

⎪ j
matchD = ⎨ 0 if Dj does not match with GT rectangles
⎪ fsc (k) if Dj matches with several GT rectangles
⎩
⎧ 1 if G matches with a single detected rectangle

⎪ i
matchG = ⎨ 0 if Gi does not match with detected rectangles
⎪ fsc (k) if Gi matches with several detected rectangles
⎩
Here fsc (k) is a function of performance evaluation framework which decides the amount
of penalty will be given depending on the match score. If its value is 1, then no penalty will
be given. In this protocol, its value is set to 0.8. Figure 14 shows the final evaluation metric
values after the result of a text detection algorithm is obtained.
This algorithm tries to resolve the problems of previously proposed evaluation frame-
works to some extent by considering one-to-many and many-to-one matches for evalu-
ation task. It deals with both text-line and word level performance evaluation. Besides,
it is able to evaluate the performance for multiple images without losing its strength.
However, few limitations still exist in this framework that may affect accurate evaluation
of text instances, such as (1) in one-to-many matching, many fragmented text detections
may be considered to be correct ones that partially cover single GT box, incurring erro-
neous detection results, (2) in many-to-one matching, single detected box may roughly
cover all GT boxes, which may not be suitable for word-level text detection algorithm in
granular level, and (3) in one-to-many or many-to-one matches, algorithm only consid-
ers the first match as the best match and ignores subsequent matches that significantly
affects the result.
4.3.4 IoU evaluation protocol
ICDAR’15 dataset for incidental scene text detection task has been evaluated using IoU
metrics (Karatzas et al. 2015). This protocol is same as PASCAL VOC evaluation frame-
work used for object detection (Everingham et al. 2015). True positive (TP) and false posi-
tive (FP) are measured using an overlap ratio of GT and detected rectangle bounding box.
TP is considered if the value of IoU ratio is greater than 0.5 which is defined in Eq. 9.
Fig. 14 Performance evalua-

tion based on metrics where
R = 100% and P = 50% (Wolf and
Jolion 2006)
13
( )
Area Gi ∩ Pj
( ) > 0.5 (9)
Area Gi ∪ Pj
{ } { }
Here G = G1 , G2 , … Gn is a set of GT boxes and P = P1 , P2 , … Pm is a set of pre-
dicted boxes, where 1 < i < n and 1 < j < m. If the IoU value is zero then it is considered
as no detection.
In this framework, detected bounding box is labeled as positive or negative based on over-
lap ratio with GT boxes. However, the primary aim of bounding box based text detection is
to recognize the text accurately. So, incomplete detection should not be accepted in recogni-
tion stage. A few issues may be noted with this protocol: (1) a partially detected bounding
box, which misses some characters, is still considered to be a valid bounding box if overlap
ratio reaches the threshold value, and it could create incorrect result in text recognition, (2) in
some cases, bounding box that encloses text region along with a large region of background,
will also be considered to be valid bounding box if overlap ratio reaches the threshold value,
and (3) if threshold value is set very high to overcome this problem, some legitimate area of
bounding boxes may remain ignored, whereas for low threshold value more background noise
will be added.
4.3.5 Tightness‑aware IoU protocol
This framework effectively addresses several issues of IoU evaluation metrics for scene text
detection methods. Unlike, object detection methods, the main aim of text detection methods
is to detect the text region completely in a granular level. IoU framework measures different
evaluation metrics (P, R, FM) using fixed threshold value that sometime generates erroneous
results. Figure 15 illustrates some inappropriate cases obtained using IoU protocol.
In this regards, Tightness-aware IoU (TIoU) protocol is designed by Liu et al. (2019b),
quantifies the completeness of GT, solidity of detected regions and tightness of matching
scores. So, the prime goal of TIoU is to detect the complete text regions without any informa-
tion loss. P and R defined using modified formulas that manage the cutting behavior (incom-
plete detection of texts) appropriately. These are defined in Eqs. 10, 11.
( ) ( )
Area Gi ∩ Dj ∗ f Ct
TIoUR = ( ) (10)
Area Gi ∪ Dj
Fig. 15 Example of some errone-

ous results obtained using IoU
protocol (Liu et al. 2019b). For
a–d IoU value is 66% (higher
than threshold) and accordingly
detected box is considered as TP.
However, in all cases detected
boxes cannot properly bound the
text instance
13
T. Khan et al.
( ) ( )
Area Dj ∩ Gi ∗ f Ot
TIoUP = ( ) (11)
Area Dj ∪ Gi
( ) ( )
Here f Ct = 1 − x where x = Area t G and f Ot = 1 − x , where x = Area tD , where Ct is
C O
( i) ( j)
defined as not-recalled area of GT box (partial detection) and Ot is defined as union of all
outlier-GT region that fall inside the target GT box.
This protocol also deals with tight bounding box regression, one-to-many and
many-to-one matching types. There are several aspects of this protocol that may lead to
accurate and effective evaluation as follows: (1) Completeness It ensures that detected
bounding box should enclose the target text region completely. In other words, GT
boxes must be recalled completely. (2) Compactness It receives more attention in com-
pact bounding boxes by penalizing outlier-GT problems. (3) Tightness Tighter bound-
ing box is generated by setting relatively high overlap ratio (such as 0.9). However, this
protocol depends on quality of GT boxes for accurate evaluation that may not always
be consistent for complex scene images.
4.3.6 TedEval evaluation protocol
Lee et al. (2019) propose a novel evaluation metric named TedEval (Text detector
Evaluation) for scene text detection which evaluates the performance of an algorithm
by instance aware matching technique and character level detection policy. In this sys-
tem, an appropriate penalty is given for inaccurate detection like missing characters or
overlapping regions. Here, R of GT and P of predicted value are measured in charac-
ter-level which are defined in Eqs. 12–13.
∑�G�
RGi
R=
i=1
(12)
�G�
∑�D�
PDj
P=
j=1
(13)
�D�
Here Recall RGi is the number of correctly detected characters matches over a text length li
and precision PDj is the number of correctly detected characters over a total text length of
GT’s matched with Dj.
This protocol works on text instance level matching and character level scoring
strategy. Salient features of this frameworks are mentioned as follows: (1) It addresses
granularity problem by ensuring one-to-one, one-to-many and many-one matching for
evaluation task. (2) It also addresses completeness of detected bounding box where
text instance level matches are further scored (character-level) and based on that score
penalty is given for missing and overlapping characters. (3) This protocol ensures text
instances’ bounding boxes to be more accurate as recall and precision scores are com-
puted using Pseudo Character Center (PCC) at character level without any character
level annotation. However, this protocol is mainly used for rectangular bounding box
and polygon-shaped bounding box is yet to be evaluated.
13
5 Hitherto progress and open areas
In this section, experimental results of different state-of-the-art methods for scene text
detection on benchmark datasets are reported. Many of such methods conducted experi-
ments on one or more public datasets to prove the robustness of their methods. An attempt
is made to report a fair comparison on the basis of standard performance metrics of both
deep learning based approaches and few recently proposed traditional approaches on dif-
ferent datasets. It is important to mention that for end-to-end text spotting framework, only
detection results of text instances are reported in performance comparison section as rec-
ognition module is beyond scope of this study. Such a categorical comparative study may
help to get a quick idea about the state-of-the-art and also pinpoint where more focus is
required. It is worth mentioning that in all the tables (refer to Tables 3, 4, 5, 6, 7), rounded
values are taken for evaluation metrics.
5.1 Performance–deep versus traditional methods
In Table 3, a comprehensive report on performance analysis of different methods on

ICDAR 2003 and 2005 is presented. As both the datasets consists of the more or less same
images so, a fair comparison can be made with the data given in this table. It is observed
that very few works on deep learning based approaches are reported so far on these data-
sets. Most of the works are based on traditional handcrafted features and achieved signifi-
cant results. The recently proposed method (Huang 2019) using traditional approach sur-
passes the other methods in terms of performance. However, few deep learning approaches
obtained satisfactory results, still, there is a scope to improve the results further on these
datasets using deep networks.
In Table 4, performance of different methods on most popular ICDAR datasets namely
ICDAR’11, ICDAR’13 and ICDAR’15 is reported. ICDAR’13 dataset is an extended ver-
sion of ICDAR’11 dataset where some new images are added to increase its diversity. It
may be noted from Table 4 that Zhong et al. (2019b) have obtained an impressive result
in terms of F-Measure on ICDAR’11 dataset that contains horizontal texts. Text candidate
proposals are enlarged to create search region and then location based pixel classification is
Table 3 Performance comparison of state-of-the-art methods on ICDAR 2003/2005 dataset (Methods are
reported in year-wise order)
Method Year Precision (P) Recall (R) F-measure (FM)
Epshtein et al. (2010)* 2010 0.73 0.60 0.66

Chen et al. (2011)* 2011 0.73 0.60 0.66
Shivakumara et al. (2010)* 2011 0.76 0.86 0.81
Yao et al. (2012)* 2012 0.69 0.66 0.67
SFT-TCD (Huang et al. 2013)* 2013 0.81 0.74 0.72
Huang et al. (2014) 2014 0.84 0.67 0.75
He et al. (2016a) 2016 0.87 0.73 0.79
Ansari et al. (2018) 2018 0.86 0.77 0.82
Sain et al. (2018)* 2018 0.83 0.86 0.85
Huang (2019)* 2019 0.89 0.91 0.90
(*) indicates traditional method
13
Table 4 Performance comparison in terms of text detection accuracy and inference speed (Ts) of different state-of-the-art methods on ICDAR benchmark datasets using dif-
ferent statistical evaluation metrics
Method Year ICDAR’11 ICDAR’13 ICDAR’15
13
P R F-M T(s) P R F-M T(s) P R F-M T(s)
Lu et al. (2015)* 2015 – – – 0.89 0.69 0.78 – – – –

CTPN (Tian et al. 2016a) 2016 0.89 0.79 0.84 0.93 0.83 0.88 0.14 s 0.74 0.52 0.61 –
Gupta et al. (2016) 2016 0.92 0.75 0.82 0.25 s 0.92 0.76 0.83 0.25 s – – – –
He et al. (2016a) 2016 0.91 0.74 0.82 – 0.93 0.73 0.82 – – – – –
Qin and Manduchi (2017) 2017 – – – 0.90 0.83 0.86 0.45 s 0.79 0.65 0.71 0.45 s
SegLink (Shi et al. 2017a) 2017 – – – – 0.87 0.83 0.85 0.05 s 0.73 0.77 0.75 –
TextBoxes (Liao et al. 2017) 2017 0.89 0.82 0.86 0.73 s 0.89 0.83 0.86 0.73 s – – –
R2CNN (Jiang et al. 2017) 2017 – – – – 0.93 0.82 0.87 – 0.85 0.79 0.82 2.25 s
Neycharan and Ahmadyfard (2018)* 2017 – – – – 0.93 0.77 0.85 – – – – –
Dey et al. (2017)* 2017 – – – – 0.67 0.87 0.75 – – – – –
He et al. (2017c) 2017 – – – – 0.92 0.81 0.86 0.9 s 0.82 0.80 0.81 –
EAST (Zhou et al. 2017) 2017 – – – – – – – – 0.83 0.78 0.81 0.06 s
Li et al. (2017)+ 2017 0.89 0.81 0.85 0.73 s – – – – 0.91 0.80 0.85 0.73 s
Lyu et al. (2018a) 2018 – – – – 0.92 0.84 0.88 1s 0.89 0.79 0.84 1s
RRPN (Ma et al. 2018) 2018 – – – – 0.95 0.88 0.91 – 0.84 0.77 0.80 –
FOTS (Liu et al. 2018c)+ 2018 – – – – – – 0.92 0.04 s 0.91 0.85 0.88 0.12 s
Textboxes ++ (Liao et al. 2018a) 2018 – – – – 0.92 0.86 0.89 – 0.87 0.78 0.83 0.43 s
PSENet (Li et al. 2018) 2018 – – – – – – – – 0.89 0.85 0.87 0.13 s
Tang and Wu (2018) 2018 0.90 0.84 0.87 8.7 s 0.91 0.86 0.88 – – – – –
Sain et al. (2018)* 2018 – – – 0.87 0.74 0.80 – – – – –
MCN (Liu et al. 2018a) 2018 – – – – 0.88 0.87 0.88 0.03 s 0.72 0.80 0.76 –
PixelLink (Deng et al. 2018) 2018 – – – – 0.88 0.87 0.88 – 0.85 0.82 0.84 0.33 s
Richardson et al. (2019) 2019 – – – – – – – – 0.85 0.83 0.84 0.28 s
RCR-CNN (Zhu et al. 2019) 2019 – – – – – – – – 0.88 0.86 0.87 0.55 s
T. Khan et al.

SegLink++ (Tang et al. 2019) 2019 – – – – – – – – 0.83 0.80 0.82 0.14 s
Table 4 (continued)
Method Year ICDAR’11 ICDAR’13 ICDAR’15
Gao et al. (2019) 2019 – – – – 0.90 0.80 0.85 0.08 s 0.80 0.78 0.79 0.08 s
Zhong et al. (2019a) 2019 – – – 0.94 0.90 0.92 0.50 s 0.89 0.83 0.86 –
Zhong et al. (2019b) 2019 0.89 0.90 0.89 0.7 s 0.94 0.87 0.91 0.70 s 0.88 0.80 0.84 1.95 s
Huang (2019)* 2019 – – – – 0.88 0.90 0.89 – – – – –
FTPN (Liu et al. 2019c) 2019 – – – – 0.93 0.92 0.92 – 0.68 0.78 0.73 0.3 s
Yang et al. (2019) 2019 – – – – 0.91 0.89 0.90 – 0.84 0.83 0.83 0.07 s
Xu et al. (2019b) 2019 – – – – – – – – 0.90 0.87 0.88 0.47 s
PAN (Wang et al. 2019b) 2019 – – – – – – – – 0.84 0.81 0.83 0.03 s
LOMO (Zhang et al. 2019) 2019 – – – – – – – – 0.87 0.87 0.87 –
CRAFT (Baek et al. 2019) 2019 – – – – 0.97 0.93 0.95 0.11 s 0.89 0.84 0.87 –
Wang et al. (2019d) 2019 – – – – 0.93 0.89 0.91 – 0.89 0.86 0.87 –
SBD (Liu et al. 2019f) 2019 – – – – – – – – 0.89 0.83 0.86 –
Mask TextSpotter (Liao et al. 2019a)+ 2019 – – – – 0.94 0.89 0.92 0.33 s 0.86 0.87 0.87 0.33 s
Qin et al. (2019b)+ 2019 – – – – – – – – 0.91 0.88 0.89 0.27 s
PMTD (Liu et al. 2019h) 2019 – – – – – – – – 0.91 0.87 0.89 –

SPCNET (Xie et al. 2019) 2019 – – – – 0.93 0.90 0.92 – 0.88 0.85 0.87 –
Huang et al. (2019) 2019 – – – – – – – – 0.90 0.81 0.85 –
Kobchaisawat et al. (2020) 2020 – – – – – – – – 0.89 0.86 0.88 0.05 s
PuzzleNet (Liu et al. 2020a) 2020 – – – – – – – – 0.88 0.88 0.88 –
CAST (Jeon and Jeong 2020) 2020 – – – – 0.94 0.69 0.80 0.05 s 0.85 0.76 0.81 0.05 s
Methods are reported on arbitrary shaped scene images with English text (in year-wise order). Only text detection results are reported for end-to-end text spotting methods
(*) indicates traditional method and (+) indicates end-to-end text spotting method
13
Table 5 Performance comparison in terms of text detection accuracy and inference speed (Ts) of state-of-the-art methods on ICDAR Multi-lingual and Chinese datasets using
13
different statistical evaluation metrics
Method Year ICDAR’17-MLT RCTW-17 ICDAR’19-MLT
SegLink (Shi et al. 2017a) 2017 – – – – 0.76 0.40 0.52 – – – – –

RRD + MS (Liao et al. 2018b) 2018 – – – – 0.77 0.59 0.67 0.1 s – – – –
IncepText (Yang et al. 2018) 2018 – – – – 0.78 0.57 0.66 – – – – –
PSENet (Li et al. 2018) 2018 0.77 0.68 0.72 – – – – – – – – –
Lyu et al. (2018a) 2018 0.74 0.70 0.72 – – – – – – – – –
Richardson et al. (2019) 2019 0.73 0.62 0.67 0.39 s – – – – – – – –
Zhong et al. (2019a) 2019 0.75 0.66 0.70 – – – – – – – – –
Xu et al. (2019b) 2019 0.79 0.70 0.74 – – – – – – – – –
FOTS (Liu et al. 2018c) + 2018 0.81 0.62 0.70 – – – – – – – – –
LOMO (Zhang et al. 2019) 2019 0.80 0.67 0.73 – 0.79 0.60 0.68 0.33 s – – – –
CRAFT (Baek et al. 2019) 2019 0.80 0.68 0.74 – – – – – – – – –
SBD (Liu et al. 2019f) 2019 0.83 0.70 0.76 – – – – – – – – –
PMTD (Liu et al. 2019h) 2019 0.84 0.76 0.80 – – – – – – – – –
SPCNET (Xie et al. 2019) 2019 0.80 0.68 0.74 – – – – – – – – –
Huang et al. (2019) 2019 0.80 0.69 0.74 – – – – – – – –
Kobchaisawat et al. (2020) 2020 0.78 0.69 0.73 0.05 s – – – – 0.86 0.75 0.80 0.05 s
CAST (Jeon and Jeong 2020) 2020 0.70 0.58 0.63 0.27 s – – – – – – – –
Methods are reported in year-wise order and only text detection results are reported for end-to-end text spotting methods
T. Khan et al.

Table 6 Performance comparison in terms of text detection accuracy and inference speed (Ts) of state-of-the-art methods on other standard arbitrary-shaped text and street-
view texts datasets using different statistical evaluation metrics (Methods are reported in year-wise order and only text detection results are reported for end-to-end text spot-
ting methods)
Method Year MSRA-TD500 SVT COCO-Text

SegLink (Shi et al. 2017a) 2017 0.86 0.70 0.77 0.11 s – – – – – – – –

EAST (Zhou et al. 2017) 2017 0.87 0.67 0.76 – – – – – 0.50 0.32 0.39 –
Tang and Wu (2017) 2017 – – – 0.59 0.76 0.66 – – –
Dey et al. (2017)* 2017 0.52 0.85 0.65 – 0.55 0.68 0.61 – – – – –
Lyu et al. (2018a) 2018 0.87 0.76 0.81 0.17 s – – – – 0.70 0.26 0.38 –
MCN (Liu et al. 2018a) 2018 0.88 0.79 0.83 – – – – – – – – –
RRD + MS (Liao et al. 2018b) 2018 0.87 0.73 0.79 0.1 s – – – – 0.64 0.57 0.61 –
PixelLink (Deng et al. 2018) 2018 0.83 0.73 0.78 – – – – – – – – –
IncepText (Yang et al. 2018) 2018 0.87 0.79 0.83 – – – – – – – – –
Textboxes++ (Liao et al. 2018a) 2018 – – – – – – – – 0.60 0.56 0.58 0.05 s
Tang and Wu (2018) 2018 – – – – 0.54 0.76 0.63 8.7 s – – – –
TextSnake (Long et al. 2018a) 2018 0.83 0.74 0.78 0.9 s – – – – – – – –
Sain et al. (2018)* 2018 0.85 0.79 0.82 – 0.81 0.67 0.74 – – – – –
Liu et al. (2019a) 2019 0.84 0.77 0.80 – – – – – – – – –
Yang et al. (2019) 2019 – – – – – – – – 0.54 0.60 0.57 –
Qin et al. (2019a) 2019 – – – – 0.79 0.81 0.80 – – – – –
Zhong et al. (2019b) 2019 0.78 0.81 0.79 1.09 s – – – – – – – –
Huang (2019)* 2019 0.90 0.93 0.92 – – – – – – – – –
PAN (Wang et al. 2019b) 2019 0.84 0.83 0.84 0.03 s – – – – – – – –
CRAFT (Baek et al. 2019) 2019 0.88 0.78 0.83 – – – – – – – – –
MSR (Xue et al. 2019) 2019 0.87 0.76 0.81 – – – – – – – – –
Tian et al. (2019) 2019 0.84 0.81 0.83 – – – – – – – – –
Wang et al. (2019d) 2019 0.85 0.82` 0.83 0.1 s – – – – – – – –
13
SBD (Liu et al. 2019f) 2019 0.89 0.80 0.84 0.31 s – – – – – – – –
Table 6 (continued)
Method Year MSRA-TD500 SVT COCO-Text
13
+
Mask TextSpotter (Liao et al. 2019a) 2019 – – – – – – – – 0.66 0.58 0.62 0.20 s
PuzzleNet (Liu et al. 2020a) 2020 0.86 0.86 0.86 – – – – – – – – –
T. Khan et al.

Table 7 Performance comparison in terms of text detection accuracy and inference speed (Ts) of state-of-the-art methods on curved text datasets in terms of different statisti-
cal evaluation metrics (in order of publication year)
Method Year CTW-1500 Total-text CUTE80
Risnumawan et al. (2014) 2014 – – – – – – – – 0.65 0.68 0.61 16.1 s

Total-text (Ch’ng and Chan 2017) 2017 – – – – 0.40 0.33 0.36 – – – – –
TextSnake (Long et al. 2018a) 2018 0.68 0.85 0.75 – 0.83 0.74 0.78 – – – – –
TextField (Xu et al. 2019a) 2018 0.83 0.80 0.81 – 0.81 0.80 0.80 0.02 s – – – –
FTSN (Dai et al. 2018) 2018 – – – – 0.84 0.78 0.81 0.25 s – – – –
SegLink++ (Tang et al. 2019) 2019 0.82 0.79 0.81 – 0.82 0.81 0.81 – – – – –
RCR-CNN (Zhu et al. 2019) 2019 0.83 0.83 0.83 – – – – – – – – –
Liu et al. (2019a) 2019 0.77 0.69 0.73 0.07 s 0.74 0.71 0.73 – – – – –
Ch’ng et al. (2019) 2019 0.86 0.62 0.72 – 0.78 0.68 0.73 0.31 s 0.66 0.64 0.65 –
PAN (Wang et al. 2019b) 2019 0.86 0.81 0.83 0.02 s 0.89 0.81 0.85 0.02 s – – – –
LOMO (Zhang et al. 2019) 2019 0.85 0.76 0.80 – 0.87 0.79 0.83 – – – – –
CRAFT (Baek et al. 2019) 2019 0.86 0.81 0.83 – 0.87 0.80 0.83 – – – – –
MSR (Xue et al. 2019) 2019 0.85 0.78 0.81 – 0.83 0.74 0.79 – – – – –
Tian et al. (2019) 2019 0.82 0.77 0.80 – – – – – – – – –

Wang et al. (2019d) 2019 0.80 0.80 0.80 0.1 s 0.81 0.76 0.78 0.1 s – – – –
Mask TextSpotter (Liao et al. 2019a) + 2019 – – – – 0.88 0.82 0.85 0.20 s – – – –
Qin et al. (2019b)+ 2019 – – – – 0.87 0.85 0.86 0.27 s – – – –
SPCNET (Xie et al. 2019) 2019 – – – – 0.83 0.82 0.82 – – – – –
Huang et al. (2019) 2019 0.86 0.83 0.85 – – – – – – – – –
Kobchaisawat et al. (2020) 2020 – – – – 0.88 0.79 0.83 0.05 s – – – –
PuzzleNet (Liu et al. 2020a) 2020 0.83 0.86 0.84 – – – – – – – –
ContourNet (Wang et al. 2020b) 2020 0.83 0.84 0.84 0.22 s 0.86 0.83 0.85 0.26 s – – – –
ISNet (Yang et al. 2020) 2020 0.85 0.81 0.83 0.06 s 0.85 0.82 0.83 0.04 s – – – –
13
T. Khan et al.
performed within text instance region that significantly improves the localization accuracy.
This framework is mainly suitable for horizontally aligned texts. However, this method
has also achieved impressive result for multi-oriented texts where two-stage approach
converts multi-oriented detection problem into easier horizontal text detection problem.
Recently, CRAFT (Baek et al. 2019) has achieved impressive result (F-measure of 95%)
on ICDAR’13 dataset due to its ability of individual character-level localization rather
than entire text instances. This model is more robust for long multi-oriented text instances,
which reflects in detection accuracy on ICDAR’15 dataset and curved texts as well.
Comparative performance of recent works on multi-lingual datasets are reported in
Table 5. Most of the recent works have achieved significant results on ICDAR’17-MLT
dataset. However, very few works have been reported in literature on ICDAR’19-MLT
dataset as this dataset has just been published. PMTD (Liu et al. 2019h) has achieved high
F-measure of 80% on ICDAR’17-MLT due to its boundary adhered text mask that accu-
rately localizes text instances of arbitrary shapes. It is observed that instance segmentation
based text detection frameworks with Mask R-CNN baseline (Liu et al. 2019h; Xie et al.
2019; Huang et al. 2019) work quite well for multi-lingual text detection. Very recently,
an arbitrary shape fast text detection framework (Kobchaisawat et al. 2020) adopted poly-
gon shape bounding box regression and obtains promising result on ICDAR’19-MLT data-
set with F-measure of 80%. This method overcomes the general rectangular bounding box
regression problem on arbitrary-shape text instances where unwanted background involve-
ment results in poor detection accuracy. Few works have been reported on Chinese scene
text images released in the ICDAR 2017 competition. To deal with Chinese multi-oriented
scene texts (RCTW-17), two separate sets of features are designed for both classification
and regression tasks by the authors of Liao et al. (2018b), where rotation-invariant features
and rotation-sensitive features are designed for bounding box classification and regression
respectively. LOMO (Zhang et al. 2019) reports its text detection performance on Chinese
scene texts and achieves better result compared to state-of-the-art methods.
Table 6 reports the text detection performance of existing methods on arbitrary-shaped
(especially long multi-oriented) and street view text instances. It has been observed that
most of the existing methods are limited to short text instances due to small receptive field
of CNN baseline, resulting in fragmented/partial text detection. Street view texts are not
confined to any particular area that certainly creates more challenges for text localization.
An edge-based traditional text detection method (Huang 2019) outperforms the previous
approaches on MSRA-TD500 dataset (F-measure of 92%). This method generates an edge
saliency map that eliminates complex backgrounds and exploits edge features that are not-
sensitive to uneven lighting and noise effects. It also works effectively on low-contrast
images. Recently, multi-box detection and semantic segmentation are processed in parallel
for scene text detection in Qin et al. (2019a), which achieves state-of-the-art result on SVT
images with F-measure of 80%. Text detection accuracy on COCO-text dataset is compara-
tively low than other benchmark datasets due to its versatility. An end-to-end text spotting
framework with Mask R-CNN (Liao et al. 2019a) achieves state-of-the art performance on
COCO-texts.
Performance comparison of different methods only on curved text is reported in Table 7.
Accurate detection of curved texts is considered to be most challenging for research com-
munity. Since last few years, researchers have been paying more attention on designing text
detection framework suitable for curved texts. Huang et al. (2019) effectively remove false
alarms near boundary of text instances that certainly improves the precision rate, resulting
highest F-measure of 85% on CTW-1500 curved text dataset. Qin et al. (2019b) obtained
state-of-the-art result on Total-text with F-measure of 86%. This method effectively detects
13
curved texts using text instance mask and polygon fitting method. Very few works have
been reported on CUTE80 dataset. Total-text (Ch’ng et al. 2019) obtains 65% of F-meas-
ure on CUTE80 using poly-FRCNN-3 baseline. As CUTE80 has mixed annotation (word
and line level), this method evaluated its performance after converting the annotation into
word-level.
Inference Speed It is an important measure to assess the usability of a text detection
method in practical scenario. Faster inference time reflects higher efficiency. In general,
existing state-of-the-art methods maintain a trade-off between model efficiency and detec-
tion accuracy (Liao et al. 2019b). It has been observed that most of the methods have
higher text detection accuracy but poor inference time due to heavy network architecture
and complex processing steps, which makes them unsuitable for real-time deployment.
Inference time for any model entirely depends on network architecture, parameters, image
size and other implementation particulars. Few works have reported different inference
time for the same model with varying image size, number of processing steps, and other
hyperparameters (Tang and Wu 2018; Jeon and Jeong 2020), that demonstrate the nature
of inference time for any model. In this context, performance comparison of existing meth-
ods according to inference time may not be very logical. After a careful study, it may be
observed that most of the existing methods obtain higher accuracy but low inference speed
when using multi-scale input image strategy, whereas for single-scale input image strat-
egy, scenario may not be the same (Liao et al. 2017; Lyu et al. 2018a). Recently, few real-
time text detectors (He et al. 2020; Liu et al. 2020b; Liao et al. 2019b) are found to give
more priority to processing speed without affecting the performance, which increases their
usability in real–world scenario. Moreover, for end-to-end text spotter, inference speed of
text spotting depends on that of the detection module. Finally, it is also observed that no
specific measurement unit is defined for inference speed, some use frame per second (FPS),
whereas some others report their results in per image processing time in millisecond. So, in
order to compare the inference speed in a common platform, we have converted everything
into per image processing time in second as far as practicable.
5.2 Discussion
In this paper, a categorical review on text detection from natural scene images has been
presented in a comprehensive way. Earlier, public datasets in this domain were mainly
developed considering horizontal texts, but gradually as text detection problem evolved
over time, text with multi-orientation and curved texts came into the scenario. Increasing
complexity of texts certainly demonstrates the wide-scope of research on multi-oriented
and multi-lingual scene images. Deep learning based methods perform significantly well
over all these complex datasets and achieve state-of-the-art results. Initially, single deep
network model was used for text detection (Liao et al. 2017). Later, it has been realized that
a single model is not sufficient to produce the desired result. Hence, different deep network
models are attached in a cascaded or other way for better performance (Zhong et al. 2019a,
b; Liu et al. 2019c; Yang et al. 2019).
General text proposal-based approaches mainly deal with horizontal and multi-oriented
texts. Multi-scale sliding window-based techniques have been proven to be good enough
for detecting horizontally aligned texts. But, for multi-oriented texts these techniques often
fail to detect the entire text regions. Consequently, multi-scale quadrilateral window based
technique is applied, which is introduced by DMPNet (Liu and Jin 2017) for tighter bound-
ing box detection of arbitrary-shaped and multi-oriented text instances. Later, rotational
13
T. Khan et al.
sliding window is adopted for rotation-invariant text detection as well (Ma et al. 2018).
Proposal-based approaches are faster than other state-of-the-art methods as direct bound-
ing box regression is performed on text instances instead of considering entire images.
These approaches mainly depend on fast object detection frameworks. Generally, proposal-
based methods yield limited performance to curved texts (Ma et al. 2018; Liu et al. 2018a;
Jiang et al. 2017; Zhou et al. 2017) using axis aligned bounding box regression techniques,
resulting in additional background involvement and improper detection of text instances.
Natural images have rich semantic information spread over the entire image that may help
to discriminate text instances more accurately from non-texts. For that reason, pixel-wise
semantic segmentation-based methods are adopted to segment text instances more effec-
tively within complex backgrounds. These methods are suitable for arbitrary-shaped and
curved-texts detection (Wang et al. 2019c; Liu et al. 2019e; Tian et al. 2019). However,
these methods sometimes fail to segment overlapping text instances in complex environ-
ment (Zhang et al. 2016; Yao et al. 2016).
Analysis of this survey reveals the fact that most of the researchers are just exploit-
ing the prowess of deep networks in an experimental point-of-view rather than trying
to link between a given problem to be solved and fundamental aspects of deep learning
approaches. It is observed that most of the works are implemented in a particular deep
model either by trial and error method to boost up the end results or deploying different
models in a cascaded way. Although, many researchers can argue in this context that this
kind of implementation is still proven to be revolutionary on scene text detection, but for
a research point of view, to design an efficient framework, one must realize the appropri-
ate characteristic and the ability of any deep network to solve the particular problem in a
granular level.
5.3 Shortcomings and open areas
Current research trend on scene text detection and recognition is largely plunged into deep
learning based models. Since last few years, research community is almost thrown them-
selves into a deep learning environment because entire community is witnessing a signifi-
cant improvement of performance of DNN models. However, some challenges still need to
be addressed that will make deep learning models more robust and effective, and as a result
of that both research community and our society will be benefited. Some of the potential
future scopes of scene text detection in deep learning era are presented in two groups -
major and minor potential scopes/future trends.
5.3.1 Major scopes
(1) Research on multi-lingual scene text Until recently, it has been observed that large
number of deep learning models have primarily focused on texts written in Latin script.
It is worth mentioning that Latin script has comparatively well-formed structure than
some other scripts in the world. Chinese, Arabic, Korean and other scripts have a com-
plex set of patterns. Relatively few works have been reported on Chinese texts (such as
RCTW Shi et al. 2017b) in the wild. Although, deep learning-based approaches have
been introduced in character recognition for Indian regional scripts such as Devanagari,
Bangla, Malayalam, etc., text detection from scene images containing such scripts
is not yet adequately explored. Complex structured scripts are more sensitive due to
font-style, orientation in scenes and hard to detect even using the deep learning based
13
models. So, one possible solution is to extract some intrinsic features from such com-
plex text-scripts and train a deep network model iteratively using those extracted feature
set. Also, a deep learning model can be designed, driven by large training dataset that
can be generated using artificial text synthesizer.
(2) Deployment of real-time system Most of the existing deep networks for text detection
problem are limited to standalone environment rather than being deployed in real-time
environment. Nowadays, text detection problem can be thought of in the context of real-
time target spotting. Therefore, designing more efficient real-time scene text detectors
may meet the expectation. High-end computational resources are prerequisites to many
deep learning models, which is a primary constraint to become deployable in real-time
scenario. Though, few text detection frameworks have been recently reported to have
high inference speed (Wang et al. 2019b; Liao et al. 2019b), these models require high
resources rendering them unsuitable in unconstrained environment. In that regard, a
low-resource, lightweight and fast deep network model with proper customization is
indispensable for future progress in a real-time scenario.
(3) Difficulty in detecting multi-scale text instances It is realized that many existing models
struggle to detect text instances with varying scales. Text detection frameworks often
fail to detect large texts due to small receptive field of DNN baseline, leading to partial
and inaccurate detection. In addition, text instances with highly spaced characters get,
sometimes, fragmented, which affects subsequent recognition. On the other side, text
detection for small text instances suffer in large-scale images. Moreover, it has been
observed that models often resize input images into different scales for processing,
which may reduce the size of small text instances further, resulting in inaccurate detec-
tion. Thus, multi-scale scene text detection is an important consideration to be adopted
by research fraternity in order to develop robust scene text detection frameworks.
(4) Inability of synthetic images for contextual representation of scene texts In the current
scenario, synthetic datasets play a major role in training a deep learning model. Syn-
thetic texts cannot fully represent the rich semantic information which lie in the real
scene images. It is also worth mentioning that data augmentation techniques are also
questionable to represent true variety of scene texts.
(5) Ambiguous text annotation Text annotation plays an essential and crucial role for proper
evaluation of text detection methods. However, multi-lingual datasets have ambiguous
annotation of text instances, e.g. Chinese and Japanese texts are annotated at text-
line level, whereas other scripts are mostly annotated at word level. Such ambiguous
annotation may hinder methods from being evaluated on even platform. Few datasets,
however, have mixed annotations i.e. both text-line level and word-level (e.g. CUTE80).
(6) Focus on video data Appropriate text detection from video frames may expand the
information retrieval system from multimedia resources. However, low availability of
video frames for text detection is one of the main constraints for slow growth in this
area. So, development of datasets containing video frames may be considered as a
significant move for future research.
(7) Less availability of character-level annotation Recently, few works have achieved high
accuracy for arbitrary-shaped and long-text detection using character-level localization
that require baseline network with small receptive area (Baek et al. 2019). However,
these kinds of frameworks require character-level annotation to fine-tune the network.
Unfortunately, dedicated datasets with character level annotation are hardly available.
13
T. Khan et al.
5.3.2 Minor scopes
(1) Usage of transfer learning Transfer learning, a recent development in the deep learning
area where a network model is designed for any specific task may be used for other
tasks as well, may lead to speed up the training and boost up the performance of deep
learning models. So, employing transfer learning approach in scene text detection may
reduce the computation expenses and also increase the performance.
(2) End-to-end system with mobile phone application In deep learning era, demand for
designing an end-to-end scene text detection model is increasing rapidly. In future,
designing such end-to-end scene text detectors embedded as mobile phone application
may increase its usage and popularity.
(3) Handling low-quality image Most of the deep learning models have shown high perfor-
mance on high-quality images that may not always be the case in real scenario. So, in
practical environment, it is necessary to design a model capable in handling low-quality
images due to low-resolution, noise, distortion, blur, etc.
(4) Feature engineering and deep learning Design of a text detection framework, where
handcrafted features can be used in parallel with automated deep leaning based fea-
tures, which may be more efficient than existing models.
6 Conclusion
Text detection from natural scene images has received significant attention among
researchers in the field of computer vision since last few decades due to wide scope of
real-world applications. Though, researchers have proposed various detection methods on
different types of scene texts, there are yet adequate scopes of improvement of overall per-
formance of scene text detection systems. This review is a modest attempt to summarize
different deep learning based approaches along with competing traditional methods used
in scene text detection and recognition. In addition, different DNN frameworks relevant to
text detection have been outlined. A comprehensive summary of public datasets on differ-
ent types of texts along with applicable evaluation frameworks is also presented. Finally,
performance comparison of different state-of-the-art methods on several benchmark data-
sets is reported, which would certainly help research fraternity analyze effectiveness of
each method. It is worth mentioning that a large number of methods have reported their
results on early ICDAR datasets comprised of only English texts. However, other datasets
having multi-lingual, multi-oriented, multi-scaled, curved and street-view images have put
forward more challenges to researchers. Comparatively fewer works have been found on
such datasets so far.
Besides reflecting the state-of-the-art of text detection and recognition with due focus
on deep learning based approaches, open issues and associated research problems have
been identified and reported. Furthermore, suggested scopes of work to address these open
issues have also been outlined, that should be looked into in order to build robust models
that can perform reasonably well on any kind of texts irrespective of language, orientation
and other challenges, which is essential for end-to-end text reading systems—a dire need in
current times.
Acknowledgements Authors are grateful to Department of Computer Science and Engineering, Aliah Uni-
versity for providing necessary support to carry out this work. Tauseef Khan is further grateful to University
13
Grant Commission (UGC), Govt. of India for granting financial support under the scheme of Maulana Azad
National Fellowship.
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghe-
mawat S (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. In:
arXiv:1603.04467
Ansari GJ, Shah JH, Yasmin M, Sharif M, Fernandes SL (2018) A novel machine learning approach for
scene text extraction. Future Gener Comput Syst 87:328–340
Baek Y, Lee B, Han D, Yun S, Lee H (2019) Character region awareness for text detection. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 9365–9374
Bagri N, Johari PK (2015) A comparative study on feature extraction using texture and shape for content
based image retrieval. Int J Adv Sci Technol 80(4):41–52
Bai B, Yin F, Liu CL (2013) Scene text localization using gradient local correlation. In: 12th international
conference on document analysis and recognition, pp 1380–1384
Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-Farley D, Ben-
gio Y (2012) Theano: new features and speed improvements. In: arXiv:1211.5590
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT
metrics. J Image Video Process 1
Busta M, Neumann L, Matas J (2017) Deep textspotter: an end-to-end trainable scene text localization and
recognition framework. In: Proceedings of the IEEE international conference on computer vision, pp
2204–2212
Ch’ng CK, Chan CS (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In:
14th international conference on document analysis and recognition, pp 935–942
Ch’ng CK, Chan CS, Liu CL (2019) Total-text: toward orientation robustness in scene text detection. In:
International journal on document analysis and recognition, pp 1–22 (In press)
Chen X, Yuille AL (2004) Detecting and reading text in natural scenes. In: IEEE conference on computer
vision and pattern recognition, vol 2, pp II–II
Chen H, Tsai SS, Schroth G, Chen DM, Grzeszczuk R, Girod B (2011) Robust text detection in natural
images with edge-enhanced maximally stable extremal regions. In: 18th IEEE international confer-
ence on image processing, pp 2609–2612
Cho H, Sung M, Jun B (2016) Canny text detector: fast and robust scene text localization algorithm. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pp 3566–3573
CIFAR-10 Dataset. https://www.cs.toronto.edu/~kriz/cifar.html. Accessed on 14 June 2020
Coates A, Carpenter B, Case C, Satheesh S, Suresh B, Wang T, Wu DJ, Ng AY (2011) Text detection and
character recognition in scene images with unsupervised feature learning. In: IEEE international con-
ference on document analysis and recognition, pp 440–445
da Silveira TL, Kozakevicius AJ, Rodrigues CR (2017) Single-channel EEG sleep stage classification based
on a streamlined set of statistical features in wavelet domain. Med Biol Eng Comput 55(2):343–352
Dai Y, Huang Z, Gao Y, Xu Y, Chen K, Guo J, Qiu W (2018) Fused text segmentation networks for multi-
oriented scene text detection. In: 24th international conference on pattern recognition, pp 3604–3609
Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: 32th inter-
national conference of atrificial intelligence AAAI, pp 6773–6780
Dey S, Shivakumara P, Raghunandan KS, Pal U, Lu T, Kumar GH, Chan CS (2017) Script independent
approach for multi-oriented text detection in scene image. Neurocomputing 242:96–112
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: IEEE
computer society conference on computer vision and pattern recognition, pp 2963–2970
Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual
object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136
Fathi A, Wojna Z, Rathod V, Wang P, Song HO, Guadarrama S, Murphy KP (2017) Semantic instance seg-
mentation via deep metric learning. In: arXiv:1703.10277
Feng W, He W, Yin F, Zhang XY, Liu CL (2019) TextDragon: an end-to-end framework for arbitrary shaped
text spotting. In: Proceedings of the IEEE international conference on computer vision, pp 9076–9085
Fogel I, Sagi D (1989) Gabor filters as texture discriminator. Biol Cybern 61(2):103–113
Francis LM, Sreenath N (2017) TEDLESS–Text detection using least-square SVM from natural scene. J
King Saud Univ Comput Inf Sci 29(4)
13
T. Khan et al.
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) DSSD: deconvolutional single shot detector. In: arXiv
:1701.06659
Gao J, Wang Q, Yuan Y (2019) Convolutional regression network for multi-oriented text detection. IEEE
Access 7:96424–96433
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision,
pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and
semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern rec-
ognition, pp 580–587
Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classification of
high-frequency wavelet coefficients. In: 17th international conference on pattern recognition, vol 1,
pp 425–428
Google Street View. http://maps.google.com
Greenhalgh J, Mirmehdi M (2012) Real-time detection and recognition of road traffic signs. IEEE Trans
Intell Transp Syst 13(4):1498–1506
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE
conference on computer vision and pattern recognition, pp 2315–2324
He T, Huang W, Qiao Y, Yao J (2016a) Text-attentional convolutional neural network for scene text detec-
tion. IEEE Trans Image Process 25(6):2529–2541
He K, Zhang X, Ren S, Sun J (2016b) Deep residual learning for image recognition. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp 770–778
He D, Yang X, Liang C, Zhou Z, Ororbi AG, Kifer D, Lee Giles C (2017a) Multi-scale FCN with cascaded
instance aware segmentation for arbitrary oriented word spotting in the wild. In: IEEE conference on
computer vision and pattern recognition, pp 3519–3528
He P, Huang W, He T, Zhu Q, Qiao Y, Li X (2017b) Single shot text detector with regional attention. In:
IEEE international conference on computer vision, pp 3047–3055
He W, Zhang XY, Yin F, Liu CL (2017c) Deep direct regression for multi-oriented scene text detection. In:
IEEE international conference on computer vision, pp 745–753
He K, Gkioxari G, Dollár P, Girshick R (2017d) Mask R-CNN. In: Proceedings of the IEEE international
conference on computer vision, pp 2961–2969
He T, Tian Z, Huang W, Shen C, Qiao Y, Sun C (2018a) An end-to-end textspotter with explicit alignment
and attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
5020–5029
He W, Zhang XY, Yin F, Liu CL (2018b) Multi-oriented and multi-lingual scene text detection with direct
regression. IEEE Trans Image Process 27(11):5406–5419
He W, Zhang XY, Yin F, Luo Z, Ogier JM, Liu CL (2020) Realtime multi-scale scene text detection with
scale-based region proposal network. Pattern Recognit 98:107026
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang X (2019) Automatic video scene text detection based on saliency edge map. Multimed Tools Appl
78(24):34819–34838
Huang W, Lin Z, Yang J, Wang J (2013) Text localization in natural images using stroke feature transform
and text covariance descriptors. In: IEEE international conference on computer vision, pp 1241–1248
Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced
mser trees. In: European conference on computer vision, pp 497–511
Huang L, Yang Y, Deng Y, Yu Y (2015) Densebox: unifying landmark localization with end to end object
detection. In: arXiv:1509.04874
Huang Z, Zhong Z, Sun L, Huo Q (2019) Mask R-CNN with pyramid attention network for scene text detec-
tion. In: 2019 IEEE winter conference on applications of computer vision, pp 764–772
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neu-
ral networks. Int J Comput Vis 116(1):1–20
Jeon M, Jeong YS (2020) Compact and accurate scene text detector. Appl Sci 10(6):2096
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: con-
volutional architecture for fast feature embedding. In: 22nd international conference on multimedia,
pp 675–678
Jiang Y, Zhu X, Wang X, Yang S, Li W, Wang H, Fu P, Luo Z (2017) R 2CNN: rotational region CNN for
orientation robust scene text detection. In: arXiv:1706.09579
Jiang M, Cheng J, Chen M, Ku X (2018) An improved text localization method for natural scene images. J
Phys 960(1):012027
Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R (2019) A survey of deep learning-based object detec-
tion. IEEE Access 7:128837–128868
13
Joan SF, Valli S (2019) A survey on text information extraction from born-digital and scene text images.
Proc Natl Acad Sci India Sect A 89(1):77–101
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, De
Las Heras LP (2011) ICDAR 2011 robust reading competition. In: 12th international conference on
document analysis and recognition, pp 1484–1493
Karatzas D, Shafait F, Uchida S, Iwamura M, Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, De
Las Heras LP (2013) ICDAR 2013 robust reading competition. In: 12th international conference on
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L,
Chandrasekhar VR, Lu S, Shafait F (2015) ICDAR 2015 competition on robust reading. In: 13th
international conference on document analysis and recognition, pp 1156–1160
Kasturi R, Goldgof D, Soundararajan P, Manohar V, Garofolo J, Bowers R, Boonstra M, Korzhova V, Zhang
J (2008) Framework for performance evaluation of face, text, and vehicle detection and tracking in
video: data, metrics, and protocol. IEEE Trans Pattern Anal Mach Intell 31(2):319–336
Ketkar N (2017) Introduction to keras. In: Deep learning with python, pp 97–111
Khan T, Mollah AF (2019a) Distance transform-based stroke feature descriptor for text non-text classifica-
tion. In: Recent developments in machine learning and data analytics, pp 189–200
Khan T, Mollah AF (2019b) AUTNT-A component level dataset for text non-text classification and
benchmarking with novel script invariant feature descriptors and D-CNN. Multimed Tools Appl
78(22):32159–32186
Khan FA, Tahir MA, Khelifi F, Bouridane A, Almotaeryi R (2017) Robust off-line text independent writer
identification using bagged discrete cosine transform features. Expert Syst Appl 71:404–415
Kim KH, Hong S, Roh B, Cheon Y, Park M (2016) Pvanet: deep but lightweight neural networks for real-
time object detection. In: arXiv:1608.08021
Kobchaisawat T, Chalidabhongse TH, Satoh SI (2020) Scene text detection with polygon offsetting and bor-
der augmentation. Electronics 9(1):117
Kong S, Fowlkes CC (2018) Recurrent pixel embedding for instance grouping. In: Proceedings of the IEEE
conference on computer vision and pattern recognition, pp 9018–9028
Koo HI, Kim DH (2013) Scene text detection via connected component clustering and nontext filtering.
IEEE Trans Image Process 22(6):2296–2305
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural net-
works. In: Advances in neural information processing systems, pp 1097–1105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition.
IEEE 86(11):2278–2324
Lee S, Cho MS, Jung K, Kim JH (2010) Scene text extraction with edge constraint and text collinearity. In:
20th international conference on pattern recognition, pp 3983–3986
Lee JJ, Lee PH, Lee SW, Yuille A, Koch C (2011a) Adaboost for text detection in natural scene. In: 2011
International conference on document analysis and recognition, pp 429–434
Lee JJ, Lee PH, Lee SW, Yuille A, Koch C (2011b) Adaboost for text detection in natural scene. In: Interna-
tional conference on document analysis and recognition, pp 429–434
Lee CY, Baek Y, Lee H (2019) TedEval: a fair evaluation metric for scene text detectors. In: arXiv
:1907.01227
Leibe B, Matas J, Sebe N, Welling M (eds) (2016) Computer vision—ECCV 2016. In: 14th European con-
ference, vol 9908
Li Y, Lu H (2012) Scene text detection via stroke width. In: 21st international conference on pattern recog-
nition, pp 681–684
Li H, Wang P, Shen C (2017) Towards end-to-end text spotting with convolutional recurrent neural net-
works. In: Proceedings of the IEEE international conference on computer vision, pp 5238–5246
Li X, Wang W, Hou W, Liu RZ, Lu T, Yang J (2018) Shape robust text detection with progressive scale
expansion network. In: arXiv:1806.02559
Liang J, Phillips IT, Haralick RM (1997) Performance evaluation of document layout analysis algorithms on
the UW data set. Int Soc Opt Photonics Doc Recognit 3027:149–160
Liang G, Shivakumara P, Lu T, Tan CL (2015) A new wavelet-Laplacian method for arbitrarily-oriented
character segmentation in video text lines. In: 13th international conference on document analysis and
recognition, pp 926–930
Liao M, Shi B, Bai X, Wang X, Liu W (2017) TextBoxes: a fast text detector with a single deep neural net-
work. In: International conference of AAAI, pp 4161–4167
Liao M, Shi B, Bai X (2018a) Textboxes++: a single-shot oriented scene text detector. IEEE Trans Image
Process 27(8):3676–3690
13
T. Khan et al.
Liao M, Zhu Z, Shi B, Xia GS, Bai X (2018b) Rotation-sensitive regression for oriented scene text detec-
tion. In: IEEE conference on computer vision and pattern recognition, pp 5909–5918
Liao M, Lyu P, He M, Yao C, Wu W, Bai X (2019a) Mask textspotter: an end-to-end trainable neural net-
work for spotting text with arbitrary shapes. In: IEEE transactions on pattern analysis and machine
intelligence. https://doi.org/10.1109/tpami.2019.2937086
Liao M, Wan Z, Yao C, Chen K, Bai X (2019b) Real-time scene text detection with differentiable binariza-
tion. In: arXiv:1911.08947
Liao M, Song B, Long S, He M, Yao C, Bai X (2020) SynthText3D: synthesizing scene text images from
3D virtual worlds. Sci China Inf Sci 63(2):120105
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco:
Common objects in context. In: European conference on computer vision, pp 740–755
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object
detection. In: IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin H, Yang P, Zhang F (2019) Review of scene text detection and recognition. In: Archives of computa-
tional methods in engineering, pp 1–22
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. In: IEEE
international conference on computer vision and pattern recognition, pp 3454–3461
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016a) SSD: single shot multibox
detector. In: European conference on computer vision, pp 21–37
Liu L, Lao S, Fieguth PW, Guo Y, Wang X, Pietikäinen M (2016b) Median robust extended local binary
pattern for texture classification. IEEE Trans Image Process 25(3):1368–1381
Liu L, Fieguth P, Guo Y, Wang X, Pietikäinen M (2017) Local binary features for texture classification: tax-
onomy and experimental study. Pattern Recognit 62:135–160
Liu Z, Lin G, Yang S, Feng J, Lin W, Goh WL (2018a) Learning markov clustering networks for scene
text detection. In: IEEE international conference of computer vision and pattern recognition, pp
6936–6944
Liu S, Qi L, Qin H, Shi J, Jia J (2018b) Path aggregation network for instance segmentation. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
Liu X, Liang D, Yan S, Chen D, Qiao Y, Yan J (2018c) FOTS: fast oriented text spotting with a unified
network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
5676–5685
Liu Y, Jin L, Zhang S, Luo C, Zhang S (2019a) Curved scene text detection via transverse and longitudinal
sequence connection. Pattern Recognit 90:337–345
Liu Y, Jin L, Xie Z, Luo C, Zhang S, Xie L (2019b) Tightness-aware evaluation protocol for scene text
detection. In: IEEE Conference on computer vision and pattern recognition, pp 9612–9620
Liu F, Chen C, Gu D, Zheng J (2019c) FTPN: scene text detection with feature pyramid based text proposal
network. IEEE Access 7:44219–44228
Liu X, Meng G, Pan C (2019d) Scene text detection and recognition with advances in deep learning: a sur-
vey. Int J Doc Anal Recognit 22(2):143–162
Liu Z, Lin G, Yang S, Liu F, Lin W, Goh WL (2019e) Towards robust curve text detection with conditional
spatial expansion. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 7269–7278
Liu Y, Zhang S, Jin L, Xie L, Wu Y, Wang Z (2019f) Omnidirectional scene text detection with sequential-
free box discretization. In: arXiv:1906.02371
Liu X, Zhang R, Zhou Y, Jiang Q, Song Q, Li N, Zhou K, Wang L, Wang D, Liao M, Yang M (2019g)
ICDAR 2019 robust reading challenge on reading chinese text on signboard. In: arXiv:1912.09641
Liu J, Liu X, Sheng J, Liang D, Li X, Liu Q (2019h) Pyramid mask text detector. In: arXiv:1903.11800
Liu H, Guo A, Jiang D, Hu Y, Ren B (2020a) PuzzleNet: scene text detection by segment context graph
learning. In: arXiv:2002.11371
Liu Y, Chen H, Shen C, He T, Jin L, Wang L (2020b) ABCNet: real-time scene text spotting with adaptive
bezier-curve network. In: arXiv:2002.10200
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE
Long S, Ruan J, Zhang W, He X, Wu W, Yao C (2018a) TextSnake: a flexible representation for detecting
text of arbitrary shapes. In: European conference on computer vision, pp 20–36
Long S, He X, Ya C (2018b) Scene text detection and recognition: the deep learning era. In: arXiv
:1811.04256
Lu S, Chen T, Tian S, Lim JH, Tan CL (2015) Scene text extraction based on edges and support vector
regression. Int J Doc Anal Recognit 18(2):125–135
13
Lucas SM (2005) ICDAR 2005 text locating competition results. In: 8th international conference on
Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading compe-
titions. In: 7th international conference on document analysis and recognition, pp 682–687
Lyu P, Yao C, Wu W, Yan S, Bai X (2018a) Multi-oriented scene text detection via corner localiza-
tion and region segmentation. In: IEEE conference on computer vision and pattern recognition, pp
7553–7563
Lyu P, Liao M, Yao C, Wu W, Bai X (2018b) Mask textspotter: an end-to-end trainable neural network
for spotting text with arbitrary shapes. In: Proceedings of the European conference on computer
vision, pp 67–83
Ma J, Shao W, Ye H, Wang L, Wang H, Zheng Y, Xue X (2018) Arbitrary-oriented scene text detection
via rotation proposals. IEEE Trans Multimed 20(11):3111–3122
Ma C, Sun L, Zhong Z, Huo Q (2020) ReLaText: exploiting visual relationships for arbitrary-shaped
scene text detection with graph convolutional networks. In: arXiv:2003.06999
Maitra DS, Bhattacharya U, Parui SK (2015) CNN based common approach to handwritten character
recognition of multiple scripts. In: 13th international conference on document analysis and recog-
nition, pp 1021–1025
Majhi B, Pujari P (2018) On development and performance evaluation of novel odia handwritten digit
recognition methods. Arab J Sci Eng 43(8):3887–3901
Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE
Trans Pattern Anal Mach Intell 7:674–693
Manjusha K, Kumar MA, Soman KP (2018) Reduced scattering representation for Malayalam character
recognition. Arab J Sci Eng 43(8):4315–4326
Mishra A, Alahari K, Jawahar CV (2012) Scene text recognition using higher order language priors. In:
HAL
Mitchell T (1999) The 20 newsgroups text dataset
Mollah AF, Basu S, Nasipuri M (2012) Text detection from camera captured images using a novel fuzzy-
based technique. In: 3rd international conference on emerging applications of information technol-
ogy, pp 291–294
Mosleh A, Bouguila N, Hamza AB (2012) Image text detection using a bandlet-based edge detector and
stroke width transform. In: British machine vision conference, pp 1–12
Nayef N, Yin F, Bizid I, Choi H, Feng Y, Karatzas D, Luo Z, Pal U, Rigaud C, Chazalon J, Khlif W (2017)
ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-
rrc-mlt. In: 14th IAPR international conference on document analysis and recognition, pp 1454–1459
Nayef N, Patel Y, Busta M, Chowdhury PN, Karatzas D, Khlif W, Matas J, Pal U, Burie JC, Liu CL, Ogier
JM (2019) ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recogni-
tion–RRC-MLT-2019. In: IAPR international conference of document analysis and recognition
Neumann L, Matas J (2010) A method for text localization and recognition in real-world images. In:
Asian conference on computer vision, pp 770–783
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: Proceedings of the
Neycharan JG, Ahmadyfard A (2018) Edge color transform: a new operator for natural scene text locali-
zation. Multimed Tools Appl 77(6):7615–7636
Niconico. http://www.nicovideo.jp
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In Proceed-
ings of the IEEE international conference on computer vision, pp 1520–1528
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of texture measures with classification
based on featured distributions. Pattern Recognit 29(1):51–59
Pan YF, Hou X, Liu CL (2010) A hybrid approach to detect and localize texts in natural scene images.
Paul S, Saha S, Basu S, Saha PK, Nasipuri M (2019) Text localization in camera captured images using
fuzzy distance transform based adaptive stroke filter. Multimed Tools Appl 78(13):18017–18036
Qiao L, Tang S, Cheng Z, Xu Y, Niu Y, Pu S, Wu F (2020) Text perceptron: towards end-to-end arbi-
trary-shaped text spotting. In: arXiv:2002.06820
Qin S, Manduchi R (2017) Cascaded segmentation-detection networks for word-level text spotting. In:
14th international conference on document analysis and recognition, pp 1275–1282
Qin H, Zhang H, Wang H, Yan Y, Zhang M, Zhao W (2019a) An algorithm for scene text detection
using multibox and semantic segmentation. Appl Sci 9(6):1054
Qin S, Bissacco A, Raptis M, Fujii Y, Xiao Y (2019b) Towards unconstrained end-to-end text spotting. In:
Proceedings of the IEEE international conference on computer vision, pp 4704–4714
13
T. Khan et al.
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection.
In: IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region pro-
posal networks. In: Advances in neural information processing systems, pp 91–99
Richardson E, Azar Y, Avioz O, Geron N, Ronen T, Avraham Z, Shapiro S (2019) It’s all about the scale–
efficient text detection using adaptive scaling. In: arXiv:1907.12122
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natu-
ral scene images. Expert Syst Appl 41(18):8027–8048
Saha S, Chakraborty N, Kundu S, Paul S, Mollah AF, Basu S, Sarkar R (2020) Multi-lingual scene text
detection and language identification. Pattern Recognit Lett 138:16–22
Sain A, Bhunia AK, Roy PP, Pal U (2018) Multi-oriented text detection and verification in video frames and
scene images. Neurocomputing 275:1531–1549
Sherstinsky A (2018) Fundamentals of recurrent neural network (RNN) and long short-term memory
(LSTM) network. In: arXiv:1808.03314
Shi C, Wang C, Xiao B, Zhang Y, Gao S (2013) Scene text detection using graph model built upon maxi-
mally stable extremal regions. Pattern Recognit Lett 34(2):107–116
Shi B, Bai X, Belongie S (2017a) Detecting oriented text in natural images by linking segments. In: IEEE
Shi B, Yao C, Liao M, Yang M, Xu P, Cui L, Belongie S, Lu S, Bai X (2017b) ICDAR 2017 competition on
reading chinese text in the wild (rctw-17). In: 14th IAPR international conference on document analy-
sis and recognition, pp 1429–1434
Shivakumara P, Phan TQ, Tan CL (2010) A Laplacian approach to multi-oriented text detection in video.
IEEE Trans Pattern Anal Mach Intell 33(2):412–419
Shivakumara P, Roy S, Jalab HA, Ibrahim RW, Pal U, Lu T, Khare V, Wahab AWBA (2019) Fractional
means based method for multi-oriented keyword spotting in video/scene/license plate images. Expert
Syst Appl 118:1–19
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In:
arXiv:1409.1556
Song X, Wu Y, Wang W, Lu T (2020) TK-text: multi-shaped scene text detection via instance segmentation.
In: Proceedings of the international conference on multimedia modeling, pp 201–213
Sun Y, Zhang C, Huang Z, Liu J, Han J, Ding E (2018) Textnet: irregular text reading from images with an
end-to-end trainable network. In: Proceedings of the Asian conference on computer vision, pp 83–99
Sun Y, Liu J, Liu W, Han J, Ding E, Liu J (2019) Chinese street view text: large-scale Chinese text reading
with partially supervised learning. In: Proceedings of the IEEE international conference on computer
vision, pp 9086–9095
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015)
Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pat-
tern recognition, pp 1–9
Tang Y, Wu X (2017) Scene text detection and segmentation based on cascaded convolution neural net-
works. IEEE Trans Image Process 26(3):1509–1520
Tang Y, Wu X (2018) Scene text detection using superpixel-based stroke feature transform and deep learn-
ing based region classification. IEEE Trans Multimed 20(9):2276–2288
Tang J, Yang Z, Wang Y, Zheng Q, Xu Y, Bai X (2019) SegLink++: detecting dense and arbitrary-shaped
scene text by instance-aware component grouping. In: Pattern recognition, vol 96, pp 106954
Tian Z, Huang W, He T, He P, Qiao Y (2016a) Detecting text in natural image with connectionist text pro-
posal network. In: European conference on computer vision, pp 56–72
Tian S, Bhattacharya U, Lu S, Su B, Wang Q, Wei X, Lu Y, Tan CL (2016b) Multilingual scene character
recognition with co-occurrence of histogram of oriented gradients. Pattern Recognit 51:125–134
Tian Z, Shu M, Lyu P, Li R, Zhou C, Shen X, Jia J (2019) Learning shape-aware embedding for scene
text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp
4234–4243
Tychsen-Smith L, Petersson L (2017) Denet: scalable real-time object detection with directed sparse sam-
pling. In: IEEE international conference of computer vision, pp 428–436
Van Dongen SM (2000) Graph clustering by flow simulation (Doctoral dissertation)
Veit A, Matera T, Neumann L, Matas J, Belongie S (2016) Coco-text: Dataset and benchmark for text detec-
tion and recognition in natural images. In: arXiv:1601.07140
Wang K, Belongie S (2010) Word spotting in the wild. In: European conference on computer vision, pp
591–604
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE international con-
ference on computer vision, pp 1457–1464
13
Wang T, Wu DJ, Coates A, Ng AY (2012) End-to-end text recognition with convolutional neural net-
works. In: 21st international conference on pattern recognition, pp 3304–3308
Wang X, Chen K, Huang Z, Yao C, Liu W (2017) Point linking network for object detection. In: arXiv
:1706.03646
Wang K, Li G, Liu X, Yan J, Li S, Huang H (2018) Natural scene text detection based on MSER. In: 3rd
international conference on communications, information management and network security
Wang X, Feng X, Xia Z (2019a) Scene video text tracking based on hybrid deep text detection and lay-
out constraint. Neurocomputing 363:223–235
Wang W, Xie E, Song X, Zang Y, Wang W, Lu T, Yu G, Shen C (2019b) Efficient and accurate arbitrary-
shaped text detection with pixel aggregation network. In: Proceedings of the IEEE international
conference on computer vision, pp 8440–8449
Wang P, Zhang C, Qi F, Huang Z, En M, Han J, Liu J, Ding E, Shi G (2019c) A single-shot arbitrarily-
shaped text detector based on context attended multi-task learning. In: Proceedings of the 27th
ACM international conference on multimedia, pp 1277–1285
Wang X, Jiang Y, Luo Z, Liu CL, Choi H, Kim S (2019d) Arbitrary shape scene text detection with
adaptive text region representation. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp 6449–6458
Wang Y, Xie H, Fu Z, Zhang Y (2019e) DSRN: a deep scale relationship network for scene text detec-
tion. In: Proceedings of the 28th international joint conference on artificial intelligence. AAAI
Press, pp 947–953
Wang H, Lu P, Zhang H, Yang M, Bai X, Xu Y, He M, Wang Y, Liu W (2019f) All you need is bound-
ary: toward arbitrary-shaped text spotting. In: arXiv:1911.09550
Wang S, Liu Y, He Z, Wang Y, Tang Z (2020a) A quadrilateral scene text detector with two-stage net-
work architecture. Pattern Recognit 102:107230
Wang Y, Xie H, Zha Z, Xing M, Fu Z, Zhang Y (2020b) ContourNet: taking a further step toward accu-
rate arbitrary-shaped scene text detection. In: arXiv:2004.04940
Welcome to Lasagne. https://lasagne.readthedocs.io/en/latest/
Which GPU(s) to get for deep learning: my experience and advice for using GPUs in deep learning, https
://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/. Accessed on 3 June 2020
Wolf C, Jolion JM (2006) Object count/area graphs for the evaluation of object detection and segmenta-
tion algorithms. Int J Doc Anal Recognit 8(4):280–296
Wu Y, Natarajan P (2017) Self-organized text detection with minimal post-processing via border learn-
ing. In: IEEE international conference of computer vision, pp 5000–5009
Xie E, Zang Y, Shao S, Yu G, Yao C, Li G (2019) Scene text detection with supervised pyramid context
network. In: Proceedings of the AAAI conference on artificial intelligence, pp 9038–9045
Xu Y, Wang Y, Zhou W, Wang Y, Yang Z, Bai X (2019a) TextField: learning a deep direction field for
irregular scene text detection. IEEE Trans Image Process 28(11):5566–5579
Xu Y, Duan J, Kuang Z, Yue X, Sun H, Guan Y, Zhang W (2019b) Geometry normalization networks for
accurate scene text detection. In: arXiv:1909.00794
Xue C, Lu S, Zhang W (2019) MSR: multi-scale shape regression for scene text detection. In: arXiv
:1901.02596
Yang Q, Cheng M, Zhou W, Chen Y, Qiu M, Lin W, Chu W (2018) Inceptext: a new inception-text mod-
ule with deformable psroi pooling for multi-oriented scene text detection. In: arXiv:1805.01167
Yang P, Zhang F, Yang G (2019) A fast scene text detector using knowledge distillation. IEEE Access
7:22588–22598
Yang P, Yang G, Gong X, Wu P, Han X, Wu J, Chen C (2020) Instance segmentation network with self-
distillation for scene text detection. IEEE Access 8:45825–45836
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In:
Yao C, Bai X, Sang N, Zhou X, Zhou S, Cao Z (2016) Scene text detection via holistic, multi-channel
prediction. In: arXiv:1606.09002
Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping.
Yi C, Tian Y (2012) Localizing text in scene images by boundary clustering, stroke segmentation, and
string fragment classification. IEEE Trans Image Process 21(9):4256–4268
Zamberletti A, Noce L, Gallo I (2014) Text localization based on fast feature pyramids and multi-res-
olution maximally stable extremal regions. In: Asian conference on computer vision, pp 91–105
Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature
learning. In: 2011 International conference on computer vision, pp 2018–2025
13
T. Khan et al.
Zhan F, Lu S, Xue C (2018) Verisimilar image synthesis for accurate detection and recognition of texts in
scenes. In: Proceedings of the European conference on computer vision, pp 249–266
Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolu-
tional networks. In: IEEE international conference on computer vision and pattern recognition, pp
4159–4167
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection.
In: IEEE conference on computer vision and pattern recognition, pp 4203–4212
Zhang C, Liang B, Huang Z, En M, Han J, Ding E, Ding X (2019) Look more than once: an accurate detec-
tor for text of arbitrary shapes. In: Proceedings of the IEEE conference on computer vision and pat-
tern recognition, pp 10552–10561
Zhong Z, Jin L, Zhang S, Feng Z (2016) Deeptext: a unified framework for text proposal generation and text
detection in natural images. arXiv:1605.07314
Zhong Z, Sun L, Huo Q (2019a) An anchor-free region proposal network for Faster R-CNN based text
detection approaches. Int J Doc Anal Recognit 22(3):315–327
Zhong Z, Sun L, Huo Q (2019b) Improved localization accuracy by LocNet for faster R-CNN based text
detection in natural scene images. In: Pattern recognition, p 106986
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) EAST: an efficient and accurate scene
text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
5551–5560
Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: recent advances and future trends. Front
Comput Sci 10(1):19–36
Zhu Y, Ma C, Du J (2019) Rotated cascade R-CNN: a shape robust detector with coordinate regression. In:
Pattern recognition, vol 96
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
1. use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
2. use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
[email protected]

Deep Learning Approaches To Scene Text Detection A

Uploaded by

Copyright:

Available Formats

Deep Learning Approaches To Scene Text Detection A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Approaches To Scene Text Detection A

Uploaded by

Copyright:

Available Formats

Artificial Intelligence Review

Deep learning approaches to scene text detection:

Tauseef Khan1 · Ram Sarkar2 · Ayatullah Faruk Mollah1

© Springer Nature B.V. 2021

In real-world, human beings are surrounded by instances of text objects in imagery

* Ayatullah Faruk Mollah

Information retrieval from multimedia. Text embedded in images signifies important

(1) Diverse nature of text in the wild

(2) Complex imagery background

(3) Improper image acquisition

(4) Text-specific algorithms

1.3 Pre‑deep learning era

1.4 Advent of deep learning

1.5 Motivation and contribution

• A modest attempt is made to present comprehensive review on different deep learning

The paper is organized as follows. In Sect. 2, a detailed review of the state-of-the-art

2 Text detection approaches based on deep networks

2.1 Regression based methods

2.1.1 Proposal based methods

Arbitrary-oriented Texts Text proposal-based methods efficiently and accurately local-

2.1.3 Ascendancy and hindrances

2.2 Segmentation based methods

two strategies—(a) Semantic segmentation methods and (b) Instance-aware segmentation

Objective of semantic segmentation is grouping pixels in a meaningful way in order to seg-

2.2.3 Ascendancy and hindrances

2.4 End‑to‑end text spotters

3 Deep learning frameworks for scene text detection

3.1 Semantic segmentation based frameworks

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

3.1.1 Fully convolutional network

FCN is used as a backbone architecture for most of the semantic segmentation-based

Another type of semantic segmentation network that performs pixel-wise classification is

3.2 Object detection/instance segmentation based frameworks

3.2.1 Two‑stage object detection networks

object detector and instance segmentation-based framework. Mask R-CNN is suitable

3.2.2 One‑stage object detection networks

3.3 Hardware and software requirements

4 Benchmark datasets and evaluation protocols

4.1 Standard datasets for text detection

widely diversified images captured in the unconstrained environment. A categorical dis-

4.1.1 ICDAR datasets (horizontal and multi‑oriented)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

4.1.3 Curved text datasets

4.1.4 Street view image datasets

background scene images, which indicates that there is no repetition of background

4.2 Suitability of datasets for deep networks

4.3.1 Liang’s evaluation protocol

• If 𝜎ij ≈ 1 and 𝜏ij ≈ 1, then correctly detected.

4.3.2 ICDAR’03 evaluation protocol

4.3.3 DetEval evaluation protocol

⎧ 1 if D matches with a single GT rectangle

⎧ 1 if G matches with a single detected rectangle

4.3.4 IoU evaluation protocol

Fig. 14 Performance evalua-

4.3.5 Tightness‑aware IoU protocol

Fig. 15 Example of some errone-

4.3.6 TedEval evaluation protocol

5 Hitherto progress and open areas

1.3 Pre‑deep learning era

1.4 Advent of deep learning

1.5 Motivation and contribution

2 Text detection approaches based on deep networks

2.1 Regression based methods

2.1.1 Proposal based methods

2.1.3 Ascendancy and hindrances

2.2 Segmentation based methods

2.2.3 Ascendancy and hindrances

2.4 End‑to‑end text spotters

3 Deep learning frameworks for scene text detection

3.1 Semantic segmentation based frameworks

3.1.1 Fully convolutional network

3.2 Object detection/instance segmentation based frameworks

3.2.1 Two‑stage object detection networks

3.2.2 One‑stage object detection networks

3.3 Hardware and software requirements

4 Benchmark datasets and evaluation protocols

4.1 Standard datasets for text detection

4.1.1 ICDAR datasets (horizontal and multi‑oriented)

4.1.3 Curved text datasets

4.1.4 Street view image datasets

4.2 Suitability of datasets for deep networks

4.3.1 Liang’s evaluation protocol

4.3.2 ICDAR’03 evaluation protocol

4.3.3 DetEval evaluation protocol

4.3.4 IoU evaluation protocol

4.3.5 Tightness‑aware IoU protocol

4.3.6 TedEval evaluation protocol

5 Hitherto progress and open areas

5.1 Performance–deep versus traditional methods

5.3 Shortcomings and open areas