10

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Archives of Computational Methods in Engineering

https://doi.org/10.1007/s11831-019-09315-1

ORIGINAL PAPER

Review of Scene Text Detection and Recognition


Han Lin1 · Peng Yang1,2 · Fanlong Zhang1

Received: 9 October 2018 / Accepted: 8 January 2019


© CIMNE, Barcelona, Spain 2019

Abstract
Scene texts contain rich semantic information which may be used in many vision-based applications, and consequently
detecting and recognizing scene texts have received increasing attention in recent years. In this paper, we first introduce the
history and progress of scene text detection and recognition, and classify conventional methods in detail and point out their
advantages as well as disadvantages. After that, we study these methods and illustrate the corresponding key issues and
techniques, including loss function, multi-orientation, language model and sequence labeling. Finally, we describe commonly
used benchmark datasets and evaluation protocols, based on which the performance of representative scene text detection
and recognition methods are analyzed and compared.

1 Introduction respectively. Text recognition aims to convert image regions


containing text into machine-readable strings. Different from
Texts in scene image contain high-level important semantic the general image classification, the dimension of output
information, which is help to analyzing and understanding sequence for text recognition is not fixed. In most cases, text
the corresponding environment. With the rapid populariza- detection is a preliminary step of text recognition. Recently,
tion of smart phones and mobile computing devices, images many researchers begin to integrate the detection and rec-
with text data are acquired more conveniently and efficiently. ognition tasks into an end-to-end text recognition system.
Therefore, scene text recognition (STR) has become active Considering a small lexicon, word spotting offers an effec-
research topic in computer vision, and its related applica- tive strategy for realizing end-to-end recognition [4].
tions are including image retrieval, automatic navigation The target of traditional optical character recognition
and human–computer interaction, etc. [1–3]. Moreover, the (OCR) is mainly document images acquired by scanner
International Conference on Document Analysis and Recog- [5]. Since even old scanners have enough resolution for text
nition (ICDAR) initiates “Robust Reading” competition in image acquisition, the recognition rates of many OCR meth-
2003, and since then numerous techniques and methods have ods can easily reach 99%. Compared to traditional OCR,
been proposed to greatly advance the development of STR. however, STR is more challenging, which are discussed as
Text detection and recognition are two fundamental tasks follows:
for STR. Text detection aims to determine the position of
text from input image, and the position is often represented (1) Texts are often scattered in the scene image, and
by a bounding box. Generally, the shape of target bound- there is no prior information about their location. For
ing box may be rectangle, oriented rectangle or quadrilat- scanned documents, the number of text lines, line spac-
eral. More precisely, parameters (x, y, w, h) , (x, y, w, h, 𝜃) ing and even the number of words are known. For scene
and (x1 , y1 , x2 , y2 , x3 , y3 , x4 , y4 ) can be used to denotes hori- texts, however, we cannot directly apply segmentation
zontal, rotated and arbitrary quadrilateral bounding box methods for document images since there is no such
formatting rule.
(2) Scene texts often have variety of sizes, fonts and orien-
* Peng Yang tations. Targets in scene image may contain decorated
[email protected]
or specially-designed characters, such as presentation
1
School of Information Engineering, Nanjing Audit slides on screen, calligraphic slogans on wall, and mes-
University, Jiangshu 211815, China sages on digital signboard. Such texts with multifari-
2
School of Information Engineering, Nanchang Hangkong
University, Jiangxi 330063, China

13
Vol.:(0123456789)
H. Lin et al.

ous appearance are difficultly recognized by traditional 2.1 Hand‑Crafted Feature Extraction Stage
OCR engines.
(3) The quality of scene image acquired by digital devices Traditional text detectors focus on developing hand-crafted
is potentially poor. At present, scene text covers wide low-level features to discriminate text and non-text compo-
range of applications linked to wearable cameras or nents in scene image, which can be mainly classified into
massive urban captures which are difficult or undesir- two categories, i.e., sliding window (SW) and connected
able to control. Therefore, characters and their back- component (CC) based methods.
ground often have very low contrast or perspective dis-
tortion, which results in difficulty for localization and
recognition. Figure 1 shows some examples of scene 2.1.1 SW Methods
text images that are not easily detected and recognized.
(4) There are many character-like patterns (non-character) SW methods first detect text information by moving a multi-
in scene image. Since the background of scene image scale sub-window through all possible locations in an image,
is often complex, there are many ambiguous objects and then use a pre-trained classifier to identify whether text
such as leaves, windows or icons that are much like is contained within the sub-window [22].
characters or words. Moreover, sometimes scene texts Wang et al. [6] provided an end-to-end pipeline for STR,
connect to other objects, which easily results in confus- where they perform multi-scale character detection via SW
ing patterns. classification. Features are first extracted by chosen entries
in a HOG descriptor computed at the window location. Then
In this paper, we mainly provide a comprehensive review Random Ferns is applied to evaluate the likelihood of char-
about scene text detection and recognition research over the acter in the window location. Pan et al. [7] estimated the
past decade, and highlight the key techniques. Moreover, we text existing confidence and scale information via SW. After
compare state-of-the-art methods and report the correspond- that, a conditional random field (CRF) model is proposed to
ing performance on several standard benchmark datasets. filter out the non-text components. Similarly, Mishra et al.
[8] used a standard SW method with character aspect ratio
prior to detect potential locations of characters in scene
2 Scene Text Detection image. Wang et al. [9] applied a convolutional neural net-
work (CNN) model with SW scheme to obtain candidate
As mentioned above, scene text detection is a challenging lines of text in given image, and thus estimate text locations.
problem. Similar to majority of computer vision tasks, most Jaderberg et al. [10] also applied CNN in SW fashion to
previous text detection methods are based on handcraft fea- compute text saliency map, which stays the same resolution
tures as well as prior knowledge, and since around 2015 as the original image through zero-padding. After that word
deep learning based methods emerge and gradually become bounding boxes can be generated based on these saliency
the mainstream. maps.

Fig. 1  Examples of scene text images

13
Review of Scene Text Detection and Recognition

The main difficulties for this group of methods lie in Huang et al. [16] applied CNN to learn high-level fea-
designing discriminative features to train a powerful clas- tures from the MSREs components in image. These compo-
sifier, and reasonably managing the number of scanning nents show high discriminant ability and strong robustness
windows to reduce computation complexity. against complicated background ones. Moreover, SW model
and non-maximal suppression (NMS) are incorporated in
the CNN classifier so as to handle the problem of multiple
2.1.2 CC Methods
characters connection. Gomez et al. [17] used the MSER
algorithm to firstly obtain the initial segmentation of image.
CC methods first extract candidate components from the
After that they propose a text specific selective search strat-
image, and then filter out non-text components using manu-
egy, which can group the initial regions by agglomerative
ally designed rules or automatically trained classifiers [23].
clustering in a hierarchy where each node defines a possible
Compared to SW methods, such methods are more efficient
word hypothesis. Finally a ranked list of proposals prioritiz-
and robust. There are two representative methods, i.e., stroke
ing the best hypotheses is provided for text detection. Busta
width transform (SWT) and maximally stable extremal
et al. [18] proposed a stroke detector, which first finds stroke
regions (MSER).
key-points and then uses them to obtain stroke segmentations
Epshtein et al. [11] presented SWT operator to compute
for scene text. They show that compared to the traditional
the width of the most likely stroke for image pixel. Canny
MSER method, using stroke specific key-points could detect
edge detector is first used to find edges in image. After all
more characters with less region segmentations. Cho et al.
the edge pixels in the opposite gradient direction being
[20] presented Canny text detector using multi-stage algo-
found, strokes are considered effective and these pixels are
rithm. ER method is first utilized to extract character candi-
grouped into character candidates. Neumann et al. [12] gave
dates as many as possible, and the overlapped candidates are
a description for character detection problem, i.e., finding
eliminated by NMS. After that, the candidates are classified
all contiguous regions in image such that probability that
as strong text, weak text or non-text with double threshold.
the sequence represents text has a local maximum. Based
Besides strong text, candidates with low confidence, i.e.
on the description, MSER classifier is trained to find region
weak text, are selected by hysteresis. Finally, the surviving
containing characters. Finally, post-processing and connec-
text candidates are grouped to compose sentence. Fabrizio
tion rules are applied to combine the candidate characters
et al. [21] presented a hybrid text detector, which adopts CC
into text line. MSER method needs less priori knowledge
method to generate text candidates and also applies texture
and is more robust to language and oriented text. In order to
analysis to compose text string or discard false positives.
address problems on blurry images or characters with low
CCs in image can be first obtained by employing the toggle
contrast, the same authors implemented character detection
mapping morphological segmentation (TMMS) algorithm.
in all extremal regions (ERs) instead of just in MSERs [13,
A shape descriptor based on fast wavelet decomposition is
14]. They use incrementally computable descriptors as fea-
used to classify each CC as character or non- character. After
tures to train a sequential classifier, which can reduce the
that, a series of texture features are used to train a support
high false positive rate in real-time. Yin et al. [15] proposed
vector machine (SVM) for post-processing. He et al. [22]
a fast MSERs pruning algorithm, which can significantly
developed contrast-enhancement maximally stable extre-
reduce the number of character candidates to be processed.
mal regions (CE-MSERs) detector, which extends the con-
Character candidates are clustered into text candidates by
ventional MSERs by enhancing intensity contrast between
the single-link clustering algorithm, whose distance weights
text patterns and background. Furthermore, they trained a
and clustering threshold can be automatically learnt. Such
text-attentional CNN that could extract high-level features
new MSER based method is more robust and efficient for
including text region mask, character label, and binary text/
text detection.
non-text information. The two schemes are incorporated to
Generally speaking, CC methods easily bring with
form an effective text detection model. Zhang et al. [19] pro-
numerous non-text components. Therefore, correctly filter-
posed a text detector which exploits the symmetry property
ing out the false positives is critical to the success of this
of character groups. Different from traditional methods that
group of methods.
mainly exploit the properties of single characters or strokes,
this new detector could utilize context information from
2.1.3 Hybrid Methods scene image to implement text lines extraction.

In order to more efficiently handle scene text with cluttered 2.2 Deep learning Era
background information, several hybrid methods are pro-
posed, which make use of the advantages of different meth- Recently, deep learning has been widely used in semantic
ods and combine with specific schemes. segmentation and general object detection, and achieved

13
H. Lin et al.

great success. Accordingly, related methods are also being backbone, and then multi-level feature maps are combined
adopted in the field of text detection. In general, semantic and fed to the region proposed network (RPN) for text region
segmentation based detectors first extract text blocks from of interest (ROI) generation. The whole architecture could
the segmentation map generated by fully convolutional net- implement text detection and segmentation simultaneously
work (FCN). After that, bounding boxes of text are obtained and provide predictions both in the pixel and word level.
by complex post-processing. General object detectors, how- Deng et al. [28] proposed a scene text detector (called Pix-
ever, predict candidate bounding boxes directly by regarding elLink) based on instance segmentation. The Single-Shot
texts as objects. Different from common objects, texts have Detector (SSD) [29] like architecture is used to extract fea-
clear definition of orientation, which should be predicted tures and perform text/non-text prediction as well as link
in addition to the axis-aligned bounding box information. prediction. The predicted positive pixels are joined together
into text instances by predicted positive links. Finally, text
2.2.1 Semantic Segmentation Based Methods bounding boxes are generated directly from the segmenta-
tion result without location regression. Li et al. [30] pro-
Yao et al. [24] take scene text detection as a semantic seg- posed the progressive scale expansion network (PSENet)
mentation problem. They use a FCN model based on holis- for segmentation-based text detection. In order to handle the
tically-nested edge detection (HED) to produce global maps, closely adjacent text instances, a progressive scale expan-
including information of text region, individual characters sion algorithm is presented. Inspired by the idea of breadth
and their relationship. And the proposed algorithm could first-search, the expansion starts from the pixels of multiple
detect multi-oriented and curved texts in scene image. He kernels and iteratively merges the adjacent text pixels until
et al. [33] presented the cascaded convolutional text net- the largest kernels are explored. Yang et al. [31] proposed an
works (CCTN), which uses two networks to implement IncepText architecture based on instance-aware segmenta-
coarse-to-fine segmentation for scene image. Note that tion, which could deal with scene texts with large variance
the coarse network outputs a per-pixel heat-map indicat- of scale, aspect ratio, and orientation. ResNet-50 module is
ing the location and probability of text instance, and the first used for feature extraction, and Inception-Text module
fine network outputs two heat-maps for final text detection. is appended after feature fusion. Furthermore, deformable
Zhang et al. [25] also implement text detection with coarse- PSROI pooling [32] is applied to detect multi-oriented text.
to-fine procedure. They first use a FCN (called Text-Block This group of methods is suitable for handling multi-
FCN) to predict the salient map of text blocks. After that oriented text in real-world scene image. Once text instances
MSER method is applied to extract multi-oriented text in image are very close to each other, however, simply
line candidates. Finally, they train another smaller FCN using text/non-text semantic segmentation is hard to sepa-
(called Character-Centroid FCN) to provide the character rate them. Therefore, post-processing is often inevitable to
centroid information, based on which false text line can- improve the performance.
didates can be eliminated. Qin et al. [26] proposed a text
detector based on the cascade of two CNNs. Text regions 2.2.2 General Object Detection Based Methods
of interest are first produced by a FCN and then resized to
a square shape with fixed size. The next stage is the word Zhong et al. [34] developed a unified framework (called
detection procedure, i.e., training a YOLO-like network to DeepText) for text detection. An inception-RPN is pro-
generate oriented rectangular bounding boxes for all words. posed in the framework, which could achieve a high recall
Finally, a NMS stage is implemented to handle overlapping with only hundreds of word region proposals via apply-
bounding boxes. He et al. [40] proposed a FCN architecture ing multi-scale sliding windows over the feature maps and
for multi-oriented scene text detection with two tasks. The designing a set of text characteristic prior bounding boxes
classification task implements down-sampled segmentation with each sliding position. Gupta et al. [35] presented an
between text and non-text for input image, and the regres- efficient engine that could generate synthetic scene images
sion task determines the vertex coordinates of quadrilateral with text annotations, and all synthetic images are used to
text boundaries through direct regression. Zhou et al. [44] train a fully-convolutional regression network (FCRN) for
also proposed a FCN based model for scene text detection. text detection. Since an extreme variant of Hough voting is
Multiple channels of pixel-level text score map and geom- adopted in FCRN, all individual predictions could be aggre-
etry could be generated in this model, which is flexible to gated across the input image. Tian et al. [36] proposed a con-
produce either word level or line level predictions. Further- nectionist text proposal network (CTPN) to localize scene
more, a locality aware NMS with low time complexity is text. In CTPN, VGG16 backbone is first used for feature
proposed for post-processing. Dai et al. [27] presented a extraction, and then a vertical anchor mechanism is devel-
detector based on fused text segmentation networks. Fea- oped to predict text locations in a fine scale. Finally, a Bi-
tures of each image are first extracted through a resnet-101 directional long short term memory (BLSTM) is applied to

13
Review of Scene Text Detection and Recognition

connect the fine scale sequential text proposals. Liao et al. using a pixel-wise text mask. Such model could effectively
[37] presented an end-to-end trainable scene text detector suppress background interference in the convolutional fea-
(called TextBoxes), which is inspired by SSD. Since SSD tures. Furthermore, multi-scale inception features are aggre-
is general object detector, it cannot be directly applied for gated to encode rich local and context information for text
text detection. To address the problem, text-box layers are prediction. The whole detector works in a coarse-to-fine
included in the architecture of TextBoxes, which could manner. Zhong et al. [47] presented an anchor-free region
detect the words with extreme aspect ratios by designing proposal network (AF-RPN), which could generate high-
long default boxes and irregular 1*5 convolutional filters. quality inclined text proposals directly without designing
Ma et al. [38] proposed a rotation region proposal networks complicated hand-crafted anchors. In AF-RPN, three detec-
(RRPN), which is built upon the Faster-RCNN [39] archi- tion modules are attached on different pyramid levels for
tecture. Since the ground truth (GT) of a text region is rep- detecting small, medium and large text instances. Lyu et al.
resented with 5 tuples (x, y, w, h, 𝜃) , where 𝜃 is the angle [48] proposed a hybrid network for multi-oriented scene text
parameter, RRPN could generate inclined proposals with detection. The corner points of text region are first detected,
text orientation information. Jiang et al. [41] also proposed and at the same time position sensitive segmentation maps
a Faster-RCNN based architecture, called rotational region are predicted. After that, candidate bounding boxes are gen-
CNN ­(R2CNN), for arbitrary-oriented text detection. They erated by sampling and grouping corner points, and finally
point out that using an angle parameter could make the net- suppressed by using NMS. He et al. [49] presented an end-
work hard to detect vertical texts. Therefore, the coordinates to-end text spotter, which is based on the idea of mask
of the first two vertices in clockwise and the height of the R-CNN [50]. Especially, a text-alignment layer is designed
bounding box are used to represent an inclined rectangle in by introducing a grid sampling scheme. It aims to compute
­R2CNN. Liu et al. [42] designed a small set of quadrilateral fixed length convolutional features that precisely align to a
sliding windows to roughly recall text. In training phase, a detected text region with arbitrary orientation. The bounding
shared Monte-Carlo method is proposed to compute over- box and segmentation mask of text could be jointly predicted
lapping area between GT and sliding window. The sliding in the multi-task model.
window beyond the given overlapping threshold is consid-
ered as positive and used to finely localize the text. Shi et al.
[43] proposed a novel perspective, i.e., texts are composed
3 Discussion
of segments and links. A segment is a part of a word or text
line, and a link connects two adjacent segments. Both seg-
In general, traditional hand-crafted feature extraction based
ments and links are detected by a SSD like network, and then
methods consist of several steps, which make the detection
they are taken as nodes and edges of a graph respectively.
system complicated and inefficient, and easily result in error
Finally, a depth-first search (DFS) algorithm is performed
accumulation. Moreover, they need too many manual optimi-
on the graph to find the connected components (word or text
zations of classification rules. Deep learning based methods,
line). Liao et al. [45] presented a rotation-sensitive regres-
however, inherit the merits of machine learning. As long as
sion detector (RRD) based on SSD, which has two network
having sufficient number of training samples, they could out-
branches. The regression branch extracts rotation-sensitive
distance the traditional methods in terms of both accuracy
features by rotating the convolutional filters, while the clas-
and efficiency. Figure 2 shows the focused scene text detec-
sification branch extracts rotation-invariant features by pool-
tion results on standard datasets (including ICDAR 2003,
ing the rotation-sensitive features.
ICDAR 2005, ICDAR 2011 and ICDAR 2013) in terms of
This kind of detectors is often trained by bounding-box
F-measure reported in literatures mentioned in Sects. 2.1
annotations just like general object detection methods do,
and 2.2. The blue and red bars represent traditional and
which is difficult to learn fine information of text. While
deep learning based methods respectively. Obviously, deep
handling small-scale texts, only using single shot model
learning based methods achieve overwhelming performance,
may result in accuracy loss. Moreover, it requires designing
which explains why they become the mainstream recently.
anchors or default boxes with various scales, aspect ratios
and orientations in advance.

2.2.3 Hybrid Methods 4 Scene Text Recognition

Recently, some researchers try to combine the two kinds of Similar to text detection, scene text recognition also experi-
above methods so as to correctly detect texts under more ences the transition from traditional means using handcrafted
complex situations. He et al. [46] proposed a text atten- features to deep learning era. In this section, we roughly
tion model, which encodes strong text-specific information classify current mainstream text recognition methods into

13
H. Lin et al.

Fig. 2  Performance comparison of representative scene text detectors

three categories: character classification based, word clas- images with arbitrary sizes. Furthermore, a multi-stage pool-
sification based and sequence based methods. ing scheme is adopted so as to utilize both higher and lower
level features for recognition. Kang et al. [63] designed a
4.1 Character Classification Based Methods context-aware convolutional recurrent network for word rec-
ognition. Besides a lexicon dictionary, the metadata of the
Bissacco et al. [51] use a deep neural network that is trained input image, such as title, tags, and comments, are used as
on HOG features for character classification. In order to a context prior to enhance the recognition rate. Yang et al.
enhance the recognition performance, a two-level language [65] proposed an adaptive ensemble of deep neural networks
model is adopted: a compact character-level n-gram model (AdaDNNs), which could select and combine network com-
is held in RAM and a much larger distributed word-level ponents at different iterations within a Bayesian-based for-
n-gram model is accessed over the network. Jaderberg et al. mulation framework for text recognition.
[57] proposed a CNN based architecture employing a con- Word recognition is actually a multi-class classification
ditional random field (CRF) graphical model. In this model, task with a large number of class labels (e.g. the number
unary terms are provided by a CNN that predicts charac- of English words is about 90,000). The strong expression
ters at each position of the output, and higher order terms and computation ability of CNN make this task possible.
are provided by another CNN that detects the presence of However, the deformation of long word image may affect
n-grams. Lee et al. [60] presented recursive recurrent neural the recognition rate. Furthermore, this kind of methods often
networks (RNNs) with attention model for text recognition. relies on a pre-defined dictionary.
The RNNs could be applied for learning character-level
language model without using n-grams. The soft-attention 4.3 Sequence Based Methods
mechanism allows the model to select features flexibly for
end-to-end training. Shi et al. [55] proposed a convolutional recurrent neural
This group of methods finds individual characters in network (CRNN) for image-based sequence recognition. A
scene image and consequently recognizes them one by one. standard CNN model is first used to extract a sequential fea-
Complex heuristic rules or language models are often indis- ture representation from input image. Then a bidirectional
pensable to integrate characters into words due to the occur- long-short term memory (LSTM) network is connected with
rences of missing or superfluous characters. the top convolutional layers to predict a label distribution for
each frame of feature sequence. Finally, the connectionist
4.2 Word Classification Based Methods temporal classification (CTC) is applied to find the label
sequence with the highest probability conditioned on the
Jaderberg et al. [52] proposed a synthetic data engine, which per-frame predictions. He et al. [58] also developed a deep-
could generate plenty of cropped word images with different text recurrent network (DTRN) for scene text recognition.
styles. A CNN framework is trained using synthetic data Similar to [55], a MaxOut CNN is responsible for encod-
without handcrafted labeling and achieves high performance ing input image into an ordered sequence, and a LSTM is
for word recognition. Shi et al. [56] presented a variant of employed to decode the CNN sequence into a word string.
CNN for script identification under multilingual scenarios. In order to deal with perspective distortion text and curved
In this network, feature maps that have a fixed number text, Shi et al. [59] proposed a recognizer with automatic
of rows but a variable number of columns are input to a rectification. The input image is first employed thin-plate-
spatially-sensitive pooling (SSP) layer, which could handle spline (TPS) transformation, and then the rectified image is

13
Review of Scene Text Detection and Recognition

fed to a sequence recognition network (SRN) to obtain the 4.5 End‑to‑end Text Spotting
final result. The methods mentioned above are mainly under
an encoder-decoder framework, and use a frame-wise loss Text detection and recognition are usually combined to
to optimize the model. However, the misalignment between implement text spotting, rather than being treated as sepa-
the ground truth (GT) sequence and the output probability rate tasks. In a unified system, the recognizer not only pro-
distribution (PD) sequence may mislead the training [68]. In duces recognition outputs but also regularizes text detec-
[68], an edit probability (EP) method is proposed for accu- tion with its semantic-level awareness [70]. Wang et al. [9]
rate text recognition. EP measures the probability of a text applied CNN to implement end-to-end text recognition. In
string conditioned on the input image under parameters for this model, NMS is used to remove overlapping candidates
training attention model, meanwhile considering the pos- and obtain the set of line-level bounding boxes for texts.
sible occurrences of missing/superfluous characters. And then beam search technique is used to find the best seg-
The advantages and disadvantages of the three kinds of mentation of words. The proposed method achieves state-of-
methods for text recognition are summarized in Table 1. the art results under tasks of character recognition, lexicon
driven cropped word recognition and end-to-end recogni-
tion. Yao et al. [53] presented a unified framework, where
4.4 Hybrid Methods text detection and recognition share both features and clas-
sification. Furthermore, the dictionary is generated accord-
In this subsection, we also review some hybrid text recogni- ing to Bing search, whose error correction scheme can be
tion methods, which mainly rely on intricate graphical model used to enhance the recognition rate. Jaderberg et al. [61]
or hand-crafted feature designing, and do not strictly fall into also proposed t an end-to-end text spotting system. Word
the above categories. Shi et al. [71] use the tree-structured level bounding box proposals are first obtained with high
model to generate detection windows that contain candidate recall, and then filtered by a random forest classifier for
characters. Then a CRF model is built on the detection win- improving precision. Two CNNs are used for bounding box
dows to decide character locations. Finally, word recogni- regression and text recognition respectively. Moysset et al.
tion is implemented according to a cost function defined by [64] designed a CRNN system, in which the convolutional
character detection scores, spatial constraints and linguistic layers share parameters over the different regressors to find
knowledge. Yao et al. [72] represent each candidate char- text lines locally, and a 2D-LSTM model is trained with
acter by a set of strokelets that could capture the essential CTC alignment to recognize texts. Gomez et al. [67] pre-
substructures of character at multi-scales. Coupled with sented a text-specific proposal method, which first extracts
HOG descriptor, they could train a random forest classifier connected components from input image, and then groups
with high performance and efficiency. Almazan et al. [54] them by their similarity via single linkage clustering (SLC).
proposed a word recognition method based on embedded Furthermore, a ranking strategy is designed to prioritize the
attributes. On one hand, a pyramidal histogram of charac- best word proposals. Finally, an end-to-end word spotting
ters (PHOC) representation for each word is defined, which system can be built by incorporating the word recognizers
embeds label strings into a d-dimensional space. On the provided in [61]. Liao et al. [70] proposed a novel text detec-
other hand, word image is represented using Fisher vector. tor called TextBoxes ++. TextBoxes ++ is an extension of
Finally, the attributes with PHOCs could be learned by train- [37], which could efficiently detect arbitrary-oriented scene
ing a SVM. Lou et al. [62] represent word recognition model text. Combined with a text recognizer, TextBoxes ++ can
as a high-order factor graph, where hypothetical neighbor- also be used for end-to-end text spotting.
ing candidate characters are constructed edges of the graph More recently, researchers begin to design unified end-
and taken as random variables. Four factors, i.e., transition, to-end trainable deep learning network (DNN) that could
smoothness, consistency, and singleton, are defined and predict both text regions and text labels in a single forward
applied for word parsing. pass. Bartz et al. [66] presented a single DNN that could

Table 1  Comparison of different kinds of text recognition methods


Method Strength Weakness

Character classification based Be insensitive to font variation, noise, blur and orientation Rely on complex heuristic rules or language models
Word classification based Can effectively recognize words in scene image with a Rely on a lexicon and hardly to handle long word
large number of class labels with deformation
Sequence based Do not rely on the precision of text segmentation, and can Need to design proper objective function to opti-
process arbitrary strings mize the network parameters

13
H. Lin et al.

train text detector and recognizer from input image. Moreo- 5.1.2 Resnet
ver, a recurrent spatial transformer is applied as attention
mechanism, which makes the localization of the text be Deeper neural networks are more difficult to train, since the
learned by the network itself. Liu et al. [69] adopted FCN accuracy may get saturated and degrade rapidly. To address
to find bounding boxes of text, based on which a RoIRotate the degradation problem, He et al. [74] proposed a deep
operator is introduced to extract proper features from shared residual learning framework (called Resnet), whose building
feature maps. Finally, the features of text proposal are fed to block is defined as y = F(X, {Wi }) + x (see Fig. 4), where x
RNN and CTC for text recognition. and y are the input and output vectors of the layers consid-
ered, and F(X, {Wi }) is the residual mapping to be learned.
Some text detectors [27, 31] use Resnet 50/101 as backbone
5 Key Techniques for Scene Text Detection for feature extraction.
and Recognition
5.1.3 Regions with CNN (R‑CNN)
In this section, state-of-the-art techniques used in current
scene text detection and recognition methods are reviewed. Fast R-CNN [39] is an end-to-end architecture for object
As mentioned in Sect. 2, deep learning based methods detection. In this architecture, an input image and multiple
have become the mainstream for text detection. Therefore, regions of interest (RoIs) are input into a FCN, and softmax
Sects. 4.1 to 4.3 analyze the relevant schemes and issues, probabilities and per-class bounding-box regression offsets
including network architecture, loss function and multi-ori- are the outputs (see Fig. 5a). Faster R-CNN [76] makes
entation detection. With text recognition, techniques related improvement on Fast R-CNN, which aims to reduce the
to language model and sequence labeling are discussed in time spending on region proposals generation (see Fig. 5b).
Sects. 4.4 and 4.5. A region proposal network (RPN) that shares full-image
convolutional features with the detection network is pro-
5.1 Network Architecture posed, and the RPN and Fast R-CNN are finally merged
into a single network by sharing their convolutional fea-
5.1.1 Fully Convolutional Network (FCN) tures. By incorporating additional components into these

FCN [73] could yield hierarchies of features for effective


semantic segmentation (see Fig. 3). Since the merits of
multi-scale learning and prediction conform to the nature
of scene text, many methods [24–26, 33, 40] adopt FCN as
their backbone for text detection. Generally, a pixel-wise
text/non-text salient map is first obtained by using FCN,
which produces pixel-wise labeling or labeled region con-
taining texts. After that, candidate bounding boxes of text
could be generated. By applying skip architecture of FCN,
receptive fields with different sizes could be helpful to
encode both local features and global context of text.
Fig. 4  A building block for residual learning [74]

Fig. 3  Architecture of FCN [73]

13
Review of Scene Text Detection and Recognition

Fig. 5  Architecture of R-CNN series. a Fast R-CNN [39], b faster R-CNN [76]

architectures, several text detection methods [34, 38, 41, 49] 5.1.5 Single Shot Detector (SSD)
with computational efficiency are proposed.
SSD [29] defines a set of default boxes for the output space
of bounding boxes, and it simultaneously predicts the
5.1.4 You only Look Once (YOLO) shape offsets and the confidences for all object categories
(see Fig. 7). In SSD, predictions are combined from mul-
YOLO [75] is a single convolutional network that simultane- tiple feature maps with different resolutions. Compared to
ously predicts multiple bounding boxes and class probabili- YOLO, SSD could effectively deal with objects of various
ties for those boxes (see Fig. 6). Since YOLO takes object sizes. Moreover, SSD eliminates proposal generation and
detection as a single regression problem, it extremely fast feature resampling, which is different from R-CNN based
comparing with R-CNN based system. However, it may network. Since SSD integrates the advantages of YOLO
achieve poor precision while localizing objects with small and Fast R-CNN/Faster R-CNN, many methods [37, 42, 43,
size. Therefore, it cannot be directly applied for text detec- 45, 48] extend this architecture for text detection by giving
tion. Inspired by YOLO, Gupta et al. [35] proposed a fully- some specific modifications, such as designing default boxes
convolutional regression network (FCRN), which could with larger aspect ratios or multi orientations, and adopting
effectively and efficiently detect texts in scene image. inception-style convolutional filters.

Fig. 6  Architecture of YOLO [75]

13
H. Lin et al.

Fig. 7  Architecture of SSD [29]

5.2 Loss Function 36–38, 41, 43, 45–47, 49] as the loss for distinguishing text
(y = 1) and non-text (y = 0).
Just like in general machine learning model, a loss func-
tion should be defined first in deep neural network to 5.4 Smooth‑L1 Loss Function
measure the gap between prediction and actual value. And
then training algorithm seeks to minimize the loss func- It is often used for bounding box regression task [27, 31, 34,
tion. The smaller the loss function is, the more robust the 36–38, 41, 43–48], which is defined as follow
model is. Most work often takes text detection as a multi ∑
task learning problem, e.g. classification and regression. In Lreg = smoothL1 (pi , p ∗) (3)
this section, some commonly used loss functions for text
i∈S

detection are listed and discussed. in which,


{
0.5(𝜎x)2 if |x| < 1∕𝜎 2
5.2.1 Cross‑Entropy Loss Function smoothL1 (x) = (4)
|x| − 0.5∕𝜎 2 otherwise
It is often used in tasks such as pixel/instance classifica-
where p and p ∗ are predicted value and ground truth respec-
tion or segmentation [25, 27, 28, 30, 31, 33, 44, 48], which
tively, and x represents the error between p and p ∗. Note
is defined as follow
that the deviation function of Smooth-L1 is also a piecewise
function. In [42], Liu et al. defined a continuous function
1∑
N
Lce = − [y log ŷ n + (1 − yn ) log(1 − ŷ n )] (1) as follow
N n=1 n
smoothLn (x) = (|x| + 1) ln(|x| + 1) − |x| (5)
where yn and ŷ n are actual value and prediction respectively. They claims that smooth-Ln loss could achieve the tradeoff
Note that if the same weight is put on all positive pixels, between robustness and stability (see Fig. 8)
it may achieve poor performance while handling instances
with small areas. Therefore, several balanced cross-entropy 5.4.1 Squared Loss Function
losses [28, 44] are also introduced to facilitate the training
procedure. It is a conventional loss for regression task, which is defined
as follow
5.3 Softmax Loss Function
Lsqu = (y − ŷ )2 (6)
It should be found in many general object detection meth- where y and ŷ are actual value and prediction respectively.
ods, which is defined as follow In [26] [35], a bounding box is parameterized in terms of the
(m−1 ) position of its center, width, height, orientation angle and the

Lsm = log ezj − zy (2) confidence that the box contains a word. While training the
j=0 network, all the parameters are optimized by minimizing a
multi-part squared loss function.
where zy is the ith value on score vector for classification, There are many other loss functions used for scene text
and y is the classification label. This function is used in [34, detection. For example, the Dice loss [48] is adopted to

13
Review of Scene Text Detection and Recognition

Fig. 8  Comparison of smooth-L1 and smooth-Ln [42]

implement position-sensitive segmentation, and the IOU scale expansion algorithm that could make the kernels
loss [44] is applied for regressing four channels of axis- grow from small to large scale, is used to obtain the final
aligned bounding box since it is invariant against texts with detections. Therefore, the prediction is robust to arbitrary
different scales. shapes and orientations. In [31], the position-sensitive RoI
(PSROI) pooling [79] is replaced by a deformable PSROI
5.5 Multi‑orientation Detection pooling, which could implement multi-oriented text detec-
tion through adding offsets to the spatial binning positions.
Most of the previous work focuses on horizontal text detec- Note that most of above work includes segmentation
tion and achieves pretty good performance. However, text step, which is usually time-consuming. A new trend
in real-world situation could appear with any orientation. inspired by general object detection has emerged recently,
Therefore, text orientation needs to be estimated and cor- i.e., generating inclined proposals/boxes to roughly recall
rected for subsequent recognition procedure. Although text, and then implementing bounding box regression to
many studies [81–89] have concentrated on multi-oriented finely localize text region. Text orientation information
scene text detection, the accuracy rates need to be further could be represented by different ways, such as rotation
improved. With the initiating of ICDAR 2015 Competition anchors [38], inclined minimum area rectangle [41] or
Challenge 4, a large number of deep learning based meth- quadrangles inside horizontal sliding windows [42]. Dif-
ods have stood out, and achieved superior performance over ferent from previous text detection methods that rely on
conventional approaches. shared features for both classification and oriented bound-
In [24], individual characters and their relationship, i.e., ing box regression, active rotating filters (ARF) [80] are
linking orientation are considered, and the corresponding used to extract rotation-sensitive features in [45]. Since
prediction maps are produced by training the holistically- ARF convolves feature map with a canonical filter and
nested edge detection (HED) [77] based network. Since its rotated clones, it can help to capture rotation sensitive
HED could find edges of different scales and orientations, features. In [48], scene text detection is implemented by
it can be used for multi-orientation text detection. Similar localizing corner points of text bounding boxes and seg-
work could be found in [43], where the oriented text is menting text regions in relative positions (see Fig. 9). The
decomposed into segments and links, and the final detec- candidate boxes are generated by grouping corner points
tion results are produced via combining segments con- according to the scores of segmentation maps.
nected by links. Since text lines from the same text block
often have a roughly uniform spatial layout, a projection 5.6 Language Model
profile based skew estimation algorithm [78] is used to
determine the possible orientation of text line in [25]. Strong language prior, e.g. probability distribution over char-
In [27, 33], pixel-wise text region masks with arbitrary acter/word sequence, would make major contribution to final
shapes are taken as supervision information for training text recognition. Some characters or strings cannot be easily
segmentation network so as to handle multi-orientation distinguished, such as the number “0” and the character “O”,
texts. In [30], the concept “kernel” is introduced, which or the string “cl” and character “d”. If a proper language
denotes multiple predicted segmentation areas of text model is adopted to consider the context information, these
instance. The kernels have the similar shape and locate cases must be eliminated.
at the same central point with differ scales. A progressive

13
H. Lin et al.

Fig. 9  Corner points and


position-sensitive maps predic-
tion [48]

5.7 Sequence Labeling

As mentioned in Sect. 3.1, many character classification


based text recognition methods firstly detect individual
characters in image, and sequently recognize each character
using CNN models. In order to train a strong character detec-
tor, however, we need a large number of labeled character
images, which is unrealistic in most cases. Word classifica-
tion based methods assign a class label to each word, and
treat text recognition as an image classification problem.
Such methods often train CNN models with a huge num-
Fig. 10  The N-gram encoding model [57] ber of classes. For English there are about 90 K words, and
for Chinese however, the number of potential words may
exceed 1 million. Moreover, CNN models are often hard
Inspired by the successful applying of hidden markov to deal with long words (the number of characters is large).
model (HMM) in voice recognition, a hybrid HMM/Maxout Recently, the state-of-the-art methods consider text spot-
architecture is proposed in [90], which could sequence words ting as a sequence labeling problem. These methods could
into their corresponding character/inter-character regions by generate an ordered high level sequence from input image,
integrating a lexicon. The method is highly accurate as well and have properties of handling text with arbitrary lengths,
as fast, since it takes constant time relative to lexicon size. lexicon free and avoiding the character segmentation. Some
Conditional random field (CRF) model is adopted to pre- key techniques are reviewed as follows.
dict character position in [8, 91, 92]. The CRF is defined
over a set of random variables, and each random variable
5.7.1 Recurrent Neural Network (RNN)
denotes a potential character in word. In order to recognize
weak character or non-dictionary words, however, it needs
RNN is an important branch of DNN family, which does
to compute unary and higher-order terms for all candidate
not need the position information of each element in a
characters, which results in expensive computation. In [51],
sequence image. In [55, 58, 59], a CNN model is first
the beam search based on n-gram model is used to obtain
used to convert text image into a sequence of features,
candidate characters. Beside this language model, a simple
and then sequential features are fed to a RNN model for
dictionary is also maintained for providing a soft scoring
learning context information and generating a predicted
signal. Finally, the candidate characters are re-ranked by
sequence. Traditional RNN is hard to transmit the gradient
using both language model and shape model. Similarly, a
information consistently over long time due to the vanish-
word is taken as a composition of bag-of-n-grams in [57]. In
ing gradient problem. The RNN model adopted in [55,
order to compress encoding representation, the model only
58, 59] is the long-short term memory (LSTM) structure.
selects a subset of the space of all possible n-grams. Since
To be more precisely, two LSTMs, one forward and one
the n-gram based CNN has a large number of output nodes,
backward, are combined into a bidirectional LSTM (see
e.g. 10 k output units for n = 4 (see Fig. 10), it increases the
Fig. 11).
training complexity. Different from the above methods, the
recurrent neural network (RNN) is used in [60] to model the
character-level statistics for text. In this model, character
5.7.2 Connectionist Temporal Classification (CTC)
recognition is considered as a task of learning mappings
from pixel intensities to character-level vectors, and does
In CNN + LSTM model [55, 58], the length of the LSTM
not need n-grams any more.
outputs may not consistent with that of the target string.

13
Review of Scene Text Detection and Recognition

6.1 Benchmark Datasets

In this section, we describe the widely used benchmark


datasets for tasks of text detection and recognition, whose
features are summarized in Table 2.
ICDAR 2003 [94]. It is the first released benchmark for
scene text detection and recognition from ICDAR Robust
Reading Competition. There are 258 natural images for train-
ing and 251 natural images for testing. All the text instances
in this dataset are in English and are horizontally placed.
ICDAR 2011 [95]. It inherits from ICDAR 2003 and has
made some modification. There are 229 natural images for
training and 255 natural images for testing.
ICDAR 2013 [96]. It also inherits from ICDAR 2003 and
Fig. 11  The structure of deep bidirectional LSTM [55] has made some modification. There are 229 natural images
for training and 233 natural images for testing.
ICDAR 2015 [97]. It is from the Incidental Scene Text
Therefore, the CTC [93] is applied to approximately map Challenge of the ICDAR 2015 Robust Reading Competi-
the LSTM sequential output into its target string: tion. The dataset includes 1500 natural images in total,
which are acquired using Google Glass. The text instances
Sw∗ ≈ B(arg max P(𝜋|p)) (7) (annotated by 4 vertices of the quadrangle) are usually
𝜋
skewed or blurred in ICDAR 2015, since they are acquired
where B is the projection that removes the repeated labels without user’s prior preference or intention.
and the non-character labels. ICDAR 2017 MLT [98]. It is a large scale multi-lingual
text dataset, which is composed of complete scene images
with 9 languages. There are 7200 training images, 1800
6 Evaluation and Comparison validation images and 9000 testing images in this dataset.
MSRA-TD500 [99]. It has 500 high resolution natural
Scene text detection and recognition have received increas- scene images, where the text instances present with multi
ing attention in computer vision and document analysis, and orientations and the language types include both Chinese
many approaches and methods have been proposed so far. and English. There are 300 images for training and 200
Therefore, it is impossible to give fair evaluation and com- images for testing.
parison for all of them. In this section, we first summarize COCO-Text [100]. It is the largest benchmark that could
the widely used datasets and protocols for text detection and be used for text detection and recognition so far. The orig-
recognition. After that, we mainly survey published results inal images are from the Microsoft COCO dataset, and
of the representative methods for comparison. 173,589 text instances from 63,686 images are annotated

Table 2  Benchmark datasets for Dataset Annotation Orientation Language Task End-to-end
text detection and recognition
ICDAR 2003 Character/word Horizontal English Detection/recognition Yes
ICDAR 2011 Word Horizontal English Detection/recognition Yes
ICDAR 2013 Character/word Horizontal English Detection/recognition Yes
ICDAR 2015 Word Multi oriented English Detection/recognition Yes
Incidental
ICDAR 2017 Word Multi oriented Multi lingual Detection/recognition Yes
MLT
MSRA-TD500 Text line Multi oriented English/Chinese Detection No
COCO-Text Word Horizontal English Detection/recognition Yes
SVT Word Horizontal English Detection/recognition Yes
RCTW-17 Text line Multi oriented Chinese Detection Yes
IIIT 5 k Character/word Horizontal English Recognition No
SynthText Character/word Horizontal English Detection/recognition No
Synth90 k Word Horizontal English Recognition No

13
H. Lin et al.

in COCO-Text. There are 43,686 images for training and


m(r, R) = maxmp (r, r� )|r� ∈ R (8)
20,000 images for validation/testing.
Street View Text (SVT) [101]. It consists of 350 images where mp denotes the match between two rectangles of text
annotated with word-level axis-aligned bounding boxes instances, which can be calculated as the area of intersection
from Google Street View. It contains smaller and lower divided by the area of the minimum bounding box contain-
resolution text, and not all text instances within it are ing both rectangles. Then, the metrics of precision (P), recall
annotated. ( R) and F-measure(F ) can be defined as follows
RCTW-17 [102]. It contains various kinds of image, ∑
including street views, posters, menus, indoor scenes and re ∈E m(re , T)
P= (9)
screenshots for competition on reading Chinese text in image. �E�
The dataset contains about 8000 training images and 4000 test
images, whose annotations are similar to ICDAR2015. ∑
IIIT 5 k [103]. It contains 5000 cropped word images rt ∈T m(rt , E)
R= (10)
downloaded from Google image search. There are 2000 �T�
images for training and 3000 images for testing. Each image
has an associated 50 word lexicon (IIIT5 k-50) and 1 k word 1
lexicon (IIIT5 k-1 k). F=
𝛼∕P + (1 − 𝛼)∕R (11)
SynthText [104]. It contains 858,750 synthetic images,
where texts with random colors, fonts, scales and orienta- where T and E are respectively the sets of ground-truth and
tions are rendered on natural images carefully to have a real- estimated rectangles, and rt and re are respectively a ground-
istic look. The texts in this dataset are annotated in character, truth and an estimated rectangle. 𝛼 is weight parameter,
word and line level. which is often set to 0.5.
Synth90 k [105]. It contains about 9 million synthetic
cropped word images, and covers 90 k different English 6.2.2 DetEval Detection Protocol
words. Similar to SynthText, the synthetic data in Synth90 k
is highly realistic. There are approximate 8 million images Since standard ICDAR detection protocol is unable to handle
for training and 900 k images for testing. the cases of one-to-many and many-to-many matches among
the ground truth and detections, it always underestimates
the performance of text detection algorithms. To address
6.2 Evaluation Protocols the problem, Wolf et al. proposed the DetEval protocol to
comprise the area overlap and the object level evaluation.
In this section, we summarize evaluation protocols for text In this protocol, the metrics of precision ( P′) and recall ( R′)
detection and recognition. The task of text detection could be can be defined as follows
commonly evaluated using ICDAR or DetEval protocol, and ∑
the task of text recognition could be commonly evaluated using i MatchD (Di , G, tr , tp )

P = (12)
word recognition accuracy or end-to-end recognition protocol. �D�

6.2.1 ICDAR Detection Protocol ∑


j MatchG (Gj , D, tr , tp )

R = (13)
First, the best match m(r, R) for a rectangle r in a set of rec- �D�
tangles R is defined as follow
where MatchD and MatchG are functions that consider the
different types of matches:

13
Review of Scene Text Detection and Recognition

⎧1 if Di matches against a single detected rectangle



MatchD (Di , G, tr , tp ) = ⎨ 0 if Di does not match against any detected rectangle (14)
⎪ fsc (k) if Di matches against several (→ k) detected rectangles

⎧1 if Gj matches against a single detected rectangle



MatchG (Gj , D, tr , tp ) = ⎨ 0 if Gj does not match against any detected rectangle (15)
⎪ fsc (k) if Gj matches against several ( → k) detected rectangles

an evaluation protocol that considers true or false positives


where fsc (k) is a parameter function that controls the amount
based on the overlap ratio between the estimated mini-
of punishment, and it is often set to 0.8.
mum area rectangles and the ground truth rectangles. If
the included angle between the estimated rectangle and the
ground truth rectangle is less than 𝜋∕8 and their overlap ratio
6.2.3 Yao’s Detection Protocol
exceeds 0.5, the estimated rectangle is considered a correct
detection. Multiple detections of the same text line are taken
While handling texts with arbitrary orientation, the overlap
as false positives. Thus, the metrics of precision ( P′′ ) and
ratio computed in the way of standard ICDAR protocol is
recall ( R′′) can be defined as follows
possibly not accurate. Therefore, Yao et al. [81] proposed

Table 3  Performance of Method Year ICDAR2011 ICDAR2013 ICDAR2015 ICDAR2017


different text detection methods
evaluated on ICDAR datasets P(%) R(%) F(%) P(%) R(%) F(%) P(%) R(%) F(%) P(%) R(%) F(%)

Yao [24] 2016 – – – 88.88 80.22 84.33


72.26 58.69 64.77 – – –
Zhang [25] 2016 – – – 88 78 8371 43 54 – – –
He [33] 2016 88 79 84 90 83 86– – – – – –
Zhong [34] 2016 85 81 83 87 83 85– – – – – –
Gupta [35] 2016 91.5 74.8 82.3 92 75.5 83– – – – – –
Tian [36] 2016 89 79 84 93 83 8874 52 61 – – –
Qin [26] 2017 – – – 90 83 8679 65 71 – – –
Dai [27] 2017 – – – 88.6 80 84.1 – – –
Liao [37] 2017 88 82 85 88 83 85 – – – – – –
Ma [38] 2017 – – – 90 72 80 82.17 73.23 77.44 – – –
He [40] 2017 – – – 92 81 86 82 80 81 – – –
Jiang [41] 2017 – – – 93.55 82.59 87.73 85.62 79.68 82.54 – – –
Liu [42] 2017 – – – – – – 73.23 68.22 70.64 – – –
Shi [43] 2017 – – – 87.7 83 85.3 73.1 76.8 75 – – –
Zhou [44] 2017 – – – – – – 83.27 78.33 80.72 – – –
He [46] 2017 – – – 89 86 88 80 73 77 – – –
Deng [28] 2018 – – – 88.6 87.5 88.1 85.5 82 83.7 – – –
Li [30] 2018 – – – – – – 89.3 85.22 87.21 77.01 68.4 72.45
Yang [31] 2018 – – – – – – 93.8 87.3 90.5 – – –
Liao [45] 2018 – – – 92 86 89 88 80 83.8 – – –
Zhong [47] 2018 – – – 94 90 92 89 83 86 75 66 70
Lyu [48] 2018 – – – 92 84.4 88 89.5 79.7 84.3 74.3 70.6 72.4
He [49] 2018 – – – 91 89 90 87 86 87 – – –
Liu [69] 2018 – – – – – 92.82 – – – 81.86 62.3 70.75
Liao [70] 2018 – – – 92 86 89 87.8 78.5 82.9 – – –

The significance of bold in the tables means the best result acquired by the method

13
H. Lin et al.

the ratio of the correctly recognized word number to the


P�� = |TP|∕|E| (16) ground truth number. For holistic scene image containing
R�� = |TP|∕|T| (17) texts, there are two protocols for evaluation, i.e., word spot-
ting and end-to-end. Word spotting only examines whether
where TP is the set of true positive detections, while E and T
the words in lexicon appear in input image, and it ignores
are respectively the sets of estimated rectangles and ground
symbols, punctuations, numbers and words whose length is
truth rectangles.
less than three. End-to-end protocol concerns both detec-
tion and recognition results, and it needs to recognize all
6.2.4 Text Recognition Protocols
the words precisely, no matter whether the lexicon contains
these strings. F-measure is also adopted by the two proto-
Given cropped word image, word recognition accuracy is
cols. Performance comparison
a commonly used evaluation metric, which is defined as

Table 4  Performance of Method Year MSRA TD500 COCO-Text SVT RCTW-17


different text detection methods
evaluated on other public P(%) R(%) F(%) P(%) R(%) F(%) P(%) R(%) F(%) P(%) R(%) F(%)
datasets
Yao [24] 2016 76.51 75.31 75.91 43.23 27.1 33.31 – – – – – –
Zhang [25] 2016 83 67 74 – – – – – – – – –
He [33] 2016 79 65 71 – – – – – – – – –
Gupta [35] 2016 – – – – – – 26.2 27.4 26.7 – – –
Tian [36] 2016 – – – – – – 68 65 66 – – –
Dai [27] 2017 87.6 77.1 82 – – – – – – – – –
Ma [38] 2017 82.1 67.7 74.2 – – – – – – – – –
He [40] 2017 77 70 74 – – – – – – – – –
Shi [43] 2017 86 70 77 – – – – – – – – –
Zhou [44] 2017 87.28 67.43 76.08 50.39 32.4 39.45 – – – – – –
He [46] 2017 – – – 46 31 37 – – – – – –
Deng [28] 2018 83 73.2 77.8 – – – – – – – – –
Yang [31] 2018 87.5 79 83 – – – – – – 78.5 56.9 66
Liao [45] 2018 87 73 79 64 57 61 – – – 77.5 59.1 67
Lyu [48] 2018 87.6 76.2 81.5 61.9 32.4 42.5 – – – – – –
Liao [70] 2018 – – – 60.87 56.7 58.72 – – – – – –

The significance of bold in the tables means the best result acquired by the method

Fig. 12  The learned context sur-


rounding the text by deformable
PSROI pooling [31]

Fig. 13  Rotation sensitive


regression [45]

13
Review of Scene Text Detection and Recognition

Table 5  Cropped word Method Year IC03-50 IC03-Full IC03 IC11-50 IC11-Full IC13 IC15
recognition accuracy (%) on
ICDAR datasets Wang [9] 2012 90 84 – – – – –
Bissacco [51] 2013 – – – – – 82.83 –
Shi [71] 2013 87.44 79.3 – 87.04 82.87 – –
Jaderberg [52] 2014 98.7 98.6 – – – 90.8 –
Yao [72] 2014 88.48 80.33 – – – – –
Shi [55] 2015 98.7 97.6 89.4 – – –
Jaderberg [57] 2015 97.8 97 89.6 – – 81.8 –
He [58] 2016 97 93.8 – – –
Shi [59] 2016 98.3 96.2 90.1 – – 88.6 –
Lee [60] 2016 97.9 97 88.7 – – 90 –
Jaderberg [61] 2016 98.7 98.6 93.3 – – 90.8 –
Lou [62] 2016 – – – – – 86.2 –
Yang [65] 2017 – – – – – 85.21 79.78
Bartz [66] 2017 – – – – – 90.3
Bai [68] 2018 98.7 97.9 94.6 – – 94.4 73.9

The significance of bold in the tables means the best result acquired by the method

Table 6  Cropped word Method Year SVT-50 SVT IIIT5 K-50 IIIT5 K-1 k IIIT5 K
recognition accuracy (%) on
other public datasets Wang [9] 2012 70 – – – –
Bissacco [51] 2013 90.93 – – – –
Shi [71] 2013 – 73.51 – – –
Jaderberg [52] 2014 95.4 80.7 97.1 92.7 –
Almazan [54] 2014 87.01 – 88.57 75.6 –
Yao [72] 2014 – 75.89 80.2 69.3 38.3
Shi [55] 2015 96.4 80.8 97.6 94.4 78.2
Jaderberg [57] 2015 93.2 71.7 95.5 89.6 –
He [58] 2016 93.5 – 94 91.5 –
Shi [59] 2016 95.5 81.9 96.2 93.8 81.9
Lee [60] 2016 96.3 80.7 96.8 94.4 78.4
Jaderberg [61] 2016 95.4 80.7 97.1 92.7 –
Lou [62] 2016 – 80.7 – – –
Bartz [66] 2017 – 79.8 – – 86
Bai [68] 2018 96.6 87.5 99.5 97.9 88.3

The significance of bold in the tables means the best result acquired by the method

Table 7  End-to-end F-measures Method Year IC03-50 IC03-Full IC03 SVT-50 SVT IC11 IC13
(%) on ICDAR03, ICDAR11,
ICDAR13 and SVT Wang [9] 2012 72 67 – 46 – – –
Jaderberg [61] 2016 90 86 78 76 53 76 76
Gupta [35] 2016 – – – 67.7 55.7 84.3 84.7
Gomez [67] 2017 92 90 75 85 54 – –
Liao [37] 2017 – – – 84 64 87 –
Liao [70] 2018 – – – 84 64 – –

In this section, we reported the experimental results as using synthetic dataset for pre-training, or using special
of representative text detection and recognition methods data augmentation scheme to enlarge the number of train-
on some public datasets through a comprehensive litera- ing samples), it is impossible for us to make an absolutely
ture review. Since different methods may conduct experi- fair comparison. However, we can witness the development
ments on different benchmark datasets, and even on the of state-of-the-art methods in this field and acquire some
same dataset they may adopt different training sets (such inspiration.

13
H. Lin et al.

Table 8  Word spotting and end-to-end F-measures (%) on ICDAR13 and ICDAR15
Method Year Word Spotting End-to-end
IC13-100 IC13-Full IC13 IC13-100 IC13-Full IC13

Gomez [67] 2017 85.37 83.58 70.71 81.16 79.49 68.54


Liao [37] 2017 94 92 87 91 89 84
Liu [69] 2018 95.94 93.9 87.6 91.99 90.11 84.77
Liao [70] 2018 96 95 87 93 92 85
Method Year Word Spotting End-to-end
IC15-50 IC15-Full IC15 IC15-50 IC15-Full IC15

Gomez [67] 2017 56 52.26 49.73 53.3 49.61 47.18


Liao [37] 2017 - - - - - -
Liu [69] 2018 87.01 82.39 67.97 83.55 79.11 65.33
Liao [70] 2018 76.45 69.04 54.37 73.34 65.87 51.9

The significance of bold in the tables means the best result acquired by the method

Tables 3 and 4 report text detection performance of differ- semantic segmentation have been extended for scene text
ent methods on eight datasets. As mentioned in Sect. 2, deep location, and the current trend is applying deep learning
learning based methods become the mainstream recently for framework to training an end-to-end text detector.
text detection. Here we only give results of this group of Tables 5, 6, 7 and 8 report text recognition performance
methods. As is shown in Table 3, at present the F-measures of different methods on six commonly used datasets. As
on ICDAR2013 and ICDAR2015 both exceed 90%. Espe- is shown in Tables 5 and 6, the method of Bai et al. [68]
cially, the performance on ICDAR 2015 has increased dras- achieve relatively high performance on all ICDAR data-
tically from 54% (Zhang et al. [25]) to 90.5% (Yang et al. sets. In [68], edit probability (EP) is proposed to train
[31]) in terms of F-measure. In [31], deformable PSROI attention based text recognition model. By applying a
pooling is applied to add offsets to the spatial binning posi- sequence generation mechanism for lexicon-free predic-
tions in PSROI pooling (see Fig. 12), which can greatly tion, this method can effectively recognize out-of-training-
enhance the performance of multi-oriented text detection. set words, and obtain the best result on ICDAR 2003 and
As is shown in Table 4, the F-measures on the other four ICDAR 2013 without strong or weak lexicon. As is shown
datasets all achieve unprecedented levels so far. On the in Tables 7 and 8, the methods of Liao et al. [70] and Liu
largest COCO-Text dataset, the performance has increased et al. [69] achieve the state-of-the-art performance. Since
drastically from 33.31% (Yao et al. [24]) to 61% (Liao et al. TextBoxes ++ [70] extends directly from TextBoxes [37]
[45]) in terms of F-measure. In [45], a rotation sensitive that mainly handles horizontal texts, it obtains relatively
regression network (see Fig. 13) is adopted, which can be high F-measures on ICDAR 2013 and SVT dataset. Note
helpful to achieve better detection result. It can be observed
that abundant technologies of general object detection and

Fig. 14  Illustration of RoIRo-


tate [69]

13
Review of Scene Text Detection and Recognition

that the performance improvement of TextBoxes ++ is Funding This work is supported in part by the Sub Project of National
spectacularly significant on SVT dataset due to its train- Key Research and Development Program (2017YFC0804002) and the
National Natural Science Foundation of China (61662048, 61772277,
ing on low-resolution images. In [69], the RoIRotate 71771125 and 61603192).
operator is proposed to connect detection and recognition
in a unified network, and it can apply transformation on Compliance with Ethical Standards
oriented detection bounding boxes to obtain axis-aligned
feature maps (see Fig. 14). Therefore, such unified net- Conflict of interest The authors declare that they have no conflict of
work achieves obvious advantages on oriented ICDAR interest.
2015 dataset. Note that there is no general text recogni-
tion method yet, and each method only performs well on
certain datasets. As long as the text regions are properly
localized, traditional methods have already achieved rela- References
tively high cropped word recognition accuracy. However,
present methods attempt to construct an end-to-end frame- 1. Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection
in natural scene images. IEEE Trans Pattern Anal Mach Intell
work without complicated pre- or post-processing for both 36:970–983
text detection and recognition. 2. JJ Weinman, E Learned-Miller, AR Hanson (2009) Scene Text
Recognition Using Similarity and a Lexicon with Sparse Belief
Propagation. IEEE Transactions on Pattern Analysis & Machine
Intelligence 1733-1746
7 Conclusions 3. Karaoglu S, Tao R, Gevers T, Smeulders AWM (2017) Words
matter: scene text for image classification and retrieval. IEEE
Scene text detection and recognition have received increas- Trans Multimedia 31:1063–1076
ing attention in computer vision due to its potential applica- 4. Ye Q, Doermann D (2015) Text detection and recognition
in imagery: a survey. IEEE Trans Pattern Anal Mach Intell
tions in numerous fields. This paper mainly reviews detec- 37:1480–1500
tion and recognition methods proposed in the last decade. 5. Uchida S (2014) Text localization and recognition in images and
We comprehensively classify these methods and highlight video. Handbook of document image processing and recognition.
the key techniques. Furthermore, more than 10 benchmark Springer, London, pp 843–883
6. Babenko B, Belongie S (2012) End-to-end scene text recogni-
datasets and the corresponding evaluation protocols are tion. In: IEEE international conference on computer vision, pp
described in the paper. Finally, we report the results of more 1457–1464
than 40 representative methods and compare their perfor- 7. Pan YF, Hou X, Liu CL (2011) A hybrid approach to detect and
mance. Although great progress has been achieved in text localize texts in natural scene images 20:800–813
8. Mishra A, Alahari K, Jawahar CV (2012) Scene text recogni-
detection and recognition recently, we also find out some tion using higher order language priors. In: Proceedings british
problems that should be addressed. machine vision conference, pp 1–11
Since most methods focus on text in English, there is 9. Wang T, Wu DJ, Coates A, Ng AY (2012) End-to-end text rec-
still ample room remained for performance improvement ognition with convolutional neural networks. In: International
conference on pattern recognition, pp 3304–3308
on non-Latin or multi-lingual datasets, such as RCTW-17, 10. Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for
MSRA-TD500 and ICDAR 2017 MLT. It is potentially to text spotting. In: European conference on computer vision, pp
construct a common text detection engine based on character 512–528
detectors, since character is the most basic element for vari- 11. Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural
scenes with stroke width transform. In: Computer vision & pat-
ous languages. Some weakly supervised scene text detec- tern recognition, pp 2963–2970
tion frameworks [106, 107] have been proposed recently, 12. Neumann L, Matas J (2010) A method for text localization
and they can train robust scene text detectors with a small and recognition. In: Asian conference on computer vision, pp
amount of annotated character images. We consider that this 770–783
13. L Neumann (2012) Real-time scene text localization and recogni-
work worthy to be further studied in the future. The results tion. In: Computer vision & pattern recognition, pp 3538–3545
on ICDAR 2015 and COCO-Text are also unsatisfactory. 14. Neumann L, Matas J (2015) Real-time lexicon-free scene text
It means that we need to tackle the problem of incidental localization and recognition. IEEE Trans Pattern Anal Mach
and diversified text detection. Enhancement and rectification Intell 38:1872–1885
15. Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection
methods [22, 31] should be integrated in the conventional in natural scene images. IEEE Trans Pattern Anal Mach Intell
deep learning models so as to obtain better performance in 36:970–983
the future work. Moreover, many existing text recognition 16. Huang W, Qiao Y, Tang X (2014) Robust scene text detection
methods achieve poor performance with general lexicons. with convolution neural network induced MSER trees. In: Euro-
pean Conference on Computer Vision, pp 497–511
Schemes of applying large scale language information [108, 17. Gomez L, Karatzas D (2015) Object proposals for text extraction
109] and sequence leaning [55] have been proposed for text in the wild. In: International conference on document analysis
recognition, which should be further studied. and recognition, pp 206–210

13
H. Lin et al.

18. Buta M, Neumann L, Matas J (2015) FASText efficient uncon- 40. He W, Zhang XY, Yin F, Liu CL (2017) Deep direct regression
strained scene text detector. In: IEEE international conference on for multi-oriented scene text detection. In: IEEE international
computer vision, pp 1206–1214 conference on computer vision, pp 745–753
19. Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry-based text 41. Jiang Y, Zhu X, Wang X, Yang S, Li W (2017) R2CNN: rota-
line detection in natural scenes. In: IEEE conference on computer tional region cnn for orientation robust scene text detection,
vision and pattern recognition, pp 2558–2567 pp 1–8. arXiv​:1706.09579​
20. Cho H, Sung M, Jun B (2016) CannyText detector fast and robust 42. Liu Y, Jin L (2017) Deep matching prior network toward
scene text localization algorithm. In: IEEE conference on com- tighter multi-oriented text detection. In: IEEE conference on
puter vision and pattern recognition, pp 3566–3573 computer vision and pattern recognition, pp 3454–3461
21. Fabrizio J, Robert-Seidowsky M, Dubuisson S, Calarasanu S 43. Shi B, Bai X, Belongie S (2017) Detecting oriented text in
(2016) TextCatcher: a method to detect curved and challenging natural images by linking segments. In: IEEE conference on
text in natural scenes. In: International conference on document computer vision and pattern recognition, pp 2482–3490
analysis and recognition, pp 99–117 44. Zhou X, Yao C, Wen H, Wang Y, Zhou S (2017) EAST an
22. He T, Huang W, Qiao Y, Yao J (2016) Text-attentional convo- efficient and accurate scene text detector. In: IEEE conference
lutional neural networks for scene text detection. IEEE Trans on computer vision and pattern recognition, pp 2642–2651
Image Process 25:2529–2541 45. Liao M, Zhu Z, Shi B, Xia G, Bai X (2018) Rotation-sensitive
23. Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: regression for oriented scene text detection. In: IEEE confer-
recent advances and future trends. Front Comput Sci 10:19–36 ence on computer vision and pattern recognition, pp 1–10
24. Yao C, Bai X, Sang N, Zhou X, Zhou S (2016) SceneText detection 46. He P, Huang W, He T, Zhu Q, Qiao Y (2017) Single shot text
via holistic, multi-channel prediction. arXiv​:1606.09002​ pp 1–10 detector with regional attention. In: IEEE international confer-
25. Zhang Z, Zhang C, Shen W, Yao C, Liu W (2016) Multi- ence on computer vision, pp 3047–3055
oriented text detection with fully convolutional networks. In: 47. Zhong Z, Sun L, Huo Q (2018) An Anchor-Free Region
Computer vision & pattern recognition, pp 4159–4167 proposal network for faster R-CNN based text detection
26. Qin S, Manduchi R (2017) Cascaded segmentation-detection approaches, pp 1–8. arXiv​:1804.09003​
networks for word-level text spotting. In: International confer- 48. Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text
ence on document analysis and recognition, pp 1275–1282 detection via corner localization and region segmentation. In: IEEE
27. Dai Y, Huang Z, Gao Y, Xu Y, Chen K (2017) Fused text seg- conference on computer vision and pattern recognition, pp 1–10
mentation networks for multi-oriented scene text detection, pp 49. He T, Tian Z, Huang W, Shen C, Qiao Y (2018) Single shot
1–6. arXiv​:1709.03272​ text spotter with explicit alignment and attention. In: ieee con-
28. Deng D, Liu H, Li X, Cai D (2018) PixelLink: detecting scene ference on computer vision and pattern recognition, pp 1–10
text via instance segmentation. In: Proceedings of association 50. He K, Gkioxari G, Dollar P, Girshick R (2018) Mask R-CNN.
for the advancement of artificial intelligence, pp 1–8 IEEE transactions on pattern analysis & machine intelligence
29. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016) SSD: 51. Bissacco A, Cummins M, Netzer Y, Neven H (2013) Photo
single shot multibox detector. In: European conference on com- OCR: reading text in uncontrolled conditions. In: IEEE inter-
puter vision, pp 21–37 national conference on computer vision, pp 785–792
30. Li X, Wang W, Hou W, Liu RZ, Lu T (2018) Shape robust text 52. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014)
detection with progressive scale expansion network, pp 1–12. Synthetic data and artificial neural networks for natural scene
arXiv​:1806.02559​ text recognition. In: Conference on neural information process-
31. Yang Q, Cheng M, Zhou W, Chen Y, Qiu M (2018) IncepText: ing systems, pp 1–10
a new inception-text module with deformable psroi pooling 53. Yao C, Bai X, Liu W (2014) A unified framework for multi-
for multi-oriented scene text detection. In: International joint oriented text detection and recognition. IEEE Trans Image
conference on artificial intelligence, pp 1–7 Process 23:4737–4749
32. Dai J, Qi H, Xiong Y, Li Y, Zhang G (2017) Deformable con- 54. Almazan J, Gordo A, Fornes A, Valveny E (2015) Word spot-
volutional networks. In: IEEE international conference on ting and recognition with embedded attributes. IEEE Trans
computer vision, pp 764–773 Pattern Anal Mach Intell 36:2552–2566
33. He T, Huang W, Qiao Y, Yao J (2016) Accurate text localiza- 55. Shi B, Bai X, Yao C (2016) An end-to-end trainable neural
tion in natural image with cascaded convolutional textnetwork, network for image-based sequence recognition and its applica-
pp 1–10. arXiv​:1603.09423​ tion to scene text recognition. IEEE Trans Pattern Anal Mach
34. Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText a unified Intell 39:2298–2304
framework for text proposal generation and text detectionin 56. Shi B, C Yao, C Zhang, X Guo (2015) Automatic script identi-
natural images, pp 1–12. arXiv​:1605.07314​v1 fication in the wild. In: International conference on document
35. Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for analysis and recognition, pp 531–535
text localisation in natural images. In: IEEE conference on 57. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2015) Deep
computer vision and pattern recognition, pp 2315–2324 structured output learning for unconstrained text recognition. In:
36. Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text International conference on learning representations, pp 1–10
in natural image with connectionist text proposal network. In: 58. He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene
European conference on computer vision, pp 56–72 text in deep convolutional sequences. In: Proceedings of associa-
37. Liao M, Shi B, Bai X, Wang X, Liu W (2017) TextBoxes a tion for the advancement of artificial intelligence, pp 1–8
fast text detector with a single deep neural network. In: Pro- 59. Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text
ceedings of association for the advancement of artificial intel- recognition with automatic rectification. In: IEEE conference on
ligence, pp 1–7 computer vision and pattern recognition, pp 1–9
38. Ma J, Shao W, Ye H, Wang L, Wang H (2017) Arbitrary-ori- 60. Lee CY, Osindero S (2016) Recursive recurrent nets with atten-
ented scene text detection via rotation proposals. IEEE Trans tion modeling for OCR in the Wild. In: IEEE conference on com-
Multimed 20:1–9 puter vision and pattern recognition, pp 2231–2239
39. Girshick R (2015) Fast R-CNN. In: IEEE international confer-
ence on computer vision, pp 1440–1448

13
Review of Scene Text Detection and Recognition

61. Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Read- colour perception. In: International conference on pattern recog-
ing text in the wild with convolutional neural networks. Int J nition, pp 634–637
Comput Vis 116:1–20 83. Rajendran D, Shivakumara P, Su B, Lu S, Tan CL (2011) A
62. Lou X, Kansky K, Lehrach W, Laan V, Marthi B (2016) Genera- new Fourier-moments based video word and character extraction
tive shape models joint text recognition and segmentation with method for recognition. In: International conference on document
very little training data. In: Advances in Neural Information analysis and recognition, pp 1165–1169
Processing Systems. Neural Information Processing Systems 84. Sharma N, Shivakumara P, Pal U, Blumenstein M, Tan CL (2012)
Foundation, Barcelona A new method for arbitrarily-oriented text detection in video. In:
63. Kang C, Kim G, Yoo SI (2017) Detection and recognition of Proceedings of the IAPR international workshop on document
text embedded in online images via neural context models. In: analysis systems, pp 74–78
Proceedings of association for the advancement of artificial intel- 85. Shivakumara P, Sreedhar R, Phan T, Lu S, Tan CL (2012) Mul-
ligence, pp 4103–4110 tioriented video scene text detection through Bayesian classifi-
64. B Moysset, C Kermorvant, C Wolf (2017) Full-Page Text Rec- cation and boundary growing. IEEE Trans Circuits Syst Video
ognition Learning Where to Start and When to Stop. In: Inter- Technol 22:1227–1235
national Conference on Document Analysis and Recognition, pp 86. Singh C, Bhatia N, Kaur A (2008) Hough transform based fast
871-876 skew detection and accurate skew correction methods. Pattern
65. Yang C, Yin XC, Li Z, Wu J, Guo C(2017) AdaDNNs: adaptive Recognit 41:3528–3546
ensemble of deep neural networks for scene text recognition, pp 87. Yi C, Tian Y (2011) Text string detection from natural scenes
1–8. arXiv​:1710.03425​ by structure-based partition and grouping. IEEE Trans Image
66. Bartz C, Yang H, Meinel C (2017) STN-OCR a single neural Process 20:2594–2605
network for text detection and text recognition, pp 1-9. arXiv​ 88. Shivakumara P, Phan TQ, Tan CL (2011) A Laplacian approach
:1707.08831​ to multi-oriented text detection in video. IEEE Trans Pattern
67. Gomezbigorda L, Karatzas D (2017) TextProposals a text-spe- Anal Mach Intell 33:412–419
cific selective search algorithm for word spotting in the wild. 89. Pan YF, Hou X, Liu CL (2011) A hybrid approach to detect and
Pattern Recognit 70:60–74 localize texts in natural scene images. IEEE Trans Image Process
68. Bai F, Cheng Z, Niu Y, Pu S, Zhou S (2018) Edit probability for 20:800–813
scene text recognition, pp 1–9. arXiv​:1805.03384​ 90. Alsharif O, Pineau J (2013) End-to-end text recognition with
69. Liu X, Liang D, Yan S, Chen D, Qiao Y (2018) FOTS Fast oriented hybrid HMM Maxout models. sys, pp 1–10. arXiv​:1310.1811v​1
text spotting with a unified network, pp 1-10. arXiv​:1801.01671​ 91. Jawahar CV, Alahari K, Mishra A (2012) Top-down and bottom-
70. Liao M, Shi B, Bai X (2018) TextBoxes ++ a single-shot oriented up cues for scene text recognition. In: IEEE conference on com-
scene text detector. IEEE Trans Image Process 27:3676–3690 puter vision and pattern recognition, pp 2687–2694
71. Shi C, Wang C, Xiao B, Zhang Y, Gao S (2013) Scene text rec- 92. Novikova T, Barinova O, Kohli P, Lempitsky V (2012) Large-
ognition using part-based tree-structured character detection. In: lexicon attribute-consistent text recognition in natural images.
IEEE conference on computer vision and pattern recognition, pp In: European conference on computer vision, pp 752–765
2961–2968 93. Graves A, Gomez F (2006) Connectionist temporal classifica-
72. Bai X, Yao C, Liu W (2014) Strokelets a learned multi-scale tion: labelling unsegmented sequence data with recurrent neural
representation for scene text recognition. In: IEEE conference networks. In: International conference on machine learning, pp
on computer vision and pattern recognition, pp 4042–4049 369–376
73. Long J, Shelhamer E, Darrell T (2015) Fully convolutional net- 94. http://www.iapr-tc11.org/media​wiki/index​.php?title​=ICDAR​
works for semantic segmentation. In: IEEE conference on com- _2003_Robus​t_Readi​ng_compe​titio​ns. Accessed July 2018
puter vision and pattern recognition, pp 3431–3440 95. http://www.cvc.uab.es/icdar​2011c​ompet​ition​/?com=downl​oads.
74. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for Accessed July 2018
image recognition, pp 1–12. arXiv​:1512.03385​ 96. http://rrc.cvc.uab.es/?ch=2&com=downl​oads. Accessed July
75. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only 2018
look once: unified, real-time object detection. In: IEEE confer- 97. http://rrc.cvc.uab.es/?ch=4&com=downl​oads. Accessed July
ence on computer vision and pattern recognition, pp 779–788 2018
76. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards 98. http://rrc.cvc.uab.es/?ch=8&com=intro​ducti​on. Accessed July
real-time object detection with region proposal networks. In: 2018
International Conference on Neural Information Processing 99. http://www.iapr-tc11.org/media​wiki/index​.php/MSRA_Text_
Systems, pp 91–99 Detect​ ion_500_Databa​ se_(MSRA- TD500). Accessed July 2018
77. Xie S, Tu Z (2015) Holistically-nested edge detection. In: Inter- 100. https​://visio​n.corne​ll.edu/se3/coco-text-2/. Accessed July 2018
national Journal of Computer Vision, pp 1–16 101. http://visio​n.ucsd.edu/~kai/grocr​/. Accessed July 2018
78. Postl W (1986) Detection of linear oblique structures and skew 102. http://rctw.vlrla​b.net/. Accessed July 2018
scan in digitized documents. In: International Conference on Pat- 103. http://cvit.iiit.ac.in/resea​rch/proje​cts/cvit-proje​cts/the-iiit-5k-
tern Recognition, pp 687-689 word-datas​et. Accessed July 2018
79. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via 104. http://www.robot​s.ox.ac.uk/~vgg/data/scene​text/. Accessed July
region-based fully convolutional networks. In: Advances in neu- 2018
ral information processing systems, pp 379–387 105. http://www.robot​s.ox.ac.uk/~vgg/data/text/. Accessed July 2018
80. Zhou Y, Ye Q, Qiu Q, Jiao J (2017) Oriented response networks. 106. Tian S, Lu S, Li C (2017) WeText scene text detection under
In: IEEE conference on computer vision and pattern recognition, weak supervision. In: IEEE international conference on computer
pp 4961–4970 vision, pp 1501–1509
81. Yao C, Zhang X, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting 107. Hu H, Zhang C, Luo Y, Wang Y, Han J(2017) WordSup: exploit-
texts of arbitrary orientations in natural images. In: IEEE inter- ing word annotations for character based text detection. In: IEEE
national conference on computer vision, pp 1083–1090 international conference on computer vision, pp 4950–4959
82. Karatzas D, Antonacopoulos A (2004) Text extraction from web
images based on a split-and-merge segmentation method using

13
H. Lin et al.

108. Weinman JJ, Butler Z, Knoll D, Field J (2014) Toward inte- 109. Bai X, Yao C, Liu W (2016) Strokelets: a learned multi-scale
grated scene text reading. IEEE Trans Pattern Anal Mach Intell mid-level representation for scene text recognition. IEEE Trans
36:375–387 Image Process 25:2789–2802

13

You might also like