Stroke-Based Scene Text Erasing Using Synthetic Data For Training

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1

Stroke-Based Scene Text Erasing


Using Synthetic Data for Training
Zhengmi Tang, Tomo Miyazaki, Member, IEEE, Yoshihiro Sugaya, Member, IEEE,
and Shinichiro Omachi, Senior Member, IEEE

Abstract—Scene text erasing, which replaces text regions with


reasonable content in natural images, has drawn significant atten-
arXiv:2104.11493v2 [cs.CV] 19 Aug 2021

tion in the computer vision community in recent years. There are


two potential subtasks in scene text erasing: text detection and
image inpainting. Either subtask requires considerable data to
achieve better performance; however, the lack of a large-scale
real-world scene-text removal dataset does not allow existing
methods to work according to their potential. To avoid the
limitation of the lack of pairwise real-world data, we enhance
and make considerable use of the synthetic text and subsequently
train our model only on the dataset generated by the improved
synthetic text engine. Our proposed network contains a stroke
mask prediction module and background inpainting module that
can extract the text stroke as a relatively small hole from the
cropped text image to maintain more background content for
better inpainting results. This model can partially erase text Fig. 1. Scene Text Erasing: original images and text bounding boxes (left);
instances in a scene image with a bounding box or work with text-erased images by our method (middle); predicted text stroke mask (right).
an existing scene-text detector for automatic scene text erasing.
The experimental results from the qualitative and quantitative
evaluation of the SCUT-Syn, ICDAR2013, and SCUT-EnsText detection and inpainting functions into one network, making
datasets demonstrate that our method significantly outperforms them lightweight and fast. The drawback is that the text
existing state-of-the-art methods even when they were trained on
localization mechanism of these networks is weak, and the
real-world data.
text-erasing process is not controllable. To allow the network
Index Terms—Scene text erasing, Synthetic text, Background to learn the complicated distribution of scene texts, a con-
inpainting
siderable number of manually text-erased real-world images
are required as training data because the text distribution of
I. I NTRODUCTION the Synth-text [5] dataset, which is generated according to
human rules, is significantly different from real-world cases.
T EXTS created by humankind comprise rich, precise high-
level semantics. Text ,in our daily life, provides us with
a considerable amount of valuable information. However, with
To generate scene-text-erased images [4], [6], photo editing
software such as Photoshop is used to fill the text region of
natural images with visually plausible content. However, this
the increasing number of portable devices, such as digital
type of annotation is expensive and time-consuming because
cameras, tablets, smartphones, and SNS, a huge volume of
the annotators need to operate carefully to guarantee the
scene images, including text, are shared on the internet every
quality of the erasing, especially when the text to be erased is
second. These texts can contain private information such
on a complicated background.
as names, addresses, and vehicle number plates. With the
increasing development of scene text detection and recognition The two-step approaches decompose the text-erasing task
technology, there is a high risk that information collected into two sub-problems: text detection and background in-
automatically is used for illegal purposes. Therefore, scene painting. MTRNet [7] inpaints the text region by manually
text erasing, which is replacing text regions in scene images providing a mask of the text regions. Zdenek et al. [8] use a
with proper content, has drawn considerable attention in the pretrained scene text detection model and inpainting model to
computer vision community in recent years. erase the text. This weak supervision method does not require
After the prior work of Scene Text Eraser [1], scene text paired training data. However, the inpainting model is trained
erasing research has developed into two directions: one-step in Street View [9] or on ImageNet [10] datasets, and thus, the
and two-step methods. One-step methods [2]–[4] do not need pretrained models face domain shift problems, which cannot
the input of any text location information as they combine text make a “perfect fit” in the context of scene text.
In this study, we propose a word-level two stage scene-text-
Manuscript received March 10, 2021. This work was partially supported erasing network that works on cropped text images It follows:
by JSPS KAKENHI Grant Numbers 18K19772, 19K12033, 20H04201. first, predict the region of text stroke as a hole, then inpaint
The authors are with the Graduate School of Engineering, To-
hoku University, Sendai, 980-8579, Japan. (e-mail: [email protected], it according to the background. And this network is trained
[email protected], [email protected], [email protected]). in an end-to-end fashion. To avoid the limitation of the lack
2

of pairwise real-world data, we make full use of synthetic Segmentation-based methods usually first extract text pixels
texts. The appearance of the synthetic texts [5] is enhanced from the segmentation map and then obtain bounding boxes
and we trained our model only on the dataset generated by of the text by post-processing. Zhang et al. [24] used the
the improved synthetic text engine. The model can partially FCN and MSER for pixel-level multi-oriented text detection.
erase text instances in a scene image if text bounding boxes Mask textspotter [25] built a model based on the framework
are provided. The model can also work with existing scene of Mask R-CNN [26] and performed character-level instance
text detectors for automatic scene text erasing. Examples of segmentation for each alphabet. TextSnake [27] proposed a
the text-erasing results obtained by the proposed method are novel representation of arbitrarily shaped text and predicted
shown in Fig. 1. heat maps of text center lines, text regions, radii, and orienta-
The main contributions of our study can be summarized as tions to extract text regions. PSENet [28] gradually expanded
follows: the text region from small kernels to large ones to make final
• We propose a practical text erasing method and a stroke- predictions through multiple semantic segmentation maps.
based text-erasing network using cropped text images Liao et al. proposed a module called differentiable binarization
instead of the entire image, which makes prediction of (DB) to perform the binarization process in a segmentation
pixel-level mask of text stroke more accurate and stable. network. CRAFT [29] proposed to learn each character center
Benefiting from our erasing pipeline and network struc- and the affinity between characters in the form of a heat map.
ture, our method can erase text instances while retaining
more background information along with background B. Image Inpainting
texture restoration.
• We enhance the Text Synthesis Engine [5] to make Image inpainting fills the hole regions of an image with
the appearance of synthetic text instances share more plausible content. Image inpainting research can generally be
similarity with real-world data. divided into two categories: non-learning and deep-learning-
• The quantitative and qualitative evaluation results on based approaches. Non-learning approaches transfer the sur-
SCUT-Syn [3], ICDAR 2013 [11] and SCUT-EnsText [4] rounding content to the hole region based on low-level features
datasets show that our method outperforms previous state- using a traditional algorithm such as patch matching [30]–
of-the-art methods while it is only trained on the dataset [32] and diffusion [33], [34]. Although these methods can
generated by the improved synthetic text engine [5]. work well on small holes, they cannot deal with large missing
The remainder of this paper is organized as follows. Section regions, where semantics patches should be selected for the
II reviews related works on scene-text detection, image in- hole restoration based on a high-level understanding of the
painting, and text erasing. Section III introduces the details of whole image.
our method, including the pipeline and the proposed networks. The generative adversarial network (GAN) strategy has been
In Section IV, our proposed method is evaluated and compared widely adopted in recent deep-learning-based approaches,
with related inpainting and scene-text erasing studies through which can be summarized as single-stage inpainting and pro-
experiments. Finally, we provide concluding statements in gressive inpainting. Context encoders [9] first train an image-
Section V. inpainting deep neural network using an encoder–decoder
structure and adversarial losses. Iizuka et al. [35] adopted
dilated convolution and proposed global and local discrimi-
II. R ELATED WORK
nators for adversarial training. Liu et al. [36] defined a partial
A. Scene Text Detection convolutional layer with a mask-update mechanism to ensure
The emergence of deep learning has facilitated the devel- that the partial convolution filters learn more valid information
opment of scene text detection research and shows promising from the non-hole region and can robustly handle holes of
performance compared to traditional manually designed algo- any shape. Xie et al. [37] proposed a learnable bidirectional
rithms [12]–[16]. Recent learning-based scene text detection attention module that included forward and reverse attention
methods can be roughly categorized into regression-based maps for more effective hole filling. Li et al. [38] exploited
and segmentation-based methods. Regression-based methods the correlation between adjacent pixels by recurrently inferring
aim to predict the bounding boxes of text instances directly. and gathering the hole boundary for the encoder, named the
TextBoxes [17] adjust the aspect ratios of anchors in SSD [18] recurrent feature reasoning module.
to detect text with different shapes. CTPN [19] combines the To generate a more realistic texture, a coarse-to-fine strat-
framework of Faster R-CNN [20] with a recurrence mecha- egy was adopted in the progressive inpainting methods. Yu
nism to predict the contextual and dense components of text. et al. [39] exploited textural similarities to borrow feature
RRPN [21] proposes a rotation region proposal to bind multi- information from known background patches to generate the
oriented scene text with rotated rectangles. EAST [22] directly missing patches and designed a two-stage network architec-
regresses the rotated rectangles or quadrangles of text through ture from coarse to fine. Yu et al. [40] introduced gated
a simplified pipeline without using any anchors. LOMO [23] convolution, which further generalizes partial convolution to
detects long text and arbitrarily shaped text in scene images by make the mask-update mechanism learnable. The method was
refining the preliminary proposals iteratively and considering then combined with the SN-PatchGAN discriminator to obtain
text geometry properties including text region, text center line, better performance. Yi et al. [41] improved gated convolution
and border offsets. through a lightweight design and proposed high-frequency
3

Fig. 2. Pipeline of our proposed method. Our text-erasing network takes cropped text images as inputs and outputs the corresponding text-erased images.
Then, by placing these text-erased images back to the original locations of the input image, we can obtain the final text-erased image.

Fig. 3. Structure of our proposed network. It is composed of a stroke mask prediction module (SMPM) (top) and a background inpainting module
(BIPM)(bottom). SMPM first predicts the stroke mask of the text, and BIPM inpaints the stroke pixels with the proper content to erase the text.

residuals to generate rich and detailed textures for efficient adversarial network (cGAN) designed for general purpose
ultra-high-resolution image inpainting. image-to-image translation tasks, which can be applied to
text removal. Nakamura et al. [1] made the first attempt to
C. Text Erasing build a one-stage scene text eraser, a sliding window method
Early text erasing research starts with the removal of born- using a skip-connected auto-encoder. This method destroys
digital text such as watermarks, captions, and subtitles on im- the integrity of text strokes but cannot maintain the global
ages or video sequences [42]. Owing to the plain layout, color, consistency of the entire image. EnsNet [3] adopted cGAN
and regular font, born-digital text can be detected by traditional with a refined loss function and local-aware discriminator to
feature engineering approaches, such as binarization [43], [44], erase texts on the entire image level. Liu et al. [4] provided
and inpainted by a patch matching [45] or smoothing algorithm a comprehensive real-world scene-text removal benchmark,
[46]. named SCUT-EnsText, and proposed EraseNet, which adopts
Erasing text in the wild is a more complex and challeng- a coarse-to-fine erasure network structure with a segmentation
ing task owing to its various fonts with different layouts head to generate a mask of the text region to help with text
and illumination conditions. The recent rapid development region localization. MTRNet++ [2] shares the same coarse-
of deep neural networks has made scene text erasure a to-fine inpainting idea but uses a multi-branch generator. The
promising research task in the computer vision field, and mask-refine branch could predict stroke-level text masks to
it can be classified into two categories: one-step and two- guide text removal.
step methods. One-step methods use an end-to-end model to Two-step methods remove the text with the awareness of
directly output the text-erased image without the aid of text text location, which can be provided by users or by a pretrained
location information. Pix2Pix [47] is a conditional generative text detector. Qin et al. [48] proposed a cGAN with one
4

encoder and two decoder architectures to perform content TABLE I


region segmentation and inpainting jointly; this network can be A RCHITECTURE OF SMPM. C ONV: C ONVOLUTIONAL L AYER , D ECONV:
T RANSPOSED C ONVOLUTIONAL L AYER , R ES B LOCK : R ESIDUAL
applied to text removal in crop text images. MTRNet [7] uses C ONVOLUTIONAL B LOCK , C : CHANNELS , K : KERNEL SIZE , S : STRIDE .
manually provided text masks to guide the training and predic-
tion processes of cGAN, thereby realizing the controllability Layers Configurations Output
of the text-erasing region. Zdenek et al. [8] proposed a weak Conv×2 c: 32, k: 3, s: 1 128×640
Conv c: 64, k: 3, s: 2 64×320
supervision method employing a pretrained scene text detector Conv×2 c: 64, k: 3, s: 1 64×320
[28] and image inpainting model [49], to free the scene text Conv c: 128, k: 3, s: 2 32×160
erasing task from the requirement of paired-wise training data Conv×2 c: 128, k: 3, s: 1 32×160
Conv c: 256, k: 3, s: 2 16×80
of scene images with text and the corresponding text-erased Conv×2 c: 256, k: 3, s: 1 16×80
images. Bian et al. [6] proposed a cascaded GAN-based model ResBlock×4 c: 256, k: 3 16×80
to decouple text stroke detection and stroke removal in text Conv×2 c: 256, k: 3, s: 1 16×80
removal tasks. Deconv c: 128, k: 3, s: 1/2 32×160
Conv×2 c: 128, k: 3, s: 1 32×160
We consider our proposed method a practical solution to Deconv c: 64, k: 3, s: 1/2 64×320
scene text erasing tasks, because the pretrained scene text Conv×2 c: 64, k: 3, s: 1 64×320
detector and synthetic data obviate the need for paired real- Deconv c: 32, k: 3, s: 1/2 128×640
Conv×2 c: 32, k: 3, s: 1 128×640
world data. The relatively weak text detection ability of one- Conv c: 3, k: 3, s: 1 128×640
step end-to-end methods can lead to excessive erasure of text-
free areas and incomplete erasure of the text region. And
TABLE II
the detection ability of these networks can only be improved A RCHITECTURE OF BIPM. PC ONV: PARTIAL C ONVOLUTIONAL L AYER ,
during end-to-end training, which would require expensive U PSAMPLE : U PSAMPLE L AYER , R ES B LOCK : R ESIDUAL C ONVOLUTIONAL
real-world training data. The advantage of our method over B LOCK , SCALE : SCALE FACTOR , C : CHANNELS , K : KERNEL SIZE , S :
STRIDE .
weak supervision methods is that using improved synthetic
data for training can greatly reduce the domain shift problem Layers Configurations Output
during the inpainting process. PConv c: 64, k: 7, s: 2 64×320
PConv×2 c: 64, k: 3, s: 1 64×320
PConv c: 128, k: 5, s: 2 32×160
III. M ETHODOLOGY PConv×2 c: 128, k: 3, s: 1 32×160
PConv c: 256, k: 3, s: 2 16×80
The pipeline of the proposed method is shown in Fig. 2. PConv×2 c: 256, k: 3, s: 1 16×80
Given the source image containing the scene text and the GCblock c: 512, ratio: 4 16×80
corresponding text bounding boxes, text images are cropped Conv c: 256, k: 3, s: 1 16×80
from the source image. Subsequently, each cropped text image Upsample scale: 2 32×160
Pconv×2 c: 256, k: 3, s: 1 32×160
goes through the text-erasing network. The cropped text-erased PConv c: 128, k: 3, s: 1 32×160
images are then inserted into the source image to obtain an Upsample scale: 2 64×320
output image that does not contain text. Pconv×2 c: 128, k: 3, s: 1 64×320
PConv c: 64, k: 3, s: 1 64×320
Our network comprises 1) a stroke mask prediction module Upsample scale: 2 128×640
and 2) a background inpainting module, as illustrated in Fig. 3. Pconv×2 c: 64, k: 3, s: 1 128×640
Specifically, the stroke mask prediction module first predicts PConv c: 3, k: 3, s: 1 128×640
the stroke mask of the scene text Iˆmask as the hole from Iin .
Then, the background inpainting module is used to fill the hole
loss were employed to guide the generation of the text mask.
of the input image Iin with the appropriate content and output
In image segmentation, the dice loss is a region-based loss
text-erased image Iˆout .
measure that expresses the proportion of correctly predicted
pixels to the sum of the total pixels of both prediction and
A. Stroke Mask Prediction Module ground truth. Mathematically, the dice loss is defined as
An encoder–decoder FCN is adopted in this module. The PN
2 i (Iˆmask )i (Imask )i
input image Iin is encoded by three down-sampling convo- Ldice = 1 − PN PN , (1)
ˆ
lutional layers and four residual blocks [50], and the fea- i (Imask )i + i (Imask )i
ture maps Fm are decoded by three up-sampling transposed where N denotes the total number of pixels in the input image,
convolutional layers to generate the text stroke mask Iˆmask . and Iˆmask and Imask represent the prediction and ground truth
The architecture of the SMPM is presented in Table I. The of the text mask, respectively. The total loss of the stroke mask
skip connections concatenate feature maps of the same shape prediction module is
between the encoder and the decoder. The feature maps Fm LSM P M = kIˆmask − Imask k1 + λ0 Ldice , (2)
output from the residual blocks are concatenated with the
feature maps Fb in the background inpainting module, which is and in our experiments, λ0 is set to 1.0.
introduced in section B . Because there is usually an imbalance
between the pixels of the text region and the pixels of the B. Background Inpainting Module
non-text region in Iin , and both high recall and precision are In this module, the input image Iin and the predicted
expected during mask prediction, the dice loss [51] and L1 mask image Iˆmask are taken as input, and the output is
5

context (GC) block [52] as the self-attention block, and the


architecture of a GCblock is illustrated in Fig. 4.
3) Training Loss: We introduce four loss functions to
measure the structural and textural differences between the
output image Iˆout and the final ground truth Iout , taking
into consideration the spatial smoothness of the inpainted
region of Iˆout , including pixel reconstruction loss, perceptual
loss, style loss, and total variation loss during the training
of the background inpainting module. The details of the loss
functions are shown below.
Fig. 4. Architecture of self-attention block: Global Context (GC) block. Pixel loss is aimed at guiding pixel-level reconstruction,
where more weight is added to the inpainted region. The pixel
loss can be formulated as follows:
a background image Ob , in which all text stroke pixels,
including text shadows caused by illumination, are replaced Lpixel = kIˆmask (Iˆout − Iout )k1
(5)
with proper texture. As shown in the blue part of Fig. 3, + 6k(1 − Iˆmask ) (Iˆout − Iout )k1 .
the input image is encoded by three down-sampling partial
convolutional layers and concatenated with the feature map Perceptual loss and style loss, which are also known as VGG
Fm from the stroke mask prediction module. This is followed loss [53], [54], are used to make the generated image more
by a self-attention block, which allows the network to learn realistic. Perceptual loss captures the high-level semantics and
long-range dependencies. Finally, the decoder generates the can be considered as a simulation of human perception of
output image Iˆout . The architecture of the BIPM is presented images. Perceptual loss computes the differences (L1 loss)
in Table II. between different levels of feature representations between
1) Partial Convolutional Layers: To generate clearer back- both Iˆout and Icomp and Iout , and is extracted by the same
ground images and suppress text ghosts and artifacts, we pretrained VGG network [55]. Icomp is the composed image,
use partial convolutional layers [36] to allow the network to where the hole and non-hole regions are from Iˆout and Iout ,
learn more features from the non-text part of Iin . The partial respectively. Perceptual loss can be defined by Eq. 7:
convolution layer comprises two steps: partial convolution
operation and mask update. The partial convolution operation Icomp = Iˆmask Iout + (1 − Iˆmask ) Iˆout . (6)
and mask update are defined as follows: X
sum(1) Lper = E[ kφi (Iˆout ) − φi (Iout )k1
WT (X M) sum(M)

0 + b, if sum(M) > 0
x = , (3) i=1
0, otherwise X (7)
+ kφi (Icomp ) − φi (Iout )k1 ],
 i=1
0 1, if sum(M) > 0
m = , (4) where φi is the activation map from relu1 1 to the relu5 1
0, otherwise
layer of an ImageNet-pretrained VGG-19 model. Style loss
where W and b indicate the weights and bias of the convolution penalizes the differences between both Iˆout and Icomp and
filter, respectively. X are the input pixels for the current Iout in image style, such as color, texture, and pattern. The
convolution window, and M is the corresponding mask in the style loss is defined as follows:
receptive field. denotes element-wise multiplication. x0 is
the output feature value of the partial convolution, and m0 is Lstyle = Ei [kGφi (Iˆout ) − Gφi (Iout )k1
(8)
the updated mask value. 1 is an all-one matrix with the same + kGφi (Icomp ) − Gφi (Iout )k1 ],
shape as M.
2) Skip Connection and Self-Attention Block: By leveraging Here, the Gram matrix Gφi = φi φTi /Ci Hi Wi and Ci Hi Wi is
partial convolutional layers, we can extract more features from the shape of the feature map of φi .
outside the text region, which are beneficial for background re- The total variation loss is employed to maintain spatial
construction. However, in some cases, the feature information continuity and smoothness in the generated image to reduce
related to texture and illumination (such as highlight, shadow, the effect of noise.
and transparency) inside the text region is also helpful for
X
i,j+1 i,j
Ltv = kIcomp − Icomp k1
background inpainting. Thus, considering that the feature maps (i,j,j+1)∈M
Fm generated in the stroke mask prediction module contain X (9)
i+1,j i,j
rich feature information on text, we concatenate the feature + kIcomp − Icomp k1 ,
maps Fm and Fb from the two modules using skip connection, (i,j,i+1)∈M

and feed the concatenated feature maps into a self-attention where M is the hole region in Iˆmask .
network that learns both the correspondences between feature Finally, the loss function of the entire network is expressed
maps and non-local features. Consequently, the features inside as follows:
and outside the text regions are split and weighted by a self-
attention block for later decoding. Here, we adopt a global L = LSM P M + Lpixel + λ1 Lper + λ2 Lstyle + λ3 Ltv , (10)
6

Fig. 5. Some training image samples generated by our enhanced synthesis text engine. Input (top), final ground truth (middle), text mask ground truth (bottom)

where λ1 , λ2 , and λ3 are set as 0.05, 100, and 0.1 according training dataset contains original background images as
to Liu et al. [36]. We slightly changed the weight of the style the final ground truth, synthetic text in background im-
loss according to our own training loss curve. ages as input, and mask images of synthetic text as the
ground truth of the text mask.
IV. E XPERIMENT Compared with the vanilla synth-text method, we made
A. Implementation Details several improvements to make our generated data share
more similarity with real-world data. 1) There is a 50%
Our implementation is based on PyTorch. In the training possibility that the text instance is composed directly on
process, we generated one million synthetic text images and the background instead of being generated by Poisson
corresponding text mask images as training data from back- blending which would provide some background infor-
ground images that did not contain text. The input size of mation behind the text instance. 2) Some other effects
our network was 128×640, and the height of the training were added to the text instance, such as Gaussian blur to
images was resized to 128 while maintaining the aspect ratio. simulate out of focus, text shift to simulate text with a 3D
If the width of the image was insufficient, the remaining structure, more shadow parameter to make the shadow of
pixels were padded with 0 on the right side of the image; the text more realistic, among others. 3) We compressed
otherwise, it was resized to 640. The training batch size and saved the image in the JPEG format with different
was 8 on a single 1080Ti GPU. We used Adam [56] to compression qualities to handle images of varying quality.
optimize the entire network with β = (0.9,0.999) and set the 4) A dilation mask of a text instance, including all of
weight decay to 0. The learning rate started from 0.0002 and its effects, was generated in the mask image to reduce
decayed to nine-tenth after each epoch in the training phase. the effect of JPEG artifacts around the text edge. Some
The network was trained in an end-to-end manner, and we samples of the generated images are shown in Fig. 5.
followed the fine-tuning strategy [36], which freezes the batch • SCUT-Syn [3] was created by Synthesis text engine [5],
normalization parameters in the encoder of the background which contains 8,000 images for training and 800 images
inpainting module after approximately 10 epochs. for testing. The background images of this dataset were
In the inference process, we first expanded and cropped the collected from ICDAR 2013 [11] and ICDAR MLT-2017
text bounding box to include more background information. [57], and the text instances in the background images
Subsequently, we fed the cropped text image to our proposed were manually removed. Most test images were from the
network for prediction. Finally, part of the output, which was training set, and the training and testing sets were gen-
inside the original bounding box, was pasted back into the erated from the same background images, although the
source image. The text in arbitrary quadrilateral annotation synthesized text instances were different. We evaluated
was transformed by perspective transformation, and the text our method using only test images.
in curved annotation was transformed by thin-plate-spline into
2) Real-world Dataset:
rectangular text images before feeding them into our network.
The network output was transformed back to its original shape • ICDAR 2013 [11] is a widely used scene text images
and copied to the original position to obtain the final text- dataset that includes 229 training images and 223 testing
erased image. images. All text instances are in English and are well
focused. In this study, only the test set was used for the
evaluation.
B. Dataset and Evaluation Metrics • SCUT-EnsText [4] is a comprehensive and challenging
1) Synthetic Dataset: scene text removal dataset, containing 2,749 training
• Improved Synth-text image We used over 1,500 English images and 813 testing images, which are collected from
and Chinese fonts and 10,000 background images without ICDAR2013 [11], ICDAR-2015 [58], MS COCO-Text
text to generate a total of one million images for our [59], SVT [60], MLT-2019 [61], and ArTs [62]. The
model training using our enhanced synthesis text engine, text instances of this dataset are in Chinese or English
which is improved from Synth-text technology [5]. The with diverse shapes, such as horizontal text, arbitrary
7

(a) Input (b) BIPM (c) SMPM+BIPM (d) SMPM+BIPM+SC (e) All (w/o PConv) (f) All
Fig. 6. Visual quality results of ablation study on SCUT-EnsText dataset. From left to right: input images, output of BIPM, output of SMPM+BIPM, output
of SMPM+BIPM+SC, output of all (w/o PConv), and output of all. SMPM: Stroke mask prediction module. BIPM: Background inpainting module. PConv:
Partial convolutions. SC: Skip connection between two modules. SA: Self-attention block. All: SMPM + BIPM + SC + SA.

TABLE III
A BLATION S TUDY AND Q UALITATIVE C OMPARISON B ETWEEN D IFFERENT C ONFIGURATIONS OF O UR P ROPOSED N ETWORK ON SCUT-S YN AND
SCUT-E NS T EXT DATASETS . SMPM: S TROKE M ASK P REDICTION M ODULE . BIPM: BACKGROUND I NPAINTING M ODULE . PC ONV: PARTIAL
C ONVOLUTIONS . SC: S KIP C ONNECTION BETWEEN T WO M ODULES . SA: S ELF - ATTENTION B LOCK

SCUT-Syn SCUT-EnsText
Method
PSNR↑ SSIM(%)↑ MSE↓ PSNR↑ SSIM(%)↑ MSE↓
BIPM (w/o PConv) 34.78 96.67 0.00061 34.62 95.79 0.00114
SMPM + BIPM 37.74 97.31 0.00034 36.11 96.31 0.00077
SMPM + BIPM + SC 38.29 97.48 0.00030 36.36 96.39 0.00070
SMPM + BIPM (w/o PConv) + SC + SA 38.70 97.67 0.00025 36.49 96.39 0.00073
SMPM + BIPM + SC + SA (all) 38.60 97.55 0.00024 37.08 96.54 0.00054

quadrilateral text, and curved text. All text instances were index (SSIM) [63] and mean squared error (MSE). Higher
carefully erased by annotators with good visual quality. SSIM, PSNR, and lower MSE values indicate better
By providing original text annotation and text-erased image restoration quality. Qualitative evaluations were
ground-truth, this dataset can be comprehensively used conducted on both the SCUT-Syn [3] and SCUT-EnsText
for both qualitative and quantitative evaluations. In this [4] datasets.
study, we used this test set to evaluate the performance
of our method. C. Ablation Study
3) Evaluation metrics: In this section, we investigate the effectiveness of the differ-
• Quantitative Evaluation To quantify the text erasure ent settings of our proposed model. The stroke mask prediction
ability of a model, we followed [2]–[4], [7], [8] and module (SMPM), skip connection (SC) between two modules,
utilized a baseline scene text detection model to detect self-attention block (SA), and partial convolutions (PConv) are
the texts in the text-erased images and evaluated how the focus of this study. The qualitative evaluation results on
low are the recall of the detection results. A lower recall the SCUT-Syn and SCUT-EnsText datasets are presented in
indicated that less text was detected and more text was Table III, and some text-erasing samples are shown in Fig. 6.
erased by the model. To make a fair comparison with • Stroke Mask Prediction Module SMPM aims to provide
previous studies, scene text detector EAST and ICDAR the pixel-level information of text region as the hole for
2013 evaluation [22] protocols were used in the ICDAR background inpainting Module (BIPM), so that the net-
2013 dataset [11], text detector CRAFT [29], and ICDAR work can learn more from the valid features of the non-
2015 [58] protocols were adopted for the evaluation of text region and suppress the text residue. The qualitative
SCUT-EnsText [4]. results are presented in Table III. Using the text mask
• Qualitative Evaluation We followed the previous image can significantly improve the text erasing performance. It
inpainting works by reporting metrics including peak should be noted that without the mask image from the
signal-to-noise ratio (PSNR), the structural similarity stroke mask prediction module, the partial convolutional
8

(a) Input of ours (b) Ground-truth (c) Ours (d) Predicted text mask

Fig. 7. Qualitative results of our method on the SCUT-Syn dataset. From left to right: input image and text bounding boxes, ground truth, output of our
method, and predicted text mask.

TABLE IV
C OMPARISON B ETWEEN P REVIOUS S CENE T EXT- ERASING S TUDIES AND O UR P ROPOSED M ETHOD ON THE SCUT-S YN AND ICDAR2013 DATASETS .

SCUT-Syn ICDAR2013
Method Parameters Inference speed Input
PSNR↑ SSIM(%)↑ MSE↓ R↓
Original images - - - 70.83 - - -
SceneTextEraser [1] 14.68 46.13 0.7148 10.08 - - Image
Pix2Pix [47] 25.60 89.86 0.2465 10.19 54.4M 17ms Image
EnsNet [3] 37.36 96.44 0.0021 5.66 12.4M 24ms Image
MTRNet [7] 29.71 94.43 0.0001 0.18 54.4M - Image(256×256) + Text Mask
Weak Supervision [8] 37.44 93.69 - 2.47 28.7+6.0M 57+39ms Image(256×256)
MTRNet++ [2] 34.55 98.45 0.0004 - 18.7M 37ms Image(256×256)
EraseNet [4] 38.32 97.67 0.0002 - 19.7M 34ms Image
EAST [22] + Ours 31.18 95.93 0.002 0.73 24.1+9.9M 18+23∼ms Image
Ours 38.60 97.55 0.0002 0 9.9M 23∼ms Image + BBox

layers in the background inpainting module will function non-local features. To confirm the importance of self-
as normal convolutional layers. attention block, we trained our network without the SA
• Skip Connection Skip connection links and concatenates block. From Table III, we can see that the performance
the low-resolution feature maps of two modules to pro- of our network decreases when the self-attention block is
vide the features inside the text region for the decoder missing.
of the BIPM and to improve the accuracy and stability • Partial Convolutions To evaluate the advantages of the
of text mask prediction. Table III implies that the skip partial convolutional layers, we also re-implemented our
connection between the two modules can improve the method without these layers. Table III lists the qualita-
erasing quality of the image in both the SCUT-Syn and tive performance of the SCUT-Syn and SCUT-EnsText
SCUT-EnsText datasets. datasets. We observed that, compared to a network with
• Self-Attention Block GCblock adds channel-wise partial convolutional layers, the network with partial con-
weights to the input feature maps taking into considera- volutional layers performs better on SCUT-EnsText but
tion of the correspondences between feature maps and worse on SCUT-Syn datasets. We believe data discrepan-
9

(a) Input of ours (b) Input of inpainting methods (c) Ours (d) LBAM (e) RFR-Net (f) HiFill
Fig. 8. Visual qualitative comparison between our method and state-of-the-art image inpainting methods on the SCUT-EnsText dataset. From left to right:
input of our method, input of inpainting methods, output of our method, output of LBAM, output of RFR-Net, and output of HiFiII.

TABLE V
C OMPARISON B ETWEEN S TATE - OF - THE - ART I NPAINTING M ETHODS AND O UR P ROPOSED M ETHOD ON THE SCUT-E NS T EXT DATASET.

SCUT-EnsText
Method Parameters Inference speed Input Training dataset
PSNR↑ SSIM(%)↑ MSE↓
LBAM [37] 36.21 95.58 0.0007 68.3M 11ms Image(256×256) + Text Mask Paris Street View [64]
RFR-Net [38] 36.95 96.12 0.0006 31.2M 90ms Image(256×256) + Text Mask Paris Street View [64]
HiFill [41] 31.48 94.17 0.0021 2.7M 28ms Image + Text Mask Places2 [65]
Ours 37.89 97.02 0.0004 9.9M 23∼ms Image(256×256) + BBox Improved Synth text
Ours 37.08 96.54 0.0005 9.9M 23∼ms Image + BBox Improved Synth text

TABLE VI
C OMPARISON B ETWEEN S TATE - OF - THE - ART S CENE T EXT- ERASING M ETHODS AND O UR P ROPOSED M ETHOD ON THE SCUT-E NS T EXT DATASET.

Qualitative eval Quantitative eval


Method Parameters Inference speed Input
PSNR↑ SSIM(%)↑ MSE↓ R↓
Original images - - - 69.5 - - -
SceneTextEraser [1] 25.47 90.14 0.0047 5.9 - - Image
EnsNet [3] 29.54 92.74 0.0024 32.8 12.4M 24ms Image
EraseNet [4] 32.30 95.42 0.0015 4.6 19.7M 34ms Image
DB-ResNet-18 [66] + Ours 33.17 95.44 0.0020 10.3 12.6+9.9M 17+27∼ms Image
DB-ResNet-50 [66] + Ours 33.54 95.57 0.0018 10.5 26.1+9.9M 37+27∼ms Image
CRAFT [29] + Ours 35.34 96.24 0.0009 3.6 20.7+9.9M 120+27∼ms Image

cies between synthetic data and real-world data are the convolutions, using partial convolution and Fm feature
cause for this difference in performance. As mentioned concatenation to split the features inside and outside the
before, the reason for improving the Synth-text engine text region is an relatively inefficient approach for the
is that Poisson image editing will retain some texture transparent texts erasure. However, when facing the real-
information of the background image when it blends the world data like SCUT-EnsText, using partial convolution
foreground text instances into a background image. But can achieve better performance than normal convolution.
most scene text instances in real-world images are not
transparent. For the erasure result of SCUT-Syn datasets,
compared with directly extracting feature using normal
10

(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask

Fig. 9. Qualitative results of our method on SCUT-EnsText dataset. From left to right: input, ground truth, output of our method, and predicted text mask.
11

(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask

Fig. 10. Our method can retain more detailed background information and restore the background texture. From left to right: input image and text bounding
boxes, ground truth, output of our method, and predicted text mask.

D. Comparison With State-of-the-Art Methods the detection result of EAST [22] is used to provide text
location. Here, R represents recall, which is the detection
To evaluate the performance of our proposed method, we result of EAST under the ICDAR2013 evaluation protocol.
compared it with recent state-of-the-art methods on the SCUT- We consider the reason why MTRNet++, EraseNet and EnsNet
Syn, ICDAR2013, and SCUT-EnsText datasets. For the SCUT- could generate higher SSIM images on the SCUT-Syn dataset
Syn and ICDAR2013 datasets, the results of SceneTextEraser, is because most test images are included in the training set and
Pix2Pix, and EnsNet were implemented and reported by Zhang share the same background images with training images when
et al. [3]. The results of MTRNet [7], weak supervision [8], they are generated. EAST and our model were not trained
MTRNet++ [2], and EraseNet [4] were collected from official in the SCUT-Syn training set, thereby causing a lower PSNR
reports of papers. If there is no separate description, the and SSIM in our results. We believe that this dataset cannot
resolution of the input image is 512×512. Table IV shows the fully reflect the generalization ability of a network when it is
results for the SCUT-Syn [3] and ICDAR2013 [11] datasets. used for training. Our text-erasing network is lightweight with
Our proposed method achieves the highest PSNR in the SCUT- only 9.9 million trainable parameters. For a fair comparison
Syn dataset and the lowest recall in the ICDAR2013 dataset of the inference speed, we tested all methods in a single
when the bounding boxes were provided. Some text-erasing 1080ti GPU and an AMD Ryzen7 3700X @ 3.6GHz CPU
examples on the SCUT-Syn dataset are shown in Fig. 7. Our with the original input size of the networks. The inference
method achieves lower recall in the ICDAR2013 dataset when
12

(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask

Fig. 11. Some failure cases of our method. From left to right: input image and text bounding boxes, ground truth, output of our method, and predicted text
mask.

time of our method consists of the time cost of the network with higher quality than that in the case of state-of-the-art
forward, pre-processing, and post-processing. The time cost image inpainting methods under the same conditions. Some
of pre-processing and post-processing is approximately 4 ms visual quality comparison samples are presented in Fig. 8.
in the case of using perspective transformation and 76 ms We observed that there are always certain unusual textures
in the case of using off-the-shell thin-plate-spline function in or artifacts in the text-erased images inpainted via pretrained
OpenCV. image inpainting methods, causing unnatural erasing results.
We also found that in the inpainting logic of HiFill [41], a
For the SCUT-EnsText dataset, we compared our method preference to restore the hole region with further background
with state-of-the-art image inpainting methods and scene text information is displayed, leading to worse results than those
erasing methods. The results of the comparison with the obtained via the other two methods. Although this model is
inpainting methods of the SCUT-EnsText dataset are shown pretrained on the Places2 [65] dataset, which contains many
in Table V. Our scene text erasing method achieves excellent indoor and urban views, it still shows a serious domain shift
result in image quality when text bounding box information problem when facing scene text erasing tasks.
is provided. We made some revisions to the original text
location annotation because we observed some unmatched In addition, we used our method with a scene text detector
cases between the location of the erased text and the text as a two-step automatic scene text eraser. In our experiment,
bounding box of the ground truth. We selected three state-of- we used DB [66] and CRAFT [29] for scene text detection
the-art image inpainting methods: LBAM [37], RFR-Net [38], and produced arbitrary quadrilateral bounding boxes as the
and HiFill [41]. The LBAM [37] and RFR-Net [38] models input for our method. DB-ResNet-18 and DB-ResNet-50 were
were pretrained on the Paris Street View dataset [64], and the pretrained in the SynthText and ICDAR 2015 datasets with
HiFill [41] model was pretrained on the Places2 dataset [65]. the box threshold set as 0.3. CRAFT was pretrained on the
For a fair comparison with the pretrained inpainting methods, SynthText, ICDAR 2013, and MLT 2017 datasets, and the
we generated the hole mask directly from our revised text text threshold was set as 0.6. We compared our proposed
location annotations and resized the input images to the same methods with previous scene text erasing methods on the
size because some pretrained inpainting models only work at SCUT-EnsText dataset. The results are shown in Table VI and
a resolution of 256 × 256. Our method can generate images imply that results of first detection and then inpainting via our
13

method are significantly superior to those of existing state-of- R EFERENCES


the-art one-step methods in both qualitative and quantitative [1] T. Nakamura, A. Zhu, K. Yanai, and S. Uchida, “Scene Text Eraser,” in
evaluations. Here, R denotes the recall, which is the detection Proc. Int. Conf. Doc. Anal. Recognit., vol. 1, 2017, pp. 832–837.
result of the CRAFT [29] under the ICDAR2015 evaluation [2] O. Tursun, S. Denman, R. Zeng, S. Sivapalan, S. Sridharan, and
C. Fookes, “MTRNet++: One-stage mask-based scene text eraser,”
[58] protocol. Some qualitative results of our method on Comput. Vis. Image Underst., vol. 201, p. 103066, 2020.
the SCUT-EnsText dataset are shown in Fig. 9. Our method [3] S. Zhang, Y. Liu, L. Jin, Y. Huang, and S. Lai, “EnsNet: Ensconce Text
could clearly remove the text region, regardless of whether in the Wild,” Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 801–808, 2019.
[4] C. Liu, Y. Liu, L. Jin, S. Zhang, C. Luo, and Y. Wang, “EraseNet: End-
the text instances had varying shapes, fonts, and illumination to-End Text Removal in the Wild,” IEEE Trans. Image Process., vol. 29,
conditions. Nonetheless, when compared with the inference pp. 8760–8775, 2020.
speed in the case of one-step end-to-end scene text erasing [5] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic Data for Text
Localisation in Natural Images,” in Proc. IEEE Conf. Comput. Vis.
methods, that of our two-step cropped-images-based pipeline Pattern Recognit., 2016, pp. 2315–2324.
is relatively slow, as shown in Table VI. [6] X. Bian, C. Wang, W. Quan, J. Ye, X. Zhang, and D.-M. Yan, “Scene text
removal via cascaded text stroke detection and erasing,” arXiv, 2020.
[7] O. Tursun, R. Zeng, S. Denman, S. Sivapalan, S. Sridharan, and
C. Fookes, “MTRNet: A Generic Scene Text Eraser,” in Proc. Int. Conf.
E. Discussion Doc. Anal. Recognit., 2019, pp. 39–44.
[8] J. Zdenek and H. Nakayama, “Erasing Scene Text with Weak Supervi-
From the experimental results, we observed that the model sion,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2020, pp. 2227–
trained using our improved synthetic text dataset has a dif- 2235.
[9] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
ferent inpainting logic from that of the annotations of the “Context Encoders: Feature Learning by Inpainting,” in Proc. IEEE
real-world scene-text removal dataset, which was manually Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
edited using Photoshop. As shown in Fig. 10, we observed [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
that our method can erase text while retaining more detailed L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” Int.
background information of the image. Moreover, benefiting J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
from the model design and a large amount of training data, our [11] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I. Bigorda, S. R.
Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras,
method enabled reconstruction of some background textures “ICDAR 2013 Robust Reading Competition,” in Proc. 12th Int. Conf.
for better visual perception. However, owing to the limited Doc. Anal. Recognit., 2013, pp. 1484–1493.
text style of the synthetic text engine, our method results in [12] Xiangrong Chen and A. Yuille, “Detecting and reading text in natural
scenes,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog-
failure text erasing when the text is oriented in a special shape nit., vol. 2, 2004, pp. 366–373.
or stereoscopic text under complicated illumination conditions, [13] L. Neumann and J. Matas, “A method for text localization and recog-
as depicted in Fig. 11. Because we propose erasing the word- nition in real-world images,” in Proc. Asian Conf. Comput. Vis., vol.
6494, 2011, pp. 770–783.
level text in the cropped images, our method encounters some
[14] A. Jamil, I. Siddiqi, F. Arif, and A. Raza, “Edge-Based Features for
difficulty in solving small text instances in the images. As there Localization of Artificial Urdu Text in Video Images,” in Proc. Int. Conf.
are always JPEG boundary artifacts surrounding the text edge, Doc. Anal. Recognit., 2011, pp. 1120–1124.
our model cannot significantly discriminate whether the pixel [15] A. Mosleh, N. Bouguila, and A. B. Hamza, “Automatic inpainting
scheme for video text detection and removal,” IEEE Trans. Image
near the text edge belongs to the background or the artifact; Process., vol. 22, no. 11, pp. 4460–4472, 2013.
borrowing the features from artifact regions will result in the [16] W. Huang, Z. Lin, J. Yang, and J. Wang, “Text Localization in Natural
text region being inpainted by strange colors, yielding poor Images Using Stroke Feature Transform and Text Covariance Descrip-
tors,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1241–1248.
inpainting results. [17] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “TextBoxes: A fast text
detector with a single deep neural network,” in Proc. AAAI Conf. Artif.
Intell., 2017, pp. 4161–4167.
V. C ONCLUSION [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and
A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
Comput. Vis., vol. 9905, 2016, pp. 21–37.
In this study, we propose a novel scene text erasing method [19] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural
that addresses the weak text location problem of one-step image with connectionist text proposal network,” in Proc. Eur. Conf.
methods and the domain shift problem of pretrained inpainting Comput. Vis., vol. 9912, 2016, pp. 56–72.
[20] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
models trained on street view or Places datasets. To this end, time object detection with region proposal networks,” in Proc. Adv.
our model is trained only using our improved synthetic text Neural Inf. Process. Syst., 2015, pp. 91–99.
dataset. It inpaints the text region based on a predicted text [21] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue,
“Arbitrary-Oriented Scene Text Detection via Rotation Proposals,” IEEE
stroke mask on cropped text images, whereby more back- Trans. Multimed., vol. 20, no. 11, pp. 3111–3122, 2018.
ground information can be preserved. Through utilization of a [22] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang,
stroke mask prediction module, partial convolution layers, an “EAST: An Efficient and Accurate Scene Text Detector,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2642–2651.
attention block in the background inpainting module, and skip [23] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding,
connection between two modules, our method can reasonably “Look More Than Once: An Accurate Detector for Text of Arbitrary
erase scene text with texture restoration. Using a pretrained Shapes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019,
pp. 10 544–10 553.
scene text detector to provide text location information, our [24] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-
model can function as an automatic scene text eraser to remove oriented Text Detection with Fully Convolutional Networks,” in Proc.
text in the wild. Overall, our experimental results show that IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4159–4167.
[25] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-
our method achieves better performance than that of existing to-end trainable neural network for spotting text with arbitrary shapes,”
state-of-the-art methods. in Proc. Eur. Conf. Comput. Vis., vol. 11218, 2018, pp. 71–88.
14

[26] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. [52] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-Local Networks
IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988. Meet Squeeze-Excitation Networks and Beyond,” in Proc. IEEE/CVF
[27] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “TextSnake: Int. Conf. Comput. Vis. Work., 2019, pp. 1971–1980.
A Flexible Representation for Detecting Text of Arbitrary Shapes,” in [53] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
Proc. Eur. Conf. Comput. Vis., vol. 11206, 2018, pp. 19–35. style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.,
[28] Y. Li, Z. Wu, S. Zhao, X. Wu, Y. Kuang, Y. Yan, S. Ge, K. Wang, vol. 9906, 2016, pp. 694–711.
W. Fan, X. Chen, and Y. Wang, “PSENet: Psoriasis Severity Evaluation [54] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer Using
Network,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 01, 2020, pp. Convolutional Neural Networks,” in Proc. IEEE Conf. Comput. Vis.
800–807. Pattern Recognit., 2016, pp. 2414–2423.
[29] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character Region [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Awareness for Text Detection,” in Proc. IEEE/CVF Conf. Comput. Vis. large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent.,
Pattern Recognit., 2019, pp. 9357–9366. 2015.
[30] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis [56] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-
and transfer,” in Proc. 28th Annu. Conf. Comput. Graph. Interact. Tech. tion,” in Proc. 3rd Int. Conf. Learn. Represent., 2015.
- SIGGRAPH ’01, 2001, pp. 341–346. [57] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo,
[31] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patch- U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. Burie,
Match,” ACM Trans. Graph., vol. 28, no. 3, pp. 1–11, 2009. C.-l. Liu, and J.-M. Ogier, “ICDAR2017 Robust Reading Challenge on
[32] S. Darabi, E. Shechtman, C. Barnes, Dan B Goldman, and P. Sen, “Image Multi-Lingual Scene Text Detection and Script Identification - RRC-
melding: Combining inconsistent images using patch-based synthesis,” MLT,” in Proc. 14th IAPR Int. Conf. Doc. Anal. Recognit., vol. 1, 2017,
ACM Trans. Graph., vol. 31, no. 4, pp. 1–10, 2012. pp. 1454–1459.
[33] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image in- [58] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov,
painting,” in Proc. 27th Annu. Conf. Comput. Graph. Interact. Tech. M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu,
- SIGGRAPH ’00, 2000, pp. 417–424. F. Shafait, S. Uchida, and E. Valveny, “ICDAR 2015 competition on
[34] M. M. Oliveira, B. Bowen, R. McKenna, and Y.-S. Chang, “Fast Digital Robust Reading,” in Proc. 13th Int. Conf. Doc. Anal. Recognit., 2015,
Image Inpainting,” in Int. Conf. Vis. Imaging Image Process., no. Viip, pp. 1156–1160.
2001, pp. 261–266. [59] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “COCO-
[35] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally Text: Dataset and Benchmark for Text Detection and Recognition in
consistent image completion,” ACM Trans. Graph., vol. 36, no. 4, pp. Natural Images,” arXiv, 2016.
1–14, 2017. [60] K. Wang and S. Belongie, “Word spotting in the wild,” in Proc. Eur.
[36] G. Liu, F. A. Reda, K. J. Shih, T. C. Wang, A. Tao, and B. Catanzaro, Conf. Comput. Vis., vol. 6311, 2010, pp. 591–604.
“Image Inpainting for Irregular Holes Using Partial Convolutions,” in [61] N. Nayef, C.-l. Liu, J.-M. Ogier, Y. Patel, M. Busta, P. N. Chowdhury,
Proc. Eur. Conf. Comput. Vis., vol. 11215, 2018, pp. 89–105. D. Karatzas, W. Khlif, J. Matas, U. Pal, and J.-C. Burie, “ICDAR2019
[37] C. Xie, S. Liu, C. Li, M.-M. Cheng, W. Zuo, X. Liu, S. Wen, and Robust Reading Challenge on Multi-lingual Scene Text Detection and
E. Ding, “Image Inpainting With Learnable Bidirectional Attention Recognition — RRC-MLT-2019,” in Proc. Int. Conf. Doc. Anal. Recog-
Maps,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 8857– nit., 2019, pp. 1582–1587.
8866. [62] C. K. Chng, E. Ding, J. Liu, D. Karatzas, C. S. Chan, L. Jin, Y. Liu,
[38] J. Li, N. Wang, L. Zhang, B. Du, and D. Tao, “Recurrent Feature Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, and J. Han,
Reasoning for Image Inpainting,” in Proc. IEEE/CVF Conf. Comput. “ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text -
Vis. Pattern Recognit., 2020, pp. 7757–7765. RRC-ArT,” in Proc. Int. Conf. Doc. Anal. Recognit., 2019, pp. 1571–
[39] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative 1576.
Image Inpainting with Contextual Attention,” in Proc. IEEE/CVF Conf. [63] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
Comput. Vis. Pattern Recognit., 2018, pp. 5505–5514. quality assessment: From error visibility to structural similarity,” IEEE
[40] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. Huang, “Free-Form Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
Image Inpainting With Gated Convolution,” in Proc. IEEE/CVF Int. [64] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, “What makes
Conf. Comput. Vis., 2019, pp. 4470–4479. Paris look like Paris?” ACM Trans. Graph., vol. 31, no. 4, pp. 1–9, 2012.
[41] Z. Yi, Q. Tang, S. Azizi, D. Jang, and Z. Xu, “Contextual Residual [65] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A
Aggregation for Ultra High-Resolution Image Inpainting,” in Proc. 10 Million Image Database for Scene Recognition,” IEEE Trans. Pattern
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7505–7514. Anal. Mach. Intell., vol. 40, no. 6, pp. 1452–1464, 2018.
[42] C. W. Lee, K. Jung, and H. J. Kim, “Automatic text detection and [66] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-Time Scene
removal in video sequences,” Pattern Recognit. Lett., vol. 24, no. 15, Text Detection with Differentiable Binarization,” Proc. AAAI Conf. Artif.
pp. 2607–2623, 2003. Intell., vol. 34, no. 07, pp. 11 474–11 481, 2020.
[43] E. A. Pnevmatikakis and P. Maragos, “An inpainting system for auto-
matic image structure - texture restoration with text removal,” in Proc.
15th IEEE Int. Conf. Image Process., 2008, pp. 2616–2619.
[44] A. Mosleh, N. Bouguila, and A. B. Hamza, “Image Text Detection
Using a Bandlet-Based Edge Detector and Stroke Width Transform,”
in Procedings Br. Mach. Vis. Conf., 2012, pp. 63.1–63.12.
[45] M. Khodadadi and A. Behrad, “Text localization, extraction and inpaint-
ing in color images,” in Proc. 20th Iran. Conf. Electr. Eng., 2012, pp.
1035–1040.
[46] P. D. Wagh and D. R. Patil, “Text detection and removal from image
using inpainting with smoothing,” Proc. Int. Conf. Pervasive Comput., Zhengmi Tang received his B.E. degree from Xidian
2015. University, Shaanxi, China, in 2017 and his M.E.
[47] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Trans- degree in cybernetics engineering from Hiroshima
lation with Conditional Adversarial Networks,” in Proc. IEEE Conf. University, Hiroshima, Japan, in 2020. He is cur-
Comput. Vis. Pattern Recognit., 2017, pp. 5967–5976. rently pursuing a Ph.D. degree in communication
[48] S. Qin, J. Wei, and R. Manduchi, “Automatic semantic content removal engineering in the IIC-Lab at Tohoku University,
by learning to neglect,” in Proc. Br. Mach. Vis. Conf., 2018. Japan. His current research interests include com-
[49] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic Image Completion,” in puter vision, scene-text detection, and data synthesis.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1438–
1447.
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
[51] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional
Neural Networks for Volumetric Medical Image Segmentation,” in Proc.
Int. Conf. 3D Vis., 2016, pp. 565–571.
15

Tomo Miyazaki (Member, IEEE) received his B.E.


and Ph.D. degrees from Yamagata University (2006)
and Tohoku University (2011), respectively. From
2011 to 2012, he worked on the geographic infor-
mation system at Hitachi, Ltd. From 2013 to 2014,
he worked at Tohoku University as a postdoctoral
researcher. Since 2015, he has been an Assistant
Professor at the university. His research interests
include pattern recognition and image processing.

Yoshihiro Sugaya (Member, IEEE) received his


B.E., M.E., and Ph.D. degrees from Tohoku Uni-
versity, Sendai, Japan in 1995, 1997, and 2002,
respectively. He is currently an Associate Professor
at the Graduate School of Engineering, Tohoku Uni-
versity. His research interests include computer vi-
sion, pattern recognition, image processing, parallel
processing, and distributed computing. Dr. Sugaya
is a member of the Institute of Electronics, Informa-
tion and Communication Engineers (IEICE) and the
Information Processing Society of Japan.

Shinichiro Omachi (M’96-SM’11) received his


B.E., M.E., and Ph.D. degrees in Information En-
gineering from Tohoku University, Japan, in 1988,
1990, and 1993, respectively. He worked as an
Assistant Professor at the Education Center for
Information Processing at Tohoku University from
1993 to 1996. Since 1996, he has been affiliated
with the Graduate School of Engineering at Tohoku
University, where he is currently a Professor. From
2000 to 2001, he was a visiting Associate Professor
at Brown University. His research interests include
pattern recognition, computer vision, image processing, image coding, and
parallel processing. He served as the Editor-in-Chief of IEICE Transactions
on Information and Systems from 2013 to 2015. Dr. Omachi is a member
of the Institute of Electronics, Information and Communication Engineers,
the Information Processing Society of Japan, among others. He received the
IAPR/ICDAR Best Paper Award in 2007, the Best Paper Method Award of the
33rd Annual Conference of the GfKl in 2010, the ICFHR Best Paper Award
in 2010, and the IEICE Best Paper Award in 2012. He is currently the Vice
Chair of the IEEE Sendai Section.

You might also like