Stroke-Based Scene Text Erasing Using Synthetic Data For Training
Stroke-Based Scene Text Erasing Using Synthetic Data For Training
Stroke-Based Scene Text Erasing Using Synthetic Data For Training
of pairwise real-world data, we make full use of synthetic Segmentation-based methods usually first extract text pixels
texts. The appearance of the synthetic texts [5] is enhanced from the segmentation map and then obtain bounding boxes
and we trained our model only on the dataset generated by of the text by post-processing. Zhang et al. [24] used the
the improved synthetic text engine. The model can partially FCN and MSER for pixel-level multi-oriented text detection.
erase text instances in a scene image if text bounding boxes Mask textspotter [25] built a model based on the framework
are provided. The model can also work with existing scene of Mask R-CNN [26] and performed character-level instance
text detectors for automatic scene text erasing. Examples of segmentation for each alphabet. TextSnake [27] proposed a
the text-erasing results obtained by the proposed method are novel representation of arbitrarily shaped text and predicted
shown in Fig. 1. heat maps of text center lines, text regions, radii, and orienta-
The main contributions of our study can be summarized as tions to extract text regions. PSENet [28] gradually expanded
follows: the text region from small kernels to large ones to make final
• We propose a practical text erasing method and a stroke- predictions through multiple semantic segmentation maps.
based text-erasing network using cropped text images Liao et al. proposed a module called differentiable binarization
instead of the entire image, which makes prediction of (DB) to perform the binarization process in a segmentation
pixel-level mask of text stroke more accurate and stable. network. CRAFT [29] proposed to learn each character center
Benefiting from our erasing pipeline and network struc- and the affinity between characters in the form of a heat map.
ture, our method can erase text instances while retaining
more background information along with background B. Image Inpainting
texture restoration.
• We enhance the Text Synthesis Engine [5] to make Image inpainting fills the hole regions of an image with
the appearance of synthetic text instances share more plausible content. Image inpainting research can generally be
similarity with real-world data. divided into two categories: non-learning and deep-learning-
• The quantitative and qualitative evaluation results on based approaches. Non-learning approaches transfer the sur-
SCUT-Syn [3], ICDAR 2013 [11] and SCUT-EnsText [4] rounding content to the hole region based on low-level features
datasets show that our method outperforms previous state- using a traditional algorithm such as patch matching [30]–
of-the-art methods while it is only trained on the dataset [32] and diffusion [33], [34]. Although these methods can
generated by the improved synthetic text engine [5]. work well on small holes, they cannot deal with large missing
The remainder of this paper is organized as follows. Section regions, where semantics patches should be selected for the
II reviews related works on scene-text detection, image in- hole restoration based on a high-level understanding of the
painting, and text erasing. Section III introduces the details of whole image.
our method, including the pipeline and the proposed networks. The generative adversarial network (GAN) strategy has been
In Section IV, our proposed method is evaluated and compared widely adopted in recent deep-learning-based approaches,
with related inpainting and scene-text erasing studies through which can be summarized as single-stage inpainting and pro-
experiments. Finally, we provide concluding statements in gressive inpainting. Context encoders [9] first train an image-
Section V. inpainting deep neural network using an encoder–decoder
structure and adversarial losses. Iizuka et al. [35] adopted
dilated convolution and proposed global and local discrimi-
II. R ELATED WORK
nators for adversarial training. Liu et al. [36] defined a partial
A. Scene Text Detection convolutional layer with a mask-update mechanism to ensure
The emergence of deep learning has facilitated the devel- that the partial convolution filters learn more valid information
opment of scene text detection research and shows promising from the non-hole region and can robustly handle holes of
performance compared to traditional manually designed algo- any shape. Xie et al. [37] proposed a learnable bidirectional
rithms [12]–[16]. Recent learning-based scene text detection attention module that included forward and reverse attention
methods can be roughly categorized into regression-based maps for more effective hole filling. Li et al. [38] exploited
and segmentation-based methods. Regression-based methods the correlation between adjacent pixels by recurrently inferring
aim to predict the bounding boxes of text instances directly. and gathering the hole boundary for the encoder, named the
TextBoxes [17] adjust the aspect ratios of anchors in SSD [18] recurrent feature reasoning module.
to detect text with different shapes. CTPN [19] combines the To generate a more realistic texture, a coarse-to-fine strat-
framework of Faster R-CNN [20] with a recurrence mecha- egy was adopted in the progressive inpainting methods. Yu
nism to predict the contextual and dense components of text. et al. [39] exploited textural similarities to borrow feature
RRPN [21] proposes a rotation region proposal to bind multi- information from known background patches to generate the
oriented scene text with rotated rectangles. EAST [22] directly missing patches and designed a two-stage network architec-
regresses the rotated rectangles or quadrangles of text through ture from coarse to fine. Yu et al. [40] introduced gated
a simplified pipeline without using any anchors. LOMO [23] convolution, which further generalizes partial convolution to
detects long text and arbitrarily shaped text in scene images by make the mask-update mechanism learnable. The method was
refining the preliminary proposals iteratively and considering then combined with the SN-PatchGAN discriminator to obtain
text geometry properties including text region, text center line, better performance. Yi et al. [41] improved gated convolution
and border offsets. through a lightweight design and proposed high-frequency
3
Fig. 2. Pipeline of our proposed method. Our text-erasing network takes cropped text images as inputs and outputs the corresponding text-erased images.
Then, by placing these text-erased images back to the original locations of the input image, we can obtain the final text-erased image.
Fig. 3. Structure of our proposed network. It is composed of a stroke mask prediction module (SMPM) (top) and a background inpainting module
(BIPM)(bottom). SMPM first predicts the stroke mask of the text, and BIPM inpaints the stroke pixels with the proper content to erase the text.
residuals to generate rich and detailed textures for efficient adversarial network (cGAN) designed for general purpose
ultra-high-resolution image inpainting. image-to-image translation tasks, which can be applied to
text removal. Nakamura et al. [1] made the first attempt to
C. Text Erasing build a one-stage scene text eraser, a sliding window method
Early text erasing research starts with the removal of born- using a skip-connected auto-encoder. This method destroys
digital text such as watermarks, captions, and subtitles on im- the integrity of text strokes but cannot maintain the global
ages or video sequences [42]. Owing to the plain layout, color, consistency of the entire image. EnsNet [3] adopted cGAN
and regular font, born-digital text can be detected by traditional with a refined loss function and local-aware discriminator to
feature engineering approaches, such as binarization [43], [44], erase texts on the entire image level. Liu et al. [4] provided
and inpainted by a patch matching [45] or smoothing algorithm a comprehensive real-world scene-text removal benchmark,
[46]. named SCUT-EnsText, and proposed EraseNet, which adopts
Erasing text in the wild is a more complex and challeng- a coarse-to-fine erasure network structure with a segmentation
ing task owing to its various fonts with different layouts head to generate a mask of the text region to help with text
and illumination conditions. The recent rapid development region localization. MTRNet++ [2] shares the same coarse-
of deep neural networks has made scene text erasure a to-fine inpainting idea but uses a multi-branch generator. The
promising research task in the computer vision field, and mask-refine branch could predict stroke-level text masks to
it can be classified into two categories: one-step and two- guide text removal.
step methods. One-step methods use an end-to-end model to Two-step methods remove the text with the awareness of
directly output the text-erased image without the aid of text text location, which can be provided by users or by a pretrained
location information. Pix2Pix [47] is a conditional generative text detector. Qin et al. [48] proposed a cGAN with one
4
and feed the concatenated feature maps into a self-attention where M is the hole region in Iˆmask .
network that learns both the correspondences between feature Finally, the loss function of the entire network is expressed
maps and non-local features. Consequently, the features inside as follows:
and outside the text regions are split and weighted by a self-
attention block for later decoding. Here, we adopt a global L = LSM P M + Lpixel + λ1 Lper + λ2 Lstyle + λ3 Ltv , (10)
6
Fig. 5. Some training image samples generated by our enhanced synthesis text engine. Input (top), final ground truth (middle), text mask ground truth (bottom)
where λ1 , λ2 , and λ3 are set as 0.05, 100, and 0.1 according training dataset contains original background images as
to Liu et al. [36]. We slightly changed the weight of the style the final ground truth, synthetic text in background im-
loss according to our own training loss curve. ages as input, and mask images of synthetic text as the
ground truth of the text mask.
IV. E XPERIMENT Compared with the vanilla synth-text method, we made
A. Implementation Details several improvements to make our generated data share
more similarity with real-world data. 1) There is a 50%
Our implementation is based on PyTorch. In the training possibility that the text instance is composed directly on
process, we generated one million synthetic text images and the background instead of being generated by Poisson
corresponding text mask images as training data from back- blending which would provide some background infor-
ground images that did not contain text. The input size of mation behind the text instance. 2) Some other effects
our network was 128×640, and the height of the training were added to the text instance, such as Gaussian blur to
images was resized to 128 while maintaining the aspect ratio. simulate out of focus, text shift to simulate text with a 3D
If the width of the image was insufficient, the remaining structure, more shadow parameter to make the shadow of
pixels were padded with 0 on the right side of the image; the text more realistic, among others. 3) We compressed
otherwise, it was resized to 640. The training batch size and saved the image in the JPEG format with different
was 8 on a single 1080Ti GPU. We used Adam [56] to compression qualities to handle images of varying quality.
optimize the entire network with β = (0.9,0.999) and set the 4) A dilation mask of a text instance, including all of
weight decay to 0. The learning rate started from 0.0002 and its effects, was generated in the mask image to reduce
decayed to nine-tenth after each epoch in the training phase. the effect of JPEG artifacts around the text edge. Some
The network was trained in an end-to-end manner, and we samples of the generated images are shown in Fig. 5.
followed the fine-tuning strategy [36], which freezes the batch • SCUT-Syn [3] was created by Synthesis text engine [5],
normalization parameters in the encoder of the background which contains 8,000 images for training and 800 images
inpainting module after approximately 10 epochs. for testing. The background images of this dataset were
In the inference process, we first expanded and cropped the collected from ICDAR 2013 [11] and ICDAR MLT-2017
text bounding box to include more background information. [57], and the text instances in the background images
Subsequently, we fed the cropped text image to our proposed were manually removed. Most test images were from the
network for prediction. Finally, part of the output, which was training set, and the training and testing sets were gen-
inside the original bounding box, was pasted back into the erated from the same background images, although the
source image. The text in arbitrary quadrilateral annotation synthesized text instances were different. We evaluated
was transformed by perspective transformation, and the text our method using only test images.
in curved annotation was transformed by thin-plate-spline into
2) Real-world Dataset:
rectangular text images before feeding them into our network.
The network output was transformed back to its original shape • ICDAR 2013 [11] is a widely used scene text images
and copied to the original position to obtain the final text- dataset that includes 229 training images and 223 testing
erased image. images. All text instances are in English and are well
focused. In this study, only the test set was used for the
evaluation.
B. Dataset and Evaluation Metrics • SCUT-EnsText [4] is a comprehensive and challenging
1) Synthetic Dataset: scene text removal dataset, containing 2,749 training
• Improved Synth-text image We used over 1,500 English images and 813 testing images, which are collected from
and Chinese fonts and 10,000 background images without ICDAR2013 [11], ICDAR-2015 [58], MS COCO-Text
text to generate a total of one million images for our [59], SVT [60], MLT-2019 [61], and ArTs [62]. The
model training using our enhanced synthesis text engine, text instances of this dataset are in Chinese or English
which is improved from Synth-text technology [5]. The with diverse shapes, such as horizontal text, arbitrary
7
(a) Input (b) BIPM (c) SMPM+BIPM (d) SMPM+BIPM+SC (e) All (w/o PConv) (f) All
Fig. 6. Visual quality results of ablation study on SCUT-EnsText dataset. From left to right: input images, output of BIPM, output of SMPM+BIPM, output
of SMPM+BIPM+SC, output of all (w/o PConv), and output of all. SMPM: Stroke mask prediction module. BIPM: Background inpainting module. PConv:
Partial convolutions. SC: Skip connection between two modules. SA: Self-attention block. All: SMPM + BIPM + SC + SA.
TABLE III
A BLATION S TUDY AND Q UALITATIVE C OMPARISON B ETWEEN D IFFERENT C ONFIGURATIONS OF O UR P ROPOSED N ETWORK ON SCUT-S YN AND
SCUT-E NS T EXT DATASETS . SMPM: S TROKE M ASK P REDICTION M ODULE . BIPM: BACKGROUND I NPAINTING M ODULE . PC ONV: PARTIAL
C ONVOLUTIONS . SC: S KIP C ONNECTION BETWEEN T WO M ODULES . SA: S ELF - ATTENTION B LOCK
SCUT-Syn SCUT-EnsText
Method
PSNR↑ SSIM(%)↑ MSE↓ PSNR↑ SSIM(%)↑ MSE↓
BIPM (w/o PConv) 34.78 96.67 0.00061 34.62 95.79 0.00114
SMPM + BIPM 37.74 97.31 0.00034 36.11 96.31 0.00077
SMPM + BIPM + SC 38.29 97.48 0.00030 36.36 96.39 0.00070
SMPM + BIPM (w/o PConv) + SC + SA 38.70 97.67 0.00025 36.49 96.39 0.00073
SMPM + BIPM + SC + SA (all) 38.60 97.55 0.00024 37.08 96.54 0.00054
quadrilateral text, and curved text. All text instances were index (SSIM) [63] and mean squared error (MSE). Higher
carefully erased by annotators with good visual quality. SSIM, PSNR, and lower MSE values indicate better
By providing original text annotation and text-erased image restoration quality. Qualitative evaluations were
ground-truth, this dataset can be comprehensively used conducted on both the SCUT-Syn [3] and SCUT-EnsText
for both qualitative and quantitative evaluations. In this [4] datasets.
study, we used this test set to evaluate the performance
of our method. C. Ablation Study
3) Evaluation metrics: In this section, we investigate the effectiveness of the differ-
• Quantitative Evaluation To quantify the text erasure ent settings of our proposed model. The stroke mask prediction
ability of a model, we followed [2]–[4], [7], [8] and module (SMPM), skip connection (SC) between two modules,
utilized a baseline scene text detection model to detect self-attention block (SA), and partial convolutions (PConv) are
the texts in the text-erased images and evaluated how the focus of this study. The qualitative evaluation results on
low are the recall of the detection results. A lower recall the SCUT-Syn and SCUT-EnsText datasets are presented in
indicated that less text was detected and more text was Table III, and some text-erasing samples are shown in Fig. 6.
erased by the model. To make a fair comparison with • Stroke Mask Prediction Module SMPM aims to provide
previous studies, scene text detector EAST and ICDAR the pixel-level information of text region as the hole for
2013 evaluation [22] protocols were used in the ICDAR background inpainting Module (BIPM), so that the net-
2013 dataset [11], text detector CRAFT [29], and ICDAR work can learn more from the valid features of the non-
2015 [58] protocols were adopted for the evaluation of text region and suppress the text residue. The qualitative
SCUT-EnsText [4]. results are presented in Table III. Using the text mask
• Qualitative Evaluation We followed the previous image can significantly improve the text erasing performance. It
inpainting works by reporting metrics including peak should be noted that without the mask image from the
signal-to-noise ratio (PSNR), the structural similarity stroke mask prediction module, the partial convolutional
8
(a) Input of ours (b) Ground-truth (c) Ours (d) Predicted text mask
Fig. 7. Qualitative results of our method on the SCUT-Syn dataset. From left to right: input image and text bounding boxes, ground truth, output of our
method, and predicted text mask.
TABLE IV
C OMPARISON B ETWEEN P REVIOUS S CENE T EXT- ERASING S TUDIES AND O UR P ROPOSED M ETHOD ON THE SCUT-S YN AND ICDAR2013 DATASETS .
SCUT-Syn ICDAR2013
Method Parameters Inference speed Input
PSNR↑ SSIM(%)↑ MSE↓ R↓
Original images - - - 70.83 - - -
SceneTextEraser [1] 14.68 46.13 0.7148 10.08 - - Image
Pix2Pix [47] 25.60 89.86 0.2465 10.19 54.4M 17ms Image
EnsNet [3] 37.36 96.44 0.0021 5.66 12.4M 24ms Image
MTRNet [7] 29.71 94.43 0.0001 0.18 54.4M - Image(256×256) + Text Mask
Weak Supervision [8] 37.44 93.69 - 2.47 28.7+6.0M 57+39ms Image(256×256)
MTRNet++ [2] 34.55 98.45 0.0004 - 18.7M 37ms Image(256×256)
EraseNet [4] 38.32 97.67 0.0002 - 19.7M 34ms Image
EAST [22] + Ours 31.18 95.93 0.002 0.73 24.1+9.9M 18+23∼ms Image
Ours 38.60 97.55 0.0002 0 9.9M 23∼ms Image + BBox
layers in the background inpainting module will function non-local features. To confirm the importance of self-
as normal convolutional layers. attention block, we trained our network without the SA
• Skip Connection Skip connection links and concatenates block. From Table III, we can see that the performance
the low-resolution feature maps of two modules to pro- of our network decreases when the self-attention block is
vide the features inside the text region for the decoder missing.
of the BIPM and to improve the accuracy and stability • Partial Convolutions To evaluate the advantages of the
of text mask prediction. Table III implies that the skip partial convolutional layers, we also re-implemented our
connection between the two modules can improve the method without these layers. Table III lists the qualita-
erasing quality of the image in both the SCUT-Syn and tive performance of the SCUT-Syn and SCUT-EnsText
SCUT-EnsText datasets. datasets. We observed that, compared to a network with
• Self-Attention Block GCblock adds channel-wise partial convolutional layers, the network with partial con-
weights to the input feature maps taking into considera- volutional layers performs better on SCUT-EnsText but
tion of the correspondences between feature maps and worse on SCUT-Syn datasets. We believe data discrepan-
9
(a) Input of ours (b) Input of inpainting methods (c) Ours (d) LBAM (e) RFR-Net (f) HiFill
Fig. 8. Visual qualitative comparison between our method and state-of-the-art image inpainting methods on the SCUT-EnsText dataset. From left to right:
input of our method, input of inpainting methods, output of our method, output of LBAM, output of RFR-Net, and output of HiFiII.
TABLE V
C OMPARISON B ETWEEN S TATE - OF - THE - ART I NPAINTING M ETHODS AND O UR P ROPOSED M ETHOD ON THE SCUT-E NS T EXT DATASET.
SCUT-EnsText
Method Parameters Inference speed Input Training dataset
PSNR↑ SSIM(%)↑ MSE↓
LBAM [37] 36.21 95.58 0.0007 68.3M 11ms Image(256×256) + Text Mask Paris Street View [64]
RFR-Net [38] 36.95 96.12 0.0006 31.2M 90ms Image(256×256) + Text Mask Paris Street View [64]
HiFill [41] 31.48 94.17 0.0021 2.7M 28ms Image + Text Mask Places2 [65]
Ours 37.89 97.02 0.0004 9.9M 23∼ms Image(256×256) + BBox Improved Synth text
Ours 37.08 96.54 0.0005 9.9M 23∼ms Image + BBox Improved Synth text
TABLE VI
C OMPARISON B ETWEEN S TATE - OF - THE - ART S CENE T EXT- ERASING M ETHODS AND O UR P ROPOSED M ETHOD ON THE SCUT-E NS T EXT DATASET.
cies between synthetic data and real-world data are the convolutions, using partial convolution and Fm feature
cause for this difference in performance. As mentioned concatenation to split the features inside and outside the
before, the reason for improving the Synth-text engine text region is an relatively inefficient approach for the
is that Poisson image editing will retain some texture transparent texts erasure. However, when facing the real-
information of the background image when it blends the world data like SCUT-EnsText, using partial convolution
foreground text instances into a background image. But can achieve better performance than normal convolution.
most scene text instances in real-world images are not
transparent. For the erasure result of SCUT-Syn datasets,
compared with directly extracting feature using normal
10
(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask
Fig. 9. Qualitative results of our method on SCUT-EnsText dataset. From left to right: input, ground truth, output of our method, and predicted text mask.
11
(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask
Fig. 10. Our method can retain more detailed background information and restore the background texture. From left to right: input image and text bounding
boxes, ground truth, output of our method, and predicted text mask.
D. Comparison With State-of-the-Art Methods the detection result of EAST [22] is used to provide text
location. Here, R represents recall, which is the detection
To evaluate the performance of our proposed method, we result of EAST under the ICDAR2013 evaluation protocol.
compared it with recent state-of-the-art methods on the SCUT- We consider the reason why MTRNet++, EraseNet and EnsNet
Syn, ICDAR2013, and SCUT-EnsText datasets. For the SCUT- could generate higher SSIM images on the SCUT-Syn dataset
Syn and ICDAR2013 datasets, the results of SceneTextEraser, is because most test images are included in the training set and
Pix2Pix, and EnsNet were implemented and reported by Zhang share the same background images with training images when
et al. [3]. The results of MTRNet [7], weak supervision [8], they are generated. EAST and our model were not trained
MTRNet++ [2], and EraseNet [4] were collected from official in the SCUT-Syn training set, thereby causing a lower PSNR
reports of papers. If there is no separate description, the and SSIM in our results. We believe that this dataset cannot
resolution of the input image is 512×512. Table IV shows the fully reflect the generalization ability of a network when it is
results for the SCUT-Syn [3] and ICDAR2013 [11] datasets. used for training. Our text-erasing network is lightweight with
Our proposed method achieves the highest PSNR in the SCUT- only 9.9 million trainable parameters. For a fair comparison
Syn dataset and the lowest recall in the ICDAR2013 dataset of the inference speed, we tested all methods in a single
when the bounding boxes were provided. Some text-erasing 1080ti GPU and an AMD Ryzen7 3700X @ 3.6GHz CPU
examples on the SCUT-Syn dataset are shown in Fig. 7. Our with the original input size of the networks. The inference
method achieves lower recall in the ICDAR2013 dataset when
12
(a) Input (b) Ground-truth (c) Ours (d) Predicted text mask
Fig. 11. Some failure cases of our method. From left to right: input image and text bounding boxes, ground truth, output of our method, and predicted text
mask.
time of our method consists of the time cost of the network with higher quality than that in the case of state-of-the-art
forward, pre-processing, and post-processing. The time cost image inpainting methods under the same conditions. Some
of pre-processing and post-processing is approximately 4 ms visual quality comparison samples are presented in Fig. 8.
in the case of using perspective transformation and 76 ms We observed that there are always certain unusual textures
in the case of using off-the-shell thin-plate-spline function in or artifacts in the text-erased images inpainted via pretrained
OpenCV. image inpainting methods, causing unnatural erasing results.
We also found that in the inpainting logic of HiFill [41], a
For the SCUT-EnsText dataset, we compared our method preference to restore the hole region with further background
with state-of-the-art image inpainting methods and scene text information is displayed, leading to worse results than those
erasing methods. The results of the comparison with the obtained via the other two methods. Although this model is
inpainting methods of the SCUT-EnsText dataset are shown pretrained on the Places2 [65] dataset, which contains many
in Table V. Our scene text erasing method achieves excellent indoor and urban views, it still shows a serious domain shift
result in image quality when text bounding box information problem when facing scene text erasing tasks.
is provided. We made some revisions to the original text
location annotation because we observed some unmatched In addition, we used our method with a scene text detector
cases between the location of the erased text and the text as a two-step automatic scene text eraser. In our experiment,
bounding box of the ground truth. We selected three state-of- we used DB [66] and CRAFT [29] for scene text detection
the-art image inpainting methods: LBAM [37], RFR-Net [38], and produced arbitrary quadrilateral bounding boxes as the
and HiFill [41]. The LBAM [37] and RFR-Net [38] models input for our method. DB-ResNet-18 and DB-ResNet-50 were
were pretrained on the Paris Street View dataset [64], and the pretrained in the SynthText and ICDAR 2015 datasets with
HiFill [41] model was pretrained on the Places2 dataset [65]. the box threshold set as 0.3. CRAFT was pretrained on the
For a fair comparison with the pretrained inpainting methods, SynthText, ICDAR 2013, and MLT 2017 datasets, and the
we generated the hole mask directly from our revised text text threshold was set as 0.6. We compared our proposed
location annotations and resized the input images to the same methods with previous scene text erasing methods on the
size because some pretrained inpainting models only work at SCUT-EnsText dataset. The results are shown in Table VI and
a resolution of 256 × 256. Our method can generate images imply that results of first detection and then inpainting via our
13
[26] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. [52] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-Local Networks
IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988. Meet Squeeze-Excitation Networks and Beyond,” in Proc. IEEE/CVF
[27] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “TextSnake: Int. Conf. Comput. Vis. Work., 2019, pp. 1971–1980.
A Flexible Representation for Detecting Text of Arbitrary Shapes,” in [53] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
Proc. Eur. Conf. Comput. Vis., vol. 11206, 2018, pp. 19–35. style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.,
[28] Y. Li, Z. Wu, S. Zhao, X. Wu, Y. Kuang, Y. Yan, S. Ge, K. Wang, vol. 9906, 2016, pp. 694–711.
W. Fan, X. Chen, and Y. Wang, “PSENet: Psoriasis Severity Evaluation [54] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer Using
Network,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 01, 2020, pp. Convolutional Neural Networks,” in Proc. IEEE Conf. Comput. Vis.
800–807. Pattern Recognit., 2016, pp. 2414–2423.
[29] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character Region [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
Awareness for Text Detection,” in Proc. IEEE/CVF Conf. Comput. Vis. large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent.,
Pattern Recognit., 2019, pp. 9357–9366. 2015.
[30] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis [56] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-
and transfer,” in Proc. 28th Annu. Conf. Comput. Graph. Interact. Tech. tion,” in Proc. 3rd Int. Conf. Learn. Represent., 2015.
- SIGGRAPH ’01, 2001, pp. 341–346. [57] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo,
[31] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patch- U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. Burie,
Match,” ACM Trans. Graph., vol. 28, no. 3, pp. 1–11, 2009. C.-l. Liu, and J.-M. Ogier, “ICDAR2017 Robust Reading Challenge on
[32] S. Darabi, E. Shechtman, C. Barnes, Dan B Goldman, and P. Sen, “Image Multi-Lingual Scene Text Detection and Script Identification - RRC-
melding: Combining inconsistent images using patch-based synthesis,” MLT,” in Proc. 14th IAPR Int. Conf. Doc. Anal. Recognit., vol. 1, 2017,
ACM Trans. Graph., vol. 31, no. 4, pp. 1–10, 2012. pp. 1454–1459.
[33] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image in- [58] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov,
painting,” in Proc. 27th Annu. Conf. Comput. Graph. Interact. Tech. M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu,
- SIGGRAPH ’00, 2000, pp. 417–424. F. Shafait, S. Uchida, and E. Valveny, “ICDAR 2015 competition on
[34] M. M. Oliveira, B. Bowen, R. McKenna, and Y.-S. Chang, “Fast Digital Robust Reading,” in Proc. 13th Int. Conf. Doc. Anal. Recognit., 2015,
Image Inpainting,” in Int. Conf. Vis. Imaging Image Process., no. Viip, pp. 1156–1160.
2001, pp. 261–266. [59] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “COCO-
[35] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally Text: Dataset and Benchmark for Text Detection and Recognition in
consistent image completion,” ACM Trans. Graph., vol. 36, no. 4, pp. Natural Images,” arXiv, 2016.
1–14, 2017. [60] K. Wang and S. Belongie, “Word spotting in the wild,” in Proc. Eur.
[36] G. Liu, F. A. Reda, K. J. Shih, T. C. Wang, A. Tao, and B. Catanzaro, Conf. Comput. Vis., vol. 6311, 2010, pp. 591–604.
“Image Inpainting for Irregular Holes Using Partial Convolutions,” in [61] N. Nayef, C.-l. Liu, J.-M. Ogier, Y. Patel, M. Busta, P. N. Chowdhury,
Proc. Eur. Conf. Comput. Vis., vol. 11215, 2018, pp. 89–105. D. Karatzas, W. Khlif, J. Matas, U. Pal, and J.-C. Burie, “ICDAR2019
[37] C. Xie, S. Liu, C. Li, M.-M. Cheng, W. Zuo, X. Liu, S. Wen, and Robust Reading Challenge on Multi-lingual Scene Text Detection and
E. Ding, “Image Inpainting With Learnable Bidirectional Attention Recognition — RRC-MLT-2019,” in Proc. Int. Conf. Doc. Anal. Recog-
Maps,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 8857– nit., 2019, pp. 1582–1587.
8866. [62] C. K. Chng, E. Ding, J. Liu, D. Karatzas, C. S. Chan, L. Jin, Y. Liu,
[38] J. Li, N. Wang, L. Zhang, B. Du, and D. Tao, “Recurrent Feature Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, and J. Han,
Reasoning for Image Inpainting,” in Proc. IEEE/CVF Conf. Comput. “ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text -
Vis. Pattern Recognit., 2020, pp. 7757–7765. RRC-ArT,” in Proc. Int. Conf. Doc. Anal. Recognit., 2019, pp. 1571–
[39] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative 1576.
Image Inpainting with Contextual Attention,” in Proc. IEEE/CVF Conf. [63] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
Comput. Vis. Pattern Recognit., 2018, pp. 5505–5514. quality assessment: From error visibility to structural similarity,” IEEE
[40] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. Huang, “Free-Form Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
Image Inpainting With Gated Convolution,” in Proc. IEEE/CVF Int. [64] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, “What makes
Conf. Comput. Vis., 2019, pp. 4470–4479. Paris look like Paris?” ACM Trans. Graph., vol. 31, no. 4, pp. 1–9, 2012.
[41] Z. Yi, Q. Tang, S. Azizi, D. Jang, and Z. Xu, “Contextual Residual [65] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A
Aggregation for Ultra High-Resolution Image Inpainting,” in Proc. 10 Million Image Database for Scene Recognition,” IEEE Trans. Pattern
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7505–7514. Anal. Mach. Intell., vol. 40, no. 6, pp. 1452–1464, 2018.
[42] C. W. Lee, K. Jung, and H. J. Kim, “Automatic text detection and [66] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-Time Scene
removal in video sequences,” Pattern Recognit. Lett., vol. 24, no. 15, Text Detection with Differentiable Binarization,” Proc. AAAI Conf. Artif.
pp. 2607–2623, 2003. Intell., vol. 34, no. 07, pp. 11 474–11 481, 2020.
[43] E. A. Pnevmatikakis and P. Maragos, “An inpainting system for auto-
matic image structure - texture restoration with text removal,” in Proc.
15th IEEE Int. Conf. Image Process., 2008, pp. 2616–2619.
[44] A. Mosleh, N. Bouguila, and A. B. Hamza, “Image Text Detection
Using a Bandlet-Based Edge Detector and Stroke Width Transform,”
in Procedings Br. Mach. Vis. Conf., 2012, pp. 63.1–63.12.
[45] M. Khodadadi and A. Behrad, “Text localization, extraction and inpaint-
ing in color images,” in Proc. 20th Iran. Conf. Electr. Eng., 2012, pp.
1035–1040.
[46] P. D. Wagh and D. R. Patil, “Text detection and removal from image
using inpainting with smoothing,” Proc. Int. Conf. Pervasive Comput., Zhengmi Tang received his B.E. degree from Xidian
2015. University, Shaanxi, China, in 2017 and his M.E.
[47] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Trans- degree in cybernetics engineering from Hiroshima
lation with Conditional Adversarial Networks,” in Proc. IEEE Conf. University, Hiroshima, Japan, in 2020. He is cur-
Comput. Vis. Pattern Recognit., 2017, pp. 5967–5976. rently pursuing a Ph.D. degree in communication
[48] S. Qin, J. Wei, and R. Manduchi, “Automatic semantic content removal engineering in the IIC-Lab at Tohoku University,
by learning to neglect,” in Proc. Br. Mach. Vis. Conf., 2018. Japan. His current research interests include com-
[49] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic Image Completion,” in puter vision, scene-text detection, and data synthesis.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1438–
1447.
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
[51] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional
Neural Networks for Volumetric Medical Image Segmentation,” in Proc.
Int. Conf. 3D Vis., 2016, pp. 565–571.
15