Ria 38.01 12
Ria 38.01 12
Ria 38.01 12
Copyright: ©2024 The authors. This article is published by IIETA and is licensed under the CC BY 4.0 license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.18280/ria.380112 ABSTRACT
Received: 31 October 2023 Text detection in natural scene images presents significant challenges, particularly in
Revised: 25 November 2023 detecting irregular shapes. As a result of the limited receptive field of CNNs, existing
Accepted: 28 December 2023 methods have difficulty capturing long-range relationships between distant component
Available online: 29 February 2024 regions. This study introduces an innovative method for identifying irregular text in images
of natural scenes. The approach utilizes a U-net architecture combined with connected
component analysis, resulting in improved accuracy in detecting text components and
Keywords: reducing the identification of non-character text components. Additionally, our strategy
image text detection, scene image; irregular, incorporates the use of graph convolution networks (GCN) to deduce adjacency relations
relation inference, graph convolution among text components. The integration of GCNs introduces a sophisticated mechanism
network for inferring adjacency relations, contributing significantly to the advancement of text
detection in natural scene images. Our method's efficacy is showcased through
experimental assessments on three publicly available datasets: "ICDAR2013," "CTW-
1500," and "MSRA-TD500."
115
model the contextual information necessary for accurate identifying and localizing irregular text regions in images.
detection and localization of irregular text. Here are some notable approaches:
Hence in this we are aggregating both the transformer and (1) Mask TextSpotter: This method combines the
GNN based methods for irregular text detection. Transformer architecture with a mask-based text detection
We have made three main contributions: framework. It first generates text proposals using region-based
• Utilizing the U-Net architecture, we perform feature methods and then refines them using a Transformer-based
extraction, and character center point estimation is achieved network. By modeling the contextual information and
through connected component analysis. capturing the relationships between text elements, Mask
• We represent each text region as a node and employ Graph TextSpotter achieves precise irregular text detection.
Neural Networks (GNNs) to build a local inference graph. (2) Border Detectors: Border detectors based on
• The integration of the inference graph and the deep Transformers aim to detect the boundaries of irregular text
relational inference network enhances our ability to regions. By using the self-attention mechanism of
comprehend the relationships and interactions among Transformers, irregular text regions of arbitrary shapes [14, 15]
character text components in a more holistic manner. Here is can be accurately localized by capturing both local and global
the structure of the research. Section 2 examines various context.
papers on text detection. Section 3 presents the proposed (3) TextPerceiver: TextPerceiver is a recent approach that
architecture for scene text detection. Section 4 entails an combines the Perceiver model, a variant of the Transformer,
experimental analysis on datasets, along with the evaluation with a segmentation head. It operates on the entire image and
metrics for the proposed methodology. The paper concludes learns to attend to relevant text regions, allowing for effective
with a final section summarizing the findings. irregular text detection. The model can adapt to various text
shapes and orientations, making it suitable for challenging
scenarios.
2. RELATED WORK (4) TextFuseNet: CNNs and Transformers are both used in
TextFuseNet. It employs a multi-branch architecture where
Irregular text detection [3-6], which focuses on identifying CNNs capture local features, while Transformers model global
text in non-standard or non-horizontal orientations, has gained context for irregular text detection. By fusing the information
significant attention in recent years. This challenging task has from different branches, TextFuseNet achieves robust
been tackled in various ways by researchers. Here are some performance in detecting irregular text regions.
notable methods: (5) LayoutLM: Although primarily designed for document
(1) Stroke Width Transform (SWT): The SWT algorithm layout analysis, LayoutLM, which is based on the Transformer,
identifies text regions based on the variations in stroke width. can also be utilized for irregular text detection. By treating text
It detects regions where the stroke width is relatively constant, detection as a sequence labeling task, LayoutLM captures the
which is indicative of text, and distinguishes them from non- spatial dependencies of irregular text elements and accurately
text regions. identifies them.
(2) Connected Component Analysis: This approach As a technique for detecting irregular text in images, graph
segments the image into connected components and analyzes neural networks have emerged as one of the most powerful.
their properties to identify irregular text. By considering As graphs are modeled as edges and nodes in a graph, GNN-
attributes like aspect ratio, height, or geometric relationships based methods use text components to represent images.
between components, irregular text regions can be detected. Our goal in this article is to provide you with a brief
(3) Hough Transform: The Hough Transform is a widely summary of the latest advances in detecting text in images that
used technique for detecting lines and shapes in images. By have arbitrary and irregular shapes [16, 17]. Recent research
applying the Hough Transform specifically for text detection, has been focusing on the detection of text in scenes of various
researchers have successfully identified irregular text by orientations, forms, and layouts. Several studies [18-23] have
detecting lines or curves that represent the text shape. been published on the topic of detecting irregular text because
(4) Deep Learning Based Methods: The development of of the considerable interest in this area of text detection. The
deep learning has contributed to the development of CNNs that following four groups best describe the scope of these
can detect irregular texts [7-10]. These models are trained on investigations.
annotated datasets to learn the complex patterns and Regression Based Approaches: Regression-based
characteristics of irregular text, enabling them to accurately approaches have been used to identify scene text using text
detect and localize such text in images. bounding boxes. Various approaches have been developed in
(5) Graph Neural Networks (GNN): GNNs have also been this category, such as Textboxes [24], ABCnet [25], EAST
utilized for irregular text detection tasks [11]. By representing [26], and the adaptive boundary proposal network proposed by
the image as a graph and leveraging the graph structure, GNNs authors [27].
can capture the relationships between text elements and To handle rectangular candidate boxes with long sliding
effectively identify irregular text regions. windows and convolution kernels, textboxes use horizontal
(6) Hybrid Approaches: Some methods combine multiple text processing and has trouble with text that isn't perfectly
techniques to improve irregular text detection. rectangular. The output of ABCnet is influenced by the over-
In particularly, Transformers are neural network models reliance on control points in the description of non-rectangular
that excel in capturing long-range dependencies and modeling text shapes. EAST is built to provide quick and precise results
contextual information. Transformer-based methods have for text detection in natural settings. Further, a boundary
emerged as powerful techniques for irregular text detection proposal network was developed in the previous paper to help
[12, 13], leveraging their ability to capture long-range detect arbitrary-shaped text. It produced accurate boundaries
dependencies and handle complex spatial relationships. These without the need for further post-processing.
methods have shown promising results in accurately While these regression-based approaches have shown
116
promising results in detecting horizontal and multi-oriented network (RNN). The goal of this iterative procedure is to
text, they may struggle to do so when presented with scene produce a text instance with a more precise form. Their
texts that have very wide aspect ratios and are oriented in approach performed exceptionally well on difficult text-in-
unexpected ways. the-wild datasets like TotalText [41].
Segmentation Based Approaches: These have emerged as Transformer Based Approaches: Computer vision [42-
another approach for text detection, relying on classification at 45] has grown in popularity since their introduction for
pixel-level [28-31]. Text segmentation zones are identified machine translation [46]. With DETR [47], object detection
using deep convolutional neural networks, and then the boxes was treated as a set prediction problem instead of being a
are created using postprocessing. PSEnet [28] presents a complex post-processing problem. In spite of these challenges,
progressive scale expansion post-processing approach that DETR continued to investigate detection transformers due to
greatly enhances detection precision. In contrast, Pixellink [29] inefficient utilization of high-resolution features and slow
overcomes the problem of textual closeness by foreseeing training convergence. For instance, Deformable-DETR [48]
pixel connections between distinct instances of text. addressed these issues by focusing on sparse features. DE-
According to the study [30], pixels are classified into groups DETR [49] identified sparse feature sampling as a crucial
using feature distances through pixel embedding. factor for data efficiency. In the Transformer decoder,
There are obvious benefits to using segmentation-based dynamic anchor boxes were introduced to enhance training
approaches for both text & non-text segmentation [32, 33]. It through DAB-DETR [50].
is possible, however, for irrelevant non-characters to be Limitations of the Exisiting approaches are:
misclassified as characters during the text segmentation areas (1) Existing methods struggle with detecting irregular
training process. This can lower the quality of the shapes in natural scene images.
segmentation findings by causing problems with text line (2) Limited receptive fields of Convolutional Neural
adherence. Networks (CNNs) make it difficult to capture long-range
Connected component Based Approaches: Text detection relationships between distant text component regions.
systems employ these methods, which first detect individual (3) Existing methods may inaccurately identify non-
text entries, then link or group these into complete text character text components, leading to false positives.
instances after a post-processing step. As a result of their (4) Capturing adjacency relations among text components
flexible representation and adaptability [34-37], these methods is crucial for accurate text detection.
have gained popularity in the detection of arbitrarily shaped Advantages of our Proposed approach
text. (1) Our approach employs a U-net architecture, which is
Using ordered discs and text centerlines to model text particularly effective in capturing irregular shapes. This
instances, TextSnake [37] represents text of varying shapes architecture, combined with connected component analysis,
successfully. When it comes to inference, however, TextSnake enhances the detection of irregular text shapes, addressing a
still needs to rely on laborious post-processing procedures like significant challenge in text detection.
centralising, striding, and sliding. Each text instance is built (2) Our method integrates graph convolution networks
from ordered rectangular components, including text and non- (GCN), enabling the deduction of adjacency relations among
text, in DRRG's text detection approach. text components. This innovation allows for a more
Text regions are typically divided into several pieces comprehensive understanding of long-range relationships,
consisting of both text and non-text components by these enhancing the model's ability to connect and identify distant
methods that work on specific text parts. The computational text components.
cost and difficulty may rise if many non-text parts must be (3) By combining U-net architecture with connected
generated all at once. The authors [38] present a method for component analysis, our approach enhances the accuracy of
multidirectional text detection that uses exhaustive text component detection and reduces the likelihood of
segmentation to provide potential character candidates. To misidentifying non-character text components. This results in
foretell character region maps and affinity maps, CRAFT [39] a more precise and reliable text detection system.
uses semi-supervised learning. These techniques can decrease (4) The integration of GCNs introduces a sophisticated
computing complexity and difficulty by limiting their mechanism for inferring adjacency relations. This step
attention to character regions inside text components. significantly contributes to the advancement of text detection
Relational inference is an essential aspect of connected by providing a more nuanced understanding of the spatial
component-based methods, as their performance relies heavily relationships between text components.
on the grouping of text lines. Methods like Pixellink utilize
embedding features to generate text areas and provide instance
information. In the case of CRAFT [39], affinity maps are 3. PROPOSED METHOD
predicted through weakly supervised learning.
However, the receptive field of the CNN limits the efficacy In this section, we delve into several key aspects of our
of these approaches, making it difficult to capture relationships methodology for text detection in natural scenes. Firstly, we
between distant component areas utilising local convolutional elucidate the intricacies of Character Center Point Estimation,
operators. Graph convolutional networks (GCNs) were employing the U-Net architecture to achieve precise
created by the authors [40] to overcome this shortcoming by identification. This step ensures accurate localization of
allowing for local graph-based reasoning and deduction of the character center points, a fundamental element for effective
likelihood of links between a component and its neighbours. text detection. Secondly, we detail the Construction of the
On open-source datasets, their technique outperformed Local Inference Graph, where each identified text region is
previous best practices. represented as a node, and Graph Neural Networks (GNNs)
To accomplish iterative boundary deformation, the authors are utilized to establish a comprehensive local graph. This
[40] present a model that combines GCN with recurrent neural graph captures adjacency relations, enhancing our model's
117
ability to understand intricate connections among text emphasizing the strategic organization of identified text
components. Additionally, we explore the Comprehensive components into coherent lines. The combination of these
Exploration of Proximity Relationships, aiming to provide a components forms a robust and innovative approach to address
holistic understanding of the spatial relationships between text challenges in text detection, particularly in scenes with
elements. Lastly, we discuss Text Line Formation, irregular shapes and long-range dependencies.
Figure 1 depicts the general architecture of our method, At the bottleneck layer, the spatial dimensions are
outlining the several processes involved in the framework. significantly reduced, but the learned features are highly
Extraction of text components, construction of a local abstract and semantically rich. The decoder path starts with up
inference graph, inference of deep adjacency relations, and sampling operations to restore the spatial dimensions,
production of text lines make up the framework. At first, we followed by convolutional layers that refine the features.
use the U-net architecture is applied for feature extraction and Skip connections, which link comparable feature maps
connected component analysis is used for character center between the encoder and decoder paths, are a crucial
point estimation to the last layer of U-Net [51]. Then, we component of the U-Net design. The localization of character
create a local inference network that stands in for the innate centre points is facilitated by these links, which allow low-
connection relationships among the character text components level and high-level features to be combined. They also aid in
by capitalising on their fundamental features. The deep maintaining spatial information and fine-grained features from
relational inference network uses this local inference graph to previous network stages.
reason about the causal relationships between the constituent
parts of a character string. At last, the separated connected
regions are used to categorise the reasoning outcomes obtained
into individual text instances.
118
The final layer of the U-Net architecture produces 3.2 Construction of the local inference graph
predictions, typically in the form of a heatmap. For character
center point estimation, the output layer can be designed to The next step, after character center point estimation, we
predict the likelihood or probability of each pixel being a use a graph convolution network for inferring adjacency
character center point. This is accomplished using a sigmoid relationships between text components. Text components are
activation function that generates pixel-wise predictions represented by character center points according to this
between 0 and 1. method. Each piece of text represents a node in the network.
To identify the character center points, a thresholding Inference time and complexity would increase if all nodes
operation is applied to the heatmap. Pixels with values above were used for inference directly. A local inference graph is
a certain threshold are considered potential character center constructed for this purpose in DRRG, which includes the
points. This thresholding step creates a binary mask, where pivot node and its neighbours up to the second order. First-
values above the threshold are set to 1 and the rest to 0. order neighbours are the eight nodes immediately adjacent to
Connected component analysis is then applied to the binary the pivot, while second-order neighbours are the four nodes
mask. Connected component analysis is represented in immediately adjacent to the first-order neighbours. Our
Algorithm 1. This analysis identifies and labels connected method, in contrast to DRRG, takes into account only the
regions in the binary image, where each labeled region immediate neighbours of each node. This reduces the number
represents a group of adjacent pixels. of nodes involved in the reasoning process by choosing the
pivot node, four neighbouring nodes of the first order, and two
Algorithm 1 Character Center Point estimation using neighbouring nodes of the second order. Figure 3 elaborates
Connected Component Analysis (CCA) on the steps taken to construct the local inference graph. A
node's adjacency is determined by evaluating the affinity,
Input: Binary image (after thresholding) between it and the pivot node. The affinity, As between a pivot
node pn and another node is defined as follows:
Output: Connected components
𝐴𝑝𝑟
Algorithm CCA (binaryImage): 𝐴𝑠 = 1 − (1)
max(𝐻, 𝑊)
components= [ ]
labelCounter=1
Algorithm dfs(pixel, component): 𝐴𝑝𝑟 = √(𝑀𝑝 − 𝑀𝑟 ) 2 + (𝑁𝑝 − 𝑁𝑟 ) 2 (2)
component.add(pixel)
binaryImage[pixel]=labelCounter where, H and W represents height and width of the images and
neighbors=getNeighbors(pixel) Apr represents Euclidean distance between two nodes p and r.
for neighbor in neighbors:
if binaryImage[neighbor]==1:
dfs(neighbor, component)
for each pixel in binaryImage:
if binaryImage[pixel]==1:
component=new empty component
dfs(pixel, component)
components.add(component)
labelCounter+=1 Figure 3. Construction of local inference graph
return components
3.3 Comprehensive exploration of proximity relationships
To filter out small connected components that may
correspond to noise or artifacts, a filtering step based on Text nodes are connected in the local inference graph in an
component area is performed. Components below a certain accurate manner. However, the adjacency relations between
area threshold are removed, retaining larger and more these nodes cannot be accurately represented by a link
meaningful components that are more likely to represent mapping or embedding mapping approach. To overcome this
character center points. shortcoming, we present a Graph Convolutional Network
For each remaining connected component, the centroid (GCN)-based deep relational inference network. Inferring
(center of mass) is computed by averaging the x and y proximity relations between text component nodes is possible
coordinates of all pixels within the component. These with the help of this network. A pivot's relationships with its
computed centroids represent the estimated character center first-order neighbours are an important part of the deep
points. adjacency relation inference procedure. It is also important to
Optionally, additional refinement techniques can be applied note that the characteristics of a node can be influenced by its
to improve the accuracy of the character center points. These neighbors. Hence, fusion features are supplied for the first-
techniques may include centroid shift correction, sub-pixel order neighbours by second-order neighbours. Two common
precision estimation, or the incorporation of geometric types of inputs to the GCN are a feature matrix (denoted by
constraints. Fm) and an adjacency matrix (denoted by Am). Here is how
By following the procedure of connected component these two matrices are calculated:
analysis and the subsequent steps, the U-Net architecture can Feature-Matrix (Fm): Each text component of the same
effectively identify and extract the character center points text instance is represented by a rotating rectangle, and these
from the binary mask, providing a more precise localization of rectangles share certain geometric properties. We use a
the character positions. combination of deep features and geometric properties as
119
features for the textual parts. After a text component has been 4. EXPERIMENTAL SETUP
extracted, we can acquire its deep features by mapping its
characteristics to the RROI-Align layer. At the same time, we 4.1 Datasets
calculate the text component's geometric properties using its
X, Y, W, H, and attributes. We embed text components ICDAR2013 Dataset: By separating the training from the
geometric qualities into high-dimensional spaces in order to testing sets and removing duplicate images, the ICDAR2013
derive geometric characteristics from text [8, 24, 25]. Eq. (3) dataset was created from the ICDAR2011 benchmark.
and Eq. (4) provide the formulas for determining these Annotations have been modified for a subset of ground-truth
embeddings. The feature matrix Fm, which represents the text annotations. We used 229 images for training and 233 for
components and it is the outcome of combining the features testing, resulting in a dataset of 482 images. The vast majority
with the geometric characteristics. of the pictures are from nature, and the majority of the texts
are horizontal or nearly horizontal.
(𝑧)
∈2𝑏 = (cos (
𝑧
)), j (0, C/2a-1) (3) MSRA(TD500) Dataset: Pocket camera indoor (office,
10002𝑏⁄𝐶∈ mall) and outdoor (street) photos make up the bulk of the
(𝑧) 𝑧
MSRA-TD500. Signs, doorplates, and warning signs
∈2𝑎+1 = sin ( ) , j (0, C/2a-1) (4) predominate indoors, while guide-boards & bill-boards take
10002𝑏⁄𝐶∈
up the bulk of exterior imagery. Images are available in
Adjacency-Matrix (Am): Inference graph nodes are dimensions ranging from 1296x864 to 1920x1280. The
connected to produce the adjacency matrix Am. If node a of the collection contains text in several formats, including a wide
text component is connected to node bof the local inference range of languages, scripts, sizes, colours, and orientations
graph, then Am (a, b) =1, and otherwise Am (a, b) =0. (including but not limited to Chinese, English, and
Adjacency analysis between a node and itself is superfluous, combinations).
thus we set Am (a, a) =0. CTW (1500) Dataset: It contains 1500 images: 1000 for
Graph convolutional network: The local inference graph training & 500 for testing. There are 10,751 photographs of
is inferred using a GCN-based inference network based on the cropped text included, with an additional 3,530 images of bent
feature matrix (Fm) and the adjacency matrix (Am). Layer k's text. The pictures were collected by hand from various sources,
feature matrix is referred to as Fk, and its corresponding including the web, image databases like Google Open-Image,
convolutional layer is defined as follows. and mobile phone cameras. There is a lot of horizontal text in
the dataset, as well as text in other orientations.
𝐹 𝑘 = 𝜎((𝐹𝑚𝑘 ⨁ 𝐺𝑋 𝑘 )𝑊 𝑘 ) (5) Total text dataset: The Total-Text dataset includes 1,255
high-dimensional images for training and 300 for testing. Text
−1⁄ −1⁄ in a variety of orientations, including horizontal, multi-
𝐺 = (𝐷 2 𝐴𝐷 2) (6) oriented, and curved text, are included in this collection. The
text examples include both polygon and word-level
𝐷𝑖,𝑖 = ∑ 𝐴𝑖,𝑗 annotations, providing additional information about the
(7) marked areas. The Totaltext dataset is an essential resource for
𝑗
developing and evaluating text detection and recognition
algorithms.
In the equation, Xk represents the feature matrix of size
N×din, where N represents the text components number and din
4.2 Implementation details
is the feature dimension of the input nodes. Similarly, F k
represents the feature matrix of size N×d out, where dout is the Our network relies on the Resnet-50 architecture, which has
feature dimension of the output nodes. Λ represents diagonal undergone pre-training utilizing the ImageNet dataset. We use
matrix, & G illustrates the symmetric normalized Laplacian of a two-stage training procedure that begins with two epochs of
size ((N×N)). The symbol ⊕ denotes concatenation. Wk is the pre-training on the SynthText dataset and concludes with 600
weight matrix of layer k, & σ represents a nonlinear activation epochs of fine-tuning on a targeted benchmark dataset. In the
function. Training involves only computing gradients for the first round of training, we randomly crop text sections, scale
nodes that are 1-order neighbours, since we are primarily them up to 512 pixels wide, and divide them into 12 batches.
interested in connecting the pivot node with its first-order To train the model, we employ the Adam optimizer with a
neighbours, while testing involves the classification of 1-order learning rate set at 104.
nodes. During the fine-tuning phase, we employ a multi-scale
training strategy. Text regions are randomly cropped and
3.4 Text line formation resized to three distinct dimensions: 640×640 with a batch size
of 8, 800×800 with a batch size of 4, and 960×960 with a batch
As part of the Comprehensive Exploration of Proximity size of 4. During the fine-tuning process, we transition to using
Relationships, we summarize the probabilities from all the the SGD optimizer with an initial learning rate of 0.01. This
local inference graphs to derive the adjacency probability learning rate is decreased by a factor of 0.8 every 100 epochs.
matrix (S). When deciding whether or not to keep an edge Moreover, we incorporate fundamental data augmentation
between two nodes, the threshold (TH) is used. If S(a, b) is methods, including rotations, crops, color variations, and
greater than a threshold value, S(a, b) is set to 1; otherwise, partial flipping. The hyperparameters associated with the local
S(a, b) is set to 0. By using BFS, we find the related subgraphs graph remain constant during both the training and testing
(L = L1, L2, ..., Lk) in the whole, which we will call L=L1, stages. All experiments are performed on a single GPU (RTX-
L2, ..., Lk. Each line of text in the set L is represented by a 2080Ti) utilizing PyTorch 1.2.0.
subgraph in L. Nodes inside each subgraph are subsequently
sorted to complete the procedure.
120
4.3 Assessment criteria effectiveness of the relational reasoning network. The
experimental results are presented in Table 1. In order to
The role of evaluation metrics is pivotal when assessing the mitigate the influence of data on the results, our model was
performance of algorithms designed for irregular text initially pre-trained using SynthText and subsequently fine-
detection. These criteria serve as quantitative measures for tuned using Totaltext & CTW (1500). For MSRA (TD500),
evaluating the accuracy and efficacy of the detection system. which includes both English and Chinese text, we pre-trained
Various evaluation metrics are commonly employed to assess our network using ICDAR2017-MLT. The maximum
irregular text detection in a standardized manner. dimensions of the images in Totaltext, CTW (1500), and
One commonly used metric is bounding box-based MSRA (TD500) were restricted to 1,280, 1,024, and 640,
evaluation, where metrics such as “precision”, “recall”, & “F1- respectively, while maintaining their original aspect ratios.
score” are computed based on the accuracy of the predicted We have enhanced our method based on the DRRG and
bounding boxes compared to ground truth annotations. An DPText-DETR approach and conducted a comparative
indication of precision is the proportion of instances of analysis of the experimental results with DRRG. Figure 4
irregular text that have been correctly localized out of all the presents a visual comparison of the text components generated
predicted instances. Recall calculates the proportion of by various methods. A statistical comparison has also been
correctly detected instances out of all the ground truth conducted between the two approaches regarding the no. of
instances, while F1-score provides a balanced evaluation by text components & detection results. Table 1 shows that our
taking into account both precision and recall. text component generation method led to significant
Another metric is pixel-level evaluation, which involves reductions in text component numbers and detection times.
measuring the accuracy of the pixel-wise segmentation masks Furthermore, the results indicate that our method effectively
for irregular text. In this particular context, the widely utilized reduces the number of non-character text components while
evaluation metric is Intersection over Union (IoU), which improving the overall performance of text detection.
calculates the overlap between the predicted mask and the
ground truth mask. Higher IoU values indicate better Table 1. Experimental results focusing on the extraction of
segmentation accuracy. text components
Other evaluation metrics commonly employed in the
assessment of irregular text detection encompass Average Dataset Models P R F
Precision (AP), which evaluates precision at various recall DRRG [11] 83.8 81.5 82.6
levels, and the F-measure, which combines precision and CTW (1500) TD-GCN [40] 86.7 85.4 86.1
recall to provide a consolidated assessment. Proposed 89.5 88.4 89.2
The selection of appropriate evaluation metrics depends on DRRG [11] 88.1 82.3 85.1
MSRA(TD500) TD-GCN [40] 89.7 85.1 87.4
factors such as the specific characteristics of irregular text, the Proposed 90.2 88.9 89.2
complexity of the detection task, and the desired trade-off DRRG [11] 83.1 85.9 84.5
between precision and recall. It is important to choose metrics Total text TD-GCN [40] 89.1 84.4 86.1
that align with the objectives and requirements of the irregular Proposed 90.4 88.4 89.1
text detection system being evaluated.
Precision evaluates the ratio of accurately identified text
instances to all the text instances detected by the system. It
emphasizes the accuracy of positive predictions, serving as an
indicator of how effectively the algorithm recognizes true
positive text regions. Recall, also known as sensitivity,
measures the proportion of correctly detected text instances
out of all the actual text instances present in the dataset. It
emphasizes the ability of the algorithm to capture all the
positive instances, minimizing false negatives.
The F1-score represents a balanced assessment of the
algorithm's performance as it is a harmonic mean of precision
and recall. It offers a comprehensive measure that takes into
account both precision and recall simultaneously. It takes into Figure 4. Bar graph illustration of text component extraction
account both precision and recall, giving equal importance to
false positives and false negatives. 4.4.2 Ablation experiment on local inference graph
Additional evaluation metrics for text detection could Methods that employ feature extraction networks for direct
encompass Intersection over Union (IoU), which quantifies text region detection often face challenges when it comes to
the degree of overlap between the predicted text regions and accurately segmenting text lines. Instances where two text
the ground truth regions, as well as Average Precision (AP), regions are mistakenly merged into one region. To address
which determines the average precision across various recall these issues and improve text region segmentation, our
levels. approach utilizes a relation inference network that leverages
the adjacency relationships between text components.
4.4 Ablation study Experimental results on the MSRA (TD500) & CTW (1500)
datasets demonstrate the effectiveness of our adjacency
4.4.1 Exploring the impact of u-net architecture for text inference network were shown in Table 2. Figure 5 presents a
component extraction through ablation study visual comparison of the text components generated by
We performed ablation experiments on three datasets, various methods. The local inference ablation experiments
namely Total text, CTW (1500), MSRA (TD500), to assess the reveal significant improvements, with precision, recall, and F-
measure on MSRA (TD500) and on CTW (1500). These
121
performance improvements further validate the efficacy of our Table 3. Experimental results conducted on the ICDAR
proposed adjacency inference network. (2013) dataset using different methods
Reference P R F
[55] 86.0 70.0 77.0
[6] 87.4 75.9 81.3
[39] 88.2 78.2 82.9
[14] 81.6 77.2 79.3
[15] 85.0 82.0 83.0
[35] 88.1 82.3 85.1
[12] 90.2 81.9 85.8
[13] 90.9 83.8 87.2
[32] 91.5 83.3 87.2
Figure 6. Detection samples from the proposed method on [40] 89.7 85.4 86.1
the ICDAR (2013) dataset Proposed model 92.3 87.8 89.3
122
4.3.3 Empirical investigations conducted on the CTW (1500): Table 6. Results obtained from experiments conducted on the
Moreover, we selected the CTW (1500) dataset so that we Totaltext dataset using different methods
could assess our method's robustness to detect irregular scene
text. Figure 8 presents several examples showcasing the Reference P R F
experimental outcomes achieved using our method. As shown [33] 85.6 75.7 80.3
in Table 5, our method outperforms other methods. [6] 81.2 79.9 80.6
Remarkably, the results presented in Table 5 highlight that our [28] 84.02 77.9 80.87
method surpasses alternative approaches in terms of recall rate [55] 82.1 80.9 81.5
[23] 87.6 79.3 83.3
and F-measure, attaining impressive values of 85.4% and 86.1% [11] 86.54 84.93 85.73
respectively. As a result, we were able to detect irregular and Proposed model 89.9 89.2 87.96
multidirectional scene text accurately with the help of our
method.
5. CONCLUSION
123
[5] Yang, C., Chen, M., Yuan, Y., Wang, Q. (2023). Text Scene uyghur text detection based on fine-grained
growing on leaf. IEEE Transactions on Multimedia. feature representation. Sensors, 22(12): 4372.
https://doi.org/10.1109/TMM.2023.3244322 https://doi.org/10.3390/s22124372
[6] Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., Bai, X. [20] Arava, K., Paritala, C., Shariff, V., Praveen, S.P.,
(2019). Textfield: Learning a deep direction field for Madhuri, A. (2022). A generalized model for identifying
irregular scene text detection. IEEE Transactions on fake digital images through the application of deep
Image Processing, 28(11): 5566-5579. learning. In 2022 3rd International Conference on
https://doi.org/10.1109/TIP.2019.2900589 Electronics and Sustainable Communication Systems
[7] Sirisha, U., Sai Chandana, B. (2022). Semantic (ICESC), IEEE, pp. 1144-1147.
interdisciplinary evaluation of image captioning models. https://doi.org/10.1109/ICESC54411.2022.9885341
Cogent Engineering, 9(1): 2104333. [21] Sindhura, S., Praveen, S.P., Safali, M.A., Rao, N. (2021).
https://doi.org/10.1080/23311916.2022.2104333 Sentiment analysis for product reviews based on weakly-
[8] Sirisha, U., Bolem, S.C. (2022). Aspect based sentiment supervised deep embedding. In 2021 Third International
& emotion analysis with ROBERTa, LSTM. Conference on Inventive Research in Computing
International Journal of Advanced Computer Science and Applications (ICIRCA), IEEE, pp. 999-1004.
Applications, 13(11). https://doi.org/10.1109/ICIRCA51532.2021.9544985
https://doi.org/10.14569/IJACSA.2022.0131189 [22] Praveen, S.P., Sindhura, S., Madhuri, A., Karras, D.A.
[9] Sirisha, U., Chandana, B.S. (2023). Privacy preserving (2021). A novel effective framework for medical images
image encryption with optimal deep transfer learning secure storage using advanced cipher text algorithm in
based accident severity classification model. Sensors, cloud computing. In 2021 IEEE International Conference
23(1): 519. https://doi.org/10.3390/s23010519 on Imaging Systems and Techniques (IST), pp. 1-4.
[10] Madhuri, M.A., Devi, T.U. (2023). Statistical analysis of https://doi.org/10.1109/IST50367.2021.9651475
design aspects on various graph embedding learning [23] Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding,
classifiers. In 2023 7th International Conference on E., Ding, X. (2019). Look more than once: An accurate
Computing Methodologies and Communication detector for text of arbitrary shapes. In Proceedings of the
(ICCMC), IEEE, pp. 98-105. https://doi.oSliding IEEE/CVF Conference on Computer Vision and Pattern
rg/10.1109/ICCMC56507.2023.10083741 Recognition, pp. 10552-10561.
[11] Zhang, S.X., Zhu, X., Hou, J.B., Liu, C., Yang, C., Wang, [24] Liao, M., Shi, B., Bai, X., Wang, X., Liu, W. (2017).
H., Yin, X.C. (2020). Deep relational reasoning graph Textboxes: A fast text detector with a single deep neural
network for arbitrary shape text detection. In Proceedings network. In Proceedings of The AAAI Conference on
of the IEEE/CVF Conference on Computer Vision and Artificial Intelligence, 31(1).
Pattern Recognition, pp. 9699-9708. https://doi.org/10.1609/aaai.v31i1.11196
[12] Wang, X., Zheng, S., Zhang, C., Li, R., Gui, L. (2021). [25] Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.
R-YOLO: A real-time text detector for natural scenes (2020). Abcnet: Real-time scene text spotting with
with arbitrary rotation. Sensors, 21(3): 888. adaptive bezier-curve network. In Proceedings of The
https://doi.org/10.3390/s21030888 Ieee/Cvf Conference on Computer Vision and Pattern
[13] Raisi, Z., Naiel, M.A., Younes, G., Wardell, S., Zelek, Recognition, pp. 9809-9818.
J.S. (2021). Transformer-based text detection in the wild. [26] Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W.,
In Proceedings of the IEEE/CVF Conference on Liang, J. (2017). East: An efficient and accurate scene
Computer Vision and Pattern Recognition, pp. 3162- text detector. In Proceedings of the IEEE Conference on
3171. Computer Vision and Pattern Recognition, pp. 5551-
[14] Wan, Q., Ji, H., Shen, L. (2021). Self-attention-based text 5560.
knowledge mining for text detection. In Proceedings of [27] Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.
the IEEE/CVF Conference on Computer Vision and (2021). Adaptive boundary proposal network for
Pattern Recognition, pp. 5983-5992. arbitrary shape text detection. In Proceedings of The
[15] Wang, X., Jiang, Y., Luo, Z., Liu, C.L., Choi, H., Kim, IEEE/CVF International Conference on Computer
S. (2019). Arbitrary shape scene text detection with Vision, pp. 1305-1314.
adaptive text region representation. In Proceedings of the [28] Li, X., Wang, W., Hou, W., Liu, R. Z., Lu, T., Yang, J.
IEEE/CVF Conference on Computer Vision and Pattern (2018). Shape robust text detection with progressive
Recognition, pp. 6449-6458. scale expansion network. arXiv Preprint arXiv:
[16] Long, S., He, X., Yao, C. (2021). Scene text detection 1806.02559. https://doi.org/10.48550/arXiv.1806.02559
and recognition: The deep learning era. International [29] Deng, D., Liu, H., Li, X., Cai, D. (2018). Pixellink:
Journal of Computer Vision, 129: 161-184. Detecting scene text via instance segmentation. In
https://doi.org/10.1007/s11263-020-01369-0 Proceedings of the AAAI Conference on Artificial
[17] Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T. (2021). Text Intelligence, 32(1).
recognition in the wild: A survey. ACM Computing https://doi.org/10.1609/aaai.v32i1.12269
Surveys (CSUR), 54(2): 1-35. [30] Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., Jia,
https://doi.org/10.1145/3440756 J. (2019). Learning shape-aware embedding for scene
[18] Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., Zhang, text detection. In Proceedings of the IEEE/CVF
W. (2021). Fourier contour embedding for arbitrary- Conference on Computer Vision and Pattern Recognition,
shaped text detection. In Proceedings of the IEEE/CVF pp. 4234-4243.
Conference on Computer Vision and Pattern Recognition, [31] Zhang, S.X., Zhu, X., Hou, J.B., Yang, C., Yin, X.C.
pp. 3123-3131. (2022). Kernel proposal network for arbitrary shape text
[19] Wang, Y., Mamat, H., Xu, X., Aysa, A., Ubul, K. (2022). detection. IEEE Transactions on Neural Networks and
124
Learning Systems. transformer using shifted windows. In Proceedings of
https://doi.org/10.1109/TNNLS.2022.3152596 The IEEE/CVF International Conference on Computer
[32] Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X. (2022). Real- Vision, pp. 10012-10022.
time scene text detection with differentiable binarization [44] Zhou, Y., Xie, H., Fang, S., Wang, J., Zha, Z., Zhang, Y.
and adaptive scale fusion. IEEE Transactions on Pattern (2021). TDI TextSpotter: Taking data imbalance into
Analysis and Machine Intelligence, 45(1): 919-931. account in scene text spotting. In Proceedings of the 29th
https://doi.org/10.1109/TPAMI.2022.3155612 ACM International Conference on Multimedia, pp. 2510-
[33] Feng, W., He, W., Yin, F., Zhang, X.Y., Liu, C.L. (2019). 2518. https://doi.org/10.1145/3474085.3475423
Textdragon: An end-to-end framework for arbitrary [45] Zhang, Q., Xu, Y., Zhang, J., Tao, D. (2023). Vitaev2:
shaped text spotting. In Proceedings of the IEEE/CVF Vision transformer advanced by exploring inductive bias
International Conference on Computer Vision, pp. 9076- for image recognition and beyond. International Journal
9085. of Computer Vision, 1-22.
[34] Keserwani, P., Dhankhar, A., Saini, R., Roy, P.P. (2021). https://doi.org/10.1007/s11263-022-01739-w
Quadbox: Quadrilateral bounding box-based scene text [46] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
detection using vector regression. IEEE Access, 9: Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I. (2017).
36802-36818. Attention is all you need. Advances in Neural
[35] Yin, F., Wu, Y.C., Zhang, X.Y., Liu, C.L. (2017). Scene Information Processing Systems, 30.
text recognition with sliding convolutional character [47] Carion, N., Massa, F., Synnaeve, G., Usunier, N.,
models. arXiv preprint arXiv:1709.01727. Kirillov, A., Zagoruyko, S. (2020). End-to-end object
https://doi.org/10.48550/arXiv.1709.01727 detection with transformers. In European Conference on
[36] Tian, Z., Huang, W., He, T., He, P., Qiao, Y. (2016). Computer Vision. Cham: Springer International
Detecting text in natural image with connectionist text Publishing, pp. 213-229. https://doi.org/10.1007/978-3-
proposal network. In Computer Vision-ECCV 2016: 030-58452-8_13
14th European Conference, Amsterdam, The [48] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J. (2020).
Netherlands, Proceedings, Part VIII. Springer Deformable detr: Deformable transformers for end-to-
International Publishing, 14: 56-72. end object detection. arXiv Preprint arXiv: 2010.04159.
https://doi.org/10.1007/978-3-319-46484-8_4 https://doi.org/10.48550/arXiv.2010.04159
[37] Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C. [49] Wang, W., Zhang, J., Cao, Y., Shen, Y., Tao, D. (2022).
(2018). Textsnake: A flexible representation for Towards data-efficient detection transformers. In
detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision. Cham:
European Conference on Computer Vision (ECCV), pp. Springer Nature Switzerland, pp. 88-105.
20-36. https://doi.org/10.1007/978-3-031-20077-9_6
[38] Wei, Y., Shen, W., Zeng, D., Ye, L., Zhang, Z. (2018). [50] Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu,
Multi-oriented text detection from natural scene images J., Zhang, L. (2022). Dab-detr: Dynamic anchor boxes
based on a CNN and pruning non-adjacent graph edges. are better queries for detr. arXiv Preprint arXiv:
Signal Processing: Image Communication, 64: 89-98. 2201.12329. https://doi.org/10.48550/arXiv.2201.12329
https://doi.org/10.1016/j.image.2018.02.016 [51] Lu, X., Jian, M., Wang, X., Yu, H., Dong, J., Lam, K.M.
[39] Baek, Y., Lee, B., Han, D., Yun, S., Lee, H. (2019). (2022). Visual saliency detection via combining center
Character region awareness for text detection. In prior and U-Net. Multimedia Systems, 28(5): 1689-1698.
Proceedings of the IEEE/CVF Conference on Computer https://doi.org/10.1007/s00530-022-00940-8
Vision and Pattern Recognition, pp. 9365-9374. [52] Wei, Y., Zhang, Z., Shen, W., Zeng, D., Fang, M., Zhou,
[40] Zhang, S., Zhou, C., Li, Y., Zhang, X., Ye, L., Wei, Y. S. (2017). Text detection in scene images based on
(2023). Irregular scene text detection based on a graph exhaustive segmentation. Signal Processing: Image
convolutional network. Sensors, 23(3): 1070. Communication, 50: 1-8.
https://doi.org/10.3390/s23031070 https://doi.org/10.1016/j.image.2016.10.003
[41] Ch’ng, C.K., Chan, C.S., Liu, C.L. (2020). Total-text: [53] Gao, J., Wang, Q., Yuan, Y. (2019). Convolutional
Toward orientation robustness in scene text detection. regression network for multi-oriented text detection.
International Journal on Document Analysis and IEEE Access, 7: 96424-96433.
Recognition (IJDAR), 23(1): 31-52. https://doi.org/10.1109/ACCESS.2019.2929819
https://doi.org/10.1007/s10032-019-00334-z [54] Jeon, M., Jeong, Y.S. (2020). Compact and accurate
[42] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, scene text detector. Applied Sciences, 10(6): 2096.
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, https://doi.org/10.3390/app10062096
M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. [55] Tang, J., Yang, Z., Wang, Y., Zheng, Q., Xu, Y., Bai, X.
(2020). An image is worth 16x16 words: Transformers (2019). Seglink++: Detecting dense and arbitrary-shaped
for image recognition at scale. arXiv Preprint arXiv: scene text by instance-aware component grouping.
2010.11929. https://doi.org/10.48550/arXiv.2010.11929 Pattern Recognition, 96: 106954.
[43] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, https://doi.org/10.1016/j.patcog.2019.06.020
S., Guo, B. (2021). Swin transformer: Hierarchical vision
125