A Sentence-Matching Model Based On Multi-Granulari

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Article

A Sentence-Matching Model Based on Multi-Granularity


Contextual Key Semantic Interaction
Jinhang Li 1 and Yingna Li 1,2,*

1 Faculty of Information Engineering and Automation, Kunming University of Science and Technology,
Kunming 650500, China; [email protected]
2 Computer Technology Application Key Lab of the Yunnan Province, Kunming 650500, China

* Correspondence: [email protected]

Abstract: In the task of matching Chinese sentences, the key semantics within sentences and the
deep interaction between them significantly affect the matching performance. However, previous
studies mainly relied on shallow interactions based on a single semantic granularity, which left them
vulnerable to interference from overlapping terms. It is particularly challenging to distinguish
between positive and negative examples within datasets from the same thematic domain. This paper
proposes a sentence-matching model that incorporates multi-granularity contextual key semantic
interaction. The model combines multi-scale convolution and multi-level convolution to extract
different levels of contextual semantic information at word, phrase, and sentence granularities. It
employs multi-head self-attention and cross-attention mechanisms to align the key semantics
between sentences. Furthermore, the model integrates the original, similarity, and dissimilarity
information of sentences to establish deep semantic interaction. Experimental results on both open-
and closed-domain datasets demonstrate that the proposed model outperforms existing baseline
models in terms of matching performance. Additionally, the model achieves matching effectiveness
comparable to large-scale pre-trained language models while utilizing a lightweight encoder.

Keywords: sentence matching; multi-granularity feature extraction; attention mechanism; semantic


interaction
Citation: Li, J.; Li, Y. A Sentence-
Matching Model Based on
Multi-Granularity Contextual Key
Semantic Interaction. Appl. Sci. 2024, 14,
5197. https://doi.org/10.3390/
1. Introduction
app14125197 Sentence matching is essential in various natural language processing tasks and finds
applications in question answering, information retrieval, recommendation systems, and
Academic Editors: André Pomp and
natural language inference. For example, in question-answering systems, it is used for
Christoph Quix
paragraph retrieval to match relevant paragraphs from text corpus with the given
Received: 13 May 2024 question. Additionally, it is employed in answer selection tasks to choose the most
Revised: 7 June 2024 suitable answer from a set of candidates. Information retrieval and recommendation
Accepted: 11 June 2024
system tasks rely on sentence matching to establish relationships between queries and
Published: 14 June 2024
documents and rank candidate results accordingly. In natural language inference tasks,
sentence matching is utilized to evaluate the semantic relationship between hypotheses
and premises.
Copyright: © 2024 by the authors. In recent years, pre-trained language models have shown great success in diverse
Licensee MDPI, Basel, Switzerland. natural language processing tasks. For instance, Garg et al. [1] employed a fully connected
This article is an open access article
layer to combine the classification vector of BERT [2]. Reimers et al. [3] introduced SBERT,
distributed under the terms and
which encodes each sentence into a dense vector and measures the similarity between two
conditions of the Creative Commons
vectors to assess the textual matching degree. These BERT-based models effectively
Attribution (CC BY) license
capture overlapping terms in sentences. However, these methods have limitations in
(https://creativecommons.org/license
s/by/4.0/).
terms of utilizing single-granularity vectors, which are insufficient to fully represent
textual information and the limited interaction between sentences.

Appl. Sci. 2024, 14, 5197. https://doi.org/10.3390/app14125197 www.mdpi.com/journal/applsci


Appl. Sci. 2024, 14, 5197 2 of 21

Inspired by the matching aggregation framework [4], Tang et al. [5] enhanced the
effectiveness of short text matching by considering multiple aspects of interaction that
capture the similarities and differences between two sentences comprehensively. Wang et
al. [6] utilized ChineseBERT [7] to incorporate the visual and phonetic information of
Chinese characters into the pre-training model, which served as encoders for questions
and answers. This approach enabled multi-granularity feature encoding. They introduced
a multi-scale convolutional neural network [8] to extract local semantic information and
developed a context-aware interactive module to model sentence interaction.
Furthermore, they proposed a multi-perspective fusion mechanism to integrate local and
interactive information, with the goal of capturing rich features from questions and
answers. However, these methods have limitations in their focus on low-level local
semantic interactions and their limited discriminative ability when dealing with candidate
options that contain a high degree of overlapping terms.
Lin et al. [9] introduced BERT-SMAP (BERT with self-matching attention pooling),
which incorporates a self-matching attention-pooling mechanism into sentence pairs. This
mechanism allows the model to prioritize the key terms within the sentence pairs rather
than the overlapping terms. Lu et al. [10] proposed MKPM (multi-keyword pair
matching), a method for addressing the sentence-matching task. It utilizes attention
mechanisms on sentence pairs to identify the most important keyword pairs from the two
sentences, representing their semantic relationship and avoiding interference from
redundancy and noise. However, these methods have limitations in terms of their
performance on closed-domain datasets. In datasets with a high semantic similarity
between texts in the same thematic domain, solely focusing on key terms may make it
challenging for the model to distinguish between positive and negative examples.
Humans first consider the presence of matching keywords and then synthesize the
overall meaning of sentences to assess text alignment. To address these issues, this paper
proposes a sentence-matching model that incorporates multi-granularity contextual key
semantic interaction. The model combines pre-trained language model embeddings with
BiGRU contextual encoding to thoroughly consider sentence contextual information. It
utilizes multi-scale CNN and multi-level CNN to extract different levels semantic
information at word and phrase granularity. The model introduces multi-head self-
attention and cross-attention to emphasize key semantics within sentences and align
crucial information between sentence pairs. By integrating three interaction methods, it
incorporates interactive information on similarity and dissimilarity, enabling focus on key
semantics and local interaction information at different granularity levels. The sentence’s
overall representation is obtained through max pooling, with the pre-trained language
model’s CLS vector serving as the coarse-grained semantics of the entire sentence pair. By
combining these components, the model comprehensively captures the semantic
information of the entire sentence pair.
In Chinese, a language lacking natural word delimiters, semantic information is
conveyed through lexical units. Additionally, the polysemy of Chinese words can
introduce ambiguity in various contexts. Phrases formed by word combinations carry
specific underlying semantics, and Chinese sentences with different word orders can
exhibit entirely different meanings. Approaches that rely on single-granularity text
semantic matching can only analyze semantics at a single level and may not accurately
capture the semantic relationship between texts. Furthermore, they often overlook the
fusion of features from different granularities. In contrast, MCKI explores key semantic
information at the levels of words, phrases, and sentences, while fully considering the
interaction and fusion of semantics across different granularities. This enables MCKI to
achieve superior performance.
The main contributions of this paper can be summarized as follows:
1. Introduction of the multi-granularity contextual key semantic interaction model
(MCKI) for sentence matching, which combines multi-granularity feature extraction
Appl. Sci. 2024, 14, 5197 3 of 21

and key semantic interaction fusion. This enables the model to capture
comprehensive semantic information in the text.
2. Development of a semantic feature extraction mechanism that combines multi-scale
CNN and multi-level CNN to extract contextual semantic information at different
granularities and levels.
3. Construction of a key semantic alignment mechanism by integrating multi-head self-
attention and cross-attention, incorporating interaction information at different levels
from low level to high level.
4. The proposed model demonstrates superior matching performance compared to
existing baseline models in the LCQMC and BQ datasets. Additionally, it achieves
comparable results to large-scale pre-trained language models while using a
lightweight encoder.

2. Related Work
With the advancement of deep learning techniques, attention mechanisms have been
extensively explored by researchers to capture semantic interactions and determine the
degree of matching. For example, Zhao et al. [11] presented an interactive attention
network that employed a matching matrix to model the semantic exchange between
source and target texts, improving the matching capability for low-frequency keywords.
Fei et al. [12] introduced a deep neural communication model that utilized BiLSTM and
tree encoders to capture semantic and syntactic features. Through attention mechanisms,
they designed deep communication modules to capture local and global interactions
between semantic and syntactic features, thereby enhancing semantic comprehension.
However, these studies primarily focus on modeling English sentences and mainly
address single-granularity text semantic matching. They primarily analyze semantics
from a single level and do not take into account the characteristics of the Chinese
language. As a result, accurately measuring semantic relationships between Chinese texts
is hindered.
Several studies have investigated multi-granularity text semantic matching to
comprehensively assess the semantic similarity and matching degree between texts. These
studies have explored deep fusion techniques at the character and word levels. For
example, Zhang et al. [13] utilized a soft alignment attention mechanism to enhance local
information in sentences at different levels, capturing feature information and correlations
from multiple perspectives. Yu et al. [14] employed interactive information to model
diverse text pairs across various tasks and languages. Wang et al. [15] concurrently
considered deep and shallow semantic similarity, as well as granularity at the lexical and
character levels, enabling a deeper exploration of similarity information. Chang et al. [16]
proposed semantic similarity analysis through the fusion of words and phrases.
Considering the characteristics of Chinese text, several research studies have
employed character-level and word-level approaches to develop semantic matching
models. For instance, Zhang et al. [17] proposed a deep feature fusion model that
combines various deep learning structures to capture semantic information. They
generate a similarity tensor using three matching strategies. Lai et al. [18] designed a
lattice-based approach that constructs word grids by selecting character and word paths
and utilized CNN to generate sentence representations. Chen et al. [19] introduced a
neural graph matching network for Chinese text matching, constructing a graph that
includes all possible characters and words in a sentence. They employed graph neural
networks to generate graph representations for predicting the matching degree. Zhang et
al. [20] presented a multi-granularity fusion model that combines LSTM encoding
structures to extract features at different granularities. They employed an interactive
matching layer to capture and fuse more interaction features. Zhao et al. [21] proposed a
multi-granularity alignment model that utilizes attention mechanisms to capture
interaction features between different granularities and merge them for obtaining
matching representations. Lyu et al. [22] introduced a language knowledge-enhanced
Appl. Sci. 2024, 14, 5197 4 of 21

graph transformer to update character representations and semantic representations,


obtaining the final sentence representation. Some studies attempted to model text
representations using pinyin (phonetic transcription) and radicals. Tao et al. [23] proposed
a radical-guided correlation model that extracts features from characters, radicals, and
character–radical associations for Chinese text classification. Zhao et al. [24] proposed a
multi-granularity interaction model based on pinyin and radicals for Chinese semantic
matching. In addition to character and word information, they further incorporated multi-
granularity features of Chinese pinyin and radicals. The interaction features between
different granularities and sentences were aggregated to generate the final matching
representation and predict the matching degree. Although these methods utilized multi-
granularity information to comprehend text semantics from different perspectives, they
faced limitations with the Word2Vec model for text embedding representation.
Performance tended to decline when dealing with rare words or domain-specific terms.
In recent years, the widespread use of pre-trained language models has permeated
various domains, benefiting tasks involving the matching of textual semantics. These
models undergo extensive pre-training on large textual corpora, equipping them with the
ability to understand grammar, semantics, and contextual intricacies in natural language.
Wu et al. [25] merged BERT with tree models to investigate matching relationships in
question-answering data. Zhang et al. [26] utilized BERT models and multi-feature
convolutions to extract semantic features for textual semantic matching. Zou et al. [27]
employed RoBERTa as the backbone model and employed a divide-and-conquer strategy
by decomposing keywords and intents for textual semantic matching. Wang et al. [15]
achieved multi-granularity feature encoding using ChineseBERT. They also developed a
context-aware interaction module to model interactive information between sentences
and utilized a multi-perspective fusion mechanism to integrate local and interaction
information, capturing rich features of questions and answers. However, these
approaches primarily focus on low-level local semantic interactions and overlook crucial
semantic information within the text, making them susceptible to irrelevant
terminologies. Lin et al. [9] introduced a self-matching attention-pooling mechanism in
sentence pairs, prioritizing key terms rather than overlapping terms. Lu et al. [10]
proposed a matching method based on multiple pairs of key terms, selecting the most
significant word pairs from two sentences to avoid redundancy and noise interference.
While emphasizing key terms helps eliminate irrelevant semantic information, its
effectiveness is limited in domain-specific datasets where semantic similarity is higher
within texts of the same theme. Solely focusing on key terms poses challenges for the
model in distinguishing positive and negative examples.

3. Multi-Granularity Contextual Key Semantic Interaction Model


This section presents the detailed components of the proposed model, as illustrated
in Figure 1. The multi-granularity contextual key semantic interaction model (MCKI)
consists of several layers: the contextual representation layer, the multi-granularity
feature extraction layer, the key semantic alignment layer, the multi-way interaction
fusion layer, and the similarity prediction layer.
Appl. Sci. 2024, 14, 5197 5 of 21

Figure 1. The overall architecture of MCKI model.

The contextual representation layer utilizes a pre-trained language model to obtain


embedded representations of the sentence pairs. The BiGRU is then employed to capture
contextual features of the sequences.
In the multi-granularity feature extraction layer, a combination of multi-scale CNN
and multi-level CNN is used to extract semantic information at different granularities and
levels from the contextual features.
The key semantic alignment layer first utilizes a multi-head self-attention mechanism
to extract key semantic information within each sentence. Then, a cross-attention
mechanism is applied to align the key semantic information between sentence pairs.
Next, the multi-way interaction fusion layer combines the interaction features of key
semantics through multiple comparison operations.
Finally, the similarity prediction layer employs max pooling to obtain the final
semantic vectors at different granularity levels. These vectors are concatenated with the
CLS token vector to measure the semantic matching relationship.

3.1. Contextual Representation Layer


The contextual representation layer consists of two steps: embedding representation
and context feature extraction. For two-sentence sequences, denoted as 𝑃 =
{𝑝1 , 𝑝2 , . . . , 𝑝𝑚 } and 𝑄 = {𝑞1 , 𝑞2 , . . . , 𝑞𝑛 }, a pre-trained language model (such as BERT) is
utilized as the encoder for the sentence pair. Special tokens, namely [CLS] and [SEP], are
added to the sequences to distinguish them. The embedding representation of the
sentence pair, denoted as 𝑿𝑒𝑚𝑏 , is obtained through the following encoding process:
Appl. Sci. 2024, 14, 5197 6 of 21

𝑋 = [ [𝐶𝐿𝑆], 𝑃, [𝑆𝐸𝑃], 𝑄, [𝑆𝐸𝑃] ] (1)

𝑿𝑒𝑚𝑏 = 𝐵𝐸𝑅𝑇(𝑋) = [𝒉𝑐𝑙𝑠 , 𝒉𝑝1 , … , 𝒉𝑝𝑚 , 𝒉𝑠𝑒𝑝 , 𝒉𝑞1 , … , 𝒉𝑞𝑛 , 𝒉𝑠𝑒𝑝 ] (2)
where 𝑋 represents the input sequence containing special markers and sentence pairs 𝑃
and 𝑄. These sequences are encoded using a pre-trained language model, such as BERT,
resulting in the embedded representation 𝑿𝑒𝑚𝑏 ∈ ℝℎ×(𝑚+𝑛+3) of the entire sequence. 𝒉 ∈
ℝℎ represents an embedding vector, with h denoting the dimensionality of the encoder’s
hidden layer. Notably, 𝒉𝑐𝑙𝑠 ∈ ℝℎ can serve as a coarse-grained feature, summarizing the
semantic information of the entire sentence pair. Recurrent neural networks (RNNs) are
commonly employed for modeling sequential data. However, RNN models are
susceptible to problems like vanishing or exploding gradients. Hence, this study adopts
an efficient model called a Bidirectional Gated Recurrent Unit (BiGRU) to further encode
the embedded sequences and extract context features for each of the two sentences.
Additionally, a linear layer is utilized to merge the embedding representations, facilitating
the comprehensive utilization of both the original and contextual information of the
sequences. The representation process is illustrated as follows:
𝑷𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝐵𝑖𝐺𝑅𝑈([𝒉𝑝1 , … , 𝒉𝑝𝑚 ]) + 𝐿𝑖𝑛𝑒𝑎𝑟([𝒉𝑝1 , … , 𝒉𝑝𝑚 ]) (3)

𝑸𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝐵𝑖𝐺𝑅𝑈([𝒉𝑞1 , … , 𝒉𝑞𝑛 ]) + 𝐿𝑖𝑛𝑒𝑎𝑟([𝒉𝑞1 , … , 𝒉𝑞𝑛 ]) (4)

⃗⃗⃗⃗⃗⃗⃗⃗⃗ (𝑯), ⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗


𝐵𝑖𝐺𝑅𝑈(𝑯) = 𝐶𝑜𝑛𝑐𝑎𝑡(𝐺𝑅𝑈 𝐺𝑅𝑈(𝑯)) (5)
where 𝑷𝑐𝑜𝑛𝑡𝑒𝑥𝑡 ∈ ℝ𝑚×2𝑑 , 𝑸𝑐𝑜𝑛𝑡𝑒𝑥𝑡 ∈ ℝ𝑛×2𝑑 , and 𝑚 and 𝑛 represent the lengths of the
sentences, while 𝑑 denotes the hidden dimension of the unidirectional GRU.

3.2. Multi-Granularity Feature Extraction Layer


In the field of natural language processing (NLP), convolutional neural networks
(CNNs) have proven to be effective in extracting local features from sentences. They
operate by applying convolutional operations to input sentences, which allows them to
capture n-gram features and generate latent semantic representations that are rich in
information for various downstream tasks. However, when dealing with Chinese text,
words and phrases can vary in length and carry different meanings based on the number
of characters they consist of. Using a fixed-size convolutional kernel might not be
sufficient to capture the diverse range of phrase features, potentially resulting in the
omission of crucial semantic information.
To address the limitations of a single fixed-size convolutional kernel in capturing
diverse phrase features in Chinese text, this study proposes the use of multiple-scale
convolutions to capture semantic features at different granularity levels, including
individual words, phrases, and entire sentences. This allows for a comprehensive
representation of sequence samples.
The multi-scale CNN performs convolutions on the input sentence using a series of
differently sized kernels, each targeting specific n-gram semantic features. By employing
multiple CNNs, the model can capture distinct features contained in different
combinations of n-grams, overcoming the limitations of a single CNN. The network
architecture is depicted in Figure 2.
Appl. Sci. 2024, 14, 5197 7 of 21

Figure 2. Structure of multi-scale CNN.

Moreover, combinations of different phrases and expressions carry higher-level semantic


information. By stacking different-scale CNNs, the multi-level CNN can capture advanced
semantic features from low-level character sequences. This allows for the extraction of deeper
semantic information. The network structure is illustrated in Figure 3.

Figure 3. Structure of multi-level CNN.

The multi-scale CNN and multi-level CNN have the ability to capture different
aspects of sentences. The multi-scale CNN captures a broader range of semantic
information, while the multi-level CNN captures deeper semantic information. These two
approaches complement each other. Similar to other CNN models used in NLP tasks, this
study employs fixed-width convolutions to extract local features. The width of the
convolutional kernel matches the hidden dimension of the input sequence.
For a given sequence of sentences, let us assume 𝒄𝑖 ∈ ℝ𝑑 represents the contextual
representation of the 𝑖-th word in the sentence, where 𝑑 denotes the dimension of the
representation. The input sentence consists of 𝑛 words. To extract meaningful local
features, we employ convolutional filters of different scales, denoted as 𝑆 = {𝑠1 , 𝑠2 , . . . , 𝑠𝑡 },
𝑆
with a fixed kernel width of 𝑙 . For example, generating feature 𝑚𝑖𝑗𝑖 through window
𝒄𝑖:𝑖+𝑆𝑖 −1 :
𝑆 𝑆 𝑆
𝑚𝑖𝑗𝑖 = 𝐺𝐸𝐿𝑈 (𝑾𝑗 𝑖 (𝒄𝑖:𝑖+𝑆𝑖 −1 ) + 𝑏𝑗 𝑖 ) (6)
𝑆
where 𝑾𝑗 𝑖 ∈ ℝ𝑆𝑖 ×𝑑 represents the weight of the 𝑗 -th feature map at scale 𝑠𝑖 , 𝐺𝐸𝐿𝑈
denotes the activation function, and 𝑏𝑗𝑆𝑖 represents the bias of the feature map.
The convolutional kernel, denoted as 𝑠𝑖 , slides over the entire padded contextual
representation matrix, denoted as 𝑪, to generate feature maps. This process is illustrated
below:
𝑆 𝑆 𝑆 𝑆
𝒎𝑗 𝑖 = [𝑚1𝑗𝑖 , 𝑚2𝑗𝑖 , . . . , 𝑚𝑛𝑗𝑖 ] (7)
Appl. Sci. 2024, 14, 5197 8 of 21

The local n-gram features of the sentences have been extracted using the multi-scale
CNN. By concatenating the outputs of the convolutional kernel 𝑠𝑖 , we obtain the input for
the key semantic alignment layer, which is represented as follows:
𝑆 𝑆 𝑆
𝑴𝑆𝑖 = [𝒎1𝑖 , 𝒎2𝑖 , . . . , 𝒎ℎ𝑖 ] (8)
where ℎ represents the number of feature maps, and 𝑴𝑆𝑖 ∈ ℝ𝑛×ℎ .
For the multi-level CNN, it is necessary to repeat the above operation with the
convolutional kernel 𝑠𝑗 on top of 𝑴𝑆𝑖 to realize the extraction of higher-level semantics.
After applying the multi-granularity feature extraction layer to the contextual
features of the sentence, the resulting multi-granularity features of the sentence 𝑷 and 𝑸
are as follows:
𝑆 𝑆 𝑆 𝑆
𝑴𝑃 = {𝑴𝑃0 , 𝑴𝑃1 , 𝑴𝑃2 , . . . , 𝑴𝑃𝑡 } (9)

𝑆 𝑆 𝑆 𝑆
𝑴𝑄 = {𝑴𝑄0 , 𝑴𝑄1 , 𝑴𝑄2 , . . . , 𝑴𝑄𝑡 } (10)
𝑆 𝑆
where 𝑴𝑃𝑖 ∈ ℝ𝑚×ℎ , 𝑴𝑄𝑖 ∈ ℝ𝑛×ℎ , 𝑡 represents the number of convolutional kernels, 𝑴𝑆0
represents the output features of the multi-level CNN, and 𝑴𝑆𝑖 represents the output
features of different-scale convolutional kernels.

3.3. Key Semantic Alignment Layer


The key semantic alignment layer plays a crucial role in enabling the model to focus
on important semantic features at different granularities within a sentence and align key
semantic information across sentence pairs. The key semantic features for sentence pairs
should meet the following conditions:
1. They should possess rich semantic characteristics.
2. They should be highly significant within the sentences and directly impact the overall
meaning of the sentences.
3. As key semantic information for sentence pairs, they should exhibit substantial
correlations with each other.
These three features are determined by the contextual features at different
granularities within the sentences, self-attention mechanisms within the sentences, and
cross-attention mechanisms between sentence pairs. Specifically, the contextual features
of the sentences are derived from the outputs of the multi-granularity feature extraction
𝑆 𝑆
layer, represented by 𝑴𝑃𝑖 ∈ ℝ𝑚×ℎ and 𝑴𝑄𝑖 ∈ ℝ𝑛×ℎ , where ℎ denotes the number of
feature maps. The overall structure of the key semantic alignment layer is illustrated in
Figure 4.
Appl. Sci. 2024, 14, 5197 9 of 21

Figure 4. Structure of key semantic alignment layer.

3.3.1. Multi-Head Self-Attention


The transformer model utilizes the scaled dot-product attention (SDA) mechanism
for self-attention. As transformer-based pre-trained language models have gained
popularity, the use of the scaled dot-product function has become more prevalent. The
process begins by calculating the dot product between 𝑸and𝑲, which serves as a measure
of similarity between them. A higher value indicates a stronger correlation. This result is
then divided by the scaling factor √𝑑, where 𝑑 represents the dimension of the vectors.
The introduction of the scaling factor addresses the issue of gradient vanishing or
exploding that may occur when dealing with large vector dimensions during the dot
product computation. This ensures smoother and more manageable attention scores,
promoting training stability. Subsequently, the softmax function is applied to ensure that
the weighted coefficients sum up to one, achieving attention normalization and
distribution. Finally, the attention expression is obtained by multiplying it with the matrix
v. The SDA function is defined as follows:
𝑸𝑲⊤
𝑆𝐷𝐴(𝑸, 𝑲, 𝑽) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( )𝑽 (11)
√𝑑
By introducing the multi-head mechanism, sequence features are mapped into
different linear spaces, allowing for the extraction of relevant information from various
representation spaces. This integration of information from different spaces enhances the
expressive power of attention. The formula can be represented as follows:
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑(𝑸, 𝑲, 𝑽) = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑯𝒆𝒂𝒅1 , . . . , 𝑯𝒆𝒂𝒅𝑡 )𝑾 (12)

𝑯𝒆𝒂𝒅𝑖 = 𝑆𝐷𝐴(𝑸𝑾𝑄 𝐾 𝑉
𝑖 , 𝑲𝑾𝑖 , 𝑽𝑾𝑖 ) (13)
where 𝑾 , 𝑾𝑄𝑖 , 𝑾𝐾 𝑉
𝑖 , and 𝑾𝑖 are learnable parameters, 𝑡 denoting the number of
attention heads.
𝑆
Using multi-head self-attention, the key semantic features 𝑺𝑃𝑖 for the sentence 𝑷
itself can be computed as follows:
𝑆 𝑆 𝑆
𝑆𝑃 𝑖 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑺𝑃1𝑖 , . . . , 𝑺𝑃𝑡𝑖 )𝑾 (14)
Appl. Sci. 2024, 14, 5197 10 of 21

𝑆 𝑆
𝑆 𝑆 𝑆𝑖 𝐾 𝑆𝑖 𝑉 𝑴𝑃𝑖 𝑾𝑄 𝑖 𝐾 ⊤
𝑖 (𝑴𝑃 𝑾𝑖 ) 𝑆
𝑺𝑃𝑖𝑖 = 𝑆𝐷𝐴(𝑴𝑃𝑖 𝑾𝑄
𝑖 , 𝑴 𝑃 𝑾 𝑖 , 𝑴 𝑃 𝑾 𝑖 ) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( ) 𝑴𝑃𝑖 𝑾𝑉𝑖 (15)
√ℎ
𝑆
In this equation, the dimension of 𝑺𝑃𝑖 matches the dimension of the input feature
𝑆𝑖 𝑆 𝑆
𝑴𝑃 , 𝑺𝑃𝑖 ∈ ℝ𝑚×ℎ . Similarly, the crucial semantic features 𝑺𝑄𝑖 ∈ ℝ𝑛×ℎ for the sentence 𝑸
can be calculated.

3.3.2. Cross-Attention
Cross-attention calculates the attention between sentence pairs in both directions,
allowing for the alignment of key semantic information between the two sentences. This
attention is computed from the similarity matrix 𝑭𝑆𝑖 ∈ ℝ𝑚×𝑛 , which considers the lengths
of sentence 𝑷 (𝑚) and sentence 𝑸 (𝑛). The computation of the similarity matrix 𝑭𝑆𝑖 is as
follows:
𝑆 𝑆 ⊤
𝑭𝑆𝑖 = 𝑺𝑃𝑖 (𝑺𝑄𝑖 ) (16)

Afterwards, the softmax function is applied to normalize the similarity matrix 𝑭𝑆𝑖
𝑆𝑖
in both directions. This generates attention weights 𝑭𝑃𝑇𝑄 ∈ ℝ𝑚×𝑛 row-wise for the
𝑆
attention from sentence 𝑷 to sentence 𝑸 and 𝑭𝑄𝑇𝑃 𝑖
∈ ℝ𝑛×𝑚 column-wise for the
attention from sentence 𝑸 to sentence 𝑷. The calculation is as follows:
𝑆
𝑖 𝑆
𝑭𝑃𝑇𝑄 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑭𝑖:𝑖 ) (17)

𝑆
𝑖 𝑆
𝑭𝑄𝑇𝑃 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 (𝑭:𝑗𝑖 ) (18)
𝑆
To obtain the attention-aware representation 𝑹𝑃𝑖 ∈ ℝ𝑚×ℎ for sentence 𝑷 with
respect to sentence 𝑸 , the attention weights are multiplied with the representations.
𝑆
Similarly, the attention-aware representation 𝑹𝑄𝑖 ∈ ℝ𝑛×ℎ for sentence 𝑸 with respect to
sentence 𝑷 can be obtained. The calculation is as follows:
𝑆 𝑆 𝑆
𝑹𝑃𝑖 = 𝑭𝑃𝑇𝑄
𝑖
𝑺𝑄𝑖 (19)

𝑆 𝑆 𝑆
𝑹𝑄𝑖 = 𝑭𝑄𝑇𝑃
𝑖
𝑺𝑃𝑖 (20)
𝑆
where 𝑹𝑃𝑖 represents the attention-aware representation of sentence 𝑷 with respect to
𝑆
sentence 𝑸 after the 𝑖 -th scale convolution and attention mechanism. Similarly, 𝑹𝑄𝑖
represents the attention-aware representation of sentence 𝑸 with respect to sentence 𝑷
after the 𝑖-th scale convolution and attention mechanism.

3.4. Multi-Way Interaction Fusion Layer


To capture the interaction information between sentences, the key semantic features
𝑺 and the attention-aware representation 𝑹𝑆𝑖 can be effectively integrated through
𝑆𝑖

three different comparison operations. These operations facilitate a deep interaction


between the key semantic features of each sentence, enabling the integration of interaction
information across different granularities, from low level to high level. The structure of
the multi-way interaction fusion layer is illustrated in Figure 5.
Appl. Sci. 2024, 14, 5197 11 of 21

Figure 5. Structure of multi-way interactive fusion layer.

For sentence P, the three comparison operations are defined as follows:


𝑆 𝑆 𝑆
𝑮𝑃1𝑖 = 𝐹𝐹𝑁1 ([𝑺𝑃𝑖 , 𝑹𝑃𝑖 ]) (21)

𝑆 𝑆 𝑆
𝑮𝑃2𝑖 = 𝐹𝐹𝑁2 ([𝑺𝑃𝑖 − 𝑹𝑃𝑖 ]) (22)

𝑆 𝑆 𝑆
𝑮𝑃3𝑖 = 𝐹𝐹𝑁3 ([𝑺𝑃𝑖 ⊙ 𝑹𝑃𝑖 ]) (23)
𝑆 𝑆
where 𝑮𝑃1𝑖 ∈ ℝ𝑚×2ℎ (achieved through concatenation along the column direction), 𝑮𝑃2𝑖 ∈
𝑆
ℝ𝑚×ℎ , and 𝑮𝑃3𝑖 ∈ ℝ𝑚×ℎ represent the interaction features obtained after the three different
operations:
𝑆
1. 𝑮𝑃1𝑖 represents the concatenation of the key semantic features 𝑺𝑃𝑆𝑖 and the attention-
aware representation 𝑹𝑃𝑆𝑖 . In the interaction operation, concatenation is used to
preserve all the information.
𝑆
2. 𝑮𝑃2𝑖 performs element-wise subtraction, approximating the calculation of Euclidean
distance and measuring the relevance between the two representations using
difference information.
𝑆
3. 𝑮𝑃3𝑖 performs element-wise multiplication, approximating the calculation of cosine
similarity and measuring the similarity between the two representations.
These comparison operations enhance semantic information and capture semantic
relationships. To optimize the model’s efficiency, a feed-forward neural network is
employed to integrate the interaction information, resulting in the feature interaction
𝑆
matrix 𝑮𝑃𝑖 ∈ ℝ𝑚×𝑙 :
𝑆 𝑆 𝑆 𝑆
𝑮𝑃𝑖 = 𝐹𝐹𝑁4 ([𝑮𝑃1𝑖 , 𝑮𝑃2𝑖 , 𝑮𝑃3𝑖 ]) (24)
where 𝐹𝐹𝑁1 , 𝐹𝐹𝑁2 , 𝐹𝐹𝑁3 , and 𝐹𝐹𝑁4 denote distinct feed-forward neural networks with
non-shared parameters, conventionally represented as follows:
𝐹𝐹𝑁(𝒁) = 𝐺𝐸𝐿𝑈(𝒁𝑾 + 𝒃) (25)
𝑆 𝑆 𝑆
and [𝑮𝑃1𝑖 , 𝑮𝑃2𝑖 , 𝑮𝑃3𝑖 ] ∈ ℝ𝑚×4ℎ is concatenated along the column dimension. The parameters
𝑾4 ∈ ℝ4ℎ×𝑙 and 𝒃 ∈ ℝ𝑙 are adjustable, and 𝑙 represents the output dimension of the
network. This operation enables the model to capture deeper levels of interaction
information and reduces the complexity of the representation. Likewise, the feature
𝑆
interaction matrix 𝑮𝑄𝑖 ∈ ℝ𝑛×𝑙 can be obtained for sentence 𝑸.

3.5. Similarity Prediction Layer


𝑆
By applying the max-pooling operation, we map the feature interaction matrices 𝑮𝑃𝑖
𝑆
and 𝑮𝑄𝑖 to vector representations of the same dimension. This process allows for the
Appl. Sci. 2024, 14, 5197 12 of 21

extraction of valuable features at various granularities while reducing computational


costs. The max-pooling operation for each granularity is defined as follows:
𝑆
𝒑𝑆𝑖 = 𝑀𝑎𝑥𝑃𝑜𝑜𝑙𝑖𝑛𝑔(𝑮𝑃𝑖 ) (26)

𝑆
𝒒𝑆𝑖 = 𝑀𝑎𝑥𝑃𝑜𝑜𝑙𝑖𝑛𝑔(𝑮𝑄𝑖 ) (27)
where 𝒑𝑆𝑖 ∈ ℝ𝑙 and 𝒒𝑆𝑖 ∈ ℝ𝑙 represent the semantic feature vectors modeled by the
model at scale 𝑆𝑖 . By concatenating all semantic feature vectors from different scales with
the CLS token vector, we obtain the final semantic interaction vector 𝒇:
𝒇 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝒉𝑐𝑙𝑠 , 𝒑𝑆0 , 𝒑𝑆1 , … , 𝒑𝑆𝑡 , 𝒒𝑆0 , 𝒒𝑆1 , … , 𝒒𝑆𝑡 ) (28)
where 𝒇 ∈ ℝℎ+2(𝑡+1)𝑙 , and ℎ represents the dimensions of the CLS token vector, with a
total of 𝑡 + 1 scales, and 𝑙 denotes the dimension of the feature vector at each scale.
Finally, a fully connected layer is used to predict the matching relationship between the
two sentences based on this vector.
The chosen loss function is the cross-entropy loss, represented by the following
formula:
1
ℒ=− ∑(𝑦𝑙𝑜𝑔 𝑝 + (1 − 𝑦) × 𝑙𝑜𝑔(1 − 𝑝)) (29)
𝑁
𝑁

In the equation, 𝑦 represents the true label, which is set to one when there is
semantic matching between the sentence pairs and zero otherwise. 𝑝 denotes the
predicted result, and 𝑁 represents the number of training samples.

4. Experimental Results and Analysis


4.1. Dataset
The experimental evaluation of the model in this study was conducted on the following
two publicly available datasets, and the dataset statistics are presented in Table 1.

Table 1. Dataset statistics.

Datasets Total Train Validation Test


LCQMC 260,068 238,766 8802 12,500
BQ 120,000 100,000 10,000 10,000

Large-scale Chinese Question Matching Corpus (LCQMC) [28]: LCQMC is a dataset


designed for sentence pair matching in an open-domain context. It focuses on semantic
intent matching, making it more generalizable than sentence rewriting datasets. Each
sentence pair in LCQMC is labeled with zero or one to indicate whether the sentences
match. The dataset consists of 149,226 positive sentence pairs and 110,842 negative
sentence pairs.
Bank Question Dataset (BQ) [29]: The BQ dataset contains customer service logs from
online banks spanning one year. It serves as a dataset for identifying sentence semantic
equivalence, specifically for sentence pair semantic matching. Similar to LCQMC, each
sentence pair in BQ is labeled with zero or one to indicate whether the sentences match.
The positive and negative sentence pairs in BQ are balanced with a ratio of 1:1.

4.2. Implementation
The model architecture and system experiments in this study were implemented
using the PaddleNLP framework. The framework’s pre-trained models and their
initialized weights were utilized. The experiments were conducted using two NVIDIA
GeForce RTX 3090 GPUs (Taiwan Semiconductor Manufacturing Company Limited,
Hsinchu, Taiwan), each with a memory capacity of 24 GB.
Appl. Sci. 2024, 14, 5197 13 of 21

In terms of the model architecture, for the input encoding stage, this study utilized
768-dimensional BERT-base, 312-dimensional ERNIE 3.0-nano, and 768-dimensional
ERNIE 3.0-base for text embedding, respectively. For the feature extraction stage, the
BiGRU unidirectional hidden dimension was set to 150. The multi-scale CNN adopts
convolutional filters of size (3, 4, 5), while the multi-level CNN uses convolutional filters
of size (3, 4). The feature map dimension for both was 300. The multi-head self-attention
employs 20 heads. In the interaction fusion stage, the hidden dimension of the FNN was
set to 150.
Regarding model training, the maximum sentence length was set to 128, and the
batch size was 32. The optimization was performed using AdamW with β1 = 0.9 and β2
= 0.999. The learning rate was set to 5 × 10−5, and the dropout parameter was 0.1. The model
was trained for 3 epochs. The evaluation metric used was accuracy (Acc), and the best-
performing result was chosen as the experimental outcome.
In terms of model testing, the evaluation metric used was accuracy (Acc), and the
optimal result was chosen as the experimental outcome. The following definitions were
used: True Positives (TPs) represent the cases where the original sentences were matched
and correctly classified, while True Negatives (TNs) indicate the cases where the original
sentences were unmatched and correctly classified. False Positives (FPs) refer to the cases
where the original sentences were unmatched but mistakenly classified, and False
Negatives (FNs) denote the cases where the original sentences were matched but
erroneously classified. The accuracy was calculated using the following formula:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (30)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

4.3. Baseline Models


In this study, representative methods from three categories, namely neural network-
based methods, attention-based methods, and pre-trained language model-based
methods, were selected as comparative models. The chosen models are as follows:
• Neural Network-based Methods:
Text-CNN [30]: a convolutional neural network model commonly used for sentence
classification and matching tasks.
BiLSTM [31]: a variant of recurrent neural networks that processes sentences using
LSTM units, capturing long-term and short-term dependencies from both forward and
backward contexts.
BiMPM [32]: A sentence-matching model based on BiLSTM. It utilizes BiLSTM to
represent contextual information and performs matching from two directions and four
different perspectives. The matching results are then aggregated using BiLSTM.
• Attention-based Methods:
DRCN [33]: a densely connected co-attention recurrent neural network where each
layer utilizes attention features and hidden features from all preceding recurrent layers.
MIPR: A multi-granularity interaction model based on pinyin and radicals for
Chinese semantic matching. The interaction features between different granularities and
sentences are aggregated to generate the final matching representation and predict the
matching degree.
MKPM: A sentence-matching method based on multi-keyword pair matching. It
employs attention mechanisms for sentence pairs and selects the most important keyword
pairs from both sentences to represent their semantic relationships. This approach helps
to avoid interference from redundancy and noise.
• Pre-trained Language Model-based Methods:
BERT: A renowned pre-trained language model. This study utilizes BERT-base as the
encoder to generate representations for sentence pairs. The CLS vector output is fed into
a linear layer to predict the matching relationship between the pairs.
Appl. Sci. 2024, 14, 5197 14 of 21

SBERT: A semantic textual similarity method based on BERT. It employs a Siamese


and triplet network structure to obtain semantically meaningful sentence representations.
MAGE: A method for multi-scale context-aware interaction in sentence pair
matching, based on multi-granularity embeddings. It aligns semantic relationships using
context-aware interaction modules and combines local semantic features with attention
representations using a multi-perspective fusion approach. This study utilizes BERT-base
as the encoder and predicts the matching relationship between sentence pairs using a
linear layer.
MacBERT [34]: An enhanced Chinese pre-trained language model based on BERT. It
replaces the MLM (Masked Language Model) task in BERT with the Mac (MLM as
correction) task, which helps alleviate the discrepancy between pre-training and fine-
tuning stages.
ERNIE 3.0 [35]: A large-scale knowledge-enhanced pre-trained language model that
combines autoregressive and autoencoding networks. This study conducts experiments
using both ERNIE 3.0-base and ERNIE 3.0-nano as encoders. Similar to BERT, the CLS
vector output is fed into a linear layer to predict the matching relationship between
sentence pairs.

4.4. Performance Comparison


Table 2 presents the accuracy results of various models on the test sets of LCQMC
and BQ datasets.

Table 2. Accuracy of different models on LCQMC and BQ.

Methods LCQMC BQ
Text-CNN [28,29] 72.80 68.52
BiLSTM [28,29] 76.10 73.51
BiMPM [28,29] 83.40 81.85
DRCN [33] 85.90 83.30
MIPR [24] 85.29 84.45
MKPM [10] 86.71 84.11
BERT-base 86.75 84.67
SBERT 84.44 83.64
MAGE 87.58 84.84
MacBERT-base [34] 87.00 85.20
MacBERT-large [34] 87.60 85.60
Ernie 3.0-nano 86.12 83.02
Ernie 3.0-base 87.64 85.83
MCKI (BERT-base) 87.90 85.28
MCKI (Ernie 3.0-nano) 87.65 83.69
MCKI (Ernie 3.0-base) 89.08 86.08

Overall, pre-trained language model-based methods outperformed the other two


approaches. The performance improvement of these models was more pronounced on
open-domain datasets compared to closed-domain datasets. This can be attributed to the
emphasis on capturing essential semantic information, which allows for the elimination
of irrelevant semantic details. As a result, the models demonstrated significant
enhancements when handling sentence pairs with more noise. However, it is worth noting
that the BQ dataset consisted of samples with a common theme, resulting in higher
similarity between positive and negative instances. This increased similarity posed a
challenge in distinguishing between them.
Among the neural network-based methods, BiLSTM demonstrated a substantial
accuracy improvement compared to Text-CNN. This is attributed to the fact that text data
Appl. Sci. 2024, 14, 5197 15 of 21

can be viewed as sequential data, and the recurrent structure of BiLSTM is better suited
for modeling sequence features. Unlike Text-CNN, which extracts diverse text features
through convolutions of various sizes, BiLSTM takes into account the dependencies
between contextual information present in the text. To further enhance accuracy, BiMPM,
which builds upon BiLSTM, introduces bidirectional different-perspective matching to
capture contextual features and aggregates matching information using BiLSTM.
Methods incorporating attention mechanisms, which enable the model to concentrate
on crucial interaction information between sentences, also exhibit enhanced performance.
DRCN introduced an attention mechanism in recurrent neural networks. MKPM
employed bidirectional attention to select crucial words for keyword matching in sentence
pairs, leading to significant improvements on the LCQMC dataset. This highlighted the
importance of key semantics in sentence matching within an open-domain context. MIPR
extracted semantic interaction features within and between sentences at four different
granularities, resulting in notable improvements on the BQ dataset. This showcases the
influence of multi-granularity semantics in sentence matching within a closed-domain
setting.
Pre-trained language models demonstrate exceptional performance in language
modeling by acquiring contextual representations from extensive data. BERT achieved an
accuracy of 86.75% on the LCQMC dataset and 84.67% on the BQ dataset. In contrast,
SBERT, which employs a dual-encoder architecture to model sentence representations but
lacks explicit interaction information between sentences, exhibited lower effectiveness on
both datasets when compared to the cross-encoder architecture of BERT. While attention
mechanisms enable models to concentrate on important words or phrases in text, they can
pose challenges in distinguishing between positive and negative examples in texts with
similar themes. MAGE effectively handles this challenge through the use of a context-
aware interaction module and a multi-perspective fusion module. Additionally, pre-
trained language models such as MacBERT and Ernie 3.0, which have undergone
extensive training on large-scale Chinese data, enhance performance by improving the
training process.
The proposed method in this study demonstrated significant performance
improvements compared to both neural network-based models and attention-based
models, highlighting the effectiveness of building models upon pre-trained language
models. Specifically, when compared to pre-trained language models, the method in this
study achieved improvements of 1.15% and 0.61% on top of BERT, 1.53% and 0.67% on
top of Ernie 3.0-nano, and 1.44% and 0.25% on top of Ernie 3.0-base, respectively,
surpassing all baseline models. However, the improvement observed for the model based
on Ernie 3.0-base in the BQ dataset is relatively smaller. This can be attributed to two
factors. Firstly, as mentioned earlier, the BQ dataset poses greater difficulty in terms of
matching compared to open-domain datasets. Secondly, Ernie 3.0, being a more advanced
pre-trained language model, has undergone extensive knowledge-enhanced pre-training,
equipping it with certain capabilities for modeling semantic matching relationships in
sentence pairs. Consequently, the potential for improvement relative to the BERT-based
models is relatively smaller.
Additionally, this study achieved performance that is comparable to BERT-base and
even MacBERT-large by leveraging the capabilities of the lightweight Ernie 3.0-nano
model on the LCQMC dataset. This observation suggests that, instead of relying on larger
pre-trained language models, the approach presented in this study can attain similar
performance levels by utilizing a more lightweight pre-trained language model. This
strategy proves effective, particularly when resources are limited.
In comparison to the MAGE model, the proposed method achieved improvements of
0.32% and 0.44% on the two datasets. The approach presented in this study utilizes a
fusion of multi-scale and multi-level CNNs to extract comprehensive contextual features,
enabling the model to capture a wider and deeper range of information. By combining
multi-head self-attention and cross-attention mechanisms, the model not only emphasizes
Appl. Sci. 2024, 14, 5197 16 of 21

key semantic information but also aligns significant details across different levels of
semantics. As a result, significant improvements are observed compared to the MAGE
model.

4.5. Ablation Study


To further evaluate the effectiveness of each module, this study included an ablation
study on the model based on Ernie 3.0-nano using the BQ dataset. This study involved
removing specific modules to assess their impact on performance, as displayed in Table 3.
In the table, the symbol “√” denotes the inclusion of the corresponding module, while “-”
denotes its removal.

Table 3. Ablation study on BQ.

Index Modules Acc (%)


Contextual Multi-Granularity Multi-Head Cross- Multi-Way
Representation Feature Extraction Self-Attention Attention Interaction Fusion
1 √ √ √ √ √ 83.69
2 - √ √ √ √ 83.25
3 √ - √ √ √ 83.32
4 √ √ - √ √ 83.35
5 √ √ √ - √ 83.38
6 √ √ √ - - 83.21
7 √ √ √ √ - 83.26

To assess the effectiveness of the contextual representation module, the study


conducted an ablation experiment by removing this module and directly extracting multi-
granularity features from the original embeddings. The experimental results showed that
the model’s accuracy decreased by 0.44%. Furthermore, this study investigated the impact
of using different contextual representation modules, and the comparative results are
presented in Table 4. Compared to unidirectional LSTM and GRU structures, the
bidirectional structure demonstrated an improvement of approximately 0.7% in accuracy.
This improvement can be attributed to two factors. Firstly, the hidden dimension of the
unidirectional structure is only half that of the bidirectional structure, leading to a
decrease in overall model performance. Secondly, the bidirectional structure captures
both forward and backward semantic dependencies in the text, effectively capturing
contextual semantic information. Additionally, the GRU structure outperformed the
LSTM structure, indicating that GRU is more effective in modeling sequential information
in the text. Moreover, better results were obtained by combining the original embeddings,
which allows for the comprehensive utilization of both the original information and
contextual information from the text.

Table 4. Accuracy of different contextual representation modules.

Modules Acc (%)


LSTM 82.66
GRU 82.83
BiLSTM 83.44
BiGRU 83.57
BiGRU + embedding 83.69

Additionally, to verify the effectiveness of the multi-granularity feature extraction


module, its removal resulted in a decrease of 0.37% in model accuracy. This study also
explored the replacement of this module with different convolutional methods, as depicted in
Table 5. It was observed that a single-scale CNN can only capture features at a specific scale,
Appl. Sci. 2024, 14, 5197 17 of 21

limiting its ability to incorporate a broader range of semantic information. Conversely, a multi-
scale CNN compensates for this limitation by capturing semantic information at multiple
scales, while a multi-level CNN focuses on capturing higher-level semantic information. These
two approaches possess distinct capabilities in capturing different semantic features. The
multi-scale CNN extracts a wider range of semantic information, while the multi-level CNN
delves into deeper semantic layers. By combining both approaches, the model achieves
improved performance as they complement each other in capturing diverse semantic features.

Table 5. Accuracy of different multi-granularity feature extraction modules.

Modules Filters Acc (%)


Single-scale CNN (3) 83.35
Multi-scale CNN (3, 4, 5) 83.41
Multi-level CNN (3, 4) 83.37
Multi-scale CNN + Multi-
(3, 4, 5) + (3, 4) 83.69
level CNN

This study also examined the impact of attention mechanisms on the model’s
performance. By removing self-attention and directly fusing the multi-granularity contextual
features with the aligned features from cross-attention, the model’s accuracy declined by
0.34%. This decrease can be attributed to the absence of important semantic information at
various granularities. Similarly, removing cross-attention and fusing the multi-granularity
contextual features with the output features from self-attention led to a decrease in model
accuracy by 0.31%. This reduction occurred because the model solely focused on its own
semantic information and lacked interaction between sentences. Furthermore, simultaneous
removal of cross-attention and the multi-path interaction fusion module, relying solely on key
semantic information at different granularities, resulted in a decrease in model accuracy by
0.48%. This finding indicates that aligning crucial semantic information between sentences
and considering semantic interactions at various granularities significantly contribute to
enhancing semantic matching capabilities.
Additionally, solely removing the multi-way interaction fusion module resulted in a
decrease of 0.43% in accuracy. Furthermore, to further validate the effectiveness of different
comparison methods, as presented in Table 6, using only the concatenation of sentence
semantic features achieved an accuracy of 83.35%. Using only semantic difference information
led to an accuracy of 82.91%, and using only semantic similarity information resulted in an
accuracy of 83.05%. These results indicate that combining all three types of comparison
methods is an effective approach for capturing matching interactions. Moreover, the impact
of feature fusion in different directions on performance was explored. In the table, “,”
represents concatenation along the column direction of the tensor, expanding the hidden state
dimension, while “;” represents concatenation along the row direction of the tensor, expanding
the sentence length. The results showed that concatenation in the row direction did not
effectively merge the compared information from multiple ways, which failed to reflect the
interaction between sentences and resulted in decreased performance.

Table 6. Accuracy of different feature fusion methods.

Methods Acc (%)


(M, R) 83.35
(M-R) 82.91
(M⊙R) 83.05
(M-R, M⊙R) 83.24
(M, R, M-R) 83.30
(M, R, M⊙R) 83.51
(M, R, M-R, M⊙R) 83.69
(M; R; M-R; M⊙R) 83.19
Appl. Sci. 2024, 14, 5197 18 of 21

Lastly, a comparative experiment was conducted to validate the impact of the CLS
token vector on model performance, as presented in Table 7. The experiment involved
using different prediction modules. The results demonstrated that the best performance
is achieved when the CLS token vector is concatenated with the output vector from the
max-pooling operation. This finding suggests that in addition to fine-grained features,
incorporating the CLS token as a coarse-grained representation of semantic information
for the entire sentence pair effectively enhances semantic matching performance.
Furthermore, the additional concatenation of the output vector from the average-pooling
operation leads to a decrease in model performance. This decrease can be attributed to the
introduction of redundant information, which affects the model’s ability to distinguish
important features.

Table 7. Accuracy of different prediction modules.

Modules Acc (%)


Max pooling 83.23
Avg pooling 83.18
(CLS, Max pooling) 83.69
(CLS, Avg pooling) 83.51
(CLS, Max pooling, Avg pooling) 83.09

4.6. Parameter Sensitivity Analysis


In this section, a further evaluation was carried out to assess the impact of different
combinations of multi-scale CNNs on model performance, as presented in Table 8. The
results indicated that the best model performance was achieved when using a
combination of multi-scale convolutional filters (3, 4, 5) and multi-level convolutional
filters (3, 4). Increasing or decreasing the number of convolutional filters resulted in
varying degrees of information redundancy or loss, which led to a decrease in
performance.

Table 8. Effect of different scale CNN combinations on model performance.

Multi-Scale CNN Multi-Level CNN Acc (%)


(3, 4, 5) (3, 4) 83.69
(3, 4, 5) (3, 4, 5) 83.47
(3, 4) (3, 4) 83.54
(1, 2, 3) 83.62
(3, 4)
(2, 3, 4) 83.65
(1, 2) 83.50
(2, 2) 83.49
(2, 3) 83.57
(3, 2) 83.44
(3, 4, 5) (3, 3) 83.64
(4, 2) 83.40
(4, 3) 83.44
(4, 4) 83.46
(4, 5) 83.53

In addition, the impact of the number of self-attention heads on model performance


was investigated, as illustrated in Figure 6. It was observed that as the number of heads
increased, the model performance improved, reaching its peak when the number of heads
was 20. This improvement can be attributed to the average sentence length in the BQ
dataset being 11.9 tokens, and with a model hidden dimension of 300, having 20 attention
Appl. Sci. 2024, 14, 5197 19 of 21

heads ensures that each attention unit has a dimension of 15, which is sufficient to capture
most of the sentence patterns in the dataset. However, as the number of heads continues
to increase beyond this point, the dimension per attention unit decreases, leading to a low-
rank bottleneck problem [36]. This ultimately results in a decline in model performance.

83.8

83.6
Acc(%)

83.4

83.2
1 2 5 10 15 20 25 30 50 100 150 300
Number of heads of self-attention

Figure 6. Effect of different number of self-attention heads on model performance.

5. Conclusions
This paper presents a sentence-matching model called the multi-granularity
contextual key semantic interaction model (MCKI). The model incorporates pre-trained
language model embeddings and BiGRU to capture contextual information effectively. It
employs a semantic feature extraction mechanism that combines a multi-scale CNN and
multi-level CNN to extract semantic contextual information at different granularities and
levels. The introduction of a key semantic alignment mechanism by integrating multi-
head self-attention and cross-attention mechanisms emphasizes key semantic information
between sentence pairs. By combining three interaction methods, the model
comprehensively considers both the similarity and difference between sentences,
resulting in enhanced matching performance.
The effectiveness of the proposed model and its components was evaluated on public
datasets. The findings demonstrated that in open-domain datasets, matching key semantic
elements effectively eliminates irrelevant information, leading to significant
improvements, especially when sentence pairs contain high levels of noise. Moreover, the
proposed methodology in this study achieved comparable performance to large-scale pre-
trained models by utilizing lightweight pre-training language models, which proves
effective when resources are limited. However, in closed-domain datasets where sentence
pairs share the same theme, considering semantic nuances at different levels and
granularities aids the model in distinguishing overlapping terminologies within the
subject. Nevertheless, the effectiveness of model enhancements remains limited in such
scenarios. Additionally, marginal effects are observed when applying improvements to
more advanced pre-trained models. The limitations and shortcomings of the current
research should be acknowledged. This study primarily focused on combining existing
models without thoroughly exploring improvements to these models. Future research
will aim to enhance known models in order to achieve better performance. Furthermore,
there is a consideration to integrate graph neural networks, which can leverage both the
key semantic information and structural aspects of sentences to optimize matching
performance.

Author Contributions: Conceptualization, J.L. and Y.L.; methodology, J.L.; software, J.L.; validation,
J.L.; formal analysis, J.L.; investigation, J.L.; resources, Y.L.; data curation, J.L.; writing—original
draft preparation, J.L.; writing—review and editing, Y.L.; supervision, Y.L.; funding acquisition, Y.L.
All authors have read and agreed to the published version of the manuscript.
Appl. Sci. 2024, 14, 5197 20 of 21

Funding: The research was funded by the Key projects of science and technology plan of Yunnan
Provincial Department of Science and Technology (grant number 202201AS070029).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Restrictions apply to the availability of these data. Data were obtained from
Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen), and are available at
http://icrc.hitsz.edu.cn/info/1037/1146.htm and http://icrc.hitsz.edu.cn/info/1037/1162.htm with the
permission of Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen),
accessed on 1 October 2023.
Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181.
2. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp.
4171–4186.
3. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992.
4. Wang, S.; Jiang, J. A compare-aggregate model for matching text sequences. In Proceedings of the ICLR 2017: International
Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–15.
5. Tang, X.; Luo, Y.; Xiong, D.; Yang, J.; Li, R.; Peng, D. Short text matching model with multiway semantic interaction based on
multi-granularity semantic embedding. Appl. Intell. 2022, 52, 15632–15642.
6. Wang, M.; He, X.; Liu, Y.; Qing, L.; Zhang, Z.; Chen, H. MAGE: Multi-scale context-aware interaction based on multi-granularity
embedding for chinese medical question answer matching. Comput. Methods Programs Biomed. 2023, 228, 107249.
7. Sun, Z.; Li, X.; Sun, X.; Meng, Y.; Ao, X.; He, Q.; Wu, F.; Li, J. ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 2065–
2075.
8. Zhang, S.; Zhang, X.; Wang, H.; Cheng, J.; Li, P.; Ding, Z. Chinese medical question answer matching using end-to-end character-
level multi-scale CNNs. Appl. Sci. 2017, 7, 767.
9. Lin, D.; Tang, J.; Li, X.; Pang, K.; Li, S.; Wang, T. BERT-SMAP: Paying attention to Essential Terms in passage ranking beyond
BERT. Inf. Process. Manag. 2022, 59, 102788.
10. Lu, X.; Deng, Y.; Sun, T.; Gao, Y.; Feng, J.; Sun, X.; Sutcliffe, R. MKPM: Multi keyword-pair matching for natural language
sentences. Appl. Intell. 2022, 52, 1878–1892.
11. Zhao, S.; Huang, Y.; Su, C.; Li, Y.; Wang, F. Interactive attention networks for semantic text matching. In Proceedings of the 2020
IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 861–870.
12. Fei, H.; Ren, Y.; Ji, D. Improving text understanding via deep syntax-semantics communication. In Proceedings of the Findings
of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 84–93.
13. Zhang, X.; Li, Y.; Lu, W.; Jian, P.; Zhang, G. Intra-correlation encoding for Chinese sentence intention matching. In Proceedings
of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 5193–
5204.
14. Yu, C.; Xue, H.; Jiang, Y.; An, L.; Li, G. A simple and efficient text matching model based on deep interaction. Inf. Process. Manag.
2021, 58, 102738.
15. Wang, X.; Yang, H. MGMSN: Multi-Granularity Matching Model Based on Siamese Neural Network. Front. Bioeng. Biotechnol.
2022, 10, 839586.
16. Chang, G.; Wang, W.; Hu, S. MatchACNN: A multi-granularity deep matching model. Neural Process. Lett. 2023, 55, 4419–4438.
17. Zhang, X.; Lu, W.; Li, F.; Peng, X.; Zhang, R. Deep feature fusion model for sentence semantic matching. Comput. Mater. Contin.
2019, 61, 601–616.
18. Lai, Y.; Feng, Y.; Yu, X.; Wang, Z.; Xu, K.; Zhao, D. Lattice cnns for matching based chinese question answering. In Proceedings
of the Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6634–
6641.
19. Chen, L.; Zhao, Y.; Lyu, B.; Jin, L.; Chen, Z.; Zhu, S.; Yu, K. Neural graph matching networks for Chinese short text matching.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6152–
6158.
Appl. Sci. 2024, 14, 5197 21 of 21

20. Zhang, X.; Lu, W.; Zhang, G.; Li, F.; Wang, S. Chinese sentence semantic matching based on multi-granularity fusion model. In
Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 11–14 May 2020; pp. 246–
257.
21. Zhao, P.; Lu, W.; Li, Y.; Yu, J.; Jian, P.; Zhang, X. Chinese semantic matching with multi-granularity alignment and feature
fusion. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China 18–22 July
2021; pp. 1–8.
22. Lyu, B.; Chen, L.; Zhu, S.; Yu, K. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching. In
Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 13498–13506.
23. Tao, H.; Tong, S.; Zhang, K.; Xu, T.; Liu, Q.; Chen, E.; Hou, M. Ideography leads us to the field of cognition: A radical-guided
associative model for chinese text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9
February 2021; pp. 13898–13906.
24. Zhao, P.; Lu, W.; Wang, S.; Peng, X.; Jian, P.; Wu, H.; Zhang, W. Multi-granularity interaction model based on pinyins and
radicals for Chinese semantic matching. World Wide Web 2022, 25, 1703–1723.
25. Wu, Z.; Liang, J.; Zhang, Z.; Lei, J. Exploration of text matching methods in Chinese disease Q&A systems: A method using
ensemble based on BERT and boosted tree models. J. Biomed. Inform. 2021, 115, 103683.
26. Zhang, Z.; Zhang, Y.; Li, X.; Qian, Y.; Zhang, T. BMCSA: Multi-feature spatial convolution semantic matching model based on
BERT. J. Intell. Fuzzy Syst. 2022, 43, 4083–4093.
27. Zou, Y.; Liu, H.; Gui, T.; Wang, J.; Zhang, Q.; Tang, M.; Li, H.; Wang, D. Divide and Conquer: Text Semantic Matching with
Disentangled Keywords and Intents. In Proceeding of the Findings of Computational Linguistics: ACL 2022, Dublin, Ireland,
22–27 May 2022; pp. 3622–3632.
28. Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.; Tang, B. Lcqmc: A large-scale chinese question matching corpus. In
Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp.
1952–1962.
29. Chen, J.; Chen, Q.; Liu, X.; Yang, H.; Lu, D.; Tang, B. The bq corpus: A large-scale domain-specific chinese corpus for sentence
semantic equivalence identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4946–4951.
30. He, T.; Huang, W.; Qiao, Y.; Yao, J. Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image
Process. 2016, 25, 2529–2541.
31. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings
of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292.
32. Wang, Z.; Hamza, W.; Florian, R. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the
International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017.
33. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645.
34. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech
Lang. Process. 2021, 29, 3504–3514.
35. Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y. Ernie 3.0: Large-scale knowledge
enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137.
36. Bhojanapalli, S.; Yun, C.; Rawat, A.S.; Reddi, S.; Kumar, S. Low-rank bottleneck in multi-head attention models. In Proceedings
of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 864–873.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury
to people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like