A Sentence-Matching Model Based On Multi-Granulari
A Sentence-Matching Model Based On Multi-Granulari
A Sentence-Matching Model Based On Multi-Granulari
1 Faculty of Information Engineering and Automation, Kunming University of Science and Technology,
Kunming 650500, China; [email protected]
2 Computer Technology Application Key Lab of the Yunnan Province, Kunming 650500, China
* Correspondence: [email protected]
Abstract: In the task of matching Chinese sentences, the key semantics within sentences and the
deep interaction between them significantly affect the matching performance. However, previous
studies mainly relied on shallow interactions based on a single semantic granularity, which left them
vulnerable to interference from overlapping terms. It is particularly challenging to distinguish
between positive and negative examples within datasets from the same thematic domain. This paper
proposes a sentence-matching model that incorporates multi-granularity contextual key semantic
interaction. The model combines multi-scale convolution and multi-level convolution to extract
different levels of contextual semantic information at word, phrase, and sentence granularities. It
employs multi-head self-attention and cross-attention mechanisms to align the key semantics
between sentences. Furthermore, the model integrates the original, similarity, and dissimilarity
information of sentences to establish deep semantic interaction. Experimental results on both open-
and closed-domain datasets demonstrate that the proposed model outperforms existing baseline
models in terms of matching performance. Additionally, the model achieves matching effectiveness
comparable to large-scale pre-trained language models while utilizing a lightweight encoder.
Inspired by the matching aggregation framework [4], Tang et al. [5] enhanced the
effectiveness of short text matching by considering multiple aspects of interaction that
capture the similarities and differences between two sentences comprehensively. Wang et
al. [6] utilized ChineseBERT [7] to incorporate the visual and phonetic information of
Chinese characters into the pre-training model, which served as encoders for questions
and answers. This approach enabled multi-granularity feature encoding. They introduced
a multi-scale convolutional neural network [8] to extract local semantic information and
developed a context-aware interactive module to model sentence interaction.
Furthermore, they proposed a multi-perspective fusion mechanism to integrate local and
interactive information, with the goal of capturing rich features from questions and
answers. However, these methods have limitations in their focus on low-level local
semantic interactions and their limited discriminative ability when dealing with candidate
options that contain a high degree of overlapping terms.
Lin et al. [9] introduced BERT-SMAP (BERT with self-matching attention pooling),
which incorporates a self-matching attention-pooling mechanism into sentence pairs. This
mechanism allows the model to prioritize the key terms within the sentence pairs rather
than the overlapping terms. Lu et al. [10] proposed MKPM (multi-keyword pair
matching), a method for addressing the sentence-matching task. It utilizes attention
mechanisms on sentence pairs to identify the most important keyword pairs from the two
sentences, representing their semantic relationship and avoiding interference from
redundancy and noise. However, these methods have limitations in terms of their
performance on closed-domain datasets. In datasets with a high semantic similarity
between texts in the same thematic domain, solely focusing on key terms may make it
challenging for the model to distinguish between positive and negative examples.
Humans first consider the presence of matching keywords and then synthesize the
overall meaning of sentences to assess text alignment. To address these issues, this paper
proposes a sentence-matching model that incorporates multi-granularity contextual key
semantic interaction. The model combines pre-trained language model embeddings with
BiGRU contextual encoding to thoroughly consider sentence contextual information. It
utilizes multi-scale CNN and multi-level CNN to extract different levels semantic
information at word and phrase granularity. The model introduces multi-head self-
attention and cross-attention to emphasize key semantics within sentences and align
crucial information between sentence pairs. By integrating three interaction methods, it
incorporates interactive information on similarity and dissimilarity, enabling focus on key
semantics and local interaction information at different granularity levels. The sentence’s
overall representation is obtained through max pooling, with the pre-trained language
model’s CLS vector serving as the coarse-grained semantics of the entire sentence pair. By
combining these components, the model comprehensively captures the semantic
information of the entire sentence pair.
In Chinese, a language lacking natural word delimiters, semantic information is
conveyed through lexical units. Additionally, the polysemy of Chinese words can
introduce ambiguity in various contexts. Phrases formed by word combinations carry
specific underlying semantics, and Chinese sentences with different word orders can
exhibit entirely different meanings. Approaches that rely on single-granularity text
semantic matching can only analyze semantics at a single level and may not accurately
capture the semantic relationship between texts. Furthermore, they often overlook the
fusion of features from different granularities. In contrast, MCKI explores key semantic
information at the levels of words, phrases, and sentences, while fully considering the
interaction and fusion of semantics across different granularities. This enables MCKI to
achieve superior performance.
The main contributions of this paper can be summarized as follows:
1. Introduction of the multi-granularity contextual key semantic interaction model
(MCKI) for sentence matching, which combines multi-granularity feature extraction
Appl. Sci. 2024, 14, 5197 3 of 21
and key semantic interaction fusion. This enables the model to capture
comprehensive semantic information in the text.
2. Development of a semantic feature extraction mechanism that combines multi-scale
CNN and multi-level CNN to extract contextual semantic information at different
granularities and levels.
3. Construction of a key semantic alignment mechanism by integrating multi-head self-
attention and cross-attention, incorporating interaction information at different levels
from low level to high level.
4. The proposed model demonstrates superior matching performance compared to
existing baseline models in the LCQMC and BQ datasets. Additionally, it achieves
comparable results to large-scale pre-trained language models while using a
lightweight encoder.
2. Related Work
With the advancement of deep learning techniques, attention mechanisms have been
extensively explored by researchers to capture semantic interactions and determine the
degree of matching. For example, Zhao et al. [11] presented an interactive attention
network that employed a matching matrix to model the semantic exchange between
source and target texts, improving the matching capability for low-frequency keywords.
Fei et al. [12] introduced a deep neural communication model that utilized BiLSTM and
tree encoders to capture semantic and syntactic features. Through attention mechanisms,
they designed deep communication modules to capture local and global interactions
between semantic and syntactic features, thereby enhancing semantic comprehension.
However, these studies primarily focus on modeling English sentences and mainly
address single-granularity text semantic matching. They primarily analyze semantics
from a single level and do not take into account the characteristics of the Chinese
language. As a result, accurately measuring semantic relationships between Chinese texts
is hindered.
Several studies have investigated multi-granularity text semantic matching to
comprehensively assess the semantic similarity and matching degree between texts. These
studies have explored deep fusion techniques at the character and word levels. For
example, Zhang et al. [13] utilized a soft alignment attention mechanism to enhance local
information in sentences at different levels, capturing feature information and correlations
from multiple perspectives. Yu et al. [14] employed interactive information to model
diverse text pairs across various tasks and languages. Wang et al. [15] concurrently
considered deep and shallow semantic similarity, as well as granularity at the lexical and
character levels, enabling a deeper exploration of similarity information. Chang et al. [16]
proposed semantic similarity analysis through the fusion of words and phrases.
Considering the characteristics of Chinese text, several research studies have
employed character-level and word-level approaches to develop semantic matching
models. For instance, Zhang et al. [17] proposed a deep feature fusion model that
combines various deep learning structures to capture semantic information. They
generate a similarity tensor using three matching strategies. Lai et al. [18] designed a
lattice-based approach that constructs word grids by selecting character and word paths
and utilized CNN to generate sentence representations. Chen et al. [19] introduced a
neural graph matching network for Chinese text matching, constructing a graph that
includes all possible characters and words in a sentence. They employed graph neural
networks to generate graph representations for predicting the matching degree. Zhang et
al. [20] presented a multi-granularity fusion model that combines LSTM encoding
structures to extract features at different granularities. They employed an interactive
matching layer to capture and fuse more interaction features. Zhao et al. [21] proposed a
multi-granularity alignment model that utilizes attention mechanisms to capture
interaction features between different granularities and merge them for obtaining
matching representations. Lyu et al. [22] introduced a language knowledge-enhanced
Appl. Sci. 2024, 14, 5197 4 of 21
𝑿𝑒𝑚𝑏 = 𝐵𝐸𝑅𝑇(𝑋) = [𝒉𝑐𝑙𝑠 , 𝒉𝑝1 , … , 𝒉𝑝𝑚 , 𝒉𝑠𝑒𝑝 , 𝒉𝑞1 , … , 𝒉𝑞𝑛 , 𝒉𝑠𝑒𝑝 ] (2)
where 𝑋 represents the input sequence containing special markers and sentence pairs 𝑃
and 𝑄. These sequences are encoded using a pre-trained language model, such as BERT,
resulting in the embedded representation 𝑿𝑒𝑚𝑏 ∈ ℝℎ×(𝑚+𝑛+3) of the entire sequence. 𝒉 ∈
ℝℎ represents an embedding vector, with h denoting the dimensionality of the encoder’s
hidden layer. Notably, 𝒉𝑐𝑙𝑠 ∈ ℝℎ can serve as a coarse-grained feature, summarizing the
semantic information of the entire sentence pair. Recurrent neural networks (RNNs) are
commonly employed for modeling sequential data. However, RNN models are
susceptible to problems like vanishing or exploding gradients. Hence, this study adopts
an efficient model called a Bidirectional Gated Recurrent Unit (BiGRU) to further encode
the embedded sequences and extract context features for each of the two sentences.
Additionally, a linear layer is utilized to merge the embedding representations, facilitating
the comprehensive utilization of both the original and contextual information of the
sequences. The representation process is illustrated as follows:
𝑷𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝐵𝑖𝐺𝑅𝑈([𝒉𝑝1 , … , 𝒉𝑝𝑚 ]) + 𝐿𝑖𝑛𝑒𝑎𝑟([𝒉𝑝1 , … , 𝒉𝑝𝑚 ]) (3)
The multi-scale CNN and multi-level CNN have the ability to capture different
aspects of sentences. The multi-scale CNN captures a broader range of semantic
information, while the multi-level CNN captures deeper semantic information. These two
approaches complement each other. Similar to other CNN models used in NLP tasks, this
study employs fixed-width convolutions to extract local features. The width of the
convolutional kernel matches the hidden dimension of the input sequence.
For a given sequence of sentences, let us assume 𝒄𝑖 ∈ ℝ𝑑 represents the contextual
representation of the 𝑖-th word in the sentence, where 𝑑 denotes the dimension of the
representation. The input sentence consists of 𝑛 words. To extract meaningful local
features, we employ convolutional filters of different scales, denoted as 𝑆 = {𝑠1 , 𝑠2 , . . . , 𝑠𝑡 },
𝑆
with a fixed kernel width of 𝑙 . For example, generating feature 𝑚𝑖𝑗𝑖 through window
𝒄𝑖:𝑖+𝑆𝑖 −1 :
𝑆 𝑆 𝑆
𝑚𝑖𝑗𝑖 = 𝐺𝐸𝐿𝑈 (𝑾𝑗 𝑖 (𝒄𝑖:𝑖+𝑆𝑖 −1 ) + 𝑏𝑗 𝑖 ) (6)
𝑆
where 𝑾𝑗 𝑖 ∈ ℝ𝑆𝑖 ×𝑑 represents the weight of the 𝑗 -th feature map at scale 𝑠𝑖 , 𝐺𝐸𝐿𝑈
denotes the activation function, and 𝑏𝑗𝑆𝑖 represents the bias of the feature map.
The convolutional kernel, denoted as 𝑠𝑖 , slides over the entire padded contextual
representation matrix, denoted as 𝑪, to generate feature maps. This process is illustrated
below:
𝑆 𝑆 𝑆 𝑆
𝒎𝑗 𝑖 = [𝑚1𝑗𝑖 , 𝑚2𝑗𝑖 , . . . , 𝑚𝑛𝑗𝑖 ] (7)
Appl. Sci. 2024, 14, 5197 8 of 21
The local n-gram features of the sentences have been extracted using the multi-scale
CNN. By concatenating the outputs of the convolutional kernel 𝑠𝑖 , we obtain the input for
the key semantic alignment layer, which is represented as follows:
𝑆 𝑆 𝑆
𝑴𝑆𝑖 = [𝒎1𝑖 , 𝒎2𝑖 , . . . , 𝒎ℎ𝑖 ] (8)
where ℎ represents the number of feature maps, and 𝑴𝑆𝑖 ∈ ℝ𝑛×ℎ .
For the multi-level CNN, it is necessary to repeat the above operation with the
convolutional kernel 𝑠𝑗 on top of 𝑴𝑆𝑖 to realize the extraction of higher-level semantics.
After applying the multi-granularity feature extraction layer to the contextual
features of the sentence, the resulting multi-granularity features of the sentence 𝑷 and 𝑸
are as follows:
𝑆 𝑆 𝑆 𝑆
𝑴𝑃 = {𝑴𝑃0 , 𝑴𝑃1 , 𝑴𝑃2 , . . . , 𝑴𝑃𝑡 } (9)
𝑆 𝑆 𝑆 𝑆
𝑴𝑄 = {𝑴𝑄0 , 𝑴𝑄1 , 𝑴𝑄2 , . . . , 𝑴𝑄𝑡 } (10)
𝑆 𝑆
where 𝑴𝑃𝑖 ∈ ℝ𝑚×ℎ , 𝑴𝑄𝑖 ∈ ℝ𝑛×ℎ , 𝑡 represents the number of convolutional kernels, 𝑴𝑆0
represents the output features of the multi-level CNN, and 𝑴𝑆𝑖 represents the output
features of different-scale convolutional kernels.
𝑯𝒆𝒂𝒅𝑖 = 𝑆𝐷𝐴(𝑸𝑾𝑄 𝐾 𝑉
𝑖 , 𝑲𝑾𝑖 , 𝑽𝑾𝑖 ) (13)
where 𝑾 , 𝑾𝑄𝑖 , 𝑾𝐾 𝑉
𝑖 , and 𝑾𝑖 are learnable parameters, 𝑡 denoting the number of
attention heads.
𝑆
Using multi-head self-attention, the key semantic features 𝑺𝑃𝑖 for the sentence 𝑷
itself can be computed as follows:
𝑆 𝑆 𝑆
𝑆𝑃 𝑖 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑺𝑃1𝑖 , . . . , 𝑺𝑃𝑡𝑖 )𝑾 (14)
Appl. Sci. 2024, 14, 5197 10 of 21
𝑆 𝑆
𝑆 𝑆 𝑆𝑖 𝐾 𝑆𝑖 𝑉 𝑴𝑃𝑖 𝑾𝑄 𝑖 𝐾 ⊤
𝑖 (𝑴𝑃 𝑾𝑖 ) 𝑆
𝑺𝑃𝑖𝑖 = 𝑆𝐷𝐴(𝑴𝑃𝑖 𝑾𝑄
𝑖 , 𝑴 𝑃 𝑾 𝑖 , 𝑴 𝑃 𝑾 𝑖 ) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 ( ) 𝑴𝑃𝑖 𝑾𝑉𝑖 (15)
√ℎ
𝑆
In this equation, the dimension of 𝑺𝑃𝑖 matches the dimension of the input feature
𝑆𝑖 𝑆 𝑆
𝑴𝑃 , 𝑺𝑃𝑖 ∈ ℝ𝑚×ℎ . Similarly, the crucial semantic features 𝑺𝑄𝑖 ∈ ℝ𝑛×ℎ for the sentence 𝑸
can be calculated.
3.3.2. Cross-Attention
Cross-attention calculates the attention between sentence pairs in both directions,
allowing for the alignment of key semantic information between the two sentences. This
attention is computed from the similarity matrix 𝑭𝑆𝑖 ∈ ℝ𝑚×𝑛 , which considers the lengths
of sentence 𝑷 (𝑚) and sentence 𝑸 (𝑛). The computation of the similarity matrix 𝑭𝑆𝑖 is as
follows:
𝑆 𝑆 ⊤
𝑭𝑆𝑖 = 𝑺𝑃𝑖 (𝑺𝑄𝑖 ) (16)
Afterwards, the softmax function is applied to normalize the similarity matrix 𝑭𝑆𝑖
𝑆𝑖
in both directions. This generates attention weights 𝑭𝑃𝑇𝑄 ∈ ℝ𝑚×𝑛 row-wise for the
𝑆
attention from sentence 𝑷 to sentence 𝑸 and 𝑭𝑄𝑇𝑃 𝑖
∈ ℝ𝑛×𝑚 column-wise for the
attention from sentence 𝑸 to sentence 𝑷. The calculation is as follows:
𝑆
𝑖 𝑆
𝑭𝑃𝑇𝑄 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑭𝑖:𝑖 ) (17)
𝑆
𝑖 𝑆
𝑭𝑄𝑇𝑃 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 (𝑭:𝑗𝑖 ) (18)
𝑆
To obtain the attention-aware representation 𝑹𝑃𝑖 ∈ ℝ𝑚×ℎ for sentence 𝑷 with
respect to sentence 𝑸 , the attention weights are multiplied with the representations.
𝑆
Similarly, the attention-aware representation 𝑹𝑄𝑖 ∈ ℝ𝑛×ℎ for sentence 𝑸 with respect to
sentence 𝑷 can be obtained. The calculation is as follows:
𝑆 𝑆 𝑆
𝑹𝑃𝑖 = 𝑭𝑃𝑇𝑄
𝑖
𝑺𝑄𝑖 (19)
𝑆 𝑆 𝑆
𝑹𝑄𝑖 = 𝑭𝑄𝑇𝑃
𝑖
𝑺𝑃𝑖 (20)
𝑆
where 𝑹𝑃𝑖 represents the attention-aware representation of sentence 𝑷 with respect to
𝑆
sentence 𝑸 after the 𝑖 -th scale convolution and attention mechanism. Similarly, 𝑹𝑄𝑖
represents the attention-aware representation of sentence 𝑸 with respect to sentence 𝑷
after the 𝑖-th scale convolution and attention mechanism.
𝑆 𝑆 𝑆
𝑮𝑃2𝑖 = 𝐹𝐹𝑁2 ([𝑺𝑃𝑖 − 𝑹𝑃𝑖 ]) (22)
𝑆 𝑆 𝑆
𝑮𝑃3𝑖 = 𝐹𝐹𝑁3 ([𝑺𝑃𝑖 ⊙ 𝑹𝑃𝑖 ]) (23)
𝑆 𝑆
where 𝑮𝑃1𝑖 ∈ ℝ𝑚×2ℎ (achieved through concatenation along the column direction), 𝑮𝑃2𝑖 ∈
𝑆
ℝ𝑚×ℎ , and 𝑮𝑃3𝑖 ∈ ℝ𝑚×ℎ represent the interaction features obtained after the three different
operations:
𝑆
1. 𝑮𝑃1𝑖 represents the concatenation of the key semantic features 𝑺𝑃𝑆𝑖 and the attention-
aware representation 𝑹𝑃𝑆𝑖 . In the interaction operation, concatenation is used to
preserve all the information.
𝑆
2. 𝑮𝑃2𝑖 performs element-wise subtraction, approximating the calculation of Euclidean
distance and measuring the relevance between the two representations using
difference information.
𝑆
3. 𝑮𝑃3𝑖 performs element-wise multiplication, approximating the calculation of cosine
similarity and measuring the similarity between the two representations.
These comparison operations enhance semantic information and capture semantic
relationships. To optimize the model’s efficiency, a feed-forward neural network is
employed to integrate the interaction information, resulting in the feature interaction
𝑆
matrix 𝑮𝑃𝑖 ∈ ℝ𝑚×𝑙 :
𝑆 𝑆 𝑆 𝑆
𝑮𝑃𝑖 = 𝐹𝐹𝑁4 ([𝑮𝑃1𝑖 , 𝑮𝑃2𝑖 , 𝑮𝑃3𝑖 ]) (24)
where 𝐹𝐹𝑁1 , 𝐹𝐹𝑁2 , 𝐹𝐹𝑁3 , and 𝐹𝐹𝑁4 denote distinct feed-forward neural networks with
non-shared parameters, conventionally represented as follows:
𝐹𝐹𝑁(𝒁) = 𝐺𝐸𝐿𝑈(𝒁𝑾 + 𝒃) (25)
𝑆 𝑆 𝑆
and [𝑮𝑃1𝑖 , 𝑮𝑃2𝑖 , 𝑮𝑃3𝑖 ] ∈ ℝ𝑚×4ℎ is concatenated along the column dimension. The parameters
𝑾4 ∈ ℝ4ℎ×𝑙 and 𝒃 ∈ ℝ𝑙 are adjustable, and 𝑙 represents the output dimension of the
network. This operation enables the model to capture deeper levels of interaction
information and reduces the complexity of the representation. Likewise, the feature
𝑆
interaction matrix 𝑮𝑄𝑖 ∈ ℝ𝑛×𝑙 can be obtained for sentence 𝑸.
𝑆
𝒒𝑆𝑖 = 𝑀𝑎𝑥𝑃𝑜𝑜𝑙𝑖𝑛𝑔(𝑮𝑄𝑖 ) (27)
where 𝒑𝑆𝑖 ∈ ℝ𝑙 and 𝒒𝑆𝑖 ∈ ℝ𝑙 represent the semantic feature vectors modeled by the
model at scale 𝑆𝑖 . By concatenating all semantic feature vectors from different scales with
the CLS token vector, we obtain the final semantic interaction vector 𝒇:
𝒇 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝒉𝑐𝑙𝑠 , 𝒑𝑆0 , 𝒑𝑆1 , … , 𝒑𝑆𝑡 , 𝒒𝑆0 , 𝒒𝑆1 , … , 𝒒𝑆𝑡 ) (28)
where 𝒇 ∈ ℝℎ+2(𝑡+1)𝑙 , and ℎ represents the dimensions of the CLS token vector, with a
total of 𝑡 + 1 scales, and 𝑙 denotes the dimension of the feature vector at each scale.
Finally, a fully connected layer is used to predict the matching relationship between the
two sentences based on this vector.
The chosen loss function is the cross-entropy loss, represented by the following
formula:
1
ℒ=− ∑(𝑦𝑙𝑜𝑔 𝑝 + (1 − 𝑦) × 𝑙𝑜𝑔(1 − 𝑝)) (29)
𝑁
𝑁
In the equation, 𝑦 represents the true label, which is set to one when there is
semantic matching between the sentence pairs and zero otherwise. 𝑝 denotes the
predicted result, and 𝑁 represents the number of training samples.
4.2. Implementation
The model architecture and system experiments in this study were implemented
using the PaddleNLP framework. The framework’s pre-trained models and their
initialized weights were utilized. The experiments were conducted using two NVIDIA
GeForce RTX 3090 GPUs (Taiwan Semiconductor Manufacturing Company Limited,
Hsinchu, Taiwan), each with a memory capacity of 24 GB.
Appl. Sci. 2024, 14, 5197 13 of 21
In terms of the model architecture, for the input encoding stage, this study utilized
768-dimensional BERT-base, 312-dimensional ERNIE 3.0-nano, and 768-dimensional
ERNIE 3.0-base for text embedding, respectively. For the feature extraction stage, the
BiGRU unidirectional hidden dimension was set to 150. The multi-scale CNN adopts
convolutional filters of size (3, 4, 5), while the multi-level CNN uses convolutional filters
of size (3, 4). The feature map dimension for both was 300. The multi-head self-attention
employs 20 heads. In the interaction fusion stage, the hidden dimension of the FNN was
set to 150.
Regarding model training, the maximum sentence length was set to 128, and the
batch size was 32. The optimization was performed using AdamW with β1 = 0.9 and β2
= 0.999. The learning rate was set to 5 × 10−5, and the dropout parameter was 0.1. The model
was trained for 3 epochs. The evaluation metric used was accuracy (Acc), and the best-
performing result was chosen as the experimental outcome.
In terms of model testing, the evaluation metric used was accuracy (Acc), and the
optimal result was chosen as the experimental outcome. The following definitions were
used: True Positives (TPs) represent the cases where the original sentences were matched
and correctly classified, while True Negatives (TNs) indicate the cases where the original
sentences were unmatched and correctly classified. False Positives (FPs) refer to the cases
where the original sentences were unmatched but mistakenly classified, and False
Negatives (FNs) denote the cases where the original sentences were matched but
erroneously classified. The accuracy was calculated using the following formula:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (30)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Methods LCQMC BQ
Text-CNN [28,29] 72.80 68.52
BiLSTM [28,29] 76.10 73.51
BiMPM [28,29] 83.40 81.85
DRCN [33] 85.90 83.30
MIPR [24] 85.29 84.45
MKPM [10] 86.71 84.11
BERT-base 86.75 84.67
SBERT 84.44 83.64
MAGE 87.58 84.84
MacBERT-base [34] 87.00 85.20
MacBERT-large [34] 87.60 85.60
Ernie 3.0-nano 86.12 83.02
Ernie 3.0-base 87.64 85.83
MCKI (BERT-base) 87.90 85.28
MCKI (Ernie 3.0-nano) 87.65 83.69
MCKI (Ernie 3.0-base) 89.08 86.08
can be viewed as sequential data, and the recurrent structure of BiLSTM is better suited
for modeling sequence features. Unlike Text-CNN, which extracts diverse text features
through convolutions of various sizes, BiLSTM takes into account the dependencies
between contextual information present in the text. To further enhance accuracy, BiMPM,
which builds upon BiLSTM, introduces bidirectional different-perspective matching to
capture contextual features and aggregates matching information using BiLSTM.
Methods incorporating attention mechanisms, which enable the model to concentrate
on crucial interaction information between sentences, also exhibit enhanced performance.
DRCN introduced an attention mechanism in recurrent neural networks. MKPM
employed bidirectional attention to select crucial words for keyword matching in sentence
pairs, leading to significant improvements on the LCQMC dataset. This highlighted the
importance of key semantics in sentence matching within an open-domain context. MIPR
extracted semantic interaction features within and between sentences at four different
granularities, resulting in notable improvements on the BQ dataset. This showcases the
influence of multi-granularity semantics in sentence matching within a closed-domain
setting.
Pre-trained language models demonstrate exceptional performance in language
modeling by acquiring contextual representations from extensive data. BERT achieved an
accuracy of 86.75% on the LCQMC dataset and 84.67% on the BQ dataset. In contrast,
SBERT, which employs a dual-encoder architecture to model sentence representations but
lacks explicit interaction information between sentences, exhibited lower effectiveness on
both datasets when compared to the cross-encoder architecture of BERT. While attention
mechanisms enable models to concentrate on important words or phrases in text, they can
pose challenges in distinguishing between positive and negative examples in texts with
similar themes. MAGE effectively handles this challenge through the use of a context-
aware interaction module and a multi-perspective fusion module. Additionally, pre-
trained language models such as MacBERT and Ernie 3.0, which have undergone
extensive training on large-scale Chinese data, enhance performance by improving the
training process.
The proposed method in this study demonstrated significant performance
improvements compared to both neural network-based models and attention-based
models, highlighting the effectiveness of building models upon pre-trained language
models. Specifically, when compared to pre-trained language models, the method in this
study achieved improvements of 1.15% and 0.61% on top of BERT, 1.53% and 0.67% on
top of Ernie 3.0-nano, and 1.44% and 0.25% on top of Ernie 3.0-base, respectively,
surpassing all baseline models. However, the improvement observed for the model based
on Ernie 3.0-base in the BQ dataset is relatively smaller. This can be attributed to two
factors. Firstly, as mentioned earlier, the BQ dataset poses greater difficulty in terms of
matching compared to open-domain datasets. Secondly, Ernie 3.0, being a more advanced
pre-trained language model, has undergone extensive knowledge-enhanced pre-training,
equipping it with certain capabilities for modeling semantic matching relationships in
sentence pairs. Consequently, the potential for improvement relative to the BERT-based
models is relatively smaller.
Additionally, this study achieved performance that is comparable to BERT-base and
even MacBERT-large by leveraging the capabilities of the lightweight Ernie 3.0-nano
model on the LCQMC dataset. This observation suggests that, instead of relying on larger
pre-trained language models, the approach presented in this study can attain similar
performance levels by utilizing a more lightweight pre-trained language model. This
strategy proves effective, particularly when resources are limited.
In comparison to the MAGE model, the proposed method achieved improvements of
0.32% and 0.44% on the two datasets. The approach presented in this study utilizes a
fusion of multi-scale and multi-level CNNs to extract comprehensive contextual features,
enabling the model to capture a wider and deeper range of information. By combining
multi-head self-attention and cross-attention mechanisms, the model not only emphasizes
Appl. Sci. 2024, 14, 5197 16 of 21
key semantic information but also aligns significant details across different levels of
semantics. As a result, significant improvements are observed compared to the MAGE
model.
limiting its ability to incorporate a broader range of semantic information. Conversely, a multi-
scale CNN compensates for this limitation by capturing semantic information at multiple
scales, while a multi-level CNN focuses on capturing higher-level semantic information. These
two approaches possess distinct capabilities in capturing different semantic features. The
multi-scale CNN extracts a wider range of semantic information, while the multi-level CNN
delves into deeper semantic layers. By combining both approaches, the model achieves
improved performance as they complement each other in capturing diverse semantic features.
This study also examined the impact of attention mechanisms on the model’s
performance. By removing self-attention and directly fusing the multi-granularity contextual
features with the aligned features from cross-attention, the model’s accuracy declined by
0.34%. This decrease can be attributed to the absence of important semantic information at
various granularities. Similarly, removing cross-attention and fusing the multi-granularity
contextual features with the output features from self-attention led to a decrease in model
accuracy by 0.31%. This reduction occurred because the model solely focused on its own
semantic information and lacked interaction between sentences. Furthermore, simultaneous
removal of cross-attention and the multi-path interaction fusion module, relying solely on key
semantic information at different granularities, resulted in a decrease in model accuracy by
0.48%. This finding indicates that aligning crucial semantic information between sentences
and considering semantic interactions at various granularities significantly contribute to
enhancing semantic matching capabilities.
Additionally, solely removing the multi-way interaction fusion module resulted in a
decrease of 0.43% in accuracy. Furthermore, to further validate the effectiveness of different
comparison methods, as presented in Table 6, using only the concatenation of sentence
semantic features achieved an accuracy of 83.35%. Using only semantic difference information
led to an accuracy of 82.91%, and using only semantic similarity information resulted in an
accuracy of 83.05%. These results indicate that combining all three types of comparison
methods is an effective approach for capturing matching interactions. Moreover, the impact
of feature fusion in different directions on performance was explored. In the table, “,”
represents concatenation along the column direction of the tensor, expanding the hidden state
dimension, while “;” represents concatenation along the row direction of the tensor, expanding
the sentence length. The results showed that concatenation in the row direction did not
effectively merge the compared information from multiple ways, which failed to reflect the
interaction between sentences and resulted in decreased performance.
Lastly, a comparative experiment was conducted to validate the impact of the CLS
token vector on model performance, as presented in Table 7. The experiment involved
using different prediction modules. The results demonstrated that the best performance
is achieved when the CLS token vector is concatenated with the output vector from the
max-pooling operation. This finding suggests that in addition to fine-grained features,
incorporating the CLS token as a coarse-grained representation of semantic information
for the entire sentence pair effectively enhances semantic matching performance.
Furthermore, the additional concatenation of the output vector from the average-pooling
operation leads to a decrease in model performance. This decrease can be attributed to the
introduction of redundant information, which affects the model’s ability to distinguish
important features.
heads ensures that each attention unit has a dimension of 15, which is sufficient to capture
most of the sentence patterns in the dataset. However, as the number of heads continues
to increase beyond this point, the dimension per attention unit decreases, leading to a low-
rank bottleneck problem [36]. This ultimately results in a decline in model performance.
83.8
83.6
Acc(%)
83.4
83.2
1 2 5 10 15 20 25 30 50 100 150 300
Number of heads of self-attention
5. Conclusions
This paper presents a sentence-matching model called the multi-granularity
contextual key semantic interaction model (MCKI). The model incorporates pre-trained
language model embeddings and BiGRU to capture contextual information effectively. It
employs a semantic feature extraction mechanism that combines a multi-scale CNN and
multi-level CNN to extract semantic contextual information at different granularities and
levels. The introduction of a key semantic alignment mechanism by integrating multi-
head self-attention and cross-attention mechanisms emphasizes key semantic information
between sentence pairs. By combining three interaction methods, the model
comprehensively considers both the similarity and difference between sentences,
resulting in enhanced matching performance.
The effectiveness of the proposed model and its components was evaluated on public
datasets. The findings demonstrated that in open-domain datasets, matching key semantic
elements effectively eliminates irrelevant information, leading to significant
improvements, especially when sentence pairs contain high levels of noise. Moreover, the
proposed methodology in this study achieved comparable performance to large-scale pre-
trained models by utilizing lightweight pre-training language models, which proves
effective when resources are limited. However, in closed-domain datasets where sentence
pairs share the same theme, considering semantic nuances at different levels and
granularities aids the model in distinguishing overlapping terminologies within the
subject. Nevertheless, the effectiveness of model enhancements remains limited in such
scenarios. Additionally, marginal effects are observed when applying improvements to
more advanced pre-trained models. The limitations and shortcomings of the current
research should be acknowledged. This study primarily focused on combining existing
models without thoroughly exploring improvements to these models. Future research
will aim to enhance known models in order to achieve better performance. Furthermore,
there is a consideration to integrate graph neural networks, which can leverage both the
key semantic information and structural aspects of sentences to optimize matching
performance.
Author Contributions: Conceptualization, J.L. and Y.L.; methodology, J.L.; software, J.L.; validation,
J.L.; formal analysis, J.L.; investigation, J.L.; resources, Y.L.; data curation, J.L.; writing—original
draft preparation, J.L.; writing—review and editing, Y.L.; supervision, Y.L.; funding acquisition, Y.L.
All authors have read and agreed to the published version of the manuscript.
Appl. Sci. 2024, 14, 5197 20 of 21
Funding: The research was funded by the Key projects of science and technology plan of Yunnan
Provincial Department of Science and Technology (grant number 202201AS070029).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Restrictions apply to the availability of these data. Data were obtained from
Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen), and are available at
http://icrc.hitsz.edu.cn/info/1037/1146.htm and http://icrc.hitsz.edu.cn/info/1037/1162.htm with the
permission of Intelligent Computing Research Center, Harbin Institute of Technology (Shenzhen),
accessed on 1 October 2023.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181.
2. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp.
4171–4186.
3. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992.
4. Wang, S.; Jiang, J. A compare-aggregate model for matching text sequences. In Proceedings of the ICLR 2017: International
Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–15.
5. Tang, X.; Luo, Y.; Xiong, D.; Yang, J.; Li, R.; Peng, D. Short text matching model with multiway semantic interaction based on
multi-granularity semantic embedding. Appl. Intell. 2022, 52, 15632–15642.
6. Wang, M.; He, X.; Liu, Y.; Qing, L.; Zhang, Z.; Chen, H. MAGE: Multi-scale context-aware interaction based on multi-granularity
embedding for chinese medical question answer matching. Comput. Methods Programs Biomed. 2023, 228, 107249.
7. Sun, Z.; Li, X.; Sun, X.; Meng, Y.; Ao, X.; He, Q.; Wu, F.; Li, J. ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 2065–
2075.
8. Zhang, S.; Zhang, X.; Wang, H.; Cheng, J.; Li, P.; Ding, Z. Chinese medical question answer matching using end-to-end character-
level multi-scale CNNs. Appl. Sci. 2017, 7, 767.
9. Lin, D.; Tang, J.; Li, X.; Pang, K.; Li, S.; Wang, T. BERT-SMAP: Paying attention to Essential Terms in passage ranking beyond
BERT. Inf. Process. Manag. 2022, 59, 102788.
10. Lu, X.; Deng, Y.; Sun, T.; Gao, Y.; Feng, J.; Sun, X.; Sutcliffe, R. MKPM: Multi keyword-pair matching for natural language
sentences. Appl. Intell. 2022, 52, 1878–1892.
11. Zhao, S.; Huang, Y.; Su, C.; Li, Y.; Wang, F. Interactive attention networks for semantic text matching. In Proceedings of the 2020
IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 861–870.
12. Fei, H.; Ren, Y.; Ji, D. Improving text understanding via deep syntax-semantics communication. In Proceedings of the Findings
of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 84–93.
13. Zhang, X.; Li, Y.; Lu, W.; Jian, P.; Zhang, G. Intra-correlation encoding for Chinese sentence intention matching. In Proceedings
of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 5193–
5204.
14. Yu, C.; Xue, H.; Jiang, Y.; An, L.; Li, G. A simple and efficient text matching model based on deep interaction. Inf. Process. Manag.
2021, 58, 102738.
15. Wang, X.; Yang, H. MGMSN: Multi-Granularity Matching Model Based on Siamese Neural Network. Front. Bioeng. Biotechnol.
2022, 10, 839586.
16. Chang, G.; Wang, W.; Hu, S. MatchACNN: A multi-granularity deep matching model. Neural Process. Lett. 2023, 55, 4419–4438.
17. Zhang, X.; Lu, W.; Li, F.; Peng, X.; Zhang, R. Deep feature fusion model for sentence semantic matching. Comput. Mater. Contin.
2019, 61, 601–616.
18. Lai, Y.; Feng, Y.; Yu, X.; Wang, Z.; Xu, K.; Zhao, D. Lattice cnns for matching based chinese question answering. In Proceedings
of the Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6634–
6641.
19. Chen, L.; Zhao, Y.; Lyu, B.; Jin, L.; Chen, Z.; Zhu, S.; Yu, K. Neural graph matching networks for Chinese short text matching.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6152–
6158.
Appl. Sci. 2024, 14, 5197 21 of 21
20. Zhang, X.; Lu, W.; Zhang, G.; Li, F.; Wang, S. Chinese sentence semantic matching based on multi-granularity fusion model. In
Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 11–14 May 2020; pp. 246–
257.
21. Zhao, P.; Lu, W.; Li, Y.; Yu, J.; Jian, P.; Zhang, X. Chinese semantic matching with multi-granularity alignment and feature
fusion. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China 18–22 July
2021; pp. 1–8.
22. Lyu, B.; Chen, L.; Zhu, S.; Yu, K. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching. In
Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 13498–13506.
23. Tao, H.; Tong, S.; Zhang, K.; Xu, T.; Liu, Q.; Chen, E.; Hou, M. Ideography leads us to the field of cognition: A radical-guided
associative model for chinese text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9
February 2021; pp. 13898–13906.
24. Zhao, P.; Lu, W.; Wang, S.; Peng, X.; Jian, P.; Wu, H.; Zhang, W. Multi-granularity interaction model based on pinyins and
radicals for Chinese semantic matching. World Wide Web 2022, 25, 1703–1723.
25. Wu, Z.; Liang, J.; Zhang, Z.; Lei, J. Exploration of text matching methods in Chinese disease Q&A systems: A method using
ensemble based on BERT and boosted tree models. J. Biomed. Inform. 2021, 115, 103683.
26. Zhang, Z.; Zhang, Y.; Li, X.; Qian, Y.; Zhang, T. BMCSA: Multi-feature spatial convolution semantic matching model based on
BERT. J. Intell. Fuzzy Syst. 2022, 43, 4083–4093.
27. Zou, Y.; Liu, H.; Gui, T.; Wang, J.; Zhang, Q.; Tang, M.; Li, H.; Wang, D. Divide and Conquer: Text Semantic Matching with
Disentangled Keywords and Intents. In Proceeding of the Findings of Computational Linguistics: ACL 2022, Dublin, Ireland,
22–27 May 2022; pp. 3622–3632.
28. Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.; Tang, B. Lcqmc: A large-scale chinese question matching corpus. In
Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp.
1952–1962.
29. Chen, J.; Chen, Q.; Liu, X.; Yang, H.; Lu, D.; Tang, B. The bq corpus: A large-scale domain-specific chinese corpus for sentence
semantic equivalence identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4946–4951.
30. He, T.; Huang, W.; Qiao, Y.; Yao, J. Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image
Process. 2016, 25, 2529–2541.
31. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings
of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292.
32. Wang, Z.; Hamza, W.; Florian, R. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the
International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017.
33. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645.
34. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech
Lang. Process. 2021, 29, 3504–3514.
35. Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y. Ernie 3.0: Large-scale knowledge
enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137.
36. Bhojanapalli, S.; Yun, C.; Rawat, A.S.; Reddi, S.; Kumar, S. Low-rank bottleneck in multi-head attention models. In Proceedings
of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 864–873.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury
to people or property resulting from any ideas, methods, instructions or products referred to in the content.