A Spam Transformer Model For SMS Spam Detection

Received April 28, 2021, accepted May 11, 2021, date of publication May 17, 2021, date of current
version June 8, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3081479
A Spam Transformer Model for

SMS Spam Detection
XIAOXU LIU , HAOYE LU , (Member, IEEE), AND AMIYA NAYAK , (Senior Member, IEEE)
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
Corresponding author: Haoye Lu ([email protected])
ABSTRACT In this paper, we aim to explore the possibility of the Transformer model in detecting the spam
Short Message Service (SMS) messages by proposing a modified Transformer model that is designed for
detecting SMS spam messages. The evaluation of our proposed spam Transformer is performed on SMS
Spam Collection v.1 dataset and UtkMl’s Twitter Spam Detection Competition dataset, with the benchmark
of multiple established machine learning classifiers and state-of-the-art SMS spam detection approaches.
In comparison to all other candidates, our experiments on SMS spam detection show that the proposed
modified spam Transformer has the optimal results on the accuracy, recall, and F1-Score with the values
of 98.92%, 0.9451, and 0.9613, respectively. Besides, the proposed model also achieves good performance
on the UtkMl’s Twitter dataset, which indicates a promising possibility of adapting the model to other similar
problems.
INDEX TERMS SMS spam detection, transformer, attention, deep learning.
I. INTRODUCTION dependent on the handcrafted features extracted from the

A. MOTIVATION AND OBJECTIVE training data [2].
THE Short Message Service (SMS) has been widely used As a class of machine learning techniques, deep learning
as a communication tool over the past few decades as the has been developing rapidly recently thanks to the surprising
popularity of mobile phone and mobile network grows. How- growth of computational resources in the last few decades.
ever, SMS users are also suffering from SMS spam. The SMS Nowadays, deep learning based applications play a signifi-
spam, also known as drunk message, refers to any irrelevant cant part in our society, making our lives much easier in many
messages delivered using mobile networks [1]. There are aspects. As one of the most effective and widely used deep
several reasons that lead to the popularity of spam messages. learning architectures, Recurrent Neural Network (RNN),
Firstly, there is a large number of users who use mobile as well as its variants such as Long Short-Term Memory
phones in the world, making the potential victims of the spam (LSTM), were applied to spam detection and proved to be
messages attack also high. Secondly, the cost of sending out extremely effective during the last few years.
spam messages is low, which could be good news to the The Transformer [3] is an attention-based sequence-to-
spam attacker. Last but not least, the capability of the spam sequence model that was originally designated for transla-
classifier on most mobile phones is relatively weak due to the tion task, and it achieved great success in English-German
shortage of computational resources, which limits them from and English-French translation. Moreover, there are multiple
identifying the spam message correctly and efficiently. improved Transformer-based models such as GPT-3 [4] and
Machine learning is one of the most popular topics in BERT [5] proposed recently to address different Natural Lan-
the last few decades, and there are a great number of guage Process (NLP) problems. The accomplishments of the
machine learning based classification applications in multiple Transformer and its successors have proved how powerful
research areas. Specifically, spam detection is a relatively and promising they are. In this paper, we aim to explore
mature research topic with several established methods. How- whether it is possible to adapt the Transformer model to
ever, most of the machine learning based classifiers were the SMS spam detection problem. Therefore, we propose a
modified model based on the vanilla Transformer to identify
The associate editor coordinating the review of this manuscript and SMS spam messages. Additionally, we analyze and compare
approving it for publication was Wei Xiang . the performance of SMS spam detection between traditional
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

VOLUME 9, 2021 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 80253
X. Liu et al.: Spam Transformer Model for SMS Spam Detection
machine learning classifiers, an LSTM deep learning solu- CNN, LSTM, and 9 traditional machine learning solutions.
tion, and our proposed spam Transformer model. The experimental tests that were conducted by the authors
showed that the CNN-LSTM solution performed better than
B. RELATED WORK other approaches and yield an accuracy of 98.3% and an
There are several different machine learning based classifi- F1-Score of 0.914.
cation applications proposed in the last few decades [6], [7]
[8], [9]. In the field of SMS spam detection, a great number C. PAPER ORGANIZATION
of these approaches are based on traditional machine learning The rest of the paper is organized as follows. Section II pro-
techniques, such as Logistic Regression (LR), Random Forest vides the backgrounds and details of the LSTM and our spam
(RF) [10], Support Vector Machine (SVM) [11], Naïve Bayes Transformer approaches. Concretely, Section II-A introduces
(NB), and Decision Trees (DT). Recently, with the prosperity the architecture of RNN, followed by one of its most suc-
of the deep learning techniques, an increasing number of cessful variant LSTM in Section II-B. We then introduce
methods have been introduced to address the SMS spam Sequence-to-Sequence in Section II-C, attention mechanism
problem using deep learning based solutions such as Convo- in Section II-D, and the original version of Transformer for
lutional Neural Network (CNN), Recurrent Neural Network translation tasks in Section II-E. Furthermore, Section III
(RNN), and Long Short-Term Memory (LSTM), which is a discusses the modified spam Transformer that we proposed
successful variant of RNN. in detail. Afterward, Section IV demonstrates the experi-
In [12], Gupta et al. compared the performance of 8 differ- ment designs, results and analysis. Finally, we conclude in
ent classifiers including SVM, NB, DT, LR, RF, AdaBoost, Section VI and describes the future work in Section VII.
Neural Network, and CNN. The experimental tests on the
SMS Spam Collection v.1 [13] dataset that was conducted
by the authors shows that the CNN and Neural Network are II. DEEP LEARNING APPROACHES
better compared to other machine learning classifiers, and the While the traditional machine learning techniques do per-
CNN and Neural Network achieved an accuracy of 98.25% form well in many fields, they are still much interference or
and 98.00%, respectively. guidance from human specialists required when people try to
In [14], Jain et al. proposed a method to apply rule-based apply these technologies to address problems. For instance,
models on the SMS spam detection problem. The authors extracting and representing the features from data is always
extracted 9 rules and implemented Decision Tree (DT), RIP- a challenging but indispensable work for machine learning
PER [15], and PRISM [16] to identify the spam messages. scientists. In another word, the inadequate capacity of many
According to the experimental results from the authors, traditional machine learning classifiers is a major limitation
the RIPPER outperformed the PRISM and the DT, yielding to a more effective and massive application. However, many
a 99.01% True Negative Rate (TNR) and a 92.82% True deep learning techniques are able to not only learn much
Positive Rate (TPR). more amount of features but also extract more higher-level
In [1], Roy et al. aimed to adapt the CNN and LSTM features that are formed by the composition of lower-level
to the SMS spam messages detection problem. The authors features. With an effective training process, the deep learning
evaluated the performance of CNN and LSTM by comparing techniques are more capable to consume and make good use
them with Naïve Bayes (NB), Random Forest (RF), Gra- of a large amount of data and thus perform better especially in
dient Boosting (GB) [17], Logistic Regression (LR), and coping with difficult jobs compared to the traditional machine
Stochastic Gradient Descent (SGD) [18]. The experiments learning approaches.
that were conducted by the authors showed that the CNN and
LSTM perform significantly better than the tested traditional A. RECURRENT NEURAL NETWORK
machine learning approaches when it comes to SMS spam As is known to all, shuffling the order of words in a sentence
detection. can severely influence the meaning of the entire sentence,
In [2], the authors proposed the Semantic Long Short-Term which could potentially turn a legitimate message into spam
Memory (SLSTM), a variant of LSTM with an additional messages, and vice versa. Therefore, in many Natural Lan-
semantic layer. The authors employed the Word2vec [19], guage Process (NLP) problems, the order of words is no less
the WordNet [20], and the ConceptNet [21] as the seman- important than the words themselves. To address this prob-
tic layer, and combined the semantic layer with the LSTM lem, we need a new kind of model that is capable to effectively
to train an SMS spam detection model. The experimental learn from prior knowledge to improve the understanding of
evaluation that was conducted by the authors claimed that the data. Although the classical feed-forward neural network
the SLSTM achieved an accuracy of 99% on the SMS Spam is a powerful deep learning technique that generally works
Collection v.1 dataset. well in many areas, it cannot utilize the information from the
In [22], Ghourabi et al. proposed the CNN-LSTM model past. Derived from the feed-forward neural network, recurrent
that consists of a CNN layer and an LSTM layer in order neural network (RNN) [23] has the ability to reuse the saving
to identify SMS spam messages in English and Arabic. The information at the time of processing input values. Addi-
authors evaluated the CNN-LSTM by comparing it with the tionally, unlike the traditional feed-forward neural network
80254 VOLUME 9, 2021

In LSTM, at time t, the state of a memory cell ct is calculated

based on the input xt and the last hidden state ht−1 . The
state of input gate, output gate, and forget gate at time t are
represented as it , ot , ft , respectively. Therefore, the LSTM
transition functions are defined as follows [29]:
it = σ (Wi · [ht−1 , xt ] + bi )
ft = σ (Wf · [ht−1 , xt ] + bf )
qt = tanh(Wq · [ht−1 , xt ] + bq )
FIGURE 1. Structure of a typical Recurrent Neural Network.
ot = σ (Wo · [ht−1 , xt ] + bo )
ct = ft ct−1 + it qt
supports only the input sequence with a fixed length, RNN is ht = ot tanh(ct ) (1)
capable to handle the input sequence with different length.
The σ denotes the sigmoid function, and the operator
The Fig. 1 shows the typical structure of RNN models, with
denotes the element-wise multiplication. The sigmoid func-
the input sequence, output sequence, and the hidden layers at
tion is a logistic function with the returning value between
time t are represented by xt , ot , and ht respectively. At time
0 and 1. The sigmoid function is defined as follow:
t, the current hidden layers state ht is calculated based on
the current input sequence xt and the last hidden layers ht−1 . 1
σ (x) = (2)
After the calculation of ht is finished, the output at the current 1 + e−x
time step ot is generated and the hidden layers state ht will When the output of a gate unit is close to 1, the information
get involved in the calculation at the next time step t + 1. is more likely to be memorized. On the contrary, a returning
Unlike the normal neural network, where the neurons in the value close to 0 from a gate unit means that the information
same layer of the hidden layers are independent of each other, should not be kept. The input gate it is the gate unit that con-
RNN models usually allow the data flows within the same trols how much information should be stored at this time. The
layer. In another word, connections between neurons in the forget gate ft is responsible to determine to what extent the
same layers or even self-connections are allowed generally memory from the last time ct−1 should be kept at time t.
allowed in RNN based models. The output gate ot at time t is designed to be used in the
A major advantage of RNN models is that they are able computation of the output (hidden state) based on the memory
to utilize the information from previous input and apply it at cell state.
the current time, which is significantly useful in NLP prob- In our LSTM approach for SMS spam detection, the input
lems since the context can help us understand the sentence message embedding is fed into an LSTM network as an
better. However, a major drawback of the vanilla RNN is the input sequence. Meanwhile, the LSTM network saves the
vanishing and exploding gradients [24]. In back-propagation important features and outputs a sequence with the same
training process, the vanishing gradients refers to gradients go length as the input sequence. The output sequence is then
exponentially close to 0, while the exploding gradients refers fed into a feed-forward fully connected layer with a single
to the gradients go exponentially increase. The vanishing and neuron since SMS spam detection is a binary classification
exploding gradients are usually caused by the multiplication problem. Finally, a sigmoid function is applied to the output
of multiple derivatives in training process. Although there are of the single neuron to produce a final prediction.
several approaches [25] existing to address the vanishing and
exploding gradients problem, in practice, it is still difficult C. SEQUENCE-TO-SEQUENCE MODELS
for vanilla RNN to memorize and learn the features from Sequence-to-sequence (Seq2Seq) [30] was introduced
long distance, which is described as long-term dependencies in 2014 by Sutskever et al. aiming to find a mapping
problem. In order to deal with the long-term dependencies between two sequences for translation tasks. Seq2Seq models
problem, many researchers have proposed multiple variants employed the Encoder-Decoder architecture, which consists
of RNN, such as the Long Short-Term Memory (LSTM) [26], of an encoder stack, a hidden state, and a decoder stack.
the Gated Recurrent Unit (GRU) [27], and the Clockwork Fig. 2 presents a typical Encoder-Decoder architecture. The
RNN (CW-RNN) [28]. encoders take the input sequence and produce a hidden
state with critical information, which is consumed by the
B. LONG SHORT-TERM MEMORY decodes to generate the output sequences. One of the crucial
The Long Short-Term Memory (LSTM) is a famous variant advantages of the Encoder-Decoder architecture is that the
of RNN. The main idea of the LSTM is the introduction of input sequence and output sequence can be different in terms
gate units, which are the structures that can determine to keep of size or format, which provides much more flexibility and
or discard the current information. A typical LSTM network possibility. In reality, the Seq2Seq models have been proved
consists of multiple memory cells, and each memory cell is themselves in language translation [30], Speech Recogni-
formed by an input gate, a forget gate, and an output gate. tion [31], and Video to Text [32]. Undoubtedly, Seq2Seq
VOLUME 9, 2021 80255

becomes a significant limitation. At time t, the computation of

hidden state ht relies on the previous hidden state ht−1 , which
is the sequential computation nature of recurrent models.
This sequential computation nature prevents the computing
of RNN variants from parallelization, leading to the limitation
on computational efficiency during the training process.
In order to address the computational efficiency limitation
of RNN variants, the Transformer uses only multi-head atten-
FIGURE 2. Structure of Encoder-Decoder architecture. tion mechanism instead of RNN variants as encoders and
decoders. This not only greatly reduces the cost of training
through parallelization, but also surprisingly improves the
architecture is designed to fit translation tasks exception- performance in translation tasks as is mentioned in [3].
ally well, since it can extract the relationship between the In Transformer, the attention function takes a query Q
sequences in one language and the sequences in a different and a set of key-value pairs (K , V ) as input, and computes
language. The vanilla version of the Seq2Seq model proposed the weighted sum of values as output, where the weights
in 2014 choose LSTM as both encoder and decoder, because are calculated based on the queries and keys. Particularly,
LSTM has the ability to successfully learn on data with Scaled Dot-Product Attention is used in Transformer as the
long-term dependencies [30]. attention function. The Scaled Dot-Product Attention is the
dot-product attention [35] with a scaling factor of √1d , which
k
D. ATTENTION MECHANISM aims to counteract the massive growth of dot-product when
The main purpose of the attention mechanism is to find out dimensions of queries and keys dk is large.
the most important part from the input sequence. Concretely, Another important innovation of Transformer is the
the attention mechanism produces weights that represent the Multi-Head Attention. In the previous practice, the atten-
importance of the elements based on their correlation with the tion is directly performed on the queries, keys, and values,
context. The attention mechanism makes it possible to focus where their dimension is dmodel . In this way, there is only
on the key elements. a single attention function calculated at one turn. However,
In [33], the attention mechanism was introduced as an Transformer finds an effective way to apply multiple attention
improvement of the RNN Encoder-Decoder model hidden functions at once. Specifically, the queries, keys, and values
state in Neural Machine Translation (NMT). The most impor- are sent to some different learned linear layers to be projected
tant contribution of the attention mechanism in NMT is that h times to the dimension of dk , dk , and dv , respectively.
it computes the weights based on all the hidden states gener- In another word, the projection linear layers are individually
ated by the encoder, and the decoder consumes the weighted learned, and output projections have dimensions of dk , dk ,
combination of all the hidden states instead of focusing only and dv , where dk = dv = dmodel /h. After that, a number
on the latest one. The introduction of the attention mechanism of h attention functions are performed in parallel on these
greatly boosts the performance of NMT. projected queries, keys, and values, resulting in h different
There are also other forms of attention mechanism pro- output values. Finally, all these h values are concatenated
posed. In [34], the attention mechanism is applied to the field together and then projected back to a dimension of dmodel . The
of computer vision by Xu et al., and they also proposed two entire process of the attention mechanism in the Transformer
different approaches of attention named ‘‘soft attention’’ and is defined as follows [3]:
‘‘hard attention’’. In [35], Luong et al. proposed global atten-
tion and local attention. The global attention is similar to the QK T
model of Bahdanau et al. in [33] with a simpler architecture, Attention(Q, K , V ) = softmax( √ )V
dk
while the local attention is a combination of soft and hard
attention from Xu et al. in [34]. MultiHead(Q, K , V ) = Concat(head1 , . . . , headh )W O
Q
headi = Attention(QWi , KWiK , VWiV ) (3)
E. THE TRANSFORMER MODEL
Q
The Transformer [3] model is a sequence-to-sequence The Wi , WiK , and WiV are parameters matrices in linear
(Seq2Seq) model that was proposed in 2017 by Vaswani et al., projection layers, where they are used to project dmodel -
as an approach to English-German and English-French trans- dimension queries, keys, and values to dk , dk , and dv dimen-
lation tasks. Compared to those previous Seq2Seq models, sion, respectively. In both vanilla Transformer and our mod-
the main innovation of Transformer is that it completely relies ified Transformer for SMS spam detection, dk = dv =
on the attention mechanism to efficiently learn from the most dmodel /h.
informative elements [36]. In RNN, the computation of the hidden states is based on
Though LSTM and some other RNN variants were proved the previous states, making it available to learn from the order
to perform well as encoders and decoders in Seq2Seq based of words naturally. However, there is no recurrent or con-
models, the high training consumption of recurrent models volutional structure in Transformer. Therefore, Transformer
80256 VOLUME 9, 2021

introduces a positional encoding function based on sine and apply the Transformer model to the SMS spam detection task,
cosine functions of different frequencies. two major modifications are done to the vanilla Transformer
In vanilla Transformer model designed for language trans- model, which is described in Section III-A and Section III-
lation tasks, source language texts and shifted right target B, respectively. After that, several implementation details are
language texts are first sent to embedding layers as input discussed.
sequence and output sequence. Secondly, positional infor-
mation is injected into the input and output sequence in the A. MEMORY
positional encoding layer. After that, the input and output The first modification for the SMS spam detection task is the
sequence is fed into encoders and decoders, respectively. introduction of memory. Since there is no output sequence
Then, the Multi-Head Attention layers and fully-connected (target sequence) in the SMS spam detection task, we used
feed-forward layers, combined as a single encoder or decoder, a list of trainable parameters named ‘‘memory’’ to be the
produce the output of dimension of dmodel . The results of substitute for output sequence embedding. The length of the
decoders are passed to a linear layer. Finally, the softmax memory is a configurable hyper-parameter. Each element of
function is performed on the output of the linear layer, pro- the memory is a vector of dimension dmodel so that it can be
ducing the translation in the target language. adapted to the Transformer model without any extra projec-
tion. In other words, the memory is a matrix of dimension
III. PROPOSED MODIFIED TRANSFORMER MODEL FOR lenmemory ×dmodel . The output embedding layer in the original
SMS SPAM DETECTION Transformer model is also removed since there are no target
In Fig. 3, the main architecture of the modified Transformer sequence texts anymore to be mapped to numeric vectors.
model for SMS spam detection is described. In order to Similar to the output sequence in the vanilla Transformer
model, the positional information is injected into the mem-
ory at the positional encoding layer before being fed into
decoders.
During the training process, the parameters of memory
are trained, and the memory matrix is expected to contain
the important information that can help to predict whether
or not a message is a spam. Therefore, in the decoders of
the modified spam Transformer model, with the help of the
attention mechanism, the memory can contribute to locate the
significant part of the output sequence of the encoder stack
that summarized the message, and eventually help to classify
the spam SMS messages.
B. LINEAR LAYERS AND FINAL ACTIVATION FUNCTION

The second modification is the final activation function. In the
vanilla Transformer, the dimension of outputs of decoder
layers is T × dmodel , where T is the target sequence length
and dmodel is the model size (number of features). Therefore,
intuitively, it is a promising approach to use the linear layers
to map the output to a vector that has the same dimension
as the number of words in the dictionary and apply a softmax
function on the vector to find the closest candidate word from
the dictionary.
However, the SMS spam detection task is a binary classi-
fication problem. Therefore, to convert the output from the
decoder stacks with dimension dmodel into a single proba-
bility of the message being spam, the linear layers after the
decoders are also modified. Instead of mapping the output of
the decoder stack to a vector, the linear layer in the modified
FIGURE 3. Structure of proposed modified Transformer model for SMS
Transformer model for SMS spam detection has only one sin-
spam detection. The input messages embeddings and memory (trainable gle neuron in the last layer. Thus, the outputs of the decoder
parameters) are positional encoded, respectively. Then, the processed stack are converted into a single numeric probability value.
message vectors are passed to encoder layers, where the self-attention is
performed. The results of encoder layers are passed to decoder layers. Additionally, the final activation function needs to be
In decoder layers, the Multi-Head Attention is executed based on the replaced with a function that can map the result to a
results of encoder layers and the processed memory. Then, the decoded
vectors are sent to some fully-connected linear layers, followed by a final binary outcome. Thus, in the modified Transformer for SMS
activation function for classification. spam detection, a sigmoid function, which is defined in
VOLUME 9, 2021 80257

Equation (2), as the final activation function, is applied to the Transformer model, the same way of determining the learning
output of the linear layers after decoders, generating a binary as mentioned in [3] is utilized. The learning rate lr first
result that predicts whether or not the message is spam. increases linearly until reaching the warmup_steps steps and
then decreases proportionally to the square root of the step
C. DROPOUT numbers. Concretely, we used warmup_steps = 8000.
Dropout [37] is a powerful technique published by Hinton
et al. in 2012 in order to prevent over-fitting in a large F. DATAFLOW OF MODIFIED TRANSFORMER
feed-forward neural network. Concretely, the Dropout refers As is shown in Figure 3, the input messages are first converted
to randomly omit some nodes in those large feed-forward into word embeddings using the Glove model. Following
layers on each specific training case. The modified spam this, the memory (trainable parameters) and the embeddings
Transformer model that we proposed employs multiple of the input sequence are positionally encoded, respectively.
feed-forward layers. Thus, the Dropout technique is also Then, the processed message vectors are passed to encoder
implemented in the feed-forward layers of our spam Trans- layers, where the multi-head self-attention is performed and
former model. Besides, the Dropout technique is also used in the important parts of the input sequence are given larger
positional encoding and calculation of attention function. weights. The results of encoder layers are passed to decoder
layers. In decoder layers, the multi-head self-attention is
D. BATCHES AND PADDING computed on the memory. After that, the multi-head attention
During each epoch of training on our proposed models, is executed based on the results of encoder layers and the pro-
the whole training set is divided into multiple batches. As the cessed memory. Finally, the decoded vectors are sent to some
length of the message with the same batch should be the same, fully-connected linear layers, followed by a final activation
some padding words (empty words) should be added into function for classification.
the shorter message vectors, interfering with the detection to
some extent. Therefore, the algorithm of dividing the training IV. EXPERIMENT
set into batches is designed to minimize the padding words. A. DATASETS
Specifically, the training data is sorted by the message length In the experiments, two different datasets are utilized. The
first, and the batches are created to minimize the padding first dataset is SMS Spam Collection v.1 [13] dataset, which
words based on the sorted messages. is labeled SMS messages dataset collected for mobile phone
Admittedly, adding padding words may pose a negative message research. The second one is UtkMl’s Twitter Spam
influence on the model. However, using batch has been Detection Competition (UtkMl’s Twitter) [39] from Kaggle.
proved to be a good idea for model training as it increases Table 1 shows the overview statistics of the two datasets.
the training speed extraordinarily. In fact, a larger batch size
accelerates a ton for the training speed. Additionally, the neg- TABLE 1. The statistics of two datasets.
ative influence of padding words is addressed by minimizing
the use of padding words. Besides, the padding masks are also
passed into the model along with the training batches so that
the Transformer model can ignore the padding words during
training.
Although the Twitter posts are not precisely the same as
E. OPTIMIZATION AND LEARNING RATE the SMS messages, they are still in some ways common. For
The gradient descent is employed to optimize our modified instance, they both have approximately less than 100 words.
spam Transformer model. The main idea of the gradient People tend to use more casual language and abbreviations
descent algorithm is to minimize the loss function of the in both Twitter posts and SMS messages. Therefore, UtkMl’s
model by updating the parameters along the opposite way Twitter dataset can also be used to test our model. Besides,
of the gradient to the loss function, where the gradient is we can also analyze the extensibility of our model by com-
the partial derivatives of the loss function of the parameter. paring the performance of our model on these two datasets.
There are plenty of variant optimizers of gradient descent. In comparison with SMS Spam Collection v.1 [13] dataset,
We use the AdamW [38] optimizer for our proposed mod- UtkMl’s Twitter dataset contains more data in both spam and
ified spam Transformer with β1 = 0.9, β2 = 0.98, and ham classes. Besides, UtkMl’s Twitter dataset is balanced
= 10−9 . Learning rate is a critical hyper-parameter in since the number of spam messages and ham messages are
machine learning. It is defined as the step size of updating approximately equal. In terms of the language, although they
parameters, which basically represents the speed of learning are a lot of casual language and abbreviation used in both
of the model. Having the learning rate set too high will lead to datasets, casual language and abbreviation appear more fre-
the situation that the model fails to locate the best parameters quently in UtkMl’s Twitter dataset. The reason for this obser-
(weights and biases), while a learning rate that is too small vation may be the feature of the Twitter posts. Alternatively,
sticks the model around the local optimal point rather than it could also because of the date that the dataset was collected,
finding a better parameter solution. For the modified spam as SMS Spam Collection v.1 was published in 2011.
80258 VOLUME 9, 2021

TABLE 2. The confusion matrix. There are two major methods of calculating representation
vectors are employed in our experiments.
• TF-IDF Representation: The TF-IDF (Term
Frequency–Inverse Document Frequency) is a widely-
used numerical statistic in NLP. It is designed to reflect
the importance of a word to a document in the given
B. EVALUATION MEASURES text corpus. The Term Frequency (TF) is defined as
In order to evaluate the performance of the proposed modified the number of times that a term occurs in a document.
spam Transformer model, some metrics such as accuracy, A larger TF means the term is referred for more times
precision, recall, and F1-Score are used in the experiments. in the given document, showing that the term is more
All these metrics are calculated based on the confusion relevant to the document. There are multiple different
matrix. As is mentioned in the previous section, the spam means to weigh the TF in order to adapt it in different
messages in the SMS Spam Collection v.1 dataset are sig- applications. In our experiment, we use the raw count
nificantly less than the ham messages, which means that of the term in the document as the TF. The Inverse
the dataset is unbalanced. Therefore, the accuracy is not Document Frequency (IDF) is a value to qualify the
sufficient as a measurement to evaluate the performance of specificity of a term, which is normally defined as the
the proposed model, and the F1-Score is employed in the logarithmically scaled inverse fraction of the number
experiments. The accuracy, precision, recall, and F1-Score is of documents that contain the term. In another word,
defined as follows: when a term occurs in a great number of documents,
TP + TN the IDF is numerically low, leading to a low TF-IDF. For
Accuracy = (4) instance, the term ‘‘the’’ occurs in almost every English
TP + FP + FP + FN
TP document, leading to a document frequency of almost 1
Precision = (5) (100% of the documents in the corpus contain the term
TP + FP
TP ‘‘the’’). Thus, the IDF of ‘‘the’’ is close to 0, which
Recall = (6) means that its importance to any documents in the corpus
TP + FN
Precision × Recall is low.
F1 − Score = 2 × (7) • GloVe Representation: GloVe [41] is an unsupervised
Precision + Recall
learning algorithm for obtaining vector representations
The precision, also known as the positive predictive value, for words. The main idea is to map words into a mean-
represents the percentage of the predicted positive cases that ingful space where the distance between words is related
are actually positive, meaning the possibility that the classi- to semantic similarity. GloVe produces a vector space
fier is correct given that it predicts positive. The recall, also with a meaningful substructure, and it can also find the
known as sensitivity, denotes the number of true positives relations like synonyms between words.
instances divided by the number of actual positive instances, In our experiments, for the deep learning approaches
which can also be described as the percentage of the positive such as LSTM and our proposed spam Transformer model,
cases that are identified successfully. The F1-Score is the the GloVe model is employed to create representation vec-
harmonic mean of precision and recall, which measures the tors for them. Specifically, in our experiments, we used the
performance of a classifier in terms of precision and recall in ‘‘glove.840B.300d’’, a pre-trained model with 2.2 million
a balanced way. words in the dictionary that converts textual data into
300-dimensional vectors. For benchmark machine learn-
C. DATA SPLITTING ing algorithms, although the vectors generated by GloVe
For the traditional machine learning approaches, the data is model have more dimensions and theoretically contain more
divided into training set (70%), and test set (30%). For the information, presumably due to the limitation of traditional
LSTM and our proposed modified spam Transformer model, machine learning classifiers, the TF-IDF representation per-
the data is split into training set (50%), validation set (20%), forms better in practice. Therefore, TF-IDF representation
and test set (30%), where the validation set is used after each is used for calculating representation vectors in benchmark
epoch of training to help us select the best model and perform machine learning algorithms.
early stopping to avoid over-fitting.
E. LOSS FUNCTION
D. DATA PRE-PROCESSING The loss function we used for deep learning approaches
The textual messages in the dataset are first tokenized. including LSTM and modified spam Transformer is Binary
Tokenization refers to the task of splitting textual into mean- Cross Entropy function, which is defined as follow:
ingful words. Specifically, the SpaCy [40] library is employed
for data pre-processing in order to tokenize the data. l(xi , yi ) = −wi [yi · logxi + (1 − yi ) · log(1 − xi )] (8)
After that, the numeric representation vectors (word The weight wi is the rescaling factor for loss. Since
embeddings) are calculated based on the textual messages. the SMS Spam Collection v.1 is unbalanced, where spam
VOLUME 9, 2021 80259

messages are severely less than ham (legitimate) messages, TABLE 4. Initial and optimized hyper-parameters for modified spam
Transformer on SMS Spam Collection v.1.
a larger weight is given to the actual spam messages to
counteract the negative effect of the unbalanced dataset. The
rescaling weight is calculated based on the ratio between the
number of ham messages and spam messages.
F. MODEL TRAINING
We trained our experiment models on NVIDIA GeForce RTX
3090 GPU. For the machine learning classifiers, the experi-
ments are performed on the Scikit-learn 0.24.0 [42] environ-
ment. For deep learning approaches like LSTM and spam TABLE 5. Initial and optimized hyper-parameters for modified spam
Transformer model, the experiments are conducted on the Transformer on UtkMl’s Twitter.
Ubuntu 20.04 LTS, CUDA 11.1, and PyTorch 1.7.1 [43]

environment. The early stopping technique is implemented
to fight against the over-fitting. Besides, we also trained and
tested the CNN-LSTM SMS spam detection model proposed
in [22] on both datasets as a benchmark to evaluate our
modified spam Transformer model.
G. HYPER-PARAMETERS TUNING
In order to tune the models and find the best hyper-parameters TABLE 6. Results obtained on SMS Spam Collection v.1.
set, the Ray Tune [44] library is employed. The Ray Tune
is a hyper-parameter tuning extension tool that supports
multiple machine learning frameworks. Given a candidate
hyper-parameters set, the Ray Tune can find the opti-
mized hyper-parameters set by training multiple models
with different settings and comparing the results automat-
ically. In our experiments, with the help of the Ray Tune,
we first explored optimal settings for the overall architectural
hyper-parameters such as Encoder layers, Decoder layers,
TABLE 7. Results obtained on UtkMl’s Twitter.
and Model size. After that, other hyper-parameters such as the
rate of dropout and Feed-forward layer size are tuned under
the candidate optimal model settings.
For the LSTM model, the optimized parameters on both
datasets are shown in Table 3. For our modified spam Trans-
former model on SMS Spam Collection v.1, Table 4 presents
the initial hyper-parameters that we started from and the opti-
mized values when the better result was achieved after tuning.
Table 5 demonstrates the initial as well as the optimized
hyper-parameters of modified spam Transformer on UtkMl’s
Twitter dataset. Machine (classifier), and Long Short-Term Memory. Besides,
for the SMS Spam Collection v.1 dataset, we also compare
TABLE 3. Optimized hyper-parameters for LSTM. our models with the CNN-LSTM approaches in [22], since
they aim to solve the same problem on the same dataset with
us.
Table 6 summarizes the results on SMS Spam Collection
v.1 dataset. For accuracy, our modified spam Transformer
model achieved the best value of 98.92%. Concerning pre-
cision, the best score was from the Random Forests classifier
with a value of 1.0, and our proposed spam Transformer
V. RESULTS AND ANALYSIS got a value of 0.9781. When it comes to recall, the opti-
A. EVALUATION mal result came from the spam Transformer model with a
We demonstrate the performance of the modified spam Trans- value of 0.9451, and the same value came from the Naïve
former model by comparing it on two datasets with some Bayes classifier as well. Finally, in terms of F1-Score, our
other typical spam detection classifiers, including Logistic spam Transformer also achieved the best value of 0.9613.
Regression, Naïve Bayes, Random Forests, Support Vector The experiment of CNN-LSTM [22] that was conducted
80260 VOLUME 9, 2021

TABLE 8. The confusion matrices on SMS Spam Collection v.1.
TABLE 9. The confusion matrices on UtkMl’s Twitter.
by Ghourabi et al. on the same dataset, are also included In addition, Table 8 and Table 9 show the excellent robust-
in Table 6. In Table 8, we demonstrate the confusion matrix of ness of our model to classify both the spams and hams effec-
all the approaches that we tested in the experiments on SMS tively on no matter balanced (UtkMl’s Twitter) or unbalanced
Spam Collection v.1 dataset. (SMS Spam Collection v.1) datasets.
Table 7 summarizes the results on UtkMl’s Twitter dataset.
The modified spam Transformer model outperformed all VI. CONCLUSION
other candidates in all four aspects that we tested with the val- In this paper, we proposed a modified Transformer model that
ues of 87.06%, 0.8746, 0.8576, and 0.8660 on the accuracy, aims to identify SMS spam. We evaluated our spam Trans-
precision, recall, and F1-Score, respectively. The confusion former model by comparing it with several other SMS spam
matrix of the modified spam Transformer model on UtkMl’s detection approaches on the SMS Spam Collection v.1 dataset
Twitter is presented in Table 9. and UtkMl’s Twitter dataset. The experimental results show
that, compared to Logistic Regression, Naïve Bayes, Random
Forests, Support Vector Machine, Long Short-Term Memory,
B. ANALYSIS and CNN-LSTM [22], our proposed spam Transformer model
Although the experimental results show an improved perfor- performs better on both datasets.
mance of the proposed spam Transformer model compared On the SMS Spam Collection v.1 dataset, our spam Trans-
to other candidates, the false predictions also indicate the former has a better performance in terms of accuracy, recall,
drawback of the proposed model. We analyzed the content and F1-Score compared to other classifiers. Specifically,
of the false prediction samples including false positive and our modified spam Transformer approach accomplished an
false negative samples and found that there were a great exceeding result on F1-Score.
number of the UNK marks in the data passed to the model, Additionally, on the UtkMl’s Twitter dataset, the results
which is produced because the words are never seen in the from our modified spam Transformer model demonstrate its
training data. In other words, the unknown words obstruct the improved performance on all four aspects in comparison to
model from understanding the messages. Besides, the SMS other alternative approaches mentioned in this paper. Con-
messages are usually short, which increases the influence cretely, our spam Transformer does exceptionally well on
of every single word and makes the unknown words more recall, which contributes to a distinct F1-Score.
influential. Actually, due to the unknown words, the model
did not have enough information to detect spams in many VII. FUTURE WORK
false prediction cases. Although the experimental results in this paper have shown
Though our proposed model performs better than other an improvement of our proposed spam Transformer model
candidate algorithms on UtkMl’s Twitter dataset, the results in comparison with some previous approaches on SMS spam
are still not as good as that in case of SMS Spam Collection detection, we still believe that there is great potential in the
v.1 dataset. From our observation, the major cause is also model we proposed.
the unknown words. Compared to SMS Spam Collection Firstly, since our current two datasets contain only thou-
v.1 dataset, there are more casual language and abbreviations sands of messages, in the future, we plan to extend our spam
in UtkMl’s Twitter dataset, which may be caused by the Transformer model to a larger dataset with more messages
feature of Twitter posts or the date of collection of the dataset, or even other types of content, for the purpose of better
as is discussed in Section IV-A. Therefore, the negative influ- performance.
ence from casual language and abbreviation is more severe on Besides, in our proposed model, we flattened the out-
UtkMl’s Twitter dataset, and that is the major cause of more puts from decoders and applied linear fully-connected lay-
unknown words and eventually worse performance from our ers before applying the final activation function and getting
perspective. the prediction. We believe that some dedicated designs or
VOLUME 9, 2021 80261

implementations instead of simple flattening and linear layers [16] J. Cendrowska, ‘‘PRISM: An algorithm for inducing modular rules,’’ Int.
could absolutely boost the performance, which would be one J. Man-Machine Stud., vol. 27, no. 4, pp. 349–370, Oct. 1987.
[17] J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting
of the most important future works. machine,’’ Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
Additionally, although the experimental results show that [18] L. Bottou, ‘‘Large-scale machine learning with stochastic gradient
our modified model based on the vanilla Transformer per- descent,’’ in Proc. COMPSTAT. Physica-Verlag, 2010, pp. 177–186.
[19] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, ‘‘Efficient estimation of
forms well on SMS spam detection and confirms the avail- word representations in vector space,’’ in Proc. Int. Conf. Learn. Repre-
ability of the Transformer on this problem, the model is still sent., 2013.
far from optimal. There are some improved models based [20] G. A. Miller, ‘‘WordNet: A lexical database for English,’’ Commun. ACM,
vol. 38, no. 11, pp. 39–41, 1995.
on the Transformer with more complex architecture such as [21] H. Liu and P. Singh, ‘‘ConceptNet — A practical commonsense reasoning
GPT-3 [4] and BERT [5] that could be explored in the future. tool-kit,’’ BT Technol. J., vol. 22, no. 4, pp. 211–226, Oct. 2004.
Specifically, the BERT seems to be a promising starting point [22] A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, ‘‘A hybrid CNN-LSTM
model for SMS spam detection in arabic and English messages,’’ Future
of future work as it has fewer features and is easier to be Internet, vol. 12, no. 9, p. 156, Sep. 2020.
fine-tuned. [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning rep-
Finally, as is discussed in Section V-B, the proposed resentations by back-propagating errors,’’ Nature, vol. 323, no. 6088,
pp. 533–536, Oct. 1986.
model is severely influenced by the unknown words in many [24] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies
cases of false prediction. To address this problem, more data with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2,
pre-processing techniques could be applied. For instance, pp. 157–166, Mar. 1994.
[25] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘On the difficulty of training
a larger vocabulary with more words could be a good option, recurrent neural networks,’’ in Proc. 30th Int. Conf. Mach. Learn. (ICML),
and some semantic operations such as replacing unknown 2013, pp. 2347–2355.
words with their synonyms could also be explored. Besides, [26] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
there are some other data-preprocessing and feature extrac- [27] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
tion techniques that could be done, such as the extraction and H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
analysis of the abbreviation, URLs, tags, or emoji in data. RNN Encoder–Decoder for statistical machine translation,’’ in Proc.
Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014,
pp. 1724–1734.
REFERENCES [28] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber, ‘‘A clockwork RNN,’’
[1] P. K. Roy, J. P. Singh, and S. Banerjee, ‘‘Deep learning to filter SMS spam,’’ in Proc. 31st Int. Conf. Mach. Learn. (ICML), vol. 5, 2014, pp. 3881–3889.
Future Gener. Comput. Syst., vol. 102, pp. 524–533, Jan. 2020. [29] C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau, ‘‘A C-LSTM neural net-
[2] G. Jain, M. Sharma, and B. Agarwal, ‘‘Optimizing semantic LSTM for work for text classification,’’ 2015, arXiv:1511.08630. [Online]. Available:
spam detection,’’ Int. J. Inf. Technol., vol. 11, no. 2, pp. 239–250, Jun. 2019. http://arxiv.org/abs/1511.08630
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [30] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning with
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 4, Sep. 2014,
Neural Inf. Process. Syst., 2017, pp. 5999–6009. pp. 3104–3112.
[4] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ 2020, [31] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly,
arXiv:2005.14165. [Online]. Available: http://arxiv.org/abs/2005.14165 ‘‘A comparison of sequence-to-sequence models for speech recognition,’’
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training in Proc. Interspeech, Aug. 2017, pp. 939–943.
of deep bidirectional transformers for language understanding,’’ in Proc. [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Tech- K. Saenko, ‘‘Sequence to sequence–video to text,’’ in Proc. IEEE Int. Conf.
nol., vol. 1, Jun. 2019, pp. 4171–4186. Comput. Vis. (ICCV), Dec. 2015, pp. 4534–4542.
[6] G. Sonowal and K. S. Kuppusamy, ‘‘SmiDCA: An anti-Smishing [33] D. Bahdanau, K. H. Cho, and Y. Bengio, ‘‘Neural machine translation
model with machine learning approach,’’ Comput. J., vol. 61, no. 8, by jointly learning to align and translate,’’ in Proc. 3rd Int. Conf. Learn.
pp. 1143–1157, Aug. 2018. Represent. (ICLR), 2015.
[7] J. W. Joo, S. Y. Moon, S. Singh, and J. H. Park, ‘‘S-detector: An enhanced [34] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
security model for detecting Smishing attack for mobile computing,’’ R. S. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption
Telecommun. Syst., vol. 66, no. 1, pp. 29–38, Sep. 2017. generation with visual attention,’’ in Proc. 32nd Int. Conf. Mach. Learn.,
[8] S. Mishra and D. Soni, ‘‘Smishing detector: A security model to detect vol. 3, 2015, pp. 2048–2057.
Smishing through SMS content analysis and URL behavior analysis,’’ [35] T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to attention-
Future Gener. Comput. Syst., vol. 108, pp. 803–815, Jul. 2020. based neural machine translation,’’ in Proc. Conf. Empirical Methods
[9] C. Li, L. Hou, B. Y. Sharma, H. Li, C. Chen, Y. Li, X. Zhao, H. Huang, Natural Lang. Process. (EMNLP), 2015, pp. 1412–1421.
Z. Cai, and H. Chen, ‘‘Developing a new intelligent system for the diagno- [36] E. S. D. Reis, C. A. D. Costa, D. E. D. Silveira, R. S. Bavaresco,
sis of tuberculous pleural effusion,’’ Comput. Methods Programs Biomed., R. D. R. Righi, J. L. V. Barbosa, R. S. Antunes, M. M. Gomes, and
vol. 153, pp. 211–225, Jan. 2018. G. Federizzi, ‘‘Transformers aftermath,’’ Commun. ACM, vol. 64, no. 4,
[10] T. K. Ho, ‘‘Random decision forests,’’ in Proc. Int. Conf. Document Anal. pp. 154–163, Apr. 2021.
Recognit. (ICDAR), vol. 1, 1995, pp. 278–282. [37] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
[11] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn., R. R. Salakhutdinov, ‘‘Improving neural networks by preventing co-
vol. 20, no. 3, pp. 273–297, 1995. adaptation of feature detectors,’’ 2012, arXiv:1207.0580. [Online].
[12] M. Gupta, A. Bakliwal, S. Agarwal, and P. Mehndiratta, ‘‘A comparative Available: http://arxiv.org/abs/1207.0580
study of spam SMS detection using machine learning classifiers,’’ in Proc. [38] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’
11th Int. Conf. Contemp. Comput. (IC3), Aug. 2018, pp. 1–7. 2017, arXiv:1711.05101. [Online]. Available: http://arxiv.org/abs/1711.
[13] T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, ‘‘Contributions to the 05101
study of SMS spam filtering: New collection and results,’’ in Proc. 11th [39] UtkMl’s Twitter Spam Detection Competition | Kaggle, UtkMl.
ACM Symp. Document Eng., Sep. 2011, pp. 259–262. [40] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, ‘‘spaCy:
[14] A. K. Jain and B. B. Gupta, ‘‘Rule-based framework for detection of Smish- Industrial-strength natural language processing in python,’’ 2020, doi:
ing messages in mobile environment,’’ Procedia Comput. Sci., vol. 125, 10.5281/zenodo.1212303.
pp. 617–623, 2018. [41] J. Pennington, R. Socher, and C. D. Manning, ‘‘GloVe: Global vectors for
[15] W. W. Cohen, ‘‘Fast effective rule induction,’’ in Machine Learning Pro- word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
ceedings, 1995, pp. 115–123. Process. (EMNLP), 2014, pp. 1532–1543.
80262 VOLUME 9, 2021

[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, HAOYE LU (Member, IEEE) received the joint
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, B.Sc. degree in computer science and mathemat-
A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, ics, in 2017, and the master’s degree in computer
‘‘Scikit-learn: Machine learning in python,’’ J. Mach. Learn. Res., vol. 12, science, in 2019. In 2013, he joined the University
pp. 2825–2830, Nov. 2011. of Ottawa, Canada. He is currently working as a
[43] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep Research Associate. His research interests include
learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, artificial intelligence and network structures.
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett,
Eds. 2019, pp. 8024–8035.
[44] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez,
and I. Stoica, ‘‘Tune: A research platform for distributed model
selection and training,’’ 2018, arXiv:1807.05118. [Online]. Available:
http://arxiv.org/abs/1807.05118
AMIYA NAYAK (Senior Member, IEEE) received

the B.Math. degree in computer science and com-
binatorics and optimization from the University of
Waterloo, Canada, in 1981, and the Ph.D. degree in
systems and computer engineering from Carleton
University, Canada, in 1991. He is currently a Full
Professor with the School of Electrical Engineer-
ing and Computer Science, University of Ottawa.
He has over 17 years of industrial experience in
software engineering, avionics and navigation sys-
tems, and simulation and system level performance analysis. His research
interests include software-defined networking, mobile computing, wireless
XIAOXU LIU received the bachelor’s degree in sensor networks, and vehicular ad hoc networks.
computer science and technology from the Nan- He has served on the Editorial Board of several journals, including IEEE
jing University of Posts and Telecommunications, TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, International Journal of
China, and the master’s degree in computer sci- Parallel, Emergent and Distributed Systems, Journal of Sensor and Actuator
ence program from the University of Ottawa, Networks, and EURASIP Journal on Wireless Communications and Network-
Canada, in 2019. His research interests include ing. He is currently serving on the Editorial Board of the IEEE INTERNET OF
machine learning, deep learning, and natural lan- THINGS JOURNAL, IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE OPEN
guage processing. JOURNAL OF THE COMPUTER SOCIETY, Future Internet, and International Journal
of Distributed Sensor Networks.
VOLUME 9, 2021 80263

A Spam Transformer Model For SMS Spam Detection

Uploaded by

Copyright:

Available Formats

A Spam Transformer Model For SMS Spam Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Spam Transformer Model For SMS Spam Detection

Uploaded by

Copyright:

Available Formats

Received April 28, 2021, accepted May 11, 2021, date of publication May 17, 2021, date of current

version June 8, 2021.

A Spam Transformer Model for

INDEX TERMS SMS spam detection, transformer, attention, deep learning.

I. INTRODUCTION dependent on the handcrafted features extracted from the

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

80254 VOLUME 9, 2021

In LSTM, at time t, the state of a memory cell ct is calculated

VOLUME 9, 2021 80255

becomes a significant limitation. At time t, the computation of

80256 VOLUME 9, 2021

B. LINEAR LAYERS AND FINAL ACTIVATION FUNCTION

VOLUME 9, 2021 80257

80258 VOLUME 9, 2021

VOLUME 9, 2021 80259

Ubuntu 20.04 LTS, CUDA 11.1, and PyTorch 1.7.1 [43]

80260 VOLUME 9, 2021

TABLE 8. The confusion matrices on SMS Spam Collection v.1.

TABLE 9. The confusion matrices on UtkMl’s Twitter.

VOLUME 9, 2021 80261

80262 VOLUME 9, 2021

AMIYA NAYAK (Senior Member, IEEE) received

VOLUME 9, 2021 80263

You might also like