A Spam Transformer Model For SMS Spam Detection
A Spam Transformer Model For SMS Spam Detection
A Spam Transformer Model For SMS Spam Detection
ABSTRACT In this paper, we aim to explore the possibility of the Transformer model in detecting the spam
Short Message Service (SMS) messages by proposing a modified Transformer model that is designed for
detecting SMS spam messages. The evaluation of our proposed spam Transformer is performed on SMS
Spam Collection v.1 dataset and UtkMl’s Twitter Spam Detection Competition dataset, with the benchmark
of multiple established machine learning classifiers and state-of-the-art SMS spam detection approaches.
In comparison to all other candidates, our experiments on SMS spam detection show that the proposed
modified spam Transformer has the optimal results on the accuracy, recall, and F1-Score with the values
of 98.92%, 0.9451, and 0.9613, respectively. Besides, the proposed model also achieves good performance
on the UtkMl’s Twitter dataset, which indicates a promising possibility of adapting the model to other similar
problems.
machine learning classifiers, an LSTM deep learning solu- CNN, LSTM, and 9 traditional machine learning solutions.
tion, and our proposed spam Transformer model. The experimental tests that were conducted by the authors
showed that the CNN-LSTM solution performed better than
B. RELATED WORK other approaches and yield an accuracy of 98.3% and an
There are several different machine learning based classifi- F1-Score of 0.914.
cation applications proposed in the last few decades [6], [7]
[8], [9]. In the field of SMS spam detection, a great number C. PAPER ORGANIZATION
of these approaches are based on traditional machine learning The rest of the paper is organized as follows. Section II pro-
techniques, such as Logistic Regression (LR), Random Forest vides the backgrounds and details of the LSTM and our spam
(RF) [10], Support Vector Machine (SVM) [11], Naïve Bayes Transformer approaches. Concretely, Section II-A introduces
(NB), and Decision Trees (DT). Recently, with the prosperity the architecture of RNN, followed by one of its most suc-
of the deep learning techniques, an increasing number of cessful variant LSTM in Section II-B. We then introduce
methods have been introduced to address the SMS spam Sequence-to-Sequence in Section II-C, attention mechanism
problem using deep learning based solutions such as Convo- in Section II-D, and the original version of Transformer for
lutional Neural Network (CNN), Recurrent Neural Network translation tasks in Section II-E. Furthermore, Section III
(RNN), and Long Short-Term Memory (LSTM), which is a discusses the modified spam Transformer that we proposed
successful variant of RNN. in detail. Afterward, Section IV demonstrates the experi-
In [12], Gupta et al. compared the performance of 8 differ- ment designs, results and analysis. Finally, we conclude in
ent classifiers including SVM, NB, DT, LR, RF, AdaBoost, Section VI and describes the future work in Section VII.
Neural Network, and CNN. The experimental tests on the
SMS Spam Collection v.1 [13] dataset that was conducted
by the authors shows that the CNN and Neural Network are II. DEEP LEARNING APPROACHES
better compared to other machine learning classifiers, and the While the traditional machine learning techniques do per-
CNN and Neural Network achieved an accuracy of 98.25% form well in many fields, they are still much interference or
and 98.00%, respectively. guidance from human specialists required when people try to
In [14], Jain et al. proposed a method to apply rule-based apply these technologies to address problems. For instance,
models on the SMS spam detection problem. The authors extracting and representing the features from data is always
extracted 9 rules and implemented Decision Tree (DT), RIP- a challenging but indispensable work for machine learning
PER [15], and PRISM [16] to identify the spam messages. scientists. In another word, the inadequate capacity of many
According to the experimental results from the authors, traditional machine learning classifiers is a major limitation
the RIPPER outperformed the PRISM and the DT, yielding to a more effective and massive application. However, many
a 99.01% True Negative Rate (TNR) and a 92.82% True deep learning techniques are able to not only learn much
Positive Rate (TPR). more amount of features but also extract more higher-level
In [1], Roy et al. aimed to adapt the CNN and LSTM features that are formed by the composition of lower-level
to the SMS spam messages detection problem. The authors features. With an effective training process, the deep learning
evaluated the performance of CNN and LSTM by comparing techniques are more capable to consume and make good use
them with Naïve Bayes (NB), Random Forest (RF), Gra- of a large amount of data and thus perform better especially in
dient Boosting (GB) [17], Logistic Regression (LR), and coping with difficult jobs compared to the traditional machine
Stochastic Gradient Descent (SGD) [18]. The experiments learning approaches.
that were conducted by the authors showed that the CNN and
LSTM perform significantly better than the tested traditional A. RECURRENT NEURAL NETWORK
machine learning approaches when it comes to SMS spam As is known to all, shuffling the order of words in a sentence
detection. can severely influence the meaning of the entire sentence,
In [2], the authors proposed the Semantic Long Short-Term which could potentially turn a legitimate message into spam
Memory (SLSTM), a variant of LSTM with an additional messages, and vice versa. Therefore, in many Natural Lan-
semantic layer. The authors employed the Word2vec [19], guage Process (NLP) problems, the order of words is no less
the WordNet [20], and the ConceptNet [21] as the seman- important than the words themselves. To address this prob-
tic layer, and combined the semantic layer with the LSTM lem, we need a new kind of model that is capable to effectively
to train an SMS spam detection model. The experimental learn from prior knowledge to improve the understanding of
evaluation that was conducted by the authors claimed that the data. Although the classical feed-forward neural network
the SLSTM achieved an accuracy of 99% on the SMS Spam is a powerful deep learning technique that generally works
Collection v.1 dataset. well in many areas, it cannot utilize the information from the
In [22], Ghourabi et al. proposed the CNN-LSTM model past. Derived from the feed-forward neural network, recurrent
that consists of a CNN layer and an LSTM layer in order neural network (RNN) [23] has the ability to reuse the saving
to identify SMS spam messages in English and Arabic. The information at the time of processing input values. Addi-
authors evaluated the CNN-LSTM by comparing it with the tionally, unlike the traditional feed-forward neural network
introduces a positional encoding function based on sine and apply the Transformer model to the SMS spam detection task,
cosine functions of different frequencies. two major modifications are done to the vanilla Transformer
In vanilla Transformer model designed for language trans- model, which is described in Section III-A and Section III-
lation tasks, source language texts and shifted right target B, respectively. After that, several implementation details are
language texts are first sent to embedding layers as input discussed.
sequence and output sequence. Secondly, positional infor-
mation is injected into the input and output sequence in the A. MEMORY
positional encoding layer. After that, the input and output The first modification for the SMS spam detection task is the
sequence is fed into encoders and decoders, respectively. introduction of memory. Since there is no output sequence
Then, the Multi-Head Attention layers and fully-connected (target sequence) in the SMS spam detection task, we used
feed-forward layers, combined as a single encoder or decoder, a list of trainable parameters named ‘‘memory’’ to be the
produce the output of dimension of dmodel . The results of substitute for output sequence embedding. The length of the
decoders are passed to a linear layer. Finally, the softmax memory is a configurable hyper-parameter. Each element of
function is performed on the output of the linear layer, pro- the memory is a vector of dimension dmodel so that it can be
ducing the translation in the target language. adapted to the Transformer model without any extra projec-
tion. In other words, the memory is a matrix of dimension
III. PROPOSED MODIFIED TRANSFORMER MODEL FOR lenmemory ×dmodel . The output embedding layer in the original
SMS SPAM DETECTION Transformer model is also removed since there are no target
In Fig. 3, the main architecture of the modified Transformer sequence texts anymore to be mapped to numeric vectors.
model for SMS spam detection is described. In order to Similar to the output sequence in the vanilla Transformer
model, the positional information is injected into the mem-
ory at the positional encoding layer before being fed into
decoders.
During the training process, the parameters of memory
are trained, and the memory matrix is expected to contain
the important information that can help to predict whether
or not a message is a spam. Therefore, in the decoders of
the modified spam Transformer model, with the help of the
attention mechanism, the memory can contribute to locate the
significant part of the output sequence of the encoder stack
that summarized the message, and eventually help to classify
the spam SMS messages.
Equation (2), as the final activation function, is applied to the Transformer model, the same way of determining the learning
output of the linear layers after decoders, generating a binary as mentioned in [3] is utilized. The learning rate lr first
result that predicts whether or not the message is spam. increases linearly until reaching the warmup_steps steps and
then decreases proportionally to the square root of the step
C. DROPOUT numbers. Concretely, we used warmup_steps = 8000.
Dropout [37] is a powerful technique published by Hinton
et al. in 2012 in order to prevent over-fitting in a large F. DATAFLOW OF MODIFIED TRANSFORMER
feed-forward neural network. Concretely, the Dropout refers As is shown in Figure 3, the input messages are first converted
to randomly omit some nodes in those large feed-forward into word embeddings using the Glove model. Following
layers on each specific training case. The modified spam this, the memory (trainable parameters) and the embeddings
Transformer model that we proposed employs multiple of the input sequence are positionally encoded, respectively.
feed-forward layers. Thus, the Dropout technique is also Then, the processed message vectors are passed to encoder
implemented in the feed-forward layers of our spam Trans- layers, where the multi-head self-attention is performed and
former model. Besides, the Dropout technique is also used in the important parts of the input sequence are given larger
positional encoding and calculation of attention function. weights. The results of encoder layers are passed to decoder
layers. In decoder layers, the multi-head self-attention is
D. BATCHES AND PADDING computed on the memory. After that, the multi-head attention
During each epoch of training on our proposed models, is executed based on the results of encoder layers and the pro-
the whole training set is divided into multiple batches. As the cessed memory. Finally, the decoded vectors are sent to some
length of the message with the same batch should be the same, fully-connected linear layers, followed by a final activation
some padding words (empty words) should be added into function for classification.
the shorter message vectors, interfering with the detection to
some extent. Therefore, the algorithm of dividing the training IV. EXPERIMENT
set into batches is designed to minimize the padding words. A. DATASETS
Specifically, the training data is sorted by the message length In the experiments, two different datasets are utilized. The
first, and the batches are created to minimize the padding first dataset is SMS Spam Collection v.1 [13] dataset, which
words based on the sorted messages. is labeled SMS messages dataset collected for mobile phone
Admittedly, adding padding words may pose a negative message research. The second one is UtkMl’s Twitter Spam
influence on the model. However, using batch has been Detection Competition (UtkMl’s Twitter) [39] from Kaggle.
proved to be a good idea for model training as it increases Table 1 shows the overview statistics of the two datasets.
the training speed extraordinarily. In fact, a larger batch size
accelerates a ton for the training speed. Additionally, the neg- TABLE 1. The statistics of two datasets.
ative influence of padding words is addressed by minimizing
the use of padding words. Besides, the padding masks are also
passed into the model along with the training batches so that
the Transformer model can ignore the padding words during
training.
Although the Twitter posts are not precisely the same as
E. OPTIMIZATION AND LEARNING RATE the SMS messages, they are still in some ways common. For
The gradient descent is employed to optimize our modified instance, they both have approximately less than 100 words.
spam Transformer model. The main idea of the gradient People tend to use more casual language and abbreviations
descent algorithm is to minimize the loss function of the in both Twitter posts and SMS messages. Therefore, UtkMl’s
model by updating the parameters along the opposite way Twitter dataset can also be used to test our model. Besides,
of the gradient to the loss function, where the gradient is we can also analyze the extensibility of our model by com-
the partial derivatives of the loss function of the parameter. paring the performance of our model on these two datasets.
There are plenty of variant optimizers of gradient descent. In comparison with SMS Spam Collection v.1 [13] dataset,
We use the AdamW [38] optimizer for our proposed mod- UtkMl’s Twitter dataset contains more data in both spam and
ified spam Transformer with β1 = 0.9, β2 = 0.98, and ham classes. Besides, UtkMl’s Twitter dataset is balanced
= 10−9 . Learning rate is a critical hyper-parameter in since the number of spam messages and ham messages are
machine learning. It is defined as the step size of updating approximately equal. In terms of the language, although they
parameters, which basically represents the speed of learning are a lot of casual language and abbreviation used in both
of the model. Having the learning rate set too high will lead to datasets, casual language and abbreviation appear more fre-
the situation that the model fails to locate the best parameters quently in UtkMl’s Twitter dataset. The reason for this obser-
(weights and biases), while a learning rate that is too small vation may be the feature of the Twitter posts. Alternatively,
sticks the model around the local optimal point rather than it could also because of the date that the dataset was collected,
finding a better parameter solution. For the modified spam as SMS Spam Collection v.1 was published in 2011.
TABLE 2. The confusion matrix. There are two major methods of calculating representation
vectors are employed in our experiments.
• TF-IDF Representation: The TF-IDF (Term
Frequency–Inverse Document Frequency) is a widely-
used numerical statistic in NLP. It is designed to reflect
the importance of a word to a document in the given
B. EVALUATION MEASURES text corpus. The Term Frequency (TF) is defined as
In order to evaluate the performance of the proposed modified the number of times that a term occurs in a document.
spam Transformer model, some metrics such as accuracy, A larger TF means the term is referred for more times
precision, recall, and F1-Score are used in the experiments. in the given document, showing that the term is more
All these metrics are calculated based on the confusion relevant to the document. There are multiple different
matrix. As is mentioned in the previous section, the spam means to weigh the TF in order to adapt it in different
messages in the SMS Spam Collection v.1 dataset are sig- applications. In our experiment, we use the raw count
nificantly less than the ham messages, which means that of the term in the document as the TF. The Inverse
the dataset is unbalanced. Therefore, the accuracy is not Document Frequency (IDF) is a value to qualify the
sufficient as a measurement to evaluate the performance of specificity of a term, which is normally defined as the
the proposed model, and the F1-Score is employed in the logarithmically scaled inverse fraction of the number
experiments. The accuracy, precision, recall, and F1-Score is of documents that contain the term. In another word,
defined as follows: when a term occurs in a great number of documents,
TP + TN the IDF is numerically low, leading to a low TF-IDF. For
Accuracy = (4) instance, the term ‘‘the’’ occurs in almost every English
TP + FP + FP + FN
TP document, leading to a document frequency of almost 1
Precision = (5) (100% of the documents in the corpus contain the term
TP + FP
TP ‘‘the’’). Thus, the IDF of ‘‘the’’ is close to 0, which
Recall = (6) means that its importance to any documents in the corpus
TP + FN
Precision × Recall is low.
F1 − Score = 2 × (7) • GloVe Representation: GloVe [41] is an unsupervised
Precision + Recall
learning algorithm for obtaining vector representations
The precision, also known as the positive predictive value, for words. The main idea is to map words into a mean-
represents the percentage of the predicted positive cases that ingful space where the distance between words is related
are actually positive, meaning the possibility that the classi- to semantic similarity. GloVe produces a vector space
fier is correct given that it predicts positive. The recall, also with a meaningful substructure, and it can also find the
known as sensitivity, denotes the number of true positives relations like synonyms between words.
instances divided by the number of actual positive instances, In our experiments, for the deep learning approaches
which can also be described as the percentage of the positive such as LSTM and our proposed spam Transformer model,
cases that are identified successfully. The F1-Score is the the GloVe model is employed to create representation vec-
harmonic mean of precision and recall, which measures the tors for them. Specifically, in our experiments, we used the
performance of a classifier in terms of precision and recall in ‘‘glove.840B.300d’’, a pre-trained model with 2.2 million
a balanced way. words in the dictionary that converts textual data into
300-dimensional vectors. For benchmark machine learn-
C. DATA SPLITTING ing algorithms, although the vectors generated by GloVe
For the traditional machine learning approaches, the data is model have more dimensions and theoretically contain more
divided into training set (70%), and test set (30%). For the information, presumably due to the limitation of traditional
LSTM and our proposed modified spam Transformer model, machine learning classifiers, the TF-IDF representation per-
the data is split into training set (50%), validation set (20%), forms better in practice. Therefore, TF-IDF representation
and test set (30%), where the validation set is used after each is used for calculating representation vectors in benchmark
epoch of training to help us select the best model and perform machine learning algorithms.
early stopping to avoid over-fitting.
E. LOSS FUNCTION
D. DATA PRE-PROCESSING The loss function we used for deep learning approaches
The textual messages in the dataset are first tokenized. including LSTM and modified spam Transformer is Binary
Tokenization refers to the task of splitting textual into mean- Cross Entropy function, which is defined as follow:
ingful words. Specifically, the SpaCy [40] library is employed
for data pre-processing in order to tokenize the data. l(xi , yi ) = −wi [yi · logxi + (1 − yi ) · log(1 − xi )] (8)
After that, the numeric representation vectors (word The weight wi is the rescaling factor for loss. Since
embeddings) are calculated based on the textual messages. the SMS Spam Collection v.1 is unbalanced, where spam
messages are severely less than ham (legitimate) messages, TABLE 4. Initial and optimized hyper-parameters for modified spam
Transformer on SMS Spam Collection v.1.
a larger weight is given to the actual spam messages to
counteract the negative effect of the unbalanced dataset. The
rescaling weight is calculated based on the ratio between the
number of ham messages and spam messages.
F. MODEL TRAINING
We trained our experiment models on NVIDIA GeForce RTX
3090 GPU. For the machine learning classifiers, the experi-
ments are performed on the Scikit-learn 0.24.0 [42] environ-
ment. For deep learning approaches like LSTM and spam TABLE 5. Initial and optimized hyper-parameters for modified spam
Transformer model, the experiments are conducted on the Transformer on UtkMl’s Twitter.
G. HYPER-PARAMETERS TUNING
In order to tune the models and find the best hyper-parameters TABLE 6. Results obtained on SMS Spam Collection v.1.
set, the Ray Tune [44] library is employed. The Ray Tune
is a hyper-parameter tuning extension tool that supports
multiple machine learning frameworks. Given a candidate
hyper-parameters set, the Ray Tune can find the opti-
mized hyper-parameters set by training multiple models
with different settings and comparing the results automat-
ically. In our experiments, with the help of the Ray Tune,
we first explored optimal settings for the overall architectural
hyper-parameters such as Encoder layers, Decoder layers,
TABLE 7. Results obtained on UtkMl’s Twitter.
and Model size. After that, other hyper-parameters such as the
rate of dropout and Feed-forward layer size are tuned under
the candidate optimal model settings.
For the LSTM model, the optimized parameters on both
datasets are shown in Table 3. For our modified spam Trans-
former model on SMS Spam Collection v.1, Table 4 presents
the initial hyper-parameters that we started from and the opti-
mized values when the better result was achieved after tuning.
Table 5 demonstrates the initial as well as the optimized
hyper-parameters of modified spam Transformer on UtkMl’s
Twitter dataset. Machine (classifier), and Long Short-Term Memory. Besides,
for the SMS Spam Collection v.1 dataset, we also compare
TABLE 3. Optimized hyper-parameters for LSTM. our models with the CNN-LSTM approaches in [22], since
they aim to solve the same problem on the same dataset with
us.
Table 6 summarizes the results on SMS Spam Collection
v.1 dataset. For accuracy, our modified spam Transformer
model achieved the best value of 98.92%. Concerning pre-
cision, the best score was from the Random Forests classifier
with a value of 1.0, and our proposed spam Transformer
V. RESULTS AND ANALYSIS got a value of 0.9781. When it comes to recall, the opti-
A. EVALUATION mal result came from the spam Transformer model with a
We demonstrate the performance of the modified spam Trans- value of 0.9451, and the same value came from the Naïve
former model by comparing it on two datasets with some Bayes classifier as well. Finally, in terms of F1-Score, our
other typical spam detection classifiers, including Logistic spam Transformer also achieved the best value of 0.9613.
Regression, Naïve Bayes, Random Forests, Support Vector The experiment of CNN-LSTM [22] that was conducted
by Ghourabi et al. on the same dataset, are also included In addition, Table 8 and Table 9 show the excellent robust-
in Table 6. In Table 8, we demonstrate the confusion matrix of ness of our model to classify both the spams and hams effec-
all the approaches that we tested in the experiments on SMS tively on no matter balanced (UtkMl’s Twitter) or unbalanced
Spam Collection v.1 dataset. (SMS Spam Collection v.1) datasets.
Table 7 summarizes the results on UtkMl’s Twitter dataset.
The modified spam Transformer model outperformed all VI. CONCLUSION
other candidates in all four aspects that we tested with the val- In this paper, we proposed a modified Transformer model that
ues of 87.06%, 0.8746, 0.8576, and 0.8660 on the accuracy, aims to identify SMS spam. We evaluated our spam Trans-
precision, recall, and F1-Score, respectively. The confusion former model by comparing it with several other SMS spam
matrix of the modified spam Transformer model on UtkMl’s detection approaches on the SMS Spam Collection v.1 dataset
Twitter is presented in Table 9. and UtkMl’s Twitter dataset. The experimental results show
that, compared to Logistic Regression, Naïve Bayes, Random
Forests, Support Vector Machine, Long Short-Term Memory,
B. ANALYSIS and CNN-LSTM [22], our proposed spam Transformer model
Although the experimental results show an improved perfor- performs better on both datasets.
mance of the proposed spam Transformer model compared On the SMS Spam Collection v.1 dataset, our spam Trans-
to other candidates, the false predictions also indicate the former has a better performance in terms of accuracy, recall,
drawback of the proposed model. We analyzed the content and F1-Score compared to other classifiers. Specifically,
of the false prediction samples including false positive and our modified spam Transformer approach accomplished an
false negative samples and found that there were a great exceeding result on F1-Score.
number of the UNK marks in the data passed to the model, Additionally, on the UtkMl’s Twitter dataset, the results
which is produced because the words are never seen in the from our modified spam Transformer model demonstrate its
training data. In other words, the unknown words obstruct the improved performance on all four aspects in comparison to
model from understanding the messages. Besides, the SMS other alternative approaches mentioned in this paper. Con-
messages are usually short, which increases the influence cretely, our spam Transformer does exceptionally well on
of every single word and makes the unknown words more recall, which contributes to a distinct F1-Score.
influential. Actually, due to the unknown words, the model
did not have enough information to detect spams in many VII. FUTURE WORK
false prediction cases. Although the experimental results in this paper have shown
Though our proposed model performs better than other an improvement of our proposed spam Transformer model
candidate algorithms on UtkMl’s Twitter dataset, the results in comparison with some previous approaches on SMS spam
are still not as good as that in case of SMS Spam Collection detection, we still believe that there is great potential in the
v.1 dataset. From our observation, the major cause is also model we proposed.
the unknown words. Compared to SMS Spam Collection Firstly, since our current two datasets contain only thou-
v.1 dataset, there are more casual language and abbreviations sands of messages, in the future, we plan to extend our spam
in UtkMl’s Twitter dataset, which may be caused by the Transformer model to a larger dataset with more messages
feature of Twitter posts or the date of collection of the dataset, or even other types of content, for the purpose of better
as is discussed in Section IV-A. Therefore, the negative influ- performance.
ence from casual language and abbreviation is more severe on Besides, in our proposed model, we flattened the out-
UtkMl’s Twitter dataset, and that is the major cause of more puts from decoders and applied linear fully-connected lay-
unknown words and eventually worse performance from our ers before applying the final activation function and getting
perspective. the prediction. We believe that some dedicated designs or
implementations instead of simple flattening and linear layers [16] J. Cendrowska, ‘‘PRISM: An algorithm for inducing modular rules,’’ Int.
could absolutely boost the performance, which would be one J. Man-Machine Stud., vol. 27, no. 4, pp. 349–370, Oct. 1987.
[17] J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting
of the most important future works. machine,’’ Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
Additionally, although the experimental results show that [18] L. Bottou, ‘‘Large-scale machine learning with stochastic gradient
our modified model based on the vanilla Transformer per- descent,’’ in Proc. COMPSTAT. Physica-Verlag, 2010, pp. 177–186.
[19] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, ‘‘Efficient estimation of
forms well on SMS spam detection and confirms the avail- word representations in vector space,’’ in Proc. Int. Conf. Learn. Repre-
ability of the Transformer on this problem, the model is still sent., 2013.
far from optimal. There are some improved models based [20] G. A. Miller, ‘‘WordNet: A lexical database for English,’’ Commun. ACM,
vol. 38, no. 11, pp. 39–41, 1995.
on the Transformer with more complex architecture such as [21] H. Liu and P. Singh, ‘‘ConceptNet — A practical commonsense reasoning
GPT-3 [4] and BERT [5] that could be explored in the future. tool-kit,’’ BT Technol. J., vol. 22, no. 4, pp. 211–226, Oct. 2004.
Specifically, the BERT seems to be a promising starting point [22] A. Ghourabi, M. A. Mahmood, and Q. M. Alzubi, ‘‘A hybrid CNN-LSTM
model for SMS spam detection in arabic and English messages,’’ Future
of future work as it has fewer features and is easier to be Internet, vol. 12, no. 9, p. 156, Sep. 2020.
fine-tuned. [23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning rep-
Finally, as is discussed in Section V-B, the proposed resentations by back-propagating errors,’’ Nature, vol. 323, no. 6088,
pp. 533–536, Oct. 1986.
model is severely influenced by the unknown words in many [24] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies
cases of false prediction. To address this problem, more data with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2,
pre-processing techniques could be applied. For instance, pp. 157–166, Mar. 1994.
[25] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘On the difficulty of training
a larger vocabulary with more words could be a good option, recurrent neural networks,’’ in Proc. 30th Int. Conf. Mach. Learn. (ICML),
and some semantic operations such as replacing unknown 2013, pp. 2347–2355.
words with their synonyms could also be explored. Besides, [26] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
there are some other data-preprocessing and feature extrac- [27] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
tion techniques that could be done, such as the extraction and H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
analysis of the abbreviation, URLs, tags, or emoji in data. RNN Encoder–Decoder for statistical machine translation,’’ in Proc.
Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014,
pp. 1724–1734.
REFERENCES [28] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber, ‘‘A clockwork RNN,’’
[1] P. K. Roy, J. P. Singh, and S. Banerjee, ‘‘Deep learning to filter SMS spam,’’ in Proc. 31st Int. Conf. Mach. Learn. (ICML), vol. 5, 2014, pp. 3881–3889.
Future Gener. Comput. Syst., vol. 102, pp. 524–533, Jan. 2020. [29] C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau, ‘‘A C-LSTM neural net-
[2] G. Jain, M. Sharma, and B. Agarwal, ‘‘Optimizing semantic LSTM for work for text classification,’’ 2015, arXiv:1511.08630. [Online]. Available:
spam detection,’’ Int. J. Inf. Technol., vol. 11, no. 2, pp. 239–250, Jun. 2019. http://arxiv.org/abs/1511.08630
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [30] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning with
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 4, Sep. 2014,
Neural Inf. Process. Syst., 2017, pp. 5999–6009. pp. 3104–3112.
[4] T. B. Brown et al., ‘‘Language models are few-shot learners,’’ 2020, [31] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly,
arXiv:2005.14165. [Online]. Available: http://arxiv.org/abs/2005.14165 ‘‘A comparison of sequence-to-sequence models for speech recognition,’’
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training in Proc. Interspeech, Aug. 2017, pp. 939–943.
of deep bidirectional transformers for language understanding,’’ in Proc. [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Tech- K. Saenko, ‘‘Sequence to sequence–video to text,’’ in Proc. IEEE Int. Conf.
nol., vol. 1, Jun. 2019, pp. 4171–4186. Comput. Vis. (ICCV), Dec. 2015, pp. 4534–4542.
[6] G. Sonowal and K. S. Kuppusamy, ‘‘SmiDCA: An anti-Smishing [33] D. Bahdanau, K. H. Cho, and Y. Bengio, ‘‘Neural machine translation
model with machine learning approach,’’ Comput. J., vol. 61, no. 8, by jointly learning to align and translate,’’ in Proc. 3rd Int. Conf. Learn.
pp. 1143–1157, Aug. 2018. Represent. (ICLR), 2015.
[7] J. W. Joo, S. Y. Moon, S. Singh, and J. H. Park, ‘‘S-detector: An enhanced [34] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
security model for detecting Smishing attack for mobile computing,’’ R. S. Zemel, and Y. Bengio, ‘‘Show, attend and tell: Neural image caption
Telecommun. Syst., vol. 66, no. 1, pp. 29–38, Sep. 2017. generation with visual attention,’’ in Proc. 32nd Int. Conf. Mach. Learn.,
[8] S. Mishra and D. Soni, ‘‘Smishing detector: A security model to detect vol. 3, 2015, pp. 2048–2057.
Smishing through SMS content analysis and URL behavior analysis,’’ [35] T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to attention-
Future Gener. Comput. Syst., vol. 108, pp. 803–815, Jul. 2020. based neural machine translation,’’ in Proc. Conf. Empirical Methods
[9] C. Li, L. Hou, B. Y. Sharma, H. Li, C. Chen, Y. Li, X. Zhao, H. Huang, Natural Lang. Process. (EMNLP), 2015, pp. 1412–1421.
Z. Cai, and H. Chen, ‘‘Developing a new intelligent system for the diagno- [36] E. S. D. Reis, C. A. D. Costa, D. E. D. Silveira, R. S. Bavaresco,
sis of tuberculous pleural effusion,’’ Comput. Methods Programs Biomed., R. D. R. Righi, J. L. V. Barbosa, R. S. Antunes, M. M. Gomes, and
vol. 153, pp. 211–225, Jan. 2018. G. Federizzi, ‘‘Transformers aftermath,’’ Commun. ACM, vol. 64, no. 4,
[10] T. K. Ho, ‘‘Random decision forests,’’ in Proc. Int. Conf. Document Anal. pp. 154–163, Apr. 2021.
Recognit. (ICDAR), vol. 1, 1995, pp. 278–282. [37] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
[11] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn., R. R. Salakhutdinov, ‘‘Improving neural networks by preventing co-
vol. 20, no. 3, pp. 273–297, 1995. adaptation of feature detectors,’’ 2012, arXiv:1207.0580. [Online].
[12] M. Gupta, A. Bakliwal, S. Agarwal, and P. Mehndiratta, ‘‘A comparative Available: http://arxiv.org/abs/1207.0580
study of spam SMS detection using machine learning classifiers,’’ in Proc. [38] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’
11th Int. Conf. Contemp. Comput. (IC3), Aug. 2018, pp. 1–7. 2017, arXiv:1711.05101. [Online]. Available: http://arxiv.org/abs/1711.
[13] T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, ‘‘Contributions to the 05101
study of SMS spam filtering: New collection and results,’’ in Proc. 11th [39] UtkMl’s Twitter Spam Detection Competition | Kaggle, UtkMl.
ACM Symp. Document Eng., Sep. 2011, pp. 259–262. [40] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, ‘‘spaCy:
[14] A. K. Jain and B. B. Gupta, ‘‘Rule-based framework for detection of Smish- Industrial-strength natural language processing in python,’’ 2020, doi:
ing messages in mobile environment,’’ Procedia Comput. Sci., vol. 125, 10.5281/zenodo.1212303.
pp. 617–623, 2018. [41] J. Pennington, R. Socher, and C. D. Manning, ‘‘GloVe: Global vectors for
[15] W. W. Cohen, ‘‘Fast effective rule induction,’’ in Machine Learning Pro- word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
ceedings, 1995, pp. 115–123. Process. (EMNLP), 2014, pp. 1532–1543.
[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, HAOYE LU (Member, IEEE) received the joint
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, B.Sc. degree in computer science and mathemat-
A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, ics, in 2017, and the master’s degree in computer
‘‘Scikit-learn: Machine learning in python,’’ J. Mach. Learn. Res., vol. 12, science, in 2019. In 2013, he joined the University
pp. 2825–2830, Nov. 2011. of Ottawa, Canada. He is currently working as a
[43] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep Research Associate. His research interests include
learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, artificial intelligence and network structures.
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett,
Eds. 2019, pp. 8024–8035.
[44] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez,
and I. Stoica, ‘‘Tune: A research platform for distributed model
selection and training,’’ 2018, arXiv:1807.05118. [Online]. Available:
http://arxiv.org/abs/1807.05118