1 s2.0 S2405844024012751 Main
1 s2.0 S2405844024012751 Main
1 s2.0 S2405844024012751 Main
Heliyon
journal homepage: www.cell.com/heliyon
Research article
A R T I C L E I N F O A B S T R A C T
Keywords: The widespread dissemination of false information across various online platforms has emerged
Fake-news as a matter of paramount concern due to the potential harm it poses to individuals, communities,
CNN and entire nations. Substantial efforts are currently underway in the research community to
LSTM
combat this issue. A burgeoning area of study gaining significant traction is the development of
NLP
fake news identification techniques. However, this field faces formidable challenges primarily
stemming from limited resources, including access to comprehensive datasets, computational
resources, and evaluation tools. To overcome these challenges, researchers are exploring various
methodologies. One promising approach involves the use of feature abstraction and vectorization
techniques. In this context, we highly recommend utilizing the Python sci-kit-learn module,
which offers many invaluable tools such as the Count Vectorizer and Tiff Vectorizer. These tools
enable the efficient handling of text data by converting it into numerical representations, thereby
facilitating subsequent analysis. Once the text data is appropriately transformed, the next crucial
step involves feature selection. To achieve optimal results, researchers often employ feature se
lection methods based on misperception matrices. These methods allow for the exploration and
selection of the most suitable features, which are essential for achieving the highest accuracy in
fake news identification.
1. Introduction
In today’s digital age, the ease of disseminating information across the internet is undeniable. Unfortunately, this convenience has
also paved the way for the rampant spread of counterfeit news, particularly through online social media platforms. Counterfeit news
consists of misleading information that can be difficult to authenticate. This misinformation can perpetuate falsehoods within a
particular nation or exaggerate the costs of government services, potentially leading to instability in some regions. Although there are
organizations dedicated to addressing issues related to author accountability, their effectiveness is limited due to their reliance on
manual fact-checking, which is neither scalable nor reliable in a world where countless articles are created and shared daily.
One potential solution to address this issue is the establishment of a comprehensive system designed to provide a dependable
* Corresponding author.
** Corresponding author.
E-mail addresses: [email protected] (M. Gupta), [email protected] (A. Nanthaamornphong).
https://doi.org/10.1016/j.heliyon.2024.e25244
Received 10 October 2023; Received in revised form 22 November 2023; Accepted 23 January 2024
Available online 28 January 2024
2405-8440/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
D.G. Dev et al. Heliyon 10 (2024) e25244
automated rating for both the credibility of various information sources and the context in which news is presented. It is evident that
many individuals often become unwitting victims of misinformation, lacking the critical evaluation skills required to verify the ac
curacy of the content they share. Such tendencies prove detrimental, particularly when false rumors or misleading news stories
contribute to the creation of divisive opinions among individuals or within specific groups. In light of advancing technology, the
implementation of protective measures to combat these actions becomes increasingly imperative. Mass communication wields a
considerable influence over the general population, and regrettably, numerous websites are dedicated to disseminating deceptive
information.
These websites deliberately engage in propagating deceptive promotions, fabrications, and falsehoods while masquerading them as
factual news. Their primary responsibility should be to monitor and regulate the flow of information that has the potential to affect
public trust. The internet is rife with numerous similar websites, collectively significantly impacting public opinion. According to
research, several artificial intelligence algorithms can play a pivotal role in detecting fake news. The overarching aim of fake news
detection is to disrupt the spread of rumors across various platforms, whether they are social media, messaging services, or other
communication channels. This objective has been a driving force behind our project, as detecting false news seeks to identify and
discourage such behaviors, ultimately safeguarding society from the harm they can inflict.
The primary objective is to effectively distinguish between authentic and fake news, a challenge that fundamentally hinges on text
analysis. Developing a model capable of discerning between “authentic" and “fake" news is paramount. This is especially critical given
the rapid dissemination of fake news on social platforms like Facebook and Instagram, microblogging platforms such as Twitter, and
instant messaging applications like WhatsApp and Hike. The proposed method offers invaluable assistance in evaluating the credibility
of news sources and contributes significantly to mitigating the proliferation of false information.
The paper’s structure is as follows: Section 2 presents fake news detection methods; Section 3 discusses machine learning and data
mining tools for mining fake news; Section 4 explains types of online data posts. Sections 5 and 6 provide a literature survey with
motivation. Section 7 introduces the proposed fake news detection model. Section 8 covers experimental settings and results, while
Section 9 offers conclusions.
In fake news research, various methods are used to detect and combat misinformation. This includes applying Natural Language
Processing (NLP) techniques like sentiment analysis and linguistic pattern recognition to analyze news and social media content.
Machine learning algorithms, both supervised and unsupervised, help classify news as real or fake based on factors like source
credibility and textual characteristics. Network-based methods track the spread of false information on social networks. Deep learning,
2
D.G. Dev et al. Heliyon 10 (2024) e25244
using CNNs and RNNs, extracts relevant features from text and media to identify fake news. Fact-checking databases and crowdsourced
platforms validate claims against established facts. Combining these methods, often in ensemble models, improves fake news detection
accuracy and efficiency. See Fig. 1 [1] for a diagram of the process.
1. Natural Language Processing (NLP) serves as a critical tool in various research endeavours, enabling the exploration of specific
systems and methodologies. This multifaceted field harnesses the power of algorithmic systems to facilitate the integration of voice
interpretation and speech generation, bolstered by the prowess of Natural Language Processing. Moreover, NLP extends its utility to
encompass activity recognition across diverse languages, thereby showcasing its versatility. A particularly intriguing facet of NLP
lies in its ability to leverage language-specific pipelines, including those designed for the intricate task of emotion analysis [2].
2. Data Mining encompasses two principal categories of methodologies: supervised and unsupervised. The supervised approach
harnesses training data to anticipate concealed behaviours, while Unsupervised Data Mining unveils hidden data patterns, such as
pairs of input tags and clusters. Within the realm of unsupervised data mining, one finds composite structures and a collective
foundation, further enhancing its analytical potential.
3. Machine Learning Arrangement empowers software systems to evolve and improve without necessitating extensive reprogram
ming. Skilled data scientists define the crucial variations or features for accurate evaluation and prediction. Following training, the
algorithm effectively segregates data into distinct levels, facilitating the continuous adaptation and enhancement of the system.
4. Decision Trees serve as a pivotal tool for data analysis, where each node represents a rule or “test" pertaining to a specific char
acteristic. The tree subsequently branches out based on the outcomes of these tests. These decision trees play a pivotal role in
assigning class labels to leaf nodes, thereby aiding in the identification of crucial factors and elucidating intricate relationships
among them. They are indispensable in the effective prediction of target variables, making them invaluable in developing novel
variables and features that drive data-driven insights.
5. The goal behind Random Forests is to generate several different decision tree procedures, each of which will provide a diverse
consequence. The results anticipated by numerous decision trees are included in the random forest. To guarantee range in the
resulting decision trees, the random forest randomly chooses a subset of traits from individual clusters. Random forest excels when
used on decision trees where there is no correlation between variables. When used on similar trees, the result is analogous to that of
a lone decision tree.
6. SVM method portrays the information elements in n-dimensional space, with the coordinates indicating the value of each feature.
Each data point is organized in a variety of sizes n (the amount of accessible attributes), and the value of specific stuff is equal to the
number of quantified coordinates. The data is divided into two groups using the hyper-plane that emerges from doing so.
7. This Bayesian theory–informed strategy has applications across machine learning domains [3]. It functions under the hypothesis
that there are no predictors present. Naive Bayes is the hypothesis that different categories’ functions are independent of one
another. If the fruit is red, has spiraled, and is smaller than 3 inches in diameter, we call it an apple. Naive Bayes presumes that each
of these functions has its own proof of the apples, irrespective of whether they are contingent on each extra or other purposes [4].
8. KNN recognizes new places founded on the common of noises from the nearby k in relation to them. As indicated by the purpose of
the remoteness [5], the location allocated in the class is strongly equally limited among the nearest neighbours K.
In the realm of computational linguistics, there is a distinct emphasis placed on the profound and methodical analysis of the textual
origins. Considerable dedication has been directed towards the comprehensive assessment of the diverse array of texts that constitute
the bulk of digital entries. Online posts, as a dynamic medium of communication, encompass a wide spectrum of data types, each
serving a unique purpose and contributing to the multifaceted nature of online discourse. These data types can be broadly categorized
as follows.
1. Textual Data: The most prevalent form of data in online posts is textual content, encompassing written narratives, comments,
descriptions, and captions. This textual data can manifest in various formats, ranging from plain text to intricately formatted
structures like HTML or Markdown, and even encompassing rich text, replete with emojis and specialized characters.
2. Media Data: Online posts frequently feature a medley of media content, comprising:
a. Images: Visual representations, including pictures and graphics, either uploaded or embedded within the posts.
b. Videos: Multimedia content that may reside on platforms such as YouTube or be directly uploaded.
c. Audio: Audio clips or recordings, though less common, do make an appearance in specific posts.
3. Links and URLs: These posts often include hyperlinks directing users to other online resources, websites, articles, or additional
multimedia content. These links may either be seamlessly integrated within the text or presented as discrete references.
4. Metadata: Essential information pertaining to the post itself or the user responsible for its creation is a common component.
This metadata comprises timestamps, author profiles, and engagement metrics, such as likes, shares, and comments.
5. Structured Data: Some online posts incorporate structured data, often presented as tables, lists, or other organized arrange
ments. Such structured data can encompass diverse information, ranging from product specifications to event schedules or
survey findings.
3
D.G. Dev et al. Heliyon 10 (2024) e25244
6. User-generated Content: User-generated posts may feature data contributed by users, encompassing elements such as ratings,
reviews, or user-generated tags and categories.
7. Geolocation Data: A subset of posts may include geolocation information, offering insights into the locale where the content
was generated. This capability enables users to discern the origin of an event or the geographical context of an image.
8. User Interaction Data: Posts also capture data regarding user engagement, encompassing metrics such as view counts, likes,
comments, and shares.
9. Sentiment Data: Within the textual data, sentiment analysis is employed to discern the emotional tone conveyed by the
content, thereby classifying it as expressing positive, negative, or neutral emotions.
10. Structured Social Network Data: Particularly within the realm of social media platforms, posts are often interconnected
through intricate social networks, showcasing the intricate web of relationships between users, followers, and friends.
11. User Behavior Data: The navigation patterns, engagement strategies, and sharing behaviors of users can be collected and
subjected to analysis, providing valuable insights into user behavior and preferences.
12. Location Data: Posts may be augmented with location-based data, including check-ins, geo-tagged photos, and content spe
cifically tied to particular geographical locations.
5. Literature survey
In the paper, Ahmad et al. [6] proposed the utilization of a machine learning ensemble strategy for the automated categorization of
news articles. Their research delved into a wide range of linguistic features that could effectively discern between genuine and
fraudulent content. To achieve this, Kaliyar et al. [1] conducted comprehensive training of various machine learning algorithms
through collaborative approaches and assessed their performance across four real-world datasets using these features. The results of
their novel approach demonstrated the superiority of their collaborative ensemble strategy over individual learners.
In this study, Sahoo et al. [7] developed an autonomous method for detecting false news within the Chrome browser, particularly
focusing on identifying deceptive news on Facebook. Using deep learning techniques, they leveraged a diverse set of Facebook account
parameters in conjunction with specific news item attributes to evaluate account activity. The empirical analysis of real-world data
showcased that their proposed approach for identifying false news outperformed existing state-of-the-art algorithms.
Furthermore, Wang et al. [8] exhibited clear effectiveness in detecting counterfeit news and posts using a variety of Machine
Learning approaches. However, the challenge of categorizing fake news was exacerbated by its dynamic traits across different social
media platforms. What set deep learning apart was the ability to compute hierarchical features. Deep neural networks, including
multimodal and parallel processing, computer vision, object recognition, and audio/speech processing, proved to be versatile and
effective. In the research work, Manzoor and Singla [2] not only discussed a theoretical framework but also presented a method for
identifying false reports related to recent events. They employed support vector machines to determine the authenticity of news,
achieving an impressive accuracy rate of up to 96.89 % when compared to other existing models.
For data collection, Jain et al. [9] gathered 1356 news instances from diverse sources on Twitter and online platforms to construct a
comprehensive dataset comprising both genuine and false news articles. The paper delved into advanced methodologies, including
Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), joint approaches, and attention mechanisms.
The author highlighted the attention mechanism’s remarkable accuracy (89.12 percent), while Ko et al.’s method achieved the highest
detection rate (95.0 percent), effectively addressing the challenge of fake news identification.
In their research, Kumar et al. [10] introduced a majority vote technique to identify false news items, leveraging diverse linguistic
features of both fake and genuine news articles. They utilized a publicly available false news dataset consisting of 23,571 news pieces,
and their multi-model false news recognition approach, in combination with the Majority Voting method, achieved an impressive
accuracy rate of 96.38 %, surpassing individual learning techniques.
In this context, Khanam et al. [11] proposed a method for identifying false news in the era of rapidly expanding online information.
Recognizing the challenges posed by the sheer volume of data, the author employed arrangement methods to categorize big data. The
study rigorously compared all Machine Learning algorithms for detecting false news.
In their research, Hiramath and Deshpande [12] developed a unique ML prototype for classifying false news, combining con
volutional and recurrent neural networks to yield superior results compared to non-hybrid approaches. The model displayed an ability
to detect significantly more false news stories, and further tests across various datasets reinforced its effectiveness.
Nasir et al. [13] introduced an innovative approach to automatically detect fake news by extending traditional CNNs to graphs,
incorporating diverse data sources such as topics, user profiles, social networks, and news dissemination. Professional fact-checking
groups validated the accuracy of these news stories, revealing the crucial role of social network structures and information dissemi
nation in achieving highly accurate (95.3 % ROC AUC) fake news identification. Additionally, the author performed an aging test on
their model, which demonstrated the potential of propagation-based techniques for detecting false news.
From a data mining perspective, Monti et al. [14] explored automated strategies for detecting bogus news, evaluating various
supervised text classification techniques, including CNNs, LSTMs, and BERT. The incorporation of unverified training data via lin
guistic model pre-training and distributed word representations resulted in an impressive accuracy rate of 96.42 %.
Wani et al. [15] endeavored to evaluate and compare multiple strategies to address this issue, encompassing classical ML ap
proaches such as Naive Bayes and general deep learning methods like cross RNN. Their article provided a foundation for choosing an
ML or deep learning approach that balances accuracy and computational efficiency.
Lastly, Han and Metha [16] conducted an in-depth investigation of word embedding for text preprocessing, constructing a tra
jectory space of arguments and linguistic connections using a hybrid neural network approach. Their prototype achieved consistent
4
D.G. Dev et al. Heliyon 10 (2024) e25244
performance in false news detection, with the fine-tuning of various prototype parameters leading to enhanced accuracy values, such
as 93.51 % when incorporating more input data while mitigating overfitting through a dropout layer.
6. Methodology
In the continuation of our comprehensive categorization approach, we delve deeper into the intricate processes that empower our
system to effectively discern fraudulent articles from legitimate ones. Beyond the initial application of supervised machine learning
(ML), we embark on a meticulous journey of iterative refinement. Multiple trials are meticulously conducted, both independently and
in combination, pushing the boundaries of accuracy and correctness to their highest levels.
Following this rigorous experimentation, we meticulously curate a set of relevant attributes during the data collection and
exploration phase. From this curated selection, we meticulously extract linguistic features that encapsulate a wide spectrum of textual
attributes. These features are then meticulously transformed into an algebraic representation, rendering them eminently suitable for
seamless integration into our machine learning models. The scope of these linguistic features encompasses indicators of sentiment,
patterns of idiom usage, purpose-driven word selections, instances of informal language, and the nuanced parsing elements such as
prepositions and verbs.
To maintain consistency in our feature values, meticulous scaling procedures are applied, meticulously constraining them within
the standardized range of 0–1. This meticulous approach becomes indispensable when dealing with attributes that naturally span from
0 to 100 while others exhibit more diverse and unpredictable ranges.
Having meticulously extracted these refined characteristics, we employ them to train a diverse array of ML models. Our dataset
undergoes a meticulous division into two distinct parts - an analysis set and a testing set, with a meticulously allocated 80/20 ratio.
Rigorous randomization techniques are diligently applied to ensure the balanced representation of both fraudulent and genuine
content within our testing scenarios.
During the meticulous training phase, we embark on a process of fine-tuning the hyperparameters of our learning algorithms. The
goal here is to achieve the meticulously balanced trade-off between bias and variance that best suits the intricacies of our dataset. This
journey of optimization employs a meticulous grid search approach, meticulously iterating through countless parameter combinations
until we unearth the most effective configuration.
Throughout this meticulous engineering process, our algorithms remain steadfast in their pursuit of precision. Whenever feasible,
we rigorously evaluate their performance as a collective whole, judiciously making informed decisions that lead to optimal results.
This exhaustive procession of the dataset through multiple meticulous procedures serves as our bulwark against the infiltration of false
news. The meticulous scrutiny of the collected results ensures that we arrive at a definitive and well-founded conclusion.
Shifting gears, our system analysis procedure unveils a distinctive approach for identifying misleading news articles. The process
begins with the meticulous acquisition of a dataset comprised of political news articles, followed by an in-depth analysis that
meticulously eliminates any extraneous noise. The dataset is meticulously partitioned, laying the foundation for the implementation of
machine learning techniques to construct our classifier model [17].
Fig. 2 in our analysis provides a visual representation of the positive pre-processing of the dataset, meticulously executed through
NLTK (Natural Language Toolkit). This meticulous pre-processing paves the way for the subsequent application of algorithms to the
meticulously prepared text. We meticulously execute information extraction and word tokenization using the powerful Stanford NLP
tools to prepare the textual data for our classifier. The meticulous encoding of results as integers and floating-point values ensures that
our data is impeccably prepared for further analysis.
Natural language processing (NLP) combines linguistics and computer science. Giving computers the ability to support and
manipulate human language is its main goal. It entails applying either rule-based or probabilistic (i.e., statistical and, most recently,
neural network-based) machine learning techniques to the processing of natural language information, such as text or speech corpora.
The objective is to create a computer that can “understand" the contents of papers, including the subtle linguistic context included in
them. The system can then precisely classify and arrange the documents themselves in addition to extracting insights and information
from them. Natural language creation, interpretation, and recognition of speech are common problems in natural language processing.
5
D.G. Dev et al. Heliyon 10 (2024) e25244
For the meticulous tasks of text data tokenization and feature extraction, we turn to the versatile Python scikit-learn package
(Machine learning library for the Python programming language is called Scikit-learn. With support-vector machines, random forests,
gradient boosting, k-means, DBSCAN, and other classification, regression, and clustering techniques, it is compatible with the NumPy
and SciPy scientific and numerical libraries for Python), a meticulously chosen toolset that aligns seamlessly with our research goals.
Furthermore, the utilization of a confusion matrix adds a layer of meticulous visualization to our data, aiding in its comprehensive
understanding.
In the realm of artificial intelligence, deep learning models have established themselves as pioneers, consistently delivering cutting-
edge results across a spectrum of applications. This section of our work offers an insightful overview of the deep learning models and
the underlying architectural intricacies that have played a pivotal role in our research.
Our meticulous approach has undergone rigorous evaluation, with deep learning-based models such as Convolutional Neural
Networks (CNN) and Long Short-Term Memory networks (LSTM) [5,18] forming the cornerstone of our methodology. These models
have been meticulously integrated with various pre-trained word embeddings, contributing to the robustness and depth of our
analytical framework.
Fig. 3 shows CNN Model. A regularized kind of feed-forward neural network, the convolutional neural network (CNN) uses filters
(or kernel) optimization to teach itself features engineering. By employing regularized weights over fewer connections, back propa
gation in early neural networks is shielded from disappearing gradients and expanding gradients. Applications for them include natural
language processing, brain–computer interfaces, recommender systems, image classification, picture segmentation, medical image
analysis, and image and video recognition. Because of the shared-weight architecture of the convolution kernels or filters, which slide
along input features and produce translation-equivariant replies known as feature maps, CNNs are sometimes referred to as Shift
Invariant or Space Invariant Artificial Neural Networks (SIANN). Contrary to popular belief, most convolutional neural networks are
not translation-invariant because of the down-sampling process they perform on the input. Because of the similarity in the connecting
pattern between neurons to the structure of the animal visual cortex, convolutional networks were inspired by biological processes.
Only in a small area of the visual field known as the receptive field do individual cortical neurons react to stimuli. Different neurons’
receptive areas partially overlap to encompass the whole visual field. CNNs require less pre-processing than other algorithms for
classifying images. This means that, in contrast to traditional algorithms, where these filters are hand-engineered, the network learns
to optimize the filters (or kernels) through automatic learning. One significant benefit is that feature extraction is not dependent on
human interaction or previous knowledge.
7. Proposed model
The suggested model for detecting false news employs a Long Short-Term Memory (LSTM) recurrent neural network, as depicted in
Fig. 4. The news items undergo an initial preprocessing phase wherein a binary label of 1 denotes fake news, while 0 signifies truthful
news for each news item. Before any modifications are made, the input news items are meticulously cleansed of punctuation and stop
words. News articles’ headlines and body content are then translated into padded word sequences. These sequences are further divided
into tokens, forming the basis for creating word vector representations—a method that remains unproven.
To handle the complexity of high-dimensional news articles, the model utilizes pre-trained GloVe word embeddings. Instead of
initializing with random weights, the embedding layer adopts GloVe weights. GloVe harnesses global co-occurrence data from an
extensive news story corpus, resulting in representations that capture essential linear structures within the word space. The trans
formed dataset comprises three distinct parts: training, validation, and testing.
The model’s refinement process involves an iterative approach to minimize the objective function and enhance accuracy. The
proposed methodology employs cross-entropy loss to distinguish fake news stories. Furthermore, various adaptive learning optimi
zation techniques, such as AdaGrad and Adam, are under scrutiny to improve model performance.
In the context of addressing the issue of false news, previous research [9] has explored unidirectional pre-trained word embedding
models coupled with a 1D pooling layer system. Our proposed system, on the other hand, leverages automated feature engineering
6
D.G. Dev et al. Heliyon 10 (2024) e25244
techniques. We feed input vectors of equal dimensions to all three layers in parallel and pool each block. In our model design, we
meticulously select parameters, including the number of convolutional layers, kernel sizes, the quantity of filters, and other essential
factors, to enhance the model’s robustness.
The issue of error propagation over time and across layers is effectively addressed with the assistance of LSTMs [19–21].
Bi-directional processing emerges as a natural choice for tasks involving the analysis of extensive text sequences and text classification.
To formalize a simple RNN, consider each timestep (x1 … xT), where the model updates its internal state (h1 … hT) and generates its
output (y1 … yT). This process adheres to the fundamental structure of a conventional neural network. At time t, the input and output
vectors, denoted as xt and yt respectively, are influenced by connection weight matrices WIH and WOH, which ensure consistency in
both input and output.
7.1. Pseudo code of proposed algorithm to detect using the acquired dataset
1. Open URL file: The algorithm starts by opening a file containing a list of URLs (presumably pointing to news articles or web pages).
2. For each title: The algorithm proceeds to iterate through each title of the news articles or web pages associated with the URLs.
3. Title starts with a number: If the title starts with a number, it outputs the file. This step might detect fake news articles with titles
starting with numbers as a potential attribute.
4. Title contains and/or! marks: If the title contains question marks (?) and/or exclamation marks (!), it outputs the file. This step
may aim to identify sensational or exaggerated titles commonly associated with clickbait or misleading content.
5. All words are capital in the title: If all words in the title are in capital letters, it outputs the file. This step could be a heuristic for
detecting sensational headlines, but it might have limitations as many legitimate titles use capitalization for emphasis.
6. Users left the website after visiting: If the algorithm detects that users quickly leave the website after visiting it, it outputs the file.
This step could be an indicator of low-quality or misleading content that fails to engage visitors.
7. Contents have no words from the title: If the web page’s contents have no words from the title, it outputs the file. This step might
be used to identify discrepancies between the title and the actual content, which could indicate misleading information.
8. Title contains keywords No Keywords → output file: If the title contains certain keywords (not specified in the provided al
gorithm), it outputs the file. This step could be used to detect specific patterns or keywords associated with fake news.
9. End for: The algorithm ends the loop, and the computation of attributes for each title is completed.
7
D.G. Dev et al. Heliyon 10 (2024) e25244
Count Vectorizer is a tool used to change a bunch of text documents into a grid that shows word counts. Count Vectorizer is
employed to convert a text into a vector based on the frequency (count) of every word that appears in the text as a whole. The Count
Vectorizer is capable of handling huge text datasets with numerous documents and is computationally efficient. Especially when
working with high-dimensional data, it makes use of sparse matrix representations to reduce processing time and memory usage. The
Count Vectorizer does this.
a) Lowercase your text (if you don’t want lowercasing, put lowercase = false);
b) Encode your content using UTF-8.
c) executes tokenization, which divides raw text into smaller textual units;
d) applies word level tokenization, which treats each word as distinct token.
For example, suppose you have a bunch of text documents (let’s say n of them), and you want to turn them into a grid where each
row is a document, and each column is a unique word. The number in each box of this grid tells you how many times a specific word
appears in a particular document.
∑
n
Xij = CWj (1)
i=1
Here, i represents the number of documents and CWj represents the number count of words j.
TfidfVectorizer is another tool used for text documents, but instead of just word counts, it calculates a special value for each word-
document pair called TF-IDF (Term Frequency-Inverse Document Frequency). Like Count Vectorizer, TfidfVectorizer also turns text
documents into a grid. But this time, the numbers in the grid are not just word counts; they’re TF-IDF values, which are calculated
differently.
∑
n
Xij = TFIDF ∗ WWj (2)
i=1
Here, WWj represents the weight of word j in document i. However, value of TFIDF is calculated by the given equation.
TFIDF(w, d, D) = TF(w, d) ∗ IDF(w, D) (3)
Here, TF (w, d) represents the Term Frequency of word w in document d, which is the count of w in d and IDF (w, D) represents the
Inverse Document Frequency of word w in the entire corpus D.
The initial dataset was meticulously curated by the Kaggle data science community, representing a valuable resource for research
and analysis. This dataset was constructed with precision, comprising a total of four distinct columns and an impressive volume of over
seven thousand rows, precisely 7796 to be exact, as documented in Ref. [9]. To provide a comprehensive overview, it’s essential to
delineate the content of each column within this dataset. Firstly, the inaugural column serves as an index, facilitating efficient
referencing and organization of the dataset. Moving on, the second column is dedicated to housing the titles of news articles, offering a
concise and informative glimpse into the subject matter of each entry. The third column plays a pivotal role by accommodating the full
text of the news articles. This text column constitutes the heart of the dataset, containing the rich and diverse content upon which
subsequent analyses and classifications will be based. Finally, the fourth and last column serves as a crucial identifier, denoting
whether the news article is classified as either “fake" or “real," a classification that is of paramount interest in discerning the
authenticity and trustworthiness of the news articles, as noted in Ref. [10]. The primary objective underlying the creation of this
dataset is the classification of its constituent news articles. The goal is to develop a robust and accurate classification system that can
effectively discern between genuine and fabricated news. This dataset is poised to serve as the foundation for extensive research
endeavours to identify and optimize the most suitable classification algorithms, ultimately contributing to the broader mission of
combating misinformation and enhancing the quality of information dissemination.
Table 1
Deep learning classifiers accuracy.
Classifier Accuracy
SNN 93 %
RNN + LSTM 91 %
CNN + LSTM 98 %
8
D.G. Dev et al. Heliyon 10 (2024) e25244
8.3. Simulation
8.3.2. Comparison
In this research, a comprehensive comparative analysis was conducted, juxtaposing the outcomes of the current investigation with
those documented in prior studies. This comparison unfolded in a meticulously structured manner, spanning two distinct stages. The
primary phase involved meticulous scrutiny of the research outcomes in relation to previously published works, irrespective of
whether the datasets under examination were identical or distinct. Table 3 presents a clear and organized summary of comprehensive
comparative analysis with baseline studies. Fig. 9 shows comparative analysis of proposed model with other baselines studies. This
clearly shows that hybrid approach of CNN and LSTM outperforms others with accuracy 0.98.
9. Conclusion
This work focuses on identifying fake news in a two-step process: characterizing and discovering. Initially, we use social media
platforms to understand the foundations of false news. Then, in the discovery phase, we assess existing strategies for false news
detection, employing supervised machine learning algorithms. Our study uses voice and text features and analytical representations,
which differ from traditional text analysis methods, to spot fake news. Our hybrid approach of CNN and LSTM achieved an impressive
0.98 accuracy rate. Our primary goal in training is to identify textual characteristics that distinguish fake articles from genuine ones.
Using the Linguistic Inquiry and Word Count (LIWC) tool, we extract various textual attributes from articles and integrate them into
our models. We meticulously optimize and parameterize these trained representations to maximize accuracy.
Proposed fake news detection model outperforms other baseline studies with accuracy 0.98. Limitation of this study is that our
proposed model is not cross language dependent. Our proposed approach is limited to text only so as a future scope we can work on
images and video part in fake news detection.
Additional information
9
D.G. Dev et al. Heliyon 10 (2024) e25244
Table 2
Performance of Proposed Fake News Detection Model for different classifiers.
Techniques Used Accuracy Precision Recall F1-Score
Deepali Goyal Dev: Writing – original draft, Methodology, Formal analysis, Conceptualization. Vishal Bhatnagar: Writing –
review & editing, Visualization, Data curation. Bhoopesh Singh Bhati: Writing – review & editing, Investigation, Conceptualization.
Manoj Gupta: Writing – review & editing, Investigation, Data curation. Aziz Nanthaamornphong: Writing – review & editing,
Validation, Investigation.
10
D.G. Dev et al. Heliyon 10 (2024) e25244
Table 3
Comprehensive comparative analysis with baselines studies.
Reference Techniques Used Year Data Set Accuracy
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
References
[1] R.K. Kaliyar, A. Goswami, P. Narang, FakeBERT: fake news detection in social media with a BERT-based deep learning approach, Multimed. Tool. Appl. 80 (8)
(2021) 11765–11788.
[2] S.I. Manzoor, J. Singla, Fake news detection using machine learning approaches: a systematic review, in: 2019 3rd International Conference on Trends in
Electronics and Informatics (ICOEI), IEEE, 2019, April, pp. 230–234.
[3] J.C. Reis, A. Correia, F. Murai, A. Veloso, F. Benevenuto, Explainable machine learning for fake news detection, in: Proceedings of the 10th ACM Conference on
Web Science, 2019, June, pp. 17–26.
[4] A. Agarwal, M. Mittal, A. Pathak, L.M. Goyal, Fake news detection using a blend of neural networks: an application of deep learning, SN Computer Science 1
(2020) 1–9.
[5] N. Aslam, I. Ullah Khan, F.S. Alotaibi, L.A. Aldaej, A.K. Aldubaikil, Fake detect: a deep learning ensemble model for fake news detection, Complexity 2021
(2021) 1–8.
[6] I. Ahmad, M. Yousaf, S. Yousaf, M.O. Ahmad, Fake news detection using machine learning ensemble methods, Complexity (2020) 1–11, 2020.
11
D.G. Dev et al. Heliyon 10 (2024) e25244
[7] S.R. Sahoo, B.B. Gupta, Multiple features based approach for automatic fake news detection on social networks using deep learning, Appl. Soft Comput. 100
(2021) 106983.
[8] Y. Wang, W. Yang, F. Ma, J. Xu, B. Zhong, Q. Deng, J. Gao, Weak supervision for fake news detection via reinforcement learning, Proc. AAAI Conf. Artif. Intell.
34 (1) (2020, April) 516–523.
[9] A. Jain, A. Shakya, H. Khatter, A.K. Gupta, A smart system for fake news detection using machine learning, in: 2019 International Conference on Issues and
Challenges in Intelligent Computing Techniques (ICICT), vol. 1, IEEE, 2019, September, pp. 1–4.
[10] S. Kumar, R. Asthana, S. Upadhyay, N. Upreti, M. Akbar, Fake news detection using deep learning models: a novel approach, Transactions on Emerging
Telecommunications Technologies 31 (2) (2020) e3767.
[11] Z. Khanam, B.N. Alwasel, H. Sirafi, M. Rashid, Fake news detection using machine learning approaches, in: IOP Conference Series: Materials Science and
Engineering, IOP Publishing, 2021, March 012040. Vol. 1099, No. 1.
[12] C.K. Hiramath, G.C. Deshpande, Fake news detection using deep learning techniques, in: 2019 1st International Conference on Advances in Information
Technology (ICAIT), IEEE, 2019, July, pp. 411–415.
[13] J.A. Nasir, O.S. Khan, I. Varlamis, Fake news detection: a hybrid CNN-RNN based deep learning approach, International Journal of Information Management
Data Insights 1 (1) (2021) 100007.
[14] F. Monti, F. Frasca, D. Eynard, D. Mannion, M.M. Bronstein, Fake News Detection on Social Media Using Geometric Deep Learning, 2019 arXiv preprint arXiv:
1902.06673.
[15] A. Wani, I. Joshi, S. Khandve, V. Wagh, R. Joshi, Evaluating deep learning approaches for covid 19 fake news detection, in: Combating Online Hostile Posts in
Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, Springer
International Publishing, 2021, pp. 153–163. February 8, 2021, Revised Selected Papers 1.
[16] W. Han, V. Mehta, Fake news detection in social networks using machine learning and deep learning: performance evaluation, in: 2019 IEEE International
Conference on Industrial Internet (ICII), IEEE, 2019, November, pp. 375–380.
[17] O.A. Arqub, Z. Abo-Hammour, Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm, Inf. Sci. 279 (2014)
396–415.
[18] M.D. Ibrishimova, K.F. Li, A machine learning approach to fake news detection using knowledge verification and natural language processing, in: Advances in
Intelligent Networking and Collaborative Systems: the 11th International Conference on Intelligent Networking and Collaborative Systems (INCoS-2019),
Springer International Publishing, 2020, pp. 223–234.
[19] A. Choudhary, A. Arora, Linguistic feature based learning model for fake news detection and classification, Expert Syst. Appl. 169 (2021) 114171.
[20] T.A.O. Jiang, J.P. Li, A.U. Haq, A. Saboor, A. Ali, A novel stacking approach for accurate detection of fake news, IEEE Access 9 (2021) 22626–22639.
[21] A.M. Braşoveanu, R. Andonie, Integrating machine learning techniques in semantic fake news detection, Neural Process. Lett. 53 (5) (2021) 3055–3072.
[22] J.Y. Khan, M.T.I. Khondaker, S. Afroz, G. Uddin, A. Iqbal, A benchmark study of machine learning models for online fake news detection, Machine Learning with
Applications 4 (2021) 100032.
[23] T.C. Truong, Q.B. Diep, I. Zelinka, R. Senkerik, Supervised classification methods for fake news identification, in: International Conference on Artificial
Intelligence and Soft Computing, Springer International Publishing, Cham, 2020, October, pp. 445–454.
[24] N.K. Conroy, V.L. Rubin, Y. Chen, Automatic deception detection: methods for finding fake news, Proceedings of the association for information science and
technology 52 (1) (2015) 1–4.
[25] W.Y. Wang, Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, 2017 arXiv preprint arXiv:1705.00648.
12