1-s2.0-S0925231223008032-main

Neurocomputing 557 (2023) 126680
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Information retrieval algorithms and neural ranking models to detect

previously fact-checked information
Tanmoy Chakraborty b , Valerio La Gatta a , Vincenzo Moscato a , Giancarlo Sperlì a ,∗
a
Department of Electrical Engineering and Information Technology (DIETI), University of Naples ‘‘Federico II’’, Via Claudio 21, Naples, Italy
b
Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, India
ARTICLE INFO ABSTRACT
Communicated by Y. Yan Although in the last decade several fact-checking organizations have emerged to verify misinformation, fake
news has continued to proliferate, especially through social media platforms. Even though adopting improved
Keywords:
detection strategies is of utmost importance, the fact-checking process could be optimized by verifying whether
Fact-checking
Claim retrieval
a claim has been previously fact-checked. Despite some ad-hoc information retrieval approaches having been
Learning-to-rank recently proposed, the utility of modern (neural) retrieval systems have not been investigated yet. In this paper,
Semantic matching we consider the standard two-phases retriever-reranker architecture and benchmark different state-of-the-art
techniques from the information retrieval and Q&A literature. We design several experiments on a real-world
Twitter dataset to analyze the efficiency and the effectiveness of the benchmark approaches. Our results show
that combining standard and neural approaches is the most promising research direction to improve retrievers
performance and that complex (neural) rerankers might still be efficient in practice since there is no need to
process a high number of documents to improve ranking performance.
1. Introduction agreed on a self-regulatory Code of Practice3 to converge on a common

strategy to deal with the spread of online disinformation. Third, govern-
Although fake news is not a new phenomenon, since the last decade ments are investing in educating the public on this problem and on how
it has become one of the major threats to democracy, journalism, and to discern false from true information, since it has been proven that,
freedom of expression. The rise of social media has been playing a once trained, the probability of people sharing fake news decreases by
key-role since those platforms enable the creation, the publication and 400%.4
the consumption of news online faster and cheaper. As a result, huge However, the quantity of information is much more than the one
amount of false information which spreads across the population affects people can effectively check and, since fake news spreads comparably
our life. For instance, fake news proliferation during the 2016 US or even faster than true ones [2,3], it is essential to speed up and/or
Presidential Election undermined public trust in the government [1], ease the verification process. From this perspective, one should consider
and from an economic perspective, the false claim stating that ‘Barack that the same viral claim is often reposted by thousands of people in a
Obama was injured in an explosion’ wiped out $130 billion in stock short time-frame and might be shared also after a while in a different
value.1
context. Moreover, when considering political statements, it is well-
The problem of fake news proliferation is being addressed from
known that politicians, even unconsciously, have a tendency to repeat
different perspectives. First, the Duke Reportes’ Lab2 counts more
themselves.5
than 400 fact-checking world-wide organizations which try to debunk
Thus, detecting whether a claim has been already fact-checked
false information though domain experts’ analyses and semi-automatic
seems a promising approach on which researchers should focus more
systems assessing news truthfulness. Second, representatives of online
for at least three reasons. First, it can ease the manual fact-checkers’ ef-
platforms, leading social networks, advertisers and advertising industry
fort, increasing their productivity and thus their effectiveness. Second,
∗ Corresponding author.
E-mail addresses: [email protected] (T. Chakraborty), [email protected] (V. La Gatta), [email protected] (V. Moscato),
[email protected] (G. Sperlì).
1
https://www.forbes.com/sites/kenrapoza/2017/02/26/can-fake-news-impact-the-stock-market/?sh=1883d50d2fac.
2
https://reporterslab.org/fact-checking/.
3
https://ec.europa.eu/digital-single-market/en/code-practice-disinformation.
4
https://edition.cnn.com/videos/tech/2020/04/17/mark-zuckerberg-facebook-limit-coronavirus-misinformation-cnn-town-hall-vpx.cnn.
5
https://www.washingtonpost.com/outlook/its-easy-to-fact-check-trumps-lies-he-tells-the-same-ones-all-the-time.
https://doi.org/10.1016/j.neucom.2023.126680
Received 26 April 2022; Received in revised form 28 April 2023; Accepted 11 August 2023
Available online 21 August 2023
0925-2312/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
T. Chakraborty et al. Neurocomputing 557 (2023) 126680
automatic fact-checking systems might be improved since the veracity disparate scenarios. Recently, researchers are increasingly focusing on
prediction of the input claim could be based on a set of already verified evidence-aware fact-checking, i.e. extracting the veracity of an input
information. Third, journalists, who are sceptical towards the adoption claim based on retrieved evidences, which can support or refute it. Un-
of automatic detection systems, could easily exploit, instead, a tool der these settings, [11] releases FEVER dataset aiming at fact-checking
which checks in real-time if their interviewees are referring to (telling) mutated claims generated from Wikipedia pages. [12,13] exploit web
inaccurate (false) data (claims). It is worth mentioning that popular search engines to find real-time potential evidences and compute their
search engines do not represent an effective solution because they do stance w.r.t. the input claim. In addition, [14] leverages LSTM mod-
not report verified information and thus they could activate a danger- els and attention mechanisms to retrieve documents and to capture
ous and time-consuming verification cascade, i.e., the experts should their most relevant sentences, respectively. [15] first employs neural
in turn verify whether the retrieved evidences are actually reliable. In semantic matching networks to address the document retrieval and
addition, we highlight that the task of detecting previously fact-checked
the evidence selection problems. Inspired by the unprecedented perfor-
information does not depend on the claim veracity; it supports the fact-
mance transformer architectures are achieving in many NLP tasks, [16]
checking process in a preliminary phase w.r.t. retrieving the evidences
adopts BERT model to compute the evidences’ relevance and the verac-
which the truthfulness assessment should be based on.
ity of the input claim. In addition, [17,18] leverage reasoning elements
Despite having been proposed to integrate the checking of the input
over an entity-graph and a hierarchical hypergraph, respectively, to
claim against a knowledge base of verified information in the fact-
checking pipeline [4], the problem of detecting previously fact-checked perform the verification process with fine-grained evidences.
information has been considered only recently by [5] which formulates Another research direction performs fact-checking relying on knowl-
the information retrieval task of ranking a list of verified documents edge base. To this end, [19] builds a knowledge graph of fact-checked
according to the relevance with the input claim. Under these settings, information which can be queried in order to assess the veracity of
the task aims at filtering out already verified claims, thus allowing an input claim. In addition, [20] encodes background knowledge in
professionals to focus on brand new claims which should be carefully the form of Horn rules and generates rule-based explanations support-
checked, also by retrieving other evidences. However, as opposed ing the veracity prediction of the claim. [20] determines the claim
to classical ad-hoc retrieval problems, the documents corpus, i.e. the truthfulness treating the knowledge graph as a flow network. More
verified information, is not static and, in principle, for each truthfulness recently, [21] proposes to use language models as knowledge base,
assessment an update should be triggered on it. exploiting their factual knowledge acquired during the pretraining
Inspired by the great advancements transformers architectures have process.
been bringing to the natural language processing field [6,7], com- Our work analyzes the fact-checking panorama from the perspective
petitors at CheckThat!2020 [8] dealt with the task and showed that of detecting previously fact-checked information: assuming most of the
different transformers’ fine tuning strategies lead to promising perfor- claims are repeated over time, especially on social media platforms, we
mance improvements w.r.t. standard information retrieval baselines aim to detect if an input claim has been already checked and stored in a
(e.g. BM25). However, the efficiency of the proposed approaches has predefined knowledge base. It is worth noting that evidence-based fact-
not been analyzed yet, neither the different requirements of top-k checking approaches [11,15] differ from the considered task because
retrieval and reranking models has been considered within a two-stages
while the former aims at predicting the claim’s veracity by understand-
pipeline. Consequently, since the retriever-reranker architecture has
ing whether some evidences support or refute it, the latter does not
been widely studied for information retrieval and question answer-
depend at all on the claim veracity. Indeed, if an input claim has been
ing systems [9], it seems profitable to explore how the most recent
already fact-checked there is no point in verifying it again regardless
methods and models perform on the above-mentioned task. In this
of its truthfulness. In other words, detecting previously fact-checked
paper, we conduct an extensive benchmark, especially considering the
recent advances that neural ranking models and transformer-based information supports the fact-checking process in a preliminary phase,
systems have brought to both retriever and reranker stages [9,10]. and in principle, is complementary to evidence-based approaches.
We evaluate the models on a real-world tweets’ dataset considering Despite having been proposed to integrate the checking of the input
both the effectiveness and the efficiency of the system. Our results claim against a knowledge base of verified information in the fact-
indicate that the integration of conventional and neural methodologies checking pipeline [4], only during the last year, some initial works
holds considerable potential as a research avenue for enhancing the proposed their solution. [5] ranked verified information according to
performance of retrievers. Additionally, we find that complex neural their relevance to the input claim. Specifically, they use standard infor-
rerankers have the potential to be efficient in practical settings as mation retrieval algorithms (e.g. BM25) and compute cosine similarity
they do not require a high volume of document processing to improve over the embedding produced by a not fine-tuned BERT model. In
ranking performance. addition, competitors at CheckThat!2020 [8] showed that different
Overall, these findings unveil the practical utility of conventional fine tuning strategies lead to promising performance improvements.
and neural methodologies from relevant literature in the context of Finally, [22] addresses the problem using multimodal data, i.e. the texts
detecting previously fact-checked information, thereby highlighting the and the images of the claim and of the verified information.
potential for their effective application in real-world settings. Despite the overall ranking performance, no one has analyzed the
The paper is organized as follows. After having presented related efficiency of the proposed approaches. For instance, winners at Check-
works regarding fact-checking methods and ranking models proposed That!2020 [23] explicitly declares that their approach is unfeasible
in the recent literature (Section 2), we present the benchmarking frame- with large-scale documents’ corpus because it would take hours to
work in Section 3 and define our research objectives in Section 4.1.
retrieve the top-k element for an input claim. From this perspective, we
The experimental evaluation using a Twitter dataset is presented in
consider a more realistic scenario where not only retrieval performance
Section 4. Finally, Section 5 discuss the theoretical and practical im-
but also execution times should be considered. In other words, we fully
plications of our research and Section 6 discusses several conclusions
explore the trade-off between effectiveness and efficiency to understand
and possible future works to focus on.
the best operating settings for such systems.
2. Related works Furthermore, given the information retrieval nature of the task,
there is a wide range of powerful, yet not explored, architectures
2.1. Fact-checking panorama and models [9,10] which could be used to optimize the overall per-
formance. We try to bridge this gap considering a retriever-reranker
The fact-checking problem, i.e. predicting the veracity of a claim, architecture and benchmark a broad range of models considering both
has been studied for long time from different perspectives and under their efficiency and effectiveness.
2
2.2. Multi-stage ranking models tasks. In particular, [39] shows the effectiveness of using ensemble of
different BERT-models and combining point-wise, pair-wise and list-
Ranking list of documents according to some queries is a com- wise loss functions. Similarly, [40] proposes a two-stages re-ranking
mon problem when performing information retrieval tasks. Specifically, pipeline with a point-wise (monoBERT) and a pair-wise (duoBERT)
when the document corpus is very large, multi-stage pipelines are the classification models, respectively.
de-facto standard to solve the problem [9]. In other words, the first Finally, some hybrid architectures have been proposed (e.g. DUET
stage retriever performs top-k document retrieval, i.e. the potential [41] combining the outputs of models from different categories to
set of documents relevant to the query; the second (and, in case, its produce the relevance score.
successors) stage reranker aims at reordering that set of candidates with Whilst interaction-based approaches leads to better ranking perfor-
more powerful and computationally expensive models. mance compared to representation-based ones, their application for
end-to-end retrieval is still limited due to their low efficiency in online
Retriever. The first-stage retrieval task has long been dominated by
ranking scenarios [10].
the classical term-based probabilistic models (e.g. BM25 [24]) due to
In this work, we select models from both categories and apply them
their efficiency and effectiveness even with million-scale corpus of doc-
to rerank previously retrieved top-k fact-checked documents according
uments. Nevertheless, they still suffer from the vocabulary mismatch
to their relevance to the input claim. Specifically, we assess to what
problem [25] and do not model the document semantics which is es-
extent one model should be preferred to another considering their
sential when considering text’s meaning. While in the past decades term
efficiency and effectiveness.
dependency and topic models [26–28] have addressed the former prob-
lem, the unprecedented performance improvements that transformer
3. Method
architectures and representation learning strategies are achieving in
NLP, have determined an explosive growth of works proposing their
3.1. Problem formulation
neural network-based semantic first-stage retriever. [9] classifies neural
retriever into two categories – sparse retrieval methods and dense retrieval
The task of detecting previously fact-checked information aims at
methods. The former strategies adopt efficient sparse representation for
improving the fact-checking process by filtering out all information that
query and documents and essentially improve the weighting scheme of
has been already verified. Thus, the task deals with an input claim 𝑐
the classical term-based methods (e.g. DeepCT [29], docT5query [30]).
and a (large) corpus of fact-checked documents  = {𝑑1 , 𝑑2 , … , 𝑑𝑁 }.
On the other hand, dense retrieval methods usually consist of a
It is worth to note that while 𝑐 does not have a predefined structure
dual-encoder architecture which embed queries and documents in-
(e.g. statements from a political debate or social network posts), a
dependently, the final relevance score is computed through a sim-
document 𝑑𝑖 represents a formal assessment of the claim under veri-
ilarity function 𝑓 . These methods can be further categorized into
fication, indeed it provides its context and all arguments which should
term-level representation learning and document-level representation learn-
be considered in the evaluation.
ing [9]. The former models represent queries and documents with
The information retrieval formulation of this task follows the
the sequence of their terms’ embeddings and 𝑓 performs term-level
retriever-reranker paradigm: the first-stage retriever should learn a
matching and aggregates the result to compute the final score (e.g. DC-
function 𝑠 ∶ { (𝑐, 𝑑𝑖 ) ∣ 𝑑𝑖 ∈  } → R which assigns high scores to
BERT [31], ColBERT [32]). Document-level representation learning ap-
relevant (𝑐, 𝑑) pairs and low scores to irrelevant ones. In other words,
proaches find one global representation for each query and document
the retriever aims at finding the set ̄ = {𝑑̄1 , 𝑑̄2 , … , 𝑑𝑀
̄ ∣ 𝑀 ≪ 𝑁}
(e.g. Sentence-BERT [33] DPR [34]).
of potentially relevant documents with respect to 𝑐, thus making ̄ ⊂
It is worth to note that even if the above-mentioned methods are
. Consequently, the second-stage reranker should learn a function
categorized as first-stage retriever for their efficiency, they can still
𝑓 ∶ { (𝑐, 𝑑̄𝑖 ) ∣ 𝑑̄𝑖 ∈ ̄ } → R which reorders the elements in ̄
be used for end-to-end retrieval, performing jointly the retrieval and
according to how much similar they are to the input claim 𝑐. We
reranking tasks.
observe that learning 𝑠 is not trivial for at least two reasons. First, it has
In this work, we benchmark the wide range of the above-mentioned
a strong efficiency requirement dependent on the million/billion scale
retrievers discussing which category is more promising in the context of
documents’ corpus it deals with. Second, it should be flexible enough
retrieving fact-checked information. In addition, we also assess whether
to integrate additional knowledge about new events. By the same
the most advanced (neural) models could be exploited as one-stage
token, learning 𝑓 could be difficult since input claims can be phrased
retrievers without any reranking.
differently with respect to the corresponding fact-checked information,
Reranker. Even if some of the retriever models have proven discrete even if they refer to the same concepts [5]. Another problem to deal
ranking performance [24,32], researchers are working hard to design with is that complex claims might be the conjunction of different
specialized learning-to-rank systems. In fact, during the last decade, verified claims, thus making even partial matches relevant for the task.
we have witnessed a strong growth in applying deep neural net- It is worth to note that this task does not depend on the claim
works for building ranking models, also referred to as neural ranking truthfulness but can support its estimation since we assume that the
models (NRMs). Specifically, they can categorized into two classes – documents in the corpus have been already fact-checked and thus can
representation-based and interaction-based approaches [10]. The former be used as evidences for the verification task.
methods leverage the same bi-encoder plus matching layer architec-
ture adopted by dense retrieval methods. Some leading examples are 3.2. Benchmarking architecture
DSMN [35] and ESIM [36], which exploit fully-connected networks and
chained LSTMs, respectively, to perform Natural Language Inference As mentioned in the previous section, ranking problems are widely
tasks. In the domain of fact-checking, NSMN [15], first, combines common in information retrieval tasks, and machine learning ap-
bidirectional LSTMs and pooling strategy in order to perform jointly proaches are more and more studied to propose effective solutions.
evidence retrieval and fact verification. In order to integrate and compare the most recent advancements of
On the other hand, interaction-based NRMs aim to capture rele- neural ranking and retrieval models with the classical information
vant matching signals between a query and a document based on retrieval approaches, we considered the two-stages learning-to-rank
word interactions. While pioneering works, i.e. MatchPyramid [37] model depicted in Fig. 1.
and KNRM [38], applied deep neural networks to represent the words The first-stage retriever aims at selecting the subset ̄ of the doc-
interaction matrix, more recently pre-trained transformers [6,7] have uments corpus. Specifically, it assumes that the input claim and the
achieved the new state-of-the-art performance on any ranking-related most related documents share some basic properties such as mentioning
3
4.1. Research questions
Trying to merge the wide literature on retrieval and ranking models

to detect previously fact-checked documents, we design our research
objectives in order to assess which methods are better suited for our
two-stages pipeline. Furthermore, we want to evaluate both the effec-
tiveness and efficiency of the framework with respect to the perfor-
mance of single models and the actual applicability in real scenarios.
Concretely, we want to answer the following research questions:
• (RQ1) Which are the best retrievers? Can modern neural semantic
techniques replace the standard term-based approaches?
Fig. 1. Pipeline of our benchmarking architecture. • (RQ2) Which are the best neural (re-)ranking models?
• (RQ3) What is the benefit of combining retrievers and rerankers
with respect to the overall performance?
the same entities, having similar statistical representation (e.g. TF-
IDF features) or referring to the same concepts/topics. In other words, 4.2. Dataset & Metrics
the retriever filters out the completely unrelated verified information
in . We do not expect that it has high ranking performance; but We considered the dataset provided by [5], consisting of 1000
we require that it has good recall scores in order to not affect the tweets retrieved from Snopes6 fact-checking articles, and of 10 396
reranking performed by the second step. To put it differently, a pre- verified claims extracted from ClaimsKG dataset [19]. Specifically,
selection algorithm which leaves out too many relevant documents, data refers to multiple domains, including politics and gossip, and
would become a bottleneck for the performance of the overall system. the tweet and its verified document may be phrased similarly, thus
Moreover, since the retriever deals with huge quantities of documents, allowing a simpler approximate matching algorithm to work properly,
it should be efficient and scalable with respect to the size of the corpus. or with different terms, thus requiring more refined and semantic-based
We will evaluate the effect of the algorithm’s choice in the experiment strategies to perform the correct match.
section. We used the standard split 60%/20%/20% split, the authors pro-
The second-stage reranker is an advanced NRM which models the vided, for training, validation and testing sets. As in most information
intrinsic semantics of the claim and the (subset of) documents in order retrieval tasks, many verified claims never appear as related to any of
to perform an high performance re-ranking. In other words, once the the original tweets.
retriever has filtered documents which are correlated with the input For the ranking formulation, we adopted Mean Reciprocal Rank
claim at high level, the reranker performs semantic matching trying (MRR), Mean Average Precision truncated at 𝑘 (MAP@k) and the hit
to assess whether the input query and document, i.e. the (𝑐, 𝑑̄𝑖 ) pair, ratio [43] truncated at 𝑘 (HasPositives@k), as evaluation metrics. While
convey, even partially, the same meaning/concepts. the first two metrics take into account the ranking order, the last one
It is worth to note that the chosen multi-stage pipeline allows us to evaluates the capability of the system to retrieve correct matches. It
benchmark both interaction-based and representation-based rerankers is worth to note that since most of the tweets have only one relevant
without affecting too much the system’s efficiency since even if the document, HasPositives@k is almost equal to Recall@k. In addition, we
NRM is computationally expensive, it should only predict the relevance performed the statistical t-test between top-ranked models to assess the
between the input claim and a much smaller set of verified documents, reliability of our results.
i.e. those selected by the retriever algorithm. The extent to which this From the application perspective, metrics on lower values of 𝑘
re-ranking process affects the effectiveness and the efficiency of the (e.g. 𝑘 ∈ {1, 3, 5}) might be indicative of the system utility in easing
framework will be evaluated though our experiments. manual fact-checkers works, i.e. experts would spot in real time if
Despite some recent attempts to build end-to-end neural retrieval the top ranked results are relevant to the input claim. On the other
systems [32,42], we conjecture that the multi-stage pipeline, besides hand, metrics on higher values of 𝑘 (e.g. 𝑘 ∈ {10, 20}) should be
improving efficiency, might also increase the performance of the considered in offline settings and/or in an automated fact-checking
reranker due to the simpler problem it has to solve when combined pipeline where results should be further processed as evidences for the
with the retriever. In other words, if we use the ranking model alone, veracity prediction.
it should learn to distinguish between the input claim’s semantic and all
possible knowledge contained in the documents’ corpus, while in our 4.3. Models & Training details
settings it has to work in a more controllable environment where the
training procedure could assume a certain degree of pertinence between In the following subsection we detail which are the retrievers/
the claim and the documents to (re-)rank. rerankers considered in the benchmark, explaining how they have been
Finally, even if the framework is thought to work with corpus trained and configured in order to promote reproducibility.
represented as lists of verified documents, we point out its flexibility to- We select a wide range of retrievers dividing them in four groups.
wards knowledge graph (KG) representations: specifically, without any First, we consider classical probabilistic approaches including BM25
alteration of the reranker model, the retriever algorithm is replaced by [24], TF-IDF [44] and Language Model with Dirichlet smoothing [45].
an inference procedure on the KG using the entity and the relationships These algorithms assign a score to each tweet-claim pair based on
extracted from input claim. We leave the exploration of this scenario exact matching between the words in the tweet and the words in
to future works. a target verified claim. They have been long studied and applied in
various information retrieval tasks, thus representing the baseline for
4. Experiments the other retrievers. We adopted the Elasticsearch7 (version 7.10.1)
All experiments have been performed on Google Colab equipped

with one single core hyper threaded Xeon Processor @2.2 GHz, 12 GB 6
https://www.snopes.com/.
7
of RAM and a Tesla T4 GPU. The code will be made available on Github. https://www.elastic.co/.
4
implementation for BM25 and LM Dirichlet, with default parameters, While not reaching BM25 performance, it is evident to notice the
and used Haystack library8 for TF-IDF. progress of neural retrievers which overcome the TF-IDF baseline and
Second, we select docT5query [46] as neural sparse retrieval models perform comparably with the LM Dirichlet one.
because expanding the documents with auto-generated queries seems The sparse model docT5Query [46] is the first runner up and
profitable in this context because the query, i.e., the (false) information, exhibits great improvements with respect to BM25 on which it relies.
is often repeated with a few differences over times. Specifically, we In other words, we conjecture that expanding fact-checked documents
adapted the official code9 and use the provided T5-base model to with artificially-generated queries and then indexing through stan-
generate three queries for each document. We then used BM25 to dard techniques (BM25 in our case) is an effective approach because
the generation process clearly extracts the subjects, the topics and/or
reindex the expanded documents.
the events increasing the probability of detecting matching queries
Third, we choose ColBERT [32] as neural sparse retrieval models.
citing those concepts. Unfortunately, we point out how the queries’
It is worth noting that ColBERT can be used for reranking as well, due
generation procedure, relying on the T5 transformer [50], is compu-
to the interaction mechanism it performs between query and document
tationally intensive and might not be usable in online scenarios where
terms. In particular, we used the official repository10 and retrained the the documents’ corpus should be often updated.
bert-based-uncased model using the default hyper-parameters. On the other hand, ColBERT [32] achieves interesting performance
Fourth, we picked SentenceBERT [33] and DPR [34] as neural without requiring any pre-processing step. Moreover, the late inter-
document-based dense retrieval techniques. The former is the first action mechanism it implements between query and document words
attempt to leverage transformer-based models to perform text similarity seems efficient enough to be scalable with million scale corpora.
and thus represents our ‘‘neural’’ baseline. Specifically, we used the Finally, document-level neural retrievers (SentenceBERT [33] and
sentence-transformer library,11 fine tuning (for 4 epochs and a batch DPR [34]) are one step behind the other approaches, probably because
size of 16) the stsb-distilbert-base model using cosine similarity loss. representing the whole document/query with just one embedding pro-
On the other hand, the latter adopts the in-batch negative strategy vides a coarse representation, which does not capture the necessary
to reuse negative examples already in the training batch rather than details to infer the relation between the claim and its verified docu-
creating new ones. In particular, we used the Haystack library,8 fine ment. Concretely, fact-checked documents are usually characterized by
tuning (for 10 epochs and a batch size of 16) the bert-base-uncased longer texts which cite several concepts and entities to assess the claim
model. veracity. Under such scenario, it is difficult to provide an insightful rep-
With the exception of DPR which customizes the batch generation resentation looking at the document as a whole, instead of considering
strategy, the training dataset has always been built considering the more granular information (e.g., text’s terms and/or sentences).
positive query-document pair and 10 random negative ones. To sum up, the recent advancements in neural information retrieval
seem to be bridging the gap with classical retrieval approaches but we
Considering the second stage of the pipeline, we considered 9
have shown that even the most modern retrievers still cannot replace
rerankers, divided in the categories mentioned in Section 2.2. Specif-
them in practice. In addition, combining the two approaches and
ically, we choose MatchPyramid [37], KNMR [38], ConvKNMR [47]
designing more efficient interaction functions are the most promising
and BERT models [6], as interaction-based algorithms; ESIM [36] and
research directions to follow.
HAR [48], as representation-based algorithms; DUET [41] as hybrid
model. 4.5. (RQ2) Which are the best neural re-ranking models?
For HAR we used the official implementation,12 and for all others
methods we adopted the Pytorch implementation of the Matchzoo Table 2 reports the rerankers’ performance, considering just those
framework [49]. All hyper-parameters have been set to default with the queries which have at least one relevant article in the top 50 documents
exception of the number of kernels in KNRM and ConvKNRM which retrieved in the first stage by BM25. Not surprisingly, interaction-
was set to 11. All models have been trained until convergence on based approaches perform generally better than representation-based
the validation set. Finally, for BERT, we adopted the stsb-distilroberta- ones since they explicitly look for relevant matching signals in query-
base cross-encoder provided by the sentence-transformer library,11 fine document pairs.
tuning (for 4 epochs and a batch size of 16) using the cross-entropy Apart from the reranker’s category, the most important insight is
loss. that transformer-based models (BERT [6] and colBERT [32] outperform
When training rerankers, we need to select 𝑘 negative samples for by far other algorithms. Specifically, they reach good results already
each tweet-claim pair. The choice of 𝑘 might be decisive for the per- when considering rankings truncated at top positions, meaning that
they could effectively catch the relation between the fact-checked docu-
formance of the model: low values might determine poor performance
ment’s and the input claim. Despite ranking performance, the execution
because the model would see few pairs representing non-matching
time of these models strongly affects the number of documents they
knowledge. On the other hand, since there is just one verified claim
could practically rerank, we will analyze this aspect in the next section.
matching most of the tweets in our dataset, increasing 𝑘 too much
When observing the huge performance difference between
might lead to imbalanced training set, making the learning task more transformer-based systems and other NRMs, we conjecture that it
difficult. We select 50 random negative documents from the top-100 depends on the transformers’ pre-training procedure, which allows
ones retrieved in the first stage. However, in the experiments we also these models to acquire not only language syntax and semantics but
evaluate the effect of a completely random choice. also factual and relational knowledge [21]. By contrast, other NRMs
(e.g. MatchPyramid [37], KNRM [38] are trained from scratch, thus
4.4. (RQ1) Which are the best retrievers? requiring more training (labeled) data and time to achieve good rerank-
ing performance.
Table 1 reports the results of the retriever models. We do not report Finally, BERT [6] performs better than colBERT [32] because of the
latency performance since the documents’ corpus is too short to observe more complex interaction mechanism it implements to capture match-
significant differences between the chosen models. ing signals between the input claim and the verified document. In-
deed, although colBERT’s late interaction mechanism prioritizes (com-
putational) efficiency, it cannot compete with the full self-attention
8
https://github.com/deepset-ai/haystack. mechanism BERT relies on.
9
https://github.com/castorini/docTTTTTquery. To sum up, employing and fine tuning pre-trained language models
10
https://github.com/stanford-futuredata/ColBERT. seem to be the best and easiest solution to obtain high-quality rank-
11
https://github.com/UKPLab/sentence-transformers. ings. A fairer comparison with other NRMs will be possible when a
12
https://github.com/mingzhu0527/HAR. million-scale dataset of fact-checked information will be released.
5
Table 1
Performance of retrievers (bold indicates the best results, underline the first runner up).
Category Model MRR HasPositives@𝑘
all 𝑘=1 𝑘=3 𝑘=5 𝑘 = 10 𝑘 = 20 𝑘 = 50 𝑘 = 100
TF-IDF 0.681 0.593 0.739 0.789 0.829 0.869 0.914 0.924
Classical LM Dirichlet [51] 0.799 0.770 0.825 0.860 0.890 0.915 0.95 0.960
BM25 [24] 𝟎.𝟖𝟏𝟕 𝟎.𝟕𝟖𝟓 𝟎.𝟖𝟔𝟓 𝟎.𝟖𝟖𝟎 𝟎.𝟖𝟗𝟓 𝟎.𝟗𝟏𝟓 𝟎.𝟗𝟓𝟎 𝟎.𝟗𝟔𝟎
Neural sparse docT5query [30] 0.786 0.754 0.834 0.844 0.894 0.919 0.945 0.960
Term-based ColBERT [32] 0.765 0.708 0.793 0.819 0.874 0.904 0.944 0.949
SentenceBERT [33] 0.669 0.592 0.713 0.763 0.804 0.834 0.884 0.924
Document-level
DPR [34] 0.624 0.547 0.673 0.718 0.753 0.788 0.859 0.909
Table 2 Table 5
Performance of Neural Ranking Models (NRMs) (bold indicates the best results, Effect of negative pairs’ selection during reranker training.
underline the first runner up). Model MAP@𝑘
Category Model MRR MAP@𝑘
𝑘=1 𝑘=3 𝑘=5 𝑘 = 10 𝑘 = 20
all 𝑘=1 𝑘=3 𝑘=5 𝑘 = 10 𝑘 = 20
BERT (random negatives) 0.365 0.525 0.556 0.573 0.575
BERT [6] 𝟎.𝟗𝟔𝟖* 𝟎.𝟗𝟒𝟐* 𝟎.𝟗𝟔𝟖* 𝟎.𝟗𝟔𝟖* 𝟎.𝟗𝟔𝟖* 𝟎.𝟗𝟔𝟖* BERT (top-k negatives) 0.865 0.895 0.901 0.902 0.903
ColBERT [32] 0.903 0.847 0.893 0.901 0.902 0.903
Interaction MAN [22] 0.509 0.386 0.470 0.484 0.501 0.509
-based MatchPyramid [37] 0.495 0.413 0.444 0.462 0.479 0.489
KNRM [38] 0.319 0.212 0.272 0.287 0.298 0.307 retrievers, might compromise the performance becoming a bootleneck
ConvKNRM [52] 0.744 0.677 0.721 0.729 0.738 0.742
for the whole system. In complete fairness, we highlight that the
Representation ESIM [36] 0.507 0.370 0.451 0.482 0.498 0.504 results of the two best models are not statistically different but still
-based HAR [48] 0.602 0.331 0.508 0.557 0.557 0.560
meaningful because, even if the two solutions perform comparably,
Hybrid-based DUET [41] 0.392 0.233 0.302 0.313 0.323 0.330 the combination between BM25 and BERT is much more efficient than
*Statistical significance at 𝑝 = 0.001 w.r.t. the second best BERT alone, as we will show afterwards.
Furthermore, Table 5 clearly proves that retraining the reranker on
the claims retrieved by the first stage positively affects the ranking
Table 3
performance confirming our hypothesis that the use of the prefiltering
Performance of the overall pipeline (bold indicates the best results, underline the first
runner up).
information retrieval algorithm simplifies the learning task since the
Model HasPositive@𝑘
NRM does not have to match the input tweet’s semantic with all
knowledge encoded in the verified claims.
𝑘=1 𝑘=3 𝑘=5 𝑘 = 10 𝑘 = 20
Finally, we evaluate the effect of the number of documents selected
BM25 [24] 0.785 0.865 0.880 0.895 0.915
by the first-stage retriever. Specifically, we assess how this parame-
BERT [6] 𝟎.𝟖𝟔𝟓 𝟎.𝟗𝟑𝟓 𝟎.𝟗𝟔𝟎 𝟎.𝟗𝟕𝟎 𝟎.𝟗𝟖𝟓 ter affects the efficiency and the effectiveness of the overall system.
ColBERT [32] 0.793 0.819 0.874 0.904 0.944
Table 6 reports the runtimes, and their 95% confidence intervals, of
BM25 (100) + BERT 0.862 0.925 0.935 0.945 0.955 the system resulting from the combination of the BM25 retriever and
BM25 (100) + ColBERT 0.779 0.794 0.804 0.804 0.804
each transformer model, the scenario where no re-ranker is applied is
considered as well. While colBERT, as all representation-based models,
Table 4 scales better (even better than the BM25 baseline alone), BERT model
Performance of the overall pipeline (bold indicates the best results, underline the first runtimes strongly increase (up to 30 s per query) with the number
runner up).
of retrieved documents. Concretely, assuming the system should ease
Model MRR MAP@𝑘
manual fact-checker’s effort, we convey that a profitable response
all 𝑘=1 𝑘=3 𝑘=5 𝑘 = 10 𝑘 = 20 all time should be within five seconds thus making BERT algorithm not
BM25 [24] 0.817 0.816 0.819 0.821 0.822 0.817 0.785 tractable when there are more than 1000 documents to rerank. As a
BERT [6] 0.901 0.865 0.895 0.901 0.902 0.903 0.903 result, the application of complex transformer is tightly constrained
ColBERT [32] 0.709 0.749 0.754 0.762 0.765 0.765 0.708 with the adoption of a high-recall retriever which filters out the greatest
BM25 (100) + BERT 𝟎.𝟗𝟎𝟔 𝟎.𝟖𝟕𝟑 𝟎.𝟗𝟎𝟓 𝟎.𝟗𝟎𝟖 𝟎.𝟗𝟎𝟖 𝟎.𝟗𝟎𝟖 𝟎.𝟗𝟎𝟗 part of the documents’ corpus.
BM25 (100) + ColBERT 0.739 0.756 0.760 0.761 0.761 0.762 0.738 In addition, Fig. 2 depicts the MAP metrics varying the top-k fact-
checked documents retrieved by the BM25 baseline: not surprisingly,
performance generally increase (decrease) when considering BERT (col-
4.6. (RQ3) What is the benefit of combining retrievers and rerankers? BERT) reranker. This behavior depends on the fact that when increas-
ing the number of retrieved documents, we are converging on the
As mentioned in the previous section, the two steps of our frame- performance of the second-stage reranker applied in isolation, thus
work capture different kinds of information and thus it is worth ex- not exploiting the retriever anymore. However, we observe that for
ploring how their combination performs. Tables 3 and 4 outlines the BERT, performance no longer improves when retrieving more than 100
performance of the overall system, in terms of HasPositive@k and documents.
MAP@k respectively, considering the combination of two transform- To sum up, information retrieval literature brings a wide range
ers reranker (BERT and ColBERT) with the best retriever algorithm of methods and models, which could be exploited to efficiently solve
(BM25). We configure the system so as the latter model selects the the problem of detecting previously fact-checked information. Specif-
top-100 verified claims from the documents’ corpus. ically, the multi-stage ranking pipeline seems to achieve acceptable
It is evident how the combination is effective when combining quality performance integrating efficient retrievers with most com-
strong retriever with stronger reranker (BERT), in fact, the overall plex rerankers, making the trade-off between ranking performance
combination overcomes both its components, considered in isolation. and runtimes smoother. Concretely, in the context of this benchmark,
On the other hand, weaker rerankers (ColBERT), as well as weak we conclude that the best system is composed by the BM25 model,
6
Table 6 the other hand, the BERT model alone is able to retrieve the correct
Runtimes (in seconds) varying the number of claims to rerank.
verified information but, again, it gets confused with other pizza-
Without rerank colBERT BERT
related claim regarding totally different contexts (e.g. satire, cooking
BM25 (10) 0.0170 ± 0.0017 0.0634 ± 0.0014 0.0500 ± 0.0100 recipes). We believe that this behavior depends on the fact that the
BM25 (100) 0.0233 ± 0.0010 0.0703 ± 0.0014 0.3483 ± 0.0153
considered tweet does not express directly the entities it refers to, thus
BM25 (1000) 0.1156 ± 0.0054 0.1688 ± 0.0053 3.3851 ± 0.1709
BM25 (10000) 0.6122 ± 0.0900 0.7225 ± 0.0908 30.8846 ± 1.5110 making the information extraction and semantic understanding much
more difficult.
5. Theoretical and practical implications
The misinformation phenomenon can have adverse societal effects,

threatening democracies, journalism and freedom of expression. De-
spite their increasing effort, fact-checking organizations cannot keep
the pace of the huge amount of false information spreading on social
media. Within this context, detecting previously fact-checked informa-
tion could improve the verification process increasing fact-checkers
productivity and providing more reliable evidence which the assess-
ment should be based on.
Theoretically, our study bridges the gap between the recent ad-hoc
proposals dealing with the above-mentioned task and the vast amount
of techniques and models that have been proposed in the information
retrieval and Q&A literature. To this end, we have defined a retriever-
reranking framework to assess the efficiency and the effectiveness of
the analyzed techniques, and, as opposed to existing works, we have
explored the best trade-off between ranking performance and execution
times.
Fig. 2. Performance varying the number of claims retrieved by the first stage and Our results show that combining standard and neural approaches
reranked by ColBERT (left) and BERT (right). is the most promising research direction to improve retrieval per-
formance; fine-tuned transformer architectures provide high-quality
(re-)ranking performance. In addition, these complex rerankers might
still be efficient in practice since there is no need to process a high
number of documents to improve ranking performance. Finally, our
retrieving up to 100 fact-checked documents, followed by the BERT
error analysis shows the limitation of standard retrieval algorithms
model which performs a high performance (re-)ranking.
when the inputs texts contain too many entities or do not cite those
entities directly.
4.7. Error analysis
6. Conclusions and future directions

In order to better assess the behavior of the framework in real
scenarios, we conducted an error analysis aiming at understanding
when the best retriever-reranker combination, i.e. the BM25 retriever In this paper, we addressed the problem of detecting previously
followed by a BERT reranker, performs better than its components and fact-checked information through a multi-stage ranking pipeline. We
what kinds of input tweet may still lead to wrong results. have benchmarked state-of-the-art retriever and reranker models, con-
First, we consider the tweet ‘‘#SixFlagsBaltimore closed parks to all sidering also how the combination of standard information retrieval
non-muslims for the Muslim family day. How nice. When is CHRISTIAN algorithms and modern semantic models might improve the overall
day?’’ whose corresponding verified claim is ‘‘Six Flags is temporarily performance. The experimental results prove that the integration of
closing one or more of their theme parks to the public to host Muslim standard term-based and neural-based retrievers is the most promising
Family Day.’’. In this case, neither the BM25 model nor the BERT direction to improve top-k document retrieval. In addition, stronger
model can rank the corresponding verified claim in the first position. transformer-based rerankers seem to be the most effective solution to
In particular, the former algorithm is able to retrieve the right verified perform a high performance reranking due to the extensive knowledge
claim but ranks it on the fifth position along with other topical-related acquired during their pre-training process.
claims such as ‘‘First Lady Michelle Obama proposed a Hug A Muslim There are a number of avenues of future work that we would like to
day to replace Columbus Day.’’, while the BERT model applied in explore. First, after having considered textual claims and documents,
isolation cannot retrieve the correct match in the top-20 ranking, since we wonder whether multimodal data might improve performance,
it gets confused by other information which regard the same subject i.e. tweets’ and articles’ images might be helpful in analyzing the
(e.g. Muslims) but in total different contexts (e.g. war, crime). On semantic relation between the input data and the verified claims. Si-
the other hand, the combination of the two models is effective since multaneously, we would like to explore the possibility to use knowledge
it exploits the partial good results of the first stage and exploits the graph inference algorithms as first-stage retrievers. In addition, some
semantic knowledge acquired by the second one to re-rank the correct works should be devoted to collect a much larger dataset regarding
claim in the first position. claims (and their corresponding verification documents) related to
Second, in order to understand the limits of our approach we different topics and from different sources. Finally, since the proposed
consider a scenario where both the stages applied in isolation and their approaches are opaque with respect to the decisions they make during
combination are not able to perform the correct ranking. Taking the top-k retrieval as well as during reranking, we would like to enhance
tweet ‘‘You should get one for your house! #PizzaVendingMachine’’, the the interpretability of the system using explainable artificial intelli-
ES baseline cannot retrieve the corresponding verified claim in the top- gence techniques in order to unveil the relations that models recognize
100 results, thus penalizing the results of the reranking step as well. On between the input claim and the verified information.
7
CRediT authorship contribution statement [12] R. Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, P. Nakov, Integrating
stance detection and fact checking in a unified corpus, in: Proceedings of
the 2018 Conference of the North American Chapter of the Association for
Tanmoy Chakraborty: Conceptualization, Methodology, Formal
Computational Linguistics: Human Language Technologies, Volume 2 (Short
analysis, Writing – original draft, Writing – review & editing, Soft- Papers), Association for Computational Linguistics, New Orleans, Louisiana,
ware. Valerio La Gatta: Conceptualization, Methodology, Formal anal- 2018, pp. 21–27, http://dx.doi.org/10.18653/v1/N18-2004, URL: https://www.
ysis, Writing – original draft, Writing – review & editing, Software. aclweb.org/anthology/N18-2004.
Vincenzo Moscato: Conceptualization, Methodology, Formal analy- [13] M. Nadeem, W. Fang, B. Xu, M. Mohtarami, J. Glass, FAKTA: An automatic
end-to-end fact checking system, in: Proceedings of the 2019 Conference of
sis, Writing – original draft, Writing – review & editing, Software.
the North American Chapter of the Association for Computational Linguistics
Giancarlo Sperlì: Conceptualization, Methodology, Formal analysis, (Demonstrations), Association for Computational Linguistics, Minneapolis, Min-
Writing – original draft, Writing – review & editing, Software. nesota, 2019, pp. 78–83, http://dx.doi.org/10.18653/v1/N19-4014, URL: https:
//www.aclweb.org/anthology/N19-4014.
Declaration of competing interest [14] K. Popat, S. Mukherjee, A. Yates, G. Weikum, Declare: Debunking fake news and
false claims using evidence-aware deep learning, in: Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, Association
The authors declare that they have no known competing finan- for Computational Linguistics, Brussels, Belgium, 2018, pp. 22–32, http://dx.
cial interests or personal relationships that could have appeared to doi.org/10.18653/v1/D18-1003, URL: https://www.aclweb.org/anthology/D18-
influence the work reported in this paper. 1003.
[15] Y. Nie, H. Chen, M. Bansal, Combining fact extraction and verification with
neural semantic matching networks, in: Proceedings of the AAAI Conference on
Data availability Artificial Intelligence, Vol. 33, 2019, pp. 6859–6866, http://dx.doi.org/10.1609/
aaai.v33i01.33016859.
The authors do not have permission to share data. [16] A. Soleimani, C. Monz, M. Worring, BERT for evidence retrieval and claim
verification, in: J.M. Jose, E. Yilmaz, J.a. Magalhães, P. Castells, N. Ferro, M.J.
Silva, F. Martins (Eds.), Advances in Information Retrieval, Springer International
Acknowledgements
Publishing, Cham, 2020, pp. 359–366.
[17] C. Chen, F. Cai, X. Hu, J. Zheng, Y. Ling, H. Chen, An entity-graph based rea-
T. Chakrabrorty would like to acknowledge the support of Ramanu- soning method for fact verification, Inf. Process. Manage. 58 (3) (2021) 102472,
jan Fellowship, CAI, IIIT-Delhi and ihub-Anubhuti-iiitd Foundation set http://dx.doi.org/10.1016/j.ipm.2020.102472, URL: https://www.sciencedirect.
up under the NM-ICPS scheme of the Department of Science and com/science/article/pii/S0306457320309614.
[18] C. Chen, F. Cai, X. Hu, W. Chen, H. Chen, HHGN: A hierarchical reasoning-
Technology, India. based heterogeneous graph neural network for fact verification, Inf. Process.
Manage. 58 (5) (2021) 102659, http://dx.doi.org/10.1016/j.ipm.2021.102659,
References URL: https://www.sciencedirect.com/science/article/pii/S0306457321001473.
[19] A. Tchechmedjiev, P. Fafalios, K. Boland, M. Gasquet, M. Zloch, B. Zapilko, S.
[1] H. Allcott, M. Gentzkow, Social Media and Fake News in the 2016 Election, Dietze, K. Todorov, Claimskg: A knowledge graph of fact-checked claims, in:
Working Paper 23089, National Bureau of Economic Research, 2017, http: C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. Cruz, A. Hogan, J. Song,
//dx.doi.org/10.3386/w23089, URL: http://www.nber.org/papers/w23089. M. Lefrançois, F. Gandon (Eds.), The Semantic Web – ISWC 2019, Springer
[2] M. Cinelli, W. Quattrociocchi, A. Galeazzi, C.M. Valensise, E. Brugnoli, A.L. International Publishing, Cham, 2019, pp. 309–324.
Schmidt, P. Zola, F. Zollo, A. Scala, The COVID-19 social media infodemic, [20] P. Shiralkar, A. Flammini, F. Menczer, G.L. Ciampaglia, Finding streams in knowl-
2020, CoRR;abs/2003.05004. URL: https://arxiv.org/abs/2003.05004. arXiv: edge graphs to support fact checking, in: 2017 IEEE International Conference on
2003.05004. Data Mining, ICDM, 2017, pp. 859–864, http://dx.doi.org/10.1109/ICDM.2017.
[3] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news on- 105.
line, Science 359 (6380) (2018) 1146–1151, http://dx.doi.org/10.1126/science. [21] F. Petroni, T. Rocktäschel, P.S.H. Lewis, A. Bakhtin, Y. Wu, A.H. Miller, S.
aap9559, URL: https://science.sciencemag.org/content/359/6380/1146. arXiv: Riedel, Language models as knowledge bases? 2019, CoRR;abs/1909.01066. URL:
https://science.sciencemag.org/content/359/6380/1146.full.pdf. http://arxiv.org/abs/1909.01066. arXiv:1909.01066.
[4] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, [22] N. Vo, K. Lee, Where are the facts? Searching for fact-checked information
M. Joseph, A. Kulkarni, A.K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: to alleviate the spread of fake news, in: Proceedings of the 2020 Conference
The first-ever end-to-end fact-checking system, Proc. VLDB Endow. 10 (12) on Empirical Methods in Natural Language Processing, EMNLP, Association
(2017) 1945–1948, http://dx.doi.org/10.14778/3137765.3137815. for Computational Linguistics, Online, 2020, pp. 7717–7731, http://dx.doi.org/
[5] S. Shaar, N. Babulkov, G. Da San Martino, P. Nakov, That is a known Lie: Detect- 10.18653/v1/2020.emnlp-main.621, URL: https://www.aclweb.org/anthology/
ing previously fact-checked claims, in: Proceedings of the 58th Annual Meeting 2020.emnlp-main.621.
of the Association for Computational Linguistics, Association for Computational [23] M. Bouziane, H. Perrin, A. Cluzeau, J. Mardas, A. Sadeq, Team buster.ai at
Linguistics, Online, 2020, pp. 3607–3618, http://dx.doi.org/10.18653/v1/2020. CheckThat! 2020 insights and recommendations to improve fact-checking, in:
acl-main.332, URL: https://www.aclweb.org/anthology/2020.acl-main.332. CLEF, 2020.
[6] L. Zhuang, L. Wayne, S. Ya, Z. Jun, A robustly optimized BERT pre-training [24] S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and
approach with post-training, in: Proceedings of the 20th Chinese National beyond, Found. Trends Inf. Retr. 3 (4) (2009) 333–389, http://dx.doi.org/10.
Conference on Computational Linguistics, Chinese Information Processing Society 1561/1500000019.
of China, Huhhot, China, 2021, pp. 1218–1227, URL: https://aclanthology.org/ [25] G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais, The vocabulary problem
2021.ccl-1.108. in human-system communication, Commun. ACM 30 (11) (1987) 964–971,
[7] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidi- http://dx.doi.org/10.1145/32206.32212.
rectional transformers for language understanding, 2018, CoRR;abs/1810.04805. [26] D. Metzler, W.B. Croft, A Markov random field model for term dependencies, in:
URL: http://arxiv.org/abs/1810.04805. arXiv:1810.04805. Proceedings of the 28th Annual International ACM SIGIR Conference on Research
[8] A. Barron-Cedeno, T. Elsayed, P. Nakov, G.D.S. Martino, M. Hasanain, R. and Development in Information Retrieval, SIGIR ’05, Association for Computing
Suwaileh, F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z.S. Ali, Machinery, New York, NY, USA, 2005, pp. 472–479, http://dx.doi.org/10.1145/
Overview of CheckThat! 2020: Automatic identification and verification of claims 1076034.1076115.
in social media, 2020, arXiv:2007.07997. [27] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res.
[9] Y. Cai, Y. Fan, J. Guo, F. Sun, R. Zhang, X. Cheng, Semantic models for the 3 (null) (2003) 993–1022.
first-stage retrieval: A comprehensive review, 2021, CoRR;abs/2103.04831. URL: [28] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: Pro-
https://arxiv.org/abs/2103.04831. arXiv:2103.04831. ceedings of the 13th International Conference on Neural Information Processing
[10] J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W.B. Croft, X. Cheng, Systems, NIPS ’00, MIT Press, Cambridge, MA, USA, 2000, pp. 535–541.
A deep look into neural ranking models for information retrieval, Inf. Process. [29] Z. Dai, J. Callan, Context-aware term weighting for first stage passage retrieval,
Manage. 57 (6) (2020) 102067, http://dx.doi.org/10.1016/j.ipm.2019.102067. in: Proceedings of the 43rd International ACM SIGIR Conference on Research
[11] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale and Development in Information Retrieval, SIGIR ’20, Association for Computing
dataset for fact extraction and VERification, in: Proceedings of the 2018 Machinery, New York, NY, USA, 2020, pp. 1533–1536, http://dx.doi.org/10.
Conference of the North American Chapter of the Association for Computational 1145/3397271.3401204.
Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association [30] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction,
for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 809–819, http: 2019, CoRR;abs/1904.08375.URL: http://arxiv.org/abs/1904.08375. arXiv:1904.
//dx.doi.org/10.18653/v1/N18-1074, URL: https://aclanthology.org/N18-1074. 08375.
8
[31] P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song, D. Jiang, DC-BERT: [49] J. Guo, Y. Fan, X. Ji, X. Cheng, MatchZoo: A learning, practicing, and developing
decoupling question and document for efficient contextual encoding, in: J. system for neural text matching, in: Proceedings of the 42Nd International ACM
Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, Y. Liu (Eds.), SIGIR Conference on Research and Development in Information Retrieval, SIGIR
Proceedings of the 43rd International ACM SIGIR Conference on Research and ’19, ACM, New York, NY, USA, 2019, pp. 1297–1300, http://dx.doi.org/10.1145/
Development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 3331184.3331403, URL: http://doi.acm.org/10.1145/3331184.3331403.
25-30, 2020, ACM, 2020, pp. 1829–1832, http://dx.doi.org/10.1145/3397271. [50] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W.
3401271. Li, P.J. Liu, Exploring the limits of transfer learning with a unified text-to-text
[32] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via con- transformer, J. Mach. Learn. Res. 21 (2020) 1–67.
textualized late interaction over BERT, in: Proceedings of the 43rd International [51] C. Zhai, J. Lafferty, A study of smoothing methods for language models applied
ACM SIGIR Conference on Research and Development in Information Retrieval, to ad hoc information retrieval, in: Proceedings of the 24th Annual International
SIGIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. ACM SIGIR Conference on Research and Development in Information Retrieval,
39–48, http://dx.doi.org/10.1145/3397271.3401075. SIGIR ’01, Association for Computing Machinery, New York, NY, USA, 2001, pp.
[33] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese 334–342, http://dx.doi.org/10.1145/383952.384019.
BERT-networks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), EMNLP/IJCNLP [52] Z. Dai, C. Xiong, J. Callan, Z. Liu, Convolutional neural networks for soft-
(1), Association for Computational Linguistics, 2019, pp. 3980–3990, URL: http: matching N-grams in ad-hoc search, in: Proceedings of the Eleventh ACM
//dblp.uni-trier.de/db/conf/emnlp/emnlp2019-1.html#ReimersG19. International Conference on Web Search and Data Mining, WSDM ’18, Asso-
[34] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, ciation for Computing Machinery, New York, NY, USA, 2018, pp. 126–134,
Dense passage retrieval for open-domain question answering, in: Proceedings http://dx.doi.org/10.1145/3159652.3159659.
of the 2020 Conference on Empirical Methods in Natural Language Processing,
EMNLP, Association for Computational Linguistics, Online, 2020, pp. 6769–
6781, http://dx.doi.org/10.18653/v1/2020.emnlp-main.550, URL: https://www. Tanmoy Chakraborty is an Assistant Professor of Computer
aclweb.org/anthology/2020.emnlp-main.550. Science and a Ramanujan Fellow at IIIT-Delhi, India, where
[35] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, Learning deep structured he leads a research group, Laboratory for Computational
semantic models for web search using clickthrough data, in: Proceedings of the Social Systems (LCS2). His primary research interests in-
22nd ACM International Conference on Information & Knowledge Management, clude Social Computing and Natural Language Processing.
CIKM ’13, Association for Computing Machinery, New York, NY, USA, 2013, pp. He has received several awards/fellowships including Fac-
2333–2338, http://dx.doi.org/10.1145/2505515.2505665. ulty Awards from Google, IBM and LinkedIn; Early Career
[36] Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced LSTM for Research Award, DAAD Faculty Fellowship. He is a member
natural language inference, in: Proceedings of the 55th Annual Meeting of the of ACM and a senior member of IEEE. More details at
Association for Computational Linguistics (Volume 1: Long Papers), Association http://faculty.iiitd.ac.in/~tanmoy/.
for Computational Linguistics, Vancouver, Canada, 2017, pp. 1657–1668, http:
//dx.doi.org/10.18653/v1/P17-1152, URL: https://www.aclweb.org/anthology/
P17-1152.
[37] L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, X. Cheng, Text matching as image Valerio La Gatta is a Ph.D. student in Information Tech-
recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, nology and Electrical Engineering at the Department of
Vol. 30, 2016, 1. Electrical Engineering and Information technology of the
[38] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc ranking University of Naples Federico II. He received the Master de-
with kernel pooling, in: Proceedings of the 40th International ACM SIGIR gree in Computer engineering from the University of Naples
Conference on Research and Development in Information Retrieval, SIGIR ’17, Federico II in 2020. His research interests are focused on
Association for Computing Machinery, New York, NY, USA, 2017, pp. 55–64, Social Network Analysis, eXplainable Artificial Intelligence,
http://dx.doi.org/10.1145/3077136.3080809. Graph Data Mining.
[39] S. Han, X. Wang, M. Bendersky, M. Najork, Learning-to-rank with BERT in TF-
ranking, 2020, CoRR;abs/2004.08476. URL: https://arxiv.org/abs/2004.08476.
arXiv:2004.08476.
[40] R. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with Vincenzo Moscato is an Associate Professor at the Electrical
BERT, 2019, CoRR;abs/1910.14424. URL: http://arxiv.org/abs/1910.14424. Engineering and Information Technology Department of
arXiv:1910.14424. University of Naples ‘‘Federico II’’. He received the Ph.D.
[41] B. Mitra, F. Diaz, N. Craswell, Learning to match using local and distributed degree in Computer Science from the same University by
representations of text for web search, in: Proceedings of the 26th International defending the thesis: ‘Indexing Techniques for Image and
Conference on World Wide Web, WWW ’17, International World Wide Web Video Databases: an approach based on Animate Vision
Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2017, Paradigm’’. He is one of the leaders of PICUS (Pattern and
pp. 1291–1299, http://dx.doi.org/10.1145/3038912.3052579. Intelligence Computation for mUltimedia Systems) depart-
[42] A. Vakili Tahami, K. Ghajar, A. Shakery, Distilling knowledge for fast retrieval- mental research groups and a member of the Big Data and
based chat-bots, in: Proceedings of the 43rd International ACM SIGIR Conference Artificial Intelligence national laboratories within the Con-
on Research and Development in Information Retrieval, SIGIR ’20, Association sorzio Interuniversitario Nazionale per l’Informatica (CINI).
for Computing Machinery, New York, NY, USA, 2020, pp. 2081–2084, http: His research activities lay in the area of Multimedia, Big
//dx.doi.org/10.1145/3397271.3401296. Data, Artificial Intelligence and Social Network Analysis. He
[43] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, T.-S. Chua, Neural collaborative filtering, was involved in many national and international research
in: Proceedings of the 26th International Conference on World Wide Web, WWW projects and coordinated as principal investigator some of
’17, International World Wide Web Conferences Steering Committee, Republic the them. He was in the program committees of numerous
and Canton of Geneva, CHE, 2017, pp. 173–182, http://dx.doi.org/10.1145/ international conferences and in the editorial boards of
3038912.3052569. several important journals. Finally, he was an author of
[44] H.C. Wu, R.W.P. Luk, K.F. Wong, K.L. Kwok, Interpreting TF-IDF term weights about 200 publications on international journal, conference
as making relevance decisions, ACM Trans. Inf. Syst. 26 (3) (2008) http://dx. proceedings and book chapters.
doi.org/10.1145/1361684.1361686.
[45] C. Zhai, Statistical language models for information retrieval, Synth.
Lect. Hum. Lang. Technol. 1 (1) (2008) 1–141, http://dx.doi.org/ Giancarlo Sperlì is an assistant professor at the Department
10.2200/S00158ED1V01Y200811HLT001, URL: https://doi.org/10. of Electrical Engineering and Information Technology of the
2200/S00158ED1V01Y200811HLT001. arXiv:https://doi.org/10.2200/ University of Naples Federico II. He obtained his Ph.D. in
S00158ED1V01Y200811HLT001. Information Technology and Electrical Engineering at the
[46] D. Cheriton, From doc2query to doctttttquery, 2019. same University defending his thesis: ‘‘Multimedia Social
[47] Z. Dai, C. Xiong, J. Callan, Z. Liu, Convolutional neural networks for soft- Networks’’. He is a member of the Pattern analysis and
matching N-grams in ad-hoc search, in: Proceedings of the Eleventh ACM Intelligent Computation for mUltimedia Systems (PICUS)
International Conference on Web Search and Data Mining, WSDM ’18, Asso- departmental research groups. His main research interests
ciation for Computing Machinery, New York, NY, USA, 2018, pp. 126–134, are in the area of Cybersecurity, Semantic Analysis of
http://dx.doi.org/10.1145/3159652.3159659. Multimedia Data and Social Networks Analysis. He served
[48] M. Zhu, A. Ahuja, W. Wei, C.K. Reddy, A hierarchical attention retrieval model as guest editor of different special issues on International
for healthcare question answering, in: The World Wide Web Conference, WWW Journals. Finally, he has authored about 95 publications
’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. in international journals, conference proceedings and book
2472–2482, http://dx.doi.org/10.1145/3308558.3313699. chapters.

1-s2.0-S0925231223008032-main

Uploaded by

Copyright:

Available Formats

1-s2.0-S0925231223008032-main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-s2.0-S0925231223008032-main

Uploaded by

Copyright:

Available Formats

Neurocomputing 557 (2023) 126680

Contents lists available at ScienceDirect

Information retrieval algorithms and neural ranking models to detect

ARTICLE INFO ABSTRACT

1. Introduction agreed on a self-regulatory Code of Practice3 to converge on a common

4.1. Research questions

Trying to merge the wide literature on retrieval and ranking models

All experiments have been performed on Google Colab equipped

5. Theoretical and practical implications

The misinformation phenomenon can have adverse societal effects,

6. Conclusions and future directions

You might also like