This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
Crowdsourcing
Obtaining High-Quality Relevance Judgments
Using Crowdsourcing
Jeroen B.P. Vuurens
The Hague University of Applied Science
Arjen P. de Vries
Centrum Wiskunde & Informatica
The performance of information retrieval (IR) systems is commonly
evaluated using a test set with known relevance. Crowdsourcing is one
method for learning the relevant documents to each query in the test set.
However, the quality of relevance learned through crowdsourcing can be
questionable, because it uses workers of unknown quality with possible
spammers among them. To detect spammers, the authors’ algorithm
compares judgments between workers; they evaluate their approach by
comparing the consistency of crowdsourced ground truth to that obtained
from expert annotators and conclude that crowdsourcing can match the
quality obtained from the latter.
Information retrieval (IR) systems can help us find necessary data and avoid information overload. To
evaluate such systems’ performance, we would ideally obtain a known ground truth for every document
retrieved for a submitted query — that is, whether the document is relevant or nonrelevant. To establish
ground truth, we must obtain relevance judgments, or users’ subjective opinions regarding how relevant a
document is to an information need they expressed in a query.1 Commonly, expert annotators manually
judge relevance for the top-k ranked documents, a time-consuming and expensive process.2 Alternatively,
we can obtain relevance judgments from anonymous Web users.2 Using crowdsourcing services such as
Amazon’s
Mechanical
Turk (MTurk;
www.mturk.com/mturk/welcome)
or
CrowdFlower
(http://crowdflower.com), we can obtain judgments from a large number of workers in a short amount of
time relatively inexpensively. By gathering several judgments per query-document pair and aggregating
them using a consensus algorithm, we can establish an assumed ground truth for that pair.3
Using crowdsourcing to obtain relevance judgments presents challenges, however. Several studies found
spam among the obtained crowdsourcing results, with some reporting as much as 50 percent spam.4–6 The
random judgments spammers produce can cause inconsistencies in the obtained ground truth, increasing the
variance of evaluation measures, possibly to a point at which the test set is useless. As a solution, we
propose a new approach to detect spam in crowdsourced relevance judgments. We evaluate such judgments
by comparing their consistency to that of judgments from expert annotators.
Spam in the Crowdsourcing Environment
To develop our approach, we first examined several characteristics inherent to workers in crowdsourcing
environments. We also looked at existing anticheating mechanisms to determine ways we might detect
spammers.
Worker Taxonomy
To discover characteristics that would help us identify cheaters among honest workers, we studied different
worker types. Diligent workers follow instructions and aim to produce meaningful results.6 Sloppy workers
might have good intentions but deliver poor-quality judgments.7 This might result from workers having a
different frame of reference or being less capable of performing the task as required; it can also be due to
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
workers not following instructions or rushing their work. Such workers are likely to disagree with
coworkers more often.8
Another study analyzed spammers’ voting behavior and identified workers that purposely randomize
their responses so that requesters will have trouble discovering their dishonesty.9 Such random spammers
will likely have an interworker agreement that’s close to random chance.
Finally, uniform spammers use a fixed voting pattern. They seem interested neither in performing the
task as intended nor in advanced cheating, but rather mostly repeat the same answer.9 Although we can
easily detect the long-repeating patterns such spammers produce using manual inspections, automated spam
detection can overlook them — uniform voting patterns increase the chance that these workers will judge in
accordance with other uniform spammers over many query-document pairs or get more answers right on a
skewed relevance distribution.
Anticheating Mechanisms
We can defend against spam in crowdsourcing environments using anticheating mechanisms, including
administering qualification tests,7 hiding trick questions to see if workers are paying attention,9 evaluating
workers on questions with a known ground truth,10 analyzing the time workers spend on tasks,11 removing
low-quality workers,12 or identifying spammers based on monitored user behavior.13 Cheating strategies
evolve as cheaters learn from being detected, so spam-detection mechanisms can become less effective, as
with the time-on-task heuristic.14
A different approach discourages spammers by changing how users interact with a task. Those tasks that
require a worker to enter text or answer questions related to one another aren’t as easy to spam as closed
questions and thus attract fewer spammers.1,8,10 Even without antispam measures, the quality of
crowdsourcing results can be improved using spam-resistant consensus algorithms such as
GetAnotherLabel. This expectation-maximization algorithm estimates each worker’s latent quality and uses
the maximum likelihood that the worker is correct in estimating the most likely correct outcome per
question.3
In general, existing anticheating mechanisms require additional effort from requesters to construct tasks,
prepare tests that can detect cheaters, or process outcomes.
Quality Control
To improve average label (judgment) quality, we identify and remove spammers by comparing their
judgments to those of all other workers for the same questions. We then replace spammers’ judgments with
new ones from other workers. We first obtain the required number of judgments for all query-document
pairs, analyze the data, remove only the most likely spammer, and then replace all judgments the removed
spammer submitted. We iterate these steps until we detect no more spammers. Finally, we aggregate the
obtained judgments using GetAnotherLabel.3
Relevance Judgment Consistency
Relevance judgments obtained via crowdsourcing can differ from expert annotators’ judgments.
Crowdsourcing tasks are commonly divided into small human-intelligence tasks (HITs), which lets
individual workers decide how much work they’re willing to do. Documents that are retrieved for the same
query are thus likely to be divided across several workers for judgment, rather than given to a single expert
annotator who would judge them all. If workers judge their share of documents for the same query with
different subjective opinions, similar documents could receive inconsistent judgments. To reduce these
inconsistencies, workers assess relevance according to specific instructions rather than use their own
personal preferences, which could be key to reliable results. 15 Having workers make assessments according
to instructions also gives us a common ground for aggregating judgments.
Detecting Random Spammers
A good spam-detection mechanism should find nearly all spammers while minimizing falsely rejected
diligent workers. To decrease such false positives, we need a measure that’s reliable in noisy environments.
Figure 1a shows that two diligent workers are significantly less likely to vote on nonadjoining labels (two
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
labels that aren’t adjacent on the ordinal scale) than are a spammer and a diligent worker (P = 0.27 versus P
= 0.39, χ2-test p < 0.001; see Figure 1b). We use this feature to calculate a RandomSpam score for every
worker w as the average squared ordinal distance ordij2 between their judgments j Jw and judgments by
other workers on the same query-document pair i J j , w :
RandomSpamw
jJ w
iJ j ,w
jJ w
ordij2
(1)
J j ,w
Note that we don’t use judgments on the “empty/corrupt” label in this computation.
To allow for differences of opinion between diligent workers, we set ordij = 0 if ordij < 2. To reduce
false positives, we set ordij = 0 when a worker disagrees with only one other worker, reasoning that when
two out of five workers are cheating on a query-document pair, often it’s the case that one spammer is
selecting a nonadjoining label to the other workers.
Figure 1. Detecting random spammers. We looked at judgments from (a) accepted and (b) rejected workers
and compared them with all accepted workers who judged the same label (solid), adjoining labels
(shaded), and nonadjoining labels (white). We obtained these results for the Text Retrieval Conference
crowdsourcing track.
Detecting Uniform Spammers
Uniform spammers choose one label or a simple pattern to uniformly spam, which helps them avoid
detection via interworker comparison. In simulations, we found that if such spammers become abundant,
they could affect consensus and further increase their chances of remaining undetected.6 We designed a
separate algorithm to detect such spammers on the basis of a study for detecting fake coin flip sequences.16
This algorithm counts the average squared number of disagreements uniform spammers have with other
workers within repeating voting patterns:
UniformSpamw
sS
s f s,Jw 1
sS
jJ s ,w
jJ s ,w
J j ,w
,
2
iJ j ,w
disagreeij
(2)
where S is a collection of all possible label sequences s with length |s| = 2 or 3. We calculate the number of
disagreements disagreeij between workers’ judgments Js,w that occur within label sequence s and
judgments J j , w that other workers have submitted for the same query-document pair as judgment j. To
reduce false positives, disagreeij = 0 if disagreeij < 2; f s , J w is the frequency at which label sequence s
occurs within worker w’s time-ordered judgments Jw.
Tuning the Spam-Detection Parameters
In a pilot run, we obtained 1,320 judgments on questions with known ground truth from expert annotators.
We split the population on a RandomSpam and UniformSpam score of 2.0 to ground truth to stereotype
each worker type’s voting distribution. We seeded a simulation6 using these voting distributions, simulating
50 percent spammers and iteratively removing the worker with the highest max(RandomSpam,
UniformSpam) score. Empirically, we obtained the best results by removing workers with a RandomSpam
score higher than 1.3 and a UniformSpam score higher than 1.6. Using these parameters on the data
obtained in the pilot run removes workers with an average binary agreement to ground truth of 46.4
percent, while accepted workers have an average binary agreement of 63.6 percent to ground truth. The
simulation model estimates that with 50 percent spam among the workers, our algorithm used with these
parameters will remove more than 98 percent of the spammers and less than 2 percent of the diligent
workers. Simulating with a minimum of five judgments per worker instead of 10 gives comparable results
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
on spam removal and obtained accuracy, but falsely rejects twice as many diligent workers.
Results
We tested our spam-detection algorithms, using data that we gathered in a field test at the 2011 Text
Retrieval Conference (TREC), which organized a crowdsourcing track to address the use of crowdsourcing
to evaluate search engines.
Data
The TREC task was to investigate how to obtain high-quality relevance judgments from individual crowd
workers. For this task, eight participants crowdsourced judgments to estimate relevance for a portion of
3,380 query-document pairs. We were assigned 2,165 pairs for 20 different queries, which were divided
into 433 sets, each with one query and five documents.
We used a simple HIT design that showed workers the query terms, a description containing clear
instructions on what to judge as relevant, the title, and a rendered image of each document (see Figure 2).
We asked workers to judge relevance on a five-point ordinal scale: totally relevant, partly relevant, related,
off topic/spam, or empty/corrupt. We obtained the results in two batches using one set of five querydocument pairs in both batches to create HITs that all contained 10 query-document pairs. Within six days,
we obtained 20,840 judgments from 503 unique workers on Mturk. For every query-document pair, we
initially collected five judgments, gathering more labels to replace detected spam; we used a different
detection mechanism for TREC than the one described here. On average, we collected more than nine
labels per query-document pair. We typically rejected one to three workers per iteration cycle, resulting in a
limited number HITs being available at the same time. Also, the system didn’t let workers work on more
than one HIT simultaneously or work the same query-document pair twice. Fifty-eight percent of the
workers worked only one HIT, perhaps partly due to these restrictions. We paid $0.05 plus 10 percent
commission per accepted HIT. In a scenario using five labels per question, this would result in a total cost
of $60.
Figure 2. Instructions used in the Text Retrieval Conference (TREC) 2011 crowdsourcing track. We asked
workers to judge the relevance of documents to a query on an ordinal five-point scale. The instructions
indicate under what conditions they should judge a document as being relevant.
For the study presented here, we drew judgments from the data we obtained for TREC in chronological
order until all query-document pairs had the required number of judgments. In the rare event that the
judgments that were needed for one set of five query-document pairs was only available in a submitted HIT
with another set of five query-document pairs that we didn’t need, we used the whole HIT.
In the absence of known ground truth, we estimated it using GetAnotherLabel on the 41,081 judgments
the other seven TREC participants obtained for the same track.
Time per Judgment
We designed the HIT on Mturk using an external question tag that showed content hosted on our webserver
and registered the time taken per judgment. Figure 3 shows the average time per question versus the
question number, in the order that workers answered each question. Ninety-five percent of the HITs have a
different query for the first set of five documents than for the second set (the remaining HITs have the same
query for both sets). Figure 3a thus shows that accepted workers take more time on the first and sixth
question of every HIT — namely, to read a change in the query displayed on the screen. Two expected
learning trends are visible: workers judge consecutive documents to the same query faster, and they
become faster as they work more HITs. Analysis also shows that the total number of HITs performed
doesn’t influence time per HIT on earlier HITs; all workers have similar learning curves.
Figure 3. Time per judgment over different worker types. We showed workers a query followed by five
documents. A new query for the next set of five documents was displayed on the first and sixth question of
every human intelligence task (HIT). To suppress the noise from users taking long breaks rather than
actually working on their assignments, we used only judgments with a time per question < 90 seconds.
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
Unexpectedly, rejected workers have nearly identical time-per-judgment patterns as accepted workers
on the first two to three HITs. That is, they use more time on the sixth question of each HIT, indicating that
they pay attention to the information on the screen (see Figure 3b). After working three HITs, however,
rejected workers no longer use more time on the sixth question, indicating that they’re intentionally
spamming (they no longer notice any change to the queries onscreen). However, on the first HITs, these
workers individually show the same increase of time on the first and sixth questions, with learning curves
similar to those of other workers. We calculated separate spam scores per HIT, which revealed that these
workers spam right from the start. Although they aren’t following instructions, they take more time to read
new queries onscreen; thus, they might be looking to improve their chances of successfully cheating the
system.
Analyzing Removed Spammers
Figure 4 compares whether crowdsourced judgments agree with estimated ground truth when we use a
maximum of three, four, five, or nine labels per question. The x-axis shows the threshold used for both
spam scores; on the left-hand side no spammers have yet been removed, whereas on the right-hand side
close to all spam has been replaced with other workers’ votes. The nine-label scenario gives an upper
bound for this dataset’s accuracy; the scenario doesn’t show improvement because we didn’t have enough
labels to replace detected spam. Using three labels per question, GetAnotherLabel generates random
estimates caused by too much spam. With four labels per question, spam removal becomes critical because
the GetAnotherLabel results are random when no spammers are removed. In Figure 4, around the threshold
of 5, enough spammers are removed for GetAnotherLabel to be able to estimate better than random
relevance; accuracy increases further if we remove more spammers. Using five labels, GetAnotherLabel
gives better estimates with all spammers present than the four-label scenario with spammers removed.
Removing spam using five labels improves accuracy significantly (t-test p = 0.01) by 2.8 percent to
approximately the same accuracy that we obtain with nine labels.
Figure 4. Consensus between ground truth and GetAnotherLabel. We used three, four, five, and nine labels
per question after spam removal, using the threshold parameter on the x-axis for RandomSpam and
UniformSpam.
In the five-label scenario, we removed a maximum of 27.3 percent of the votes and 20.7 percent of the
workers in 81 iterations.
Estimating Correct Judgments
To assess how suitable the crowdsourced relevance judgments are for evaluating IR systems compared to
judgments obtained from expert annotators, we estimate judgment consistency. In datasets from previous
TRECs, researchers tracked down duplicate documents (different URLs showing the same content).17
Essentially, the same expert judged these documents twice. We use their reported numbers to estimate the
likelihood that an expert assessor gives a consistent judgment (Equation 3), reasoning that two binary
judgments for the same document are the same when both judgments are either consistent or inconsistent
(Equation 4). Based on this, we estimate that 90 percent of the relevance judgments from expert annotators
were judged consistently:
Ppairconsistent
24,327
24,327 5,514
(3)
Pjudgmentconsistent 2 1 Pjudgmentconsistent Ppairconsistent (4)
2
Pjudgmentconsistent 0.897
(5)
For the crowdsourced relevance judgments, we can estimate the likelihood of consistency by comparing
all pairs of judgments different workers give for the same query-document pair. In the five-label scenario
with spammers removed, the used judgments contain 13,588 binary consistent pairs and 7,452 binary
inconsistent pairs. Using the same formula, we estimate that 77 percent of the relevance judgments
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
obtained via crowdsourcing are consistent with other judgments.
However, for the crowdsourcing task, we obtained five judgments per query-document pair. Per pair, we
can obtain ground truth by taking consensus using five judgments. We can estimate the likelihood that the
crowdsourced ground truth for a query-document pair is consistent with the ground truth for other
documents judged for the same query by assuming uniform worker quality and using majority voting.18
Using Equation 6, we estimate that 91.6 percent of the aggregated relevance judgments are consistent:
5
5
Pqrelconsistent Pjudgmentconsistent 5 Pjudgmentconsistent 4 1 Pjudgmentconsistent
0
1
2
5
Pjudgmentconsistent 3 1 Pjudgmentconsistent
2
(6)
To verify that we don’t overestimate the number of consistent pairs by removing workers who least
agree, we report that the average binary interworker agreement after removing spam from our set is 65
percent, while the average over the judgments that all other TREC participants obtained is 70 percent.
Our results show that spam affects the outcome of algorithms such as GetAnotherLabel that depend on
estimated worker quality. Replacing spammers’ votes significantly improves accuracy and matches the
accuracy we get when we obtain 80 percent more judgments without removing spam.
Any spam-detection algorithm biases the result set by falsely rejecting workers that don’t agree with the
majority vote. This isn’t necessarily a problem for crowdsourced relevance judgments because we ask
workers to judge documents according to instructions rather than subjective opinion, specifically describing
what they should judge to be relevant. The spam-detection algorithms also tolerate small differences of
opinion on the ordinal scale. The spam scores are based on frequent disagreements within repeating voting
patterns and the selection of nonadjoining labels, neither of which is likely to occur with diligent workers.
We found that workers from the crowd make twice as many inconsistent judgments as expert annotators.
However, crowdsourcing’s strength lies in its low costs, making it feasible to obtain multiple judgments per
query-document pair and thus increase consistency by aggregating these judgments into one outcome. The
consistency of the aggregated crowdsourced ground truth over the data we obtained for TREC 2011 is
comparable to the consistency of ground truth obtained from expert annotators for previous TREC datasets.
The downside of obtaining relevance judgments via crowdsourcing might be the lack of personal
preference in the obtained relevance judgments. Fortunately, when evaluating IR systems, the subjective
nature of relevance judgments hardly influences the outcome.1 Whether this result still holds when multiple
individuals construct the ground truth for one information need requires further investigation.
Finally, we believe that for every spam-detection mechanism, spammers will develop countermeasures
to increase their chances of remaining undetected. Detectable antispam mechanisms, for instance
monitoring user behavior13, might be more vulnerable to countermeasures. Against our spam-detection
approach, spammers can increase their chances by organizing spamming so that several spammers use the
same decision rules — for example, voting “relevant” on all queries starting with a vowel. However,
countermeasures such as these aren’t easy to organize and can be counteracted when requesters detect
them, just like the defense we propose against uniform spamming.
References
1. E.M. Voorhees, “Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness,”
Information Processing and Management, vol. 36, Elsevier, 2000, pp. 697–716.
2. O. Alonso, D.E. Rose, and B. Stewart, “Crowdsourcing for Relevance Evaluation,” SIGIR Forum, vol. 42, no. 2,
2008, pp. 9–15.
3. P.G. Ipeirotis, F. Provost, and J. Wang, “Quality Management on Amazon Mechanical Turk,” Proc. ACM
SIGKDD Workshop Human Computation, ACM, 2010, pp. 64–67.
4. R. Blanco et al., “Repeatable and Reliable Search System Evaluation Using Crowdsourcing,” Proc. 34th Ann.
Int’l ACM SIGIR, Conf. Research and Development in Information Retrieval, ACM, 2011, pp. 923–932.
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
5. A. Kittur, E.H. Chi, and B. Suh, “Crowdsourcing User Studies with Mechanical Turk,” Proc. 26th Ann. SIGCHI
Conf. Human Factors in Computing Systems (CHI 08), ACM, 2008, pp. 453–456.
6. J.B.P. Vuurens, A.P. de Vries, and C. Eickhoff, “How Much Spam Can You Take? An Analysis of
Crowdsourcing Results to Increase Accuracy,” Proc. SIGIR Workshop Crowdsourcing for Information Retrieval
(CIR 11), ACM, 2011, pp. 21–26
7. J. Le et al., “Ensuring Quality in Crowdsourced Search Relevance Evaluation,” Proc. SIGIR Workshop
Crowdsourcing for Search Evaluation (CSE 10), ACM, 2010, pp. 17–20.
8. G. Kazai, J. Kamps, and N. Milic-Frayling, “Worker Types and Personality Traits in Crowdsourcing Relevance
Labels,” Proc. 20th ACM Conf. Information and Knowledge Management (CIKM 11), ACM, 2011, pp. 1941–
1944.
9. D. Zhu and B. Carterette, “An Analysis of Assessor Behavior in Crowdsourced Preference Judgments,” Proc.
33rd ACM SIGIR Workshop Crowdsourcing for Search Evaluation (CSE 10), ACM, 2010, pp. 21–26.
10. C. Eickhoff and A.P. de Vries, “Increasing Cheat Robustness of Crowdsourcing Tasks,” Advances in Information
Retrieval, Springer, to appear, 2012.
11. G. Kazai, “In Search of Quality in Crowdsourcing for Search Engine Evaluation,” Advances in Information
Retrieval, Springer, 2011, pp. 165–176.
12. O. Dekel and O. Shamir, “Vox Populi: Collecting High-Quality Labels from a Crowd,” Proc. 22nd Ann. Conf.
Learning Theory (COLT 09), 2009.
13. J. Rzeszotarski and A. Kittur, “Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict Task
Performance,” ACM Symp. User Interface Software and Technology, ACM, 2011, pp. 13–22.
14. J. Downs et al., “Are Your Participants Gaming the System? Screening Mechanical Turk Workers,” Proc. 28th
Int’l Conf. Human Factors in Computing Systems (CHI 10), ACM, 2010, pp. 2399–2402.
15. K. Krippendorff and M.A. Bock, The Content Analysis Reader, SAGE Publications, 2009.
16. M.A. Kouritzin et al., “On Detecting Fake Coin Flip Sequences,” IMS Collections – Markov Processes and
Related Topics: A Festschrift for Thomas G. Kurtz, vol. 4, IMS Collections, 2008, pp. 107–122.
17. F. Scholer, A. Turpin, and M. Sanderson, “Quantifying Test Collection Quality Based on the Consistency of
Relevance Judgments,” Proc. 34th Ann. ACM Special Interest Group on Information Retrieval, ACM, 2011, pp.
1063–1072.
18. V.S. Sheng, F. Provost, and P.G. Ipeirotis, “Get Another Label? Improving Data Quality and Data Mining Using
Multiple, Noisy Labelers,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD
08), ACM, 2008, pp. 614–622.
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
# judgments
Fig. 1. Instructions used the TREC 2011 Crowdsourcing track.
4000
agreement
3500
3000
adjoining label
2500
2000
non-adjoining
label
1500
1000
disregarded
500
0
totally
partly
relevant
related
off topic
corrupt
# judgments
(a) judgments by accepted workers
2500
agreement
2000
adjoining label
1500
1000
non-adjoining
label
500
0
disregarded
totally
partly
relevant
related
off topic
corrupt
(b) judgments by rejected workers
Fig. 2. Judgments by (a) accepted and (b) rejected workers categorized by comparison with all accepted workers that judged on (solid)
same label, (shaded) adjoining labels (white) non-adjoining labels. Results were obtained for the TREC 2011 Crowdsourcing track.
avg time-per-question (sec)
This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing.
The (more beautiful) final version can be downloaded at IEEE.
(a) Accepted workers
40
35
30
25
20
15
10
5
avg time-per-question (sec)
0
10
20
30
40
50
60
70
question #
50
60
70
question #
(b) Rejected workers
40
35
30
25
20
15
10
5
0
10
20
30
40
% agreement with ground
truth
Fig. 3. Time-per-judgment over different worker types. Only judgments with a time-per-question < 90 seconds were used, to suppress
the noise from users taking long breaks not actually working the assignments. The workers were shown a query followed by 5 pages
to judge on every 1st and 6th question of HIT.
9 labels
0.7
5 labels
4 labels
3 labels
0.6
0.5
0.4
6 5.5 5 4.5 4 3.5 3 2.5 2 1.5
threshold used for RandomSpam and UniformSpam
Fig. 4. Agreement between ground truth and GetAnotherLabel consensus when resp. 9, 5, 4 and 3 labels are used per question, after
spam removal using the threshold parameter on the x-axis for RandomSpam and UniformSpam.