Academia.eduAcademia.edu

Obtaining High-Quality Relevance Judgments using Crowdsourcing

2012, Internet Computing

Crowdsourcing can be used to obtain the ground truth that is needed to evaluate information retrieval systems. However, the quality of crowdsourced ground truth may be questionable using workers of unknown quality with possible spammers amongst them. This study presents a new approach to detect close to all spammers by comparing judgments between workers. We compared the consistency of crowdsourced ground truth to the consistency of ground truth obtained from expert annotators, and conclude that crowdsourcing can match the quality obtained from expert annotators.

This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. Crowdsourcing Obtaining High-Quality Relevance Judgments Using Crowdsourcing Jeroen B.P. Vuurens The Hague University of Applied Science Arjen P. de Vries Centrum Wiskunde & Informatica The performance of information retrieval (IR) systems is commonly evaluated using a test set with known relevance. Crowdsourcing is one method for learning the relevant documents to each query in the test set. However, the quality of relevance learned through crowdsourcing can be questionable, because it uses workers of unknown quality with possible spammers among them. To detect spammers, the authors’ algorithm compares judgments between workers; they evaluate their approach by comparing the consistency of crowdsourced ground truth to that obtained from expert annotators and conclude that crowdsourcing can match the quality obtained from the latter. Information retrieval (IR) systems can help us find necessary data and avoid information overload. To evaluate such systems’ performance, we would ideally obtain a known ground truth for every document retrieved for a submitted query — that is, whether the document is relevant or nonrelevant. To establish ground truth, we must obtain relevance judgments, or users’ subjective opinions regarding how relevant a document is to an information need they expressed in a query.1 Commonly, expert annotators manually judge relevance for the top-k ranked documents, a time-consuming and expensive process.2 Alternatively, we can obtain relevance judgments from anonymous Web users.2 Using crowdsourcing services such as Amazon’s Mechanical Turk (MTurk; www.mturk.com/mturk/welcome) or CrowdFlower (http://crowdflower.com), we can obtain judgments from a large number of workers in a short amount of time relatively inexpensively. By gathering several judgments per query-document pair and aggregating them using a consensus algorithm, we can establish an assumed ground truth for that pair.3 Using crowdsourcing to obtain relevance judgments presents challenges, however. Several studies found spam among the obtained crowdsourcing results, with some reporting as much as 50 percent spam.4–6 The random judgments spammers produce can cause inconsistencies in the obtained ground truth, increasing the variance of evaluation measures, possibly to a point at which the test set is useless. As a solution, we propose a new approach to detect spam in crowdsourced relevance judgments. We evaluate such judgments by comparing their consistency to that of judgments from expert annotators. Spam in the Crowdsourcing Environment To develop our approach, we first examined several characteristics inherent to workers in crowdsourcing environments. We also looked at existing anticheating mechanisms to determine ways we might detect spammers. Worker Taxonomy To discover characteristics that would help us identify cheaters among honest workers, we studied different worker types. Diligent workers follow instructions and aim to produce meaningful results.6 Sloppy workers might have good intentions but deliver poor-quality judgments.7 This might result from workers having a different frame of reference or being less capable of performing the task as required; it can also be due to This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. workers not following instructions or rushing their work. Such workers are likely to disagree with coworkers more often.8 Another study analyzed spammers’ voting behavior and identified workers that purposely randomize their responses so that requesters will have trouble discovering their dishonesty.9 Such random spammers will likely have an interworker agreement that’s close to random chance. Finally, uniform spammers use a fixed voting pattern. They seem interested neither in performing the task as intended nor in advanced cheating, but rather mostly repeat the same answer.9 Although we can easily detect the long-repeating patterns such spammers produce using manual inspections, automated spam detection can overlook them — uniform voting patterns increase the chance that these workers will judge in accordance with other uniform spammers over many query-document pairs or get more answers right on a skewed relevance distribution. Anticheating Mechanisms We can defend against spam in crowdsourcing environments using anticheating mechanisms, including administering qualification tests,7 hiding trick questions to see if workers are paying attention,9 evaluating workers on questions with a known ground truth,10 analyzing the time workers spend on tasks,11 removing low-quality workers,12 or identifying spammers based on monitored user behavior.13 Cheating strategies evolve as cheaters learn from being detected, so spam-detection mechanisms can become less effective, as with the time-on-task heuristic.14 A different approach discourages spammers by changing how users interact with a task. Those tasks that require a worker to enter text or answer questions related to one another aren’t as easy to spam as closed questions and thus attract fewer spammers.1,8,10 Even without antispam measures, the quality of crowdsourcing results can be improved using spam-resistant consensus algorithms such as GetAnotherLabel. This expectation-maximization algorithm estimates each worker’s latent quality and uses the maximum likelihood that the worker is correct in estimating the most likely correct outcome per question.3 In general, existing anticheating mechanisms require additional effort from requesters to construct tasks, prepare tests that can detect cheaters, or process outcomes. Quality Control To improve average label (judgment) quality, we identify and remove spammers by comparing their judgments to those of all other workers for the same questions. We then replace spammers’ judgments with new ones from other workers. We first obtain the required number of judgments for all query-document pairs, analyze the data, remove only the most likely spammer, and then replace all judgments the removed spammer submitted. We iterate these steps until we detect no more spammers. Finally, we aggregate the obtained judgments using GetAnotherLabel.3 Relevance Judgment Consistency Relevance judgments obtained via crowdsourcing can differ from expert annotators’ judgments. Crowdsourcing tasks are commonly divided into small human-intelligence tasks (HITs), which lets individual workers decide how much work they’re willing to do. Documents that are retrieved for the same query are thus likely to be divided across several workers for judgment, rather than given to a single expert annotator who would judge them all. If workers judge their share of documents for the same query with different subjective opinions, similar documents could receive inconsistent judgments. To reduce these inconsistencies, workers assess relevance according to specific instructions rather than use their own personal preferences, which could be key to reliable results. 15 Having workers make assessments according to instructions also gives us a common ground for aggregating judgments. Detecting Random Spammers A good spam-detection mechanism should find nearly all spammers while minimizing falsely rejected diligent workers. To decrease such false positives, we need a measure that’s reliable in noisy environments. Figure 1a shows that two diligent workers are significantly less likely to vote on nonadjoining labels (two This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. labels that aren’t adjacent on the ordinal scale) than are a spammer and a diligent worker (P = 0.27 versus P = 0.39, χ2-test p < 0.001; see Figure 1b). We use this feature to calculate a RandomSpam score for every worker w as the average squared ordinal distance ordij2 between their judgments j  Jw and judgments by other workers on the same query-document pair i  J j , w : RandomSpamw    jJ w  iJ j ,w jJ w ordij2 (1) J j ,w Note that we don’t use judgments on the “empty/corrupt” label in this computation. To allow for differences of opinion between diligent workers, we set ordij = 0 if ordij < 2. To reduce false positives, we set ordij = 0 when a worker disagrees with only one other worker, reasoning that when two out of five workers are cheating on a query-document pair, often it’s the case that one spammer is selecting a nonadjoining label to the other workers. Figure 1. Detecting random spammers. We looked at judgments from (a) accepted and (b) rejected workers and compared them with all accepted workers who judged the same label (solid), adjoining labels (shaded), and nonadjoining labels (white). We obtained these results for the Text Retrieval Conference crowdsourcing track. Detecting Uniform Spammers Uniform spammers choose one label or a simple pattern to uniformly spam, which helps them avoid detection via interworker comparison. In simulations, we found that if such spammers become abundant, they could affect consensus and further increase their chances of remaining undetected.6 We designed a separate algorithm to detect such spammers on the basis of a study for detecting fake coin flip sequences.16 This algorithm counts the average squared number of disagreements uniform spammers have with other workers within repeating voting patterns: UniformSpamw   sS    s  f s,Jw  1    sS jJ s ,w  jJ s ,w J j ,w , 2 iJ j ,w disagreeij (2) where S is a collection of all possible label sequences s with length |s| = 2 or 3. We calculate the number of disagreements disagreeij between workers’ judgments Js,w that occur within label sequence s and judgments J j , w that other workers have submitted for the same query-document pair as judgment j. To reduce false positives, disagreeij = 0 if disagreeij < 2; f s , J w is the frequency at which label sequence s occurs within worker w’s time-ordered judgments Jw. Tuning the Spam-Detection Parameters In a pilot run, we obtained 1,320 judgments on questions with known ground truth from expert annotators. We split the population on a RandomSpam and UniformSpam score of 2.0 to ground truth to stereotype each worker type’s voting distribution. We seeded a simulation6 using these voting distributions, simulating 50 percent spammers and iteratively removing the worker with the highest max(RandomSpam, UniformSpam) score. Empirically, we obtained the best results by removing workers with a RandomSpam score higher than 1.3 and a UniformSpam score higher than 1.6. Using these parameters on the data obtained in the pilot run removes workers with an average binary agreement to ground truth of 46.4 percent, while accepted workers have an average binary agreement of 63.6 percent to ground truth. The simulation model estimates that with 50 percent spam among the workers, our algorithm used with these parameters will remove more than 98 percent of the spammers and less than 2 percent of the diligent workers. Simulating with a minimum of five judgments per worker instead of 10 gives comparable results This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. on spam removal and obtained accuracy, but falsely rejects twice as many diligent workers. Results We tested our spam-detection algorithms, using data that we gathered in a field test at the 2011 Text Retrieval Conference (TREC), which organized a crowdsourcing track to address the use of crowdsourcing to evaluate search engines. Data The TREC task was to investigate how to obtain high-quality relevance judgments from individual crowd workers. For this task, eight participants crowdsourced judgments to estimate relevance for a portion of 3,380 query-document pairs. We were assigned 2,165 pairs for 20 different queries, which were divided into 433 sets, each with one query and five documents. We used a simple HIT design that showed workers the query terms, a description containing clear instructions on what to judge as relevant, the title, and a rendered image of each document (see Figure 2). We asked workers to judge relevance on a five-point ordinal scale: totally relevant, partly relevant, related, off topic/spam, or empty/corrupt. We obtained the results in two batches using one set of five querydocument pairs in both batches to create HITs that all contained 10 query-document pairs. Within six days, we obtained 20,840 judgments from 503 unique workers on Mturk. For every query-document pair, we initially collected five judgments, gathering more labels to replace detected spam; we used a different detection mechanism for TREC than the one described here. On average, we collected more than nine labels per query-document pair. We typically rejected one to three workers per iteration cycle, resulting in a limited number HITs being available at the same time. Also, the system didn’t let workers work on more than one HIT simultaneously or work the same query-document pair twice. Fifty-eight percent of the workers worked only one HIT, perhaps partly due to these restrictions. We paid $0.05 plus 10 percent commission per accepted HIT. In a scenario using five labels per question, this would result in a total cost of $60. Figure 2. Instructions used in the Text Retrieval Conference (TREC) 2011 crowdsourcing track. We asked workers to judge the relevance of documents to a query on an ordinal five-point scale. The instructions indicate under what conditions they should judge a document as being relevant. For the study presented here, we drew judgments from the data we obtained for TREC in chronological order until all query-document pairs had the required number of judgments. In the rare event that the judgments that were needed for one set of five query-document pairs was only available in a submitted HIT with another set of five query-document pairs that we didn’t need, we used the whole HIT. In the absence of known ground truth, we estimated it using GetAnotherLabel on the 41,081 judgments the other seven TREC participants obtained for the same track. Time per Judgment We designed the HIT on Mturk using an external question tag that showed content hosted on our webserver and registered the time taken per judgment. Figure 3 shows the average time per question versus the question number, in the order that workers answered each question. Ninety-five percent of the HITs have a different query for the first set of five documents than for the second set (the remaining HITs have the same query for both sets). Figure 3a thus shows that accepted workers take more time on the first and sixth question of every HIT — namely, to read a change in the query displayed on the screen. Two expected learning trends are visible: workers judge consecutive documents to the same query faster, and they become faster as they work more HITs. Analysis also shows that the total number of HITs performed doesn’t influence time per HIT on earlier HITs; all workers have similar learning curves. Figure 3. Time per judgment over different worker types. We showed workers a query followed by five documents. A new query for the next set of five documents was displayed on the first and sixth question of every human intelligence task (HIT). To suppress the noise from users taking long breaks rather than actually working on their assignments, we used only judgments with a time per question < 90 seconds. This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. Unexpectedly, rejected workers have nearly identical time-per-judgment patterns as accepted workers on the first two to three HITs. That is, they use more time on the sixth question of each HIT, indicating that they pay attention to the information on the screen (see Figure 3b). After working three HITs, however, rejected workers no longer use more time on the sixth question, indicating that they’re intentionally spamming (they no longer notice any change to the queries onscreen). However, on the first HITs, these workers individually show the same increase of time on the first and sixth questions, with learning curves similar to those of other workers. We calculated separate spam scores per HIT, which revealed that these workers spam right from the start. Although they aren’t following instructions, they take more time to read new queries onscreen; thus, they might be looking to improve their chances of successfully cheating the system. Analyzing Removed Spammers Figure 4 compares whether crowdsourced judgments agree with estimated ground truth when we use a maximum of three, four, five, or nine labels per question. The x-axis shows the threshold used for both spam scores; on the left-hand side no spammers have yet been removed, whereas on the right-hand side close to all spam has been replaced with other workers’ votes. The nine-label scenario gives an upper bound for this dataset’s accuracy; the scenario doesn’t show improvement because we didn’t have enough labels to replace detected spam. Using three labels per question, GetAnotherLabel generates random estimates caused by too much spam. With four labels per question, spam removal becomes critical because the GetAnotherLabel results are random when no spammers are removed. In Figure 4, around the threshold of 5, enough spammers are removed for GetAnotherLabel to be able to estimate better than random relevance; accuracy increases further if we remove more spammers. Using five labels, GetAnotherLabel gives better estimates with all spammers present than the four-label scenario with spammers removed. Removing spam using five labels improves accuracy significantly (t-test p = 0.01) by 2.8 percent to approximately the same accuracy that we obtain with nine labels. Figure 4. Consensus between ground truth and GetAnotherLabel. We used three, four, five, and nine labels per question after spam removal, using the threshold parameter on the x-axis for RandomSpam and UniformSpam. In the five-label scenario, we removed a maximum of 27.3 percent of the votes and 20.7 percent of the workers in 81 iterations. Estimating Correct Judgments To assess how suitable the crowdsourced relevance judgments are for evaluating IR systems compared to judgments obtained from expert annotators, we estimate judgment consistency. In datasets from previous TRECs, researchers tracked down duplicate documents (different URLs showing the same content).17 Essentially, the same expert judged these documents twice. We use their reported numbers to estimate the likelihood that an expert assessor gives a consistent judgment (Equation 3), reasoning that two binary judgments for the same document are the same when both judgments are either consistent or inconsistent (Equation 4). Based on this, we estimate that 90 percent of the relevance judgments from expert annotators were judged consistently: Ppairconsistent  24,327 24,327  5,514 (3) Pjudgmentconsistent 2  1  Pjudgmentconsistent   Ppairconsistent (4) 2 Pjudgmentconsistent  0.897 (5) For the crowdsourced relevance judgments, we can estimate the likelihood of consistency by comparing all pairs of judgments different workers give for the same query-document pair. In the five-label scenario with spammers removed, the used judgments contain 13,588 binary consistent pairs and 7,452 binary inconsistent pairs. Using the same formula, we estimate that 77 percent of the relevance judgments This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. obtained via crowdsourcing are consistent with other judgments. However, for the crowdsourcing task, we obtained five judgments per query-document pair. Per pair, we can obtain ground truth by taking consensus using five judgments. We can estimate the likelihood that the crowdsourced ground truth for a query-document pair is consistent with the ground truth for other documents judged for the same query by assuming uniform worker quality and using majority voting.18 Using Equation 6, we estimate that 91.6 percent of the aggregated relevance judgments are consistent: 5  5 Pqrelconsistent    Pjudgmentconsistent 5    Pjudgmentconsistent 4 1  Pjudgmentconsistent  0 1 2 5    Pjudgmentconsistent 3 1  Pjudgmentconsistent   2 (6) To verify that we don’t overestimate the number of consistent pairs by removing workers who least agree, we report that the average binary interworker agreement after removing spam from our set is 65 percent, while the average over the judgments that all other TREC participants obtained is 70 percent. Our results show that spam affects the outcome of algorithms such as GetAnotherLabel that depend on estimated worker quality. Replacing spammers’ votes significantly improves accuracy and matches the accuracy we get when we obtain 80 percent more judgments without removing spam. Any spam-detection algorithm biases the result set by falsely rejecting workers that don’t agree with the majority vote. This isn’t necessarily a problem for crowdsourced relevance judgments because we ask workers to judge documents according to instructions rather than subjective opinion, specifically describing what they should judge to be relevant. The spam-detection algorithms also tolerate small differences of opinion on the ordinal scale. The spam scores are based on frequent disagreements within repeating voting patterns and the selection of nonadjoining labels, neither of which is likely to occur with diligent workers. We found that workers from the crowd make twice as many inconsistent judgments as expert annotators. However, crowdsourcing’s strength lies in its low costs, making it feasible to obtain multiple judgments per query-document pair and thus increase consistency by aggregating these judgments into one outcome. The consistency of the aggregated crowdsourced ground truth over the data we obtained for TREC 2011 is comparable to the consistency of ground truth obtained from expert annotators for previous TREC datasets. The downside of obtaining relevance judgments via crowdsourcing might be the lack of personal preference in the obtained relevance judgments. Fortunately, when evaluating IR systems, the subjective nature of relevance judgments hardly influences the outcome.1 Whether this result still holds when multiple individuals construct the ground truth for one information need requires further investigation. Finally, we believe that for every spam-detection mechanism, spammers will develop countermeasures to increase their chances of remaining undetected. Detectable antispam mechanisms, for instance monitoring user behavior13, might be more vulnerable to countermeasures. Against our spam-detection approach, spammers can increase their chances by organizing spamming so that several spammers use the same decision rules — for example, voting “relevant” on all queries starting with a vowel. However, countermeasures such as these aren’t easy to organize and can be counteracted when requesters detect them, just like the defense we propose against uniform spamming. References 1. E.M. Voorhees, “Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness,” Information Processing and Management, vol. 36, Elsevier, 2000, pp. 697–716. 2. O. Alonso, D.E. Rose, and B. Stewart, “Crowdsourcing for Relevance Evaluation,” SIGIR Forum, vol. 42, no. 2, 2008, pp. 9–15. 3. P.G. Ipeirotis, F. Provost, and J. Wang, “Quality Management on Amazon Mechanical Turk,” Proc. ACM SIGKDD Workshop Human Computation, ACM, 2010, pp. 64–67. 4. R. Blanco et al., “Repeatable and Reliable Search System Evaluation Using Crowdsourcing,” Proc. 34th Ann. Int’l ACM SIGIR, Conf. Research and Development in Information Retrieval, ACM, 2011, pp. 923–932. This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. 5. A. Kittur, E.H. Chi, and B. Suh, “Crowdsourcing User Studies with Mechanical Turk,” Proc. 26th Ann. SIGCHI Conf. Human Factors in Computing Systems (CHI 08), ACM, 2008, pp. 453–456. 6. J.B.P. Vuurens, A.P. de Vries, and C. Eickhoff, “How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy,” Proc. SIGIR Workshop Crowdsourcing for Information Retrieval (CIR 11), ACM, 2011, pp. 21–26 7. J. Le et al., “Ensuring Quality in Crowdsourced Search Relevance Evaluation,” Proc. SIGIR Workshop Crowdsourcing for Search Evaluation (CSE 10), ACM, 2010, pp. 17–20. 8. G. Kazai, J. Kamps, and N. Milic-Frayling, “Worker Types and Personality Traits in Crowdsourcing Relevance Labels,” Proc. 20th ACM Conf. Information and Knowledge Management (CIKM 11), ACM, 2011, pp. 1941– 1944. 9. D. Zhu and B. Carterette, “An Analysis of Assessor Behavior in Crowdsourced Preference Judgments,” Proc. 33rd ACM SIGIR Workshop Crowdsourcing for Search Evaluation (CSE 10), ACM, 2010, pp. 21–26. 10. C. Eickhoff and A.P. de Vries, “Increasing Cheat Robustness of Crowdsourcing Tasks,” Advances in Information Retrieval, Springer, to appear, 2012. 11. G. Kazai, “In Search of Quality in Crowdsourcing for Search Engine Evaluation,” Advances in Information Retrieval, Springer, 2011, pp. 165–176. 12. O. Dekel and O. Shamir, “Vox Populi: Collecting High-Quality Labels from a Crowd,” Proc. 22nd Ann. Conf. Learning Theory (COLT 09), 2009. 13. J. Rzeszotarski and A. Kittur, “Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict Task Performance,” ACM Symp. User Interface Software and Technology, ACM, 2011, pp. 13–22. 14. J. Downs et al., “Are Your Participants Gaming the System? Screening Mechanical Turk Workers,” Proc. 28th Int’l Conf. Human Factors in Computing Systems (CHI 10), ACM, 2010, pp. 2399–2402. 15. K. Krippendorff and M.A. Bock, The Content Analysis Reader, SAGE Publications, 2009. 16. M.A. Kouritzin et al., “On Detecting Fake Coin Flip Sequences,” IMS Collections – Markov Processes and Related Topics: A Festschrift for Thomas G. Kurtz, vol. 4, IMS Collections, 2008, pp. 107–122. 17. F. Scholer, A. Turpin, and M. Sanderson, “Quantifying Test Collection Quality Based on the Consistency of Relevance Judgments,” Proc. 34th Ann. ACM Special Interest Group on Information Retrieval, ACM, 2011, pp. 1063–1072. 18. V.S. Sheng, F. Provost, and P.G. Ipeirotis, “Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers,” Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD 08), ACM, 2008, pp. 614–622. This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. # judgments Fig. 1. Instructions used the TREC 2011 Crowdsourcing track. 4000 agreement 3500 3000 adjoining label 2500 2000 non-adjoining label 1500 1000 disregarded 500 0 totally partly relevant related off topic corrupt # judgments (a) judgments by accepted workers 2500 agreement 2000 adjoining label 1500 1000 non-adjoining label 500 0 disregarded totally partly relevant related off topic corrupt (b) judgments by rejected workers Fig. 2. Judgments by (a) accepted and (b) rejected workers categorized by comparison with all accepted workers that judged on (solid) same label, (shaded) adjoining labels (white) non-adjoining labels. Results were obtained for the TREC 2011 Crowdsourcing track. avg time-per-question (sec) This is the accepted manuscript for IEEE Internet Computing sept-okt 2013 special issue on crowdsourcing. The (more beautiful) final version can be downloaded at IEEE. (a) Accepted workers 40 35 30 25 20 15 10 5 avg time-per-question (sec) 0 10 20 30 40 50 60 70 question # 50 60 70 question # (b) Rejected workers 40 35 30 25 20 15 10 5 0 10 20 30 40 % agreement with ground truth Fig. 3. Time-per-judgment over different worker types. Only judgments with a time-per-question < 90 seconds were used, to suppress the noise from users taking long breaks not actually working the assignments. The workers were shown a query followed by 5 pages to judge on every 1st and 6th question of HIT. 9 labels 0.7 5 labels 4 labels 3 labels 0.6 0.5 0.4 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 threshold used for RandomSpam and UniformSpam Fig. 4. Agreement between ground truth and GetAnotherLabel consensus when resp. 9, 5, 4 and 3 labels are used per question, after spam removal using the threshold parameter on the x-axis for RandomSpam and UniformSpam.