An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Gabriella Kazai*, Jaap Kamps, Natasa Milic-Frayling

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

110 Citations (Scopus)

Abstract

Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals-the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon's Mechanical Turk, conducted to explore the 'human' characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.

Original languageEnglish
Pages (from-to)138-178
Number of pages41
JournalInformation Retrieval
Volume16
Issue number2
DOIs
Publication statusPublished - Apr 2013
Externally publishedYes

Keywords

  • Crowdsourcing
  • Relevance judgments
  • Study of human factors

Fingerprint

Dive into the research topics of 'An analysis of human factors and label accuracy in crowdsourcing relevance judgments'. Together they form a unique fingerprint.

Cite this