TY - GEN
T1 - On aggregating labels from multiple crowd workers to infer relevance of documents
AU - Hosseini, Mehdi
AU - Cox, Ingemar J.
AU - Milić-Frayling, Nataša
AU - Kazai, Gabriella
AU - Vinay, Vishwa
PY - 2012
Y1 - 2012
N2 - We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.
AB - We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.
UR - http://www.scopus.com/inward/record.url?scp=84860206608&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-28997-2_16
DO - 10.1007/978-3-642-28997-2_16
M3 - Conference contribution
AN - SCOPUS:84860206608
SN - 9783642289965
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 182
EP - 194
BT - Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Proceedings
T2 - 34th European Conference on Information Retrieval, ECIR 2012
Y2 - 1 April 2012 through 5 April 2012
ER -