The Three Sides of CrowdTruth


  • Lora Aroyo VU University Amsterdam
  • Chris Welty





Crowdsourcing is often used to gather annotated data for training and evaluating computational systems that attempt to solve cognitive problems, such as understanding Natural Language sentences. Crowd workers are asked to perform semantic interpretation of sentences to establish a ground truth. This has always been done under the assumption that each task unit, e.g. each sentence, has a single correct interpretation that is contained in the ground truth. We have countered this assumption with CrowdTruth, and have shown that it can be better suited to tasks for which semantic interpretation is subjective. In this paper we investigate the dependence of worker metrics for detecting spam on the quality of sentences in the dataset, and the quality of the target semantics. We show that worker quality metrics can improve significantly when the quality of these other aspects of semantic interpretation are considered.


Alonso, O and Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In Proc. ECAIR. Springer-Verlag, 153–164.

Anastasi, A and Urbina, S. (1997). Psychological testing. Prentice Hall.

Ang, J, Dhillon, R, Krupski, A, Shriberg, E, and Stolcke, A. (2002). Prosody-Based Automatic Detection Of Annoyance And Frustration In Human-Computer Dialog. In

Proc. ICSLP 2002. 2037–2040.

Aroyo, L and Welty, C. (2013)a. Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. In Web Science 2013. ACM.

Aroyo, L and Welty, C. (2013)b. Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. Technical Report No.203386. IBM Research.

Aroyo, L and Welty, C. (2013)c. Measuring Crowd Truth for Medical Relation Extraction. In AAAI 2013 Fall Symposium on Semantic for Big Data. AAAI.

Aroyo, L and Welty, C. (2014). Truth is a Lie: Seven myths about human annotation. AI Magazine (2014).

Bachrach, Y, Graepel, T, Minka, T, and Guiver, J. (2012). How To Grade a TestWithout Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing

and Aptitude Testing.. In ICML. / Omnipress.

Bozzon, A, Brambilla, M, Ceri, S, and Mauri, A. (2013). Reactive crowdsourcing. In Proceedings of the 22nd international conference on World Wide Web (WWW ’13).

International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 153–164.

Chen, D and Dolan, W. (2011). Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection. (2011).


Cheng, P. (1997). From covariation to causation: A causal power theory. Psychological Review 104 (1997), 367-405.

Chilton, L. B, Little, G, Edge, D,Weld, D. S, and Landay, J. A. (2013). Cascade: crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems (CHI ’13). ACM, New York, NY, USA, 1999–2008. DOI:

Chklovski, T and Mihalcea, R. (2003). Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. In UNT Scholarly Works. UNT

Digital Library.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1960), 37–46.

Difallah, D. E, Demartini, G, and Cudré-Mauroux, P. (2012). Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms. In

CrowdSearch. 26–30.

Ferrucci, D, Brown, E, Chu-Carroll, J, Fan, J, Gondek, D, Kalyanpur, A. A, Lally, A, Murdock, J. W, Nyberg, E, Prager, J, Schlaefer, N, and Welty, C. (2010). Building

Watson: An Overview of the DeepQA Project. AI Magazine 31 (2010), 59–79. Issue 3.

Finin, T, Murnane, W, Karandikar, A, Keller, N, Martineau, J, and Dredze, M. (2010). Annotating named entities in Twitter data with crowdsourcing. In In Proc. NAACL

HLT (CSLDAMT ’10). Association for Computational Linguistics, 80–88.

Gligorov, R, Hildebrand, M, van Ossenbruggen, J, Schreiber, G, and Aroyo, L. (2011). On the role of user-generated metadata in audio visual collections. In K-CAP.


Hovy, E, Mitamura, T, and Verdejo, F. (2012). Event Coreference Annotation Manual. Technical Report. Information Sciences Institute (ISI).

Inel, O, Aroyo, L, Welty, C, and Sips, R.-J. (2013). Exploiting Crowdsourcing Disagreement with Various Domain-Independent Quality Measures. In Proceedings of

the 3rd International Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2013), 12th International Semantic Web


Ipeirotis, P. G, Provost, F, andWang, J. (2010). Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDDWorkshop on Human Computation

(HCOMP ’10). ACM, New York, NY, USA, 64–67. DOI:

Kittur, A, Chi, E. H, and Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing

Systems (CHI ’08). ACM, New York, NY, USA, 453–456.

Lee, J, Cho, H, Park, J.-W, Cha, Y.-r, Hwang, S.-w, Nie, Z, and Wen, J.-R. (2013). Hybrid entity clustering using crowds and data. The VLDB Journal 22, 5 (2013),

–726. DOI:

Lee, J. H and Hu, X. (2012). Generating ground truth for music mood classification using mechanical turk. In Proceedings of the 12th ACM/IEEE-CS joint conference on

Digital Libraries (JCDL ’12). ACM, New York, NY, USA, 129–138. DOI:

Litman, D. J. (2004). Annotating Student Emotional States in Spoken Tutoring Dialogues. In In Proc. 5th SIGdial Workshop on Discourse and Dialogue. 144–153.

Markines, B, Cattuto, C, Menczer, F, Benz, D, Hotho, A, and Stumme, G. (2009). Evaluating similarity measures for emergent semantics of social tagging. In Proceedings

of the 18th international conference on World wide web (WWW ’09). ACM, New York, NY, USA, 641–650. DOI:

Mason, W and Suri, S. (2012). Conducting behavioral research on Amazonâ ˘ AŽÃˇDÃt’s Mechanical Turk. Behavior Research Methods 44, 1 (2012), 1–23. DOI:

Mintz, M, Bills, S, Snow, R, and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In In Proc. ACL and Natural Language Processing

of the AFNLP: Vol2. Association for Computational Linguistics, 1003–1011.

Ogden, C. K and Richards, I. (1923). The meaning of meaning. Trubner & Co, London.

Oleson, D, Sorokin, A, Laughlin, G. P, Hester, V, Le, J, and Biewald, L. (2011). Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. In

Human Computation.

Plank, B, Hovy, D, and SÃÿgaard, A. (2014). Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of EACL-2014.

Raykar, V. C and Yu, S. (2012). Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks. J. Mach. Learn. Res. 13 (March 2012), 491–518.

Human Computation 1 (2014) 13

Raykar, V. C, Yu, S, Zhao, L. H, Valadez, G. H, Florin, C, Bogoni, L, and Moy, L. (2010). Learning From Crowds. Journal of Machine Learning Research 11 (2010),


Sarasua, C, Simperl, E, and Noy, N. F. (2012). CrowdMap: Crowdsourcing Ontology Alignment with Microtasks. In International Semantic Web Conference (1).


Sheng, V. S, Provost, F, and Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th

ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’08). ACM, New York, NY, USA, 614–622. DOI:


Singer, Y and Mittal, M. (2013). Pricing mechanisms for crowdsourcing markets. In Proceedings of the 22nd international conference on World Wide Web (WWW

’13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1157–1166.


Snow, R, O’Connor, B, Jurafsky, D, and Ng, A. Y. (2008). Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings

of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08). Association for Computational Linguistics, Stroudsburg, PA, USA, 254–263.

Soberón, G, Aroyo, L, Welty, C, Inel, O, Lin, H, and Overmeen, M. (2013). Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters. In Proceedings of the 1st International Workshop on Crowdsourcing the Semantic Web (CrowdSem 2013), 12th International Semantic Web Conference.

van Zwol, R, Garcia, L, Ramirez, G, Sigurbjornsson, B, and Labad, M. (2008). Video Tag Game. In WWW Conference, developer track). ACM.

Viera, A. J and Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. Family Medicine 37, 5 (2005), 360–363.

Zhou, Z.-H and Li, M. (2010). Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24, 3 (2010), 415–439.




How to Cite

Aroyo, L., & Welty, C. (2014). The Three Sides of CrowdTruth. Human Computation, 1(1).