The Three Sides of CrowdTruth
AbstractCrowdsourcing is often used to gather annotated data for training and evaluating computational systems that attempt to solve cognitive problems, such as understanding Natural Language sentences. Crowd workers are asked to perform semantic interpretation of sentences to establish a ground truth. This has always been done under the assumption that each task unit, e.g. each sentence, has a single correct interpretation that is contained in the ground truth. We have countered this assumption with CrowdTruth, and have shown that it can be better suited to tasks for which semantic interpretation is subjective. In this paper we investigate the dependence of worker metrics for detecting spam on the quality of sentences in the dataset, and the quality of the target semantics. We show that worker quality metrics can improve significantly when the quality of these other aspects of semantic interpretation are considered.
Alonso, O and Baeza-Yates, R. (2011). Design and implementation of relevance assessments using crowdsourcing. In Proc. ECAIR. Springer-Verlag, 153–164.
Anastasi, A and Urbina, S. (1997). Psychological testing. Prentice Hall. http://books.google.nl/books?id=lfFGAAAAMAAJ
Ang, J, Dhillon, R, Krupski, A, Shriberg, E, and Stolcke, A. (2002). Prosody-Based Automatic Detection Of Annoyance And Frustration In Human-Computer Dialog. In
Proc. ICSLP 2002. 2037–2040.
Aroyo, L and Welty, C. (2013)a. Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. In Web Science 2013. ACM.
Aroyo, L and Welty, C. (2013)b. Harnessing Disagreement in Crowdsourcing a Relation Extraction Gold Standard. Technical Report No.203386. IBM Research.
Aroyo, L and Welty, C. (2013)c. Measuring Crowd Truth for Medical Relation Extraction. In AAAI 2013 Fall Symposium on Semantic for Big Data. AAAI.
Aroyo, L and Welty, C. (2014). Truth is a Lie: Seven myths about human annotation. AI Magazine (2014).
Bachrach, Y, Graepel, T, Minka, T, and Guiver, J. (2012). How To Grade a TestWithout Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing
and Aptitude Testing.. In ICML. icml.cc / Omnipress.
Bozzon, A, Brambilla, M, Ceri, S, and Mauri, A. (2013). Reactive crowdsourcing. In Proceedings of the 22nd international conference on World Wide Web (WWW ’13).
International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 153–164. http://dl.acm.org/citation.cfm?id=2488388.
Chen, D and Dolan, W. (2011). Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection. (2011). http://citeseerx.ist.psu.edu/viewdoc/
Cheng, P. (1997). From covariation to causation: A causal power theory. Psychological Review 104 (1997), 367-405.
Chilton, L. B, Little, G, Edge, D,Weld, D. S, and Landay, J. A. (2013). Cascade: crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI ’13). ACM, New York, NY, USA, 1999–2008. DOI:http://dx.doi.org/10.1145/2470654.2466265
Chklovski, T and Mihalcea, R. (2003). Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. In UNT Scholarly Works. UNT
Digital Library. http://digital.library.unt.edu/ark:/67531/metadc30948/
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1960), 37–46.
Difallah, D. E, Demartini, G, and Cudré-Mauroux, P. (2012). Mechanical Cheat: Spamming Schemes and Adversarial Techniques on Crowdsourcing Platforms. In
Ferrucci, D, Brown, E, Chu-Carroll, J, Fan, J, Gondek, D, Kalyanpur, A. A, Lally, A, Murdock, J. W, Nyberg, E, Prager, J, Schlaefer, N, and Welty, C. (2010). Building
Watson: An Overview of the DeepQA Project. AI Magazine 31 (2010), 59–79. Issue 3.
Finin, T, Murnane, W, Karandikar, A, Keller, N, Martineau, J, and Dredze, M. (2010). Annotating named entities in Twitter data with crowdsourcing. In In Proc. NAACL
HLT (CSLDAMT ’10). Association for Computational Linguistics, 80–88.
Gligorov, R, Hildebrand, M, van Ossenbruggen, J, Schreiber, G, and Aroyo, L. (2011). On the role of user-generated metadata in audio visual collections. In K-CAP.
Hovy, E, Mitamura, T, and Verdejo, F. (2012). Event Coreference Annotation Manual. Technical Report. Information Sciences Institute (ISI).
Inel, O, Aroyo, L, Welty, C, and Sips, R.-J. (2013). Exploiting Crowdsourcing Disagreement with Various Domain-Independent Quality Measures. In Proceedings of
the 3rd International Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2013), 12th International Semantic Web
Ipeirotis, P. G, Provost, F, andWang, J. (2010). Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDDWorkshop on Human Computation
(HCOMP ’10). ACM, New York, NY, USA, 64–67. DOI:http://dx.doi.org/10.1145/1837885.1837906
Kittur, A, Chi, E. H, and Suh, B. (2008). Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (CHI ’08). ACM, New York, NY, USA, 453–456. http://doi.acm.org/10.1145/1357054.1357127
Lee, J, Cho, H, Park, J.-W, Cha, Y.-r, Hwang, S.-w, Nie, Z, and Wen, J.-R. (2013). Hybrid entity clustering using crowds and data. The VLDB Journal 22, 5 (2013),
Lee, J. H and Hu, X. (2012). Generating ground truth for music mood classification using mechanical turk. In Proceedings of the 12th ACM/IEEE-CS joint conference on
Digital Libraries (JCDL ’12). ACM, New York, NY, USA, 129–138. DOI:http://dx.doi.org/10.1145/2232817.2232842
Litman, D. J. (2004). Annotating Student Emotional States in Spoken Tutoring Dialogues. In In Proc. 5th SIGdial Workshop on Discourse and Dialogue. 144–153.
Markines, B, Cattuto, C, Menczer, F, Benz, D, Hotho, A, and Stumme, G. (2009). Evaluating similarity measures for emergent semantics of social tagging. In Proceedings
of the 18th international conference on World wide web (WWW ’09). ACM, New York, NY, USA, 641–650. DOI:http://dx.doi.org/10.1145/1526709.1526796
Mason, W and Suri, S. (2012). Conducting behavioral research on Amazonâ ˘ AŽÃˇDÃt’s Mechanical Turk. Behavior Research Methods 44, 1 (2012), 1–23. DOI:
Mintz, M, Bills, S, Snow, R, and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In In Proc. ACL and Natural Language Processing
of the AFNLP: Vol2. Association for Computational Linguistics, 1003–1011.
Ogden, C. K and Richards, I. (1923). The meaning of meaning. Trubner & Co, London.
Oleson, D, Sorokin, A, Laughlin, G. P, Hester, V, Le, J, and Biewald, L. (2011). Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. In
Plank, B, Hovy, D, and SÃÿgaard, A. (2014). Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of EACL-2014.
Raykar, V. C and Yu, S. (2012). Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks. J. Mach. Learn. Res. 13 (March 2012), 491–518.
Human Computation 1 (2014) 13
Raykar, V. C, Yu, S, Zhao, L. H, Valadez, G. H, Florin, C, Bogoni, L, and Moy, L. (2010). Learning From Crowds. Journal of Machine Learning Research 11 (2010),
Sarasua, C, Simperl, E, and Noy, N. F. (2012). CrowdMap: Crowdsourcing Ontology Alignment with Microtasks. In International Semantic Web Conference (1).
Sheng, V. S, Provost, F, and Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th
ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’08). ACM, New York, NY, USA, 614–622. DOI:http://dx.doi.org/10.1145/
Singer, Y and Mittal, M. (2013). Pricing mechanisms for crowdsourcing markets. In Proceedings of the 22nd international conference on World Wide Web (WWW
’13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1157–1166. http://dl.acm.org/citation.cfm?id=
Snow, R, O’Connor, B, Jurafsky, D, and Ng, A. Y. (2008). Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08). Association for Computational Linguistics, Stroudsburg, PA, USA, 254–263.
Soberón, G, Aroyo, L, Welty, C, Inel, O, Lin, H, and Overmeen, M. (2013). Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters. In Proceedings of the 1st International Workshop on Crowdsourcing the Semantic Web (CrowdSem 2013), 12th International Semantic Web Conference.
van Zwol, R, Garcia, L, Ramirez, G, Sigurbjornsson, B, and Labad, M. (2008). Video Tag Game. In WWW Conference, developer track). ACM.
Viera, A. J and Garrett, J. M. (2005). Understanding interobserver agreement: the kappa statistic. Family Medicine 37, 5 (2005), 360–363.
Zhou, Z.-H and Li, M. (2010). Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24, 3 (2010), 415–439.
How to Cite
LicenseAuthors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).