Exploring the Use of Deep Learning with Crowdsourcing to Annotate Images
Keywords:Crowdsourcing, Computer Vision, Deep Learning, Human Machine Collaboration
AbstractWe investigate what, if any, benefits arise from employing hybrid algorithm-crowdsourcing approaches over conventional approaches of relying exclusively on algorithms or crowds to annotate images. We introduce a framework that enables users to investigate different hybrid workflows for three popular image analysis tasks: image classification, object detection, and image captioning. Three hybrid approaches are included that are based on having workers: (i) verify predicted labels, (ii) correct predicted labels, and (iii) annotate images for which algorithms have low confidence in their predictions. Deep learning algorithms are employed in these workflows since they offer high performance for image annotation tasks. Each workflow is evaluated with respect to annotation quality and worker time to completion on images coming from three diverse datasets (i.e., VOC, MSCOCO, VizWiz). Inspired by our findings, we offer recommendations regarding when and how to employ deep learning with crowdsourcing to achieve desired quality and efficiency for image annotation.
Bernstein, M. S, Teevan, J, Dumais, S, Liebling, D, and Horvitz, E. (2012). Direct answers for search queries in the long tail. In
Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 237–246.
Brady, E, Morris, M. R, Zhong, Y, White, S, and Bigham, J. P. (2013). Visual Challenges in the Everyday Lives of Blind People. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2117–2126.
Chen, X, Fang, H, Lin, T.-Y, Vedantam, R, Gupta, S, Dollár, P, and Zitnick, C. L. (2015). Microsoft COCO Captions: Data Collection
and Evaluation Server. arXiv preprint arXiv:1504.00325 (2015).
Cheng, J and Bernstein, M. S. (2015). Flock: Hybrid Crowd-Machine Learning Classifiers. In Proceedings of the 18th ACM Conference
on Computer Supported Cooperative Work & Social Computing. ACM, 600–611.
Chilton, L. B, Little, G, Edge, D, Weld, D. S, and Landay, J. A. (2013). Cascade: Crowdsourcing Taxonomy Creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999–2008.
Cohen, I and Medioni, G. (1999). Detecting and tracking moving objects for video surveillance. In Computer Vision and Pattern
Recognition, 1999. IEEE Computer Society Conference on., Vol. 2. IEEE, 319–325.
Dang, B, Hutson, M, and Lease, M. (2016). MmmTurkey: A crowdsourcing framework for deploying tasks and recording worker
behavior on Amazon Mechanical Turk. arXiv preprint arXiv:1609.00945 (2016).
Deng, J, Dong, W, Socher, R, Li, L.-J, Li, K, and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
Everingham, M, Van Gool, L, Williams, C. K, Winn, J, and Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge.
International Journal of Computer Vision 88, 2 (2010), 303–338. DOI:http://dx.doi.org/10.1007/s11263-009-0275-4
Gaur, Y, Lasecki, W. S, Metze, F, and Bigham, J. P. (2016). The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th Web for All Conference. ACM, 23.
Guinness, D, Cutrell, E, and Morris, M. R. (2018). Caption Crawler: Enabling Reusable Alternative Text Descriptions using Reverse
Image Search. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 518.
Guo, A. (2018). Crowd-AI Systems for Non-Visual Information Access in the Real World. In Extended Abstracts of the 2018 CHI
Conference on Human Factors in Computing Systems. ACM, DC09.
Gurari, D, Jain, S, Betke, M, and Grauman, K. (2016). Pull the Plug? Predicting If Computers or Humans Should Segment Images. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 382–391.
Gurari, D, Li, Q, Stangl, A. J, Guo, A, Lin, C, Grauman, K, Luo, J, and Bigham, J. P. (2018). VizWiz grand challenge: Answering
visual questions from blind people. arXiv preprint arXiv:1802.08218 (2018).
Gurari, D, Sameki, M, Wu, Z, and Betke, M. (2016). Mixing Crowd and Algorithm Efforts to Segment Objects in Biomedical Images.
In Medical Image Computing and Computer Assisted Intervention Interactive Medical Image Computation Workshop (2016). 1–8.
Hara, K, Le, V, and Froehlich, J. (2013). Combining crowdsourcing and google street view to identify street-level accessibility problems.
In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 631–640.
Hara, K, Sun, J, Moore, R, Jacobs, D, and Froehlich, J. (2014). Tohme: Detecting Curb Ramps in Google Street View Using Crowdsourcing, Computer Vision, and Machine Learning. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology. ACM, 189–204.
Harrington, R. P and Vanderheiden, G. C. (2013). Crowd caption correction (CCC). In Proceedings of the 15th International ACM
SIGACCESS Conference on Computers and Accessibility. ACM, 45.
Huang, Y, Huang, Y, Xue, N, and Bigham, J. P. (2017). Leveraging complementary contributions of different workers for efficient
crowdsourcing of video captions. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM,
Kacorri, H, Kitani, K. M, Bigham, J. P, and Asakawa, C. (2017). People with visual impairment training personal object recognizers:
Feasibility and challenges. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5839–5849.
Konyushkova, K, Uijlings, J, Lampert, C. H, and Ferrari, V. (2017). Learning Intelligent Dialogs for Bounding Box Annotation. arXiv
preprint arXiv:1712.08087 (2017).
Krasin, I, Duerig, T, Alldrin, N, Ferrari, V, Abu-El-Haija, S, Kuznetsova, A, Rom, H, Uijlings, J, Popov, S, Veit, A, Belongie, S, Gomes, V, Gupta, A, Sun, C, Chechik, G, Cai, D, Feng, Z, Narayanan, D, and Murphy, K. (2017). OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017).
Laput, G, Lasecki, W. S, Wiese, J, Xiao, R, Bigham, J. P, and Harrison, C. (2015). Zensors: Adaptive, Rapidly Deployable,
Intelligent Sensor Feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM,
Lasecki, W. S, Homan, C, and Bigham, J. P. (2014). Architecting Real-Time Crowd-Powered Systems. Human Computation 1, 1
Lin, C. H, Mausam, M, and Weld, D. S. (2012). Dynamically Switching between Synergistic Workflows for Crowdsourcing. In TwentySixth AAAI Conference on Artificial Intelligence.
Lin, T.-Y, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, Dollár, P, and Zitnick, C. L. (2014) Microsoft COCO: Common
Objects in Context. In European Conference on Computer Vision. Springer, 740–755.
Lofi, C and El Maarry, K. (2014). Design Patterns for Hybrid Algorithmic-Crowdsourcing Workflows.. In CBI (1). 1–8.
Lundgard, A, Yang, Y, Foster, M. L, and Lasecki, W. S. (2018). Bolt: Instantaneous crowdsourcing via just-in-time training. In
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 467.
MacLeod, H, Bennett, C. L, Morris, M. R, and Cutrell, E. (2017). Understanding Blind People’s Experiences with Computer-Generated
Captions of Social Media Images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM,
Pan, J.-Y, Yang, H.-J, Faloutsos, C, and Duygulu, P. (2004). Automatic multimedia cross-modal correlation discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 653–658.
Papadopoulos, D. P, Uijlings, J. R, Keller, F, and Ferrari, V. (2016). We don’t need no bounding-boxes: Training object class detectors
using only human verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 854–863.
Papineni, K, Roukos, S, Ward, T, and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.
Pirsiavash, H and Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2847–2854.
Quinn, A. J and Bederson, B. B. (2011). Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems. ACM, 1403–1412.
Russakovsky, O, Deng, J, Su, H, Krause, J, Satheesh, S, Ma, S, Huang, Z, Karpathy, A, Khosla, A, and Bernstein, M. (2015). Imagenet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
Rzeszotarski, J and Kittur, A. (2012). CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 55–62.
Sabou, M, Scharl, A, and Föls, M. (2013). Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows. International
Journal on Semantic Web and Information Systems (IJSWIS) 9, 3 (2013), 14–41.
Salisbury, E, Kamar, E, and Morris, M. R. (2017). Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for
Refining Vision-to-Language Technology for the Blind. Proceedings of HCOMP 2017 (2017).
Salisbury, E, Kamar, E, and Morris, M. R. (2018). Evaluating and Complementing Vision-to-Language Technology for People who are
Blind with Conversational Crowdsourcing.. In IJCAI. 5349–5353.
Sodemann, A. A, Ross, M. P, and Borghetti, B. J. (2012). A Review of Anomaly Detection in Automated Surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 1257–1272.
Song, J. Y, Lemmer, S. J, Liu, M. X, Yan, S, Kim, J, Corso, J. J, and Lasecki, W. S. (2019). Popup: reconstructing 3D video using
particle filtering to aggregate crowd responses. In Proceedings of the 24th International Conference on Intelligent User Interfaces.
Von Ahn, L and Dabbish, L. (2004). Labeling Images with a Computer Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 319–326.
Von Ahn, L, Ginosar, S, Kedia, M, Liu, R, and Blum, M. (2006)a. Improving Accessibility of the Web with a Computer Game. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 79–82.
Von Ahn, L, Liu, R, and Blum, M. (2006)b. Peekaboom: A Game for Locating Objects in Images. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. ACM, 55–64.
Weld, D. S and Dai, P. (2011). Human Intelligence Needs Artificial Intelligence. In Workshops at the Twenty-Fifth AAAI Conference on
Wigness, M, Draper, B. A, and Ross Beveridge, J. (2015). Efficient Label Collection for Unlabeled Image Datasets. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.
Zhang, H, Horvitz, E, and Parkes, D. C. (2013). Automated Workflow Synthesis.. In AAAI.
Zhou, F and Lin, Y. (2016). Fine-grained image classification by exploring bipartite-graph labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1124–1133.
How to Cite
Copyright (c) 2021 Human Computation
This work is licensed under a Creative Commons Attribution 4.0 International License.Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).