A Crowdsourced Approach to Evaluating the Relevance of Digitized Primary Sources for Historians

Nai-Ching Wang, David Hicks, Paul Quigley, Kurt Luther


Historians spend significant time evaluating the relevance of primary sources that they encounter in digitized archives and through web searches. One reason this task is time-consuming is that historians’ research interests are often highly abstract and specialized. These topics are unlikely to be manually indexed and are difficult to identify with automated text analysis techniques. In this article, we investigate the potential of a new crowdsourcing model in which the historian delegates to a novice crowd the task of evaluating the relevance of primary sources with respect to her unique research interests. The model employs a novel crowd workflow, Read-AgreePredict (RAP), that allows novice crowd workers to perform as well as expert historians. As a useful byproduct, RAP also reveals and prioritizes crowd confusions as targeted learning opportunities. We demonstrate the value of our model with two experiments with paid crowd workers (n=170), with the future goal of extending our work to classroom students and public history interventions. We also discuss broader implications for historical research and education.


Applications; Techniques; Algorithms

Full Text:



Aggarwal, C. C., & Zhai, C. (2012). A Survey of Text Classification Algorithms. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 163–222). Springer US. https://doi.org/10.1007/978-1-4614-3223-4_6

Anderson, J. R., Boyle, C. F., Corbett, A. T., & Lewis, M. W. (1990). Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42(1), 7–49.

Anderson, J. R., Boyle, C. F., & Reiser, B. J. (1985). Intelligent tutoring systems. Science(Washington), 228(4698), 456–462.

André, P., Kittur, A., & Dow, S. P. (2014). Crowd Synthesis: Extracting Categories and Clusters from Complex Data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 989–998). New York, NY, USA: ACM. https://doi.org/10.1145/2531602.2531653

Banko, M., & Brill, E. (2001). Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 26–33). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1073012.1073017

Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B., Ackerman, M. S., Karger, D. R., … Panovich, K. (2010). Soylent: A Word Processor with a Crowd Inside. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and Technology (pp. 313–322). New York, NY, USA: ACM. https://doi.org/10.1145/1866029.1866078

Bobrow, S. A., & Bower, G. H. (1969). Comprehension and recall of sentences. Journal of Experimental Psychology, 80(3, Pt.1), 455–461. https://doi.org/10.1037/h0027461

Brands, H. W. (2008). Response to Hochschild. Historically Speaking, 9(4), 6–7. https://doi.org/10.1353/hsp.2008.0063

Bretzing, B. H., & Kulhavy, R. W. (1979). Notetaking and depth of processing. Contemporary Educational Psychology, 4(2), 145–153. https://doi.org/10.1016/0361-476X(79)90069-9

Brown, A. L. (1992). Design experiments: Theoretical and methodological challenges in creating complex interventions in classroom settings. The Journal of the Learning Sciences, 2(2), 141–178.

Cai, C. J., Iqbal, S. T., & Teevan, J. (2016). Chain Reactions: The Impact of Order on Microtask Chains. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 3143–3154). New York, NY, USA: ACM. https://doi.org/10.1145/2858036.2858237

Chi, E. H., Hong, L., Heiser, J., & Card, S. K. (2006). Scentindex: Conceptually Reorganizing Subject Indexes for Reading. In 2006 IEEE Symposium On Visual Analytics Science And Technology (pp. 159–166). https://doi.org/10.1109/VAST.2006.261418

Chi, Ed H., Hong, L., Gumbrecht, M., & Card, S. K. (2005). ScentHighlights: Highlighting Conceptually-related Sentences During Reading. In Proceedings of the 10th International Conference on Intelligent User Interfaces (pp. 272–274). New York, NY, USA: ACM. https://doi.org/10.1145/1040830.1040895

Council, N. R., & others. (2000). How people learn: Brain, mind, experience, and school: Expanded edition. National Academies Press. Retrieved from https://books.google.com/books?hl=en&lr=&id=QZb7PnTgSCgC&oi=fnd&pg=PR1&dq=bransford+how+people+learn&ots=FsQVkIesZE&sig=qESNaxmqFmysC8uqFFdNdTvJ2LI

Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11(6), 671–684. https://doi.org/10.1016/S0022-5371(72)80001-X

Craik, F. I. M., & Tulving, E. (1975). Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General, 104(3), 268–294. https://doi.org/10.1037/0096-3445.104.3.268

Davis, M. S. (1971). That’s Interesting: Towards a Phenomenology of Sociology and a Sociology of Phenomenology. Philosophy of the Social Sciences, 1(4), 309–344.

Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 20–28.

Doctorow, M., C, M., & Marks, C. (1978). Generative processes in reading comprehension. Journal of Educational Psychology, 70(2), 109–118. https://doi.org/10.1037/0022-0663.70.2.109

Dow, S., Kulkarni, A., Klemmer, S., & Hartmann, B. (2012). Shepherding the Crowd Yields Better Work. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (pp. 1013–1022). New York, NY, USA: ACM. https://doi.org/10.1145/2145204.2145355

Drapeau, R., Chilton, L. B., Bragg, J., & Weld, D. S. (2016). MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). Retrieved from http://www.cs.washington.edu/ai/pubs/drapeau-hcomp16.pdf

Glassman, E. L., Kim, J., Monroy-Hernández, A., & Morris, M. R. (2015). Mudslide: A Spatially Anchored Census of Student Confusion for Online Lecture Videos. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (pp. 1555–1564). New York, NY, USA: ACM. https://doi.org/10.1145/2702123.2702304

Glassman, E. L., Lin, A., Cai, C. J., & Miller, R. C. (2016). Learnersourcing Personalized Hints. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (pp. 1626–1636). New York, NY, USA: ACM. https://doi.org/10.1145/2818048.2820011

Hosseini, M., Cox, I. J., Milić-Frayling, N., Kazai, G., & Vinay, V. (2012). On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents. In Advances in Information Retrieval (pp. 182–194). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_16

Hynd, C., Holschuh, J. P., & Hubbard, B. P. (2004). Thinking like a historian: College students’ reading of multiple historical documents. Journal of Literacy Research, 36(2), 141–176.

Ipeirotis, P. G., Provost, F., & Wang, J. (2010). Quality Management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (pp. 64–67). New York, NY, USA: ACM. https://doi.org/10.1145/1837885.1837906

Kavzoglu, T., & Colkesen, I. (2012). The effects of training set size for performance of support vector machines and decision trees. In Proceeding of the 10th international symposium on spatial accuracy assessment in natural resources and environmental sciences, July (pp. 10–13).

Kim, J., Miller, R. C., & Gajos, K. Z. (2013). Learnersourcing Subgoal Labeling to Support Learning from How-to Videos. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems (pp. 685–690). New York, NY, USA: ACM. https://doi.org/10.1145/2468356.2468477

Kim, J., & others. (2015). Learnersourcing: improving learning with collective learner activity. Massachusetts Institute of Technology. Retrieved from http://dspace.mit.edu/handle/1721.1/101464

Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363–394. https://doi.org/10.1037/0033-295X.85.5.363

Law, E., Gajos, K. Z., Wiggins, A., Gray, M. L., & Williams, A. (2017). Crowdsourcing As a Tool for Research: Implications of Uncertainty. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (pp. 1544–1561). New York, NY, USA: ACM. https://doi.org/10.1145/2998181.2998197

Lee, D. J., Lo, J., Kim, M., & Paulos, E. (2016). Crowdclass: Designing classification-based citizen science learning modules. HCOMP. Retrieved from http://dorisjunglinlee.com/files/crowdclass.pdf

Linden, M., & Wittrock, M. C. (1981). The Teaching of Reading Comprehension according to the Model of Generative Learning. Reading Research Quarterly, 17(1), 44–57. https://doi.org/10.2307/747248

Little, G., Chilton, L. B., Goldman, M., & Miller, R. C. (2010). Exploring Iterative and Parallel Human Computation Processes. In Proceedings of the ACM SIGKDD Workshop on Human Computation (pp. 68–76). New York, NY, USA: ACM. https://doi.org/10.1145/1837885.1837907

Mandell, N. (2008). Thinking like a Historian: A Framework for Teaching and Learning. OAH Magazine of History, 22(2), 55–59. https://doi.org/10.1093/maghis/22.2.55

McDonnell, T., Lease, M., Elsayad, T., & Kutlu, M. (2016). Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). Retrieved from https://www.ischool.utexas.edu/~ml/papers/mcdonnell-hcomp16.pdf

Merrill, D. C., Reiser, B. J., Ranney, M., & Trafton, J. G. (1992). Effective Tutoring Techniques: A Comparison of Human Tutors and Intelligent Tutoring Systems. Journal of the Learning Sciences, 2(3), 277–305. https://doi.org/10.1207/s15327809jls0203_2

Mitros, P. (2015). Learnersourcing of Complex Assessments. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale (pp. 317–320). New York, NY, USA: ACM. https://doi.org/10.1145/2724660.2728683

Nawrotzki, K. (Ed.). (2013). Writing History in the Digital Age. University of Michigan Press. Retrieved from http://hdl.handle.net/2027/spo.12230987.0001.001

Nist, S. L., & Hogrebe, M. C. (1987). The Role of Underlining and Annotating in Remembering Textual Information. Reading Research and Instruction, 27(1), 12–25. https://doi.org/10.1080/19388078709557922

Peterson, S. E. (1991). The cognitive functions of underlining as a study technique. Reading Research and Instruction, 31(2), 49–56. https://doi.org/10.1080/19388079209558078

Pirolli, P., & Card, S. (2005). The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of international conference on intelligence analysis (Vol. 5, pp. 2–4). Retrieved from https://www.e-education.psu.edu/geog885/sites/www.e-education.psu.edu.geog885/files/geog885q/file/Lesson_02/Sense_Making_206_Camera_Ready_Paper.pdf

Prelec, D., Seung, H. S., & McCoy, J. (2017). A solution to the single-question crowd wisdom problem. Nature, 541(7638), 532–535. https://doi.org/10.1038/nature21054

Russell, D. M., Stefik, M. J., Pirolli, P., & Card, S. K. (1993). The Cost Structure of Sensemaking. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (pp. 269–276). New York, NY, USA: ACM. https://doi.org/10.1145/169059.169209

Rutner, J., & Schonfeld, R. (2012). Supporting the Changing Research Practices of Historians. New York: Ithaka S+R. Retrieved from http://sr.ithaka.org/?p=22532

Schnell, T., & Rocchio, D. (1978). A Comparison of Underlying Strategies for Improving Reading Comprehension and Retention. Reading Horizons, 18(2). Retrieved from http://scholarworks.wmich.edu/reading_horizons/vol18/iss2/4

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Comput. Surv., 34(1), 1–47. https://doi.org/10.1145/505282.505283

Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 614–622). New York, NY, USA: ACM. https://doi.org/10.1145/1401890.1401965

Šimko, J., Šimko, M., Bieliková, M., Ševcech, J., & Burger, R. (2013). Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects. In International Conference on Computational Collective Intelligence (pp. 62–71). Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-3-642-40495-5_7

Smart, K. L., & Bruning, J. L. (1973). An examination of the practical importance of the von Restorff effect. In annual meeting of the American Psychological Association, Montreal, Canada.

Smith, R. (2007). An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (Vol. 2, pp. 629–633). https://doi.org/10.1109/ICDAR.2007.4376991

Smith, R. W. (2009). Hybrid Page Layout Analysis via Tab-Stop Detection. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition (pp. 241–245). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICDAR.2009.257

Smith, Ray, Antonova, D., & Lee, D.-S. (2009). Adapting the Tesseract Open Source OCR Engine for Multilingual OCR. In Proceedings of the International Workshop on Multilingual OCR (pp. 1:1–1:8). New York, NY, USA: ACM. https://doi.org/10.1145/1577802.1577804

Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and Fast—but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1613715.1613751

Stearns, P. N., Seixas, P. C., & Wineburg, S. (2000). Knowing, teaching, and learning history: National and international perspectives. NYU Press. Retrieved from https://books.google.com/books?hl=en&lr=&id=viQVCgAAQBAJ&oi=fnd&pg=PR9&dq=+Knowing,teaching,+and+learning+history&ots=gPjNC0qroE&sig=RxJx6hzT9Cq0-CTOdyk8RhDbTBs

Tally, B., & Goldenberg, L. B. (2005). Fostering historical thinking with digitized primary sources. Journal of Research on Technology in Education, 38(1), 1–21.

Venkatesan, R., Er, M. J., Dave, M., Pratama, M., & Wu, S. (2016). A novel online multi-label classifier for high-speed streaming data applications. Evolving Systems, 1–13.

Weir, S., Kim, J., Gajos, K. Z., & Miller, R. C. (2015). Learnersourcing Subgoal Labels for How-to Videos. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 405–416). New York, NY, USA: ACM. https://doi.org/10.1145/2675133.2675219

Wineburg, S. (2010). Thinking like a historian. Teaching with Primary Sources Quarterly, 3(1), 2–4.

Wineburg, S. S. (1991). On the Reading of Historical Texts: Notes on the Breach Between School and Academy. American Educational Research Journal, 28(3), 495–519. https://doi.org/10.3102/00028312028003495

Wittrock, M. C., & Alesandrini, K. (1990). Generation of Summaries and Analogies and Analytic and Holistic Abilities. American Educational Research Journal, 27(3), 489–502. https://doi.org/10.3102/00028312027003489

Wittrock, Merlin C. (1989). Generative Processes of Comprehension. Educational Psychologist, 24(4), 345–376. https://doi.org/10.1207/s15326985ep2404_2

Xu, A., Rao, H., Dow, S. P., & Bailey, B. P. (2015). A Classroom Study of Using Crowd Feedback in the Iterative Design Process. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 1637–1648). New York, NY, USA: ACM. https://doi.org/10.1145/2675133.2675140

Yu, L., Kittur, A., & Kraut, R. E. (2014). Distributed Analogical Idea Generation: Inventing with Crowds. In Proceedings of the 32Nd Annual ACM Conference on Human Factors in Computing Systems (pp. 1245–1254). New York, NY, USA: ACM. https://doi.org/10.1145/2556288.2557371

Zhang, M.-L., & Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819–1837.

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657). Retrieved from http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification

Zhu, H., Dow, S. P., Kraut, R. E., & Kittur, A. (2014). Reviewing Versus Doing: Learning and Performance in Crowd Assessment. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (pp. 1445–1455). New York, NY, USA: ACM. https://doi.org/10.1145/2531602.2531718


  • There are currently no refbacks.