A Survey of Crowdsourcing in Medical Image Analysis

—Rapid advances in image processing capabilities have been seen across many domains, fostered by the application of machine learning algorithms to “big-data”. However, within the realm of medical image analysis, advances have been curtailed, in part, due to the limited availability of large-scale, well-annotated datasets. One of the main reasons for this is the high cost often associated with producing large amounts of high-quality meta-data. Recently, there has been growing interest in the application of crowdsourcing for this purpose; a technique that has proven effective for creating large-scale datasets across a range of disciplines, from computer vision to astrophysics. Despite the growing popularity of this approach, there has not yet been a comprehensive literature review to provide guidance to researchers considering using crowdsourcing methodologies in their own medical imaging analysis. In this survey, we review studies applying crowdsourcing to the analysis of medical images, published prior to July 2018. We identify common approaches, challenges and considerations, providing guidance of utility to researchers adopting this approach. Finally, we discuss future opportunities for development within this emerging domain.


INTRODUCTION
The limited availability and size of labeled datasets for training machine learning algorithms is a common problem in medical image analysis (Greenspan et al., 2016;Litjens et al., 2017;. In several other fields, crowdsourcing -defined as the outsourcing of tasks to a crowd of individuals (Howe, 2006)-has been found effective for labeling large quantities of data. For example, in computer vision crowdsourcing has been used to annotate large datasets of images and videos with various tags (Kovashka et al., 2016), and online citizen science via platforms such as the Zooniverse has become well established across a number of academic domains including astronomy (Lintott et al., 2008), meteorology (Knapp et al., 2016) and ecology (Willi et al., 2019).
Due to the success of crowdsourcing, several researchers have recently applied these techniques to the annotation of medical images. Although such images present specific challenges, including absence of expertise of the crowd, several early papers such as (Mitry et al., 2013;Mavandadi et al., 2012;Maier-Hein et al., 2014a) have demonstrated promising results. Despite the growing interest, there has not been an overview of the work in this field. In this paper, we summarize existing literature on crowdsourcing in medical imaging. This paper originated during the Lorentz workshop "Crowdsourcing in medical image analysis" in June 2018 1 . As participants of the workshop, we searched Google Scholar with the query "crowdsourcing AND (medical or biomedical)" and screened the results for papers focusing on the topic. Google Scholar was selected due to previous papers highlighting the poor indexing of the topic in databases and the high prevalence of crowdsourcing papers in conferences (Wazny, 2017). Additional papers were identified for inclusion by all authors, by examining the references and citations of selected papers. We did not exclusively focus only on journal papers, but included preprints and abstracts, as we realize that some studies may not have been prepared as scientific articles. However, we realize that such papers are more difficult to find, and thus there is still a degree of publication bias in our selection. Furthermore, we did not update our search query to include other relevant terms, such as "citizen science", which might have expanded the set of included papers.
We only included papers where the crowd was involved in the analysis of medical or biomedical images, for example by annotating them. Our search strategy resulted in 57 papers. Key terms emerging from these studies are illustrated in Fig. 1. Five key dimensions were identified for discussion: the application involved, the type of interaction between the crowd and the images, the scale of the task (such as the number of images), the type of evaluation performed on the crowd annotations, and the results of the evaluation.
There are a number of surveys which are related to this work. However, they are quite different in scope: - Ranard et al. (2014) survey crowdsourcing in health and medical research. They identify four tasks: problem solving, data processing, monitoring and surveying and cover 21 papers published until March 2013. In contrast, we only focus on papers where image analysis (i.e. data processing) is involved. - Kovashka et al. (2016) survey crowdsourcing in computer vision. The surveyed papers focus on analysis of everyday/natural images. Only one of the 195 referenced papers ( (Gurari et al., 2015a)) uses biomedical data. -Wazny (2017) present a meta-review of crowdsourcing from 2006 to 2016. Similar to (Ranard et al., 2014), they take a more broad view of crowdsourcing. They review 48 existing review papers until August 2015, focusing on how each review categorizes the papers, for example by platform, size of crowd, and so forth. - Alialy et al. (2018) is most similar to our survey, but only focuses on crowdsourcing in human pathology. They do a systematic search with several steps, excluding conference papers or abstracts, and summarize seven papers. The coverage of literature is therefore much more limited than in this work.
The paper is organized as follows. In Sections 2 to 6 we summarize the reviewed papers according to the five dimensions we identified and in Section 7 we discuss overall trends, limitations, and opportunities for future research. A condensed overview of all papers and their properties according to the key dimensions is provided in Table 9 in the Appendix.

APPLICATIONS
There are a variety of crowdsourcing applications addressed in the surveyed papers. We group these applications by the type of task performed by the crowd, the biomedical content of the image, and the dimensionality of the images. Table 1 summarizes the different task types of the surveyed papers. An important task in medical image analysis is classification, and 42% of the surveyed papers focus on this task. Classification can refer to assigning a label to an entire image, such as diagnosing whether a chest CT image contains any abnormalities. Classification can also refer to assigning a label to a part of the image, for example, the type of abnormality located in a particular region of interest. Other types of labels include non-diagnostic labels such as image modality (de Herrera et al., 2014), visual attributes , and assessing the quality of the image (Keshavan et al., 2018). These three types of labels are based more on visual characteristics, and thus might be easier to provide than diagnostic labels without any medical training.

Type of task
A further 39% of the papers focus on localization or segmentation. Typically the goal is to delineate the boundary of an entire healthy structure, or of an abnormality such as a lesion. The difference with how we define the classification task above is that instead of providing information about the image, the annotator has to modify the image, by providing positions or outlines. These tasks rely more on visual characteristics than classification tasks, and may be more easily explained to a nonexpert crowd.
In 12% of the papers both classification and segmentation are addressed. Often this means that the annotator first has to indicate if the structure of interest is visible, and if yes, to locate it in the image.
Finally, 7% of papers request less standard tasks from their crowd. For example, Maier-Hein et al. (2015) focus on determining correspondence between pairs of images. Although this is a type of detection task, where the annotator has to locate points of interest in an image, it is also different since a point of reference is already provided. Another example is Ørting et al. (2017), where the annotator has to decide which image is more similar to a reference image. This is a type of classification problem, but again relying more on visual features than on prior knowledge.

Type of image
Medical images are acquired at vastly different scales and locations depending on the physiological measurement of interest. The imaging acquisition modality and strategy depends heavily on the scale of the anatomy of interest, and different technologies' expected contrast with surrounding tissues. Here we categorize the images by the type of structure that is being imaged, which narrows down the modality. We use the following categorization, also used in two recent surveys of medical
Many of the papers in this survey are aimed at 2D images. The most common application is histopathology/microscopy with 28% of all the papers, followed by retinal images with 14% of the papers. Both applications are over-represented compared to Litjens et al. (2017) and . This overrepresentation in crowdsourcing studies may be because many retinal and microscopic images are acquired in 2D, which might be easier to use in a crowdsourcing study than 3D images.
Breast and heart images, which were already not well represented in the other two surveys, are almost absent in crowdsourcing studies. Both applications can be aimed at 2D or 3D images. However, perhaps due to lack of datasets or perceived difficulty of assessing these images, these applications are almost never considered for crowdsourcing.
Several other papers address applications where images are often 3D, such as the brain (9%) and the lungs (9%). Compared to Litjens et al. (2017) and , brain and lung images are underrepresented in crowdsourcing. This could be due to complexity of images or limitations in interfaces. One approach for dealing with 3D images is to select 2D parts of the original 3D images. For example, Ørting et al. (2017) andO'Neil et al. (2017) select axial slices. Cheplygina et al. (2016) shows patches of 2D projections in various directions in the image. Others circumvent the Table 2. Application Domains.
The last type of data that is addressed is video, common for endoscopy and colonoscopy (both in the abdomen category). Several different approaches are used for presenting video data: 2D frames (Maier-Hein et al., 2014b, 2015Heim, 2018;Roethlingshoefer et al., 2017), 3D renderings McKenna et al., 2012), short video clips (Park et al., 2017), or longer videos that can be paused and annotated .
Other applications of crowdsourcing include segmenting hip joints in 2D MRI (Chávez-Aragón et al., 2013), rating visual characteristics of dermatological images  and assessing surgical performance (Malpani et al., 2015;Holst et al., 2015). Two papers (Foncubierta Rodríguez and Müller, 2012;de Herrera et al., 2014) look at multiple applications, where the task is classifying image modality, rather than segmentation or diagnosis. A few papers address segmentation in multiple modalities: Gurari et al. (2016) focus on both natural and biomedical images, Lejeune et al. (2017) address segmentation across four medical applications.

Data availability
Next to categorizing the type of applications, we also examined whether the datasets used in these studies were publicly available. Out of 57 papers, at least 22 papers used at least one publicly available dataset. We only considered datasets as public when they were clearly identifiable as such, for example the paper described the dataset as openly available and contained a reference to a publication about the dataset, and/or a dataset website.

INTERACTION
An important aspect of crowdsourcing medical image annotations is task design. The interplay between the type of image data, the type of annotations that are needed and the available tools for
Rating entire images was the most common interaction and was the main task of 52% of the studies surveyed here. Ratings took many forms, identifying the presence/absence of specific visual features (Sonabend et al., 2017), counting number of cells (Smittenaar et al., 2018), assessing intensity of cell staining (dos Reis et al., 2015), or discriminating healthy samples from diseased (Mavandadi et al., 2012). Most commonly, crowd workers were asked to create new annotations (90% of rating tasks). Less commonly, crowd workers were asked to validate pre-existing annotations (14%). One study involved both validating pre-existing annotations and creating new ones (Heim, 2018), so the percentages do not sum to 100%. Existing annotations were the output of automated methods (Roethlingshoefer et al., 2017;Ganz et al., 2017;Gur et al., 2017) half of the time, and the crowdsourced annotations were used to identify instances with errors to be corrected.
Drawing a shape was the second most common task, comprising 38% of studies. Here crowd workers were asked to draw bounding boxes or segment outlines of structures of interest. Sometimes, this was only after identifying if a structure was present in the image or not (Heim, 2018). Similar to rating images, drawing shapes was used as an interaction for both creating new annotations (90% of drawing tasks) and validating existing annotations (14%). In the case of evaluating existing annotations, drawing was used as a means to indicate the location of errors in segmentations produced by automated methods (Roethlingshoefer et al., 2017;Ganz et al., 2017).
Clicking on specific locations was the third most used interaction, occurring in 25% of studies. Clicking was only used to create new annotations such as identifying the precise location of specific cells, abnormalities, or artifacts within an image. The use of multiple clicks to outline a structure was considered a "drawing a shape" interaction. Selecting points was also used in pairs of video frames to determine the stereotactic correspondence of two video streams for follow-up 3D reconstruction (Maier-Hein et al., 2014b, 2015. Comparing two or more images was the least used interaction, occurring in only 5 (9%) of studies. In all cases, comparisons were used to create new annotations, such as marking corresponding points in two consecutive video frames (Maier-Hein et al., 2015 or to choose which of two images was more similar to a target image (Ørting et al., 2017).
Overall, crowds were more often used to create new annotations, than to make judgments on existing annotations, which was done only in (Roethlingshoefer et al., 2017;Foncubierta Rodríguez and Müller, 2012;Ganz et al., 2017;Gur et al., 2017;de Herrera et al., 2014). Ratings and drawing of shapes can be used to obtain more detailed annotations than information already present in datasets.
Clicking interactions are sometimes used to identify specific image features, but more commonly used to create bounding boxes or draw object boundaries. Evaluating existing annotations is always done with rating (Foncubierta Rodríguez and Müller, 2012;de Herrera et al., 2014) or drawing (Roethlingshoefer et al., 2017;Ganz et al., 2017) interactions. Different types of annotations were often collected based on the type of task information being sought, e.g., clicking to obtain counts and locations of specific image features (Cabrera-Bean et al., 2017; Della Mea et al., 2014) and drawing to determine segmentation boundaries (Gurari et al., 2015b;Chávez-Aragón et al., 2013). Ratings were a more general type of interaction and could be used to classify for whole-image level discrimination of image feature presence or category (Keshavan et al., 2018;de Herrera et al., 2014;McKenna et al., 2012) or to obtain estimates of feature certainty (dos Reis et al., 2015).
Rating and drawing interactions for existing annotations are usually chosen to speed up the annotation process, as verifying and correcting existing annotations is faster for crowd workers than annotating an image from scratch (Roethlingshoefer et al., 2017). Similarly, rating with predefined categories is preferred instead of free text input, which is prone to spelling mistakes and misunderstanding of the annotation task (Foncubierta Rodríguez and Müller, 2012). Timmermans et al. (2016) chose custom drawing instead of the most commonly used drawing of independent polygons to better identify the shapes of interest.

Crowdsourcing platforms
A potentially important factor that varies across the surveyed papers is the choice of platform for conducting crowdsourcing experiments. We classify the platforms into six categories: paid commercial marketplaces such as Amazon Mechanical Turk 2 and FigureEight (formerly known as Crowd-Flower and acquired by Appen 3 in 2019), volunteers such as Zooniverse 4 and Volunteer Science 5 , custom recruitment/platforms, lab participants, experts and simulation or no experiment at all. The most common choice is a commercial platform (55%). The second most common choice is a custom platform (23%) followed by a volunteer platform (10%). The remaining 8% were almost equally divided into the other categories with around 5% of all papers reporting prototypes or simulation studies. Around half of the papers we reviewed, namely 25 papers, motivate the choice of the platform and name some of their advantages and disadvantages.
The main reasons for choosing a paid commercial marketplace such as Amazon Mechanical Turk and FigureEight are both explicitly and implicitly mentioned: to be able to reach out to a large and diverse crowd (Boorboor et al., 2018;Brady et al., 2014;Bruggemann et al., 2018;Gurari et al., 2016;Irshad et al., 2015Irshad et al., , 2017Foncubierta Rodríguez and Müller, 2012); to be cost-efficient (Boorboor et al., 2018;Gurari et al., 2016;Irshad et al., 2015Irshad et al., , 2017; to be time-efficient (Brady et al., 2014;de Herrera et al., 2014;Irshad et al., 2015Irshad et al., , 2017 and to make use of in-place quality control mechanisms (Della Mea et al., 2014;Nguyen et al., 2012;Irshad et al., 2015Irshad et al., , 2017Della Mea et al., 2014). Amazon Mechanical Turk is preferred over other paid platforms because it allows requesters to add custom-built annotation interfaces (Cheplygina et al., 2016;Heim, 2018;Maier-Hein et al., 2015) and test their HITs in a sandbox (Heim, 2018). Similarly, researchers prefer the FigureEight platform because it is available in Europe (Della Mea et al., 2014) and because it provides an internal interface where they can set up the annotation task and distribute it to their network, without having to pay for the annotations (de Herrera et al., 2014;Foncubierta Rodríguez and Müller, 2012).
Volunteer or custom recruitment platforms are usually chosen to mitigate the disadvantages of paid commercial marketplaces. Such platforms enable users to build custom, lightweight applications, fine-tune interfaces and manipulate images for better annotations (Leifman et al., 2015;Albarqouni et al., 2016a;Heller et al., 2017;Keshavan et al., 2018;dos Reis et al., 2015). Furthermore, custom platforms allow requesters to mitigate privacy issues, as data remains in secure, centralized clinical repositories at any time (Gur et al., 2017;Heller et al., 2017). Sullivan et al. (2018) and Huang and Hamarneh (2017) chose to use a custom platform to be able to integrate gamification aspects that proved helpful to reduce the time needed for annotation, to motivate players and maintain them for longer annotation campaigns. One other major advantage of volunteer or custom platforms is the fact that they can reach out to contributors that are interested in supporting medical imaging research or science in general (Rajchl et al., 2016;dos Reis et al., 2015). Furthermore, such applications are usually much simplified, can be played at any time and on any device (Keshavan et al., 2018;dos Reis et al., 2015).
Among disadvantages of chosen paid commercial marketplaces we find the following. Brady et al. (2014) mention that built-in qualifications in Amazon Mechanical Turk such as "Photo Moderation Master" are not useful for the medical imaging analysis task conducted in the paper, as there were only a few annotators available. Furthermore, this increased both the cost and the time to complete the experiments (Brady et al., 2014;McKenna et al., 2012). Another disadvantage of Amazon Mechanical Turk mentioned by Brady et al. (2017) is the fact that workers could potentially use automated scripts to accept or reserve large amounts of HITs at a time. As a consequence, the time needed to complete each HIT can not be computed reliably anymore. Cheplygina et al. (2016) note that the integration of custom-built annotation interfaces in Amazon Mechanical Turk, although useful, is costly and time-consuming for novice users of the platform. Mitry et al. (2016) mention that the integrated annotation tool in Amazon Mechanical Turk only allowed rectangles to be drawn, while this could affect users' ability to capture more irregular regions of interest. Another major disadvantage of Amazon Mechanical Turk is the fact that the platform was not always available for requesters outside US (Maier-Hein et al., 2015). Regarding FigureEight, the major disadvantages are the difficulty of setting up gold questions for annotation tasks that involve drawing and segmentation (Della Mea et al., 2014) and insufficient settings to control for the number of annotations that should be performed for each image (Albarqouni et al., 2016a).
While volunteer and custom platforms could attract a large pool of participants due to their advertisement campaigns, large drop-off in user participation can be seen over time (Smittenaar et al., 2018;dos Reis et al., 2015). In one case, advertisement campaigns only attracted neuroscientists from the social network of the requesters thus limiting the diversity of the annotators (Keshavan et al., 2018). Furthermore, many participants seem to annotate only a few samples (dos Reis et al., 2015).

Scale
We summarize the scale of the crowdsourcing experiments in terms of number of images annotated, number of images per task, and number of annotations per image.

Number of images:
We classify the number of images into four categories: very small (less than 10 annotated images), small (10 to 100 annotated images), medium (100 to 1000 annotated images) and large (more than 1000 annotated images). Column #I in Table 9 shows an overview of the exact number of images annotated in each paper included in the review. The large majority of reviewed papers (70%) report small and medium scale experiments, while a smaller part report large experiments (22%) or very small experiments (5%). However, in around 3% of the reviewed papers, the scale of the experiments is not reported. Number of images per task: In total, 25 out of the 57 papers included in this survey report on the number of images per task, 25% use one image per task, 7% use five images per task, 5% use ten images per task, while the other 7% use between 3 and 84 images per task. Irshad et al. (2015) and Irshad et al. (2017) also mention that out of the five images in the task, four images are actually unlabeled data, while one image is a gold question.

Number of annotations per image:
We divide the number of annotations per image into two categories: a single annotator per image (5%) or multiple annotators per image (63%). Surprisingly, for 33% of surveyed papers the number of annotations per image is not reported nor can it be inferred. Column #Ann/I in Table 9 shows an overview of the exact number of annotations per image, as reported in each paper reviewed.
Overall, the experiments using a single annotator per image involve either simulations or locally recruited, volunteer-based annotators that are not remunerated. The number of annotators per image for experiments using multiple annotators per image ranges from 2 to 5000. However, the majority (66%) of these experiments use between 5 to 25 annotators per image.

Annotators Wage
We classify the wage given to annotators into six different categories: a few dollars per hour, less than or equal to $0.10 per annotation, more than $0.10 per annotation, volunteers (no monetary incentive), not specified (if we have no information about compensation) and none (if no actual experiment or recruitment took place).
More than a third (34%) of papers did not specify anything about wage. In 34% of papers the wage was less than or equal to $0.10, in 24% of papers crowds where volunteers with no monetary incentive, in 5% of papers the wage was more than $0.10, and in 3% of papers the wage was an hourly payment of a few dollars per hour.
Overall, very few and mainly the papers that mention an hourly payment considered crowd worker wages in relation to minimum wage rules and regulations.

EVALUATION
In this section we describe how the crowdsourced annotations are evaluated. This is done via two strategies: ensuring sufficient quality of annotations by preprocessing and estimating the utility of the crowd annotations for the task at hand. Although the two strategies are closely related and should be considered jointly when designing crowdsourcing experiments, it is informative to consider them separately here.
The first strategy is closely related to the field of quality control in crowdsourcing. Numerous approaches exist to tackle this, starting from simple majority voting and worker filtering to sophisticated statistical and machine learning methods that consider workers' specific skills, task difficulty and clarity of task descriptions. The second strategy is more domain-specific, as different tasks may have different levels of tolerance for errors.

Preprocessing of annotations
Preprocessing of annotations broadly covers what is done to the crowdsourced annotations prior to using them for their intended purpose. It includes filtering individual annotations and/or aggregating annotations. The majority (84%) of the surveyed papers perform some form of preprocessing.

Filtering individuals
One way to filter annotations, is to remove annotations made by "poorly performing" annotators. Most crowdsourcing platforms offer a rating score for workers that provides an estimate of their performance, based on their percentage of previously approved tasks. This score is used in 16% of surveyed papers to filter workers prior to assigning tasks. A related approach, used in 12% of surveyed papers, is to exclude workers that fail a test task prior to the actual tasks. A refinement of this, used in 23% of surveyed papers, is to integrate separate test tasks in the tasks and exclude workers that fail the tests. Park et al. (2018), for example, added a smiley face to colonoscopy videos to ensure attention.
Another common filtering approach for individual workers, used in 23% of surveyed papers, is comparing annotations to gold standard annotations. In this case, tasks with known gold standard annotations, are injected into the regular working process. A worker's correspondence with the gold standard can then be used to estimate individual worker performance. In contrast to platform Table 6. Filtering mechanisms.

Filtering Papers
Before (Heim, 2018), (Holst et al., 2015),   (Sullivan et al., 2018) scores and unrelated test tasks, this approach assesses worker performance on the specific task, allowing more fine-grained worker selection. The filtering mechanisms used in the surveyed papers are summarized in Table 6.

Aggregating results
One of the main benefits of crowdsourcing is the fast and cost-effective collection of a large number of annotations. This allows aggregating annotations to reduce noise in the individual annotations.
Majority voting is widely used due to its computational and conceptual simplicity, and was found in 23% of the papers. In the context of medical image analysis, majority voting is applied to annotations, ratings, and also to aggregate slices of images. Heim (2018), for example, used crowdsourcing for organ segmentation in computed tomography scan. Multiple organ outlines are collected via an online tool and pixel-wise majority voting is applied to improve the accuracy of the segmentation. In the case of numerical ratings, mean and median statistics are also used in 12% of the papers to determine a final annotation. For example, Cheplygina et al. (2016) used the median to aggregate the areas of the annotations created by individual workers. A more sophisticated version of the majority vote uses additional information about the general quality of workers. This information can be derived if workers perform multiple tasks or if gold standard data is available. Weighted voting is used in 16% of surveyed papers, for example, Keshavan et al. (2018) used the XGBoost algorithm to estimate annotator weights, and Brady et al. (2017) estimated the weights of the annotators as the probability that an annotator is correct while taking task difficulty into account. The aggregation mechanisms used in the surveyed papers are summarized in Table 7

Evaluating annotations
Evaluating how well crowd annotations solve the intended purpose is most commonly (79% of surveyed papers) done by directly comparing crowdsourced annotations to a gold standard. In about 16% of surveyed papers crowd annotations are used for training a machine learning method, and the performance of the machine learning method used to indirectly evaluate annotations. The remaining 5% have no evaluation of how well annotations solve the intended purpose.
The gold standard originates from different sources. In about 25% of surveyed papers, the gold standard is based on a single expert, in about 37% the gold standard is based on multiple experts, and in the remaining papers the number of experts is not reported or no expert gold standard is used. Using a gold standard based on a single expert can be problematic since experts often disagree on all but the most trivial tasks. However, only 3 of 21 papers that use multiple experts consider how well experts agree.
Expert-based gold standards are generally not obtained from experts performing exactly the same task as the crowd. In several cases the only difference in expert and crowd tasks is due to differences in user interface, e.g. a clinical workstation for experts and a web interface for crowds. As long as the fundamental task is the same (e.g. count cells) and the user interface has not been dramatically changed we consider the expert and crowd tasks to be the same. Using this definition, about 40% of the papers use the same task and about 40% use a different task. In the remaining 19% it is either not reported or no expert gold standard is used.
There are several reasons for asking crowds to perform a different task than what experts have done for the gold standard. Some papers use a simplified version of the expert task in order to make the task easier or more suitable as a small self-contained task. For example, ranking relative performance in pairs of surgical videos instead of grading performance in each (Malpani et al., 2015); assessing visual similarity of images instead of classifying disease patterns (Ørting et al., 2017); refining segmentation proposals instead of performing a full segmentation (Maier-Hein et al., 2016); annotating polyps in a single frame instead of in a full video (Park et al., 2017) or counting stained cells instead of classifying disease status (Irshad et al., 2017). Other papers focus on changing the user interface, such as Lejeune et al. (2017) who used an eye tracker for segmentation instead of a mouse, or Albarqouni et al. (2016b) and Mavandadi et al. (2012) who changed the user interface to support gamification strategies.
In a few papers, the evaluation is focused on variation in annotations. For example, Lee and Tufail (2014) and Lee et al. (2016) evaluate annotations in terms of inter-rater reliability; and Heller et al. (2017), Huang and Hamarneh (2017), Leifman et al. (2015), and Sonabend et al. (2017) compare individual annotations to aggregated annotations. Measuring variability of annotations it not directly useful for evaluating the correctness of annotations. However, annotation variability is essential when evaluating how much the crowdsourced annotations can be trusted. Additionally, variation provides an indirect measure of correctness. Large variation can indicate that annotations are often wrong, while small variation indicates that annotations are often correct or the task has been designed such that annotators are consistently wrong. All evaluation used in the survey papers are aggregated in Table 8.

RESULTS AND RECOMMENDATIONS
Here, we provide an overview of the primary results and recommendations emerging from the papers examined in this review. Complementary to the topics discussed in Section 5, we consider how effective the application of crowdsourcing to medical image analysis is, and provide recommendations to ensure data quality.

How effective is the application of crowdsourcing to medical image analysis?
The vast majority of studies examined in this review found crowdsourcing to be a valid approach for data production. Crowdsourcing of medical image analysis was noted to be an accurate approach (Lawson et al., 2017) that can produce large quantities of annotations needed to solve highthroughput problems requiring human input (Irshad et al., 2015;dos Reis et al., 2015;Lee and Tufail, 2014;Maier-Hein et al., 2014b). Crowdsourcing can be used to create new annotations or make existing data more robust, both cheaper and faster than annotation by medical experts (Rajchl et al., 2016;Holst et al., 2015;Gurari et al., 2016;Eickhoff, 2014;Park et al., 2017).
Although the relative efficacy of crowdsourcing applied to medical image analysis will be dependent on the complexity of the task, the papers examined here show crowdsourcing to be an effective methodology across a wide variety of applications, including objective assessment of surgical skill (Malpani et al., 2015), emphysema assessment (Ørting et al., 2017), polyp marking in virtual colonoscopy , identification of chromosomes (Sharma et al., 2017) and biomarker discovery in immunohistochemistry data (Smittenaar et al., 2018). Notably, only one project stated that crowdsourcing could not always be applied effectively to the studied task ("it is very difficult and maybe even impossible to entirely outsource the task of labelling mitotic figures in histology images to crowds" (Albarqouni et al., 2016a)).
Rather than comparing the absolute performance of the crowd to experts or to algorithms, it might be worth considering their relative benefits. For example, crowds were particularly useful for rare classes (Sullivan et al., 2018), which are often difficult cases for algorithms. Another situation where crowds can be useful is identifying data that is missing from the gold standard provided by experts, see for example (Luengo-Oroz et al., 2012). Benefits of combining crowds with algorithms were also demonstrated by (Albarqouni et al., 2016a;Keshavan et al., 2018;Sharma et al., 2017).

Recommendations to ensure data quality
The papers examined in this review included suggestions to improve the quality of data produced through crowdsourcing. These suggestions focused on refining the task design, crowdsourcing platform and post-processing of annotations. We summarize these recommendations here.

Task design
As discussed, crowdsourcing has been applied effectively to many medical imaging applications. However, careful study design remains necessary to ensure generation of data of sufficient quality.
The selection and design of an appropriate crowdsourcing task is central to project success. Effort should be made to make the task simple and unambiguous (Rajchl et al., 2016;Gurari et al., 2016), and to present study data appropriately . For unavoidably challenging tasks, crowdsourcing may still provide useful data, for instance, through enabling a rapid first-pass evaluation of large scale data sets (Della Mea et al., 2014;Park et al., 2017). Particularly challenging tasks may be made tractable through gamification (Albarqouni et al., 2016b) or careful reframing of the task, e.g. crowdsourcing of emphysema assessment was made possible through reframing the task as a question of image similarity (Ørting et al., 2017). Alternatively, it may be possible to achieve the desired data quality simply through asking a larger cohort of crowd workers to perform each task per data point. Gurari et al. (2016) give an interesting example of task design where quality and speed of crowdsourced segmentations in natural images are increased by flipping images, suggesting that familiarity with an image can be detrimental.
Besides the technical and methodological challenges, also ethical considerations have to be addressed in this design phase. The workers should be informed of the visual content of the task, e.g. surgery images, before the first image is shown. Further, an appropriate wage should be provided. An appropriate wage is often hard to determine as it depends on various factors, like the home country of the workers, platform in use, and complexity of the task. However, these factors should be considered and reported in publications.

Crowdsourcing platform
The choice of crowdsourcing platform can influence study cost and completion time, as well as the size and demographics of the crowd. Furthermore, different platforms offer distinct features which may influence the quality of data produced. For example, Heller et al. noted that user interface features, such as zoom and intuitive controls, can increase data quality. Contingent on the complexity of the task and interface design, training materials should be provided, as this can improve results . However, this is not always necessary -in some cases minimal (Brady et al., 2014) or no training (Ganz et al., 2017) was required.

Post-processing
Post-processing of annotations is recommended to improve annotation quality by removing annotations from poorly performing workers. Alternatively, if multiple workers annotate the same data it is possible to improve annotation quality by aggregating annotations.
The surveyed papers propose a variety of criteria for filtering individual annotations. For example, time spend on task (O'Neil et al., 2017), expected shape of segmentation (Cheplygina et al., 2016;Chávez-Aragón et al., 2013), correlation with other workers' results (Sharma et al., 2017;Chávez-Aragón et al., 2013) and correlation to experts annotations or ground truth Keshavan et al., 2018;Irshad et al., 2017Irshad et al., , 2015Foncubierta Rodríguez and Müller, 2012). However, due to the lack of comparisons between different filtering approaches, the only clear recommendation from these works is to use some form of filtering. Nguyen et al. (2012) found that filtering unreliable workers did not have a significant influence when annotations from multiple workers are aggregated. However, aggregating without taking individual performance into account might not be the best approach. Malpani et al. (2015) compared different aggregation methods, and found that weighted voting, with weights based on self-reported confidence scores, improved results compared to simple majority voting. Similarly, Irshad et al. (2015) found that aggregating segmentations from 3-5 workers, using weights based on consensus and worker trust scores, improved performance over using single worker annotations. Further, Cheplygina and Pluim (2018) found that disagreement between workers was predictive of melanoma diagnosis in skin lesions, suggesting that simple aggregation, such as majority voting or mean statistics, might not be the best approach.

DISCUSSION
In this section we discuss the trends, limitations and opportunities within crowdsourcing in medical imaging.

Trends
As discussed in Section 2, crowdsourcing is applied to a variety of medical images, however, it is most commonly applied to histology or microscopy images. The trend for crowdsourcing of this image type may be due to the ease of which these (typically 2D) images can be incorporated into a crowdsourcing or citizen science project. Alternatively, the microscopy images examined in these papers may have not been derived from a patient, and would therefore not require the consent of an individual to use for crowdsourcing purposes.
The most common crowd task is rating entire images. This is somewhat surprising, given that we would expect such tasks to rely more on prior knowledge than other crowdsourced tasks, such as drawing outlines of objects. Again, this trend might be facilitated due to the ease with which rating images can be incorporated in existing platforms.
Most crowdsourcing studies are set up on commercial platforms, followed by custom platforms. Each image is annotated by multiple crowd workers, who typically receive less than $0.10 per annotation. On the one hand, this low reimbursement might be a product of researchers trying to optimize the total number of annotations given a particular budget. On the other hand, it could be a lack of awareness of what appropriate compensation should be (Hara et al., 2017).
A surprising finding is that, often, important details about the crowd and their compensation are missing. Besides missing details in terms of crowd compensation, we find missing details regarding the number of requested annotations per unit. While for some of the surveyed papers, we could infer an approximation of the number of annotations gathered per unit by checking the scale of the experiment and the total amount of annotations gathered, for at least a third of the surveyed papers (33%) this was not possible due to a lack of detail when describing the crowdsourcing experimental methodology.
Crowdsourced annotations are generally processed prior to evaluating how well the annotations solved the intended purpose. Simply excluding workers based on platform scores or a single test task is not as popular as continuously monitoring worker performance. 61% of the surveyed papers aggregate annotations from multiple crowd workers. This is most commonly done by simple majority voting, but some papers use estimates of task difficulty and/or worker performance to obtain a weighted aggregation.
The most common approach to evaluating the quality of preprocessed annotations is by comparing to an expert defined gold standard. A smaller set of papers use the annotations to train a machine learning method and evaluate the performance of the trained method.
The studies we reviewed almost unanimously conclude that crowdsourcing is a viable solution for medical image annotation, which may seem unexpected given the complexity of medical imaging as a field in general. There might be several possible reasons for the lack of negative results. One is researchers selecting tasks which they already expect to be suitable for crowdsourcing. Another reason is publication bias, with papers demonstrating negative results having less chance of being published, which is also an issue in computer vision (Borji, 2018).

Limitations
There are a number of limitations in the way that the current studies are being conducted. There is generally a lack of clarity in the reporting of experimental design and evaluation protocols. Additionally, ethical questions regarding worker compensation, image content and patient privacy are rarely discussed, but seem crucial to address. In several papers the study design appears to be adhoc. Characteristics such as the platform, number of annotators, how the task is explained and so forth, are not always motivated, or even described. This creates difficulties in understanding what leads to a successful crowdsourcing study and increases the barrier for researchers who have not used crowdsourcing before. The studies which do examine such factors are often conducted on a single application, making it difficult to generalize lessons learned to other applications. Detailed documentation of experiments is a crucial factor for ensuring reproducible science and essential for replication studies.
Another problem is the evaluation of results. The quality of crowdsourced annotations is generally estimated by comparing directly to expert annotations. However, variation in both expert defined gold standard and crowd annotations are not systematically accounted for, making it difficult to assess if crowd annotations are actually good enough. When using annotations to train machine learning methods, noisy crowd annotations might not be a problem if handled by the method. However, variation in annotations should still be investigated in this case. A related problem is using expert annotations to filter crowd annotations, which would not be possible for real unlabelled data, thus leading to overly optimistic results.
Overall, the surveyed papers reported successful results. However, from our personal experience and discussions with other researchers, it is non-trivial to setup a crowdsourcing project for medical images. Due to the lack of negative results, the current literature does not inform researchers inexperienced with crowdsourcing about the main considerations of such a project. Furthermore, very few articles report on pilot experiments which aim to calibrate and identify the optimal crowdsourcing parameter settings such as the number of annotators per image.
There are important ethical issues which are largely not mentioned in the papers we surveyed. First of all, details about compensation are often missing, whereas this can have an important effect on the crowd (Hara et al., 2017). Furthermore, what is reasonable compensation in one country, may be too low for another country due to different cost of living. How to set the compensation fairly is an open issue that researchers should consider in their work.
Another ethical concern is whether it is possible and/or appropriate to share images with the crowd. Some images (for example of surgery) may be traumatic to view or unsuitable for children, which is more unique to the medical domain than other areas where crowdsourcing is applied e.g. astronomy or ecology projects. Another issue is sharing images from the perspective of patient consent, which is an issue that must be considered case by case.

Opportunities
Several papers discuss directions they want to take in further research. One of the popular directions is increasing the role of machine learning. Several papers not using machine learning plan to do so in future (Brady et al., 2017;Sullivan et al., 2018). Papers that already use machine learning discuss improvements to their algorithms or crowd-algorithm combinations (Sharma et al., 2017;Sameki et al., 2016).
Related to the above, tailoring the tasks to individual workers is another possibility. The rating score given to workers by platforms only reflects an overall completion rate, and might be artificially high because employers tend to rate the majority of the tasks positive and apply a filtering afterwards. Considering worker scores on different task types could help to make a better selection of workers beforehand.
Another strategy discussed as future work is the use of gamification. Several papers by Luengo-Oroz et al. (2012) 2018) take a more task-independent approach of a mini-game within an existing, larger game. This could be an opportunity for many other researchers, without the need to design a game from scratch. Finally, annotating images at a festival as presented by Timmermans et al. (2016) could be an interesting direction.
Beyond the opportunities that the papers discuss as future research, we see a number of other future directions for the community as a whole. Perhaps the most important future direction is openly sharing our experiences with crowdsourcing, including failures. Due to publication bias, current papers may not reflect the performance and difficulties encountered in a typical crowdsourcing project.
More generally, there is an opportunity to create a set of guidelines for crowdsourcing medical imaging studies. Rather than relying on ad-hoc choices, researchers could then make informed de-cisions about the platform, reward of the annotators and other variables. For example, the European Citizen Science Initiative has a selection of guides for performing citizen science studies 6 . A further opportunity is to interact more with other fields where crowdsourcing has been in use longer, and to see which of their best practices are also applicable to medical imaging.
Interacting with workers could both improve projects and help establish guidelines. Workers have created communities (e.g. Reddit 7 , Facebook) and discussion boards 8 for some platforms. Chandler et al. (2014) found that 28% ± 5% of the workers on Mechanical Turk read discussion boards and blogs related to Mechanical Turk. The topics of conversations, in order of frequency, are: pay, gratification, completion time, difficulty, how to successfully complete, purpose and the requesters' reputation. These forums are a valuable source for researchers for gathering information, measuring opinions and getting feedback on improving their project. This is particularly important because high throughput workers are more likely to discuss tasks (Chandler et al., 2014). This subgroup (less than 10 % of the workers do more than 75% of the work (Hara et al., 2017)) is likely to have experience with similar tasks (Chandler et al., 2014), and interaction with these workers may result in various improvements such as improvements of the user interface as observed by Bruggemann et al. (2018).
Next to image analysis, crowdsourcing could also be a way to collect, rather than curate, data to improve medical knowledge. This could vary from donating your own medical images, such as MedicalDataDonors 9 to contributing experiences about rare diseases. Since such initiatives do not focus on image analysis we did not include them in this survey, however the work by Ranard et al. (2014); Wazny (2017) may be good starting points for readers interested in these topics.