Keywords

1 Introduction

The development of interactive web technologies allows organizations to access information from individuals outside, and not formally associated with, the organization. This external information is commonly known as user-generated content (UGC) – content that is voluntarily contributed by individuals external to organizations. Access to UGC is revolutionizing industry and research. UGC sourced through crowdsourcing systems – systems that enable “outsourcing a task to a ‘crowd’, rather than to a designated ‘agent’ … in the form of an open call” [1, p. 355] – have successfully been used in diverse contexts for understanding customers, developing new products, improving service quality, and supporting scientific research [2,3,4,5]. In this paper, UGC and crowdsourcing refer specifically to UGC from purpose-built integrative crowdsourcing systemsFootnote 1 that “pool complementary input from the crowd” [6, p. 98], rather than passive UGC collected through applications such as social media.

When creating crowdsourcing systems, one important design decision sponsorsFootnote 2 must make is determining the composition of an appropriate crowd [28]. This decision influences the other design decisions about crowdsourcing projects (i.e. system design, task design, and motivation of contributors). Because the quality of UGC to be collected is a concern, sponsors either require potential contributors to possess relevant knowledge of the crowdsourcing task or allow a broader spectrum of volunteers to be part of their crowds. Choosing the former implies implementing recruitment strategies that favor knowledgeable contributors and prevent less knowledgeable contributors from participating, such as training volunteers before they are allowed to participate [8, 9] and recruiting experienced contributors – people who have previously participated (or are presently participating) in a similar project [31].

By restricting participation in integrative crowdsourcing projects to trained or experienced contributors, sponsors seek to tap into contributors’ proficiency and familiarity with the task to ensure high information quality [30, 31]. This practice is supported in practice and in the crowdsourcing literature. For example, Wiggins et al.’s [p. 17] survey of 128 citizen science crowdsourcing projects – which often are integrative crowdsourcing systems that engage citizens in data collection – reports that “several projects depend on personal knowledge of contributing individuals in order to feel comfortable with data quality”. Likewise, [8] promotes a contributor selection strategy for “eliminating poorly performing individuals from the crowd” and identifying experts from volunteers “who consistently outperform the crowd”. However, in this position paper, we make the case against adopting strategies that restrict participation to only knowledgeable contributors.

2 Information Quality and Repurposable UGC

Knowledge about the phenomena on which data are being collected is assumed to positively influence the key dimensions of information quality – information accuracy and information completeness. Information accuracy is defined as “the correctness in the mapping of stored information to the appropriate state in the real world that the information represents” [10, p. 203], while information completeness is the “degree to which all possible states relevant to the user population are represented in the stored information” [10, p. 203]. However, the literature contains several studies in which experts or knowledgeable contributors in the crowd have not provided more accurate information than novices. For example, three studies in an ecological context found that knowledgeable contributors did not provide more accurate data than non-experts [11,12,13]. Likewise, in an experiment in which participants were required to identify and provide information about sightings of flora and fauna, novices performed as well as knowledgeable contributors with respect to the study’s task [9].

Similarly, even though Kallimanis et al. [13] showed that less knowledgeable contributors report less information than knowledgeable contributors based on the fitness criterion employed in their study, they also reported that less knowledgeable contributors provided more data about certain aspects of the tasks than knowledgeable contributors and made significantly more unanticipated discoveries. These findings are mostly congruent with Lukyanenko et al.’s field and lab experiments [9, 16], which showed that the conceptualization and design of a crowdsourcing system plays a role in the completeness of data provided by contributors with varying degrees of knowledge. In sum, empirical research offers evidence that knowledgeable contributors do not always provide more complete or more accurate information (i.e. higher quality information) than those with little or no domain knowledge.

While accuracy and completeness are pertinent dimensions of information quality, UGC needs to encompass diverse views and perspectives to sufficiently address the need for contributed data to be repurposable [17]. This repurposability requirement can only be met if crowdsourced data is “managed with multiple different fitness for use requirements in mind” [18, p. 11]. That is, the design choices made for integrative crowdsourcing systems should also support information diversity – the “number of different dimensions” present in data [7, p. 214] – to ensure repurposability and reusability of data. The relevant dimensions of information quality for crowdsourced UGC thus go beyond accuracy and dataset completeness and include information diversity.

Information diversity is the ratio of the amount of distinct information in contributions about an entity to the amount of information available in the contributions. The degree of diversity between two contributions A and B, each consisting of a set of attributes, is \( \frac{{\left( {A \cup B - A \cap B} \right)}}{A \cup B}. \) The higher the ratio, the more diverse both contributions areFootnote 3. Information diversity promotes discoveries as it enables different users and uses of data, which may lead to unanticipated insights [17]. Information diversity helps provide a better understanding of data points, as some contributors may give details about the data point where others do not. In addition, information diversity affords flexibility to project sponsors, as data requirements may change with new insight or because projects are commissioned without clearly defined hypotheses in mind. A richer, more robust dataset can better handle such changes than a highly constrained one.

Understandably, information diversity has not received a lot of attention in the information quality literature, which has mainly focused on the quality of information collected within organizations with tight control over their information inputs, processing and outputs, and with predetermined users and uses of resulting data. Within these traditional organizational settings, described in [17] as closed information environments, information diversity is sometimes considered undesirable and data management processes seek to minimize or eliminate it. Moreover, in the few cases where data diversity has been considered in the context of the repurposability of UGC, research has focused on system (or data acquisition instrument) design [17,18,19]. Less attention has been paid to the effect of the cognitive diversity (i.e. differences in experience and task proficiency) arising from the choice of target crowds on the diversity of data generated.

3 Theoretical Foundation for Information Quality in UGC

Generally speaking, humans manage limited cognitive resources in the face of a barrage of sensory experience by paying selective attention to relevant features that aid in identifying instances of a class, while irrelevant features (those not useful for predicting class membership) can be safely ignored. Even though everyone selectively attends to information to some extent, our use of selective attention only covers top-down attention, i.e. “internal guidance of attention based on prior knowledge, willful plans, and current goals” [14, p. 509].

Although selective attention leads to efficient learning, it is accompanied by the cost of learned inattention to features that are not “diagnostic” in the present context [21, 22]. Training leads to selective attention to pertinent or diagnostic attributes [22, 24]. When members of a crowd have been trained, their reporting will most closely align to the information learned from their training, resulting in less diversity than would be present in data reported by members of an untrained crowd. This is particularly pronounced when the training provides specific rules for performing the task, as contributors will tend to rely on (and pay attention to) this explicit information above any implicit inference they may form themselves – a phenomenon known as salience bias [15].

Consider a citizen science scenario (adapted from [22]) where contributors who have been trained on how to identify rose bushes were requested to report their occurrences in a field of rose, cranberry and raspberry bushes. In addition, assume contributors through their training are able to distinguish rose bushes from the other bushes present in the field by the absence of berries. Their training is sufficient to ensure the data they report is accurate and complete as other attributes like the presence of thorns would not be diagnostic in this context where rose and raspberry bushes both have thorns. However, if in the future a user needs to repurpose the collected data to confirm the presence of cranberry bushes in the same field or estimate their number, the presence or absence of berries is no longer diagnostic as cranberry and raspberry bushes have red berries, and the presence of thorns becomes diagnostic as cranberry bushes do not have thorns. The data becomes inadequate requiring resources to repeat the data acquisition stage. This tendency for training to influence the information reported by contributors making contributions align with the training received while reducing their diversity thus affects repurposability and the ability to make discoveries.

Similarly, experience increases the tendency towards selective attention. The absence of the tendency for selective attention is “a developmental default” [23, 24]. Infants do not selectively attend to attributes of instances. They reason about entities by observing all the features of individual instances [20] and are, therefore, naturally comparable to novice contributors in an integrative crowdsourcing context [24, 25]. The tendency for selective attention thus forms with development to aid classification as a mechanism for coping with the deluge of information around us. For this reason, the capacity to classify is a distinguishing factor between adults and infants [20]. As experience increases, the tendency for selective attention increases correspondingly.

Knowledge of the crowdsourcing task acquired by contributors through training or experience will help them report mainly about attributes of instances they have been taught (or learned experientially) to be relevant to the task [26]; thus, they are expected to be less likely to attend to attributes irrelevant to the task than novices [27]. Ogunseye and Parsons [29] argue that knowledge therefore affects the accuracy and completeness of contributed data as knowledgeable contributors have an increased tendency to only focus on diagnostic attributes, ignoring changes to other attributes when they occur. In addition, knowledgeable contributors show more resistance to further learning [27], impeding their ability to make discoveries. We add here that since contributors with similar knowledge are expected to show similar levels of selective attention and contribute more homogeneous data than cognitively diverse contributors, knowledge (task proficiency and experience) will also reduce a crowd’s capacity for information diversity.

4 Conclusion

As organizations continue to leverage the collective wisdom of crowds, interest in crowdsourced UGC will continue to grow. At the center of new discovery and insight from UGC based on integrative crowdsourcing tasks rather than selective crowdsourcing tasks is the ability of collected UGC to accommodate the different perspectives of multiple users. This desire for repurposable UGC places a new information diversity requirement on crowdsourced information that is largely absent from traditional IS environments, where the uses of data are usually predetermined and stable. In addition to traditional dimensions of information quality, we argue for the inclusion of the information diversity dimension as a necessary dimension for crowdsourced UGC. We also explain from a cognitive perspective why training and experience will constrain information diversity and correspondingly, reduce the quality of crowdsourced UGC. Consequently, systems that seek repurposable UGC are better served if they are designed with inclusivity and openness as their core focus. Our agenda for future research includes studying how cognitive diversity impacts information diversity in different settings and how this impact affects the quality of decisions made from UGC.