1 Introduction

Creating high quality annotation in the biomedical domain is notoriously challenging and expensive, and text mining tools that assist with capturing entities and relationships from free text will prove invaluable for database curation and other information needs. The complex language in scientific text, combined with the level of expertise required to understand the content, make it difficult not only to develop systems that can perform these tasks automatically, but also to consistently assess relevance and recognize entity classes (Pyysalo et al. 2008), resulting in inter-annotator agreement scores lower than seen in the newswire domain (Dingare et al. 2005). Furthermore, representing the variety of information needs from biomedical researchers calls for different types and levels of annotation.

To generate a gold standard for evaluating participant submissions, challenges like BioCreAtIvE and the TREC Genomics track have relied on a mix of existing annotation and annotation generated by expert judges hired specifically for the challenges (Hersh and Bhupatiraju 2003; Hirschman et al. 2005). Having a cadre of expert judges opens the door to new tasks for which little or no annotation exists. Conversely, new tasks bring new challenges in generating high quality (i.e. reliable and reproducible) annotations, the importance of which is demonstrated by repeated calls for high quality annotation and explicit guidelines (Medlock 2008; Pyysalo et al. 2008). Errors in annotation, especially in training data, can cause systems to underperform (Dingare et al. 2005), or worse, misleadingly imply a level of achievable performance based on inapplicable measurements.

What constitutes a good annotated corpus is in part reflected by how well it replicates real information needs. The TREC Genomics track differs from other biomedical challenges in that the main focus of the ad hoc task was to develop systems based on real needs of biomedical researchers, as opposed to the needs of database curators (classification tasks in the track did address curation needs, but they will not be discussed here; see Cohen and Hersh 2006). Several publications by TREC participants discuss features of their systems, such as query construction, differences in indexing, and use of ontologies (Ide et al. 2007). Instead, the aim of this work is to describe how the annotation used to evaluate the systems was created, and how each year’s experiences drove the evolution of corpora, tasks and annotation methods.

Central to biomedical corpus annotation is the use of controlled vocabularies. Using standardized language to unify entities or concepts that are expressed in myriad ways lies at the heart of many successful search engines, such as PubMed (http://www.pubmed.gov) and the underlying Medical Subject Headings (MeSH) vocabulary used to expand queries. Similarly, test collections can be annotated with controlled vocabularies for any number of purposes. Recent debates about assigning controlled vocabulary terms highlight how difficult they are to navigate and apply consistently when comprehensiveness makes their size unwieldy, or when no appropriate term exists in the vocabulary (Gerstein et al. 2007; Hahn et al. 2007). Although they are a powerful tool, training novices to use controlled vocabularies is a major undertaking and can be too time consuming to be practical for challenge tasks.

Biomedical annotation is a relatively new phenomenon, and most biologists with the requisite domain expertise are not proficient in setting relevance criteria, working with controlled vocabularies, and applying them consistently. Even full-time “biocurators” require up to six months training to become truly proficient (Burkhardt et al. 2006; Salimi and Vita 2006). The TREC Genomics track set an aggressive agenda over its 5-year run to evaluate system performance against large corpora modeling real user needs that evolved each year. Domain experts with a background in biology, but not necessarily in annotation, were recruited to generate annotations to create gold standard test collections. There is no comprehensive quality assessment of these test collections, but the “fair” level of inter-annotator agreement, which failed to improve over the course of the track, suggests that there is room for improvement. Here we describe how tasks and topics were developed and their possible impact on the test collections. The increasing complexity of the tasks made the annotation process more challenging, but it improved the transparency of how judges made decisions, making third-party review of the annotations feasible. Working with controlled vocabularies proved to be difficult, and this was remedied by allowing judges to create their own controlled vocabularies. A method for screening potential annotators was also developed. Lastly, developing topics that are objective and not subject to multiple levels of interpretation is discussed. A summary of the lessons learned during the 5 years of the TREC Genomics track is shown in Table 1.

Table 1 Lessons learned during the 5 year TREC Genomics track

2 Ad hoc task evolution: rationale and participant perspective

Ad hoc search topics were designed to represent the diverse granularity and breadth of real user needs (for more about topic selection and rationale, see the track overviews: Hersh et al. 2006, 2007). The user was taken to be a biomedical researcher who may or may not be an expert in the topics being searched. Variations on the ad hoc task, such as constraining the content of the topic or its answers, were also manipulated to simulate the array of user needs in the controlled setting that TREC offers, and to explore the ability of automated systems to improve performance by restricting the form in which the information needs can be expressed.

Figure 1 shows examples of topics for each year that the track ran. In year 1, there were no resources for expert human judging, and the ad hoc task from that year will not be discussed here. In year 2, real user needs unrestricted in form and subject were used to create the topics (Fig. 1a).The relatively poor performance of systems in year 2 led to a task change in year 3 that constrained topics to five basic questions called Generic Topic Types (GTTs) with the hope that systems could take advantage of a restricted and predefined set of relationships between entities that differed only in the details, such as a particular gene or disease (Fig. 1b). For both of these years, participating systems searched against 10 years of Medline records comprised of title, abstract, author, journal and MeSH terms. Systems were charged with returning a set of document identifiers corresponding to Medline records ranked by how likely they were to answer the topic questions.

Fig. 1
figure 1

Sample ad hoc topics. a Free-form topics from year 2 that include a title, need, and context for additional background. b The Generic Topic Types from years 3 and 4. c Year 5 topics for which the answer was a list of items

The introduction of full text journal articles in year 4 prompted modifications to the ad hoc task, but the same topics in GTT format from year 3 were re-used to provide continuity. Document retrieval at the level of full text articles often presents users with a daunting amount of text to review. Instead of returning the entire document, participants were instead asked to return up to a paragraph of text containing the answer to the topic question from the full-text document, simulating a user-friendly design directing users to the relevant section of the document.

In year 5, entities were incorporated into topics in a new way to simulate information needs with answers that were in part comprised of a list of like items, such as, “What [GENES] are involved in NFkB activation of iNOS?” (Fig. 1c), where the type of term requested was genes. This approach was intended to address information overload by providing the users with both answers to specific questions, along with the surrounding text which provided evidence of why a given answer was retrieved. As in year 4, participants returned up to a paragraph of text that contained relevant entities that answered the topic question.

In addition to the tasks described above, participants were scored on two additional considerations: the length of text that they nominated, and how many different kinds of answers they provided that answered the topic questions. Concise passages that answer a question are desirable because they are easier for users to review than long passages with information that does not pertain to the question. Similarly, if there are multiple correct answers to a question, and each answer can be supported by multiple passages, a system that returns several relevant passages but misses correct answers will score lower than a system that returns more correct answers (Hersh et al. 2006, 2007). How “answer diversity” was quantified for TREC is explained in Sect. 3.3.

3 Ad hoc task evolution: the expert judge perspective

Judges with strong backgrounds in biology were recruited by track participants and organizers for a 2 month period each year. The amount of overview and training that they received increased over time as their responsibilities grew. Basically, they were randomly assigned topics, which consisted of (usually) 1000 “records”, each of which required evaluation. What constituted a record changed with the corpus: in years 2 and 3, a Medline record with title, abstract, authors and MeSH terms was presented to judges. In years 4 and 5, a record was a paragraph from a full text article in HTML format, i.e. any block of text between the HTML paragraph tags >P and </P, including bibliographies and other article sections, e.g. keyword lists. This section describes the judging responsibilities, including the guidance they received to create consistent annotation for the gold standard.

Table 2 summarizes tasks and judging resources from years 2–5. All years involved assessing the relevance of each record (Sect. 3.1). Two additional responsibilities were introduced in year 4: isolating the minimum portion of text that answered the question from the paragraph “records” (Sect. 3.2), and applying controlled vocabularies to code the relevant text excerpts to capture answer diversity (Sect. 3.3). No training was provided in years 2 and 3, but as the organizers worked with the new corpus in year 4 to test isolating minimum spans and controlled vocabulary assignments, it became clear that more formal training was required to acquaint judges with vocabularies and term selection, and their corresponding search and browsing tools.

Table 2 Overview of the TREC Genomics track ad hoc task and judging responsibilities

3.1 Determining relevance

The official judging process began when judges received their first topics, which were randomly assigned without regard to judges’ areas of expertise (with one exception when a judge specifically requested topics in her field). As a first step in evaluation of the ad hoc tasks in years 2–5, judges were instructed to determine the relevance of a record by satisfying two essential criteria. Firstly, the record had to be on topic, and secondly, it had to answer the question posed in the topic. The former involves identifying key elements of the topic and comparing them to each record. A record is on topic when it contains all of the key elements, whether they are physical entities such as genes or cell types, or conceptual entities, like molecular functions or diseases. Once key elements are defined for records classified as Definitely Relevant, criteria for limiting related entities that are Possibly Relevant must be defined. The distinction between Definitely and Possibly Relevant requires an understanding of concepts that are related to, but not the same as, the key elements, such as the biology around a mouse homolog of a human gene when the topic is about the human gene. This often involves hypernym, hyponym and part-of relationships, and it can be difficult to provide complete guidance that covers all potential cases, thereby precluding preparation of representative judgments for each topic. Interpretation of the question also affects relevance, as illustrated by one of the Generic Topic Types from years 3 and 4, “What is the role of [GENE] in [DISEASE]?” It was left to judges to decide what constituted a “role”, if that role should be limited to the gene, or if roles included molecular functions carried out by the gene product. The subjectivity inherent in the GTTs led to more specific questions in year 5, where, for instance, genes and proteins were distinct entities and not subject to interpretation.

Table 3 illustrates an example of Definitely vs. Possibly vs. Not Relevant judgments for a sample topic taken from the year 5 judging guidelines.

Table 3 Excerpts and reasoning for relevance judgments

Once a record was deemed on topic, judges were asked to determine whether it answered the question. Given the 2005/2006 topic “What is the role of Prnp in prion diseases”, a statement such as, “Prnp plays a role in prion diseases” contains all the essential elements, but it fails to answer the question.

3.2 Defining minimal portions of text

For years 4 and 5 when retrieval shifted from the document level to the paragraph, or passage, level, systems were asked to return paragraphs (or portions of paragraphs) from full text journal articles that contained an answer to a topic question, and they were rewarded for minimizing the amount of text beyond that which contained the answer to the topic, analogous to the “focused strategy” task used in the Initiative for the Evaluation of XML (INEX) challenge (http://inex.is.informatik.uni-duisburg.de/2006/). In order to minimize the amount of text that had to be assessed, candidate passages could be no more than one paragraph, as defined by the HTML paragraph delimiters. From an evaluation perspective, there are two options for presenting the retrieved text to judges. One option is to present the extracted text produced by the system, but because systems might differ in how they define the start and end of answer text, this method could produce excerpts from the same paragraph of a full text paper differing only in characters at the excerpt boundaries. This would make judging more error-prone and labor intensive. The second option, used in years 4 and 5, is to match excerpts produced by systems to their corresponding full paragraph, pool paragraphs to create a non-redundant set, and let the judges determine excerpt boundaries from the complete paragraphs and score accordingly (for more about evaluation scores, see 2006 and 2007 TREC overview papers).

The key elements used to determine relevance in the previous section were re-applied to identify the boundaries of the minimal span. Judges were instructed to include non-specific pronouns (“it” or “they”) or non-specific nouns (“the disease”, “this complex”) only if the antecedent was present in the same paragraph. Examples are shown in Table 4. Although this made for longer minimal excerpts, such stringent rules about a complete answer made it easier for a third party (the meta-judge, i.e. the judge who judges the judges) to evaluate the excerpts without consulting the full text. Also, it made the passage “stand on its own”, just as it would have to in a deployed IR system. If the antecedent was in a separate paragraph, the excerpt would be judged as not relevant despite the existence of a complete answer in the document, since a complete answer was not contained in the retrieved paragraph. Abbreviations frequently used in a specialized area of research, such as “LOAD” (Table 4, third example), served as adequate representations of key elements, and did not require supplemental text to explain their origin.

Table 4 Examples from year 4 guidelines to illustrate selection of minimal answer text

3.3 Assigning controlled vocabulary terms (aspects)

In addition to rewarding groups for producing minimal excerpts containing answers, providing a variety of distinct answers to topic questions was scored. Two approaches were used to uniformly describe similarities and differences among answers. In year 4, judges were instructed how to codify relevant passages using terms from the Medical Subject Headings (MeSH) vocabulary (see Sect. 4 for more about training), which contains a hierarchical set of terms that describe anatomy, diseases, disease treatments, enzymes, genetic variation, and experimental methods, among other subjects (Lipscomb 2000). The breadth of MeSH coverage provided judges with a number of avenues for relating answers to one another. Using examples from year 4, Table 5 illustrates how answers can differ. The first excerpt describes how genetic mutations in the gene encoding PrnP impact disease susceptibility, whereas the second excerpt describes changes to the PrnP protein found in diseased tissue. In this example, MeSH terms that captured the theme of the excerpt were available for selection.

Table 5 MeSH terms applied to excerpts as performed in year 4

In year 5, topics were designed to generate a list of items belonging to a single entity class, such as gene, protein, disease, mutation, or toxicities (Fig. 1c; see the 2007 track overview for the complete list of fourteen entity classes). Judges were directed to controlled vocabularies as examples of an entity class (if the tables existed), but they were not required to use them.

3.4 Judging interface

A number of features were incorporated into the judging interface to facilitate data collection and post-processing (Table 6). In years 2 and 3, the interface provided all the information available for retrieval, including title, authors, abstract and MeSH terms, as well as radio buttons for entering the relevance judgment, and a check box for marking the record for later review. In years 4 and 5, the interface displayed the nominated paragraph, a document identifier, a paragraph identifier, a drop-down menu for noting relevance, a link to the full text document, and columns for entering minimal excerpts and aspects.

Table 6 Overview of judging interfaces and features

For years 2 and 3, a simple database built using Microsoft Access sufficiently presented records and accepted relevance judgments. All judges were PC users, removing platform independence as an issue. The shift in tasks at year 4 required a different kind of judging interface with more fields for entering data. More judges introduced the need for platform-independence. A link to the original document allowed judges unfamiliar with the topic to refer to introductory sections or references if needed. The links were frequently used, as evidenced by notification to organizers if they did not work. The additional tasks performed on relevant records made filtering and sorting crucial for consistent annotation. For instance, sorting by document identifier allowed judges to review multiple paragraphs from the same document together.

On rare occasions passages contained more than one minimal excerpt that answered the question. In year 4, judges were allowed to duplicate records (which for this year consisted of duplicating a row in a spreadsheet) and enter their annotation accordingly. This process was error-prone for the judges and frequently corrupted the data, interfering with post-processing. Because of this, the ability to duplicate fields was abandoned in year 5, and if two minimal spans were present in a paragraph, judges were instructed to capture both spans, including the intervening, non-relevant text. Lastly, the OpenOffice Base application was equipped with reporting features that detected cases of conflicting annotation (e.g. irrelevant records with controlled vocabulary terms), freeing organizers from the task of identifying obvious errors.

3.5 Reviewing judges’ work

Biocuration frequently employs secondary checks on annotations to catch unintended errors (Burkhardt et al. 2006; Salimi and Vita 2006). The TREC Genomics track also benefited from a central moderator (the meta-judge) in two ways: firstly, frequently occurring errors caught by the moderator could be addressed and used to influence future task design. Secondly, as specific cases arose when a judge was evaluating a topic, s/he could consult the central moderator for advice on how to best apply the guidelines. Nearly every topic evaluated prompted a question from its evaluator, indicating the utility of this role.

In years 2 and 3, because of the simplicity of the judging responsibilities, completed topics were reviewed only for missing annotation, i.e. records that had not received a relevance judgment. The gold standard test collections generated from these years contained inconsistent annotation detected by some of the participating teams that performed subsequent failure analysis of their systems (W. Xu, personal communication). Although no quantitative error analysis was performed, a large source of error was inconsistent application of relevance criteria, a problem noted in other studies of expert evaluation (Hripcsak and Wilcox 2002; Dingare et al. 2005). To address this, the written guidelines prepared in years 4 and 5 contained extensive direction and examples for setting criteria.

In year 4, one of the most frequent problems detected by independent review was a lack of familiarity with MeSH, which resulted in missed or misapplied aspects. Consistent application of MeSH was also an issue. For instance, if MeSH terms about experimental methods were assigned to some relevant excerpts, they needed to be assigned to all excerpts that contained similar information. This led us to abandon MeSH for annotating answers to user queries, and we decided to let each judge create their own set topic-specific aspect labels. While this complicated comparing controlled vocabularies from different judges for the same topic, it did result in judges assigning aspects in a more sensible, discriminatory, consistent and usable manner within an individual topic.

3.6 Inter-annotator agreement

Conceivably, lessons learned from year to year, in combination with increasingly refined guidelines, would result in greater consistency in judgments over time, independent of the judge. Inter-annotator agreement was measured each year that judges participated (Table 7; see the track overview papers for more detail). The kappa score measure of inter-annotator relevance agreement was largely unchanged. Possible reasons are discussed in Sect. 5.1.

Table 7 Inter-annotator agreement scores for abstract or passage relevance

3.7 Impact of task design on resources

Four years of relevance assessment by domain experts makes it possible to evaluate the impact of task complexity on resources required to generate a gold-standard set of documents and assessments. The TREC Genomics track covered 134 different topics over four years, resulting in evaluation of over 162 topics, plus an additional 37 topics that were judged at least twice to measure relevance assessment reproducibility. Most topics required review of 1000 records (with some exceptions in years 2 and 3).

Figure 2 shows the number of topics completed by each judge for year 4 (Fig. 2a) and year 5 (Fig. 2b). In year 5, for instance, 12 judges together completed 16 topics, whereas 6 high-performing judges together completed 35 topics. From a resourcing perspective, one can expect most of the judges to complete only a few topics, and a subset of judges to do the bulk of the work. This probably reflects differences in time constraints, motivation to earn extra money (judges were paid by the hour) and interest in this kind of work, which requires attention to detail and tolerance for the repetitive nature of reviewing thousands of scientific articles.

Fig. 2
figure 2

Number of topics completed by each judge: a Year 4 and b Year 5

Judges also appear to be the primary influence on how long it took to evaluate a topic. Presumably, a topic with more relevant records would take more time to evaluate than one with few relevant records if the extra steps of isolating minimum spans and assigning controlled vocabulary terms were time-consuming. Figure 3 disputes this hypothesis. The time spent per topic is plotted as a function of the number of relevant records found in that topic. There is a tendency for fewer relevant records to take less time, but the correlation between time per topic and relevant records per topic is weak (R2 = 0.2097). For topics with few relevant records, there is considerable variation in time per topic, which could be due to variation in the individual judges. Alternatively, some topics may be harder to evaluate than others, but Table 8 shows that when the same topic was evaluated by more than one judge, time spent could vary up to 3-fold (e.g. topic 200, Table 8), suggesting that differences among judges is the most significant determinant in judging time. Whether these variations are due to differences in domain expertise or familiarity with the judging process is not clear from our data. This effect may also make it difficult to predict the cost to judge a task based on an hourly judging rate.

Fig. 3
figure 3

Time spent judging topics as a function of number of relevant records. The time spent on each topic in years 4 and 5 was plotted against the total number of Definitely and Possibly Relevant records in that topic

Table 8 Number of hours per year 5 topic for topics judged more than once

4 Training as a predictor of judge productivity

Two modes of training were used to prepare judges for evaluation. First, judges were instructed to go through a set of guidelines prepared by the track organizers (with feedback from the steering committee), followed by a mandatory conference call to review the guidelines. In year 5, a self-training alternative was added. Using results from the previous year, a “mini-topic” was created consisting of ten definitely relevant, ten possibly relevant and ten not relevant paragraphs. About one-third of the judges completed the training topic; the rest were trained by conference call. Some chose both modes of training.

Successful completion of the mini-topic ended up being a strong predictor of prolific judges who required few corrections following independent review. Of the six judges in year 5 who successfully completed the mini-topic, four completed five or more topics.

5 Discussion

The TREC Genomics track featured ad hoc tasks that simulated a variety of user needs. Changes to the document collection caused tasks to change, which altered the evaluation process by adding duties to judging beyond relevance assessment. The aim of these changes was to make evaluation of full text biomedical information retrieval systems tractable, and to lend insight into system performance (via capturing answer diversity with aspects) without impacting evaluation consistency or making the judging task more arduous than absolutely necessary. Although these latter objectives were not necessarily met, lessons learned can be used to inform future challenges. We discuss here how topics, judging duties, and the judges themselves affected generation of gold standard test collections, and suggestions for organizing future challenges that involve human expert judgment.

5.1 Factors influencing inter-annotator agreement

Inter-annotator agreement provides a means of assessing quality of the gold standard, assuming that agreement among annotators reflects a consistent approach for segregating relevant from irrelevant material. Other challenges have used a number of methods to boost inter-annotator agreement, some of which were adopted by the TREC Genomics track. For the 2002 KDD cup and year 1 of the Genomics track, assessment was performed by using existing curation in FlyBase and Entrez Gene, respectively (Hersh and Bhupatiraju 2003; Yeh et al. 2003). Task 1a of the first BioCreAtIvE challenge took a multi-pronged approach by developing guidelines for their three assessors, performing three-way comparisons of replicated judgments, and using pooled results from participants to uncover potential false negatives and false positives (Colosimo et al. 2005). For judging duties that went beyond relevance assessment, the TREC Genomics track employed guidelines, training sessions, scaled-down “mini-topics”, and moderated assessment by an experienced former judge. Nonetheless, inter-annotator agreement remained steady, scoring ~0.6, or a fair level of agreement. Possible factors include freedom afforded to judges to set their own relevance criteria, differences in domain expertise and controlled vocabulary experience among judges, and frequent modification of track tasks.

Studies of evaluation quality have shown that intra-assessor consistency on a give topic has a greater impact on the usefulness of a gold standard for system evaluation than does inter-judge differences (such as that measured by inter-rater agreement) (Voorhees 2000). Inter-annotator agreement in the biomedical realm frequently scores below agreement measures in mainstream domains such as newswires (Hirschman et al. 2002). The breadth and complexity of topics in the TREC Genomics track precluded articulating specific rules for possible relevance for every information need. When overseeing the judging process, we allowed judges to determine their own criteria for relevance, paying more attention to consistent application of relevance criteria than to the criteria themselves, which judges were only asked to articulate if their answers did not make the criteria readily apparent. This probably contributed to the “fair” kappa score for inter-annotator agreement described in Sect. 3.6. Ideally, each topic would be judged by at least three judges and use a consensus-driven process which would provide maximum consistency throughout the entire process of gold standard creation (Hripcsak and Wilcox 2002; Colosimo et al. 2005). The time scale of TREC and the resources required for a consensus approach made this impractical.

Domain expertise was an absolute requirement when enlisting judges to evaluate for the TREC Genomics track. In studies of characteristics that influence relevance determination, familiarity with the subject matter frequently ranks high, suggesting that domain expertise can influence what constitutes a relevant document (Borlund 2003; Xu and Chen 2006). Of the 27 judges who performed evaluations (three judges participated in two successive years), the majority held a Ph.D. in the biological sciences or an M.D. degree; others held biology or bioinformatics bachelors’ or masters’ degrees. Some judges also had database curation experience, but most were unfamiliar with information retrieval and applying controlled vocabularies. The complexity of the judging process should not have impacted inter-annotator agreement at the document or passage level, which was a straightforward relevance determination. Minimal excerpt selection and aspect assignment, however, are more likely to suffer, but we do not have enough year-over-year data to quantify this.

Changes to the ad hoc task impacted a number of features of the relevance assessment process, including number of times a record was reviewed. Reviewing 1000 paragraphs about a given topic will result in increased familiarity with that topic, introducing a possible source of inconsistent relevance assessment between the first records judged and the last. In years two and three, simply assigning relevance required that judges visit each record once, and any return views were purely voluntary. In years 4 and 5, judges were recommended to perform relevance judgments first, then return to relevant records to group them and assign aspects, resulting in at least two visits to each potentially relevant passage. Even if judges created and assigned aspects as they evaluated each record, they still had to deconstruct excerpts in order to codify them with aspects.

It was difficult to determine a priori what part of the evaluation process in years 4 and 5 would be difficult for judges to grasp and how much emphasis should be placed on them in the guidelines. Based on feedback from the judges, working with MeSH in year 4 proved to be very difficult, including finding the right terms to use as aspects and applying them consistently. That motivated us to allow judges to develop their own vocabularies in year 5, which worked out well. For instance, the training mini-topic asked for a list of proteins. One would naïvely assume that Uniprot or Entrez Gene protein/gene thesauri, which organize information at the level of individual proteins and genes, would be a good source for vocabulary terms. Several papers did not refer to individual proteins and, instead, a multi-subunit complex constituted a correct answer. Judges readily selected the complex as a term, leading to a more accurate “answer” to the topic question that would not have been possible using popular controlled vocabularies.

Interestingly, allowing judges in year 5 to create their own controlled vocabulary specific to the topic questions was much easier for the judge than trying to use a controlled vocabulary unfamiliar to them, such as MeSH. No judge chose to use an existing vocabulary (although many were suggested during training), which would have forced judges to identify the terms they selected to assign aspects. In some cases, a sufficient controlled vocabulary didn’t exist (e.g. a set of pathway names). In other cases, the structure of the vocabulary did not match the user needs, such as the protein complex scenario discussed above. The judges’ experiences suggest that controlled vocabularies remain unfamiliar to biomedical researchers, and that the vocabularies may not be appropriate for all tasks in their domain. Maintaining the end-user perspective when designing tools that use controlled vocabularies, such as a means of choosing the appropriate vocabulary and supporting browsing vocabulary database, will continue to be important.

The custom vocabularies had another unexpected benefit that simplified review by the meta-judge. Typically the aspect vocabulary used was extracted by the judge from the literature under review, and therefore was specific to the topic. Since the terms selected by judges were frequently taken directly from the record or nearly matched the record text, it was easier for the meta-judge to find the term in context and evaluate its appropriateness. A controlled vocabulary term may not match precisely, complicating interpretation of the annotation. The phenomenon has been cited as a factor that affects the utility of annotated biomedical corpora (Cohen et al. 2005).

5.2 Staffing a challenge in the biomedical domain

Judges were asked to devote ten hours per week to relevance assessment. If each topic took an average of 15 h, judges should have been able to complete at least five topics in the 8 week judging period. From the statistics in Fig. 2, it was apparent that over half the judges devoted less than 10 h per week. We observed that the mini-topic self-training approach was a predictor of productive judges. In future challenges, a screening mechanism such as this might enrich the judging pool with high-performing judges. Once the number of high-performing judges is estimated, it may allow for better predictions about how large a gold standard can be constructed within a given amount of time.

Our experience showed that employing more judges required increasingly platform-independent judging interface solutions. Judges needed a stable work environment that facilitated entering judgments, saving minimum spans, and assigning aspects. They also wanted to be able to work whether connected to the internet or not. Organizers needed a system that prevented corruption of key fields and alerted judges to errors in relevance or aspect assignment, which aided with efficient post-processing of the assessments. Furthermore, the application had to be widely available and easy to install to minimize time spent on non-judging activities. OpenOffice Base met most of these criteria, and all but one judge was able to install the software. However, due to instability of OpenOffice Base, nearly every judge lost data at some point due to Base crashing, often losing hundreds of assessments. This instability was the single worst problem with using OpenOffice Base, and was loosely correlated with machines with lower amounts of memory. Time spent per topic in Fig. 3 includes time spent repeating lost judgments. While the judges were compensated for this extra time, it certainly added to the frustration of an already exacting task. Besides the crashing problem, the overall experience with Base was positive. If future revisions of OpenOffice Base can improve the stability, it could be a very useful general tool for cross-platform collection of relevance judgments.

6 Conclusions and future directions

The TREC Genomics track sought faithful representation of real user information needs with less consideration for capabilities of current systems. Operating over a five-year span allowed the tasks and topics to evolve to an increasingly accurate model of user information needs and include a variety of information textual sources including both title plus abstract and full text. The downside of frequent task changes was the inability to refine methods in an effort to improve inter-annotator agreement and test collection quality. It was also difficult to judge year-to-year increases in overall system performance. The results across the years of TREC Genomics show that systems were able to adapt and at least maintain performance measures while the tasks and evaluation criteria were getting more specific and, presumably, more difficult . Task evolution was in part driven by feedback from the relevance judges, who represented the intended end users of the information retrieval systems being tested. For instance, capturing answer diversity in years 4 and 5 arose from an analysis of relevant excerpts that answered a topic question in year 3.

Future work in IR system evaluation for the biomedical domain must continue this trend of evaluating systems within frameworks that model with increasing accuracy real world user information needs. This includes specifying the desired output, such as list answers, the intended end-user audience (e.g., bench scientists versus clinicians, etc.), and the scenarios in which the systems are evaluated. While most evaluations to date have been performed with a fixed corpus and set of information needs, interactive evaluations with a well-defined population of users are becoming more feasible and represent a method to demonstrate real world effectiveness for a variety of biomedical information retrieval tasks.