Keywords

1 API Interoperability

Workflows for data analysis are increasingly employing cloud-based, Web-friendly application programming interfaces (APIs). Thousands of Web tools and APIs are available through Web API registries such as ProgrammableWebFootnote 1, BioCatalogueFootnote 2 (specifically for life science data), and cloud platforms such as GalaxyFootnote 3. However, sifting through these and other API repositories to define a linkable toolset pertinent to the workflow is challenging. Discovering relevant APIs often requires a precise combination of matching keywords; and once discovered, the API outputs must be examined to determine whether or not they can be connected together. This task is made more difficult since, in general, there is a lack of rich metadata that precisely describe the APIs, their services, and the data on which they operate. Improvements to this task have been achieved slowly since authoring rich metadata is seen as tedious and unrewarding. Providing easy methods for Web API annotation that integrate with shared terminologies could ease this perception and foster a more discoverable environment for API repositories. In turn, users could more precisely find linkable services that meet their functional requirements as the number of APIs grows.

The problem of authoring coherent, comprehensive, and structured API metadata is gaining attention as a pressing matter due to the aforementioned demand and a lack of work in this area that has fully addressed the issues described. The API metadata problem requires an end-to-end solution – from the specification of metadata elements, to developing metadata templates, to filling out such templates using ontology-based terms, to offering developer-friendly solutions to augment API results. All of this needs to occur in a manner that facilitates discovery, exploration, and reuse. While this problem is admittedly large and complex, our objective here is to carve out specific elements and pilot a lightweight software system for the annotation, discovery, and reuse of what we call smart Web APIs. Our approach is innovative because we address first the problem of API metadata authoring, a task generally disliked in the field, by making it easier to generate useful metadata, and by demonstrating concrete benefits of semantic metadata to developers and users alike.

The overall aim of this project is to undertake a pilot effort that investigates the use of semantic technologies such as ontologies and Linked Data for the annotation, discovery, and reuse of APIs. Linked DataFootnote 4 involves the creation of typed links between data from different sources on the Web. It has the properties of being machine-readable, having a meaning that is explicitly defined, being linked to other datasets external to itself, and being able to be linked to from external datasets [4]. The Linked Data principles define the use of Web technologies to establish data-level links among diverse data sources. Linked Data is very useful in cases where exchanges of heterogeneous data are required between distributed systems [2]. Web services can similarly benefit from these principles to facilitate integration and composition [20]. The smartAPI specification employs Linked Data with the aim to connect diverse data sources in pursuit of improved API discovery, interoperability, and reuse.

Our main objective is to develop and evaluate a lightweight software system for the discovery and reuse of smart Web APIs. Smart Web APIs have the advantages that they (i) are easier to discover due to rich semantic annotations, (ii) can be readily connected together without additional data wrangling, and (iii) eliminate data silos by providing Linked Data. Our proposed system will consist of two key components: (a) a coordinated facility for the intelligent annotation of smart Web APIs; and (b) an application to discover smart APIs and how they connect to each other. Essentially, smartAPI helps make APIs FAIR [26]: Findable with the API metadata and the registry; Accessible with the detailed API operations metadata; Interoperable with the responseDataType metadata (profiler); and Reusable with the access to existing APIs stored in an open repository. This work is done as part of the NIH Commons Big Data to Knowledge (BD2K) API Interoperability Working Group (WG)Footnote 5 and is available at http://smart-api.info/.

This project has four main contributions:

  • Development of the smartAPI metadata specification, based on the results of survey of API metadata guidelines and metadata-in-use in API repositories (Sect. 4).

  • Development of an intelligent tool that supports the composition and validation of API metadata that conforms to the smartAPI specification (Sect. 5).

  • Development of a profiler that automatically annotates the API response data with semantic identifiers (Sect. 5).

  • Development of a repository and smartAPI-conformant API to submit, search, and browse API descriptions (Sect. 5) and obtain field-specific metadata suggestions.

We list the different use cases and projects, specifically in the biomedical domain, that are actively participating in the API Interoperability WG and in the process of annotating (or plan to annotate) their APIs using the smartAPI specification (Sect. 6). We then conclude with a discussion on the future direction of this work in Sect. 7.

2 Related Work

Currently, there exist several challenges in finding relevant APIs as well as reusing those APIs. We discuss both of these challenges in this section. Also, when discussing the annotation and description of Web APIs, we need to distinguish two main groups that interact with these APIs [24]. First, there are annotations targeted at developers, with the main aim of facilitating development. Second, there are efforts to describe Web APIs in such a way that automated clients can access and compose them. In this section, we will provide a brief overview of both kinds of annotations.

2.1 Challenges of Finding APIs

Finding relevant APIs is a challenging task for developers for diverse reasons. Extensive collections of useful and representative code and data are still lacking [10] despite the quick proliferation of APIs that makes the discovery of resources relevant to individual developers and users difficult [22]. The most visible and accessible APIs are often those that are currently most used, relegating newer and potentially more useful, but less popular, APIs to obscurity [22]. Application frameworks and software libraries often lack proper documentation [9, 21], and more sophisticated algorithms need to be developed to facilitate the identification of useful resources [10]. The discovery of relevant APIs can be facilitated by enhancing rich metadata that describe APIs and the services and data associated with them. The smartAPI initiative contributes toward improved discoverability by providing methods that permit simple and intuitive annotation of Web APIs and that are integrated with standard ontologies.

2.2 Challenges of Reusing APIs

Reuse in the context of Web APIs can mean multiple things [24]. First, an API itself is a means to enable reuse of the functionality offered by a certain server. Second, the client-side code for interacting with an API can be (partially) reused across applications. Third, the interface of an API – independent of its implementation – can be (partially) reused by other servers, as is the case with standardized APIs. This third form of reuse is unfortunately rare, since many Web APIs are designed from scratch. The resulting heterogeneity leads to a steep learning curve for the integration of existing Web APIs in applications [10, 24], which is the fourth and most common meaning of “Web API reuse”. The smartAPI initiative aims to tackle this challenge by developing a profiler that features automatic annotation of API response data. This profiler is integrated with the smartAPI editor to facilitate the semantic annotation of APIs. These features enhance reusability as well as interoperability.

2.3 Annotations for Developers

The XML-based Web Service Description Language (WSDL) provided one of the first models to describe Web services [5, 6]. However, WSDL only provides the mechanisms to characterize the technical implementation of Web services; it does not provide the means to capture the functionality of a service. Furthermore, the module source code is generated automatically using a WSDL description, which is then compiled into a larger program. Then, if the description changes, the program no longer works, even if such a change leaves the functionality intact. This prevents WSDL from being used for automatic service discovery at runtime. Furthermore, WSDL is limited by proprietary vendor-specific implementations, being bound to a specific programming language. SwaggerFootnote 6, on the other hand, provides an editor for authoring HTTP API documents, and is widely used by API developersFootnote 7. Swagger uses the OpenAPI specificationFootnote 8, which defines a standard, language-agnostic interface to HTTP APIs. However, each API developer annotates his API in isolation, which results in less interoperable and reusable APIs. The current Web API landscape is hindered by the problem of scalability as every API requires its own hardcoded clients, which only benefits the developers. In particular on the current Web, there is a one-to-many relationship between Web APIs and clients: a single API often has clients for one or more programming languages, but none of these clients work with other APIs. As such, individual clients do not scale with the number of APIs. This makes each API unusually short-lived with a tightly coupled relationship of highly subjective quality. This directly leads to increase in development costs and prevents the design of a more intelligent generation of clients that provide cross-API compatibility [24]. Annotating APIs is an important step in making them accessible for more generic clients.

2.4 Descriptions for Automated Clients

Many approaches for service description exist with different underlying service models. OWL-S  [18] and WSMO [16] are the most well-known Semantic Web Service description paradigms. They both allow the description of high-level semantics of services whose message format is WSDL [7]. Though extension to other message formats is possible, this is rarely seen in practice. Semantic Annotations for WSDL (SAWSDL [14]) aim to provide a more lightweight approach for bringing semantics to WSDL services. Composition of Semantic Web Services has been well documented, but all approaches focus on Remote Procedure Call (RPC) interactions and require specific software [19].

In recent years, several description formats for the more lightweight Web APIs have emerged [25]. Several methods aim to enhance existing technologies to deliver annotations of Web APIs. HTML for RESTful Services (hRESTS, [12]) is a microformats extension to annotate HTML descriptions of Web APIs in a machine-processable way. SA-REST [8] provides an extension of hRESTS that describes other facets such as data formats and programming language bindings. MicroWSMO [13, 17], an extension to SAWSDL that enables the annotation of RESTful services, supports the discovery, composition, and invocation of Web APIs, but requires additional software.

The description of hypermedia APIs is a relatively new field. Hydra [15] is a vocabulary to support API descriptions, but does not directly support automated composition. RESTdesc  [23] is a description format for hypermedia APIs that describes them in terms of resources and links. The Resource Linking Language (ReLL, [1]) features media types, resource types, and link types as priorities for description.

With our smartAPI specification, we build upon the already existing widely used OpenAPI specification to provide richer metadata that precisely describes the APIs, their services, the data on which they operate, and the data they return. Our smartAPI editor, which is also an extension of the popular Swagger editor, makes it easier to generate useful metadata and indicates which terms are most widely used to annotate Web APIs. The editor also supports suggestion of metadata elements and values along with their usage frequency to the next API provider while she is annotating her API. Furthermore, the smartAPI profiler (c.f. Sect. 5), integrated within the editor, provides automatic annotation of the API response data. Finally, the smartAPI registry serves as a repository to save, search, and browse the created API descriptions. Consequently, the smartAPI framework helps to make APIs FAIR.

3 Survey of API Metadata in the Wild

We conducted a survey of existing metadata repositories and specifications that describe APIs. In particular, the following eight resources were surveyed:

  • Repositories:

    • Biocatalogue [3]Footnote 9, a registry of biological Web APIs with 1,184 entries.

    • Programmable WebFootnote 10, a directory of internet-based APIs with over 15,000 API descriptions.

    • Tools & Data Services Registry [11]Footnote 11, a registry with information about analytical tools and data APIs for bioinformatics with 2, 331 entries.

  • Specifications:

    • OpenAPI InitiativeFootnote 12, created by a consortium of forward-looking industry experts who recognize the immense value of standardizing how HTTP APIs are described.

    • Minimal Information About a Software (MIAS)Footnote 13, a key set of minimal fields can that provide maximum value when describing a software.

    • Prototype smartAPI SpecificationFootnote 14, a specification describing semantically annotated Web APIs that facilitates discovery and reuse of Web-based APIs.

    • Semantic Automated Discovery and Integration (SADI)[27]Footnote 15, a set of design patterns defining the behavior of data retrieval and/or analysis resources that must interoperate on the Semantic Web.

    • schema.org API ReferenceFootnote 16, reference documentations for APIs as described by schema.org.

We retrieved and listed the metadata elements from each of the resources and also analyzed the degree to which each field was actually employed in practice by its frequency of usage. For instance, in the case of Programmable Web, which contains over 15, 000 API descriptionsFootnote 17, all of the entries use the Title and Description fields. However, only 90% of them supply details about the API provider and the primary category to which the API belongs. Results of the survey are available at https://goo.gl/F4OLnW. Thereafter, we aggregated all the metadata elements from the full set of eight resources to produce a common list of 54 API metadata elements (as discussed in Sect. 4).

4 SmartAPI Metadata Specification

This standard is the result of a survey conducted by the NIH Commons Big Data to Knowledge (BD2K) API Interoperability Working Group of existing metadata repositories and specifications that describe APIs. The smartAPI specification implements the FAIR principles: Findable, Accessible, Interoperable, and Reusable. In particular, we aggregated all the metadata elements from the eight surveyed resources to produce a common list of 54 API metadata elements. We subsequently divided these elements into five categories:

  • API Metadata (Table 1Footnote 18): 20 elements

  • Service Provider Metadata (Table 2): 6 elements

  • API Operation Metadata (Table 3): 12 elements

  • Operation Parameter Metadata (Table 4): 10 elements

  • Operation Response Metadata (Table 5): 6 elements

The smartAPI Specification includes 21 metadata elements beyond those included in the OpenAPI specification. Examples of the 21 elements are the category to which the API belongs; metadata format and access mode at the API metadata level; the parameter type and parameter value type at the operation parameter level; and the conformance to a specified response profile at the operation response level. The metadata elements marked with a * in the tables are those specific to the smartAPI specification.

Next, we re-evaluated each of the metadata fields according to its applicability and relevance, and further determined whether each MUST, SHOULD, or MAY be included in the API description. The cardinality and datatype of metadata fields were further specified along with a description and exampleFootnote 19. The smartAPI Specification along with cardinality, datatype, and an example of each metadata element is available at https://websmartapi.github.io/smartapi_specification/.

Table 1. smartAPI specification metadata elements: API metadata.
Table 2. smartAPI specification metadata elements: Service Provider Metadata.
Table 3. smartAPI specification metadata elements: API Operation Metadata.
Table 4. smartAPI specification metadata elements: API Parameter Metadata.
Table 5. smartAPI specification metadata elements: Operation Response Metadata.

5 SmartAPI Implementation

smartAPI serves both the API providers and API users. The framework consists of three modules: the editor that facilitates the API metadata authoring for API providers; the searchable API registry where the created API documents are stored and indexed; and the profiler that annotates the API response data.

The smartAPI editor is an extension of the Swagger editorFootnote 20, which is widely used by API providers. The Swagger editor uses the OpenAPI specification and provides a framework for creating interactive HTTP API documentation. First, we extended the OpenAPI specification JSON file to incorporate the newly added smartAPI metadata. We extended the auto-completion functionality of the Swagger editor, by suggesting not only the list of predefined metadata and values, but also the values retrieved from the indexed API documents previously created and saved in the registryFootnote 21, along with the frequency of their usage. Every new API document added to the registry is indexed using Elasticsearch query engineFootnote 22, and their metadata elements and values along with their usage frequency are suggested to the next API provider (Fig. 1b). The conformance level (Required, Recommended, or Optional) of the suggested metadata element is also provided (Fig. 1a).

The smartAPI profiler, shown in Fig. 2a, provides automatic annotation of the API response data, i.e. responseDataType (Fig. 2b). To do this, the API response data (e.g. http://mygene.info/v3/gene/1017) is recursively traversed to provide a keypath/value pair where the keypath consists of one or more labels concatenated together and the value is either a single value or list of strings. The resource annotation is provided by comparing the keypath labels to resource names and synonyms from Identifiers.orgFootnote 23. In cases where a match is not found, an example value for the keypath is then compared against resource identifier patterns from Identifiers.org and resulting matches are displayed as suggested annotations. The user may also add his own resource annotation if one does not exist. The annotated API response data is stored in the responseDataType element (Fig. 2b).

The “parameterValueType” and “responseDataType” elements are added to the specification to semantically annotate the input (parameter) and the output (response) of the API respectively. As shown in Fig. 1b and Fig. 2b, the values of these metadata elements are semantic identifiers from identifiers.org, prefixcommonsFootnote 24, and other relevant ontologies.

The code, full documentation, and tutorial are available at https://github.com/WebsmartAPI/swagger-editor. A live demo is also availableFootnote 25.

Fig. 1.
figure 1

Auto-suggestion functionality for API metadata elements and values.

Fig. 2.
figure 2

Semantic annotation of the API response (e.g. http://mygene.info/v3/gene/1017) using the smartAPI profiler.

6 SmartAPI Use Cases

One of the main use cases in which we will examine the usefulness and usability of the smartAPI system is to find and explore connections pertaining to cardiovascular pharmacogenomics. Our use case begins with a set of genes that are differentially expressed in hypertrophic cardiomyopathy (HCM), a leading cause of death among young athletes. HCM arises from genetic defects in close to 20 different genes, although the most common forms of HCM result from mutations in genes encoding proteins of the cardiac sarcomeric apparatus. One concern is that young athletes may be increasing their risk of HCM through pharmacogenomic interactions. Our objective is to use the smartAPI platform to i) discover which, if any, differentially expressed genes in HCM are targeted by FDA-approved drugs, and ii) identify which HCM genes are also differentially expressed in other published cardiovascular studies. Information about drug targets and pharmacogenomics is already available as Linked Data, through the open source Bio2RDF projectFootnote 26. Bio2RDF provides nearly 11 billion Linked Data points from 35 life science databases including DrugBankFootnote 27 (a source of drug targets) and PharmGKBFootnote 28 (a source of pharmacogenomic interactions). Users of the smartAPI system can gain access to Bio2RDF data by following the Linked Data generated by the MyGene.info and MyVariant.info smartAPIs to Identifiers.org, which in turn will provide links to these Bio2RDF data.

The API Interoperability group is a Working GroupFootnote 29 in the NIH Commons Framework project. The NIH Commons is defined as “an initiative which is essentially a shared virtual space where scientists can work with the digital objects of biomedical research, i.e., it is a system that will allow investigators to find, manage, share, use and reuse data, software, metadata and workflows.”Footnote 30. A series of Commons pilots has been initiated to develop and test these components in order to understand and evaluate how well they will contribute to an ecosystem that will effectively support and facilitate sharing and reuse of digital objects.

Below, we list the projects that are actively participating in the WG and are in the process of annotating (or plan to annotate) their APIs, specifically in the biomedical domain, using the smartAPI specification:

  • MyGene.info [28]Footnote 31 provides Web APIs for both gene queries and gene annotation retrieval. MyGene.info services are being used in Web applications that require querying genes, e.g. BioGPSFootnote 32, as well as in an analysis pipeline to retrieve regularly updated gene annotations. MyGene.info has a Swagger-based API document that was loaded into the smartAPI Swagger editor for being validated against the smartAPI specification and saved into the smartAPI registryFootnote 33. The validation process provided a list of missing required, recommended, and optional metadata elements. As a result,“contact” info was added as a required element and the “parameterType”, “parameterValueType”, and “responseDataType” were recommended. These additions semantically enrich the API document and increase its interoperability with other relevant APIs.

  • MyVariant.info [28]Footnote 34 provides simple-to-use Web APIs to query/retrieve variant annotation data, aggregated from many popular data resources. MyVariant.info was modified and saved into the smartAPI registry through the same process as MyGene.info.

  • The National Institutes of Health Library of Integrated Network Cellular Signatures (NIH LINCS) Data PortalFootnote 35 provides access to a diverse array of novel bioassay data that has been curated and packaged with rich metadata for the assay entities. These metadata conform to the NIH LINCS metadata standardsFootnote 36 that enable integration and interpretation of LINCS data. The LINCS Data Portal APIFootnote 37 provides programmatic access to all datasets, dataset entities, and metadata within the LINCS Data Portal.

  • The BD2K PIC-SURE HTTP API facilitates platform-agnostic programmatic access to disparate patient-level heterogeneous datasets to authenticated users. The API provides a selection of methods to access, query, and interrogate data in diverse formatsFootnote 38. To test the PIC-SURE API, a demo with the National Health and Nutritional Examination Survey (NHANES) is available onlineFootnote 39. NHANES is a publicly available epidemiological survey conducted by the US CDC, recording over 1, 100 variables from more than 41, 000 respondents across the US; it is essentially a snapshot of patients’ exposomes and phenomes. The exposome is composed of collections of environmental, behavioral, and dietary factors that are associated with health and disease, and phenomes include clinical and physiological phenotypes that are predictive of health.

  • The Alliance of Genome Resources (AGR)Footnote 40 is an initiative formed in 2016 that has the goals of providing better support for the biological sciences via an integration of shared data; standardization of data models and interfaces; and unified outreach to researchers, educators, and the public. The initial members of AGR are the Gene Ontology ConsortiumFootnote 41 and six model organism databases: Saccharomyces Genome DatabaseFootnote 42, WormBaseFootnote 43, FlyBaseFootnote 44, Zebrafish Model Organism DatabaseFootnote 45, Mouse Genome DatabaseFootnote 46 and Rat Genome DatabaseFootnote 47. This integration will provide the best visualizations and tools currently in use and allow efficient development of new tools in a collaborative manner. As the project moves toward deeper integration of content and software, we will provide easy-to-use cross-organism queries of the extensive data available in the component resources. The data access will be available via an API that will be conformant to the smartAPI specification.

The API Interoperability project is still an ongoing project, and there are a number of BD2K centers that have been actively participating in the WG meetings and have expressed interest in adopting and implementing the smartAPI specification and editor to annotate their APIs. Once we have annotated the APIs using the smartAPI editor, we will store them in the smartAPI registryFootnote 48, which will not only provide all of the smartAPI-conformant APIs in one location but will also be integrated into the editor. With this integration, the data and values will be used to suggest related fields and values for new similar APIs during the annotation process (refer to Sect. 5).

7 Conclusions and Future Work

In this paper, we have defined a smartAPI metadata template that contains 54 API metadata elements used to describe an API. Results are reported for a survey of eight resources that were used to identify these API-associated metadata. We constructed the smartAPI metadata template for the validation of API annotations. Additionally, we built a Web application for the intelligent annotation of smartAPIs. Since authoring metadata can be tedious and overwhelming, we developed a software built upon the already existing Swagger editor that will help users describe their APIs by (i) indicating highly used fields, (ii) suggesting commonly used values, and (iii) enabling the discovery and reuse of terms authored by others. Moreover, we developed a profiler for automatic annotation of API response data and integrated that within our editor to enable semantic annotation of APIs, which increases their reusability and interoperability.

Our proposal to facilitate the authoring of rich API metadata is especially significant because of the increased emphasis on providing cloud-based APIs. If left unmanaged, a majority of the APIs will lack the proper metadata needed to find APIs. As sketched out by the participants of the Software Discovery Index WorkshopFootnote 49, our work begins to explore their roadmap to address challenges facing specifically the biomedical research community in locating, citing, and reusing biomedical software. We believe that the semantic tools and technologies developed in this project will form an important cornerstone in the overall vision of the Commons. As future work, we will assess the ease and utility of authoring smart API metadata for biomedical APIs as well as APIs in other domains. Although we have developed our own API repository (http://smart-api.info/registry/), we expect to be able to export to other repositories that generally have fewer metadata requirements, e.g. ProgrammableWeb. Additionally, our main aim for future work will be to focus on use cases that illustrate our aim of making the APIs interoperable.