These new funder policies now require researchers to develop research data management plans, part of which involves publishing their data in what is called FAIR form.  The four components of FAIR are:
F: Findable. Data should be discoverable by searches, ideally on a global scale using consistent interfaces.
A: Accessible. Data should be openly retrievable not only by humans, but by machines operating on a larger scale for the purpose of data or content mining.
I: Interoperable. Once discovered and retrieved, data should be capable of validation and re-use, again not merely by human but also by software.
R: Reusable with a commensurate and declared license that allows this.
Although nowadays a virtually mandatory component of the journal publication process in chemistry, very little supporting information (SI) actually fulfils all these FAIR criteria for a variety of reasons. SI is mostly contained as a PDF document containing page breaks and page headers or footers. The PDF wrapper was never designed as a data container; such containment can easily disable data discoverability. Some data, such as crystallographic information, is contained in structured semantic form, but this is not generally true. Crucially, the PDF-based SI document never has formally declared metadata (information about the data contained therein) and its monolithic structure (examples have reached 504 pages in length,  and this may not have been even been close to the maximum) means that even a simple index of the text content is probably next to useless to satisfy the F of FAIR. SI is a child of its parent, the scientific journal article, and as such inherits the persistent (digital object) identifier or DOI of the article. The article DOI, however, carries no information (metadata) about the SI itself or about any data contained in the SI. The DOI normally points to a landing page for the article and this page has to be visually inspected by a human to ascertain the existence and whereabouts of SI, often in a manner parochial to the journal; a fail for both the F and the A of FAIR. Validation of data held inside a PDF file is rarely possible with any semantic assurance, a fail for the I of FAIR. Finally, the licenses that cover data are or should be fundamentally different from those that cover copyrightable materials such as journal articles. These are rarely declared; a fail for the R of FAIR.
All four aspects of FAIR can be addressed by the use of appropriately rich  metadata. In this regard, molecular science, and in particular molecule-centric chemical data, has been revolutionised by the introduction of the InChI identifier.  The key components and procedures for managing research data using InChI metadata identifiers include the following:
The SI document held on a publisher’s site as part of a journal article can be augmented with or entirely replaced by the use of a data repository. 
This repository should be capable of issuing an identified data depositor with a deposition receipt in the form of a DOI, issued by an associated authority. The current leading DOI registration agency for data is DataCite. 
Such a DOI carries some assurance that metadata describing the deposition has been appropriately gathered and validated against a specified schema. In exchange for issuing a DOI, the issuing authority receives this metadata in a structured manner specified by a declared metadata schema and the entire process should ideally be automated as a workflow by the repository.
The metadata schema includes core aspects such as the identity of the depositor (nowadays defined by their ORCID identifier), the data and time of the deposition, an explicit declaration of the license under which the data is issued, such as CC0, and the name of the publisher (normally the research institution).
An InChI string and key for a molecule can be (automatically) generated and submitted to augment the core metadata, along with the media type of the data which greatly facilitates its semantic inter-operability.
The registration authority in turn provides rich search facilities of the submitted metadata, along with access statistics.
The registration authority can also record specific metadata specifying how the deposited data might be accessed based on its DOI, which allows implementation in a fully machine-automatable manner to allow high throughput access to data.
The data is now held in an optimal environment which includes appropriate metadata associated with a persistent identifier to ensure the data passes the FAIR tests. Any journal article based on discourse or narrative where supporting evidence based on data is required can now simply include one or more data DOI citations in the bibliography. The article and data DOIs mutually complement each other. To show why InChI-based metadata in particular has the potential to catalyse enthusiastic adoption of RDM best-practices in molecular science, I will devote the rest of this article to a use-case example derived from our own experience and research.
A Use-Case Example
This research narrative, which has been peer reviewed and published in a journal,  describes the procedures and outcomes of curating a ten-year-old dataset of molecular files based on the NCI small molecule collection. The data and other research objects associated with this project were separately published in a data repository, cited in the bibliography of the article as refs 25, 27, 35, 36, and 50. It takes the form of an overall dataset collection assigned a DOI 10.14469/ch/2, with general metadata associated with the collection revealed using the query: http://data.datacite.org/10.14469/ch/2. There are 158,122 items within this collection (this is abnormally high, most collections would have far fewer items), each of which is also assigned its own DOI, e.g.http://doi.org/10.14469/ch/153690, and its own metadata; http://data.datacite.org/10.14469/ch/153690. Inspection of the metadata for any individual entry reveals the presence of both the InChI string and key as identifiers for the molecule in that entry, along with information about the media type(s) present for the data. For this dataset, the presence of a chemical/x-cml media type  suggests that a validatable XML-based document with implied identifiable semantic content present can be obtained: http://data.datacite.org/chemical/x-cml/10.14469/ch/153690. Such standardized metadata collection facilitates indexing, including that of the InChI identifiers, allowing a variety of rich searches based on it to be made (see Table). Both the search and the form of the outputs can be fully automated to allow high throughput queries.
As the management of research data together with its deposition as a digital research object becomes both increasingly common and likely mandatory, the deployment of rich metadata becomes essential. In molecule-based molecular sciences, the InChI identifier will play a pivotal role in enabling the discovery of the data and helping to ensure its FAIRness.
The announcement of the detection of gravitational waves has associated FAIR data; doi: 10.7935/K5MW2F23 but the metadata (http://data.datacite.org/10.7935/K5MW2F23) cannot be described as rich. Crossref
Research data repositories can be located using this resource: http://www.re3data.org
H. S. Rzepa, P. Murray-Rust and B. J. Whitaker, The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World-Wide Web information Exchange, J. Chem. Inf. Comp. Sci., 1998, 38, 976-982. doi: 10.1021/ci9803233 CrossrefGoogle Scholar
A manual specifying the search syntax can be found at http://search.datacite.org/help.html
About the article
Henry S. Rzepa
Andrew Mclean is Research and Academic Support Team Leader in the ICT Division of Imperial College London
Matthew J. Harvey
Matthew J. Harvey is a specialist in the High performance computing unit, ICT Division, Imperial College London
Published Online: 2016-05-31
Published in Print: 2016-05-01