Skip to content
Publicly Available Published by De Gruyter February 5, 2018

InChI’ng forward: Community Engagement in IUPAC’s Digital Chemical identifier

Leah McEwen

Leah McEwen <lrm1@cornell.edu> chemistry librarian at Cornell University, USA. She is a member of the IUPAC Committee on Publications and Cheminformatics Data Standards (CPCDS), co-chair of the CPCDS Subcommittee on Cheminformatics Data Standards, and secretary of the InChI Subcommittee. ORCID.org/0000-0003-2968-1674

EMAIL logo
From the journal Chemistry International

Given two chemical structures, how do you determine if they are the same? How can chemical data from multiple sources be merged accurately? How can published data be consistently indexed and cross-linked for maximum discovery? Managing these processes manually, either for external or internal purposes, is untenable with the current scale of chemical information and the current diversity of sources. The ability to machine process chemical structure data is crucial across the chemical enterprise. InChI technology has become the industry standard for matching and cross-indexing in the major chemical databases.

InChI is based on a canonical algorithm that notates chemical structure information in a layered format, the InChI string, with the formula and connectivity at the core. This standard form, based on a normalized structure, enables interoperability between databases. InChI strings can become quite long, however, especially for larger molecules; a hashed version of 28 characters called the InChIKey can be used for faster searching and matching. The InChIKey hashes the connectivity in one portion and additional information in another portion. This notation facilitates the automation of two key functions when working with large numbers of chemical structures: identification and verification.

InChI can function as a bridge from a chemical record in one data source to a corresponding chemical record in another. By matching InChIKeys across these data sources, we can see how much overlap exists in the chemical space, but also how much unique coverage. Databases that collect data from multiple sources often apply this verification routine to sort data that can be connected directly, records that are likely unique, and records that may need further investigation as partial matches. Comparing InChIKeys can also indicate situations where the connectivity is the same, but some other variable may be present, such as stereoisomers or isotopes (see Figure 1). More information about InChI can be found on the InChI Trust website: www.inchi-trust.org; for a recent overview, see IUPAC100 Essential Tools, January 2018. (see page 32 for more information about IUPAC100 Essential Tools)

InChI User Meetings

Verifying chemical structure data from multiple sources was by far the most common use case discussed at two dedicated InChI meetings in the past year. The first InChI workshop was held at the European Bioinformatics Institute (EBI) in Hinxton, UK, 20-21 March 2017. The meeting was part of the EBI Industry Programme series, engaging industry members and reviewing their use cases for managing chemical records. Over 60 attendees exchanged ideas, needs, challenges, and innovations during two days of presentations and breakout sessions. The second users meeting was held at the US National Center for Biotechnology Information (NCBI) in the National Library of Medicine (NLM/NCBI), in Bethesda, MD, 16-18 August 2017, attracting over 80 attendees. Participation at these meetings was international and came from industry, government, and academia. Summary materials for these meetings are available on the InChI Trust site at: www.inchi-trust.org/status-future-iupac-inchi-context-use-cases-august-16-18-2017

Figure 1: Structures, names and InChI notation for the enantiomers of carvone. (Image by User:Walkermaderivative work: user:Karlhahn - Carvone.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=8634902)
Figure 1:

Structures, names and InChI notation for the enantiomers of carvone. (Image by User:Walkermaderivative work: user:Karlhahn - Carvone.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=8634902)

Figure 2. Reaction InChI (RInChI) string for the above reaction. (Image by G. Blanke, “Reaction InChI.” InChI Workshop at NIH; Bethesda, MD; 16-18 August 2017.)
Figure 2.

Reaction InChI (RInChI) string for the above reaction. (Image by G. Blanke, “Reaction InChI.” InChI Workshop at NIH; Bethesda, MD; 16-18 August 2017.)

The successful application of InChI in managing information around small organic molecules has prompted interest in expanding the technology to other chemical classes. The workshops were organized around active or proposed projects addressing large molecules, organometallic and inorganic molecules, polymers, mixed substances, and reactions. Other projects are underway to extend the functionality of the InChI algorithm for tautomeric forms and advanced stereochemistry. The community has also begun to consider the applications of InChI in QR Codes for labelling and metadata schema for indexing datasets, as well as in publication workflows, teaching cheminformatics, etc. Some of these extensions are currently representable to various extents through non-standard forms of InChI, but identification and matching across systems is not possible without the consistent normalization or rule-sets conceived by these projects.

Applications for multi-component systems

Reactions (IUPAC project 2009-043-2-800) [1]

In addition to notating single molecules, IUPAC is developing specifications for using InChI with multi-component systems. The code for supporting RInChI, the InChI based notation for reactions, was formally announced just after the InChI meeting in April [2]. RInChI is essentially an application to assemble InChIs into groups by roles: reactants, products, and reagents. The concatenated string is canonical for a given reaction scheme, using ordered lists of the compounds within each group (see Figure 2). Hashed versions of RInChIs can allow for a differing granularity of component matching, including individual components, component groups (e.g., same reactants, same products, and/or same reagents), or across all components (i.e., same schema).

Mixtures (IUPAC project 2015-025-4-800) [3]

Another notation using a concatenated approach is MInChI, an InChI-based notation for mixed substances currently under development. InChI strings are included in an ordered list for those components that are characterized, which can facilitate the linking of mixed substances to data about specific components. Additional layers can be used to specify order and groupings, as well as the concentration ratios of components. A significant challenge for the MInChI project is the lack of consistent approaches to describing mixtures, which are most often text-based, as opposed to chemical structure-based, motifs. The project considers specifications for input formats, as well as mechanisms to parse and normalize description.

Expanding single molecule coverage

Polymers (IUPAC project 2009-042-1-800) [4]

Basic support for polymers is included in a recent code update for InChI (v. 1.05) but is still in beta (testing with live users) [5]. This version of InChI accommodates both source-based and structure-based representation, normalizing to a preferred constitutional repeating unit related to the IUPAC recommendations for polymer nomenclature for regular single strand polymers. The notation is limited to single strand polymers for now, but future considerations arising in the workshops include: hydrogens as end groups, canonicalization with specific end-groups, folding, the degree of polymerization, and end-groups encoded in some separate way. It was also recognized that there is a lack of guidelines for representing polymers and this will be a continuing discussion across IUPAC Divisions IV and VIII.

Large Molecules (IUPAC project 2013-010-1-800) [6]

The connection table-based formats used to describe chemical structures and the primary input to the InChI algorithm have for many years been limited to molecules up to 1000 atoms, severely impacting the format’s ability to represent biomacromolecules. Representing a chemically modified biologic presents many challenges, such as size, variable substitution sites, variable substitution loading, hydrogen bonding, and the presence of heavy metals, as well as the inclusion of several other types of representation formats. How can InChI be developed to support use cases for the comparison and matching of large, complex molecules? Two approaches were discussed, including significant enhancements to the InChI code to handle generic and variable structures, which would be expensive, likely invalidate existing InChIs, and potentially lead to many unexpected combinations. An alternate suggestion, inspired by RInChI and MInChI, would be to treat chemically modified biologics as mixtures and capture what is known about them in a collection of InChIs. This approach does not require extensive changes to the core InChI code and could accommodate information about antibodies, linkers, payload, specific attachment points, residue types, familial relationships among components, etc. Hashing the resultant collection, as is done with RInChI, for example, could provide a manageable text string for searching.

Organometallics (IUPAC project 2009-040-2-800) [7]

Notating organometallic compounds is a significant challenge for databases organized around 2D covalent connectivity. The current version of standard InChI disconnects bonds to metals in a similar way to salts. A layer to specify reconnection is offered in the non-standard InChI. This is similar to the process followed by nomenclature rules for organometallic compounds: disconnection, naming ligands, and representing connection. Unfortunately, the current process of disconnection in the standard InChI gives rise to ambiguity in how the metal associates with the ligands. Issues with tautomerization and charge distribution arise in the reconnection process. The lack of advanced stereochemical specification in the current standard InChI also impacts canonicalization. Moreover, chemists and the tools they use for drawing are not consistent in representing metal-organic structures and a variety of specialized bond types have cropped up that are not supported in the current standard InChI. The project faces the fundamental challenge of balancing requirements for canonicalization in ways that can support InChI function across different systems.

Looking ahead to InChI version 2

Tautomers (IUPAC project 2012-023-2-800) [8]

Nothing confounds the searching and bookkeeping functions in large chemical databases like tautomers. There are many examples of different listings of tautomeric forms of the same molecules in the same catalog. The current version of the standard InChI normalizes to a single tautomeric form, effectively “locking in” the placement of the hydrogens. As InChI’s key role as a matching algorithm develops, the inability to match across tautomers becomes a significant limitation. The current project on tautomers has identified around 50 transformations through experimental studies of chemical representation in large chemical databases. Most of these are found to be applicable to some hundreds of compounds, and some, such as heteroatom H shifting, impact tens of thousands of molecules with delocalized bonds. Several of these rule-sets have already been encoded into cheminformatics toolkits and could potentially be considered for version 2 of the standard InChI code.

Advanced Stereochemistry (proposal in development)

Support for enhanced stereoconfiguration was identified as a crucial need in discussion at the first InChI meeting at EBI. The depiction of chirality in connection table formats was for a long time limited to global stereoconfiguration, using a simple flag to indicate either the specified steroisomer or the racemic mixture. A lack of consensus on representing specificity at stereocenters prompted the practice of depicting specific structures for all stereoisomers, which can be quite extensive. Current connection table formats provide the opportunity to notate absolute, racemic, or relative configurations at specific stereocenters, which in combination could encompass all possible stereoisomers. The standard InChI could be extended to interpret these notations and incorporate them into a new layer. Harmonizing electronic representation conventions and nomenclature guidelines for stereocompounds will be important for consistent application of the enhanced stereo-notation. Other possible advanced stereochemistry types to consider are atropisomers, Haworth projections, non-tetrahedral stereoreprentation, and longer cummulenes.

InChI in the landscape of technologies and standards

Standard Chemical Structure Files (proposal in development)

Input to the InChI algorithm is based on connection table files, including the MDL family of structure formats (MOL, SDF, RDF, etc.). Connection table formats are used across the community and, like InChI, cover much of the 2D small organic molecule space fairly consistently. However, the opportunities discussed in the projects to extend InChI are impacted by diverse practices in representing more advanced chemical features. At the current time, much of chemical data exchange relies on proprietary file formats that have variously been extended by different parties. Different systems can interpret the same file in different ways and there is no clear statement of expected behavior. Freely accessible and redistributable specifications for common formats are desirable in order to introduce the conventions formulated in these projects. Examples, validation sets, and reference implementations are important for establishing reproducible practice between systems. Transparent processes for making improvements and corrections, as well as community governance and stewardship, are a critical part of standards development for chemical data exchange.

QR Codes (IUPAC project 2015-019-2-800) [9]

In addition to linking chemical records in electronic databases, InChIs can be incorporated into QR codes to facilitate links from chemical information in printed formats, such as labels or signs. Workshop discussions suggested that an InChI-based QR code system could improve inventory management, especially in universities and other research institutions with broadly distributed, small amounts of a large number of different specialty chemicals, often with in-house labels. For example, with an InChI QR code app, one could determine the contents of a laboratory shelf or cupboard by sweeping the camera of a mobile device across the containers to capture the QR codes. The project is developing a standard URL format to include in the code for retrieving information associated with a given InChIKey from a given database or supplier, for example (not real links): resolver.example.com/InChIKey/[InChIKey] or www.abc-chemical.com/inchikey_search?q=[InChIKey]

Using InChI with other standards (under discussion)

The InChI string notation directly encodes normalized structure information. This is typical of chemical notation practice (e.g., SMILES, WLN, Dyson) and allows the standard InChI to be an interoperable link between different systems at a granular level based on the standard normalization rules. However, as an algorithm, it differs from most other digital identifiers, which generally use a unique numbering scheme and reference metadata stores (DOIs, for example). InChIKeys can be used as metadata in a record for articles or datasets, which can be searched and retrieved using those InChIKeys. More specifically, InChIKeys can be incorporated into the metadata of DOIs, which can help connect articles and data. InChIKeys could also be incorporated into other chemical data formats to describe measurement data particular to specific compounds, such as spectral data represented in the IUPAC JCAMP-DX format.

Community contribution

Additional project areas include positional isomers, Markush, a resolver function, and the development of training materials to support the use of InChI more broadly. Clearly there is great momentum around this digital IUPAC standard. Ultimately, such standards are only as valuable as they are useful; the current and future success of InChI depends on dynamic engagement across the community of users. In addition to expanding InChI specifications and applications, conference discussion focused on further engagement with current and prospective users through code testing and development, outreach to other communities using chemical data, and inclusion in chemical education.

https://iupac.org/body/802

About the author

Leah McEwen

Leah McEwen <> chemistry librarian at Cornell University, USA. She is a member of the IUPAC Committee on Publications and Cheminformatics Data Standards (CPCDS), co-chair of the CPCDS Subcommittee on Cheminformatics Data Standards, and secretary of the InChI Subcommittee. ORCID.org/0000-0003-2968-1674

References

1. Standard InChI-Based Representation of Chemical Reactions; G. Blanke, Chair. https://iupac.org/project/2009-043-2-800Search in Google Scholar

2. http://www.inchi-trust.org/inchi-reactions-rinchi-releasedSearch in Google Scholar

3. InChI Extension for Mixture Composition; L. McEwen, Chair. https://iupac.org/project/2015-025-4-800Search in Google Scholar

4. InChI Requirements for Representation of Polymers; A. Yerin, Chair. https://iupac.org/project/2009-042-1-800Search in Google Scholar

5. http://www.inchi-trust.org/inchi-version-1-05-releasedSearch in Google Scholar

6. Implementation of InChI for Chemically Modified Biomolecules; K. Taylor, Chair. https://iupac.org/project/2013-010-1-800Search in Google Scholar

7. InChI Requirements for Representation of Organometallic and Coordination Compound Structures; C. Batchelor, Chair. https://iupac.org/project/2009-040-2-800Search in Google Scholar

8. Redesign of Handling Tautomerism for InChI V2; M. Nicklaus, Chair. https://iupac.org/project/2012-023-2-800Search in Google Scholar

9. Identifying InChI Enhancements – QR Codes and Industry Applications; R. Hartshorn, Chair. https://iupac.org/project/2015-019-2-800Search in Google Scholar

Published Online: 2018-2-5
Published in Print: 2018-1-1

©2018 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/

Downloaded on 1.12.2022 from frontend.live.degruyter.dgbricks.com/document/doi/10.1515/ci-2018-0109/html
Scroll Up Arrow