XML in Chemistry and Chemical Identifiers
by Antony (Tony) N. Davies
Steve Stein of the National Institute of Standards and Technology (NIST) in Gaithersburg, Maryland, USA, and Alan McNaught of the Royal Society of Chemistry, Cambridge, UK, jointly hosted a three-day meeting to discuss IUPAC projects on XML in Chemistry and the Chemical Identifier Project. The meeting was held at NIST from 12–14 November 2003.
The meeting was exceptionally well attended with over 50 attendees from governmental and regulatory bodies, research and academic institutes, and industry. A wide range of experts in the field were brought together for a lively exchange of views on many of the topics covered.
XML in Chemistry
Numerous speakers related tales of XML initiatives involving chemistry in their respective organizations, including the European Patent Office, the International Union of Crystallography, and the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research. Various projects within NIST itself were also discussed, such as UnitsML for scientific units and ThermoML for thermodynamic properties. ToxML was described for toxicology data. Despite the range of speakers’ views on the issue of XML in chemistry, one thing became clear. The decision of IUPAC to take a leading role to avoid multiplication of effort was clearly correct.
Some very detailed technical discussions were held on the mechanisms surrounding the generation of controlled ontologies or data dictionaries that highlighted the speed at which the field is moving. The number of XML initiatives that have been born, flourished briefly, and then vanished into obscurity was also discussed.
These arguments underlined the essential nature of the problem, which is that the research effort ought to be better placed in producing novel ways to handle information to enhance productivity and produce better more advanced tools for data mining rather than repeatedly discussing how best to move the data from A to B. With luck, the IUPAC initiative will bring a certain degree of stability to the information technology base in chemistry and allow teams working in this area to concentrate on their core business without having to worry whether their underlying technology is about to be made obsolete!
IUPAC/NIST Chemical Identifiers (INChI)
Alan McNaught introduced the project, the aim of which is to produce a public Chemical Identifier to uniquely identify compounds. The current version is available for testing and has been expanded to cover organic, inorganic, and organometallic chemistry. It should be noted that the project acronym IChI (for IUPAC Chemical Identifiers) has been changed to INChI, where N stands for NIST. This change was made to recognize the immense contribution of NIST to the project.
But how does INChI work? Well, INChI starts off by looking at the chemistry of the structure to be assigned an “Identifier.” The structure is normalized and a number of chemical rules applied. Next, some mathematics “canonoicalises” the structure (labels atoms) with equivalent atoms receiving the same numbers. Finally, the labelled structure is “serialized” and the output is a character string. Sound simple? Well, as they say in Germany, the devil hides in the details!
The normalization of the structure involves a series of layers for the raw chemical substance, the molecular formula, and a connectivity layer followed where necessary by a stereochemistry and isotopic layer. The connectivity layer consists of four “sub layers,” with increasing amounts of detail, generated as follows:
- disconnect all H and meta atoms to create a “skeleton”
- reconnect fixed hydrogen atoms to reveal tautomers
- optionally reconnect all mobile hydrogen atoms
- optionally reconnect all metal atoms
As you would expect this very simple approach came in for some heavy discussion, but “the proof of the pudding is in the eating,” as they say. So far, with some very large structural databases being analyzed in this way, no insurmountable problems have arisen. The developers are looking for beta testers so please get in touch through the IUPAC Web site if you are interested!
Antony N. Davies <email@example.com> works at Creon Lab Control AG, in Frechen, Germany. He is secretary of the IUPAC Committee on Printed and Electronic Publications and chairman of the Subcommittee on Spectroscopic Data Standards; he is JCAMP-DX external professor at the University of Glamorgan, Wales, United Kingdom.
Page last modified 2 July 2004.
Copyright © 2003-2004 International Union of Pure and Applied Chemistry.
Questions regarding the website, please contact firstname.lastname@example.org