XML in Chemistry
Extensible Mark-up Language (XML) is a powerful alternative to conventional binary file storage and information exchange. As many scientific organizations and companies delivering scientific products have implemented or are looking at the use of XML, IUPAC decided to review and evaluate what could and should be its role in advancing the use of XML in chemistry. In January this year, the IUPAC Committee on Printed and Electronic Publications (CPEP) organized a two day Strategic Meeting to assess the Union's position and options. Hosted by the Unilever Cambridge Centre for Molecular Informatics in the University of Cambridge Department of Chemistry, delegates from all interested IUPAC Divisions gathered together with key players in the field.
XML can be regarded as an extension to the well known HTML or Hyper Text Mark-up Language, which is the language most frequently encountered when viewing web pages. XML is considered to be the universal format for structured documents and data on the Web.1
|It isn't the use of XML itself that is interesting or even particularly novel, but the content stored within the XML files.|
As with a conventional Web page, it isn't the use of XML itself that is interesting or even particularly novel, but the content stored within the XML files. In chemistry and associated technical fields, various groups– commercial organizations, academic institutions, and government bodies–have been developing XML formats independent of each other. These formats have similar content but differing data dictionaries and conventions.
This means they are not compatible with each other and, what is far worse, resources are being deployed to address problems already solved by other groups. In order to support standardization in this field for the benefit of the community, IUPAC has decided to actively explore ways in which it can help to unify the various dictionaries and publicize their availability.
IUPAC'S Role and Timeline
During the 2001 IUPAC General Assembly in Brisbane, an ad hoc group outlined the dos and don'ts of a possible IUPAC role in advancing the use of XML in chemistry and developed a timeline for further action. The strategic importance of these decisions was reflected in the presentation of Wendy Warr–CPEP chairman–to the IUPAC Council2 and the subsequent comments by IUPAC's secretary general Ted Becker in his article in CI.3
|Dos and Don'ts IUPAC should not: • Commence activities better left to the computer scientists. • Re-invent the wheel–the current activities at various locations should be invited to contribute to a standardization process through IUPAC as long as their efforts remain in the public domain. • Become formal members of World Wide Web Consortium (W3C), Object Management Group (OMG) or other similar organizations, however they should be informed of IUPAC activities in this area and we should continue to monitor their work. IUPAC should: • Establish "ownership" of the definition of standard terms in chemistry to be used in digital communications through formal IUPAC recommendations. • Generate a glossary of standard terms in chemistry for use in applications involved in digital communications such as scientific data exchange or electronic publishing. • Locate potential interested parties within IUPAC who "own" glossaries of terms or who are in the process of creating them • Establish a method to identify and resolve problems in overlap of definitions (within IUPAC as well as with other scientific standards and other organizations)|
It was very clear from the Brisbane meeting that there was an urgent need to address the issues that were raised there. Hence, by the end of December 2001 the issues of identifying glossaries, project team members, and contacts between divisions and standing committees had been addressed. By then, Professor Bobby Glen of the new Unilever Centre for Molecular Informatics at the University of Cambridge, United Kingdom, agreed to host a follow-up meeting from 24-25 January 2002, as this type of initiative is of great interest to the fledgling center. Those invited to attend included IUPAC division and standing committee representatives and delegates from outside IUPAC who are active in establishing guideline for handling of chemical objects within their organizations. The IUPAC Analytical Chemistry Division was represented by its president David Moore; the Physical and Biophysical Chemistry Division represented by Jeremy Frey; and the new Chemical Nomenclature and Structure Representation Division, represented also by its president, Alan McNaught. In addition, I represented the IUPAC JCAMP-DX Working Party.
The meeting started with a welcoming address by Bobby Glen, who briefly explained the background of the Unilever Centre and provided a useful overview of the type of projects underway at the center. Alan McNaught, Robert Lancashire, and I discussed IUPAC's intentions, current activities involving IUPAC glossaries, and the status of the JCAMP-DX file formats. Currently, within the eight IUPAC divisions there exist seven glossaries that are supervised by the Interdivisional Committee on Terminology, Nomenclature, and Symbols, which is responsible for ensuring conformity with existing IUPAC recommendations and consistency within and between each volume. These compendia, known as the IUPAC color books, cover chemical terminology, quantities, units, and symbols in physical chemistry, inorganic, organic, macromolecular, and analytical nomenclature, as well as the terminology and nomenclature of clinical laboratory sciences. 4
Jeremy Frey pointed out that one difficulty encountered during the revision of the "green book" (which covers quantities, units, and symbols in physical chemistry) was the accommodation of different definitions, which originated from different fields of chemistry, for single entries in the data dictionary. Steve Heller offered an even broader example of the problem: although nm is widely recognized as nanometers in the scientific community, there is a significant body of opinion that feels that the letters obviously refer to nautical miles!
The International Union of Crystallographers (IUCr), represented at the meeting by Brian McMahon, has a very special interest in mark-up language because it has developed a standard format –the Crystallographic Information File (CIF) [more about CIF]– for the deposition, storage, and distribution of crystallographic data with the publication of peer-reviewed papers. As McMahon explained, CIF was commissioned by IUCr following long-standing interest in the need for an open standard for data and information exchange. CIFs are divided into blocks, with each block consisting of individual labels or tags whose definition is stored elsewhere. Key points are that the semantic content is kept separate from the syntax of data representation, and that different dictionaries are used for different topic areas. McMahon concluded that one thing was abundantly clear from experience with CIF: "The design of a file format is an essential step, but it is only one component (and in many ways the least difficult) in the process of devising a feature-rich exchange mechanism. Far more difficult is the detailed definition of the tags that will be used within the file to ensure that applications attribute exactly the same meaning to the same item of information. The experience of the expert committees who undertake this work to extend CIF is that years of painstaking effort and discussion may be needed to define a few dozen tags, which are accepted across the community." As a contribution toward the establishment of content-rich XML applications in related areas of chemistry, the IUCr will make available its CIF-based definitions to the IUPAC groups working to establish XML-based applications. The scientific community said McMahon is looking forward to the day when effective chemical information exchange standards, widely accepted by the community, should complement and interoperate with CIF or its successors.
|... for XML to function effectively for the sciences there needs to be agreement on the vocabularies or "ontologies" in use.|
Peter Murray-Rust summarized other global activities surrounding the use of XML in science–see "Markup Languages-How to Structure Chemistry-Related Documents" for a review of his work, co-authored with Henry Rzepa. At the meeting, Murray-Rust explained some of the benefits of using XML-based documents, including the ability to "validate" documents for correct or complete content, to create better electronically linked publications, and to significantly simplify information harvesting from such documents. According to Murray-Rust, for XML to function effectively for the sciences there needs to be agreement on the vocabularies or "ontologies" in use. He noted that the W3C expects that "domains" will create domain-specific tools and protocols for different subject areas such as chemistry. He also explained how the XML files differentiate between content, which has often been specified at different locations. Individual XML files may contain content from different ontologies such as a structure as defined by Chemical Markup Language (CML), a spectrum as defined by JCAMP-DX or SPECTROML, and a mathematical relationship as defined by MathML. This can be regarded as a powerful bonus, but again poses the question about reliability of the links the content needs to be put. This is currently leading to situations where "
Namespaces do not have to be registered and so it is simple for any group or company to define their own version of "element." For example, although they could quite correctly claim to be using XML for data storage and transfer, the files generated would be as limited to their own internal applications as if they were using 17-bit binary encoded files. One way in which IUPAC could play a significant role in furthering XML for chemistry explained Murray-Rust is by ensuring that dictionaries are future safe and don't vanish from the Internet when a particular professor retires or a software or publishing house is bought out or goes bankrupt.
Jonathan Goodman, of the Unilever Centre, presented an amusing view from an academic and educational standpoint ; see How Well Are We Using XML in Chemistry?. His group has developed several databases that could lend themselves to being made available in an XML format. But, Goodman asked, what would be the immediate benefit? Quite simply, there would be none he stated. Should IUPAC take a clear lead in laying down guidelines on the presentation of chemical information in XML then it would be worthwhile to take this additional step as then other chemists and projects would be able to access and use the information more easily.
To conclude, Goodman said "there is a long way to go before XML is used routinely to improve and enhance chemical communication. However, XML friendly structures are already in place, and this should mean that a lot of data can easily be moved to this marked-up language. If an XML-based standard is accepted, then this process could be very rapid and data could be shared and reused much more easily than is now possible."
This supported the views of McMahon, who had commented that to generate an XML file from CIF would be a simple enough task, but questioned whether this would be "good" XML and "fit for purpose." Goodman and McMahon agreed that IUPAC needed to identify the customers who would benefit from XML projects. This includes clearly identifying stakeholders who will make the effort to implement whatever is developed.
Other presentations dealt with XML from various information providers' standpoints. Bill Town from ChemWeb and Sandy Lawson from MDL Information Systems pointed out the difficulties in achieving the uptake of technical developments in large organizations. Efforts have been made across the publishing industry to establish electronic submission and presentation of published papers, but authors still are unhappy about changing their habits. A general discussion was also held on the lack of decent authoring tools.
Kirk Schwall summarized the views of the Chemical Abstracts Service (CAS). According to Schwall, CAS has a collection of highly integrated data that have been organized using SGML since 1994. Since 1997, XML has been used for some data that have required frequent updating and interchangeability. Both the document and authority data collection concepts at CAS have XML as an element of their design. The vast complexity of their operation meant that they were forced to handle about to every possible mode of information delivery with only a small minority of their information suppliers delivering content in an XML format. Even when it is available it is not used, as the tags are stripped before being regenerated at the end of the document handling process. CAS does have an extensive thesaurus, but this is not publicly available. It was agreed that there is a need for CAS and IUPAC to discuss common ontologies.
Gary Mallard from the U.S. National Institute of Standards (NIST) summarized XML activities within that organization. According to Mallard, NIST uses XML for standardizing the delivery of the following types of scientific information: numerical data, exchange of instrument/reference data, materials property, and reactions design. The wide range of experience gained by NIST in different fields of scientific information delivery have placed it in a unique position to advise on the strengths and weaknesses of XML in chemistry. Quite often difficulties have arisen over rather banal problems such as unit names not being standardized internationally (e.g., meter vs metre vs mètre), symbols requiring special fonts and characters (e.g., unit °C, prefix m, and quantity Vemf) or cases in which symbols are not available (or are not standardized internationally) for all units or quantities. Mallard, was, however, quick to point out some of the drawbacks of XML. He highlighted the problems associated with files that are essentially uninterpretable if the explanations of the individual labels used are not open and freely available. According to Mallard, he had created a nice presentation of the various XML efforts underway, but a problem arose when it turned out that several of the reference Web sites essential for the understanding of the ontologies no longer existed.
|Some of the attendees at the IUPAC Strategic Meeting on XML in Chemistry:(from left to right) Robert Lancashire, Bill Town, Jonathan Goodman, Sandy Lawson, Peter Murray-Rust, Kirk Schwall, Brian McMahon, Alan McNaught, Gary Mallard, Steve Stein, David Moore, Steve Heller, Bobby Glen, Kirill Degtyarenko, Richard Cammack, Peter Lampen, and Tony Davies.|
A Project for IUPAC
At the conclusion of this very successful meeting, Steve Stein of NIST was appointed to draft a project proposal to IUPAC on "Standard XML Data Dictionaries for Chemistry." In addition, a group of volunteers was established for a task group to support this project. The group plans to give a presentation at the coming CAS/IUPAC Conference on Chemical Identifiers and XML for Chemistry to be held in Columbus Ohio on 1 July 2002. 5
The future is always difficult to predict and those who are brave or foolish enough to attempt it are usually proved wrong–often before their predictions go into print. However, I would like to put one point at the end of this summary: IUPAC is in an excellent position to provide a vital service to the scientific community by assisting in the development of information technology in chemistry and associated sciences. This is probably a unique situation in the history of IUPAC because those championing this work clearly understand the need to work fast, but also the inherent limitations of working within an IUPAC framework, as shown by the dos and don'ts list from the Brisbane meeting. I wish them all the best and hope to see all of you at the IUPAC/CAS conference in July.
I would like to thank Ian Michael, for permission to use my original column published in Spectroscopy Europe,6 as the basis for this extended report, and Henry Rzepa, Peter Murray-Rust, Jonathan Goodman, Brian McMahon, Gary Mallard, and Kirk Schwall for their contributions. Also, I would like to thank Bobby Glen for hosting the conference and all those who attended the meeting, whether it was just to learn and report back to their IUPAC bodies or whether it was to assist with the drive for standardization of scientific IT. It is a hard road we tread and one with few rewards. After all, no one ever won a Nobel Prize for enabling communication among scientists!
6. A.N. Davies, XML in Chemistry, Spectroscopy Europe, 14(1)2002, 22-24 <www.spectroscopyeurope.com/td_col.html>
Antony N. Davies <firstname.lastname@example.org> is secretary of the IUPAC Committee on Printed and Electronic Publications, chairman of the IUPAC Working Party on Spectroscopic Data Standards, and JCAMP-DX external professor, University of Glamorgan, Wales, United Kingdom.