Skip to content
Publicly Available Published by De Gruyter May 24, 2017

The IUPAC Gold Book

An Exemplar for IUPAC Asset Digitization

Stuart Chalk and Leah McEwen
From the journal Chemistry International

Abstract

As IUPAC approaches its 100th anniversary, it is important to re-evaluate the mode(s) in which it operates to sustainably support the standards that are vital to the chemical enterprise–chemical naming, chemical identification, and essential terms for commerce and the community. In addition, the importance of IUPAC’s underpinning of the digital representation of chemistry in the current ‘big data’ environment cannot be underestimated. Sustainable global support for the digital use of IUPAC assets is vital to the chemical sciences.

In 2014, Jeremy Frey promoted the idea of a “Digital IUPAC,” or "iUPAC" to highlight the need to support the computer readability of chemical information in addition to that of humans. [1] Specifically, he stated that,

The comprehensive conversion of IUPAC’s knowledge base of standards and definitions from human-readable to computer-readable form is essential. It is vital that this conversion be done now, as a matter of extreme urgency, if IUPAC is to maintain its role as the international authority for the chemical sciences.“ [2]

We cannot agree more with this statement. Three years later, we argue that it is not only important for chemistry, but for all related sciences, as there are significant gaps in the development of semantic representation of chemical and biological entities and chemical concepts.

In this article, we present a discussion of the IUPAC Gold Book [3] as an exemplar of an important asset that needs to be digitally represented for preservation, maintenance, dissemination, programmatic access, and semantic application. This is an immensely important activity for IUPAC when we consider the number of volunteer-hours invested over the years in the development of the color books upon which the Gold Book is based. This effort continues previous work to formalize a future-looking digital management plan that could be applied to other IUPAC assets. [4] Issues that fall out of the development process are likely to be important in other digitization efforts and a proactive stance can inform appropriate policies and procedures for implementing and sustaining future projects.

The Gold Book is a compendium of authoritative chemical terminology originally compiled from IUPAC recommendations published by the scientific divisions of the Union. Since standardized naming of compounds became important, IUPAC has been involved in defining standards for chemistry. This effort has resulted in a series of ‘color’ books that define concepts in chemistry, many of which are integrated into the Gold Book. Scientific divisions in IUPAC are responsible for updating terms and drafting new definitions, which are then ratified by the IUPAC Interdivisional Committee on Terminology, Nomenclature and Symbols (ICTNS) and published in the Color Books or in IUPAC’s premier journal, Pure and Applied Chemistry (PAC). The Gold Book provides a unified portal into these definitions. The compendium was initiated by Victor Gold in the early 1980s [5] and has undergone several revisions to add and modify terms [6]. It was compiled online in 1997 in Portable Document Format (PDF) by Alan D. McNaught and Andrew Wilkinson of the Royal Society of Chemistry [7] and has since evolved into a web-based form to facilitate global access and use of these important terms.

The first digital version of the Gold Book was originally conceived in 2002 as part of an initiative to translate existing IUPAC standard terminologies into electronic “data dictionaries” in XML format. [8] This was an early vision of a “Digital IUPAC” and the first digital rendition was released in 2006 at the now familiar namespace, goldbook.iupac.org. Further enhancements to the content, markup, and site functionality were implemented in a project from 2007-2009, including functionality to promote linking and citing of the term definitions. [9] At this time, the Gold Book was registered in the CrossRef system of scholarly publishers, and each term record was assigned a Digital Object Identifier (DOI) for persistent referral back to IUPAC authority name-space. A snapshot of the interface is shown in Figure 1.

At the time of this writing, the current project has begun a move of the Gold Book to a new website built on a relational database, from which the pages can be dynamically created. This change will make the website more manageable through the ease of updating the scripts that are used to generate pages, more secure through frequent software updates, and more robust via integrity checking of the data in the database. Fear not though, the changes that will be enacted will not change the content of terms and relationships between them.

Lessons Learned for Sustainability

In the Fall of 2016, it was reported that a large portion of the Gold Book website (terms A thru G) was not working. An analysis of website snapshots in archive.org’s WayBack Machine [10] shows that, somewhere between June and October, the number of pages declined significantly. As it turns out, over 8000 of the HTML files that contain the text of the terms had just disappeared—to this day there is no explanation for the loss. According to a poster mailed to chemistry departments by the Chemical Abstracts Service (CAS) last year, 50 % of data loss is because of hacking, theft, and loss. If that is the case, then the other 50 % of the loss is due to… what? Servers going down? Data getting corrupted? Formats not being readable? If nothing is safe when it’s on the web, how do we move forward with digital assets securely and sustainably?

It is thanks to the WayBack Machine that the Gold Book website content has been restored. With a little scripting, the missing files could be retrieved from previous snapshots and added back to the website. Based on this event, IUPAC migrated the pages from the Gold Book website to a new (modern) server, where the site could be better managed, actively maintained, and backed up. This event should be taken as the wake-up call that it is. While there was ultimately no loss of data (as far as we know), it could have been much worse. There was little information regarding the complete content and structure of the site as implemented over time and a backup was not known to exist previously.

Lesson learned: Web versions of important IUPAC assets need to be continually managed and documented; backups should never be an afterthought.

Taking a critical look at the Gold Book site in light of this event, we realize not only how valuable it is, but also how complex are its contents. Previous projects put much effort into building a community resource that is not only comprehensive, but also interactive, and of course scientifically rigorous. Looking at the pages with more than a glance reveals a wealth of information and integration. Take for example the encoding of symbols, mathematical equations, and chemical reactions. Each of these components was developed using scaled vector graphics (svg) files—XML based image files—that were then converted into portable network graphic (png) files so that the presentation was preserved on all browsers. This involved much effort, automated scripts, and a naming convention for the files.

 Figure 1. The Gold Book website as it appeared from its release in 2005 until 2016.

Figure 1. The Gold Book website as it appeared from its release in 2005 until 2016.

 Figure 2. IUPAC Gold Book term link map for ‘charge-transfer complexes’

Figure 2. IUPAC Gold Book term link map for ‘charge-transfer complexes’

Another very important feature of each term is the mapping that shows how the current term relates to other terms in the Gold Book. The implementation of this on the page is in the form of three linked pages that contain image maps with links out to the terms related to a term at the first, second, and third levels (see Figure 2). This visual navigation of the content of the Gold Book can only be done via the web and highlights additional important information about the categorization of a Gold Book term, useful context for those less familiar with the Gold Book’s content. In today’s linked data perspective, image-based navigation is excellent for humans, but is not a true digital representation of the context that computers can interpret and can be difficult to maintain over time.

Lesson learned: Websites need ongoing technology updates and planned migration; IUPAC must support the value of digital assets as well as physical ones.

Given these recent concerns, it is clear that there needs to be a strategy for backing up the website. Going forward, this will involve mirroring the scripts that create the webpages on the new site at GitHub, [11] a popular online code repository, which will allow multiple authorized administrators and developers ready and secure access to the code. In addition to the code, backups of the database behind the new website will be uploaded to GitHub, as well as a mirror of the database on a server maintained by the current project lead at the University of North Florida. GitHub provides tools to document code and features, with complete logs of the changes that have been made and by whom. Let us know if you are interested in helping with this strategy to support the Gold Book website.

Anticipating Usage of the Gold Book for Humans and Computers

Two forward-thinking digital enhancements to the new site are eagerly anticipated: the development of the Gold Book terms as a formal, machine-readable controlled vocabulary (available in different formats), and the on-demand digital publication of Gold Book terms via a web-based application programming interface (API). These enhancements will allow IUPAC to sustainably support digital functionality of the Gold Book as well as access to content.

The Gold Book serves as a basis from which to reference specific concepts using approved formal terminology. Information scientists are clamoring for authoritative controlled vocabularies in machine-readable format to reference in databases, computational models, and other applications. Distributed scientific information systems need to be able to exchange and connect data and controlled vocabularies can facilitate linking across these systems. Of course, controlled vocabularies are not a new thing (your librarian will remind you of this), but the digital application of highly specialized scientific vocabularies is a hot topic right now as we start to understand the ramifications of ‘big data’ for accelerating science.

Machine readability of the Gold Book terms in a controlled vocabulary is only the first step in the logical progression toward the eventual development of an ontology. Formal digital ontologies allow the expression of the meaning of a concept as well as the relationships among concepts, written in a formal language. [12] Ontologies can be referenced by databases to incorporate greater context around data. Figure 3 shows how an ontology entry for the term ‘absorbance’ might be formulated in the Web Ontology Language (OWL) [13] using the Protégé software package. [14] The representation is based on the content of the current Gold Book page, including the source and related terms. This approach could be used to digitally represent the term link maps illustrated in Figure 2. To formalize such structures, IUPAC will need to formulate an approved specification for defining the metadata necessary to correctly characterize and contextualize each term.

Computer-based access to the Gold Book terms can be provided from the new site through an API service. Initially implemented as a proof-of-concept, the API will allow non-HTML versions of the information on each term to be delivered via a documented, standard format. API’s are very common these days on many public sites (e.g. Facebook, Wikipedia, PubChem) and provide a way for web developers to integrate their content accurately with other websites.

As an example, the term absorbance might be available to view in HTML at:

http://goldbook.iupac.org/view/term/absorbance

 Figure 3. Sample ontology entry for Gold Book term ‘absorbance’

Figure 3. Sample ontology entry for Gold Book term ‘absorbance’

A more transportable output format could be made available at:

http://goldbook.iupac.org/view/term/absorbance/json

JavaScript Object Notation (JSON) is consistent and succinct and can be read by many programming languages. Offering output in JSON could open up many opportunities for the integration of authoritative IUPAC terms into other websites and would allow the promotion of the service. Note the reference back to the IUPAC-based DOI for the authoritative ‘copy of record’ in the JSON example in Figure 4.

The Goldify feature implemented in the previous version of the Gold Book website that allowed automatic addition of term links to other electronic documents could be accomplished via the API in such a way that the usage of the feature could be tracked over time. This in turn could identify major users of the Gold Book and catalyze the development of collaborative projects that make the Gold Book more useful, usable, and relevant. In the long term, there is much potential to develop and manage a multitude of services for Gold Book terms, leveraging their online nature through future projects.

Towards Digital IUPAC

It is clear that to make the Gold Book and other valuable IUPAC resources as widely available as possible, they must be implemented digitally on the web in addition to traditional print copies. This vision has been promulgated over time in several projects addressing the IUPAC Color Book corpus. [15, 16] Given the current climate, it must be approached in a way that supports computer as well as human use and honors the rigorous scientific definition process that exists within IUPAC. Many questions arise as we consider what an appropriate web version ‘means’ and how to develop a strategic plan to formulate and sustain digital assets.

  1. Where are the data? the backup? the documentation?

  2. In what format(s) are the data and how do we maintain their integrity?

  3. What processes need to be in place to support the metadata? the web presence? DOI? the formal vocabulary/ontology structures?

  4. What policies address the digital integration of information managed by IUPAC with information managed by other entities?

IUPAC joins many organizations globally going through a transition from physical to digital resources. The process is inevitably iterative and lessons learned along the way uncover many best practices, example use cases, and technology solutions with which to navigate. It is important that IUPAC defines what is important in this process and continues to work toward solutions that fill the needs of members and of the chemistry community. IUPAC Divisions need to plan for digital migration and consider consistent, sustainable, and cost effective solutions to the critical issues of security, integrity, provenance, and management of digital assets. Members at all levels need to engage and actively recruit new members to help shape the Union as it approaches its centennial.

 Figure 4. Sample JSON representation of Gold Book term ‘absorbance’

Figure 4. Sample JSON representation of Gold Book term ‘absorbance’

Conclusion

The digital footprint of IUPAC is an important global resource for the Union as it approaches its 100th anniversary. Managing the corpus of assets that IUPAC has developed in this time is an ongoing and monumental task. We all understand how important it is in the long term to curate and promote these contributions so that the chemical community and the Union continue to flourish in the next 100 years.

References

1. Jeremy G. Frey . “Digital IUPAC: A Vision and a Necessity for the 21st Century”, Chem. Int. 36(1):14-16 (2014) https://doi.org/10.1515/ci.2014.36.1.1410.1515/ci.2014.36.1.14Search in Google Scholar

2. Rob Smith, Ryan M. Taylor, and John T. Prince. “Current controlled vocabularies are insufficient to uniquely map molecular entities to mass spectrometry signal” BMC Bioinformatics 16(Suppl 7):S2 (2015). Available at https://doi.org/10.1186/1471-2105-16-S7-S210.1186/1471-2105-16-S7-S2Search in Google Scholar PubMed PubMed Central

3. IUPAC Gold Book. http://goldbook.iupac.orgSearch in Google Scholar

4. IUPAC project “Backup, Maintenance, and Redevelopment of the IUPAC Gold Book website” https://iupac.org/project/2016-046-1-024Search in Google Scholar

5. Compendium of chemical terminology: IUPAC recommendations. Compiled by Gold, V., Loening, K. L., McNaught, A. N., and Sehmi, P. Blackwell Scientific Publications: Oxford, 1987Search in Google Scholar

6. IUPAC Project 2001-062-2–027, “Revision of the IUPAC Compendium of Chemical Terminology (the gold book).” Chair: Aubrey D. Jenkins; Group Members: Richard Cammack, Jeremy G. Frey, Anders Kallner, G. Jeffrey Leigh, David Moore, Donald Moss, Gerard P. Moss, Monica Nordberg,Yehuda Shevah. https://iupac.org/project/2001-062-2-027Search in Google Scholar

7. IUPAC Compendium of Chemical Terminology. PDF version, compiled by McNaught, A. D. and Wilkinson, A. 1997.Search in Google Scholar

8. IUPAC project “Standard XML data dictionaries for chemistry” https://iupac.org/project/2002-022-1-024Search in Google Scholar

9. IUPAC project “Enhancement of the electronic version of the IUPAC Compendium of Chemical Terminology” https://iupac.org/project/2007-016-1-024Search in Google Scholar

10. Archive.org WayBack Machine. Available at https://archive.org/webSearch in Google Scholar

11. GitHub Version Control Repository. Available at http://github.comSearch in Google Scholar

12. Controlled Vocabulary vs Ontology – SemWebTec https://semwebtec.wordpress.com/2010/11/23/contolled-vocabulary-vs-ontology/Search in Google Scholar

13. OWL 2 Web Ontology Language. Available at https://www.w3.org/TR/owl-overview/Search in Google Scholar

14. Protégé Ontology Editor. Available at http://protege.stanford.eduSearch in Google Scholar

15. IUPAC project “Software framework for transformation of IUPAC Color Books to XML” https://iupac.org/project/2007-014-1-024Search in Google Scholar

16. IUPAC project “IUPAC Color Book Data Management” https://iupac.org/project/2013-052-1-024Search in Google Scholar

Online erschienen: 2017-5-24
Erschienen im Druck: 2017-7-26

©2017 by Walter de Gruyter Berlin/Boston