Accessible Published by De Gruyter November 28, 2015

Leveraging Web 2.0 technologies to add value to the IUPAC Solubility Data Series: development of a REST style website and application programming interface (API)

Stuart J. Chalk ORCID logo

Abstract

This paper details an approach to re-purposing scientific data as presented on a web page for the sole purpose of making the data more available for searching and integration into other websites. Data ‘scraping’ is used to extract metadata from a set of pages on the National Institute of Standards and Technology (NIST) website, clean, organize and store the metadata in a MySQL database. The metadata is then used to create a new website at the authors institution using the CakePHP framework to create a representational state transfer (REST) style application program interface (API). The processes used for website analysis, schema development, database construction, metadata scraping, REST API development, and remote data integration are discussed. Lessons learned and tips and tricks on how to get the most out of the process are also included.

Introduction

Over the last 150 years, scientists have generated ever-increasing amounts of research data [1]. As a consequence, at the turn of the last century scholars starting aggregating the results (content-mining) from specific areas of science in order that they and their peers would be able to keep up with scientific progress. As a result, today many science publishers have a plethora of ‘hard-copy’ databases with a wealth of important peer-reviewed data. In addition, in the last 20 years, this process has also been undertaken by many independent scientists taking advantage of the Internet by aggregating information about a specific aspect of their work and ‘self-publishing’ (with very little credit) a website.

As a consequence, in this Web 2.0 [2]/Web 3.0 [3] era there is great need to find automated mechanisms to digitize quality scientific information and publish it in a ways scientists can easily find, download and use, especially given current funding agency movement toward open data [4]. That is, by annotation of the information with unique identifiers (e.g., InChI strings/keys [5]), the use of websites with Application Programming Interfaces (API’s) [6], and script friendly formats (e.g., JavaScript Object Notation – JSON [7–9]). Tools and methods for automated extraction of data from Adobe Portable Document Format (PDF) documents and Hypertext Markup Language (HTML) [10] pages are generally available to the casual hacker but have not been exploited for the purposes of scientific data extraction. The notable exception to this is Peter Murray-Rust’s work on ContentMine [11] a website of resources for retrieving data from webpages and PDF files using a variety of tools and converters.

In this work, we detail the analysis, workflow and scripting behind a project to repurpose data made available on the National Institute of Standards and Technology (NIST) website [12], published from the International Union of Pure and Applied Chemistry (IUPAC) Solubility Database Series [13]. Data exposed on the NIST website was analyzed for its organization and content types, extracted using PHP [14] scripts, stored in a MySQL [15] database and exposed via a new website built using the CakePHP framework – a model-view-controller (MVC) based platform [16] that exposes data via RESTful [17] URLs. The development of this process is discussed along with mechanisms used for integration of data from other website, and lessons learned from the project.

Background

Data scraping

Data scraping [18], a more specific activity than web/page scraping [19], is the process of extracting data from an output (e.g., via a web browser) that was intended for human display. This means that the data of interest has to be isolated from any text that is related to presentation (HTML or CSS [20]) or embedded scripts (e.g., JavaScript [21]).

This process is easily performed by humans, for both printed media and websites, because we can interpret the meaning of the information presented to us (the context). However, manual data scraping is laborious for even small amounts of information. Thus, today, the process is performed by computers programmed in a variety of languages that allow direct access to the content of web pages, for example PHP [14], Perl [22], Python [23], or JavaScript [21]. However, in most cases, there is no unique identification of the information within the HTML source and/or there is no API with which to interact with the data directly. Because of this, the context of the data, its meaning, must be embedded in the code that is written to extract the data and this then relies on the author of the script(s) having a level of knowledge of the domain such that the information can be correctly identified and accurately extracted.

Representational state transfer (REST)

Representational state transfer – REST – is a software architecture [17, 24] with a set of constraints that allows a simple communication between two systems, typically a client and a server where communication between the two is based on the Hypertext Transfer Protocol (http/https) [25]. The REST architecture is resource-based, that is, ‘things’ are what we request information about and in sending a request about a ‘thing’ we get a representation of that ‘thing’ in return (the amount of detail can vary), indicative of the ‘thing’s state. With this perspective on information the http methods; POST, GET, PUT and DELETE are used to implement four common actions performed on a resource; Create, Retrieve, Update, and Delete (CRUD) [26].

The six constraints of a REST architecture are:

  • Uniform interface – a well defined way to communicate requests for information

  • Stateless – the server has no knowledge of the state of any client. Each request to the server must have enough information to transmit the state of the client (if one exists)

  • Client-Server – the system is disconnected, the client cannot assume anything about the server and vice versa, except that a uniform interface exists between the two

  • Cacheable – representations returned from the server can be cached on the client on the server or both

  • Layered System – there can be multiple layers of software involved in retrieving information and/or the information can be retrieved from the servers cache

  • Code on Demand (optional) – the possible transfer of logic (an executable) from the server as part of the transfer of state to the client. This allows the client to do logic, temporarily that is normally only available on the server.

Model-view-controller (MVC)

The Model-view-controller (MVC) architecture [27] is a software design pattern for developing user interfaces. It separates the logic (code) of handling a request from a user (via their web browser) into three parts as shown in Fig. 1. Logic in the controller defines what information to retrieve from the model (database) and how to manipulate it, before sending it to the view, which generates a page for a user. The use of MVC is important in the development of a REST style website as it provides the uniform interface.

Fig. 1: MVC design pattern [27].

Fig. 1:

MVC design pattern [27].

It is also important to point out that many MVC based frameworks are designed to enforce the “Don’t Repeat Yourself” or DRY [28] style of programming where code needed to execute a function is kept in only one place in the software and re-used as needed. This makes a web application much easier to maintain and update.

Application programming interface (API)

An application programming interface (API) [6] is a programming style that exposes the inputs, outputs, operations and data types of a piece of code in a standard documented way. In the context of websites, an API is a definition of the generic format of URLs that will return data from a site, as well as information about the structure of the data returned by each URL. This makes it easy for other software developers to interact with a sites API, as they know how to format requests for information and how to process what is returned.

The Solubility Data Series

The IUPAC Solubility Data Series (SDS) was started in 1979 under Commission V.8 – The Commission on Solubility Data. Since then 103 volumes of data have been published [13] with the 100th volume on the solubility of rare earth metals in water being published in March of 2014. Data included in the volumes is aggregated from the literature, critically reviewed and reported with specific metadata about and/or calculations on the original data [29].

In 1998 IUPAC and NIST signed an agreement to publish the SDS in the Journal of Physical and Chemistry Research Database (JPCRD) [30] and subsequently in 1999 a project was started [31] to put the data from all volumes online. In 2004, 18 of the volumes were made available [12] as a searchable online database. Although the agreement was originally intended to make all volumes available in this way, development ceased in 2006 due to a lack of funding.

Resources

The following software was used to scrape, store and display the data obtained from the NIST website; Apache Httpd Server 2.4 (using SSL – https) [32], MySQL 5.6 [15], PHP 5.6 [14], and CakePHP 2.5 [16]. The software was originally installed on an Apple iMac [33] (running OSX 10.9.5) using MacPorts 2.3 [34]. In order to ensure long-term stability of the website it was transferred to a cloud based VMWare virtual machine running FreeBSD [35], administered and supported by University of North Florida (UNF) Information Technology Services (ITS). Since the research was performed, UNF ITS has upgraded Httpd, PHP, and CakePHP on a regular schedule in order to keep the website as secure and stable as possible.

Discussion

The development of this project started when the author noticed the symposium ‘The IUPAC Solubility Data Series: 100 Volumes of Solubility Data Online’ scheduled at the 248th ACS Meeting in San Francisco, CA. Looking into the SDS and the volumes that were available online at NIST, the dataset seemed to have the right setup for a limited scope project to repurpose scientific data as a REST based website, i.e.,

  • Fundamentally important scientific data

  • High quality scientific data

  • Structured website (with documentation)

  • Internationally recognized publishers

After a cursory analysis of the website, to understand its flow and the extent of the data it contained, it was determined that the site was almost ideal for this project. As the infrastructure for the project was already available to the author (server and software) the project was executed over 1 week (50 h) in order to produce a crude ‘alpha’ version of the site discussed in this paper [36]. A full copy of the CakePHP website code, with documentation, and a MySQL dump of the database tables is available on GitHub [37].

Given the nature of this research, evaluation of a quick way to repurpose existing data into a REST style version, it should be noted that the ‘finished’ website is not a complete re-invention of the current one. Due to time, complexity of the data on the system detail page, and errors in the original conversion of the data from PDF files, the following issues were not addressed in this project:

  • Data for solubility is presented in tables on the NIST webpages and this proved to be difficult to parse into individual solubility data values with units. Thus, the solubility data is stored in JSON encoded tables in the new website for convenience and display.

  • Chemicals should correctly be linked to system types (see below) but are instead linked to individual system data reports as it was easier to identify individual chemicals from the system detail pages. This causes some duplication of chemicals and the addition of ill defined chemical entries (e.g., seawater) in the ‘chemicals’ table.

  • Inconsistencies in the representation of citations led to misidentification of authors, duplication of citations, and inaccurate/no DOI assignment. Citation refno’s were also not extracted from individual system data reports as they were not linked on the page.

As a result of the issues above, anyone accessing the source files for this project should understand that the database and website are not a finished product and are made available primarily to understand the methods and structure needed to implement a project like this.

Schema development

Schema, derived from the Greek word meaning ‘plan’, can be thought of in the context of information science as the way things (data) are related to each other. In order to understand the needs for this project in terms of the REST based API, the PHP scripts written in conformance with the CakePHP ‘model’, and the MySQL database, it was necessary to untangle the available data on the NIST website and evaluate how it fit together.

Figure 2 shows a conceptual picture of the schema for the solubility data available on the websites, that includes semantic annotation of the relationships between the different types of data. The implementation of this schema can be done in a variety of formats using software of the user’s choosing, e.g., SQL database, No-SQL database, MS Excel [38], ElasticSearch [39], etc.

Fig. 2: Schema for the data available on the NIST SDS website [12].

Fig. 2:

Schema for the data available on the NIST SDS website [12].

Database development

Based on the schema in Fig. 2 the development of a schema for MySQL was relatively straightforward. Things in green in Fig. 2 were added as database tables. For data relating authors to citations and chemicals to systems, both of which are many-to-many relationships [40], join tables [41] were also added. Because of the time and complexity of the location of the data in the data tables in the SDS system pages, tables for ‘data’, ‘evaluator’, and ‘references’ (citations in the critical evaluation/experimental remarks provided by the evaluator) were not implemented. These data were stored in fields in the tables (data) and systems (evaluator and references) database tables. The final MySQL schema is shown in Fig. 3.

Fig. 3: MySQL database schema for the NIST SDS website [12].

Fig. 3:

MySQL database schema for the NIST SDS website [12].

Page analysis and data scraping

In order to extract data from the NIST website an analysis of the website structure (links) and page content was performed prior to scraping. Based on the structure and/or the amount of data the extraction was done by either; (i) exporting the page as text and using a text editor to clean up the content so that it was ready to be imported into MySQL, or (ii) importing the page into a PHP script, processing the content, and adding it to MySQL through the CakePHP connection (the ‘Model’) to MySQL.

Under the search page on the NIST site the volumes available online are accessible via a single page (http://srdata.nist.gov/solubility/dataSeries.aspx) with links identified by volume number and title, and by a ‘dataSeries’ identifier (‘nistid’ field in the ‘volumes’ table) coded into the URL link. Data was extract by export as a text file, followed by cleanup, and manual import into MySQL. However, for this paper, the process was also re-written as PHP code in order to show how simple HTML files can be processed in PHP and added to a MySQL database. Analysis of the HTML code of the page (Fig. 4a) shows that each of the volumes on the page is a link with a specific structure. Figure 4b shows a PHP script that captures the content of HTML in a PHP array, removes any unneeded array elements, and then uses the explode [42] function to cut the remaining text for each volume into the $volume, $title, and $nistid variables before then saving them to the MySQL database (Fig. 4c). After the volume, title, and nistid values are saved to the ‘volumes’ table, the reference and URL fields were added manually (the id and updated fields are added automatically).

Fig. 4: HTML, PHP Code, and MySQL database used for extraction/storage of NIST SDS volume data.

Fig. 4:

HTML, PHP Code, and MySQL database used for extraction/storage of NIST SDS volume data.

The same process was used for other data extracted from the NIST website, with some of the scripts being more complicated because of the amount/variety of data extracted, the number of database tables accessed/updated, and the integration of data from other websites.

CakePHP URLs for REST and API

In the CakePHP framework, the MVC architecture leads to URLs that fulfill the REST constraint of a uniform interface. In addition, having a well defined database structure makes building the site straightforward. Starting from the MySQL database table structure in Fig. 2 (except the join tables), a controller PHP file is created for each table used in the CakePHP application (in CakePHP you ‘Bake’ a website). Figure 5 shows a controller file for the ‘Volumes’ controller, that co-ordinates access to data in the ‘volumes’ database table and sends it on to the browser via a view file.

Fig. 5: The CakePHP ‘Volumes’ controller.

Fig. 5:

The CakePHP ‘Volumes’ controller.

Each of the functions defined in a controller perform a particular action and the names of the functions become part of the URL to access the action. In addition, any input parameters of a function (e.g., $vol and $format for the ‘view’ function above) also become part of the URL, if they are required, i.e.,:

<base_url>/<controller>/<function>/<param1>/<param2>/…

For instance if a user wants to ‘view’ volume 20, then the URL will be:

https://chalk.coas.unf.edu/sol/volumes/view/20 (for HTML output)

To make the REST URL useful as an API, additional ‘format’ options allow the data for this volume to be returned in formats consumable by other web applications, e.g., XML and JSON:

https://chalk.coas.unf.edu/sol/volumes/view/20/xml (for XML output)

https://chalk.coas.unf.edu/sol/volumes/view/20/json (for JSON output)

In this way, URLs can be constructed that allow access to some or all of the data in all the tables in the MySQL database. A comparison of the old and new URLs for accessing data in the SDS is shown Table 1.

Table 1

Comparison of current and REST style URLs used to access SDS data.

VolumesCurrent: /solubility/dataSeries.aspx (list of volumes)
REST: /sol/volumes/index (‘index’ action is optional)
Current: /solubility/sol_sys.aspx?nm_dataSeries=<volID>
REST: /sol/volumes/view/<volID>
System typesCurrent: /solubility/sys_category.aspx (list of system types)
REST: /sol/systemtypes/index
Current: /solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN
REST: /sol/systemtypes/view/<typeID>
SystemsCurrent: /solubility/casNO.aspx (no ‘index’ option, search by CAS#)
REST: /sol/systems/index
Current: /solubility/sol_detail.aspx?sysID=<sysID> (different sysID)
REST: /sol/systems/view/<sysID>
CitationsCurrent: /solubility/citation.aspx (no ‘index’ option, search by author/year)
REST: /sol/citations/index
Current: /solubility/citation_detail.aspx?REF_NO=<refno>
REST: /sol/citations/view/<id> (database id used instead of refno)

Website integration

During the process of extraction of chemical and citation information from the system detail pages, additional information for the MySQL tables was retrieved if available.

For chemicals, the CAS# or chemical name was used to search the Chemical Identifier Resolver (CIR) website [43] to retrieve InChI strings and keys, and if found, the InChI string was used to determine if the chemical was already in the ‘chemicals’ MySQL table. The CIR site has a very clean REST based API (below) that allows users to access different types of data for over 80 million small molecules. The “structure identifier” can be a name, a CAS#, an InChI Key, a SMILES string, or one of a number of other specialized identifiers. The “representation”, defines what is returned to the user and can be any of the structure identifiers or an image of the structure in GIF or PNG format. Finally, if “format” is omitted an HTML page is returned or if “format” is XML then that is returned (JSON is not available).

http://cactus.nci.nih.gov/chemical/structure/“structure identifier”/“representation”/“format”

Figure 6 shows a code snippet using the CIR REST API to obtain the standard InChI string for a chemical by CAS# ($fields[2]) and if that is not successful, by name ($fields[0]) which is URL encoded [44] to ensure any special characters are URL compatible.

Fig. 6: CakePHP code snippet for retrieval of InChI strings for chemicals from the CIR website [43].

Fig. 6:

CakePHP code snippet for retrieval of InChI strings for chemicals from the CIR website [43].

For citations, CrossRef [45] provides a number of mechanisms to search for Digital Object Identifiers (DOI’s) [46]. CrossRef’s OpenURL [47] interface allows users to find DOI’s of articles using the articles’ metadata (after obtaining a free account). OpenURL [48, 49] is a standard URL format for encoding article metadata using GET variables. Figure 7 shows how metadata extracted from a system data citation is organized and used to search the CrossRef site. In this example, CakePHP’s built-in HttpSocket [50] class is used to send GET variables to the CrossRef OpenURL endpoint.

Fig. 7: CakePHP code snippet showing retrieval of DOIs for citations from the CrossRef website [47].

Fig. 7:

CakePHP code snippet showing retrieval of DOIs for citations from the CrossRef website [47].

Lessons learned, tips and tricks

Data in HTML pages can be well organized but is often not and CSS and JavaScript code get in the way. Exporting HTML pages as text loses the structure you need to be able to find information in the file, so the use of an HTML cleaning tool, or cleanup function within the script used is highly recommended before extraction. XHTML is better than HTML as it can be processed using XML tools, however again watch for errors and beware of special characters which need to be coded as entities [51] in XML which are especially prevalent in non-English web pages. Although there are many options for extracting data from text strings readers are encouraged to use regular expressions (regex) [52] wherever more complicated, or programmatically generated encoding of data occurs.

Conclusion

This project shows that freely available tools, scripting languages, and web resources can be used to take valuable scientific data and make it available in a much more useful way. The website developed for this project [36] and the associated code [37], are made available to encourage the chemistry community to leverage the work that chemist have done, both the original research and the peer-review of data, by applying this general approach to the extraction of as much important scientific data a possible. Indeed, this will take time and money, but it is fundamentally important to the progress of chemistry.

Subsequent to the completion of this project, the author wrote a proposal to NIST to actually implement this project. As of January 2015 the project was funded (NIST grant 60NANB15D034) with the goal of developing this idea further and incorporating semantic technology.


Corresponding author: Stuart J. Chalk, Department of Chemistry, University of North Florida, Jacksonville FL 32224 USA, e-mail: .

Acknowledgments

Thanks go to Tony Williams, the creator of ChemSpider, for seeding me with the idea of doing a ‘Skunkworks’ project.

References

[1] NSF. National Science Foundation (NSF), Washington, DC (2011). url. .Search in Google Scholar

[2] Wikipedia. (2015). url. .Search in Google Scholar

[3] Wikipedia. (2015). url. .Search in Google Scholar

[4] NSF. p. 35. National Science Foundation (NSF), Wasington, DC (2015). url. .Search in Google Scholar

[5] InChI Trust. (2015). url. .Search in Google Scholar

[6] Wikipedia. (2015). url. .Search in Google Scholar

[7] Anonymous. JSON.org (2015). url. .Search in Google Scholar

[8] ECMA. ECMA International (2015). url. .Search in Google Scholar

[9] Wikipedia. (2015). url. .Search in Google Scholar

[10] W3C. The World Wide Web Consortium (2015). url. .Search in Google Scholar

[11] P. Murray-Rust. (2015). url. .Search in Google Scholar

[12] IUPAC-NIST. National Institute of Standards and Technology (NIST), Giathersburg, MD (2006). url. .Search in Google Scholar

[13] IUPAC. International Union of Pure and Applied Chemistry (IUPAC) (2015). url. .Search in Google Scholar

[14] The PHP Group. The PHP Group (2015). url. .Search in Google Scholar

[15] Oracle. Oracle Corporation (2015). url. .Search in Google Scholar

[16] CSF. Cake Software Foundation (CSF) (2015). url. .Search in Google Scholar

[17] Wikipedia. (2015). url. .Search in Google Scholar

[18] Wikipedia. (2015). url. .Search in Google Scholar

[19] Wikipedia. (2015). url. .Search in Google Scholar

[20] W3C. The World Wide Web Consortium (W3C) (2015). url. .Search in Google Scholar

[21] Wikipedia. (2015). url. .Search in Google Scholar

[22] Perl.org. Perl.org (2015). url. .Search in Google Scholar

[23] PSF. Python Software Foundation (2015). url. .Search in Google Scholar

[24] T. Fredrich. Person eCollege (2012). url. .Search in Google Scholar

[25] IETF. The Internet Engineering Task Force (IETF) (2015). url. .Search in Google Scholar

[26] Wikipedia. (2015). url. .Search in Google Scholar

[27] Wikipedia. (2015). url. .Search in Google Scholar

[28] Wikipedia. (2015). url. .Search in Google Scholar

[29] H. Gamsjager, J. W. Lorimer, M. Salomon, D. G. Shaw, R. P. T. Tomkins. J. Phys. Chem. Ref. Data39, 023101 (2010).Search in Google Scholar

[30] AIP. American Institute of Physics (AIP) (2015). url. .Search in Google Scholar

[31] IUPAC. International Union of Pure and Applied Chemistry (IUPAC) (2002). url. .Search in Google Scholar

[32] ASF. The Apache Software Foundation (ASF) (2015). url. .Search in Google Scholar

[33] Apple. Apple Incorporated (2015). url. .Search in Google Scholar

[34] MacPorts. The MacPorts Project (2015). url. .Search in Google Scholar

[35] The FreeBSD Project (2015). url. .Search in Google Scholar

[36] S. Chalk. University of North Florida (UNF) (2015). url. .Search in Google Scholar

[37] S. Chalk. GitHub.com (2015). url. .Search in Google Scholar

[38] Microsoft. Microsoft Incorporated, Remond, WA (2015). url. .Search in Google Scholar

[39] Elastic. Elastic Incorporated, Mountain View, CA (2015). url. .Search in Google Scholar

[40] Wikipedia. Wikipedia (2015). url. .Search in Google Scholar

[41] Wikipedia. Wikipedia (2015). url. .Search in Google Scholar

[42] The PHP Group. The PHP Group (2015). url. .Search in Google Scholar

[43] NIH. National Institute of Health (NIH), Bethesda, MD (2015). url. .Search in Google Scholar

[44] The PHP Group. The PHP Group (2015). url. .Search in Google Scholar

[45] CrossRef. CrossRef.org (2015). url. .Search in Google Scholar

[46] IDF. International DOI Foundation (IDF) (2015). url. .Search in Google Scholar

[47] CrossRef. CrossRef.org (2015). url. .Search in Google Scholar

[48] Wikipedia. (2015). url. .Search in Google Scholar

[49] NISO. National Information Standards Organization (NISO), Baltimore, MD (2015). url. .Search in Google Scholar

[50] CSF. CakePHP Software Foundation (2015). url. .Search in Google Scholar

[51] Wikipedia. (2015). url. .Search in Google Scholar

[52] J. Goyvaerts. (2015). url. .Search in Google Scholar

Published Online: 2015-11-28
Published in Print: 2015-12-1

©2015 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/