This paper details an approach to re-purposing scientific data as presented on a web page for the sole purpose of making the data more available for searching and integration into other websites. Data ‘scraping’ is used to extract metadata from a set of pages on the National Institute of Standards and Technology (NIST) website, clean, organize and store the metadata in a MySQL database. The metadata is then used to create a new website at the authors institution using the CakePHP framework to create a representational state transfer (REST) style application program interface (API). The processes used for website analysis, schema development, database construction, metadata scraping, REST API development, and remote data integration are discussed. Lessons learned and tips and tricks on how to get the most out of the process are also included.
Over the last 150 years, scientists have generated ever-increasing amounts of research data . As a consequence, at the turn of the last century scholars starting aggregating the results (content-mining) from specific areas of science in order that they and their peers would be able to keep up with scientific progress. As a result, today many science publishers have a plethora of ‘hard-copy’ databases with a wealth of important peer-reviewed data. In addition, in the last 20 years, this process has also been undertaken by many independent scientists taking advantage of the Internet by aggregating information about a specific aspect of their work and ‘self-publishing’ (with very little credit) a website.
In this work, we detail the analysis, workflow and scripting behind a project to repurpose data made available on the National Institute of Standards and Technology (NIST) website , published from the International Union of Pure and Applied Chemistry (IUPAC) Solubility Database Series . Data exposed on the NIST website was analyzed for its organization and content types, extracted using PHP  scripts, stored in a MySQL  database and exposed via a new website built using the CakePHP framework – a model-view-controller (MVC) based platform  that exposes data via RESTful  URLs. The development of this process is discussed along with mechanisms used for integration of data from other website, and lessons learned from the project.
Representational state transfer (REST)
Representational state transfer – REST – is a software architecture [17, 24] with a set of constraints that allows a simple communication between two systems, typically a client and a server where communication between the two is based on the Hypertext Transfer Protocol (http/https) . The REST architecture is resource-based, that is, ‘things’ are what we request information about and in sending a request about a ‘thing’ we get a representation of that ‘thing’ in return (the amount of detail can vary), indicative of the ‘thing’s state. With this perspective on information the http methods; POST, GET, PUT and DELETE are used to implement four common actions performed on a resource; Create, Retrieve, Update, and Delete (CRUD) .
The six constraints of a REST architecture are:
Uniform interface – a well defined way to communicate requests for information
Stateless – the server has no knowledge of the state of any client. Each request to the server must have enough information to transmit the state of the client (if one exists)
Client-Server – the system is disconnected, the client cannot assume anything about the server and vice versa, except that a uniform interface exists between the two
Cacheable – representations returned from the server can be cached on the client on the server or both
Layered System – there can be multiple layers of software involved in retrieving information and/or the information can be retrieved from the servers cache
Code on Demand (optional) – the possible transfer of logic (an executable) from the server as part of the transfer of state to the client. This allows the client to do logic, temporarily that is normally only available on the server.
The Model-view-controller (MVC) architecture  is a software design pattern for developing user interfaces. It separates the logic (code) of handling a request from a user (via their web browser) into three parts as shown in Fig. 1. Logic in the controller defines what information to retrieve from the model (database) and how to manipulate it, before sending it to the view, which generates a page for a user. The use of MVC is important in the development of a REST style website as it provides the uniform interface.
It is also important to point out that many MVC based frameworks are designed to enforce the “Don’t Repeat Yourself” or DRY  style of programming where code needed to execute a function is kept in only one place in the software and re-used as needed. This makes a web application much easier to maintain and update.
Application programming interface (API)
An application programming interface (API)  is a programming style that exposes the inputs, outputs, operations and data types of a piece of code in a standard documented way. In the context of websites, an API is a definition of the generic format of URLs that will return data from a site, as well as information about the structure of the data returned by each URL. This makes it easy for other software developers to interact with a sites API, as they know how to format requests for information and how to process what is returned.
The Solubility Data Series
The IUPAC Solubility Data Series (SDS) was started in 1979 under Commission V.8 – The Commission on Solubility Data. Since then 103 volumes of data have been published  with the 100th volume on the solubility of rare earth metals in water being published in March of 2014. Data included in the volumes is aggregated from the literature, critically reviewed and reported with specific metadata about and/or calculations on the original data .
In 1998 IUPAC and NIST signed an agreement to publish the SDS in the Journal of Physical and Chemistry Research Database (JPCRD)  and subsequently in 1999 a project was started  to put the data from all volumes online. In 2004, 18 of the volumes were made available  as a searchable online database. Although the agreement was originally intended to make all volumes available in this way, development ceased in 2006 due to a lack of funding.
The following software was used to scrape, store and display the data obtained from the NIST website; Apache Httpd Server 2.4 (using SSL – https) , MySQL 5.6 , PHP 5.6 , and CakePHP 2.5 . The software was originally installed on an Apple iMac  (running OSX 10.9.5) using MacPorts 2.3 . In order to ensure long-term stability of the website it was transferred to a cloud based VMWare virtual machine running FreeBSD , administered and supported by University of North Florida (UNF) Information Technology Services (ITS). Since the research was performed, UNF ITS has upgraded Httpd, PHP, and CakePHP on a regular schedule in order to keep the website as secure and stable as possible.
The development of this project started when the author noticed the symposium ‘The IUPAC Solubility Data Series: 100 Volumes of Solubility Data Online’ scheduled at the 248th ACS Meeting in San Francisco, CA. Looking into the SDS and the volumes that were available online at NIST, the dataset seemed to have the right setup for a limited scope project to repurpose scientific data as a REST based website, i.e.,
Fundamentally important scientific data
High quality scientific data
Structured website (with documentation)
Internationally recognized publishers
After a cursory analysis of the website, to understand its flow and the extent of the data it contained, it was determined that the site was almost ideal for this project. As the infrastructure for the project was already available to the author (server and software) the project was executed over 1 week (50 h) in order to produce a crude ‘alpha’ version of the site discussed in this paper . A full copy of the CakePHP website code, with documentation, and a MySQL dump of the database tables is available on GitHub .
Given the nature of this research, evaluation of a quick way to repurpose existing data into a REST style version, it should be noted that the ‘finished’ website is not a complete re-invention of the current one. Due to time, complexity of the data on the system detail page, and errors in the original conversion of the data from PDF files, the following issues were not addressed in this project:
Data for solubility is presented in tables on the NIST webpages and this proved to be difficult to parse into individual solubility data values with units. Thus, the solubility data is stored in JSON encoded tables in the new website for convenience and display.
Chemicals should correctly be linked to system types (see below) but are instead linked to individual system data reports as it was easier to identify individual chemicals from the system detail pages. This causes some duplication of chemicals and the addition of ill defined chemical entries (e.g., seawater) in the ‘chemicals’ table.
Inconsistencies in the representation of citations led to misidentification of authors, duplication of citations, and inaccurate/no DOI assignment. Citation refno’s were also not extracted from individual system data reports as they were not linked on the page.
As a result of the issues above, anyone accessing the source files for this project should understand that the database and website are not a finished product and are made available primarily to understand the methods and structure needed to implement a project like this.
Schema, derived from the Greek word meaning ‘plan’, can be thought of in the context of information science as the way things (data) are related to each other. In order to understand the needs for this project in terms of the REST based API, the PHP scripts written in conformance with the CakePHP ‘model’, and the MySQL database, it was necessary to untangle the available data on the NIST website and evaluate how it fit together.
Figure 2 shows a conceptual picture of the schema for the solubility data available on the websites, that includes semantic annotation of the relationships between the different types of data. The implementation of this schema can be done in a variety of formats using software of the user’s choosing, e.g., SQL database, No-SQL database, MS Excel , ElasticSearch , etc.
Based on the schema in Fig. 2 the development of a schema for MySQL was relatively straightforward. Things in green in Fig. 2 were added as database tables. For data relating authors to citations and chemicals to systems, both of which are many-to-many relationships , join tables  were also added. Because of the time and complexity of the location of the data in the data tables in the SDS system pages, tables for ‘data’, ‘evaluator’, and ‘references’ (citations in the critical evaluation/experimental remarks provided by the evaluator) were not implemented. These data were stored in fields in the tables (data) and systems (evaluator and references) database tables. The final MySQL schema is shown in Fig. 3.
Page analysis and data scraping
In order to extract data from the NIST website an analysis of the website structure (links) and page content was performed prior to scraping. Based on the structure and/or the amount of data the extraction was done by either; (i) exporting the page as text and using a text editor to clean up the content so that it was ready to be imported into MySQL, or (ii) importing the page into a PHP script, processing the content, and adding it to MySQL through the CakePHP connection (the ‘Model’) to MySQL.
Under the search page on the NIST site the volumes available online are accessible via a single page (http://srdata.nist.gov/solubility/dataSeries.aspx) with links identified by volume number and title, and by a ‘dataSeries’ identifier (‘nistid’ field in the ‘volumes’ table) coded into the URL link. Data was extract by export as a text file, followed by cleanup, and manual import into MySQL. However, for this paper, the process was also re-written as PHP code in order to show how simple HTML files can be processed in PHP and added to a MySQL database. Analysis of the HTML code of the page (Fig. 4a) shows that each of the volumes on the page is a link with a specific structure. Figure 4b shows a PHP script that captures the content of HTML in a PHP array, removes any unneeded array elements, and then uses the explode  function to cut the remaining text for each volume into the $volume, $title, and $nistid variables before then saving them to the MySQL database (Fig. 4c). After the volume, title, and nistid values are saved to the ‘volumes’ table, the reference and URL fields were added manually (the id and updated fields are added automatically).
The same process was used for other data extracted from the NIST website, with some of the scripts being more complicated because of the amount/variety of data extracted, the number of database tables accessed/updated, and the integration of data from other websites.
CakePHP URLs for REST and API
In the CakePHP framework, the MVC architecture leads to URLs that fulfill the REST constraint of a uniform interface. In addition, having a well defined database structure makes building the site straightforward. Starting from the MySQL database table structure in Fig. 2 (except the join tables), a controller PHP file is created for each table used in the CakePHP application (in CakePHP you ‘Bake’ a website). Figure 5 shows a controller file for the ‘Volumes’ controller, that co-ordinates access to data in the ‘volumes’ database table and sends it on to the browser via a view file.
Each of the functions defined in a controller perform a particular action and the names of the functions become part of the URL to access the action. In addition, any input parameters of a function (e.g., $vol and $format for the ‘view’ function above) also become part of the URL, if they are required, i.e.,:
For instance if a user wants to ‘view’ volume 20, then the URL will be:
https://chalk.coas.unf.edu/sol/volumes/view/20 (for HTML output)
To make the REST URL useful as an API, additional ‘format’ options allow the data for this volume to be returned in formats consumable by other web applications, e.g., XML and JSON:
https://chalk.coas.unf.edu/sol/volumes/view/20/xml (for XML output)
https://chalk.coas.unf.edu/sol/volumes/view/20/json (for JSON output)
In this way, URLs can be constructed that allow access to some or all of the data in all the tables in the MySQL database. A comparison of the old and new URLs for accessing data in the SDS is shown Table 1.
|Volumes||Current: /solubility/dataSeries.aspx (list of volumes)|
|REST: /sol/volumes/index (‘index’ action is optional)|
|System types||Current: /solubility/sys_category.aspx (list of system types)|
|Systems||Current: /solubility/casNO.aspx (no ‘index’ option, search by CAS#)|
|Current: /solubility/sol_detail.aspx?sysID=<sysID> (different sysID)|
|Citations||Current: /solubility/citation.aspx (no ‘index’ option, search by author/year)|
|REST: /sol/citations/view/<id> (database id used instead of refno)|
During the process of extraction of chemical and citation information from the system detail pages, additional information for the MySQL tables was retrieved if available.
For chemicals, the CAS# or chemical name was used to search the Chemical Identifier Resolver (CIR) website  to retrieve InChI strings and keys, and if found, the InChI string was used to determine if the chemical was already in the ‘chemicals’ MySQL table. The CIR site has a very clean REST based API (below) that allows users to access different types of data for over 80 million small molecules. The “structure identifier” can be a name, a CAS#, an InChI Key, a SMILES string, or one of a number of other specialized identifiers. The “representation”, defines what is returned to the user and can be any of the structure identifiers or an image of the structure in GIF or PNG format. Finally, if “format” is omitted an HTML page is returned or if “format” is XML then that is returned (JSON is not available).
Figure 6 shows a code snippet using the CIR REST API to obtain the standard InChI string for a chemical by CAS# ($fields) and if that is not successful, by name ($fields) which is URL encoded  to ensure any special characters are URL compatible.
For citations, CrossRef  provides a number of mechanisms to search for Digital Object Identifiers (DOI’s) . CrossRef’s OpenURL  interface allows users to find DOI’s of articles using the articles’ metadata (after obtaining a free account). OpenURL [48, 49] is a standard URL format for encoding article metadata using GET variables. Figure 7 shows how metadata extracted from a system data citation is organized and used to search the CrossRef site. In this example, CakePHP’s built-in HttpSocket  class is used to send GET variables to the CrossRef OpenURL endpoint.
Lessons learned, tips and tricks
This project shows that freely available tools, scripting languages, and web resources can be used to take valuable scientific data and make it available in a much more useful way. The website developed for this project  and the associated code , are made available to encourage the chemistry community to leverage the work that chemist have done, both the original research and the peer-review of data, by applying this general approach to the extraction of as much important scientific data a possible. Indeed, this will take time and money, but it is fundamentally important to the progress of chemistry.
Subsequent to the completion of this project, the author wrote a proposal to NIST to actually implement this project. As of January 2015 the project was funded (NIST grant 60NANB15D034) with the goal of developing this idea further and incorporating semantic technology.
This Special Topic article is related to work in the IUPAC Solubility Data Series. Pure and Applied Chemistry welcomes consideration, after the normal review process, of similar occasional contributions which provide added value to IUPAC publications.
Thanks go to Tony Williams, the creator of ChemSpider, for seeding me with the idea of doing a ‘Skunkworks’ project.
 NSF. National Science Foundation (NSF), Washington, DC (2011). url. http://www.nsf.gov/cise/aci/taskforces/TaskForceReport_GrandChallenges.pdf.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/Application_programming_interface.Search in Google Scholar
 ECMA. ECMA International (2015). url. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/Representational_state_transfer.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/Create,_read,_update_and_delete.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/Model-view-controller.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/Don%27t_repeat_yourself.Search in Google Scholar
 AIP. American Institute of Physics (AIP) (2015). url. http://scitation.aip.org/content/aip/journal/jpcrd.Search in Google Scholar
 IUPAC. International Union of Pure and Applied Chemistry (IUPAC) (2002). url. http://www.iupac.org/publications/ci/2002/2402/sd.html.Search in Google Scholar
 Wikipedia. Wikipedia (2015). url. http://en.wikipedia.org/wiki/Many-to-many_(data_model).Search in Google Scholar
 CrossRef. CrossRef.org (2015). url. http://help.crossref.org/using_the_open_url_resolver.Search in Google Scholar
 NISO. National Information Standards Organization (NISO), Baltimore, MD (2015). url. http://www.niso.org/apps/group_public/project/details.php?project_id=82.Search in Google Scholar
 CSF. CakePHP Software Foundation (2015). url. http://book.cakephp.org/2.0/en/core-utility-libraries/httpsocket.html.Search in Google Scholar
 Wikipedia. (2015). url. http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.Search in Google Scholar
©2015 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/