Skip to content
Publicly Available Published by De Gruyter May 24, 2017

From Experiments to Knowledge

Reproducibility, Validation and Reuse of Crystal Structure Data

Ian Bruno
From the journal Chemistry International

Abstract

The Cambridge Structural Database (CSD) provides a platform for sharing data generated from X-ray and neutron diffraction experiments. [1] It contains experimental determinations of the 3D structures of over 850 000 small organic and metal-organic compounds. In recent years, the rate of growth of the CSD has shown an upward trend, with over 80 000 datasets deposited in 2016. Collectively, these datasets constitute a rich resource of information about 3D molecular structure. Knowledge derived from the CSD is used by academic and industrial scientists worldwide to help address scientific challenges across a range of domains.

Crystallography as a discipline has established standards and best practices that support the reliable exchange of experimental data. [2] Central to these is the Crystallographic Information File (CIF), [3] which enables the semantic representation of metadata pertaining to the experiment, derived and processed data, and the methods used to determine the structure. [4] Publication of both processed and derived data has been common within the crystallographic community for many years, with conversations now turning to the publication and preservation of raw data. [5]

CSD data deposition services aim to make it easy for researchers to comply with recommendations around best practice and policies for publishing crystal structure data. This includes integration with the community-supported checkCIF validation service that assesses the consistency and completeness of the data. [6] checkCIF issues alerts of varying degrees of severity and journal policies often require that severe alerts are addressed or explained prior to publication.

Datasets published in the CSD are uniquely identified by a Digital Object Identifier (DOI) [7] and can be independently cited. CSD deposition services also encourage depositors to supply an ORCID identifier [8] to enable the reliable and unambiguous association of a researcher with their research output. Whilst many structures are associated with a journal article, an increasing number are separately published as CSD Communications. [9]

A CSD entry includes a chemical representation of the substance studied by the diffraction experiment. This is vital for enabling the effective discovery and reuse of the data, particularly in domains beyond crystallography. Generating this representation uses a combination of automated processes and validation by expert scientists. Automated processes generate diagnostic information indicating probable points of error and a reliability score that helps prioritize manual validation activities. [10]

The combination of chemistry and crystallography in the CSD provides a foundation for software solutions that enable knowledge about molecular shape and interactions to be applied to the design of new molecules and materials in areas such as drug discovery [11] and solid form optimization. [12] A reliable chemical representation makes it possible to generate standard InChIs [13] that can be used to establish interoperability between the CSD and other chemical and biological resources. Links have thus far been established between ChemSpider, [14] PubChem, [15] and the Protein Data Bank. [16]

The CSD is part of a wider ecosystem comprising the technical and social components needed to make crystal structure data available in support of published research and for reuse in the pursuit of new discoveries. Achieving this requires the commitment of researchers, publishers, and repositories alike and is greatly aided by community-based standards and recommendations. The experiences of the crystallographic community demonstrate what is required to provide joined-up systems that support the stewardship of data from instrument through publication and subsequent application across domains. The challenges encountered and lessons learnt in the field of crystallography are potentially applicable to initiatives aimed at achieving similar ends for other types of data relevant to chemistry.

References

1. C.R. Groom, I.J. Bruno, M.P. Lightfoot, S.C. Ward, Acta Crystallogr. Sect. B Struct. Sci. Cryst. Eng. Mater., 72:171–179 (2016).10.1107/S2052520616003954Search in Google Scholar

2. S. Larsen, G. Kostorz, Publication standards for crystal structures http://www.iucr.org/home/leading-article/2011/2011-06-02 (Accessed Jan 20, 2017).Search in Google Scholar

3. S.R. Hall, F.H. Allen, I.D. Brown, Acta Crystallogr. Sect. A Found. Crystallogr., 47:655–685 (1991).10.1107/S010876739101067XSearch in Google Scholar

4. S.R. Hall, B. McMahon, Data Sci. J., 15:1–15 (2016).Search in Google Scholar

5. L.M.J. Kroon-Batenburg, J.R. Helliwell, B. McMahon, T.C. Terwilliger, IUCrJ, 4:87–99 (2017).10.1107/S2052252516018315Search in Google Scholar

6. A.L. Spek, Acta Crystallogr. Sect. D Biol. Crystallogr., 65:148–155 (2009).10.1107/S090744490804362XSearch in Google Scholar

7. International DOI Foundation, DOI Handbook http://www.doi.org/hb.html (Accessed Jan 20, 2017).Search in Google Scholar

8. ORCID | Connecting Research and Researchers https://orcid.org/ (Accessed Jan 20, 2017).Search in Google Scholar

9. C. Groom, New Communications with the New CSD http://www.ccdc.cam.ac.uk/Community/blog/2016-03-15-new-communications-with-the-new-csd/ (Accessed Jun 20, 2016).Search in Google Scholar

10. I.J. Bruno, G.P. Shields, R. Taylor, Acta Crystallogr. Sect. B Struct. Sci., 67:333–349 (2011).10.1107/S0108768111024608Search in Google Scholar

11. C.R. Groom, T.S.G. Olsson, J.W. Liebeschuetz, D.A. Bardwell, I.J. Bruno, F.H. Allen, 5 Mining the Cambridge Structural Database for Bioisosteres, in: N. Brown (Ed.), Bioisosteres in Medicinal Chemistry, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany, 2012: pp. 75–101.10.1002/9783527654307.ch5Search in Google Scholar

12. P.T.A. Galek, E. Pidcock, P.A. Wood, N. Feeder, F.H. Allen, Navigating the Solid Form Landscape with Structural Informatics, in: Comput. Pharm. Solid State Chem., John Wiley & Sons, Inc, Hoboken, NJ, 2016: pp. 15–35.10.1002/9781118700686.ch2Search in Google Scholar

13. S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, I. Pletnev, J. Cheminform., 5:7 (2013).10.1186/1758-2946-5-7Search in Google Scholar

14. H.E. Pence, A. Williams, J. Chem. Educ., 87:1123–1124 (2010).10.1021/ed100697wSearch in Google Scholar

15. E. Bolton, Y. Wang, P. Thiessen, S. Bryant, PubChem: Integrated Platform of Small Molecules and Biological Activities, in: R.A. Wheeler, D.C. Spellmeyer (Eds.), Annual Reports in Computational Chemistry, Volume 4, Elsevier, Oxford, UK, 2008: pp. 217–240.10.1016/S1574-1400(08)00012-1Search in Google Scholar

16. wwPDB, Data correspondences between the PDB and CSD archives now available. http://wwpdb.org/news/news?year=2015#29-July-2015.Search in Google Scholar

Online erschienen: 2017-5-24
Erschienen im Druck: 2017-7-26

©2017 by Walter de Gruyter Berlin/Boston