Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter June 16, 2022

BioChemUDM: a unified data model for compounds and assays

  • Michael A. Kappler EMAIL logo , Christopher T. Lowden and J. Chris Culberson


We present a simple, biochemistry data model (BioChemUDM) to represent compounds and assays for the purpose of capturing, reporting, and sharing data, both biological and chemical. We describe an approach to register a compound based solely on a stereo-enhanced sketch, thereby replacing the need for additional user-specified “flags” at the time of compound registration. We describe a convention for string-based labels that enables inter-organizational compound and assay data sharing. By co-adopting the BioChemUDM, we have successfully enabled same-day exchange and utilization of chemical and biological information with various stakeholders.

Corresponding author: Michael A. Kappler, IDEAYA Biosciences Inc, 7000 Shoreline Blvd Ste 350, South San Francisco, CA 94080, USA, E-mail:

Article note: A collection of invited papers on Cheminformatics: Data and Standards.


Special thanks to Sandra Simon (IDEAYA Biosciences) and Jacob Spiegel (Workflow Informatics) for supporting the effort to launch the platform based on the BioChemUDM and assistance with writing this manuscript.


[1] F. Agnetti, M. Bensch, H. Biller, M. Blapp, B. Cheikh, G. Blanke, J. Degen, B. Dienon, T. Doerner, G. Doernen, F. Farshchian, W. Gotzeina, P. Hilty, R. Horstmoeller, T. Jeker, B. Jones, M. Kappler, A. Momin, A. Regoli, D. Ribaud, B. Starck, D. Stoffler, K. Weymann, P. Udupa. Intuitive and integrated browsing of reactions, structures, and citations: The Roche experience. In 245th National Meeting of the American Chemical Society, New Orleans, LA, April 7–11, (2013), (accessed Sep 9, 2021).Search in Google Scholar

[2] R. Sayle, D. Lowe, N. O’Boyle, M. Kappler, A. Pelliccioli, N. Tomkinson, D. Stoffler. Extraction, analysis, atom mapping, classification and naming of reactions from pharmaceutical ELNs. 6th Joint Sheffield Conference on Cheminformatics, July 22–24, (2013), (accessed Oct 10, 2021).Search in Google Scholar

[3] Elsevier Press Release. Elsevier and Roche Collaborate to Integrate Proprietary Chemistry Data in Reaxys®, Elsevier GmbH, Frankfurt, Germany (2012), (accessed Apr 19, 2020).Search in Google Scholar

[4] T. Hoctor, M. Kappler. Making Dollars and Sense from Large Data, Global Drug Discovery Informatics Summit, Princeton, NJ (2013).Search in Google Scholar

[5] Elsevier Press Release. Elsevier Donates Unified Data Model to The Pistoia Alliance, Facilitating Data Sharing and Accelerating Research in the Life Sciences, Elsevier GmbH, Frankfurt, Germany (2017),,-facilitating-data-sharing-and-accelerating-research-in-the-life-sciences (accessed Oct 10, 2021).Search in Google Scholar

[6] XML 1.0 Specification. World Wide Web Consortium (2008), (accessed May 23, 2022). “The Extensible Markup Language is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.” (accessed May 23, 2022).Search in Google Scholar

[7] CDXML Format. CDXML is the XML Analogue of the Binary CDX File Type used by CambridgeSoft Corporation’s ChemDraw Chemical Structure Application (2020), (accessed May 23, 2022).Search in Google Scholar

[8] JSON Format. JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values) (2001), (accessed May 23, 2022).Search in Google Scholar

[9] Pistoia Alliance News. The Pistoia Alliance announces major milestone in unified data model project to promote life sciences collaboration, The Pistoia Alliance is a Global Not-for-Profit Members’ Organization Collaborating to Lower Barriers to Innovation in Life Science and Healthcare R&D, Elsevier, Boston, MA (2018), (accessed Oct 10, 2021).Search in Google Scholar

[10] T. Struble, J. Alvarez, S. Brown, M. Chytil, J. Cisar, R. DesJarlais, O. Engkvist, S. Frank, D. Greve, D. Griffin, X. Hou, J. Johannes, C. Kreatsoulas, B. Lahue, M. Mathea, G. Mogk, C. Nicolaou, A. Palmer, D. Price, R. Robinson, S. Salentin, L. Xing, T. Jaakkola, W. Green, R. Barzilay, C. Coley, K. Jensen. J. Med. Chem. 63, 8667 (2020), in Google Scholar PubMed PubMed Central

[11] E. Cascade, C. Sears. Leveraging a Unified Data Model to Drive Collaboration and Clinical Trial Efficiency, Applied Clinical Trials (2018), (accessed Oct 10, 2021).Search in Google Scholar

[12] J. Tomczak, E. Herzog, M. Fisher, J. Swienty-Busch, F. van den Broek, G. Whittick, M. Kappler, B. Jones, G. Blanke. UDM (Unified Data Model) for Chemical Reactions – Past, Present and Future, this issue.Search in Google Scholar

[13] A. Dalby, J. Nourse, W. Hounshell, A. Gushurst, D. Grier, B. Leland, J. Laufer. J. Chem. Inf. Comput. Sci. 32, 244 (1992), in Google Scholar

[14] CTfile Formats, Elsevier (2005), (accessed Oct 10, 2021).Search in Google Scholar

[15] CTfile Formats. Dassault Systemes (2016), (accessed Oct 10, 2021).Search in Google Scholar

[16] Y. Shafranovich. RFC 4180: Common Format and MIME Type for CSV Files, IETF (2005).Search in Google Scholar

[17] E. Martin, A. Monge, J.-A. Duret, F. Gualandi, M. Peitsch. J. Cheminform. 4, 11 (2012).10.1186/1758-2946-4-11Search in Google Scholar PubMed PubMed Central

[18] A. Hersey, J. Chambers, L. Bellis, A. Bento, A. Gaulton, J. Overington. Drug Discov. Today Technol. 14, 17 (2015).10.1016/j.ddtec.2015.01.005Search in Google Scholar PubMed PubMed Central

[19] R. Sayle. J. Comput. Aided Mol. Des. 24, 485 (2010).10.1007/s10822-010-9329-5Search in Google Scholar PubMed

[20] A. Gobbi, M.-L. Lee. J. Chem. Inf. Model. 52, 285 (2012), in Google Scholar PubMed

[21] L. Guasch, W. Yapamudiyansel, M. Peach, J. Kelley, J. JBachiJr, M. Nicklaus. J. Chem. Inf. Model. 56, 2149 (2016).10.1021/acs.jcim.6b00338Search in Google Scholar PubMed PubMed Central

[22] D. Dhaked, M. Nicklaus. Tautomeric conflicts in forty small-molecule databases (2021), ChemRxiv Cambridge Open Engage. This content is a working paper (preprint) and has not been peer-reviewed.10.26434/chemrxiv.14779254.v1Search in Google Scholar

[23] SMIRKS – A Reaction Transform Language. Daylight Theory Manual (1997), (accessed May 24, 2022).Search in Google Scholar

[24] W. Ihlenfeldt, Y. Takahashi, H. Abe, S. Sasaki. J. Chem. Inf. Comput. Sci. 34, 109 (1994).10.1021/ci00017a013Search in Google Scholar

[25] RDKit. Open-Source Cheminformatics, in Google Scholar

[26] BIOVIA Pipeline Pilot. Release, Dassault Systèmes, San Diego (2021).Search in Google Scholar

[27] C. Baker, N. Kidley, K. Papachristos, M. Hotson, R. Carson, D. Gravestock, M. Pouliot, J. Harrison, A. Dowling. J. Chem. Inf. Model. 60, 3781 (2020).10.1021/acs.jcim.0c00232Search in Google Scholar PubMed

[28] PerkinElmer Announcement. ChemDraw/ChemOffice+ Cloud v20.0 (2020), (accessed Oct 10, 2021).Search in Google Scholar

[29] H. Morgan. J. Chem. Doc. 5, 107 (1965), in Google Scholar

[30] R. Cahn, C. Ingold, V. Prelog. Angew. Chem. Int. Ed. 5, 385 (1966), in Google Scholar

[31] BIOVIA Pilotscript. Release 2016, Dassault Systèmes, San Diego (2016), (accessed Oct 10, 2021).Search in Google Scholar

[32] Scripting Integrations. KNIME Community (2019), (accessed Oct 10, 2021).Search in Google Scholar

[33] D. Bonchev. Chemical Graph Theory: Introduction and Fundamentals, Routledge (1991).Search in Google Scholar

[34] ChemAxon Documentation. ChemAxon Extended SMILES and SMARTS – CXSMILES and CXSMARTS, (accessed Oct 10, 2021).Search in Google Scholar

[35] S. Heller, A. McNaught, I. Pletnev, S. Stein, D. Tchekhovskoi. Journal of Cheminformatics 7, 23 (2015), in Google Scholar PubMed PubMed Central

[36] D. Dhaked, W. Ihlenfeldt, H. Patel, V. Delannee, M. Nicklaus. J. Chem. Inf. Model. 60, 1253 (2020).10.1021/acs.jcim.9b01080Search in Google Scholar PubMed PubMed Central

[37] W. DeGruyter. Chem. Int. 42, 1, 30 (2020).Search in Google Scholar

[38] USPTO Reg. No. 3,884,839, CDD Vault, registered December 7, 2010. in Google Scholar

[39] L. Fisher. CDD Support: Advanced Stereochemistry Registration: Atropisomers, Mixtures, Unknowns and Non-Tetrahedral Chirality, (accessed Oct 10, 2021).Search in Google Scholar

[40] M. Wilkinson, M. Dumontier, I. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. W. Boiten, L. da Silva Santos, P. Bourne, J. Bouwman, A. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. Evelo, R. Finkers, A. Gonzalez-Beltran, A. Gray, P. Groth, C. Goble, J. Grethe, J. Heringa, P. Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. Lusher, M. Martone, A. Mons, A. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, I. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons. Sci. Data 3, 160018 (2016), in Google Scholar PubMed PubMed Central

[41] AWS Lake Formation. What is a Data Lake? (accessed Oct 10, 2021).Search in Google Scholar

Supplementary Material

The online version of this article offers supplementary material (

Published Online: 2022-06-16
Published in Print: 2022-06-27

© 2022 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit:

Downloaded on 6.6.2023 from
Scroll to top button