Abstract
International frameworks have been put in place to foster chemical weapons nonproliferation and disarmament. These frameworks feature lists of chemicals that can be used as chemical weapons or precursors for their synthesis (CW-control lists). In these lists, chemicals of concern are described through chemical names and CAS Registry Numbers®. Importantly, in some CW-control lists, some entries, rather than specifying individual chemicals, describe families of related chemicals. Working with CW-control lists poses challenges for frontline customs and export control officers implementing these frameworks. Entries that describe families of chemicals are not easy to interpret, especially for non-chemists. Moreover, synonyms and chemical variants complicate the issue of checking CW-control lists through names and registry numbers. To ameliorate these problems, we have developed a functioning prototype of a cheminformatics tool that automates the task of assessing whether a chemical is part of a CW-control list. The tool, dubbed the Nonproliferation Cheminformatics Compliance Tool (NCCT), is a database management system (based on ChemAxon’s Instant JChem) with an embedded database of chemical structures. The key feature of the database is that it contains not only the structures of the individually listed chemicals, but also the generic structures that describe the entries relative to families of chemicals (Markush structures).
Introduction
Chemicals weapons (CW) are weapons intended to cause death or incapacitation through the toxic properties of chemicals. The Chemical Weapons Convention (CWC), a disarmament and nonproliferation treaty that entered into force in 1997, poses a complete ban on chemical weapons, prohibiting not only their use, but also their development, production, and stockpiling [1], [2], [3]. As further explained in “CW-control lists implemented in the NCCT database” section, to support its verification regime and declaration requirements, the CWC features three schedules that list key toxic chemicals on which chemical weapons can be based as well as precursors for their synthesis.
However, far from being a relic from the past, chemical weapons have been used in recent years on several occasions, although the chemical weapons landscape has changed. Current chemical weapons attacks do not involve large quantities of chemicals like those that were deployed in World War I. Conversely, smaller quantities of chemicals have been used by Syria for counterinsurgency operations during its civil war and by the Islamic State against civilians and government forces in Iraq and Syria [4], [5], [6], [7], [8]. Even smaller quantities of chemicals have been employed for targeted assassinations and assassination attempts, most recently for the attempted murder of political activist Alexei Navalny with a Novichok agent [9, 10].
Due to the changing chemical weapons landscape, it is more important than ever to control even small quantities of toxic chemicals that can be used as chemical weapons and precursors for their synthesis. Hence, all possible variants of a given chemical, including those that are not typically produced on a large scale must be taken into account [11]. This is consistent with a recent recommendation from the Scientific Advisory Board (SAB) to the Organisation for the Prohibition of Chemical Weapons (OPCW), the international organization that supports the implementation of the CWC. Specifically, the SAB recommended that “the isotopically labelled compound or stereoisomer related to the parent chemical specified in the schedule should be interpreted as belonging to the same schedule” [12, 13].
Several frameworks at the national and international level have been put in place to foster CW nonproliferation and promote chemical disarmament, including the CWC, Australia Group, Wassenaar Arrangement, the World Custom Organization’s Strategic Trade Control Enforcement Programme, and European Union export controls. To support their missions, these frameworks contain lists of chemicals that can potentially be employed as chemical weapons or precursors for their synthesis. In all these lists, chemicals are identified through chemical names and, whenever available, registry numbers from the American Chemical Society’s Chemical Abstract Service (CAS) – CAS Registry Numbers®, also referred to as CAS RNs® [11, 14]. This is well illustrated by the snippet of a CW-control list provided in Fig. 1 [15].

A snippet from of CWC Schedule 1 from the OPCW website (https://www.opcw.org/chemical-weapons-convention/annexes/annex-chemicals/schedule-1). The figure shows the first four entries of the schedule. The first three describe three families of chemicals, namely two families of nerve agents of the G-series and one family of nerve agents of the V-series. Specific examples are given for each of the three families. The fourth entry is an individually listed chemical belonging to the sulfur mustards family.
Very importantly, in some of the CW-control lists, some entries are not individual chemicals. They are families of related chemicals based on a common scaffold with variable substituents. These entries do not have CAS RNs associated with them, as CAS RNs only pertain to individual chemicals, not families of chemicals. For instance, the first entry of the first of the three schedules featured in the Annex on Chemicals of the CWC (Schedule 1) defines a large family of nerve agents that includes sarin and soman, which are listed as specific examples, as well as a large number of additional agents that are not explicitly listed. The entry is defined as “O-Alkyl (<=C10, including cycloalkyl) alkyl (Me, Et, n-Pr or i-Pr)-phosphonofluoridates”, where Me, Et, n-Pr and i-Pr stand for methyl, ethyl, normal-propyl, and isopropyl, respectively. As explained by Pontes and coworkers, depending on the definition accepted for the word cycloalkyl, that entry may encompass in excess of two million different chemicals. Other listed families are so large that they cannot be enumerated [16], [17], [18].
Working with lists of chemicals and identifying what chemicals are covered by them poses some challenges for frontline officers implementing these frameworks, such as export control officers and customs officials, as well employees of chemical manufacturing, shipping, and logistics companies. In particular, entries that describe families of chemicals are not easy to interpret, especially for non-chemists. Moreover, synonyms and chemical variants complicate the issue of checking control lists through names and CAS RNs [11]. By way of example, as illustrated by Pontes and coworkers, more than 80 different synonyms can be found for the nerve agent sarin in chemical databases such as CAS, the Royal Society of Chemistry’s ChemSpider, and the National Library of Medicine’s PubChem [16]. Similarly, the CAS contains 17 different variants of sarin with different CAS RNs. Beyond the plain version of the molecule, these include isotopically labelled variants, specific isomers, hydrates, and cyclodextrin-enclosed variants [16].
Through this work, we are endeavoring to make it easier for a wide range of relevant stakeholders, many of whom do not have extensive training in chemistry, to identify whether a chemical under scrutiny is covered by a given list of chemicals that are controlled for security concerns. Among others, possible stakeholders include customs officials, export control officers, and employees of chemical manufacturing, shipping, or logistics companies. To simplify the task of assessing whether a chemical is part of a CW-control list, we propose the development and adoption of a cheminformatics tool, consisting of a database management system with an embedded database of relevant CW-control lists, that can automate this process. We call this system, of which we have developed a working prototype, the Nonproliferation Cheminformatics Compliance Tool (NCCT).
Results and discussion
General description of the IJC-based NCCT prototype
The NCCT prototype is a desktop-based database management system with an embedded database of CW-control lists (the NCCT database). We have developed the NCCT prototype through Instant JChem (IJC), a chemical database management system developed by ChemAxon (Fig. 2) [19].

A snapshot of the Simple Federated Search window of Instant JChem, showing the 5 CW-control lists, divided into 11 tables, implemented in the NCCT database.
The IJC-based NCCT prototype allows a user to input a query chemical, use the entered query to search the embedded NCCT database, and retrieve a list of NCCT Database entries that match the query chemical, if any (Fig. 3). Specifically, through the Instant JChem graphical user interface (GUI), the user can input a query chemical into the IJC-based NCCT prototype in a variety of ways, including CAS RNs, chemical names, and structural identifiers such as SMILES or the IUPAC-developed InChI codes.[1] Alternatively, the user can draw the 2D structure of the query chemical through the Instant JChem GUI. The Instant JChem engine then converts the chemical, no matter which way it has been entered, into a chemical structure. Subsequently, it establishes the equivalence of chemical variants by standardizing the query (Fig. 4). The standardized query structure is then checked against the standardized NCCT database, and the database entries that match the query, if any, are retrieved.

Flowchart showing the functioning of the IJC-based NCCT prototype. A query chemical, in various formats, is entered through the IJC interface (see Fig. 4 for further details). Through the IJC’s Simple Federated Search function, the standardized query is checked against the standardized NCCT database, in Derby format. The database entries matching the query are retrieved, if any.

A query chemical can be entered into IJC in a variety of ways, including by entering a CAS RN, a chemical name, or a structural identifier. The input will be converted into a 2D chemical structure. Alternatively, a 2D chemical structure can be sketched through the IJC interface. By subjecting the query to a standardization process, the equivalence of chemical variants is established. The standardized query is ready to be checked against the standardized NCCT database.
A key feature of the IJC-based NCCT prototype is that the NCCT database contains not only the structures of the individually listed chemicals, but also the generic structures that describe the entries relative to the families of chemicals, encoded as Markush structures.[2] Another key feature of the IJC-based NCCT prototype is the mentioned standardization process applied to the NCCT database tables as well as to the query structures, which establishes the equivalence of different variants of the same chemical, including stereoisomers, protonation states, charges, radicals, salts, and isotopically labelled forms (for further details, see “Methods” section). The standardization is particularly important considering the changing chemical weapons landscape, which makes it imperative to control all variants of the chemicals of concern. This approach is consistent with the mentioned SAB recommendation according to which isotopically labelled or stereoisomeric forms of a scheduled chemical should be considered as belonging to the same schedule [12, 13].
CW-control lists implemented in the NCCT database
As illustrated in the paragraphs below, to date, we have implemented into the NCCT database the CW-control lists from five international frameworks (Fig. 5). Each CW-control list was implemented as one or two separate NCCT database tables, depending on whether their entries required different standardization procedures (for further details, see “Methods” section). In total, the NCCT database comprises a total of 11 database tables, one for each of the three CWC schedules and two for each of the remaining four international frameworks. The total number of entries, including the 3 exceptions listed in the CWC Schedules, is 588. The CW-control lists featured in the NCCT database are also available, in curated and structurally annotated form, at the Costanzi Research website (https://costanziresearch.com/cw-nonproliferation/cw-control-lists/) [14].

The CW-control lists implemented into the NCCT database.
Chemical Weapons Convention (CWC) Schedules. The CWC is an international disarmament and nonproliferation treaty that seeks the complete elimination of chemical weapons. As mentioned, the treaty poses a complete ban on chemical weapons, prohibiting not only their use, but also their development, production, and stockpiling. With its 193 State Parties, it enjoys almost universal embracement and is the most prominent international framework for chemical disarmament and nonproliferation. To support its verification regime and declaration requirements, in its Annex on Chemicals, the CWC features three schedules of chemicals: Schedule 1, Schedule 2, and Schedule 3. Going from Schedule 1 to Schedule 3, the schedules contain chemicals that, beyond their chemical weapons-related role, have increasingly larger legitimate commercial applications (with Schedule 1 listing chemicals with minimal or no commercial applications, and Schedule 3 listing chemicals with large industrial applications). Each schedule is divided into two parts: toxic chemicals are listed in Part A and precursors are listed in part B. The CWC Schedules comprise 75 entries including: 45 individually listed chemicals; 15 families of related chemicals, defined as a central chemical scaffold bearing a number of attached variable chemical groups; and 15 chemicals presented as specific examples for listed families of chemicals. The CWC Schedules also list 3 exceptions, i.e. chemicals that would fall within the scope of one of the listed families but are excluded from the coverage [14, 22].
Australia Group (AG) Chemical Weapons Precursors list. The AG is an informal arrangement between 43 like-minded states committed to preventing the proliferation of chemical and biological weapons through the harmonization of export controls. The AG Chemical Weapons Precursors list comprises a total of 87 dual-use chemicals that can be used as precursors for the synthesis of chemical weapons. Some of these precursors are also covered by the CWC Schedules, while other are not. All of the 87 entries in the AG list are explicitly listed as discrete chemicals [14, 23].
The Wassenaar Arrangement (WA) Munitions List 7 (ML7). The WA is an international framework, adhered to by 42 countries, that was established with the objective of “promoting transparency and greater responsibility in transfers of conventional arms and dual-use goods and technologies.” Within its ML7, the WA features chemical agents, biological agents, riot control agents, radioactive materials, related equipment, components, and materials. In particular, with regards to chemical agents, ML7 comprises all of the CWC Schedule 1 chemicals, with the exception of the Novichok and carbamate entries that were added to CWC Schedule 1 through a recent amendment that entered into force in June 2021 [7, 8]. ML7 also comprises the central incapacitating agent BZ, which is a CWC Schedule 2 chemical, as well as a list of defoliants and a list of riot control agents, neither of which is in the CWC schedules. ML7 comprises 39 entries, including: 27 individually listed chemicals; five families of related chemicals, defined as a central chemical scaffold bearing a number of attached variable chemical groups; six chemicals presented as specific examples for listed families of chemicals; and 1 entry that lists to a mixtures of two chemicals (Agent Orange) [11, 24].
European Union Council Regulation 36/2012 (Syria-related list). The European Union tightly regulates the export of dual-use chemicals to Syria, within the scope of the restricting measures imposed by Council Regulation 36/2012. Beyond posing additional restrictions on the chemicals already present in the general EU export control lists, in Annex Ia and Annex IX, EU Council Regulation 36/2012 lists additional dual-use chemicals that, although being widely used in chemical industry for non-military purposes, are of security concern when exported to Syria. An example of such chemicals is isopropanol, a dual-use chemical that, beyond its legitimate uses in chemical synthesis or as a disinfectant, is also a precursor, in a highly pure form, for the synthesis of the nerve agent sarin. Annex Ia and Annex IX of EU Council Regulation 36/2012 comprise a total of 80 entries, all of which are explicitly listed as discrete chemicals [11, 25].
The strategic chemicals identified in the World Customs Organization (WCO) Strategic Trade Control Enforcement Implementation Guide (STCE). The WCO is “an independent intergovernmental body whose mission is to enhance the effectiveness and efficiency of Customs administrations.” The WCO maintains a document – the Strategic Trade Control Enforcement Implementation Guide (or STCE) – intended to provide WCO members with “practical assistance related to enforcing strategic trade controls.” In Annex V, the STCE provides a list of strategic chemicals that goes beyond chemicals of CW-proliferation concern and includes chemicals of nuclear, missile, explosive, and military concern. The WCO list of strategic chemicals comprises a total of 304 entries, all of which are explicitly listed as discrete chemicals [11, 26].
Description of the table fields
For each entry in the NCCT database, 12 fields are given (Fig. 6).

The 11 fields featured in the NCCT database. Entry 1A1 of CWC Schedule 1 is given as an example.
The Markush structure field is the field that is searched when the NCCT database is queried. This field is essential for the functioning of the IJC-based NCCT prototype. Despite the fact that Instant JChem labels it as “Markush structure,” this field can be populated with both discrete structures (for individually listed chemicals) and Markush structures (for families of chemicals).
Three fields directly reflect the information provided by the official CW-control frameworks for the listed entries. Specifically, these include the Entry number field, which reflects the number assigned to the entry in the official CW-control list, as well as the Entry name and CAS Registry Number ® fields.
The remaining eight fields contain complementary useful information with which we annotated the database [14]. Specifically, these fields include:
three structural identifier fields (namely a SMILES, an InChI, and an InChiKey field);
an overlap field, which indicates whether the chemical in question is also covered by one or more of the other lists of controlled chemicals implemented in the NCCT database;
a category field, which indicates whether the chemical is a chemical weapon agent (indicating the specific class), a precursor for the synthesis of chemical weapons, a defoliant, a riot control agent, or a chemical posing a different threat (explosive, nuclear, missile, or general military concern);
an entry type field, which indicates whether the entry is a family of chemicals, an example pertinent to a family of chemicals, an exception to a family of chemicals (i.e., a chemical that falls within the scope of the family definition but is explicitly excluded from controls in the listing framework), an individually listed small molecule, a polymer, a protein, or a mixture of chemicals.
Search example 1: addressing the families of chemicals issue
The paragraphs below illustrate how the IJC-based NCCT prototype helps an operator identify whether a chemical under scrutiny falls under the scope of one of the families of chemicals featured in the CW-control lists implemented into the NCCT database.
The choroethylamine in Fig. 7A is a precursor for the synthesis of the nerve agent VX (shown in the yellow inset) [11]. The CWC does not list this chemical explicitly. Instead, in entry 10 of Part B of Schedule 2 (CWC 2B10), it lists a family of chloroethylamines that comprises the VX precursor. Specifically, the definition provided in CWC Schedule 2 for this entry is: N,N-Dialkyl (Me, Et, n-Pr or i-Pr) aminoethyl-2-chlorides and corresponding protonated salts (Fig. 7B) [29]. This family is characterized by a central scaffold common to all family members and two variable R groups (Fig. 7C).

Panel A: A chloroethylamine that can be employed as a precursor for the synthesis of the nerve agent VX. Panel B: the CWC Schedule 2 entry that describes the family of chemicals that encompasses the chloroethylamine shown in Panel A. Panel C: structural representation of the CWC Schedule 2 family shown in Panel B.
The CAS RN for the VX precursor in question is 96-79-7 and the primary name listed by CAS is N-(2-chloroethyl)-N-(1-methylethyl)-2-propanamine. Given this chemical name or this CAS RN, it is not straightforward to infer that this chemical is covered by CWC Schedules. In fact, a frontline officer who tried to match said name and CAS RN with those listed in the CWC Schedules would conclude that the VX precursor is not one of the covered chemicals.
The IJC-based NCCT prototype automates the task of checking whether the VX precursor in question is covered by the CWC Schedules or any of the other implemented international CW-control lists. The operator can enter either the chemical name or the CAS RN® shown in Fig. 8A, both of which will be converted by the Instant JChem engine into a structure (Fig. 8B). The operator can then launch the query and the IJC-based NCCT prototype will search the NCCT database and identify all the entries in the implemented CW-control lists that cover this VX precursor, either as an individual chemical or as part of a family of chemicals. The results will indicate that the VX precursor is covered by CWC Schedule 2, as part of family 2B10 (Fig. 8C). The results will also indicate that the VX precursor is covered as an individual chemical by the AG Chemical Weapons Precursors list, as entry AG 11, and the World Customs Organization STCE list, as entry STCE 19 (Fig. 8C) [23]. Of note, it would have been straightforward to match this VX precursor with entries AG 11 and STCE 19 based on the CAS RN 97-79-7, as these entries list the chemical individually and report its CAS RN. However, it would not have been possible to do the same for entry CWC 2B10, because, as mentioned, the entry covers the whole family of choloethylamines without explicitly enumerating its members.

A chloroethylamine VX precursor is entered into IJC either through its name or its CAS RN (Panel A). The query is converted into a structure (Panel B), is standardized, and is checked against the NCCT database. The results will indicate whether the entered chloroethylamine is covered is a member of the family described by CWC Schedule 2 entry 2B10 and is individually listed by the AG and WCO as entry AG 11 and STCW 19, respectively.
Search example 2: addressing the synonyms and chemical variants issue
The paragraphs below illustrate how the IJC-based NCCT prototype addresses the issues caused by synonyms and chemical variants, thus helping an operator determine whether a chemical under scrutiny is part of a CW-control list even if the supplied chemical name or CAS RN® are different from those found in the list.
The alcohol shown in Fig. 9A is a precursor for the synthesis of the nerve agent soman [30]. It is listed by the AG as entry 28, with the name of pinacolyl alcohol and the CAS RN 464-07-3 (Fig. 9B) [23]. However, the same chemical can be described with many different synonyms, some of which are listed in Fig. 9C. Hence, simply checking whether a chemical name is part of a list, is not sufficient to assess whether the chemical in question is covered by that list.

Panel A: Pinacolyl alcohol, a chemical that can be employed as a precursor for the synthesis of the nerve agent soman. Panel B: the AG list entry that describes pinacolyl alcohol. Panel C: examples of chemical names synonymous of pinacolyl alcohol. Panel D: examples of chemical variants of pinacolyl alcohol with CAS RNs different from the one found in the AG list.
Using the CAS RN rather than the name makes things easier, as all these synonyms have the same CAS RN. However, an additional layer of complexity is added by the fact that different variants of the same chemical have different CAS RNs. A few variants of pinacolyl alcohol, including an ionized version of the molecule, two stereoisomers, and an isotopically labelled version, all of which have different CAS RNs, are show in Fig. 9D. Hence, simply checking whether a CAS RN is part of a list is not sufficient to assess whether the chemical in question is covered by that list, as the chemical in question could be a variant of a chemical whose canonical form is included in a CW-control list.
For the (S) stereoisomer variant pinacolyl alcohol, given the chemical name (S)-1-tert-butylethanol or the CAS RN 1517-67-5 it is not straightforward to infer that this chemical is covered by a CW-control list. In fact, a frontline officer who tried to match said name and CAS RN with those listed in the CW-control lists would conclude that this chemical is not controlled.
The IJC-based NCCT prototype automates the task of checking whether the variant of pinacolyl alcohol in question is covered by one or more of the international CW-control lists implemented into it. An NCCT operator can enter either the chemical name or the CAS RN shown in Fig. 10A, both of which will be converted by the Instant JChem engine into a structure (Fig. 10B). The operator can then launch the query and the IJC-based NCCT prototype will search the NCCT database and identify all the entries in the implemented CW-control lists that cover pinacolyl alcohol, either as an individual chemical or as part of a family of chemicals. The results will indicate that pinacolyl alcohol is covered as an individual chemical by the CWC Schedules, as entry CWC 2B14, the AG Chemical Weapons Precursors list, as entry AG 28, and the World Customs Organization STCE list, as entry STCE 49 (Fig. 10C).

The (S) isomer of pinacolyl alcohol is entered into IJC either through one of its synonyms or its CAS RN (Panel A). The query is converted into a structure (Panel B), is standardized, and is checked against the NCCT database. The results will indicate that pinacolyl alcohol is listed as an individual chemical by CWC Schedule 2, as entry 2B14, the AG list, as entry 28, and the WCO STCE list, as entry 49. Note how different names are provided for pinacolyl alcohol by CWC Schedule 2 on one hand and the AG and WCO STCE lists on the other hand.
Current limits and future directions
The IJC-based NCCT prototype currently has some limits which we will seek to address as the tool evolves into a mature product.
First, Instant JChem’s interface is designed with chemists in mind, thus making the IJC-based NCCT prototype less intuitive to use for people who lack a certain level of training in chemistry. Going forward, endowing the tool with an easier to use interface will be a key aspect of future development efforts. In particular, the development of the next iteration of the NCCT could leverage existing commercial products that were developed with the primary goal of supporting the control of regulated substances, chiefly prescription and illegal narcotic and psychotropic drugs. These tools, which were developed within the scope of Substance Compliance Service Project of the Pistoia Alliance,[3] include Controlled Substances Squared (CS2), from Scitegrity, and Compliance Checker, from ChemAxon [32], [33], [34]. Both tools are available in a web-based version, with a streamlined interface. Notably, although targeting mainly controlled substances, both tools already have some CW-control lists in their databases [11].
Second, the IJC-based NCCT prototype relies entirely on Instant JChem’s engine for the conversion of names and CAS RNs into structures, and there are several situations in which this does not work. It occurs rather often that chemical names cannot be interpreted and converted into structures. Similarly, although less frequently, some CAS RNs cannot be converted into structures. Among other situations, in Instant JChem, this occurs rather commonly for isotopically labelled compounds. For instance, neither the name nor the CAS RNs for the isotopically labelled version of sarin shown in Fig. 11A can be converted into a structure by Instant JChem. This limit is due to the fact that CAS RNs cannot be converted to structures by computers algorithmically, as the numbers do not have structural information embedded in them. For the conversion of CAS RNs to structures, a connection to a comprehensive, up-to-date relational database that links the registry numbers to the chemicals that they identify is always needed. Going forward, endowing the NCCT with a robust engine for the conversion of names and CAS RNs into structures will be a key aspect of future development efforts.

A 32P-labelled version cannot be entered into IJC by name or CAS RN. However, it can be entered as a SMILES string or an InChI code (Panel A). The query is converted into a structure (Panel B), is standardized, and is checked against the NCCT database. The results will indicate that sarin is covered by the CWC Schedules, as member of family CWC 1A1 (for which it is also listed as a specific example), by the Wassenaar Arrangement ML7, as a member of family b.1.a (for which it is also listed as a specific example), and by the World Customs Organization STCE list, as entry STCE 28.
It is worth pointing out that, rather than using names or CAS RNs, it is always possible to enter the chemical as a structural identifier, such as a SMILES or an InChI code. As mentioned, structural identifiers are text strings that encode a molecular structure [14, 20]. A structural identifier string contains all the information needed to infer the structure of the chemical to which it pertains. Hence, structural identifiers will be successfully converted into a molecular structure by cheminformatics software algorithmically. For instance, although, Instant JChem cannot convert the name or the CAS RN of 32P-labelled of sarin, the chemical can be successfully entered as a SMILES string or an InChI code (Fig. 11A). The structural identifier will be algorithmically converted by the software into a structure, thus allowing the search to be performed (Fig. 11B). Although the input chemical is an isotopically labelled version of sarin, thanks to the standardization process, the IJC-based NCCT prototype will be able to establish that it is equivalent to non-labelled sarin. The results will indicate that sarin is covered by the CWC Schedules, as member of family CWC 1A1 (for which it is also listed as a specific example), by the Wassenaar Arrangement ML7, as a member of family b.1.a (for which it is also listed as a specific example), and by the World Customs Organization STCE list, as entry STCE 28 (Fig. 11C). Beyond structural identifiers, it is also worth noting that, in many cases, the IJC-based NCCT prototype can convert proper IUPAC names into structures.
Lastly, the Instant JChem function that is used to query the database in the IJC-based NCCT prototype only searches the Markush Structure field of the NCCT database, but not the associated metadata (see “Launching queries” in “Methods” section). Going forward, enabling searches for entries that cannot be searched by structure will be a key aspect of future development efforts. In particular, it will be important to add to the NCCT the ability to query the metadata associated with each entry as well. Coupled with a thorough annotation of NCCT database entries with the synonyms listed in databases such as CAS, ChemSpider, and PubChem, and the CAS RNs associated with all the known variants of the chemical in question, this feature is expected to significantly enhance the robustness of queries based on chemical names and CAS RNs, enabling the identification of controlled chemicals in some of the cases where the conversion of names or CAS RNs to structures does not succeed. Importantly, this feature will also mitigate the current complete inability of the NCCT to query for substances endowed with larger structures, such as biological macromolecules and polymers. Of the CW-control lists implemented into the NCCT database tables, only nine such entries exist (seven organic polymers and two biological macromolecules). In the NCCT database tables, the Markush Structure field is left empty, as organic polymers have a variable length, thereby lacking a uniquely defined molecular structures, and biological macromolecules have chemical structures that are too large to be handled by the system. Hence, given that, as mentioned, the queries are currently confined to the Markush Structure field of the NCCT database, organic polymers and biological macromolecules are left out of the search.
Conclusions
Summarizing, for frontline officers, such as export control and customs officers as well as employees of chemical manufacturing, shipping, and logistics companies, it can be difficult to assess whether a chemical under scrutiny is covered by the lists of chemicals embedded in one or more of the international frameworks for the control of chemical weapons and, more generally, chemicals of security concern. As discussed, this is mainly due to two issues: 1) matching a chemical under scrutiny with CW-control list entries that describe whole families of chemicals is a very complex task that cannot be undertaken by individuals who do not have a substantial training in chemistry; 2) matching a chemical under scrutiny with CW-control list entries that describe individual chemicals, although being an easier task, is significantly complicated by the many synonyms associated with a chemical name and the fact that different variants of the same chemical have different CAS RNs [11].
To ameliorate these issues, we have developed a working prototype of the NCCT, a cheminformatics tool that automates the task of checking whether a chemical under scrutiny is indeed encompassed by one or more CW-control lists either because it falls within the scope of one of the listed families or because it is listed as an individual chemical. As described above, the IJC-based NCCT prototype is a database management system, based on ChemAxon’s Instant JChem platform, with an embedded database of chemical structures.
Through internal tests, we have identified the limits of the IJC-based NCCT prototype, which chiefly revolve around the complexity of the Instant JChem interface, its engine for the conversion of names and CAS RNs into structures, and the restriction of the federated search to the structural field of the database. To guide the development of the NCCT, we have also established a network of relevant stakeholders drawn from international organizations, government agencies, industry, civil society, and academia, including an advisory group, who will assist in the identification of the requirements for a more advanced version of the prototype. Input from stakeholders is key to ensuring that the NCCT adds significant value to current and future efforts to enhance chemical security and prevent the proliferation of chemical weapons. Going forward, we will work with select jurisdictions to subject the IJC-based NCCT prototype to field tests intended to verify the practical usefulness of the tool, further probe its limitations, and assist with identification of the requirements for future iterations of the tool.
Controlling the export of toxic chemicals that can be used as chemical weapons, precursors for their synthesis, and, more generally, chemicals of security concern is paramount when trying to prevent the proliferation of chemical weapons and support chemical security. It is key that the controls be implemented in a thorough and timely manner to prevent illegal transfers of chemicals, while at the same time facilitating legitimate transfers. By bolstering the ability of frontline officers to effectively perform such controls, the NCCT will contribute to ensuring that chemistry be only applied to serve peaceful purposes and support the progress of humanity [11].
Methods
Instant JChem platform
The IJC-based NCCT prototype was built and is managed through ChemAxon’s Instant JChem software (IJC), version 21.8.0 [19]. Specifically, as described below, Instant JChem was used to build the NCCT database. Moreover, Instant JChem is also used to manage and query the NCCT database. The Instant JChem software is a desktop-based application that runs on Windows and MacOS computers. Installation of Instant JChem is required to query the NCCT database. The NCCT database also resides locally on the desktop computers, in Derby format.
NCCT tables in CSV format
A comma-separated values (CSV) file was created for each of the CW-control tables to be implemented into the NCCT database. The CSV file had a total of 12 columns that correspond to the NCCT table fields listed above in “Results” section (Description of the table fields), with the exception of the PubChem URL field, which was created directly in Instant JChem after the import of the NCCT tables (see next section for more details). For each entry, we populated the “Structure” column with the SMILES string retrieved from the Chemical Abstract Service through the SciFindern web tool [35]. When this was not available, the field was left blank. The columns “Entry Number”, “Entry Name”, and “CAS Registry Number®” were populated with the information found in the official version of the CW-control lists. The remaining seven columns were populated with complementary pertinent information (see “Description of the table fields” in “Results” section for further details). Two of these seven columns were populated with InChI structural identifiers and InChIKey codes, which are their hashed version [14, 20]. InChI and InChIKey codes were derived from the SMILES retrieved from SciFindern. In particular, the SMILES strings were pasted into the “Draw structure” tool of the National Library of Medicine’s PubChem website [28], where they were then converted to their corresponding InChI and InChIKey codes.
Implementation of the NCCT database
The NCCT tables in CSV format were imported into Instant JChem, where they were converted into searchable database tables in Derby format through the IJC-embedded Derby database management system (the NCCT database). Each CSV file was converted into an NCCT database table; each row in the CSV file became an entry in the corresponding NCCT database table; each column in the CSV file became a field in the corresponding NCCT database table. The NCCT tables in CSV format were imported as “Markush libraries.” This is a key feature of the NCCT database, as it allows handling of the families of chemicals contained in several CW-control lists. The “Empty structures allowed” feature was turned on. The Instant JChem software automatically converted the SMILES strings in the “Structure” column into bidimensional molecular structures, which were visually inspected, thoroughly checked for accuracy, and corrected whenever needed. For the individually listed chemicals for which the “Structure” field was left blank, the molecular structures were built with ChemAxon’s Marvin interface, as implemented in Instant JChem. For families of chemicals, the Markush structures were built using ChemAxon’s Markush Editor, version 21.8.0 [36]. The Markush structures were then added to the NCCT database in Instant JChem. Once the NCCT database was implemented, we added a twelfth field featuring clickable PubChem URL links. This was done by adding a static URL field through the Instant JChem interface.
Separation of single component and multiple component substances
Most of the entries in the CW-control lists implemented in the NCCT database contain a single chemical species (single component substances). However, in some CW-control lists, some of the entries, e.g. salts and mixtures, contain multiple chemical species (multiple component substances). In the NCCT database, we split the CW-control lists that feature both single component and multiple component substances into two tables, one with single component substances only and one with multiple component substances only. This is due to the need to apply different standardizers to the single component and multiple component substances (see the paragraph below for further details).
Standardization of tables and queries
Standardizers were applied to the NCCT database tables to account for the equivalence of different variants of the same chemical as well as different ways or representing the same chemical. The following standardizers were applied to all NCCT database tables: Aromatize, Remove Explicit Hydrogens, Clear Isotopes, and Neutralize. For the NCCT database tables featuring single component substances, the Remove Solvents and Remove Fragment standardizers were also applied to account for the equivalence of salts (these standardizers cannot be applied to lists featuring multiple component substances because they would cause the deletion of one of the components of the entries). The parameters for the Remove Fragment standardizer were set to “Remove smallest” and “Depends on number of heavy atoms” in all the NCCT database tables except for the one relative to CWC Schedule 2, where it was set to “Keep largest” and “Depends on the number of heavy atoms.” The different approach taken for Schedule 2 is due to the fact that, because of the size of one of the Schedule 2 families (entry 2B4), the “Remove smallest” option would not allow Schedule 2 queries to run.
Query structures are automatically standardized during the search process, so that their standardization always matches the one that has been applied to the NCCT table that is being searched. In particular, before a query is run against a specific NCCT database table, it is subjected to the same standardization process to which that specific database table was subjected when the NCCT database was created. Once the search moves to the next NCCT database table, the query is automatically re-standardized to match the standardization process to which the currently searched databased table was subjected.
Alternative structural representations
The IJC-based NCCT prototype was tested to verify whether it recognized different structural representations of the chemicals featured in the NCCT database tables. In particular, we ensured that it recognized the representation obtained by inputting into the Instant JChem interface: 1) CAS RN; 2) SMILES string, as found in CAS; 3) SMILES string, as found in PubChem. Whenever a structural representation not recognized by the prototype was found, this alternative structural representation was appended to the relevant NCCT database table.
Launching queries
Query chemicals are searched against the NCCT database using the “Simple Federated Search” function of Instant JChem. The Simple Federated Search allows for the searching of all the NCCT database tables at the same time. The search can also be restricted to one or more of the NCCT database tables, if desired. Of note, the Simple Federated Search only queries Structure column of the NCCT database tables, not the metadata found in the remaining 11 columns. The search mode is set to “Full” (i.e., the query has to match the entire entry to yield a hit). The search options are set as follows. Stereochemistry: off; Charges, Isotopes, Radicals, and Valence: Ignore; Vague Bond: Ambiguous aromaticity 5-membered rings; Markush: Homology broad translation; Tautomer: Off. Once configured to the above settings, the query can be entered into Instant JChem by inputting its chemical name, its CAS RN or a structural identifier, which are all converted by Instant JChem into a molecular structure. Alternatively, the structure can be sketched using Instant JChem’s ‘Sketch Structure’ feature.
Funding source: Global Affairs Canada http://dx.doi.org/10.13039/501100008627
Award Identifier / Grant number: CWC-2020-0001 (P009515)
-
Research funding: This work was supported by Global Affaerirs Canada under award CWC-2020-0001 (P009515).
References
[1] S. Costanzi. Kirk-Othmer Encyclopedia of Chemical Technology, pp. 1–32, Wiley Online Library (2020), https://doi.org/10.1002/0471238961.0308051308011818.a01.pub3.10.1002/0471238961.0308051308011818.a01.pub3Search in Google Scholar
[2] V. Pitschmann. Toxins 6, 1761 (2014), https://doi.org/10.3390/toxins6061761.Search in Google Scholar
[3] K. Ganesan, S. K. Raza, R. Vijayaraghavan. J. Pharm. BioAllied Sci. 2, 166 (2010), https://doi.org/10.4103/0975-7406.68498.Search in Google Scholar
[4] R. K. Hersman, W. Pittinos. Restoring Restraint: Enforcing Accountability for Users of Chemical Weapons, Rowman & Littlefield, Lanham, MD (2018).Search in Google Scholar
[5] R. K. Hersman, S. Claeys. Rigid Structures, Evolving Threat: Preventing the Proliferation and Use of Chemical Weapons, Center for Strategic & International Studies, Washington, DC (2019).Search in Google Scholar
[6] S. Costanzi, J.-H. Machado, M. Mitchell. ACS Chem. Neurosci. 9, 873 (2018), https://doi.org/10.1021/acschemneuro.8b00148.Search in Google Scholar
[7] S. Costanzi, G. D. Koblentz. Nonproliferation Rev. 26, 1 (2019), https://doi.org/10.1080/10736700.2019.1662618.Search in Google Scholar
[8] S. Costanzi, G. D. Koblentz. Arms Control Today 50, 16 (2020).Search in Google Scholar
[9] D. Steindl, W. Boehmerle, R. Körner, D. Praeger, M. Haug, J. Nee, A. Schreiber, F. Scheibe, K. Demin, P. Jacoby, R. Tauber, S. Hartwig, M. Endres, K.-U. Eckardt. The Lancet 397, 249 (2021), https://doi.org/10.1016/s0140-6736(20)32644-1.Search in Google Scholar
[10] S. Costanzi, G. D. Koblentz. Strengthening controls on Novichoks: a family-based approach to covering A-series agents and precursors under the chemical-weapons nonproliferation regime, Nonproliferation Rev. (2022), https://doi.org/10.1080/10736700.2021.2020010.Search in Google Scholar
[11] S. Costanzi, G. D. Koblentz, R. T. Cupitt. Strat. Trade Rev. 6, 69 (2020).Search in Google Scholar
[12] OPCW. Report of The Scientific Advisory Board on Developments in Science and Technology for the Fourth Special Session of the Conference of the States Parties to Review Figure the Operation of the Chemical Weapons Convention, p. 30 (2018), RC-4/DG.1, https://www.opcw.org/sites/default/files/documents/CSP/RC-4/en/rc4dg01_e_.pdf.Search in Google Scholar
[13] C. M. Timperley, J. E. Forman, M. Abdollahi, A. S. Al-Amri, I. P. Alonso, A. Baulig, V. Borrett, F. A. Cariño, C. Curty, D. Gonzalez. Pure Appl. Chem. 90, 1647 (2018), https://doi.org/10.1515/pac-2018-0803.Search in Google Scholar
[14] S. Costanzi, C. K. Slavick, B. O. Hutcheson, G. D. Koblentz, R. T. Cupitt. J. Chem. Inf. Model. 60, 4804 (2020), https://doi.org/10.1021/acs.jcim.0c00896.Search in Google Scholar PubMed
[15] OPCW. Annex on Chemicals – Schedule 1, https://www.opcw.org/chemical-weapons-convention/annexes/annex-chemicals/schedule-1.Search in Google Scholar
[16] G. Pontes, J. Schneider, P. Brud, L. Benderitter, B. Fourie, C. Tang, C. M. Timperley, J. E. Forman. J. Chem. Educ. 97, 1715 (2020), https://doi.org/10.1021/acs.jchemed.0c00547.Search in Google Scholar
[17] C. Rücker, M. Meringer, A. Wassermann. J. Chem. Educ. 98, 1465 (2021).10.1021/acs.jchemed.0c01023Search in Google Scholar
[18] C. M. Timperley, J. E. Forman. J. Chem. Educ. 98, 1468 (2021), https://doi.org/10.1021/acs.jchemed.1c00134.Search in Google Scholar
[19] ChemAxon. Instant IChem, https://chemaxon.com/products/instant-jchem.Search in Google Scholar
[20] W. A. Warr. WIREs Computational Molecular Science 1, 557 (2011), https://doi.org/10.1002/wcms.36.Search in Google Scholar
[21] D. A. Cosgrove. Scaffold Hopping in Medicinal Chemistry, pp. 15–38, Wiley, Weinheim, Germany (2013).10.1002/9783527665143.ch02Search in Google Scholar
[22] OPCW. Annex on Chemicals, https://www.opcw.org/chemical-weapons-convention/annexes/annex-chemicals/annex-chemicals.Search in Google Scholar
[23] The Australia Group. Chemical Weapons Precursors, https://www.dfat.gov.au/publications/minisite/theaustraliagroupnet/site/en/precursors.html.Search in Google Scholar
[24] The Wassenaar Arrangement. Control Lists, https://www.wassenaar.org/control-lists/.Search in Google Scholar
[25] Council Regulation (EU) No 36/2012, https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:02012R0036-20200217&from=EN#tocId2.Search in Google Scholar
[26] World Customs Organization. Strategic Trade Control Enforcement Implementation Guide, http://www.wcoomd.org/en/topics/enforcement-and-compliance/instruments-and-tools/guidelines/wco-strategic-trade-control-enforcement-implementation-guide.aspx.Search in Google Scholar
[27] S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu. Nucleic Acids Res. 47, D1102 (2019), https://doi.org/10.1093/nar/gky1033.Search in Google Scholar PubMed PubMed Central
[28] National Library of Medicine. PubChem, https://pubchem.ncbi.nlm.nih.gov.Search in Google Scholar
[29] OPCW. Annex on Chemicals – Schedule 2, https://www.opcw.org/chemical-weapons-convention/annexes/annex-chemicals/schedule-2.Search in Google Scholar
[30] J. B. Tucker. Nonproliferation Rev. 16, 363 (2009), https://doi.org/10.1080/10736700903255060.Search in Google Scholar
[31] Pistoia Alliance, https://www.pistoiaalliance.org.Search in Google Scholar
[32] Scitegrity. Controlled Substances Squared, https://scitegrity.co.uk/index.php?page=cs2.Search in Google Scholar
[33] ChemAxon. Compliance Checker, https://chemaxon.com/products/compliance-checker.Search in Google Scholar
[34] D. Taylor, S. G. Bowden, R. Knorr, D. R. Wilson, J. Proudfoot, A. E. Dunlop. Drug Discov. Today 20, 175 (2015), https://doi.org/10.1016/j.drudis.2014.09.021.Search in Google Scholar PubMed
[35] Chemical Abstract Service (CAS). SciFinder-n, https://scifinder-n.cas.org/.Search in Google Scholar
[36] ChemAxon. Markush Editor, https://chemaxon.com/products/instant-jchem.Search in Google Scholar
© 2022 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/