Database¶
For commongroups
to work, it needs a database of compounds searchable by
chemical structure. Specifically, this must be a PostgreSQL database with
the RDKit extension. We do not distribute a pre-built database, and there
are no requirements for or limits on what compounds and data sources might be
included in the database.
We develop and test commongroups
using a database compiled from public data
available in the US EPA CompTox Dashboard, containing approximately 700,000
structures. We believe that this is a good starting point for our immediate
goals. We distribute a program that can automatically download
data and recreate the particular database that we use.
What follows is a general description of how a database can be created and
prepared for use with commongroups
.
Data sources¶
From the US EPA CompTox Dashboard:
dsstox_20160701.tsv
- Downloaded asDSSTox_Mapping_20160701.zip
- Date: 2016-07-01 (file generated); 2016-12-14 (posted on EPA website)
- Accessed: 2017-04-18
- Contains mappings between the DSSTox substance identifier (DTXSID) and the associated InChI/InChIKey.
Dsstox_CAS_number_name.xlsx
- Date: 2016-11-14
- Accessed: 2017-04-18
- Contains the CASRN, DTXSID, and the “preferred name” used by US EPA.
PubChem_DTXSID_mapping_file.txt
- Date: 2016-11-14
- Accessed: 2017-04-18
- Contains the PubChem SID, PubChem CID and DTXSID.
Constructing the database¶
These are the general steps to create the database using the data sources above:
- Create a database table of DTXSIDs, InChI(Key)s, and RDKit
mol
-type representations of the structures (based on InChI) - Create database tables with DTXSID-CASRN and DTXSID-CID correspondences.
- Merge the above tables on DTXSID, creating either a new table or a materialized view. Create an index on molecular structures using the GiST-powered RDKit extension.
To execute these steps automatically on your system, first install the required software and then see Automated database instantiation. For more technical
detail and the exact database commands used, please see the source code in
tools/construct_database.py
.
Known limitations¶
Currently, for commongroups
to be able to use a database, it must contain a
table or view called compounds
, which in turn must contain a column called
molecule
, containing molecular structures in the RDKit mol
data type.
Furthermore, for HTML display of compound groups, the program currently assumes
that compounds
contains the following columns: cid
(PubChem CID, also
used for images), casrn
, and dtxsid
. This may change in future versions.
The database produced by our automatic installation script satisfies these requirements.