How to create SD/SDF files from SMILES using freely available software tools
Molecular Design Limited (MDL) MOL/SD files, i.e. files with extension *.mol (Molecular),
*.sd (Structure-Data) or *.sdf (Structure-Data-Format) are the files containing the structural
information and associated data for single molecule (*.mol) or for any number of molecules
(*sd./*sdf). The SD/SDF format currently serves as the most common standard to exchange
information about chemicals. Hence, one of the most significant steps during QSAR Model
Reporting Formats (QMRFs) completing is to generate and provide SDF files including all
necessary information about training/test set molecules (i.e. identifiers of all compounds, e.g.
CAS/InChI/name/formula; visualised 3D structures; experimental and predicted values of
target properties/parameters; the values of utilized molecular descriptors).
The following step-by-step description of an SDF file generating procedure is intended to
serve as a practical guide on how to obtain (in an easy and fast way, starting with SMILES)
the proper attachments, necessary to make each QMRF a reliable source of information about
reported QSAR/QSPR modelling.
1. The first step is to create a SMI file (i.e. a file with extension *.smi), which must be a
text file including one or more molecular structures (each structure represented by
SMILES should be placed in a separate row, e.g. file containing 50 structures should
have 50 rows). The SMI file should begin with the list of compounds (SMILES),
without any heading/empty lines. It can contain the IDs of molecules (in columns), but
they have to be separated from the SMILES by TAB or SPACE. Also other
information can be optionally added (e.g. to which set (training/test) the compound
belongs, if the compound is active or inactive, what are the values of descriptors, etc.)
and each column (information) has to be TAB/SPACE delimited.
The most common starting point is a table with compounds listed in separate rows and
associated data in columns (e.g. [Link] table). First, such table should be saved,
without a heading row, as TAB delimited text file (e.g. [Link]). Subsequently
the extension of obtained TXT file should be changed into SMI (by simply opening
the [Link] file and saving it again but as an [Link] file). The content
of each SMI file can be browsed by opening the file with Notepad/WordPad, etc.
2. Properly prepared SMI file can be subjected to conversion. Currently there are several
software tools which can operate SMI to SD/SDF transformation; two freely available
ones will be discussed here.
a) OpenBabel version 2.2.3 for Windows, freely available at [Link]
(i)
A properly prepared SMI file has to be indicated as OpenBabel input its content
is visualised if the file is properly recognized.
(ii)
Open Babel does not generate coordinates, unless the box "Generate 3D
coordinates" is ticked. It is necessary to select this option, as the SDF files without
calculated coordinates (x, y and z coordinates are equal 0) cannot be recognized by
the overwhelming majority of software as well as by the SDF files browser
(recommended in point 4).
(iii)
Before the conversion it is necessary to specify the path to and the name of the
output file in order to save the results.
(iv)
The conversion procedure can be a bit time consuming due to calculation of 3D
coordinates (about 3 minutes for 60 relatively small molecules).
The disadvantage of OpenBabel is that it does not allow to add various attributes of
compounds (different IDs, values of descriptors, etc.) to the output file. The only way to
include all information in SDF seems to be to prepare the input SMI file containing
TAB/SPACE delimited columns with all necessary information about the molecules.
Thus, the SMI file should consist of all rows (except the heading row) and columns from
the initial (XLS) data table. However, in the output file all attributes are placed in one row
without heading and it is difficult to recognize the meaning of particular textual/numerical
information. Hence, OpenBabel could be recommended to produce SDF files including
3D structures with only one ID (e.g. name) associated. As far as QMRFs are concerned,
providing the remaining data in a separate attachment (e.g. XLS file) would be necessary.
b) Accelrys Discovery
[Link]
Studio
Visualizer
version
2.5,
freely
available
at
(i)
The input SMI file should include SMILES and optionally one structure ID
(preferably name since thesoftware, by default, will recognize this initial ID as
name).
(ii)
Other attributes (like CAS, InChI, descriptors values, etc.) can be easily added
(Edit Add Attribute) and copy-pasted from the XLS data table. They will be
transparently placed in separate rows in the output SDF file
(iii)
The file can be subsequently saved as SDF (File Save as MDL MOL/SD
files). The software generates 3D coordinates, but the procedure is much faster
than the one performed by OpenBabel.
Accelrys Discovery Studio is able to produce SDF files including 3D structures as well as all
other textual/numerical information about studied compounds (additional attachments are not
necessary).
4. The content of the final SDF files can be browsed either with Accelrys Discovery
Studio or with easy to use SDF files browser Hyleos (currently version [Link]), freely
available at [Link]
With the Hyleos application it is possible to screen the content of SDFs (visualised 3D
structures and all associated data) as well as to merge/split the files.
JRC Computational Toxicology Group
14 December 2009