Create topology and coordinates for benzene molecule

For the molecular system that contains a mixture of small molecules it is convenient to have topology and coordinate files for each component. This way the system coordinates can be created by a third-party software (e.g. Packmol or VMD) and the topology file will just list the molecules and their quantities, not the interatomic connectivity in the molecules. We will use an example of benzene to explain the procedure of creating a pair of coordinates and topology files for a single molecule that can be later used when the molecule is added to the mixture of many molecules.

The coordinates

We will write basic code to synthesize the coordinates according to the atomic parameters. Another approach would be to find the coordinates on the internet, as a stand-alone file or part of another molecular system. The later for benzene may be, e.g. a part of phenylalanine side-chain form any protein that contains this amino-acid. Note that the the carbon, that connects the side-chain to the backbone of the protein will have to be substituted with a hydrogen. Also, if the system is synthesized with the correct force-field parameters, there is no need to minimize the energy (but we will do it to show the complete procedure).

Let us use CHARMM36 force-field for the reference. We assume that you have the force-field files in your local folder, see CHARMM36 section for details. We need cgenff.rtp residue topology file, where we can find benzene molecule description (look for BENZ residue):

[ BENZ ]
; C6H6, benzene, adm jr.
[ atoms ]
    CG   CG2R61 -0.1150   1
    HG    HGR61  0.1150   1
    CD1   CG2R61 -0.1150   1
    HD1    HGR61  0.1150   1
    CD2   CG2R61 -0.1150   1
    HD2    HGR61  0.1150   1
    CE1   CG2R61 -0.1150   1
    HE1    HGR61  0.1150   1
    CE2   CG2R61 -0.1150   1
    HE2    HGR61  0.1150   1
    CZ   CG2R61 -0.1150   1
    HZ    HGR61  0.1150   1
[ bonds ]
    CD1    CG
    CD2    CG
    CE1   CD1
    CE2   CD2
    CZ   CE1
    CZ   CE2
    CG    HG
    CD1   HD1
    CD2   HD2
    CE1   HE1
    CE2   HE2
    CZ    HZ

Here, we have two sections that describe residue atoms and bonds respectively. For atoms, the atom name, type, partial charge and charge group are listed. Each bond is a pair of atom names that should be connected. Note that angles and dihedrals will be generated by GROMACS according to what is in the header to residue topology file, but we will discuss this later. All atoms in the residue has unique name, which is used to identify the atom when bonds are defined. See how CG atom is connected to CD1 and CD2 carbons and to HG hydrogen.

Exercise

Using the residue description of benzene molecule, try to draw it on a piece of paper. Write the unique names of the atoms from the file next to them.

Now let us try to see how the parameters for the interactions are determined. This will also help us to get the equilibrium parameters, needed to create atomic coordinates. We see that all the carbons have type CG2R61, hydrogens are of type HGR61. These atom types are what defines the parameters for bonded and non-bonded interactions for these atoms. To see how, open ffbonded.itp file in the force-field. The bond and angle parameters are in the following lines of the ffbonded.itp file (we will skip dihedrals and non-bonded parameters for now):

...
CG2R61   CG2R61     1   0.13750000    255224.00 ; PROT benzene, JES 8/25/89
...
CG2R61    HGR61     1   0.10800000    284512.00 ; PROT phe,tyr JES 8/25/89
...
CG2R61   CG2R61   CG2R61     5   120.000000   334.720000   0.24162000     29288.00 ; PROT JES 8/25/89
...
CG2R61   CG2R61    HGR61     5   120.000000   251.040000   0.21525000     18409.60 ; PROT JES 8/25/89 benzene
...

As one can see from the comments, these atom types are found in phenylalanine and tyrosine amino-acids as well as in benzene molecule. The equilibrium bond distances are 0.1375 nm for bond between carbons and 0.108 for carbon-hydrogen bond. The equilibrium angle is 120 degrees for both C-C-C and C-C-H angles.

So, to make a coordinate file, we need to create a regular hexagon with carbons in its corners and hydrogens sticking out. Each side of the hexagon is a=0.1375 nm and hydrogens are farther b=0.108 nm away from the corners.

If placed in the x-y plane so that the center of hexagon is (0,0,0) and the longest diagonal along x-axis, the coordinates of the vertices are (clockwise):

\[ \begin{align}\begin{aligned}(a, 0, 0),\\(a/2, -a\cdot\cos(30), 0),\\(-a/2, -a\cdot\cos(30), 0),\\(-a, 0, 0),\\(-a/2, a\cdot\cos(30), 0),\\(a/2, a\cdot\cos(30), 0)\end{aligned}\end{align} \]

The coordinates of the respected hydrogens are in the same layout, but (a+b) away from the center, i.e.:

\[ \begin{align}\begin{aligned}(a+b, 0, 0),\\((a+b)/2, -(a+b)\cdot\cos(30), 0),\\(-(a+b)/2, -(a+b)\cdot\cos(30), 0),\\(-(a+b), 0, 0),\\(-(a+b)/2, (a+b)\cdot\cos(30), 0),\\((a+b)/2, (a+b)\cdot\cos(30), 0)\end{aligned}\end{align} \]

We also need to synchronize the coordinates with the topology: molecular simulation software needs to know which atom is connected to which. This is described in the [ bonds ] section of the topology: atom CD1 is connected to atom CG, CD2 to CG as well, and so on. Hence, if we want to go around the carbon ring, starting from the CG atom, the atom order will be CG, CD1, CE1, CZ, CE2, CD2. This notation is inherited from proteins, where atoms are named according to their separation from the C-alpha (CA) atom: C-beta (CB), C-gamma (CG), C-delta (CD) and so on. If the chain separates, atom names receive a number (e.g. CD1 and CD2). This does not make much sense in case of benzene molecule, but this does not affect us in the future so we can leave it as it is.

Once we decide which coordinates correspond to which atom name, we can create the coordinates file in any format that is suitable for the software. For instance, we can use GROMACS .gro format.

Exercise

Write a basic code that generates coordinates for benzene and saves them in .gro format.

Molecular topology

Once the coordinates are ready, the topology file can be created. If the atoms and residue are named according to the force-field, the basic topology can be created using pdb2gmx utility from GROMACS:

GMX=/usr/local/gromacs/bin/gmx
SYSTEM_NAME=C6H6
$GMX pdb2gmx -f ${SYSTEM_NAME}.gro -o ${SYSTEM_NAME}.gro -p ${SYSTEM_NAME}.top -ff charmm36 -water tip3p

We set and use some variable to make it easier to copy-paste these commands later. Here, we take the .gro file we created and pass it to the pdb2gmx. We use CHARMM36 force-field and tip3p water model (if we don’t specify these options, they will be asked interactively). Note that CHARMM36 is not supplied with GROMACS by default, so it has to be installed or you need to have force-field files (i.e. charmm36.ff folder) in your local folder. As an output we have corrected .gro file (if the correction is needed, e.g. hydrogen were added) and topology file in .top format. We opted to overwrite the .gro file here, since not much is updated there — only the periodic boundary description is added. Don’t worry if you wanted to compare the files — GROMACS makes back-ups it case it overwrites (see the log file for the backup name).

Before we modify the topology to make it more useful in the future, we can finalize our coordinates by running energy minimization simulations. Strictly speaking, this step is not required because we created the coordinates using force-field parameters. Hence the structure should be already in its energy minima. But it is useful to do the energy minimization to fix any imperfections in the structure and to make sure that the coordinates are useable.

The following script performs the energy minimization in vacuo for the molecular system. Assuming that you have this minimization .mdp file and your coordinates are saved as C6H6.gro:

SYSTEM_NAME=C6H6
$GMX editconf -f ${SYSTEM_NAME}.gro -o ${SYSTEM_NAME}.gro -d 0.1
$GMX editconf -f ${SYSTEM_NAME}.gro -o ${SYSTEM_NAME}.gro -box 100 100 100 -noc
$GMX grompp -f em_vac.mdp -c ${SYSTEM_NAME}.gro -o em_vac.tpr
$GMX mdrun -deffnm em_vac
mkdir hydrocarbons
cp ${SYSTEM_NAME}_em.gro hydrocarbons/${SYSTEM_NAME}.gro

Note that this will overwrite your .gro file, but don’t worry — GROMACS will make a backup, surrounded by # signs. We first run the editconf utility twice to set the periodic boundary correctly. The first run moves the molecule so that all the coordinates are positive. The -d 0.1 option means, that the box will be no closer than 0.1 nm from the molecule. By default, the editconf shifts the coordinates to the center of the box, which is in the positive quadrant. The second run of editconf changes the periodic box definition to a cube with 100 nm side. This is an arbitrary large number to make sure that PBC don’t affect the system — modern versions of GROMACS can only run with periodic boundary. Note that we use -noc option here to leave the molecule where it is instead of moving it to the center of the box. After coordinates are prepared, we configure GROMACS with grompp utility. This takes the .mdp file that describes the simulation protocol (see The energy minimization .mdp file section below), topology and coordinate files. grompp creates a portable .tpr file, which contains all the data for the molecular dynamics simulation run. This file is used with gmx mdrun, which is the main simulation engine. The -deffnm option means default name, i.e. the name of the files for input/output that will be different only by extension. Hence, the resulting (energy minimized) structure will be saved as ${SYSTEM_NAME}_em.gro, which are the coordinates that we are going to save for future use.

Some software packages may need a .pdb file instead of .gro. To convert, one can use editconf utility from GROMACS:

$GMX editconf -f ${SYSTEM_NAME}_em.gro -o ${SYSTEM_NAME}.pdb
cp ${SYSTEM_NAME}.pdb hydrocarbons/${SYSTEM_NAME}.pdb

Now, the hydrocarbons folder should have both .gro and .pdb files.

The energy minimization .mdp file

Above, we ran something called energy minimization in vacuum. This is not something that is done by default, we explicitly asked molecular simulation software to do it. This is specified in the configuration or molecular dynamics parameters (.mdp) file (see full description here). We used the following parameters in our run:

title       = enrgy minimisation

; Parameters describing what to do, when to stop and what to save
integrator  = steep  ; Algorithm (steep = steepest descent minimization)
emtol       = 1000.0 ; Stop minimization when the maximum force < 1000.0 kJ/mol/nm
emstep      = 0.01   ; Minimization step size
nstenergy   = 500    ; save energies every 1.0 ps, so we can observe if we are successful
nstxout-compressed       = 500    ; for writing coords (x)
nsteps      = -1     ; run as long as we need
; Settings that make sure we run with parameters in harmony with the selected force-field
constraints             = none          ; no constraints
rcoulomb                = 10            ; short-range electrostatic cutoff (in nm)
rvdw                    = 10            ; short-range van der Waals cutoff (in nm)
coulombtype             = Cut-Off       ; Cutoff electrostatics with large radii
rlist                   = 20

The main parameter here is integrator, which is set to steep or steepest descent energy minimization algorithm. Also note that we set all the cut-offs to an arbitrary large value, which is only allowed since our simulation box is large. This is to minimize the effect of switching the interaction potential off near the cut-off radius.

Creating molecule topology file for future use

Current .top file does two things at the same time: it describes the composition of the entire molecular system and topology of a single benzene molecule. This is not surprising because our system is a singular benzene molecule. However, this is not convenient if we want to re-use the molecular topology description part of the file when benzene molecule(s) is(are) part of another molecular system. What we can do in this case is we can create a separate Include ToPology or .itp file and copy the topology description there. Let us have a look at the generated .top file:

...

; Include forcefield parameters
#include "charmm36.ff/forcefield.itp"

[ moleculetype ]
; Name            nrexcl
Other               3

[ atoms ]
;   nr       type  resnr residue  atom   cgnr     charge       mass  typeB    chargeB      massB
; residue   1 BENZ rtp BENZ q  0.0
    1     CG2R61      1   BENZ     CG      1     -0.115     12.011
    2      HGR61      1   BENZ     HG      1      0.115      1.008
    3     CG2R61      1   BENZ    CD1      1     -0.115     12.011
    4      HGR61      1   BENZ    HD1      1      0.115      1.008
    5     CG2R61      1   BENZ    CD2      1     -0.115     12.011
    6      HGR61      1   BENZ    HD2      1      0.115      1.008
    7     CG2R61      1   BENZ    CE1      1     -0.115     12.011
    8      HGR61      1   BENZ    HE1      1      0.115      1.008
    9     CG2R61      1   BENZ    CE2      1     -0.115     12.011
    10      HGR61      1   BENZ    HE2      1      0.115      1.008
    11     CG2R61      1   BENZ     CZ      1     -0.115     12.011
    12      HGR61      1   BENZ     HZ      1      0.115      1.008   ; qtot 0

[ bonds ]
;  ai    aj funct            c0            c1            c2            c3
    1     2     1
    1     3     1
    ...
    9    11     1
    11    12     1

[ pairs ]
;  ai    aj funct            c0            c1            c2            c3
    1     8     1
    1    10     1
    ...
    8    12     1
    10    12     1

[ angles ]
;  ai    aj    ak funct            c0            c1            c2            c3
    2     1     3     5
    2     1     5     5
    ...
    7    11    12     5
    9    11    12     5

[ dihedrals ]
;  ai    aj    ak    al funct            c0            c1            c2            c3            c4            c5
    2     1     3     4     9
    2     1     3     7     9
    ...
    10     9    11     7     9
    10     9    11    12     9

...

[ system ]
; Name
Benzene

[ molecules ]
; Compound        #mols
Other               1

Here, dots represent parts of the file that were omitted for shortness. After the header, the description of the molecule starts in [ moleculetype ] section. There are two parameters here: name of the newly created molecule and number of exclusions in it. By default, the molecule is called Other. This is followed by the list of atoms with their respective types, residue number, residue name, charge and mass. Note that we are adding mass here, which is defined by the atom type.

Exercise

In the force-filed files, find where is mass is defined for our atom types.

The following sections describe connectivity between atoms, including bonds, angles and dihedral angles. Note that now atoms are referred by their unique index, not by their name. This is because each molecule can have several residues and hence names will not be unique any more. The angles and dihedral angles were not listed in the residue description above, they were added by GROMACS pdb2gmx utility according to the rules in the beginning of the residue database file.

We are skipping the position restrain section here for the clarity of the lesson. This section is normally used if one wants to constrain some (e.g. heavy) atoms in their simulations, which we will not be doing on this stage.There are also some include directive for ions and water molecule, which are self-explanatory.

The topology is finalized by [ system ] and [ molecules ] sections, which describe what molecules we have in our system. Here, we only have one benzene molecule, which was given a default name Other here.

The part we need to keep for the molecule description starts with [ moleculetype ] and ends after [ dihedrals ] section of the topology file. We also want to change the name Other to avoid duplicating names in future. The described procedures can be performed with the following script:

cp ${SYSTEM_NAME}.top ${SYSTEM_NAME}.itp
sed -i -n '/\[ moleculetype \]/,$p' ${SYSTEM_NAME}.itp
sed -i '/; Include Position restraint file/,$d' ${SYSTEM_NAME}.itp
sed -i "s/Other/${SYSTEM_NAME}/g" ${SYSTEM_NAME}.itp
cp ${SYSTEM_NAME}.itp hydrocarbons/${SYSTEM_NAME}.itp

This will create a molecular topology or include topology C6H6.itp file in hydrocarbons folder. Now, the system topology can just include the molecular topology. For the system of a singular benzene, this topology will be:

; Include forcefield parameters
#include "charmm36.ff/forcefield.itp"
#include "hydrocarbons/C6H6.itp"

[ system ]
; Name
C6H6

[ molecules ]
; Compound     #mols
C6H6      1

Exercise

Create .gro and .top files for a system of two benzene molecules, where one of the molecules is shifted by 0.5 nm from the other in z-direction (perpendicular to the benzene plane).

Conclusions

Creating topologies and coordinates for all the molecules of the mixture allow to re-use them in the future, thus simplifying the process of simulating the arbitrary mixture significantly. The coordinates file can be constructed using pre-generated coordinate files for the molecules (e.g. using Packmol software) and the topologies of molecules can be included into the system topology file. These two files should be kept in-sync though, with the number of molecules and their order in coordinates file strictly correspondent to the number of molecules and their order in the system topology.