.. _chignolin: Fold the mini-protein, chignolin ================================= .. image:: ../../examples/chignolin/data/2RVD.png :width: 200 :alt: folded structure of the chignolin protein Chignolin is an artificial mini-protein consisting of only 10 amino acid residues. Here we use potential contrasting to learn a coarse-grained force field that can capture the protein's folding and unfolding 1. Download the all-atom simulation trajectory ---------------------------------------------- An all-atom simulation of chignolin is used as training data. You can follow the link `pccg-chignolin `_ to download the all-atom simulation trajectory. Then you should uncompress the downloaded data and move them to the directory ``WORK_DIR/data``. Here ``WORK_DIR`` can be any directory and wil be used as the working directory for running commands in the tutorial. In other words, you should change your working directory to ``WORK_DIR`` before running following commands. In the direcotry ``WORK_DIR/data``, there should be three files: * ``cln025_all_atom.dcd``: an all-atom simulation trajectory * ``cln025.prmtop``: the topology file of chignolin * ``cln025_reference.pdb``: a folded conformation of chignolin The all-atom simulation was conducted with implicit solvent and near the folding temperature, so the all-atom simulation trajectory contains both folded and unfolded conformations. You can visualize the trajectory by loading it together with the topology file into softwares such as ``VMD``. 2. Convert the all-atom trajectory into a coarse-grained one ------------------------------------------------------------ First we need to choose a coarse-grained representation for the force field. Here we choose to represent each amino acid residue with just one particle fixed at the residue's :math:`\alpha` carbon atom postion. With this coarse-grained representation, let us convert the all-atom trajectory into a coarse-grained one, which will be used as training data. .. code-block:: python import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import mdtraj import os import scipy import scipy.optimize as optimize import scipy.cluster.hierarchy from scipy.spatial.distance import squareform import torch import math import openmm import openmm.unit as unit import openmm.app as ommapp import time from sys import exit #### convert the all atom trajectory into a coarse-grained one top_aa = mdtraj.load_prmtop('./data/cln025_all_atom.prmtop') traj_aa = mdtraj.load_dcd('./data/cln025_all_atom.dcd', top_aa, stride = 10) alpha_carbon_atom_indices = [] for atom in top_aa.atoms: if atom.name == 'CA': alpha_carbon_atom_indices.append(atom.index) traj_cg = traj_aa.atom_slice(alpha_carbon_atom_indices) os.makedirs('./output/', exist_ok = True) traj_cg.save_dcd('./output/cln025_cg.dcd') We can visualize the coarse-grained trajectory by projecting conformations into a low dimensional space spanned by collective variables such as the RMSD (root-mean-squared-distance) with respect to the folded conformations. .. code-block:: python top_cg = mdtraj.load_psf('./data/cln025_cg.psf') traj_cg = mdtraj.load_dcd('./output/cln025_cg.dcd', top_cg) ref_pdb = mdtraj.load_pdb('./data/cln025_reference.pdb') ref_pdb = ref_pdb.atom_slice(alpha_carbon_atom_indices) ref_traj = mdtraj.Trajectory(ref_pdb.xyz, topology = top_cg) rmsd = mdtraj.rmsd(traj_cg, ref_traj) fig = plt.figure() fig.clf() plt.hist(rmsd, bins = 30, density = True, range = (0, 0.8), color = 'C1', label = 'All atom') plt.legend() plt.xlabel('RMSD (nm)') plt.ylabel('Probablity density') plt.tight_layout() plt.savefig('./output/rmsd_hist_all_atom.png') plt.close() .. image:: ../../examples/chignolin/output/rmsd_hist_all_atom.png :width: 400 3. Construct a noise distribution and generate noise samples ------------------------------------------------------------ In potential contrasting, we construct the noise distribution by learning an energy function and use the corresponding Boltzmann distribution as the noise distribution. Here we use an energy function that includes terms for bonds, angles, and dihedrals defined as .. math:: u_\mathrm{bonded}(b, a, d) = \sum_{i=1}^{L-1} \frac{1}{2}k_i (b_i - b_i^\circ)^2 + \sum_{i=1}^{L-2} S_\mathrm{angle}(a_i; \boldsymbol{c}^a_i) + \sum_{i=1}^{L-3} S_\mathrm{dihedral}(d_i; \boldsymbol{c}^d_i), where :math:`L = 10` is the number of residues in the protein, and :math:`b_i, a_i`, and :math:`d_i` represent the :math:`i` th bond, angle, and dihedral angle, respectively. A quadratic function is used for energies on bonds. :math:`k_i` and :math:`b_i^\circ` are the force constant and the equilibrium value for the :math:`i` th bond, respectively. Cubic spline functions, :math:`S_\mathrm{angle}` and :math:`S_\mathrm{dihedral}`, are used for energies on angles and dihedral angles. :math:`\boldsymbol{c}^a_i` and :math:`\boldsymbol{c}^d_i` are spline coefficients for the :math:`i` th angle and dihedral angle, respectively. Because individual energy terms in :math:`u_\mathrm{bonded}(b, a, d)` are separable, we can fit their paramters independently such that each energy term will reproduce the marginal distribution of the corresponding internal coordinate. For bond paramters, we can estimate them directly using the mean and the variance of bond length distributions. For angle and dihedral parameters, we use the noise contrastive estimation with uniform distributions as the noise distribution. We note that the unit of energy in :math:`u_\mathrm{bonded}(b, a, d)` is :math:`kT`. .. code-block:: python n_atoms = top_cg.n_atoms bonded_terms = { 'bond': {'indices': np.array([[i,i+1] for i in range(n_atoms - 1)])}, 'angle': {'indices': np.array([[i,i+1,i+2] for i in range(n_atoms - 2)])}, 'dihedral': {'indices': np.array([[i,i+1,i+2,i+3] for i in range(n_atoms - 3)])} } #### fit parameters for bonds bonded_terms['bond']['b0'] = [] bonded_terms['bond']['kb'] = [] for i in range(bonded_terms['bond']['indices'].shape[0]): atom_pair = bonded_terms['bond']['indices'][i] dist = mdtraj.compute_distances(traj_cg, atom_pairs = [atom_pair]) bonded_terms['bond']['b0'].append(dist.mean()) bonded_terms['bond']['kb'].append(1./dist.var()) #### fit parameters for angles angle_knots = torch.linspace(0, math.pi, 10)[1:-1] angle_boundary_knots = torch.tensor([0.0, math.pi]) bonded_terms['angle']['grid'] = [] bonded_terms['angle']['basis_grid'] = [] bonded_terms['angle']['theta'] = [] bonded_terms['angle']['energy_grid'] = [] for i in range(bonded_terms['angle']['indices'].shape[0]): angle_atom_indices = bonded_terms['angle']['indices'][i] angle_data = mdtraj.compute_angles(traj_cg, [angle_atom_indices]) angle_data = np.squeeze(angle_data).astype(np.float64) angle_data = torch.from_numpy(angle_data) angle_data.clamp_(0, math.pi) angle_noise = torch.rand(len(angle_data))*math.pi basis_data = pccg.spline.bs(angle_data, angle_knots, angle_boundary_knots) basis_noise = pccg.spline.bs(angle_noise, angle_knots, angle_boundary_knots) log_q_data = torch.ones_like(angle_data)*math.log(1./math.pi) log_q_noise = torch.ones_like(angle_noise)*math.log(1./math.pi) theta, dF = pccg.NCE(log_q_noise, log_q_data, basis_noise, basis_data, verbose = False) angle_grid = torch.linspace(0, math.pi, 100) basis_grid = pccg.spline.bs(angle_grid, angle_knots, angle_boundary_knots) energy_grid = torch.matmul(basis_grid, theta) bonded_terms['angle']['grid'].append(angle_grid) bonded_terms['angle']['basis_grid'].append(basis_grid) bonded_terms['angle']['theta'].append(theta) bonded_terms['angle']['energy_grid'].append(energy_grid) #### fit parameters for dihedrals dihedral_knots = torch.linspace(-math.pi, math.pi, 12)[1:-1] dihedral_boundary_knots = torch.tensor([-math.pi, math.pi]) bonded_terms['dihedral']['grid'] = [] bonded_terms['dihedral']['basis_grid'] = [] bonded_terms['dihedral']['theta'] = [] bonded_terms['dihedral']['energy_grid'] = [] for i in range(bonded_terms['dihedral']['indices'].shape[0]): dihedral_atom_indices = bonded_terms['dihedral']['indices'][i] dihedral_data = mdtraj.compute_dihedrals(traj_cg, [dihedral_atom_indices]) dihedral_data = np.squeeze(dihedral_data).astype(np.float64) dihedral_data = torch.from_numpy(dihedral_data) dihedral_data.clamp_(-math.pi, math.pi) dihedral_noise = torch.distributions.Uniform(-math.pi, math.pi).sample((len(dihedral_data),)) basis_data = pccg.spline.pbs(dihedral_data, dihedral_knots, dihedral_boundary_knots) basis_noise = pccg.spline.pbs(dihedral_noise, dihedral_knots, dihedral_boundary_knots) log_q_data = torch.ones_like(dihedral_data)*math.log(1./(2*math.pi)) log_q_noise = torch.ones_like(dihedral_noise)*math.log(1./(2*math.pi)) theta, dF = pccg.NCE(log_q_noise, log_q_data, basis_noise, basis_data, verbose = False) dihedral_grid = torch.linspace(-math.pi, math.pi, 100) basis_grid = pccg.spline.pbs(dihedral_grid, dihedral_knots, dihedral_boundary_knots) energy_grid = torch.matmul(basis_grid, theta) bonded_terms['dihedral']['grid'].append(dihedral_grid) bonded_terms['dihedral']['basis_grid'].append(basis_grid) bonded_terms['dihedral']['theta'].append(theta) bonded_terms['dihedral']['energy_grid'].append(energy_grid) The energy function :math:`u_\mathrm{bonded}(b, a, d)` defines a Boltzmann distribution on internal coordinates :math:`p(b, a, d) \propto \exp(-u_\mathrm{bonded}(b, a, d))`. We generate noise samples from the Boltzmann distribution by running molecular dynamics using `OpenMM `_. .. code-block:: python #### make an openmm system with bonded parameters #### add particles aa_mass = {'ALA': 71.08, 'ARG': 156.2, 'ASN': 114.1, 'ASP': 115.1, 'CYS': 103.1, 'GLN': 128.1, 'GLU': 129.1, 'GLY': 57.05, 'HIS': 137.1, 'ILE': 113.2, 'LEU': 113.2, 'LYS': 128.2, 'MET': 131.2, 'PHE': 147.2, 'PRO': 97.12, 'SER': 87.08, 'THR': 101.1, 'TRP': 186.2, 'TYR': 163.2, 'VAL': 99.07} sys_im = openmm.System() for res in top_cg.residues: print(res.name, aa_mass[res.name]) sys_im.addParticle(aa_mass[res.name]) k = unit.BOLTZMANN_CONSTANT_kB*unit.AVOGADRO_CONSTANT_NA T = 360.47 * unit.kelvin kT = (k*T).value_in_unit(unit.kilojoule_per_mole) #### add bond force bond_force = openmm.CustomBondForce("(0.5*kb*(r - b0)^2 + 2*log(r))*kT") bond_force.addGlobalParameter('kT', kT) bond_force.addPerBondParameter('b0') bond_force.addPerBondParameter('kb') for k in range(bonded_terms['bond']['indices'].shape[0]): p1,p2 = bonded_terms['bond']['indices'][k] b0 = bonded_terms['bond']['b0'][k] kb = bonded_terms['bond']['kb'][k] bond_force.addBond(p1,p2, [b0, kb, alpha]) bond_force.setForceGroup(0) sys_im.addForce(bond_force) #### add angle force ua = torch.stack(bonded_terms['angle']['energy_grid']).numpy() func = openmm.Continuous2DFunction(xsize = ua.shape[0], ysize = ua.shape[1], values = ua.reshape(-1, order = 'F'), xmin = 0.0, xmax = float(ua.shape[0] - 1), ymin = 0.0, ymax = math.pi, periodic = False) angle_force = openmm.CustomCompoundBondForce( 3, f"(ua(idx, angle(p1, p2, p3)) + log(sin(pi - angle(p1, p2, p3))) )*kT" ) angle_force.addGlobalParameter('pi', math.pi) angle_force.addGlobalParameter('kT', kT) angle_force.addTabulatedFunction("ua", func) angle_force.addPerBondParameter('idx') for k in range(bonded_terms['angle']['indices'].shape[0]): p1, p2, p3 = bonded_terms['angle']['indices'][k] angle_force.addBond([p1, p2, p3], [float(k), alpha]) angle_force.setForceGroup(0) sys_im.addForce(angle_force) #### add dihedral force ud = torch.stack(bonded_terms['dihedral']['energy_grid']).numpy() ud = np.concatenate([ud, ud[[0]]]) tmp = (ud[:,0] + ud[:,-1])/2 ud[:,0] = tmp ud[:,-1] = tmp func = openmm.Continuous2DFunction(xsize = ud.shape[0], ysize = ud.shape[1], values = ud.reshape(-1, order = 'F'), xmin = 0.0, xmax = float(ud.shape[0] - 1), ymin = -math.pi, ymax = math.pi, periodic = True) dihedral_force = openmm.CustomCompoundBondForce( 4, f"ud(idx, dihedral(p1, p2, p3, p4))*kT" ) dihedral_force.addGlobalParameter('kT', kT) dihedral_force.addTabulatedFunction("ud", func) dihedral_force.addPerBondParameter('idx') for k in range(bonded_terms['dihedral']['indices'].shape[0]): p1, p2, p3, p4 = bonded_terms['dihedral']['indices'][k] dihedral_force.addBond([p1, p2, p3, p4], [float(k)]) dihedral_force.setForceGroup(0) sys_im.addForce(dihedral_force) sys_im.addForce(openmm.CMMotionRemover()) #### run simulation with the system that only contains #### bonded interaction energy terms fricCoef = 1./unit.picoseconds stepsize = 5. * unit.femtoseconds integrator = openmm.LangevinMiddleIntegrator(T, fricCoef, stepsize) pdb = mdtraj.load_pdb('./data/cln025_reference.pdb') pdb = pdb.atom_slice(alpha_carbon_atom_indices) xyz_init = pdb.xyz[0] platform = openmm.Platform.getPlatformByName('Reference') context = openmm.Context(sys_im, integrator, platform) context.setPositions(xyz_init) os.makedirs(f"./output/", exist_ok = True) file_handle = open(f"./output/traj_im.dcd", 'wb') dcd_file = ommapp.DCDFile(file_handle, top_cg.to_openmm(), dt = 200*unit.femtoseconds) num_frames = len(traj_aa) for i in range(num_frames): integrator.step(100) state = context.getState(getPositions = True) pos = state.getPositions() dcd_file.writeModel(pos) if (i + 1) % 1000 == 0: print(i, flush = True) file_handle.close() .. note:: When defining the energy function :math:`u_\mathrm{bonded}(b, a, d)` in ``OpenMM``, we added extra energy terms for bonds and angles (e.g. :math:`\log(r)` for bonds and :math:`\log(\sin(\pi - \mathrm{angle}(p1, p2, p3)))` for angles). This is because of the Jacobian factor for transformation between Cartesian coordiantes and internal coordinates. We can compare distributions of internal coordinates between data (the all-atom simulation) and noise samples to make sure our noise energy function is defined correctly in ``OpenMM``. .. code-block:: python #### compare traj_im and traj_cg bond_cg = mdtraj.compute_distances(traj_cg, bonded_terms['bond']['indices']) angle_cg = mdtraj.compute_angles(traj_cg, bonded_terms['angle']['indices']) dihedral_cg = mdtraj.compute_dihedrals(traj_cg, bonded_terms['dihedral']['indices']) traj_im = mdtraj.load_dcd('./output/traj_im.dcd', top_cg) bond_im = mdtraj.compute_distances(traj_im, bonded_terms['bond']['indices']) angle_im = mdtraj.compute_angles(traj_im, bonded_terms['angle']['indices']) dihedral_im = mdtraj.compute_dihedrals(traj_im, bonded_terms['dihedral']['indices']) fig = plt.figure(figsize = (6.4*4, 4.8*6)) fig.clf() idx_plot = 1 for j in range(bond_cg.shape[1]): plt.subplot(6,4,idx_plot) bond_min = min(bond_cg[:,j].min(), bond_im[:,j].min()) bond_max = max(bond_cg[:,j].max(), bond_im[:,j].max()) plt.hist(bond_cg[:,j], bins = 30, range = (bond_min, bond_max), density = True, color = 'C0', alpha = 0.5, label = 'All atom') plt.hist(bond_im[:,j], bins = 30, range = (bond_min, bond_max), density = True, color = 'C1', alpha = 0.5, label = 'CG_im') plt.title(f'Bond {j}-{j+1}') plt.legend() plt.tight_layout() idx_plot += 1 for j in range(angle_cg.shape[1]): plt.subplot(6,4,idx_plot) plt.hist(angle_cg[:,j], bins = 30, range = (0, math.pi), density = True, color = 'C0', alpha = 0.5, label = 'All atom') plt.hist(angle_im[:,j], bins = 30, range = (0, math.pi), density = True, color = 'C1', alpha = 0.5, label = 'CG_im') plt.title(f'Angle {j}-{j+1}-{j+2}') plt.legend() plt.tight_layout() idx_plot += 1 for j in range(dihedral_cg.shape[1]): plt.subplot(6,4,idx_plot) plt.hist(dihedral_cg[:,j], bins = 30, range = (-math.pi, math.pi), density = True, color = 'C0', alpha = 0.5, label = 'All atom') plt.hist(dihedral_im[:,j], bins = 30, range = (-math.pi, math.pi), density = True, color = 'C1', alpha = 0.5, label = 'CG_im') plt.title(f'Dihedral {j}-{j+1}-{j+2}-{j+3}') plt.legend() plt.tight_layout() idx_plot += 1 plt.savefig('./output/bad_hist_im.pdf') plt.close() .. image:: ../../examples/chignolin/output/bad_hist_im.png From the above plot, we see that the coarse-grained energy function with just bonded terms reproduces the marginal distribution of every internal coordinate. However, reproducing marginal distributions does not guarantee that the energy function reproduces the conformational ensemble. One way to see that is comparing the distribution of RMSD between data and noise samples, as shown below. .. image:: ../../examples/chignolin/output/rmsd_hist_aa_vs_im.png