Abstract
We revisit the problem of protein structure determination from geometrical restraints from NMR, using convex optimization. It is well-known that the NP-hard distance geometry problem of determining atomic positions from pairwise distance restraints can be relaxed into a convex semidefinite program (SDP). However, often the NOE distance restraints are too imprecise and sparse for accurate structure determination. Residual dipolar coupling (RDC) measurements provide additional geometric information on the angles between atom-pair directions and axes of the principal-axis-frame. The optimization problem involving RDC is highly non-convex and requires a good initialization even within the simulated annealing framework. In this paper, we model the protein backbone as an articulated structure composed of rigid units. Determining the rotation of each rigid unit gives the full protein structure. We propose solving the non-convex optimization problems using the sum-of-squares (SOS) hierarchy, a hierarchy of convex relaxations with increasing complexity and approximation power. Unlike classical global optimization approaches, SOS optimization returns a certificate of optimality if the global optimum is found. Based on the SOS method, we proposed two algorithms—RDC-SOS and RDC–NOE-SOS, that have polynomial time complexity in the number of amino-acid residues and run efficiently on a standard desktop. In many instances, the proposed methods exactly recover the solution to the original non-convex optimization problem. To the best of our knowledge this is the first time SOS relaxation is introduced to solve non-convex optimization problems in structural biology. We further introduce a statistical tool, the Cramér–Rao bound (CRB), to provide an information theoretic bound on the highest resolution one can hope to achieve when determining protein structure from noisy measurements using any unbiased estimator. Our simulation results show that when the RDC measurements are corrupted by Gaussian noise of realistic variance, both SOS based algorithms attain the CRB. We successfully apply our method in a divide-and-conquer fashion to determine the structure of ubiquitin from experimental NOE and RDC measurements obtained in two alignment media, achieving more accurate and faster reconstructions compared to the current state of the art.
Similar content being viewed by others
References
Alipanahi B, Krislock N, Ghodsi A, Wolkowicz H, Donaldson L, Li M (2013) Determining protein structures from NOESY distance constraints by semidefinite programming. J Comput Biol 20(4):296–310
Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: IEEE conference on computer vision and pattern recognition (CVPR 2009), pp 1014–1021. IEEE
APS Mosek (2010) The MOSEK optimization software
Bax A, Kontaxis G, Tjandra N (2001) Dipolar couplings in macromolecular structure determination. Methods Enzymol 339:127
Biswas P, Liang T-C, Toh K-C, Ye Y, Wang T-C (2006) Semidefinite programming approaches for network localization with noisy distance measurements. IEEE Trans Autom Sci Eng 3(4):360–371
Blackledge M (2005) Recent progress in the study of biomolecular structure and dynamics in solution from residual dipolar couplings. Prog Nucl Magn Reson Spectrosc 46(1):23–61
Blekherman G, Parrilo PP, Thomas RR (2011) Semidefinite optimization and convex algebraic geometry. Society for Industrial and Applied Mathematics, Philadelphia
Bonvin AMJJ, Brünger AT (1996) Do NOE distances contain enough information to assess the relative populations of multi-conformer structures? J Biomol NMR 7(1):72–76
Boumal N, Mishra B, Absil P-A, Sepulchre R (2014) Manopt, a matlab toolbox for optimization on manifold. J Mach Learn Res 15(1):1455–1459
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Bryson M, Tian F, Prestegard JH, Valafar H (2008) REDCRAFT: a tool for simultaneous characterization of protein backbone structure and motion from RDC data. J Magn Reson 191(2):322–334
Case DA (1994) Normal mode analysis of protein dynamics. Curr Opin Struct Biol 4(2):285–290
Casella G, Berger RL (2002) Statistical inference. Duxbury, Pacific Grove
Cassioli A, Bardiaux B, Bouvier G, Mucherino A, Alves R, Liberti L, Nilges M, Lavor C, Malliavin TE (2015) An algorithm to enumerate all possible protein conformations verifying a set of distance constraints. BMC Bioinform 16(1):23
Chen K, Tjandra N (2012) The use of residual dipolar coupling in studying proteins by NMR. In: Zhu G (ed) NMR of proteins and small biomolecules. Springer, Heidelberg, pp 47–67
Clore GM, Gronenborn AM, Tjandra N (1998) Direct structure refinement against residual dipolar couplings in the presence of rhombicity of unknown magnitude. J Magn Reson 131(1):159–162
Cornilescu G, Marquardt JL, Ottiger M, Bax A (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J Am Chem Soc 120(27):6836–6837
Cucuringu M, Singer A, Cowburn D (2012) Eigenvector synchronization, graph rigidity and the molecule problem. Inf Inference 1(1):21–67
De Castro Y, Gamboa F, Henrion D, Lasserre J-B (2017) Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Trans Inf Theory 63(1):621–630
Ding Y, Krislock N, Qian J, Wolkowicz H (2010) Sensor network localization, Euclidean distance matrix completions, and graph realization. Optim Eng 11(1):45–66
Donald BR (2011) Algorithms in structural molecular biology. MIT Press Cambridge, Cambridge
Gavrila MD (1999) The visual analysis of human movement: a survey. Comput Vis Image Underst 73(1):82–98
Goemans MX, Williamson DP (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J ACM (JACM) 42(6):1115–1145
Gorman JD, Hero AO (1990) Lower bounds for parametric estimation with constraints. IEEE Trans Inf Theory 36(6):1285–1301
Grant M, Boyd S (2014) CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx
Güntert P (2004) Automated NMR structure calculation with CYANA. In: Protein NMR techniques. Springer, pp 353–378
Güntert P, Mumenthaler C, Wüthrich K (1997) Torsion angle dynamics for NMR structure calculation with the new program DYANA. J Mol Biol 273(1):283–298
Havel TF (1998) Distance geometry: theory, algorithms, and chemical applications. Encycl Comput Chem 120:723–742
Henrion D, Garulli A (2005) Positive polynomials in control, vol 312. Springer, Berlin
Henrion D, Lasserre J-B (2005) Detecting global optimality and extracting solutions in GloptiPoly. In: Henrion D, Garulli A (eds) Positive polynomials in control. Springer, Berlin, pp 293–310
Jackson B (2007) Notes on the rigidity of graphs (Levico conference notes), vol 4. Citeseer
Joo K, Joung I, Cheng Q, Lee SJ, Lee J (2016) Contact-assisted protein structure modeling by global optimization in CASP11. Proteins
Joo K, Joung IS, Lee J, Lee J, Lee W, Brooks B, Lee SJ, Lee J (2015) Protein structure determination by conformational space annealing using NMR geometric restraints. Proteins 83(12):2251–2262
Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization by simulated annealing. Science 220(4598):671–680
Kontaxis G, Delaglio F, Bax A (2005) Molecular fragment replacement approach to protein structure determination by chemical shift and dipolar homology database mining. Methods Enzymol 394:42–78
Krishnan K, Terlaky T (2005) Interior point and semidefinite approaches in combinatorial optimization. In: Graph theory and combinatorial optimization. Springer, Dordretch, pp 101–157
Krislock N (2010) Semidefinite facial reduction for low-rank euclidean distance matrix completion. Ph.D. thesis, University of Waterloo
Kumar A, Ernst RR, Wüthrich K (1980) A two-dimensional nuclear Overhauser enhancement (2d NOE) experiment for the elucidation of complete proton-proton cross-relaxation networks in biological macromolecules. Biochem Biophys Res Commun 95(1):1–6
Lasserre JB (2001) Global optimization with polynomials and the problem of moments. SIAM J Optim 11(3):796–817
Lasserre JB (2015) An introduction to polynomial and semi-algebraic optimization. Cambridge University Press, Cambridge
Li F, Grishaev A, Ying J, Bax A (2015) Side chain conformational distributions of a small protein derived from model-free analysis of a large set of residual dipolar couplings. J Am Chem Soc 137(46):14798–14811
Liberti L, Lavor C, Maculan N, Mucherino A (2014) Euclidean distance geometry and applications. SIAM Rev 56(1):3–69
Lipsitz RS, Tjandra N (2004) Residual dipolar couplings in NMR structure analysis. Annu Rev Biophys Biomol Struct 33:387–413
Liwo A, Lee J, Ripoll DR, Pillardy J, Scheraga HA (1999) Protein structure prediction by global optimization of a potential energy function. Proc Natl Acad Sci 96(10):5482–5485
Losonczi JA, Andrec M, Fischer MWF, Prestegard JH (1999) Order matrix analysis of residual dipolar couplings using singular value decomposition. J Magn Reson 138(2):334–342
Lovell SC, Word JM, Richardson JS, Richardson DC (2000) The penultimate rotamer library. Proteins 40(3):389–408
Mareuil F, Malliavin TE, Nilges M, Bardiaux B (2015) Improved reliability, accuracy and quality in automated NMR structure calculation with ARIA. J Biomol NMR 62(4):425–438
Moré JJ, Wu Z (1999) Distance geometry optimization for protein structures. J Glob Optim 15(3):219–234
Mukhopadhyay R, Irausquin S, Schmidt C, Valafar H (2014) DYNAFOLD: A dynamic programming approach to protein backbone structure determination from minimal sets of residual dipolar couplings. J Bioinform Comput Biol 12(1):1450002
Ngai-Hang Z, Leung NHZ, Toh K-C (2009) An SDP-based divide-and-conquer algorithm for large-scale noisy anchor-free graph realization. SIAM J Sci Comput 31(6):4351–4372
Nie J (2009) Sum of squares method for sensor network localization. Comput Optim Appl 43(2):151–179
Nie J (2014) Optimality conditions and finite convergence of Lasserres hierarchy. Math Progr 146(1–2):97–121
Parrilo PA (2003) Semidefinite programming relaxations for semialgebraic problems. Math Progr 96(2):293–320
Prestegard JH, Agard DA, Moremen KW, Lavery LA, Morris LC, Pederson K (2014) Sparse labeling of proteins: structural characterization from long range constraints. J Magn Reson 241:32–40
Ramachandran GN, Ramakrishnan CT, Sasisekharan V (1963) Stereochemistry of polypeptide chain configurations. J Mol Biol 7(1):95–99
Saupe A, Englert G (1963) High-resolution nuclear magnetic resonance spectra of orientated molecules. Phys Rev Lett 11(10):462
Saxe JB (1980) Embeddability of weighted graphs in k-space is strongly np-hard. Carnegie-Mellon University, Department of Computer Science
Schmidt E, Guntert P (2012) A new algorithm for reliable and general NMR resonance assignment. J Am Chem Soc 134(30):12817–12829
Schwieters CD, Kuszewski JJ, Tjandra N, Clore M (2003) The Xplor-NIH NMR molecular structure determination package. J Magn Reson 160(1):65–73
Shen Y, Delaglio F, Cornilescu G, Bax A (2009) Talos+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J Biomol NMR 44(4):213–223
Singer A (2011) Angular synchronization by eigenvectors and semidefinite programming. Appl Comput Harmon Anal 30(1):20–36
So AMC, Ye Y (2007) Theory of semidefinite programming for sensor network localization. Math Progr 109(2–3):367–384
Stoica P, Ng BC (1998) On the Cramér–Rao bound under parametric constraints. IEEE Signal Process Lett 5(7):177–179
Sturm JF (1999) Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 11(1–4):625–653
Tang G, Shah P (2015) Guaranteed tensor decomposition: a moment approach
Tjandra N, Bax A (1997) Direct measurement of distances and angles in biomolecules by NMR in a dilute liquid crystalline medium. Science 278(5340):1111–1114
Toh K-C, Todd MJ, Tütüncü RH (1999) Sdpt3a matlab software package for semidefinite programming, version 1.3. Optim Methods Softw 11(1–4):545–581
Tolman JR, Ruan K (2006) NMR residual dipolar couplings as probes of biomolecular dynamics. Chem Rev 106(5):1720–1736
Tolman JR, Flanagan JM, Kennedy MA, Prestegard JH (1995) Nuclear magnetic dipole interactions in field-oriented proteins: information for structure determination in solution. Proc Natl Acad Sci 92(20):9279–9283
Tripathy C, Zeng J, Zhou P, Donald BR (2012) Protein loop closure using orientational restraints from NMR data. Proteins 80(2):433–453
Vijay-Kumar S, Bugg CE, Cook WJ (1987) Structure of ubiquitin refined at 1.8Å resolution. J Mol Biol 194(3):531–544
Wang L, Donald BR (2004) Exact solutions for internuclear vectors and backbone dihedral angles from NH residual dipolar couplings in two media, and their application in a systematic search algorithm for determining protein backbone structure. J Biomol NMR 29(3):223–242
Wang L, Mettu RR, Donald BR (2006) A polynomial-time algorithm for de novo protein backbone structure determination from nuclear magnetic resonance data. J Comput Biol 13(7):1267–1288
Wang Z, Zheng S, Ye Y, Boyd S (2008) Further relaxations of the semidefinite programming approach to sensor network localization. SIAM J Optim 19(2):655–673
Wasserman L (2013) All of statistics: a concise course in statistical inference. Springer, New York
Weidong H, Wang L (2006) Residual dipolar couplings: measurements and applications to biomolecular studies. Annu Rep NMR Spectrosc 58:231–303
Weinberger KQ, Saul LK (2006) An introduction to nonlinear dimensionality reduction by maximum variance unfolding. AAAI 6:1683–1686
Whiteley W (2005) Counting out to the flexibility of molecules. Phys Biol 2(4):S116
Wüthrich K (2003) NMR studies of structure and function of biological macromolecules (Nobel lecture). Angew Chem Int Ed 42(29):3340–3363
Yingqi X, Zheng Y, Fan J-S, Yang D (2006) A new strategy for structure determination of large proteins in solution without deuteration. Nat Methods 3(11):931–937
Yershova A, Tripathy C, Zhou P, Donald BR (2011) Algorithms and analytic solutions using sparse residual dipolar couplings for high-resolution automated protein backbone structure determination by NMR. In: Algorithmic foundations of robotics IX. Springer, pp 355–372
Zeng J, Boyles J, Tripathy C, Wang L, Yan A, Zhou P, Donald BR (2009) High-resolution protein structure determination starting with a global fold calculated from exact solutions to the RDC equations. J Biomol NMR 45(3):265–281
Zweckstetter M (2008) NMR: Prediction of molecular alignment from structure using the PALES software. Nat Protoc 3(4):679–690
Acknowledgements
The authors would like to thank James Saunderson for discussions related to unit quaternion parameterization for optimization problems on \(\mathbb {SO}(3)\). The authors are grateful to João M. Pereira, Roy R. Lederman and Yutong Chen for discussions regarding this problem, to Nicolas Boumal for the discussion on manifold optimization and proof-reading an earlier version of this manuscript. The authors also want to thank Richard Harris and Roberto Tejero for assisting with interpreting and reading NMR restraint files. The research of AS was partially supported by awards FA9550-12-1-0317 from AFOSR, by the Simons Foundation investigator award and the Simons Foundation Collaboration on Algorithms and Geometry, and the Moore Foundation Data Driven Discovery Investigator award.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The residual dipolar coupling and Saupe tensor
We give here a brief introduction to RDC and the Saupe tensor, while a detailed exposition can be found in (Tolman and Ruan 2006) for example. Let \(\varvec{v}_{nm}\) be the unit vector denoting the direction of the bond between nuclei n and m. Let b be the unit vector denoting the direction of the magnetic field. The RDC \(D_{nm}\) due to the interaction between nuclei n and m is
\(D_{nm}^\text {max}\) is a constant depending on the gyromagnetic ratios \(\gamma _n,\gamma _m\) of the two nuclei, the bond length \(r_{nm}\), and the Planck’s constant h as
and \(\langle \ \cdot \rangle _{t,e}\) denotes the ensemble and time averaging operator. As presented, RDC depends on the relative angle between the magnetic field and the bond. By extension of terminology, we refer to the normalized RDC
as simply the RDC.
It is conventional to interpret the RDC measurement in the molecular frame. More precisely, we treat the molecule as being static in some coordinate system, and the magnetic field direction being a time and sample varying vector. In this case the RDC becomes
where the Saupe tensor S is defined as
We note that \(\varvec{S}\) is symmetric and \(\text {Tr}(\varvec{S}) = 0\). In order to use RDC for structural refinement of a protein, \(\varvec{S}\) is usually first determined from a known structure (known \(\varvec{v}_{nm}\)) that is similar to the protein.
We now detail a classical way of obtaining the Saupe tensor from a known template structure (Losonczi et al. 1999). Using the fact that \(\varvec{S}\) is symmetric and \(\text {Tr}(\varvec{S}) = 0\), Eq. (70) can be rewritten as
where \({\varvec{v}_{nm}}_i\), \(i=x,y,z\) are the different components of \(\varvec{v}_{nm}\) in the molecular frame. When there are L RDC measurements, Eq. (72) results in L linear equations in five unknowns (\(\varvec{S}(2,2),\varvec{S}(3,3),\varvec{S}(1,2),\varvec{S}(1,3)\) and \(\varvec{S}(2,3)\)), that can be written in matrix form as
and \(\varvec{A}\in {\mathbb {R}}^{L\times 5}\). An ordinary least squares procedure can be used to estimate s if \(\varvec{A}\) has full rank. This is also referred to as the SVD procedure in (Losonczi et al. 1999).
Sum-of-squares relaxation
In this section, we explain why the convex relaxation presented in “Convex relaxation with only RDC constraints” section is coined SOS. The polynomial optimization problem
where \(f(\varvec{x}),h(\varvec{x})\) are polynomial functions, can be expressed equivalently as
This is equivalent to
[(Blekherman et al. 2011), Chapter 3], which is the dual problem to (74). However, due to the NP-hardness in testing the non-negativity of a polynomial (Blekherman et al. 2011), we further restrict the search space from the set of non-negative polynomials to the set of SOS polynomials:
This results in a standard SDP
for some specific choices of t. Since \(p_1 = d_1\ge d_2\), solving (78) provides a lower bound to (74). Indeed, the dual of (78) is exactly the type of convex relaxations presented in “Convex relaxation with only RDC constraints” section for optimization problems of the form (74).
Cramér–Rao lower bound
In this section, we introduce a classical tool from statistics, the Cramér–Rao bound (CRB) (Casella and Berger 2002), to give perspective on the lowest possible error any unbiased estimator can achieve when estimating coordinates from noisy RDC measurements. We first describe the CRB for general point estimators. Let \(\varvec{\theta }\in {\mathbb {R}}^n\) be a multidimensional parameter which is to be estimated from measurements \(\varvec{x} \in {\mathbb {R}}^{m}\). Suppose \(\varvec{x}\) is generated from the distribution \(p(\varvec{x}\vert \varvec{\theta }).\) The Fisher information matrix (FIM) is defined as the \(n\times n\) matrix
where expectation is taken with respect to the distribution \(p(\varvec{x}\vert \varvec{\theta })\) and the gradient \(\nabla _{\varvec{\theta }}\) is taken with respect to \(\varvec{\theta }\). For any unbiased estimator \(\hat{\varvec{\theta }}\) of \(\varvec{\theta }\), that is \(\mathbb {E}(\hat{\varvec{\theta }}) = \varvec{\theta }\), the following relationship holds:
if \(\varvec{I}(\varvec{\theta })\) is invertible. Therefore the total variance of the estimator \(\hat{\varvec{\theta }}\) is lower bounded by \(\text {Tr}(\varvec{I}(\varvec{\theta })^{-1})\). We remark that for an unbiased estimator, its variance and the mean-squared error are the same, therefore we often use these terms interchangeably.
We also introduce the CRB in the case when \(\varvec{\theta }\) and \(\hat{\varvec{\theta }}\) are constrained to be in the set \(\{\varvec{\theta }\vert \ f(\varvec{\theta })= 0\}\) where \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}^k\) (Stoica and Ng 1998). Let \(\varvec{Df}(\varvec{\theta })\in {\mathbb {R}}^{k\times n}\) be the gradient matrix of f at \(\varvec{\theta }\) with full row rank, and \(\varvec{Q}\in {\mathbb {R}}^{n\times (n-k)}\) be a set of orthonormal vectors satisfying
i.e. \(\varvec{Q}\) is an orthonormal basis of the null space of \(\varvec{Df}(\varvec{\theta })\). In this case, for any unbiased estimator \(\hat{\varvec{\theta }}\) satisfying \(f(\hat{\varvec{\theta }}) = 0\), the CRB is then
if \(\varvec{Q}^T \varvec{I}(\varvec{\theta }) \varvec{Q}\) is invertible.
CRB for the variance of coordinate estimator
We are now ready to investigate the CRB for estimating atomic positions from RDC data. Let \(\varvec{\zeta }= [\varvec{\zeta }_1,\ldots ,\varvec{\zeta }_K]\in {\mathbb {R}}^{3\times K}\) be the coordinates of the atoms we want to estimate. We aim to derive a lower bound \(\text {Tr}(\varvec{Q}(\varvec{Q}^T \varvec{I}(\varvec{\zeta }) \varvec{Q})^{-1} \varvec{Q}^T)\) of \(\mathbb {E}[\text {Tr}((\hat{\varvec{\zeta }}-\varvec{\zeta })^T(\hat{\varvec{\zeta }}-\varvec{\zeta }))]\) for any unbiased estimator \(\hat{\varvec{\zeta }}\) of \(\varvec{\zeta }\), and compare
with the RMSD of the solutions from RDC-SOS and RDC–NOE-SOS in Fig. 3.
We assume that the RDC measurements are generated through the noise model in (64). This noise model is used to get an expression for \(\varvec{I}(\varvec{\theta })\). There are several sets of equality constraints that need to be considered when deriving \(\varvec{Q}\). We assume that within each rigid unit, the distance between any pair of atoms is fixed. We therefore have a set of equality constraints
where \(E_\text {fixed}\) consists of all atom pairs within each and every rigid unit. Without loss of generality, we also consider the constraint
which implies the points \(\varvec{\zeta }_1,\ldots ,\varvec{\zeta }_K\) are centered at zero. This is due to the fact that
where \(\varvec{\zeta }_c\) and \(\hat{\varvec{\zeta }}_c\) denote the zero centered coordinates and coordinate estimators, and t is the relative translation between \(\varvec{\zeta }\) and \(\hat{\varvec{\zeta }}\). Equation (86) implies that deriving a lower bound for \(\mathbb {E}[\text {Tr}((\hat{\varvec{\zeta }}_c - \varvec{\zeta }_c)^T(\hat{\varvec{\zeta }}_c - \varvec{\zeta }_c))]\) is sufficient for obtaining a lower bound for \(\mathbb {E}[\text {Tr}((\hat{\varvec{\zeta }}-\varvec{\zeta })^T(\hat{\varvec{\zeta }}-\varvec{\zeta }))]\). When there are atoms that are constrained to lie on the same plane, we need to add the constraint that any three vectors in the plane span a space with zero volume, i.e.
for atoms i, j, k, l, m, n in the same plane.
We first start with deriving an expression for the Fisher information matrix when RDC data are generated through (64). From (64) and (65), the likelihood function for the coordinates is
and the log-likelihood is (up to an additive constant)
where \(\varvec{e}_{nm} = \varvec{e}_n - \varvec{e}_m\). The derivative of l with respect to \(\text {vec}(\varvec{\zeta })\) is then
It follows from the noise model (64) and the independence of \(\varvec{\epsilon }_{nm}^{(j)}\)’s that the Fisher information matrix
Having the Fisher information matrix, we now incorporate the constraints in (84) and (85) in order to obtain a bound as in (82). Stacking the equality constraints (84) into a \(\vert E_\text {fixed}\vert \times 1\) matrix, we get
The gradient matrix is thus
where \(\varvec{Df}(\text {vec}(\varvec{\zeta })) \in {\mathbb {R}}^{\vert E_\text {fixed} \vert \times 3K}\). We note that \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\) is known as the rigidity matrix (Jackson 2007), and the vectors in its null space indicate the direction of infinitesimal motion the atoms can take without violating (84). Even in the case when all pairwise distances between the atoms are known, there is still a 6-dimensional null space for \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\), corresponding to an infinitesimal global rotation and translation to the coordinates \(\varvec{\zeta }\) that preserves all pairwise distances. We now augment \(f(\text {vec}(\varvec{\zeta }))=0\) with the centering constraint \(\varvec{\zeta }\varvec{1} = 0\), and this augments \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\) with three rows \(\varvec{1}^T \otimes \varvec{I}_3\), i.e.
The inclusion of such centering constraint eliminates the three dimensional subspace in the kernel of the rigidity matrix that corresponds to the translational degree of freedom. Let \(\varvec{Q}\) be an orthonormal basis that spans the null space of \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\). Together with (91) and (82) we obtain the desired CRB. We omit detailing the derivative for constraint (87) but simply note that the inclusion of such constraints eliminates the out of plane infinitesimal motion for atoms lying on rigid planar unit.
Inclusion of NOE constraints
We have so far neglected the use of NOE measurements when deriving the CRB. Unlike RDC, the NOE restraints remain more qualitative, with imprecise upper and lower bound (Bonvin et al. 1996) due to the \(r^{-6}\) scaling of the interaction. In protein structural calculation, it is customary to include a flat potential well-like penalty [(e.g. (12)] in addition to the RDC log-likelihood function derived from RDC, or treat the backbone NOE as inequality constraints on the distances. In any of these cases, when the coordinates \(\varvec{\zeta }\) strictly satisfy both upper and lower bounds on the distances, the CRB is exactly the same as the CRB derived in “CRB for the variance of coordinate estimator” section (Gorman and Hero 1990) since the CRB only depends on the local curvature of the log-likelihood function around \(\varvec{\zeta }.\) Therefore when the noise on RDC is large and the NOE restraints are active in determining a coordinate estimator \(\varvec{\zeta }\), the CRB may no longer serve as a lower bound for the mean squared error of \(\hat{\varvec{\zeta }}\). In particular, it is possible for \(\hat{\varvec{\zeta }}\) to have a mean squared error lower than the CRB due to the bias introduced by the NOE (by favoring solutions that satisfies the distance bounds), as observed in Fig. 3. A fundamental results in statistical estimation theory-the bias-variance trade-off (Wasserman 2013), states that the mean squared error of an estimator can be obtained from the summation of the variance and squared bias of the estimator. It is possible that with the expense of having some bias, the variance of an estimator can be greatly reduced, resulting a mean squared error that is lower than the CRB (Wasserman 2013, Chapter 7).
Observed Fisher information matrix and protein variability
We remark that since the LHS of (80) (or (82)) is the covariance matrix for the estimator \(\hat{\varvec{\theta }}\), the leading eigenvectors of \({\varvec{I}(\varvec{\theta })}^{-1}\) give the direction of the greatest variations of the protein based on the observed data, whereas the corresponding eigenvalues give the variance (amplitude) of variations. When deriving the CRB, we use the FIM (91) which is obtained from averaging over the distribution of the data. An estimator \(\widehat{\varvec{I}(\varvec{\theta })}\) of the FIM can be obtained from the observed data, by replacing \(\varvec{\theta }\) in FIM by its maximum-likelihood estimator \(\hat{\varvec{\theta }}\) and plugging in the observed data (in our case the observed RDC \(r_{nm}\)) instead of taking expectation over the distribution of the data. The direction for which an estimator \(\hat{\varvec{\theta }}\) has the greatest variance can be estimated by the top eigenvector of \(\widehat{\varvec{I}(\varvec{\theta })}^{-1}\). In the constrained case, we compute the top eigenvector of \(\hat{\varvec{Q}} (\hat{\varvec{Q}}^T \widehat{\varvec{I}(\varvec{\theta })}^{-1} \hat{\varvec{Q}}) \hat{\varvec{Q}}^T\) where \(\hat{\varvec{Q}}\) are computed based on \(\hat{\varvec{\theta }}\) instead of \(\varvec{\theta }\). In Fig. 5a, we demonstrate the variation of the ubiquitin fragment for residue 1–7 (with 159 atoms) using the eigenvector of estimated FIM. In Fig. 5b we show the largest 10 eigenvalues of the inverse of the estimated FIM. As we see, there is one prominent mode of variation for this protein fragment. We note that this procedure of determining the modes of protein variation bear resemblance to normal mode analysis (Case 1994). In such analysis, the Hessian for the pseudo energy function of a protein near a local minimum is first determined. Then the normal modes are determined by the eigenvectors of the Hessian matrix. If we treat the log-likehood function as some pseudo energy function, our FIM-based analysis of the modes of atomic displacement corresponds to the classical normal modes analysis.
Infinitesimal rigidity and invertibility of the Fisher information matrix
In this subsection, we study the infinitesimal rigidity (Liberti et al. 2014) of the protein structure given RDC and distance measurements and how it guarantees invertibility of the Fisher information matrix. Let a framework with coordinates \(\varvec{\zeta }\in {\mathbb {R}}^{3\times K}\) be constrained by
and
In order to derive a condition for infinitesimal rigidity, we first let \(\text {vec}(\varvec{\zeta }(s))\) be a curve in dimension \({\mathbb {R}}^{3K}\) parameterized by s, where \(\varvec{\zeta }(0)\) satisfies (95) and (96). Taking derivative of the constraints in (95) and (96) with respect to s at \(s=0\), we have
The null space of the generalized rigidity matrix \(\varvec{R}(\varvec{\zeta }(0))\) with dimension \((\vert E_\text {fixed} \vert + \vert E_\text {RDC} \vert )\times 3K\) represents the direction of infinitesimal motion such that \(\varvec{\zeta }(s)\) satisfies the constraints (95), (96) for infinitesimally small s. If \(\varvec{R}(\varvec{\zeta }(0))\) only has a three dimensional nullspace, i.e. the global translations in x, y, z-directions, we say the framework \(\varvec{\zeta }(0)\) along with the constraints (95) and (96) is infinitesimally rigid.
Now we verify that the constrained Fisher information matrix is invertible if \(\varvec{R}(\varvec{\zeta }(0))\) has a three dimensional null space corresponds to global translation of the points. We define \(\text {ker}(\varvec{A})\) to be the kernel of a matrix \(\varvec{A}\) and \(\text {range}(\varvec{A})\) to be the column space of \(\varvec{A}\). Let \(\varvec{Q}\) again be the basis of the nullspace of \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\) defined in (94) such that \(\varvec{Df}(\text {vec}(\varvec{\zeta }))\varvec{Q} = 0\). Let \(\varvec{v}\) satisfies \(\begin{aligned} \varvec{Q}^T\varvec{I}(\varvec{\zeta }) \varvec{Q} \varvec{v} =0. \end{aligned}\) \(\varvec{Q}^T \varvec{I}(\varvec{\zeta }) \varvec{Q} \varvec{v} = 0\) if and only if \(\varvec{v}\in \text {ker}(\varvec{Q})\) or \(\varvec{Q}\varvec{v}\in \text {ker}(\varvec{I})\). Since the columns of \(\varvec{Q}\) are linearly independent, \(\varvec{Q}\varvec{v}\ne 0\) unless \(\varvec{v}=0\). This means \(\varvec{Q}^T \varvec{I}(\varvec{\zeta }) \varvec{Q} \varvec{v} = 0\) if and only if \(\varvec{v} = 0\) or \(\varvec{Q}\varvec{v} \in \text {ker}(\varvec{I})\cap \text {range}(\varvec{Q}) = \text {ker}(\varvec{I})\cap \text {range}(\varvec{Q}) = \text {ker}(\varvec{I})\cap \text {ker}(\varvec{Df}(\text {vec}(\varvec{\zeta }))).\) Therefore if
or in other words
then \(\varvec{Q}^T \varvec{I}(\varvec{\zeta }) \varvec{Q}\) is invertible. From the form of the (91), it is easy to show that the range condition (98) is satisfied if and only if the range of
is \({\mathbb {R}}^{3K}\). Then we arrive at the conclusion that if the framework \(\varvec{\varvec{\zeta }}\) is infinitesimally rigid with the null space of \(\varvec{R}(\varvec{\zeta })\) being the global translations, the constrained Fisher information matrix defined as \(\varvec{Q}^T \varvec{I}(\varvec{\zeta }) \varvec{Q}\) is invertible.
In Yershova et al. (2011), it is shown that if there exists RDC measurements for a bond in the peptide plane and a bond in the CA-body in a single alignment media, the solutions of the protein structure form a discrete set. Therefore under this condition, there is no infinitesimal motion other than global translation such that the protein framework satisfies the RDC and NOE constraints. We can thus compute the CRB safely under such condition.
Rights and permissions
About this article
Cite this article
Khoo, Y., Singer, A. & Cowburn, D. Integrating NOE and RDC using sum-of-squares relaxation for protein structure determination. J Biomol NMR 68, 163–185 (2017). https://doi.org/10.1007/s10858-017-0108-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10858-017-0108-7