Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Paquette, Courtney; van Merriënboer, Bart; Paquette, Elliot; Pedregosa, Fabian

doi:10.1007/s10208-022-09554-y

Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Published: 15 February 2022

Volume 23, pages 597–673, (2023)
Cite this article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Courtney Paquette^1,2,
Bart van Merriënboer²,
Elliot Paquette¹ &
…
Fabian Pedregosa²

536 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with first-order methods including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization in the Small-Data, Large-Scale Regime

Why Optimization Is Faster Than Solving Systems of Equations: A Qualitative Explanation

Frank-Wolfe Style Algorithms for Large Scale Optimization

Notes

The signal ${\widetilde{{{\varvec{x}}}}}$ is not the same as the vector for which the iterates of the algorithm are converging to as $k \rightarrow \infty $.
The definition of ${\widetilde{R}}^2$ in Assumption 1 does not imply that $R^2 \approx \frac{1}{d}\Vert {{\varvec{b}}}\Vert ^2 - {\widetilde{R}}^2$. However, the precise definition of ${\widetilde{R}}$ and this intuitive one yield similar magnitudes and both are generated from similar quantities.
In many situations this deterministic quantity $ \underset{d \rightarrow \infty }{{\mathcal {E}}} [\Vert \nabla f({{\varvec{x}}}_{k})\Vert ^2]\,$ is in fact the limiting expectation of the squared-norm of the gradient. However, under the assumptions that we are using, this does not immediately follow. It is, however, always the limit of the median of the squared-norm of the gradient.
Technically, there is no need to assume the measure $\mu $ has a density—the theorem holds just as well for any limiting spectral measure $\mu $. In fact, a version of this theorem can be formulated at finite n just as well, thus dispensing entirely with Assumption 2 – c.f. Proposition 4.
Precisely, we show that $\tfrac{d {\widetilde{R}}^2}{\Vert {{\varvec{x}}}^{\star }-{{\varvec{x}}}_0\Vert ^2}$ is tight (see Sect. 5, Lemma 8).

References

Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
Bai, Z., Silverstein, J.: No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26(1), 316–345 (1998). https://doi.org/10.1214/aop/1022855421
Article MathSciNet MATH Google Scholar
Bai, Z., Silverstein, J.: Exact separation of eigenvalues of large-dimensional sample covariance matrices. Ann. Probab. 27(3), 1536–1555 (1999). https://doi.org/10.1214/aop/1022677458
Article MathSciNet MATH Google Scholar
Bai, Z., Silverstein, J.: CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann. Probab. 32(1A), 553–605 (2004). https://doi.org/10.1214/aop/1078415845
Article MathSciNet MATH Google Scholar
Bai, Z., Silverstein, J.: Spectral analysis of large dimensional random matrices, second edn. Springer Series in Statistics. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-0661-8
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Article MathSciNet MATH Google Scholar
Benigni, L., Péché, S.: Eigenvalue distribution of nonlinear models of random matrices. arXiv preprint arXiv:1904.03090 (2019)
Bhojanapalli, S., Boumal, N., Jain, P., Netrapalli, P.: Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st Conference On Learning Theory (COLT), Proceedings of Machine Learning Research, vol. 75, pp. 3243–3270. PMLR (2018)
Borgwardt, K.: A Probabilistic Analysis of the Simplex Method. Springer-Verlag, Berlin, Heidelberg (1986)
Google Scholar
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311 (2018). https://doi.org/10.1137/16M1080173
Article MathSciNet MATH Google Scholar
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M., Leary, C., Maclaurin, D., Wanderman-Milne, S.: JAX: composable transformations of Python+NumPy programs (2018)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014)
Deift, P., Menon, G., Olver, S., Trogdon, T.: Universality in numerical computations with random data. Proc. Natl. Acad. Sci. USA 111(42), 14973–14978 (2014). https://doi.org/10.1073/pnas.1413446111
Article MathSciNet MATH Google Scholar
Deift, P., Trogdon, T.: Universality in numerical computation with random data: Case studies, analytical results, and some speculations. Abel Symposia 13(3), 221–231 (2018)
Article MathSciNet MATH Google Scholar
Deift, P., Trogdon, T.: Universality in numerical computation with random data: case studies and analytical results. J. Math. Phys. 60(10), 103306, 14 (2019). https://doi.org/10.1063/1.5117151
Deift, P., Trogdon, T.: The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic. Quart. Appl. Math. 79(1), 125–161 (2021). https://doi.org/10.1090/qam/1574
Article MathSciNet MATH Google Scholar
Demmel, J.W.: The probability that a numerical analysis problem is difficult. Math. Comp. 50(182), 449–480 (1988). https://doi.org/10.2307/2008617
Article MathSciNet MATH Google Scholar
Durrett, R.: Probability—theory and examples, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 49. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108591034
Edelman, A.: Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl 9(4), 543–560 (1988). https://doi.org/10.1137/0609045
Article MathSciNet MATH Google Scholar
Edelman, A., Rao, N.R.: Random matrix theory. Acta Numer. 14, 233–297 (2005). https://doi.org/10.1017/S0962492904000236
Article MathSciNet MATH Google Scholar
Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E.: Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Mitt. Inst. Angew. Math. Zürich 8, 107 (1959)
MathSciNet MATH Google Scholar
Fischer, B.: Polynomial based iteration methods for symmetric linear systems, Classics in Applied Mathematics, vol. 68. Society for Industrial and Applied Mathematics (SIAM) (2011). https://doi.org/10.1137/1.9781611971927.fm
Flanders, D., Shortley, G.: Numerical determination of fundamental modes. J. Appl. Phys. 21, 1326–1332 (1950)
Article MathSciNet MATH Google Scholar
Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97, pp. 2232–2241. PMLR (2019)
Golub, G., Varga, R.: Chebyshev semi-iterative methods, successive over-relaxation iterative methods, and second order Richardson iterative methods. I. Numer. Math. 3, 147–156 (1961). https://doi.org/10.1007/BF01386013
Article MathSciNet MATH Google Scholar
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 1832–1841. PMLR (2018)
Hachem, W., Hardy, A., Najim, J.: Large complex correlated Wishart matrices: fluctuations and asymptotic independence at the edges. Ann. Probab. 44(3), 2264–2348 (2016). https://doi.org/10.1214/15-AOP1022
Article MathSciNet MATH Google Scholar
Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 (2019)
Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49, 409–436 (1952)
Article MathSciNet MATH Google Scholar
Hoare, C.A.R.: Quicksort. Comput. J. 5, 10–15 (1962). https://doi.org/10.1093/comjnl/5.1.10
Article MathSciNet MATH Google Scholar
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems (NeurIPS), vol. 31 (2018)
Knowles, A., Yin, J.: Anisotropic local laws for random matrices. Probab. Theory Related Fields 169(1-2), 257–352 (2017). https://doi.org/10.1007/s00440-016-0730-4
Article MathSciNet MATH Google Scholar
Kuijlaars, A.B.J., McLaughlin, K.T.R., Van Assche, W., Vanlessen, M.: The Riemann-Hilbert approach to strong asymptotics for orthogonal polynomials on $[-1,1]$. Adv. Math. 188(2), 337–398 (2004). https://doi.org/10.1016/j.aim.2003.08.015
Article MathSciNet MATH Google Scholar
Lacotte, J., Pilanci, M.: Optimal randomized first-order methods for least-squares problems. In: Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119, pp. 5587–5597. PMLR (2020)
Liao, Z., Couillet, R.: The dynamics of learning: A random matrix approach. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 3072–3081. PMLR (2018)
Louart, C., Liao, Z., Couillet, R.: A random matrix approach to neural networks. Ann. Appl. Probab. 28(2), 1190–1248 (2018). https://doi.org/10.1214/17-AAP1328
Article MathSciNet MATH Google Scholar
Marčenko, V., Pastur, L.: Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik (1967)
Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research 22(165), 1–73 (2021)
MathSciNet MATH Google Scholar
Mei, S., Montanari, A.: The generalization error of random features regression: Precise asymptotics and double descent curve. Communications on Pure and Applied Mathematics (CPAM) (2019). https://doi.org/10.1002/cpa.22008
Menon, G., Trogdon, T.: Smoothed analysis for the conjugate gradient algorithm. SIGMA Symmetry Integrability Geom. Methods Appl. 12, Paper No. 109, 22 (2016). https://doi.org/10.3842/SIGMA.2016.109
Nemirovski, A.: Information-based complexity of convex programming. Lecture Notes (1995)
Nesterov, Y.: Introductory lectures on convex optimization: A basic course, Applied Optimization, vol. 87. Kluwer Academic Publishers (2004). https://doi.org/10.1007/978-1-4419-8853-9
Nesterov, Y.: How to make the gradients small. Optima 88 pp. 10–11 (2012)
Google Scholar
Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D., Pennington, J., Sohl-Dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)
Papyan, V.: The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062 (2018)
Paquette, E., Trogdon, T.: Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:2007.00640 (2020)
Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 7553–7562 (2020)
Pennington, J., Worah, P.: Nonlinear random matrix theory for deep learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
Pfrang, C.W., Deift, P., Menon, G.: How long does it take to compute the eigenvalues of a random symmetric matrix? In: Random matrix theory, interacting particle systems, and integrable systems, Math. Sci. Res. Inst. Publ., vol. 65, pp. 411–442. Cambridge Univ. Press, New York (2014)
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 04, 791–803 (1964)
Article Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 20, pp. 1177–1184 (2008)
Sagun, L., Bottou, L., LeCun, Y.: Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476 (2016)
Sagun, L., Trogdon, T., LeCun, Y.: Universal halting times in optimization and machine learning. Quarterly of Applied Mathematics 76(2), 289–301 (2018). https://doi.org/10.1090/qam/1483
Article MathSciNet MATH Google Scholar
Sankar, A., Spielman, D.A., Teng, S.: Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. Matrix Anal. Appl. 28(2), 446–476 (2006). https://doi.org/10.1137/S0895479803436202
Article MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
Smale, S.: On the average number of steps of the simplex method of linear programming. Mathematical Programming 27(3), 241–262 (1983). https://doi.org/10.1007/BF02591902
Article MathSciNet MATH Google Scholar
Spielman, D., Teng, S.: Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. J. ACM 51(3), 385-463 (2004). https://doi.org/10.1017/CBO9780511721571.010
Article MathSciNet MATH Google Scholar
Su, W., Boyd, S., Candès, E.: A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research 17(153), 1–43 (2016)
MathSciNet MATH Google Scholar
Tao, T.: Topics in random matrix theory, vol. 132. American Mathematical Soc. (2012). https://doi.org/10.1090/gsm/132
Tao, T., Vu, V.: Random matrices: the distribution of the smallest singular values. Geom. Funct. Anal. 20(1), 260–297 (2010). https://doi.org/10.1007/s00039-010-0057-8
Article MathSciNet MATH Google Scholar
Taylor, A., Hendrickx, J., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1-2, Ser. A), 307–345 (2017). https://doi.org/10.1007/s10107-016-1009-3
Article MathSciNet MATH Google Scholar
Todd, M.J.: Probabilistic models for linear programming. Math. Oper. Res. 16(4), 671–693 (1991). https://doi.org/10.1287/moor.16.4.671
Article MathSciNet MATH Google Scholar
Trefethen, L.N., Schreiber, R.S.: Average-case stability of Gaussian elimination. SIAM J. Matrix Anal. Appl. 11(3), 335–360 (1990). https://doi.org/10.1137/0611023
Article MathSciNet MATH Google Scholar
Walpole, R.E., Myers, R.H.: Probability and statistics for engineers and scientists, second edn. Macmillan Publishing Co., Inc., New York; Collier Macmillan Publishers, London (1978)
Wilson, A., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

Download references

Acknowledgements

The authors would like to thank our colleagues Nicolas Le Roux, Ross Goroshin, Zaid Harchaoui, Damien Scieur, and Dmitriy Drusvyatskiy for their feedback on this manuscript, and Henrik Ueberschaer for providing useful random matrix theory references.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada
Courtney Paquette & Elliot Paquette
Google Research, Brain Team, Montreal, QC, Canada
Courtney Paquette, Bart van Merriënboer & Fabian Pedregosa

Authors

Courtney Paquette
View author publications
You can also search for this author in PubMed Google Scholar
Bart van Merriënboer
View author publications
You can also search for this author in PubMed Google Scholar
Elliot Paquette
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Pedregosa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Courtney Paquette.

Additional information

Communicated by Jim Renegar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

C. Paquette is a CIFAR AI chair. Research by E. Paquette was supported by a Discovery Grant from the Natural Science and Engineering Research Council (NSERC) of Canada.

Appendices

Derivation of Polynomials

In this section, we construct the residual polynomials for various popular first-order methods, including Nesterov’s accelerated gradient and Polyak momentum.

1.1 Nesterov’s Accelerated Methods

Nesterov accelerated methods generate iterates using the relation

$$\begin{aligned} {{\varvec{x}}}_{k+1} = {{\varvec{y}}}_k - \alpha \nabla f({{\varvec{y}}}_k),\quad&\text {where} \quad \alpha = \frac{1}{\lambda _{{{\varvec{H}}}}^+}\\ {{\varvec{y}}}_{k+1} = {{\varvec{x}}}_{k+1} + \beta _k( {{\varvec{x}}}_{k+1} - {{\varvec{x}}}_k), \quad&\text {where} \quad \beta _k = {\left\{ \begin{array}{ll} \frac{\sqrt{\lambda _{{{\varvec{H}}}}^+} - \sqrt{\lambda _{{{\varvec{H}}}}^-}}{\sqrt{\lambda _{{{\varvec{H}}}}^+} + \sqrt{\lambda _{{{\varvec{H}}}}^-} }, &{} \text {if }\lambda _{{{\varvec{H}}}}^- \ne 0\\ \frac{k}{k+3}, &{} \text {if }\lambda _{{{\varvec{H}}}}^- = 0. \end{array}\right. } \end{aligned}$$

By developing the recurrence of the iterates on the least squares problem (10), we get the following three-term recurrence

$$\begin{aligned} {{\varvec{x}}}_{k+1}-{\widetilde{{{\varvec{x}}}}}&= (1 + \beta _{k-1}) (I- \alpha {{\varvec{H}}}) ({{\varvec{x}}}_k-{\widetilde{{{\varvec{x}}}}}) - \beta _{k-1} (I - \alpha {{\varvec{H}}})({{\varvec{x}}}_{k-1}-{\widetilde{{{\varvec{x}}}}}) + \alpha \cdot \tfrac{{{\varvec{A}}}^T {\varvec{\eta }}}{n}, \end{aligned}$$

with the initial vector ${{\varvec{x}}}_0 \in {{\mathbb {R}}}^d$ and ${{\varvec{x}}}_1 = {{\varvec{x}}}_0-\alpha \nabla f({{\varvec{x}}}_0)$. Using these standard initial conditions, we deduce from Proposition 1 the following

$$\begin{aligned}&P_{k+1}({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm })({{\varvec{x}}}_0-{\widetilde{{{\varvec{x}}}}}) + Q_{k+1}({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm }) \tfrac{{{\varvec{A}}}^T {\varvec{\eta }}}{n}\\&\quad = \big [ (1+ \beta _{k-1}) ({{\varvec{I}}}- \alpha {{\varvec{H}}}) P_k({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta _{k-1} ({{\varvec{I}}}- \alpha {{\varvec{H}}}) P_{k-1}({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm }) \big ] ({{\varvec{x}}}_0-{\widetilde{{{\varvec{x}}}}})\\&\qquad + \big [ (1+\beta _{k-1})({{\varvec{I}}}-\alpha {{\varvec{H}}}) Q_k({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta _{k-1}({{\varvec{I}}}-\alpha {{\varvec{H}}})Q_{k-1}({{\varvec{H}}}; \lambda _{{{\varvec{H}}}}^{\pm }) + \alpha {{\varvec{I}}}\big ] \tfrac{{{\varvec{A}}}^T {\varvec{\eta }}}{n}. \end{aligned}$$

It immediately follows that the residual polynomials satisfy the same three-term recurrence, namely,

$$\begin{aligned} \begin{aligned}&P_{k+1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = (1+\beta _{k-1}) (1-\alpha \lambda ) P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta _{k-1}(1-\alpha \lambda ) P_{k-1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\&\quad \text {with} \quad P_0(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1, \quad P_1(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1-\alpha \lambda \\&Q_{k+1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = (1+\beta _{k-1})(1-\alpha \lambda ) Q_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta _{k-1} (1 - \alpha \lambda ) Q_{k-1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) + \alpha \\&\quad \text {with} \quad Q_0(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 0, \quad Q_1(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = \alpha . \end{aligned} \end{aligned}$$

(97)

By Proposition 1, we only need to derive an explicit expression for the $P_k$ polynomials.

1.1.1 Strongly Convex Setting

The polynomial recurrence relationship for Nesterov’s accelerated method in the strongly convex setting is given by

$$\begin{aligned} \begin{aligned}&P_{k+1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = (1 + \beta )(1-\alpha \lambda ) P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta (1-\alpha \lambda ) P_{k-1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\&\quad \text {where} \quad P_0(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1, \quad P_1(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1-\alpha \lambda , \quad \alpha = \tfrac{1}{\lambda _{{{\varvec{H}}}}^+} \quad \text {and}\\&\beta = \tfrac{\sqrt{\lambda _{{{\varvec{H}}}}^+}- \sqrt{\lambda _{{{\varvec{H}}}}^-}}{\sqrt{\lambda _{{{\varvec{H}}}}^+}+ \sqrt{\lambda _{{{\varvec{H}}}}^-}}. \end{aligned} \end{aligned}$$

(98)

We generate an explicit representation for the polynomial by constructing the generating function for the polynomials $P_k$, namely

$$\begin{aligned} {\mathfrak {G}}(\lambda , t)&{\mathop {=}\limits ^{\text {def}}}\sum _{k=0}^\infty t^k P_{k}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) \\ \text {(recurrence in }(98)) \, \,&= 1 + \frac{1}{t(1+\beta )(1-\alpha \lambda )} \sum _{k=2}^\infty t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\&\quad + \frac{t \beta }{1+ \beta } \sum _{k=0}^\infty t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\ \text {(initial conditions)} \, \,&= 1 + \frac{\left( {\mathfrak {G}}(\lambda , t) - (1 + t(1-\alpha \lambda )) \right) }{t(1+\beta )(1-\alpha \lambda )} + \frac{t \beta }{1 + \beta } {\mathfrak {G}}(\lambda , t). \end{aligned}$$

We solve this expression for ${\mathfrak {G}}$, which gives

$$\begin{aligned} {\mathfrak {G}}(\lambda , t) = \frac{1-t\beta (1-\alpha \lambda )}{1 + \beta (1-\alpha \lambda ) t^2 - t(1+\beta ) (1-\alpha \lambda )}. \end{aligned}$$

(99)

Ultimately, we want to relate the generating function for the polynomials $P_k$ to a generating function for known polynomials. Notably, in this case, the Chebyshev polynomials of the 1st and 2nd kind—denoted $(T_k(x))$ and $(U_k(x))$, respectively—resemble the generating function for the residual polynomials of Nesterov accelerated method. The generating function for Chebyshev polynomials is given as

$$\begin{aligned} \sum _{k=0}^\infty (T_k(x) + \delta U_k(x)) t^k = \frac{1-tx + \delta }{1-2tx + t^2}. \end{aligned}$$

(100)

To give the explicit relationship between (99) and (100), we make the substitution $t \mapsto \frac{t}{(\beta (1-\alpha \lambda )^{1/2}}$. A simple calculation yields the following

$$\begin{aligned} \begin{aligned} \sum _{k=1}^\infty \frac{t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })}{(\beta (1-\alpha \lambda ))^{k/2}}&= \frac{1- \frac{\beta \sqrt{1-\alpha \lambda }}{\sqrt{\beta }}t}{ 1 - \frac{(1+\beta )\sqrt{1-\alpha \lambda }}{2 \sqrt{\beta }} \cdot 2 t + t^2 }\\&= \frac{1- \tfrac{2\beta }{1+\beta } t x}{1-2tx + t^2} = \frac{\frac{2\beta }{1+\beta } \left( 1-tx \right) + \left( 1 - \frac{2\beta }{1+\beta } \right) }{1-2tx + t^2}\\ \text {where} \quad x&= \frac{(1+\beta ) \sqrt{1 - \alpha \lambda }}{2 \sqrt{\beta }}. \end{aligned} \end{aligned}$$

(101)

We can compare (100) with (101) to derive an expression for the polynomials $P_k$

$$\begin{aligned} \begin{aligned} P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) =&\big (\beta (1-\alpha \lambda ) \big )^{k/2} \Big [ \frac{2\beta }{1+\beta } T_k \left( \frac{(1+\beta ) \sqrt{1 - \alpha \lambda }}{2 \sqrt{\beta }} \right) \\&+ \left( 1 - \frac{2\beta }{1 + \beta } \right) U_k \left( \frac{(1+\beta ) \sqrt{1 - \alpha \lambda }}{2 \sqrt{\beta }} \right) \Big ], \end{aligned} \end{aligned}$$

(102)

where $T_k$ is the Chebyshev polynomial of the first kind and $U_k$ is the Chebyshev polynomial of the second kind.

1.1.2 Convex Setting: Legendre Polynomials and Bessel Asymptotics

When the objective function is convex (i.e., $\lambda _{{{\varvec{H}}}}^- = 0$), the recurrence for the residual polynomial associated with Nesterov’s accelerated method residual reduces to

$$\begin{aligned}&P_{k+1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = (1+\beta _{k-1}) (1-\alpha \lambda ) P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) - \beta _{k-1}(1-\alpha \lambda ) P_{k-1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) \nonumber \\&\quad \text {with} \quad P_0(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1, \quad P_1(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1-\alpha \lambda , \quad \alpha = \frac{1}{\lambda _{{{\varvec{H}}}}^+}, \quad \text {and} \quad \beta _k = \frac{k}{k+3}.\nonumber \\ \end{aligned}$$

(103)

We now seek to solve this recurrence.

Nesterov’s polynomials as Legendre polynomials. First we observe that these polynomials are also polynomials in $u =\alpha \lambda $, so we can define new polynomials ${\widetilde{P}}_k(u)$ such that ${\widetilde{P}}_k(\alpha \lambda ) = P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })$. Let us define new polynomials ${\widetilde{R}}_k(u) = {\widetilde{P}}_k(u)(1-u)^{-k/2}$. Then the recurrence in (103) can be reformulated as

$$\begin{aligned} {\widetilde{R}}_{k+1}(u)= & {} (1+\beta _{k-1})(1-u)^{1/2} {\widetilde{R}}_k(u) - \beta _{k-1} {\widetilde{R}}_{k-1}(u),\nonumber \\ {\widetilde{R}}_0(u)= & {} 1, \, \, {\widetilde{R}}_1(u) = (1-u)^{1/2}. \end{aligned}$$

(104)

A simple computation shows that the polynomials $\{{\widetilde{R}}_k\}$ are polynomials in $v = (1-u)^{1/2}$. Because of this observation, we define new polynomials $R_k(v)$ where $R_k( (1-u)^{1/2}) = {\widetilde{R}}_k(u)$. Now we will find a formula for the polynomials $R_k$ by constructing its generating function,

$$\begin{aligned} {\mathfrak {G}}(v,t) {\mathop {=}\limits ^{\text {def}}}\sum _{k=0}^\infty R_k(v) t^k \end{aligned}$$

The recurrence in (104) together with the definition of $R_k$ yields the following differential identity

$$\begin{aligned} 2 \partial _t (v t^{1/2} {\mathfrak {G}}(v,t) )&= \sum _{k=0}^\infty (2k + 1) R_k(v) v t^{k-1/2}\\&= v t^{-1/2} + \sum _{k=1}^\infty (2k +1) R_k(v) v t^{k-1/2}\\&= vt ^{-1/2} + \sum _{k=1}^\infty (k+2) \cdot \frac{2k+1}{k+2} \cdot v \cdot R_k(v) t^{k-1/2} \\ \text {(recurrence in }(104)) \quad&= v t^{-1/2} + \sum _{k=1}^\infty (k-1) R_{k-1}(v) t^{k-1/2} \\&\qquad + \sum _{k=1}^\infty (k+2) R_{k+1}(v) t^{k-1/2} \\&= vt^{-1/2} + t^{3/2} \sum _{k=0}^\infty k R_k(v) t^{k-1}\\&\qquad + \sum _{k=2}^\infty (k+1) t^{k-3/2} R_k(v)\\ \big (\partial _t (t {\mathfrak {G}}) = \sum _{k=0}^\infty (k+1) t^k R_k(v) \big ) \quad&= v t^{-1/2} + t^{3/2} \partial _t( {\mathfrak {G}} ) + t^{-3/2} \partial _t (t ({\mathfrak {G}}-(1+vt))). \end{aligned}$$

One can see this is a first-order linear ODE with initial conditions given by

$$\begin{aligned} \partial _t({\mathfrak {G}}) + \frac{1-tv}{t^3 - 2vt^2 +t} {\mathfrak {G}} = \frac{1+tv}{t^3-2vt^2 +t}, \quad \text {with} \quad {\mathfrak {G}}(v,0) = 1, \quad \partial _t {\mathfrak {G}}(v,0) = v. \end{aligned}$$

Using an integrating factor of $\mu (t) = \tfrac{t}{\sqrt{t^2-2tv +1}}$ , the solution to this initial value problem is

$$\begin{aligned} {\mathfrak {G}}(v,t) = \frac{2v \sqrt{t^2 -2vt +1} + tv^2 + t -2v}{t(1-v^2)}. \end{aligned}$$

At first glance, this does not seem related to any known generating function for a polynomial; however, if we differentiate this function, we get that

$$\begin{aligned} \sum _{k=1}^\infty k R_k(v) t^k= & {} t\partial _t ({\mathfrak {G}}) = \frac{2v( vt - 1 + \sqrt{t^2-2tv+1})}{t(1-v^2) \sqrt{t^2-2tv + 1} } = \frac{2v(vt-1)}{t(1-v^2) \sqrt{t^2 -2tv + 1}} \\&+ \frac{2v}{t(1-v^2)}, \end{aligned}$$

and it is known that the generating function for the Legendre Polynomials $\{L_k\}$ is exactly

$$\begin{aligned} \sum _{k=0}^\infty L_k(v) t^k = \frac{1}{\sqrt{t^2 - 2vt + 1}}. \end{aligned}$$

Hence, it follows that

$$\begin{aligned} \frac{2v(vt-1)}{t(1-v^2) \sqrt{t^2-2tv +1} }&= \left( \frac{2v^2}{1-v^2} - \frac{2v}{t(1-v^2)} \right) \frac{1}{\sqrt{t^2 - 2tv +1}}\\&= \frac{2v^2}{1-v^2} \sum _{k=0}^\infty L_k(v) t^k - \frac{2v}{1-v^2} \sum _{k=0}^\infty L_k(v) t^{k-1}, \end{aligned}$$

and so for $k \ge 1$, by comparing coefficients we deduce that

$$\begin{aligned} k R_k(v) = \frac{2v^2}{1-v^2} L_k(v)- \frac{2v}{1-v^2} L_{k+1}(v). \end{aligned}$$

Replacing all of our substitutions back in, we get the following representation for the Nesterov polynomials for $k \ge 1$

$$\begin{aligned} P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = \frac{2(1-\alpha \lambda )^{(k+1)/2}}{k \alpha \lambda } \left( \sqrt{1-\alpha \lambda } \cdot L_k(\sqrt{1-\alpha \lambda }) - L_{k+1}(\sqrt{1-\alpha \lambda }) \right) ,\nonumber \\ \end{aligned}$$

(105)

where $\{L_k\}$ are the Legendre polynomials.

Bessel asymptotics for Nesterov’s residual polynomials. In this section, we derive an asymptotic for the residual polynomials of Nesterov’s accelerated method in the convex setting. We will show that the polynomials $P_k$ in (105) satisfy in a sufficiently strong sense

$$\begin{aligned} P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) \sim \frac{2J_1(k\sqrt{\alpha \lambda })}{k\sqrt{\alpha \lambda }} e^{-\alpha \lambda k / 2}, \end{aligned}$$

(106)

where $J_1$ is the Bessel function of the first kind. Another possible way to derive this asymptotic is to extract the second order asymptotics from [34]. We will show that the Bessel asymptotic (106) follows directly from the Legendre polynomials in (105). To see this, recall the integral representation of a Legendre polynomial is given below by

$$\begin{aligned} L_k(\sqrt{1-u}) = \frac{1}{\pi } \int _0^\pi \left( \sqrt{1-u} + i \sqrt{u} \cos (\phi ) \right) ^k d \phi , \end{aligned}$$

and so we have

$$\begin{aligned}&\frac{1}{\sqrt{u}} \big ( \sqrt{1-u} \, L_k(\sqrt{1-u}) - L_{k+1}(\sqrt{1-u}) \big ) \\&\quad = \frac{-i}{\pi } \int _0^{\pi } \big (\sqrt{1-u} + i \sqrt{u} \cos (\phi ) \big )^k \cos (\phi ) \, d\phi . \\&\quad = \frac{1}{\pi } \int _0^{\pi } \text {Im} \{ \big ( \sqrt{1-u} + i \sqrt{u} \cos (\phi ) \big )^k \} \cos (\phi ) \, d\phi \\&\qquad \text {(symmetry about }\tfrac{\pi }{2}) \quad = \frac{2}{\pi } \int _0^{\pi /2} \text {Im} \{ \big ( \sqrt{1-u} + i \sqrt{u} \cos (\phi ) \big )^k \} \cos (\phi ) \, d\phi . \end{aligned}$$

Now define the polynomial ${\widetilde{P}}_k(u) = P_k( \lambda ^+ u; \lambda _{{{\varvec{H}}}}^{\pm })$ where the polynomials $P_k$ satisfies Nesterov’s recurrence (97). Using the derivation of Nesterov’s polynomial from the previous section (105), we obtain the following expression

$$\begin{aligned} \begin{aligned} {\widetilde{P}}_k(u)&= \frac{2(1-u)^{(k+1)/2}}{ku} \left( \sqrt{1-u} \cdot L_k(\sqrt{1-u}) - L_{k+1}(\sqrt{1-u}) \right) \\&= \frac{4 (1-u)^{(k+1)/2}}{k \pi \sqrt{u}} \int _0^{\pi /2} \text {Im} \{ \big ( \sqrt{1-u} + i \sqrt{u} \cos (\phi ) \big )^k \} \cos (\phi ) \, d\phi . \end{aligned} \end{aligned}$$

(107)

We can get an explicit expression for the imaginary part of the k-th power in (107) by expressing $\sqrt{1-u} + i \sqrt{u} \cos (\phi )$ in terms of its polar form. In particular, we have that

$$\begin{aligned}&\theta (u, \phi ) {\mathop {=}\limits ^{\text {def}}}\tan ^{-1} \left( \sqrt{\tfrac{u}{1-u}} \cos (\phi ) \right) \quad \text {and} \\&R(u,\phi ) {\mathop {=}\limits ^{\text {def}}}\sqrt{1-u + u \cos ^2(\phi )} = \sqrt{1- u \sin ^2(\phi )}. \end{aligned}$$

Hence, we have the following

$$\begin{aligned} \begin{aligned} {\widetilde{P}}_k(u)&= \frac{4 (1-u)^{(k+1)/2}}{k \pi \sqrt{u}} \int _0^{\pi /2} \text {Im} \{ \big ( \sqrt{1-u} + i \sqrt{u} \cos (\phi ) \big )^k \} \cos (\phi ) \, d\phi \\&= \frac{4 (1-u)^{(k+1)/2}}{k \pi \sqrt{u}} \int _0^{\pi /2} R(u, \phi )^k \sin (k \theta (u, \phi )) \cos (\phi ) \, d\phi . \end{aligned} \end{aligned}$$

(108)

Define the following integral

$$\begin{aligned} I_k(u) {\mathop {=}\limits ^{\text {def}}}\frac{2}{\pi } \int _0^{\pi /2} R(u, \phi )^k \sin (k \theta (u, \phi )) \cos (\phi ) \, d\phi , \end{aligned}$$

(109)

and note the similarity of this integral with the Bessel function, namely

$$\begin{aligned} J_1(k \sqrt{u}) = \frac{2}{\pi } \int _0^{\pi /2} \sin (k \sqrt{u} \cos (\phi )) \cos (\phi ) \, d\phi . \end{aligned}$$

Using this definition, the polynomial can be written as ${\widetilde{P}}_k(u) = \frac{2(1-u)^{(k+1)/2}}{k\sqrt{u}}I_k(u)$. Since $I_k$ is always bounded, then for $u \ge \log ^2(k)/k$, the magnitude of $|{\widetilde{P}}_k(u)|$ is smaller than any power of k. This follows by using the bound that $(1-x)^k \le \text {exp}(-kx)$ and noting that $\text {exp}(-\log ^2(k))$ decays faster than any polynomial in k. So the interesting asymptotic is for $u \le \log ^2(k)/k$, and for this we show the following.

Lemma 9

There is an absolute constant C so that for all $k \ge 1$ and $0 \le u \le \log ^2(k)/k$

$$\begin{aligned} |I_k(u) - J_1(k \sqrt{u}) | \le {\left\{ \begin{array}{ll} C k^{1/3} \sqrt{u}, &{} \text {if } u \le k^{-4/3}\\ C k^{-1/3}, &{} \text {if } u > k^{-4/3}. \end{array}\right. } \end{aligned}$$

Corollary 1

(Nesterov’s polynomial asymptotic) There is an absolute constant C so that for all $k \ge 1$ and all $0 \le u \le \tfrac{\log ^2(k)}{k}$, the following holds

$$\begin{aligned} \big | {\widetilde{P}}_k(u) - \frac{2 e^{-uk/2}}{k \sqrt{u}} J_1(k \sqrt{u}) \big | \le {\left\{ \begin{array}{ll} C e^{-u k/2} k^{-2/3}, &{} \text {if } u \le k^{-4/3}\\ C e^{-uk/2} u^{-1/2} k^{-4/3}, &{} \text {if } u > k^{-4/3}. \end{array}\right. } \end{aligned}$$

(110)

In particular, the following result holds for all $0 \le u \le \frac{\log ^2(k)}{k}$

$$\begin{aligned} \Big | {\widetilde{P}}_k^2(u) - \frac{4 e^{-uk} J_1^2(k \sqrt{u})}{k^2 u} \Big | \le {\left\{ \begin{array}{ll} C(k^{-4/3} + k^{-13/6} u^{-3/4}), &{} \text {if } u \le k^{-4/3}\\ Ce^{-uk} ( u^{-1} k^{-8/3} + u^{-5/4} k^{-17/6}), &{} \text {if } u \ge k^{-4/3}. \end{array}\right. }\nonumber \\ \end{aligned}$$

(111)

Proof of Corollary 1

First, we have that ${\widetilde{P}}_k(u) = \frac{2(1-u)^{(k+1)/2}}{k \sqrt{u}} I_k(u)$. A simple triangle inequality shows that

$$\begin{aligned} \big |{\widetilde{P}}_k(u) - \frac{2e^{-uk/2}}{k \sqrt{u}} J_k(k \sqrt{u}) \big |&\le \big | {\widetilde{P}}_k(u) - \frac{2e^{-uk/2}}{k \sqrt{u}} I_k(u) \big |\\&\quad + \big | \frac{2e^{-uk/2}}{k \sqrt{u}} I_k(u) - \frac{2e^{-uk/2}}{k \sqrt{u}} J_1(k \sqrt{u}) \big |. \end{aligned}$$

The first difference is small because $|(1-u)^{(k+1)/2} - e^{-uk/2}| \le C e^{-uk/2} (u + ku^2)$ for some absolute constant C and $I_k$ is bounded. The second difference follows directly from Lemma 9. The second inequality (111) follows from $|a^2-b^2| \le |a-b| (|a-b| + 2|b|)$ and $J_1(x) \le \frac{C}{\sqrt{x}}$. $\square $

Proof of Lemma 9

First, we observe that $z^k$ is k-Lipschitz on the interval [0, 1]. Since $e^{-u}$ and $1-u$ lie in the interval [0, 1] for any $u \in [0,1]$, the Lipschitz property of $z^k$ and the 2nd-order Taylor approximation of $e^{-uk}$ imply that there exists an $\xi \in [0,1]$ such that

$$\begin{aligned} \begin{aligned}&|e^{-ku\sin ^2(\phi )/2} - R(u, \phi )^k| \le \tfrac{k}{2} |e^{-u\sin ^2(\phi )}-(1-u\sin ^2(\phi ))|\\&\quad = \tfrac{k}{2} |1-u\sin ^2(\phi ) + \tfrac{(u\sin ^2(\phi ))^2}{2} - \tfrac{e^{-\xi }(u\sin ^2(\phi ))^3}{3!} - (1-u\sin (\phi )) |\\&\quad = \tfrac{k}{2}| \tfrac{(u\sin ^2(\phi ))^2}{2} - \tfrac{e^{-\xi }}{6} (u\sin ^2(\phi ))^3| \le \frac{k u^2}{4} \le \frac{C \log ^4(k)}{k}. \end{aligned} \end{aligned}$$

(112)

Here we used that $u \le \tfrac{\log ^2(k)}{k}$. Similarly, we have that

$$\begin{aligned} |R(u, \phi )^k-1| = |(1-u\sin ^2(\phi ))^{k/2}-1| \le \tfrac{k}{2}|1-u\sin ^2(\phi )-1| = \tfrac{k}{2} u \sin ^2(\phi ). \nonumber \\ \end{aligned}$$

(113)

We also know that $\sin (k x)$ is k-Lipschitz. Therefore, again by Taylor approximation on $\tan ^{-1}(x)$, we deduce the following bound for some $\xi _{u, \phi } \in [0, \sqrt{\tfrac{u}{1-u}}]$

$$\begin{aligned} \big | \sin (k \theta (u, \phi )) - \sin \left( k \sqrt{\tfrac{u}{1-u}} \cos (\phi ) \right) \big |\le & {} k \big | \tan ^{-1}\left( \sqrt{\tfrac{u}{1-u}} \cos (\phi ) \right) - \sqrt{\tfrac{u}{1-u}} \cos (\phi ) \big | \nonumber \\\le & {} k \left| \frac{6\xi _{u, \phi }^2-2}{(\xi _{u, \phi }^2 + 1)^3} \right| \left( \sqrt{ \tfrac{u}{1-u} } \right) ^3. \end{aligned}$$

(114)

Moreover, suppose $v = \sqrt{u}$ and consider the function $\sqrt{\frac{u}{1-u}} = \tfrac{v}{\sqrt{1-v^2}}$. Using a Taylor approximation at $v = 0$ and the k-Lip. of $\sin (kx)$, we obtain that

$$\begin{aligned} \begin{aligned} \big |\sin \left( k \sqrt{\tfrac{u}{1-u}} \cos (\phi ) \right) - \sin (k \sqrt{u} \cos (\phi )) \big | \le k \big | \sqrt{\tfrac{u}{1-u}} - \sqrt{u} \big | \le k u^{3/2}. \end{aligned} \end{aligned}$$

(115)

Since $0 \le u \le \tfrac{\log ^2(k)}{k}$, we have that $\sqrt{\frac{u}{1-u}}$ is bounded by some constant C independent of k and hence, the constant $\xi _{u, \phi }$ is bounded. This implies that $\frac{6 \xi _{u, \phi }^2 -2}{(\xi _{u,\phi }+1)^3}$ is bounded by some absolute constant ${\widetilde{C}}$. By putting together (114) and (115), we deduce that

$$\begin{aligned} \begin{aligned}&\big | \sin (k \theta (u, \phi )) - \sin \left( k \sqrt{u} \cos (\phi ) \right) \big |\\&\quad \le k \left| \frac{6\xi _{u, \phi }^2-2}{(\xi _{u, \phi }^2 + 1)^3} \right| \left( \sqrt{ \tfrac{u}{1-u} } \right) ^3 + k u^{3/2} \le C k u^{3/2} \end{aligned} \end{aligned}$$

(116)

where C is some absolute constant. Here we used that $u \le \tfrac{\log ^2(k)}{k}$ and $\log ^2(k) \le C k^{1/3}$. We now consider three cases. Suppose $u \le k^{-4/3}$. Using $u \le k^{-4/3}$ and (113) we deduce that

$$\begin{aligned} \big | R(u, \phi )^k-1 \big | \le \tfrac{k}{2} u \le \tfrac{k^{1/3} \sqrt{u}}{2}. \end{aligned}$$

(117)

By putting together (116) and (117), we get the following bound for all $u \le k^{-4/3}$.

$$\begin{aligned}&|I_k(u) - J_1(k \sqrt{u})|\\&\quad \le \frac{2}{\pi } \big | \int _0^{\pi /2} R(u, \phi )^k \sin (k \theta (u, \phi )) \cos (\phi ) \, d\phi - \int _0^{\pi /2} \sin (k \theta (u, \phi )) \cos (\phi ) \, d\phi \big |\\&\qquad + \frac{2}{\pi } \big | \int _0^{\pi /2} \sin (k \theta (u,\phi )) \cos (\phi ) \, d\phi - \int _0^{\pi /2} \sin (k \sqrt{u} \cos (\phi )) \cos (\phi ) \, d\phi \big |\\&\quad \le \frac{k^{1/3} \sqrt{u}}{2} + C k^{1/3} \sqrt{u}. \end{aligned}$$

The result immediately follows. On the other hand for $k^{-4/3} \le u \le \log ^2(k)/k$, we cut the range of $\phi $. Let $\phi _0$ be such that $\phi _0 = k^{-2/3} u^{-1/2}$. We know from $u \ge k^{-4/3}$ that

$$\begin{aligned} \phi _0 = k^{-2/3} u^{-1/2} \le k^{-2/3} k^{2/3} = 1. \end{aligned}$$

(118)

Now for $\phi \le \phi _0$, we have in this range that

$$\begin{aligned}&\big | \int _0^{\phi _0} R(u, \phi )^k \sin (k\theta (u, \phi )) \cos (\phi ) \, d\phi - \int _0^{\phi _0} \sin (k \sqrt{u} \cos (\phi )) \cos (\phi ) \, d\phi \big | \nonumber \\&\quad \le \int _0^{\phi _0} \big | R(u, \phi )^k-1| \, d\phi + \int _0^{\phi _0} | \sin (k \theta (u, \phi )) - \sin (\sqrt{u} k \cos (\phi )) | d\phi \nonumber \\&\quad \le \tfrac{k}{2} u \phi _0^2 + C k u^{3/2} \le \frac{1}{2k^{1/3}} + \frac{C\log ^3(k)}{k^{1/2}}. \end{aligned}$$

(119)

In the last inequality we used that $\phi _0 = k^{-2/3} u^{-1/2}$ and $u \le \log ^2(k)/k$. Since $\frac{\log ^3(k)}{k^{1/2}} \le C k^{-1/3}$, the result immediately follows.

For larger $\phi $, we use integration by parts with $F(\phi ) = \cos (k \sqrt{u} \cos (\phi ))$ and $G(\phi ) = e^{-k \sin ^2(\phi )u/2} \cot (\phi )$ to express

$$\begin{aligned} I_1&{\mathop {=}\limits ^{\text {def}}}\int _{\phi _0}^{\pi /2} \sin (k \sqrt{\tfrac{u}{1-u}} \cos (\phi )) e^{-k \sin ^2 \phi ^2 u /2} \cos (\phi ) \, d\phi \\&= \frac{1}{k \sqrt{\tfrac{u}{1-u}}} \int _{\phi _0}^{\pi /2} F'(\phi ) G(\phi ) \, d\phi \\&= \frac{\sqrt{1-u}}{k \sqrt{u}} F(\phi ) G(\phi ) \Big |_{\phi _0}^{\pi /2} - \frac{\sqrt{1-u}}{k \sqrt{u}} \int _{\phi _0}^{\pi /2} G'(\phi ) F(\phi ) \, d\phi \\ \text {(def. of }G'(\phi )) \,&= \frac{\sqrt{1-u}}{k \sqrt{u}} \Big [ F(\phi ) G(\phi ) \Big |_{\phi _0}^{\pi /2} \\&\quad - \int _{\phi _0}^{\pi /2} F(\phi ) e^{-k \sin ^2(\phi ) u/2} \left( \cot (\phi ) (-k \sin (2\phi )u) - \csc ^2(\phi ) \right) \, d\phi \Big ] \\&= \frac{\sqrt{1-u}}{k \sqrt{u}} F(\phi ) G(\phi ) \Big |_{\phi _0}^{\pi /2} \\&\quad + \sqrt{u(1-u)} \int _{\phi _0}^{\pi /2} F(\phi ) e^{-k \sin ^2(\phi ) u/2} \cot (\phi ) \sin (2\phi ) d\phi \\&\quad + \frac{\sqrt{1-u}}{k \sqrt{u}} \int _{\phi _0}^{/pi/2} F(\phi ) e^{-k \sin ^2(\phi ) u/2} csc^2(\phi ) \, d\phi . \end{aligned}$$

Since $u \in [0,1]$, we get the following bound

$$\begin{aligned} \begin{aligned} |I_1|&\le \underbrace{\frac{C}{k \sqrt{u}} |F(\phi _0)G(\phi _0)|}_{\text {(a)}} + \underbrace{\Big | C \sqrt{u} \int _{\phi _0}^{\pi /2} F(\phi ) e^{-k \sin ^2\phi u/2} \cot (\phi ) \sin (2 \phi ) \, d\phi \Big |}_{\text {(b)}}\\&\quad + \underbrace{\Big | \frac{C}{k \sqrt{u}} \int _{\phi _0}^{\pi /2} F(\phi ) e^{-k \sin ^2 \phi u/2} \csc ^2(\phi ) \, d\phi \Big |}_{\text {(c)}} \end{aligned} \end{aligned}$$

(120)

for some $C > 0$. We will bound each of the terms in (120) independently. For (a), Taylor’s approximation yields that $|\cot (\phi _0) - \tfrac{1}{\phi _0} | \le C$ which implies that $|\cot (\phi _0)| \le \tfrac{C}{\phi _0}$ for some positive constants. Therefore, we deduce that the quantity $\text {(a)}$ is bounded by $\frac{C}{k \sqrt{u} \phi _0}$. For (b) since $|F(\phi )| \le 1$, $|\cot (\phi )\sin (2\phi )| = |2\cos ^2(\phi )| \le 2$ and, of course, $| e^{-k \sin ^2 \phi u/2} | \le 1$, we have that the quantity (b) is bounded by $C \sqrt{u}$. As for the quantity (c), we use the following approximation $|\csc ^2(\phi )- \tfrac{1}{\phi ^2}| \le C$ so that $|\csc ^2(\phi )| \le \frac{C}{\phi ^2}$. Hence, the integral (c) is bounded by $\tfrac{C}{k\sqrt{u} \phi _0}$. Therefore, we conclude that

$$\begin{aligned} |I_1| \le \frac{C}{k \sqrt{u} \phi _0} + C \sqrt{u} + \frac{C}{k \sqrt{u} \phi _0} \le Ck^{-1/3}. \end{aligned}$$

(121)

Here we used that $k \sqrt{u} \phi _0 = k^{1/3}$ and $\sqrt{u} \le \log (k)/\sqrt{k} \le C k^{-1/3}$.

Now let’s repeat this process replacing $G(\phi ) = e^{-k \sin ^2(\phi ) u/2} \cot (\phi )$ with $G(\phi ) = \cot (\phi )$. This time we have that

$$\begin{aligned} I_2 {\mathop {=}\limits ^{\text {def}}}\int _{\phi _0}^{\pi /2} \sin (k \sqrt{\tfrac{u}{1-u}} \cos (\phi ) ) \cos (\phi ) = \frac{1}{k \sqrt{\frac{u}{1-u} }} \int _{\phi _0}^{\pi /2} F'(\phi ) G(\phi ) \, d\phi . \end{aligned}$$

Using the same bounds as before, we deduce the following

$$\begin{aligned} |I_2| \le \frac{C}{k \sqrt{u} \phi _0} + \frac{C}{k \sqrt{u} \phi _0} \le C k^{-1/3}. \end{aligned}$$

(122)

For $u \ge k^{-4/3}$, we have the following result

$$\begin{aligned} |I_k(u)-J_1(k \sqrt{u})|&\le \frac{2}{\pi } \big | \int _0^{\phi _0} R(u, \phi )^k \sin (k \theta (u, \phi )) \cos (\phi ) \, d \phi \\&\qquad - \int _0^{\phi _0} \sin (k \sqrt{u}\cos (\phi ) ) \cos (\phi ) \, d\phi \big |\\&\qquad + \frac{2}{\pi } \big | \int _{\phi _0}^{\pi /2} R(u, \phi )^k \sin (k \theta (u, \phi )) \cos (\phi ) \, d\phi \\&\qquad - \int _{\phi _0}^{\pi /2} \sin (k \sqrt{u} \cos (\phi )) \cos (\phi )) \, d\phi \big |\\ \text {(by }(119)) \quad&\le \frac{2}{\pi } \int _{\phi _0}^{\pi /2} \big | (R(u, \phi )^k - e^{-k \sin ^2\phi u /2} ) \sin (k \theta (u, \phi )) \cos (\phi ) \big | \, d\phi \\&\qquad + C k^{-1/3}+ \frac{2}{\pi } |I_1| + \frac{2}{\pi } |I_2|\\&\qquad + \frac{2}{\pi } \int _{\phi _0}^{\pi /2} \big | \big [ \sin (k \theta (u, \phi ) ) - \sin (k \sqrt{u} \cos (\phi )) \big ] \cos (\phi ) \big | \, d\phi \\ \text {(by }(121)\text { and }(122)) \quad&\le Ck^{-1/3} \\&\qquad + \frac{2}{\pi } \int _{\phi _0}^{\pi /2} \big | (R(u, \phi )^k - e^{-k \sin ^2\phi u /2} ) \sin (k \theta (u, \phi )) \cos (\phi ) \big | \, d\phi \\&\qquad + \frac{2}{\pi } \int _{\phi _0}^{\pi /2} \big | \big [ \sin (k \theta (u, \phi ) ) - \sin (k \sqrt{u} \cos (\phi )) \big ] \cos (\phi ) \big | \, d\phi \\ \text {(by }(112)\text { and }(116)) \qquad&\le C k^{-1/3}. \end{aligned}$$

This finishes the proof for the lemma. $\square $

1.2 Polyak Momentum (Heavy-Ball) Method

The polynomials that generated Polyak’s heavy ball method satisfies the following three-term recursion

$$\begin{aligned} P_{k+1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })= & {} (1-m + \alpha \lambda ) P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) + m P_{k-1}(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }),\nonumber \\ P_0= & {} 1, \, \, \text {and} \, \, P_1(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = 1- \beta \lambda \nonumber \\ \text {where} \, \, m= & {} - \left( \tfrac{\sqrt{\lambda _{{{\varvec{H}}}}^+} - \sqrt{\lambda _{{{\varvec{H}}}}^-}}{\sqrt{\lambda _{{{\varvec{H}}}}^+} + \sqrt{\lambda _{{{\varvec{H}}}}^-}} \right) ^2, \, \, \alpha = \tfrac{-4}{(\sqrt{\lambda _{{{\varvec{H}}}}^+} + \sqrt{\lambda _{{{\varvec{H}}}}^-})^2}, \, \, \text {and} \, \, \beta = \frac{2}{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^-}.\nonumber \\ \end{aligned}$$

(123)

As in the previous examples, we will construct the generating function for the polynomials $P_k$ using the recurrence in (123)

$$\begin{aligned} {\mathfrak {G}}(\lambda ,t) {\mathop {=}\limits ^{\text {def}}}\sum _{k=0}^\infty t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })&= 1 + \frac{1}{t(1-m + \alpha \lambda )} \sum _{k=2}^\infty t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\&\qquad - \frac{m t}{1-m + \alpha \lambda } \sum _{k=0}^\infty t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\\&= 1 + \frac{1}{t(1 - m + \alpha \lambda )} \big [ {\mathfrak {G}}(\lambda , t) - 1 - t(1-\beta \lambda ) \big ] \\&\qquad - \frac{mt}{1-m+\alpha \lambda } {\mathfrak {G}}(\lambda , t). \end{aligned}$$

We solve for the generating function

$$\begin{aligned} {\mathfrak {G}}(\lambda , t) = \frac{1 + t(m - (\alpha + \beta ) \lambda )}{1 - t(1 - m + \alpha \lambda ) - m t^2}. \end{aligned}$$

This generating function for Polyak resembles the generating function for Chebyshev polynomials of the first and second kind (100). First, we set $t \mapsto \frac{t}{\sqrt{-m}}$ (note that $m < 0$ by definition in (123)). Under this transformation, we have the following

$$\begin{aligned} \sum _{k=0}^\infty \frac{t^k P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })}{(-m)^{k/2}}&= \frac{1 - \frac{t}{\sqrt{-m}} (-m + (\alpha + \beta ) \lambda )}{1 - 2t \left( \tfrac{1-m + \alpha \lambda }{2 \sqrt{-m}} \right) + t^2 } = \frac{1-\sigma (\lambda )t \cdot \frac{-m + (\alpha + \beta ) \lambda }{ \sqrt{-m} \cdot \sigma (\lambda )}}{1-2 \sigma (\lambda ) t + t^2}\nonumber \\&= \frac{ \frac{-m + (\alpha + \beta ) \lambda }{ \sqrt{-m} \cdot \sigma (\lambda )} (1-\sigma (\lambda ) t) + 1 - \frac{-m + (\alpha + \beta ) \lambda }{ \sqrt{-m} \cdot \sigma (\lambda )} }{ 1- 2 \sigma (\lambda ) t + t^2}, \end{aligned}$$

(124)

where $\sigma (\lambda ) = \frac{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^- - 2 \lambda }{\lambda _{{{\varvec{H}}}}^+ - \lambda _{{{\varvec{H}}}}^-}$. A simple computation shows that

$$\begin{aligned} \frac{-m + (\alpha + \beta ) \lambda }{\sqrt{-m} \sigma (\lambda )} = \tfrac{(\sqrt{\lambda _{{{\varvec{H}}}}^+} - \sqrt{\lambda _{{{\varvec{H}}}}^-})^2}{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^-} \quad \text {and} \quad 1- \frac{-m + (\alpha + \beta ) \lambda }{\sqrt{-m} \sigma (\lambda )} = \tfrac{2\sqrt{\lambda _{{{\varvec{H}}}}^- \lambda _{{{\varvec{H}}}}^+}}{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^-}. \end{aligned}$$

By matching terms in the generating function for Chebyshev polynomials (100) and Polyak’s generating function (124), we derive an expression for the polynomials $P_k$ in Polyak’s momentum

$$\begin{aligned} P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm }) = \left( \tfrac{\sqrt{\lambda _{{{\varvec{H}}}}^+}-\sqrt{\lambda _{{{\varvec{H}}}}^-}}{\sqrt{\lambda _{{{\varvec{H}}}}^+} + \sqrt{\lambda _{{{\varvec{H}}}}^-}} \right) ^k \big [ \tfrac{ ( \sqrt{\lambda _{{{\varvec{H}}}}^+}-\sqrt{\lambda _{{{\varvec{H}}}}^-})^2}{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^-} \cdot T_k(\sigma (\lambda )) + \tfrac{2 \sqrt{\lambda _{{{\varvec{H}}}}^- \lambda _{{{\varvec{H}}}}^+}}{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^-} \cdot U_k(\sigma (\lambda )) \big ] \end{aligned}$$

where $T_k(x) \,(U_k(x))$ is the Chebyshev polynomial of the 1st (2nd) kind, respectively

$$\begin{aligned} \text {and }\quad \sigma (\lambda ) = \tfrac{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^- -2 \lambda }{\lambda _{{{\varvec{H}}}}^+ - \lambda _{{{\varvec{H}}}}^-}. \end{aligned}$$

(125)

Average-Case Complexity

In this section, we compute the average-case complexity for various first-order methods. To do so, we integrate the residual polynomials found in Table 3 against the Marčenko-Pastur density.

Lemma 10

(Average-case: Gradient descent) Let $\mathrm {d}\mu _{\mathrm {MP}}$ be the Marčenko-Pastur law defined in (2) and $P_k, Q_k$ be the residual polynomials for gradient descent.

1.
For $r = 1$ and $\ell = \{1,2\}$, the following holds
$$\begin{aligned} \int \lambda ^\ell P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}&= \frac{(\lambda ^+)^{\ell +1}}{2 \pi \sigma ^2} \cdot \frac{\varGamma (2k + \tfrac{3}{2}) \varGamma (\ell + \tfrac{1}{2})}{\varGamma (2k+\ell +2)}\\&\sim \frac{(\lambda ^+)^{\ell + 1}}{2 \pi \sigma ^2} \cdot \frac{\varGamma (\ell + \tfrac{1}{2})}{(2k + 3/2)^{\ell + 1/2}}. \end{aligned}$$
2.
For $r \ne 1$, the following holds
$$\begin{aligned} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}=&\frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \frac{\varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{3}{2})}{\varGamma (2k + 3)}\\&\sim \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \cdot \frac{\varGamma (\tfrac{3}{2})}{(2k + \tfrac{3}{2})^{3/2}}. \end{aligned}$$
3.
For $r \ne 1$, the following holds
$$\begin{aligned}&\int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\\&\quad = \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \left( \frac{\lambda ^- \cdot \varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{3}{2})}{\varGamma (2k + 3)} + \frac{(\lambda ^+-\lambda ^-) \cdot \varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{5}{2})}{\varGamma (2k + 4)} \right) \\&\qquad \sim \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \left( \frac{\lambda ^- \cdot \varGamma (\tfrac{3}{2})}{(2k + \tfrac{3}{2})^{3/2}} + \frac{(\lambda ^+-\lambda ^-) \varGamma (\tfrac{5}{2})}{(2k + \tfrac{3}{2})^{5/2}} \right) . \end{aligned}$$

Proof

The proof relies on writing the integrals in terms of $\beta $-functions. Let $\ell = \{1,2\}$. Using a change of variables $\lambda = \lambda ^- + (\lambda ^+-\lambda ^-) w$, we deduce the following expression

$$\begin{aligned} \begin{aligned}&\int \lambda ^\ell P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}= \frac{1}{2\pi \sigma ^2 r} \int _{\lambda ^-}^{\lambda ^+} \lambda ^{\ell -1} \big (1-\tfrac{\lambda }{\lambda ^+} \big )^{2k} \sqrt{(\lambda -\lambda ^-)(\lambda ^+-\lambda )} \, d\lambda \\&\quad = \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \int _0^1 (1-w)^{2k} (\lambda ^- + (\lambda ^+ - \lambda ^-)w)^{\ell -1} \sqrt{w(1-w)} \, dw. \end{aligned} \end{aligned}$$

(126)

We consider cases depending on whether $\lambda ^- = 0$ or not (i.e., $r = 1$). First suppose $\lambda ^- = 0$ so by equation (126) we have

$$\begin{aligned}&\frac{1}{2\pi \sigma ^2 r} \int _{\lambda ^-}^{\lambda ^+} \lambda ^{\ell -1} \big (1-\tfrac{\lambda }{\lambda ^+} \big )^{2k} \sqrt{(\lambda -\lambda ^-)(\lambda ^+-\lambda )} \, d\lambda \\&\quad = \frac{(\lambda ^+)^{\ell + 1}}{2 \pi \sigma ^2 r} \int _0^1 (1-w)^{2k+1/2} w^{\ell - 1/2} \, dw. \end{aligned}$$

The result follows after noting that the integral is a $\beta $-function with parameters $ 2k + 3/2$ and $\ell + 1/2$ as well as the asymptotics of $\beta $-functions, $\beta (x,y) = \varGamma (y) x^{-y}$ for x large and y fixed.

Next consider when $r \ne 1$ and $\ell =1$. Using (126), we have that

$$\begin{aligned} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}= \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r}&\left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \int _0^1 (1-w)^{2k + 1/2} w^{1/2} \, dw. \end{aligned}$$

The integral is a $\beta $-function with parameters $2k + 3/2$ and 3/2. Applying the asymptotics of $\beta $-functions, finishes this case.

Lastly consider when $r \ne 1$ and $\ell = 2$. Similar to the previous case, using (126), the following holds

$$\begin{aligned} \int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}=&\frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \Big (\lambda ^- \int _0^1 (1-w)^{2k + 1/2} w^{1/2} \, dw\\&+ (\lambda ^+-\lambda ^-) \int _0^1 (1-w)^{2k + 1/2} w^{3/2} \, dw \Big ). \end{aligned}$$

The first integral is a $\beta $-function with parameters $2k + 3/2$ and 3/2 and the second term is a $\beta $-function with parameters $2k + 3/2$ and 5/2. Again using the asymptotics for $\beta $-functions yields the result. $\square $

Lemma 11

(Average-case: Nesterov accelerated method (strongly convex)) Let $d\mu _{\mathrm {MP}}$ be the Marčenko-Pastur law defined in (2) and $P_k$ be the residual polynomial for Nesterov accelerated method on a strongly convex objective function (102). Then the following holds

$$\begin{aligned} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}= & {} \tfrac{(\lambda ^+-\lambda ^-)^2}{2^5 4^k \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \nonumber \\&\times \Big [ \tfrac{4\beta ^2}{(1+\beta )^2} \left( -k^2+\frac{k}{2}+1 - \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) \nonumber \\&+ \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) \left( 2k +1 - \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) \nonumber \\&+ 2\big (1-\tfrac{2\beta }{1+\beta } \big )^2 \left( \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) -1 \right) \Big ] \nonumber \\&\sim \tfrac{(\lambda ^+-\lambda ^-)^2}{4 \sigma ^2 r \sqrt{\pi }} \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \left( \beta \big (1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \frac{1}{k^{1/2}}, \end{aligned}$$

(127)

and the integral equals

$$\begin{aligned}&\int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\\&\quad = \lambda ^- \int \lambda P_k^2(\lambda ; \lambda ^{\pm })\, d\mu _{\mathrm {MP}}+ \tfrac{(\lambda ^+-\lambda ^-)^3}{2^7 4^k \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \\&\qquad \Big [\tfrac{4\beta ^2}{(1+\beta )^2} \left( \tfrac{1}{3}(2k^3-9k^2+k+6) -4 \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k -1\end{array}}\right) + 3 \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) \\&\qquad + \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) \left( -2k^2+3k+2 - 4 \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k-1\end{array}}\right) +3 \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) \\&\qquad + 4 \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \left( k- \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) \Big ]\\&\qquad \sim \tfrac{\lambda ^- (\lambda ^+-\lambda ^-)^2}{4 \sigma ^2 r \sqrt{\pi }} \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \frac{1}{k^{1/2}}, \end{aligned}$$

where $\beta = \frac{\sqrt{\lambda ^+}-\sqrt{\lambda ^-}}{\sqrt{\lambda ^+} + \sqrt{\lambda ^-}}$ and $\alpha = \frac{1}{\lambda ^+}$.

Proof

Throughout this proof, we define $P_k(\lambda )$ to be $P_k(\lambda ; \lambda ^{\pm })$ in order to simplify the notation. In order to integrate the Chebyshev polynomials, we reduce our integral to trig functions via a series of change of variables. Under the change of variables that sends $\lambda = \lambda ^{-} + (\lambda ^+-\lambda ^-) w$, we have that for any $\ell \ge 1$

$$\begin{aligned} \begin{aligned}&\int \lambda ^{\ell } P_k^2(\lambda ) \, d\mu _{\mathrm {MP}}\\&\quad = \tfrac{(\lambda ^+-\lambda ^-)^2}{2\pi \sigma ^2 r} \int _0^1 P_k^2(\lambda ^- + (\lambda ^+-\lambda ^-) w) (\lambda ^- + (\lambda ^+-\lambda ^-) w)^{\ell -1} \sqrt{w (1-w)} \, dw. \end{aligned} \end{aligned}$$

(128)

We note that under this transformation $1-\alpha \lambda = (1-\tfrac{\lambda ^-}{\lambda ^+}) (1-w)$ and $\tfrac{1+\beta }{2 \sqrt{\beta }} \sqrt{1-\alpha \lambda } = (1-w)^{1/2}$. Moreover, by expanding out Nesterov’s polynomial (102), we deduce the following

$$\begin{aligned}&P_k^2(\lambda ) = (\beta x)^k \left( \tfrac{4\beta ^2}{(1+\beta )^2} T_k^2(y) + \tfrac{2\beta }{1+\beta } \left( 1 - \tfrac{2\beta }{1+\beta } \right) T_k (y) U_k(y)\right. \nonumber \\&\quad + \left. \left( 1 - \tfrac{4\beta }{1+\beta } \right) ^2 U_k^2(y) \right) , \nonumber \\&\quad \text {where} \qquad y = \tfrac{1+\beta }{2 \sqrt{\beta }} \sqrt{x} \quad \text {and} \quad x = 1-\alpha \lambda . \end{aligned}$$

(129)

First, we consider the setting where $\ell = 1$ in (128) and hence we deduce that

$$\begin{aligned} \int \lambda P_k^2(\lambda ) \, d\mu _{\mathrm {MP}}=&\tfrac{(\lambda ^+-\lambda ^-)^2}{2\pi \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \nonumber \\&\times \int _0^1 (1-w)^k \sqrt{w(1-w)} \big [ \tfrac{4\beta ^2}{(1+\beta )^2} T_k^2((1-w)^{1/2}) \nonumber \\&+ \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) T_k((1-w)^{1/2}) U_k( (1-w)^{1/2}) \nonumber \\&+ \big (1-\tfrac{2\beta }{1+\beta } \big )^2 U_k^2((1-w)^{1/2}) \big ] \, dw \nonumber \\ (1-w = \cos ^2(\theta )) \,\, =&\tfrac{2(\lambda ^+-\lambda ^-)^2}{2\pi \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \int _0^{\pi /2} \cos ^{2k+2}(\theta ) \sin ^2(\theta ) \big [ \tfrac{4\beta ^2}{(1+\beta )^2} \cos ^2(k\theta ) \nonumber \\&+ \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) \cos (k\theta ) \cdot \tfrac{\sin ((k+1)\theta )}{\sin (\theta )} + \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \tfrac{\sin ^2((k+1)\theta )}{\sin ^2(\theta )} \big ] \, d\theta \nonumber \\ \text {(by symmetry)} \, \, =&\tfrac{2(\lambda ^+-\lambda ^-)^2}{8 \pi \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \int _0^{2\pi } \cos ^{2k+2}(\theta ) \sin ^2(\theta ) \big [ \tfrac{4\beta ^2}{(1+\beta )^2} \cos ^2(k\theta ) \nonumber \\&+ \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) \cos (k\theta ) \cdot \tfrac{\sin ((k+1)\theta )}{\sin (\theta )} + \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \tfrac{\sin ^2((k+1)\theta )}{\sin ^2(\theta )} \big ] \, d\theta . \end{aligned}$$

(130)

We will treat each term in the summand separately. Because the integral of $\int _0^{2\pi } e^{i k\theta } =0 $ for any $k \in {\mathbb {N}}$, we only need to keep track of the constant terms. From this observation, we get the following

$$\begin{aligned} \begin{aligned}&\cos ^{2k+2}(\theta ) \sin ^2(\theta )\cos ^2(k \theta ) \\&\quad = \frac{1}{2^4 4^k} \left( -k^2+\frac{k}{2} + 1 - \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) + \text {terms}\\&\cos ^{2k+2}(\theta ) \sin (\theta ) \cos (k \theta ) \sin ((k+1)\theta ) \\&\quad = \frac{1}{2^4 4^k} \left( 2k +1 - \left( {\begin{array}{c} 2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) + \text {terms}\\&\cos ^{2k+2}(\theta ) \sin ^2((k+1) \theta ) = \frac{1}{2^{4} 4^k} \left( -2+ 2\left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) + \text {terms} \end{aligned} \end{aligned}$$

(131)

We note that $\tfrac{1}{4^k} \left( \left( {\begin{array}{c}2k+2\\ k+1\end{array}}\right) - \left( {\begin{array}{c}2k+2\\ k\end{array}}\right) \right) \sim \tfrac{4}{\sqrt{\pi } k^{3/2}}$ and $\tfrac{1}{4^k} \left( {\begin{array}{c}2k+2\\ k+1\end{array}}\right) \sim \frac{4}{\sqrt{\pi } k^{1/2}}$.

Next, we consider the setting where $\ell = 2$ and we observe from (128) that

$$\begin{aligned} \int (\lambda ^2-\lambda ^- \lambda ) P_k^2(\lambda ) d\mu _{\mathrm {MP}}= \tfrac{(\lambda ^+-\lambda ^-)^3}{2\pi \sigma ^2 r} \int _0^1 P_k^2(\lambda ^- + (\lambda ^+-\lambda ^-) w) w \sqrt{w (1-w)} \, dw. \end{aligned}$$

Since we know how to evaluate $\int \lambda P_k^2(\lambda ) d\mu _{\mathrm {MP}}$, we only need to analyze the RHS of this integral. A similar analysis as in (130) applies

$$\begin{aligned}&\int (\lambda ^2-\lambda ^- \lambda ) P_k^2(\lambda ) d\mu _{\mathrm {MP}}\\&\quad = \tfrac{2(\lambda ^+-\lambda ^-)^3}{8 \pi \sigma ^2 r} \left( \beta \big ( 1-\tfrac{\lambda ^-}{\lambda ^+} \big ) \right) ^k \int _0^{2\pi } \cos ^{2k+2}(\theta ) \sin ^4(\theta ) \big [ \tfrac{4\beta ^2}{(1+\beta )^2} \cos ^2(k\theta )\\&\qquad + \tfrac{4\beta }{1+\beta } \big (1-\tfrac{2\beta }{1+\beta } \big ) \cos (k\theta ) \cdot \tfrac{\sin ((k+1)\theta )}{\sin (\theta )} + \big (1-\tfrac{2\beta }{1+\beta } \big )^2 \tfrac{\sin ^2((k+1)\theta )}{\sin ^2(\theta )} \big ] \, d\theta . \end{aligned}$$

As before, we will treat each term in the summand separately and use that $\int _0^{2\pi } e^{ik\theta } = 0$ to only keep track of the constant terms:

$$\begin{aligned} \begin{aligned}&\cos ^{2k+2}(\theta ) \sin ^4(\theta )\cos ^2(k \theta ) = \frac{1}{2^6 4^k} \Bigg ( \tfrac{1}{3}(2k^3-9k^2+k+6)\\&\quad -4 \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k -1\end{array}}\right) + 3 \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \Bigg ) + \text {terms}\\&\cos ^{2k+2}(\theta ) \sin ^3(\theta ) \cos (k \theta ) \sin ((k+1)\theta ) = \frac{1}{2^7 4^k} \Bigg (-4k^2+6k+4\\&\quad - 8 \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + 2 \left( {\begin{array}{c}2k +2\\ k-1\end{array}}\right) +6 \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \Bigg ) + \text {terms}\\&\cos ^{2k+2}(\theta ) \sin ^2(\theta ) \sin ^2((k+1) \theta ) = \frac{1}{2^{4} 4^k} \left( k- \left( {\begin{array}{c}2k +2\\ k\end{array}}\right) + \left( {\begin{array}{c}2k +2\\ k +1\end{array}}\right) \right) + \text {terms}.\\ \end{aligned} \end{aligned}$$

(132)

The result immediately follows. $\square $

Lemma 12

(Average-case: Polyak Momentum) Let $d\mu _{\mathrm {MP}}$ be the Marčenko-Pastur law defined in (2) and $P_k$ be the residual polynomial for Polyak’s (heavy-ball) method (125). Then the following holds

$$\begin{aligned} \begin{aligned}&\int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}= \tfrac{(\lambda ^+-\lambda ^-)^2}{32 r \sigma ^2} \left( \tfrac{\sqrt{\lambda ^+}-\sqrt{\lambda ^-}}{\sqrt{\lambda ^+}+\sqrt{\lambda ^-}} \right) ^{2k}\\&\quad \times \big [ \left( \tfrac{(\sqrt{\lambda ^+}-\sqrt{\lambda ^-})^2}{\lambda ^+ + \lambda ^-} \right) ^2 + 2 \tfrac{(\sqrt{\lambda ^+}-\sqrt{\lambda ^-})^2}{\lambda ^+ + \lambda ^-} \tfrac{2 \sqrt{\lambda ^-\lambda ^+}}{\lambda ^+ + \lambda ^-} + 2 \left( \tfrac{2 \sqrt{\lambda ^-\lambda ^+}}{\lambda ^+ + \lambda ^-} \right) ^2 \big ]\\&\quad \text {and} \quad \int \lambda ^2 P_k^2(\lambda ) \, d\mu _{\mathrm {MP}}= \frac{\lambda ^+-\lambda ^-}{2} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}. \end{aligned} \end{aligned}$$

(133)

Proof

In order to simplify notation, we define the following

$$\begin{aligned} \begin{aligned} \beta =&\tfrac{\sqrt{\lambda ^+}-\sqrt{\lambda ^-}}{\sqrt{\lambda ^+}+\sqrt{\lambda ^-}}, \quad c = \tfrac{(\sqrt{\lambda ^+}-\sqrt{\lambda ^-})^2}{\lambda ^+ + \lambda ^-}, \quad d = \tfrac{2 \sqrt{\lambda ^-\lambda ^+}}{\lambda ^+ + \lambda ^-}, \quad \text {and} \quad \sigma (\lambda ) = \tfrac{\lambda ^++\lambda ^- -2\lambda }{\lambda ^+- \lambda ^-}\\&\text {with} \quad {\widetilde{P}}^2_k(x) {\mathop {=}\limits ^{\text {def}}}\beta ^{2k} \big [ c^2 T_k^2(x) + 2cd \cdot T_k(x)U_k(x) + d^2 U_k^2(x) \big ] \\&\text {and} \quad {\widetilde{P}}_k(\sigma (\lambda )) = P_k(\lambda ; \lambda ^{\pm }). \end{aligned} \end{aligned}$$

(134)

Under the change of variables, $u = \sigma (\lambda )$, we deduce the following for any $\ell \ge 1$

$$\begin{aligned} \begin{aligned}&\int \lambda ^{\ell } P_k^2(\lambda ; \lambda ^{\pm }) \, \mathrm {d}\mu _{MP} = \frac{1}{2 \pi \sigma ^2 r} \int _{\lambda ^-}^{\lambda ^+} \lambda ^{\ell -1} {\widetilde{P}}_k^2( \sigma (\lambda )) \sqrt{(\lambda ^+-\lambda )(\lambda -\lambda ^-)} \, d\lambda \\&\quad = \frac{(\lambda ^+-\lambda ^-)^2}{8 \pi \sigma ^2 r} \int _{-1}^1 (\tfrac{\lambda ^+ + \lambda ^-}{2} - \tfrac{\lambda ^+ - \lambda ^-}{2}u)^{\ell -1} {\widetilde{P}}_k^2(u) \sqrt{1-u^2} \, du. \end{aligned} \end{aligned}$$

(135)

First, we consider when $\ell = 1$. We convert this into a trig. integral using the substitution $u = \cos (\theta )$ and its nice relationship with the Chebyshev polynomials. In particular, we deduce the following

$$\begin{aligned}&\int \lambda P_k^2(\lambda ; \lambda ^{\pm }) d\mu _{\mathrm {MP}}\\&\quad = \frac{(\lambda ^+-\lambda ^-)^2}{8 \pi \sigma ^2 r} \beta ^{2k} \int _0^{\pi } \big [c^2 \cos ^2(k\theta )+ 2 cd \tfrac{\cos (k\theta ) \sin ((k+1) \theta )}{\sin (\theta )} + d^2 \tfrac{\sin ^2((k+1) \theta )}{\sin ^2(\theta )} \big ] \sin ^2(\theta ) \, d\theta . \end{aligned}$$

Treating each term in the summand separately, we can evaluate each integral

$$\begin{aligned} \begin{aligned}&\int _0^{\pi } \cos ^2(k \theta ) \sin ^2(\theta ) \, d\theta = \frac{\pi }{4}, \quad \int _0^{\pi } \sin ^2((k+1) \theta ) \, d\theta = \frac{\pi }{2}, \\&\quad \text {and} \quad \int _0^{\pi } \cos (k\theta ) \sin ((k+1)\theta ) \sin (\theta ) d\theta = \frac{\pi }{4}. \end{aligned} \end{aligned}$$

(136)

The result follows. Next we consider when $\ell = 2$. A quick calculation using (135) shows that

$$\begin{aligned} \int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}&= \frac{\lambda ^++ \lambda ^-}{2} \int \lambda P_k^2(\lambda ) \, d\mu _{\mathrm {MP}}\\&\quad - \frac{(\lambda ^+-\lambda ^-)^3}{16 \pi \sigma ^2 r} \int _{-1}^1 u {\widetilde{P}}_k^2(u) \sqrt{1-u^2} \, du. \end{aligned}$$

Since we evaluated the first integral, it suffices to analyze the second one. Again, we use a trig. substitution $u = \cos (\theta )$ and deduce the following

$$\begin{aligned}&\int _{-1}^1 u {\widetilde{P}}_k^2(u) \sqrt{1-u^2} \, du\\&\quad = \beta ^{2k} \int _0^{\pi } \cos (\theta ) \sin ^2(\theta ) \big [c^2 \cos ^2(k\theta )+ 2 cd \tfrac{\cos (k\theta ) \sin ((k+1) \theta )}{\sin (\theta )} + d^2 \tfrac{\sin ^2((k+1) \theta )}{\sin ^2(\theta )} \big ] \, d\theta . \end{aligned}$$

Treating each term in the summand separately, we can evaluate each integral

$$\begin{aligned} \begin{aligned}&\int _0^{\pi } \cos ^2(k \theta ) \sin ^2(\theta ) \cos (\theta ) \, d\theta = \int _0^{\pi } \sin ^2((k+1) \theta ) \cos (\theta ) \, d\theta = 0, \\&\quad \text {and} \quad \int _0^{\pi } \cos (k\theta ) \sin ((k+1)\theta ) \sin (\theta ) \cos (\theta ) d\theta = 0. \end{aligned} \end{aligned}$$

(137)

$\square $

Lemma 13

(Average-case: Nesterov accelerated method (convex)) Let $d\mu _{\mathrm {MP}}$ be the Marčenko-Pastur law defined in (2). Suppose the polynomials $P_k$ are the residual polynomials for Nesterov’s accelerated gradient (105). If the ratio $r = 1$, the following asymptotic holds

$$\begin{aligned} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\sim \frac{(\lambda ^+)^2}{\pi ^2 \sigma ^2} \frac{\log (k)}{k^3} \quad \text {and} \quad \int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\sim \frac{2 (\lambda ^+)^3}{\pi ^2 \sigma ^2} \frac{1}{k^4}.\nonumber \\ \end{aligned}$$

(138)

Proof

Define the polynomial ${\widetilde{P}}_k(u) = P_k( \lambda ^+ u; \lambda ^{\pm })$ where the polynomial $P_k$ satisfies Nesterov’s recurrence (105). Using the change of variables $u = \frac{\lambda }{\lambda ^+}$, we get the following relationship

$$\begin{aligned} \begin{aligned}&\int _0^{\lambda ^+} \lambda ^{\ell } P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\\&\quad = \frac{(\lambda ^+)^{\ell +1}}{2 \pi \sigma ^2} \int _0^1 u^{\ell -1} {\widetilde{P}}_k^2(u) \sqrt{u(1-u)} \, du\\&\quad = \frac{(\lambda ^+)^{\ell +1}}{2 \pi \sigma ^2} \int _0^1 u^{\ell -1} \frac{4J_1^2(k\sqrt{u})}{k^2u} e^{-uk} \sqrt{u(1-u)} \, du\\&\qquad + \frac{(\lambda ^+)^{\ell +1}}{2 \pi \sigma ^2} \int _0^1 u^{\ell -1} \big [ {\widetilde{P}}_k^2(u) - \frac{4J_1^2(k\sqrt{u})}{k^2u} e^{-uk} \big ] \sqrt{u(1-u)} \, du. \end{aligned} \end{aligned}$$

(139)

In the equality above, the first integral will become the asymptotic and the second integral we bound using Corollary 1. We start by bounding the second integral. We break this integral into three components based on the value of u

$$\begin{aligned} \Big ( \int _0^1 = \underbrace{\int _0^{k^{-4/3}}}_{\text {(i)}} + \underbrace{\int _{k^{-4/3}}^{\log ^2(k)/k}}_{\text {(ii)}} + \underbrace{\int _{\log ^2(k)/k}^1}_{\text {(iii)}} \Big ) u^{\ell -1} \big [{\widetilde{P}}_k^2(u) - \frac{4J_1^2(\sqrt{u}k)}{u k^2} e^{-uk} \big ] \sqrt{u(1-u)} \, du.\nonumber \\ \end{aligned}$$

(140)

For (i) in (140), we bound the integrand using Corollary 1 such that for all $u \le k^{-4/3}$

$$\begin{aligned} \begin{aligned} u^{\ell -1/2} \, \Big |{\widetilde{P}}_k^2(u) - \frac{4J_1^2(k \sqrt{u})}{u k^2} e^{-uk} \Big |&\le C ( u^{\ell -1/2} k^{-4/3} + k^{-13/6} u^{\ell -5/4} ). \end{aligned} \end{aligned}$$

(141)

Therefore, we get that the integral (i) is bounded by

$$\begin{aligned} \begin{aligned}&\int _0^{k^{-4/3}} u^{\ell -1} \big |{\widetilde{P}}_k^2(u)- \frac{4J_1^2(\sqrt{u}k)}{u k^2} e^{-uk} \big | \sqrt{u(1-u)} \, du\\&\le C \int _0^{k^{-4/3}}u^{\ell -1/2} k^{-4/3} + k^{-13/6} u^{\ell -5/4} \, du\\&\quad = C( k^{-2 - 4/3 \ell } + k^{-11/6 - 4/3 \ell })\\&\quad \le C {\left\{ \begin{array}{ll} k^{-19/6}, &{} \text {if } \ell =1\\ k^{-9/2}, &{} \text {if } \ell = 2, \end{array}\right. } \end{aligned} \end{aligned}$$

(142)

for sufficiently large k and absolute constant C.

For (ii) in (140), we bound the integrand using Corollary 1 to get for all $k^{-4/3} \le u \le \log ^2(k)/k$ we have

$$\begin{aligned} u^{\ell -1/2} \, \Big |{\widetilde{P}}_k^2(u) - \frac{4J_1^2(k\sqrt{u})}{k^2 u} e^{-uk} \Big | \le C e^{-uk} (u^{\ell -3/2} k^{-8/3} + u^{\ell -7/4} k^{-17/6}).\nonumber \\ \end{aligned}$$

(143)

Therefore, we get that the integral (ii) is bounded by

$$\begin{aligned}&\int _{k^{-4/3}}^{\log ^2(k)/k} u^{\ell -1} \big |{\widetilde{P}}_k^2(u)- \frac{4J_1^2(k\sqrt{u})}{k^2 u} e^{-uk} \big | \sqrt{u(1-u)} \, du \nonumber \\&\quad \le C \int _{k^{-4/3}}^{\log ^2(k)/k} e^{-uk} (u^{\ell -3/2} k^{-8/3} + u^{\ell -7/4} k^{-17/6}) \, du \nonumber \\&\qquad {(v = uk)} \quad \le C \int _0^\infty e^{-v} v^{\ell -3/2} k^{-(\ell +13/6)} \, dv + C \int _0^\infty e^{-v} v^{\ell -7/4} k^{-(\ell + 25/12)} \, dv \nonumber \\&\quad = C(k^{-(\ell + 13/6)} + k^{-(\ell + 25/12)} ) \nonumber \\&\quad \le C {\left\{ \begin{array}{ll} k^{-37/12}, &{} \text {if } \ell = 1,\\ k^{-49/12}, &{} \text {if } \ell =2. \end{array}\right. } \end{aligned}$$

(144)

For (iii) in (140), we use a simple bound on the functions ${\widetilde{P}}_k(u) = \frac{2e^{-ku/2}}{k\sqrt{u}} I_k(u)$ where $I_k(u)$ is defined in (109) and $J_1(k\sqrt{u})$ for $u \ge \log ^2(k)/k$

$$\begin{aligned} \big | {\widetilde{P}}_k^2(u) - \frac{4 J_1^2(k\sqrt{u})}{k^2u} e^{-uk} \big | \le \frac{4(1-u)^{k+1}}{k^2 u} I_k^2(u) + e^{-uk} \frac{4J_1^2(k\sqrt{u})}{k^2u} \le \frac{C e^{-\log ^2(k)}}{k \log ^2(k)}. \nonumber \\ \end{aligned}$$

(145)

In the last inequality, we used that the functions $I_k(u)$ and $J_1(k\sqrt{u})$ are bounded and $(1-u)^{k+1} \le e^{-uk}$. Since $e^{-\log ^2(k)}$ decays faster than any polynomial, we have that

$$\begin{aligned} \int _{\log ^2(k)/k}^1 \big |P_k^2(u) - \frac{4J_1^2(k\sqrt{u})}{k^2u} e^{-ku} \big | u^{\ell -1} \sqrt{u(1-u)} \, du \le C e^{-\log ^2(k)} \end{aligned}$$

(146)

for sufficiently large k and some absolute constant C.

It follows by combining (142), (144), and (146) into (140) we have for $\ell \ge 1$ the following

$$\begin{aligned} \Big | \int _0^1 \Big [{\widetilde{P}}_k^2(u)- \frac{4J_1(k\sqrt{u})}{k^2u} e^{-ku} \Big ] u^{\ell -1} \sqrt{u(1-u)} \, du \Big | \le Ck^{-(\ell + 25/12)}. \end{aligned}$$

(147)

All that remains is to integrate the Bessel part in (139) to derive the asymptotic. Here we must consider cases when $\ell =1$ and $\ell =2$ separately. For $\ell = 1$ using the change of variables $v = k \sqrt{u}$ we have that

$$\begin{aligned}&\frac{(\lambda ^+)^2}{2\pi \sigma ^2} \int _0^1 \left( \frac{2J_1(k\sqrt{u})}{k \sqrt{u}} \right) ^2 e^{-uk} \sqrt{u(1-u)} \, du \\&\quad = \frac{2 \cdot 4 (\lambda ^+)^2}{2 \pi \sigma ^2} \frac{1}{k^3} \int _0^k J_1^2(v) e^{-v^2/k} \sqrt{1-v^2/k^2} \, dv\\&\qquad (\sqrt{1-x} \approx 1-x\text { for }x\text { small)} \qquad \sim \frac{2 \cdot 4 \cdot (\lambda ^+)^2}{2 \pi \sigma ^2} \frac{1}{k^3} \int _1^{\infty } J_1^2(v) e^{-v^2/k} \, dv\\&\qquad \text {(Bessel asymptotic, }J_1^2(v) \sim \tfrac{1}{\pi v}) \qquad \sim \frac{2 \cdot 4 \cdot (\lambda ^+)^2}{2 \pi \sigma ^2} \frac{1}{k^3} \int _1^\infty \frac{1}{\pi v} e^{-v^2/k} \, dv\\&\quad = \frac{2 \cdot 4 \cdot (\lambda ^+)^2}{2 \pi \sigma ^2} \cdot \frac{1}{k^3} \cdot \frac{-1}{2\pi } {\mathcal {E}}_i(-\sqrt{k}), \end{aligned}$$

where ${\mathcal {E}}_i$ is the exponential integral. It is known that the exponential integral $\frac{-1}{2 \pi } {\mathcal {E}}_i(-\sqrt{k}) \sim \frac{\log (k)}{4\pi }$.

For $\ell =2$ using the change of variables $v = uk$ we have the following

$$\begin{aligned}&\frac{(\lambda ^+)^{3}}{2 \pi \sigma ^2} \int _0^1 \frac{4 J_1^2(k\sqrt{u} )}{k^2u} e^{-uk} u \sqrt{u(1-u)} \, du \\&\quad = \frac{4 \cdot (\lambda ^+)^3}{2 \pi \sigma ^2} \cdot \frac{1}{k^{7/2}} \int _0^k e^{-v} J_1^2( \sqrt{vk}) v^{1/2} \sqrt{1-\frac{v}{k}} \, dv\\&\qquad \sim \frac{2 \cdot 4 \cdot (\lambda ^+)^3 }{2 \pi ^2 \sigma ^2} \cdot \frac{1}{k^{4}} \int _0^{\infty } \cos ^2(\sqrt{vk} + C) e^{-v} \, dv\\&\qquad \sim \frac{4 \cdot (\lambda ^+)^3}{2 \pi ^2 \sigma ^2} \cdot \frac{1}{k^4} \int _0^\infty e^{-v} \, dv = \frac{4 \cdot (\lambda ^+)^3}{2 \pi ^2 \sigma ^2} \cdot \frac{1}{k^4}. \end{aligned}$$

The results follow. $\square $

Adversarial Model Computations

In this section, we derive the adversarial guarantees for gradient descent and Nesterov’s accelerated method.

Lemma 14

(Adversarial model: Gradient descent) Suppose Assumption 1 holds. Let $\lambda ^+$ ($\lambda ^-$) be the upper (lower) edge of the Marčenko Pastur distribution (2) and $P_k$ the residual polynomial for gradient descent. Then the adversarial model for the maximal expected squared norm of the gradient is the following.

1.
If there is no noise ${\widetilde{R}} = 0$, then
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k)\Vert ^2 \big ] \sim {\left\{ \begin{array}{ll} \frac{R^2 (\lambda ^+)^2}{(k+1)^2} e^{-2}, &{} \text {if }\lambda ^- = 0\\ R^2 (\lambda ^-)^2 \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k}, &{} \text {if }\lambda ^- > 0. \end{array}\right. } \end{aligned}$$
2.
If ${\widetilde{R}} > 0$, then the following holds
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k) \Vert ^2 \big ] \sim {\left\{ \begin{array}{ll} \left[ \frac{R^2 (\lambda ^+)^2}{4} \frac{1}{k^2} + \frac{ {\widetilde{R}}^2 \lambda ^+}{2} \frac{1}{k} \right] e^{-2}, &{} \text {if }\lambda ^- = 0\\ \big [ R^2 (\lambda ^-)^2 + r {\widetilde{R}}^2 \lambda ^- \big ] \big (1- \frac{\lambda ^-}{\lambda ^+} \big )^{2k}, &{} \text {if }\lambda ^- > 0. \end{array}\right. } \end{aligned}$$

Proof

Suppose we are in the noiseless setting. By a change of variables, setting $u = \lambda /\lambda ^+$, the following holds

$$\begin{aligned} \max _{\lambda \in [\lambda ^-, \lambda ^+]} \lambda ^2 \big (1- \frac{\lambda }{\lambda ^+} \big )^{2k} = \max _{u \in \big [\frac{\lambda ^-}{\lambda ^+}, 1 \big ]} (\lambda ^+)^2 u^2 (1-u)^{2k}. \end{aligned}$$

(148)

Taking derivatives, we get that the maximum of the RHS occurs when $u = \frac{1}{k+1}$. For sufficiently large k and $\lambda ^- > 0$, the maximum lies outside the constraint of $\big [\tfrac{\lambda ^-}{\lambda ^+}, 1 \big ]$. Hence, the maximum occurs on the boundary, or equivalently, where $u = \tfrac{\lambda ^-}{\lambda ^+}$. The result in the setting when $\lambda ^- > 0$ immediately follows from this. When $\lambda ^- = 0$, then the maximum does occur at $\frac{1}{k+1}$. Plugging this value into the RHS of (148) and noting that for sufficiently large k, $(1-1/(k+1))^{2k} \rightarrow e^{-2}$, we get the other result for noiseless case.

Now suppose that ${\widetilde{R}} > 0$. By a change variables, setting $u = \lambda / \lambda ^+$, we have that

$$\begin{aligned} \begin{aligned}&\max _{\lambda \in [\lambda ^-, \lambda ^+]}~ \left( R^2 \lambda ^2 + r{\widetilde{R}}^2 \lambda \right) \left( 1- \frac{\lambda }{\lambda ^+} \right) ^{2k}\\&\quad = \max _{u \in \big [\tfrac{\lambda ^-}{\lambda ^+}, 1 \big ]} \Big \{ h(u) {\mathop {=}\limits ^{\text {def}}}\lambda ^+ \big ( R^2 \lambda ^+ u^2 + r {\widetilde{R}}^2 u \big ) (1-u)^{2k} \Big \}. \end{aligned} \end{aligned}$$

(149)

The derivative $h'(u)$ equals 0 at $u = 1$ (local minimum) and at solutions to the quadratic

$$\begin{aligned} 2 R^2 \lambda ^+ (k+1) u^2 + [2r {\widetilde{R}}^2 k + r {\widetilde{R}}^2 - 2R^2 \lambda ^+] u - r {\widetilde{R}}^2 = 0. \end{aligned}$$

There is only one positive root of this quadratic so

$$\begin{aligned} \lambda ^* = \frac{\sqrt{(2r {\widetilde{R}}^2 k + r{\widetilde{R}}^2 - 2R^2 \lambda ^+)^2 + 8 r {\widetilde{R}}^2 R^2 \lambda ^+ (k+1)} - \big [2r {\widetilde{R}}^2 k + r {\widetilde{R}}^2 - 2R^2 \lambda ^+ \big ] }{4 R^2 \lambda ^+ (k+1)}. \nonumber \\ \end{aligned}$$

(150)

We can approximate the square root using Taylor approximation to get that

$$\begin{aligned}&\frac{1}{k} \sqrt{(2r {\widetilde{R}}^2 k + r{\widetilde{R}}^2 - 2R^2 \lambda ^+)^2 + 8 r {\widetilde{R}}^2 R^2 \lambda ^+ (k+1)}\\&\quad = 2r{\widetilde{R}}^2 \Big [ 1 + \frac{r{\widetilde{R}}^2-2R^2\lambda ^+}{r {\widetilde{R}}^2k} + \frac{2 R^2 \lambda ^+}{r {\widetilde{R}}^2 k} + {\mathcal {O}} ( k^{-2} ) \Big ]^{1/2} \\&\qquad \text {(Taylor approximation)} \quad \qquad = 2r{\widetilde{R}}^2 \Big [ 1 + \frac{1}{2k} + {\mathcal {O}}(k^{-2}) \Big ]. \end{aligned}$$

Putting this together into (150), we get that

$$\begin{aligned}&\frac{\sqrt{(2r {\widetilde{R}}^2 k + r{\widetilde{R}}^2 - 2R^2 \lambda ^+)^2 + 8 r {\widetilde{R}}^2 R^2 \lambda ^+ (k+1)} - \big [2r {\widetilde{R}}^2 k + r {\widetilde{R}}^2 - 2R^2 \lambda ^+ \big ] }{4 R^2 \lambda ^+ (k+1)}\\&\quad = \frac{\frac{r {\widetilde{R}}^2}{k} + {\mathcal {O}}(k^{-2}) - \frac{r{\widetilde{R}}^2}{k} + \frac{2 R^2 \lambda ^+}{k}}{4 R^2 \lambda ^+ + \frac{4 R^2 \lambda ^+}{k}} \sim \frac{1}{2k}. \end{aligned}$$

As before for sufficiently large k and $\lambda ^- > 0$, the root above lies outside the constraint of $\big [ \frac{\lambda ^-}{\lambda ^+}, 1 \big ]$ and so maximum occurs on the boundary, or equivalently, $u = \frac{\lambda ^-}{\lambda ^+}$. The result immediately follows by plugging this u into (150). When $\lambda ^- =0$, then the maximum is the root of the above quadratic which asymptotically equals 1/k. Plugging this value into (150) and noting for sufficiently large k that $(1-1/k)^{2k} \approx e^{-2}$, we get the result for the noiseless setting. $\square $

Lemma 15

(Adversarial model: Nesterov (convex)) Suppose Assumption 1 holds. Let $\lambda ^+$ be the upper edge of the Marčenko Pastur distribution (2) and $P_k$ the residual polynomial for gradient descent. Suppose $r = 1$. Then the adversarial model for the maximal expected squared norm of the gradient is the following.

1.
If there is no noise ${\widetilde{R}} = 0$, then
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k)\Vert ^2 \big ] \sim \frac{8e^{-1/2}}{\sqrt{2}\pi } (\lambda ^+)^2 R^2 \frac{1}{k^{7/2}}. \end{aligned}$$
2.
If ${\widetilde{R}} > 0$, then the following holds
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k) \Vert ^2 \big ] \sim \Vert J_1^2(x)\Vert _{\infty } (\lambda ^+){\widetilde{R}}^2 \frac{1}{k^2}. \end{aligned}$$

Proof

First, we claim that

$$\begin{aligned} \Vert \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm })\Vert _{\infty } k^{7/2} \rightarrow \frac{8}{\pi } (\lambda ^+)^2 \max _{x \ge 0} \{ x^{1/2}e^{-x} \} = \frac{8e^{-1/2}}{\sqrt{2}\pi } (\lambda ^+)^2 \quad \text {as }k \rightarrow \infty .\nonumber \\ \end{aligned}$$

(151)

Using the definitions in (108) and (109) for ${\widetilde{P}}_k(u)$ and $I_k(u)$, respectively, we can write

$$\begin{aligned} P_k(\lambda ^+ u; \lambda ^{\pm }) = {\widetilde{P}}_k(u) = \frac{2(1-u)^{(k+1)/2}}{k \sqrt{u}} I_k(u). \end{aligned}$$

Now by a change of variables we have the following

$$\begin{aligned}&\max _{\lambda \in [0, \lambda ^+]} \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) k^{7/2} = \max _{u \in [0,1]} (\lambda ^+)^2 u^2 {\widetilde{P}}_k^2(u) k^{7/2} \nonumber \\&\quad = \max \big \{ \max _{u \in [0, \frac{\log ^2(k)}{k} ]} (\lambda ^+)^2 k^{7/2} u^2 {\widetilde{P}}_k^2(u), \max _{u \in [\frac{\log ^2(k)}{k}, 1]} (\lambda ^+)^2 k^{7/2} u^2 {\widetilde{P}}_k^2(u) \, \, \big \}. \end{aligned}$$

(152)

Let’s first consider the second term in the maximum. Here we use that $|I_k(u)|$ is bounded so that

$$\begin{aligned} \max _{u \in [\frac{\log ^2(k)}{k}, 1]} (\lambda ^+)^2 k^{7/2} u^2 {\widetilde{P}}_k^2(u)&= \max _{u \in [\frac{\log ^2(k)}{k}, 1]} 4 (\lambda ^+)^2 k^{3/2} u (1-u)^{k+1} I_k^2(u) \\&\le \max _{u \in [\frac{\log ^2(k)}{k}, 1]} C (\lambda ^+)^2 k^{3/2} u (1-u)^{k+1} \\&{\mathop {=}\limits ^{\text {def}}} \max _{u \in [\frac{\log ^2(k)}{k}, 1]} h(u). \end{aligned}$$

The function h(u) is maximized when $u = \tfrac{1}{k+2}$ and hence the maximum over the constrained set occurs at the endpoint $\tfrac{\log ^2(k)}{k}$. With this value, it is immediately clear that the maximum over $u \in [ \tfrac{\log ^2(k)}{k}, 1]$ of $(\lambda ^+) k^{7/2} u^2 {\widetilde{P}}_k^2(u) \rightarrow 0$. Now we consider the first term in (152). In this regime, the polynomial ${\widetilde{P}}_k^2(u)$ behaves like the Bessel function in (34). We further break up the interval $[0, \log ^2(k)/k]$ into larger or smaller than $k^{-4/3}$. When $u \in [0, k^{-4/3}]$, Corollary 1 says there exists a constant C such that

$$\begin{aligned} \max _{u \in [0, k^{-4/3}]} (\lambda ^+)^2 u^2 k^{7/2} \big |{\widetilde{P}}_k^2(u)-\tfrac{4e^{-uk}J_1^2(k \sqrt{u})}{k^2 u} \big | \le C (\lambda ^+)^2 [k^{-1/2} + k^{-1/3}] \rightarrow 0. \end{aligned}$$

Similarly when $u \in [k^{-4/3}, \log ^2(k)/k]$, Corollary 1 yields that

$$\begin{aligned}&\max _{u \in [k^{-4/3}, \log ^2(k)/k]} (\lambda ^+)^2 u^2 k^{7/2} \big |{\widetilde{P}}_k^2(u) - \tfrac{4 e^{-uk} J_1^2(k \sqrt{u})}{k^2 u} \big | \\&\quad \le \max _{u \in [k^{-4/3}, \log ^2(k)/k]} C (\lambda ^+)^2 (u k^{5/6} + u^{3/4} k^{2/3} ) \rightarrow 0. \end{aligned}$$

Using a change of variables, the relevant asymptotic to compute is

$$\begin{aligned} \max _{u \in [0, \log ^2(k)/k]} 4(\lambda ^+)^2 u k^{3/2} J_1^2(k \sqrt{u})e^{-uk} = \max _{x \in [0, \log ^2(k)]} 4(\lambda ^+)^2 (\sqrt{(x k)}J_1^2(\sqrt{kx})) \sqrt{x} e^{-x}. \end{aligned}$$

From the uniform boundedness of the function $y \mapsto \sqrt{y}J_1^2(\sqrt{y}),$ there is a constant ${\mathcal {C}} > 0$ so that

$$\begin{aligned} \max _{x \in [0, \delta ]} 4(\lambda ^+)^2 (\sqrt{(x k)}J_1^2(\sqrt{kx})) \sqrt{x} e^{-x} \le 4(\lambda ^+)^2 {\mathcal {C}}\sqrt{\delta }. \end{aligned}$$

Moreover, the Bessel function satisfies

$$\begin{aligned} J_1(z) = \sqrt{\frac{2}{\pi z}}\cos ( z - \tfrac{3\pi }{4}) + O(z^{-3/2}), \end{aligned}$$

and so for any fixed $\delta >0$

$$\begin{aligned} \max _{x \in [\delta , \log ^2(k)]} 4(\lambda ^+)^2 (\sqrt{(x k)}J_1^2(\sqrt{kx})) \sqrt{x} e^{-x} \rightarrow \max _{x \in [\delta , \infty ]} \biggl \{ \frac{8}{\pi }(\lambda ^+)^2 \sqrt{x} e^{-x} \biggr \}. \end{aligned}$$

As $\delta > 0$ is arbitrary, picking it sufficiently small completes the claim.

Next we claim that the following holds

$$\begin{aligned} \max _{\lambda \in [0, \lambda ^+]} k^2 \lambda P_k^2(\lambda ; \lambda ^{\pm }) = \max _{u \in [0,1]} k^2 \lambda ^+ u {\widetilde{P}}_k^2(u) \sim \lambda ^+ \Vert J_1(u)\Vert ^2_{\infty } \quad \text {as }k \rightarrow \infty .\nonumber \\ \end{aligned}$$

(153)

A similar argument as above using that in this regime the exponential dominates the polynomial ${\widetilde{P}}_k^2(u)$ we have

$$\begin{aligned} \max _{u \in [\log ^2(k)/k, 1]} k^2 \lambda ^+ u {\widetilde{P}}_k^2(u) \rightarrow 0 \quad \text {as }k \rightarrow \infty . \end{aligned}$$

Now we need to consider the regime where the Bessel function (34) becomes important. We use our asymptotic in Corollary 1 to show that the polynomial is close to the Bessel, namely,

$$\begin{aligned} \begin{aligned} \max _{u \in [0, k^{-4/3}]} \lambda ^+ k^2 u \big | {\widetilde{P}}_k^2(u) - \tfrac{4 e^{-uk} J_1^2(k \sqrt{u})}{k^2 u} \big |&\le C \lambda ^+ [k^{-2/3} + k^{-5/2}] \rightarrow 0 \quad \text {as }k \rightarrow \infty \\ \text {and} \quad \max _{u \in [k^{-4/3}, \log ^2(k)/k]} \lambda ^+ k^2 u&\big | {\widetilde{P}}_k^2(u) - \tfrac{4 e^{-uk} J_1^2(k \sqrt{u})}{k^2 u} \big |\\&\le C \lambda ^+ [k^{-2/3} + k^{-1/2}] \rightarrow 0 \quad \text {as }k \rightarrow \infty . \end{aligned} \end{aligned}$$

(154)

It remains to compute the maximum of the Bessel equation in (34) for u in 0 to $\log ^2(k)/k$. Now there exists an absolute constant ${\mathcal {C}}$ so that $|J_1(x)^2| \le \tfrac{{\mathcal {C}}}{|x|}$ and there is also an $\eta > 0$ so that $\displaystyle \max _{0 \le x \le \tfrac{1}{\eta }} |J_1^2(x)| > \eta $. Moreover, the maximizer of $J_1^2(x)$ exists. By picking R sufficiently large, we see that

$$\begin{aligned} \max _{u > \tfrac{R}{k^2}} 4 \lambda ^+ e^{-ku} J_1^2(k \sqrt{u}) \le \frac{4 {\mathcal {C}} \lambda ^+}{k^2 u} \Big |_{u = R/k^2} = \frac{4{\mathcal {C}} \lambda ^+}{R} < \eta . \end{aligned}$$

This means that the maximum must occur for u between 0 and $R/k^2$. Hence, by picking R sufficiently large, we have the following

$$\begin{aligned} \max _{u \in [0, R/k^2]} 4\lambda ^+ e^{-ku} J_1^2(k \sqrt{u})&\rightarrow \max _{x \in [0,R]} 4 \lambda ^+ J_1^2(x) = 4 \lambda ^+ \Vert J_1^2(x)\Vert _{\infty }. \end{aligned}$$

Consequently, for sufficiently large k, the maximizer of

$$\begin{aligned} \begin{aligned} \max _{u \in [0,1]} (\lambda ^+)^2 R^2 u^2 {\widetilde{P}}_k^2(u) + \lambda ^+ {\widetilde{R}}^2 r u{\widetilde{P}}_k^2(u)&= \max _{u \in [0, \log ^2(k)/k]} 4\lambda ^+ e^{-ku} J_1^2(k \sqrt{u})\\&\rightarrow \max _{x \in [0,R]} 4 \lambda ^+ J_1^2(x) = 4 \lambda ^+ \Vert J_1^2(x)\Vert _{\infty }. \end{aligned} \end{aligned}$$

(155)

$\square $

Simulation Details

For Figs. 1 and 7, which show that the halting time concentrates, we perform $\frac{2^{12}}{\sqrt{d}}$ training runs for each value of d. In our initial simulations we observed that the empirical standard deviation was decreasing as $d^{-1/2}$ as the model grows. Because the larger models have a significant runtime, but very little variance in the halting time, we decided to scale the number of experiments based on this estimate of the variance.

As discussed in the text, the Student’s t-distribution can produce ill-conditioned matrices with large halting times. To make the numerical experiments feasible we limit the number of iterations to 1000 steps for the GD and Nesterov experiments and discard the very few runs that have not converged by this time (less than 0.1%).

For Fig. 3, which shows the convergence rates, we trained 8192 models for $d = n = 4096$ steps both with (${\widetilde{R}}^2 = 0.05$) and without noise. The convergence rates were estimated by fitting a line to the second half of the log-log curve.

For each run we calculate the worst-case upper bound on $\Vert \nabla f({{\varvec{x}}}_k)\Vert ^2$ at $k = n$ using [62, Conjecture 3].

$$\begin{aligned} \Vert \nabla f({{\varvec{x}}}_k)\Vert ^2 \le \frac{L^2 \Vert {{\varvec{x}}}_0 - {{\varvec{x}}}^{\star }\Vert ^2}{(k + 1)^2} {\mathop {=}\limits ^{\text {def}}}\mathrm {UB}_{\text {cvx}}(\Vert \nabla f({{\varvec{x}}}_k)\Vert ^2) \end{aligned}$$

where ${{\varvec{x}}}^{\star }$ is the argmin of f calculated using the linear solver in JAX [11]. To visualize the difference between the worst-case and average-case rates, we draw a log-log histogram of the ratio,

$$\begin{aligned} \frac{\mathrm {UB}_{\text {cvx}}(\Vert \nabla f({{\varvec{x}}}_k)\Vert ^2)}{\Vert \nabla f({{\varvec{x}}}_k)\Vert ^2}. \end{aligned}$$

1.1 Step Sizes

In this appendix section, we discuss our choices of step sizes for logistic regression and stochastic gradient descent (SGD).

1.1.1 Logistic Regression

For both gradient descent and Nesterov’s accelerated method (convex) on the least squares problem, we use the step size $\frac{1}{L}$. The Lipschitz constant, L, is equal to the largest eigenvalue of ${{\varvec{H}}}$ which can be quickly approximated using a power iteration method.

For logistic regression the Hessian is equal to ${{\varvec{A}}}^T{{\varvec{D}}}{{\varvec{A}}}$, where ${{\varvec{D}}}$ is the Jacobian matrix of the sigmoid activation function. Hence, the Hessian’s eigenvalues are equal to those of ${{\varvec{H}}}$ scaled by the diagonal entries ${{\varvec{D}}}_{ii} = \sigma (({{\varvec{A}}}{{\varvec{x}}})_i)(1 - \sigma ({{\varvec{A}}}{{\varvec{x}}})_i))$. Since the maximum value of these entries is $\frac{1}{4}$ we use a step size of $\frac{4}{L}$ for our logistic regression experiments.

1.1.2 Stochastic Gradient Descent (SGD)

The least squares problem (10) can be reformulated as

$$\begin{aligned} \min _{{{\varvec{x}}}\in {{\mathbb {R}}}^d}~ \frac{1}{2n} \Vert {{\varvec{A}}}{{\varvec{x}}}-{{\varvec{b}}}\Vert ^2 = \frac{1}{2n} \sum _{i=1}^n ({{\varvec{a}}}_i {{\varvec{x}}}-b_i)^2, \end{aligned}$$

(156)

where ${{\varvec{a}}}_i$ is the ith row of the matrix ${{\varvec{A}}}$. We perform a mini-batch SGD, i.e., at each iteration we select uniformly at random a subset of the samples $b_k \subset \{1, \ldots , n\}$ and perform the update

$$\begin{aligned} {{\varvec{x}}}_{k+1} = {{\varvec{x}}}_k - \frac{\alpha }{|b_k|} \sum _{i \in b_k} \nabla f_i({{\varvec{x}}}_k). \end{aligned}$$

(157)

With a slight abuse of notation, we denote by $\nabla f_i({{\varvec{x}}}_k) = \frac{1}{|b_k|} \sum _{i \in b_k} \nabla f_i({{\varvec{x}}}_k)$ the update direction and use the shorthand $b = |b_k|$ for the mini-batch size since it is fixed across iterations. The rest of this section is devoted to choosing the step size $\alpha $ so that the halting time is consistent across dimensions n and d. Contrary to (full) gradient descent, the step size in SGD is dimension-dependent because a typical step size in SGD uses the variance in the gradients which grows as the dimension d increases.

Over-parametrized. If $n \le d$ we call the model over-parametrized. In this case, the strong growth condition from [56] holds. This implies that training will converge when we use a fixed step size $\frac{2}{LB^2}$ where B is defined as a constant verifying for all ${{\varvec{x}}}$

$$\begin{aligned} \max _i \left\{ \Vert \nabla f_i({{\varvec{x}}})\Vert \right\} \le B\Vert \nabla f({{\varvec{x}}})\Vert . \end{aligned}$$

(158)

To estimate B we will compute the expected values of $\Vert \nabla f_i({{\varvec{x}}})\Vert ^2$ and $\Vert \nabla f({{\varvec{x}}})\Vert ^2$. To simplify the derivation, we will assume that ${\widetilde{{{\varvec{x}}}}}$ and ${\varvec{\eta }}$ are normally distributed. At iterate ${{\varvec{x}}}$ we then have

$$\begin{aligned} \nabla f({{\varvec{x}}})&= \frac{1}{n} {{\varvec{A}}}^T({{\varvec{A}}}({{\varvec{x}}}- {\widetilde{{{\varvec{x}}}}}) - {\varvec{\eta }}) \end{aligned}$$

(159)

$$\begin{aligned} \nabla f({{\varvec{x}}})&\sim N\left( {{\varvec{H}}}{{\varvec{x}}},\frac{1}{d}{{\varvec{H}}}^2+\frac{{\widetilde{R}}^2}{n}{{\varvec{H}}}\right) . \end{aligned}$$

(160)

Hence, the expected value of $\Vert \nabla f({{\varvec{x}}})\Vert ^2$ is $\Vert {{\varvec{H}}}{{\varvec{x}}}\Vert ^2 + \text {tr}\left( \frac{1}{d}{{\varvec{H}}}^2+\frac{{\widetilde{R}}^2}{n}{{\varvec{H}}}\right) $. Following [5, Equation 3.1.6 and Lemma 3.1] we know that for large values of n and d, the expected trace $\frac{1}{d}\text {tr}{{\varvec{H}}}\approx 1$ and $\frac{1}{d}\mathrm{tr }{{\varvec{H}}}^2 \approx 1 + r$. Further, ${{\mathbb {E}}}\,\left[ \Vert {{\varvec{H}}}{{\varvec{x}}}\Vert ^2\right] = (1 + r)\Vert {{\varvec{x}}}\Vert ^2$ and hence

$$\begin{aligned} \begin{aligned}&{{\mathbb {E}}}\,\left[ \Vert \nabla f({{\varvec{x}}})\Vert ^2\right] \approx (1 + r)\Vert {{\varvec{x}}}\Vert ^2 + (1 + r) + r{\widetilde{R}}^2 \\&\quad = (1 + r)(1 + \Vert {{\varvec{x}}}\Vert ^2) + r{\widetilde{R}}^2. \end{aligned} \end{aligned}$$

(161)

We can approximate the same value for a mini-batch gradient, where

$$\begin{aligned} {{\mathbb {E}}}\,\left[ \Vert \nabla f_i({{\varvec{x}}})\Vert ^2\right] \approx (1 + r')(1 + \Vert {{\varvec{x}}}\Vert ^2) + r'{\widetilde{R}}^2, \end{aligned}$$

(162)

for batch size b and $r' = \frac{d}{b}$. Note that $\Vert {{\varvec{x}}}\Vert \approx 1$ because of the normalization of both the initial point and the solution, so for our experiments we set $B^2 = \frac{2 + r'(2 + {\widetilde{R}}^2)}{2+r(2 + {\widetilde{R}}^2)}$.

Under-parametrized. In the under-parametrized case SGD will not converge but reach a stationary distribution around the optimum. Given a step size ${\overline{\alpha }} \le \frac{1}{LM_G}$ the expected square norm of the mini-batch gradients will converge to ${\overline{\alpha }}LM$ where M and $M_G$ are constants such that ${{\mathbb {E}}}\,\left[ \Vert \nabla f_i({{\varvec{x}}})\Vert ^2\right] \le M + M_G\Vert \nabla f({{\varvec{x}}})\Vert ^2$ [10, Theorem 4.8, Equation 4.28]. We will use rough approximations of both M and $M_G$. In fact, we will set $M_G = B^2 = \frac{2 + 3r'}{2+3r}$.

To approximate M we will estimate the norm of the mini-batch gradients at the optimum for our least squares model. Set ${{\varvec{x}}}^* = {{\varvec{A}}}^+{{\varvec{b}}}= {{\varvec{A}}}^+{\varvec{\eta }}+ {\widetilde{{{\varvec{x}}}}}$ where ${{\varvec{A}}}^+$ is the Moore-Penrose pseudoinverse. We will write the row-sampled matrix ${\widetilde{{{\varvec{A}}}}}$ in mini-batch SGD ${\widetilde{{{\varvec{A}}}}} = {{\varvec{P}}}{{\varvec{A}}}$ where ${{\varvec{P}}}$ consists of exactly b rows of the identity matrix. Note that ${{\varvec{P}}}^T{{\varvec{P}}}$ is idempotent.

$$\begin{aligned} {\widetilde{\nabla }}f({{\varvec{x}}}^*)&= \frac{1}{b}{\widetilde{{{\varvec{A}}}}}^T({\widetilde{{{\varvec{A}}}}}({{\varvec{A}}}^+{\varvec{\eta }}+{\widetilde{{{\varvec{x}}}}} -{\widetilde{{{\varvec{x}}}}}) - {\widetilde{{\varvec{\eta }}}}) \\&= \frac{1}{b}{{\varvec{A}}}^T{{\varvec{P}}}^T({{\varvec{P}}}{{\varvec{A}}}{{\varvec{A}}}^+{\varvec{\eta }}- {{\varvec{P}}}{\varvec{\eta }}) \\&= \frac{1}{b}{{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}({{\varvec{A}}}{{\varvec{A}}}^+ - {\varvec{I}}){\varvec{\eta }}. \end{aligned}$$

To simplify the derivation, we will again assume that ${\varvec{\eta }}$ is normally distributed and that ${\widetilde{R}} = 1$. Thus, we have

$$\begin{aligned} {\widetilde{\nabla }}f({{\varvec{x}}}^*)&\sim N\left( 0, \frac{1}{b^2} {{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}({{\varvec{A}}}{{\varvec{A}}}^+ - {\varvec{I}})({{\varvec{A}}}{{\varvec{A}}}^+ - {\varvec{I}})^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) . \end{aligned}$$

(163)

By taking the expectation of the squared norm, we derive the following

$$\begin{aligned} {{\mathbb {E}}}\,\left[ \Vert {\widetilde{\nabla }}f({{\varvec{x}}}^*)\Vert ^2\right]&= \frac{1}{b^2} \mathrm {tr}\left( {{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}({{\varvec{A}}}{{\varvec{A}}}^+ - {\varvec{I}})({{\varvec{A}}}{{\varvec{A}}}^+ - {\varvec{I}})^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&= \frac{1}{b^2} \mathrm {tr}\left( {{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}{{\varvec{A}}}^+{{\varvec{A}}}^{+T}{{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&\quad +\frac{1}{b^2}\mathrm {tr}\left( {{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&\quad -\frac{2}{b^2}\mathrm {tr}\left( {{\varvec{A}}}^T{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}{{\varvec{A}}}^+{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&= \mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{A}}}^+{{\varvec{A}}}^{+T}\right) + \frac{1}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}\right) - \frac{2}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}{{\varvec{A}}}^+{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&= \frac{1}{n} \mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+\right) + \frac{1}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}\right) - \frac{2}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}{{{\varvec{A}}}^+{{\varvec{A}}}^{+T}{{\varvec{A}}}^T}{{\varvec{P}}}^T{{\varvec{P}}}{{\varvec{A}}}\right) \\&= \frac{1}{n} \mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+\right) + \frac{1}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}\right) - 2\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{A}}}^+{{\varvec{A}}}^{+T}\right) \\&= \frac{1}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}\right) - \frac{1}{n}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+\right) . \end{aligned}$$

Now, we must find an approximation of $\frac{1}{n}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+\right) $. For $b \approx n$ we have ${\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+ \approx {{\varvec{H}}}$, whereas for $b \approx 1$ we argue that ${\widetilde{{{\varvec{H}}}}}$ and ${{\varvec{H}}}^+$ can be seen as independent matrices with ${{\varvec{H}}}^+ \approx {{\varvec{I}}}$. We can linearly interpolate between these two extremes,

$$\begin{aligned}&{{\mathbb {E}}}\,\left[ \Vert {\widetilde{\nabla }}f({{\varvec{x}}}^*)\Vert ^2\right] \approx \frac{1}{b}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}\right) - \frac{b}{n}\frac{1}{n}\mathrm {tr}\left( {{\varvec{H}}}\right) - \left( 1 - \frac{b}{n}\right) \frac{1}{n}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2\right) \end{aligned}$$

(164)

$$\begin{aligned}&\quad = r' - \frac{b}{n}r - \left( 1-\frac{b}{n}\right) r(1 + r') = (1 - r)(r' - r). \end{aligned}$$

(165)

Experimentally these approximations work well. Hence, in our simulations we set $M = {\widetilde{R}}^2(1 - r)(r' - r)$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paquette, C., van Merriënboer, B., Paquette, E. et al. Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis. Found Comput Math 23, 597–673 (2023). https://doi.org/10.1007/s10208-022-09554-y

Download citation

Received: 30 November 2020
Revised: 26 September 2021
Accepted: 29 September 2021
Published: 15 February 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10208-022-09554-y

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Abstract

Access this article

Similar content being viewed by others

Optimization in the Small-Data, Large-Scale Regime

Why Optimization Is Faster Than Solving Systems of Equations: A Qualitative Explanation

Frank-Wolfe Style Algorithms for Large Scale Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Derivation of Polynomials

1.1 Nesterov’s Accelerated Methods

1.1.1 Strongly Convex Setting

1.1.2 Convex Setting: Legendre Polynomials and Bessel Asymptotics

Lemma 9

Corollary 1

Proof of Corollary 1

Proof of Lemma 9

1.2 Polyak Momentum (Heavy-Ball) Method

Average-Case Complexity

Lemma 10

Proof

Lemma 11

Proof

Lemma 12

Proof

Lemma 13

Proof

Adversarial Model Computations

Lemma 14

Proof

Lemma 15

Proof

Simulation Details

1.1 Step Sizes

1.1.1 Logistic Regression

1.1.2 Stochastic Gradient Descent (SGD)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation