Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Loizou, Nicolas; Richtárik, Peter

doi:10.1007/s10589-020-00220-z

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Published: 23 September 2020

Volume 77, pages 653–710, (2020)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

2279 Accesses
46 Citations
5 Altmetric
Explore all metrics

Abstract

In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent: convex quadratic problems. We prove global non-asymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates, and dual function values. We also show that the primal iterates converge at an accelerated linear rate in a somewhat weaker sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesàro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonlinear acceleration of momentum and primal-dual algorithms

Article 09 February 2022

Convergence of Gradient Algorithms for Nonconvex C1+α Cost Functions

Article 27 May 2023

A stochastic subspace approach to gradient-free optimization in high dimensions

Article 13 April 2021

Notes

In addition, these three methods are identical to the stochastic fixed point method (with relaxation) for solving the fixed point problem $x = {\mathbb {E}\left[ \varPi _{\mathcal{L}_\mathbf{S}}(x)\right] }$, where $\mathcal{L}_{\mathbf{S}}$ is the set of solutions of $\mathbf{S}^\top \mathbf{A}x = \mathbf{S}^\top b$, which is a sketched version of the linear system (2), and can be seen as a stochastic approximation of the set $\mathcal{L}{:}{=}\{x\;:\; \mathbf{A}x = b\}$.
In the rest of the paper we consider projection with respect to an arbitrary Euclidean norm.
Note that for $\mathbf {B}=\mathbf {I}$ it holds that $\mathbf {M}^{\dagger _{\mathbf {I}}}=\mathbf {M}^{\dagger }$ and hence the $\mathbf {I}$-pseudoinverse reduces to the standard Moore-Penrose pseudoinverse.
A more popular, and certainly theoretically much better understood alternative to Polyak’s momentum is the momentum introduced by Nesterov [60, 62], leading to the famous accelerated gradient descent (AGD) method. This method converges non-asymptotically and globally; with optimal sublinear rate $\mathcal{O}(\sqrt{L/\epsilon })$ [59] when applied to minimizing a smooth convex objective function (class $\mathcal{F}^{1,1}_{0,L}$), and with the optimal linear rate $\mathcal{O}(\sqrt{L/\mu } \log (1/\epsilon ))$ when minimizing smooth strongly convex functions (class $\mathcal{F}^{1,1}_{\mu ,L}$). Recently, variants of Nesterov’s momentum have also been introduced for the acceleration of stochastic gradient descent. We refer the interested reader to [1, 26, 35, 36, 41, 95, 96] and the references therein. Both Nesterov’s and Polyak’s update rules are known in the literature as “momentum” methods. In this paper, however, we focus exclusively on Polyak’s heavy ball momentum.
This choice implies that the Hessian of the stochastic function $f_{\mathbf{S}}(x) = \tfrac{1}{2}\Vert \mathbf{A}x - b\Vert ^2_{\mathbf{H}}$ is a projection matrix (in the $\mathbf{B}$ inner product). This fact is immensely useful throughout the analysis. This choice implies that SGD without momentum satisfies the decrease identity
$$\begin{aligned} \Vert x_{k+1}-x_*\Vert _{\mathbf{B}}^2 = \Vert x_k - x_*\Vert _{\mathbf{B}}^2 - 2\omega (2-\omega ) f_{\mathbf{S}_k}(x_k), \end{aligned}$$
where $x_*$ is the projection of $x_0$ onto the solution space of the linear system $\mathbf{A}x=b$ [78]. The above identity holds for any matrix $\mathbf{S}_k$; note that the equation does not involve any expectation. If $\mathbf{S}_k$ is chosen randomly, and it is chosen randomly throughout our paper, then this is an identity between two random variable: the left and the right-hand side. One interesting consequence of the identity is that $\omega =1$ is a natural (and in some sense optimal) stepsize for SGD without momentum. Indeed, fixing $\mathbf{S}_k$ and $x_k$, the decrease in distance squared is maximized for $\omega =1$. Lastly, the choice of $\mathbf{H}$ is what makes all the various methods we consider in this paper equivalent. Hence, it is in this sense the canonical choice of the pseudo-metric [78].
While the Hessian is not self-adjoint with respect to the standard inner product, it is self-adjoint with respect to the inner product $\langle \mathbf{B}x, y\rangle $ which we use as the canonical inner product in $\mathbb {R}^n$.
The gradient is computed with respect to the inner product $\langle x, y\rangle _{\mathbf{B}} {:}{=}\langle \mathbf{B}x, y\rangle $. Since $\langle x, y\rangle = \langle \mathbf{B}^{-1}x, y\rangle _{\mathbf{B}}$, this gradient is obtained from the standard gradient by applying to it the linear transformation $\mathbf{B}^{-1}$.
In this method we take the $\mathbf{B}$-pseudoinverse of the Hessian of $f_{\mathbf{S}_k}$ instead of the classical inverse, as the inverse does not exist. When $\mathbf{B}=\mathbf{I}$, the $\mathbf{B}$ pseudoinverse specializes to the standard Moore-Penrose pseudoinverse.
In this case, the equivalence only works for $0<\omega \le 1$.
In the plots of Fig. 1, the hyperplane of each update is chosen in an alternating fashion for illustration purposes.
The experiments were repeated with various values of the main parameters and initializations, and similar results were obtained in all cases.
Remember that in our setting we have $f(x_*)=0$ for the optimal solution $x_*$ of the best approximation problem; thus $f(x)-f(x_*)=f(x)$. The function values $f(x_k)$ refer to function (37) in the case of RK and to function (39) for the RCD. For block variants the objective function of problem (1) has also closed form expression but it can be very difficult to compute. In these cases one can instead evaluate the quantity $\Vert \mathbf {A}x-b\Vert ^2_{\mathbf {B}}$.
Note that in the first experiment we use Gaussian matrices which by construction are full rank matrices with probability 1 and as a result the consistent linear systems have unique solution. Thus, for any starting point $x_0$, the vector z that is used to create the linear system is the solution mSGD converges to. This is not true for general consistent linear systems, with no full-rank matrix. In this case, the solution $x_{*}=\varPi _{\mathcal{L}}^{\mathbf {B}}(x_0)$ that mSGD converges to is not necessarily equal to z. For this reason, in the evaluation of the relative error measure $\Vert x_k-x_*\Vert ^2_\mathbf {B}/ \Vert x_0-x_*\Vert ^2_\mathbf {B}$, one should be careful and use the value $x_*=x_0+\mathbf {A}^\dagger (b- \mathbf {A}x_0)\overset{x_0=0}{=} \mathbf {A}^\dagger b$.
RCD converge to the optimal solution only in the case of positive definite matrices. For this reason $\mathbf {A}= \mathbf {P}^\top \mathbf {P}\in \mathbb {R}^{n \times n}$ is used which with probability 1 is a full rank matrix.
To pre-compute the solution $x_*$ for each linear system $\mathbf {A}_g x=b_g$ we use the close form expression of the projection (14).
Matrix $\mathbf {A}$ of the linear system is the incidence matrix of the graph and it is known that the Laplacian matrix is equal to $\mathbf {L}=\mathbf {A}^\top \mathbf {A}$, where $\Vert \mathbf {A}\Vert ^2_F=2m$.
The lower bound on $\beta $ is tight. However, the upper bound is not. However, we do not care much about the regime of large $\beta $ as $\beta $ is the convergence rate, and hence is only interesting if smaller than 1.

References

Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM (2017)
Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: International Conference on Machine Learning, pp. 1110–1119 (2016)
Arnold, S., Manzagol, P., Babanezhad, R., Mitliagkas, I., Roux, N.: Reducing the variance in online optimization by transporting past gradients. arXiv preprint arXiv:1906.03532 (2019)
Bertsekas, D.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)
Google Scholar
Blatt, D., Hero, A., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)
MathSciNet MATH Google Scholar
Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms. IEEE Trans. Inf. Theory 14(SI), 2508–2530 (2006)
MathSciNet MATH Google Scholar
Byrne, C.: Applied Iterative Methods. AK Peters, Wellesley (2008)
MATH Google Scholar
Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891–901 (2019)
Chambolle, A., Ehrhardt, M., Richtárik, P., Schönlieb, C.: Stochastic primal–dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018)
MathSciNet MATH Google Scholar
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Google Scholar
Csiba, D., Richtárik, P.: Global convergence of arbitrary-block gradient methods for generalized Polyak–Lojasiewicz functions. arXiv preprint arXiv:1709.03014 (2017)
De Abreu, N.M.M.: Old and new results on algebraic connectivity of graphs. Linear Algebra Appl. 423(1), 53–73 (2007)
MathSciNet MATH Google Scholar
Defazio, A.: A simple practical accelerated method for finite sums. In: Advances in Neural Information Processing Systems, pp. 676–684 (2016)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Devraj, A., Bušic, A., Meyn, S.: Optimal matrix momentum stochastic approximation and applications to q-learning. arXiv preprint arXiv:1809.06277 (2018)
Devraj, A., Bušić, A., Meyn, S.: Zap meets momentum: stochastic approximation algorithms with optimal convergence rate. arXiv preprint arXiv:1809.06277 (2018)
Dimakis, A., Kar, S., Moura, J., Rabbat, M., Scaglione, A.: Gossip algorithms for distributed signal processing. Proc. IEEE 98(11), 1847–1864 (2010)
Google Scholar
Elaydi, S.: An Introduction to Difference Equations. Springer, Berlin (2005)
MATH Google Scholar
Eldar, Y., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)
MathSciNet MATH Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
MathSciNet MATH Google Scholar
Fillmore, J., Marx, M.: Linear recursive sequences. SIAM Rev. 10(3), 342–353 (1968)
MathSciNet MATH Google Scholar
Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electron. J. Stat. 12(1), 461–529 (2018)
MathSciNet MATH Google Scholar
Geman, S.: A limit theorem for the norm of random matrices. Ann. Probab. 8, 252–261 (1980)
MathSciNet MATH Google Scholar
Ghadimi, E., Feyzmahdavian, H., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: Control Conference (ECC), 2015 European, pp. 310–315. IEEE (2015)
Ghadimi, E., Shames, I., Johansson, M.: Multi-step gradient methods for networked optimization. IEEE Trans. Signal Process. 61(21), 5417–5429 (2013)
Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
MathSciNet MATH Google Scholar
Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878 (2016)
Gower, R., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)
MathSciNet MATH Google Scholar
Gower, R., Richtárik, P.: Stochastic dual ascent for solving linear systems. arXiv preprint arXiv:1512.06890 (2015)
Gower, R., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse. arXiv preprint arXiv:1612.06255 (2016)
Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)
MathSciNet MATH Google Scholar
Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)
MathSciNet MATH Google Scholar
Hanzely, F., Konečný, J., Loizou, N., Richtárik, P., Grishchenko, D.: Privacy preserving randomized gossip algorithms. arXiv preprint arXiv:1706.07636 (2017)
Hanzely, F., Konečnỳ, J., Loizou, N., Richtárik, P., Grishchenko, D.: A privacy preserving randomized gossip algorithm via controlled noise insertion. In: NeurIPS Privacy Preserving Machine Learning Workshop (2018)
Jalilzadeh, A., Shanbhag, U., Blanchet, J., Glynn, P.: Optimal smoothed variable sample-size accelerated proximal methods for structured nonsmooth stochastic convex programs. arXiv preprint arXiv:1803.00718 (2018)
Jofré, A., Thompson, P.: On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Program. 174, 1–40 (2017)
MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l’Academie Polonaise des Sciences et des Lettres 35, 355–357 (1937)
MATH Google Scholar
Konečný, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)
Google Scholar
Konečný, J., Richtárik, P.: Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3(9), 1–14 (2017)
MATH Google Scholar
Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lee, Y., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156. IEEE (2013)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
MathSciNet MATH Google Scholar
Leventhal, D., Lewis, A.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
MathSciNet MATH Google Scholar
Liu, J., Wright, S.: An accelerated randomized Kaczmarz algorithm. Math. Comput. 85(297), 153–178 (2016)
MathSciNet MATH Google Scholar
Loizou, N., Rabbat, M., Richtárik, P.: Provably accelerated randomized gossip algorithms. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7505–7509. IEEE (2019)
Loizou, N., Richtárik, P.: A new perspective on randomized gossip algorithms. In: 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 440–444. IEEE (2016)
Loizou, N., Richtárik, P.: Linearly convergent stochastic heavy ball method for minimizing generalization error. In: NIPS-Workshop on Optimization for Machine Learning (2017)
Loizou, N., Richtárik, P.: Accelerated gossip via stochastic heavy ball method. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 927–934. IEEE (2018)
Loizou, N., Richtárik, P.: Convergence analysis of inexact randomized iterative methods. arXiv preprint arXiv:1903.07971 (2019)
Loizou, N., Richtárik, P.: Revisiting randomized gossip algorithms: general framework, convergence rates and novel block and accelerated protocols. arXiv preprint arXiv:1905.08645 (2019)
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
MathSciNet MATH Google Scholar
Ma, J., Yarats, D.: Quasi-hyperbolic momentum and Adam for deep learning. arXiv preprint arXiv:1810.06801 (2018)
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)
MathSciNet MATH Google Scholar
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent and the randomized Kaczmarz algorithm. Math. Program. Ser. A 155(1), 549–573 (2016)
MathSciNet MATH Google Scholar
Needell, D., Tropp, J.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)
MathSciNet MATH Google Scholar
Needell, D., Zhao, R., Zouzias, A.: Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra Appl. 484, 322–343 (2015)
MathSciNet MATH Google Scholar
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, Hoboken (1983)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $o(1/k^2)$. Sov. Math. Dokl. 27, 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
MATH Google Scholar
Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., Koepke, H.: Coordinate descent converges faster with the gauss-southwell rule than random selection. In: International Conference on Machine Learning, pp. 1632–1641 (2015)
Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 547–556. AUAI Press (2016)
Ochs, P., Brox, T., Pock, T.: iPiasco: inertial proximal algorithm for strongly convex optimization. J. Math. Imaging Vis. 53(2), 171–181 (2015)
MathSciNet MATH Google Scholar
Ochs, P., Chen, Y., Brox, T., Pock, T.: iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)
MathSciNet MATH Google Scholar
Penrose, M.: Random Geometric Graphs, vol. 5. Oxford University Press, Oxford (2003)
MATH Google Scholar
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Google Scholar
Polyak, B.: Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software, New York (1987)
Google Scholar
Popa, C.: Least-squares solution of overdetermined inconsistent linear systems using Kaczmarz’s relaxation. Int. J. Comput. Math. 55(1–2), 79–89 (1995)
MATH Google Scholar
Popa, C.: Convergence rates for Kaczmarz-type algorithms. Numer. Algorithms 79(1), 1–17 (2018)
MathSciNet MATH Google Scholar
Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling I: algorithms and complexity. Optim. Methods Softw. 31(5), 829–857 (2016)
MathSciNet MATH Google Scholar
Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016)
MathSciNet MATH Google Scholar
Qu, Z., Richtárik, P., Takáč, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization. In: International Conference on Machine Learning (2016)
Qu, Z., Richtárik, P., Zhang, T.: Quartz: randomized dual coordinate ascent with arbitrary sampling. In: Advances in Neural Information Processing Systems, pp. 865–873 (2015)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)
MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108 (2017)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
MathSciNet MATH Google Scholar
Schöpfer, F., Lorenz, D.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173, 1–28 (2018)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet MATH Google Scholar
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
MathSciNet MATH Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning , vol. 28, pp. 1139–1147 (2013)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Tseng, P.: An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)
MathSciNet MATH Google Scholar
Tu, S., Venkataraman, S., Wilson, A., Gittens, A., Jordan, M., Recht, B.: Breaking locality accelerates block Gauss–Seidel. In: International Conference on Machine Learning (2017)
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)
Wright, S.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
MathSciNet MATH Google Scholar
Xiang, H., Zhang, L.: Randomized iterative methods with alternating projections. arXiv preprint arXiv:1708.09845 (2017)
Xu, P., He, B., De Sa, C., Mitliagkas, I., Re, C.: Accelerated stochastic power iteration. In: International Conference on Artificial Intelligence and Statistics, pp. 58–67 (2018)
Yang, T., Lin, Q., Li, Z.: Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257 (2016)
Zavriev, S., Kostyuk, F.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)
MATH Google Scholar
Zhang, J., Mitliagkas, I., Ré, C.: Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471 (2017)
Zhou, K.: Direct acceleration of SAGA using sampled negative momentum. arXiv preprint arXiv:1806.11048 (2018)
Zhou, K., Shang, F., Cheng, J.: A simple stochastic variance reduced algorithm with fast convergence rates. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 5980–5989 (2018)
Zouzias, A., Freris, N.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Mila and DIRO, Université de Montréal, Quebec, Canada
Nicolas Loizou
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Peter Richtárik

Authors

Nicolas Loizou
View author publications
You can also search for this author in PubMed Google Scholar
Peter Richtárik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Loizou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done while the first author was a PhD student at School of Mathematics, The University of Edinburgh.

Appendices

Appendix 1: Technical Lemmas

Lemma 9

Fix $F_1=F_0\ge 0$ and let $\{F_k\}_{k\ge 0}$ be a sequence of nonnegative real numbers satisfying the relation

$$\begin{aligned} F_{k+1}\le a_1F_k +a_2 F_{k-1}, \quad \forall k\ge 1, \end{aligned}$$

(42)

where $a_2 \ge 0 $, $ a_1 + a_2 <1$ and at least one of the coefficients $a_1,a_2$ is positive. Then the sequence satisfies the relation $ F_{k+1}\le q^{k} (1+ \delta ) F_0$ for all $k\ge 1,$ where $q=\frac{a_1+\sqrt{a_1^2+4a_2}}{2}$ and $\delta =q-a_1\ge 0$. Moreover,

$$\begin{aligned} q \ge a_1 + a_2, \end{aligned}$$

(43)

with equality if and only if $a_2=0$ (in which case $q=a_1$ and $\delta =0$).

Proof

Choose $\delta = \frac{-a_1+\sqrt{a_1^2+4a_2}}{2}$. We claim $\delta \ge 0$ and $a_2 \le (a_1+\delta )\delta $. Indeed, non-negativity of $\delta $ follows from $a_2\ge 0$, while the second relation follows from the fact that $\delta $ satisfies

$$\begin{aligned} (a_1+\delta )\delta - a_2 = 0. \end{aligned}$$

(44)

In view of these two relations, adding $\delta F_k$ to both sides of (42), we get

$$\begin{aligned} F_{k+1} + \delta F_k \le (a_1+\delta )F_k + a_2 F_{k-1} \le (a_1+\delta )(F_k + \delta F_{k-1}) = q(F_k+\delta F_{k-1}). \end{aligned}$$

(45)

Let us now argue that $0<q<1$. Non-negativity of q follows from non-negativity of $a_2$. Clearly, as long as $a_2>0$, q is positive. If $a_2=0$, then $a_1>0$ by assumption, which implies that q is positive. The inequality $q<1$ follows directly from the assumption $a_1+a_2<1$. By unrolling the recurrence (45), we obtain $F_{k+1} \le F_{k+1} + \delta F_k \le q^k (F_1+ \delta F_0) = q^{k}(1+\delta ) F_{0}.$

Finally, let us establish (45). Noting that $a_1 = q-\delta $, and since in view of (44) we have $a_2=q\delta $, we conclude that $a_1+a_2 = q + \delta (q-1) \le q$, where the inequality follows from $q <1$. $\square $

The following identities were established in [78]. For completeness, we include different (and somewhat simpler) proofs here.

Lemma 10

([78]) For all $x \in \mathbb {R}^n$ we have

$$\begin{aligned} f_{\mathbf {S}}(x) = \frac{1}{2}\Vert \nabla f_{\mathbf {S}}(x)\Vert ^2_{\mathbf {B}}. \end{aligned}$$

(46)

Moreover, if $x_*\in \mathcal{L}$ (i.e., if $x_*$ satisfies $\mathbf{A}x_* =b$), then for all $x\in \mathbb {R}^n$ we have

$$\begin{aligned} f_{\mathbf {S}}(x) = \frac{1}{2}\langle \nabla f_{\mathbf {S}}(x),x-x_* \rangle _{\mathbf {B}}, \end{aligned}$$

(47)

and

$$\begin{aligned} f(x) = \frac{1}{2}\langle \nabla f(x),x-x_* \rangle _{\mathbf {B}}. \end{aligned}$$

(48)

Proof

In view of (10), and since $\mathbf{Z}\mathbf{B}^{-1} \mathbf{Z}= \mathbf{Z}$ (see [78]), we have

$$\begin{aligned} \Vert \nabla f_{\mathbf {S}}(x)\Vert ^2_{\mathbf {B}}&\overset{(10)}{=} \Vert \mathbf {B}^{-1} \mathbf{Z}(x-x_*)\Vert ^2_{\mathbf {B}} \\&= (x-x_*)^\top \mathbf{Z}\mathbf {B}^{-1} \mathbf {Z}(x-x_*) = (x-x_*)^\top \mathbf{Z}(x-x_*) \\&\overset{(7)}{=} (x-x_*)^\top \mathbf{A}^\top \mathbf{H}\mathbf{A}(x-x_*) = (\mathbf{A}x-b)^\top \mathbf{H}(\mathbf{A}x- b ) \quad \overset{(6)}{=}\quad 2f_{\mathbf {S}}(x). \end{aligned}$$

Moreover,

$$\begin{aligned} \langle \nabla f_{\mathbf {S}}(x),x-x_* \rangle _{\mathbf {B}}&\overset{(10)}{=} \langle \mathbf {B}^{-1} \mathbf {Z}(x-x_*),x-x_* \rangle _{\mathbf {B}}\\&= (x-x_*)^\top \mathbf {Z}\mathbf {B}^{-1} \mathbf {B}(x-x_*) \quad = \quad 2f_{\mathbf {S}}(x). \end{aligned}$$

By taking expectations in the last identity with respect to the random matrix $\mathbf {S}$, we get $ \langle \nabla f(x),x-x_* \rangle _{\mathbf {B}}=2f(x). $ $\square $

Lemma 11

([78]) For all $x \in \mathbb {R}^n$ and $x_* \in \mathcal{L}$

$$\begin{aligned} \lambda _{\min }^+ f(x) \le \frac{1}{2} \Vert \nabla f(x) \Vert ^2_{\mathbf {B}} \le \lambda _{\max } f(x) \end{aligned}$$

(49)

and

$$\begin{aligned} f(x) \le \frac{\lambda _{\max }}{2} \Vert x-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}$$

(50)

Moreover, if exactness is satisfied, and we let $x_* =\varPi ^{\mathbf {B}}_{\mathcal{L}}(x)$, we have

$$\begin{aligned} \frac{\lambda _{\min }^+}{2} \Vert x-x_*\Vert ^2_{\mathbf {B}} \le f(x) . \end{aligned}$$

(51)

Finally, let us present a simple lemma of an identity that we use in our main proofs. This preliminary result is known to hold for the case of Euclidean norms ($\mathbf {B}=\mathbf {I}$). We provide the proof for the more general $\mathbf {B}-$norm for completeness.

Lemma 12

Let a, b, c be arbitrary vectors in $\mathbb {R}^n$ and let $\mathbf {B}$ be a positive definite matrix. Then the following identity holds: $2 \langle a-c,c-b \rangle _{\mathbf {B}}=\Vert a-b\Vert ^2_{\mathbf {B}}-\Vert c-b\Vert ^2_{\mathbf {B}}-\Vert a-c\Vert ^2_{\mathbf {B}},$

Proof

$$\begin{aligned} LHS= & {} 2 \langle a-c,c-b \rangle _{\mathbf {B}} = 2(a-c)^\top \mathbf {B}(c-b)\\= & {} 2a^\top \mathbf {B}c-2a^\top \mathbf {B}b-2c^\top \mathbf {B}c+2c^\top \mathbf {B}b \end{aligned}$$

and

$$\begin{aligned} RHS= & {} \Vert a-b\Vert ^2_{\mathbf {B}}-\Vert c-b\Vert ^2_{\mathbf {B}}-\Vert a-c\Vert ^2_{\mathbf {B}}\\= & {} (a-b)^\top \mathbf {B}(a-b)- (c-b)^\top \mathbf {B}(c-b)-(a-c)^\top \mathbf {B}(a-c)\\= & {} a^\top \mathbf {B}a-a^\top \mathbf {B}b-b^\top \mathbf {B}a+b^\top \mathbf {B}b-c^\top \mathbf {B}c+c^\top \mathbf {B}b+b^\top \mathbf {B}c-b^\top \mathbf {B}b\\&- a^\top \mathbf {B}a+a^\top \mathbf {B}c+c^\top \mathbf {B}a-c^\top \mathbf {B}c \\= & {} 2a^\top \mathbf {B}c-2a^\top \mathbf {B}b-2c^\top \mathbf {B}c+2c^\top \mathbf {B}b \end{aligned}$$

LHS (left-hand side) = RHS (right-hand side) and this completes the proof. $\square $

Appendix 2: Proof of Theorem 1

First, we decompose

(52)

We will now analyze the three expressions separately. The first expression can be written as

(53)

We will now bound the second expression. First, we have

(54)

Using the identity from Lemma 12 for the vectors $x_k, x_*$ and $x_{k-1}$ we obtain:

$$\begin{aligned} 2 \langle x_k-x_*, x_*-x_{k-1} \rangle _{\mathbf {B}}= \Vert x_k-x_{k-1}\Vert ^2_{\mathbf {B}}- \Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}-\Vert x_k-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}$$

Substituting this into (54) gives

(55)

The third expression can be bounded as

(56)

By substituting the bounds (53), (55), (56) into (52) we obtain

$$\begin{aligned}&\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}\\&\quad \le \Vert x_k-x_*\Vert ^2_{\mathbf {B}}-2\omega (2-\omega )f_{\mathbf {S}_k}(x_k)\\&\qquad + \beta \Vert x_k-x_*\Vert ^2_{\mathbf {B}}+\beta \Vert x_{k}-x_{k-1}\Vert ^2_{\mathbf {B}}-\beta \Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&\qquad + 2\omega \beta \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}} + 2\beta ^2\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}+2\beta ^2\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}\\&\quad \le (1+3\beta + 2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta + 2\beta ^2 )\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}-2\omega (2-\omega )f_{\mathbf {S}_k}(x_k)\\&\qquad + 2\omega \beta \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}}. \end{aligned}$$

Now by first taking expectation with respect to $\mathbf{S}_k$, we obtain:

$$\begin{aligned} \mathbb {E}_{\mathbf{S}_k}[\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}]\le & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&-2\omega (2-\omega )f(x_k) + 2\omega \beta \langle \nabla f(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}}\\\le & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&-2\omega (2-\omega )f(x_k) + 2\omega \beta (f(x_{k-1})-f(x_k))\\= & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&- (2\omega (2-\omega ) +2\omega \beta )f(x_k) + 2\omega \beta f(x_{k-1}). \end{aligned}$$

where in the second step we used the inequality $\langle \nabla f(x_k),x_{k-1}- x_k \rangle \le f(x_{k-1})-f(x_k)$ and the fact that $\omega \beta \ge 0$, which follows from the assumptions. We now apply inequalities (50) and (51), obtaining

$$\begin{aligned} \mathbb {E}_{\mathbf{S}_k}[\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}]\le & {} \underbrace{(1+3\beta +2\beta ^2 - (\omega (2-\omega ) +\omega \beta )\lambda _{\min }^+)}_{a_1}\Vert x_k-x_*\Vert ^2_{\mathbf {B}} \\&\quad + \underbrace{(\beta +2\beta ^2 + \omega \beta \lambda _{\max })}_{a_2}\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}$$

By taking expectation again, and letting $F_k{:}{=}\mathbb {E}[\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}]$, we get the relation

$$\begin{aligned} F_{k+1} \le a_1 F_k + a_2 F_{k-1} . \end{aligned}$$

(57)

It suffices to apply Lemma 9 to the relation (57). The conditions of the lemma are satisfied. Indeed, $a_2\ge 0$, and if $a_2=0$, then $\beta =0$ and hence $a_1=1-\omega (2-\omega )\lambda _{\min }^+>0$. The condition $a_1+a_2<1$ holds by assumption.

The convergence result in function values, $\mathbb {E}[f(x_k)]$, follows as a corollary by applying inequality (50) to (23).

Appendix 3: Proof of Theorem 3

Let $p_t=\frac{\beta }{1-\beta }(x_t-x_{t-1})$ and $d_t = \Vert x_t + p_t -x_*\Vert _{\mathbf{B}}^2$. In view of (22), we can write

$$\begin{aligned} x_{t+1}+p_{t+1}&= x_{t+1}+\frac{\beta }{1-\beta }(x_{t+1}-x_{t}) \nonumber \\&\overset{(22)}{=} x_{t}-\omega \nabla f_{\mathbf {S}_t}(x_t)+\beta (x_t-x_{t-1})\nonumber \\&\qquad +\frac{\beta }{1-\beta }\left( -\omega \nabla f_{\mathbf {S}_t}(x_t)+\beta (x_t-x_{t-1})\right) \nonumber \\&= x_{t}-[\omega +\frac{\beta }{1-\beta }\omega ] \nabla f_{\mathbf {S}_t}(x_t)+[\beta +\frac{\beta ^2}{1-\beta }](x_t-x_{t-1})\nonumber \\&= x_{t}-\frac{\omega }{1-\beta }\nabla f_{\mathbf {S}_t}(x_t)+\frac{\beta }{1-\beta }(x_t-x_{t-1})\nonumber \\&= x_t+p_t-\frac{\omega }{1-\beta } \nabla f_{\mathbf {S}_t}(x_t) \end{aligned}$$

(58)

and therefore

$$\begin{aligned} d_{t+1}&\overset{(58)}{=} \left\| x_t+p_t-\frac{\omega }{1-\beta } \nabla f_{\mathbf {S}_t}(x_t) -x_* \right\| ^2_{\mathbf {B}} \\&= d_t -2 \frac{\omega }{1-\beta } \langle x_t+p_t-x_*, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}} + \frac{\omega ^2}{(1-\beta )^2} \Vert \nabla f_{\mathbf {S}_t}(x_t)\Vert ^2_{\mathbf {B}}\\&= d_t -\frac{2\omega }{1-\beta } \langle x_t-x_*, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}} - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}}\\&\quad + \frac{\omega ^2}{(1-\beta )^2} \Vert \nabla f_{\mathbf {S}_t}(x_t)\Vert ^2_{\mathbf {B}}. \end{aligned}$$

Taking expectation with respect to the random matrix $\mathbf {S}_t$ we obtain:

$$\begin{aligned} \mathbb {E}_{\mathbf{S}_t}[d_{t+1}]&= d_t -\frac{2\omega }{1-\beta } \langle x_t-x_*, \nabla f(x_t) \rangle _{\mathbf {B}} - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f(x_t) \rangle _{\mathbf {B}} \\&\quad + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t) \\&\overset{(48)}{=} d_t -\frac{4\omega }{1-\beta } f(x_t) - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f(x_t) \rangle _{\mathbf {B}} + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t)\\&\le d_t -\frac{4\omega }{1-\beta } f(x_t) - \frac{2 \omega \beta }{(1-\beta )^2} [f(x_t)-f(x_{t-1})] + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t)\\&= d_t + \left[ -\frac{4\omega }{1-\beta } - \frac{2 \omega \beta }{(1-\beta )^2} +\frac{2 \omega ^2}{(1-\beta )^2}\right] f(x_t) + \frac{2 \omega \beta }{(1-\beta )^2} f(x_{t-1}), \end{aligned}$$

where the inequality follows from convexity of f. After rearranging the terms we get

$$\begin{aligned} \mathbb {E}_{\mathbf{S}_t}[d_{t+1}] + \frac{2 \omega \beta }{(1-\beta )^2} f(x_t) + \alpha f(x_t) \le d_t + \frac{2 \omega \beta }{(1-\beta )^2} f(x_{t-1}), \end{aligned}$$

where $\alpha = \frac{4\omega }{1-\beta } -\frac{2 \omega ^2}{(1-\beta )^2} > 0$. Taking expectations again and using the tower property, we get

$$\begin{aligned} \theta _{t+1} + \alpha \mathbb {E}[f(x_t)] \le \theta _t, \qquad t=1,2,\dots , \end{aligned}$$

(59)

where $\theta _t = \mathbb {E}[d_t] + \frac{2 \omega \beta }{(1-\beta )^2}\mathbb {E}[ f(x_{t-1})]$. By summing up (59) for $t=1,\dots , k$ we get

$$\begin{aligned} \sum _{t=1}^k \mathbb {E}[f(x_t)] \le \frac{\theta _1-\theta _{k-1}}{\alpha } \le \frac{\theta _1}{\alpha }. \end{aligned}$$

(60)

Finally, using Jensen’s inequality, we get

$$\begin{aligned} \mathbb {E}[f(\hat{x}_k)] = \mathbb {E}\left[ f\left( \frac{1}{k}\sum _{t=1}^k x_t\right) \right] \le \mathbb {E}\left[ \frac{1}{k}\sum _{t=1}^k f(x_t)\right] = \frac{1}{k}\sum _{t=1}^k \mathbb {E}[f(x_t)] \overset{(60)}{\le } \frac{\theta _1}{\alpha k}. \end{aligned}$$

It remains to note that $\theta _1 = \Vert x_0-x_*\Vert _{\mathbf{B}}^2 + \frac{2\omega \beta }{(1-\beta )^2 }f(x_0).$

Appendix 4: Proof of Theorem 4

In the proof of Theorem 4 the following two lemmas are used.

Lemma 13

([78]) Assume exactness. Let $x\in \mathbb {R}^n$ and $x_* = \varPi _\mathcal {L}^\mathbf {B}(x)$. If $\lambda _i=0$, then $u_i^\top \mathbf {B}^{1/2} (x-x_*)=0$.

Lemma 14

([18, 21]) Consider the second degree linear homogeneous recurrence relation:

$$\begin{aligned} r_{k+1}= a_1r_k+a_2 r_{k-1} \end{aligned}$$

(61)

with initial conditions $r_0,r_1 \in \mathbb {R}$. Assume that the constant coefficients $a_1$ and $a_2$ satisfy the inequality $a_1^2 +4a_2<0$ (the roots of the characteristic equation $t^2-a_1t-a_2=0$ are imaginary). Then there are complex constants $C_0$ and $ C_1$ (depending on the initial conditions $r_0$ and $r_1$) such that:

$$\begin{aligned} r_k=2 M^k (C_0 \cos ( \theta k) + C_1 \sin (\theta k)) \end{aligned}$$

where $M= \bigg (\sqrt{\frac{a_1^2}{4}+\frac{(-a_1^2-4a_2)}{4}} \bigg )=\sqrt{-a_2}$ and $\theta $ is such that $a_1=2 M \cos (\theta )$ and $\sqrt{-a_1^2-4a_2}=2 M \sin (\theta )$.

We can now turn to the proof of Theorem 4. Plugging in the expression for the stochastic gradient, mSGD can be written in the form

$$\begin{aligned} x_{k+1}= & {} x_k -\omega \nabla f_{\mathbf {S}_k}(x_k) + \beta (x_k - x_{k-1}) \nonumber \\&\overset{(10)}{=}&x_k- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k(x_k-x_*) + \beta (x_k - x_{k-1}). \end{aligned}$$

(62)

Subtracting $x_*$ from both sides of (62), we get

$$\begin{aligned} x_{k+1}-x_*= & {} (\mathbf {I}- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k)(x_k-x_*) + \beta (x_k -x_* +x_* - x_{k-1})\\= & {} \left( (1+\beta )\mathbf {I}- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k\right) (x_k-x_*) - \beta (x_{k-1}-x_*). \end{aligned}$$

Multiplying the last identity from the left by $\mathbf {B}^{1/2}$, we get

$$\begin{aligned} \mathbf {B}^{1/2} (x_{k+1}-x_*)= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbf {Z}_k \mathbf {B}^{-1/2}\right) \mathbf {B}^{1/2}(x_{k} -x_*) \\&- \beta \mathbf {B}^{1/2}(x_{k-1}-x_*). \end{aligned}$$

Taking expectations, conditioned on $x_k$ (that is, the expectation is with respect to $\mathbf {S}_k$):

$$\begin{aligned} \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_* \;|\; x_k]= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2}\right) \mathbf {B}^{1/2}(x_{k} -x_*) \\&- \beta \mathbf {B}^{1/2}(x_{k-1}-x_*) . \end{aligned}$$

Taking expectations again, and using the tower property, we get

$$\begin{aligned} \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_*]= & {} \mathbf {B}^{1/2}\mathbb {E}\left[ \mathbb {E}[x_{k+1} -x_* \;|\; x_k]\right] \\= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2} \right) \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \\&- \beta \mathbf {B}^{1/2} \mathbb {E}[x_{k-1}-x_*]. \end{aligned}$$

Plugging the eigenvalue decomposition ${\mathbf {U}}\varvec{\varLambda } {{\mathbf {U}}}^\top $ of the matrix $\mathbf {W}=\mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2}$ into the above, and multiplying both sides from the left by ${{\mathbf {U}}}^\top $, we obtain

$$\begin{aligned} {{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_*]&= {{\mathbf {U}}}^\top \left( (1+\beta )\mathbf {I}- \omega {\mathbf {U}}\varvec{\varLambda } {{\mathbf {U}}}^\top \right) \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*]\nonumber \\&\quad \, - \beta {{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k-1}-x_*]. \end{aligned}$$

(63)

Let us define $s_k{:}{=}{{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \in \mathbb {R}^n$. Then relation (63) takes the form of the recursion

$$\begin{aligned} s_{k+1}= [(1+\beta )\mathbf{I}- \omega \varvec{\varLambda } ] s_k - \beta s_{k-1}, \end{aligned}$$

which can be written in a coordinate-by-coordinate form as follows:

$$\begin{aligned} s_{k+1}^i= [(1+\beta ) - \omega \lambda _i ] s_k^i - \beta s_{k-1}^i \quad \text {for all} \quad i= 1,2,3,\ldots ,n, \end{aligned}$$

(64)

where $s_k^i$ indicates the ith coordinate of $s_k$.

We will now fix i and analyze recursion (64) using Lemma 14. Note that (64) is a second degree linear homogeneous recurrence relation of the form (61) with $a_1=1+\beta - \omega \lambda _i $ and $a_2=- \beta $. Recall that $0\le \lambda _i \le 1$ for all i. Since we assume that $0< \omega \le 1/\lambda _{\max }$, we know that $0\le \omega \lambda _i \le 1$ for all i. We now consider two cases:

1.
$ \lambda _i =0$.

In this case, (64) takes the form:
$$\begin{aligned} s_{k+1}^i=(1+\beta )s_k^i-\beta s_{k-1}^i. \end{aligned}$$
(65)
Applying Proposition 2, we know that $x_*=\varPi _\mathcal {L}^\mathbf {B}(x_0)=\varPi _\mathcal {L}^\mathbf {B}(x_1)$. Using Lemma 13 twice, once for $x=x_0$ and then for $x=x_1$, we observe that $s_0^i=u_i^\top \mathbf {B}^{1/2} (x_0-x_*)=0$ and $s_1^i=u_i^\top \mathbf {B}^{1/2} (x_1-x_*)=0$. Finally, in view of (65) we conclude that
$$\begin{aligned} s_k^i=0 \quad \text {for all} \quad k\ge 0 . \end{aligned}$$
(66)
2.
$ \lambda _i >0$.

Since $0<\omega \lambda _i \le 1$ and $\beta \ge 0$, we have $1+\beta - \omega \lambda _i \ge 0$ and hence
$$\begin{aligned} a_1^2+4a_2=(1+\beta -\omega \lambda _i)^2-4\beta \le (1+\beta -\omega \lambda _{\min }^+)^2-4\beta < 0, \end{aligned}$$
where the last inequality can be shown to hold^{Footnote 17} for $\left( 1-\sqrt{\omega \lambda _{\min }^+}\right) ^2< \beta < 1 $. Applying Lemma 14 the following bound can be deduced
$$\begin{aligned} s_k^i= & {} 2(-a_2)^{k/2} (C_0 \cos (\theta k) +C_1 \sin (\theta k)) \; \le \; 2 \beta ^{k/2} P_i, \end{aligned}$$
(67)
where $P_i$ is a constant depending on the initial conditions (we can simply choose $P_i = |C_0| + |C_1|$).

Now putting the two cases together, for all $k\ge 0$ we have

$$\begin{aligned} \Vert \mathbb {E}[x_{k} -x_*]\Vert _{\mathbf {B}}^2&= \mathbb {E}[x_{k} -x_*]^\top \mathbf {B}\mathbb {E}[x_{k} -x_*] \; = \; \mathbb {E}[x_{k} -x_*] \mathbf {B}^{1/2} \mathbf {U}{\mathbf {U}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \\&= \Vert {\mathbf {U}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \Vert _2^2 = \Vert s_k\Vert ^2 = \sum _{i=1}^{n} (s_k^i)^2 \\&= \sum _{i: \lambda _i=0} (s_k^i)^2 + \sum _{i: \lambda _i>0} (s_k^i)^2 \; \overset{(66)}{=}\; \sum _{i: \lambda _i>0} (s_k^i)^2\\&\overset{(67)}{\le } \sum _{i: \lambda _i >0} 4 \beta ^k P_i^2 \\&= \beta ^k C, \end{aligned}$$

where $C=4\sum _{i: \lambda _i >0} P_i^2$.

Appendix 5: Proof of Theorem 7

The proof follows a similar pattern to that of Theorem 1. However, stochasticity in the momentum term introduces an additional layer of complexity, which we shall tackle by utilizing a more involved version of the tower property.

For simplicity, let $i=i_k$ and $r_{k}^i {:}{=}e_i^\top (x_k-x_{k-1})e_i$. First, we decompose

$$\begin{aligned} \Vert x_{k+1}-x_*\Vert ^2= & {} \Vert x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)+\gamma r_k^i -x_*\Vert ^2 \nonumber \\= & {} \Vert x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)-x_*\Vert ^2+2\langle x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)\nonumber \\&-x_*, \gamma r_k^i \rangle + \gamma ^2\Vert r_k^i\Vert ^2. \end{aligned}$$

(68)

We shall use the tower property in the form

$$\begin{aligned} \mathbb {E}[\mathbb {E}[\mathbb {E}[ X \;|\; x_k, \mathbf{S}_k] \;|\; x_k]] = \mathbb {E}[X], \end{aligned}$$

(69)

where X is some random variable. We shall perform the three expectations in order, from the innermost to the outermost. Applying the inner expectation to the identity (68), we get

(70)

We will now analyze the three expressions separately. The first expression is constant under the expectation, and hence we can write

(71)

We will now bound the second expression. Using the identity

$$\begin{aligned} \mathbb {E}[r_k^i \;|\; x_k, \mathbf{S}_k] = \mathbb {E}_i [r_k^i] = \sum _{i=1}^n \frac{1}{n}r_k^i = \frac{1}{n}(x_k-x_{k-1}), \end{aligned}$$

(72)

we can write

(73)

Using the fact that for arbitrary vectors $a,b,c \in \mathbb {R}^n$ we have the identity $2 \langle a-c,c-b \rangle =\Vert a-b\Vert ^2-\Vert c-b\Vert ^2-\Vert a-c\Vert ^2,$ we obtain

$$\begin{aligned} 2 \langle x_k-x_*, x_*-x_{k-1} \rangle = \Vert x_k-x_{k-1}\Vert ^2- \Vert x_{k-1}-x_*\Vert ^2-\Vert x_k-x_*\Vert ^2. \end{aligned}$$

Substituting this into (73) gives

(74)

The third expression can be bound as

(75)

By substituting the bounds (71), (74), (75) into (70) we obtain

$$\begin{aligned} \mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ]&\le \Vert x_k-x_*\Vert ^2-2\omega (2-\omega ) f_{\mathbf {S}_k}(x_k)\nonumber \\&\quad + \tfrac{\gamma }{n} \Vert x_k-x_*\Vert ^2+ \tfrac{\gamma }{n} \Vert x_{k}-x_{k-1}\Vert ^2 -\tfrac{\gamma }{n} \Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\quad + 2\omega \tfrac{\gamma }{n} \langle \nabla f_{\mathbf {S}_k}(x_k), x_{k-1}- x_k \rangle + 2\tfrac{\gamma ^2}{n} \Vert x_{k}-x_*\Vert ^2 \nonumber \\&\quad + 2\tfrac{\gamma ^2}{n}\Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\overset{(56)}{\le } \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\quad - 2\omega (2-\omega )f_{\mathbf {S}_k}(x_k) + 2\omega \tfrac{\gamma }{n} \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle . \end{aligned}$$

(76)

We now take the middle expectation (see (69)) and apply it to inequality (76):

$$\begin{aligned}&\mathbb {E}[\mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ] \;|\; x_k] \\&\quad \le \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad -2\omega (2-\omega )f(x_k) + 2\omega \tfrac{\gamma }{n} \langle \nabla f(x_k),x_{k-1}- x_k \rangle \\&\quad \le \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad -2\omega (2-\omega )f(x_k) + 2\omega \tfrac{\gamma }{n}(f(x_{k-1})-f(x_k))\\&\quad = \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad - \left( 2\omega (2-\omega ) +2\omega \tfrac{\gamma }{n}\right) f(x_k) + 2\omega \tfrac{\gamma }{n} f(x_{k-1}). \end{aligned}$$

where in the second step we used the inequality $\langle \nabla f(x_k),x_{k-1}- x_k \rangle \le f(x_{k-1})-f(x_k)$ and the fact that $\omega \gamma \ge 0$, which follows from the assumptions. We now apply inequalities (50) and (51), obtaining

$$\begin{aligned}&\mathbb {E}[\mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ] \;|\; x_k]\\&\quad \le \underbrace{\left( 1+3\tfrac{\gamma }{n}+2\tfrac{\gamma ^2}{n} - \left( \omega (2-\omega ) +\omega \tfrac{\gamma }{n}\right) \lambda _{\min }^+ \right) }_{a_1}\Vert x_k-x_*\Vert ^2 \\&\qquad + \underbrace{\tfrac{1}{n}\left( \gamma +2\gamma ^2 + \omega \gamma \lambda _{\max }\right) }_{a_2}\Vert x_{k-1}-x_*\Vert ^2. \end{aligned}$$

By taking expectation again (outermost expectation in the tower rule (69)), and letting $F_k{:}{=}\mathbb {E}[\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}]$, we get the relation

$$\begin{aligned} F_{k+1} \le a_1 F_k + a_2 F_{k-1} . \end{aligned}$$

(77)

It suffices to apply Lemma 9 to the relation (57). The conditions of the lemma are satisfied. Indeed, $a_2\ge 0$, and if $a_2=0$, then $\gamma =0$ and hence $a_1=1-\omega (1-\omega )\lambda _{\min }^+>0$. The condition $a_1+a_2<1$ holds by assumption.

The convergence result in function values follows as a corollary by applying inequality (50) to (32).

Appendix 6: Notation glossary

For the frequently used notation, see Table 8.

Table 8 Frequently used notation

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Loizou, N., Richtárik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Comput Optim Appl 77, 653–710 (2020). https://doi.org/10.1007/s10589-020-00220-z

Download citation

Received: 18 January 2018
Published: 23 September 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10589-020-00220-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Abstract

Access this article

Similar content being viewed by others

Nonlinear acceleration of momentum and primal-dual algorithms

Convergence of Gradient Algorithms for Nonconvex C1+α Cost Functions

A stochastic subspace approach to gradient-free optimization in high dimensions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Technical Lemmas

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Lemma 12

Proof

Appendix 2: Proof of Theorem 1

Appendix 3: Proof of Theorem 3

Appendix 4: Proof of Theorem 4

Lemma 13

Lemma 14

Appendix 5: Proof of Theorem 7

Appendix 6: Notation glossary

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Abstract

Access this article

Similar content being viewed by others

Nonlinear acceleration of momentum and primal-dual algorithms

Convergence of Gradient Algorithms for Nonconvex C1+α Cost Functions

A stochastic subspace approach to gradient-free optimization in high dimensions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Technical Lemmas

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Lemma 12

Proof

Appendix 2: Proof of Theorem 1

Appendix 3: Proof of Theorem 3

Appendix 4: Proof of Theorem 4

Lemma 13

Lemma 14

Appendix 5: Proof of Theorem 7

Appendix 6: Notation glossary

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation