Skip to main content
Log in

Optimizing the Efficiency of First-Order Methods for Decreasing the Gradient of Smooth Convex Functions

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

This paper optimizes the step coefficients of first-order methods for smooth convex minimization in terms of the worst-case convergence bound (i.e., efficiency) of the decrease in the gradient norm. This work is based on the performance estimation problem approach. The worst-case gradient bound of the resulting method is optimal up to a constant for large-dimensional smooth convex minimization problems, under the initial bounded condition on the cost function value. This paper then illustrates that the proposed method has a computationally efficient form that is similar to the optimized gradient method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. We found that the set of constraints in (P1) is sufficient for the exact worst-case gradient analysis of GM and OGM-G for (IFC), as illustrated in later sections. In other words, the resulting worst-case rates of GM and OGM-G in this paper are tight with our specific choice of the set of inequalities. Note that this relaxation choice in (P1) differs from the choice in [1, Problem (G\('\))].

  2. The inequality (8) for the pair \(\{(N,*)\}\) simplifies to \(\frac{1}{2L}||\nabla f(\varvec{x} _N)||^2 \le f(\varvec{x} _N) - f_*\) under the condition \(X_*(f) \ne \emptyset \). Such inequality is not used under the assumption (IFC\('\)) in Corollaries 5.1 and 6.1.

  3. In PESTO toolbox [28], we used the SDP solver SeDuMi [26] interfaced through Yalmip [27]. The OGM-G method is implemented in the PESTO toolbox.

References

  1. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–82 (2014). https://doi.org/10.1007/s10107-013-0653-0

    Article  MathSciNet  Google Scholar 

  2. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Dokl. Akad. Nauk. USSR 269(3), 543–7 (1983)

    Google Scholar 

  3. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, New York (2004). https://doi.org/10.1007/978-1-4419-8853-9

    Book  Google Scholar 

  4. Nemirovsky, A.S.: Information-based complexity of linear operator equations. J. Complex 8(2), 153–75 (1992). https://doi.org/10.1016/0885-064X(92)90013-2

    Article  MathSciNet  Google Scholar 

  5. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016). https://doi.org/10.1007/s10107-015-0949-3

    Article  MathSciNet  Google Scholar 

  6. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017). https://doi.org/10.1016/j.jco.2016.11.001

    Article  MathSciNet  Google Scholar 

  7. Kim, D., Fessler, J.A.: Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions (2018). arxiv:1803.06600

  8. Nesterov, Y., Gasnikov, A., Guminov, S., Dvurechensky, P.: Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optim. Methods Softw. (2020). https://doi.org/10.1080/10556788.2020.1731747

  9. Nesterov, Y.: How to make the gradients small. Optima 88 (2012). http://www.mathopt.org/?nav=optima_newsletter

  10. Allen-Zhu, Z.: How to make the gradients small stochastically: even faster convex and nonconvex SGD. In: NIPS (2018)

  11. Drori, Y., Shamir, O.: The complexity of finding stationary points with stochastic gradient descent. In: ICML (2020)

  12. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: first-order methods. Math. Program. (2019). https://doi.org/10.1007/s10107-019-01431-x

  13. Kim, D., Fessler, J.A.: Another look at the Fast Iterative Shrinkage/Thresholding Algorithm (FISTA). SIAM J. Optim. 28(1), 223–50 (2018). https://doi.org/10.1137/16M108940X

    Article  MathSciNet  Google Scholar 

  14. Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28(2), 1920–50 (2018). https://doi.org/10.1137/17m112124x

    Article  MathSciNet  Google Scholar 

  15. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1), 59–99 (2016). https://doi.org/10.1007/s10107-015-0871-8

    Article  MathSciNet  Google Scholar 

  16. Monteiro, R.D.C., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013). https://doi.org/10.1137/110833786

    Article  MathSciNet  Google Scholar 

  17. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first- order methods. Math. Program. 161(1), 307–45 (2017). https://doi.org/10.1007/s10107-016-1009-3

    Article  MathSciNet  Google Scholar 

  18. Nacson, M.S., Lee, J.D., Gunasekar, S., Savarese, P.H.P., Srebro, N., Soudry, D.: Convergence of gradient descent on separable data. In: AISTATS (2019)

  19. Soudry, D., Hoffer, E., Nacson, M.S., Srebro, N.: The implicit bias of gradient descent on separable data. In: Proc. Intl. Conf. on Learning Representations (2018)

  20. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542

    Article  MathSciNet  Google Scholar 

  21. Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. (2019). https://doi.org/10.1007/s10107-019-01410-2

  22. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. J. Optim. Theory Appl. 178(2), 455–76 (2018)

    Article  MathSciNet  Google Scholar 

  23. CVX Research Inc.: CVX: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx (2012)

  24. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008). http://stanford.edu/~boyd/graph_dcp.html

  25. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient methods. J. Optim. Theory Appl. 172(1), 187–205 (2017). https://doi.org/10.1007/s10957-016-1018-7

    Article  MathSciNet  Google Scholar 

  26. Sturm, J.: Using SeDuMi 1.02, A MATLAB toolbox for optimization over symmetric cones. Optim. Methods Softw. 11(1), 625–53 (1999). https://doi.org/10.1080/10556789908805766

    Article  MathSciNet  Google Scholar 

  27. Löfberg, J.: YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proc. of the CACSD Conference (2004)

  28. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: Proc. Conf. Decision and Control, pp. 1278–83 (2017). https://doi.org/10.1109/CDC.2017.8263832

  29. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–52 (2005). https://doi.org/10.1007/s10107-004-0552-5

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Part of this work was carried through while the first author was affiliated with the University of Michigan. The first author would like to thank Ernest K. Ryu for pointing out related references. The authors would like to thank associate editor and referees for useful comments, especially regarding the case where a finite minimizer does not exist. The first author was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A5A1028324), and the POSCO Science Fellowship of POSCO TJ Park Foundation. The second author was supported in part by NSF grant IIS 1838179.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donghwan Kim.

Additional information

Communicated by Alexander Mitsos.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Eqs. (25) and (26)

Appendix: Proof of Eqs. (25) and (26)

This proof shows the properties (25) and (26) of the step coefficients \(\{\tilde{h} _{i,j}\}\) (22).

We first show (25). We can easily derive

$$\begin{aligned} \tilde{h} _{i,i-2} = \frac{(\tilde{\theta } _{i-1}-1)(2\tilde{\theta } _i-1)}{\tilde{\theta } _{i-2}\tilde{\theta } _{i-1}} = \frac{\tilde{\theta } _i^2(2\tilde{\theta } _i-1)}{\tilde{\theta } _{i-2}\tilde{\theta } _{i-1}^2} \end{aligned}$$

for \(i=2,\ldots ,N\) using (27). Again using the definition of (22) and (27), we have

$$\begin{aligned} \tilde{h} _{i,j}= & {} \frac{\tilde{\theta } _{j+1}-1}{\tilde{\theta } _j}\tilde{h} _{i,j+1} = \cdots = \left( \prod _{l=j+1}^{i-2}\frac{\tilde{\theta } _l-1}{\tilde{\theta } _{l-1}}\right) \tilde{h} _{i,i-2} = \left( \prod _{l=j+1}^{i-1}\frac{\tilde{\theta } _l-1}{\tilde{\theta } _{l-1}}\right) \frac{2\tilde{\theta } _i-1}{\tilde{\theta } _{i-1}} \\= & {} \frac{1}{\tilde{\theta } _j}\frac{1}{\tilde{\theta } _{j+1}} \frac{\tilde{\theta } _{j+1}-1}{\tilde{\theta } _{j+2}} \cdots \frac{\tilde{\theta } _{i-3}-1}{\tilde{\theta } _{i-2}} (\tilde{\theta } _{i-2}-1)(\tilde{\theta } _{i-1}-1) \frac{2\tilde{\theta } _i-1}{\tilde{\theta } _{i-1}} \\= & {} \frac{1}{\tilde{\theta } _j}\frac{1}{\tilde{\theta } _{j+1}} \frac{\tilde{\theta } _{j+2}}{\tilde{\theta } _{j+1}} \cdots \frac{\tilde{\theta } _{i-2}}{\tilde{\theta } _{i-3}} (\tilde{\theta } _{i-2}-1)(\tilde{\theta } _{i-1}-1) \frac{2\tilde{\theta } _i-1}{\tilde{\theta } _{i-1}} \\= & {} \frac{\tilde{\theta } _{i-2}(\tilde{\theta } _{i-2}-1)(\tilde{\theta } _{i-1}-1)(2\tilde{\theta } _i-1)}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2\tilde{\theta } _{i-1}} = \frac{\tilde{\theta } _i^2(2\tilde{\theta } _i-1)}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2}, \end{aligned}$$

for \(i=2,\ldots ,N,\;j=0,\ldots ,i-3\), which concludes the proof of (25).

We next prove the first two lines of (26) using the induction. For \(N=1\), we have \(\tilde{\theta } _1 = 1\) and

$$\begin{aligned} \tilde{h} _{1,0} = 1 + \frac{2\tilde{\theta } _1-1}{\tilde{\theta } _0} = 1 + \frac{\tilde{\theta } _1^2}{\tilde{\theta } _0} = 1 + \frac{\frac{1}{2}(\tilde{\theta } _0^2 - \tilde{\theta } _0)}{\tilde{\theta } _0} = \frac{1}{2}(\tilde{\theta } _0+1) , \end{aligned}$$

where the third equality uses (27). For \(N>1\), we have

$$\begin{aligned} \tilde{h} _{N,N-1} = 1 + \frac{2\tilde{\theta } _N-1}{\tilde{\theta } _{N-1}} = 1 + \frac{\tilde{\theta } _N^2}{\tilde{\theta } _{N-1}} = 1 + \frac{\tilde{\theta } _{N-1}^2 - \tilde{\theta } _{N-1}}{\tilde{\theta } _{N-1}} = \tilde{\theta } _{N-1} , \end{aligned}$$

where the third equality uses (27). Assuming \(\sum _{l=j+1}^N\tilde{h} _{l,j} = \tilde{\theta } _j\) for \(j=n,\ldots ,N-1\) and \(n\ge 1\), we get

$$\begin{aligned} \sum _{l=n}^N\tilde{h} _{l,n-1}&= 1 + \frac{2\tilde{\theta } _n-1}{\tilde{\theta } _{n-1}} + \frac{\tilde{\theta } _n-1}{\tilde{\theta } _{n-1}}(\tilde{h} _{n+1,n}-1) + \frac{\tilde{\theta } _n-1}{\tilde{\theta } _{n-1}}\sum _{l=n+2}^N\tilde{h} _{l,n} \\&= 1 + \frac{\tilde{\theta } _n}{\tilde{\theta } _{n-1}} + \frac{\tilde{\theta } _n-1}{\tilde{\theta } _{n-1}}\sum _{l=n+1}^N\tilde{h} _{l,n} = \frac{\tilde{\theta } _{n-1} + \tilde{\theta } _n + (\tilde{\theta } _n-1)\tilde{\theta } _n}{\tilde{\theta } _{n-1}} = \frac{\tilde{\theta } _{n-1} + \tilde{\theta } _n^2}{\tilde{\theta } _{n-1}} \\&= {\left\{ \begin{array}{ll} \frac{1}{2}(\tilde{\theta } _0 + 1), &{} n = 0, \\ \tilde{\theta } _n, &{} n=1,\ldots ,N-1, \end{array}\right. } \end{aligned}$$

where the last equality uses (27), which concludes the proof of the first two lines of (26).

We finally prove the last line of (26) using the induction. For \(i\ge 1\), we have

$$\begin{aligned} \sum _{l=i+1}^N\tilde{h} _{l,i-1} = \sum _{l=i}^N\tilde{h} _{l,i-1} - \tilde{h} _{i,i-1} = \tilde{\theta } _{i-1} - \left( 1+\frac{2\tilde{\theta } _i-1}{\tilde{\theta } _{i-1}}\right) = \frac{(\tilde{\theta } _i-1)^2}{\tilde{\theta } _{i-1}} = \frac{\tilde{\theta } _{i+1}^4}{\tilde{\theta } _{i-1}\tilde{\theta } _i^2} , \end{aligned}$$

where the third and fourth equalities use (27). Then, assuming \(\sum _{l=i+1}^N\tilde{h} _{l,j}=\frac{\tilde{\theta } _i^4}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2}\) for \(i=n,\ldots ,N-1\), \(j=0,\ldots ,i-1\) with \(n\ge 1\), we get:

$$\begin{aligned} \sum _{l=n}^N\tilde{h} _{l,j}&= \sum _{l=n+1}^N\tilde{h} _{l,j} + \tilde{h} _{n,j} = \frac{\tilde{\theta } _{n+1}^4}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2} + \frac{\tilde{\theta } _n^2(2\tilde{\theta } _n-1)}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2} = \frac{\tilde{\theta } _n^2(\tilde{\theta } _n-1)^2 + \tilde{\theta } _n^2(2\tilde{\theta } _n-1)}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2}\\&= \frac{\tilde{\theta } _n^4}{\tilde{\theta } _j\tilde{\theta } _{j+1}^2} , \end{aligned}$$

where the second and third equalities use (25), which concludes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, D., Fessler, J.A. Optimizing the Efficiency of First-Order Methods for Decreasing the Gradient of Smooth Convex Functions. J Optim Theory Appl 188, 192–219 (2021). https://doi.org/10.1007/s10957-020-01770-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-020-01770-2

Keywords

Mathematics Subject Classification

Navigation