Efficient first-order methods for convex minimization: a constructive approach

Drori, Yoel; Taylor, Adrien B.

doi:10.1007/s10107-019-01410-2

Efficient first-order methods for convex minimization: a constructive approach

Full Length Paper
Series A
Published: 24 June 2019

Volume 184, pages 183–220, (2020)
Cite this article

Mathematical Programming Submit manuscript

1830 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

We describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, non-smooth, and strongly convex minimization. The technique builds upon a certain variant of the conjugate gradient method to construct a family of methods such that (a) all methods in the family share the same worst-case guarantee as the base conjugate gradient method, and (b) the family includes a fixed-step first-order method. We demonstrate the effectiveness of the approach by deriving optimal methods for the smooth and non-smooth cases, including new methods that forego knowledge of the problem parameters at the cost of a one-dimensional line search per iteration, and a universal method for the union of these classes that requires a three-dimensional search per iteration. In the strongly convex case, we show how numerical tools can be used to perform the construction, and show that the resulting method offers an improved worst-case bound compared to Nesterov’s celebrated fast gradient method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

References

Arjevani, Y., Shalev-Shwartz, S., Shamir, O.: On lower and upper bounds in smooth and strongly convex optimization. J. Mach. Learn. Res. 17(126), 1–51 (2016)
MathSciNet MATH Google Scholar
Beck, A.: Quadratic matrix programming. SIAM J. Optim. 17(4), 1224–1238 (2007)
MathSciNet MATH Google Scholar
Beck, A., Drori, Y., Teboulle, M.: A new semidefinite programming relaxation scheme for a class of quadratic matrix problems. Oper. Res. Lett. 40(4), 298–302 (2012)
MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
MathSciNet MATH Google Scholar
Bubeck, S., Lee, Y.T., Singh, M.: A geometric alternative to Nesterov’s accelerated gradient descent (2015). arXiv preprint arXiv:1506.08187
De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
MathSciNet MATH Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)
Devolder, O., Glineur, F., Nesterov, Y.: Intermediate gradient methods for smooth convex problems with inexact oracle. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), Technical report (2013)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
MathSciNet MATH Google Scholar
Diehl, M., Ferreau, H.J., Haverbeke, N.: Efficient numerical methods for nonlinear MPC and moving horizon estimation. Nonlinear Model Predict. Control 384, 391–417 (2009)
MATH Google Scholar
Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. thesis, Tel-Aviv University (2014)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: An optimal variant of Kelley’s cutting-plane method. Math. Program. 160(1–2), 321–351 (2016)
MathSciNet MATH Google Scholar
Drusvyatskiy, D., Fazel, M., Roy, S.: An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)
MathSciNet MATH Google Scholar
Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM J. Optim. 28(3), 2654–2689 (2018)
MathSciNet MATH Google Scholar
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming. version 2.0 beta. http://cvxr.com/cvx (2013)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Stand. 49(6), 409–436 (1952)
Article MathSciNet Google Scholar
Hu, B., Lessard, L.: Dissipativity theory for Nesterov’s accelerated method. In: International Conference on Machine Learning (ICML), pp. 1549–1557 (2017)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)
Karimi, S., Vavasis, S.A.: A unified convergence bound for conjugate gradient and accelerated gradient. (2016). arXiv preprint arXiv:1605.00320
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)
MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
MathSciNet MATH Google Scholar
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2663–2671 (2012)
Lemaréchal, C., Sagastizábal, C.: Variable metric bundle methods: from conceptual to implementable forms. Math. Program. 76(3), 393–410 (1997)
MathSciNet MATH Google Scholar
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
MathSciNet MATH Google Scholar
Löfberg, J.: YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)
Mosek, A.: The MOSEK Optimization Software, vol. 54 (2010). http://www.mosek.com
Narkiss, G., Zibulevsky, M.: Sequential subspace optimization method for large-scale unconstrained problems. In: Technion-IIT, Department of Electrical Engineering (2005)
Nemirovski, A.: Orth-method for smooth convex optimization. Izvestia AN SSSR 2, 937–947 (1982). (in Russian)
Google Scholar
Nemirovski, A.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
MathSciNet Google Scholar
Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
MathSciNet MATH Google Scholar
Nemirovski, A., Yudin, D.: Information-based complexity of mathematical programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika 1 (1983) (in Russian)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Willey-Interscience, New York (1983)
Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate O($1/k^2$)). Soviet Mathematics Doklady 27, 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, London (2004)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
MathSciNet MATH Google Scholar
Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)
Article MathSciNet Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Google Scholar
Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
MATH Google Scholar
Ruszczyński, A.P.: Nonlinear Optimization, vol. 13. Princeton University Press, Princeton (2006)
MATH Google Scholar
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection (2018). arXiv preprint arXiv:1812.00146
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 1458–1466 (2011)
Scieur, D., Roulet, V., Bach, F., d’Aspremont, A.: Integration methods and optimization algorithms. In: Advances in Neural Information Processing Systems (NIPS), pp. 1109–1118 (2017)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)
Taylor, A.: Convex interpolation and performance estimation of first-order methods for convex optimization. Ph.D. thesis, Université catholique de Louvain (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: IEEE 56th Annual Conference on Decision and Control (CDC), pp. 1278–1283 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. J. Optim. Theory Appl. 178(2), 455–476 (2018)
MathSciNet MATH Google Scholar
Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)
Google Scholar
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization. (2016). arXiv preprint arXiv:1611.02635
Wright, S.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
MathSciNet MATH Google Scholar
Wright, S., Nocedal, J.: Numerical optimization. Science 35, 67–68 (1999)
MATH Google Scholar

Download references

Author information

Yoel Drori and Adrien B. Taylor have contributed equally to this work.

Authors and Affiliations

Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA
Yoel Drori
INRIA, Département d’Informatique de l’ENS, École Normale Supérieure, CNRS, PSL Research University, Paris, France
Adrien B. Taylor

Authors

Yoel Drori
View author publications
You can also search for this author in PubMed Google Scholar
Adrien B. Taylor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrien B. Taylor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The Adrien B. Taylor was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement 724063).

Appendices

Appendix A: Proof of Lemma 1

We start the proof of Lemma 1 with the following a technical lemma.

Lemma 5

Let ${\mathcal {F}}$ be a class of contraction-preserving c.c.p. functions (see Definition 3), and let $S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}$ be an ${\mathcal {F}}$-interpolable set satisfying

$$\begin{aligned}&{\left\langle g_i, g_j\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N,\end{aligned}$$

(23)

$$\begin{aligned}&{\left\langle g_i, x_j-x_0\right\rangle }=0,\quad \text {for all } 1\le j\le i=1,\ldots ,N, \end{aligned}$$

(24)

then there exists $\{{\hat{x}}_i\}_{i\in I^*_N}\subset \mathbb {R}^d$ such that the set ${\hat{S}}=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}$ is ${\mathcal {F}}$-interpolable, and

$$\begin{aligned}&{\left||{\hat{x}}_0 - {\hat{x}}_*\right||}\le {\left||x_0-x_*\right||}, \end{aligned}$$

(25)

$$\begin{aligned}&{\hat{x}}_i \in {\hat{x}}_0 + \mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad {i=0,\ldots , N}. \end{aligned}$$

(26)

Proof

By the orthogonal decomposition theorem there exists $\{h_{i,j}\}_{0\le j<i\le N} \subset \mathbb {R}$ and $\{v_i\}_{0\le i\le N} \subset \mathbb {R}^d$ with ${\left\langle g_k, v_i\right\rangle }=0$ for all $0\le k<i \le N$ such that

$$\begin{aligned} x_i&=x_0-\sum _{j=0}^{i-1} h_{i,j}g_j +v_i, \quad { i=0,\ldots , N}, \end{aligned}$$

furthermore, there exist $r_*\in \mathbb {R}^d$ satisfying ${\left\langle r_*, v_j\right\rangle }=0$ for all $0\le j \le N$ and some $\{\nu _{j}\}_{0\le j\le N}\subset \mathbb {R}$, such that

$$\begin{aligned} x_*=x_0 + \sum _{j=0}^N \nu _{j}v_j + r_*. \end{aligned}$$

By (23) and (24) it then follows that for all $k\ge i$

$$\begin{aligned} {\left\langle g_k, v_i\right\rangle } = {\left\langle g_k, x_i-x_0+\sum _{j=0}^{i-1} h_{i,j} g_j\right\rangle } = 0, \end{aligned}$$

hence, together with the definition of $v_i$, we get

$$\begin{aligned} {\left\langle g_k, v_i\right\rangle }=0, \quad {i,k=0,\ldots ,N}. \end{aligned}$$

(27)

Let us now choose $\{{\hat{x}}_i\}_{i\in I^*_N}$ as follows:

$$\begin{aligned}&{\hat{x}}_0:=x_0,\\&{\hat{x}}_i:=x_0-\sum _{j=0}^{i-1} h_{i,j} g_j, \quad { i =0,\ldots , N}, \\&{\hat{x}}_* := x_0+r_*. \end{aligned}$$

It follows immediately from this definition that (26) holds, it thus remains to show that ${\hat{S}}$ is ${\mathcal {F}}$-interpolable and that (25) holds.

In order to establish that ${\hat{S}}$ is ${\mathcal {F}}$-interpolable, from Definition 3 it is enough to show that the conditions in (4) are satisfied. This is indeed the case, as ${\left\langle g_j, {\hat{x}}_i - {\hat{x}}_0\right\rangle }={\left\langle g_j, x_i-x_0\right\rangle }$ follows directly from definition of $\{{\hat{x}}_i\}$ and (27), whereas ${\left||{\hat{x}}_i - {\hat{x}}_j\right||}\le {\left||x_i-x_j\right||}$ in the case $i,j\ne *$ follows from

$$\begin{aligned} {\left||x_i-x_j\right||}^2&={\left||x_0-\sum _{k=0}^{i-1} h_{i,k} g_k+v_i-x_0+\sum _{k=0}^{j-1} h_{j,k} g_k-v_j\right||}^2\\&={\left||{\hat{x}}_i - {\hat{x}}_j\right||}^2+{\left||v_i-v_j\right||}^2\\&\ge {\left||{\hat{x}}_i - {\hat{x}}_j\right||}^2, \quad {i,j=0,\ldots , N}, \end{aligned}$$

and in the case $j=*$, follows from

$$\begin{aligned} {\left||x_i-x_*\right||}^2&={\left||x_0-\sum _{k=0}^{i-1} h_{i,k} g_k+v_i-x_0 -\sum _{j=0}^N \nu _{j}v_j - r_*\right||}^2\\&={\left||{\hat{x}}_i - {\hat{x}}_*\right||}^2+{\left||v_i-\sum _{j=0}^N \nu _{j}v_j\right||}^2\\&\ge {\left||{\hat{x}}_i - {\hat{x}}_*\right||}^2, \quad {i=0,\ldots , N}, \end{aligned}$$

where for the second equality we used ${\left\langle v_i, r_*\right\rangle }=0$. The last inequality also establishes (25), which completes the proof. $\square $

Proof of Lemma 1

By the first-order necessary and sufficient optimality conditions (see e.g., [42, Theorem 3.5]), the definitions of $x_i$ and $f'(x_i)$ in (5) and (6) can be equivalently defined as a solution to the problem of finding $x_i\in \mathbb {R}^d$ and $f'(x_i)\in \partial f(x_i)$ ($0\le i\le N$), that satisfy:

$$\begin{aligned}&{\left\langle f'(x_i), f'(x_j)\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \\&x_i\in x_0+\mathrm {span}\{f'(x_0),\ldots ,f'(x_{i-1})\}, \quad \text {for all } i=1,\ldots ,N, \end{aligned}$$

hence the problem (PEP) can be equivalently expressed as follows:

$$\begin{aligned} \sup _{ f, \left\{ x_i\right\} _{i \in I^*_N}, \{f'(x_i)\}_{i\in I^*_N}}&f(x_N)-f_*\nonumber \\ \text {subject to: }&f\in {\mathcal {F}}(\mathbb {R}^d),\ x_* \text { is a minimizer of } f, \nonumber \\&f'(x_i) \in \partial f(x_i), \quad \text {for all } i\in I^*_N, \nonumber \\&{\left||x_0-x_*\right||}\le R_x, \nonumber \\&{\left\langle f'(x_i), f'(x_j)\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \nonumber \\&x_i\in x_0+\mathrm {span}\{f'(x_0),\ldots ,f'(x_{i-1})\}, \quad \text {for all } i=1,\ldots ,N. \end{aligned}$$

(28)

Now, since all constraints in (28) depend only on the first-order information of f at $\{x_i\}_{i\in I^*_N}$, by taking advantage of Definition 2 we can denote $f_i:=f(x_i)$ and $g_i:=f'(x_i)$ and treat these and as optimization variables, thereby reaching the following equivalent formulation

$$\begin{aligned} \sup _{\{(x_i,g_i,f_i)\}_{i\in I^*_N}}&\ f_N-f_* \nonumber \\ \text { subject to: }&\{(x_i,g_i,f_i)\}_{i\in I^*_N} \text { is }{\mathcal {F}}(\mathbb {R}^d)\text {-interpolable}, \nonumber \\&{\left||x_0-x_*\right||}\le R_x, \nonumber \\&g_*=0, \nonumber \\&{\left\langle g_i, g_j\right\rangle }= 0, \ \text {for all } 0\le j<i=1,\ldots N,\nonumber \\&x_i\in x_0+\mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad \text {for all } i=1,\ldots ,N. \end{aligned}$$

(29)

Since (PEP-GFOM) is a relaxation of (29), we get

$$\begin{aligned} f(x_N) - f_*\le {{\,\mathrm{val}\,}}\mathrm{(PEP)} \le {{\,\mathrm{val}\,}}\mathrm{(PEP-GFOM)}, \end{aligned}$$

which establishes the bound (13).

In order to establish the second part of the claim, let $\varepsilon >0$. We will proceed to show that there exists some valid input for GFOM $(f, x_0)$, such that $f(\mathrm {GFOM}_N(f, x_0)) - f_*\ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon $.

Indeed, by the definition of (PEP-GFOM), there exists a set $S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}$ that satisfies the constraints in (PEP-GFOM) and reaches an objective value $f_N-f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon $. Since S satisfies the requirements of Lemma 5 [as these requirements are constraints in (PEP-GFOM)], there exists a set of vectors $\{{\hat{x}}_i\}_{i\in I^*_N}$ for which

$$\begin{aligned}&{\left||{\hat{x}}_0- {\hat{x}}_*\right||}\le R_x, \\&{\hat{x}}_i\in {\hat{x}}_0 + \mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad i=0,\dots ,N, \end{aligned}$$

hold, and in addition, ${\hat{S}}:=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}$ is ${\mathcal {F}}(\mathbb {R}^d)$-interpolable. By definition of an ${\mathcal {F}}(\mathbb {R}^d)$-interpolable set, it follows that there exists a function ${\hat{f}}\in {\mathcal {F}}(\mathbb {R}^d)$ such that ${\hat{f}}({\hat{x}}_i) = f_i$, $g_i \in \partial {\hat{f}}({\hat{x}}_i)$, hence satisfying

$$\begin{aligned}&{\left\langle {\hat{f}}'({\hat{x}}_i), {\hat{f}}'({\hat{x}}_j)\right\rangle } = 0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \\&{\hat{x}}_i\in {\hat{x}}_0+\mathrm {span}\{{\hat{f}}'(x_0),\ldots , {\hat{f}}'({\hat{x}}_{i-1})\}, \quad \text {for all } i=1,\ldots ,N. \end{aligned}$$

Furthermore, since $g_*=0$ we have that ${\hat{x}}_*$ is an optimal solution of ${\hat{f}}$.

We conclude that the sequence ${\hat{x}}_0, \dots , {\hat{x}}_N$ forms a valid execution of GFOM on the input $({\hat{f}}, {\hat{x}}_0)$, that the requirement ${\left||{\hat{x}}_0 - {\hat{x}}_*\right||}\le R_x$ is satisfied, and that the output of the method, ${\hat{x}}_N$, attains the absolute inaccuracy value of ${\hat{f}}({\hat{x}}_N) -{\hat{f}}({\hat{x}}_*) = f_N - f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon $. $\square $

Appendix B: Proof of Theorem 3

Lemma 6

Suppose there exists a pair $(f,x_0)$ such that $f\in {\mathcal {F}}$, ${\left||x_0-x_*\right||}\le R_x$ and $\mathrm {GFOM}_{2N+1}(f, x_0)$ is not optimal for f, then (sdp-PEP-GFOM) satisfies Slater’s condition. In particular, no duality gap occurs between the primal-dual pair (sdp-PEP-GFOM), (dual-PEP-GFOM), and the dual optimal value is attained.

Proof

Let $(f,x_0)$ be a pair satisfying the premise of the lemma and denote by $\{x_i\}_{i\ge 0}$ the sequence generated according to GFOM and by $\{f'(x_i)\}_{i\ge 0}$ the subgradients chosen at each iteration of the method, respectively. By the assumption that the optimal value is not obtained after $2N+1$ iterations, we have $f(x_{2N+1})>f_*$.

We show that the set $\{({\tilde{x}}_i,{\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}$ with

$$\begin{aligned}&{\tilde{x}}_i:=x_{2i}, \quad i=0,\ldots ,N, \\&{\tilde{x}}_*:=x_*, \\&{\tilde{g}}_i:=f'(x_{2i}), \quad i=0,\ldots ,N, \\&{\tilde{g}}_*:=0, \\&{\tilde{f}}_i:=f(x_{2i}), \quad i=0,\ldots ,N, \\&{\tilde{f}}_*:=f(x_*), \end{aligned}$$

corresponds to a Slater point for (sdp-PEP-GFOM).

In order to proceed, we consider the Gram matrix ${\tilde{G}}$ and the vector ${\tilde{F}}$ constructed from the set $\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}$ as in Sect. 3.2. We then continue in two steps:

(i)
we show that $({\tilde{G}}, {\tilde{F}})$ is feasible for (sdp-PEP-GFOM),
(ii)
we show that ${\tilde{G}}\succ 0$.

The proofs follow.

(i)
First, we note that the set $\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}$ satisfies the interpolation conditions for ${\mathcal {F}}$, as it was obtained by taking the values and gradients of a function in ${\mathcal {F}}$. Furthermore, since ${\tilde{x}}_0 = x_0$ and ${\tilde{x}}_*=x_*$ we also get that the initial condition ${\left||{\tilde{x}}_0-{\tilde{x}}_*\right||}\le R_x$ is respected, and since $\{x_i\}$ correspond to the iterates of GFOM, we also have by Lemma 5 that
$$\begin{aligned}&{\left\langle {\tilde{g}}_i, {\tilde{g}}_j\right\rangle }= 0, \quad \text {for all } 0\le j<i=1,\ldots N, \\&{\left\langle {\tilde{g}}_i, {\tilde{x}}_j-{\tilde{x}}_0\right\rangle }= 0, \quad \text {for all } 1\le j \le i=1,\ldots N. \end{aligned}$$
It then follows from the construction of ${\tilde{G}}$ and ${\tilde{F}}$ and by (10) that ${\tilde{G}}$ and ${\tilde{F}}$ satisfies the constrains of (sdp-PEP-GFOM).
(ii)
In order to establish that ${\tilde{G}}\succ 0$ it suffices to show that the vectors
$$\begin{aligned} \{{\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots ,{\tilde{x}}_N- {\tilde{x}}_0 ; {\tilde{x}}_*- {\tilde{x}}_0 \} \end{aligned}$$
are linearly independent. Indeed, this follows from Lemma 5, since these vectors are all non-zero, and since ${\tilde{x}}_*$ does not fall in the linear space spanned by ${\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots , {\tilde{x}}_N- {\tilde{x}}_0$ (as otherwise $x_{2N+1}$ would be an optimal solution).

We conclude that $({\tilde{G}}, {\tilde{F}})$ forms a Slater point for (sdp-PEP-GFOM).$\square $

Proof of Theorem 3

The bound follows directly from

$$\begin{aligned} f(\mathrm {GFOM}_{N}(f, x_0)) - f_*\le {{\,\mathrm{val}\,}}\mathrm{(PEP-GFOM)} \le {{\,\mathrm{val}\,}}\mathrm{(sdp-PEP-GFOM)}, \end{aligned}$$

established by Lemmas 1 and 2. The tightness claim follows from the tightness claims of Lemmas 1, 2 and 6. $\square $

Appendix C: Proof of Theorem 4

We begin the proof of Theorem 4 by recalling a well-known lemma on constraint aggregation, showing that it is possible to aggregate the constraints of a minimization problem while keeping the optimal value of the resulting program bounded from below.

Lemma 7

Consider the problem

where $f:\mathbb {R}^d\rightarrow \mathbb {R}$, $h:\mathbb {R}^d\rightarrow \mathbb {R}^n$, $g:\mathbb {R}^d\rightarrow \mathbb {R}^m$ are some (not necessarily convex) functions, and suppose $({\tilde{\alpha }}, {\tilde{\beta }})\in \mathbb {R}^{n}\times \mathbb {R}_+^{m}$ is a feasible point for the Lagrangian dual of (P) that attains the value ${\tilde{\omega }}$. Let $k\in {\mathbb {N}}$, and let $M\in \mathbb {R}^{n \times k}$ be a linear map such that ${\tilde{\alpha }} \in \mathrm {range}(M)$, then

is bounded from below by ${\tilde{\omega }}$.

Proof

Let

$$\begin{aligned} L(x, \alpha , \beta ) = f(x)+\alpha ^\top h(x) + \beta ^\top g(x) \end{aligned}$$

be the Lagrangian for the problem (P), then by the assumption on $({\tilde{\alpha }}, {\tilde{\beta }})$ we have $ \min _x L(x, {\tilde{\alpha }}, {\tilde{\beta }}) = {\tilde{\omega }}. $ Now, let $u\in \mathbb {R}^k$ be some vector such that $Mu = {\tilde{\alpha }}$, then for every x in the domain of (P$'$)

$$\begin{aligned}&{\tilde{\alpha }}^\top h(x) = u^\top M^\top h(x) = 0, \\&{\tilde{\beta }}^\top g(x)\le 0, \end{aligned}$$

where that last inequality follows from nonnegativity of ${\tilde{\beta }}$. We get

$$\begin{aligned} f(x) \ge f(x) + {\tilde{\alpha }}^\top h(x) + {\tilde{\beta }}^\top g(x) = L(x, {\tilde{\alpha }}, {\tilde{\beta }}) \ge {\tilde{\omega }}, \quad \forall x: M^\top h(x)=0, g(x)\le 0, \end{aligned}$$

and thus the desired result $w'\ge {\tilde{\omega }}$ holds. $\square $

Before proceeding with the proof of the main results, let us first formulate a performance estimation problem for the class of methods described by (14).

Lemma 8

Let $ R_x\ge 0$ and let $\{\beta _{i,j}\}_{1\le i\le N, 0\le j\le i-1}$, $\{\gamma _{i,j}\}_{1\le i\le N, 1\le j\le i}$ be some given sets of real numbers, then for any pair $(f, x_0)$ such that $f\in {\mathcal {F}}(\mathbb {R}^d)$ and ${\left||x_0-x_*\right||}\le R_x$ (where $x_*\in {{\,\mathrm{argmin}\,}}_x f(x)$). Then for any sequence $\{x_i\}_{1\le i\le N}$ that satisfies

$$\begin{aligned} {\left\langle f'(x_i), \sum _{j=0}^{i-1}\beta _{i,j} f'(x_j) + \sum _{j=1}^{i} \gamma _{i,j}(x_j-x_0)\right\rangle }=0, \quad i=1,\ldots ,N \end{aligned}$$

(30)

for some $f'(x_i)\in \partial f(x_i)$, the following bound holds:

$$\begin{aligned}&f(x_N)-f_*\le \sup _{ F\in {\mathbb {R}}^{N+1}, G\in {\mathbb {R}}^{2N+2\times 2N+2}} F^\top \mathbf {f}_N - F^\top \mathbf {f}_* \\&\quad \begin{array}{lrl} \text {subject to: } &{}{{{\,\mathrm{Tr}\,}}\left( A^{\mathrm {ic}}_kG\right) }+(a^{\mathrm {ic}}_k)^\top F+b^{\mathrm {ic}}_k\le 0, &{} \quad \text {for all } k\in K_N,\\ &{}{\left\langle \mathbf {g}_i, \sum \limits _{j=0}^{i-1} \beta _{i,j}\mathbf {g}_j + \sum \limits _{j=1}^{i} \gamma _{i,j}(\mathbf {x}_j-\mathbf {x}_0)\right\rangle }_G = 0, &{}\quad \text {for all } i=1,\ldots N,\\ &{}{\left||\mathbf {x}_0-\mathbf {x}_*\right||}_G^2-R_x^2\le 0, &{}\\ &{} G\succeq 0. \end{array} \end{aligned}$$

We omit the proof since it follows the exact same lines as for (sdp-PEP-GFOM) (c.f. the derivations in [13, 50]).

Proof of Theorem 4

The key observation underlying the proof is that by taking the PEP for GFOM (sdp-PEP-GFOM) and aggregating the constraints that define its iterates, we can reach a PEP for the class of methods (14). Furthermore, by Lemma 7, this aggregation can be done in a way that maintains the optimal value of the program, thereby reaching a specific method in this class whose corresponding PEP attains an optimal value that is at least as good as that of the PEP for GFOM.

We perform the aggregation of the constraints as follows: for all $i=1,\dots ,N$ we aggregate the constraints which correspond to $\{\beta _{i,j}\}_{0\le j<i}$, $\{\gamma _{i,j}\}_{1\le j\le i}$ (weighted by $\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}$, $\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}$, respectively) into a single constraint, reaching

By Lemma 7 and the choice of weights $\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}$, $\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}$ it follows that

$$\begin{aligned} w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x) \le {\tilde{\omega }}. \end{aligned}$$

Finally, by Lemma 8, we conclude that $w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x)$ forms an upper bound on the performance of the method (14), i.e., for any valid pair $(f, x_0)$ and any $\{x_i\}_{i\ge 0}$ that satisfies (14) we have

$$\begin{aligned} f(x_N)-f_*\le w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x)\le {\tilde{\omega }}. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Drori, Y., Taylor, A.B. Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184, 183–220 (2020). https://doi.org/10.1007/s10107-019-01410-2

Download citation

Received: 27 May 2018
Accepted: 10 June 2019
Published: 24 June 2019
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10107-019-01410-2

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient first-order methods for convex minimization: a constructive approach

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

References