Abstract
We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.
Similar content being viewed by others
Notes
This work appeared concurrently with our manuscript.
A function is called semi-algebraic if its graph decomposes into a finite union of sets, each defined by finitely many polynomial inequalities.
Perhaps more appropriate would be the terms active strict saddle and the active strict saddle property. For brevity, we omit the word “active.”
Weak convexity is not essential here, provided one modifies the definitions appropriately. Moreover, this guarantee holds more generally for functions definable in an o-minimal structure.
The domain of \(d^2 f_{\mathcal {M}}(\bar{y})(u|\cdot )\) consists of w satisfying \((\langle \nabla ^2 G_1(\bar{y})u,u\rangle ,\ldots , \langle \nabla ^2 G_{n-r}(\bar{y})u,u\rangle )=-\nabla G(\bar{y})w\), where \(G_i\) are the coordinate functions of G.
What we call an active manifold here is called an identifiable manifold in [19]—the reference we most closely follow. The term active is more evocative in the context of the current work.
Note that due to the convention \(\inf _{\emptyset }=+\infty \), the entire space \(\mathcal {M}=\mathbb {R}^d\) is the active manifold for a globally \(C^p\)-smooth function f around any of its critical points.
Better terminology would be the terms active strict saddle and the active strict saddle property. To streamline the notation, we omit the word active, as it should be clearly understood from context.
A function is semi-algebraic if its graph can be written as a finite union of sets each cut out by finitely many polynomial inequalities.
For example, let F be a \(C^2\) function defined on a neighborhood U of \(\bar{x}\) that agrees with f on \(U\cap \mathcal {M}\). Using a partition of unity (e.g., [36, Lemma 2.26]), one can extend F from a slightly smaller neighborhood to be \(C^2\) on all of \(\mathbb {R}^d\).
References
F. Al-Khayyal and J. Kyparisis. Finite convergence of algorithms for nonlinear programs and variational inequalities. J. Optim. Theory Appl., 70(2):319–332, 1991.
P. Albano and P. Cannarsa. Singularities of semiconcave functions in Banach spaces. In Stochastic analysis, control, optimization and applications, Systems Control Found. Appl., pages 171–190. Birkhäuser Boston, Boston, MA, 1999.
H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
D. Avdiukhin, c. Jin, and G. Yaroslavtsev. Escaping saddle points with inequality constraints via noisy sticky projected gradient descent. Optimization for Machine Learning Workshop, 2019.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
J.F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer, New York, 2000.
J.V. Burke. Descent methods for composite nondifferentiable optimization problems. Math. Programming, 33(3):260–279, 1985.
J.V. Burke. On the identification of active constraints. II. The nonconvex case. SIAM J. Numer. Anal., 27(4):1081–1103, 1990.
J.V. Burke and J.J. Moré. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.
P.H. Calamai and J.J. Moré. Projected gradient methods for linearly constrained problems. Math. Prog., 39(1):93–116, 1987.
V. Charisopoulos, Y. Chen, D. Davis, M. Díaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. Foundations of Computational Mathematics, pages 1–89, 2021.
F.H. Clarke, Yu. Ledyaev, R.I. Stern, and P.R. Wolenski. Nonsmooth Analysis and Control Theory. Texts in Math. 178, Springer, New York, 1998.
C. Criscitiello and N. Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News, 26(2), 2018.
D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Generic minimizing behavior in semialgebraic optimization. SIAM Journal on Optimization, 26(1):513–534, 2016.
D. Drusvyatskiy and A.S. Lewis. Optimality, identifiablity, and sensitivity. Math. Program., 147(1-2, Ser. A):467–498, 2014.
D. Drusvyatskiy and A.S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.
D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503–558, 2019.
S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, A. Singh, and B. Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067–1077, 2017.
J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
J.C. Dunn. On the convergence of projected gradient processes to singular critical points. J. Optim. Theory Appl., 55(2):203–216, 1987.
M.C. Ferris. Finite termination of the proximal point algorithm. Math. Program., 50(3, (Ser. A)):359–366, 1991.
S.D. Flåm. On finite convergence and constraint identification of subgradient projection methods. Math. Program., 57:427–437, 1992.
R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233–1242. JMLR. org, 2017.
R. Ge, J.D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
N. Hallak and M. Teboulle. Finding second-order stationary points in constrained minimization: A feasible direction approach. Journal of Optimization Theory and Applications, 186(2):480–503, 2020.
W.L. Hare and A.S. Lewis. Identifying active manifolds. Algorithmic Oper. Res., 2(2):75–82, 2007.
C. Jin, P. Netrapalli, and M. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.
R. Jin, C.and Ge, P. Netrapalli, S.M. Kakade, and M.I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.
J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, and B. Recht. First-order methods almost always avoid strict saddle points. Math. Program., 176(1-2):311–337, 2019.
J.D. Lee, M. Simchowitz, M.I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016a.
J.M. Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer, 2013.
Sangkyun Lee and Stephen J Wright. Manifold identification in dual averaging for regularized stochastic online learning. Journal of Machine Learning Research, 13(Jun):1705–1744, 2012.
C. Lemaréchal, F. Oustry, and C. Sagastizábal. The U-lagrangian of a convex function. Trans. Amer. Math. Soc., 352:711–729, 1996.
A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002.
A.S. Lewis and S.J. Wright. A proximal method for composite minimization. Math. Program., pages 1–46, 2015.
A.S. Lewis and S. Zhang. Partial smoothness, tilt stability, and generalized Hessians. SIAM Journal on Optimization, 23(1):74–94, 2013.
B. Martinet. Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Rech. Opérationnelle, 4(Sér. R-3):154–158, 1970.
B. Martinet. Détermination approchée d’un point fixe d’une application pseudo-contractante. Cas de l’application prox. C. R. Acad. Sci. Paris Sér. A-B, 274:A163–A165, 1972.
A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. In Advances in Neural Information Processing Systems, pages 3629–3639, 2018.
B.S. Mordukhovich. Variational analysis and generalized differentiation. I, volume 330 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2006. Basic theory.
J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93:273–299, 1965.
Yu. Nesterov. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optimisation Methods and Software, 22(3):469–483, 2007.
Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
M. Nouiehed, J.D. Lee, and M. Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.
E.A. Nurminskii. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9(1):145–150, 1973.
I. Panageas and G. Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
E. Pauwels. The value function approach to convergence analysis in composite optimization. Operations Research Letters, 44(6):790–795, 2016.
R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996.
R.T. Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.
R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optimization, 14(5):877–898, 1976.
R.T. Rockafellar. Favorable classes of Lipschitz-continuous functions in subgradient optimization. In Progress in nondifferentiable optimization, volume 8 of IIASA Collaborative Proc. Ser. CP-82, pages 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg, 1982.
R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.
S. Rolewicz. On paraconvex multifunctions. In Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), Section I, volume 31 of Operations Res. Verfahren, pages 539–546. Hain, Königstein/Ts., 1979.
A. Shapiro. Second order sensitivity analysis and asymptotic theory of parametrized nonlinear programs. Mathematical Programming, 33(3):280–299, 1985.
M. Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013.
J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.
J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
Y. Sun, N. Flammarion, and M. Fazel. Escaping from saddle points on riemannian manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
S.J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31:1063–1079, 1993.
Acknowledgements
We thank John Duchi for his insightful comments on an early version of the manuscript. We also thank the anonymous referees for numerous suggestions that have improved the readability of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Michael Overton.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
D. Drusvyatskiy: Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.
Appendices
Proofs of Theorems 2.9 and 5.2
In this section, we prove Theorem 2.9. We should note that Theorem 2.9, appropriately restated, holds much more broadly beyond the weakly convex function class. To simplify the notational overhead, however, we impose the weak convexity assumption, throughout.
We will require some basic notation from variational analysis; for details, we refer the reader to [57]. A set-valued map \(F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m\) assigns to each point \(x\in \mathbb {R}^d\) a set F(x) in \(\mathbb {R}^m\). The graph of F is defined by
A map \(F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m\) is called metrically regular at \((\bar{x},\bar{v})\in \mathrm{gph}\,F\) if there exists a constant \(\kappa >0\) such that the estimate holds:
for all x near \(\bar{x}\) and all v near \(\bar{v}\). If the graph \(\mathrm{gph}\,F\) is a \(C^1\)-smooth manifold around \((\bar{x},\bar{v})\), then metric regularity at \((\bar{x},\bar{v})\) is equivalent to the condition [57, Theorem 9.43(d)]:Footnote 12
We begin with the following lemma.
Lemma A.1
(Subdifferential metric regularity in smooth minimization). Consider the optimization problem
where \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) is a \(C^2\)-smooth function and \(\mathcal {M}\) is a \(C^2\)-smooth manifold. Let \(\bar{x}\in \mathcal {M}\) satisfy the criticality condition \(0\in \partial f_{\mathcal {M}}(\bar{x})\) and suppose that the subdifferential map \(\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Then, the guarantee holds:
Proof
First, appealing to (A.1), we conclude that the implication holds:
Let us now interpret the condition (A.3) in Lagrangian terms. To this end, let \(G=0\) be the local defining equations for \(\mathcal {M}\) around \(\bar{x}\). Define the Lagrangian function
and let \(\bar{\lambda }\) be the unique Lagrange multiplier vector satisfying \(\nabla _x \mathcal {L}(\bar{x},\bar{\lambda })=0\). According to [41, Corollary 2.9], we have the following expression:
where \(L:=\nabla ^2_{xx}\mathcal {L}(\bar{x},\bar{\lambda })\) denotes the Hessian of the Lagrangian. Combining (A.3) and (A.4), we deduce that the only vector \(u\in T_{\mathcal {M}}(\bar{x})\) satisfying \(L u\in N_{\mathcal {M}}(\bar{x})\) is the zero vector \(u=0\).
Now for the sake of contradiction, suppose that (A.2) fails. Then, the quadratic form \(Q(u)=\langle L u,u\rangle \) is nonnegative on \(T_{\mathcal {M}}(\bar{x})\) and there exists \(0\ne \bar{u}\in T_{\mathcal {M}}(\bar{x})\) satisfying \(Q(\bar{u})=0\). We deduce that \(\bar{u}\) minimizes \(Q(\cdot )\) on \(T_{\mathcal {M}}(\bar{x})\), and therefore, the inclusion \(L\bar{u}\in N_{\mathcal {M}}(\bar{x})\) holds, a clear contradiction. \(\square \)
The following corollary for active manifolds will now quickly follow.
Corollary A.2
(Subdifferential metric regularity and active manifolds). Consider a closed and weakly convex function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{\infty \}\). Suppose that f admits a \(C^2\)-smooth active manifold around a critical point \(\bar{x}\) and that the subdifferential map \(\partial f:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Then, \(\bar{x}\) is either a strong local minimizer of f or satisfies the curvature condition \(d^2 f_{\mathcal {M}}(\bar{x})(u)<0\) for some \(u\in T_{\mathcal {M}}(\bar{x})\).
Proof
The result [19, Proposition 10.2] implies that \(\mathrm{gph}\,\partial f\) coincides with \(\mathrm{gph}\,\partial f_{\mathcal {M}}\) on a neighborhood of \((\bar{x},0)\). Therefore, the subdifferential map \(\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Using Lemma A.1, we obtain the guarantee:
If the infimum is strictly negative, the proof is complete. Otherwise, the infimum is strictly positive. In this case, \(\bar{x}\) is a strong local minimizer of \(f_{\mathcal {M}}\), and therefore by [19, Proposition 7.2] a strong local minimizer of f. \(\square \)
We are now ready for the proofs of Theorems 2.9 and 5.2.
Proof of Theorem 2.9
The result [18, Corollary 4.8] shows that for almost all \(v\in \mathbb {R}^d\), the function \(g(x):=f(x)-\langle v,x\rangle \) has at most finitely many critical points. Moreover each such critical point \(\bar{x}\) lies on some \(C^2\) active manifold \(\mathcal {M}\) of g and the subdifferential map \(\partial g:\mathbb {R}^d\rightrightarrows \mathbb {R}^d\) is metrically regular at \((\bar{x},0)\). Applying Corollary A.2 to g for such generic vectors v, we deduce that every critical point \(\bar{x}\) of g is either a strong local minimizer or a strict saddle of g. The proof is complete. \(\square \)
Proof of Theorem 5.2
The proof is identical to that of Theorem 2.9 with [18, Theorem 5.2] playing the role of [18, Corollary 4.8]. \(\square \)
Pathological Example
Theorem B.1
Consider the following function
Assume that \(\lambda > \rho \). Define a mapping \(T :\mathbb {R}^d \rightarrow \mathbb {R}\) by the following formula.
and if \(\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|\), we have
Then, \(\mathrm{prox}_{(1/\lambda ) f}(x, y) = S(x,y)\).
Proof
Let us denote the components of S(x, y) by \((x_+, y_+) = S(x,y)\). By first-order optimality conditions, we have \(\mathrm{prox}_{(1/\lambda ) f}(x, y) = (x_+, y_+) \) if and only if
Let us show that \((x_+, y_+)\) indeed satisfies this inclusion.
-
1.
If \((x,y) = 0\), then \(x_+ = y_+ = 0\), and the pair satisfies the inclusion.
-
2.
If \(|x| \le \frac{1}{1 + \lambda }|y|\) and \(y \ne 0\), then \(x_+ = 0\), \(y_+ = \frac{\lambda }{1+\lambda }y\), and
$$\begin{aligned} \lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+) = \lambda \left( x, \frac{1}{1 + \lambda }y\right) \in ([-1, 1]y_+) \times \{y_+\}. \end{aligned}$$Thus, the pair satisfies the inclusion.
-
3.
If \(|y| \le \frac{1}{1+\lambda -\rho }|x|\) and \(x \ne 0\), then \(x_+ = \frac{\lambda }{(1+\lambda - \rho )}x\), \(y_+ = 0\), and
$$\begin{aligned}&\lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+)\\&= \lambda \left( x - \frac{\lambda -\rho }{(1+\lambda - \rho )}x, y\right) \in \{x_+\}\times ([-1, 1]x_+). \end{aligned}$$
For the remaining two cases, let us assume that \(\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|\).
-
4.
If \(\mathrm {sign}(x) = \mathrm {sign}(y)\), let \(s = \mathrm {sign}(x)\) and note that
$$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} -1 \\ -1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ -|x| + (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$From this equation we learn \(\mathrm {sign}(x_+) = \mathrm {sign}(y_+) = s\). Inverting the matrix, we also learn
$$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} 1 \\ 1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ + y_+ \\ x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$Thus, the pair satisfies the inclusion.
-
5.
If \(\mathrm {sign}(x) \ne \mathrm {sign}(y)\), let \(s = \mathrm {sign}(x)\) and note that
$$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} 1 \\ 1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ |x| - (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$From this equation we learn \(\mathrm {sign}(x_+) \ne \mathrm {sign}(y_+) \). Inverting the matrix we also learn
$$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} -1 \\ -1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ - y_+ \\ -x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$Thus, the pair satisfies the inclusion.
Therefore, the proof is complete. \(\square \)
Corollary B.2
(Convergence to Saddles). Assume the setting of Theorem B.1. Let \(\alpha \in (0, 1]\) and define the operator \(T : = (1-\alpha ) I + \alpha S\) on \(\mathbb {R}^2\). Then, the cone \( \mathcal {K}= \{(x,y) :|x| \le (1+\lambda )^{-1}y\} \) satisfies \(T\mathcal {K}\subseteq \mathcal {K}\). Moreover, for any \((x, y) \in \mathcal {K}\), it holds that \(T^k(x,y) = ((1-\alpha )^k x, (1 - \alpha (1 - \lambda (1+\lambda )^{-1}))^ky)\) linearly converges to the origin as k tends to infinity.
Proof
Since \(\mathcal {K}\) is convex, it suffices to show that \(S \mathcal {K}\subseteq \mathcal {K}\). This follows from Theorem B.1. \(\square \)
Rights and permissions
About this article
Cite this article
Davis, D., Drusvyatskiy, D. Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions. Found Comput Math 22, 561–606 (2022). https://doi.org/10.1007/s10208-021-09516-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-021-09516-w