Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions

Davis, Damek; Drusvyatskiy, Dmitriy

doi:10.1007/s10208-021-09516-w

Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions

Published: 03 May 2021

Volume 22, pages 561–606, (2022)
Cite this article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Damek Davis¹ &
Dmitriy Drusvyatskiy²

1198 Accesses
4 Citations
Explore all metrics

Abstract

We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Relaxed-inertial proximal point type algorithms for quasiconvex minimization

Article 26 August 2022

Decomposition Techniques for Bilinear Saddle Point Problems and Variational Inequalities with Affine Monotone Operators

Article 13 June 2016

Notes

This work appeared concurrently with our manuscript.
Weakly convex functions also go by other names such as lower-$C^2$, uniformly prox-regularity, paraconvex, and semiconvex. We refer the reader to the seminal works on the topic [2, 50, 53, 56, 58].
A function is called semi-algebraic if its graph decomposes into a finite union of sets, each defined by finitely many polynomial inequalities.
Perhaps more appropriate would be the terms active strict saddle and the active strict saddle property. For brevity, we omit the word “active.”
Weak convexity is not essential here, provided one modifies the definitions appropriately. Moreover, this guarantee holds more generally for functions definable in an o-minimal structure.
The domain of $d^2 f_{\mathcal {M}}(\bar{y})(u|\cdot )$ consists of w satisfying $(\langle \nabla ^2 G_1(\bar{y})u,u\rangle ,\ldots , \langle \nabla ^2 G_{n-r}(\bar{y})u,u\rangle )=-\nabla G(\bar{y})w$, where $G_i$ are the coordinate functions of G.
What we call an active manifold here is called an identifiable manifold in [19]—the reference we most closely follow. The term active is more evocative in the context of the current work.
Note that due to the convention $\inf _{\emptyset }=+\infty $, the entire space $\mathcal {M}=\mathbb {R}^d$ is the active manifold for a globally $C^p$-smooth function f around any of its critical points.
Better terminology would be the terms active strict saddle and the active strict saddle property. To streamline the notation, we omit the word active, as it should be clearly understood from context.
A function is semi-algebraic if its graph can be written as a finite union of sets each cut out by finitely many polynomial inequalities.
For example, let F be a $C^2$ function defined on a neighborhood U of $\bar{x}$ that agrees with f on $U\cap \mathcal {M}$. Using a partition of unity (e.g., [36, Lemma 2.26]), one can extend F from a slightly smaller neighborhood to be $C^2$ on all of $\mathbb {R}^d$.
We should note that metric regularity of F at $(\bar{x},\bar{v})$ is equivalent to (A.1) for an arbitrary set-valued map F with closed graph, provided we interpret $N_{\mathrm{gph}\,F}(\bar{x},\bar{v})$ as the limiting normal cone [57, Definition 6.3].

References

F. Al-Khayyal and J. Kyparisis. Finite convergence of algorithms for nonlinear programs and variational inequalities. J. Optim. Theory Appl., 70(2):319–332, 1991.
Article MathSciNet Google Scholar
P. Albano and P. Cannarsa. Singularities of semiconcave functions in Banach spaces. In Stochastic analysis, control, optimization and applications, Systems Control Found. Appl., pages 171–190. Birkhäuser Boston, Boston, MA, 1999.
H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
Article MathSciNet Google Scholar
D. Avdiukhin, c. Jin, and G. Yaroslavtsev. Escaping saddle points with inequality constraints via noisy sticky projected gradient descent. Optimization for Machine Learning Workshop, 2019.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
Article MathSciNet Google Scholar
S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
Article MathSciNet Google Scholar
J.F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer, New York, 2000.
Book Google Scholar
J.V. Burke. Descent methods for composite nondifferentiable optimization problems. Math. Programming, 33(3):260–279, 1985.
Article MathSciNet Google Scholar
J.V. Burke. On the identification of active constraints. II. The nonconvex case. SIAM J. Numer. Anal., 27(4):1081–1103, 1990.
Article MathSciNet Google Scholar
J.V. Burke and J.J. Moré. On the identification of active constraints. SIAM J. Numer. Anal., 25(5):1197–1211, 1988.
Article MathSciNet Google Scholar
P.H. Calamai and J.J. Moré. Projected gradient methods for linearly constrained problems. Math. Prog., 39(1):93–116, 1987.
Article MathSciNet Google Scholar
V. Charisopoulos, Y. Chen, D. Davis, M. Díaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. Foundations of Computational Mathematics, pages 1–89, 2021.
F.H. Clarke, Yu. Ledyaev, R.I. Stern, and P.R. Wolenski. Nonsmooth Analysis and Control Theory. Texts in Math. 178, Springer, New York, 1998.
C. Criscitiello and N. Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
Article MathSciNet Google Scholar
D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News, 26(2), 2018.
D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Generic minimizing behavior in semialgebraic optimization. SIAM Journal on Optimization, 26(1):513–534, 2016.
Article MathSciNet Google Scholar
D. Drusvyatskiy and A.S. Lewis. Optimality, identifiablity, and sensitivity. Math. Program., 147(1-2, Ser. A):467–498, 2014.
D. Drusvyatskiy and A.S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.
Article MathSciNet Google Scholar
D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503–558, 2019.
Article MathSciNet Google Scholar
S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, A. Singh, and B. Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067–1077, 2017.
J.C. Duchi and F. Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
Article MathSciNet Google Scholar
J.C. Dunn. On the convergence of projected gradient processes to singular critical points. J. Optim. Theory Appl., 55(2):203–216, 1987.
Article MathSciNet Google Scholar
M.C. Ferris. Finite termination of the proximal point algorithm. Math. Program., 50(3, (Ser. A)):359–366, 1991.
S.D. Flåm. On finite convergence and constraint identification of subgradient projection methods. Math. Program., 57:427–437, 1992.
Article MathSciNet Google Scholar
R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233–1242. JMLR. org, 2017.
R. Ge, J.D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
N. Hallak and M. Teboulle. Finding second-order stationary points in constrained minimization: A feasible direction approach. Journal of Optimization Theory and Applications, 186(2):480–503, 2020.
Article MathSciNet Google Scholar
W.L. Hare and A.S. Lewis. Identifying active manifolds. Algorithmic Oper. Res., 2(2):75–82, 2007.
MathSciNet MATH Google Scholar
C. Jin, P. Netrapalli, and M. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.
R. Jin, C.and Ge, P. Netrapalli, S.M. Kakade, and M.I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR. org, 2017.
J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, and B. Recht. First-order methods almost always avoid strict saddle points. Math. Program., 176(1-2):311–337, 2019.
Article MathSciNet Google Scholar
J.D. Lee, M. Simchowitz, M.I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246–1257, 2016a.
J.M. Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer, 2013.
Sangkyun Lee and Stephen J Wright. Manifold identification in dual averaging for regularized stochastic online learning. Journal of Machine Learning Research, 13(Jun):1705–1744, 2012.
C. Lemaréchal, F. Oustry, and C. Sagastizábal. The U-lagrangian of a convex function. Trans. Amer. Math. Soc., 352:711–729, 1996.
Article MathSciNet Google Scholar
A.S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. Optim., 13(3):702–725 (electronic) (2003), 2002.
A.S. Lewis and S.J. Wright. A proximal method for composite minimization. Math. Program., pages 1–46, 2015.
A.S. Lewis and S. Zhang. Partial smoothness, tilt stability, and generalized Hessians. SIAM Journal on Optimization, 23(1):74–94, 2013.
Article MathSciNet Google Scholar
B. Martinet. Régularisation d’inéquations variationnelles par approximations successives. Rev. Française Informat. Rech. Opérationnelle, 4(Sér. R-3):154–158, 1970.
B. Martinet. Détermination approchée d’un point fixe d’une application pseudo-contractante. Cas de l’application prox. C. R. Acad. Sci. Paris Sér. A-B, 274:A163–A165, 1972.
A. Mokhtari, A. Ozdaglar, and A. Jadbabaie. Escaping saddle points in constrained optimization. In Advances in Neural Information Processing Systems, pages 3629–3639, 2018.
B.S. Mordukhovich. Variational analysis and generalized differentiation. I, volume 330 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2006. Basic theory.
J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93:273–299, 1965.
Article MathSciNet Google Scholar
Yu. Nesterov. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optimisation Methods and Software, 22(3):469–483, 2007.
Article MathSciNet Google Scholar
Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
Article MathSciNet Google Scholar
M. Nouiehed, J.D. Lee, and M. Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.
E.A. Nurminskii. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9(1):145–150, 1973.
Article MathSciNet Google Scholar
I. Panageas and G. Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.
E. Pauwels. The value function approach to convergence analysis in composite optimization. Operations Research Letters, 44(6):790–795, 2016.
Article MathSciNet Google Scholar
R.A. Poliquin and R.T. Rockafellar. Prox-regular functions in variational analysis. Trans. Amer. Math. Soc., 348:1805–1838, 1996.
Article MathSciNet Google Scholar
R.T. Rockafellar. Convex analysis. Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.
R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Optimization, 14(5):877–898, 1976.
Article MathSciNet Google Scholar
R.T. Rockafellar. Favorable classes of Lipschitz-continuous functions in subgradient optimization. In Progress in nondifferentiable optimization, volume 8 of IIASA Collaborative Proc. Ser. CP-82, pages 125–143. Int. Inst. Appl. Sys. Anal., Laxenburg, 1982.
R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.
S. Rolewicz. On paraconvex multifunctions. In Third Symposium on Operations Research (Univ. Mannheim, Mannheim, 1978), Section I, volume 31 of Operations Res. Verfahren, pages 539–546. Hain, Königstein/Ts., 1979.
A. Shapiro. Second order sensitivity analysis and asymptotic theory of parametrized nonlinear programs. Mathematical Programming, 33(3):280–299, 1985.
Article MathSciNet Google Scholar
M. Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013.
J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.
J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
Article MathSciNet Google Scholar
Y. Sun, N. Flammarion, and M. Fazel. Escaping from saddle points on riemannian manifolds. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
S.J. Wright. Identifiable surfaces in constrained optimization. SIAM J. Control Optim., 31:1063–1079, 1993.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank John Duchi for his insightful comments on an early version of the manuscript. We also thank the anonymous referees for numerous suggestions that have improved the readability of the paper.

Author information

Authors and Affiliations

School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, 14850, USA
Damek Davis
Department of Mathematics, University of Washington, Seattle, WA, 98195, USA
Dmitriy Drusvyatskiy

Authors

Damek Davis
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Drusvyatskiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitriy Drusvyatskiy.

Additional information

Communicated by Michael Overton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

D. Drusvyatskiy: Research of Drusvyatskiy was supported by the NSF DMS 1651851 and CCF 1740551 awards.

Appendices

Proofs of Theorems 2.9 and 5.2

In this section, we prove Theorem 2.9. We should note that Theorem 2.9, appropriately restated, holds much more broadly beyond the weakly convex function class. To simplify the notational overhead, however, we impose the weak convexity assumption, throughout.

We will require some basic notation from variational analysis; for details, we refer the reader to [57]. A set-valued map $F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m$ assigns to each point $x\in \mathbb {R}^d$ a set F(x) in $\mathbb {R}^m$. The graph of F is defined by

$$\begin{aligned} \mathrm{gph}\,F:=\{(x,v):v\in F(x)\}.\end{aligned}$$

A map $F:\mathbb {R}^d\rightrightarrows \mathbb {R}^m$ is called metrically regular at $(\bar{x},\bar{v})\in \mathrm{gph}\,F$ if there exists a constant $\kappa >0$ such that the estimate holds:

$$\begin{aligned} \mathrm{dist}(x,F^{-1}(v))\le \kappa \mathrm{dist}(v,F(x))\end{aligned}$$

for all x near $\bar{x}$ and all v near $\bar{v}$. If the graph $\mathrm{gph}\,F$ is a $C^1$-smooth manifold around $(\bar{x},\bar{v})$, then metric regularity at $(\bar{x},\bar{v})$ is equivalent to the condition [57, Theorem 9.43(d)]:^{Footnote 12}

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,F}(\bar{x},\bar{v})\quad \Longrightarrow \quad u=0. \end{aligned}$$

(A.1)

We begin with the following lemma.

Lemma A.1

(Subdifferential metric regularity in smooth minimization). Consider the optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x)\quad \text {subject to}\quad x\in \mathcal {M},\end{aligned}$$

where $f:\mathbb {R}^d\rightarrow \mathbb {R}$ is a $C^2$-smooth function and $\mathcal {M}$ is a $C^2$-smooth manifold. Let $\bar{x}\in \mathcal {M}$ satisfy the criticality condition $0\in \partial f_{\mathcal {M}}(\bar{x})$ and suppose that the subdifferential map $\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d$ is metrically regular at $(\bar{x},0)$. Then, the guarantee holds:

$$\begin{aligned} \inf _{u\in \mathbb {S}^{d-1}\cap T_{\mathcal {M}}(\bar{x})} d^2 f_{\mathcal {M}}(\bar{x})(u)\ne 0. \end{aligned}$$

(A.2)

Proof

First, appealing to (A.1), we conclude that the implication holds:

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,\partial f_{\mathcal {M}}}(\bar{x},0)\quad \Longrightarrow \quad u=0. \end{aligned}$$

(A.3)

Let us now interpret the condition (A.3) in Lagrangian terms. To this end, let $G=0$ be the local defining equations for $\mathcal {M}$ around $\bar{x}$. Define the Lagrangian function

$$\begin{aligned} \mathcal {L}(x,\lambda )=f(x)+\langle G(x),\lambda \rangle ,\end{aligned}$$

and let $\bar{\lambda }$ be the unique Lagrange multiplier vector satisfying $\nabla _x \mathcal {L}(\bar{x},\bar{\lambda })=0$. According to [41, Corollary 2.9], we have the following expression:

$$\begin{aligned} (0,u)\in N_{\mathrm{gph}\,\partial f_{\mathcal {M}}}(\bar{x},0)\quad \Longleftrightarrow \quad u\in T_{\mathcal {M}}(\bar{x})\quad \text {and}\quad L u \in N_{\mathcal {M}}(\bar{x}), \end{aligned}$$

(A.4)

where $L:=\nabla ^2_{xx}\mathcal {L}(\bar{x},\bar{\lambda })$ denotes the Hessian of the Lagrangian. Combining (A.3) and (A.4), we deduce that the only vector $u\in T_{\mathcal {M}}(\bar{x})$ satisfying $L u\in N_{\mathcal {M}}(\bar{x})$ is the zero vector $u=0$.

Now for the sake of contradiction, suppose that (A.2) fails. Then, the quadratic form $Q(u)=\langle L u,u\rangle $ is nonnegative on $T_{\mathcal {M}}(\bar{x})$ and there exists $0\ne \bar{u}\in T_{\mathcal {M}}(\bar{x})$ satisfying $Q(\bar{u})=0$. We deduce that $\bar{u}$ minimizes $Q(\cdot )$ on $T_{\mathcal {M}}(\bar{x})$, and therefore, the inclusion $L\bar{u}\in N_{\mathcal {M}}(\bar{x})$ holds, a clear contradiction. $\square $

The following corollary for active manifolds will now quickly follow.

Corollary A.2

(Subdifferential metric regularity and active manifolds). Consider a closed and weakly convex function $f:\mathbb {R}^d\rightarrow \mathbb {R}\cup \{\infty \}$. Suppose that f admits a $C^2$-smooth active manifold around a critical point $\bar{x}$ and that the subdifferential map $\partial f:\mathbb {R}^d\rightrightarrows \mathbb {R}^d$ is metrically regular at $(\bar{x},0)$. Then, $\bar{x}$ is either a strong local minimizer of f or satisfies the curvature condition $d^2 f_{\mathcal {M}}(\bar{x})(u)<0$ for some $u\in T_{\mathcal {M}}(\bar{x})$.

Proof

The result [19, Proposition 10.2] implies that $\mathrm{gph}\,\partial f$ coincides with $\mathrm{gph}\,\partial f_{\mathcal {M}}$ on a neighborhood of $(\bar{x},0)$. Therefore, the subdifferential map $\partial f_{\mathcal {M}}:\mathbb {R}^d\rightrightarrows \mathbb {R}^d$ is metrically regular at $(\bar{x},0)$. Using Lemma A.1, we obtain the guarantee:

$$\begin{aligned} \inf _{u\in \mathbb {S}^{d-1}\cap T_{\mathcal {M}}(\bar{x})} d^2 f_{\mathcal {M}}(\bar{x})(u)\ne 0. \end{aligned}$$

If the infimum is strictly negative, the proof is complete. Otherwise, the infimum is strictly positive. In this case, $\bar{x}$ is a strong local minimizer of $f_{\mathcal {M}}$, and therefore by [19, Proposition 7.2] a strong local minimizer of f. $\square $

We are now ready for the proofs of Theorems 2.9 and 5.2.

Proof of Theorem 2.9

The result [18, Corollary 4.8] shows that for almost all $v\in \mathbb {R}^d$, the function $g(x):=f(x)-\langle v,x\rangle $ has at most finitely many critical points. Moreover each such critical point $\bar{x}$ lies on some $C^2$ active manifold $\mathcal {M}$ of g and the subdifferential map $\partial g:\mathbb {R}^d\rightrightarrows \mathbb {R}^d$ is metrically regular at $(\bar{x},0)$. Applying Corollary A.2 to g for such generic vectors v, we deduce that every critical point $\bar{x}$ of g is either a strong local minimizer or a strict saddle of g. The proof is complete. $\square $

Proof of Theorem 5.2

The proof is identical to that of Theorem 2.9 with [18, Theorem 5.2] playing the role of [18, Corollary 4.8]. $\square $

Pathological Example

Theorem B.1

Consider the following function

$$\begin{aligned} f(x, y) = \frac{1}{2}(|x| + |y|)^2 - \frac{\rho }{2} x^2 \end{aligned}$$

Assume that $\lambda > \rho $. Define a mapping $T :\mathbb {R}^d \rightarrow \mathbb {R}$ by the following formula.

$$\begin{aligned} S(x, y) = {\left\{ \begin{array}{ll} 0 &{} \text {if }(x, y) = 0;\\ \left( 0, \frac{\lambda }{1+\lambda } y\right) &{} \text {if }|x| \le \frac{1}{1+\lambda } |y|; \\ \left( \frac{\lambda }{1+\lambda - \rho }x, 0\right) &{} \text {if }|y| \le \frac{1}{1+\lambda -\rho }|x|,\\ \end{array}\right. } \end{aligned}$$

and if $\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|$, we have

$$\begin{aligned} S(x,y) = {\left\{ \begin{array}{ll} \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} -1 \\ -1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} &{} \text {if }\mathrm {sign}(x) = \mathrm {sign}(y);\\ \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )-1} \begin{bmatrix} (1+\lambda ) &{} 1 \\ 1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}&\text {if }\mathrm {sign}(x) \ne \mathrm {sign}(y). \end{array}\right. } \end{aligned}$$

Then, $\mathrm{prox}_{(1/\lambda ) f}(x, y) = S(x,y)$.

Proof

Let us denote the components of S(x, y) by $(x_+, y_+) = S(x,y)$. By first-order optimality conditions, we have $\mathrm{prox}_{(1/\lambda ) f}(x, y) = (x_+, y_+) $ if and only if

$$\begin{aligned}&\lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+) \in \\&{\left\{ \begin{array}{ll} \{x_+ + \mathrm {sign}(x_+)|y_+|\}\times \{\mathrm {sign}(y_+)|x_+| + y_+\} &{} \text {if }x_+ \ne 0\text { and }y_+ \ne 0;\\ ([-1, 1]y_+)\times \{y_+\} &{} \text {if }x_+ = 0\text { and }y_+ \ne 0;\\ \{x_+\}\times ([-1, 1]x_+) &{} \text {if }x_+ \ne 0\text { and }y_+ = 0;\\ \{0\}\times \{0\} &{} \text {if }x_+ = 0\text { and }y_+ = 0.\\ \end{array}\right. } \end{aligned}$$

Let us show that $(x_+, y_+)$ indeed satisfies this inclusion.

1.
If $(x,y) = 0$, then $x_+ = y_+ = 0$, and the pair satisfies the inclusion.
2.
If $|x| \le \frac{1}{1 + \lambda }|y|$ and $y \ne 0$, then $x_+ = 0$, $y_+ = \frac{\lambda }{1+\lambda }y$, and
$$\begin{aligned} \lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+) = \lambda \left( x, \frac{1}{1 + \lambda }y\right) \in ([-1, 1]y_+) \times \{y_+\}. \end{aligned}$$
Thus, the pair satisfies the inclusion.
3.
If $|y| \le \frac{1}{1+\lambda -\rho }|x|$ and $x \ne 0$, then $x_+ = \frac{\lambda }{(1+\lambda - \rho )}x$, $y_+ = 0$, and
$$\begin{aligned}&\lambda (x - (1-(1/\lambda )\rho )x_+, y - y_+)\\&= \lambda \left( x - \frac{\lambda -\rho }{(1+\lambda - \rho )}x, y\right) \in \{x_+\}\times ([-1, 1]x_+). \end{aligned}$$

For the remaining two cases, let us assume that $\frac{1}{ (1+\lambda - \rho )} |x|< |y| < (1+\lambda ) |x|$.

4.
If $\mathrm {sign}(x) = \mathrm {sign}(y)$, let $s = \mathrm {sign}(x)$ and note that
$$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} -1 \\ -1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ -|x| + (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$
From this equation we learn $\mathrm {sign}(x_+) = \mathrm {sign}(y_+) = s$. Inverting the matrix, we also learn
$$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} 1 \\ 1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ + y_+ \\ x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$
Thus, the pair satisfies the inclusion.
5.
If $\mathrm {sign}(x) \ne \mathrm {sign}(y)$, let $s = \mathrm {sign}(x)$ and note that
$$\begin{aligned} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \frac{\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda ) &{} 1 \\ 1 &{} (1+\lambda - \rho ) \end{bmatrix}\begin{bmatrix} x\\ y \end{bmatrix}\\&= \frac{s\lambda }{(1+\lambda )(1+\lambda - \rho )- 1} \begin{bmatrix} (1+\lambda )|x| -|y| \\ |x| - (1+\lambda - \rho )|y| \end{bmatrix} \end{aligned}$$
From this equation we learn $\mathrm {sign}(x_+) \ne \mathrm {sign}(y_+) $. Inverting the matrix we also learn
$$\begin{aligned} \lambda \begin{bmatrix} x \\ y \end{bmatrix} =\begin{bmatrix} (1+\lambda - \rho ) &{} -1 \\ -1 &{} (1+\lambda ) \end{bmatrix} \begin{bmatrix} x_+ \\ y_+ \end{bmatrix}&= \begin{bmatrix} x_+ + \lambda (1 - \rho /\lambda )x_+ - y_+ \\ -x_+ + y_+ + \lambda y_+ \end{bmatrix} \\&= \begin{bmatrix} x_+ + \mathrm {sign}(x_+) |y_+| + \lambda (1 - \rho /\lambda )x_+ \\ \mathrm {sign}(y_+) |x_+| + y_+ + \lambda y_+ \end{bmatrix}. \end{aligned}$$
Thus, the pair satisfies the inclusion.

Therefore, the proof is complete. $\square $

Corollary B.2

(Convergence to Saddles). Assume the setting of Theorem B.1. Let $\alpha \in (0, 1]$ and define the operator $T : = (1-\alpha ) I + \alpha S$ on $\mathbb {R}^2$. Then, the cone $ \mathcal {K}= \{(x,y) :|x| \le (1+\lambda )^{-1}y\} $ satisfies $T\mathcal {K}\subseteq \mathcal {K}$. Moreover, for any $(x, y) \in \mathcal {K}$, it holds that $T^k(x,y) = ((1-\alpha )^k x, (1 - \alpha (1 - \lambda (1+\lambda )^{-1}))^ky)$ linearly converges to the origin as k tends to infinity.

Proof

Since $\mathcal {K}$ is convex, it suffices to show that $S \mathcal {K}\subseteq \mathcal {K}$. This follows from Theorem B.1. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davis, D., Drusvyatskiy, D. Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions. Found Comput Math 22, 561–606 (2022). https://doi.org/10.1007/s10208-021-09516-w

Download citation

Received: 22 January 2020
Revised: 16 February 2021
Accepted: 08 April 2021
Published: 03 May 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10208-021-09516-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions

Abstract

Access this article

Similar content being viewed by others

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Relaxed-inertial proximal point type algorithms for quasiconvex minimization

Decomposition Techniques for Bilinear Saddle Point Problems and Variational Inequalities with Affine Monotone Operators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs of Theorems 2.9 and 5.2

Lemma A.1

Proof

Corollary A.2

Proof

Proof of Theorem 2.9

Proof of Theorem 5.2

Pathological Example

Theorem B.1

Proof

Corollary B.2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Proximal Methods Avoid Active Strict Saddles of Weakly Convex Functions

Abstract

Access this article

Similar content being viewed by others

Perturbed Proximal Descent to Escape Saddle Points for Non-convex and Non-smooth Objective Functions

Relaxed-inertial proximal point type algorithms for quasiconvex minimization

Decomposition Techniques for Bilinear Saddle Point Problems and Variational Inequalities with Affine Monotone Operators

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs of Theorems 2.9 and 5.2

Lemma A.1

Proof

Corollary A.2

Proof

Proof of Theorem 2.9

Proof of Theorem 5.2

Pathological Example

Theorem B.1

Proof

Corollary B.2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation