Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme Under Weak Strong Convexity Assumption

Liu, Jie; Takáč, Martin

doi:10.1007/978-3-319-66616-7_7

Jie Liu³ &
Martin Takáč³

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 213))

Included in the following conference series:

Modeling and Optimization: Theory and Applications

617 Accesses
1 Citations
1 Altmetric

Abstract

We propose a projected semi-stochastic gradient descent method with mini-batch for improving both the theoretical complexity and practical performance of the general stochastic gradient descent method (SGD). We are able to prove linear convergence under weak strong convexity assumption. This requires no strong convexity assumption for minimizing the sum of smooth convex functions subject to a compact polyhedral set, which remains popular across machine learning community. Our PS2GD preserves the low-cost per iteration and high optimization accuracy via stochastic gradient variance-reduced technique, and admits a simple parallel implementation with mini-batches. Moreover, PS2GD is also applicable to dual problem of SVM with hinge loss.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Even though the concept was first proposed by Liu and Wright in [15] as optimally strong convexity, to emphasize it as an extended version of strong convexity, we use the term weak strong convexity as in [6] throughout our paper.
2.
It is possible to finish each iteration with only b evaluations for component gradients, namely $\{\nabla f_{i}(y_{k,t})\}_{i\in A_{kt}}$, at the cost of having to store {∇f _i(x _k)}_{i ∈ [n]}, which is exactly the way that SAG [14] works. This speeds up the algorithm; nevertheless, it is impractical for big n.
3.
We only need to prove the existence of β and do not need to evaluate its value in practice. Lemma 4 provides the existence of β.
4.
rcv1 and news20 are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
5.
Available at http://users.cecs.anu.edu.au/~xzhang/data/.
6.
In practice, it is impossible to ensure that evaluating different component gradients takes the same time; however, Fig. 2 implies the potential and advantage of applying mini-batch scheme with parallelism.
7.
Note that this quantity is never computed during the algorithm. We can use it in the analysis nevertheless.
8.
For simplicity, we omit the E[⋅ | y _k, t] notation in further analysis.
9.
$\bar{y}_{k,t+1}$ is constant, conditioned on y _k, t.

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Calamai, P.H., Moré, J.J.: Projected gradient methods for linearly constrained problems. Math. Program. 39, 93–116 (1987)
Article MathSciNet MATH Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: NIPS (2014)
Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. arXiv:1312.5799 (2013)
Google Scholar
Fercoq, O., Qu, Z., Richtárik, P., Takáč, M.: Fast distributed coordinate descent for non-strongly convex losses. In: IEEE Workshop on Machine Learning for Signal Processing (2014)
Book Google Scholar
Gong, P., Ye, J.: Linear convergence of variance-reduced projected stochastic gradient without strong convexity. arXiv:1406.1102 (2014)
Google Scholar
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)
Article MathSciNet Google Scholar
Jaggi, M., Smith, V., Takáč, M., Terhorst, J., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: NIPS, pp. 3068–3076 (2014)
Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: NIPS, pp. 315–323 (2013)
Google Scholar
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: ECML PKDD, pp. 795–811 (2016)
Google Scholar
Kloft, M., Brefeld, U., Laskov, P., Müller, K.-R., Zien, A., Sonnenburg, S.: Efficient and accurate lp-norm multiple kernel learning. In: NIPS, pp. 997–1005 (2009)
Google Scholar
Konečný, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces. 10, 242–255 (2016)
Article Google Scholar
Konečný, J., Richtárik, P.: Semi-stochastic gradient descent methods. arXiv:1312.1666 (2013)
Google Scholar
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS, pp. 2672–2680 (2012)
Google Scholar
Liu, J., Wright, S.J.: Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), 351–376 (2015)
Article MathSciNet MATH Google Scholar
Mareček, J., Richtárik, P., Takáč, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Numerical Analysis and Optimization 2014, Springer Proceedings in Mathematics and Statistics, pp. 261–286 (2014)
Google Scholar
Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimization: convergence analysis and error bounds. SIAM J. Optim. 26(1), 197–226 (2016)
Article MathSciNet MATH Google Scholar
Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Optim. Appl. 57(2), 307–337 (2014)
Article MathSciNet MATH Google Scholar
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. arXiv:1504.06298 (2015)
Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)
Book MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. arXiv:1703.00102 (2017)
Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Article MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res. 17, 1–25 (2016)
MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. Ser. A 156, 1–52 (2016)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: NIPS, pp. 378–385 (2013)
Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. Ser. A, B Spec. Issue Optim. Mach. Learn. 127, 3–30 (2011)
Article MathSciNet MATH Google Scholar
Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: ICML, pp. 71–79. Springer, New York (2013)
Google Scholar
Takáč, M., Bijral, A.S., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: ICML, pp. 537–552. Springer (2013)
Google Scholar
Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)
MathSciNet MATH Google Scholar
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Article MathSciNet MATH Google Scholar
Zhang, T.: Solving large scale linear prediction using stochastic gradient descent algorithms. In: ICML, pp. 919–926. Springer (2004)
Google Scholar
Zhang, H.: The restricted strong convexity revisited: analysis of equivalence to error bound and quadratic growth. Optim. Lett. 11(4), 817–833 (2016)
Article MathSciNet MATH Google Scholar
Zhang, L., Mahdavi, M., Jin, R.: Linear convergence with condition number independent access of full gradients. In: NIPS, pp. 980–988 (2013)
Google Scholar

Download references

Acknowledgements

This research of Jie Liu and Martin Takáč was supported by National Science Foundation grant CCF-1618717. We would like to thank Ji Liu for his helpful suggestions on related works.

Author information

Authors and Affiliations

Lehigh University, Bethlehem, PA, 18015, USA
Jie Liu & Martin Takáč

Authors

Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Martin Takáč
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jie Liu or Martin Takáč .

Editor information

Editors and Affiliations

Industrial and Systems Engineering Department, Lehigh University, Bethlehem, Pennsylvania, USA
Martin Takáč
Industrial and Systems Engineering Department, Lehigh University, Bethlehem, Pennsylvania, USA
Tamás Terlaky

Appendices

Appendix 1: Technical Results

Lemma 1.

Let set $\mathcal{W}\subseteq \mathbb{R}^{d}$ be nonempty, closed, and convex, then for any $x,y \in \mathbb{R}^{d}$,

$$\displaystyle{ \|\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)\| \leq \| x - y\|. }$$

Note that the above contractiveness of projection operator is a standard result in optimization literature. We provide proof for completeness.

Inspired by Lemma 1 in [34], we derive the following lemma for projected algorithms.

Lemma 2 (Modified Lemma 1 in [34]).

Let Assumption 1 hold and let $w_{{\ast}}\in \mathcal{W}^{{\ast}}$ be any optimal solution to Problem (1). Then for any feasible solution $w \in \mathcal{W}$ , the following holds:

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}\|a_{ i}[\nabla g_{i}(a_{i}^{T}w)-\nabla g_{ i}(a_{i}^{T}w_{ {\ast}})]\| = \frac{1} {n}\sum _{i=1}^{n}\|\nabla f_{ i}(w)-\nabla f_{i}(w_{{\ast}})\| \leq 2L[F(w)-F(w_{{\ast}})]. }$$

(14)

Lemmas 3 and 4 come from [12] and [33], respectively. Please refer to the corresponding references for complete proofs.

Lemma 3 (Lemma 4 in [12]).

Let {ξ _i}_{i = 1} ⁿ be a collection of vectors in $\mathbb{R}^{d}$ and $\mu \stackrel{{\it \mathit{\text{def}}}}{=} \frac{1} {n}\sum _{i=1}^{n}\xi _{ i} \in \mathbb{R}^{d}$ . Let $\hat{S}$ be a τ-nice sampling. Then

$$\displaystyle{ \mathbf{E}\left [\left \|\frac{1} {\tau } \sum _{i\in \hat{S}}\xi _{i}-\mu \right \|^{2}\right ] = \frac{1} {n\tau } \frac{n-\tau } {(n - 1)}\sum _{i=1}^{n}\left \|\xi _{ i}\right \|^{2}. }$$

(15)

Following from the proof of Corollary 3 in [34], by applying Lemma 3 with ξ _i: = ∇f _i(y _k, t−1) −∇f _i(w _k) = a _i[∇g _i(a _i ^T y _k, t−1) −∇g _i(a _i ^T w _k)] and Lemma 2, we have the bound for variance as follows.

Theorem 3 (Bounding Variance).

Considering the definition of G _k, t in Algorithm 1 , conditioned on y _k, t , we have $\mathbf{E}[G_{k,t}] = \frac{1} {n}\sum _{i=1}^{n}\nabla g_{ i}(y_{k,t}) + q = \nabla F(y_{k,t})$ and the variance satisfies,

$$\displaystyle{ \mathbf{E}\left [\|G_{k,t} -\nabla F(y_{k,t})\|^{2}\right ] \leq \mathop{\underbrace{ \frac{n-b} {b(n-1)}}}\limits _{\alpha (b)}4L[F(y_{k,t}) - F(w_{{\ast}}) + F(w_{k}) - F(w_{{\ast}})]. }$$

(16)

Lemma 4 (Hoffman Bound, Lemma 15 in [33]).

Consider a non-empty polyhedron

$$\displaystyle{ \{w_{{\ast}}\in \mathbb{R}^{d}\vert Cw_{ {\ast}}\leq c,Aw_{{\ast}} = r\}. }$$

For any w, there is a feasible point w _∗ such that

$$\displaystyle{ \|w-w_{{\ast}}\|\leq \theta (A,C)\left \Vert \begin{array}{*{10}c} [Cw - c]^{+} \\ Aw - r \end{array} \right \Vert, }$$

where θ(A, C) is independent of x,

$$\displaystyle\begin{array}{rcl} \theta (A,C) =\sup _{u,v}\left \{\left \Vert \begin{array}{*{10}c} u\\ v \end{array} \right \Vert \left \vert \begin{array}{c} \|C^{T}u + A^{T}v\| = 1,u \geq 0.\mathit{\text{ The corresponding rows of }}C,A \\ \mathit{\text{to }}u,v\mathit{\text{'s non-zero elements are linearly independent. }} \end{array} \right.\right \}& &{}\end{array}$$

(17)

Lemma 5 (Weak Strong Convexity).

Let $w \in \mathcal{W}:=\{ w \in \mathbb{R}^{d}: Cw \leq c\}$ be any feasible solution (Assumption 3 ) and $w_{{\ast}} =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}^{{\ast}}}(w)$ which is an optimal solution for Problem (1). Then under Assumptions 2 – 3 , there exists a constant β > 0 such that for all $w \in \mathcal{W}$ , the following holds,

$$\displaystyle{ F(w) - F(w_{{\ast}}) \geq \frac{\mu } {2\beta }\|w - w_{{\ast}}\|^{2}, }$$

where μ is defined in Assumption 2 . β can be evaluated by β = θ ² where θ is defined in (17).

Appendix 2: Proofs

2.1 Proof of Lemma 1

For any $x,y \in \mathbb{R}^{d}$, by Projection Theorem, the following holds:

$$\displaystyle{ [y -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)]^{T}[\mathop{\mathrm{proj}}\nolimits _{ \mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)] \leq 0, }$$

(18)

similarly, by symmetry, we have

$$\displaystyle{ [x -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x)]^{T}[\mathop{\mathrm{proj}}\nolimits _{ \mathcal{W}}(y) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x)] \leq 0. }$$

(19)

Then (18) + (19) gives

$$\displaystyle{ [\left (\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)\right ) - (x - y)]^{T}[\mathop{\mathrm{proj}}\nolimits _{ \mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)] \leq 0, }$$

or equivalently,

$$\displaystyle{ \|\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)\|^{2} \leq (x - y)^{T}[\mathop{\mathrm{proj}}\nolimits _{ \mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)], }$$

and by Cauchy-Schwarz inequality, we have

$$\displaystyle{ \|\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x)\| \leq \| x - y\|, }$$

when $\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}}(y)$ are distinct; in addition, when $\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}}(y)$, the above inequality also holds. Hence, for any $x,y \in \mathbb{R}^{d}$, which is the same to

$$\displaystyle{ \|\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(x) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y)\| \leq \| x - y\|. }$$

2.2 Proof of Lemma 2

For any i ∈ {1, …, n}, consider the function

$$\displaystyle{ \phi _{i}(w) = f_{i}(w) - f_{i}(w_{{\ast}}) -\nabla f_{i}(w_{{\ast}})^{T}(w - w_{ {\ast}}), }$$

(20)

then it should be obvious that ∇ϕ _i(w _∗) = ∇f _i(w _∗) −∇f _i(w _∗) = 0, hence $\min _{w\in \mathbb{R}^{d}}\phi _{i}(w) =\phi _{i}(w_{{\ast}})$ because of the convexity of f _i. By Assumption 1 and Remark 1, ∇ϕ _i(w) is Lipschitz continuous with constant L, hence by Theorem 2.1.5 from [21] we have

$$\displaystyle{ \frac{1} {2L}\|\nabla \phi _{i}(w)\|^{2} \leq \phi _{ i}(w) -\min _{w\in \mathbb{R}^{l}}\phi _{i}(w) =\phi _{i}(w) -\phi _{i}(w_{{\ast}}) =\phi _{i}(w), }$$

which, by (20), suggests that

$$\displaystyle{ \|\nabla f_{i}(w) -\nabla f_{i}(w_{{\ast}})\|^{2} \leq 2L[f_{ i}(w) - f_{i}(w_{{\ast}}) -\nabla f_{i}(w_{{\ast}})^{T}(w - w_{ {\ast}})]. }$$

By averaging the above equation over i = 1, …, n and using the fact that $F(w) = \frac{1} {n}\sum _{i=1}^{n}f_{ i}(w)$, we have

$$\displaystyle{ \frac{1} {n}\sum _{i=1}^{n}\|\nabla f_{ i}(w) -\nabla f_{i}(w_{{\ast}})\|^{2} \leq 2L[F(w) - F(w_{ {\ast}}) -\nabla F(w_{{\ast}})^{T}(w - w_{ {\ast}})], }$$

which, together with ∇F(w _∗)^T(w − w _∗) ≥ 0 indicated by the optimality of w _∗ for Problem (1), completes the proof for Lemma 2.

2.3 Proof of Lemma 5

First, we will prove by contradiction that there exists a unique r such that $\mathcal{W}^{{\ast}} =\{ w \in \mathbb{R}^{d}: Cw \leq c,Aw = r\}$ which is non-empty. Assume that there exist distinct $w_{1},w_{2} \in \mathcal{W}^{{\ast}}$ such that Aw ₁ ≠ Aw ₂. Let us define the optimal value to be F ^∗ which suggests that F ^∗ = F(w ₁) = F(w ₂). Moreover, convexity of function F and feasible set $\mathcal{W}$ suggests the convexity of $\mathcal{W}^{{\ast}}$, then $\frac{1} {2}(w_{1} + w_{2}) \in \mathcal{W}^{{\ast}}$. Therefore,

$$\displaystyle{ \begin{array}{rl} F^{{\ast}} = F\left (\frac{1} {2}(w_{1} + w_{2})\right )&\mathop{ =}\limits^{ \text{(1)}}g\left (A\frac{1} {2}(w_{1} + w_{2})\right ) + \frac{1} {2}q^{T}(w_{ 1} + w_{2}) \\ & = g\left (\frac{1} {2}Aw_{1} + \frac{1} {2}Aw_{2}\right ) + \frac{1} {2}q^{T}(w_{ 1} + w_{2}). \end{array} }$$

(21)

Strong convexity indicated in Assumption 2 suggests that

$$\displaystyle\begin{array}{rcl} F^{{\ast}}& =& \frac{1} {2}(F(w_{1}) + F(w_{2}))\mathop{ =}\limits^{ \text{(1)}}\frac{1} {2}[g(Aw_{1}) + q^{T}w_{ 1}] + \frac{1} {2}[g(Aw_{2}) + q^{T}w_{ 2}] {}\\ & =& \left (\frac{1} {2}g(Aw_{1}) + \frac{1} {2}g(Aw_{2})\right ) + \frac{1} {2}q^{T}(w_{ 1} + w_{2}) {}\\ &>& g\left (\frac{1} {2}Aw_{1} + \frac{1} {2}Aw_{2}\right ) + \frac{1} {2}q^{T}(w_{ 1} + w_{2})\mathop{ =}\limits^{ \text{(21)}}F^{{\ast}}, {}\\ \end{array}$$

which is a contradiction, so there exists a unique r such that $\mathcal{W}^{{\ast}}$ can be represented by $\{w \in \mathbb{R}^{d}: Cw \leq c,Aw = r\}$.

For any $w \in \mathcal{W} =\{ x \in \mathbb{R}^{d}: Cw \leq c\},[Cw - c]^{+} = 0$, then by Hoffman’s bound in Lemma 4, for any $w \in \mathcal{W}$, there exists $w' \in \mathcal{W}^{{\ast}}$ and a constant θ > 0 defined in (17), dependent on A and C, such that

$$\displaystyle\begin{array}{rcl} \|w - w'\| \leq \theta \left \Vert \begin{array}{*{10}c} [Cw - c]^{+} \\ Aw - r \end{array} \right \Vert =\theta \| Aw - r\| =\theta \| Aw - Aw_{{\ast}}\|,\forall w_{{\ast}}\in \mathcal{W}^{{\ast}}.& &{}\end{array}$$

(22)

Being aware of that by choosing $w_{{\ast}} =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}_{{\ast}}}(w)$, we have that ∥w − w _∗∥ ≤ ∥w − w′∥, which suggests that

$$\displaystyle{ \|w - w_{{\ast}}\|\leq \| w - w'\|\mathop{ \leq }\limits^{\text{(22)}}\theta \|Aw - Aw_{{\ast}}\|, }$$

or equivalently,

$$\displaystyle{ \|Aw - Aw_{{\ast}}\|^{2} \geq \frac{1} {\beta } \|w - w_{{\ast}}\|^{2},\forall w_{ {\ast}}\in \mathcal{W}^{{\ast}}, }$$

(23)

where β = θ ² > 0.

Optimality of w _∗ for Problem (1) suggests that

$$\displaystyle{ \nabla F(w_{{\ast}})^{T}(w - w_{ {\ast}})\mathop{ =}\limits^{ \text{(1)}}[A^{T}g(Aw_{ {\ast}}) + q]^{T}(w - w_{ {\ast}}) \geq 0, }$$

(24)

then we can conclude the following:

$$\displaystyle{ g(Aw)\mathop{ \geq }\limits^{\ \text{(3)}}g(Aw_{{\ast}}) + \nabla g(Aw_{{\ast}})^{T}(Aw - Aw_{ {\ast}}) + \frac{\mu } {2}\|Aw - Aw_{{\ast}}\|^{2}, }$$

(25)

which, by considering F(w) = g(Aw) + q ^T w in Problem (1), is equivalent to

$$\displaystyle\begin{array}{rcl} F(w) &-& F(w_{{\ast}}) \mathop{=}\limits^{ \text{(1)}}g(Aw) - g(Aw_{{\ast}}) + q^{T}(w - w_{{\ast}}) {}\\ &&\qquad\quad \mathop{\geq }\limits^{\text{(25)}}[A^{T}\nabla g(Aw_{{\ast}}) + q]^{T}(w - w_{{\ast}}) + \frac{\mu } {2}\|Aw - Aw_{{\ast}}\|^{2} {}\\ &&\qquad\quad \mathop{\geq }\limits^{\text{(24)}} \frac{\mu }{2}\|Aw - Aw_{{\ast}}\|^{2} {}\\ &&\qquad\quad \mathop{\geq }\limits^{\text{(23)}} \frac{\mu }{2\beta }\|w - w_{{\ast}}\|^{2}. {}\\ {}\\ \end{array}$$

2.4 Proof of Theorem 1

The proof is following the steps in [12, 34]. For convenience, let us define the stochastic gradient mapping

$$\displaystyle{ d_{k,t} = \frac{1} {h}(y_{k,t} - y_{k,t+1}) = \frac{1} {h}(y_{k,t} -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y_{k,t} - hG_{k,t})), }$$

(26)

then the iterate update can be written as

$$\displaystyle{ y_{k,t+1} = y_{k,t} - hd_{k,t}. }$$

Let us estimate the change of ∥y _k, t+1 − w _∗∥. It holds that

$$\displaystyle\begin{array}{rcl} \|y_{k,t+1} - w_{{\ast}}\|^{2}& =& \|y_{ k,t} - hd_{k,t} - w_{{\ast}}\|^{2} \\ & =& \|y_{k,t} - w_{{\ast}}\|^{2} - 2hd_{ k,t}^{T}(y_{ k,t} - w_{{\ast}}) + h^{2}\|d_{ k,t}\|^{2}.{}\end{array}$$

(27)

By the optimality condition of $y_{k,t+1} =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}}(y_{k,t} - hG_{k,t}) =\arg \min _{w\in \mathcal{W}}\{\tfrac{1} {2}\|w - (y_{k,t} - hG_{k,t})\|^{2}\}$, we have

$$\displaystyle{ [y_{k,t+1} - (y_{k} - hG_{k,t})]^{T}(w^{{\ast}}- y_{ k,t+1}) \geq 0, }$$

then the update y _k, t+1 = y _k, t − hd _k, t suggests that

$$\displaystyle{ G_{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}) \geq d_{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}). }$$

(28)

Moreover, Lipschitz continuity of the gradient of F implies that

$$\displaystyle{ F(y_{k,t}) \geq F(y_{k,t+1}) -\nabla F(y_{k,t})^{T}(y_{ k,t+1} - y_{k,t}) -\frac{L} {2} \|y_{k,t+1} - y_{k,t}\|^{2}. }$$

(29)

Let us define the operator Δ _k, t = G _k, t −∇F(y _k, t), so

$$\displaystyle{ \nabla F(y_{k,t}) = G_{k,t} -\varDelta _{k,t} }$$

(30)

Convexity of F suggests that

$$\displaystyle\begin{array}{rcl} & & F(w^{{\ast}}) \geq F(y_{ k,t}) + \nabla F(y_{k,t})^{T}(w^{{\ast}}- y_{ k,t}) {}\\ & & \mathop{\geq }\limits^{\text{(29)}}F(y_{k,t+1}) -\nabla F(y_{k,t})^{T}(y_{ k,t+1} - y_{k,t}) -\frac{L} {2} \|y_{k,t+1} - y_{k,t}\|^{2} + \nabla F(y_{ k,t})^{T}(w^{{\ast}}- y_{ k,t}) {}\\ & & = F(y_{k,t+1}) -\frac{L} {2} \|y_{k,t+1} - y_{k,t}\|^{2} + \nabla F(y_{ k,t})^{T}(w^{{\ast}}- y_{ k,t+1}) {}\\ & & \mathop{=}\limits^{ \text{(26)},\ \text{(30)}}F(y_{k,t+1}) -\frac{Lh^{2}} {2} \|d_{k,t}\|^{2} + (G_{ k,t} -\varDelta _{k,t})^{T}(w^{{\ast}}- y_{ k,t+1}) {}\\ & & \mathop{\geq }\limits^{\text{(28)}}F(y_{k,t+1}) -\frac{Lh^{2}} {2} \|d_{k,t}\|^{2} + d_{ k,t}^{T}(w^{{\ast}}- y_{ k,t} + y_{k,t} - y_{k,t+1}) -\varDelta _{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}) {}\\ & & \mathop{=}\limits^{ \text{(26)}}F(y_{k,t+1}) -\frac{Lh^{2}} {2} \|d_{k,t}\|^{2} + d_{ k,t}^{T}(w^{{\ast}}- y_{ k,t} + hd_{k,t}) -\varDelta _{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}) {}\\ & & = F(y_{k,t+1}) + \frac{h} {2}(2 - Lh)\|d_{k,t}\|^{2} + d_{ k,t}^{T}(w^{{\ast}}- y_{ k,t}) -\varDelta _{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}) {}\\ & & \mathop{\geq }\limits^{ h \leq 1/L}F(y_{k,t+1}) + \frac{h} {2}\vert d_{k,t}\|^{2} + d_{ k,t}^{T}(w^{{\ast}}- y_{ k,t}) -\varDelta _{k,t}^{T}(w^{{\ast}}- y_{ k,t+1}), {}\\ \end{array}$$

then equivalently,

$$\displaystyle{ -d_{k,t}^{T}(y_{ k,t} - w_{{\ast}}) + \frac{h} {2}\|d_{k,t}\|^{2}\leq F(w_{ {\ast}}) - F(y_{k,t+1}) -\varDelta _{k,t}^{T}(y_{ k,t+1} - w_{{\ast}}). }$$

(31)

Therefore,

$$\displaystyle\begin{array}{rcl} \|y_{k,t+1} - w_{{\ast}}\|^{2}& \mathop{\leq }\limits^{\text{(27)},\text{(31)}}& \|y_{ k,t} - w_{{\ast}}\|^{2} + 2h\left (F(w_{ {\ast}}) - F(y_{k,t+1}) -\varDelta _{k,t}^{T}(y_{ k,t+1} - w_{{\ast}})\right ) \\ & = & \|y_{k,t} - w_{{\ast}}\|^{2} - 2h\varDelta _{ k,t}^{T}(y_{ k,t+1} - w_{{\ast}}) - 2h[F(y_{k,t+1}) - F(w_{{\ast}})].{}\end{array}$$

(32)

In order to bound −Δ _k, t ^T(y _k, t+1 − w _∗), let us define the proximal full gradient update as^{Footnote 7}

$$\displaystyle{ \bar{y}_{k,t+1} =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}}(y_{k,t} - h\nabla F(y_{k,t})), }$$

with which, by using Cauchy-Schwartz inequality and Lemma 1, we can conclude that

$$\displaystyle\begin{array}{rcl} & & -\varDelta _{k,t}^{T}(y_{ k,t+1} - w_{{\ast}}) = -\varDelta _{k,t}^{T}(y_{ k,t+1} -\bar{ y}_{k,t+1}) -\varDelta _{k,t+1}^{T}(\bar{y}_{ k,t+1} - w_{{\ast}}) \\ & & \qquad = -\varDelta _{k,t}^{T}\left [\mathop{\mathrm{proj}}\nolimits _{ \mathcal{W}}(y_{k,t} - hG_{k,t}) -\mathop{\mathrm{proj}}\nolimits _{\mathcal{W}}(y_{k,t} - h\nabla F(y_{k,t}))\right ] -\varDelta _{k,t}^{T}(\bar{y}_{ k,t+1} - w_{{\ast}}) \\ & & \qquad \leq \|\varDelta _{k,t}\|\|(y_{k,t} - hG_{k,t}) - (y_{k,t} - h\nabla F(y_{k,t}))\| -\varDelta _{k,t}^{T}(\bar{y}_{ k,t+1} - w_{{\ast}}), \\ & & \qquad = h\|\varDelta _{k,t}\|^{2} -\varDelta _{ k,t}^{T}(\bar{y}_{ k,t+1} - w_{{\ast}}). {}\end{array}$$

(33)

So we have

$$\displaystyle\begin{array}{rcl} & & \|y_{k,t+1} - w_{{\ast}}\|^{2} {}\\ & & \mathop{\leq }\limits^{\text{(32)},\text{(33)}}\left \|y_{k,t} - w_{{\ast}}\right \|^{2} + 2h\left (h\|\varDelta _{ k,t}\|^{2} -\varDelta _{ k,t}^{T}(\bar{y}_{ k,t+1} - w_{{\ast}}) - [F(y_{k,t+1}) - F(w_{{\ast}})]\right ).{}\\ \end{array}$$

By taking expectation, conditioned on y _k, t ^{Footnote 8} we obtain

$$\displaystyle{ \mathbf{E}[\|y_{k,t+1}-w_{{\ast}}\|^{2}]\mathop{ \leq }\limits^{\text{(33)},\text{(32)}}\left \|y_{ k,t} - w_{{\ast}}\right \|^{2}+2h\left (h\mathbf{E}[\|\varDelta _{ k,t}\|^{2}] -\mathbf{E}[F(y_{ k,t+1}) - F(w_{{\ast}})]\right ), }$$

(34)

where we have used that E[Δ _k, t] = E[G _k, t] −∇F(y _k, t) = 0 and hence $\mathbf{E}[-\varDelta _{k,t}^{T}(\bar{y}_{k,t+1} - w_{{\ast}})] = 0$.^{Footnote 9} Now, if we put (16) into (34) we obtain

$$\displaystyle\begin{array}{rcl} & & \mathbf{E}[\|y_{k,t+1} - w_{{\ast}}\|^{2}] \leq \left \|y_{ k,t} - w_{{\ast}}\right \|^{2} \\ & & \qquad + 2h\left (4Lh\alpha (b)(F(y_{k,t}) - F(w_{{\ast}}) + F(w_{k}) - F(w_{{\ast}})) -\mathbf{E}[F(y_{k,t+1}) - F(w_{{\ast}})]\right ),{}\end{array}$$

(35)

where $\alpha (b) = \frac{m-b} {b(m-1)}$.

Now, if we consider that we have just lower-bounds ν _F ≥ 0 of the true strong convexity parameter μ _F, then we obtain from (35) that

$$\displaystyle\begin{array}{rcl} & & \mathbf{E}[\|y_{k,t+1} - w_{{\ast}}\|^{2}] \leq \left \|y_{ k,t} - w_{{\ast}}\right \|^{2} {}\\ & & \qquad + 2h\left (4Lh\alpha (b)(F(y_{k,t}) - F(w_{{\ast}}) + F(w_{k}) - F(w_{{\ast}})) -\mathbf{E}[F(y_{k,t+1}) - F(w_{{\ast}})]\right ),{}\\ \end{array}$$

which, by decreasing the index t by 1, is equivalent to

$$\displaystyle\begin{array}{rcl} \mathbf{E}[\|y_{k,t}& -& w_{{\ast}}\|^{2}] + 2h\mathbf{E}[F(y_{ k,t}) - F(w_{{\ast}})] \leq \left \|y_{k,t-1} - w_{{\ast}}\right \|^{2} \\ & & \qquad \qquad + 8h^{2}L\alpha (b)(F(y_{ k,t-1}) - F(w_{{\ast}}) + F(w_{k}) - F(w_{{\ast}})).{}\end{array}$$

(36)

Now, by the definition of w _k we have that

$$\displaystyle\begin{array}{rcl} \mathbf{E}[F(w_{k+1})]& =& \frac{1} {M}\sum _{t=1}^{M}\mathbf{E}[F(y_{ k,t})].{}\end{array}$$

(37)

By summing (36) multiplied by (1 − hν _F)^M−t for t = 1, …, M, we can obtain the left-hand side

$$\displaystyle\begin{array}{rcl} LHS =\sum _{ t=1}^{M}\mathbf{E}[\|y_{ k,t} - w_{{\ast}}\|^{2}] + 2h\sum _{ t=1}^{M}\mathbf{E}[F(y_{ k,t}) - F(w_{{\ast}})]& &{}\end{array}$$

(38)

and the right-hand side

$$\displaystyle\begin{array}{rcl} RHS& =& \,\sum _{t=1}^{M}\mathbf{E}\|y_{ k,t-1} - w_{{\ast}}\|^{2}\,+\,8h^{2}L\alpha (b)\sum _{ t=1}^{M}\mathbf{E}[F(y_{ k,t-1})\,-\,F(w_{{\ast}})\,+\,F(w_{k})\,-\,F(w_{{\ast}})] \\ & =& \sum _{t=0}^{M-1}\mathbf{E}\|y_{ k,t} - w_{{\ast}}\|^{2} + 8h^{2}L\alpha (b)\left (\sum _{ t=0}^{M-1}\mathbf{E}[P(y_{ k,t}) - P(w_{{\ast}})]\right ) \\ & & \qquad \qquad + 8h^{2}L\alpha (b)M\mathbf{E}[F(w_{ k}) - F(w_{{\ast}})] \\ & \leq & \sum _{t=0}^{M-1}\mathbf{E}\|y_{ k,t} - w_{{\ast}}\|^{2} + 8h^{2}L\alpha (b)\left (\sum _{ t=0}^{M}\mathbf{E}[F(y_{ k,t}) - F(w_{{\ast}})]\right ) \\ & & \qquad \qquad + 8Mh^{2}L\alpha (b)\mathbf{E}[F(w_{ k}) - F(w_{{\ast}})]. {}\end{array}$$

(39)

Combining (38) and (39) and using the fact that LHS ≤ RHS we have

$$\displaystyle\begin{array}{rcl} & & \mathbf{E}[\|y_{k,M} - w_{{\ast}}\|^{2}] + 2h\sum _{ t=1}^{M}\mathbf{E}[F(y_{ k,t}) - F(w_{{\ast}})] {}\\ & & \leq \mathbf{E}\|y_{k,0} - w_{{\ast}}\|^{2} + 8Mh^{2}L\alpha (b)\mathbf{E}[F(w_{ k}) - F(w_{{\ast}})] {}\\ & & +8h^{2}L\alpha (b)\left (\sum _{ t=1}^{M}\mathbf{E}[F(y_{ k,t}) - F(w_{{\ast}})]\right ) {}\\ & & +8h^{2}L\alpha (b)\mathbf{E}[F(y_{ k,0}) - F(w_{{\ast}})]. {}\\ \end{array}$$

Now, using (37) we obtain

$$\displaystyle\begin{array}{rcl} & & \mathbf{E}[\|y_{k,M} - w_{{\ast}}\|^{2}] + 2Mh(\mathbf{E}[F(w_{ k+1})] - F(w_{{\ast}})) \\ & & \leq \mathbf{E}\|y_{k,0} - w_{{\ast}}\|^{2} + 8Mh^{2}L\alpha (b)\mathbf{E}[F(w_{ k}) - F(w_{{\ast}})] \\ & & +8Mh^{2}L\alpha (b)\left (\mathbf{E}[F(w_{ k+1})] - F(w_{{\ast}})\right ) \\ & & +8h^{2}L\alpha (b)\mathbf{E}[F(y_{ k,0}) - F(w_{{\ast}})]. {}\end{array}$$

(40)

Note that all the above results hold for any optimal solution $w_{{\ast}}\in \mathcal{W}^{{\ast}}$; therefore, they also hold for $w_{{\ast}}' =\mathop{ \mathrm{proj}}\nolimits _{\mathcal{W}^{{\ast}}}(w_{k})$, and Lemma 5 implies that, under weak strong convexity of F, i.e., ν _F = 0,

$$\displaystyle{ \|w_{k} - w_{{\ast}}'\|^{2} \leq \frac{2\beta } {\mu } [F(w_{k}) - F(w_{{\ast}}')]. }$$

(41)

Considering E∥y _k, M − w _∗′∥² ≥ 0, y _k, 0 = w _k, and using (41), the inequality (40) with w _∗ replaced by w _∗′ gives us

$$\displaystyle\begin{array}{rcl} & & 2Mh\left \{1 - 4hL\alpha (b)\right \}[\mathbf{E}[F(w_{k+1})] - F(w_{{\ast}}')] {}\\ & & \qquad \qquad \qquad \qquad \leq \left \{\frac{2\beta } {\mu } + 8Mh^{2}L\alpha (b) + 8h^{2}L\alpha (b)\right \}[F(w_{ k}) - F(w_{{\ast}}')], {}\\ \end{array}$$

or equivalently,

$$\displaystyle{ \mathbf{E}[F(w_{k+1}) - F(w_{{\ast}}')] \leq \rho [F(w_{k}) - F(w_{{\ast}}')], }$$

when 1 − 4hLα(b) > 0 (which is equivalent to $h \leq \frac{1} {4L\alpha (b)}$ ), and when ρ is defined as

$$\displaystyle{ \rho = \frac{\beta /\mu + 4h^{2}L\alpha (b)(M + 1)} {h\left (1 - 4hL\alpha (b)\right )M} }$$

The above statement, together with assumptions of h ≤ 1∕L, implies

$$\displaystyle{0 <h \leq \min \left \{ \frac{1} {4L\alpha (b)}, \frac{1} {L}\right \}.}$$

Applying the above linear convergence relation recursively with chained expectations and realizing that F(w _∗′) = F(w _∗) for any $w_{{\ast}}\in \mathcal{W}^{{\ast}}$ since $w_{{\ast}},w_{{\ast}}'\in \mathcal{W}^{{\ast}}$, we have

$$\displaystyle{ \mathbf{E}[F(w_{k}) - F(w_{{\ast}})] \leq \rho ^{k}[F(w_{ 0}) - F(w_{{\ast}})]. }$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J., Takáč, M. (2017). Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme Under Weak Strong Convexity Assumption. In: Takáč, M., Terlaky, T. (eds) Modeling and Optimization: Theory and Applications. MOPTA 2016. Springer Proceedings in Mathematics & Statistics, vol 213. Springer, Cham. https://doi.org/10.1007/978-3-319-66616-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-66616-7_7
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66615-0
Online ISBN: 978-3-319-66616-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Projected Semi-Stochastic Gradient Descent Method with Mini-Batch Scheme Under Weak Strong Convexity Assumption

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

Appendix 1: Technical Results

Lemma 1.

Lemma 2 (Modified Lemma 1 in [34]).

Lemma 3 (Lemma 4 in [12]).

Theorem 3 (Bounding Variance).

Lemma 4 (Hoffman Bound, Lemma 15 in [33]).

Lemma 5 (Weak Strong Convexity).

Appendix 2: Proofs

2.1 Proof of Lemma 1

2.2 Proof of Lemma 2

2.3 Proof of Lemma 5

2.4 Proof of Theorem 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation