Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks

Wang, Qingsong; Han, Deren

doi:10.1007/s00186-024-00851-6

Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks

Original Article
Published: 06 February 2024

Volume 99, pages 39–74, (2024)
Cite this article

Mathematical Methods of Operations Research Aims and scope Submit manuscript

372 Accesses
Explore all metrics

A Correction to this article was published on 12 March 2024

This article has been updated

Abstract

In many optimization problems arising from machine learning, image processing, and statistics communities, the objective functions possess a special form involving huge amounts of data, which encourages the application of stochastic algorithms. In this paper, we study such a broad class of nonconvex nonsmooth minimization problems, whose objective function is the sum of a smooth function of the entire variables and two nonsmooth functions of each variable. We propose to solve this problem with a stochastic Gauss–Seidel type inertial proximal alternating linearized minimization (denoted by SGiPALM) algorithm. We prove that under Kurdyka–Łojasiewicz (KŁ) property and some mild conditions, each bounded sequence generated by SGiPALM with the variance-reduced stochastic gradient estimator globally converges to a critical point after a finite number of iterations, or almost surely satisfies the finite length property. We also apply the SGiPALM algorithm to the proximal neural networks (PNN) with 4 layers for classification tasks on the MNIST dataset and compare it with other deterministic and stochastic optimization algorithms, the results illustrate the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

A nonmonotone accelerated proximal gradient method with variable stepsize strategy for nonsmooth and nonconvex minimization problems

Article 05 March 2024

A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems

Article 10 September 2019

An Inertial Tseng’s Type Proximal Algorithm for Nonsmooth and Nonconvex Optimization Problems

Article 31 March 2015

Change history

12 March 2024
A Correction to this paper has been published: https://doi.org/10.1007/s00186-024-00855-2

Notes

https://www.python.org.
https://www.tensorflow.org.
http://yann.lecun.com/exdb/mnist.
For $d \le n$, the (compact) Stiefel manifold is defined as $\text{ St }(d, n):=\{W\in {\mathbb {R}}^{n\times d}: W^{T}W=I_{d}\}$. For $d=1$ this reduces to the sphere ${\mathbb {S}}^{n-1}$, and for $d=n$ we obtain the special orthogonal group $\text{ SO }(n)$.
We know that the ELU function is differentiable with a 1-Lipschitz gradient, i.e., $\Vert \nabla \sigma (x)-\nabla \sigma (y)\Vert \le \Vert x-y\Vert .$
In particular, we have that $\Vert W_{i}\Vert _{F} \le 2\sqrt{d}$, $i = 1, \dots ,4$ and $\Vert W_{5}\Vert _{F} \le 20\sqrt{10d}$.
The exact Lipschitz constant is more difficult to obtain when function H acts on a high dimensional space, so we compute an approximative Lipschitz constant based on the batch set of sampling, see Griewank and Walther (2008) for details.

References

Absil P-A, Mahony R, Sepulchre R (2008) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton
Book Google Scholar
Attouch H, Bolte J (2009) On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math Program 116(1–2):5–16
Article MathSciNet Google Scholar
Attouch H, Bolte J, Redont P, Soubeyran A (2010) Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math Oper Res 35(2):438–457
Article MathSciNet Google Scholar
Attouch H, Bolte J, Svaiter BF (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Math Program 137(1–2):91–129
Article MathSciNet Google Scholar
Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494
Article MathSciNet Google Scholar
Bot RI, Csetnek ER, Nguyen D (2019) A proximal minimization algorithm for structured nonconvex and nonsmooth problems. SIAM J Optim 29(2):1300–1328
Article MathSciNet Google Scholar
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: 19th international conference on computational statistics, COMPSTAT, pp 177–186
Cai X, Guo K, Jiang F, Wang K, Wu Z, Han D (2022) The developments of proximal point algorithms. J Oper Res Soc China 10:197–239
Article MathSciNet Google Scholar
Chao M, Han D, Cai X (2021) Convergence of the Peaceman–Rachford splitting method for a class of nonconvex programs. Numer Math Theory Methods Appl 14(2):438–460
Article MathSciNet Google Scholar
Chouzenoux É, Pesquet J, Repetti A (2016) A block coordinate variable metric forward–backward algorithm. J Glob Optim 66(3):457–485
Article MathSciNet Google Scholar
Danilova M, Dvurechensky P, Gasnikov A, Gorbunov E, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. arXiv:2012.06188
Davis D (2016) The asynchronous PALM algorithm for nonsmooth nonconvex problems. arXiv:1604.00526
Davis D, Edmunds B, Udell M (2016) The sound of APALM clapping: faster nonsmooth nonconvex optimization with stochastic asynchronous PALM. Adv Neural Inf Process Syst 29:226–234
Google Scholar
Defazio A, Bach FR, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 27:1646–1654
Google Scholar
Driggs D, Tang J, Liang J, Davies ME, Schönlieb C (2021) A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J Imaging Sci 14(4):1932–1970
Article MathSciNet Google Scholar
Gao X, Cai X, Han D (2020) A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems. J Glob Optim 76(4):863–887
Article MathSciNet Google Scholar
Griewank A, Walther A (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation, 2nd edn. SIAM, Philadelphia
Book Google Scholar
Han D (2022) A survey on some recent developments of alternating direction method of multipliers. J Oper Res Soc China 10(1):1–52
Article MathSciNet Google Scholar
Hasannasab M, Hertrich J, Neumayer S, Plonka G, Setzer S, Steidl G (2020) Parseval proximal neural networks. J Fourier Anal Appl 26(59):1–31
MathSciNet Google Scholar
Hertrich J, Steidl G (2022) Inertial stochastic PALM and applications in machine learning. Sampl Theory Signal Process Data Anal 20(4):1–33
MathSciNet Google Scholar
Higham NJ (2008) Functions of matrices: theory and computation. SIAM, Philadelphia
Book Google Scholar
Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Oxford University Press, Oxford
Google Scholar
Jia Z, Gao X, Cai X, Han D (2021) Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems. J Optim Theory Appl 188(1):1–25
Article MathSciNet Google Scholar
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 26:315–323
Google Scholar
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Article MathSciNet Google Scholar
Lan G (2020) First-order and stochastic optimization methods for machine learning. Springer, Berlin
Book Google Scholar
Li Z, Li J (2022) Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. J Mach Learn Res 23:1–61
MathSciNet Google Scholar
Ma Y, Hu X, He T, Jiang X (2020) Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans Comput Biol Bioinform 17(3):788–795
Article Google Scholar
Naanaa W, Nuzillard J (2005) Blind source separation of positive and partially correlated data. Signal Process 85(9):1711–1722
Article Google Scholar
Nesterov Y (2018) Lectures on convex optimization, 2nd edn. Springer, Cham
Book Google Scholar
Nguyen L M, Liu J, Scheinberg K, Takác M (2017) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th international conference on machine learning, pp 2613–2621
Nikolova M, Tan P (2019) Alternating structure-adapted proximal gradient descent for nonconvex nonsmooth block-regularized problems. SIAM J Optim 29(3):2053–2078
Article MathSciNet Google Scholar
Pock T, Sabach S (2016) Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J Imaging Sci 9(4):1756–1787
Article MathSciNet Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Article MathSciNet Google Scholar
Robbins H, Siegmund D (1971) A convergence theorem for non-negative almost supermartingales and some applications. In: Optimizing methods in statistics, pp 233–257
Rockafellar RT (1970) Convex analysis. Princeton University Press, Princeton
Book Google Scholar
Rockafellar R, Wets R-B (1998) Variational analysis. Springer, Berlin
Book Google Scholar
Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 25:2672–2680
Google Scholar
Wang Q, Han D (2023) A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl Numer Math 189:66–87
Article MathSciNet Google Scholar
Wang Q, Cui C, Han D (2023a) Accelerated doubly stochastic gradient descent for tensor CP decomposition. J Optim Theory Appl 197:665–704
Article MathSciNet Google Scholar
Wang Q, Liu Z, Cui C, Han D (2023b) Inertial accelerated SGD algorithms for solving large-scale lower-rank tensor CP decomposition problems. J Comput Appl Math 423:114948
Article MathSciNet Google Scholar
Wen J, Fowler JE, He M, Zhao Y, Deng C, Menon V (2016) Orthogonal nonnegative matrix factorization combining multiple features for spectral-spatial dimensionality reduction of hyperspectral imagery. IEEE Trans Geosci Remote Sens 54(7):4272–4286
Article Google Scholar
Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imaging Sci 6(3):1758–1789
Article MathSciNet Google Scholar
Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686–1716
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the reviewers and Associate Editor for many helpful suggestions to improve the quality of the paper.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) Grants 12126603, 11926358, 12131004, and the R &D project of Pazhou Lab (Huangpu) under Grant 2023K0604.

Author information

Authors and Affiliations

School of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, China
Qingsong Wang
LMIB, School of Mathematical Sciences, Beihang University, Beijing, 100191, China
Deren Han

Authors

Qingsong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Deren Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deren Han.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 3.1

Proof

The outline of the proof is similar to that in other first order method (Han 2022). From Definition 1.1 and the iterative step (3), we have

$$\begin{aligned} x_{k+1}\in \underset{x\in {\mathbb {R}}^{m_1}}{\arg \min }\{f(x)+\langle x-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x-{\bar{x}}_{k}\Vert ^{2}\}. \end{aligned}$$

We let $x=x_{k}$, it follows that

$$\begin{aligned} \begin{aligned}&f(x_{k+1})+\langle x_{k+1}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\\&\quad \le f(x_{k})+\langle x_{k}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

or equivalently,

$$\begin{aligned} f(x_{k+1})\le & {} f(x_{k})+\langle x_{k}-x_{k+1},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \nonumber \\{} & {} +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}. \end{aligned}$$

(18)

Similarly, using Definition 1.1 and iterative step (5), and taking $y=y_{k}$, it shows that

$$\begin{aligned} g(y_{k+1})\le & {} g(y_{k})+\langle y_{k}-y_{k+1},{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \nonumber \\{} & {} +\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}. \end{aligned}$$

(19)

On the other hand, from Lemma 1.1, we have

$$\begin{aligned} H(x_{k+1},y_{k})\le H(x_{k},y_{k})+\langle \nabla _{x}H(x_{k},y_{k}),x_{k+1}-x_{k}\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-x_{k}\Vert ^{2},\nonumber \\ \end{aligned}$$

(20)

and

$$\begin{aligned} H(x_{k+1},y_{k+1})\le & {} H(x_{k+1},y_{k})+\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}\nonumber \\{} & {} +\langle \nabla _{y}H(x_{k+1},y_{k}),y_{k+1}-y_{k}\rangle . \end{aligned}$$

(21)

Adding (18), (19), (20) and (21), we get

$$\begin{aligned}{} & {} \Psi (x_{k+1},y_{k+1})\nonumber \\{} & {} \quad \le \langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\nonumber \\{} & {} \qquad +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{x_{k}}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2} +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})\nonumber \\{} & {} \qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \nonumber \\{} & {} \qquad +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \quad =\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\nonumber \\{} & {} \qquad +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle x_{k+1}-x_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{x_{k}}\Vert ^{2}\nonumber \\{} & {} \qquad +\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}\nonumber \\{} & {} \qquad -\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle . \end{aligned}$$

(22)

From Assumption 1.1, we know $\nabla H$ is Lipschitz continuous on a bounded subset of ${\mathbb {R}}^{m_{1}}\times {\mathbb {R}}^{m_{2}}$ and we assume that $\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}$ is bounded, then there exists a constant $M>0$ for any $s_{2}^{k}>0$ and $s_{4}^{k}>0$, we have

$$\begin{aligned} \begin{aligned}&\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\quad \le \frac{s_{2}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}+\frac{1}{2s_{2}^{k}}\Vert \nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2}\\&\quad \le \frac{s_{2}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}+\frac{M^{2}}{2s_{2}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\frac{M^{2}}{2s_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\quad \le \frac{s_{4}^{k}}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{M^{2}}{2s_{4}^{k}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +\frac{M^{2}}{2s_{4}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

where the inequality follows from the Young inequality. Similarly, for any $s_{1}^{k}>0$ and $s_{3}^{k}>0$, we can get that

$$\begin{aligned} \begin{aligned}&\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\quad \le \frac{1}{2s_{1}^{k}}\Vert \nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k}))\Vert ^{2} +\frac{s_{1}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\quad \le \frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2} +\frac{s_{3}^{k}}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}. \end{aligned} \end{aligned}$$

From the inertial step (4) in Algorithm 1, it shows that

$$\begin{aligned} \begin{aligned} {\bar{x}}_{k+1}-x_{k+1}&=\alpha _{1}^{k}(x_{k+1}-{\bar{x}}_{k}) =\alpha _{1}^{k}(x_{k+1}-x_{k})+\alpha _{1}^{k}(x_{k}-{\bar{x}}_{k}). \end{aligned}\end{aligned}$$

Thus we can get

$$\begin{aligned} \begin{aligned} \Vert x_{k+1}-x_{k}\Vert ^{2}&=\frac{1}{(\alpha _{1}^{k})^{2}}\Vert ({\bar{x}}_{k+1}-x_{k+1})-\alpha _{1}^{k}(x_{k}-{\bar{x}}_{k})\Vert ^{2}\\&\le \frac{1+\alpha _{1}^{k}}{(\alpha _{1}^{k})^{2}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\frac{1+\alpha _{1}^{k}}{\alpha _{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

(23)

and

$$\begin{aligned} \Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}=\frac{1}{(\alpha _{1}^{k})^{2}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}. \end{aligned}$$

Similarly, using the inertial step (6) in Algorithm 1, we also have

$$\begin{aligned} \begin{aligned} \Vert y_{k+1}-y_{k}\Vert ^{2}&=\frac{1}{(\alpha _{2}^{k})^{2}}\Vert ({\bar{y}}_{k+1}-y_{k+1})-\alpha _{2}^{k}(y_{k}-{\bar{y}}_{k})\Vert ^{2}\\&\le \frac{1+\alpha _{2}^{k}}{(\alpha _{2}^{k})^{2}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1+\alpha _{2}^{k}}{\alpha _{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

(24)

and

$$\begin{aligned} \Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}=\frac{1}{(\alpha _{2}^{k})^{2}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}. \end{aligned}$$

(25)

Now we consider the term of $\frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2}+\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}$. We apply the conditional expectation operator ${\mathbb {E}}_{k}$, we can bound the MSE terms using (9) in Definition 2.1. This gives

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}\left[ \frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\right] \\&\quad \le \frac{1}{2s_{13}^{k}}{\mathbb {E}}[\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}]\\&\quad \le \frac{1}{2s_{13}^{k}}(\Gamma _{k}+V_{1}({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}] +(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))), \end{aligned} \end{aligned}$$

(26)

where $s_{13}^{k}=\min \{s_{1}^{k},s_{3}^{k}\}$. Next, from (11) in Definition 2.1, we can get

$$\begin{aligned} \begin{aligned} \frac{1}{2s_{13}^{k}}\Gamma _{k}&\le \frac{1}{2s_{13}^{k}\tau }(-{\mathbb {E}}_{k}\Gamma _{k+1}+\Gamma _{k}+V_{\Gamma }({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\\&\quad +\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}]+(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))). \end{aligned} \end{aligned}$$

(27)

From the above two inequalities (26) and (27), it shows that

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}\left[ \frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\right] \\&\quad \le \left( \frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau }\right) ({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}] \\&\qquad +(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))-\frac{1}{2s_{13}^{k}\tau }{\mathbb {E}}_{k}\Gamma _{k+1}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}. \end{aligned} \end{aligned}$$

Applying the conditional expectation operator ${\mathbb {E}}_{k}$ on the inequality (22), we have that

$$\begin{aligned}&{\mathbb {E}}_{k}\Psi (x_{k+1},y_{k+1})\\&\quad \le {\mathbb {E}}_{k}[\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\\&\qquad -\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}+\langle x_{k+1}-x_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}+\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\\&\qquad +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle +\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\qquad +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}]\\&\quad \le \Psi (x_{k},y_{k})+(\frac{1}{2c_{1}^{k}}+\frac{(L_{1}(y_{k})+s_{2}^{k}+s_{1}^{k})(1+\alpha _{1})}{2\alpha _{1}^{k}}\\&\qquad +\frac{M^{2}}{2s_{2}^{k}}+\frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau })\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2s_{13}^{k}\tau }{\mathbb {E}}_{k}\Gamma _{k+1}\\&\qquad +(\frac{1}{2c_{2}^{k}}+\frac{(L_{2}(x_{k+1})+s_{3}^{k}+s_{4}^{k})(1+\alpha _{2}^{k})}{2\alpha _{2}^{k}}+\frac{M^{2}}{2s_{2}^{k}} +\frac{M^{2}}{2s_{4}^{k}}\\&\qquad +\frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau })\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}\\&\qquad -(\frac{1}{2c_{1}^{k}(\alpha _{1}^{k})^{2}}-\frac{(L_{1}(y_{k})+s_{2}^{k}+s_{1}^{k})(1+\alpha _{1}^{k})}{2(\alpha _{1}^{k})^{2}}-\frac{M^{2}}{2s_{4}^{k}}\\&\qquad -\frac{V_{1}}{2s_{13}^{k}}-\frac{V_{\Gamma }}{2s_{13}^{k}\tau }){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\\&\qquad -\left( \frac{1}{2c_{2}^{k}(\alpha _{2}^{k})^{2}}-\frac{(L_{2}(x_{k+1})+s_{3}^{k}+s_{4}^{k})(1+\alpha _{2}^{k})}{2(\alpha _{2}^{k})^{2}} -\frac{V_{1}}{2s_{13}^{k}}-\frac{V_{\Gamma }}{2s_{13}^{k}\tau }\right) {\mathbb {E}}_{k}\Vert y_{k+1}\\&\qquad -{\bar{y}}_{k+1}\Vert ^{2}. \end{aligned}$$

Denote $M_{1}^{k}$, $M_{2}^{k}$, $M_{3}^{k}$ and $M_{4}^{k}$ as the coefficient of $\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}$, $\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}$, $\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}$ and $\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}$ in above inequality for simplicity, respectively. We can rewrite the above inequality as

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +M_{2}^{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k+1}]\\&\quad \le {\Psi }(x_{k},y_{k})+M_{3}^{k}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}. \end{aligned} \end{aligned}$$

This completes the proof. $\square $

Proof of Lemma 3.2

Proof

If we let $L=\max \{L_{1}(y_{k}),L_{2}(x_{k+1})\}$, $\alpha _{1}^{k}=\alpha _{2}^{k}=\alpha _{k}$, $c_{1}^{k}=c_{2}^{k}=c_{k}$ and $s_{1}^{k}=s_{2}^{k}=s_{3}^{k}=s_{4}^{k}=M$, combined with the inequality (12) in Lemma 3.1, it shows that

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +M_{2}^{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+M_{3}^{k}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2\tau M}\Gamma _{k}. \end{aligned} \end{aligned}$$

(28)

We can rewritten the inequality (28) as

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +(M_{2}^{k}-\frac{M}{2})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+(M_{3}^{k}+\frac{M}{2})\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2\tau M}\Gamma _{k}\\&\qquad -\frac{M}{2}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}-\frac{M}{2}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

we know that $M_{1}^{k}=M_{2}^{k}-\frac{M}{2}=:{\tilde{M}}_{k}$ and $M_{3}^{k}+\frac{M}{2}=M_{4}^{k}=:{\bar{M}}_{k}$ from definition in (13). That is to say

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+{\tilde{M}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+{\bar{M}}_{k}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k} -\frac{M}{2}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}-\frac{M}{2}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

we can rewritten ${\tilde{M}}_{k}=\frac{1}{2c_{k}\alpha _{k}^{2}}-h_{1}(\alpha _{k})$ and ${\bar{M}}_{k}=\frac{1}{2c_{k}}+h_{2}(\alpha _{k})$, where $h_{1}(\alpha _{k})$ and $h_{2}(\alpha _{k})$ are two positive function of $\alpha _{k}$. If ${\tilde{M}}_{k}\ge {\bar{M}}_{k}$, then it shows that the Lyapunov function $\Phi _{k}$ is monotonically non-increased. Thus we have

$$\begin{aligned} \begin{aligned}&\frac{1}{2c_{k}\alpha _{k}^{2}}-h_{1}(\alpha _{k})\ge \frac{1}{2c_{k}}+h_{2}(\alpha _{k}) \Longrightarrow c_{k}\le \frac{1-\alpha _{k}^{2}}{2\alpha _{k}^{2}(h_{1}(\alpha _{k})+h_{2}(\alpha _{k}))}. \end{aligned} \end{aligned}$$

In shortly, if we let $c_{k}\le \frac{1-\alpha _{k}^{2}}{2\alpha _{k}^{2}(h_{1}(\alpha _{k})+h_{2}(\alpha _{k}))}$, we can get the decreased property. $\square $

Proof of Lemma 3.4

Proof

Since $\Psi $ is assumed to be bounded below Assumption 1.1, then the auxiliary function $\Phi $ is also bounded below. From the sequence $\{\Phi _{k}\}_{k\in {\mathbb {N}}}$ is monotonically non-increasing in expectation, we can know the sequence convergence to some real number ${\bar{\Phi }}$ in expectation. And from (16) in Lemma 3.2, we have the following inequality with a positive integer N holds

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{N-1}{\mathbb {E}}(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}) \le \frac{2}{M}(\Phi _{0}-\Phi _{N}), \end{aligned}\end{aligned}$$

where $\frac{2}{M}$ is a positive constant. Taking the limit as $N\rightarrow +\infty $ on the above inequality, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{+\infty }{\mathbb {E}}(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}) \le \frac{2}{M}(\Phi _{0}-{\bar{\Phi }}) <+\infty . \end{aligned} \end{aligned}$$

Hence we can obtain $\sum _{k=0}^{+\infty }{\mathbb {E}}[\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}]=\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}<+\infty $,which deduces the following result

$$\begin{aligned} \lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert =0 \end{aligned}$$

a.s. From (23) and (24), we can know that

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert ^{2}<+\infty \end{aligned}$$

and

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert ^{2}<+\infty . \end{aligned}$$

Thus we can easily get

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert ^{2} =\sum _{k=0}^{+\infty }{\mathbb {E}}(\Vert x_{k+1}-x_{k}\Vert ^{2}+\Vert y_{k+1}-y_{k}\Vert ^{2}) <+\infty , \end{aligned}\end{aligned}$$

which deduces that $\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert =0$ a.s. We also know that

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert {\bar{z}}_{k+1}-{\bar{z}}_{k}\Vert ^{2}&=\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert ({\bar{z}}_{k+1}-z_{k+1})+(z_{k+1}-z_{k})+(z_{k}-{\bar{z}}_{k})\Vert ^{2}\\&\quad \le 3\sum _{k=0}^{+\infty }{\mathbb {E}}[\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}+\Vert z_{k+1}-z_{k}\Vert ^{2}+\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}]\\&\quad <+\infty , \end{aligned} \end{aligned}$$

which deduces that $\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert {\bar{z}}_{k+1}-{\bar{z}}_{k}\Vert =0$ a.s. From the (4) and (6) in Algorithm 1, it shows that

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k}\Vert ^{2}&=\sum _{k=0}^{+\infty }\frac{1}{(\alpha _{1}^{k})^{2}}{\mathbb {E}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\sum _{k=0}^{+\infty }\frac{1}{(\alpha _{2}^{k})^{2}}{\mathbb {E}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}\\&\quad \le \alpha \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}\\&\quad <+\infty , \end{aligned} \end{aligned}$$

where $\alpha :=\underset{k\in {\mathbb {N}}}{\lim \sup }\,\{\max (\frac{1}{(\alpha _{1}^{k})^{2}},\frac{1}{(\alpha _{2}^{k})^{2}})\}$. Thus can deduce that $\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k}\Vert =0$ a.s. $\square $

Proof of Lemma 3.5

Proof

From the first order optimality conditions of the proximal operators (3) and (5), it shows that

$$\begin{aligned} 0\in \partial f(x_{k+1})+\frac{1}{c_{1}^{k}}(x_{k+1}-{\bar{x}}_{k})+{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k}), \end{aligned}$$

and

$$\begin{aligned} 0\in \partial g(y_{k+1})+\frac{1}{c_{2}^{k}}(y_{k+1}-{\bar{y}}_{k})+{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k}). \end{aligned}$$

Combining this with the fact that

$$\begin{aligned} \begin{aligned}&\partial \Psi (x_{k+1},y_{k+1}) =\left( \begin{matrix} \partial _{x}\Psi (x_{k+1},y_{k+1})\\ \partial _{y}\Psi (x_{k+1},y_{k+1}) \end{matrix}\right) = \left( \begin{matrix} \nabla _{x}H(x_{k+1},y_{k+1})+\partial f(x_{k+1})\\ \nabla _{y}H(x_{k+1},y_{k+1})+\partial g(y_{k+1}) \end{matrix}\right) , \end{aligned} \end{aligned}$$

we can know that

$$\begin{aligned} A_{k+1}\in \partial \Psi (z_{k+1}). \end{aligned}$$

All that remains is to bound the norm of $A_{k+1}$. Because the sequence $\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}$ is bounded and $\nabla H$ is M-Lipschitz continuous on bounded sets, we can know that

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{x}^{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}\Vert \nabla _{x}H(x_{k+1},y_{k+1})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +\frac{1}{c_{1}^{k}}{\mathbb {E}}_{k}\Vert {\bar{x}}_{k}-x_{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert \nabla _{x}H(x_{k+1},y_{k+1})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert \\&\qquad +\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ] +\frac{1}{c_{1}^{k}}{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k}\Vert \\&\quad \le (\frac{1}{c_{1}^{k}}+M){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k}\Vert +M{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert +{\mathbb {E}}_{k}\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert \\&\quad ={\mathbb {E}}_{k}\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}\\&\qquad +\frac{M}{\alpha _{1}^{k}}){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +\frac{M}{\alpha _{2}^{k}}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert , \end{aligned}$$

and

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{y}^{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}\Vert \nabla _{y}H(x_{k+1},y_{k+1})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +\frac{1}{c_{2}^{k}}{\mathbb {E}}_{k}\Vert {\bar{y}}_{k}-y_{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert \nabla _{y}H(x_{k+1},y_{k+1})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\qquad +\frac{1}{c_{2}^{k}}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert \\&\quad \le (\frac{1}{c_{2}^{k}}+M){\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert +M{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +{\mathbb {E}}_{k}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\quad ={\mathbb {E}}_{k}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{M}{\alpha _{2}^{k}}){\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert \\&\qquad +M{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert . \end{aligned}$$

Combining the above two inequalities together and using equation (10) to bound the MSE terms, then we have

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{k+1}\Vert \\&\quad ={\mathbb {E}}_{k}\Vert (A_{x}^{k+1},A_{y}^{k+1})\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert A_{x}^{k+1}\Vert +\Vert A_{y}^{k+1}\Vert ]\\&\quad \le {\mathbb {E}}_{k}[(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}+\frac{M}{\alpha _{1}^{k}}+M)\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{2M}{\alpha _{1}^{k}})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert \\&\qquad +{\mathbb {E}}_{k}[\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\quad \le {\mathbb {E}}_{k}[(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}+\frac{M}{\alpha _{1}^{k}}+M+V_{2})\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{2M}{\alpha _{2}^{k}}+V_{2})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ]\\&\qquad +V_{2}\Vert x_{k}-{\bar{x}}_{k}\Vert +V_{2}\Vert y_{k}-{\bar{y}}_{k}\Vert +\Upsilon _{k}\\&\quad \le \rho _{1}{\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ] +V_{2}(\Vert x_{k}-{\bar{x}}_{k}\Vert +\Vert y_{k}-{\bar{y}}_{k}\Vert )+\Upsilon _{k}\\&\quad \le \sqrt{2}\rho _{1}{\mathbb {E}}_{k}[\sqrt{\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}}]\\&\qquad +\sqrt{2}V_{2}\sqrt{\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}}+\Upsilon _{k}\\&\quad =\sqrt{2}\rho _{1}{\mathbb {E}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert +\sqrt{2}V_{2}\Vert z_{k}-{\bar{z}}_{k}\Vert +\Upsilon _{k}\\&\quad \le {\hat{\rho }}({\mathbb {E}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert +\Vert z_{k}-{\bar{z}}_{k}\Vert )+\Upsilon _{k}. \end{aligned}$$

This completes the proof. $\square $

Proof of Lemma 3.6

Proof

(I): From Lemma 3.1 and Lemma 3.2, there exist some non-negative constants $\delta _{1}, \delta _{2}$ and $\delta _{3}$ such that the $\Phi _{k}$ is monotonically decreasing. Thus we can get that

$$\begin{aligned} {\mathbb {E}}_{k}\Phi _{k+1}+{\mathcal {O}}(\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2})\le \Phi _{k}. \end{aligned}$$

And from Lemma 1.2, the supermartingale convergence theorem implies that the sequences $\Phi _{k}$ converge a.s. to a finite positive random variable. We can get $\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0$ ($\Vert x_{k}-{\bar{x}}_{k}\Vert \rightarrow 0$ and $\Vert y_{k}-{\bar{y}}_{k}\Vert \rightarrow 0$) a.s. from the supermartingale convergence theorem or Lemma 3.4. The stochastic gradient estimator ${\tilde{\nabla }}$ is variance-reduced, then ${\mathbb {E}}\Gamma _{k}\rightarrow 0$, and it shows that

$$\begin{aligned} \lim _{k\rightarrow +\infty }{\mathbb {E}}\Phi _{k}=\lim _{k\rightarrow +\infty }{\mathbb {E}}\Psi _{k}\in [{\bar{\Psi }},+\infty ). \end{aligned}$$

(II): From Lemma 3.5, we know that

$$\begin{aligned} {\mathbb {E}}\Vert A_{k}\Vert \le {\hat{\rho }}{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert +\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ]+{\mathbb {E}}\Gamma _{k-1}. \end{aligned}$$

We have known that $\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0$ a.s. and ${\mathbb {E}}\Gamma _{k-1}\rightarrow 0$. This ensures that ${\mathbb {E}}\Vert A_{k}\Vert \rightarrow 0$. The details follow from Lemma 3.5.

We know the point $z_{*}$ is a cluster point of the sequence $\{z_{k}\}$. This means there exists a subsequence $\{z_{k_{l}}\}$ satisfying $\lim _{l\rightarrow \infty }z_{k_{l}}\rightarrow z_{*}$. And we know that the functions f and g are lower semicontinuous, which have

$$\begin{aligned} \underset{l\rightarrow \infty }{\lim \inf }\,f(x_{k_{l}})\ge f(x_{*})\quad \text {and}\quad \underset{l\rightarrow \infty }{\lim \inf }\,g(y_{k_{l}})\ge g(y_{*}). \end{aligned}$$

(29)

From the Algorithm 1, we know

$$\begin{aligned} x_{k+1}\in \underset{x}{\arg \min }\{f(x)+\langle x-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{k}}\Vert x-{\bar{x}}_{k}\Vert ^{2}\}. \end{aligned}$$

If we set $x=x_{*}$, we have

$$\begin{aligned} \begin{aligned}&f(x_{k+1})+\langle x_{k+1}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\\&\quad \le f(x_{*})+\langle x_{*}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\qquad +\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle .\\ \end{aligned} \end{aligned}$$

If we set $k=k_{l}$ can leads to

$$\begin{aligned} \begin{aligned}&f(x_{k_{l}+1})+\langle x_{k_{l}+1}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle +\frac{1}{2c_{k_{l}}}\Vert x_{k_{l}+1}-{\bar{x}}_{k_{l}}\Vert ^{2}\\&\quad \le f(x_{*})+\langle x_{*}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \\&\qquad +\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k_{l}},\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle .\\ \end{aligned} \end{aligned}$$

By taking the superior limit above as $l\rightarrow +\infty $, we obtain

$$\begin{aligned} \begin{aligned}&\underset{l\rightarrow +\infty }{\lim \sup }f(x_{k_{l}})\\&\quad \le \underset{l\rightarrow +\infty }{\lim \sup }\{\langle x_{*}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \\&\qquad +f(x_{*})+\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k_{l}},\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \}.\\ \end{aligned} \end{aligned}$$

We know ${\bar{x}}_{k_{l}}\rightarrow x_{*}$ and boundedness of $\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})$ to make the last term of above goes to zero. The first term on the right is bounded above by $\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert +\Vert {\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\Vert $, and we have ${\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rightarrow 0$. Then we have

$$\begin{aligned} \underset{l\rightarrow \infty }{\lim \sup }\,\,f(x_{k_{l}})\le f(x_{*}). \end{aligned}$$

(30)

Similarly, we have

$$\begin{aligned} \underset{l\rightarrow +\infty }{\lim \sup }\,\,g(y_{k_{1}})\le g(y_{*}). \end{aligned}$$

(31)

By combining the relations (29), (30) and (31), we know that $f(x_{k_{l}})\rightarrow f(x_{*})$ and $g(y_{k_{l}})\rightarrow g(y_{*})$ with $l\rightarrow +\infty $. Together with the continuity of the function H, we can get

$$\begin{aligned} \begin{aligned} \underset{l\rightarrow +\infty }{\lim }\Psi (z_{k_{l}})&=\underset{l\rightarrow +\infty }{\lim }\{H(x_{k_{l}},y_{k_{l}})+f(x_{k_{l}})+g(y_{k_{l}})\}\\&=H(x_{*},y_{*})+f(x_{*})+g(x_{*})\\&=\Psi (x_{*},y_{*}). \end{aligned} \end{aligned}$$

Assertion (II) ensures that $z_{*}$ is a critical point of $\Psi $ because ${\mathbb {E}}\text {dist}(0,\partial \Psi (z_{*}))\rightarrow 0$ as $l\rightarrow +\infty $ and $\partial \Psi (z_{*})$ is closed. Thus this assertion holds. Moreover, it shows that there exists a subsequence $\{{\hat{z}}_{k_{i}}\}_{i\in {\mathbb {N}}}$ of $\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}$ converging to ${\hat{z}}_{*}$ with ${\hat{z}}_{*}\in \omega ({\hat{z}}_{0})$. Thus we have

$$\begin{aligned} \begin{aligned} \underset{i\rightarrow +\infty }{\lim }\Phi ({\hat{z}}_{k_{i}})&=\underset{i\rightarrow +\infty }{\lim }\{H(x_{k_{i}},y_{k_{i}})+f(x_{k_{i}})+g(y_{k_{i}})\\&\quad +\delta _{1}\Vert x_{k_{i}}-{\bar{x}}_{k_{i}}\Vert ^{2} +\delta _{2}\Vert y_{k_{i}}-{\bar{x}}_{k_{i}}\Vert ^{2}+\delta _{3}\Gamma _{k_{i}}\}\\&=H(x_{*},y_{*})+f(x_{*})+g(x_{*})\\&=\Psi (x_{*},y_{*})\\&=\Phi ({\hat{z}}_{*}). \end{aligned} \end{aligned}$$

(III): From the Remark 5 in Bolte et al. (2014) or Remark 5.1 in Davis (2016), we can directly say this claim holds for any sequence satisfying $\Vert z_{k}-z_{k-1}\Vert \rightarrow 0$ a.s.

(IV): Finally, we take an arbitrary point $z_{*}\in \omega (z_{0})$, then there exists a subsequence $\{z_{k_{l}}\}$ such that $z_{k_{l}}\rightarrow z_{*}$ as $l\rightarrow +\infty $. We have that ${\mathbb {E}}\Psi (z_{k})\rightarrow \Psi _{*}$, which implies that ${\mathbb {E}}\Psi (z_{k_{l}})\rightarrow \Psi _{*}$. We have known that $\Psi (z_{k_{l}})\rightarrow \Psi (z_{*})$, so it shows that ${\mathbb {E}}\Psi (z_{*})=\Psi _{*}$ for all $z_{*}\in \omega (z_{0})$. Hence $\Psi $ has constant expectation over $\omega (z_{0})$. $\square $

Proof of Theorem 3.1

Proof

If $\theta \in (0,\frac{1}{2})$, then $\Psi $ satisfies the KŁ property with exponent $\frac{1}{2}$, so we only consider the case of $\theta \in [\frac{1}{2},1)$. By the KŁ property in Lemma 3.3, there exists a desingularizing function $\phi (s)=cs^{1-\theta }$ a.s., such that the following inequality holds

$$\begin{aligned} \phi '({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})]){\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\ge 1,\,\,\forall k>m. \end{aligned}$$

Combining with the bound of ${\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))$ in Lemma 3.5, we have

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\\&\quad \le {\mathbb {E}}\Vert A_{k}\Vert \\&\quad \le {\hat{\rho }}{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert +\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ]+{\mathbb {E}}{\Upsilon _{k-1}}\\&\quad \le {\hat{\rho }}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}})+\sqrt{t{\mathbb {E}}\Gamma _{k-1}}, \end{aligned} \end{aligned}$$

(32)

where the final inequality follows from Jensen’s inequality, and we know that ${\Upsilon }_{k}=\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert $ for some vectors $v_{k}^{i}$ such that

$$\begin{aligned} {\mathbb {E}}{\Upsilon }_{k}={\mathbb {E}}\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert \le {\mathbb {E}}\sqrt{t\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert ^{2}}\le \sqrt{t{\mathbb {E}}\Gamma _{k}}. \end{aligned}$$

Thus the final inequality holds. From the inequality (10) in Definition 2.1, we can bound the term $\sqrt{{\mathbb {E}}\Gamma _{k}}$ and get

$$\begin{aligned} \begin{aligned}&\sqrt{{\mathbb {E}}\Gamma _{k}}\\&\quad \le \sqrt{(1-\tau ){\mathbb {E}}\Gamma _{k-1}+V_{\Gamma }{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}]}\\&\quad \le \sqrt{(1-\tau ){\mathbb {E}}\Gamma _{k-1}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}\\&\quad \le (1-\frac{\tau }{2})\sqrt{{\mathbb {E}}\Gamma _{k-1}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}, \end{aligned} \end{aligned}$$

(33)

where the second and final inequalities are follows from $\sqrt{\sum _{i=1}^{m}\lambda _{i}a_{i}}\le \sum _{i=1}^{m}\sqrt{\lambda _{i}a_{i}}$ ($\lambda _{i}, a_{i}\ge 0,i=1,\ldots ,m$) and $\sqrt{1-\tau }=1-\frac{\tau }{2}-\frac{\tau ^{2}}{8}-\cdots $, respectively. Thus from (33), it shows that

$$\begin{aligned} \begin{aligned} \sqrt{t{\mathbb {E}}\Gamma _{k-1}}&\le \frac{2\sqrt{t}}{\tau }(\sqrt{{\mathbb {E}}\Gamma _{k-1}}-\sqrt{{\mathbb {E}}\Gamma _{k}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\\&\quad +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}). \end{aligned} \end{aligned}$$

Combining the above inequality and (32), we can obtain

$$\begin{aligned}{} & {} {\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\nonumber \\{} & {} \quad \le N_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}) +\frac{2\sqrt{t}}{\tau }(\sqrt{{\mathbb {E}}\Gamma _{k-1}}-\sqrt{{\mathbb {E}}\Gamma _{k}})\nonumber \\{} & {} \quad =:C_{k}, \end{aligned}$$

(34)

where $N_{1}={\hat{\rho }}+\frac{2\sqrt{tV_{\Gamma }}}{\tau }$. Then we have

$$\begin{aligned} \phi '({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])C_{k}\ge 1,\,\,\forall k>m. \end{aligned}$$

From $\phi (s)=cs^{1-\theta }$ in Lemma 3.3, it shows that

$$\begin{aligned} \frac{c(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }}\ge 1,\quad \forall k>m. \end{aligned}$$

From the definition of $C_{k}$ in (34), we have

$$\begin{aligned} C_{k}&\ge N_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}})+\frac{2\sqrt{t}}{\tau }\sqrt{{\mathbb {E}}\Gamma _{k-1}}\\&\quad =c_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Gamma _{k-1}}) \end{aligned}$$

for some $c_{1}>0$. And know that ${\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}\rightarrow 0$ and ${\mathbb {E}}\Gamma _{k-1}\rightarrow 0$ from Definition 2.1. Combined with $\theta \ge \frac{1}{2}$, it shows that there exists an index m and positive constants $c_{2}$ and ${\bar{c}}$ such that

$$\begin{aligned} \begin{aligned}&({\mathbb {E}}[\delta _{1}\Vert x_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{2}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{3}\Gamma _{k}])^{\theta }\\&\quad \le {c_{2}}(({\mathbb {E}}\Gamma _{k-1}+\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2})^{\theta })\\&\quad \le {\bar{c}} C_{k},\quad \forall k>m, \end{aligned} \end{aligned}$$

where the first inequality above follows from (11) in Definition 2.1. Thus there exists a constant $0<{\bar{c}}<d<+\infty $ such that

$$\begin{aligned} \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }+({\mathbb {E}}[T_{k}])^{\theta }}\ge 1 \end{aligned}$$

for all $k>m$, where $T_{k}:=\delta _{1}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\delta _{2}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{3}\Gamma _{k}$. From the fact that $(a+b)^{\theta }\le a^{\theta }+b^{\theta }$ with $\theta \in [\frac{1}{2},1)$, we can get

$$\begin{aligned} \begin{aligned} \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Phi (z_{k})-\Psi (z_{k}^{*})])^{\theta }}&=\frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})+T_{k}])^{\theta }}\\&\quad \ge \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }+({\mathbb {E}}[T_{k}])^{\theta }}\\&\quad \ge 1,\quad \forall k>m. \end{aligned} \end{aligned}$$

Therefore, with $\phi (s)=cds^{1-\theta }$, we have that

$$\begin{aligned} \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])C_{k}\ge 1,\quad \forall k>m. \end{aligned}$$

And from the concavity of $\phi $, we can get

$$\begin{aligned} \begin{aligned}&\phi ({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])-\phi ({\mathbb {E}}[\Phi _{k+1}-\Psi _{k+1}^{*}])\\&\quad \ge \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}+\Psi _{k+1}^{*}-\Phi _{k+1}])\\&\quad \ge \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])({\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]), \end{aligned} \end{aligned}$$

where the last inequality follows from the fact that the sequence $\{\Psi _{k}^{*}\}$ is non-decreasing. Denote $\triangle _{p,q}:=\phi ({\mathbb {E}}[\Phi _{p}-\Psi _{p}^{*}])-\phi ({\mathbb {E}}[\Phi _{q}-\Psi _{q}^{*}])$, and it shows that

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge {\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]. \end{aligned}$$

From Lemma 3.2, we can bound ${\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]$ below by both ${\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}$ and ${\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}$, then we have

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge K_{1}{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}, \end{aligned}$$

as well as

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge K_{2}{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}, \end{aligned}$$

(35)

where $K_{1}$ and $K_{2}$ are two positive constants, respectively. The details of how to set $K_{1}$ and $K_{2}$ can refer to Lemma 3.2. We have that

$$\begin{aligned} \begin{aligned} 2\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}&\le 2\sqrt{C_{k}K_{1}^{-1}\triangle _{k,k+1}} \le \frac{C_{k}}{2N_{1}}+\frac{2N_{1}\triangle _{k,k+1}}{K_{1}} \end{aligned}\end{aligned}$$

(36)

with any positive constant $N_{1}$, where the inequalities follow from Young’s inequality and (35), respectively. Summing the above inequality (36) from $k=m$ to ${\bar{m}}$, we can get

$$\begin{aligned}&2\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\frac{C_{k}}{2N_{1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}\\&\qquad +\frac{\sqrt{t}}{N_{1}\tau }(\sqrt{{\mathbb {E}}\Gamma _{m-1}}-\sqrt{{\mathbb {E}}\Gamma _{{\bar{m}}}})+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad =\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}}-\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{{\bar{m}}}-{\bar{z}}_{{\bar{m}}}\Vert ^{2}}\\&\qquad +\frac{\sqrt{t}}{N_{1}\tau }(\sqrt{{\mathbb {E}}\Gamma _{m-1}}-\sqrt{{\mathbb {E}}\Gamma _{{\bar{m}}}})+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}. \end{aligned}$$

It also shows that

$$\begin{aligned} \begin{aligned} \sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} \le \frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}. \end{aligned}\nonumber \\ \end{aligned}$$

(37)

Similarly, we can apply the same argument to the inequality (35), it shows that

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}}\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{2}}. \end{aligned} \end{aligned}$$

(38)

Combining (37) and (38) together, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}})\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{2\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}\\&\qquad +\frac{2N_{1}(K_{1}+K_{2})\triangle _{m,{\bar{m}}+1}}{K_{1}K_{2}}. \end{aligned} \end{aligned}$$

From Jensen’s inequality, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert +{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ) \le \sum _{k=m}^{{\bar{m}}}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}}). \end{aligned} \end{aligned}$$

Thus from the above two inequalities, we can get

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert +{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert )\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}}\\&\qquad +\frac{2\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}(K_{1}+K_{2})\triangle _{m,{\bar{m}}+1}}{K_{1}K_{2}}, \end{aligned} \end{aligned}$$

and letting ${\bar{m}}\rightarrow +\infty $, we have

$$\begin{aligned} \sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert <+\infty . \end{aligned}$$

From (23) and (24), we can easily say that $\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert <+\infty $ and $\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert <+\infty $. It is obvious that

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert \le \sum _{k=m}^{+\infty }({\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert +{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert ) <+\infty , \end{aligned} \end{aligned}$$

under the sequence $\{z_{k}\}$ is bounded, thus we can say that

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert <+\infty . \end{aligned}$$

This completes the proof. $\square $

Proof of Theorem 3.2

Proof

From the proof of the previous lemma, we know if $\theta \in (0,1/2)$, then $\Psi $ satisfies the KŁ property with exponent 1/2, so we only consider the cases $\theta \in [1/2,1)$ and $\theta =0$.

(1)
When $\theta =0$, from the KŁ inequality, it shows that one of the following two conditions exactly holds:
1. (a)
  ${\mathbb {E}}\Psi (z_{k})\ne \Psi _{k}^{*}$ and $0<C\le {\mathbb {E}}\Vert A_{k}\Vert , \forall A_{k}\in \partial \Psi (z_{k})$;
2. (b)
  $\Psi (z_{k})=\Psi _{k}^{*}$.
Based on Theorem 4.8 in Driggs et al. (2021) and Attouch et al. (2010), we have the case holds.
(2)
When $\theta \in (1/2,1)$, from the proof of Theorem 3.1 and Theorem 4.8 in Driggs et al. (2021), we have
$$\begin{aligned} T_{m}^{\frac{\theta }{1-\theta }}\le t_{1}(T_{m-1}-T_{m}) \end{aligned}$$
for some positive constant $t_{1}$, where
$$\begin{aligned} T_{m}:=\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}+t_{2}\sqrt{{\mathbb {E}}\Gamma _{k-1}} \end{aligned}$$
which is bounded for all m because $\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}$ is bounded. From the proof of (Attouch and Bolte 2009, Theorem 5) and (Driggs et al. 2021, Theorem 4.8), there exists a $t_{3}>0$ such that
$$\begin{aligned} t_{3}\le T_{m}^{\frac{1-2\theta }{1-\theta }}-T_{m-1}^{\frac{1-2\theta }{1-\theta }}. \end{aligned}$$
Summing this inequality from $m=M_{1}$ to $M_{2}$, it shows that $(M_{2}-M_{1})t_{3}\le T_{M_{2}}^{\frac{1-2\theta }{1-\theta }}-T_{M_{1}-1}^{\frac{1-2\theta }{1-\theta }}$, this implies $T_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}$ for some constant $t_{4}$. By Jensen’s inequality, we have $\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\le S_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}$. Combined with the proof of Theorem 3.1, we can say (3) in Theorem 3.2 holds.
(3)
When $\theta =\frac{1}{2}$, with the similar proof of (Driggs et al. 2021, Theorem 4.8), we also get the desired result.

To sum up, this completes the proof. $\square $

About this article

Cite this article

Wang, Q., Han, D. Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks. Math Meth Oper Res 99, 39–74 (2024). https://doi.org/10.1007/s00186-024-00851-6

Download citation

Received: 29 December 2022
Revised: 09 October 2023
Accepted: 09 January 2024
Published: 06 February 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00186-024-00851-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A nonmonotone accelerated proximal gradient method with variable stepsize strategy for nonsmooth and nonconvex minimization problems

A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems

An Inertial Tseng’s Type Proximal Algorithm for Nonsmooth and Nonconvex Optimization Problems

Change history

12 March 2024

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Proof of Lemma 3.1

Proof

Proof of Lemma 3.2

Proof

Proof of Lemma 3.4

Proof

Proof of Lemma 3.5

Proof

Proof of Lemma 3.6

Proof

Proof of Theorem 3.1

Proof

Proof of Theorem 3.2

Proof

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation