Skip to main content

Advertisement

Log in

Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks

  • Original Article
  • Published:
Mathematical Methods of Operations Research Aims and scope Submit manuscript

A Correction to this article was published on 12 March 2024

This article has been updated

Abstract

In many optimization problems arising from machine learning, image processing, and statistics communities, the objective functions possess a special form involving huge amounts of data, which encourages the application of stochastic algorithms. In this paper, we study such a broad class of nonconvex nonsmooth minimization problems, whose objective function is the sum of a smooth function of the entire variables and two nonsmooth functions of each variable. We propose to solve this problem with a stochastic Gauss–Seidel type inertial proximal alternating linearized minimization (denoted by SGiPALM) algorithm. We prove that under Kurdyka–Łojasiewicz (KŁ) property and some mild conditions, each bounded sequence generated by SGiPALM with the variance-reduced stochastic gradient estimator globally converges to a critical point after a finite number of iterations, or almost surely satisfies the finite length property. We also apply the SGiPALM algorithm to the proximal neural networks (PNN) with 4 layers for classification tasks on the MNIST dataset and compare it with other deterministic and stochastic optimization algorithms, the results illustrate the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

Change history

Notes

  1. https://www.python.org.

  2. https://www.tensorflow.org.

  3. http://yann.lecun.com/exdb/mnist.

  4. For \(d \le n\), the (compact) Stiefel manifold is defined as \(\text{ St }(d, n):=\{W\in {\mathbb {R}}^{n\times d}: W^{T}W=I_{d}\}\). For \(d=1\) this reduces to the sphere \({\mathbb {S}}^{n-1}\), and for \(d=n\) we obtain the special orthogonal group \(\text{ SO }(n)\).

  5. We know that the ELU function is differentiable with a 1-Lipschitz gradient, i.e., \(\Vert \nabla \sigma (x)-\nabla \sigma (y)\Vert \le \Vert x-y\Vert .\)

  6. In particular, we have that \(\Vert W_{i}\Vert _{F} \le 2\sqrt{d}\), \(i = 1, \dots ,4\) and \(\Vert W_{5}\Vert _{F} \le 20\sqrt{10d}\).

  7. The exact Lipschitz constant is more difficult to obtain when function H acts on a high dimensional space, so we compute an approximative Lipschitz constant based on the batch set of sampling, see Griewank and Walther (2008) for details.

References

  • Absil P-A, Mahony R, Sepulchre R (2008) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton

    Book  Google Scholar 

  • Attouch H, Bolte J (2009) On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math Program 116(1–2):5–16

    Article  MathSciNet  Google Scholar 

  • Attouch H, Bolte J, Redont P, Soubeyran A (2010) Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math Oper Res 35(2):438–457

    Article  MathSciNet  Google Scholar 

  • Attouch H, Bolte J, Svaiter BF (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Math Program 137(1–2):91–129

    Article  MathSciNet  Google Scholar 

  • Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494

    Article  MathSciNet  Google Scholar 

  • Bot RI, Csetnek ER, Nguyen D (2019) A proximal minimization algorithm for structured nonconvex and nonsmooth problems. SIAM J Optim 29(2):1300–1328

    Article  MathSciNet  Google Scholar 

  • Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: 19th international conference on computational statistics, COMPSTAT, pp 177–186

  • Cai X, Guo K, Jiang F, Wang K, Wu Z, Han D (2022) The developments of proximal point algorithms. J Oper Res Soc China 10:197–239

    Article  MathSciNet  Google Scholar 

  • Chao M, Han D, Cai X (2021) Convergence of the Peaceman–Rachford splitting method for a class of nonconvex programs. Numer Math Theory Methods Appl 14(2):438–460

    Article  MathSciNet  Google Scholar 

  • Chouzenoux É, Pesquet J, Repetti A (2016) A block coordinate variable metric forward–backward algorithm. J Glob Optim 66(3):457–485

    Article  MathSciNet  Google Scholar 

  • Danilova M, Dvurechensky P, Gasnikov A, Gorbunov E, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. arXiv:2012.06188

  • Davis D (2016) The asynchronous PALM algorithm for nonsmooth nonconvex problems. arXiv:1604.00526

  • Davis D, Edmunds B, Udell M (2016) The sound of APALM clapping: faster nonsmooth nonconvex optimization with stochastic asynchronous PALM. Adv Neural Inf Process Syst 29:226–234

    Google Scholar 

  • Defazio A, Bach FR, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 27:1646–1654

    Google Scholar 

  • Driggs D, Tang J, Liang J, Davies ME, Schönlieb C (2021) A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J Imaging Sci 14(4):1932–1970

    Article  MathSciNet  Google Scholar 

  • Gao X, Cai X, Han D (2020) A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems. J Glob Optim 76(4):863–887

    Article  MathSciNet  Google Scholar 

  • Griewank A, Walther A (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation, 2nd edn. SIAM, Philadelphia

    Book  Google Scholar 

  • Han D (2022) A survey on some recent developments of alternating direction method of multipliers. J Oper Res Soc China 10(1):1–52

    Article  MathSciNet  Google Scholar 

  • Hasannasab M, Hertrich J, Neumayer S, Plonka G, Setzer S, Steidl G (2020) Parseval proximal neural networks. J Fourier Anal Appl 26(59):1–31

    MathSciNet  Google Scholar 

  • Hertrich J, Steidl G (2022) Inertial stochastic PALM and applications in machine learning. Sampl Theory Signal Process Data Anal 20(4):1–33

    MathSciNet  Google Scholar 

  • Higham NJ (2008) Functions of matrices: theory and computation. SIAM, Philadelphia

    Book  Google Scholar 

  • Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Oxford University Press, Oxford

    Google Scholar 

  • Jia Z, Gao X, Cai X, Han D (2021) Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems. J Optim Theory Appl 188(1):1–25

    Article  MathSciNet  Google Scholar 

  • Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 26:315–323

    Google Scholar 

  • Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500

    Article  MathSciNet  Google Scholar 

  • Lan G (2020) First-order and stochastic optimization methods for machine learning. Springer, Berlin

    Book  Google Scholar 

  • Li Z, Li J (2022) Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. J Mach Learn Res 23:1–61

    MathSciNet  Google Scholar 

  • Ma Y, Hu X, He T, Jiang X (2020) Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans Comput Biol Bioinform 17(3):788–795

    Article  Google Scholar 

  • Naanaa W, Nuzillard J (2005) Blind source separation of positive and partially correlated data. Signal Process 85(9):1711–1722

    Article  Google Scholar 

  • Nesterov Y (2018) Lectures on convex optimization, 2nd edn. Springer, Cham

    Book  Google Scholar 

  • Nguyen L M, Liu J, Scheinberg K, Takác M (2017) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th international conference on machine learning, pp 2613–2621

  • Nikolova M, Tan P (2019) Alternating structure-adapted proximal gradient descent for nonconvex nonsmooth block-regularized problems. SIAM J Optim 29(3):2053–2078

    Article  MathSciNet  Google Scholar 

  • Pock T, Sabach S (2016) Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J Imaging Sci 9(4):1756–1787

    Article  MathSciNet  Google Scholar 

  • Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407

    Article  MathSciNet  Google Scholar 

  • Robbins H, Siegmund D (1971) A convergence theorem for non-negative almost supermartingales and some applications. In: Optimizing methods in statistics, pp 233–257

  • Rockafellar RT (1970) Convex analysis. Princeton University Press, Princeton

    Book  Google Scholar 

  • Rockafellar R, Wets R-B (1998) Variational analysis. Springer, Berlin

    Book  Google Scholar 

  • Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 25:2672–2680

    Google Scholar 

  • Wang Q, Han D (2023) A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl Numer Math 189:66–87

    Article  MathSciNet  Google Scholar 

  • Wang Q, Cui C, Han D (2023a) Accelerated doubly stochastic gradient descent for tensor CP decomposition. J Optim Theory Appl 197:665–704

    Article  MathSciNet  Google Scholar 

  • Wang Q, Liu Z, Cui C, Han D (2023b) Inertial accelerated SGD algorithms for solving large-scale lower-rank tensor CP decomposition problems. J Comput Appl Math 423:114948

    Article  MathSciNet  Google Scholar 

  • Wen J, Fowler JE, He M, Zhao Y, Deng C, Menon V (2016) Orthogonal nonnegative matrix factorization combining multiple features for spectral-spatial dimensionality reduction of hyperspectral imagery. IEEE Trans Geosci Remote Sens 54(7):4272–4286

    Article  Google Scholar 

  • Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imaging Sci 6(3):1758–1789

    Article  MathSciNet  Google Scholar 

  • Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686–1716

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank the reviewers and Associate Editor for many helpful suggestions to improve the quality of the paper.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) Grants 12126603, 11926358, 12131004, and the R &D project of Pazhou Lab (Huangpu) under Grant 2023K0604.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deren Han.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 3.1

Proof

The outline of the proof is similar to that in other first order method (Han 2022). From Definition 1.1 and the iterative step (3), we have

$$\begin{aligned} x_{k+1}\in \underset{x\in {\mathbb {R}}^{m_1}}{\arg \min }\{f(x)+\langle x-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x-{\bar{x}}_{k}\Vert ^{2}\}. \end{aligned}$$

We let \(x=x_{k}\), it follows that

$$\begin{aligned} \begin{aligned}&f(x_{k+1})+\langle x_{k+1}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\\&\quad \le f(x_{k})+\langle x_{k}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

or equivalently,

$$\begin{aligned} f(x_{k+1})\le & {} f(x_{k})+\langle x_{k}-x_{k+1},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \nonumber \\{} & {} +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}. \end{aligned}$$
(18)

Similarly, using Definition 1.1 and iterative step (5), and taking \(y=y_{k}\), it shows that

$$\begin{aligned} g(y_{k+1})\le & {} g(y_{k})+\langle y_{k}-y_{k+1},{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \nonumber \\{} & {} +\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}. \end{aligned}$$
(19)

On the other hand, from Lemma 1.1, we have

$$\begin{aligned} H(x_{k+1},y_{k})\le H(x_{k},y_{k})+\langle \nabla _{x}H(x_{k},y_{k}),x_{k+1}-x_{k}\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-x_{k}\Vert ^{2},\nonumber \\ \end{aligned}$$
(20)

and

$$\begin{aligned} H(x_{k+1},y_{k+1})\le & {} H(x_{k+1},y_{k})+\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}\nonumber \\{} & {} +\langle \nabla _{y}H(x_{k+1},y_{k}),y_{k+1}-y_{k}\rangle . \end{aligned}$$
(21)

Adding (18), (19), (20) and (21), we get

$$\begin{aligned}{} & {} \Psi (x_{k+1},y_{k+1})\nonumber \\{} & {} \quad \le \langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\nonumber \\{} & {} \qquad +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{x_{k}}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2} +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})\nonumber \\{} & {} \qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \nonumber \\{} & {} \qquad +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \quad =\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\nonumber \\{} & {} \qquad +\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle x_{k+1}-x_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{x_{k}}\Vert ^{2}\nonumber \\{} & {} \qquad +\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}\nonumber \\{} & {} \qquad -\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}\nonumber \\{} & {} \qquad +\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle . \end{aligned}$$
(22)

From Assumption 1.1, we know \(\nabla H\) is Lipschitz continuous on a bounded subset of \({\mathbb {R}}^{m_{1}}\times {\mathbb {R}}^{m_{2}}\) and we assume that \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) is bounded, then there exists a constant \(M>0\) for any \(s_{2}^{k}>0\) and \(s_{4}^{k}>0\), we have

$$\begin{aligned} \begin{aligned}&\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\quad \le \frac{s_{2}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}+\frac{1}{2s_{2}^{k}}\Vert \nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2}\\&\quad \le \frac{s_{2}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}+\frac{M^{2}}{2s_{2}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\frac{M^{2}}{2s_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\quad \le \frac{s_{4}^{k}}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{M^{2}}{2s_{4}^{k}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +\frac{M^{2}}{2s_{4}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

where the inequality follows from the Young inequality. Similarly, for any \(s_{1}^{k}>0\) and \(s_{3}^{k}>0\), we can get that

$$\begin{aligned} \begin{aligned}&\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\quad \le \frac{1}{2s_{1}^{k}}\Vert \nabla _{x}H(x_{k},y_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k}))\Vert ^{2} +\frac{s_{1}^{k}}{2}\Vert x_{k+1}-x_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\quad \le \frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2} +\frac{s_{3}^{k}}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}. \end{aligned} \end{aligned}$$

From the inertial step (4) in Algorithm 1, it shows that

$$\begin{aligned} \begin{aligned} {\bar{x}}_{k+1}-x_{k+1}&=\alpha _{1}^{k}(x_{k+1}-{\bar{x}}_{k}) =\alpha _{1}^{k}(x_{k+1}-x_{k})+\alpha _{1}^{k}(x_{k}-{\bar{x}}_{k}). \end{aligned}\end{aligned}$$

Thus we can get

$$\begin{aligned} \begin{aligned} \Vert x_{k+1}-x_{k}\Vert ^{2}&=\frac{1}{(\alpha _{1}^{k})^{2}}\Vert ({\bar{x}}_{k+1}-x_{k+1})-\alpha _{1}^{k}(x_{k}-{\bar{x}}_{k})\Vert ^{2}\\&\le \frac{1+\alpha _{1}^{k}}{(\alpha _{1}^{k})^{2}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\frac{1+\alpha _{1}^{k}}{\alpha _{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$
(23)

and

$$\begin{aligned} \Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}=\frac{1}{(\alpha _{1}^{k})^{2}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}. \end{aligned}$$

Similarly, using the inertial step (6) in Algorithm 1, we also have

$$\begin{aligned} \begin{aligned} \Vert y_{k+1}-y_{k}\Vert ^{2}&=\frac{1}{(\alpha _{2}^{k})^{2}}\Vert ({\bar{y}}_{k+1}-y_{k+1})-\alpha _{2}^{k}(y_{k}-{\bar{y}}_{k})\Vert ^{2}\\&\le \frac{1+\alpha _{2}^{k}}{(\alpha _{2}^{k})^{2}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1+\alpha _{2}^{k}}{\alpha _{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$
(24)

and

$$\begin{aligned} \Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}=\frac{1}{(\alpha _{2}^{k})^{2}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}. \end{aligned}$$
(25)

Now we consider the term of \(\frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2}+\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\). We apply the conditional expectation operator \({\mathbb {E}}_{k}\), we can bound the MSE terms using (9) in Definition 2.1. This gives

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}\left[ \frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\right] \\&\quad \le \frac{1}{2s_{13}^{k}}{\mathbb {E}}[\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}]\\&\quad \le \frac{1}{2s_{13}^{k}}(\Gamma _{k}+V_{1}({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}] +(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))), \end{aligned} \end{aligned}$$
(26)

where \(s_{13}^{k}=\min \{s_{1}^{k},s_{3}^{k}\}\). Next, from (11) in Definition 2.1, we can get

$$\begin{aligned} \begin{aligned} \frac{1}{2s_{13}^{k}}\Gamma _{k}&\le \frac{1}{2s_{13}^{k}\tau }(-{\mathbb {E}}_{k}\Gamma _{k+1}+\Gamma _{k}+V_{\Gamma }({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\\&\quad +\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}]+(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))). \end{aligned} \end{aligned}$$
(27)

From the above two inequalities (26) and (27), it shows that

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}\left[ \frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2} +\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\right] \\&\quad \le \left( \frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau }\right) ({\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}] \\&\qquad +(\Vert x_{k}-{\bar{x}}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}))-\frac{1}{2s_{13}^{k}\tau }{\mathbb {E}}_{k}\Gamma _{k+1}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}. \end{aligned} \end{aligned}$$

Applying the conditional expectation operator \({\mathbb {E}}_{k}\) on the inequality (22), we have that

$$\begin{aligned}&{\mathbb {E}}_{k}\Psi (x_{k+1},y_{k+1})\\&\quad \le {\mathbb {E}}_{k}[\langle x_{k+1}-x_{k},\nabla _{x}H(x_{k},y_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\Psi (x_{k},y_{k})\\&\qquad -\frac{1}{2c_{1}^{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2c_{2}^{k}}\Vert y_{k+1}-{\bar{y}}_{k}\Vert ^{2}+\langle x_{k+1}-x_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{L_{1}(y_{k})}{2}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}+\frac{1}{2c_{2}^{k}}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\\&\qquad +\langle y_{k+1}-y_{k},\nabla _{y}H(x_{k+1},y_{k})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle +\langle y_{k+1}-y_{k},\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\rangle \\&\qquad +\frac{L_{2}(x_{k+1})}{2}\Vert y_{k+1}-y_{k}\Vert ^{2}+\frac{1}{2c_{1}^{k}}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}]\\&\quad \le \Psi (x_{k},y_{k})+(\frac{1}{2c_{1}^{k}}+\frac{(L_{1}(y_{k})+s_{2}^{k}+s_{1}^{k})(1+\alpha _{1})}{2\alpha _{1}^{k}}\\&\qquad +\frac{M^{2}}{2s_{2}^{k}}+\frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau })\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}-\frac{1}{2s_{13}^{k}\tau }{\mathbb {E}}_{k}\Gamma _{k+1}\\&\qquad +(\frac{1}{2c_{2}^{k}}+\frac{(L_{2}(x_{k+1})+s_{3}^{k}+s_{4}^{k})(1+\alpha _{2}^{k})}{2\alpha _{2}^{k}}+\frac{M^{2}}{2s_{2}^{k}} +\frac{M^{2}}{2s_{4}^{k}}\\&\qquad +\frac{V_{1}}{2s_{13}^{k}}+\frac{V_{\Gamma }}{2s_{13}^{k}\tau })\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}\\&\qquad -(\frac{1}{2c_{1}^{k}(\alpha _{1}^{k})^{2}}-\frac{(L_{1}(y_{k})+s_{2}^{k}+s_{1}^{k})(1+\alpha _{1}^{k})}{2(\alpha _{1}^{k})^{2}}-\frac{M^{2}}{2s_{4}^{k}}\\&\qquad -\frac{V_{1}}{2s_{13}^{k}}-\frac{V_{\Gamma }}{2s_{13}^{k}\tau }){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\\&\qquad -\left( \frac{1}{2c_{2}^{k}(\alpha _{2}^{k})^{2}}-\frac{(L_{2}(x_{k+1})+s_{3}^{k}+s_{4}^{k})(1+\alpha _{2}^{k})}{2(\alpha _{2}^{k})^{2}} -\frac{V_{1}}{2s_{13}^{k}}-\frac{V_{\Gamma }}{2s_{13}^{k}\tau }\right) {\mathbb {E}}_{k}\Vert y_{k+1}\\&\qquad -{\bar{y}}_{k+1}\Vert ^{2}. \end{aligned}$$

Denote \(M_{1}^{k}\), \(M_{2}^{k}\), \(M_{3}^{k}\) and \(M_{4}^{k}\) as the coefficient of \(\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\), \(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}\), \(\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}\) and \(\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\) in above inequality for simplicity, respectively. We can rewrite the above inequality as

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +M_{2}^{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2s_{13}^{k}\tau }\Gamma _{k+1}]\\&\quad \le {\Psi }(x_{k},y_{k})+M_{3}^{k}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2s_{13}^{k}\tau }\Gamma _{k}. \end{aligned} \end{aligned}$$

This completes the proof. \(\square \)

Proof of Lemma 3.2

Proof

If we let \(L=\max \{L_{1}(y_{k}),L_{2}(x_{k+1})\}\), \(\alpha _{1}^{k}=\alpha _{2}^{k}=\alpha _{k}\), \(c_{1}^{k}=c_{2}^{k}=c_{k}\) and \(s_{1}^{k}=s_{2}^{k}=s_{3}^{k}=s_{4}^{k}=M\), combined with the inequality (12) in Lemma 3.1, it shows that

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +M_{2}^{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+M_{3}^{k}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2\tau M}\Gamma _{k}. \end{aligned} \end{aligned}$$
(28)

We can rewritten the inequality (28) as

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+M_{1}^{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2} +(M_{2}^{k}-\frac{M}{2})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+(M_{3}^{k}+\frac{M}{2})\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+M_{4}^{k}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2} +\frac{1}{2\tau M}\Gamma _{k}\\&\qquad -\frac{M}{2}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}-\frac{M}{2}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

we know that \(M_{1}^{k}=M_{2}^{k}-\frac{M}{2}=:{\tilde{M}}_{k}\) and \(M_{3}^{k}+\frac{M}{2}=M_{4}^{k}=:{\bar{M}}_{k}\) from definition in (13). That is to say

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{k}[\Psi (x_{k+1},y_{k+1})+{\tilde{M}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k+1}]\\&\quad \le \Psi (x_{k},y_{k})+{\bar{M}}_{k}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\frac{1}{2\tau M}\Gamma _{k} -\frac{M}{2}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}-\frac{M}{2}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}, \end{aligned} \end{aligned}$$

we can rewritten \({\tilde{M}}_{k}=\frac{1}{2c_{k}\alpha _{k}^{2}}-h_{1}(\alpha _{k})\) and \({\bar{M}}_{k}=\frac{1}{2c_{k}}+h_{2}(\alpha _{k})\), where \(h_{1}(\alpha _{k})\) and \(h_{2}(\alpha _{k})\) are two positive function of \(\alpha _{k}\). If \({\tilde{M}}_{k}\ge {\bar{M}}_{k}\), then it shows that the Lyapunov function \(\Phi _{k}\) is monotonically non-increased. Thus we have

$$\begin{aligned} \begin{aligned}&\frac{1}{2c_{k}\alpha _{k}^{2}}-h_{1}(\alpha _{k})\ge \frac{1}{2c_{k}}+h_{2}(\alpha _{k}) \Longrightarrow c_{k}\le \frac{1-\alpha _{k}^{2}}{2\alpha _{k}^{2}(h_{1}(\alpha _{k})+h_{2}(\alpha _{k}))}. \end{aligned} \end{aligned}$$

In shortly, if we let \(c_{k}\le \frac{1-\alpha _{k}^{2}}{2\alpha _{k}^{2}(h_{1}(\alpha _{k})+h_{2}(\alpha _{k}))}\), we can get the decreased property. \(\square \)

Proof of Lemma 3.4

Proof

Since \(\Psi \) is assumed to be bounded below Assumption 1.1, then the auxiliary function \(\Phi \) is also bounded below. From the sequence \(\{\Phi _{k}\}_{k\in {\mathbb {N}}}\) is monotonically non-increasing in expectation, we can know the sequence convergence to some real number \({\bar{\Phi }}\) in expectation. And from (16) in Lemma 3.2, we have the following inequality with a positive integer N holds

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{N-1}{\mathbb {E}}(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}) \le \frac{2}{M}(\Phi _{0}-\Phi _{N}), \end{aligned}\end{aligned}$$

where \(\frac{2}{M}\) is a positive constant. Taking the limit as \(N\rightarrow +\infty \) on the above inequality, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{+\infty }{\mathbb {E}}(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}+\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}) \le \frac{2}{M}(\Phi _{0}-{\bar{\Phi }}) <+\infty . \end{aligned} \end{aligned}$$

Hence we can obtain \(\sum _{k=0}^{+\infty }{\mathbb {E}}[\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}]=\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}<+\infty \),which deduces the following result

$$\begin{aligned} \lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert =0 \end{aligned}$$

a.s. From (23) and (24), we can know that

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert ^{2}<+\infty \end{aligned}$$

and

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert ^{2}<+\infty . \end{aligned}$$

Thus we can easily get

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert ^{2} =\sum _{k=0}^{+\infty }{\mathbb {E}}(\Vert x_{k+1}-x_{k}\Vert ^{2}+\Vert y_{k+1}-y_{k}\Vert ^{2}) <+\infty , \end{aligned}\end{aligned}$$

which deduces that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert =0\) a.s. We also know that

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert {\bar{z}}_{k+1}-{\bar{z}}_{k}\Vert ^{2}&=\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert ({\bar{z}}_{k+1}-z_{k+1})+(z_{k+1}-z_{k})+(z_{k}-{\bar{z}}_{k})\Vert ^{2}\\&\quad \le 3\sum _{k=0}^{+\infty }{\mathbb {E}}[\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}+\Vert z_{k+1}-z_{k}\Vert ^{2}+\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}]\\&\quad <+\infty , \end{aligned} \end{aligned}$$

which deduces that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert {\bar{z}}_{k+1}-{\bar{z}}_{k}\Vert =0\) a.s. From the (4) and (6) in Algorithm 1, it shows that

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k}\Vert ^{2}&=\sum _{k=0}^{+\infty }\frac{1}{(\alpha _{1}^{k})^{2}}{\mathbb {E}}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\sum _{k=0}^{+\infty }\frac{1}{(\alpha _{2}^{k})^{2}}{\mathbb {E}}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}\\&\quad \le \alpha \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}\\&\quad <+\infty , \end{aligned} \end{aligned}$$

where \(\alpha :=\underset{k\in {\mathbb {N}}}{\lim \sup }\,\{\max (\frac{1}{(\alpha _{1}^{k})^{2}},\frac{1}{(\alpha _{2}^{k})^{2}})\}\). Thus can deduce that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k}\Vert =0\) a.s. \(\square \)

Proof of Lemma 3.5

Proof

From the first order optimality conditions of the proximal operators (3) and (5), it shows that

$$\begin{aligned} 0\in \partial f(x_{k+1})+\frac{1}{c_{1}^{k}}(x_{k+1}-{\bar{x}}_{k})+{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k}), \end{aligned}$$

and

$$\begin{aligned} 0\in \partial g(y_{k+1})+\frac{1}{c_{2}^{k}}(y_{k+1}-{\bar{y}}_{k})+{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k}). \end{aligned}$$

Combining this with the fact that

$$\begin{aligned} \begin{aligned}&\partial \Psi (x_{k+1},y_{k+1}) =\left( \begin{matrix} \partial _{x}\Psi (x_{k+1},y_{k+1})\\ \partial _{y}\Psi (x_{k+1},y_{k+1}) \end{matrix}\right) = \left( \begin{matrix} \nabla _{x}H(x_{k+1},y_{k+1})+\partial f(x_{k+1})\\ \nabla _{y}H(x_{k+1},y_{k+1})+\partial g(y_{k+1}) \end{matrix}\right) , \end{aligned} \end{aligned}$$

we can know that

$$\begin{aligned} A_{k+1}\in \partial \Psi (z_{k+1}). \end{aligned}$$

All that remains is to bound the norm of \(A_{k+1}\). Because the sequence \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) is bounded and \(\nabla H\) is M-Lipschitz continuous on bounded sets, we can know that

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{x}^{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}\Vert \nabla _{x}H(x_{k+1},y_{k+1})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +\frac{1}{c_{1}^{k}}{\mathbb {E}}_{k}\Vert {\bar{x}}_{k}-x_{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert \nabla _{x}H(x_{k+1},y_{k+1})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert \\&\qquad +\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ] +\frac{1}{c_{1}^{k}}{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k}\Vert \\&\quad \le (\frac{1}{c_{1}^{k}}+M){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k}\Vert +M{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert +{\mathbb {E}}_{k}\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert \\&\quad ={\mathbb {E}}_{k}\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}\\&\qquad +\frac{M}{\alpha _{1}^{k}}){\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +\frac{M}{\alpha _{2}^{k}}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert , \end{aligned}$$

and

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{y}^{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}\Vert \nabla _{y}H(x_{k+1},y_{k+1})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +\frac{1}{c_{2}^{k}}{\mathbb {E}}_{k}\Vert {\bar{y}}_{k}-y_{k+1}\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert \nabla _{y}H(x_{k+1},y_{k+1})-\nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\qquad +\frac{1}{c_{2}^{k}}{\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert \\&\quad \le (\frac{1}{c_{2}^{k}}+M){\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k}\Vert +M{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +{\mathbb {E}}_{k}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\\&\qquad -{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\quad ={\mathbb {E}}_{k}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{M}{\alpha _{2}^{k}}){\mathbb {E}}_{k}\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert \\&\qquad +M{\mathbb {E}}_{k}\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert . \end{aligned}$$

Combining the above two inequalities together and using equation (10) to bound the MSE terms, then we have

$$\begin{aligned}&{\mathbb {E}}_{k}\Vert A_{k+1}\Vert \\&\quad ={\mathbb {E}}_{k}\Vert (A_{x}^{k+1},A_{y}^{k+1})\Vert \\&\quad \le {\mathbb {E}}_{k}[\Vert A_{x}^{k+1}\Vert +\Vert A_{y}^{k+1}\Vert ]\\&\quad \le {\mathbb {E}}_{k}[(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}+\frac{M}{\alpha _{1}^{k}}+M)\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{2M}{\alpha _{1}^{k}})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert \\&\qquad +{\mathbb {E}}_{k}[\Vert \nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert +\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ]\\&\quad \le {\mathbb {E}}_{k}[(\frac{1}{c_{1}^{k}\alpha _{1}^{k}}+\frac{M}{\alpha _{1}^{k}}+M+V_{2})\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +(\frac{1}{c_{2}^{k}\alpha _{2}^{k}}+\frac{2M}{\alpha _{2}^{k}}+V_{2})\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ]\\&\qquad +V_{2}\Vert x_{k}-{\bar{x}}_{k}\Vert +V_{2}\Vert y_{k}-{\bar{y}}_{k}\Vert +\Upsilon _{k}\\&\quad \le \rho _{1}{\mathbb {E}}_{k}[\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert +\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ] +V_{2}(\Vert x_{k}-{\bar{x}}_{k}\Vert +\Vert y_{k}-{\bar{y}}_{k}\Vert )+\Upsilon _{k}\\&\quad \le \sqrt{2}\rho _{1}{\mathbb {E}}_{k}[\sqrt{\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}+\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}}]\\&\qquad +\sqrt{2}V_{2}\sqrt{\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}}+\Upsilon _{k}\\&\quad =\sqrt{2}\rho _{1}{\mathbb {E}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert +\sqrt{2}V_{2}\Vert z_{k}-{\bar{z}}_{k}\Vert +\Upsilon _{k}\\&\quad \le {\hat{\rho }}({\mathbb {E}}_{k}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert +\Vert z_{k}-{\bar{z}}_{k}\Vert )+\Upsilon _{k}. \end{aligned}$$

This completes the proof. \(\square \)

Proof of Lemma 3.6

Proof

(I): From Lemma 3.1 and Lemma 3.2, there exist some non-negative constants \(\delta _{1}, \delta _{2}\) and \(\delta _{3}\) such that the \(\Phi _{k}\) is monotonically decreasing. Thus we can get that

$$\begin{aligned} {\mathbb {E}}_{k}\Phi _{k+1}+{\mathcal {O}}(\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2})\le \Phi _{k}. \end{aligned}$$

And from Lemma 1.2, the supermartingale convergence theorem implies that the sequences \(\Phi _{k}\) converge a.s. to a finite positive random variable. We can get \(\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0\) (\(\Vert x_{k}-{\bar{x}}_{k}\Vert \rightarrow 0\) and \(\Vert y_{k}-{\bar{y}}_{k}\Vert \rightarrow 0\)) a.s. from the supermartingale convergence theorem or Lemma 3.4. The stochastic gradient estimator \({\tilde{\nabla }}\) is variance-reduced, then \({\mathbb {E}}\Gamma _{k}\rightarrow 0\), and it shows that

$$\begin{aligned} \lim _{k\rightarrow +\infty }{\mathbb {E}}\Phi _{k}=\lim _{k\rightarrow +\infty }{\mathbb {E}}\Psi _{k}\in [{\bar{\Psi }},+\infty ). \end{aligned}$$

(II): From Lemma 3.5, we know that

$$\begin{aligned} {\mathbb {E}}\Vert A_{k}\Vert \le {\hat{\rho }}{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert +\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ]+{\mathbb {E}}\Gamma _{k-1}. \end{aligned}$$

We have known that \(\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0\) a.s. and \({\mathbb {E}}\Gamma _{k-1}\rightarrow 0\). This ensures that \({\mathbb {E}}\Vert A_{k}\Vert \rightarrow 0\). The details follow from Lemma 3.5.

We know the point \(z_{*}\) is a cluster point of the sequence \(\{z_{k}\}\). This means there exists a subsequence \(\{z_{k_{l}}\}\) satisfying \(\lim _{l\rightarrow \infty }z_{k_{l}}\rightarrow z_{*}\). And we know that the functions f and g are lower semicontinuous, which have

$$\begin{aligned} \underset{l\rightarrow \infty }{\lim \inf }\,f(x_{k_{l}})\ge f(x_{*})\quad \text {and}\quad \underset{l\rightarrow \infty }{\lim \inf }\,g(y_{k_{l}})\ge g(y_{*}). \end{aligned}$$
(29)

From the Algorithm 1, we know

$$\begin{aligned} x_{k+1}\in \underset{x}{\arg \min }\{f(x)+\langle x-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{k}}\Vert x-{\bar{x}}_{k}\Vert ^{2}\}. \end{aligned}$$

If we set \(x=x_{*}\), we have

$$\begin{aligned} \begin{aligned}&f(x_{k+1})+\langle x_{k+1}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle +\frac{1}{2c_{k}}\Vert x_{k+1}-{\bar{x}}_{k}\Vert ^{2}\\&\quad \le f(x_{*})+\langle x_{*}-{\bar{x}}_{k},{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})-\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle \\&\qquad +\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k},\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})\rangle .\\ \end{aligned} \end{aligned}$$

If we set \(k=k_{l}\) can leads to

$$\begin{aligned} \begin{aligned}&f(x_{k_{l}+1})+\langle x_{k_{l}+1}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle +\frac{1}{2c_{k_{l}}}\Vert x_{k_{l}+1}-{\bar{x}}_{k_{l}}\Vert ^{2}\\&\quad \le f(x_{*})+\langle x_{*}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \\&\qquad +\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k_{l}},\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle .\\ \end{aligned} \end{aligned}$$

By taking the superior limit above as \(l\rightarrow +\infty \), we obtain

$$\begin{aligned} \begin{aligned}&\underset{l\rightarrow +\infty }{\lim \sup }f(x_{k_{l}})\\&\quad \le \underset{l\rightarrow +\infty }{\lim \sup }\{\langle x_{*}-{\bar{x}}_{k_{l}},{\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \\&\qquad +f(x_{*})+\frac{1}{2c_{k}}\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert ^{2}+\langle x_{*}-{\bar{x}}_{k_{l}},\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rangle \}.\\ \end{aligned} \end{aligned}$$

We know \({\bar{x}}_{k_{l}}\rightarrow x_{*}\) and boundedness of \(\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\) to make the last term of above goes to zero. The first term on the right is bounded above by \(\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert +\Vert {\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\Vert \), and we have \({\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rightarrow 0\). Then we have

$$\begin{aligned} \underset{l\rightarrow \infty }{\lim \sup }\,\,f(x_{k_{l}})\le f(x_{*}). \end{aligned}$$
(30)

Similarly, we have

$$\begin{aligned} \underset{l\rightarrow +\infty }{\lim \sup }\,\,g(y_{k_{1}})\le g(y_{*}). \end{aligned}$$
(31)

By combining the relations (29), (30) and (31), we know that \(f(x_{k_{l}})\rightarrow f(x_{*})\) and \(g(y_{k_{l}})\rightarrow g(y_{*})\) with \(l\rightarrow +\infty \). Together with the continuity of the function H, we can get

$$\begin{aligned} \begin{aligned} \underset{l\rightarrow +\infty }{\lim }\Psi (z_{k_{l}})&=\underset{l\rightarrow +\infty }{\lim }\{H(x_{k_{l}},y_{k_{l}})+f(x_{k_{l}})+g(y_{k_{l}})\}\\&=H(x_{*},y_{*})+f(x_{*})+g(x_{*})\\&=\Psi (x_{*},y_{*}). \end{aligned} \end{aligned}$$

Assertion (II) ensures that \(z_{*}\) is a critical point of \(\Psi \) because \({\mathbb {E}}\text {dist}(0,\partial \Psi (z_{*}))\rightarrow 0\) as \(l\rightarrow +\infty \) and \(\partial \Psi (z_{*})\) is closed. Thus this assertion holds. Moreover, it shows that there exists a subsequence \(\{{\hat{z}}_{k_{i}}\}_{i\in {\mathbb {N}}}\) of \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) converging to \({\hat{z}}_{*}\) with \({\hat{z}}_{*}\in \omega ({\hat{z}}_{0})\). Thus we have

$$\begin{aligned} \begin{aligned} \underset{i\rightarrow +\infty }{\lim }\Phi ({\hat{z}}_{k_{i}})&=\underset{i\rightarrow +\infty }{\lim }\{H(x_{k_{i}},y_{k_{i}})+f(x_{k_{i}})+g(y_{k_{i}})\\&\quad +\delta _{1}\Vert x_{k_{i}}-{\bar{x}}_{k_{i}}\Vert ^{2} +\delta _{2}\Vert y_{k_{i}}-{\bar{x}}_{k_{i}}\Vert ^{2}+\delta _{3}\Gamma _{k_{i}}\}\\&=H(x_{*},y_{*})+f(x_{*})+g(x_{*})\\&=\Psi (x_{*},y_{*})\\&=\Phi ({\hat{z}}_{*}). \end{aligned} \end{aligned}$$

(III): From the Remark 5 in Bolte et al. (2014) or Remark 5.1 in Davis (2016), we can directly say this claim holds for any sequence satisfying \(\Vert z_{k}-z_{k-1}\Vert \rightarrow 0\) a.s.

(IV): Finally, we take an arbitrary point \(z_{*}\in \omega (z_{0})\), then there exists a subsequence \(\{z_{k_{l}}\}\) such that \(z_{k_{l}}\rightarrow z_{*}\) as \(l\rightarrow +\infty \). We have that \({\mathbb {E}}\Psi (z_{k})\rightarrow \Psi _{*}\), which implies that \({\mathbb {E}}\Psi (z_{k_{l}})\rightarrow \Psi _{*}\). We have known that \(\Psi (z_{k_{l}})\rightarrow \Psi (z_{*})\), so it shows that \({\mathbb {E}}\Psi (z_{*})=\Psi _{*}\) for all \(z_{*}\in \omega (z_{0})\). Hence \(\Psi \) has constant expectation over \(\omega (z_{0})\). \(\square \)

Proof of Theorem 3.1

Proof

If \(\theta \in (0,\frac{1}{2})\), then \(\Psi \) satisfies the KŁ property with exponent \(\frac{1}{2}\), so we only consider the case of \(\theta \in [\frac{1}{2},1)\). By the KŁ property in Lemma 3.3, there exists a desingularizing function \(\phi (s)=cs^{1-\theta }\) a.s., such that the following inequality holds

$$\begin{aligned} \phi '({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})]){\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\ge 1,\,\,\forall k>m. \end{aligned}$$

Combining with the bound of \({\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\) in Lemma 3.5, we have

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\\&\quad \le {\mathbb {E}}\Vert A_{k}\Vert \\&\quad \le {\hat{\rho }}{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert +\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ]+{\mathbb {E}}{\Upsilon _{k-1}}\\&\quad \le {\hat{\rho }}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}})+\sqrt{t{\mathbb {E}}\Gamma _{k-1}}, \end{aligned} \end{aligned}$$
(32)

where the final inequality follows from Jensen’s inequality, and we know that \({\Upsilon }_{k}=\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert \) for some vectors \(v_{k}^{i}\) such that

$$\begin{aligned} {\mathbb {E}}{\Upsilon }_{k}={\mathbb {E}}\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert \le {\mathbb {E}}\sqrt{t\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert ^{2}}\le \sqrt{t{\mathbb {E}}\Gamma _{k}}. \end{aligned}$$

Thus the final inequality holds. From the inequality (10) in Definition 2.1, we can bound the term \(\sqrt{{\mathbb {E}}\Gamma _{k}}\) and get

$$\begin{aligned} \begin{aligned}&\sqrt{{\mathbb {E}}\Gamma _{k}}\\&\quad \le \sqrt{(1-\tau ){\mathbb {E}}\Gamma _{k-1}+V_{\Gamma }{\mathbb {E}}[\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}]}\\&\quad \le \sqrt{(1-\tau ){\mathbb {E}}\Gamma _{k-1}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}\\&\quad \le (1-\frac{\tau }{2})\sqrt{{\mathbb {E}}\Gamma _{k-1}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}, \end{aligned} \end{aligned}$$
(33)

where the second and final inequalities are follows from \(\sqrt{\sum _{i=1}^{m}\lambda _{i}a_{i}}\le \sum _{i=1}^{m}\sqrt{\lambda _{i}a_{i}}\) (\(\lambda _{i}, a_{i}\ge 0,i=1,\ldots ,m\)) and \(\sqrt{1-\tau }=1-\frac{\tau }{2}-\frac{\tau ^{2}}{8}-\cdots \), respectively. Thus from (33), it shows that

$$\begin{aligned} \begin{aligned} \sqrt{t{\mathbb {E}}\Gamma _{k-1}}&\le \frac{2\sqrt{t}}{\tau }(\sqrt{{\mathbb {E}}\Gamma _{k-1}}-\sqrt{{\mathbb {E}}\Gamma _{k}}+\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\\&\quad +\sqrt{V_{\Gamma }}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}). \end{aligned} \end{aligned}$$

Combining the above inequality and (32), we can obtain

$$\begin{aligned}{} & {} {\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\nonumber \\{} & {} \quad \le N_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}) +\frac{2\sqrt{t}}{\tau }(\sqrt{{\mathbb {E}}\Gamma _{k-1}}-\sqrt{{\mathbb {E}}\Gamma _{k}})\nonumber \\{} & {} \quad =:C_{k}, \end{aligned}$$
(34)

where \(N_{1}={\hat{\rho }}+\frac{2\sqrt{tV_{\Gamma }}}{\tau }\). Then we have

$$\begin{aligned} \phi '({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])C_{k}\ge 1,\,\,\forall k>m. \end{aligned}$$

From \(\phi (s)=cs^{1-\theta }\) in Lemma 3.3, it shows that

$$\begin{aligned} \frac{c(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }}\ge 1,\quad \forall k>m. \end{aligned}$$

From the definition of \(C_{k}\) in (34), we have

$$\begin{aligned} C_{k}&\ge N_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}})+\frac{2\sqrt{t}}{\tau }\sqrt{{\mathbb {E}}\Gamma _{k-1}}\\&\quad =c_{1}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Gamma _{k-1}}) \end{aligned}$$

for some \(c_{1}>0\). And know that \({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}\rightarrow 0\) and \({\mathbb {E}}\Gamma _{k-1}\rightarrow 0\) from Definition 2.1. Combined with \(\theta \ge \frac{1}{2}\), it shows that there exists an index m and positive constants \(c_{2}\) and \({\bar{c}}\) such that

$$\begin{aligned} \begin{aligned}&({\mathbb {E}}[\delta _{1}\Vert x_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{2}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{3}\Gamma _{k}])^{\theta }\\&\quad \le {c_{2}}(({\mathbb {E}}\Gamma _{k-1}+\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}+\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2})^{\theta })\\&\quad \le {\bar{c}} C_{k},\quad \forall k>m, \end{aligned} \end{aligned}$$

where the first inequality above follows from (11) in Definition 2.1. Thus there exists a constant \(0<{\bar{c}}<d<+\infty \) such that

$$\begin{aligned} \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }+({\mathbb {E}}[T_{k}])^{\theta }}\ge 1 \end{aligned}$$

for all \(k>m\), where \(T_{k}:=\delta _{1}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\delta _{2}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{3}\Gamma _{k}\). From the fact that \((a+b)^{\theta }\le a^{\theta }+b^{\theta }\) with \(\theta \in [\frac{1}{2},1)\), we can get

$$\begin{aligned} \begin{aligned} \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Phi (z_{k})-\Psi (z_{k}^{*})])^{\theta }}&=\frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})+T_{k}])^{\theta }}\\&\quad \ge \frac{cd(1-\theta )C_{k}}{({\mathbb {E}}[\Psi (z_{k})-\Psi (z_{k}^{*})])^{\theta }+({\mathbb {E}}[T_{k}])^{\theta }}\\&\quad \ge 1,\quad \forall k>m. \end{aligned} \end{aligned}$$

Therefore, with \(\phi (s)=cds^{1-\theta }\), we have that

$$\begin{aligned} \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])C_{k}\ge 1,\quad \forall k>m. \end{aligned}$$

And from the concavity of \(\phi \), we can get

$$\begin{aligned} \begin{aligned}&\phi ({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])-\phi ({\mathbb {E}}[\Phi _{k+1}-\Psi _{k+1}^{*}])\\&\quad \ge \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}+\Psi _{k+1}^{*}-\Phi _{k+1}])\\&\quad \ge \phi '({\mathbb {E}}[\Phi _{k}-\Psi _{k}^{*}])({\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]), \end{aligned} \end{aligned}$$

where the last inequality follows from the fact that the sequence \(\{\Psi _{k}^{*}\}\) is non-decreasing. Denote \(\triangle _{p,q}:=\phi ({\mathbb {E}}[\Phi _{p}-\Psi _{p}^{*}])-\phi ({\mathbb {E}}[\Phi _{q}-\Psi _{q}^{*}])\), and it shows that

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge {\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]. \end{aligned}$$

From Lemma 3.2, we can bound \({\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]\) below by both \({\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}\) and \({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}\), then we have

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge K_{1}{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}, \end{aligned}$$

as well as

$$\begin{aligned} \triangle _{k,k+1}C_{k}\ge K_{2}{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}, \end{aligned}$$
(35)

where \(K_{1}\) and \(K_{2}\) are two positive constants, respectively. The details of how to set \(K_{1}\) and \(K_{2}\) can refer to Lemma 3.2. We have that

$$\begin{aligned} \begin{aligned} 2\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}&\le 2\sqrt{C_{k}K_{1}^{-1}\triangle _{k,k+1}} \le \frac{C_{k}}{2N_{1}}+\frac{2N_{1}\triangle _{k,k+1}}{K_{1}} \end{aligned}\end{aligned}$$
(36)

with any positive constant \(N_{1}\), where the inequalities follow from Young’s inequality and (35), respectively. Summing the above inequality (36) from \(k=m\) to \({\bar{m}}\), we can get

$$\begin{aligned}&2\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\frac{C_{k}}{2N_{1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}\\&\qquad +\frac{\sqrt{t}}{N_{1}\tau }(\sqrt{{\mathbb {E}}\Gamma _{m-1}}-\sqrt{{\mathbb {E}}\Gamma _{{\bar{m}}}})+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad =\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}}-\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{{\bar{m}}}-{\bar{z}}_{{\bar{m}}}\Vert ^{2}}\\&\qquad +\frac{\sqrt{t}}{N_{1}\tau }(\sqrt{{\mathbb {E}}\Gamma _{m-1}}-\sqrt{{\mathbb {E}}\Gamma _{{\bar{m}}}})+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}\\&\quad \le \sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}. \end{aligned}$$

It also shows that

$$\begin{aligned} \begin{aligned} \sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}} \le \frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{1}}. \end{aligned}\nonumber \\ \end{aligned}$$
(37)

Similarly, we can apply the same argument to the inequality (35), it shows that

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}}\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\frac{1}{2}\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}\triangle _{m,{\bar{m}}+1}}{K_{2}}. \end{aligned} \end{aligned}$$
(38)

Combining (37) and (38) together, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}})\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}} +\frac{2\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}\\&\qquad +\frac{2N_{1}(K_{1}+K_{2})\triangle _{m,{\bar{m}}+1}}{K_{1}K_{2}}. \end{aligned} \end{aligned}$$

From Jensen’s inequality, we have

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert +{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ) \le \sum _{k=m}^{{\bar{m}}}(\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}}). \end{aligned} \end{aligned}$$

Thus from the above two inequalities, we can get

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{{\bar{m}}}({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert +{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert )\\&\quad \le \sqrt{{\mathbb {E}}\Vert z_{m}-{\bar{z}}_{m}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{m-1}-{\bar{z}}_{m-1}\Vert ^{2}}\\&\qquad +\frac{2\sqrt{t}}{N_{1}\tau }\sqrt{{\mathbb {E}}\Gamma _{m-1}}+\frac{2N_{1}(K_{1}+K_{2})\triangle _{m,{\bar{m}}+1}}{K_{1}K_{2}}, \end{aligned} \end{aligned}$$

and letting \({\bar{m}}\rightarrow +\infty \), we have

$$\begin{aligned} \sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert <+\infty . \end{aligned}$$

From (23) and (24), we can easily say that \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert <+\infty \) and \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert <+\infty \). It is obvious that

$$\begin{aligned} \begin{aligned}&\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert \le \sum _{k=m}^{+\infty }({\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert +{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert ) <+\infty , \end{aligned} \end{aligned}$$

under the sequence \(\{z_{k}\}\) is bounded, thus we can say that

$$\begin{aligned} \sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert <+\infty . \end{aligned}$$

This completes the proof. \(\square \)

Proof of Theorem 3.2

Proof

From the proof of the previous lemma, we know if \(\theta \in (0,1/2)\), then \(\Psi \) satisfies the KŁ property with exponent 1/2, so we only consider the cases \(\theta \in [1/2,1)\) and \(\theta =0\).

  1. (1)

    When \(\theta =0\), from the KŁ inequality, it shows that one of the following two conditions exactly holds:

    1. (a)

      \({\mathbb {E}}\Psi (z_{k})\ne \Psi _{k}^{*}\) and \(0<C\le {\mathbb {E}}\Vert A_{k}\Vert , \forall A_{k}\in \partial \Psi (z_{k})\);

    2. (b)

      \(\Psi (z_{k})=\Psi _{k}^{*}\).

    Based on Theorem 4.8 in Driggs et al. (2021) and Attouch et al. (2010), we have the case holds.

  2. (2)

    When \(\theta \in (1/2,1)\), from the proof of Theorem 3.1 and Theorem 4.8 in Driggs et al. (2021), we have

    $$\begin{aligned} T_{m}^{\frac{\theta }{1-\theta }}\le t_{1}(T_{m-1}-T_{m}) \end{aligned}$$

    for some positive constant \(t_{1}\), where

    $$\begin{aligned} T_{m}:=\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}+t_{2}\sqrt{{\mathbb {E}}\Gamma _{k-1}} \end{aligned}$$

    which is bounded for all m because \(\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\) is bounded. From the proof of (Attouch and Bolte 2009, Theorem 5) and (Driggs et al. 2021, Theorem 4.8), there exists a \(t_{3}>0\) such that

    $$\begin{aligned} t_{3}\le T_{m}^{\frac{1-2\theta }{1-\theta }}-T_{m-1}^{\frac{1-2\theta }{1-\theta }}. \end{aligned}$$

    Summing this inequality from \(m=M_{1}\) to \(M_{2}\), it shows that \((M_{2}-M_{1})t_{3}\le T_{M_{2}}^{\frac{1-2\theta }{1-\theta }}-T_{M_{1}-1}^{\frac{1-2\theta }{1-\theta }}\), this implies \(T_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}\) for some constant \(t_{4}\). By Jensen’s inequality, we have \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\le S_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}\). Combined with the proof of Theorem 3.1, we can say (3) in Theorem 3.2 holds.

  3. (3)

    When \(\theta =\frac{1}{2}\), with the similar proof of (Driggs et al. 2021, Theorem 4.8), we also get the desired result.

To sum up, this completes the proof. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Han, D. Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks. Math Meth Oper Res 99, 39–74 (2024). https://doi.org/10.1007/s00186-024-00851-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00186-024-00851-6

Keywords

Mathematics Subject Classification

Navigation