Abstract
In many optimization problems arising from machine learning, image processing, and statistics communities, the objective functions possess a special form involving huge amounts of data, which encourages the application of stochastic algorithms. In this paper, we study such a broad class of nonconvex nonsmooth minimization problems, whose objective function is the sum of a smooth function of the entire variables and two nonsmooth functions of each variable. We propose to solve this problem with a stochastic Gauss–Seidel type inertial proximal alternating linearized minimization (denoted by SGiPALM) algorithm. We prove that under Kurdyka–Łojasiewicz (KŁ) property and some mild conditions, each bounded sequence generated by SGiPALM with the variance-reduced stochastic gradient estimator globally converges to a critical point after a finite number of iterations, or almost surely satisfies the finite length property. We also apply the SGiPALM algorithm to the proximal neural networks (PNN) with 4 layers for classification tasks on the MNIST dataset and compare it with other deterministic and stochastic optimization algorithms, the results illustrate the effectiveness of the proposed algorithm.


Similar content being viewed by others
Change history
12 March 2024
A Correction to this paper has been published: https://doi.org/10.1007/s00186-024-00855-2
Notes
For \(d \le n\), the (compact) Stiefel manifold is defined as \(\text{ St }(d, n):=\{W\in {\mathbb {R}}^{n\times d}: W^{T}W=I_{d}\}\). For \(d=1\) this reduces to the sphere \({\mathbb {S}}^{n-1}\), and for \(d=n\) we obtain the special orthogonal group \(\text{ SO }(n)\).
We know that the ELU function is differentiable with a 1-Lipschitz gradient, i.e., \(\Vert \nabla \sigma (x)-\nabla \sigma (y)\Vert \le \Vert x-y\Vert .\)
In particular, we have that \(\Vert W_{i}\Vert _{F} \le 2\sqrt{d}\), \(i = 1, \dots ,4\) and \(\Vert W_{5}\Vert _{F} \le 20\sqrt{10d}\).
The exact Lipschitz constant is more difficult to obtain when function H acts on a high dimensional space, so we compute an approximative Lipschitz constant based on the batch set of sampling, see Griewank and Walther (2008) for details.
References
Absil P-A, Mahony R, Sepulchre R (2008) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton
Attouch H, Bolte J (2009) On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math Program 116(1–2):5–16
Attouch H, Bolte J, Redont P, Soubeyran A (2010) Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math Oper Res 35(2):438–457
Attouch H, Bolte J, Svaiter BF (2013) Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods. Math Program 137(1–2):91–129
Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494
Bot RI, Csetnek ER, Nguyen D (2019) A proximal minimization algorithm for structured nonconvex and nonsmooth problems. SIAM J Optim 29(2):1300–1328
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: 19th international conference on computational statistics, COMPSTAT, pp 177–186
Cai X, Guo K, Jiang F, Wang K, Wu Z, Han D (2022) The developments of proximal point algorithms. J Oper Res Soc China 10:197–239
Chao M, Han D, Cai X (2021) Convergence of the Peaceman–Rachford splitting method for a class of nonconvex programs. Numer Math Theory Methods Appl 14(2):438–460
Chouzenoux É, Pesquet J, Repetti A (2016) A block coordinate variable metric forward–backward algorithm. J Glob Optim 66(3):457–485
Danilova M, Dvurechensky P, Gasnikov A, Gorbunov E, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. arXiv:2012.06188
Davis D (2016) The asynchronous PALM algorithm for nonsmooth nonconvex problems. arXiv:1604.00526
Davis D, Edmunds B, Udell M (2016) The sound of APALM clapping: faster nonsmooth nonconvex optimization with stochastic asynchronous PALM. Adv Neural Inf Process Syst 29:226–234
Defazio A, Bach FR, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 27:1646–1654
Driggs D, Tang J, Liang J, Davies ME, Schönlieb C (2021) A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J Imaging Sci 14(4):1932–1970
Gao X, Cai X, Han D (2020) A Gauss–Seidel type inertial proximal alternating linearized minimization for a class of nonconvex optimization problems. J Glob Optim 76(4):863–887
Griewank A, Walther A (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation, 2nd edn. SIAM, Philadelphia
Han D (2022) A survey on some recent developments of alternating direction method of multipliers. J Oper Res Soc China 10(1):1–52
Hasannasab M, Hertrich J, Neumayer S, Plonka G, Setzer S, Steidl G (2020) Parseval proximal neural networks. J Fourier Anal Appl 26(59):1–31
Hertrich J, Steidl G (2022) Inertial stochastic PALM and applications in machine learning. Sampl Theory Signal Process Data Anal 20(4):1–33
Higham NJ (2008) Functions of matrices: theory and computation. SIAM, Philadelphia
Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Oxford University Press, Oxford
Jia Z, Gao X, Cai X, Han D (2021) Local linear convergence of the alternating direction method of multipliers for nonconvex separable optimization problems. J Optim Theory Appl 188(1):1–25
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 26:315–323
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Lan G (2020) First-order and stochastic optimization methods for machine learning. Springer, Berlin
Li Z, Li J (2022) Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. J Mach Learn Res 23:1–61
Ma Y, Hu X, He T, Jiang X (2020) Clustering and integrating of heterogeneous microbiome data by joint symmetric nonnegative matrix factorization with Laplacian regularization. IEEE/ACM Trans Comput Biol Bioinform 17(3):788–795
Naanaa W, Nuzillard J (2005) Blind source separation of positive and partially correlated data. Signal Process 85(9):1711–1722
Nesterov Y (2018) Lectures on convex optimization, 2nd edn. Springer, Cham
Nguyen L M, Liu J, Scheinberg K, Takác M (2017) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th international conference on machine learning, pp 2613–2621
Nikolova M, Tan P (2019) Alternating structure-adapted proximal gradient descent for nonconvex nonsmooth block-regularized problems. SIAM J Optim 29(3):2053–2078
Pock T, Sabach S (2016) Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J Imaging Sci 9(4):1756–1787
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Robbins H, Siegmund D (1971) A convergence theorem for non-negative almost supermartingales and some applications. In: Optimizing methods in statistics, pp 233–257
Rockafellar RT (1970) Convex analysis. Princeton University Press, Princeton
Rockafellar R, Wets R-B (1998) Variational analysis. Springer, Berlin
Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 25:2672–2680
Wang Q, Han D (2023) A generalized inertial proximal alternating linearized minimization method for nonconvex nonsmooth problems. Appl Numer Math 189:66–87
Wang Q, Cui C, Han D (2023a) Accelerated doubly stochastic gradient descent for tensor CP decomposition. J Optim Theory Appl 197:665–704
Wang Q, Liu Z, Cui C, Han D (2023b) Inertial accelerated SGD algorithms for solving large-scale lower-rank tensor CP decomposition problems. J Comput Appl Math 423:114948
Wen J, Fowler JE, He M, Zhao Y, Deng C, Menon V (2016) Orthogonal nonnegative matrix factorization combining multiple features for spectral-spatial dimensionality reduction of hyperspectral imagery. IEEE Trans Geosci Remote Sens 54(7):4272–4286
Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imaging Sci 6(3):1758–1789
Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686–1716
Acknowledgements
We thank the reviewers and Associate Editor for many helpful suggestions to improve the quality of the paper.
Funding
This work was supported by the National Natural Science Foundation of China (NSFC) Grants 12126603, 11926358, 12131004, and the R &D project of Pazhou Lab (Huangpu) under Grant 2023K0604.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Lemma 3.1
Proof
The outline of the proof is similar to that in other first order method (Han 2022). From Definition 1.1 and the iterative step (3), we have
We let \(x=x_{k}\), it follows that
or equivalently,
Similarly, using Definition 1.1 and iterative step (5), and taking \(y=y_{k}\), it shows that
On the other hand, from Lemma 1.1, we have
and
Adding (18), (19), (20) and (21), we get
From Assumption 1.1, we know \(\nabla H\) is Lipschitz continuous on a bounded subset of \({\mathbb {R}}^{m_{1}}\times {\mathbb {R}}^{m_{2}}\) and we assume that \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) is bounded, then there exists a constant \(M>0\) for any \(s_{2}^{k}>0\) and \(s_{4}^{k}>0\), we have
and
where the inequality follows from the Young inequality. Similarly, for any \(s_{1}^{k}>0\) and \(s_{3}^{k}>0\), we can get that
and
From the inertial step (4) in Algorithm 1, it shows that
Thus we can get
and
Similarly, using the inertial step (6) in Algorithm 1, we also have
and
Now we consider the term of \(\frac{1}{2s_{1}^{k}}\Vert {\nabla _{x}H({\bar{x}}_{k},{\bar{y}}_{k})}-{\tilde{\nabla }}_{x}H({\bar{x}}_{k},{\bar{y}}_{k})\Vert ^{2}+\frac{1}{2s_{3}^{k}}\Vert \nabla _{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})-{\tilde{\nabla }}_{y}H({\bar{x}}_{k+1},{\bar{y}}_{k})\Vert ^{2}\). We apply the conditional expectation operator \({\mathbb {E}}_{k}\), we can bound the MSE terms using (9) in Definition 2.1. This gives
where \(s_{13}^{k}=\min \{s_{1}^{k},s_{3}^{k}\}\). Next, from (11) in Definition 2.1, we can get
From the above two inequalities (26) and (27), it shows that
Applying the conditional expectation operator \({\mathbb {E}}_{k}\) on the inequality (22), we have that
Denote \(M_{1}^{k}\), \(M_{2}^{k}\), \(M_{3}^{k}\) and \(M_{4}^{k}\) as the coefficient of \(\Vert x_{k+1}-{\bar{x}}_{k+1}\Vert ^{2}\), \(\Vert y_{k+1}-{\bar{y}}_{k+1}\Vert ^{2}\), \(\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}\) and \(\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}\) in above inequality for simplicity, respectively. We can rewrite the above inequality as
This completes the proof. \(\square \)
Proof of Lemma 3.2
Proof
If we let \(L=\max \{L_{1}(y_{k}),L_{2}(x_{k+1})\}\), \(\alpha _{1}^{k}=\alpha _{2}^{k}=\alpha _{k}\), \(c_{1}^{k}=c_{2}^{k}=c_{k}\) and \(s_{1}^{k}=s_{2}^{k}=s_{3}^{k}=s_{4}^{k}=M\), combined with the inequality (12) in Lemma 3.1, it shows that
We can rewritten the inequality (28) as
we know that \(M_{1}^{k}=M_{2}^{k}-\frac{M}{2}=:{\tilde{M}}_{k}\) and \(M_{3}^{k}+\frac{M}{2}=M_{4}^{k}=:{\bar{M}}_{k}\) from definition in (13). That is to say
we can rewritten \({\tilde{M}}_{k}=\frac{1}{2c_{k}\alpha _{k}^{2}}-h_{1}(\alpha _{k})\) and \({\bar{M}}_{k}=\frac{1}{2c_{k}}+h_{2}(\alpha _{k})\), where \(h_{1}(\alpha _{k})\) and \(h_{2}(\alpha _{k})\) are two positive function of \(\alpha _{k}\). If \({\tilde{M}}_{k}\ge {\bar{M}}_{k}\), then it shows that the Lyapunov function \(\Phi _{k}\) is monotonically non-increased. Thus we have
In shortly, if we let \(c_{k}\le \frac{1-\alpha _{k}^{2}}{2\alpha _{k}^{2}(h_{1}(\alpha _{k})+h_{2}(\alpha _{k}))}\), we can get the decreased property. \(\square \)
Proof of Lemma 3.4
Proof
Since \(\Psi \) is assumed to be bounded below Assumption 1.1, then the auxiliary function \(\Phi \) is also bounded below. From the sequence \(\{\Phi _{k}\}_{k\in {\mathbb {N}}}\) is monotonically non-increasing in expectation, we can know the sequence convergence to some real number \({\bar{\Phi }}\) in expectation. And from (16) in Lemma 3.2, we have the following inequality with a positive integer N holds
where \(\frac{2}{M}\) is a positive constant. Taking the limit as \(N\rightarrow +\infty \) on the above inequality, we have
Hence we can obtain \(\sum _{k=0}^{+\infty }{\mathbb {E}}[\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}]=\sum _{k=0}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}<+\infty \),which deduces the following result
a.s. From (23) and (24), we can know that
and
Thus we can easily get
which deduces that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-z_{k}\Vert =0\) a.s. We also know that
which deduces that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert {\bar{z}}_{k+1}-{\bar{z}}_{k}\Vert =0\) a.s. From the (4) and (6) in Algorithm 1, it shows that
where \(\alpha :=\underset{k\in {\mathbb {N}}}{\lim \sup }\,\{\max (\frac{1}{(\alpha _{1}^{k})^{2}},\frac{1}{(\alpha _{2}^{k})^{2}})\}\). Thus can deduce that \(\lim _{k\rightarrow +\infty }{\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k}\Vert =0\) a.s. \(\square \)
Proof of Lemma 3.5
Proof
From the first order optimality conditions of the proximal operators (3) and (5), it shows that
and
Combining this with the fact that
we can know that
All that remains is to bound the norm of \(A_{k+1}\). Because the sequence \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) is bounded and \(\nabla H\) is M-Lipschitz continuous on bounded sets, we can know that
and
Combining the above two inequalities together and using equation (10) to bound the MSE terms, then we have
This completes the proof. \(\square \)
Proof of Lemma 3.6
Proof
(I): From Lemma 3.1 and Lemma 3.2, there exist some non-negative constants \(\delta _{1}, \delta _{2}\) and \(\delta _{3}\) such that the \(\Phi _{k}\) is monotonically decreasing. Thus we can get that
And from Lemma 1.2, the supermartingale convergence theorem implies that the sequences \(\Phi _{k}\) converge a.s. to a finite positive random variable. We can get \(\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0\) (\(\Vert x_{k}-{\bar{x}}_{k}\Vert \rightarrow 0\) and \(\Vert y_{k}-{\bar{y}}_{k}\Vert \rightarrow 0\)) a.s. from the supermartingale convergence theorem or Lemma 3.4. The stochastic gradient estimator \({\tilde{\nabla }}\) is variance-reduced, then \({\mathbb {E}}\Gamma _{k}\rightarrow 0\), and it shows that
(II): From Lemma 3.5, we know that
We have known that \(\Vert z_{k}-{\bar{z}}_{k}\Vert \rightarrow 0\) a.s. and \({\mathbb {E}}\Gamma _{k-1}\rightarrow 0\). This ensures that \({\mathbb {E}}\Vert A_{k}\Vert \rightarrow 0\). The details follow from Lemma 3.5.
We know the point \(z_{*}\) is a cluster point of the sequence \(\{z_{k}\}\). This means there exists a subsequence \(\{z_{k_{l}}\}\) satisfying \(\lim _{l\rightarrow \infty }z_{k_{l}}\rightarrow z_{*}\). And we know that the functions f and g are lower semicontinuous, which have
From the Algorithm 1, we know
If we set \(x=x_{*}\), we have
If we set \(k=k_{l}\) can leads to
By taking the superior limit above as \(l\rightarrow +\infty \), we obtain
We know \({\bar{x}}_{k_{l}}\rightarrow x_{*}\) and boundedness of \(\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\) to make the last term of above goes to zero. The first term on the right is bounded above by \(\Vert x_{*}-{\bar{x}}_{k_{l}}\Vert +\Vert {\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\Vert \), and we have \({\tilde{\nabla }}_{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})-\nabla _{x}H({\bar{x}}_{k_{l}},{\bar{y}}_{k_{l}})\rightarrow 0\). Then we have
Similarly, we have
By combining the relations (29), (30) and (31), we know that \(f(x_{k_{l}})\rightarrow f(x_{*})\) and \(g(y_{k_{l}})\rightarrow g(y_{*})\) with \(l\rightarrow +\infty \). Together with the continuity of the function H, we can get
Assertion (II) ensures that \(z_{*}\) is a critical point of \(\Psi \) because \({\mathbb {E}}\text {dist}(0,\partial \Psi (z_{*}))\rightarrow 0\) as \(l\rightarrow +\infty \) and \(\partial \Psi (z_{*})\) is closed. Thus this assertion holds. Moreover, it shows that there exists a subsequence \(\{{\hat{z}}_{k_{i}}\}_{i\in {\mathbb {N}}}\) of \(\{{\hat{z}}_{k}\}_{k\in {\mathbb {N}}}\) converging to \({\hat{z}}_{*}\) with \({\hat{z}}_{*}\in \omega ({\hat{z}}_{0})\). Thus we have
(III): From the Remark 5 in Bolte et al. (2014) or Remark 5.1 in Davis (2016), we can directly say this claim holds for any sequence satisfying \(\Vert z_{k}-z_{k-1}\Vert \rightarrow 0\) a.s.
(IV): Finally, we take an arbitrary point \(z_{*}\in \omega (z_{0})\), then there exists a subsequence \(\{z_{k_{l}}\}\) such that \(z_{k_{l}}\rightarrow z_{*}\) as \(l\rightarrow +\infty \). We have that \({\mathbb {E}}\Psi (z_{k})\rightarrow \Psi _{*}\), which implies that \({\mathbb {E}}\Psi (z_{k_{l}})\rightarrow \Psi _{*}\). We have known that \(\Psi (z_{k_{l}})\rightarrow \Psi (z_{*})\), so it shows that \({\mathbb {E}}\Psi (z_{*})=\Psi _{*}\) for all \(z_{*}\in \omega (z_{0})\). Hence \(\Psi \) has constant expectation over \(\omega (z_{0})\). \(\square \)
Proof of Theorem 3.1
Proof
If \(\theta \in (0,\frac{1}{2})\), then \(\Psi \) satisfies the KŁ property with exponent \(\frac{1}{2}\), so we only consider the case of \(\theta \in [\frac{1}{2},1)\). By the KŁ property in Lemma 3.3, there exists a desingularizing function \(\phi (s)=cs^{1-\theta }\) a.s., such that the following inequality holds
Combining with the bound of \({\mathbb {E}}\text{ dist }(0,\partial \Psi (z_{k}))\) in Lemma 3.5, we have
where the final inequality follows from Jensen’s inequality, and we know that \({\Upsilon }_{k}=\sum _{i=1}^{t}\Vert v_{k}^{i}\Vert \) for some vectors \(v_{k}^{i}\) such that
Thus the final inequality holds. From the inequality (10) in Definition 2.1, we can bound the term \(\sqrt{{\mathbb {E}}\Gamma _{k}}\) and get
where the second and final inequalities are follows from \(\sqrt{\sum _{i=1}^{m}\lambda _{i}a_{i}}\le \sum _{i=1}^{m}\sqrt{\lambda _{i}a_{i}}\) (\(\lambda _{i}, a_{i}\ge 0,i=1,\ldots ,m\)) and \(\sqrt{1-\tau }=1-\frac{\tau }{2}-\frac{\tau ^{2}}{8}-\cdots \), respectively. Thus from (33), it shows that
Combining the above inequality and (32), we can obtain
where \(N_{1}={\hat{\rho }}+\frac{2\sqrt{tV_{\Gamma }}}{\tau }\). Then we have
From \(\phi (s)=cs^{1-\theta }\) in Lemma 3.3, it shows that
From the definition of \(C_{k}\) in (34), we have
for some \(c_{1}>0\). And know that \({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}\rightarrow 0\) and \({\mathbb {E}}\Gamma _{k-1}\rightarrow 0\) from Definition 2.1. Combined with \(\theta \ge \frac{1}{2}\), it shows that there exists an index m and positive constants \(c_{2}\) and \({\bar{c}}\) such that
where the first inequality above follows from (11) in Definition 2.1. Thus there exists a constant \(0<{\bar{c}}<d<+\infty \) such that
for all \(k>m\), where \(T_{k}:=\delta _{1}\Vert x_{k}-{\bar{x}}_{k}\Vert ^{2}+\delta _{2}\Vert y_{k}-{\bar{y}}_{k}\Vert ^{2}+\delta _{3}\Gamma _{k}\). From the fact that \((a+b)^{\theta }\le a^{\theta }+b^{\theta }\) with \(\theta \in [\frac{1}{2},1)\), we can get
Therefore, with \(\phi (s)=cds^{1-\theta }\), we have that
And from the concavity of \(\phi \), we can get
where the last inequality follows from the fact that the sequence \(\{\Psi _{k}^{*}\}\) is non-decreasing. Denote \(\triangle _{p,q}:=\phi ({\mathbb {E}}[\Phi _{p}-\Psi _{p}^{*}])-\phi ({\mathbb {E}}[\Phi _{q}-\Psi _{q}^{*}])\), and it shows that
From Lemma 3.2, we can bound \({\mathbb {E}}[\Phi _{k}-\Phi _{k+1}]\) below by both \({\mathbb {E}}\Vert z_{k+1}-{\bar{z}}_{k+1}\Vert ^{2}\) and \({\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}\), then we have
as well as
where \(K_{1}\) and \(K_{2}\) are two positive constants, respectively. The details of how to set \(K_{1}\) and \(K_{2}\) can refer to Lemma 3.2. We have that
with any positive constant \(N_{1}\), where the inequalities follow from Young’s inequality and (35), respectively. Summing the above inequality (36) from \(k=m\) to \({\bar{m}}\), we can get
It also shows that
Similarly, we can apply the same argument to the inequality (35), it shows that
Combining (37) and (38) together, we have
From Jensen’s inequality, we have
Thus from the above two inequalities, we can get
and letting \({\bar{m}}\rightarrow +\infty \), we have
From (23) and (24), we can easily say that \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert x_{k+1}-x_{k}\Vert <+\infty \) and \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert y_{k+1}-y_{k}\Vert <+\infty \). It is obvious that
under the sequence \(\{z_{k}\}\) is bounded, thus we can say that
This completes the proof. \(\square \)
Proof of Theorem 3.2
Proof
From the proof of the previous lemma, we know if \(\theta \in (0,1/2)\), then \(\Psi \) satisfies the KŁ property with exponent 1/2, so we only consider the cases \(\theta \in [1/2,1)\) and \(\theta =0\).
-
(1)
When \(\theta =0\), from the KŁ inequality, it shows that one of the following two conditions exactly holds:
-
(a)
\({\mathbb {E}}\Psi (z_{k})\ne \Psi _{k}^{*}\) and \(0<C\le {\mathbb {E}}\Vert A_{k}\Vert , \forall A_{k}\in \partial \Psi (z_{k})\);
-
(b)
\(\Psi (z_{k})=\Psi _{k}^{*}\).
Based on Theorem 4.8 in Driggs et al. (2021) and Attouch et al. (2010), we have the case holds.
-
(a)
-
(2)
When \(\theta \in (1/2,1)\), from the proof of Theorem 3.1 and Theorem 4.8 in Driggs et al. (2021), we have
$$\begin{aligned} T_{m}^{\frac{\theta }{1-\theta }}\le t_{1}(T_{m-1}-T_{m}) \end{aligned}$$for some positive constant \(t_{1}\), where
$$\begin{aligned} T_{m}:=\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}+\sqrt{{\mathbb {E}}\Vert z_{k-1}-{\bar{z}}_{k-1}\Vert ^{2}}+t_{2}\sqrt{{\mathbb {E}}\Gamma _{k-1}} \end{aligned}$$which is bounded for all m because \(\sum _{k=m}^{+\infty }\sqrt{{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\Vert ^{2}}\) is bounded. From the proof of (Attouch and Bolte 2009, Theorem 5) and (Driggs et al. 2021, Theorem 4.8), there exists a \(t_{3}>0\) such that
$$\begin{aligned} t_{3}\le T_{m}^{\frac{1-2\theta }{1-\theta }}-T_{m-1}^{\frac{1-2\theta }{1-\theta }}. \end{aligned}$$Summing this inequality from \(m=M_{1}\) to \(M_{2}\), it shows that \((M_{2}-M_{1})t_{3}\le T_{M_{2}}^{\frac{1-2\theta }{1-\theta }}-T_{M_{1}-1}^{\frac{1-2\theta }{1-\theta }}\), this implies \(T_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}\) for some constant \(t_{4}\). By Jensen’s inequality, we have \(\sum _{k=m}^{+\infty }{\mathbb {E}}\Vert z_{k}-{\bar{z}}_{k}\le S_{M_{2}}\le t_{4}M_{2}^{\frac{1-\theta }{1-2\theta }}\). Combined with the proof of Theorem 3.1, we can say (3) in Theorem 3.2 holds.
-
(3)
When \(\theta =\frac{1}{2}\), with the similar proof of (Driggs et al. 2021, Theorem 4.8), we also get the desired result.
To sum up, this completes the proof. \(\square \)
About this article
Cite this article
Wang, Q., Han, D. Stochastic Gauss–Seidel type inertial proximal alternating linearized minimization and its application to proximal neural networks. Math Meth Oper Res 99, 39–74 (2024). https://doi.org/10.1007/s00186-024-00851-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00186-024-00851-6