Abstract
Solutions of learning problems by Empirical Risk Minimization (ERM) – and almost-ERM when the minimizer does not exist – need to be consistent, so that they may be predictive. They also need to be well-posed in the sense of being stable, so that they might be used robustly. We propose a statistical form of stability, defined as leave-one-out (LOO) stability. We prove that for bounded loss classes LOO stability is (a) sufficient for generalization, that is convergence in probability of the empirical error to the expected error, for any algorithm satisfying it and, (b) necessary and sufficient for consistency of ERM. Thus LOO stability is a weak form of stability that represents a sufficient condition for generalization for symmetric learning algorithms while subsuming the classical conditions for consistency of ERM. In particular, we conclude that a certain form of well-posedness and consistency are equivalent for ERM.
Similar content being viewed by others
References
N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Scale-sensitive dimensions, uniform convergence, and learnability, J. ACM 44(4) (1997) 615–631.
P. Assouad and R.M. Dudley, Minimax nonparametric estimation over classes of sets, unpublished manuscript (1990).
M. Bertero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 (1988) 869–889.
O. Bousquet and A. Elisseeff, Algorithmic stability and generalization performance, in: Neural Information Processing Systems, Vol. 14, Denver, CO (2000).
O. Bousquet and A. Elisseeff, Stability and generalization, J. Mach. Learning Res. (2001).
F. Cucker and S. Smale, On the mathematical foundations of learning, Bulletin AMS 39 (2001) 1–49.
L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, Vol. 31 (Springer, New York, 1996).
V. De La Pena, A general class of exponential inequalities for martingales and ratios, Ann. Probab. 27(1) (1999) 537–564.
L. Devroye and T. Wagner, Distribution-free performance bounds for potential function rules, IEEE Trans. Inform. Theory 25(5) (1979) 601–604.
R.M. Dudley, Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics (Cambridge Univ. Press, Cambridge, 1999).
R.M. Dudley, E. Giné, and J. Zinn, Uniform and universal Glivenko–Cantelli classes, J. Theoret. Probab. 4 (1991) 485–510.
H. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems (Kluwer Academic, Dordrecht, 1996).
T. Engniou, M. Pontil and T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math. 13 (2000) 1–50.
M. Kearns and D. Ron, Algorithmic stability and sanity-check bounds for leave-one-out cross-validation, Neural Comput. 11(6) (1999) 1427–1453.
S. Kutin and P. Niyogi, Almost-everywhere algorithmic stability and generalization error, Technical Report TR-2002-03, University of Chicago (2002).
S. Mendelson, Geometric parameters in learning theory (2003) submitted for publication.
S. Mukherjee, P. Niyogi, T. Poggio and R. Rifkin, Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, Massachusetts Institute of Technology (2002).
S. Mukherjee, R. Rifkin and T. Poggio, Regression and classification with regularization, in: Nonlinear Estimation and Classification, Proc. of MSRI Workshop, eds. D.D. Denison, M.H. Hansen, C.C. Holmes, B. Mallick and B. Yu, Lectures Notes in Statistics, Vol. 171 (Springer, New York, 2002) pp. 107–124.
T. Poggio, R. Rifkin, S. Mukherjee and P. Niyogi, General conditions for predictivity in learning theory, Nature 343 (February 2004) 644–647.
T. Poggio and S. Smale, The mathematics of learning: Dealing with data, Notices Amer. Math. Soc. 50(5) (2003) 537–544.
D. Pollard, Convergence of Stochastic Processes (Springer, Berlin, 1984).
M. Talagrand, Type, infratype and the Elton–Pajor theorem, Invent. Math. 107 (1992) 41–59.
M. Talagrand, A new look at independence, Ann. Probab. 24 (1996) 1–34.
A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-posed Problems (W.H. Winston, Washington, 1977).
L.G. Valiant, A theory of learnable, in: Proc. of the 1984 STOC (1984) pp. 436–445.
V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 1998).
V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequences of events to their probabilities, Theory Probab. Appl. 17(2) (1971) 264–280.
D. Zhou, The covering number in learning theory, J. Complexity 18 (2002) 739–767.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Y. Xu
Dedicated to Charles A. Micchelli on his 60th birthday
Mathematics subject classifications (2000)
68T05, 68T10, 68Q32, 62M20.
Tomaso Poggio: Corresponding author.
Rights and permissions
About this article
Cite this article
Mukherjee, S., Niyogi, P., Poggio, T. et al. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25, 161–193 (2006). https://doi.org/10.1007/s10444-004-7634-z
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10444-004-7634-z