Abstract
The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of “quasi-additive” algorithms, which includes Perceptron and Winnow as special cases. We give a single proof of convergence that covers a broad subset of algorithms in this class, including both Perceptron and Winnow, but also many new algorithms. Our proof hinges on analyzing a generic measure of progress construction that gives insight as to when and how such algorithms converge.
Our measure of progress construction also permits us to obtain good mistake bounds for individual algorithms. We apply our unified analysis to new algorithms as well as existing algorithms. When applied to known algorithms, our method “automatically” produces close variants of existing proofs (recovering similar bounds)—thus showing that, in a certain sense, these seemingly diverse results are fundamentally isomorphic. However, we also demonstrate that the unifying principles are more broadly applicable, and analyze a new class of algorithms that smoothly interpolate between the additive-update behavior of Perceptron and the multiplicative-update behavior of Winnow.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Azoury, K. & Warmuth, M. (1999). Relative loss bounds for on-line density estimation with the exponential family of distributions. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 31–40).
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34:1, 123–135.
Blum, A. (1997). Empirical support for Winnow and Weighted-Majority based algorithms: Results on a calendar scheduling domain. Machine Learning, 26:1, 5–23.
Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7, 200–217.
Cesa-Bianchi, N., Freund, Y., Helmbold, D., Haussler, D., Schapire, R., & Warmuth, M. (1997). How to use expert advice. J. ACM 44, 427–485.
Cesa-Bianchi, N., Helmbold, D., & Panizza., S. (1996). On Bayes methods for on-line boolean prediction. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (pp. 314–324).
Censor, Y. & Zenios, S. A. (1997). Parallel Optimization: Theory, Algorithms, and Applications. New York: Oxford University Press.
Duda, R. O. & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.
Dagan, I., Karov, Y., & Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (pp. 55–63).
Ellis, R. (1985). Entropy, large deviations, and statistical mechanics. New York: Springer-Verlag.
Gentile, C. & Littlestone, N. (1999). The robustness of the p-norm algorithms. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (pp. 1–11).
Grove, A., Littlestone, N., & Schuurmans, D. (1997). General convergence results for linear discriminant updates. In Proceedings of the Tenth Annual Conference on Computational Learning Theory (pp. 171–183).
Golding, A. & Roth, D. A Winnow-based approach to spelling correction. Machine Learning, 34:1, 107–130
Gentile, C. & Warmuth, M. (1999). Linear hinge loss and average margin. In Advances in Neural Information Processing Systems 11 (pp. 225–231).
Helmbold, D., Kivinen, J., & Warmuth, M. (1999).Worst-case loss bounds for single neurons. IEEE Trans. Neural Networks, 10, 1291–1304.
Khardon, R., Roth, D., & Valiant, L. (1999). Relational learning for NLP using linear threshold elements. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (pp. 911–917).
Kivinen, J. & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1, 1–63.
Kivinen, J. & Warmuth, M. (1998). Relative loss bounds for multidimensional regression problems. In Advances in Neural Information Processing Systems 10 (pp. 287–293).
Kivinen, J., Warmuth, M., & Auer, P. (1997). The perceptron algorithm versus winnow: linear versus logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97:1-2, 325–343.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear threshold algorithm. Machine Learning, 2:4, 285–318.
Littlestone, N. (1989). Mistake bounds & linear-threshold learning algorithms. Ph.D. thesis, University of California, Santa Cruz, Technical Report UCSC-CRL-89-11.
Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear-threshold learning using Winnow. In Proceedings of the Fourth Annual Conference on Computational Learning Theory (pp. 147–156).
Littlestone, N. (1995). Comparing several linear-threshold learning algorithms on tasks involving superfluous attributes. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 353–361).
Littlestone, N. & Mesterharm, C. (1997). An apobayesian relative of winnow. In Advances in Neural Information Processing Systems 9 (pp. 204–210).
Littlestone, N. & Warmuth, M. (1989). The weighted majority algorithm. In Proceedings of the Thirtieth IEEE Annual Symposium on the Foundations of Computer Science (pp. 256–261).
Minsky, M. L. & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press.
Nilsson, N. J. (1965). Learning machines. San Mateo, CA: Morgan Kaufmann.
Papert, S. (1961). Some mathematical models of learning. In Proceedings of the Fourth London Symposium on Information Theory.
Rockafellar, R. (1970). Convex analysis. Princeton University Press.
Rosenblatt, F. (1962). Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Washington, DC: Spartan Books.
Vovk, V. (1990). Aggregating strategies. In Proceedings of the Third Annual Conference on Computational Learning Theory (pp. 371–383).
Warmuth, M. & Jagota, A. (1998). Continuous and discrete-time non-linear gradient descent: Relative loss bounds and convergence. In Fifth International Symposium on Artificial Intelligence and Mathematics.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Grove, A.J., Littlestone, N. & Schuurmans, D. General Convergence Results for Linear Discriminant Updates. Machine Learning 43, 173–210 (2001). https://doi.org/10.1023/A:1010844028087
Issue Date:
DOI: https://doi.org/10.1023/A:1010844028087