0% found this document useful (0 votes)
53 views

Information Geometry in Optimization Machine Learn

This document discusses information geometry and its applications in optimization, machine learning, and statistical inference. It provides an introduction to information geometry without advanced differential geometry concepts. The key points are: - Information geometry studies manifolds of probability distributions and other spaces from a geometric perspective using divergence functions. - Divergence functions define a Riemannian geometric structure on manifolds when they satisfy certain criteria. The Kullback-Leibler divergence satisfies these criteria. - Manifolds of probability distributions and positive arrays are examples of spaces that information geometry can be applied to. - Properties like the generalized Pythagorean theorem in dually flat manifolds are important for applications in areas like optimization, machine learning

Uploaded by

sggtio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Information Geometry in Optimization Machine Learn

This document discusses information geometry and its applications in optimization, machine learning, and statistical inference. It provides an introduction to information geometry without advanced differential geometry concepts. The key points are: - Information geometry studies manifolds of probability distributions and other spaces from a geometric perspective using divergence functions. - Divergence functions define a Riemannian geometric structure on manifolds when they satisfy certain criteria. The Kullback-Leibler divergence satisfies these criteria. - Manifolds of probability distributions and positive arrays are examples of spaces that information geometry can be applied to. - Properties like the generalized Pythagorean theorem in dually flat manifolds are important for applications in areas like optimization, machine learning

Uploaded by

sggtio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/225366894

Information geometry in optimization, machine learning and statistical


inference

Article  in  Frontiers of Electrical and Electronic Engineering in China · September 2010


DOI: 10.1007/s11460-010-0101-3

CITATIONS READS

20 1,397

1 author:

Shun-ichi Amari
RIKEN
564 PUBLICATIONS   36,896 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Information Geometry and Applications to Neuroimaging View project

state-space Ising model View project

All content following this page was uploaded by Shun-ichi Amari on 28 May 2014.

The user has requested enhancement of the downloaded file.


Front. Electr. Electron. Eng. China
DOI 10.1007/s11460-010-0021-2

Shun-ichi AMARI

Information geometry in optimization, machine learning and


statistical inference


c Higher Education Press and Springer-Verlag Berlin Heidelberg 2010

Abstract The present article gives an introduction to article intends to give an understandable introduction
information geometry and surveys its applications in to information geometry without modern differential ge-
the area of machine learning, optimization and statis- ometry. Since underlying manifolds in most applications
tical inference. Information geometry is explained in- are dually flat, the dually flat structure plays a funda-
tuitively by using divergence functions introduced in a mental role. We explain the fundamental dual structure
manifold of probability distributions and other general and related dual geodesics without using the concept of
manifolds. They give a Riemannian structure together affine connections and covariant derivatives.
with a pair of dual flatness criteria. Many manifolds are We begin with a divergence function between two
dually flat. When a manifold is dually flat, a general- points in a manifold. When it satisfies an invariance cri-
ized Pythagorean theorem and related projection the- terion of information monotonicity, it gives a family of f -
orem are introduced. They provide useful means for divergences [2]. When a divergence is derived from a con-
various approximation and optimization problems. We vex function in the form of Bregman divergence [3], this
apply them to alternative minimization problems, Ying- gives another type of divergence, where the Kullback-
Yang machines and belief propagation algorithm in ma- Leibler divergence belongs to both of them. We derive
chine learning. a geometrical structure from a divergence function [4].
The Fisher information Riemannian structure is derived
Keywords information geometry, machine learning, from an invariant divergence (f -divergence) (see Refs.
optimization, statistical inference, divergence, graphical [1,5]), while the dually flat structure is derived from the
model, ying-yang machine Bregman divergence (convex function).
The manifold of all discrete probability distributions is
dually flat, where the Kullback-Leibler divergence plays
1 Introduction a key role. We give the generalized Pythagorean theo-
rem and projection theorem in a dually flat manifold,
Information geometry [1] deals with a manifold of proba- which plays a fundamental role in applications. Such
bility distributions from the geometrical point of view. It a structure is not limited to a manifold of probability
studies the invariant structure by using the Riemannian distributions, but can be extended to the manifolds of
geometry equipped with a dual pair of affine connec- positive arrays, matrices and visual signals, and will be
tions. Since probability distributions are used in many used in neural networks and optimization problems.
problems in optimization, machine learning, vision, sta- After introducing basic properties, we show three ar-
tistical inference, neural networks and others, informa- eas of applications. One is application to the alterna-
tion geometry provides a useful and strong tool to many tive minimization procedures such as the expectation-
areas of information sciences and engineering. maximization (EM) algorithm in statistics [6–8]. The
Many researchers in these fields, however, are not fa- second is an application to the Ying-Yang machine in-
miliar with modern differential geometry. The present troduced and extensively studied by Xu [9–14]. The
third one is application to belief propagation algorithm
of stochastic reasoning in machine learning or artificial
Received January 15, 2010; accepted February 5, 2010
intelligence [15–17]. There are many other applications
Shun-ichi AMARI in analysis of spiking patterns of the brain, neural net-
RIKEN Brain Science Institute, Saitama 351-0198, Japan works, boosting algorithm of machine learning, as well
E-mail: amari@brain.riken.jp as wide range of statistical inference, which we do not
2 Front. Electr. Electron. Eng. China

mention here. is an important coordinate system of Sn , as we will see


later.

2 Divergence function and information


geometry

2.1 Manifold of probability distributions and positive


arrays

We introduce divergence functions in various spaces


or manifolds. To begin with, we show typical exam-
ples of manifolds of probability distributions. A one-
dimensional Gaussian distribution with mean μ and vari-
ance σ 2 is represented by its probability density function
 
1 (x − μ)2 Fig. 1 Manifold S2 of discrete probability distributions
p(x; μ, σ) = √ exp − . (1)
2πσ 2σ 2
The third example deals with positive measures, not
It is parameterized by a two-dimensional parameter
probability measures. When we disregard the constraint
ξ = (μ, σ). Hence, when we treat all such Gaussian 
pi = 1 of (6) in Sn , keeping pi > 0, p is regarded
distributions, not a particular one, we need to consider
as an (n + 1)-dimensional positive arrays, or a positive
the set SG of all the Gaussian distributions. It forms a
measure where x = i has measure pi . We denote the set
two-dimensional manifold
of positive measures or arrays by
SG = {p(x; ξ)} , (2)
Mn+1 = {z, zi > 0 ; i = 0, 1, . . . , n} . (10)
where ξ = (μ, σ) is a coordinate system of SG . This
is not the only coordinate system. It is possible to use This is an (n + 1)-dimensional manifold with a coordi-
nate system z. Sn is its submanifold derived by a linear
other parameterizations or coordinate systems when we 
study SG . constraint zi = 1.
We show another example. Let x be a discrete random In general, we can regard any regular statistical model
variable taking values on a finite set X = {0, 1, . . . , n}. S = {p(x, ξ)} (11)
Then, a probability distribution is specified by a vector
p = (p0 , p1 , . . . , pn ), where parameterized by ξ as a manifold with a (local) coor-
dinate system ξ. It is a space M of positive measures,
pi = Prob {x = i} . (3) when the constraint p(x, ξ)dx = 1 is discarded. We
We may write may treat any other types of manifolds and introduce
 dual structures in them. For example, we will consider
p(x; p) = pi δi (x), (4) a manifold consisting of positive-definite matrices.
where 
1, x = i, 2.2 Divergence function and geometry
δi (x) = (5)
0, x = i.
Since p is a probability vector, we have We consider a manifold S having a local coordinate sys-
 tem z = (zi ). A function D[z : w] between two points
pi = 1, (6) z and w of S is called a divergence function when it
and we assume satisfies the following two properties:
pi > 0. (7) 1) D[z : w]  0, with equality when and only when
The set of all the probability distributions is denoted by z = w.
2) When the difference between w and z is infinitesi-
Sn = {p} , (8) mally small, we may write w = z + dz and Taylor
which is an n-dimensional simplex because of (6) and expansion gives

(7). When n = 2, Sn is a triangle (Fig. 1). Sn is an n- D [z : z + dz] = gij (z)dzi dzj , (12)
dimensional manifold, and ξ = (p1 , . . . , pn ) is a coordi-
nate system. There are many other coordinate systems. where
For example, ∂2
gij (z) = D[z : w]|w=z (13)
pi ∂zi ∂zj
θi = log , i = 1, . . . , n, (9) is a positive-definite matrix.
p0

View publication stats

You might also like