Prob Level Sets
Prob Level Sets
Prob Level Sets
Introduction
The study of distances between probability measures/distributions (PM) is increasingly attracting attention in the elds of Data Analysis and Pattern Recognition. Classical examples are homogeneity, independence and goodness of t
test problems between populations where the general aim is to determine if data
generated from dierent random samples come from the same population or not.
These problems can be solved by choosing an appropriate distance between PM.
You can also nd examples in Clustering [5,25], Image Analysis [9,10], Time Series Analysis [18,13], Econometrics [24,17] and Text Mining [15,7], just to name
a few.
For a review of interesting distances between probability distributions and
theoretical results, see for instance [26,4] and references therein. Non parametric
estimators often play a role in estimating such distances. In practical situations
there is usually available a (not huge) data sample, and the use of purely non
parametric estimators often results in poor performance [11].
An appealing point of view, initiated by Fisher and Rao [6,1,3] and continued
with recent development of Functional Data Analysis and Information Geometry
Methods (e.g. [20,2]), is to consider probability distributions as points belonging
to some manifold, and then take advantage of the manifold structure to derive
appropriate metrics for distributions.
In this work we elaborate on the idea of considering PMs as points in a functional space endowed with an inner product, and then obtain dierent distances
for PMs from the metric structure derived from the ambient inner product. We
propose a particular instance of such metrics for generalized functions based on
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 271278, 2012.
c Springer-Verlag Berlin Heidelberg 2012
272
A. Mu
noz et al.
the estimation of density level sets regions. The article is organized as follows:
In Section 2 we present probability measures as generalized functions and then
we dene general distances acting on the Schwartz distribution space. Section 3
presents a new distance built according to this point of view. Section 4 illustrates
the theory with some simulated and real data sets.
IR
by
dening
P
()
=
P,
=
dP
=
(x)f (x)d(x) = , f . Thus, the probability distribution identies with a
linear functional obeying the Riesz representer theorem: the representer for P is
its density function f D: P () = , f .
In particular, the familiar condition P (X) = 1 is equivalent to P, [X] = 1,
where the function [X] belongs to D, being X compact. Note that we do not
need that f D; only the integral , f should be properly dened for every
D.
So then a probability measure/distribution is a continuous linear functional
acting on a given function space. Two given linear functionals P1 and P2 will
be the same (or similar) if they act identically (or similarly)
on every D.
For instance, if we choose = Id, P1 () = P, x = 1dP = EP1 [X] and if P1
and P2 are similar then 1 = EP1 [X] EP2 [X] = 2 because P1 and P2 are
continuous. Similar arguments apply for variance (take (x) = (x E[X])2 ) and
in general for higher order moments. For (x) = eix , IR, we obtain the
Fourier transform of the probability
(called characteristic functions in
measure
Statistics), given by P () = P, eix = eix dP (x).
Thus, two PMs can be identied with their action as functionals on the
test functions and hence, distances between two distributions can be dened
from the dierences between functional evaluations for appropriately chosen test
functions.
273
d (P, i , Q, i ) for i I, where d is some distance function. Our test functions
will be indicator functions of -level sets, introduced below.
Given a PM P with density function fP , minimum volume sets (or -level sets)
are dened by S (fP ) = {x X| fP (x) }, such that P (S (fP )) = 1 ,
where fP is the density function and 0 < < 1. If we consider an ordered
sequence 1 < . . . < n , i (0, 1), then Si+1 (fP ) Si (fP ). Let us dene
Ai (P) = Si (fP ) Si+1 (fP ), i {1, . . . , n 1}. We can choose 1 0 and
n
maxxX fP (x) (which exists, given that X is compact and fP continuous);
then i Ai (P) Supp(P) = {x X| fP (x) = 0} (equality takes place when
n , 1 0 and n 1 ). Given the denition of the Ai , if Ai (P) = Ai (Q)
for every i when n , then P = Q. Thus,
taking i = [Ai ] , our choice is
,
Q,
)
=
|
P,
Q,
|
=
|
[Ai ] dP [Bi ] dQ| | [Ai ]
d
(P,
i
i
i
j
[Bi ] |, the ambient measure. Indeed, given the denition of level set and the
choice of Ai , both P and Q are approximately constant on Ai and Bi , respectively,
and so we are using the counting (ambient) measure.
Denote by the symmetric dierence operator: A B = (A B) (B A).
Consider 1i = [Ai (P)Ai (Q)] and 2i = [Ai (Q)Ai (P)] . Consider di (P, Q) =
| P, 1i Q, 1i |+| P, 2i Q, 2i |. From the previous discussion di (P, Q)
(Ai (P) Ai (Q)), what motivates the following:
Denition 1. Weighted level-set distance. Given = {(i) }n1 , we dene
d (P, Q) =
n1
i=1
(1)
where is the ambient measure. We use (Ai (P) Ai (Q)) in the numerator
instead of di (P, Q) (Ai (P) Ai (Q)) for compactness. When n both
expressions are equivalent. Of course, we can calculate d in eq. (1) only when we
know the distribution function for both PMs. In practice there will be available
two data samples generated from P and Q, and we need to dene some plug
in estimator: Consider estimators Ai (P) = Si (fP ) Si+1 (fP ), then we can
estimate d (P, Q) by
n1
# Ai (P) S Ai (Q)
,
d (P, Q) =
i
(2)
# Ai (P) Ai (Q)
i=1
where #A indicates the number of points in A and S indicates the set estimate
of the symmetric dierence, dened below.
Both d (P, Q) and d (P, Q), as currently dened, are semimetrics. The proposal of Euclidean metrics will be aorded immediately after the present work.
3.1
To estimate level sets from a data sample we present the following denitions
and theorems, concerning the One-Class Neighbor Machine [19].
274
A. Mu
noz et al.
(b) If f (x) < f (y) implies lim P (M (x, sn ) < M (y, sn )) = 1, then M is a
n
concentration measure.
Example: Consider the distance from a point x IRd to its k th -nearest neighbour in sn , x(k) : M (x, sn ) = dk (x, sn ) = d(x, x(k) ): it is a sparsity measure. Note
that dk is neither a density estimator nor is it one-to-one related to a density
estimator. Thus, the denition of sparsity measure is not trivial.
The Support Neighbour Machine [19] solves the following optimization
problem:
n
max n
i
,
i=1
(3)
s.t. g(xi ) i ,
i 0,
i = 1, . . . , n ,
where g(x) = M (x, sn ) is a sparsity measure, [0, 1], i with i = 1, . . . , n are
slack variables and is a predened constant.
Theorem 1. The set Rn = {x : hn (x) = sign(n gn (x)) 0} converges to
a region of the form S (f ) = {x|f (x) }, such that P (S (f )) = . Therefore, the Support Neighbour Machine estimates a density contour cluster S (f )
(around the mode).
Hence, we take Ai (P) = Si (fP ) Si+1 (fP ), where Si (fP ) is estimated by
Rn dened above.
3.2
#(B
A)
=
We should not estimate the region A B by A
B = #(A B)
#(A B) #(A B), given that probably there will be no points in common
(which implies A
between A and B
B =A
B). Given that A is a set of points
estimating the spatial region A, we will estimate the region A by a covering of
the type A = ni B(xi , r), where B(xi , r) are closed balls with centres at xi Ai
and radius r [8]. The radius is chosen to be constant because we can assume
density approximately constant inside region Ai (P) if the partition {i } of the
set (0, 1) is ne enough. The problem of calculating A
B reduces therefore to
not belonging to the covering estimate of A, plus the
estimate the points in B
points in A not belonging to the covering estimate of B. This will be denoted by
A S B. Figure 3 illustrates the previous discussion. Notice that S implicitly
gives rise to kernels for sets; for example: K(A, B) = 1 (A S B)/(A B), that
allows to consider distance for distributions in the context of kernel methods.
275
and B.
(b) A
B.
(c) B
Experimental Work
276
A. Mu
noz et al.
Fig. 2. MDS plot for texture groups. A representer for each class is plotted in the map.
To conclude, we will show an application of the LD measure to evaluate distances between data sets. To this aim, we consider 9 data sets from the Kylberg
texture data set [14]: blanket, canvas, seat, oatmeal, rice, lentils, linseeds, stone1, stone2 belonging to 3 mean types. There are 160 11 = 1760
texture images with a resolution of 576 576 pixels. We represent each image
using the 32 parameters of the wavelet coecient histogram proposed in [16].
Next we calculate the (between sets) distance matrix with the LD measure,
we obtain (by multidimesional scaling -MDS) the representation shown in Figure
2, that results to be a sort of MDS for data sets. It is apparent that textures get
organized in a very coherent way with human criteria, what seems to indicate
that the proposed Level Distance is appropriate for real pattern recognition
problems (high dimensional and small number of data samples).
277
Future Work
In the near future we will aord the study of the geometry induced by the proposed measure and its asymptotic properties. An exhaustive testing on a variety
of data sets following dierent distributions is needed. We are also working on
a variation of the LD distance that satises the Euclidean metric conditions.
Acknowledgments. This work was partially supported by projects DGUCM
2008/00058/002 and MEC 2007/04438/001 (Spain).
References
1. Amari, S.-I., Barndor-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Dierential Geometry in Statistical Inference. Lecture Notes-Monograph Series, vol. 10
(1987)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society (2007)
3. Atkinson, C., Mitchell, A.F.S.: Raos Distance Measure. The Indian Journal of
Statistics, Series A 43, 345365 (1981)
4. M
uller, A.: Integral Probability Metrics and Their Generating Classes of Functions.
Advances in Applied Probability 29(2), 429443 (1997)
5. Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering whit Bregman Divergences. Journal of Machine Learning Research, 17051749 (2005)
6. Burbea, J., Rao, C.R.: Entropy dierential metric, distance and divergence measures in probability spaces: A unied approach. Journal of Multivariate Analysis 12,
575596 (1982)
7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance
Metrics for Name-matching Tasks. In: Proceedings of IJCAI 2003, pp. 7378 (2003)
8. Devroye, L., Wise, G.L.: Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38, 480488 (1980)
9. Dryden, I.L., Koloydenko, A., Zhou, D.: Non-Euclidean statistics for covariance
matrices, with applications to diusion tensor imaging. The Annals of Applied
Statistics 3, 11021123
10. Dryden, I.L., Koloydenko, A., Zhou, D.: The Earth Movers Distance as a Metric
for Image Retrieval. International Journal of Computer Vision 40, 99121 (2000)
11. Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method
for the two sample problem. In: Advances in Neural Information Processing Systems, pp. 513520 (2007)
12. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning, 2nd
edn. Springer (2009)
13. Hayashi, A., Mizuhara, Y., Suematsu, N.: Embedding Time Series Data for Classication. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587,
pp. 356365. Springer, Heidelberg (2005)
14. Kylberg, G.: The Kylberg Texture Dataset v. 1.0. Centre for Image Analysis,
Swedish University of Agricultural Sciences and Uppsala University, Uppsala, Sweden, http://www.cb.uu.se/gustaf/texture/
15. Lebanon, G.: Metric Learnong for Text Documents. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(4), 497508 (2006)
278
A. Mu
noz et al.
16. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674
693
17. Marriot, P., Salmon, M.: Aplication of Dierential Geometry to Econometrics.
Cambridge University Press (2000)
18. Moon, Y.I., Rajagopalan, B., Lall, U.: Estimation of mutual information using
kernel density estimators. Physical Review E 52(3), 23182321
19. Mu
noz, A., Moguerza, J.M.: Estimation of High-Density Regions using OneClass Neighbor Machines. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(3), 476480
20. Ramsay, J.O., Silverman, B.W.: Applied Functional Data Analysis. Springer, New
York (2005)
21. Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., Lanckriet, G.R.G.:
Non-parametric estimation of integral probability metrics. In: International Symposium on Information Theory (2010)
22. Strichartz, R.S.: A Guide to Distribution Theory and Fourier Transforms. World
Scientic (1994)
23. Szekely, G.J., Rizzo, M.L.: Testing for Equal Distributions in High Dimension.
InterStat (2004)
24. Ullah, A.: Entropy, divergence and distance measures with econometric applications. Journal of Statistical Planning and Inference 49, 137162 (1996)
25. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-information. In: Advances in Neural Information
Processing Systems, pp. 505512 (2002)
26. Zolotarev, V.M.: Probability metrics. Teor. Veroyatnost. i Primenen 28(2), 264287
(1983)