Academia.eduAcademia.edu

Improving homology estimates with random walks

2011, Inverse Problems

Improving Homology Estimates with Random Walks Paul Bendich1 , Taras Galkovskyi1 , and John Harer2 1 Department of Mathematics, Duke University, Durham, NC, 27708, USA 2 Departments of Mathematics and Computer Science, Center for Systems Biology, Duke University, Durham, NC, 27708, USA E-mail: bendich@math.duke.edu,galkovsky@gmail.com,harer@math.duke.edu Abstract. This experimental paper makes the case for a new approach to the use of persistent homology in the study of shape and feature in datasets. By introducing ideas from diffusion geometry and random walks, we discover that homological features can be enhanced and more effectively extracted from spaces that are sampled densely and evenly, and with a small amount of noise. This study paves the way for a more theoretical analysis of how random walk metrics affect persistence diagrams, and provides evidence that combining topological data analysis with techniques inspired by diffusion geometry holds great promise for new analyses of a wide variety of datasets. 1. Introduction The continually growing importance of analyzing massive datasets has prompted the development of a number of new techniques for finding shape and feature in point clouds. Well established methods from statistics are now augmented with approaches that focus on shape identification, inference, and dimension reduction using methods of geometry and topology, subjects that were largely regarded in the past as too abstract to be applied to data analysis. Computational Topology has emerged as a new discipline in the last ten years [12], and its most significant application to date has been to Topological Data Analysis (TDA) in which datasets are treated as point clouds, and persistent homology is calculated to look for both global and local features in these clouds. In parallel, diffusion geometry [9], [10] has developed into a powerful tool in dimension reduction and the analysis of high-dimensional data. While its methods are much more analytical than those of computational topology, it nonetheless shares the same goals and the same spirit of applying well established methods of geometry to applied problems. Now that these subjects have come of age, the time has come to consider how they can be made to work together and to work with methods from statistics. This experimental article considers the question of how the methods of diffusion geometry can be combined with TDA. A related theoretical article [16] looks at the question of how to do statistics on the persistence diagrams that describe the results of applying TDA methods to repeated samples Improving Homology Estimates with Random Walks 2 from a geometric object. Along with some other recent papers, e.g. [6], these works suggest possible research directions in support of a future merger of techniques. Main idea. The main idea that we explore here is that by using random walks on a dataset one can define metrics that enhance topological properties and increase the capability of persistent homology to capture them. A well-known example in diffusion is a curve in the plane which is topologically a circle, but lies close to a figure-8 (see Figure 2 below). When this curve is sampled densely enough, diffusion geometry recognizes the circle by finding a new embedding based on the eigenfunctions of an appropriate operator derived from the data. Our approach is motivated by this idea. We use the diffusion metric Dm to define a distance between pairs of points in our dataset, and we then construct the Rips complexes using this distance and compute the corresponding persistence diagram. For reasons described below, we also consider our own ad-hoc definition of a different symmetric function ρm , defined below, on our point set and investigate the properties of its associated Rips complexes and persistence diagrams. Results. Here we briefly summarize the experimental results, which are described more fully in Section 5. We ran two main types of experiments on points cloud sampled from the figure in Figure 2. When we sampled points densely without outliers, we found that the diffusion metric Dm caused the main cycle to separate very cleanly from the shorter cycle introduced by the bottle-neck near the origin. Here the advantage held by Dm over the traditional Euclidean distance was striking, while ρm did better than Euclidean distance but certainly not as well as diffusion. We also considered what happens when varying amounts of noise, including a decent number of outliers, were introduced into our dataset. When the density of noisy samples was small relative to the sampling density on the underlying space, we found that our persistence diagrams for Dm were mostly unchanged, something which is impossible with the Euclidean metric. On the other hand, the performance of the diffusion metric rapidly broke down as the noise-to-signal ratio increased. For these very noisy samples, the ρm -diagrams continued to retain at least some of the topological structure of the underlying space. To obtain persuasive results, we needed to take a large number of sample points, large enough so that the persistence diagram computation would have become prohibitively slow. To fix this problem, we employed a sub-sampling method broadly inspired by Witness Complex [11] methodology. This method seems interesting in its own right, and so we describe it fully in Section 4. Prior work. A number of authors have begun to study methods for applying persistent homology to noisy datasets. In one of the first attempts [4], the space of natural images is denoised before computing a topological filtration that discovers a Klein bottle and leads to a new compression methodology for images. A more recent effort to preprocess data before computing persistence is in [15]. In a different direction, [6] defines the concept of distance from a distribution, which provides one of the first true inter-linkings of probabilistic Improving Homology Estimates with Random Walks 3 methods with topological ones and allows a new theoretical analysis of how well a noisy dataset represents the space from which it was sampled. Outline. Necessary background on persistent homology is given in Section 2, while Section 3 introduces the method of using Euclidean distance to assess the homology of a point cloud. This method has obvious limitations, to which we propose our solution in Section 4. Experimental results, as well as some implementation details, are presented in Section 5. Finally, Section 6 concludes the paper with a discussion and ideas for future research directions. 2. Background Persistent homology is described in great detail in the recent textbook [12]. In this section, we confine ourselves to a few brief definitions and examples. All homology groups are assumed to be taken over some field, usually Z/2Z, and are thus just vector spaces over that field. We assume familiarity with homology itself; see for example [18] for more background. Sublevel set filtrations. Suppose that we have a topological space M equipped with a realvalued function f : M → R. For each r ≤ s and each homological dimension p, the inclusion Mr ֒→ Ms of sublevel sets induces linear maps Hp (Mr ) → Hp (Ms ) between homology groups. In a nutshell, persistent homology uses these linear maps to track the evolution of the homology of Mr , as r increases from negative to positive infinity, and then compactly displays this information in persistence diagrams; for an example of the latter, see the right side of Figure 2. We now give some more details as to what this means. As an aside, we note that there is nothing special about homology here, since persistence can be defined for any sequence of vector spaces connected by linear maps; see [5] for a detailed treatment. Critical values. Although we ostensibly have infinitely many homology groups, one for each real number, it often turns out there are only a finite number of changes as we move from negative to positive infinity. More precisely, we say that r is a homological regular value (hrv) of f if there exists some ǫ > 0 such that, for all positive δ < ǫ, the inclusions Mr−δ ֒→ Mr+δ induce homology isomorphisms in every dimension. If r does not satisfy this condition, then it is a homological critical value (hcv). For example, consider the round circle X ⊂ R2 of radius r as drawn on the left side of Figure 1. This circle defines a distance function dX : R2 → R; in words, dX (y) is the Euclidean distance between a point y ∈ R2 and its closest neighbor on X. One can imagine the sublevel sets of dX as being a sequence of thickenings of the circle X within its ambient space R2 . In this case, dX will have only two hcvs, 0 and r. On the other hand, the function dY , where Y is the space on the left side of Figure 2, will have four hcv’s: 0 < s < r < t (the right lobe is slightly larger than the left one). The smallest non-zero hcv of dY is called the homological feature size of Y, e.g in Figure 2 it is s. 4 Improving Homology Estimates with Random Walks r r Figure 1: The persistence diagram Dgm1 (dX ) is on the right, where X is on the left. t r s t r s Figure 2: The persistence diagram Dgm1 (dY ) is on the right, where Y is on the left. Birth, death, diagrams. Returning to the general case, suppose we have a function f : M → R with a finite set of hcvs a1 < a2 < . . . < an . A function which meets this assumption, along with the requirement that the homology groups of all sublevel sets be of finite rank, is called tame. We choose hrvs {bi } with b0 < a1 < b1 < . . . < an < bn , and put Mi = Mri . Fixing a homological dimension p, we adopt the shorthand notation Hip = Hp (Mi ), and we consider the sequence of homology groups connected by linear maps, 0 = H0p → H1p → H2p → . . . → Hnp = Hp (M). (1) For each i ≤ j, we let fpi,j : Hip → Hjp be the map induced on homology by the inclusion Mi ֒→ Mj . A class α ∈ Hip is then said to be born at Mi if α 6∈ im fpi−1,i . Such an α dies entering Mj if fpi,j (α) ∈ im fpi−1,j , but fpi,j−1 (α) 6∈ im fpi−1,j−1 . The persistence of the class α is bj − bi ; this real number gives a measure of how long α lives during the sublevel set filtration. There is one technical issue to address, which we do via the example of the space Y in Figure 2. Letting M1 and M2 be sublevel sets just before and just after the hcv s, respectively, we notice that M1 has the homotopy type of a circle, while M2 has that of a wedge of two circles. There are thus two 1-dimensional classes α, β born at M2 , the ones represented by the two new circles. On the other hand, α and β both represent the same coset of H21 /im f11,2 , since α + β is homologous to the generator of H11 . Both of these new classes die upon entering M3 , a sublevel set just after the hcv r: one of them becomes homologous to zero, while the other one becomes homologous to the generator of H11 . This example illustrates the general fact that whenever a class α is born at some Mi , an entire coset is born with it, but all classes Improving Homology Estimates with Random Walks 5 in this coset will die whenever α dies. For each coset of p-dimensional classes born at Mi and dying entering Mj , we draw the point (ai , aj ) in the plane. The resulting multi-set, Dgmp (f ), is the p-th persistence diagram for the function f . Notice that the persistence of a class α is then just the vertical distance of its representative point from the major diagonal. For technical reasons that will soon become clear, we also add all major diagonal points (a, a) where a ≥ 0, taken with infinite multiplicity, to every persistence diagram. As an example, the right side of Figure 2 displays Dgm1 (dY ), where Y is the space drawn on the left side of the same figure. Notice that the “real” 1-dimensional homology of Y shows up as the point on the vertical axis, while the other non-zero persistence point in the diagram represents the new homology created by a slight thickening of Y. In terms of the persistence diagram, the homological feature size of Y shows up as the minimum of the smallest death value of a point on the y-axis, and the smallest birth value of a point off the y-axis. Diagram stability. It turns out that the persistence diagrams Dgmp (f ) and Dgmp (g), where f and g are two “close” functions defined on the same domain, will also be “close,” as measured by an appropriate metric. Indeed, given the inevitability of noisy or incomplete input, a statement of this kind really has to be true if persistence diagrams are to be a useful tool for any sort of data or shape analysis. The bottleneck distance dB between any two persistence diagrams D, D′ is defined as follows: dB (D, D′ ) = inf sup ||u − Γ(u)||∞ , Γ:D→D′ u∈D (2) where Γ ranges over all bijections between the diagrams D and D′ . Recall that every persistence diagram contains infinitely many copies of every major diagonal point, and thus bijections always exist. In words, the bottleneck distance is defined by considering all matchings between D and D′ , assigning the largest l∞ distance between matched points, and then taking the minimum over all matchings. Note that the matching may not be unique. Under this metric, the persistence diagram Dgmp (f ) is stable to small perturbations of f . Precisely, the following result was proven first in [7] and then given a wider context in [5]: Theorem 1 (Diagram Stability Theorem). If X is a compact topological space, then for every homological dimension p and every pair of tame functions f, g : X → R, dB (Dgmp (f ), Dgmp (g)) ≤ ||f − g||∞ . Filtered simplicial complexes. In our experiments below, we will not strictly have a realvalued function on a topological space. Instead, we will consider some finite simplicial complex K, along with a monotonic function f : K → R that is constant on each (open) simplex; here monotonic means that if τ is a face of σ, then f (τ ) ≤ f (σ). If we consider values b1 < b2 < . . . < bn for f , we let Ki be the subcomplex of K consisting of all simplices Improving Homology Estimates with Random Walks 6 whose f -values are no larger than bi . We then have inclusions Ki ֒→ Kj , for i ≤ j, and hence an application of the homology functor allows us to define persistence and draw persistence diagrams, exactly as above. One example that we will make use of below is the following. Suppose that we are given a finite point set U and a symmetric function ρ : U × U → R on the edges of the complete graph G with vertex set U, such that ρ(u, u) = 0 for every point u; for example, ρ could be a metric on U, but it need not be. Given any subgraph H ⊆ G, we define R(H) to be the largest simplicial complex which has H as its 1-skeleton; note that R(H) is often called a clique complex in the literature and is akin to a Rips Complex. Since U is a finite set, ρ takes finitely many values. For each such value ρi , we let Gi ⊆ G be the subgraph with vertex set U and edge set consisting of all edges with ρ-values less than or equal to ρi . Notice that U ⊂ G0 . We can then filter the simplicial complex R(G) with subcomplexes R(Gi ), leading to persistence diagrams in each homological dimension, which we denote Dgmp (U, ρ). 3. Homology of a Point Cloud Consider the 1-dimensional homology of the point cloud U on the left side of Figure 3. Taken literally, this statement is of course absurd, since U is simply a discrete set of points and thus has homology only in dimension zero. On the other hand, a slight blurring of the point cloud may result in a clear one-cycle being formed, while a further blurring will pinch this cycle into two smaller ones. This can been seen in the persistence diagram Dgm1 (dU ) (right side of Figure 3), where the different blurrings manifest as sublevel sets of the distance function dU . In this diagram, we see two highly persistent points, and then quite a few other points very close to the diagonal. Perhaps one could identify the homology classes associated to the high persistence points as representing, at least in some sense, the 1-dimensional homology of the point cloud; in general, it makes sense to think of classes which are born early and persist for a long time as being in some sense “real.” More humbly, one could just report the entire persistence diagram as an answer to the point cloud homology question, and let the end-user make a thresholding decision as to which classes are the most important. In any case, any persistence-diagram approach will lead to difficulty in two different ways, as we now describe. Homological feature size. It would seem reasonable to demand that our guess at the homology of U should in some sense match with the homology of any topological space Y from which U was sampled, at least for sufficiently good sampling. But this can never work in general. For example, our point cloud U was in fact sampled, with very little noise and absolutely no outliers, from the space Y on the left side of Figure 2. But H1 (Y) is rank one, since Y is topologically a circle. To be more precise, U is an ǫ-sample of Y, where ǫ = ||dY − dU ||∞ is the Hausdorff distance between the space and the point cloud. Appealing to the above theorem, this means the two persistence diagrams Dgm1 (dY ) and Dgm1 (dU ) can be no more than ǫ apart in the bottleneck metric. Recall that the actual first homology of Y is represented by the Improving Homology Estimates with Random Walks 7 Figure 3: On the left, a sampled version U of the space Y shown in figure 2. The persistence diagram Dgm1 (dU ) is on the right; points of higher persistence are drawn larger for clarity. point directly on the y-axis of Dgm1 (dY ), while the other point represents the new onecycle formed by a slight thickening of Y. Although the optimal bijection between the two diagrams certainly matches the two large-persistence points in Dgm1 (dU ) with these two points, sending all other points to the major diagonal, we cannot hope to tell which Dgm1 (dU )point matched to the y-axis point, or indeed how many y-axis points there really were. In a nutshell, the problem is that the homological feature size of Y is very small, and so nothing but the very densest of samples has any hope of picking out the one true cycle without also suggesting the extra cycle. Outliers. There is a second, and in some sense far graver, problem: persistence diagrams are stable to small changes in input set, but not to a small amount of very large noise. Namely, suppose that the vast majority of points in a cloud U seem to trace out a clear space Y, but that there are also a few very bad sampling errors here and there. More formally, suppose that there is a tiny subset U0 of U such that U − U0 is an ǫ-sample of Y, for some very small ǫ, but that the points in U0 lie quite far away from Y. In this case, the Hausdorff distance between U and Y will be quite high, and thus the Diagram Stability Theorem gives us no useful guarantees about the relationship between the respective diagrams. For example, suppose U is the point cloud shown on the left side of Figure 4, which seems to be a very faithful sample of a perfectly round circle X. The persistence diagram Dgm1 (dU ) for this sample consists of the box on the right of the same figure. Compare this to the right side of Figure 1, which shows the persistence diagram Dgm1 (dX ); we see that the two diagrams are quite close in the bottleneck metric. On the other hand, suppose that we add one solitary outlier directly in the center, producing the new point cloud V, which is also shown on the left of Figure 4. Looking at the cross on the right side of the same figure, we find that Dgm1 (dV ) is markedly different √ from Dgm1 (dU ), as the persistence of the main cycle has diminished by a factor of 3. The reason is obvious: suppose the circle has radius r0 ; for U, this main cycle dies when we reach √ the side length of an equilateral triangle inscribed in the circle, which is 3r0 , while it dies at r0 for the sample with just one outlier. 8 Improving Homology Estimates with Random Walks (a) Samples (b) Diagrams Figure 4: Left: dataset U of size 100 sampled from a circle. The noisy dataset V, consisting of U plus the central point (hollow point) is also depicted. Right: persistence diagrams. The symbols  and × mark persistent homology classes corresponding to ’clean’ and ’noisy’ datasets, respectively. Summary. Looking at the above paragraphs in a slightly different light, we see that the idea of doing homology inference by thickening a point sample U into gradually growing Euclidean balls and then computing the resulting persistence diagram has two major flaws. First, the space itself might have very small homological feature size. We can get around this by taking an extremely dense point sample, but this will come at a high computational cost. More seriously, any realistic sample will of course have a few bad outliers, and increasing the density will never hope to eliminate them. 4. Random Walk Here we define the two symmetric functions, each based on random walk, that we use in our experiments. We also describe the witness-like method we employ to achieve a massive increase in computation speed. Diffusion metric. There are various definitions of the diffusion metric to be found in the literature; we use the definition given in [8]. First, we imagine that we are standing at a particular point u ∈ U and are going to choose a random step to another point v ∈ U, where the probability of choosing v is proportional to some kernel function k(u, v) which is a decreasing function of the Euclidean norm ||u − v||; most of the literature uses a kernel 2 function of the form k(x, y) = exp(− ||x−y|| ), where α is some positive tuning parameter. α That is, we choose v with probability φ(u, v) = k(u, v) , Σz k(u, z) (3) where the sum in the denominator runs over all points in z ∈ U except u itself. We also set φ(u, u) = 0 for all u ∈ U. Improving Homology Estimates with Random Walks 9 The assignments φ(u, v) place probability weights on the edges of the directed complete graph G, and we then run random walks on this graph with these weights. Choosing a positive integer m, let φm (u, v) be the probability that a random walk of exactly m steps starting at u will end at v. We then define the diffusion distance Dm : U × U → R by: p Dm (x, y) = Σu [φm (x, u) − φm (y, u)]2 , (4) where the sum runs over all points u ∈ U. Now that we have a symmetric function on the edges of G, we can compute the persistence diagrams Dgmp (U, Dm ) in each homological dimension, as described at the end of section 2. Note that the integer m is of course a parameter that we must select in advance; our preliminary experiments indicate that values of m somewhere in the 10−30 range provide the best results, but we cannot say anything more conclusive than that at this time. Random walk function. For an alternative set of experiments, we also consider the following symmetric function ρm : U × U → R. We again run random walk on U, but this time taking a step from u to v with probability φ(u, v) proportional to the kernel function κ(u, v) = ||u − v||−2 . Choosing a positive integer m, let φ≤m (u, v) be the probability that a random walk of less than or equal to m steps which starts at u will end at v; more precisely, we set X φ≤m (u, v) = φk (u, v), k≤m where φk is defined above. Finally, we define: 1 ρm (u, v) = , min{φ≤m (u, v), φ≤m (v, u)} (5) for u 6= v and ρm (u, u) = 0, and then we compute the persistence diagrams Dgmp (U, ρm ) in each homological dimension, as described at the end of section 2. Our reasons for defining and testing this new function were as follows. First of all, we wanted to use a kernel function which did not depend on some parameter α; in our experiments with Dm , we were able to tune α correctly by exploiting prior knowledge of the data, but this cannot be done in reality. It also seemed more natural to use the random-walk probabilities φm directly, rather than summing over intermediate points as in the definition of Dm . This was especially appealing in the case of outliers: if a data point v lies far way from the rest of the data points u, then the numbers φm (u, v) will all be quite small, and hence ρm (u, v) will be large. As we see below, our intuitions were partially but not fully justified by our experimental results. Sub-sampling. In order to produce a persistence diagram from a point cloud endowed with a metric, one first has to construct the filtered Rips complex and then feed the resulting simplices to a matrix reduction algorithm [13] which runs in worst-case cubic time in the number of input simplices. Thus, assuming we start with N points, and compute only the d-skeleton of the Rips complex (for example, if we are only interested in persistent homology up to dimension (d − 1)), the entire computation takes O(N 3(d+1) ) time; hence a large number of Improving Homology Estimates with Random Walks 10 points will certainly lead to a very slow diagram computation. On the other hand, any randomwalk-based metric must require a fairly large number of points sampled from the underlying space: if one wishes to “see” the true geometry, then one must be able to choose from a large number of paths that lie along the space. To resolve this conflict, we propose the following technique. From an initial large sample U of size N , we draw a much smaller sub-sample U0 of size n << N . We compute all pairwise distances Dm (u, v) for all pairs from the large sample, but we then build a filtered Rips complex using only the U0 -points as vertices and compute the resulting persistence diagrams. The complexity is then much better: the entire algorithm from point cloud to metric ρm /Dm to d-skeleton of Rips complex to persistence diagram runs in O(n3(d+1) + mN 3 ). Equivalently, one may think of defining a new metric on the space U0 which is based on random walk, but where the random walker is allowed to explore within a much larger set of available paths. In this way, we satisfy two goals: a small filtered simplicial complex, but one with edge weights derived from a much larger sample that more faithfully reflects the underlying space. 5. Experiments At the end of this section, we present our main experimental results, and we argue that they provide strong evidence for random-walk methods being a solution to the two problems discussed at the end of Section 3. We begin the section with details on the implementation of the algorithms used to compute persistent homology, as well as the computation of the numbers φm , which then lead directly to the Dm and ρm edge weights. 5.1. Implementation To compute persistent homology, we used the algorithm library Dionysus [17]. In particular, we took advantage of the implementations of the persistent homology algorithm from [13] and the Bron-Kerbosch algorithm [3] for the construction of the Rips complex. We also produced a MATLAB wrapper of Dionysus to streamline the design and execution of these and future experiments [14]. Probabilities. Here we describe the computation of the metrics Dm and ρm given a dataset of witnesses and a sub-set of landmarks. In each case, we represent the metric as a matrix of all pair-wise distances between all points in the dataset, without making a distinction between witness and landmark points. We then construct the filtered Rips complex using only the distances between landmark points. The probability φm (u, v) of a random walk of exactly m steps from a point u to a point v is recursively defined using the value of φm−1 (u, v) as follows: X φm (u, v) = φm−1 (u, z)φ(z, v) (6) z Improving Homology Estimates with Random Walks 11 Thinking of each φm as an n × n matrix, where n is the total number of points, we can rewrite (6) in matrix form: φm = φm−1 ◦ φ, X φ≤m = φk (7) (8) k≤m Implementing (8) in a straightforward way gives us a total complexity of O(n3 m) to compute our ad-hoc metrics ρm . However, for the diffusion metrics Dm , we need only compute random walks for step size exactly m. In this case, rising the matrix to the power m can be done by a dichotomy algorithm that computes the result in log m multiplications, giving a total complexity of O(n3 log m). Because of the relatively small values of m that were used in experiments, numerical precision errors due to matrix multiplication are small and can be ignored. Sub-sampling. We now describe the procedure that we used to sub-sample a witness set U0 from a initial point sample U. In a nutshell, we iteratively draw one landmark point from a very dense sampling of the underlying space. For each landmark point we define a no-inclusion ball of a radius δ. Each point from the dataset that is inside this ball will not be considered later, and could not be drawn in future iterations. It is obvious that no two landmark points drawn by this algorithm will lie closer than δ. 5.2. Results Our experiments can be divided into two main groups. First we investigated the performance of our random-walk metrics versus the Euclidean metric on several sets of points densely sampled without noise from the nearly figure-eight space; our goal was to see which metric did the best job amplifying the true topological signal from the underlying space. Then we began to progressively add noise to our sample to see which metrics preserved which type of signal. Amplifying the signal. For this experiment we drew samples from three versions of the nearly figure eight space, with neck size varying from large to small, and for each sample we computed the one-dimensional persistence diagrams associated to the diffusion metric Dm , our metric ρm , and the Euclidean metric. The goal was to explore how easy it would be to distinguish signal from noise on the persistence diagram. In this synthetic case, we can of course precisely define the terms signal and noise, since we have a priori information about the original space. The point clouds consisted of 1800 points total on average, each with about 225 landmark points. For subselecting landmarks from the whole point cloud we used the algorithm as defined in section 5.1, with a higher value of δ than for the initial sampling of the underlying space cloud. Our primary criterion for choosing δ was to draw approximately 225 landmark points each time. 12 Improving Homology Estimates with Random Walks (a) Large neck size, m = 10 (a) Large neck size, m = 20. (b) Medium neck size, m = 10 (b) Medium neck size, m = 20. (c) Small neck size, m = 10 (c) Small neck size, m = 20. Figure 5: Dm metric Figure 6: ρm metric Figure 7: Witnesses appear as gray points, with landmarks as solid black points. The overlaid cycles are the canonical representatives for the most persistent homology classes. The persistence diagram for each point cloud appears to its immediate right. We then ran a series of computations using various choices for parameter values (step size m for both metrics Dm and ρm ; α for Dm only), and we found that these choices greatly affected the diagram outcome. Moreover, for different point clouds, the “best” parameter values differed. A table which summarizes the experimental results for all chosen parameter values can be found at http://www.math.duke.edu/˜elimgta/metrics/ [2]. Here we show only the results for the best choice of parameters. Our results for Dm and ρm are presented in Figure 7, and the results for Euclidean distance appear in Figure 8. As we can see, the Dm -diagrams closely resemble those of (a) Large size neck (b) Medium neck size (c) Small neck size Figure 8: Persistence diagrams computed for the same point clouds as in figure 7, this time with the Euclidean metric. 13 Improving Homology Estimates with Random Walks a round circle (compare with figure 1) in that there is only one point of non-negligible persistence; this point does move closer to the diagonal as the neck size decreases, but the topological signal remains quite clear, even for the smallest neck, since no other classes appear in the diagram. The Euclidean diagrams, on the other hand, all feature two high-persistence points. As the neck size decreases, these points move closer to one another, making the topological signal harder to discern. A nearly identical situation holds for the ρm -diagrams, although the statistics in [2] show that the signal/noise ratio is in fact marginally better for ρm versus Euclidean in most cases. It is worth noting that one of the biggest problems that we have stumbled upon was the proper choice for α in the kernel used in the Dm metric. As can be seen in [2], ”bad” choices of α lead either to incorrect homology estimates, or to signal/noise ratios close to that of the Euclidean metric. For this particular series of figure-eight datasets, we found that the best choice of α was always around one-tenth of the neck size. In general, though, we have an open question: what strategy should be used to choose the α value for datasets where we have no a priori information about the underlying topological space. (a) Small level of noise, m = 20. (a) Small level of noise, m = 20. (b) Medium level of noise, m = 20. (b) Medium level of noise, m = 20. (c) High level of noise, m = 30. Figure 9: Dm metric (c) High level of noise, m = 30. The cycle shown here is NOT the most persistent cycle, but the second most persistent one. The most persistent one includes almost solely noise points and has very late birth time Figure 10: ρm metric Figure 11: Witnesses appear as gray points, with landmarks as solid black points. The overlaid cycles are the canonical representatives for the homology classes corresponding to the points marked by circles in the diagrams. The persistence diagram for each point cloud appears to its immediate right. Improving Homology Estimates with Random Walks 14 Exploring robustness. As mentioned above, persistence diagrams computed using the Euclidean metric are extremely sensitive to even a small amount of very bad noise. In a second set of experiments, we investigated the robustness of the Dm and ρm diagrams in the presence of a increasing number of outliers. For these experiments, we drew three different samples from the same nearly figure-eight space; each sample contained some outliers, with the number of outliers increasing with each draw. Each time, the goal was the same as for the first set of experiments: draw out the true topological signal. The sampling and sub-sampling procedures were the same as before: the point clouds consisted of 1800 points on average, from which we drew out around 225 landmark points. Note that most of the outlier points became landmarks, a clear consequence of the subsampling procedure described above. As before, we ran a series of computations using various parameter values, and summarized all results in a table [2]. The results for the best parameter choices are shown in Figure 11. In the left column, we see the results for the Dm -metric. These diagrams show a weaker topological signal than in the no-noise experiments. The canonical representative for the most persistent class hints at the underlying space, but also contains quite a few outliers; Reading the left column top-to-bottom, we see that the persistence of spurious classes increases with the number of outliers. The right column of the same figure displays the ρm -results. Again, there is certainly no crystal-clear topological signal, at least compared to the no-noise experiments. However, one feature of the ρm metric that inspired us to publish its results is the fact that its diagrams do contain a point with non-negligible persistence whose canonical representative cycle traces out the underlying space without including any outliers. This compares well with the cycles for Dm , which always seem to pick out outliers. 6. Discussion The ideas and experiments described above demonstrate how the use of methods on random walks rather than the usual Euclidean distance provides a more robust and powerful way to use topological data analysis (TDA) in making measurements of dataset shape. The approach we have put forward stands in sharp contrast to methods where noise is only addressed in a preprocessing step. Instead it integrates the process of de-noising and homological calculation, and incorporates the important ideas put forward in [9] that use local proximity information in the determination of global structure. Combined with other new approaches of this form, e.g. [6], we believe that TDA is moving to a new level. Most real world datasets have some level of noise, and the challenge now is to fine tune the random walk metric to different types of data. We are particularly interested in discovering stratified structure, based on the ideas of [1]. We sum up with the visual image of a triangle, with vertices corresponding to TDA, diffusion geometry (DG) and statistical methodology (S); see Figure 12. This paper makes a first stab at the edge from TDA to DG, and the papers [6] and [16] make steps towards the Improving Homology Estimates with Random Walks 15 Figure 12: A triangle of future research directions. edge from TDA to S. These methods can be used together to make homological inference the effective shape analysis tool that it promises to be. References [1] P. Bendich, D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and D. Morozov. Inferring local homology from sampled stratified spaces. In Proceedings 48th Annual IEEE Symposium on Foundations of Computer Science, pages 536–546, 2007. [2] Paul Bendich, Taras Galkovskyi, and John Harer. Experimental results for diffusion and random walk metrics. http://www.math.duke.edu/˜elimgta/metrics/, 2011. [3] C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16(9):575577, 1973. [4] G. Carlsson and T. Ishkhanov. A topological analysis of the space of natural images. Preprint, 2011. [5] F. Chazal, D. Cohen-Steiner, M. Glisse, L. Guibas, and S. Oudot. Proximity of persistence modules and their diagrams. In Proceedings 25th Annual Symposium on Computational Geometry, pages 237–246, 2009. [6] F. Chazal, D. Cohen-Steiner, and Q. Merigot. Geometric inference for measures based on distance functions. Research Report 6930, INRIA, 2010. [7] D. Cohen-Steiner, H. Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete and Computational Geometry, 37:103–120, 2007. [8] R. Coifman and S. Lafon. Diffusion maps. Applied and Comput. Harm. Analysis, 21:5–30, 2006. [9] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data. part i: Diffusion maps. Proc. of Nat. Acad. Sci, 102:7426–7431, 2005. [10] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data. part ii: Multiscale methods. Proc. of Nat. Acad. Sci, 102:7432–7438, 2005. [11] V. de Silva and G. Carlsson. Topological estimation using witness complexes. Symposium on Point-Based Graphics, pages 157–166, 2004. [12] H. Edelsbrunner and J. Harer. Computational Topology: An Introduction. American Mathematical Society, 2010. [13] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002. [14] Taras Galkovskyi. Kapitoshka: Matlab wrappers for dionysus. http://www.math.duke.edu/ ˜elimgta/kapitoshka/. Improving Homology Estimates with Random Walks 16 [15] J. Kloke and G. Carlsson. Topological de-noising: Strengthening the topological signal. Preprint, 2011. [16] Y. Mileyko, S. Mukherjee, and J. Harer. Probability measures on the space of persistence diagrams. Preprint, 2011. [17] D. Morozov. Dionysus: C++ library for computing persistent homology. http://www.mrzv.org/ software/dionysus/. [18] J. Munkres. Elements of algebraic topology. Addison-Wesley, Redwood City, California, 1984.