Academia.eduAcademia.edu

Topological Data Analysis through alignment of Persistence Landscapes

2021

Persistence landscapes are functional summaries of persistence diagrams designed to enable analysis of the diagrams using tools from functional data analysis. They comprise a collection of scalar functions such that birth and death times of topological features in persistence diagrams map to extrema of functions and intervals where they are non-zero. As a consequence, topological information is encoded in both amplitude and phase components of persistence landscapes. Through functional data analysis of persistence landscapes under an elastic Riemannian metric, we show how meaningful statistical summaries of persistence landscapes (e.g., mean, dominant directions of variation) can be obtained by decoupling topological signal present in amplitude and phase variations. The estimated phase functions are tied to the resolution parameter that determines the filtration of simplicial complexes used to construct persistence diagrams. For a dataset obtained under scale and sampling variabilit...

arXiv:2106.15436v1 [stat.ME] 29 Jun 2021 Topological Data Analysis through alignment of Persistence Landscapes James Matuk Department of Statistics The Ohio State University Columbus, OH, USA matuk.3@osu.edu Sebastian Kurtek Department of Statistics The Ohio State University Columbus, OH, USA kurtek.1@stat.osu.edu Karthik Bharath School of Mathematical Sciences University of Nottingham Nottingham, UK Karthik.Bharath@nottingham.ac.uk Abstract Persistence landscapes are functional summaries of persistence diagrams designed to enable analysis of the diagrams using tools from functional data analysis. They comprise a collection of scalar functions such that birth and death times of topological features in persistence diagrams map to extrema of functions and intervals where they are non-zero. As a consequence, topological information is encoded in both amplitude and phase components of persistence landscapes. Through functional data analysis of persistence landscapes under an elastic Riemannian metric, we show how meaningful statistical summaries of persistence landscapes (e.g., mean, dominant directions of variation) can be obtained by decoupling topological signal present in amplitude and phase variations. The estimated phase functions are tied to the resolution parameter that determines the filtration of simplicial complexes used to construct persistence diagrams. For a dataset obtained under scale and sampling variabilities, the phase function prescribes an optimal rate of increase of the resolution parameter for enhancing the topological signal in a persistence diagram. We demonstrate benefits of alignment through several simulation examples and a real data example concerning structure of brain artery trees represented as 3D point clouds. 1 Introduction Persistent homology is a prominent tool within Topological Data Analysis (TDA) that provides a multiresolution view of the topological features of data. As a resolution parameter changes, so do the features of the data, and these changes are recorded in persistence diagrams or barcodes. Statistical analysis of samples of persistence diagrams are based on distances, such as the Wasserstein distance or bottleneck distance, which enable computation of descriptive statistics (Mileyko et al., 2011; Turner et al., 2014; Wasserman, 2018) and confidence regions (Fasy et al., 2014). Carrying out TDA directly on the space of persistence diagrams is difficult since they are multisets of planar points. This motivates using functional representations of diagrams, or functional summaries, that are more amenable for statistical analysis (Berry et al., 2020) using tools from functional data analysis (Ramsay and Silverman, 2005). In this paper, we consider the persistence landscape (Bubenik, 2015), but other functional summaries can also be used (silhouettes (Chazal et al., 2014); density estimates (Anirudh et al., 2016); rank functions (Robins and Turner, 2016); persistence entropy functions (Atienza et al., 2020); persistence intensity functions and images (Chen et al., 2015; Adams et al., 2017)). In general, there are two main sources of variation in a functional dataset: amplitude, which captures y-axis variation, and phase, which tracks variation in the relative timing of amplitude features, e.g., extrema. 1 Figure 1: Example of topological noise: Point clouds X1 , X2 , X3 (with different sampling) from topologically identical spaces (differing only in scale) lead to different persistence diagrams D1 , D2 , D3 and hence landscapes Λ1 , Λ2 , Λ3 . Our approach: Construct aligned landscapes Λa1 , Λa2 , Λa3 ; use alignment information to get transformed/denoised diagrams D1a , D2a , D3a ; use aligned diagrams/landscapes for further statistical analysis. The perils of not accounting for both sources of variation when computing summaries such as the mean or exploring dominant directions of variation via (functional) PCA are well-documented within the statistics literature (e.g., Marron et al., 2015; Srivastava and Klassen, 2016). Accordingly, there is a need to understand how topo-geometric information in persistence diagrams manifests in amplitude and phase components of the corresponding persistence landscapes, and if their alignment can mitigate the effects of topological noise in the observed data. As a first step, in this paper we examine topological noise due to scaling and sampling variabilities. Given a dataset X, the persistence landscape ΛX = {λi : I → R+ ; i = 1, . . . , K} corresponding to a degree-p persistence diagram consists of K triangular-like functions λi , defined on an interval I, that start and end at zero. We thus view ΛX as a parameterized closed curve in RK , and obtain curves ΛXi , i = 1, . . . , n as landscapes from n datasets. Our main contributions are as follows. 1. We establish an explicit link between the rate of increase of the resolution parameter t in a simplicial filtration that generates a persistence diagram and magnitude of phase variation present in the component functions λi , as captured through a reparameterization s 7→ γ(s) of the curve ΛX (s). Specifically, we show how γ is related to noise in persistence diagrams induced by (i) arbitrary (global) scaling of the data (Figure 2) and (ii) sampling variability of data (Figure 3). 2. We show how alignment of landscapes {ΛXi } by determining optimal reparameterizations {γi } leads to a mean (average) landscape that better preserves the structures of the sample of landscapes. A key consequence is the denoising of points in the corresponding persistence diagrams by transforming the points using {γi }; therefore, computing persistence diagrams for datasets {Xi } using simplicial filtrations with balls of transformed radii {t → γi (t)} enhances topological information in the persistence diagrams (Figure 4). 3. We demonstrate the benefits of carrying out this program on simulated examples and a real dataset consisting of brain artery trees, also studied in Bendich et al. (2016) (Figure 5). We summarize our approach through a simple example in Figure 1 with three point clouds X1 , X2 , X3 with degree p = 1 and K = 1-dimensional landscapes Λ1 , Λ2 , Λ3 . Topological noise is induced purely through scale (radii of circles) and sampling variabilities. Notice how transforming the diagrams {Di } using {γi } from alignment of {Λi } collapses the three points to a single one (denoising), as it should be since spaces from which {Xi } are sampled are topologically identical. What is the relationship between denoising a persistence diagram and aligning the corresponding persistence landscapes? While there are several factors that induce noise in a point cloud X = {x1 , . . . , xN }, two can be readily linked to the radius or resolution parameter t of balls {Bxi (t)} used in Čech and Rips filtrations. The first one concerns scaling of X uniformly, say by a factor greater than 1. This will ensure that a topological feature in X will be discovered at a larger radius t∗ > t; this will also result in a birth-death pair 2 (b, d) in the persistence diagram that has larger b and d coordinates when compared to the unscaled case, and thus shifts the location of the peak (of a component) of the persistence landscape to the right. This represents topological noise since the underlying topological signal is invariant to scaling. Matching peaks across persistence diagrams is achieved through their alignment, and the resulting phase functions γ can be used to transform the (b, d) pairs accordingly. The second factor is how densely or sparsely are the points xi sampled. This is related to the scaling issue, since sparse sampling relates to requiring balls with larger radius t to discover topological features, and hence result in a larger b coordinate. In both cases, alignment removes topological noise. To our knowledge, this is the first work in TDA literature to establish a concrete link between misalignment of persistence landscapes to noise in persistence diagrams. The only other related work we are aware of, although unrelated to use of persistence landscapes, is by Yoon and Ghrist (2020) who use a multiscale approach by allowing each ball Bxi (ti ) to have a possibly different radius ti . In a certain sense, our approach in determining an optimal rate of increase of t, given by γ(t), falls between the standard approach of fixing a t for each xi and having t change with xi . 2 2.1 Persistence diagrams, landscapes and scaling Persistent Homology through diagrams and landscapes Persistent homology is a tool that tracks homological features, such as connected components (degree-0), loops (degree-1), voids (degree-2), etc., of data at different resolutions (Wasserman, 2018). As an example of persistent homology, consider a point cloud X = {x1 , . . . , xN }, with each xi ∈ Rd . A ball of radius t can be drawn around each of the points, Bxi (t), and the union of the balls, ∪N i=1 Bxi (t), can be used to compute the homology. In practice, homology is computed from a simplicial complex, where the points in the point cloud are viewed as nodes in a graph and edges are drawn between them based on how the balls intersect. Two common simplicial complexes based on the balls are the Čech complex and Vietoris-Rips complex. The Čech complex consists of k-simplices whose nodes have k + 1 many balls with a non-empty intersection. The Vietoris-Rips complex is easier to compute and consists of k-simplices whose nodes have k + 1 many balls with a non-empty pairwise intersection. At each fixed radius t, the homology of the simplicial complexes is a snapshot of the features of the point cloud. However, considering all radii, t > 0, provides a multi-resolution view of the features of the data where features are born and die at different values of t. Persistent homology tracks the features with birth-death pairs, (bj , dj ), the times at which the j th feature was born and its corresponding death time. A persistence diagram is thus a multiset consisting of these points and represents a multi-resolution summary of the homology of data. For more general treatments of persistent homology and persistence diagrams, we refer the reader to Edelsbrunner et al. (2002) and Zomorodian and Carlsson (2005). Persistence landscapes are functional representations of persistent homology computed from persistence diagrams (Bubenik, 2015). For data X = {x1 , . . . , xN }, let Dp (X) denote its degree-p persistence diagram consisting of m birth-death pairs {(bj , dj )}m j=1 . The basic unit of persistence landscapes are triangular functions based on points in a persistence diagram, ℓpj (t) = (t − bj )I{bj ≤t≤ 12 (bj +dj )} + (dj − t)I{ 12 (bj +dj )≤t≤dj } . (1) For k ∈ N, the k th landscape function is defined as λpk (t) = k th max ℓpj (t), which is the k th maximum of the j=1,...,m triangular functions with λk (t) = 0 for all k > m by definition. Each function λpk thus begins and ends at zero. In practice, we truncate the number of landscape functions used for data analysis to the K many that have some positive values along their domain. The degree-p persistence landscape for the data X is defined as the collection of landscape functions ΛpX (t) = {λpk (t)}K k=1 , which we represent as a K-dimensional closed Euclidean curve (2) t 7→ ΛpX (t) := (λp1 (t), . . . , λpK (t)) ∈ RK . In this work, we will consider order p persistence landscapes of samples of datasets X1 , . . . , Xn , which we will denote simply by Λ1 (t), . . . , Λn (t), where context will clarify the degree p. 3 2.2 Effect of scaling and sampling variabilities Persistence diagrams are constructed with the Rips or the Čech simplicial filtrations based on distances between data points, which are sensitive to the scale and sampling of the data. To elaborate on intuition offered in Section 1, Figure 2 considers two datasets consisting of 10 equidistant points along circles with radii 1 (blue) and 0.5 (red), respectively. We consider degree p = 1 persistent homology (loops) with K = 1-dimensional landscapes when topological noise is entirely due to scale effects. (e) (a) (b) (c) (d) (f) Figure 2: Same topology with scale variability only: Construction of Rips filtration for two point clouds on circles with radii 0.5 (red) and 1 (blue) at resolutions (a) t = .1545, (b) t = .3090, (c) and t = .5, (d) t = 1. (e) Corresponding persistence diagrams and (f) landscapes. In Figure 2(a)-(d), balls of different radii t are drawn around the points in the point clouds. In (a), when t = .1545, a loop forms for the red point cloud, while there is no loop present for the blue point cloud. In (b), when t = .3090, a loop forms for the blue point cloud, and the loop persists for the red point cloud. In (c), when t = .5, the loop closes for the red point cloud, and persists for the blue point cloud. Finally, in (d), when t = 1, the loop closes for the blue point cloud. The loops can be summarized by the birth-death pairs (.1545, .5) and (.3090, 1) for the red and blue point clouds, respectively. Panel (e) shows them in the persistence diagram while panel (f) displays the corresponding misaligned persistence landscapes. In the persistence diagram, (b, d) coordinates of the red point are half of those for the blue, and this matches the ratio of the radii of the two circles; this implies that the persistence landscape for the red point cloud is shorter and shifted to the left by a commensurate amount as compared to the landscape for the blue point cloud. This scale-induced topological noise thus arises by a common scaling of the triangular functions ℓ1j for each j = 1, . . . , m given by ℓ1j (t) = (t − αbj )I{αbj ≤t≤ α2 (bj +dj )} + (αdj − t)I{ α2 (bj +dj )≤t≤αdj } , (3) where α = 2 (when blue point cloud is derived from the red). If α ∈ (0, 1), the triangular functions will be shifted to the left along the domain and will be shorter relative to α = 1. If α > 1, the functions will be shifted to the right and taller relative to α = 1 (as seen in Figure 2(f)). In this example, we see that when topological noise is due to scaling, alignment of peaks of the persistence landscapes will move points in a persistence diagram towards each other, and thus amplify the topological signal. In other words, such topological noise manifests entirely through phase variation in the landscapes. A similar phenomenon occurs when points in a point cloud have been sampled non-uniformly with sampling variability (see Section 4.2, Figure 4). 3 Elastic functional data analysis of persistence landscapes More generally, in contrast to the above example concerning scale, noise in a persistence diagram can arise from several other factors, including sampling variability and measurement error, and these are confounded in both amplitude and phase variations in the corresponding set of landscapes. However, transforming points in 4 a persistence diagram using the reparamterization functions obtained by aligning or registering the landscapes will continue to mitigate noise in the persistence diagram. Consider a sample of landscapes Λ1 , . . . , Λn from n datasets (for some fixed degree p homology). Without loss of generality, we can assume that they share a common domain [0, 1]: in practice, there always exists an 0 < s < ∞ such that Λi (t) = 0, ∀ t > s, i = 1, . . . , n; we can thus choose [0, s] as a common domain and rescale by 1/s. Under this representation, each landscape Λi is thus a K-dimensional parameterized closed curve [0, 1] ∋ t 7→ Λi (t) ∈ RK with Λi (0) = Λi (1). Alignment of {Λi } amounts to establishing correspondence between (K-dimensional) points in the images t 7→ Λi (t). This is achieved by estimating optimal orientationpreserving diffeomorphisms (reparameterizations) {γi : [0, 1] → [0, 1] : γ̇i > 0, γi (0) = 0, γi (1) = 1} (γ̇ is the derivative of γ) such that the collection {Λi (γi )} are ‘best’ aligned, where optimality is defined with respect to a metric-based matching functional on the landscapes. For each i, γi represents common phase variation in the component functions (λ1i , . . . , λKi ) of Λi . From the definition of a landscape, the parameter t corresponds to the resolution parameter in a simplicial filtration used to compute a persistence diagram; that is, we can map a value of t precisely to a birth-death (b, d) point on the diagram. As a consequence, rescaling the domain of landscapes {Λi } by s corresponds to i 2 considering scaled persistence diagrams {(bi,j /s, di,j /s)}n,m i=1,j=1 , so that the birth-death pairs are now in [0, 1] . Then, this correspondence implies that, for each i = 1, . . . , n, the (group) action (Λi , γi ) → Λi (γi ) induces a transformation {(bij , dij )} → {(γi−1 (bij ), γi−1 (dij ))}. Since the aligned {Λi (γi )} are such that the extrema (mainly peaks) of the component functions (λi1 (γi ), . . . , λiK (γi )) line up, the points {(γi−1 (bij ), γi−1 (dij ))} will tend to cluster, the number of which will depend on the topology of the underlying manifolds (see Figures 3 and 4 below, and Figures 8 and 10 in the Supplementary Material (Appendix B.2)). Another way to summarize the above discussion is as follows: if Čech or Rips filtration for dataset Xi is constructed with balls of radius γi (t), the corresponding persistence diagram will be ‘denoised’. The program described above rests on determining the optimal reparameterizations {γi } from observed landscapes {Λi }. In principle, any registration or alignment procedure for curves in RK can be used. Our choice is the algorithm based on the highly successful Elastic Functional Data Analysis (EFDA) framework based on a Riemannian-geometric approach, described in detail in the book by Srivastava and Klassen (2016). We refer the reader to the book and Supplementary Material (Appendix A) for more details, but provide a terse summary below. The framework is characterized by the square-root velocity function (SRVF) representation/transform Λ(t) 7→ Q(Λ(t)) = q(t) := Λ̇(t)(|Λ̇(t)|)−1/2 , where Λ̇(t) is the component-wise derivative and | · | is the Euclidean norm on RK . The map is invertible with Rt Λ(t) = 0 q(u)|q(u)|du. The motivation for the SRVF representation comes from intractability of computing the distance between, say, Λ1 and Λ2 with respect to the complicated elastic metric, which possesses the desirable property of being invariant to simultaneous reparameterizations Λ1 (γ) and Λ2 (γ); this property is crucial for aligning curves based on reparameterizations. The R 1 elastic distance under the SRVF representation, however, reduces to the usual L2 distance kq1 − q2 k2 = [ 0 |q1 (t) − q2 (t)|2 dt]1/2 , where q1 , q2 are the SRVFs of Λ1 , Λ2 (Section 10.3 in Srivastava and Klassen (2016)). A template is needed in order to obtain the {γi }. In an iterative algorithm, the Karcher mean, µ̂, is used as a template, and optimal {γi } are simultaneously determined with respect to the elastic metric, by using the SRVF representation (Section 10.4.4 in Srivastava and Klassen (2016)). The mean persistence landscape µ̂ will thus be computed using the aligned landscapes {Λi (γi )}, and will be closer in shape to a persistence landscape than one computed without alignment (see, e.g., Figure 3(g)). Once a mean µ̂ on the aligned space is available, it is possible to carry out functional principal component analysis (FPCA) on the space of aligned landscapes by diagonalizing the (empirical) covariance operator to obtain dominant directions of amplitude variation (along godesics from µ̂ with respect to the elastic metric). Then, projections of aligned landscapes {Λi (γi )} along a small number of PC directions result in an efficient finite-dimensional representation of {Λi (γi )}. 4 Numerical examples In this section we present: 5 1. Two simulation examples which demonstrate: (i) denoising of persistence diagrams, obtained under scale and sampling variabilities in the data, through alignment of landscapes; (ii) benefits of computing mean landscape and PC directions on the set of aligned landscapes as opposed to computing a pointwise mean with unaligned ones, as currently done in practice. Versions of both examples when data is sampled with noise, and additional examples on torii and spirals, are available in Appendices B.1 and B.2 respectively of the Supplementary Material. 2. A real data example on 3D brain artery trees which: (i) through alignment of persistence landscapes for male and female groups, uncovers how apparent difference in the unaligned mean landscapes of the two groups can be attributed to a difference in scale; (ii) confirms this finding by comparing the distributions of the total artery lengths of males and females. These substantially add to the findings in Bendich et al. (2016). For the simulated examples, we use the ripsDiag function to compute persistence diagrams using the Vietoris-Rips simplicial complex for point clouds, and the landscape function to compute landscapes from persistence diagrams; both functions are part of the TDA R package (Fasy et al., 2015). In the EFDA framework, registration, mean estimation and PCA for a sample of landscapes are implemented in MATLAB; the entire procedure requires approximately 1 minute on a standard laptop for each example. We have included additional simulated examples in the Supplementary Material (Appendix B.2). Data and MATLAB code to reproduce the simulated examples are also included in the Supplementary Material (the directory entitled ‘code and data’). 4.1 Simulation 1: Mean from aligned landscapes We consider 20 point clouds, where each point cloud is generated by: (i) sampling M from a Discrete-Uniform(10, 30); (ii) sampling r from |N (1, 0.32 )|; (iii) generating M points uniformly on a circle with radius r. Figure 1 shows three (from 20) point clouds along with the corresponding degree p = 1, K = 1-dimensional landscapes. Figure 3(a) shows all 20 landscapes {Λi }20 i=1 . The amplitude and phase variations in the landscapes are related to the variability in the radii, sample size and dispersion. Panels (b)-(f) demonstrate the benefit of alignment of landscapes {Λi }: a visually better mean (c) is obtained by using the aligned landscapes {Λi (γi )} (b) using reparameterizations {γi } (e); transforming points in the persistence diagram (d) using {γi } results in denoising (f) by collapsing all points to a single one, since the topology of each of the 20 point clouds is the same. (a) (b) (c) (d) (e) (f) Figure 3: Same topology with both scale and sampling variabilities: (a) Persistence landscapes {Λi }20 i=1 of 20 point clouds of the type in Figure 1. (b) Aligned persistence landscapes {Λi (γi )}20 i=1 . (c) Mean landscape after (blue) and without (red) alignment. (d) Noisy (rescaled) persistence diagram {(bij , dij )}20 i=1 from 20 point clouds. (e) Estimated phase functions {γi }20 i=1 . (f) Denoised/transformed persistence diagram {(γi−1 (bij ), γi−1 (dij ))}20 i=1 . 4.2 Simulation 2: Principal component analysis on aligned landscapes We consider a more involved setting involving 20 point clouds from two topologically different spaces: (i) one circle and (ii) two connected circles. Point clouds from (i) are drawn in the same manner as in Simulation 1, but for the fact that sample size M is drawn from a Discrete-Uniform(20, 30). For point clouds from (ii), the radius of the larger circle is drawn from a |N (1, 0.32 )|, while the radius of the smaller circle is a random proportion of the larger circle, drawn from a Beta(10, 10). Panels (a) and (c) of Figure 4 show one point cloud each from (i) and (ii). 6 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Figure 4: Different topology with both scale and sampling variabilities: (a) & (c) Two examples, from 20, of randomly generated point clouds from topologically different spaces (blue and red, respectively, in all relevant panels). (b) & (d) Corresponding degree p = 1, K = 2-dimensional persistence landscapes. (e)-(g) -1, 0, 1, standard deviation from the mean landscape in the first PC direction, and (h) projection of landscapes onto the first two PC directions: following alignment (top) and without alignment (bottom). (i) Noisy and (k) denoised/transformed persistence diagrams. (j) Estimated reparameterizations. We consider degree p = 1 homology and K = 2-dimensional persistence landscapes {Λi = (λi1 , λi2 )}20 i=1 . For point clouds from (i), λi1 will have one peak and λi2 = 0; on the other hand, for point clouds from (ii), λi1 will have two peaks and λi2 will have a single peak. Top and bottom rows of panels (e)-(g) in Figure 4 show mean landscapes and PC directions following alignment and without alignment, respectively. Panel (h) clearly highlights the benefits of alignment of landscapes through better separation of the two settings, (i) and (ii), when projected along two PC directions. Specifically, in the top row, when PCA is carried out on aligned landscapes, all of the point clouds that have two loops have a negative first PC score, while all of the point clouds with only one loop have a positive first PC score. There is no such clear separation of the two groups when PCA is performed on unaligned landscapes, as seen in the bottom row of panel (h). Following results from Simulation 1, we expect to see two clear clusters in the denoised persistence diagram, corresponding to two distinct topological features, using reparamterizations {γi }, shown in panel (j), and this is indeed the case as seen in panel (k). This is explained as follows: points concentrated around (b, d) ≈ (0.25, 0.6) correspond to the single circle in the blue point clouds and the large circle in the red point clouds. This is consistent with the data generating process where the large circles across the two groups correspond to each other. The points associated with the second feature for the red point clouds are concentrated around (b, d) ≈ (0.1, 0.25) and correspond to the additional significant homological feature (smaller circle) that generally has smaller persistence than the larger feature (larger circle). It is very difficult to discern such topological information from the noisy diagram in panel (i). 7 (d) (a) (b) (c) (e) Figure 5: Left panel : Mean persistence landscapes and their differences for males and females in the brain artery example: (a) female mean, (b) male mean, (c) difference between (a) and (b) with alignment (top row) and without, i.e. pointwise (bottom row). Right panel : Relationship between groupwise total artery length and relative phase of groupwise means following alignment, across sexes. (d) Estimated groupwise CDFs of total artery length, and (e) reparameterizations that align groupwise aligned means to a mean computed from alignment of the pooled data. The identity parameterization is shown in black for reference. Additional results showing the persistence diagrams and reparameterizations are available in Appendix C of the Supplementary Material. 4.3 Real data example: 3D brain artery trees We now demonstrate the utility of the proposed approach on real data from 3D point clouds representing brain artery trees discussed in Bendich et al. (2016). For a description of the experiment and data generation see Bullitt et al. (2005). Information regarding human subjects in the experiment is available at http: //insight-journal.org/midas/community/view/21. As per the terms of use listed on the above URL, we do not have explicit permission to share these data. We are thus unable to include the code and data used in this section in the Supplementary Material. Bendich et al. (2016) computed persistence diagrams from trees which represented arteries in 98 healthy 98,100 human subjects, and used persistence diagrams to compute the differences {di,j − bi,j }i=1,j=1 for subjects based on the 100 largest differences sorted from largest to smallest. Restricting focus to the largest differences serves as a denoising step, since points close to the line b = d in a persistence diagram can be thought of as topological noise (Fasy et al., 2014). This dataset is apt to demonstrate our approach for two reasons, one a priori and the other a posteriori : (i) since each point cloud contains a large number (order of 105 ) of points, in order to be able to compute the diagrams, Bendich et al. (2016) subsampled 3000 points from each point cloud, thus creating large sampling variability; (ii) we uncover a significant scale effect between two groups of subjects. We note that this finding is purely exploratory and serves merely as a proof of concept for the proposed methodology. The starting point for our analysis are the persistence diagrams available at https://marron.web.unc. edu/brain-artery-tree-data/, and not the original 3D point clouds. From these, we compute landscapes from degree p = 1 homology and K = 100-dimensional persistence landscapes. Information on the sex of each subject is also available along with the tree data. We investigate differences between mean persistence landscapes, computed from degree p = 1 homology persistence diagrams, grouped by sex. One major finding of Bendich et al. (2016) is the existence of sex differences in their mean degree p = 1 feature vectors {bij − dij }. For convenience, we denote the mean landscape for the male (female) group f following alignment (within each group) by µ̂m a (µ̂a ), and pointwise mean computed without alignment for the males (females) by µ̂m (µ̂f ). f The means µ̂m a and µ̂a are shown in the top of Figure 5(a)-(b), respectively. The difference between the means is obtained by first aligning the group means to the common pooled mean and then taking their difference, where all operations are carried out under the SRVF representation; this difference is shown in the 8 top row of panel (c). The bottom row of Figure 5(a)-(c) shows the pointwise means µ̂m and µ̂f , and the corresponding (µ̂m − µ̂f ), when no alignment is carried out. The difference between the pointwise means has very large features. However, these features are essentially non-existent in the difference of the aligned means. This indicates that the large difference in the pointwise means is potentially due to misalignment, and can be construed as topological noise. We surmise that global scale differences and sampling variability in observed data between the male and female groups may be responsible for this phenomenon. To confirm this, we use the total artery length for each subject as a measure of global scale, which is also available as part of the tree data (Bullitt et al., 2005). For each group, we estimated a cumulative distribution function (CDF) of total artery length using a kernel density estimate using the ksdensity function in MATLAB (default bandwidth). From the estimated CDFs shown in Figure 5(d), it appears that total artery length is stochastically ordered by sex, with females having stochastically longer brain artery trees 1 . Given this global scale disparity and sampling variability, we would expect µ̂fa to be shifted to the right m f relative to µ̂m a . This behavior can be extracted from the phase difference between µ̂a and µ̂a to the mean computed following alignment of the pooled sample, as shown in Figure 5(e). The blue reparameterization shifts µ̂fa to the left while the red shifts µ̂m a to the right. The misalignment caused by differences in global scale and sampling variabilities between the two groups appear to explain the reason behind the large difference between the groupwise pointwise means. In summary, the above analysis suggests that the sex effect detected via pointwise analysis without alignment of the landscapes is due to global scale differences of the observed data rather than differences in homology, and thus makes a compelling case study of the perils in ignoring the distinction between, and thereby confounding of, amplitude and phase variation in landscapes. 5 Discussion The novel approach presented in this paper can be viewed as a first step toward understanding how geometry of the manifold on which point clouds are sampled influences TDA. To see this, suppose e : M ֒→ RD is an equivariant embedding of a d-dimensional manifold M into RD , D ≥ d. Then, a diffeomorphism φ : RD → RD acts on the embedding as φ ◦ e(M ). The map φ does not change the topology of M , but constructing simplicial filtrations for point clouds under the embedding in RD using balls will transform according to φ since the metric is accordingly transformed; that is, for a fixed x ∈ e(M ), {y ∈ e(M ) : kx − ykRD < t} will transform to φ(x) ∈ φ ◦ e(M ), {φ(y) ∈ φ ◦ e(M ) : kφ(x) − φ(y)kRD < t}. In the special case of the setting considered in the paper, where φ corresponds to a (constant) scale change, the radius t changes nonlinearly as t 7→ γ(t), for a reparameterization γ, since t is forced to lie within [0, 1]. This phenomenon also relates to when points are sampled with variability on M , since by judiciously changing the metric depending on the locations of points, balls of different (or differently changing) radii can be used to construct the simplicial filtration, not dissimilar to the multiscale approach considered by Yoon and Ghrist (2020). Much remains to be done in this direction. The limitations of this work inspire directions for future work. As noted in Section 3, there are many factors that influence amplitude and phase variability in landscapes, including signal-to-noise ratio. Notwithstanding the promising results for the noisy setting on the circle (see Supplementary Material, Appendix B.1), robustness of the alignment-based approach to observation noise will strongly depend on the geometry of the manifold M and magnitude of noise in observed data on RD , especially if data have been sampled from a distribution with support only on M . One possible approach would constitute of first estimating M (and its dimension) using a manifold fitting method, and using this information to construct tailored simplicial filtrations; however, additional noise induced by the fitting procedure would have to be accounted for in downstream tasks. Another option for the large noise setting is to use explicit statistical models to align persistence landscapes that account for all sources of uncertainty. For example, Bayesian models based on shape-constraints to infer the pattern and number of extrema in landscapes may be profitably used (Matuk et al., 2021). While the focus of this paper is on persistence landscapes (largely due to space constraints), we expect our approach to be fruitful with silhouettes (Chazal et al., 2014), since similar triangular functions are used 1 We emphasize that this is purely an exploratory/descriptive finding without statistical significance. We do not attribute associative or causative meaning to the finding and therefore none should be construed. 9 in their definition. However, feasibility of the alignment method for other functional summaries mentioned in Section 1 is not clear, and is worthy of further investigation. Finally, for denoising a persistence diagram directly without using landscapes, it is possible to consider generalizations of the one-dimensional transforms γ of points on diagrams to the group of diffeomorphisms of R2 , along the lines of what is done in the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework (Grenander and Miller, 2007). Acknowledgements The MR brain images from healthy volunteers used in this paper were collected and made available by the CASILab at The University of North Carolina at Chapel Hill and were distributed by the MIDAS Data Server at Kitware, Inc. This research was partially funded by NSF DMS-1613054, NIH R37- CA214955 (to SK and KB), EPSRC EP/V048104/1, and NSF DMS-2015374 (to KB), and NSF CCF1740761, NSF DMS-2015226, and NSF CCF-1839252 (to SK). References Adams, H., T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research 18 (8), 1–35. Anirudh, R., V. Venkataraman, K. Ramamurthy, and P. Turaga (2016). A Riemannian framework for statistical analysis of topological persistence diagrams. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1023–1031. Atienza, N., R. Gonzalez-Díaz, and M. Soriano-Trigueros (2020). On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recognition 107, 107509. Bendich, P., J. S. Marron, E. Miller, A. Pieloch, and S. Skwerer (2016). Persistent homology analysis of brain artery trees. The Annals of Applied Statistics 10 (1), 198 – 218. Berry, E., Y.-C. Chen, J. Cisewski, and B. Fasy (2020). Functional summaries of persistence diagrams. Journal of Applied and Computational Topology 4, 211–262. Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research 16 (3), 77–102. Bullitt, E., K. Muller, I. Jung, W. Lin, and S. Aylward (2005). Analyzing attributes of vessel populations. Medical Image Analysis 9, 39–49. Bullitt, E., D. Zeng, G. Gerig, S. Aylward, S. Joshi, J. Smith, W. Lin, and M. Ewend (2005). Vessel tortuosity and brain tumor malignancy. Academic Radiology 12, 1232–40. Chazal, F., B. T. Fasy, F. Lecci, A. Rinaldo, and L. Wasserman (2014). Stochastic convergence of persistence landscapes and silhouettes. In ACM Symposium on Computational Geometry, pp. 474–483. ACM. Chen, Y.-C., D. Wang, A. Rinaldo, and L. Wasserman (2015). Statistical analysis of persistence intensity functions. arXiv:1510.02502v1 . Edelsbrunner, H., D. Letscher, and A. Zomorodian (2002). Topological persistence and simplification. Discrete & Computational Geometry 28 (4), 511–533. Fasy, B. T., J. Kim, F. Lecci, and C. Maria (2015). Introduction to the R package TDA. arXiv:1411.1830 . Fasy, B. T., F. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, and A. Singh (2014). Confidence sets for persistence diagrams. The Annals of Statistics 42 (6), 2301 – 2339. 10 Grenander, U. and M. Miller (2007). Pattern Theory: From Representation to Inference. Oxford University Press. Marron, J. S., J. O. Ramsay, L. M. Sangalli, and A. Srivastava (2015). Functional data analysis of amplitude and phase variation. Statistical Science 30 (4), 468–484. Matuk, J., K. Bharath, O. Chkrebtii, and S. Kurtek (2021). Bayesian framework for simultaneous registration and estimation of noisy, sparse, and fragmented functional data. Journal of the American Statistical Association, In Press. Mileyko, Y., S. Mukherjee, and J. Harer (2011). Probability measures on the space of persistence diagrams. Inverse Problems 27 (12), 124007. Ramsay, J. O. and B. W. Silverman (2005). Functional Data Analysis. Springer. Robins, V. and K. Turner (2016). Principal component analysis of persistent homology rank functions with case studies of spatial point patterns, sphere packing and colloids. Physica D: Nonlinear Phenomena 334, 99–117. Srivastava, A. and E. Klassen (2016). Functional and Shape Data Analysis. Springer. Srivastava, A., W. Wu, S. Kurtek, E. Klassen, and J. S. Marron (2011). Registration of functional data using Fisher-Rao metric. arXiv 1103.3817 . Tucker, J. D., W. Wu, and A. Srivastava (2013). Generative models for functional data using phase and amplitude separation. Computational Statistics & Data Analysis 61, 50–66. Turner, K., Y. Mileyko, S. Mukherjee, and J. Harer (2014). Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry 52 (1), 44–70. Wasserman, L. (2018). Topological data analysis. Annual Review of Statistics and Its Application 5 (1), 501–532. Yoon, H. R. and R. Ghrist (2020). Persistence by parts: Multiscale feature detection via distributed persistent homology. arXiv: 2001.01623 . Zomorodian, A. and G. Carlsson (2005). Computing persistent homology. Discrete and Computational Geometry 33, 249–274. 11 Appendix This appendix contains three sections for the paper entitled Topological Data Analysis through alignment of Persistence Landscapes. In Appendix A, we provide additional background material for statistical analysis of curves via the Elastic Functional Data Analysis paradigm. In Appendix B, we discuss additional simulated examples. In Appendix C, we report additional results for the brain artery tree data considered in Section 4.3 in the main article. A Brief review of Elastic Functional Data Analysis In this section, we provide several key definitions that enable alignment and statistical analysis of curves under the Elastic Functional Data Analysis paradigm, and refer to Srivastava and Klassen (2016) for more details. Let Λ : [0, 1] → RK ∈ F, where F is the space of all absolutely continuous curves in RK , denote a K-dimensional curve, e.g., a persistence landscape. Further, define a mapping (Q : F → Q) of a curve Λ ∈ F to its corresponding SRVF as Q ∋ Q(Λ) = q = Λ̇(|Λ̇|)−1/2 , where Λ̇ is the derivative of Λ and | · | is the Euclidean norm in RK . Our definition of amplitude and subsequent statistical analysis approach are analogous to the definitions presented in Srivastava et al. (2011) for univariate functions. The amplitude of a curve Λ is an equivalence class defined through its SRVF q as follows: [q] := {(q, γ) | γ ∈ Γ}, (4) damp (Λ1 , Λ2 ) = d([q1 ], [q2 ]) = min kq1 − (q2 , γ)k2 , (5) √ where (q, γ) = Q(Λ(γ)) = (q(γ)) γ̇ (recall that Γ is the set of orientation-preserving diffeomorphisms or reparameterizations of [0, 1]). Under this definition, two curves, Λ1 , Λ2 , have the same amplitude if their corresponding SRVFs are in the same equivalence class, i.e., there exists a γ ∈ Γ such that q1 = (q2 , γ). The set of all equivalence classes forms a partition of Q and is the quotient space Q/Γ. Hence, Q/Γ defines the amplitude space. The amplitude distance between two curves Λ1 , Λ2 ∈ F is defined as the distance between their corresponding SRVF equivalence classes [q1 ], [q2 ] ∈ Q/Γ: γ∈Γ where q1 ∈ [q1 ] and q2 ∈ [q2 ]. Here, the invariance of the L2 metric, under the SRVF representation, to simultaneous reparameterization of curves is key to the definition of the distance. The argmin of the right hand side of Equation 5 gives the optimal reparameterization of Λ2 to register or align it to Λ1 . For a sample of curves Λ1 , . . . , Λn , with corresponding SRVFs q1 , . . . , qn , the amplitude mean is defined as the quantity that minimizes the sum of squared amplitude distances: [µ̂q ] = argmin n X min kq − (qi , γ)k22 . (6) [q]∈Q/Γ i=1 γ∈Γ An element of [µ̂q ] is found by iteratively aligning q1 , . . . , qn to the current estimate of the mean and averaging the aligned SRVFs to produce a new mean estimate; this is repeated until convergence. For identifiability, we use the center of the orbit of [µ̂q ] as the representative element of the elastic mean; henceforth, we simply refer to this element of the mean orbit as µ̂q . For additional algorithmic details and the orbit centering step, we refer the interested reader to Srivastava et al. (2011). The corresponding mean curve µ̂ ∈ F is defined Rt through the inverse SRVF mapping: µ̂(t) = Q−1 (µ̂q )(t) = 0 µ̂q (s)|µ̂q (s)|ds. The joint alignment of Λ1 , . . . , Λn can be achieved via pairwise alignment of each Λi , i = 1, . . . , n to the mean µ̂ via Equation 5. The resulting optimal reparameterizations, γi = argminkµ̂q − (qi , γ)k2 , i = 1, . . . , n, γ∈Γ can be used to study phase variability. Methods for statistical analysis of reparameterization functions are detailed in Tucker et al. (2013) and omitted here for brevity. The aligned curves Λi (γi ), i = 1, . . . , n, or equivalently their SRVFs, (qi , γi ), i = 1, . . . , n, describe amplitude variability in the sample. Based on the aligned curves, a sample amplitude covariance function can be defined as 1 X ((qi , γi )(t) − µ̂q (t))((qi , γi )(u) − µ̂q (u))⊤ . n − 1 i=1 n C\ q (t, u) := 12 (7) Amplitude principal components analysis is carried out via eigendecomposition of C\ q (t, u), C\ q (t, u) = ∞ X τ̂b φ̂b (t)φ̂b (u)⊤ , (8) b=1 where φ̂b , b ∈ N are the primary directions of amplitude variability (amplitude principal components) and τ̂b , b ∈ N are variances in the corresponding directions. Typically, one selects a finite number, B, of principal components that describe a large portion of amplitude variability. The aligned SRVFs can then be projected onto the B directions of amplitude variability with largest variance: Z 1 h(qi , γ)(t) − µ̂q (t), φ̂b (t)idt b = 1, . . . , B, i = 1, . . . , n, (9) βi,b := 0 where h·, ·i is the Euclidean inner product in RK . The PC scores, βi = βi,1 , . . . βi,B , i = 1, . . . , n, serve as a low-dimensional Euclidean representation of the amplitude of curves. To visualize the primary directions of √ amplitude variability, we compute F ∋ Q−1 (µ̂q + ν τ̂b φ̂b ), which is a curve that is ν standard deviations from µ̂q in the direction of φ̂b (see Figure 4 in the main article and Figure 7 in Appendix B.1 for examples). B Additional numerical examples In this section, we (1) study the potential impacts of observation noise (away from the manifold of interest) through simulation examples in Appendix B.1, and (2) provide three additional mean estimation examples for point clouds sampled from different topological spaces in Appendix B.2. B.1 Simulations 1 and 2 with added noise We repeat Simulation 1, as discussed in Section 4.1 in the main article, with additive pointwise noise, i.e., for each point cloud, we independently generate additive noise from a zero-mean bivariate Gaussian distribution with covariance r(0.1)2 I2 , where r is the radius of the circle that underlies the “noiseless" point cloud. Two examples of point clouds and their corresponding degree p = 1 K = 1-dimensional landscape functions are shown in panels (a)&(c) and (b)&(d) of Figure 6, respectively. The degree p = 1 K = 1-dimensional landscape functions for all 20 point clouds are shown in panel (e). Most of the landscapes generated from the noisy point clouds are similar in shape to those displayed in Figure 3 in the main article. However, sometimes there is an extremely small peak, corresponding to noise in some of the landscape functions. This is most visually apparent in panel (f), where there is an extremely small peak at t ≈ 0.25 prior to the major peak in each landscape. Due to the small magnitude of this noise-induced feature, the addition of additive noise appears to have little effect on landscape alignment and mean computation. The mean based on aligned landscapes, visualized in panel (g), is consistent with that of a circle. As highlighted in the main article, there is considerable variance reduction in the denoised/transformed persistence diagram in panel (j), via reparameterizations shown in panel (i). Importantly, points corresponding to the main feature of the data are collapsed to a single point, while points near the diagonal, corresponding to features created by the additive noise, remain near the diagonal. Such clarity is absent in the noisy persistence diagram in panel (h). Next, we also repeat Simulation 2, discussed in Section 4.2 in the main article, with additive noise. For point clouds in the blue group (single circle), we independently generate additive noise from a zero-mean bivariate Gaussian distribution with covariance r(0.1)2 I2 , where r is the radius of the circle that underlies the “noiseless" point cloud. We repeat this procedure for point clouds in the red group, but noise is generated such that the covariance depends on the radius of one of the two circles that it belongs to. A single example of a point cloud in the blue and red group are shown in Figure 7(a)&(c); the corresponding degree p = 1 K = 2-dimensional landscapes are shown in panels (b)&(d). The shapes of the landscapes in each group are similar to those displayed in Figure 4 in the main article. However, here, we notice some differences in PCA carried out on aligned landscapes. The first direction of variability, viewed in the top row of panels (e)-(g) appears to be associated with scale variability in the point cloud data. On the other hand, as confirmed in the top row of panel (h), the second direction of variability appears to be associated with the homology of the point clouds, i.e., most of the red point clouds, generated from two connected circles, have a positive second 13 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 6: Same topology with scale and sampling variabilities as well as additive noise: (a)&(c): Two examples, from 20, of randomly generated point clouds on circles with randomly chosen radii, random sample sizes and additive noise. (b) & (d): Corresponding persistence landscapes. (e) Persistence landscapes {Λi }20 i=1 of 20 point clouds. (f) Aligned persistence landscapes {Λi (γi )}20 . (g) Mean landscape after (blue) and without i=1 (red) alignment. (h) (Rescaled) Noisy persistence diagram {(bij , dij )}20 from 20 point clouds. (i) Estimated i=1 −1 −1 20 reparameterizations {γi }20 i=1 . (j) Denoised/transformed persistence diagram {(γi (bij ), γi (dij ))}i=1 . PC score, while most of the blue point clouds, generated from a single circle, have a negative second PC score. Since noise can distort topological features, there does appear to be some overlap between the two classes based on the first two PC scores. This observation is in contrast to that in Section 4.2 in the main article, where the two classes are clearly separated based on the first PC score alone. The corresponding PC scores computed based on unaligned landscapes, shown in the bottom of panel (h), provide no such distinction between the two classes. A comparison of the denoised/transformed persistence diagram presented in Figure 7(k) to its noisy counterpart in panel (h) shows the benefits of our approach. While the clustering of features in panel (k) is not as clear as in Figure 4 in the main article, due to the additive noise, one can still extract useful homological information from the denoised/transformed persistence diagram. On the other hand, this is not possible based on the noisy diagram in panel (i). B.2 Additional mean estimation examples In Figure 8, we consider mean estimation based on degree p = 1 K = 2-dimensional persistence landscapes computed from 20 point clouds that consist of uniformly sampled points along two circles with different radii. The point clouds in this example are generated in the same exact way as the data in the red group in Section 4.2 in the main article. Panels (a) and (c) show two examples of randomly generated point clouds with their corresponding degree p = 1 K = 2-dimensional landscapes in panels (b) and (d). Panels (e)-(g) show the K = 1 (top) and K = 2 landscape functions, their alignment, and a comparison of the mean before (red) and with (blue) alignment, respectively. The proposed alignment procedure results in a mean landscape that better preserves the major features along both landscape components. On the other hand, the unaligned mean landscape destroys the prominent two peak structure in the first component and results. In panel (j), the points in the denoised/transformed persistence diagram, corresponding to the two loops in the point clouds, form two separate clusters making the presence of these features in the data much clearer; the noisy 14 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Figure 7: Different topology with scale and sampling variabilities as well as additive noise: (a) & (c) Two examples, from 20, of randomly generated point clouds from topological different spaces (blue and red, respectively, in all relevant panels). (b) & (d) Corresponding degree p = 1, K = 2-dimensional persistence landscapes. (e)-(g) -1, 0, 1, standard deviation from the mean landscape in the first PC direction, and (h) projection of landscapes onto the first two PC direction: following alignment (top) and without alignment (bottom). Noisy (i) and denoised/transformed (k) persistence diagrams. (j) Estimated reparameterizations. diagram shown in panel (h) does not provide such a distinction. In Figure 9, we consider mean estimation based on degree p = 0 K = 1-dimensional persistence landscapes computed from 20 point clouds that consist of 2000 points uniformly sampled along two interwoven spirals. The tightness of the spirals is random, so that the spirals complete Uniform(2, 5) revolutions. Panels (a) and (c) show two examples of point clouds generated in such a manner with their corresponding degree p = 0 K = 1-dimensional landscapes in panels (b) and (d). The tighter spirals in panel (a) have points closer together, and the resulting landscape is smaller and shifted to the left as compared to the spirals in panel (c). When computing landscape functions, we only considered the point in persistence diagrams that corresponded to the death time that coincided with the intersection of the two spirals present in each point cloud. Panels (e)-(g) show landscapes for all 20 point clouds, their alignment, and a comparison of the mean before (red) and with (blue) alignment. The mean based on aligned landscapes appears to have sharper features that are consistent with the observed landscapes. Based on the denoised/transformed persistence diagram in panel (j), in contrast to the noisy persistence diagram in panel (h), it is evident that warping completely accounts for the scale variability associated with the tightness of the spirals, i.e., all points collapsed to a single point in the denoised/transformed persistence diagram. In Figure 10, we consider mean estimation based on degree p = 1 K = 2-dimensional persistence landscapes computed from 20 point clouds that consist of 1000 points sampled uniformly on a ringed torus. The major radius of each torus is sampled from a |N (2, .32 )|, while the the minor radius is a proportion Beta(10, 10) of the major radius. Two example point clouds are shown in panels (a) and (c). We preprocess the persistence diagrams used to compute the degree p = 1 K = 2-dimensional persistence landscapes by disregarding all 15 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 8: Same topology with both scale and sampling variabilities: (a)&(c) Two examples, from 20, of randomly generated point clouds. (b)&(d) corresponding persistence landscapes. (e) Persistence landscapes {Λi }20 i=1 of 20 point clouds. (f) Aligned persistence landscapes {Λi (γi )}20 . (g) Mean landscape after (blue) and without i=1 (red) alignment. (h) (Rescaled) Noisy persistence diagram {(bij , dij )}20 i=1 from 20 point clouds. (i) Estimated . (j) Denoised/transformed persistence diagram {(γi−1 (bij ), γi−1 (dij ))}20 reparameterizations {γi }20 i=1 . i=1 points in the persistence diagram except for the two points with longest persistence, as these points correspond to the two loops formed by the torii that underlie the point clouds. The landscapes corresponding to point clouds in (a) and (c) are shown in (b) and (d), respectively. Clearly, the estimated landscapes can vary widely depending on the relationship between the major and minor radii. Panels (e)-(g) show the first (top) and second (bottom) landscape functions, their alignment, and a comparison of the mean without (red) and after (blue) alignment. The first landscape function is automatically weighted higher during alignment due to the relatively large magnitude of the peak as compared to the second landscape function. In panel (j), the points in the denoised/transformed persistence diagram, using the reparameterizations shown in (i), concentrate to make the presence of the features more clear; the two detected features describe the two loops. In comparison, the noisy diagram shown in panel (h) does not provide a clear distinction. 16 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 9: Same topology with scale variability: (a)&(c) Two examples, from 20, of randomly generated point clouds. (b)&(d) Corresponding persistence landscapes. (e) Persistence landscapes {Λi }20 i=1 of 20 point clouds. (f) Aligned persistence landscapes {Λi (γi )}20 i=1 . (g) Mean landscape after (blue) and without (red) alignment. (h) (Rescaled) Noisy persistence diagram {(bij , dij )}20 i=1 from 20 point clouds. (i) Estimated −1 −1 20 reparameterizations {γi }20 i=1 . (j) Denoised/transformed persistence diagram {(γi (bij ), γi (dij ))}i=1 . C Persistence diagrams for brain artery trees In this section, we display three randomly chosen examples of persistence diagrams, for females and males, that were used to generate persistence landscapes used in the real data brain artery tree example presented in Section 4.3 in the main article. We also show the estimated reparameterizations and transformed/denoised diagrams. We estimated three different sets of reparameterizations during the alignment process: {γi,F }47 i=1 and{γi,M }49 i=1 were used to align subjects within sex to group specific means, and γF̄ , γM̄ were used to align the group specific means to a mean estimated from pooled data. In order to compare reparameterizations for subjects in different groups, we further computed the alignment of each subject to the pooled mean as 49 ∗ ∗ {γi,F = γi,F ◦ γF̄ }47 i=1 for females and {γi,M = γi,M ◦ γM̄ }i=1 for males. The corresponding results, for three randomly selected females and three randomly selected males, are shown in Figures 11 and 12, respectively; each row corresponds to a different subject. In panels (a)-(c) of each figure, we show the noisy persistence diagram, the estimated reparameterization and the denoised/transformed diagram, respectively. Regions where reparameterizations are above the identity (diagonal line γ(t) = t) correspond to regions where the inverse reparameterizations are below the identity, and vice versa. Consequently, the effect of transforming persistence diagrams for female subjects results in points in denoised/transformed diagrams that are generally smaller in magnitude than the noisy diagrams, as seen in panels (a) and (c) in Figure 11. Conversely, the effect of transforming persistence diagrams for male subjects results in points in transformed diagrams that are generally larger in magnitude than the noisy diagrams, as seen in panels (a) and (c) in Figure 11. The 17 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 10: Same topology with scale and sampling variabilities: (a)&(c) Two examples, from 20, of randomly generated point clouds. (b)&(d) Corresponding persistence landscapes. (e) Persistence landscapes {Λi }20 i=1 of 20 point clouds. (f) Aligned persistence landscapes {Λi (γi )}20 i=1 . (g) Mean landscape after (blue) and without (red) alignment. (h) (Rescaled) Noisy persistence diagram {(bij , dij )}20 i=1 from 20 point clouds. (i) Estimated −1 −1 20 reparameterizations {γi }20 i=1 . (j) Denoised/transformed persistence diagram {(γi (bij ), γi (dij ))}i=1 . effects of denoising are not as clear in this real data example as in the simulations. This is due to the large sampling and geometric variability across different subjects. 18 (a) (b) (c) Figure 11: Alignment results for three random subjects in the female group (each row corresponds to a different subject): (a) Noisy and (c) denoised/transformed persistence diagrams. (b) Estimated reparameterizations. The identity reparameterization is shown in black for reference. 19 (a) (b) (c) Figure 12: Alignment results for three random subjects in the male group (each row corresponds to a different subject): (a) Noisy and (c) denoised/transformed persistence diagrams. (b) Estimated reparameterizations. The identity reparameterization is shown in black for reference. 20