Academia.eduAcademia.edu

A Fractal Dimension for Measures via Persistent Homology

2020, Topological Data Analysis

We use persistent homology in order to define a family of fractal dimensions, denoted dim i PH (μ) for each homological dimension i ≥ 0, assigned to a probability measure μ on a metric space. The case of zero-dimensional homology (i = 0) relates to work by Steele (Ann Probab 16(4): 1767-1787, 1988) studying the total length of a minimal spanning tree on a random sampling of points. Indeed, if μ is supported on a compact subset of Euclidean space R m for m ≥ 2, then Steele's work implies that dim 0 PH (μ) = m if the absolutely continuous part of μ has positive mass, and otherwise dim 0 PH (μ) < m. Experiments suggest that similar results may be true for higher-dimensional homology 0 < i < m, though this is an open question. Our fractal dimension is defined by considering a limit, as the number of points n goes to infinity, of the total sum of the i-dimensional persistent homology interval lengths for n random points selected from μ in an i.i.d. fashion. To some measures μ, we are able to assign a finer invariant, a curve measuring the limiting distribution of persistent homology interval lengths as the number of points goes to infinity. We prove this limiting curve exists in the case of zerodimensional homology when μ is the uniform distribution over the unit interval, and This work was completed while Elin Farnell was a research scientist in the

A Fractal Dimension for Measures via Persistent Homology Henry Adams, Manuchehr Aminian, Elin Farnell, Michael Kirby, Joshua Mirth, Rachel Neville, Chris Peterson, and Clayton Shonkwiler Abstract We use persistent homology in order to define a family of fractal dimensions, denoted dimiPH (μ) for each homological dimension i ≥ 0, assigned to a probability measure μ on a metric space. The case of zero-dimensional homology (i = 0) relates to work by Steele (Ann Probab 16(4): 1767–1787, 1988) studying the total length of a minimal spanning tree on a random sampling of points. Indeed, if μ is supported on a compact subset of Euclidean space Rm for m ≥ 2, then Steele’s work implies that dim0PH (μ) = m if the absolutely continuous part of μ has positive mass, and otherwise dim0PH (μ) < m. Experiments suggest that similar results may be true for higher-dimensional homology 0 < i < m, though this is an open question. Our fractal dimension is defined by considering a limit, as the number of points n goes to infinity, of the total sum of the i-dimensional persistent homology interval lengths for n random points selected from μ in an i.i.d. fashion. To some measures μ, we are able to assign a finer invariant, a curve measuring the limiting distribution of persistent homology interval lengths as the number of points goes to infinity. We prove this limiting curve exists in the case of zerodimensional homology when μ is the uniform distribution over the unit interval, and This work was completed while Elin Farnell was a research scientist in the Department of Mathematics at Colorado State University. H. Adams () · M. Aminian · M. Kirby · J. Mirth · C. Peterson · C. Shonkwiler Colorado State University, Fort Collins, CO, USA e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] E. Farnell Amazon, Seattle, WA, USA e-mail: [email protected] R. Neville University of Arizona, Fort Collins, CO, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 N. A. Baas et al. (eds.), Topological Data Analysis, Abel Symposia 15, https://doi.org/10.1007/978-3-030-43408-3_1 1 2 H. Adams et al. conjecture that it exists when μ is the rescaled probability measure for a compact set in Euclidean space with positive Lebesgue measure. 1 Introduction Let X be a metric space equipped with a probability measure μ. While fractal dimensions are most classically defined for a space, there are a variety of fractal dimension definitions for a measure, including the Hausdorff or packing dimension of a measure [24, 30, 54]. In this paper we use persistent homology to define a fractal dimension dimiPH (μ) associated to a measure μ for each homological dimension i ≥ 0. Roughly speaking, dimiPH (μ) is determined by how the lengths of the persistent homology intervals for a random sample, Xn , of n points from X vary as n tends to infinity. Our definition should be thought of as a generalization, to higher homological dimensions, of fractal dimensions related to minimal spanning trees, as studied, for example, in [63]. Indeed, the lengths of the zero-dimensional (reduced) persistent homology intervals corresponding to the Vietoris–Rips complex of a sample Xn are equal to the lengths of the edges in a minimal spanning tree with Xn as the set of vertices. In particular, if X is a subset of Euclidean space Rm with m ≥ 2, then [63, Theorem 1] by Steele implies that dim0PH (μ) ≤ m, with equality when the absolutely continuous part of μ has positive mass (Proposition 1). Independent generalizations of Steele’s work to higher homological dimensions are considered in [26, 61, 62]. To some metric spaces X equipped with a measure μ we are able to assign a finer invariant that contains more information than just the fractal dimension. Consider the set of the lengths of all intervals in the i-dimensional persistent homology for Xn . Experiments suggest that when probability measure μ is absolutely continuous with respect to the Lebesgue measure on X ⊆ Rm , the scaled set of interval lengths in each homological dimension i converges distribution-wise to some fixed probability distribution (depending on μ and i). This is easy to prove in the simple case of zero-dimensional homology when μ is the uniform distribution over the unit interval, in which case we can also derive a formula for the limiting distribution. Experiments suggest that when μ is the rescaled probability measure corresponding to a compact set X ⊆ Rm of positive Lebesgue measure, then a limiting rescaled distribution exists that depends only on m, i, and the volume of μ (see Conjecture 2). We would be interested to know the formulas for the limiting distributions with higher Euclidean and homological dimensions. Whereas Steele in [63] studies minimal spanning trees on random subsets of a space, Kozma et al. in [42] study minimal spanning trees built on extremal subsets. Indeed, they define a fractal dimension for a metric space X as the infimum, over all powers d, such that for any minimal spanning tree T on a finite number of points in X, the sum of the edge lengths in T each raised to the power d is bounded. They relate this extremal minimal spanning tree dimension to the box counting dimension. Their work is generalized to higher homological dimensions by A Fractal Dimension for Measures via Persistent Homology 3 Schweinhart [60]. By contrast, we instead generalize Steele’s work [63] on measures to higher homological dimensions. Three differences between [42, 60] and our work are the following. • The former references define a fractal dimension for metric spaces, whereas we define a fractal dimension for measures. • The fractal dimension in [42, 60] is defined using extremal subsets, whereas we define our fractal dimension using random subsets. • We can estimate our fractal dimension computationally using log-log plots as in Sect. 5, whereas we do not know a computational technique for estimating the fractal dimensions in [42, 60]. After describing related work in Sect. 2, we give preliminaries on fractal dimensions and on persistent homology in Sect. 3. We present the definition of our fractal dimension and prove some basic properties in Sect. 4. We demonstrate example experimental computations in Sect. 5; our code is publicly available at https://github.com/CSU-PHdimension/PHdimension. Section 6 describes how limiting distributions, when they exist, form a finer invariant. Sects. 7 and 8 discuss the computational details involved in sampling from certain fractals and estimating asymptotic behavior, respectively. Finally we present our conclusion in Sect. 9. One of the main goals of this paper is to pose questions and conjectures, which are shared throughout. 2 Related Work 2.1 Minimal Spanning Trees The paper [63] studies the total length of a minimal spanning tree for random subsets of Euclidean space. Let Xn be a random sample of points from a compact subset of Rd according to some probability distribution. Let Mn be the sum of all the edge lengths of a minimal spanning tree on vertex set Xn . Then for d ≥ 2, Theorem 1 of [63] says that Mn ∼ Cn(d−1)/d as n → ∞, (1.1) where the relation ∼ denotes asymptotic convergence, with the ratio of the terms approaching one in the specified limit. Here, C is a fixed constant depending on d and on the volume of the absolutely continuous part of the probability distribution.1 There has been a wide variety of related work, including for example [5–7, 38, 64– 67]. See [41] for a version of the central limit theorem in this context. The papers [51, 52] study the length of the longest edge in the minimal spanning tree 1 If the compact subset has Hausdorff dimension less than d, then [63] implies C = 0. 4 H. Adams et al. for points sampled uniformly at random from the unit square, or from a torus of dimension at least two. By contrast, [42] studies Euclidean minimal spanning trees built on extremal finite subsets, as opposed to random subsets. 2.2 Umbrella Theorems for Euclidean Functionals As Yukich explains in his book [72], there are a wide variety of Euclidean functionals, such as the length of the minimal spanning tree, the length of the traveling salesperson tour, and the length of the minimal matching, which all have scaling asymptotics analogous to (1.1). To prove such results, one needs to show that the Euclidean functional of interest satisfies translation invariance, subadditivity, superadditivity, and continuity, as in [21, Page 4]. Superadditivity does not always hold, for example it does not hold for the minimal spanning tree length functional, but there is a related “boundary minimal spanning tree functional" that does satisfy superadditivity. Furthermore, the boundary functional has the same asymptotics as the original functional, which is enough to prove scaling results. It is intriguing to ask if these techniques will work for functionals defined using higher-dimensional homology. 2.3 Random Geometric Graphs In this paper we consider simplicial complexes (say Vietoris–Rips or Čech) with randomly sampled points as the vertex set. The 1-skeleta of these simplicial complexes are random geometric graphs. We recommend the book [50] by Penrose as an introduction to random geometric graphs; related families of random graphs are also considered in [53]. Random geometric graphs are often studied when the scale parameter r(n) is a function of the number of vertices n, with r(n) tending to zero as n goes to infinity. Instead, in this paper we are more interested in the behavior over all scale parameters simultaneously. From a slightly different perspective, the paper [40] studies the expected Euler characteristic of the union of randomly sampled balls (potentially of varying radii) in the plane. 2.4 Persistent Homology Vanessa Robins’ thesis [58] contains many related ideas; we describe one such example here. Given a set X ⊆ Rm and a scale parameter ε ≥ 0, let Xε = {y ∈ Rm | there exists some x ∈ X with d(y, x) ≤ ε} A Fractal Dimension for Measures via Persistent Homology 5 denote the ε-offset of X. The ε-offset of X is equivalently the union of all closed ε balls centered at points in X. Furthermore, let C(Xε ) ∈ N denote the number of connected components of Xε . In Chapter 5, Robins shows that for a generalized Cantor set X in R with Lebesgue measure 0, the box-counting dimension of X is equal to the limit log(C(Xε )) . ε→0 log(1/ε) lim Here Robins considers the entire Cantor set, whereas we study random subsets thereof. The paper [46], which heavily influenced our work, introduces a fractal dimension defined using persistent homology. This fractal dimension depends on thickenings of the entire metric space X, as opposed to random or extremal subsets thereof. As a consequence, the computed dimension of some fractal shapes (such as the Cantor set cross the interval) disagrees significantly with the Hausdorff or box-counting dimension. Schweinhart’s paper [60] takes a slightly different approach from ours, considering extremal (as opposed to random) subsets. After fixing a homological dimension i, Schweinhart assigns a fractal dimension to each metric space X equal to the infimum over all powers d such that for any finite subset X′ ⊆ X, the sum of the i-dimensional persistent homology bar lengths for X′ , each raised to the power d, is bounded. For low-dimensional metric spaces Schweinhart relates this dimension to the box counting dimension. More recently, Divol and Polonik [26] obtain generalizations of [63, 72] to higher homological dimensions in the case when X is a cube. Related results are obtained in [62] when X is a ball or sphere, and afterwards in [61] when points are sampled according to an Ahlfors regular measure. There is a growing literature on the topology of random geometric simplicial complexes, including in particular the homology of Vietoris–Rips and Čech complexes built on top of random points in Euclidean space [3, 13, 39]. The paper [14] shows that for n points sampled from the unit cube [0, 1]d with d ≥ 2, the maximally persistent cycle in dimension 1 ≤ k ≤ d − 1 has persistence of order (( logloglogn n )1/k ), where the asymptotic notation big Theta means both big O and big Omega. The homology of Gaussian random fields is studied in [4], which gives the expected k-dimensional Betti numbers in the limit as the number of points increases to infinity, and also in [12]. The paper [29] studies the number of simplices and critical simplices in the alpha and Delaunay complexes of Euclidean point sets sampled according to a Poisson process. An open problem about the birth and death times of the points in a persistence diagram coming from sublevelsets of a Gaussian random field is stated in Problem 1 of [28]. The paper [18] shows that the expected persistence diagram,from a wide class of random point clouds, has a density with respect to the Lebesgue measure 6 H. Adams et al. The paper [15] explores what attributes of an algebraic variety can be estimated from a random sample, such as the variety’s dimension, degree, number of irreducible components, and defining polynomials; one of their estimates of dimension is inspired by our work. In an experiment in [1], persistence diagrams are produced from random subsets of a variety of synthetic metric space classes. Machine learning tools, with these persistence diagrams as input, are then used to classify the metric spaces corresponding to each random subset. The authors obtain high classification rates between the different metric spaces. It is likely that the discriminating power is based not only on the underlying homotopy types of the shape classes, but also on the shapes’ dimensions as detected by persistent homology. 3 Preliminaries This section contains background material and notation on fractal dimensions and persistent homology. 3.1 Fractal Dimensions The concept of fractal dimension was introduced by Hausdorff to describe spaces like the Cantor set, and it later found extensive application in the study of dynamical systems. The attracting sets of simple a dynamical system is often a submanifold, with an obvious dimension, but in non-linear and chaotic dynamical systems the attracting set may not be a manifold. The Cantor set, defined by removing the middle third from the interval [0, 1], and then recursing on the remaining pieces, is a typical example. It has the same cardinality as R, but it is nowhere-dense, meaning it at no point resembles a line. The typical fractal dimension of the Cantor set is log3 (2). Intuitively, the Cantor set has “too many” points to have dimension zero, but also should not have dimension one. We speak of fractal dimensions in the plural because there are many different definitions. In particular, fractal dimensions can be divided into two classes, which have been called “metric” and “probabilistic” [31]. The former describe only the geometry of a metric space. Two widely-known definitions of this type, which often agree on well-behaved fractals, but are not in general equal, are the box-counting and Hausdorff dimensions. For an inviting introduction to fractal dimensions see [30]. Dimensions of the latter type take into account both the geometry of a given set and a probability distribution supported on that set—originally the “natural measure” of the attractor given by the associated dynamical system, but in principle any probability distribution can be used. The information dimension is the best known example of this type. For detailed comparisons, see [32]. Our persistent homology fractal dimension, Definition 6, is of the latter type. A Fractal Dimension for Measures via Persistent Homology 7 For completeness, we exhibit some of the common definitions of fractal dimension. The primary definition for sets is given by the Hausdorff dimension [33]. Definition 1 Let S be a subset of a metric space X, let d ∈ [0, ∞), and let δ > 0. The Hausdorff measure of S is ⎛ Hd (S) = inf ⎝inf δ ⎧ ∞ ⎨ ⎩ j =1 diam(Bj )d | S ⊆ ∞  j =1 ⎫⎞ ⎬ Bj and diam(Bj ) ≤ δ ⎠ , ⎭ where the inner infimum is over all coverings of S by balls Bj of diameter at most δ. The Hausdorff dimension of S is dimH (S) = inf{Hd (S) = 0.} d The Hausdorff dimension of the Cantor set, for example, is log3 (2). In practice it is difficult to compute the Hausdorff dimension of an arbitrary set, which has led to a number of alternative fractal dimension definitions in the literature. These dimensions tend to agree on well-behaved fractals, such as the Cantor set, but they need not coincide in general. Two worth mentioning are the box-counting dimension, which is relatively simple to define, and the correlation dimension. Definition 2 Let S ⊆ X a metric space, and let Nε denote the infimum of the number of closed balls of radius ǫ required to cover S. Then the box-counting dimension of S is log(Nε ) , ε→0 log(1/ε) dimB (S) = lim provided this limit exists. Replacing the limit with a lim sup gives the upper boxcounting dimension, and a lim inf gives the lower box-counting dimension. The box-counting definition is unchanged if Nǫ is instead defined by taking the number of open balls of radius ε, or the number of sets of diameter at most ε, or (for S a subset of Rn ) the number of cubes of side-length ε [70, Definition 7.8], [30, Equivalent Definitions 2.1]. It can be shown that dimB (S) ≥ dimH (S). This inequality can be strict; for example if S = Q ∩ [0, 1] is the set of all rational numbers between zero and one, then dimH (S) = 0 < 1 = dimB (S) [30, Chapter 3]. In Sect. 4 we introduce a fractal dimension based on persistent homology which shares key similarities with the Hausdorff and box-counting dimensions. It can also be easily estimated via log-log plots, and it is defined for arbitrary metric spaces (though our examples will tend to be subsets of Euclidean space). A key difference, however, will be that ours is a fractal dimension for measures, rather than for subsets. 8 H. Adams et al. There are a variety of classical notions of a fractal dimension for a measure, including the Hausdorff, packing, and correlation dimensions of a measure [24, 30, 54]. We give the definitions of two of these. Definition 3 ((13.16) of [30]) The Hausdorff dimension of a measure μ with total mass one is defined as dimH (μ) = inf{dimH (S) | S is a Borel subset with μ(S) > 0}. We have dimH (μ) ≤ dimH (supp(μ)), and it is possible for this inequality to be strict [30, Exercise 3.10].2 We also give the example of the correlation dimension of a measure. Definition 4 Let X be a subset of Rm equipped with a measure μ, and let Xn be a random sample of n points from X. Let θ : R → R denote the Heaviside step function, meaning θ (x) = 0 for x < 0 and θ (x) = 1 for x ≥ 0. The correlation integral of μ is defined (for example in [35, 69]) to be C(r) = lim n→∞ 1 n2  x,x ′ ∈Xn x =x ′   θ r − x − x′ . It can be shown that C(r) ∝ r ν , and the exponent ν is defined to be the correlation dimension of μ. In [35, 36] it is shown that the correlation dimension gives a lower bound on the Hausdorff dimension of a measure. The correlation dimension can be easily estimated from a log-log plot, similar to the methods we use in Sect. 5. A different definition of the correlation definition is given and studied in [23, 47]. The correlation dimension is a particular example of the family of Rènyi dimensions, which also includes the information dimension as a particular case [56, 57]. A collection of possible axioms that one might like to have such a fractal dimension satisfy is given in [47]. 3.2 Persistent Homology The field of applied and computational topology has grown rapidly in recent years, with the topic of persistent homology gaining particular prominence. Persistent homology has enjoyed a wealth of meaningful applications to areas such as image analysis, chemistry, natural language processing, and neuroscience, to name just a 2 See also [31] for an example of a measure whose information dimension is less than the Hausdorff dimension of its support. A Fractal Dimension for Measures via Persistent Homology 9 few examples [2, 10, 20, 25, 44, 45, 71, 73]. The strength of persistent homology lies in its ability to characterize important features in data across multiple scales. Roughly speaking, homology provides the ability to count the number of independent k-dimensional holes in a space, and persistent homology provides a means of tracking such features as the scale increases. We provide a brief introduction to persistent homology in this preliminaries section, but we point the interested reader to [8, 27, 37] for thorough introductions to homology, and to [16, 22, 34] for excellent expository articles on persistent homology. Geometric complexes, which are at the heart of the work in this paper, associate to a set of data points a simplicial complex—a combinatorial space that serves as a model for an underlying topological space from which the data has been sampled. The building blocks of simplicial complexes are called simplices, which include vertices as 0-simplices, edges as 1-simplices, triangles as 2-simplices, tetrahedra as 3-simplices, and their higher-dimensional analogues as k-simplices for larger values of k. An important example of a simplicial complex is the Vietoris–Rips complex. Definition 5 Let X be a set of points in a metric space and let r ≥ 0 be a scale parameter. We define the Vietoris–Rips simplicial complex VR(X; r) to have as its k-simplices those collections of k + 1 points in X that have diameter at most r. In constructing the Vietoris–Rips simplicial complex we translate our collection of points in X into a higher-dimensional complex that models topological features of the data. See Fig. 1 for an example of a Vietoris–Rips complex constructed from a set of data points, and see [27] for an extended discussion. It is readily observed that for various data sets, there is not necessarily an ideal choice of the scale parameter so that the associated Vietoris–Rips complex captures the desired features in the data. The perspective behind persistence is to instead allow the scale parameter to increase and to observe the corresponding appearance and disappearance of topological features. To be more precise, each hole appears at a certain scale and disappears at a larger scale. Those holes that persist across a wide range of scales often reflect topological features in the shape underlying the data, whereas the holes that do not persist for long are often considered to be noise. Fig. 1 An example of a set of data points in Rm with an associated Vietoris–Rips complex at a fixed scale 10 H. Adams et al. However, in the context of this paper (estimating fractal dimensions), the holes that do not persist are perhaps better described as measuring the local geometry present in a random finite sample. For a fixed set of points, we note that as scale increases, simplices can only be added and cannot be removed. Thus, for r0 < r1 < r2 < · · · , we obtain a filtration of Vietoris–Rips complexes VR(X; r0 ) ⊆ VR(X; r1 ) ⊆ VR(X; r2 ) ⊆ · · · . The associated inclusion maps induce linear maps between the corresponding homology groups Hk (VR(X; ri )), which are algebraic structures whose ranks count the number of independent k-dimensional holes in the Vietoris–Rips complex. A technical remark is that homology depends on the choice of a group of coefficients; it is simplest to use field coefficients (for example R, Q, or Z/pZ for p prime), in which case the homology groups are furthermore vector spaces. The corresponding collection of vector spaces and linear maps is called a persistent homology module. A useful tool for visualizing and extracting meaning from persistent homology is a barcode. The basic idea is that each generator of persistent homology can be represented by an interval, whose start and end times are the birth and death scales of a homological feature in the data. These intervals can be arranged as a barcode graph in which the x-axis corresponds to the scale parameter. See Fig. 2 for an example. If Y is a finite metric space, then we let PHi (Y ) denote the corresponding collection of i-dimensional persistent homology intervals. Fig. 2 An example of Vietoris–Rips complexes at increasing scales, along with associated persistent homology intervals. The zero-dimensional persistent homology intervals shows how 21 connected components merge into a single connected component as the scale increases. The onedimensional persistent homology intervals show two one-dimensional holes, one short-lived and the other long-lived A Fractal Dimension for Measures via Persistent Homology 11 Zero-dimensional barcodes always produce one infinite interval, as in Fig. 2, which are problematic for our purposes. Therefore, in the remainder of this paper we will always use reduced homology, which has the effect of simply eliminating the infinite interval from the zero-dimensional barcode while leaving everything else unchanged. As a consequence, there will never be any infinite intervals in the persistent homology of a Vietoris–Rips simplicial complex, even in homological dimension zero. Remark 1 It is well-known (see for example [58]) and easy to verify that for any finite metric space X, the lengths of the zero-dimensional (reduced) persistent homology intervals of the Vietoris–Rips complex of X correspond exactly to the lengths of the edges in a minimal spanning tree with vertex set X. 4 Definition of the Persistent Homology Fractal Dimension for Measures Let X be a metric space equipped with a probability measure μ, and let Xn ⊆ X be a random sample of n points from X distributed independently and identically according to μ. Build a filtered simplicial complex K on top of vertex set Xn , for example a Vietoris–Rips complex VR(X; r) (Definition 5), an intrinsic Čech complex Č(X, X; r), or an ambient Čech complex Č(X, Rm ; r) if X is a subset of Rm [17]. Denote the i-dimensional persistent homology of this filtered simplicial complex by PHi (Xn ). This persistent homology barcode decomposes as a direct sum of interval summands; we let Li (Xn ) be the sum of the lengths of the intervals in PHi (Xn ). In the case of homological dimension zero, the sum L0 (Xn ) is simply the sum of all the edge lengths in a minimal spanning tree with Xn as its vertex set (since we are using reduced homology). Definition 6 (Persistent Homology Fractal Dimension) Let X be a metric space equipped with a probability measure μ, let Xn ⊆ X be a random sample of n points from X distributed according to μ, and let Li (Xn ) be the sum of the lengths of the intervals in the i-dimensional persistent homology for Xn . We define the idimensional persistent homology fractal dimension of μ to be    dimiPH (μ) = inf d  ∃ constant C(i, μ, d) such that Li (Xn ) ≤ Cn(d−1)/d d>0  with probability one as n → ∞ . The constant C can depend on i, μ, and d. Here “Li (Xn ) ≤ Cn(d−1)/d with probability one as n → ∞" means that we have limn→∞ P[Li (Xn ) ≤ Cn(d−1)/d ] = 1. This dimension may depend on the choices of filtered simplicial complex (say Vietoris–Rips or Čech), and on the choice of field coefficients for homology computations; for now those choices are suppressed from the definition. 12 H. Adams et al. Proposition 1 Let μ be a measure on X ⊆ Rm with m ≥ 2. Then dim0PH (μ) ≤ m, with equality if the absolutely continuous part of μ has positive mass. Proof By Theorem 2 of [63], we have that limn→∞ n−(m−1)/m L0 (Xn ) =  c Rm f (x)(m−1)/m dx, where c is a constant depending on m, and where f is the absolutely continuous part of μ. To see that dim0PH (μ) ≤ m, note that   L (Xn ) ≤ c 0 f (x) (m−1)/m Rm with probability one as n → ∞ for any ε > 0.  dx + ε n(m−1)/m ⊔ ⊓ We conjecture that the i-dimensional persistent homology of compact subsets of Rm have the same scaling properties as the functionals in [63, 72]. Conjecture 1 Let μ be a probability measure on a compact set X ⊆ Rm with m ≥ 2, and let μ be absolutely continuous with respect to the Lebesgue measure. Then for all 0 ≤ i < m, there is a constant C ≥ 0 (depending on μ, m, and i) such that Li (Xn ) = Cn(m−1)/m with probability one as n → ∞. Let μ be a probability measure with compact support that is absolutely continuous with respect to Lebesgue measure in Rm for m ≥ 2. Note that Conjecture 1 would imply that the persistent homology fractal dimension of μ is equal to m. The tools of subadditivity and superadditivity behind the umbrella theorems for Euclidean functionals, as described in [72] and Sect. 2.2, may be helpful towards proving this conjecture. In some limited cases, for example when X is a cube or ball, or when μ is Ahlfors regular, then Conjecture 1 is closely related to [26, 61, 62]. One could alternatively define birth-time or death-time fractal dimensions by replacing Li (Xn ) with the sum of the birth times, or alternatively the sum of the death times, in the persistent homology barcodes PHi (Xn ). 5 Experiments A feature of Definition 6 is that we can use it to estimate the persistent homology fractal dimension of a measure μ. Indeed, suppose we can sample from X according to the probability distribution μ. We can therefore sample collections of points Xn of size n, compute the statistic Li (Xn ), and then plot the results in a log-log fashion as n increases. In the limit as n goes to infinity, we expect the plotted points to be well-modeled by a line of slope d−1 d , where d is the i-dimensional persistent homology fractal dimension of μ. In many of the experiments in this section, the measures μ are simple enough (or self-similar enough) that we would expect the persistent homology fractal dimension of μ to be equal to the Hausdorff dimension of μ. A Fractal Dimension for Measures via Persistent Homology 13 In our computational experiments, we have used the persistent homology software packages Ripser [9], Javaplex [68], and code from Duke (see the acknowledgements). For the case of zero-dimensional homology, we can alternatively use well-known algorithms for computing minimal spanning trees, such as Kruskal’s algorithm or Prim’s algorithm [43, 55]. We estimate the slope of our log-log plots (of Li (Xn ) as a function of n) using both a line of best fit, and alternatively a technique designed to approximate the asymptotic scaling described in Sect. 8. Our code is publicly available at https://github.com/CSU-PHdimension/PHdimension. 5.1 Estimates of Persistent Homology Fractal Dimensions We display several experimental results, for shapes of both integral and non-integral fractal dimension. In Fig. 3, we show the log-log plots of Li (Xn ) as a function of n, where Xn is sampled uniformly at random from a disk, a square, and an equilateral triangle, each of unit area in the plane R2 . Each of these spaces constitutes a manifold of dimension two, and we thus expect these shapes to have persistent homology fractal dimension d = 2 as well. Experimentally, this appears to be the case, both for homological dimensions i = 0 and i = 1. Indeed, our asymptotically estimated slopes lie in the range 0.49–0.54, which is fairly close to the expected 1 slope of d−1 d = 2. In Fig. 4 we perform a similar experiment for the cube in R3 of unit volume. We expect the cube to have persistent homology fractal dimension d = 3, corresponding 2 to a slope in the log-log plot of d−1 d = 3 . This appears to be the case for homological dimension i = 0, where the slope is approximately 0.65. However, for i = 1 and i = 2, our estimated slope is far from 32 , perhaps because our computational limits do not allow us to take n, the number of randomly chosen points, to be sufficiently large. In Fig. 5 we use log-log plots to estimate some persistent homology fractal dimensions of the Cantor set cross the interval (expected dimension d = 1 + log3 (2)), of the Sierpiński triangle (expected dimension d = log2 (3)), of Cantor dust in R2 (expected dimension d = log3 (4)), and of Cantor dust in R3 (expected dimension d = log3 (8)). As noted in Sect. 3, various notions of fractal dimension tend to agree for well-behaved fractals. Thus, in each case above, we provide the Hausdorff dimension d in order to define an expected persistent homology fractal dimension. The Hausdorff dimension is well-known for the Sierpiński triangle, Cantor dust in R2 , and Cantor dust in R3 . The Hausdorff dimension for the Cantor set cross the interval can be shown to be 1 + log3 (2), which follows from [30, Theorem 9.3] or [48, Theorem III]). In Sect. 5.2 we define these fractal shapes in detail, and we also explain our computational technique for sampling points from them at random. Summarizing the experimental results for self-similar fractals, we find reasonably good estimates of fractal dimension for homological dimension i = 0. More 14 H. Adams et al. PH0 for points from disk of area one PH1 for points from disk of area one 1 1 log 10(L1(Xn)) log 10(L0(Xn)) 1.5 0.5 0 1 1.5 2 2.5 log 10(n) 3 3.5 0 -0.5 data linear fit=0.4942 asymptotic estimate=0.49998 -0.5 0.5 data linear fit=0.58686 asymptotic estimate=0.4925 -1 4 1 PH0 for points from unit square 2.5 log 10(n) 3 3.5 4 1 1.5 log 10(L1(Xn)) log 10(L0(Xn)) 2 PH1 for points from unit square 2 1 0.5 1 1.5 2 2.5 log 10(n) 3 3.5 0.5 0 -0.5 data linear fit=0.49392 asymptotic estimate=0.49249 0 data linear fit=0.5943 asymptotic estimate=0.53521 -1 1 4 PH0 for points from triangle of area one 1.5 2 2.5 log 10(n) 3 3.5 4 PH1 for points from triangle of area one 2 1 1.5 log 10(L1(Xn)) log 10(L0(Xn)) 1.5 1 0.5 data linear fit=0.49133 asymptotic estimate=0.48066 0 1 1.5 2 2.5 log 10(n) 3 3.5 4 0.5 0 -0.5 data linear fit=0.5919 asymptotic estimate=0.49755 -1 1 1.5 2 2.5 log 10(n) 3 3.5 4 Fig. 3 Log scale plots and slope estimates of the number n of sampled points versus L0 (Xn ) (left) or L1 (Xn ) (right). Subsets Xn are drawn uniformly at random from (top) the unit disc in R2 , (middle) the unit square, and (bottom) the unit triangle. All cases have slope estimates close to 1/2, which is consistent with the expected dimension. The asymptotic scaling estimates of the slope are computed as described in Sect. 8 A Fractal Dimension for Measures via Persistent Homology PH0 for points from Unit Cube 1.5 2 log 10(L1(Xn)) log 10(L0(Xn)) PH1 for points from Unit Cube 2 2.5 15 1.5 1 1 0.5 0 data linear fit =0.65397 0.5 1 1.5 2 2.5 log 10(n) 3 3.5 data linear fit =0.85188 -0.5 1 4 1.5 2 2.5 log 10(n) 3 3.5 4 PH2 for points from Unit Cube 1 log 10(L2(Xn)) 0.5 0 -0.5 -1 -1.5 data linear fit =1.0526 1 2 3 log 10(n) 4 Fig. 4 Log scale plots of the number n of sampled points from the cube versus L0 (Xn ) (left), L1 (Xn ) (right), and L2 (Xn ) (bottom). The dimension estimate from zero-dimensional persistent homology is reasonably good, while the one- and two-dimensional cases are less accurate, likely due to computational limitations specifically, for the Cantor set cross the interval, we expect d−1 d ≈ 0.3869, and we find slope estimates from a linear fit of all data and an asymptotic fit to be 0.3799 and 0.36488, respectively. In the case of the Sierpiński triangle, the estimate is quite good: we expect d−1 d ≈ 0.3691, and the slope estimates from both a linear fit and an asymptotic fit are approximately 0.37. Similarly, the estimates for Cantor dust in R2 and R3 are close to the expected values: (1) For Cantor dust in R2 , we expect d−1 ≈ 0.2075 and estimate d−1 ≈ 0.25. (2) For Cantor dust in R3 , d d d−1 d−1 we expect d ≈ 0.4717 and estimate d ≈ 0.49. For i > 0 many of these estimates of the persistent homology fractal dimension are not close to the expected (Hausdorff) dimensions, perhaps because the number of points n is not large enough. The experiments in R2 are related to [61, Corollary 1], although our experiments are with the Vietoris–Rips complex instead of the Čech complex. It is worth commenting on the Cantor set, which is a self-similar fractal in R. Even though the Hausdorff dimension of the Cantor set is log3 (2), it is not hard to 16 H. Adams et al. PH0 for points from C [0,1] PH1 for points from C [0,1] 2 log 10(L1(Xn)) log 10(L0(Xn)) 0.5 1.5 1 0 -0.5 0.5 -1 data linear fit =0.3799 asymptotic estimate =0.36488 0 1 1.5 2 2.5 log 10(n) 3 3.5 -1.5 4 data linear fit =0.43391 asymptotic estimate =0.46707 1 PH0 for points from Sierpinski Triangle log 10(L1(Xn)) log 10(L0(Xn)) 2.5 log 10(n) 1 0.5 1 1.5 2 2.5 log 10(n) -0.5 3 3.5 data linear fit=0.47645 asymptotic estimate=0.43541 -1 1 4 1.5 0.5 log 10(L1(Xn)) 1 1 0.5 1.5 2 2.5 log 10(n) 3 3.5 PH0 for points from Cantor Dust in R3 2 2.5 log 10(n) 1 4 1.5 2 2.5 log 10(n) 1 2 2.5 log 10(n) 3 3.5 4 3.5 4 PH2 for points from Cantor Dust in R3 0 1 0.5 0 data linear fit =0.49075 asymptotic estimate =0.48565 3 0.5 log 10(L2(Xn)) log 10(L1(Xn)) 1.5 4 data linear fit =0.34733 asymptotic estimate =0.28639 -1 1.5 2 3.5 -0.5 PH1 for points from Cantor Dust in R3 2.5 3 0 data linear fit =0.26506 asymptotic estimate =0.24543 1 1.5 PH1 for points from Cantor Dust in R2 2 0 1.5 4 0 PH0 for points from Cantor Dust in R2 1 3.5 0.5 data linear fit=0.3712 asymptotic estimate=0.37853 0 0.5 3 1 1.5 log 10(L0(Xn)) 2 PH1 for points from Sierpinski Triangle 2 log 10(L0(Xn)) 1.5 data linear fit =0.56443 asymptotic estimate =0.49887 -0.5 1 1.5 2 2.5 log 10(n) 3 3.5 4 -0.5 -1 data linear fit =0.62552 asymptotic estimate =0.5559 -1.5 1 1.5 2 2.5 log 10(n) 3 3.5 4 Fig. 5 (Top) Cantor set cross the unit interval for i = 0, 1. (Second row) Sierpiński triangle in R2 for i = 0, 1. (Third row) Cantor dust in R2 for i = 0, 1. (Bottom) Cantor dust in R3 for i = 0, 1, 2. In each case, the zero-dimensional estimate is close to the expected dimension. The higher-dimensional estimates are not as accurate; we speculate that this is due to computational limitations see that the zero-dimensional persistent homology fractal dimension of the Cantor set is 1. This is because as n → ∞ a random sample of points from the Cantor set will contain points in R arbitrarily close to 0 and to 1, and hence L0 (Xn ) → 1 as n → ∞. This is not surprising—we do not necessarily expect to be able to detect a fractional dimension less than one by using minimal spanning trees (which are one- A Fractal Dimension for Measures via Persistent Homology 17 Fig. 6 Log scale plot of the number n of sampled points from the Cantor set versus L0 (Xn ). Note that L0 (Xn ) approaches one, as expected dimensional graphs). For this reason, if a measure μ is defined on a subset of Rm , we sometimes restrict attention to the case m ≥ 2. See Fig. 6 for our experimental computations on the Cantor set. Finally, we include one example with data drawn from a two-dimensional manifold in R3 . We sample points from a torus with major radius 5 and minor radius 3. We expect the persistent homology fractal dimensions to be 2, and this is supported in the experimental evidence for zero-dimensional homology shown in Fig. 7. 5.2 Randomly Sampling from Self-Similar Fractals The Cantor set C = ∩∞ l=0 Cl is a countable intersection of nested sets C0 ⊇ C1 ⊇ C2 ⊇ · · · , where the set Cl at level l is a union of 2l closed intervals, each of length 31l . More precisely, C0 = [0, 1] is the closed unit interval, and Cl is defined recursively via Cl−1 ∪ Cl = 3  2 Cl−1 + 3 3  for l ≥ 1. In our experiment for the Cantor set (Fig. 6), we do not sample from the Cantor distribution on the entire Cantor set C, but instead from the left endpoints of level Cl of the Cantor set, where l is chosen to be very large (we use l = 100,000). More precisely, in order to sample points, we choose a binary sequence {ai }li=1 uniformly at random, meaning that each term ai is equal to either 0 or 1 with probability 12 , and furthermore the value ai is independent from the value of aj for i = j . The 18 H. Adams et al. PH0 for points from Torus 3.5 log10(L0(Xn)) 3 2.5 2 data linear fit =0.50165 asymptotic estimate =0.50034 1.5 1 1.5 2 2.5 3 3.5 4 log10(n) Fig. 7 Log scale plot of the number n of sampled points from a torus with major radius 5 and minor radius 3 versus L0 (Xn ). Estimated lines of best fit from L0 (Xn ) have slope approximately equal to 21 , suggesting a dimension estimate of d = 2. We restrict to zero-dimensional homology in this setting due to computational limitations  i corresponding random point in the Cantor set is li=1 2a . Note that this point is in 3i C and furthermore is the left endpoint of some interval in Cl . So we are selecting left endpoints of intervals in Cl uniformly at random, but since l is large this is a good approximation to sampling from the entire Cantor set according to the Cantor distribution. We use a similar procedure to sample at random for our experiments on the Cantor set cross the interval, on Cantor dust in R2 , on Cantor dust in R3 , and on the Sierpiński triangle (Fig. 5). The Cantor set cross the interval is C × [0, 1] ⊆ R2 , equipped with the Euclidean metric. We computationally sample by choosing a point from Cl as described in the paragraph above for l = 100,000, and by also sampling a point from the unit interval [0, 1] uniformly at random. Cantor dust is the subset C × C of R2 , which we sample by choosing two points from Cl as described previously. The same procedure is done for the Cantor dust C × C × C in R3 . The Sierpiński triangle S ⊆ R2 is defined in a similar way to the Cantor set, with S = ∩∞ l=0 Sl a countable intersection of nested sets S0 ⊇ S1 ⊇ S2 ⊇ · · · . Here each Sl is a union of 3l triangles. We choose l = 100,000 to be large, and then sample points uniformly at random from the bottom left endpoints of the triangles in Sl . More precisely, we choose a ternary sequence {ai }li=1 uniformly at random, meaning that each term ai is equal to either 0, 1, or 2 with probability 31 . The corresponding A Fractal Dimension for Measures via Persistent Homology random point in the Sierpiński triangle is by l ⎧ T ⎪ ⎪ ⎨(0, 0) vi = (1, 0)T √ ⎪ ⎪ ⎩ ( 1 , 3 )T 2 2 1 i=1 2i vi 19 ∈ R2 , where vector vi is given if ai = 0 if ai = 1 if ai = 2. Note this point is in S and furthermore is the bottom left endpoint of some triangle in Sl . 6 Limiting Distributions To some metric measure spaces, (X, μ), we are able to assign a finer invariant that contains more information than just the persistent homology fractal dimension. Consider the set of the lengths of all intervals in PHi (Xn ), for each homological dimension i. Experiments suggest that for some X ⊆ Rm , the scaled set of interval lengths in each homological dimension converges distribution-wise to some fixed probability distribution which depends on μ and on i. More precisely, for a fixed probability measure μ, let Fn(i) be the cumulative distribution function of the i-dimensional persistent homology interval lengths in PHi (Xn ), where Xn is a sample of n points from X drawn in an i.i.d. fashion according to μ. If μ is absolutely continuous with respect to the Lebesgue measure on some compact set, then the function Fn(i) (t) converges pointwise to the Heaviside step function as n → ∞, since the fraction of interval lengths less than any fixed ε > 0 is converging to one as n → ∞. More interestingly, for μ a sufficiently nice (i) measure on X ⊆ Rm , the rescaled cumulative distribution function Fn (n−1/m t) may converge to a non-constant curve. A back-of-the-envelope motivation for this rescaling is that if Li (Xn ) = Cn(m−1)/m with probability one as n → ∞ (Conjecture 1), then the average length of a persistent homology interval length is Cn(m−1)/m Li (Xn ) = , # intervals # intervals which is proportional to n−1/m if the number of intervals is proportional to n. We make this precise in the following conjectures. Conjecture 2 Let μ be a probability measure on a compact set X ⊆ Rm , and let μ be absolutely continuous with respect to the Lebesgue measure. Then the limiting (i) distribution F (i) (t) = limn→∞ Fn (n−1/m t), which depends on μ and i, exists. In Sect. 6.1 we show that Conjecture 2 holds when μ is the uniform distribution on an interval, and in Sect. 6.2 we perform experiments in higher dimensions. 20 H. Adams et al. • ? Question 1 Assuming Conjecture 2 is true, what is the limiting rescaled distribution when μ is the uniform distribution on an m-dimensional ball, or alternatively an mdimensional cube? Conjecture 3 Let the compact set X ⊆ Rm have positive Lebesgue measure, and let μ be the corresponding probability measure (i.e., μ is the restriction of the Lebesgue measure to X, rescaled to have mass one). Then the limiting distribution F (i) (t) = (i) limn→∞ Fn (n−1/m t) exists and depends only on m, i, and the volume of X. • ? Question 2 Assuming Conjecture 3 is true, what is the limiting rescaled distribution when X has unit volume? Remark 2 Conjecture 3 is false if μ is not a uniform measure (i.e. a rescaled Lebesgue measure). Indeed, the uniform measure on a square (experimentally) has a different limiting rescaled distribution than a (nonconstant) beta distribution on the same unit square, as seen in Fig. 8. 6.1 The Uniform Distribution on the Interval In the case where μ is the uniform distribution on the unit interval [0, 1], then Conjecture 2 is known to be true, and furthermore a formula for the limiting rescaled distribution is known. If Xn is a subset of [0, 1] drawn uniformly at random, then (with probability one) the points in Xn divide [0, 1] into n + 1 pieces. The joint probability distribution function for the lengths of these pieces is given by the flat Dirichlet distribution, which can be thought of as the uniform distribution on the n simplex (the set of all (t0 , . . . , tn ) with ti ≥ 0 for all i, such that ni=0 ti = 1). Note that the intervals in PH0 (Xn ) have lengths t1 , . . . , tn−1 , omitting t0 and tn which correspond to the two subintervals on the boundary of the interval. The probability distribution function of each ti , and therefore of each interval length in PH0 (Xn ), is the marginal of the Dirichlet distribution, which is given by the Beta distribution B(1, n) [11]. After simplifying, the cumulative distribution function of B(1, n) is given by [59] Fn(0) (t) B(t; 1, n) = = B(1, n) t 0 s 0 (1 − s)n−1 ds Ŵ(1)Ŵ(n) Ŵ(n+1) = 1 − (1 − t)n . A Fractal Dimension for Measures via Persistent Homology 21 ECDF Uniform and Beta Distribution 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 H0 intervals unif. H intervals beta 0.2 0 H intervals unif. 1 0.1 H intervals beta 1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Fig. 8 Empirical CDF’s for the H0 and H1 interval lengths computed from 10,000 points sampled from the unit square according to the uniform distribution and beta distribution with shape and size parameter both set to 2. The limiting distributions appear to be different (0) As n goes to infinity, Fn (t) converges pointwise to the constant function 1. (0) However, after rescaling, Fn (n−1 t) converges to a more interesting distribution (0)   independent of n. Indeed, we have Fn nt = 1−(1− nt )n , and the limit as n → ∞ is lim Fn(0) n→∞ t  n = 1 − e−t . This is the cumulative distribution function of the exponential distribution with rate parameter one. Therefore, the rescaled interval lengths in the limit as n → ∞ are distributed according to the exponential distribution Exp(1). 6.2 Experimental Evidence for Conjecture 2 in the Plane We now move to the case where μ is the uniform distribution on the unit square in R2 . It is known that the sum of the edge lengths of the minimal spanning tree, given by L0 (Xn ) where Xn is a random sample of n points from the unit square, converges as n → ∞ to Cn1/2 , for a constant C [63]. However, to our knowledge the limiting 22 H. Adams et al. Fig. 9 Empirical CDF’s for H0 interval lengths, H1 birth times, H1 death times, and H1 interval lengths computed from an increasing number of n points drawn uniformly from the twodimensional unit square, and rescaled by n1/2 distribution of all (rescaled) edge lengths is not known. We instead analyze this example empirically. The experiments in Fig. 9 suggest that as n increases, it is plausible that both Fn(0) (n−1/2 t) and Fn(1) (n−1/2 t) converge in distribution to a limiting probability distribution. 6.3 Examples where a Limiting Distribution Does Not Exist In this section we give experimental evidence that the assumption of being a rescaled Lebesgue measure in Conjecture 2 is necessary. Our example computation is done on a separated Sierpiński triangle. For a given separation value δ ≥ 0, the separated Sierpiński triangle can be  1 defined as the set of all points in R2 of the form ∞ i=1 (2+δ)i vi , where each vector √ vi ∈ R2 is either (0, 0), (1, 0), or ( 12 , 23 ). The Hausdorff dimension of this selfsimilar fractal shape is log2+δ (3) ([30, Theorem 9.3] or [48, Theorem III]), and note that when δ = 0, we recover the standard (non-separated) Sierpiński triangle. See from the Fig. 10 for a picture when δ = 2. Computationally, when we sample  a point 1 separated Sierpiński triangle, we sample a point of the form li=1 (2+δ) i vi , where in our experiments we use l = 100,000. A Fractal Dimension for Measures via Persistent Homology 23 Fig. 10 Plot of 20,000 points sampled at random from the Sierpiński triangle of separation δ = 2 In the following experiment we sample random points from the separated Sierpiński triangle with δ = 2. As the number of random points n goes to infinity, it appears that the rescaled3 CDF of H0 interval lengths are not converging to a fixed probability distribution, but instead to a periodic family of distributions, in the following sense. If you fix k ∈ N then the distributions on n = k, 3k, 9k, 27k, . . . , 3j k, . . . points appear to converge as j → ∞ to a fixed distribution. Indeed, see Fig. 11 for the limiting distribution on 3j points, and for the limiting distribution on 3j · 2 points. However, the limiting distribution for 3j k points and the limiting distribution for 3j k ′ points appear to be the same if and only if k and k ′ differ by a power of 3. See Fig. 12, which shows four snapshots from one full periodic orbit. Here is an intuitively plausible explanation for why the rescaled CDFs for the separated Sierpiński triangle converge to a periodic family of distributions, rather than a fixed distribution: Imagine focusing a camera at the origin of the Sierpiński triangle and zooming in. Once you get to (2 + δ)× magnification, you see the same image again. This is one full period. However, for magnifications between 1× and (2 + δ)× you see a different image. In our experiments sampling random points, zooming in by a factor of (2+δ)× is the same thing as sampling three times as many points (indeed, the Hausdorff dimension is log2+δ (3)). When zooming in you see the same image only when the magnification is at a multiple of 2 + δ, and analogously when sampling random points perhaps we should expect to see the same probability 3 Since the separated Sierpiński triangle has Hausdorff dimension log2+δ (3), the rescaled distribu(0) tions we plot are Fn (n−1/m t) with m = log2+δ (3). 24 H. Adams et al. Fig. 11 This figure shows the empirical rescaled CDFs of H0 interval lengths for n = 3j points (left) and for n = 3j · 2 points (right) sampled from the separated Sierpiński triangle with δ = 2. Each figure appears to converge to a fixed limiting distribution as j → ∞, but the two limiting distributions are not equal H0 1 0.9 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 0 500 H1 1 0.9 k=1 k=1.25 k=1.5 k=1.75 k=2.0 k=2.25 k=2.5 k=2.75 k=3.0 Fig. 12 Empirical rescaled CDF’s for H0 interval lengths, and H1 interval lengths computed from an increasing number of n = k · 36 points from the separated Sierpiński triangle with δ = 2, moving left to right. Note that as k increases between adjacent powers of three, the “bumps" in the distribution shift to the right, until the starting distribution reappears distribution of interval lengths only when the number of points is multiplied by a power of 3. 7 Another Way to Randomly Sample from the Sierpiński Triangle An alternate approach to constructing a sequence of measures converging to the Sierpiński triangle is using a particular Lindenmayer system, which generates a sequence of instructions in a recursive fashion [49, Figure 7.16]. Halting the recursion at any particular level l will give a (non-fractal) approximation to the Sierpiński triangle as a piecewise linear curve with a finite number of segments; see Fig. 13. A Fractal Dimension for Measures via Persistent Homology 25 Fig. 13 The Sierpiński triangle as the limit of a sequence of curves. We can uniformly randomly sample from the curve at level l to generate a sequence of measures μl converging to the Sierpinski triangle measure as l → ∞ Fig. 14 Scaling behaviors for various “depths” of the Sierpinski arrowhead curves visualized in Fig. 13 Let μl be the uniform measure on the piecewise linear curve at level l. In Fig. 14 we sample n points from μl and compute Li (Xn ), displayed in a log-log plot. Since each μl for l fixed is non-fractal (and one-dimensional) in nature, the ultimate asymptotic behavior will be d = 1 once the number of points n is sufficiently large 26 H. Adams et al. (depending on the level l). However, for level l sufficiently large (depending on the number of points n) we see that there is an intermediate regime in the log-log plots which scale with the expected fractal dimension near log2 (3). We expect a similar relationship between the number of points n and the level l to hold for many of types of self-similar fractals. 8 Asymptotic Approximation of the Scaling Exponent From Definition 6 we consider how to estimate the exponent (d − 1)/d numerically for a given metric measure space (X, μ). For a fixed number of points n, a pair of values (n, ℓn ) is produced, where ℓn = Li (Xn ) for a sampling Xn from (X, μ) of cardinality n. If the scaling holds asymptotically for n sampled past a sufficiently large point, then we can approximate the exponent by sampling for a range of n values and observing the rate of growth of ℓn . A common technique used to estimate power law behavior (see for example [19]) is to fit a linear function to the logtransformed data. The reason for doing this is a hypothesized asymptotic scaling y ∼ eC x α as x → ∞ becomes a linear function after taking the logarithm: log(y) ∼ C + α log(x). However, the expected power law in the data only holds asymptotically for n → ∞. We observe in practice that the trend for small n is subdominant to its asymptotic scaling. Intuitively we would like to throw out the non-asymptotic portion of the sequence, but deciding where to threshold depends on the sequence. We propose the following approach to address this issue. Suppose in general we have a countable set of measurements (n, ℓn ), with n ranging over some subset of the positive integers. Create a sequence in monotone increasing order of n so that we have a (nk , ℓnk )∞ k=1 with nk > nj for k > j . For any pairs of integers p, q with 1 ≤ p < q, we denote the log-transformed data of the corresponding terms in the sequence as Spq =    log(nk ), log(ℓnk ) | p ≤ k ≤ q ⊆ R2 . Each finite collection of points Spq has an associated pair of linear least-squares coefficients (Cpq , αpq ), where the line of best fit to the set Spq is given by y = Cpq + αpq x. For our purposes we are more interested in the slope αpq than the intercept Cpq . We expect that we can obtain the fractal dimension by considering the joint limits in p and q: if we define α as α= lim p,q→∞ αpq , then we can recover the dimension by solving α = d−1 d . A possibly overly restrictive assumption is that the asymptotic behavior of ℓnk is monotone. If this is the case, we may expect any valid joint limit p, q → ∞ will be defined and produce the same A Fractal Dimension for Measures via Persistent Homology 27 value. For example, setting q = p + r we expect the following to hold: α = lim lim αp,p+r . p→∞ r→∞ In general, the joint limit may exist under a wider variety of ways in which one allows q to grow relative to p. Now define a function A : R2 → R, which takes on values A( p1 , q1 ) = αpq , and define A(0, 0) so that A is continuous at the origin. Assuming αpq → α as above, then any sequence (xk , yk )k → (0, 0) will produce the same limiting value A(0, 0) and the limit lim(x,y)→(0,0) A(x, y) is well-defined. This suggests an algorithm for finite data: 1. Obtain a collection of estimates αpq for various values of p, q, and then 2. use the data {( p1 , q1 , A( p1 , q1 ))} to extrapolate an estimate for A(0, 0) = α, from which we can solve for the fractal dimension d. For simplicity, we currently fix q = nmax and collect estimates varying only p; i.e., we only collect estimates of the form αp nmax . In practice it is safest to use a low-order estimator to limit the risks of extrapolation. We use linear fit for the twodimensional data A( p1 , q1 ) to produce a linear approximation Â(ξ, η) = a +bξ +cη, giving an approximation α = A(0, 0) ≈ Â(0, 0) = a. Shown in Fig. 15 is an example applied to the function f (x) = 100x + 1 2 x + 0.1ε(x) 10 (1.2) with ε = dW (x), with W (x) standard Brownian noise. The theoretical asymptotic is α = 2 and should be attainable for sufficiently large x and enough sample points to overcome noise. Note that there is a balance needed to both keep a sufficient number of points to have a robust estimation (we want q − p to be large) and to Fig. 15 Left panel: approximations αpq for selections of (p, q) in an artificial function 100x + 1/10x 2 (1 + ε(x)). Center panel: log-absolute-error of the coefficients. Note that the approximation is generally poor for |p − q| small, due to a small number of sample points. Right panel: same values, with the coordinates mapped as ξ = 1/p, η = 1/q. The value to be extrapolated is at (ξ, η) = (0, 0) 28 H. Adams et al. avoid including data in the pre-asymptotic regime (thus p must be relatively large). Visually, this is seen near the top side of the triangular region, where the error drops to roughly the order of 10−3 . The challenge for an arbitrary function is not knowing precisely where this balance is; see [19, Sections 1, 3.3–3.4] in the context of estimating xmin (in their language) for the tails of probability density functions. 9 Conclusion When points are sampled at random from a subset of Euclidean space, there are a wide variety of Euclidean functionals (such as the minimal spanning tree, the traveling salesperson tour, the optimal matching) which scale according to the dimension of Euclidean space [72]. In this paper we explore whether similar properties are true for persistent homology, and how one might use these scalings in order to define a persistent homology fractal dimension for measures. We provide experimental evidence for some of our conjectures, though that evidence is limited by the sample sizes on which we are able to compute. Our hope is that our experiments are only a first step toward inspiring researchers to further develop the theory underlying the scaling properties of persistent homology. Acknowledgements We would like to thank Visar Berisha, Vincent Divol, Al Hero, Sara Kališnik, Benjamin Schweinhart, and Louis Scharf for their helpful conversations. We would like to acknowledge the research group of Paul Bendich at Duke University for allowing us access to a persistent homology package, which can be accessed via GitLab after submitting a request to Paul Bendich. References 1. Henry Adams, Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby, Francis Motta, Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier. Persistence images: A stable vector representation of persistent homology. The Journal of Machine Learning Research, 18(1):218–252, 2017. 2. Aaron Adcock, Daniel Rubin, and Gunnar Carlsson. Classification of hepatic lesions using the matching metric. Computer Vision and Image Understanding, 121:36–42, 2014. 3. Robert J Adler, Omer Bobrowski, Matthew S Borman, Eliran Subag, and Shmuel Weinberger. Persistent homology for random fields and complexes. In Borrowing strength: theory powering applications—a Festschrift for Lawrence D. Brown, pages 124–143. Institute of Mathematical Statistics, 2010. 4. Robert J Adler, Omer Bobrowski, and Shmuel Weinberger. Crackle: The persistent homology of noise. arXiv preprint arXiv:1301.1466, 2013. 5. David Aldous and J Michael Steele. Asymptotics for Euclidean minimal spanning trees on random points. Probability Theory and Related Fields, 92(2):247–258, 1992. 6. David Aldous and J Michael Steele. The objective method: probabilistic combinatorial optimization and local weak convergence. In Probability on discrete structures, pages 1–72. Springer, 2004. A Fractal Dimension for Measures via Persistent Homology 29 7. Kenneth S Alexander. The RSW theorem for continuum percolation and the CLT for Euclidean minimal spanning trees. The Annals of Applied Probability, 6(2):466–494, 1996. 8. Mark A Armstrong. Basic topology. Springer Science & Business Media, 2013. 9. Ulrich Bauer. Ripser: A lean C++ code for the computation of Vietoris–Rips persistence barcodes. Software available at https://github.com/Ripser/ripser, 2017. 10. Paul Bendich, J S Marron, Ezra Miller, Alex Pieloch, and Sean Skwerer. Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1):198–218, 2016. 11. Martin Bilodeau and David Brenner. Theory of multivariate statistics. Springer Science & Business Media, 2008. 12. Omer Bobrowski and Matthew Strom Borman. Euler integration of Gaussian random fields and persistent homology. Journal of Topology and Analysis, 4(01):49–70, 2012. 13. Omer Bobrowski and Matthew Kahle. Topology of random geometric complexes: A survey. Journal of Applied and Computational Topology, 2018. 14. Omer Bobrowski, Matthew Kahle, and Primoz Skraba. Maximally persistent cycles in random geometric complexes. arXiv preprint arXiv:1509.04347, 2015. 15. Paul Breiding, Sara Kalisnik Verovsek, Bernd Sturmfels, and Madeleine Weinstein. Learning algebraic varieties from samples. arXiv preprint arXiv:1802.09436, 2018. 16. Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society, 46(2):255–308, 2009. 17. Frédéric Chazal, Vin de Silva, and Steve Oudot. Persistence stability for geometric complexes. Geometriae Dedicata, pages 1–22, 2013. 18. Frédéric Chazal and Vincent Divol. The density of expected persistence diagrams and its kernel based estimation. arXiv preprint arXiv:1802.10457, 2018. 19. Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review, 51(4):661–703, 2009. 20. Anne Collins, Afra Zomorodian, Gunnar Carlsson, and Leonidas J. Guibas. A barcode shape descriptor for curve point cloud data. Computers & Graphics, 28(6):881–894, 2004. 21. Jose A Costa and Alfred O Hero. Determining intrinsic dimension and entropy of highdimensional shape spaces. In Statistics and Analysis of Shapes, pages 231–252. Springer, 2006. 22. Justin Michael Curry. Topological data analysis and cosheaves. Japan Journal of Industrial and Applied Mathematics, 32(2):333–371, 2015. 23. Colleen D Cutler. Some results on the behavior and estimation of the fractal dimensions of distributions on attractors. Journal of Statistical Physics, 62(3–4):651–708, 1991. 24. Colleen D Cutler. A review of the theory and estimation of fractal dimension. In Dimension estimation and models, pages 1–107. World Scientific, 1993. 25. Yuri Dabaghian, Facundo Mémoli, Loren Frank, and Gunnar Carlsson. A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS computational biology, 8(8):e1002581, 2012. 26. Vincent Divol and Wolfgang Polonik. On the choice of weight functions for linear representations of persistence diagrams. arXiv preprint arXiv: arXiv:1807.03678, 2018. 27. Herbert Edelsbrunner and John L Harer. Computational Topology: An Introduction. American Mathematical Society, Providence, 2010. 28. Herbert Edelsbrunner, A Ivanov, and R Karasev. Current open problems in discrete and computational geometry. Modelirovanie i Analiz Informats. Sistem, 19(5):5–17, 2012. 29. Herbert Edelsbrunner, Anton Nikitenko, and Matthias Reitzner. Expected sizes of Poisson– Delaunay mosaics and their discrete Morse functions. Advances in Applied Probability, 49(3):745–767, 2017. 30. Kenneth Falconer. Fractal geometry: mathematical foundations and applications; 3rd ed. Wiley, Hoboken, NJ, 2013. 31. J.D. Farmer. Information dimension and the probabilistic structure of chaos. Zeitschrift für Naturforschung A, 37(11):1304–1326, 1982. 32. J.D. Farmer, Edward Ott, and James Yorke. The dimension of chaotic attractors. Physica D: Nonlinear Phenomena, 7(1):153–180, 1983. 30 H. Adams et al. 33. Gerald Folland. Real Analysis. John Wiley & Sons, 1999. 34. Robert Ghrist. Barcodes: The persistent topology of data. Bulletin of the American Mathematical Society, 45(1):61–75, 2008. 35. Peter Grassberger and Itamar Procaccia. Characterization of strange attractors. Physics Review Letters, 50(5):346–349, 1983. 36. Peter Grassberger and Itamar Procaccia. Measuring the Strangeness of Strange Attractors. In The Theory of Chaotic Attractors, pages 170–189. Springer, New York, NY, 2004. 37. Allen Hatcher. Algebraic Topology. Cambridge University Press, Cambridge, 2002. 38. Patrick Jaillet. On properties of geometric random problems in the plane. Annals of Operations Research, 61(1):1–20, 1995. 39. Matthew Kahle. Random geometric complexes. Discrete & Computational Geometry, 45(3):553–573, 2011. 40. Albrecht M Kellerer. On the number of clumps resulting from the overlap of randomly placed figures in a plane. Journal of Applied Probability, 20(1):126–135, 1983. 41. Harry Kesten and Sungchul Lee. The central limit theorem for weighted minimal spanning trees on random points. The Annals of Applied Probability, pages 495–527, 1996. 42. Gady Kozma, Zvi Lotker, and Gideon Stupp. The minimal spanning tree and the upper box dimension. Proceedings of the American Mathematical Society, 134(4):1183–1187, 2006. 43. Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956. 44. H Lee, H Kang, M K Chung, B N Kim, and D S Lee. Persistent brain network homology from the perspective of dendrogram. IEEE Transactions on Medical Imaging, 31(12):2267–2277, 2012. 45. Javier Lamar Leon, Andrea Cerri, Edel Garcia Reyes, and Rocio Gonzalez Diaz. Gaitbased gender classification using persistent homology. In José Ruiz-Shulcloper and Gabriella Sanniti di Baja, editors, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 366–373, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. 46. Robert MacPherson and Benjamin Schweinhart. Measuring shape with topology. Journal of Mathematical Physics, 53(7):073516, 2012. 47. Pertti Mattila, Manuel Morán, and José-Manuel Rey. Dimension of a measure. Studia Math, 142(3):219–233, 2000. 48. Pat A .P. Moran. Additive functions of intervals and Hausdorff measure. Proceedings of the Cambridge Philosophical Society, 42(1):15–23, 1946. 49. Heinz-Otto Peitgen, Hartmut Jürgens, and Dietmar Saupe. Chaos and fractals: New frontiers of science. Springer Science & Business Media, 2006. 50. Mathew Penrose. Random geometric graphs, volume 5. Oxford University Press, Oxford, 2003. 51. Mathew D Penrose. The longest edge of the random minimal spanning tree. The annals of applied probability, pages 340–361, 1997. 52. Mathew D Penrose et al. A strong law for the longest edge of the minimal spanning tree. The Annals of Probability, 27(1):246–260, 1999. 53. Mathew D Penrose and Joseph E Yukich. Central limit theorems for some graphs in computational geometry. Annals of Applied probability, pages 1005–1041, 2001. 54. Yakov B Pesin. Dimension theory in dynamical systems: contemporary views and applications. University of Chicago Press, 2008. 55. Robert Clay Prim. Shortest connection networks and some generalizations. Bell Labs Technical Journal, 36(6):1389–1401, 1957. 56. Alfréd Rényi. On the dimension and entropy of probability distributions. Acta Mathematica Hungarica, 10(1–2):193–215, 1959. 57. Alfréd Rényi. Probability Theory. North Holland, Amsterdam, 1970. 58. Vanessa Robins. Computational topology at multiple resolutions: foundations and applications to fractals and dynamics. PhD thesis, University of Colorado, 2000. 59. M.J. Schervish. Theory of Statistics. Springer Series in Statistics. Springer New York, 1996. A Fractal Dimension for Measures via Persistent Homology 31 60. Benjamin Schweinhart. Persistent homology and the upper box dimension. arXiv preprint arXiv:1802.00533, 2018. 61. Benjamin Schweinhart. The persistent homology of random geometric complexes on fractals. arXiv preprint arXiv:1808.02196, 2018. 62. Benjamin Schweinhart. Weighted persistent homology sums of random Čech complexes. arXiv preprint arXiv:1807.07054, 2018. 63. J Michael Steele. Growth rates of Euclidean minimal spanning trees with power weighted edges. The Annals of Probability, pages 1767–1787, 1988. 64. J Michael Steele. Probability and problems in Euclidean combinatorial optimization. Statistical Science, pages 48–56, 1993. 65. J Michael Steele. Minimal spanning trees for graphs with random edge lengths. In Mathematics and Computer Science II, pages 223–245. Springer, 2002. 66. J Michael Steele, Lawrence A Shepp, and William F Eddy. On the number of leaves of a Euclidean minimal spanning tree. Journal of Applied Probability, 24(4):809–826, 1987. 67. J Michael Steele and Luke Tierney. Boundary domination and the distribution of the largest nearest-neighbor link in higher dimensions. Journal of Applied Probability, 23(2):524–528, 1986. 68. Andrew Tausz, Mikael Vejdemo-Johansson, and Henry Adams. Javaplex: A research software package for persistent (co)homology. In International Congress on Mathematical Software, pages 129–136, 2014. Software available at http://appliedtopology.github.io/javaplex/. 69. James Theiler. Estimating fractal dimension. JOSA A, 7(6):1055–1073, 1990. 70. Robert W Vallin. The elements of Cantor sets: with applications. John Wiley & Sons, 2013. 71. Kelin Xia and Guo-Wei Wei. Multidimensional persistence in biomolecular data. Journal of Computational Chemistry, 36(20):1502–1520, 2015. 72. Joseph E Yukich. Probability theory of classical Euclidean optimization problems. Springer, 2006. 73. Xiaojin Zhu. Persistent homology: An introduction and a new text representation for natural language processing. In IJCAI, pages 1953–1959, 2013.