A Riemannian Approach to Multivariate Geostatistical Modeling

Alvaro Figueroa Riquelme

A Riemannian Approach to Multivariate Geostatistical Modeling

Alvaro Figueroa Riquelme

2021

visibility

…

description

21 pages

link

1 file

In geosciences, the use of classical Euclidean methods is unsuitable for treating and analyzing some types of data, as this may not belong to a vector space. This is the case for correlation matrices, belonging to a subfamily of symmetric positive definite matrices, which in turn form a cone shape Riemannian manifold. We propose two novel applications for dealing with the problem of accounting with the non-linear behavior usually presented on multivariate geological data by exploiting the manifold features of correlations matrices. First, we employ an extension for the linear model of coregionalization (LMC) that alters the linear mixture, which is assumed fixed on the domain, and making it locally varying according to the local strength in the dependency of the coregionalized variables. The main challenge, once this relaxation on the LMC is assumed, is to solve appropriately the interpolation of the different known correlation matrices throughout the domain, in a reliable and coher...

A Riemannian Approach to Multivariate Geostatistical Modeling arXiv:2109.14550v2 [stat.AP] 1 Oct 2021 Álvaro I. Riquelme* Abstract In geosciences, the use of classical Euclidean methods is unsuitable for treating and analyzing some types of data, as this may not belong to a vector space. This is the case for correlation matrices, belonging to a subfamily of symmetric positive definite matrices, which in turn form a cone shape Riemannian manifold. We propose two novel applications for dealing with the problem of accounting with the non-linear behavior usually presented on multivariate geological data by exploiting the manifold features of correlations matrices. First, we employ an extension for the linear model of coregionalization (LMC) that alters the linear mixture, which is assumed fixed on the domain, and making it locally varying according to the local strength in the dependency of the coregionalized variables. The main challenge, once this relaxation on the LMC is assumed, is to solve appropriately the interpolation of the different known correlation matrices throughout the domain, in a reliable and coherent fashion. The present work adopts the non-euclidean framework to achieve our objective by locally averaging and interpolating the correlations between the variables, retaining the intrinsic geometry of correlation matrices. A second application deals with the problem of clustering of multivariate data. Keywords Geostatistical modeling · Linear model of coregionalization · Eigen-decomposition · geodesics · Riemannian manifold · Symmetric positive definite 1 Introduction In geosciences, classical Euclidean methods are not suitable for treating and analyzing some types of data, as they may not belong to a vector space. A common example is weather data, commonly assumed to be restricted to the sphere. Seeing the data as lying in different submanifolds of a Riemannian space is an increasingly used approach that has been highly successful over the past decades. In geostatistics, these concepts has been particularly used for the modeling of spatial nonstationarity in the data (Sampson and Guttorp 1992; Almendral et al. 2008; Boisvert et al. 2009; Fouedjio et al. 2015). Other exhaustive application of the concepts have been done by Taylor et al. 2003, Taylor et al. 2006 and Adler and Taylor 2007 in order to understand the topology of random fields (RF) in manifolds. We address the problem of capturing and incorporating a second source of non-stationarity from the geological phenomena, which has to do with the fact that the different variables that describe ore deposits, 𝑍1 (u), . . . , 𝑍 𝑝 (u), cannot be modeled independently among them since they are mineralogically and physically related in complex fashions. As simple linear multivariate features rarely occur among geological variables composing sampling databases, usually showing ∗ E-mail: [email protected] Robert M. Buchan Department of Mining, Queen’s University, Kingston, Canada nonlinear features instead, the correct reproduction of such characteristics becomes a problem when employing traditional estimation and geostatistical simulation techniques. When relationships are simple and linear, one can rely on the linear model of coregionalization (LMC) (Journel and Huijbregts 1978; Chilès and Delfiner 2012) which can be interpreted, in the standard Gaussian setting, as assigning a constant correlation 𝜌𝑖 𝑗 to the pair of variables {𝑍𝑖 (u), 𝑍 𝑗 (u)} throughout the domain. This correlation parameter fixes in the space the direct and cross covariances theoretically, even at different positions {𝑍𝑖 (u), 𝑍 𝑗 (u + h)}, and must be modeled in beforehand to proceed with estimation techniques such as co-kriging (Wackernagel 2013), or before applying decorrelation of the data through linear transformations such as Principal Component Analysis (PCA) (Pearson 1901) or Minimum/Maximum Autocorrelation Factors (Switzer 1985). Since geological variables rarely show a linear Gaussian characteristics, it is hard to give to the LMC a geological interpretation, specially when a non-linear multivariate behavior among the different attributes is present, reducing the rate of success for the traditional methods. To overcome this limitation, some approaches that generalize the Gaussian transformation approach from the geological variables (raw variables hereafter) into independent standard Gaussian variables, that can be treated individually, has been proposed (Leuangth- 2 ong and Deutsch 2003; Barnett et al. 2014; van den Boogaart et al. 2017). We follow a different path, which is to modify the LMC to alter the fixed correlation among geological features on the domain. This linear mixture can be made locally varying according to the local strength in the dependency of the variables, leading to a locally varying linear model of coregionalization (LVLMC), first introduced by Gelfand et al. (2003) in the context of spatial non-stationary models. The main challenge, once the relaxation on the LMC is assumed, is to properly carrying out the correlation matrix, computed first at data position, to unknown locations of the spatial domain. Correlation matrices belong to the family of symmetric positive definite (SPD) matrices, which in turn forms a cone shape Riemannian manifold. Building upon earlier studies that have shown that a Riemannian framework is appropriate to address the challenge of interpolation between correlation matrices, the present work adopts this non-euclidean framework to achieve our objective by interpolating the correlations between the variables throughout the geological domain, retaining the intrinsic geometry of correlation matrices. 2 Background 𝑇x𝑖 𝑀 x𝑖 expx𝑖 (v) = x 𝑗 𝑀 Fig. 1 The exponential map. Given a tangent vector v ∈ 𝑇x 𝑀, locally there exists a unique geodesic 𝛾v (𝑡) starting at x with initial velocity v, and this geodesic has constant speed equal to k𝑣k 2x . The exponential map, expx : 𝑇x 𝑀 → 𝑀 maps a tangent vector v to the point on the manifold that is reached at time 1 by the geodesic 𝛾v (𝑡). The inverse of expx is known as the logarithm map and is denoted by logx : 𝑀 → 𝑇x 𝑀. Now, if we have two points x𝑖 and x 𝑗 on the manifold 𝑀, the tangent vector to the geodesic curve from x𝑖 to x 𝑗 is defined as 𝑣 = logx𝑖 (x 𝑗 ), and the exponential map takes v to the point x 𝑗 = expx𝑖 logx𝑖 (x 𝑗 ) . In addition, 𝛾v (0) = x𝑖 and 𝛾v (1) = x 𝑗 . The Riemannian distance between x𝑖 and x 𝑗 is defined as 𝑑 (x𝑖 , x 𝑗 ) = logx𝑖 (x 𝑗 ) x𝑖 . Given the data x1 , . . . , x𝑛 ∈ 𝑀, we consider the use geometric or Fréchet mean 𝝁 is defined as a minimizer of the sum of squared distances: 2.1 Review of Riemannian Manifolds 𝝁 = arg inf x∈𝑀 A differentiable manifold 𝑀 of dimension 𝑝 is a topological space that is locally similar to a Euclidean space, with every point on the manifold having a neighborhood for which there exists a homeomorphism (a continuous bijection whose inverse is also continuous) mapping the neighborhood to R 𝑝 . The tangent space 𝑇x 𝑀 at x is the vector space that contains the tangent vectors to all 1-D curves on 𝑀 passing through x. Fig. 1 shows an example of a two-dimensional manifold, a smooth surface living in R3 . A Riemannian metric on a manifold 𝑀 is a bilinear form which associates to each point x ∈ 𝑀 a differentiable varying inner product h·, ·ix on the tangent space 𝑇x 𝑀 at x. The norm of a vector v ∈ 𝑇x 𝑀 is denoted by k𝑣k 2x = hv, vix . The Riemannian distance between two points x𝑖 and x 𝑗 that lie on the manifold, 𝑑 (x𝑖 , x 𝑗 ), is defined as the minimum length over all possible smooth curves on the manifold between x𝑖 and x 𝑗 . The smooth curve with minimum length is known as the geodesic curve 𝛾. v 𝑘 ∑︁ 𝑑 2 (x, x𝑖 ). 𝑖=1 2.2 The Riemannian manifold of SPD matrices Let Sym+ ( 𝑝) denote the set of symmetric, positive definite matrices of size 𝑝 × 𝑝, that is the set of all symmetric 𝑝 × 𝑝 matrices X such that the quadratic form v𝑇 Xv > 0 for all v ∈ R 𝑝 . A crucial aspect of the set Sym+ ( 𝑝) is that it is not a vector space but forms a cone-shape Riemannian manifold. As a consequence of the manifold structure of Sym+ ( 𝑝), computational methods that simply rely on the Euclidean distances between SPD matrices are generally sub optimal, with low performance (Tuzel et al. 2008). It is necessary to consider the notion of geodesic distance to exploit the manifold structure of Sym+ ( 𝑝), which is the length of the shortest curve connecting two points, in this case two matrices, on the manifold. Among the different Riemannian metrics that have been considered on Sym+ ( 𝑝), 3 the one that has been most studied and analyzed is the classical affine-invariant metric, in which the geodesic distance on the manifold between two SPD matrices P1 and P2 is defined as: 1/2 1/2 1/2 2 𝑑Sym + (P1 , P2 ) = Log(P1 P2 P1 ) 2 1/2 1/2 = tr Log2 (P1/2 1 P2 P1 ) visualize any C ∈ Corr(2) by associating the ellipsoid v𝑇 C−1 v = 1. Because of the global parametrization 𝜑 : (−1, 1) → Corr(2), we can visualize the manifold Corr(2) as the interval (−1, 1), but at each point in the interval we can attach to it the ellipsoid corresponding to the positive-definite form associated to the matrix. We see this in Fig. 2. Another visualization we will consider is to see the correlation matrices embedded inside the symmetric positive-definite matrices (Fig. 3). with Log(·) denoting the matrix logarithm and kAk 2 = tr(A𝑇 A) denoting the Frobeniuous matrix norm. Furthermore, given a tangent vector YP ∈ 𝑇P Sym+ ( 𝑝) at a point P ∈ Sym+ ( 𝑝), the Riemannian exponential map expP : 𝑇P 𝑀 → Sym+ ( 𝑝) is given by Fig. 2 The manifold Corr(2). X = expP (YP ) = P1/2 Exp(P−1/2 YP P−1/2 )P1/2 . (1) In the case of correlation matrices of dimension 3, where Exp(·) denotes the exponential of a matrix. Given the shape formed by the set is named the 3-dimensional + two positive definite matrices P, X ∈ Sym ( 𝑝), the Riemannian logarithmic map logP : Sym+ ( 𝑝) → 𝑇P Sym+ ( 𝑝), elliptope which can be represented as a linear matrix inequality, characterize by of X in relation to P is given by YP = logP (X) = P1/2 Log(P−1/2 XP−1/2 )P1/2 . (2) Finally, the geodesic passing through P in the direction of YP is uniquely given by 𝛾P (𝑡; YP ) = P1/2 Exp(P−1/2 YP P−1/2 𝑡)P1/2 . (3) ( 1 © Corr(3) := 𝑥 «𝑦 𝑥 1 𝑧 𝑦ª 𝑧 ®® 1¬ 1   : det 𝑥  𝑦  ) 1 + 2𝑥𝑦𝑧 − 𝑥 2 − 𝑦 2 − 𝑧 2 > 0 . 𝑥 1 𝑧 𝑦   𝑧 =  1 2.3 The Riemannian manifold of Correlation matrices The boundary of the elliptope (Fig. 4) is the cubic surface defined by 2.3.1 Visualizing Corr(2) and Corr(3) 1 + 2𝑥𝑦𝑧 − 𝑥 2 − 𝑦 2 − 𝑧 2 = 0 The affine-invariant structure for Sym+ ( 𝑝) is not only intrinsically linked with Corr(𝑛) but also imposes symmetry on its structure as a quotient manifold. Let us begin by visualizing Corr(2) as a subset of Sym+ (2): ) ! ( 1 𝑥 : 𝑥 ∈ (−1, 1) . Corr(2) := 𝑥 1 then we can realize this space to be a manifold of dimension 1 parameterized by the map 𝜑 : (−1, 1) → Corr(2) given by ! 1 𝑥 . 𝜑(𝑥) = 𝑥 1 This is a smooth map into the symmetric matrices which restricts to Corr(2) whose inverse is simply given by projection onto one of the off-diagonal entries. We can Fig. 3 The manifold Corr(2) visualized as an embedded submanifold of Sym+ (2). Points in Sym+ (2) (black) are sampled independently of those in Corr(2) (red). 4 Corr( 𝑝), with dim Corr( 𝑝) = dim Sym+ ( 𝑝)−dim Diag+ ( 𝑝) (David 2019). Intuitively, this correspond to a “retraction” along the leaves to the one dimensional line Corr(2), for the case of Sym+ (2). The representative that we take on Sym+ ( 𝑝)/Diag+ ( 𝑝) correspond to the element given by the projection 𝜋 : Sym+ ( 𝑝) → Corr( 𝑝) Fig. 4 The boundary of the elliptope. 𝚺 ↦→ (D𝚺 , 𝚺) = C𝚺 , where D𝚺 = (I 𝑝 ◦ 𝚺) −1/2 and ◦ is the Hadamard product. Since more than one element can be projected into the same correaltion matrix, we call to the leave 𝜋 −1 (C𝚺 ) projected into the correlation matrix C𝚺 the fiber of C𝚺 : 𝜋 −1 (C𝚺 ) = {𝚺 ∈ Sym+ ( 𝑝) : (D𝚺 , 𝚺) = C𝚺 }. 2.3.3 Accounting for a distance in Corr( 𝑝) Fig. 5 A foliation of the cone Sym+ (2). Each leaf is an embedded two-dimensional submanifolds, obtained by translating a given correlation matrix C (red dot) by the action DCD. 2.3.2 Quotient Geometry Consider an element 𝚺 ∈ Sym+ ( 𝑝). The orbit of 𝚺, that is, the set of images of 𝚺 when considering the action of a group diagonal matrices with positive entries Diag+ ( 𝑝) on it, Diag+ ( 𝑝) × Sym+ ( 𝑝) → Sym+ ( 𝑝), given by (D, 𝚺) ↦→ D𝚺D: While the result that Corr( 𝑝) exhibits this particular quotient manifold structure is meaningful in itself, this fact alone does not yield results that are suitable for algorithms and computation as closed form expression are not available for computing distances on Corr( 𝑝). Thus, one must rely only on Riemannian structure that Corr( 𝑝) inherits from Sym+ ( 𝑝) in order to obtain an algorithm that computes distances through an optimization procedure. In order to come up with such an algorithm, it is used a really helpful theorem, proved by Huckemann et al. (2010), showing that the geodesic connecting two points in the quotient can be expressed as the geodesic in the ambient manifold from the starting point to an optimal representative of the end point, lying on the fiber over the desired endpoint: Theorem 1 (Huckemann 2010). Let 𝑀 be a Riemannian manifold with an isometric action of a Lie group 𝐺. Then a geodesic 𝛾 in the quotient 𝑀/𝐺 with end + + Diag ( 𝑝) · 𝚺 = D · 𝚺 : D ∈ Diag ( 𝑝) , D · 𝚺 := D𝚺D, points 𝑎, 𝑏 ∈ 𝑀/𝐺 can be obtained from the projection of a geodesic 𝛾˜ on M (i.e. 𝛾 = 𝜋 ◦ 𝛾˜) such that is a smooth manifold of dimension equal to dim Diag+ ( 𝑝) = 𝑝. This can be seen explicitly in the case of taking an – 𝛾˜ has end points 𝑝, 𝑞 with 𝜋( 𝑝) = 𝑎, 𝜋(𝑞) = 𝑏, and element C ∈ Corr(2) ⊂ Sym+ (2), and sampling the or– 𝑞 is the solution to the problem + bit space by applying DCD, where 𝐷 ∈ Diag ( 𝑝) is min 𝑑 𝑀 ( 𝑝, 𝑐) such that 𝑐 ∈ 𝜋 −1 (𝑏). generated randomly (Fig. 5). Subsequently the quotient manifold Sym+ ( 𝑝)/Diag+ ( 𝑝) This last point can be rephrased for fixed 𝑐 ∈ 𝜋 −1 (𝑏) as is a smooth manifold on which one can take, as repmin 𝑑 𝑀 ( 𝑝, 𝑔 · 𝑐) such that 𝑔 ∈ 𝐺. resentative of the equivalence relation, an element of 5 𝐷, acting locally with different properties that change smoothly throughout the different positions u ∈ 𝐷. The correlation among attributes is the property that we consider as a function of u, as we consider standard Gaussian RVs, given as a result the reproduction of the complex non-linear features among vari−1/2 −1/2 1/2 Exp Log(C C C )𝑡 C 𝛾Sym+ (𝑡) = C1/2 , 2 1 1 1 1 ables. Therefore, this idea is a simple, linear, and geo2 −1/2 2 𝑑Sym C2 C−1/2 ) . logical meaningful approach to the estimation and un+ = Log(C1 1 certainty quantification at unknown locations when the In order to adapt this Riemannian structure to Corr( 𝑝) mentioned characteristics are exhibited on the data. we need to find the optimal representative of C2 with respect to the starting point C1 . This is done by find3.1.1 The Model e 2 in the fiber 𝜋 −1 (C2 ) which ing the unique element C + e 2 . This minimizes the Sym -distance between C1 and C The following model relies on the assumptions that can be written as “simple” non-linear multivariate features can be reconstructed in a straightforward way by mapping the orig2 2 inal 𝑝-variate cumulative distribution function with a (C1 , C2 ) = inf + 𝑑Sym 𝑑Corr + (C1 , DC2 D) D∈Diag ( 𝑝) 𝑝-variate Gaussian distribution equipped with a proper Using this equation above we then aim to solve the prior covariance matrix. This procedure is also known following minimization problem: as Nataf transformation (Nataf 1962) or NORTA (NORmal To All), and several properties of the transforma2 minimize 𝑑Sym subject to D ∈ Diag+ ( 𝑝). tion has been studied in different contexts, for instance, + (C1 , DC2 D) in Cario and Nelson (1997); Ayadi et al. (2019); Xie (4) et al. (2015); Xiao (2014); Li and Hammond (1975) and Assuming D∗ is a sufficient solution to the above probon Bourgault (2014) in the geostatistical context. We e 2 this element in the fiber 𝜋 −1 (C2 ) lem, we define as C start by a brief motivation proceeded by highlighting e2 which minimizes the Sym+ -distance between C1 and C the relevant theoretical aspects of the transformation. Let Z̃ = [ 𝑍˜1 (u), . . . , 𝑍˜ 𝑝 (u)] 𝑇 be the vector-valued e 2 = D ∗ C2 D ∗ . C random function (RF) considering 𝑝 simultaneous RFs 𝑍˜𝑖 = { 𝑍˜𝑖 (u) : u ∈ 𝐷 ⊆ R𝑛 , 𝑛 ≥ 1}, indexed by 𝑖 ranging The corresponding geodesic can be taken as the projec+ e2 in the set 𝐼 = {1, ..., 𝑝}, and defined on a fixed contintion of the Sym -geodesic connecting C1 and and C uous domain of interest 𝐷 of the Euclidean space R𝑛 . −1/2 e −1/2 𝛾Corr (𝑡) = 𝜋 C1/2 Exp 𝑡Log(C C C ) . 2 1 Let the sampling data given by the multivariate vectors 1 1 z̃ 𝛼 = [˜𝑧1 (u 𝛼 ), . . . , ˜𝑧 𝑝 (u 𝛼 )] 𝑇 , 𝛼 ∈ {1, . . . , 𝑘 }, defined as data. We face, as a main problem, that the blindly pro3 Applications cedure the values ˜𝑧𝑖 of the different RV 𝑍˜𝑖 , transforming each variable into a uni-variate Gaussian values 𝑧 𝑖 , On this section we present two novel applications based Adapting equations 3 to the current scenario, let C1 , C2 ∈ Corr(𝑛). Then the geodesic and corresponding distance in Sym+ ( 𝑝) connecting these two points are given by the following: on the theory introduced previously. The first one has to do with the prediction of geological attributes at unknown locations, showing complex non-linear multivariate features on the data. The second one is related to clustering of data. 3.1 Extending The Linear Model of Coregionalization As we mentioned earlier, the main contribution of this work is to see any geological process globally as a mixture of multivariate RVs on a given spatial domain 𝑧 𝑖 = 𝐺 −1 𝐹 (˜𝑧𝑖 ) , does not translate into independent Gaussian variables, 𝑍𝑖 . This is shown in the cross plots of Fig. 6, giving a comparative illustration of the original data, normally transformed data in a uni-variate way, showing that after the transformation, the data, that was previously correlated in raw values, is still correlated after the transformation. Therefore, when modeling two or more variables by using a non-correlated multi-Gaussian pdf, the path is 6 function 𝜙−1 𝑖 on each of the components of 𝑍: Z̃ = [ 𝑍˜1 (u), . . . , 𝑍˜ 𝑝 (u)] 𝑇 = (𝜙1 [𝑍 (u)], . . . , 𝜙 𝑝 [𝑍 (u)])𝑇 = ΦI 𝑝 (Z). The coupled prior distribution of Z, is still a 𝑝b with mean vecvariate Gaussian distribution N (0, 𝚺), tor and correlation matrix given by 1 © 𝜌ˆ21 b = . 𝚺 .. « 𝜌ˆ𝑝1 𝜌ˆ12 1 .. . 𝜌ˆ𝑝2 ··· ··· .. . ··· 𝜌ˆ1 𝑝 ª 𝜌ˆ2 𝑝 ®® .. ®® . . ® 1 ¬ Then, the random variables 𝑍𝑖 are correlated and their pairwise relationships are quantified by the correlation coefficients 𝜌ˆ𝑍𝑖 𝑍 𝑗 (or simply 𝜌ˆ𝑖 𝑗 ) with 𝑖, 𝑗 ∈ 𝐼, which has to be inferred. We proceed to do this in the next section. The 𝑝-variate ccdf over the original variables is then retrieved simply as: b 𝐹𝑍˜1 (u) ,..., 𝑍˜𝑝 (u) (˜𝑧 1 , . . . , ˜𝑧 𝑝 ) = 𝐺 0𝚺 𝜙−1 𝑧 1 ), . . . , 𝜙−1 𝑧 𝑝) 𝑝 (˜ 1 (˜ Fig. 6 Correlation of variables before and after applying a normal score transformation. Variables are still correlated. prone to give bad results when estimating or simulating values when back transforming into raw values, since the separate transformation entails an incorrect map on the multivariate probability densities. However, this problem is quickly fixed when a correlated Gaussian distribution is considered instead. This simple method works as a multivariate transformation, by coupling the univariate transformations 𝜙𝑖 . We define the non-coupled transformation of the initial multivariate RF Z̃ into a stationary 𝑝-variate Gaussian RF with zero vector mean 𝝁 = (0, . . . , 0)𝑇 = 0 and covariance matrix equal to the identity matrix I 𝑝 , i.e., Z = [𝑍1 (u), . . . , 𝑍 𝑝 (u)] 𝑇 ∼ N (0, I 𝑝 ) (I 𝑝 𝑖𝑖 = 1 and I 𝑝 𝑖 𝑗 = 0, 𝑖, 𝑗 ∈ 𝐼), by using the anamorphosis (5) We will say that Z follows a coupled anamorphob The transformation (or sis function, i.e., 𝑍 ∼ 𝚽(0, 𝚺). coupling process) is conceptually illustrated, for the bivariate case, in Figure 7. It is important to mention that this transformation is well-defined, in the sense that the order of variables does not play a role, and a permutation of them just translates in permutation of the correlation coefficients b However, this procedure entails the severe hyon 𝚺. pothesis that the multivariate behavior of geological attributes can be modeled by assuming a correlated Gaussian distribution, which may be a lousy model globally. Instead, we take this hypothesis for granted locally in the geological domain. Given the different RVs that describe ore deposits, Z̃ = [ 𝑍˜1 (u), . . . , 𝑍˜ 𝑝 (u)] 𝑇 , we proceed to transform the variables into Gaussian RVs jointly, according to Eq. 5, in order to get the vector Z = [𝑍1 (u 𝛼 ), . . . , 𝑍 𝑝 (u 𝛼 )]. Once we have device to perform this transformation, we have to decide among two possible modeling options. On the first hand, to perform the gaussianization in a “global” fashion, that is, gathering all the z 𝛼 data and perform only one transformation by running Eq. 5 once. On the second hand, to perform the gaussianization “locally”, which means to collect chunks of data in a vicinity 𝑉 to the location under study, z̃ 𝛼 , 𝛼 ∈ 𝑉 ⊂ {1, . . . , 𝑘 } 7 Uncoupled Raw Dist: ˆ.0; I/ 1 0:8 0:6 0:4 5 10 15 20 35 30 25 20 15 10 Coupled Raw Dist: ˆ.0; b †/ 2 1 0 1 2 0:2 0 2 1 0 1 2 Coupled Gaussian Dist: N .0; b †/ 0:25 0:016 0:014 0:012 1 0:01 0:008 0:0060:8 0:004 0:0020:6 0 1 0:8 0:6 0:4 0:2 0:15 0:1 0:05 0 0:4 5 10 15 20 0:2 0 0:16 0:14 0:12 0:1 0:08 0:06 0:04 0:02 0 0:4 0:2 0 Uncoupled Gaussian Dist: N .0; I/ 0:007 0:006 0:005 1 0:004 0:003 0:8 0:002 0:001 0:6 0 35 30 25 20 15 10 2 1 0 1 2 0:2 0 2 1 0 1 2 Fig. 7 Conceptual bi-variate picture of the adjustment in the correlated behavior of independent raw distribution though the Gaussian coupled anamorphosis. In surface are the correspondent cdfs, and in contour plots the pdfs. including their methodology into the presented one is a topic of further research). Thus, obtaining a model in the fashion Z = AY ∼ N 0, 𝚺(u) . A second the difficulty is to find an appropriate decomposition of 𝚺(u) = A(u)A𝑇 (u), in order to proceed later with the decoupling of Z and be able to work with independent variables Y(u) = A(u) −1 Z(u) ∼ N (0, I 𝑝 ). One can suggest the use of eigen-decomposition 𝚺 = UDU𝑇 in order Z = UD1/2 Y, but used in automatized way may result in a model with spatial discontinuities, since the non-uniqueness of this decomposition. 𝚺 can be uniquely decomposed as the product of a positive-diagonal lower triangular matrix by Cholesky decomposition, being a suitable choice for our purposes: 𝚺 = LL𝑇 . Once getting a continuous decomposition for 𝚺(u) and the independent variables, the overall process of estimation and simulation becomes straightforward, by working individually on the spatial behavior in each of the variables separately. One last difficulty in overcoming comes from the fact that, once Cholesky is applied, one notices that the following transformation also works well: Z = LRY, with R a rotation matrix, as any decomposition of the form 𝚺 = LRR𝑇 L𝑇 is valid. This is a bit problematic since the model acquires an extra free parameter, which is a source for ill-definition for our model if different spatial models are involved in the 𝑌𝑖 (u) variables (Fig. 8). If there is a way for finding a suitable R and fixing this parameter, that is a topic of further research. In order to further simplify these issues and the methodology overall at this point, we take 𝑌𝑖 (u) following the same variogram model for all 𝑖 ∈ 𝐼. in a moving neighborhood fashion, noticing that 𝑉 may be a function of the location under study, 𝑉 (u). This is a non-trivial choice to do. We take the second path in our model, since there is no loss of generality and contains the case on which Z̃ is stationary, as performing gaussianization locally should not be theoretically biased. As the next step in the methodology is the inference of a local correlation matrix 𝚺(u 𝛼 ) at the sampling locations, taking the first path of performing a global transformation and later taking chunks of data would give, as result, the inference of a 𝚺(u 𝛼 ) matrix based on data with non-zero mean locally. Finally, the second path does not contradict the traditional methodology for uncertainty modeling, which consists in partitioning the data in stationary clusters, and continue the work individually on each of the clusters separately. Now we proceed to deal with the problem of interpolating the different known correlations matrices in the space. Then, the LMC is brought into play, and assuming that each variable consists of a sum of 𝑝 independent factors: 3.1.2 Interpolation of the Correlation Matrices 𝑍𝑖 (u) = 𝑝 ∑︁ 𝑎 𝑖 𝑗 (u)𝑌 𝑗 (u), (6) 𝑗=1 with the number of factors equal to the number of attributes in order to avoid ill-definition as a linear system (the problem of working in the stationary set-up with a number of factors different to the number of attributes has been recently tackled by Pinto et al. (2021), and We present a fixed point and a gradient descent algorithm which seeks to minimize the mean-squared distances of Sym+ ( 𝑝) and Corr( 𝑝)-valued observations, respectively, with respect to the affine-invariant distance. The general process for the optimization procedure for the Corr( 𝑝) is proposed by David (2019), and takes the following steps: 8 Given the observations P1 , . . . , P 𝑘 ∈ Sym+ ( 𝑝), one could consider the arithmetic mean of the 𝑛 labeled 𝑘 : covariance matrices {P𝑖 }𝑖=1 𝑘 ∑︁ b= 1 P𝑖 𝚺 𝑘 𝑖=1 which do not account for any intrinsic geometric property of Sym+ ( 𝑝). 1 𝜌12 (u) 𝜌12 (u) 1 We consider, instead, to use the geometric or Fréchet mean, introduced in the Sym+ ( 𝑝) context by Moakher (2005). Such a matrix is defined as follows: b = arg inf 𝚺 𝚺 𝑘 ∑︁ 2 𝑑Sym + (P𝑖 , 𝚺). (7) 𝑖=1 Recall that the Riemannian distance between two SPD matrices is defined as: 2 𝑑Sym + (P𝑖 , 𝚺) = logP𝑖 (𝚺) 2 1/2 = Log(P1/2 𝑖 𝚺P𝑖 ) 2 1/2 = tr Log2 (S1/2 𝑖 𝚺S𝑖 ) Fig. 8 Effect of the rotation and mixing of two independet Gaussian RF Y = [𝑌1 (u) 𝑌2 (u) ] 𝑇 ∼ N (0, I2 ) with different spatial continuity (top) to get Z = LRY for two different options of R (middle and bottom), and taking L the Cholesky matrix for a given correlation map shown on top. 1. At the current iterate C𝑡 ∈ Corr( 𝑝) find all appropriate distances utilizing the fiber structure of Sym+ /Diag+ . 2. Interpret C𝑡 ∈ Sym+ ( 𝑝) and perform the update to a point P𝑡+1 ∈ Sym+ ( 𝑝). 3. Obtain the next iterate in the algorithm by projecting back to Corr( 𝑝), that is C𝑡 = 𝜋(P𝑡+1 ). We begin by summarizing the optimization methods on Sym+ ( 𝑝) and Corr( 𝑝). 3.1.2.1 Optimizing on Sym+ ( 𝑝) and, therefore, minimizing Eq. 7 seems to be impossible to solve in closed form, according to Moakher (2006). The same author describe a fixed-point algorithm to numerically solve the geometric mean of a set of symmetric positive-definite matrices. Other methods such as Newton’s method on Riemannian manifolds (David 2019) could also be used for the numerical computation of the geometric mean. However, the fixed-point algorithm described below is simple to implement, does not require a sophisticated machinery, and converges rapidly. b can be computed efficiently The geometric mean 𝚺 by an iterative procedure consisting in: projecting the covariance matrices in the tangent space, estimating the arithmetic mean in the tangent space and projecting the arithmetic mean back in the manifold. Then iterate the three above steps until convergence. If we want to account for the spatial configuration of the data, we need to consider the use of the weighted Fréchet mean: b = arg inf 𝚺 𝚺 𝑛 ∑︁ 𝑖=1 2 𝜆𝑖 𝑑Sym + (S𝑖 , 𝚺), 𝑛 ∑︁ 𝜆𝑖 = 1, 𝑖=1 with 𝜆 𝑖 the weights obtained from the kriging interpolation. The algorithm in this case is given by slightly modifying the one taken from Moakher (2006): 9 Algorithm 1 Weighted mean of 𝑘 SPD matrices Require: a set of 𝑘 SPD matrices P1 , . . . , P 𝑘 ∈ Sym+ ( 𝑝) and 𝜖 > 0. b (1) = Í 𝑘 𝜆𝑖 P𝑖 1: Initialize 𝚺 𝑖=1 2: repeat ¯ = Í 𝑘 𝜆𝑖 log (𝑡 ) (P𝑖 ) ⊲ Weighted mean in the 3: 𝑷 𝑖=1 b 𝚺 tangent space ¯ b (𝑡+1) = exp (𝑡 ) ( 𝑷) 4: 𝚺 b 𝚺 ¯ <𝜖 5: until 𝑷 b 6: return 𝚺 (𝑡+1) 3.1.2.2 Optimizing Along Fibers and following iterative steps 𝚫𝑡 = I ◦ 2Sym[D𝑡 Log(C𝑖 D𝑡 C−1 D𝑡 )], D𝑡+1 = D𝑡 Exp(−𝛿D−1 𝑡 𝚫𝑡 ), with Sym(A) = 12 (A + A𝑇 ), until a desired stopping criterion is reached. Once we find an optimal element D∗ ∈ Diag+ ( 𝑝) as a result of minimizing 𝑔𝑖 (D), we dee 𝑖 this element over the fiber 𝜋 −1 (C𝑖 ) which fine as C e𝑖 = minimizes the Sym+ -distance between C and C𝑖 , C ∗ ∗ D C𝑖 D . We summarize the proposed algorithm which finds the Fréchet mean on Corr( 𝑝): Algorithm 2 Weighted mean of 𝑘 Correlation matrices Require: a set of 𝑘 Correlation matrices C1 , . . . , C 𝑘 ∈ Í𝑘 Corr( 𝑝) and 𝜖 > 0, initial point C0 = 𝑖=1 𝜆 𝑖 C𝑖 , stepsize 𝛿 > 0. 1: 𝑡 = 0 𝑛 𝑘 ∑︁ ∑︁ 2 2: while Stopping criterion not met do b = arg inf C 𝜆𝑖 = 1. (8) 𝜆𝑖 𝑑Corr (C𝑖 , C), C 3: for 𝑖 = 1, . . . , 𝑘 do 𝑖=1 𝑖=1 Require: Initial point D0 Recall again that the distance between C𝑖 , C ∈ Corr( 𝑝) 4: 𝑛=1 is given by 5: while Stopping criterion not met do 6: 𝚫𝑡 = I ◦ 2Sym[D𝑛 Log(C𝑖 D𝑛 C−1 𝑡 D𝑛 )] 𝚫 ) 7: D𝑛+1 = D𝑛 Exp(−𝛿D−1 𝑛 𝑛 2 2 8: 𝑛 =𝑛+1 𝑑Corr (C𝑖 , C) = inf + 𝑑Sym + (C𝑖 , DCD) D∈Diag ( 𝑝) 9: end while −1/2 e 𝑖 = D𝑛𝑚𝑎𝑥 C𝑖 D𝑛𝑚𝑎𝑥 , = inf + tr Log2 C−1/2 DCDC 10: C 𝑖 𝑖 D∈Diag ( 𝑝) 11: end for ¯ 𝑡 = Í 𝑘 𝜆𝑖 logC ( C e𝑖) 12: 𝑷 ⊲ Mean in the tangent where we note that one can fix C and then optimize 𝑖=1 𝑡 + space of Sym ( 𝑝) over the fiber of C𝑖 as well, by symmetry. The mini¯ 13: 𝚺𝑡+1 = expC𝑡 ( 𝑷) mization of the distance between an iterate C𝑡 of the 14: C𝑡+1 = 𝜋(𝚺𝑡+1 ) = (I 𝑝 ◦ 𝚺𝑡+1 ) −1/2 𝚺𝑡+1 (I 𝑝 ◦ algorithm to all of the observations C1 , . . . , C 𝑘 is pre𝚺𝑡+1 ) −1/2 ⊲ Project back to Corr( 𝑝) ferred. Hence, the algorithm is arranged to always keep 15: 𝑡 =𝑡+1 the iterate fixed and then optimizing along the fibers of 16: end while the given observations. In this way, it is guarantee that 17: return C𝑡 the itererated point is updated appropriately. In other case, one would end up with different optimal points, not yielding a consistent base point. In finding the optimal point, it is employed a gradient descent method on 3.1.3 Methodology the set Diag+ ( 𝑝) with respect to the objective function Now that we have gone throughout the steps for inter2 𝑔𝑖 (D) = 𝑑Sym + (C, DC𝑖 D) polating the correlation matrices, we summarized the proposed methodology for extending the LMC, consistThe gradient descent algorithm in order to find the ing of the following steps with both the first and the optimal D in the above expression is proposed by David last step being optional and suggested when the data is (2019). The algorithm’s derivation is long and tedious, compositional: and we refer to the mentioned author for further details. One ends up, however, with a brief two-steps iterative 1) (Perform log-ratio transformation on data, if conalgorithm, by using a stepsize 𝛿 > 0, initializing D0 = I 𝑝 strains conditions are present). In the same fashion as previously, given the observations C1 , . . . , C 𝑘 ∈ Corr( 𝑝), we are interested in finding 10 2) At each location u 𝛼 with observation, find the nearest 𝑙 samples. 3) Perform Gaussian transformation individually for ˜ each variable 𝜙−1 𝑖 [ 𝑍 𝑖 (u 𝛼 )] = 𝑍 𝑖 (u 𝛼 ) locally using the nearest 𝑙 samples. 4) Compute the correlation matrix 𝚺(u 𝛼 ) of the vector Z = [𝑍1 (u 𝛼 ), . . . , 𝑍 𝑝 (u 𝛼 )]. a) Cholesky decomposition of 𝚺(u 𝛼 ) = LL𝑇 and apply L−1 Z = Y for decorrelation of Gaussian variables. b) Interpolation of 𝚺(u 𝛼 ) on the domain 𝐷 using weighted Fréchet mean and your favorite set weights 𝜆𝑖 . Kriging weights given by the variogram modeling of Y works appropriately. 5) Variogram modeling and simulation of 𝑌𝑖 , assuming the same model ∀𝑖 ∈ 𝐼. 6) At each unsampled location u, take the estimated b correlation matrix 𝚺(u), perform cholesky decomb position, and recover Z(u) = L(u)Y(u). 7) At the unsampled location u, find the nearest 𝑙 samples, perform Gaussian transformation individually ˜ for each variable 𝜙b−1 𝑖 [ 𝑍 𝑖 (u)] = 𝑍 𝑖 (u), and recover b the value ˜𝑧𝑖 (u) = 𝜙𝑖 [𝑧𝑖 (u)]. 8) (Perform log-ratio back transformation on data). 3.2 Geological Domaining Now we move to a second novel application of the concepts presented on previous chapters. We are considering the classical problem in geostatistics of clustering data which carry continuous information in space, 𝑧(u) (such as a grades), where u is the vector in the three-dimensional space (u ∈ R3 ). However, 𝑧 has only been sampled in a discrete set of points {u 𝛼 , 𝛼 ∈ 𝑁 = 1, . . . , 𝑛}. From these measurements, we have some intuition that there is an unknown finite collection 𝐴1 , 𝐴2 , . . . 𝑘 𝐴 = 𝑁, on which , 𝐴 𝑘 of disjoint sets of 𝑁, with ∪𝑖=1 𝑖 the measurement 𝑧(u 𝛼 ), 𝛼 ∈ 𝐴𝑙 and 𝑧(u 𝛽 ), 𝛽 ∈ 𝐴𝑚 , for all 𝑙 ≠ 𝑚, have low relationship (or not at all) between them (for example, because they have a different genesis) and, therefore, they should be clustered on different categories (typical examples are lithofacies types). We want to find the collection 𝐴1 , 𝐴2 , . . . , 𝐴 𝑘 . Methodologies able to deal with this problem have several significant applications. One of the most important is the definition of stationary spatial domains, where the assumption of a fairly constant mean within a given spatial domain is critical for some aspects of re- source estimation. There is no resource estimation done without the definition of stationary units. This process is known as the definition of geological units in geoscientific terms. Most of the time, geological domaining is done based on non-continuous attributes (lithology and alteration of the rock) which is related or explain somehow the values of continuous data. Sometimes, however, the categorical information is not enough to define the units by itself. The methodology presented next implements 𝐾-means algorithm on Corr( 𝑝) and is an alternative to include the spatial information on continuous data, and should help the geo-modeler to decide boundaries for geological units, in cases of fuzzy or contradictory categorical data. 3.2.1 Methodology Following an idea proposed in You and Park (2021) for the SPD case, we implemented 𝐾-means algorithm modified to our context. 𝐾-means algorithm (MacQueen et al. 1967) is one of famous clustering algorithms for data analysis. As pointed out in Goh and Vidal (2008), the method is easily extensible to non-Euclidean data as it solely depends on the distance measure in determining class memberships. 1) randomly choose 𝐾 correlations matrices as cluster (1) means, 𝝁1(1) , . . . , 𝝁 𝐾 , where the upper index refers to the number of iteration. 2) repeat following steps until convergence: a) assign each observation to the cluster by smallest distances to cluster centers, 𝑆𝑖(𝑡) = {𝚺 𝛼 : 𝑑 (𝚺 𝛼 , 𝝁𝑖(𝑡) )} ≤ 𝑑 (𝚺 𝛼 , 𝝁 (𝑡) 𝑗 ) for all 1 ≤ 𝑗 ≤ 𝐾 } and when it comes to a situation where an observation can belong to one of multiple clusters, assign the cluster randomly. b) update cluster centroids by Fréchet means of each class, ∑︁ 𝝁𝑖(𝑡+1) = arg min 𝑑 2 (P, 𝚺 𝛼 ) for 𝑖 = 1, . . . , 𝐾 P∈Corr( 𝑝) (𝑡 ) 𝑗 ∈𝑆𝑖 We have tested the algorithm on Corr(3), that is, on the elliptope (Fig. 9), and later, on a set of interpolated correlation matrices on the space (Fig. 10), both with good results in terms of the continuity of the clusters. 11 Fig. 9 Clustering on random samples of the Corr(3) manifold, and using 𝐾 -means algorithm with 𝐾 = 4. Iterations 1, 2 and 5 are displayed from the top, respectively. 4 Case Study 4.1 The Data In order to show the capabilities of the proposed techniques described previously, a data set obtained from a blast hole campaign pertaining to a Nickel-Laterite deposit is considered and six cross-correlated variables Fig. 10 Clustering on the Corr(3) manifold using 𝐾 -means algorithm with 𝑘 = 4, on a distribution of interpolated ellipses on the space. Iterations 1, 2 and 11 are displayed from the top, respectively. isotopically assayed at each sample point: Fe, Ni, MgO, SiO2 , Al2 O3 , and Cr. Isotopic sampling ensures that all the variables are available through all the sample locations (Wackernagel 2013). The case study includes 9990 samples available on the data set with a very dense sampling pattern. The name and location of data set cannot be disclosed because of confidentiality reasons. The primary inspection of multivariate relations (scatterplot shown on Fig. 11) exposed many aspects of com- 12 plexity such as non-linearity and heteroscedasticity. A map of the samples for each variable is presented in Fig. 12. In order to show the predictability of the proposed methodology, 500 random samples are selected and taken away for testing purposes. Fig. 11 Display of multivariate features on sampling data of Nickel-Laterite Deposit. 4 out of 6 variables can be seen on the scatter plots, by adding color and a variable diameter to the bullet, proportional to the amount of alumina (top). Geological codes provided are display as well (bottom). 4.2 Variography As described in proposed methodology, we begin by applying additive log-ratio transformation on the data, taken with respect to the Rest variable (Rest= 100 % − Ni% − · · · − Cr%), as this extra variable may gives us further information in the prediction. Gaussian transformation is applied at each sample location, by selecting a neighborhood of the closest 800 samples. This parameter was calibrated several times, showing that Fig. 12 Isometric view showing the sampling grade information. working with less data reduces the capabilities for reproduction of the multivariate behavior shown on the data drastically, as the correlation matrix gets distorted with a low amount of data. Once the data is gaussianized and de-correlated after obtaining the correlation matrix, the experimental direct and cross omni-directional variograms are calculated. Variogram analysis in different directions was not considered as the amount of data in the vertical direction is much less than horizontally. This last aspect, however, is included later in the radius of search for estimation. 13 Variogram analysis and calibration is the weakest point of the methodology. A first complication is that de-correlation breaks the marginal guassianity on the factors Y = L−1 Z, suggesting that the assumption of multi-gaussianity on Z is not a perfect model at every location. As a consequence, the experimental variance on the factors Y do not attains the value of 1, although it gets close for a couple of factors. This fact can be seen on the sill of the experimental variograms in Fig. 13. However, one-structured exponential variogram with 10 m of range is fixed as a final model, fitting relatively well for most of direct variograms. Cross variograms show low correlation among variables, as expected, however the sill do not attains 1 in some of the cases, in the same fashion as previously described. This last effect was not considered nor included. 4.3 Results Once the only variogram formulae is derived, one can establish the simply to work individually on each of the factor. A initial grid with mesh dimension of 2×2×2 (in meters) with mesh size of 75, 90 and 25 along east, north, and elevation coordinates is considered. We proceed to generate 1000 geostatistical simulations by using turning band algorithm (Chilès and Delfiner 2012) (picking 1200 directions). The neighborhood is selected as moving and the parameters for the range of search neighborhood are set to 100 m with up to 25 number of data and without considering octants. This number is chosen arbitrarily as the scope of this study is mainly focused on the examination of uncertainty, being the number of data chosen for simulation not relevant for the study. The simulated factors are later correlated according to the estimated correlation, interpolated by ordinary kriging (in order to get weights adding 1) at each location of the grid, by using the same variogram model as for the factors, and then back-transformed from gaussian values and from log-ratios into the raw distribution. The filtered grid for the mean of the simulations in the Nickel case, excluding the nodes far from sample data, is shown in Fig. 14. The estimated correlation at sample locations and the interpolation on the regular grid is shown in Fig. 15. The produced maps showing the mean of the simulations, at level 95 m, is given in Fig. 16, for the six back-transformed cross-correlated variables. The results reproduce crosscorrelation trends in the maps. For instance, there is a strong negative correlation between Fe and MgO, which can be corroborated from visual inspection. The non-linear behavior among variables is well reproduced. This is shown on Fig. 17 in the case of the mean of the simulations and for one of them, around level 95 m. Scatter plots showing all bi-variate relations for the mean of the simulations are shown in Fig. 18, together with the results of the variography. Variograms are well reproduced, besides the issues commented previously. It is quite impressive how well-fitted are most of the direct and cross variograms, given the fact that only one variogram was considered for the purpose of the presented methodology. In order to test the predictability and the uncertainty assessment capabilities of the methodology, we bring back the testing data leaved out from the first part of the case study. Each testing data was linked to the closest node on the grid, for retaining only the data within less than 2.5 m in distance to the corresponding node, to avoid distortions on results, leaving finally 366 samples to be considered from the initial 500 in an uncertainty analysis. The resulting pdfs from the simulations are obtained and shown in Fig. 19 for 50 samples. We pick this small window to inspect results in detail. The realizations are display of light gray lines, and the mean estimation of the simulation (in black dots) is shown for the different seven variables, at the 366 samples. Red dots represent the true grade of the samples. A 5% and 95% percentile lines are displayed in black lines to give a 90% confidence area. Figure 20 shows the scatter plots comparing the estimated mean of volumes versus their ground truth value, as the mean a value is often taken as a predictor for the real value. Low bias on the prediction and high correlation values are obtained, varying from a lowest value of 0.78 (in the case of Nickel) to 0.95 (in the case of Iron). The main advantage of simulating is that we can validate if the decision made on previous steps was correct. The validation is completed with the generation of an accuracy plot to check that the uncertainty given by the pdfs effectively represents the experimental frequencies on the ground truth of testing data (Fig. 21). Finally, in Figs. 22 and 23 , we present the clustering algorithm results, with 𝐾 = 5, showing high consistency between the clusters obtained at the last stage and the geological units tagged in the database, respectively. The algorithm even seems to capture the directions of 14 Fig. 13 Experimental variogram of the gaussian factors, and the final model used. rapidly on iteration number three one can anticipate the definitive zones, demonstrating to be a promising tool at the moment of delimitation of stationary units. 5 Conclusions Fig. 14 Grid used for the study showing the estimated mean and sampling data, for the Nickel case. continuity shown on the grades by the data. Spatial continuity in the units provided by the algorithm and, besides starting with some artifacts at the initial step, We have shown how multi-variate data can be modeled and understood as a RF lying on a correlation manifold, on where one can map every data sample into this topological space. By using this tool, two applications follow: first, the interpolation of the different known correlation matrices throughout the domain with the purpose of reproducing the non-linear multivariate features of data; and second, an application which deals with the problem of clustering of multivariate data. As a summary, a conceptually simple and novel methodology has been proposed to account for non-linearity in multivariate data, with reasonably good results that reproduce the multivariate behavior. By integrating some basic aspects of Riemannian geometry and the well known 15 Fig. 15 Estimated correlation matrices at sample positions (correlation of Ni, Fe and Mg), represented by ellipses (left). Interpolation of correlation matrices on a regular grid (right).The color of ellipsoids is related to their anisotropy (bottom). From left to right: isotropic tensor, planar tensor (flat ellipsoid) (𝜆1 ≃ 𝜆2 > 𝜆3 ), elongated ellipsoid (𝜆1 ≫ 𝜆2 ≥ 𝜆3 ) machinery for handling SPD matrices into the geostatistical setting, we gain enough flexibility to reproduce the mentioned complex multivariate behavior and, at the same time, serves to improve our understanding in the geological data. Implementing interpolation of correlation matrices for carrying the local linear multivariate relationships is a key step in good reproduction of data behavior. Among the limitations, we mention that the proposed methodology only works with enough data to estimate the correlation locally. As with other methodologies that try to handle non-stationarity, when limited data is available, it is better to simplify the problem and assume stationarity on the data, as calibration of hyperparameters, such as the correlation matrix at the different locations, may become hard to obtain. The variography becomes theoretically challenging to handle and interpret under the assumption of different underlying structures as well, since working with different models of spatial continuity for the different structures is a tricky decision to make as a “rotation” of the structures is a free parameter and a valid model that also fits the spatial correlation among variables. A third issue is that the definition of stationary geological domains beforehand replaces the presented methodology. If the multivariate behavior changes “continuously”, the proposed methodology may be a promising approach for handling non-stationary. As part of future research tasks, we propose developing a synthetic study to fully understand some crucial details of the methodology, such as the sensibility analysis of the main parameters involved and the impact to adjustments. Among these parameters, that may be critical for the method, we mention the local neighborhood from which the correlation matrix is obtained, at sample locations. How sensitive is the estimation for this correlation matrix to the number of data used, and the possibility of using a variable size in the amount of data are some of the open questions to be answered. Further efforts has to be made to give meaning to the variogram modeling step when including different models of spatial structure and the effect when performing the linear mixing. A base case by using traditional alternatives such as splitting the data on stationary domains is pendent as well, in order to have a way of comparison for the improvement on the estimation made by the presented methodology, if that is the case. Finally, the possibility of investigating a statistical approach to the modeling of correlation matrices is an interesting path to follow, setting as objective to obtain a probability distribution of correlation matrices at unknown locations, improving our capabilities and understanding when modeling uncertainty, by building different scenarios sampled from such distributions. Acknowledgements The author acknowledge the funding provided by the International Association for Mathematical Geosciences (IAMG) student grant, funding reference number MG-2020-14, and by the Natural Sciences and Engineering Council of Canada (NSERC), funding reference number RGPIN-2017-04200 and RGPAS-2017-507956. References Adler RJ, Taylor JE (2007) Random fields and geometry. Springer New York Almendral A, Abrahamsen P, Hauge R (2008) Multidimensional scaling and anisotropic covariance functions. In: Procedings of the Eight International Geostatistics Congress, GECAMIN Ltd, pp 187–196 Ayadi MA, Ben-Ameur H, Channouf N, Tran QK (2019) Norta for portfolio credit risk. Annals of Operations Research 281(1):99–119 Barnett RM, Manchuk JG, Deutsch CV (2014) Projection pursuit multivariate transform. Mathematical Geosciences 46(3):337–359 16 Fig. 16 Plan view showing the estimated mean of the simulations at level 95 m and sampling data. Boisvert J, Manchuk J, Deutsch C (2009) Kriging in the presence of locally varying anisotropy using non-euclidean distances. Mathematical Geosciences 41(5):585–601 van den Boogaart KG, Mueller U, Tolosana-Delgado R (2017) An affine equivariant multivariate normal score transform for compositional data. Mathematical Geosciences 49(2):231–251 Bourgault G (2014) Revisiting multi-gaussian kriging with the nataf transformation or the bayes’ rule for the estimation of spatial distributions. Mathematical Geosciences 46(7):841–868 Cario MC, Nelson BL (1997) Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Tech. rep., Citeseer Chilès JP, Delfiner P (2012) Geostatistics: Modeling Spatial Uncertainty David P (2019) A riemannian quotient structure for correlation matrices with applications to data science. PhD thesis, The Claremont Graduate University Fouedjio F, Desassis N, Romary T (2015) Estimation of space deformation model for non-stationary random functions. Spatial statistics 13:45–61 Gelfand AE, Kim HJ, Sirmans C, Banerjee S (2003) Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association 98(462):387–396 17 Fig. 17 Scatter plot for sampling data, the mean on the grid values and one simulation respectively from the left, around level 95 m. Goh A, Vidal R (2008) Clustering and dimensionality reduction on riemannian manifolds. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 1–7 Huckemann S, Hotz T, Munk A (2010) Intrinsic shape analysis: Geodesic pca for riemannian manifolds modulo isometric lie group actions. Statistica Sinica pp 1–58 Journel AG, Huijbregts CJ (1978) Mining geostatistics, vol 600. Academic press London Leuangthong O, Deutsch CV (2003) Stepwise conditional transformation for simulation of multiple variables. Mathematical Geology 35(2):155–173 Li ST, Hammond JL (1975) Generation of pseudorandom numbers with specified univariate distributions and correlation coefficients. IEEE Transactions on Systems, Man, and Cybernetics (5):557–561 MacQueen J, et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297 Moakher M (2005) A differential geometric approach to the geometric mean of symmetric positive-definite matrices. SIAM Journal on Matrix Analysis and Applications 26(3):735–747 Moakher M (2006) On the averaging of symmetric positivedefinite tensors. Journal of Elasticity 82(3):273–296 Nataf A (1962) Determination des distribution don t les marges sont donnees. Comptes Rendus de l Academie des Sciences 225:42–43 Pearson K (1901) Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2(11):559–572 Pinto FC, Manchuk JG, Deutsch CV (2021) Decomposition of multivariate spatial data into latent factors. Computers & Geosciences 153:104773 Sampson PD, Guttorp P (1992) Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association 87(417):108–119 Switzer P (1985) Min/max autocorrelation factors for multivariate spatial imagery. Computer science and statistics Taylor JE, Adler RJ, et al. (2003) Euler characteristics for gaussian fields on manifolds. The Annals of Probability 31(2):533–563 Taylor JE, et al. (2006) A gaussian kinematic formula. The Annals of Probability 34(1):122–158 Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(10):1713–1727, DOI 10.1109/TPAMI.2008.75 Wackernagel H (2013) Multivariate geostatistics: an introduction with applications. Springer Science & Business Media Xiao Q (2014) Evaluating correlation coefficient for nataf transformation. Probabilistic Engineering Mechanics 37:1–6 Xie W, Sun H, Li C (2015) Quantifying statistical uncertainty for dependent input models with factor structure. In: 2015 Winter Simulation Conference (WSC), IEEE, pp 667–678 You K, Park HJ (2021) Re-visiting riemannian geometry of symmetric positive definite matrices for the analysis of functional connectivity. Neuroimage 225:117464 18 Fig. 18 Scatter plots and variograms comparing ground truth and mean of simulations. Variograms from simulations are shown in light gray lines, the mean of the variogram on black line, and the variogram of the ground truth on red line. The scatter plot showing bi-variate relations is shown on green dots, superposed to the mean of simulations on red dots. 19 Fig. 19 Uncertainty assessment for 50 samples taken from testing data, showing 1000 simulations in gray lines, the lower 5% and the upper 95% percentile as confidence boundary in black lines, the estimated mean in black dots and, in red dots, the ground truth. 20 Fig. 20 Scatter plots comparing the estimated mean of the simulations on locations closed to testing data with the ground truth. 21 Fig. 21 Accuracy plot, which calculates the proportion of locations where the true value falls within symmetric 𝑝probability interval. Fig. 23 Scatter showing the progress on clustering when using 𝐾 -means algorithm on sampling data, for 𝐾 = 5, and for iterations 1, 3, and 30 respectively from the top. Fig. 22 Isometric view of ellipses at samplig location showing the progress on clustering when using 𝐾 -means algorithm on sampling data, for 𝐾 = 5, and for iterations 1, 3, and 30 respectively from the top.

Log In

A Riemannian Approach to Multivariate Geostatistical Modeling

Sign up for access to the world's latest research.

Related papers

Related papers

Related topics