Scalable GWR: A Linear-Time Algorithm For Large-Scale Geographically Weighted Regression With Polynomial Kernels
Scalable GWR: A Linear-Time Algorithm For Large-Scale Geographically Weighted Regression With Polynomial Kernels
Scalable GWR: A Linear-Time Algorithm For Large-Scale Geographically Weighted Regression With Polynomial Kernels
Daisuke Murakami1, Narumasa Tsutsumida2, Takahiro Yoshida3, Tomoki Nakaya4, Binbin Lu5
1
Department of Data Science, Institute of Mathematical Statistics, Japan
2
Graduate School of Global Environmental Studies, Kyoto University, Japan
3
Center for Global Environmental Studies, National Institute for Environmental Studies, Japan
4
Graduate School of Environmental Studies, Tohoku University, Japan
5
School of Remote Sensing and Information Engineering, Wuhan University, China
Abstract
While a number of studies have developed fast geographically weighted regression (GWR)
algorithms for large samples, none of them achieves the linear-time estimation that is considered
requisite for big data analysis in machine learning, geostatistics, and related domains. Against this
backdrop, this study proposes a scalable GWR (ScaGWR) for large datasets. The key development is
the calibration of the model through a pre-compression of the matrices and vectors whose size
depends on the sample size, prior to the execution of leave-one-out cross-validation (LOOCV) that is
the heaviest computational step in conventional GWR. This pre-compression allows us to run the
proposed GWR extension such that its computation time increases linearly with sample size,
whereas conventional GWR algorithms take at most quad–quadratic-order time. With this
development, the ScaGWR can be calibrated with more than one million samples without
parallelization. Moreover, the ScaGWR estimator can be regarded as an empirical Bayesian
estimator that is more stable than the conventional GWR estimator. This study compared the
ScaGWR with the conventional GWR in terms of estimation accuracy, predictive accuracy, and
computational efficiency using a Monte Carlo simulation. Then, we apply these methods to a
residential land analysis in the Tokyo Metropolitan Area. The code for ScaGWR is available in the R
package scgwr, and is going to be incorporated into another R package, GWmodel.
Keywords
Geographically weighted regression, large spatial data, computational complexity, scalability,
pre-processing
1. Introduction
Geographically weighted regression (GWR; Brunsdon et al., 1998; Fotheringham, 2002) is
a well-known approach for spatially varying coefficient (SVC) modeling. Due to its flexibility and
simplicity, GWR has widely been accepted in regional science (e.g., Cahill and Mulligan, 2007; Páez
et al., 2008), epidemiology (e.g., Nakaya et al., 2005), and environmental science (e.g., Dong et al.,
2018) to investigate spatial heterogeneity in regression coefficients. In spite of its popularity, GWR
is limited in application to big spatial data because of the computational complexity of model
estimation. Efficient GWR algorithms have been developed by Harris et al. (2010), Tran et al. (2016),
and Li et al. (2019) through parallelization. Li et al. (2019) optimized the linear algebra in the GWR
algorithm to reduce the required memory storage from O(N2) to O(NK), where N and K denote the
sample size and the number of covariates, respectively. The computational complexity of their
proposed fast GWR is O(K2N2logN), meaning that the computational burden grows in a
quasi-quadratic manner with N. However, modeling in O(K2N2logN) time is not fast enough as
linear-time modeling (i.e., O(N)) is expected in the era of big data in geostatistics (see Heaton et al.,
2018 for review), machine learning (see Liu et al., 2018 for review), and other fields.
Large spatial data modeling has been studied intensively in geostatistics. Fixed-rank
kriging (Cressie and Johannesson, 2008), predictive process modeling (Banerjee et al., 2008), and
nearest-neighbor Gaussian process modeling (NNGP; Datta et al., 2016) are instances of linear-time
geostatistical approaches. While these approaches aim to model the residual spatial process and
interpolate data, Finley et al. (2011) and Datta et al. (2016) extended the predictive process and
NNGP modeling to SVC modeling, respectively. However, as noted by Finley et al. (2011), these
approaches are still slow for large samples because of the need for the Markov chain Monte Carlo
(MCMC) simulation, although the burden is lightened by applying an efficient MCMC procedure
(e.g., Finley et al., 2018) or alternatives of MCMC, such as the variational Bayes approach (see
Bishop, 2006) and the integrated nested Laplace approximation (Rue et al., 2009). Other
computationally efficient SVC modeling approaches include (i) the spatial expansion method
(Casetti, 1972); (ii) the bivariate spline-based approach (Mu et al., 2018); and (iii) the Moran
eigenvector approach (Murakami et al., 2017; Murakami and Griffith, 2019). (i) and (ii) estimate
SVCs by fitting deterministic functions while (iii) estimates them by fitting approximate Gaussian
processes that are interpretable in terms of the Moran coefficient (Moran, 1950). All these
approaches assume that SVCs are expressed as linear combinations of 𝐿(≪ 𝑁) spatial basis
functions. However, approaches of this kind that use L basis functions are known to suffer from the
degeneracy problem (Stein, 2014). That is, they fail to capture small-scale spatial variations when N
is large (e.g., 20,000 ≤ 𝑁) and the resulting SVCs become overly smooth.
In summary, a practical approach to SVC modeling that fulfills the following requirements
is required:
(i) Linear-time estimation of SVCs;
(ii) No reliance on MCMC;
(iii) Not suffering from the degeneracy problem, i.e., small-scale spatial variations should be
effectively captured.
With the focus on (iii), GWR, which is a local approach that does not suffer from the degeneracy
problem, is an appropriate approach. Thus, we extend GWR to the scalable GWR (ScaGWR) to
satisfy all the three requirements above. The remainder of this paper is organized as follows: Section
2 introduces the original GWR and Section 3 develops the ScaGWR. Section 4 compares GWR and
ScaGWR through a Monte Carlo simulation, and Section 5 compares them in a residential land price
analysis. Finally, Section 6 provides the conclusions of this study.
where 𝑖 ∈ {1, ⋯ , 𝑁} is an index of sample sites distributed across a geographical space, yi is the i-th
explained variable, xi,k is the k-th covariate, 𝛽𝑖,0 is the intercept parameter, and 𝛽𝑖,𝑘 is the local
regression coefficient for the k-th covariate. The regression coefficients 𝛃𝑖 = [𝛽𝑖,0 , 𝛽𝑖,1 ⋯ 𝛽𝑖,𝐾 ]′ are
estimated by a weighted least squares method, where “ ′ ” denotes the matrix transpose. The
estimator is given by
̂𝑖 = [𝐗′𝐆𝑖 (𝑏)𝐗]−1 𝐗 ′ 𝐆𝑖 (𝑏)𝐲,
𝛃 (2)
where y is a vector of the explained variables and X is a matrix of covariates with a column of 1s for
the intercept. 𝐆𝑖 (𝑏) is a diagonal matrix whose j-th element is the geographical weight 𝑔𝑖,𝑗 (𝑏) for
the j-th sample, and can be calculated via a distance-decaying kernel function. For example, the
Gaussian kernel in Eq. (3) is a usual choice:
2
𝑑𝑖,𝑗
𝑔𝑖,𝑗 (𝑏) = exp [− ( ) ], (3)
𝑏
where 𝑑𝑖,𝑗 is the Euclidean distance between si and sj. The distance may be substituted with
Minkowski distance (Lu et al., 2016), network distance, travel time, or other non-Euclidean
distances (Lu et al., 2014). b is a bandwidth parameter that takes a small (near-zero) value if the
regression coefficients have small-scale map patterns and a large value when they have large-scale
̂𝑖 asymptotically converges to the ordinally least squares (OLS) estimator:
patterns. As b grows, 𝛃
̂𝑂𝐿𝑆 = (𝐗′𝐗)−1 𝐗 ′ 𝐲.
𝛃
The bandwidth b can be optimized by leave-one-out cross-validation (LOOCV) that
minimizes the cross-validation (CV) score, which is defined as follows:
𝑁
where 𝜀̂𝑖 = 𝑦𝑖 − ∑𝐾 ̂ ̂
𝑘=1 𝑥𝑖,𝑘 𝛽−𝑖,𝑘 . 𝛽−𝑖,𝑘 is estimated using the 𝑁 − 1 samples other than the i-th
̂−𝑖 = [𝛽−𝑖,1 ⋯ 𝛽−𝑖,𝐾 ]′ is estimated using Eq. (2), where 𝐆𝑖 (𝑏) is replaced
sample. Specifically, 𝛃
with 𝐆−𝑖 (𝑏), which is a diagonal matrix whose elements are summarized as a vector as 𝐠 −𝑖 (𝑏) =
[𝑔1,𝑗 (𝑏), ⋯ 𝑔𝑖−1,𝑗 (𝑏), 0, 𝑔𝑖+1,𝑗 (𝑏), ⋯ 𝑔𝑁,𝑗 (𝑏)]′.
Figure 1 summarizes the LOOVC procedure in the basic GWR. The grey color shows
matrices or vectors whose dimensions depend on sample size. As illustrated in the figure, large
matrices must be repeatedly processed to find the optimal b. This is the main reason for why GWR is
slow for large N.
Figure 1: LOOCV routine in the standard GWR. Grey squares represent matrices and vectors whose
sizes depend on N, and red squares represent other small elements.
Figure 1 suggests that GWR can be accelerated if we let the large matrices/vectors out of
the LOOCV iteration. However, the bandwidth parameter b is not separable from 𝐠 −𝑖 (𝑏) because
of the non-linearity of this function; thus, 𝐠 −𝑖 (𝑏) and other large matrices and vectors must be
processed in every LOOCV iteration. We overcome this bottleneck by introducing a linear multiscale
kernel that allows us to linearly separate the internal parameters from the kernel function and
pre-process the large matrices and vectors before the iteration. The resulting procedure estimates a
regularized GWR model in quasi-linear time. Section 3.1 introduces our ScaGWR model with the
multiscale kernel, Section 3.2 explains properties of the SVC estimator, and Section 3.3 shows how
to accelerate the LOOCV through the pre-processing.
3.1. Model
The ScaGWR model is identical to the basic GWR model Eq. (1) except for the use of a
linear multiscale kernel for fast computation. Section 3.1.1 introduces a linear polynomial kernel that
approximates standard kernels and Section 3.1.2 introduces the multiscale kernel, which is an
extension of the polynomial kernel.
where 𝑔𝑖,𝑗 (𝑏0 ) is a base kernel with known bandwidth 𝑏0 . We assume that 𝑔𝑖,𝑗 (𝑏0 ) is a
non-negative decreasing function with respect to 𝑑𝑖,𝑗 , and is one if 𝑑𝑖,𝑗 = 0 . 𝑝 ∈ {1, ⋯ , 𝑃}
represents the degree of the polynomials. This means that the Gaussian kernel (Eq. 3) and the
exponential kernel are applicable to the 𝑔𝑖,𝑗 (𝑏0 ) function while kernels with hard thresholds, such
as the bi-square and tri-cube kernels, are not. Such a polynomial kernel has often been used for
regression problems (e.g., Fan et al., 1995). The means of determining P is unclear. We perform a
Monte Carlo simulation to clarify this.
Instead of the known 𝑏0 , we estimate 𝑏̃ indicating the spatial scale of the SVCs. For
𝑝
instance, consider a case with three polynomials. In this case, Eq. (5) is ∑3𝑝=1 𝑏̃ 𝑝 𝑔𝑖,𝑗 (𝑏0 )4⁄2 with
known base kernels {𝑔𝑖,𝑗 (𝑏0 )2 , 𝑔𝑖,𝑗 (𝑏0 ), 𝑔𝑖,𝑗 (𝑏0 )1⁄2 } and their coefficients {𝑏̃1 , 𝑏̃ 2 , 𝑏̃ 3 } .
𝑔𝑖,𝑗 (𝑏0 )2 < 𝑔𝑖,𝑗 (𝑏0 ) < 𝑔𝑖,𝑗 (𝑏0 )1⁄2 holds if 𝑑𝑖,𝑗 > 0 . In other words,
{𝑔𝑖,𝑗 (𝑏0 )2 , 𝑔𝑖,𝑗 (𝑏0 ), 𝑔𝑖,𝑗 (𝑏0 )1⁄2 } describes a short-range, moderate-range, and long-range decay of a
kernel function. The coefficients {𝑏̃1 , 𝑏̃ 2 , 𝑏̃ 3 } assign weights to these kernels. For example, if 𝑏̃ =
0.1, the weights of the three kernels become {0.1,0.01,0.001} ∝ {90.1, 9.0, 0.9}. The resulting
̃ 0𝑖,𝑗 (𝑏̃) yields a short-range kernel. By contrast, if 𝑏̃ = 10, the weights become {10,100,1000} ≈
𝑔
{0.9, 9.0, 90.1}, implying a long-range kernel. Thus, 𝑏̃ replaces the usual bandwidth parameter b.
See Figure 2 for another example with P = 5.
Figure 2: Example of the polynomial kernel with P = 5 and b0 = 10
where
(𝑄)
(𝑄) 𝑔 (𝑏 ) 𝑖𝑓 𝑑𝑖,𝑗 ≤ 𝐷𝑖
𝑔𝑖,𝑗 (𝑏0 ) = { 𝑖,𝑗 0 ,
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(𝑄)
𝛼 is an unknown parameter, and 𝐷𝑖 is the distance between the i-th site and the Q-th nearest
neighbor. 𝑔̃𝑖,𝑗 (𝑏̃, 𝛼) weighs the Q-nearest neighbors by [global weight: 𝛼 ] + [local weight:
𝑝
∑𝑃𝑝=1 𝑏̃ 𝑝 𝑔𝑖,𝑗 (𝑏0 )4⁄2 ], and the other samples by [global weight: α ]. Thus, 𝑔̃𝑖,𝑗 (𝑏̃, 𝛼) is a
multiscale kernel in which the global and local weights are determined by α and 𝑏̃, respectively.
Unlike the polynomial kernel in Eq. (5), the complexity of calculating all 𝑔̃𝑖,𝑗 (𝑏̃, 𝛼)s is O(QN),
which is trivial as long as 𝑄 is small (e.g., 50). Furthermore, as we explain below, the resulting
GWR estimator, which is known to suffer from multicollinearity (Wheeler and Tiefelsdorf, 2005), is
stabilized if this multiscale specification is used. For these reasons, we prefer the multiscale kernel.
We need to determine the value of the fixed 𝑏0 a priori. If 𝑔𝑖,𝑗 (𝑏0 ) is defined by the
(𝑄) (𝑄)
Gaussian kernel, we assume 𝑏0 = 𝐷𝑚𝑒𝑑 ⁄√3, where 𝐷𝑚𝑒𝑑 is the median of the Q-nearest neighbor
(𝑄)
distances. This assumption implies 𝐷𝑚𝑒𝑑 = √3𝑏0 = [the effective bandwidth of 𝑔𝑖,𝑗 (𝑏0 )], which is
the distance at which 95% of the weight vanishes (see Cressie et al., 1993). That is, the base kernel
𝑔𝑖,𝑗 (𝑏0 ) is assumed to decay gradually across the Q-nearest neighbors, and the speed of decay of the
resulting 𝑔̃𝑖,𝑗 (𝑏̃, 𝛼) is optimized by estimating 𝑏̃ . Because 𝑏̃ substitutes 𝑏0 , the ScaGWR
estimator is likely to be less sensitive to 𝑏0 , although a sensitivity analysis is an important future
(𝑄)
topic. In the same manner, we assume 𝐷𝑚𝑒𝑑 = 3𝑏0 for the exponential kernel.
3.2. SVC estimator
The ScaGWR estimator, which is identical to the standard GWR estimator except for the
kernel specification, yields
̂ = [𝐗′𝐆
𝛃 ̃𝑖 (𝑏̃, 𝛼)𝐗]−1 𝐗′𝐆
̃𝑖 (𝑏̃, 𝛼)𝐲, (7)
𝑖
̃𝑖 (𝑏̃, 𝛼) is a diagonal matrix whose j-th entry is 𝑔̃𝑖,𝑗 (𝑏̃, 𝛼). 𝑏̃ and 𝛼 are estimated by
where 𝐆
LOOCV minimizing Eq. (4).
The estimator may be rewritten by substituting Eq. (6) into Eq. (7) as
−1
̂ = [𝛼𝐗′𝐗 + 𝐗′𝐆(𝑖 𝑄) (𝑏̃)𝐗]
𝛃
(𝑄)
[𝛼𝐗′𝐲 + 𝐗′𝐆𝑖 (𝑏̃)𝐲], (8)
𝑖
(𝑄) (𝑄) 𝑝
where 𝐆𝑖 (𝑏̃) is a diagonal matrix whose j-th entry is ∑𝑃𝑝=1 𝑏̃ 𝑝 𝑔𝑖,𝑗 (𝑏0 )4⁄2 . Based on Eq. (8), 𝛃
̂𝑖
𝑝𝑟𝑖𝑜𝑟 ̂𝑂𝐿𝑆 , 𝛼 −1 (𝐗′𝐗)−1 ).
can be viewed as an empirical Bayes estimator with prior distribution 𝛃 ~𝑁(𝛃
In other words, ScaGWR stabilizes the GWR estimator by shrinking it toward the OLS estimator,
where α determines the degree of shrinkage.
Thus, once the parameters {𝑏̃, 𝛼} are given, 𝛃
̂𝑖 is easily estimated. The next section
explains our LOOCV procedure to optimize {𝑏̃, 𝛼} computationally efficiently.
(ii) LOOCV : {𝑏̃, 𝛼} are optimized by the LOOCV in which the CV score is calculated by
substituting the inner products in 𝐌−𝑖 into Eq. (14). As explained above, the complexity of
this step is only O(K3Nlog(N)) (see Figure 3).
(iii) Estimation ̂𝑖 is estimated by substituting the estimated {𝑏̃, 𝛼} into Eq. (15), which is
: 𝛃
equal to Eq. (8):
−1
𝑷 𝑷
(0) (𝑝) (0) (𝑝)
̂𝑖 = [𝛼𝐌
𝛃 + ∑𝑏 ̃𝑝 𝐌𝑖 ] [𝛼𝐦 + ∑ 𝑏̃ 𝑝 𝐦𝑖 ]. (15)
𝑝=1 𝑝=1
(0) (𝑝)
where the (k, k')-th elements of the matrices 𝐌 and 𝐌𝑖 are given by ∑𝑗 𝑥𝑗,𝑘 𝑥𝑗,𝑘′ and
4⁄2𝑝 (0) (𝑝)
∑𝑗 𝑔𝑖,𝑗 (𝑏0 ) 𝑥𝑗,𝑘 𝑥𝑗,𝑘′ respectively, and the k-th elements of the vectors 𝐦 and 𝐦𝑖 by
𝑝
∑𝑗 𝑥𝑗,𝑘 𝑦𝑗 and ∑𝑗 𝑔𝑖,𝑗 (𝑏0 )4⁄2 𝑥𝑗,𝑘 𝑦𝑗 , respectively.
(0) (𝑝) (𝑄)
Note that the equality of Eqs. (8) and (15) means that 𝐌 = 𝐗′𝐗, ∑𝑃𝑝=1 𝑏̃ 𝑝 𝐌𝑖 = 𝐗′𝐆𝑖 (𝑏̃)𝐗,
(0) (𝑝) (𝑄)
𝐦 = 𝐗′𝐲, and ∑𝑃𝑝=1 𝑏̃ 𝑝 𝐦 = 𝐗′𝐆 (𝑏̃)𝐲.
𝑖 𝑖
It is possible to calculate the standard errors of the coefficients, degrees of freedom,
corrected Akaike Information Criterion (AICc), and other statistics to infer the modeling results in
linear time (see Appendix 1). The LOOCV in step (ii) can be substituted with an AICc
minimization-based optimization that is widely used for bandwidth selection (e.g., Nakaya et al.,
2005). See Appendix 1 for further details.
Figure 3: LOOCV routine in the ScaGWR. Grey squares represent large matrices and vectors whose
sizes depend on N, and the red squares represent the other small elements. Our multiscale kernel
(𝑄) ̃ (𝑄)
𝐠 −𝑖 𝐛 + 𝛼𝟏 includes parameters α and 𝑏̃ optimized by the LOOCV. 𝐠̃ −𝑖 is a matrix of P
𝐲 = 𝛃0 + 𝐱1 ∙ 𝛃1 + 𝐱 2 ∙ 𝛃2 + 𝛆 𝛆~𝑁(𝟎, 𝜎 2 𝐈),
(16)
𝛃0 ~𝑁(𝟏, 0.52 𝐆
̂ ), ̂ ),
𝛃1 ~𝑁(𝟏, 22 𝐆 𝛃2 ~𝑁(𝟏, 0.52 𝐆
̂ ),
where “ ∙ ” is the element-wise product. 𝛃1 , whose variance is 22, explains more spatial variations
than 𝛃0 and 𝛃2 , whose variance is 0.52. Later, we refer to 𝛃1 as strong SVC and {𝛃0 , 𝛃2 } as
̂ is a Nystrom approximation (see Grineas and Mahoney, 2005) of a kernel matrix G,
weak SVCs. 𝐆
which in itself is difficult to handle when N is large. The (i, j)-th element of G is 𝑔(𝑑𝑖,𝑗 ) =
2
𝑑𝑖,𝑗
𝑒𝑥𝑝 (− ). Following Dray et al. (2006) and Murakami et al. (2017), b is given by the maximum
𝑏2
distance in the minimum spanning tree connecting samples. ScaGWR was estimated while varying
𝑁 ∈ {500, 1000, 3000, 5000, 7000, 10000, 20000, 40000, 60000, 80000} while GWR was
estimated in the six cases with 𝑁 ≤ 10000. For each N, the estimation was iterated 200 times.
To calculate the SVC estimation error, we used the root mean-squared error (RMSE):
200 𝑁
1
𝑅𝑀𝑆𝐸[𝛽̂𝑖,𝑘 ] = √ ∑ ∑(𝛽̂𝑖,𝑘
𝑖𝑡𝑒𝑟
− 𝛽𝑖,𝑘 )2 , (17)
200𝑁
𝑖𝑡𝑒𝑟=1 𝑖=1
where 𝛽̂𝑖,𝑘
𝑖𝑡𝑒𝑟
is the coefficient estimated in the iter-th iteration.
4.2. Results
Figures 4 and 5 summarize the RMSEs for small to moderate samples (𝑁 ≤ 10,000) and
larger samples (80,000 ≤ 𝑁), respectively. They show that the strong and weak SVCs had different
tendencies.
When 𝑁 < 10,000, ScaGWR accurately estimated the strong SVC (𝛃1 ) if P and Q were
small. A smaller P reduced the number of polynomials while a smaller Q implied less dependence on
local weights. Thus, the accurateness is attributable to the parsimony of this specification. Because
GWR easily suffers from local collinearity, the result indicating the accuracy of the parsimonious
specification is reasonable. As N increased, the accuracy of the ScaGWR estimates for the strong
SVC improved and, roughly when 10,000 ≤ 𝑁, the RMSE values were similar across values of P
and Q.
Weak SVCs (𝛃0 , 𝛃3 ) indicate better accuracy when P and Q are large. This was opposite to
the tendency in case of the strong SVC. This suggests that more parameters are needed to identify
weak spatially varying signals. Based on Figure 5, Q determined the scalability of the estimation
accuracy. Specifically, Q = 50, 100, and 200 yielded slower, moderate, and faster decays of RMSE
values as N increased. A large Q is desirable to estimate weak SVCs accurately from large samples.
Still, Q = 100 achieved reasonable accuracy across cases (see Figure 5).
Overall, accuracy of the ScaGWR estimates tended to outperform the classical GWR. This
might be because ScaGWR estimates both local and global structure using 𝑏̃ and 𝛼, whereas GWR
considers only the former. Among the ScaGWR specifications, we prefer P = 4 and Q = 100 because
it achieves reasonable estimation accuracy regardless of whether N is small or large. Figure 6 plots
the true and estimated SVCs in first iterations with 𝑁 ∈ {500, 5,000, 80,000}. This figure confirms
that ScaGWR with P = 4 and Q = 100 accurately estimates the SVCs. ScaGWR successfully
captures small-scale variations even if N = 80,000 whereas GWR is not feasible because of the large
samples. It is important to note that other fast approaches, which specify each SVC using linear
combinations of L spatial bases, typically fail to capture such small variations due to the degeneracy
problem (see Section 1).
Figure 7 compares the CV score, which is a measure of out-of-sample prediction accuracy.
ScaGWR achieved higher prediction accuracy than GWR in many cases, especially on large samples.
ScaGWR is thus a faster and more accurate alternative to the standard GWR for spatial prediction.
Finally, Figure 8 summarizes the average computation (CP) time. The CP time of the
standard GWR rapidly increased with N. It took 2022.3 seconds when N = 10,000. By contrast, the
CP time of ScaGWR increased only linearly with N. On average, ScaGWR with Q = 100 and P = 4
took 47.7 seconds for execution when N = 10,000 and 338.3 seconds when N = 80,000. Even if N =
1,000,000, the ScaGWR took only 5,172 seconds (86.2 minutes) on average in five trials without
parallelization.
In summary, the ScaGWR with Q = 100 and P = 4 is more accurate and faster than the
basic GWR.
Figure 4: RMSE of SVCs (10,000 ≤ N)
Figure 5: RMSE of SVCs (80,000 ≤ N)
Figure 6: True and estimated SVCs (𝛃1 ) in each first iteration. P = 4 and Q = 100 was assumed for
ScaGWR.
Figure 7: CV scores of GWR (black line) and ScaGWR (colored lines)
5.2. Result
Figure 10 summarizes the estimated SVCs. For each panel, darker red represents a larger
negative value while darker blue represents a larger positive value. This figure suggests that the
GWR and ScaGWR estimates were similar.
For both models, βAgri and βForest indicate negative values suggesting the preference for
convenient urban areas over green areas. βAgri had a negative impact in the center of Tokyo as well,
as an increase in agricultural area was not preferable in terms of residential land price. By contrast,
βForest had a strong positive impact near the center. This shows greenery increased the attractiveness
of neighborhoods near the center but not in suburban neighborhoods.
βRail had a negative impact across the study area. The coefficients were especially large
near the center. This result confirmed the popularity of nearby stations. βBus also had a strong
negative impact in the Tokyo prefecture. The bus network managed by the prefecture was densely
stretched across Tokyo. The result shows that the Tokyo bus network successfully increased the
attractiveness of neighborhoods. Such a positive impact is not conceivable in major cities outside
Tokyo, including Saitama and Chiba. Still, a negative βBus, implying higher values of nearby areas
with bus stops, was estimated in most other suburban areas.
Thus, this section confirms that ScaGWR estimates interpretable coefficients just like the
standard GWR.
Figure 10: Estimation results
6. Concluding remarks
This study proposed the ScaGWR for large-scale SVC modeling. Unlike current GWR
algorithms whose computational complexity is at best quasi-quadratic with respect to N, ScaGWR
achieves a quasi-linear computational complexity as illustrated in Figure 8. This property allows us
to estimate SVCs from millions of samples even without parallelization. Of course, the
parallelization of ScaGWR is an interesting topic for even larger samples in future research.
As reviewed in Section 1, many of efficient approaches for SVC modeling assume that
each SVC obeys a linear combination of L spatial basis functions, and suffer from the degeneracy
problem; i.e., estimated SVCs tend to have over-smoothed map patterns (Stein, 2014). By contrast,
ScaGWR, which performs local estimation, does not suffer from this problem. Figure 6 confirms
that ScaGWR successfully captures local variations in SVCs. Thus, it satisfies all the requirements
listed in Section 1, including (i) a linear-time computational burden; (ii) no reliance on MCMC; and
(iii) no suffering the degeneracy problem
A drawback of GWR is its tendency to suffer from local multicollinearity. Even if a
regularization parameter is imposed, GWR, which considers only samples within a kernel window,
might be unstable if samples within the window are few. By contrast, ScaGWR implicitly imposes
𝑝𝑟𝑖𝑜𝑟 ̂𝑂𝐿𝑆 , 𝛼 −1 (𝐗′𝐗)−1 ) as a prior (see Section 3.3). Therefore, even if the kernel window is
𝛃 ~𝑁(𝛃
small (i.e., 𝑏̃ is small), the estimator, which is shrunk to the OLS estimator, is likely to be stable.
The robustness of ScaGWR needs to be studied in future work.
Extending ScaGWR is an interesting subject for future research. Its computational
efficiency might be preserved even if it is extended to multiscale GWR (Fotheringham et al., 2017;
Lu et al. 2019; Oshan et al. 2019), like GWR with parameter-specific distance matrices (Lu et al.,
2017; 2018), GWR with flexible bandwidth (Yang, 2014; Murakami et al., 2019), and conditional
GWR (Leong and Yue 2017). Furthermore, ScaGWR might be important in combining GWR with a
global model (see, Geniaux and Martinetti, 2018; Harris, 2019). The ScaGWR model can be
considered an integration of the GWR model and a global linear regression model. The linear
regression is potentially substituted with a mixed effects model, additive model, temporal model, and
so on. Such multi-model integration is an interesting development toward the fast, stable, and
flexible reformulation of GWR.
ScaGWR was implemented in the R package scgwr
(https://cran.r-project.org/web/packages/scgwr/). We will also consider implementing ScaGWR in
the GWmodel package embedded into C++ code via the Rcpp package (Eddelbuettel, 2013).
where S is a large matrix (N × N) that cannot be stored for large values of N. However, 𝑡𝑟[𝐒] can be
calculated without explicitly processing S as follows:
𝑃
−1 (A3)
(0) (𝑝)
= ∑ 𝐱 𝑖 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 𝐱′𝑖 𝑔̃𝑖,𝑖 (𝑏̃, 𝛼),
𝑖 𝑝=1
(0) (𝑝) −1
where [𝛼𝐌 + ∑𝑃𝑝=1 𝑏̃ 𝑝 𝐌𝑖 ] is a small matrix (K × K) that has already been calculated when
−1 −1
𝑃 𝑃
(0) (𝑝) (0) (𝑝)
= ∑ 𝐱 𝑖 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] ̃𝑖 (𝑏̃, 𝛼)2 𝐗 [𝛼𝐌
𝐗′𝐆 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 𝐱′𝑖
𝑖 𝑝=1 𝑝=1
−1 −1
𝑃 𝑃
(0) (𝑝) (𝑄) (0) (𝑝)
= ∑ 𝐱 𝑖 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 𝐗′(𝛼𝐈 + 𝐆𝑖 (𝑏̃))2 𝐗 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 𝐱′𝑖
𝑖 𝑝=1 𝑝=1
−1
𝑃
(0) (𝑝) (𝑄)
= ∑ 𝐱 𝑖 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] (𝛼 2 𝐗′𝐗 + 2𝛼𝐗′𝐆𝑖 (𝑏̃)𝐗
𝑖 𝑝=1
(A4)
−1
𝑃
(𝑄) (0) (𝑝)
+ 𝐗′𝐆𝑖 (𝑏̃)2 𝐗) [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 𝐱′𝑖
𝑝=1
−1
𝑃 𝑃
(0) (𝑝) (0) (𝑝)
= ∑ 𝐱 𝑖 [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] (𝛼 𝐌 2
+ 2𝛼 ∑ 𝑏̃ 𝑝 𝐌𝑖
𝑖 𝑝=1 𝑝=1
−1
𝑃 𝑃
̃ 2𝑝 (2𝑝) (0) ̃𝑝 (𝑝)
+∑𝑏 𝐌𝑖 ) [𝛼𝐌 + ∑𝑏 𝐌𝑖 ] 𝐱′𝑖 .
𝑝=1 𝑝=1
(2𝑝)
Although we also need to calculate 𝐌𝑖 , which is a K × K matrix whose (k, k')-th element is
8⁄2𝑝 (𝑝)
∑𝑗 𝑔𝑖,𝑗 (𝑏0 ) 𝑥𝑗,𝑘 𝑥𝑗,𝑘′ , the computational complexity is identical to that of 𝐌𝑖 . Thus, the
calculation is still trivial.
Given 𝑁 ∗ , the unbiased estimate of the residual variance yields
𝐾 𝐾 2
1
𝜎̂ 2 = ∗ ∑ (𝑦𝑖 − ∑ 𝑥𝑖,𝑘 𝛽̂𝑖,𝑘 ) . (A5)
𝑁
𝑘=1 𝑘=1
The variance estimates for the SVCs are derived as
̃𝑖 (𝑏̃, 𝛼)𝐗)−1 𝐗′𝐆
̂𝑖 ] = 𝜎̂ 2 (𝐗′𝐆
𝑉𝑎𝑟[𝛃 ̃𝑖 (𝑏̃, 𝛼)𝐗)−1 ,
̃𝑖 (𝑏̃, 𝛼)2 𝐗(𝐗′𝐆
−1
𝑃 𝑃
(0) (𝑝) (0) (𝑝)
2
= 𝜎̂ [𝛼𝐌 + ∑ 𝑏̃ 𝑝 𝐌𝑖 ] 2
(𝛼 𝐌 + 2𝛼 ∑ 𝑏̃ 𝑝 𝐌𝑖
𝑝=1 𝑝=1
(A6)
−1
𝑃 𝑃
̃ 2𝑝 (2𝑝) (0) ̃𝑝 (𝑝)
+ ∑𝑏 𝐌𝑖 ) [𝛼𝐌 + ∑𝑏 𝐌𝑖 ] ,
𝑝=1 𝑝=1
which is again trivial to calculate. Thus, diagnostics of the ScaGWR are available for large samples.
̂𝑖 and
Note that the CP time presented in Section 4 includes the time taken to estimate both 𝛃
̂𝑖 ].
𝑉𝑎𝑟[𝛃
By calculating 𝑡𝑟[𝐒], 𝑡𝑟[𝐒′𝐒], and 𝜎̂ 2 in the way explained above, the corrected Akaike
Information Criterion (AICc), which is formulated as follows, is calculated computationally
efficiently:
𝑁 + 𝑡𝑟[𝐒]
AICc = 𝑁𝑙𝑜𝑔(𝜎̂ 2 ) + 𝑁𝑙𝑜𝑔(2𝜋) + 𝑁 . (A7)
𝑁 − 2 − 𝑡𝑟[𝐒]
The LOOCV can be substituted with AICc-based bandwidth optimization by using Eq. (A7).
References
- Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008) Gaussian predictive process
models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70 (4), 825-848.
- Bishop, C. M. (2006) Pattern Recognition and Machine Learning. Springer, New York.
- Brunsdon, C., Fotheringham, S., and Charlton, M. (1998) Geographically Weighted Regression.
Journal of the Royal Statistical Society: Series D (The Statistician), 47 (3), 431-443.
- Cahill, M. and Mulligan, G. (2007) Using geographically weighted regression to explore local
crime patterns. Social Science Computer Review, 25 (2), 174-193.
- Casetti, E. (1972) Generating models by the expansion method: applications to geographical
research. Geographical analysis, 4 (1), 81-91.
- Cressie, N. (1993) Statistics for Spatial Data. John Willy and Sons, New York.
- Cressie, N. and Johannesson, G. (2008) Fixed rank kriging for very large spatial data sets.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70 (1), 209-226.
- Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016) Hierarchical nearest-neighbor
Gaussian process models for large geostatistical datasets. Journal of the American Statistical
Association, 111 (514), 800-812.
- Dong, G., Nakaya, T., and Brunsdon, C. (2018) Geographically weighted regression models for
ordinal categorical response variables: An application to geo-referenced life satisfaction data.
Computers, Environment and Urban Systems, 70, 35-42.
- Dray, S., Legendre, P., and Peres-Neto, P. R. (2006) Spatial modelling: a comprehensive
framework for principal coordinate analysis of neighbour matrices (PCNM). Ecological
Modelling, 196 (3-4), 483-493.
- Eddelbuettel, D. 2013. Seamless R and C++ Integration with Rcpp. New York: Springer.
- Fan, J., Heckman, N. E., and Wand, M. P. (1995) Local polynomial kernel regression for
generalized linear models and quasi-likelihood functions. Journal of the American Statistical
Association, 90 (429), 141-150.
- Finley, A. O. (2011) Comparing spatially ‐ varying coefficients models for analysis of
ecological data with non‐stationary and anisotropic residual dependence. Methods in Ecology
and Evolution, 2 (2), 143-154.
- Finley, A. O., Datta, A., Cook, B. C., Morton, D. C., Andersen, H. E., and Banerjee, S. (2018)
Efficient algorithms for Bayesian nearest-neighbor Gaussian processes. Journal of
Computational and Graphical Statistics, DOI: 10.1080/10618600.2018.1537924.
- Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2002) Geographically Weighted
Regression: The Analysis of Spatially Varying Relationships. Wiley, New York.
- Fotheringham, A. S., Yang, W., and Kang, W. (2017) Multiscale geographically weighted
regression (mgwr). Annals of the American Association of Geographers, 107 (6), 1247-1265.
- Geniaux, G. and Martinetti, D. (2018) A new method for dealing simultaneously with spatial
autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban
Economics, 72, 74-85.
- Gollini, I., B. Lu, M. Charlton, C. Brunsdon & P. Harris (2015) GWmodel: an R Package for
Exploring Spatial Heterogeneity using Geographically Weighted Models. Journal of Statistical
Software, 63, 1-50.
- Harris, R., Singleton, A., Grose, D., Brunsdon, C., and Longley, P. (2010) Grid‐enabling
geographically weighted regression: a case study of participation in higher education in England.
Transactions in GIS, 14 (1), 43-61.
- Harris, P. (2019) A simulation study on specifying a regression model for spatial data: Choosing
between autocorrelation and heterogeneity effects. Geographical Analysis, 51 (2), 151-181.
- Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F.,
Gramacy R. B., Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and
Zammit-Mangion, A. (2018) A case study competition among methods for analyzing large
spatial data. Journal of Agricultural, Biological and Environmental Statistics, DOI:
10.1007/s13253-018-00348-w.
- Leong, Y.-Y. & J. C. Yue (2017) A modification to geographically weighted regression.
International Journal of Health Geographics, 16, 11.
- Li, Z., Fotheringham, A. S., Li, W., and Oshan, T. (2019) Fast Geographically Weighted
Regression (FastGWR): a scalable algorithm to investigate spatial process heterogeneity in
millions of observations. International Journal of Geographical Information Science, 33 (1),
155-175.
- Liu, H., Ong, Y. S., Shen, X., and Cai, J. (2018) When Gaussian process meets big data: A
review of scalable GPs. ArXiv, 1807.01065.
- Lu, B., Charlton, M., Harris, P., and Fotheringham, A. S. (2014a). Geographically weighted
regression with a non-Euclidean distance metric: a case study using hedonic house price data.
International Journal of Geographical Information Science, 28 (4), 660-681.
- Lu, B., P. Harris, M. Charlton & C. Brunsdon (2014b) The GWmodel R package: further topics
for exploring spatial heterogeneity using geographically weighted models. Geo-spatial
Information Science, 17, 85-101.
- Lu, B., M. Charlton, C. Brunsdon & P. Harris (2016) The Minkowski approach for choosing the
distance metric in Geographically Weighted Regression. International Journal of Geographical
Information Science, 30, 351-368.
- Lu, B., Brunsdon, C., Charlton, M., and Harris, P. (2017) Geographically weighted regression
with parameter-specific distance metrics. International Journal of Geographical Information
Science, 31 (5), 982-998.
- Lu, B., Yang, W., Ge, Y., and Harris, P. (2018) Improvements to the calibration of a
geographically weighted regression with parameter-specific distance metrics and bandwidths.
Computers, Environment and Urban Systems, 71, 41-57.
- Lu, B., C. Brunsdon, M. Charlton & P. Harris (2019) A response to ‘A comment on
geographically weighted regression with parameter-specific distance metrics’. International
Journal of Geographical Information Science, 1-13.
- Mu, J., Wang, G., and Wang, L. (2018) Estimation and inference in spatially varying coefficient
models. Environmetrics, 29 (1), e2485.
- Murakami, D., and Griffith, D. A. (2019) Spatially varying coefficient modeling for large
datasets: Eliminating N from spatial regressions. Spatial Statistics, 30, 39-64.
- Murakami, D., Lu, B., Harris, P., Brunsdon, C., Charlton, M., Nakaya, T., and Griffith, D. A.
(2019) The importance of scale in spatially varying coefficient modeling. Annals of the
American Association of Geographers, 109, 1-21.
- Murakami, D., Yoshida, T., Seya, H., Griffith, D. A., and Yamagata, Y. (2017) A Moran
coefficient-based mixed effects approach to investigate spatially varying relationships. Spatial
Statistics, 19, 68-89.
- Nakaya, T., Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2005) Geographically
weighted Poisson regression for disease association mapping. Statistics in Medicine, 24 (17),
2695-2717.
- Oshan, T., L. J. Wolf, A. S. Fotheringham, W. Kang, Z. Li & H. Yu (2019) A comment on
geographically weighted regression with parameter-specific distance metrics. International
Journal of Geographical Information Science, 1-12.
- Páez, A., Long, F., and Farber, S. (2008) Moving window approaches for hedonic price
estimation: an empirical comparison of modelling techniques. Urban Studies, 45 (8), 1565-1581.
- Rue, H., Martino, S., and Chopin, N. (2009) Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the royal
statistical society: Series b (statistical methodology), 71 (2), 319-392.
- Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial
data. Spatial Statistics, 8, 1-19.
- Tran, H. T., Nguyen, H. T., and Tran, V. T. (2016) Large-scale geographically weighted
regression on Spark. Proceedings of the 2016 Eighth International Conference on Knowledge
and Systems Engineering, pp. 127-132. IEEE.
- Tsutsumida, N., Rodríguez-Veiga, P., Harris, P., Balzter, H., and Comber, A. (2019) Investigating
spatial error structures in continuous raster data. International Journal of Applied Earth
Observation and Geoinformation, 74, 259-268.
- Wheeler, D. and Tiefelsdorf, M. (2005) Multicollinearity and correlation among local regression
coefficients in geographically weighted regression. Journal of Geographical Systems, 7 (2),
161-187.
- Wolf, L. J., Oshan, T. M., and Fotheringham, A. S. (2018) Single and multiscale models of
process spatial heterogeneity. Geographical Analysis, 50 (3), 223-246.
- Yang, W. (2014) An extension of geographically weighted regression with flexible bandwidths.
Doctoral dissertation, University of St Andrews.