Ebook Becoming Spatial Data Scientist
Ebook Becoming Spatial Data Scientist
Ebook Becoming Spatial Data Scientist
Foreword
Regionalization
Copyright © 2019 by CartoDB Inc. All rights reserved. No part of this publication text may be uploaded
or posted online without the prior written permission of the publisher. For permission requests,
write to the publisher, addressed “Attention: Permissions Request,” to [email protected].
Foreword
Julia Koschinsky, PhD, Executive Director, Center for Spatial Data Science,
University of Chicago
Data scientists usually treat the location attributes of data like any other
attribute and apply the same non-spatial methods and tools from the regular
data science toolkit. They know how to get the computational aspects to work
and how to scale them. On the other side, spatial analysts are more likely to
use specialized spatial methods and tools from spatial econometrics, spatial
statistics, and geovisualization. But they might not know how to implement
new spatial methods computationally and get them to run at scale.
More often than not, data science and spatial analytics are separate worlds
with little interaction. Universities are trying to catch up on establishing and
modernizing their data science curriculum to meet the growing demand but the
traditional separation between computer science and geography departments
usually prevails. Data science bootcamps also tend to maintain these divides.
Often that leaves data scientists and spatial analysts to fend for themselves in
bridging these worlds as they seek to spatially analyze location data at scale.
At the same time, there are spatial data scientists in industry, academia, and
other institutions that have been working on integrating the data science and
spatial analytics communities. They are moving towards establishing a field
of spatial data science. Our Center for Spatial Data Science at the University of
Chicago and CARTOCarto have been part of these efforts. Along the same lines,
the goal of this ebook is to help data scientists bridge this gap by adding spatial
methods to their traditional data science toolkits.
“Spatial data science can be viewed as a subset of generic “data science” that
focuses on the special characteristics of spatial data, i.e., the importance of
“where.” Data science is often referred to as the science of extracting meaningful
information” (Anselin, 2019). Spatial analytics then, is relevant because it helps
make sense of spatial data in ways that accounts for its special characteristics,
including spatial dependence and spatial heterogeneity (Anselin, 1989),
geographic projections, and zonation and scale problems.
3
To give a few examples, by accounting for spatial structure in data, spatial
models can produce more precise and less biased estimates than non-spatial
models. They can quantify if there are spillover or interaction effects between
neighboring areas and identify if correlations vary across space. Spatial
optimization models are useful for solving location-allocation problems such
as where to best site new stores. Further, customer segmentation analysis can
be improved through spatially constrained cluster methods. And spatial access
metrics help identify mismatches in where supply and demand are concentrated.
One of the key current opportunities in spatial data science is the development
of a next generation of spatial methods that builds on the lessons learned from
earlier methods and finds new ways to model the special characteristics of
georeferenced big data (Anselin, 2019; Rey 2019). Hopefully, data scientists who
are reading this ebook on their path to becoming spatial data scientists will help
take on this challenge.
4
Chapter 1: What is Spatial Data
Science and Why is it Important?
"Spatial data science can be viewed as a subset of generic “data science” that
focuses on the special characteristics of spatial data, i.e., the importance
of “where.” Data science is often referred to as the science of extracting
meaningful information from data. In this context, it is useful to stress
the difference between standard (i.e., non-spatial) data science applied to
spatial data on the one hand and spatial data science on the other. The
former treats spatial information, such as the latitude and longitude of
data points as simply an additional variable, but otherwise does not adjust
analytical methods or software tools. In contrast, “true” spatial data science
treats location, distance, and spatial interaction as core aspects of the data
and employs specialized methods and software to store, retrieve, explore,
analyze, visualize and learn from such data. In this sense, spatial data science
relates to data science as spatial statistics to statistics, spatial databases
to databases, and geocomputation to computation." -- Luc Anselin
This quote from Professor Luc Anselin, a founding father in the field of spatial
data science, provides not only the definition of the field and the background
necessary for any data science practitioner to better understand the relationship
between spatial and non-spatial data, but provides also the raison d'être for this
resource.
Data science is the fastest growing profession in the United States, with
opportunities expanding exponentially, year-over-year. These opportunities
stem from the realization among corporate and governmental leadership across
all sectors, that in order to remain competitive, business and societal decisions
must be informed and reenforced by data.
It is more recent though, that these same leaders have come to recognize how
impactful spatial analysis can be in this decision-making process, providing
an additional level of insight. Nearly every business in the world has a spatial
component. And as Professor Anselin noted, unlocking insights from spatial
data involves distinct tools and techniques. By arming ourselves with these
5
methodologies, data scientists can provide greater value and investigate the
spatial relationships that underpin every facet of our world.
Spatial data is typically categorized into the following types (Cressie, 1993):
Point-referenced data
Data associated with a spatial index that varies continuously across space (Figure
1). Examples include data from GPS tracking, fixed devices, high resolution
satellites. This data is often useful for model inference and prediction at
unsampled locations (Banerjee et al., 2014).
Areal data:
Point patterns
Network data
6
Figure 1. Types of spatial data. (left): Example of point-referenced data: Sea Surface Temperature in situ observations
from the International Comprehensive Ocean-Atmosphere Data Set (Freeman et al., 2016). (right): Example of
areal data: total population in the state of NY from US Census data (source: US Census, https://www.census.gov/ ).
7
The Earth is (almost) round
Dealing with spatial data also means that we need to be able to project an uneven
spheroid, which is the Earth, on a plane or on a sphere. For a given 3D model of
the Earth and an origin relative to its center (a datum), a projection is defined
by appropriate functions that map the longitude and latitude coordinates to
planar or spherical coordinates. These projected coordinate reference systems
may be global or regional, and may have different characteristics, depending
amongst other if they better preserve distances, scales, shapes, or seek more
of a visualization balance. For example, the Mercator projection, which is the
standard map projection for navigation, preserves shapes, while the Mollweide
projection preserves area measures. The knowledge of the coordinate reference
system (CRS) is critical in order to establish the units of measurements,
compute distances and describe the relative position of different regions (e.g. in
a neighbourhood structure). A comprehensive list of available CRS’ is compiled
and updated by several sources, such as http://epsg.io/.
One of the useful properties of spatial data is that data at nearby locations
8
“tends” to be similar. This was first recognized by Geographer and Cartographer
Waldo Tobler in 1970 through his “First Law of Geography,” which states that
“everything is related to everything else, but near things are more related than
distant things.” In other words, spatial data is spatially dependent or correlated,
and independence between the observations, which is a common assumption
for many statistical techniques, is not satisfied.
So how is spatial dependence generated? Spatial dependence can arise for various
reasons (Diggle et al., 2013). An observed spatial pattern may be observed in
variables strictly depending on the location, or because of direct interactions
between the points. In practice, it can be difficult, or even impossible, to
distinguish empirically between these different processes.
9
(3)
where is a scale parameter, and controls the range of the spatial process
(small values will imply a fast decay in the correlation with distance). When
the stationarity condition is also satisfied by the variance function, we can also
define the semivariogram as
(4)
Measure 2: Moran’s I
For discrete spatial processes, the spatial dependent relationship is characterized
in terms of adjacency. Given observations associated with a discrete index
we can construct a neighbourhood structure with entries which
connects units and in some fashion (e.g. if and are neighbours
and zero otherwise).
(5)
where represents the mean. By comparing the computed with the mean and
variance of its asymptotic distribution under the null hypothesis that the are
IID (which is a normal distribution) we can use this coefficient as an exploratory
10
measure of spatial association. However, if the aim is to run a test of statistical
significance, a Monte Carlo permutation-based approach, in which the values
of are randomly assigned to the spatial entities, is typically recommended
(Bivand et al., 2013). Moran’s coefficient can be also applied in a local fashion
(Anselin, 1995) to identify local clusters and local spatial outliers.
A very basic form of point pattern analysis involves summary statistics such as
the mean center and a measure of dispersion given, for example, by the standard
deviational ellipse, which separates the distance for each axis.
(6)
11
Figure 2. Measures of spatial dependence for the Boston housing data (Harrison and Rubinfeld, 1978). (left):
Local Moran’s I plot. (c): Mean center and ellipsoid for the London Police crime data (http://data.police.uk/).
12
For those looking to perform their own analysis, the below table includes
examples of common packages used for exploratory analysis for measures of
spatial dependence:
https://pypi.org/project/
scikit-gstat Python Variogram analysis
scikit-gstat
13
Chapter 2: Spatial Modeling -
Leveraging Location in Prediction
Spatial modeling consists of the analysis of spatial data (i.e. data that exhibits
spatial dependence) to make inferences about the model parameters, to predict
at unsampled locations, or for downscaling/upscaling applications (Anselin,
1988; Banerjee et al., 2014).
With new techniques and technologies, increased processing power, and the
proliferation of spatial expertise across industries, spatial modeling is evolving
to meet new challenges. Areas of applications are many, including climatology,
epidemiology, real estate, and marketing, and possible questions that may
arise include the following: How does the revenue of my store depend on socio-
demographic patterns? Are my clients more likely to churn if their neighbours
are also churning? How are the spatial patterns of road incidents related to road
and demographic features?
(1)
where is the mean structure and can depend on some covariates (also known
fixed effects, as e.g. ) and represents an IID process. When dealing
with spatial data , we might add to Equation (1) an extra term
representing a spatial random effect
(2)
14
Continuous Spatial Error Models
15
Point patterns can also be modeled as a continuous spatial process. In this
case, the interest lies in modeling the intensity which varies spatially
and may also depend on some covariates. The intensity can be modeled non-
parametrically using kernel smoothing (Diggle, 2014), after which, logistic
regression can be used to estimate the model coefficients. Alternatively, the
intensity can be directly modeled as a Log-Gaussian Cox process and the model
parameters estimated using the INLA/SPDE approach (Simpson et al., 2010).
Examples of R and Python packages that can be used in the context of modeling
continuous spatial processes are provided in the table below.
https://pypi.org/project/
PyKrige Python Kriging
PyKrige/
https://rdrr.io/cran/
spBayes R Bayesian (MCMC)
spBayes/
Bayesian (INLA/SPDE)
R-INLA R http://www.r-inla.org (sparse approximation
method)
Intensity estimation by
spatstat R https://rdrr.io/cran/spatstat kernel smoothing (point
patterns)
16
Discrete Spatial Error Models
i.e. the conditional distribution for the GMRF component for the -th area is
normal with a mean that depends, with strength , on the average of its neighbors.
The construction of the spatial adjacency matrix determines the class of the
CAR model structure: for example, Intrinsic CAR (ICAR) models provide
spatial smoothing by averaging measurements of directly adjoining regions.
Another common option is to use a Simultaneous Autoregressive (SAR) model,
based instead on a spatial autoregressive error term (Banerjee et al., 2014). CAR
and SAR models can be also implemented in a Bayesian framework, where they
can be used as priors, as part of a hierarchical model (c.f. for example the ICAR
specification in the Besag, York, Mollié model, Besag et al., 1991).
17
that unit, with the degree of smoothness controlled by the rank of the GMRF
(e.g. full rank corresponds to many knots as units).
18
Figure 3. Predictions (mean) for the concentration of Zinc near the Meuse river in the Netherlands
obtained using kriging (left) and predictions (mean) for the owner occupied housing value in Boston
obtained using INLA and a Besag, York, Mollié (BYM) model for the spatial random effects (right)
19
Spatially Varying Coefficient Models
In some cases it could be attractive to allow the coefficients in the model to
vary by location, envisioning for a particular coefficient a spatial surface,
e.g . Geographically weighted regression (GWR; Brunsdon et al., 2010) is the
representative approach for spatially varying coefficient (SVC) models for
point-referenced data. GWR estimates one set of coefficient values for every
observation using all of the data falling within a fixed window (bandwidth) from
this location, and giving the most weight to the data that is closest. Because the
results tend to depend on the choice of the bandwidth, this method is mainly
used as an exploratory technique intended to indicate where non-stationarity
is taking place. Better options are represented by spline-based methods (Wood,
2010) and, although more computationally demanding, Bayesian methods
(Gelfand et al., 2003; Gamerman et al. 2003). Bayesian SVC models for areal data
are also available by using the GMRF specification, for example as implemented
in the R package R-INLA (Bakka et al., 2018).
https://pysal.org/notebooks/
PySAL Python GWR
model/mgwr/intro.html
Examples of common packages used for modeling for spatially varying coefficient models.
Spatial Confounding
Spatial confounding occurs when adding a spatially-correlated error term
changes the estimates of the fixed-effect coefficients (Hodges, 2010), especially
20
when the fixed effects are highly correlated with the spatially structured
random effect. To avoid this effect, a solution known as restricted spatial
regression, is used, consisting of restricting the spatial random effect to the
orthogonal space of the fixed effects.
Validation Tools
To assess the predictive performance of a spatial model, traditional validation
tools are typically adopted. These rely on graphical methods (e.g. graphical
inspection of the residuals) or computing some discrepancy measures such
as the Root Mean Square Error (RMSE), the pseudo-, the Logarithmic and the
Continuous Ranked Probability Score, either splitting the data into a train and
a test subset or using k-fold cross validation (Hastie et al., 2017). However, extra
care must be taken with spatial data since, in this case, observations that are
closer in space have stronger dependencies, which can result in biased measures
of discrepancy. Equivalent spatial cross-validation and bootstrap strategies
based on spatial resampling-based methods can be implemented, using, for
example, the R package sperrorest (Brenning, 2012). In practice, both non- and
spatial cross-validation methods can be very computationally expensive when
working with spatial models, and their use is still not very common.
Spatio-temporal Models
A temporal dimension when working with spatial data is also common, as
for example in the case of moving devices such as sensors, vehicles, or mobile
phones. Methods for analysing spatio-temporal data model the field accounting
both for a spatial dimension, as described in the previous sections, and the
temporal dimension, which fundamentally differs because time flows only in
one direction. For a review of spatio-temporal models see Banerjee et al. (2014)
and Cressie and Wikle (2011).
21
Chapter 3: Spatial Clustering
and Regionalization
Like clustering in traditional data science, spatial clustering covers a wide range
of methods and applications. Some traditional clustering methods can easily be
adapted to spatial problems while others require a reformulation to account for
the spatial relationships inherent in spatial data. Others are not well-suited for
spatial problems.
This chapter will cover some powerful spatial clustering methods and show
some of the more common and/or powerful methods used in spatial data
science. While the landscape of methods is large, many clustering algorithms
don't neatly fall into existing methods and need to be custom programmed
using linear programming, graphs (e.g., min cost flow), heuristic methods (e.g.,
genetic algorithms, tabu search, etc.), or other methods. We will not be covering
these here
22
desired. Additionally this method is fast and robust, even when working on
higher dimensional datasets.
Language/Platform Reference
stats - https://stat.ethz.ch/R-manual/R-devel/library/stats/
R
html/kmeans.html
PostGIS https://postgis.net/docs/ST_ClusterKMeans.html
While k-means can be used as a quick and dirty method for direct use on latitude
and longitude, that should be discouraged, as it may not yield reliable results,
especially if there is a strong geographical boundary such as a river. K-means
works by minimizing the variance of inter-cluster values and maximizing
intra-cluster values, not by minimizing distance. That is, k-means minimizes
the variances of 'distance' instead of minimizing the distance within the cluster.
As such, it is generally fine for spatially close clusters (within a city for example),
but distances should be computed using the haversine formula instead of
straight latitude-longitude since geographical distances change with latitude.
Again, this can be okay for rough clusters, but algorithm-based strategies can
actually work against the goal of creating good clusters.
For example, if clusters are imbalanced (for example, one cluster has more
samples than another), the cluster that has a higher number of samples will
tend to be randomly selected more, resulting in more seeds centers within that
area. Meanwhile, areas that have fewer samples are more likely to be assigned
with the areas that are sampled more. The result is that significant clusters that
23
are spatially separated but have fewer samples tend to be grouped with other
clusters which have more samples and are farther away. In this case, algorithms
like DBSCAN are a better approach. (Boeing, G.)
24
DBSCAN was also used in a CARTO blog post about Safegraph's data. Follow
along with the example notebook: https://github.com/CartoDB/data-science-
book/blob/master/Chapter%202/dbscan.ipynb
Language/Platform Reference
R https://rdrr.io/cran/dbscan/
PostGIS https://postgis.net/docs/ST_ClusterDBSCAN.html
Language/Platform Reference
Python https://umap-learn.readthedocs.io
R https://rdrr.io/cran/umap/
25
Regionalization
SKATER
The SKATER algorithm enables regionalization by constructing a contiguity-
based minimum spanning tree that ensures homogeneity within trees by
minimizing costs that are the inverse of the similarity of joined regions.
(Assuncao et al, 2006). This means that a cost is associated with each neighbor
pair. This can be one or more standardized attribute values that are reduced by
calculating by some distance metric (e.g., manhattan, euclidean, etc.). Larger
distances in the attribute space are less disimilar so are less likely to be paired.
The contiguity is represented as a minimum spanning tree where cuts are made
to ensure that individual regions are homogenous.
These properties of SKATER allow one to construct regions that are similar
within a cluster and dissimilar from other nearby clusters.
26
SKATER has advantages over similar methods in that it is relatively efficient.
Language/Platform Reference
R https://cran.r-project.org/web/packages/spdep/index.html
Max-p
The Max-p-regions algorithm provides spatially constrained regions that
are homogenous and meet a minimum threshold requirement. For example,
27
finding regions constructed from census tracts that are similar median incomes
but ensuring that each region has a minimum number of households.
The drawbacks to using this method are that max-p can be slower to run for
larger datasets. For this reason, heuristic-based solutions have been developed
so that approximate solutions can be calculated. Max-p also can lead to the
construction of non-compact regions as can be seen in the map below. This
can be a problem for some applications as having spatially compact regions is
important for efficiency of travel within a region depends can depend strongly
on the shape of the constructed region.
28
Agglomerative Clustering
Agglomerative clustering is a type of hierarchical clustering where clusters are
built from the bottom up. This algorithm starts building clusters where each
object is in its own cluster, then clusters are recursively merged (agglomerated)
using a "linkage strategy" such as minimizing the sum of squared distances
within a cluster. Similar to k-means, the cluster number is specified and initial
random seeds are selected at the beginning of a run.
The linkage strategy in agglomerative clustering depends on the use case, but
four main ones are used:
What makes agglomerative clustering well suited for spatial problems is that
the clusters can be built with a pre-defined connectivity graph, such that only
connected clusters can be joined into larger clusters, and distances between
units can be pre-calculated according to different metrics (e.g., euclidean or an
arbitrary distance from an external service like a routing engine). Such graphs
can created using PySAL's weights interface.
29
Click here to view the interactive map
Language/Platform Reference
https://scikit-learn.org/stable/modules/generated/sklearn.
Python
cluster.AgglomerativeClustering.html
R https://rdrr.io/cran/cluster/
30
Chapter 4: Logistics Optimization
with Spatial Analysis
31
All of these logistics problems have a very strong spatial component that must
be considered as part of any optimization solution.
Points
Lines
Represent the transportation network. They are mainly used for visualization
purposes, and in order to store information on the network’s characteristics
(connections between points, distances, etc.) matrix or graph structures are
used.
Distance/time
An optimization problem consists of two main components, the model and the
search.
32
The Model
The model is the formulation of the problem. It can be a traditional mathematical
formulation with equations, or a more conceptual formulation not necessarily
expressed in mathematical terms.
1) Decision Variables
Decision variables represent the decisions that need to be made and that will
lead to an optimal solution. Depending on the logistics problem at hand, our
decision variables could be whether to open a distribution center (DC) at a
specific location, whether a zip code is served by a DC, or which truck will serve
one customer and when.
The most frequently used variables are those with integer, binary, and
continuous domains. For some specific problems, such as task assignment
problems, it can also be useful to work with set variables.
2) Objective Function
3) Constraints:
Constraints define what solutions are feasible from different points of view. We
can have, for example, physical constraints (a truck cannot transport more than
33
its capacity), and business constraints (every client should not be further than
20 miles away from the closest DC).
All optimization problems have decision variables, but not all of them necessarily
have an objective function or constraints. For example, when scheduling tasks,
finding a solution is so complicated, that just finding a solution which satisfies
all constraints is considered to be a success.
The Search
The search is responsible for finding the best possible solution. It is called search
as a reference to the exploration of the solution space (defined by the problem’s
constraints) performed seeking the optimal solution.
Exact Algorithms
Exact algorithms are those which solve a problem to optimality, i.e., they find the
actual optimal solution. Ideally, we would always like to use an exact algorithm
to be sure we have the best possible solution. However, we are constrained by
time and computational capacity, so not all problems can be solved with exact
algorithms.
34
center (DC) opening/closing, and distribution area definition.
There are several solvers with the Simplex Algorithm and some of its variants
implemented, both commercial and open source. A good choice is Google OR-
Tools, an open source software suite for optimization. This suite provides you
with an API to model your optimization problem, and later connect to different
solvers, so you can compare their performance on your specific problem.
Approximate algorithms:
35
edges are the possible connections between pairs of nodes. Nodes and edges
can be assigned weights representing elements of the problem. For example, a
node can be assigned the value of its demand, and an edge can be assigned the
time from the origin to the destination nodes. Some very well known graph
algorithms applied for routing are Dijkstra’s algorithm for finding the shortest
path between two nodes, or the Christofides algorithm for solving the Traveling
Salesman Problem, as seen in the next chapter.
Perhaps the most famous and prominent routing optimization problem, with
multiple methods developed for finding a solution, the Traveling Salesman
Problem is defined thusly:
“Given a list of cities and the distances between each pair of cities, what is the
shortest possible route that visits each city and returns to the origin city?”
Because of this combinatorial complexity, exact algorithms are rarely the best
approach. There is a very powerful iterative algorithm that uses integer linear
programming (an exact technique) at each iteration, which ensures optimality
and that has been proven to work very efficiently with instances of up to 1000
cities from one of the very well known TSP benchmarks. You can find the
formulation using Gurobi here. However, when your business problem requires
additional constraints, this algorithm is no longer an option. This is why the
most common approach is to use approximate algorithms.
Among the different approximate algorithms, the most common ones are the
family of local search algorithms, and genetic algorithms. One example of a
local search algorithm is Simulated Annealing. Ant Colony Optimization (ACO)
is one of the best known genetic algorithms. The main difference between
36
these two families of algorithms is on how the problems are formulated, and,
of course, the logic behind them. While Simulated Annealing starts from one
solution and keeps moving to neighboring solutions with some randomness,
Ant Colony Optimization can be seen as a simulation technique in which
artificial ants (simulated agents) move through the graph.
Both families of algorithms have been proved to be very powerful with routing
problems, and in particular with the TSP. The main criterion for choosing one or
the other usually depends on the data scientist’s proficiency with each of them,
and the requirements in terms of open source vs. commercial software. With
local search algorithms, it is very common to find libraries with several of this
family’s algorithms implemented, so normally, two or three of these techniques
will be tested to find the one that suits our problem best.
Finally, many of the algorithms used to solve the Traveling Salesman Problem
require an initial solution to start from. The quality of this first solution
(understanding quality as the proximity to the optimal solution) can save us
many hours of testing our algorithms. A very well known algorithm for finding
this first solution is the Christofides Algorithm (explained in detail below).
This algorithm guarantees that its solutions will be within a factor of 3/2 of
the optimal solution, so often times its solution is good enough and no further
improvement is performed with local search algorithms.
Christofides Algorithm
One method for solving this problem is the Christofides algorithm. Step-by-step
instructions for this method are as follows:
37
The example aims to apply Christofides Algorithm to find the shortest path of visiting 73 retail stores in Minnesota.
Create a minimum spanning tree T (right) of Complete Graph G (left)
38
Vertices with odd degree O (left) and subgraph G’ (right) of G using only the vertices of O
39
Find a minimum-weight perfect matching M (above) in the induced subgraph given by the vertices from O.
Find a minimum-weight perfect matching M (above) in the induced subgraph given by the vertices from O.
40
Form an Eulerian circuit E (above) in H.
WWMake the circuit found in previous step into a Hamiltonian circuit by skipping repeated vertices (shortcutting)
41
Explore our notebook for further details on the
Christofides Algorithm and alternative methods
for solving the Traveling Salesman Problem
42
Chapter 5: Continue Your
Spatial Education
The need for spatial data science professionals and practitioners across
governments, organizations, companies, and cities is growing as these
institutions continue to further recognize the importance of deriving new
insights from growing location data assets. To meet this demand, many
programs have sprung up, from higher education to tech bootcamps, that seek
to train the next generation of spatial experts
43
The University of
Wisconsin-Madison - Post-grad study
Geospatial Data Science Lab - Research
44
Arizona State University
SPARC: Spatial Analysis Research - Undergraduate Study
Center - Post-grad study
- Research
45
References
Anselin, Luc. (Forthcoming). “Spatial Data Science”, in The International
Encyclopedia of Geography: People, the Earth, Environment, and Technology.
Arribas-Bel, D and Rey, S. “Geographic Data Science with PySAL and the pydata
stack” Retrieved from http://darribas.org/gds_scipy16/
AssunÇão, RM, Neves, MC, Câmara, G & Da Costa Freitas, C. (2006) “Efficient
regionalization techniques for socio-economic geographical units using
minimum spanning trees”, in International Journal of Geographical Information
Science.
Banerjee, Sudipto, et al. Hierarchical Modeling and Analysis for Spatial Data.
Chapman & Hall/CRC, 2014.
Barthélemy, Marc. “Spatial Networks.” Physics Reports, vol. 499, no. 1-3, 2011,
pp. 1–101., doi:10.1016/j.physrep.2010.11.002.
Bivand, Roger S., et al. “Applied Spatial Data Analysis with R.” 2013,
doi:10.1007/978-1-4614-7618-4.
Boeing, G. (2018, March 22). Clustering to Reduce Spatial Data Set Size. https://
doi.org/10.31235/osf.io/nzhdc
46
Brenning, Alexander. “Spatial Cross-Validation and Bootstrap for the
Assessment of Prediction Rules in Remote Sensing: The R Package Sperrorest.”
2012 IEEE International Geoscience and Remote Sensing Symposium, 2012,
doi:10.1109/igarss.2012.6352393.
Diggle, Peter J., et al. “Spatial and Spatio-Temporal Log-Gaussian Cox Processes:
Extending the Geostatistical Paradigm.” Statistical Science, vol. 28, no. 4, 2013,
pp. 542–563., doi:10.1214/13-sts441.
Finley, Andrew O., et al. “SpBayes: AnRPackage for Univariate and Multivariate
Hierarchical Point-Referenced Spatial Models.” Journal of Statistical Software,
vol. 19, no. 4, 2007, doi:10.18637/jss.v019.i04.
Freeman, Eric, et al. “ICOADS Release 3.0: a Major Update to the Historical
Marine Climate Record.” International Journal of Climatology, vol. 37, no. 5, 2016,
pp. 2211–2232., doi:10.1002/joc.4775.
Harrison, David, and Daniel L Rubinfeld. “Hedonic Housing Prices and the
47
Demand for Clean Air.” Journal of Environmental Economics and Management, vol.
5, no. 1, 1978, pp. 81–102., doi:10.1016/0095-0696(78)90006-2.
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer, 2017.
Hodges, James S., and Brian J. Reich. “Adding Spatially-Correlated Errors Can
Mess Up the Fixed Effect You Love.” The American Statistician, vol. 64, no. 4,
2010, pp. 325–334., doi:10.1198/tast.2010.10052.
Simpson, D., et al. “Going off Grid: Computationally Efficient Inference for Log-
Gaussian Cox Processes.” Biometrika, vol. 103, no. 1, 2016, pp. 49–70., doi:10.1093/
biomet/asv064.
Singleton, Alexander D. & Spielman, Seth E. “The Past, Present, and Future
of Geodemographic Research in the United States and United Kingdom,” The
Professional Geographer, 66:4, 558-567,2014. DOI: 10.1080/00330124.2013.848764
48
Links for helpful packages and tools:
• Geopandas - http://geopandas.org/
• cartoframes==1.0b1 - https://carto.com/developers/cartoframes/
• Carto-print - https://github.com/CartoDB/carto-print
• Matplotlib - https://matplotlib.org/
• Seaborn - https://seaborn.pydata.org/
• Pandas - https://pandas.pydata.org/
• Dask - https://dask.org/
• netCDF4 - https://pypi.org/project/netCDF4/
• Jupyter - https://jupyter.org/
• NumPy - https://www.numpy.org/
• SciPy - https://www.scipy.org/
• Sklearn - https://pypi.org/project/sklearn/
• Shapely - https://pypi.org/project/Shapely/
• Fiona - https://pypi.org/project/Fiona/
• Scikit-gstat - https://scikit-gstat.readthedocs.io/en/latest/
• Pyproj - https://pypi.org/project/pyproj/
• Utm - https://pypi.org/project/utm/
• PySAL - https://pysal.org/
• Pointpats - https://pypi.org/project/pointpats/
• Rpy2 - https://pypi.org/project/rpy2/
• Ipywidgets - https://ipywidgets.readthedocs.io/en/latest/
• GeoPy - https://github.com/geopy/geopy
• Simanneal - https://pypi.org/project/simanneal/
49
Authors
50
Steve Isaac is CARTO’s Content Marketing Manager,
in charge of content strategy, content creation, and
copyediting across multiple channels including social
media, the CARTO blog, newsletters, and more.
51
52