6-Spatial Data Analysis (E-next.in)

Chapter 6
Spatial data analysis
The discussion up until this point has sought to prepare the reader for the ‘data
analysis’ phase. So far, we have discussed the nature of spatial data, georefer-
encing, notions of data acquisition and preparation, and issues relating to data
quality and error.
Before we move on to discuss a range of analytical operations, we should begin
with some clarifications. We know from preceding discussions that the analyt-
ical capabilities of a GIS use spatial and non-spatial (attribute) data to answer
questions and solve problems that are of spatial relevance. It is important to
make a distinction between analysis (or analytical operations) as discussed in
Section 3.3.3, and analytical models (often just referred to just as ‘models’). By
analysis we mean only a subset of what is usually implied by the term: we do
not specifically deal with statistical analysis (such as cluster detection, for exam-
previous next back exit contents index glossary web links bibliography about
342
https://E-next.in
343
ple). These are advanced concepts and techniques which are outside the scope
of this book.
All knowledge of the world is based on models of some kind - whether they
are simple abstractions, culturally-based stereotypes or complex equations that
describe a physical phenomena. We have already seen in Section 1.2.1 that there
are different types of model, and that the word itself means different things in
different contexts. Section 2.1 noted that even spatial data is itself is a kind of
‘model’ of some part of the real world.
In this chapter we will focus on analytical functions that can form the build-
ing blocks for application models. It will hopefully become clear to the reader
that these operations can be combined in various ways for increasingly complex
analyses. Later in the chapter we present an overview of different types of ana-
lytical models and related concepts of which the user should be aware, as well
as an examination of how various errors may degrade the results of our models
or analyses.
https://E-next.in
6.1. Classification of analytical GIS capabilities 344
6.1 Classification of analytical GIS capabilities
There are many ways to classify the analytical functions of a GIS. The classifi-
cation used for this chapter, is essentially the one put forward by Aronoff [3]. It
makes the following distinctions, which are addressed in subsequent sections of
the chapter:
1. Classification, retrieval, and measurement functions. All functions in this

category are performed on a single (vector or raster) data layer, often using
the associated attribute data.
• Classification allows the assignment of features to a class on the basis

of attribute values or attribute ranges (definition of data patterns). On
the basis of reflectance characteristics found in a raster, pixels may be
classified as representing different crops, such as potato and maize.
• Retrieval functions allow the selective search of data. We might thus
retrieve all agricultural fields where potato is grown.
• Generalization is a function that joins different classes of objects with
common characteristics to a higher level (generalized) class.1 For ex-
1
The term generalization has different meanings in different contexts. In geography the term
‘aggregation’ is often used to indicate the process that we call generalization. In cartography,
generalization means either the process of producing a graphic representation of smaller scale
from a larger scale original (cartographic generalization), or the process of deriving a coarser res-
olution representation from a more detailed representation within a database (model generaliza-
tion). Finally, in computer science generalization is one of the abstraction mechanisms in object-
orientation.
https://E-next.in
ample, we might generalize fields where potato or maize, and possi-
bly other crops, are grown as ‘food produce fields’.
• Measurement functions allow the calculation of distances, lengths, or
areas.
More detail can be found in Section 6.2.
2. Overlay functions. These belong to the most frequently used functions

in a GIS application. They allow the combination of two (or more) spa-
tial data layers comparing them position by position, and treating areas of
overlap—and of non-overlap—in distinct ways. Many GISs support over-
lays through an algebraic language, expressing an overlay function as a
formula in which the data layers are the arguments. In this way, we can
find
• The potato fields on clay soils (select the ‘potato’ cover in the crop
data layer and the ‘clay’ cover in the soil data layer and perform an
intersection of the two areas found),
• The fields where potato or maize is the crop (select both areas of
‘potato’ and ‘maize’ cover in the crop data layer and take their union),
• The potato fields not on clay soils (perform a difference operator of
areas with ‘potato’ cover with the areas having clay soil),
• The fields that do not have potato as crop (take the complement of the
potato areas).
These are discussed further in Section 6.3.
https://E-next.in
3. Neighbourhood functions. Whereas overlays combine features at the same
location, neighbourhood functions evaluate the characteristics of an area
surrounding a feature’s location. A neighbourhood function ‘scans’ the
neighbourhood of the given feature(s), and performs a computation on it.
• Search functions allow the retrieval of features that fall within a given
search window. This window may be a rectangle, circle, or polygon.
• Buffer zone generation (or buffering) is one of the best known neigh-
bourhood functions. It determines a spatial envelope (buffer) around
(a) given feature(s). The created buffer may have a fixed width, or a
variable width that depends on characteristics of the area.
• Interpolation functions predict unknown values using the known val-
ues at nearby locations. This typically occurs for continuous fields,
like elevation, when the data actually stored does not provide the di-
rect answer for the location(s) of interest. Interpolation of continuous
data was discussed in Section 5.4.2.
• Topographic functions determine characteristics of an area by looking
at the immediate neighbourhood as well. Typical examples are slope
computations on digital terrain models (i.e. continuous spatial fields).
The slope in a location is defined as the plane tangent to the topogra-
phy in that location. Various computations can be performed, such
as:
– determination of slope angle,
– determination of slope aspect,
– determination of slope length,
https://E-next.in
– determination of contour lines. These are lines that connect points
with the same value (for elevation, depth, temperature, baromet-
ric pressure, water salinity etc).
We discuss these topics more fully in Section 6.4.
4. Connectivity functions. These functions work on the basis of networks,

including road networks, water courses in coastal zones, and communica-
tion lines in mobile telephony. These networks represent spatial linkages
between features. Main functions of this type include:
• Contiguity functions evaluate a characteristic of a set of connected spa-

tial units. One can think of the search for a contiguous area of forest
of certain size and shape in a satellite image.
• Network analytic functions are used to compute over connected line fea-
tures that make up a network. The network may consist of roads, pub-
lic transport routes, high voltage lines or other forms of transportation
infrastructure. Analysis of such networks may entail shortest path com-
putations (in terms of distance or travel time) between two points in
a network for routing purposes. Other forms are to find all points
reachable within a given distance or duration from a start point for
allocation purposes, or determination of the capacity of the network
for transportation between an indicated source location and sink lo-
cation.
• Visibility functions also fit in this list as they are used to compute the
points visible from a given location (viewshed modelling or viewshed
mapping) using a digital terrain model.
https://E-next.in
Details are discussed in Section 6.5.
https://E-next.in
6.2. Retrieval, classification and measurement 349
6.2 Retrieval, classification and measurement
https://E-next.in
6.2.1 Measurement
Geometric measurement on spatial features includes counting, distance and area

size computations. For the sake of simplicity, this section discusses such mea-
surements in a planar spatial reference system. We limit ourselves to geometric
measurements, and do not include attribute data measurement. In general, Measurement types
measurements on vector data are more advanced, thus, also more complex, than
those on raster data. We discuss each group.
https://E-next.in
Measurements on vector data
The primitives of vector data sets are point, (poly)line and polygon. Related
geometric measurements are location, length, distance and area size. Some of
these are geometric properties of a feature in isolation (location, length, area
size); others (distance) require two features to be identified.
The location property of a vector feature is always stored by the GIS: a single
coordinate pair for a point, or a list of pairs for a polyline or polygon boundary.
Occasionally, there is a need to obtain the location of the centroid of a polygon;
some GISs store these also, others compute them ‘on-the-fly’.
Length is a geometric property associated with polylines, by themselves, or in
their function as polygon boundary. It can obviously be computed by the GIS—
as the sum of lengths of the constituent line segments—but it quite often is also
stored with the polyline.
Area size is associated with polygon features. Again, it can be computed, but
usually is stored with the polygon as an extra attribute value. This speeds up
the computation of other functions that require area size values.
The attentive reader will have noted that all of the above ‘measurements’ do not
actually require computation, but only retrieval of stored data.
Measuring distance between two features is another important function. If both
features are points, say p and q, the computation in a Cartesian spatial reference
system are given by the well-known Pythagorean distance function:
q
dist(p, q) = (xp − xq )2 + (yp − yq )2 .
https://E-next.in
If one of the features is not a point, or both are not, we must be precise in defin-
ing what we mean by their distance. All these cases can be summarized as com-
putation of the minimal distance between a location occupied by the first and a
location occupied by the second feature. This means that features that intersect
or meet, or when one contains the other have a distance of 0. We leave a further
case analysis, including polylines and polygons, to the reader as an exercise. It
is not possible to store all distance values for all possible combinations of two
features in any reasonably sized spatial database. As a result, the system must
compute ‘on the fly’ whenever a distance computation request is made.
Another geometric measurement used by the GIS is the minimal bounding box
computation. It applies to polylines and polygons, and determines the minimal
rectangle—with sides parallel to the axes of the spatial reference system—that
covers the feature. This is illustrated in Figure 6.1. Bounding box computation Minimal bounding box
is an important support function for the GIS: for instance, if the bounding boxes
of two polygons do not overlap, we know the polygons cannot possibly intersect
each other. Since polygon intersection is a complicated function, but bounding
box computation is not, the GIS will always first apply the latter as a test to see
whether it must do the first.
Figure 6.1: The minimal

bounding box of (a) a poly-
(a) (b)
line, and (b) a polygon
https://E-next.in
For practical purposes, it is important to be aware of the measurement unit that
applies to the spatial data layer that one is working on. This is determined by
the spatial reference system that has been defined for it during data preparation.
A common use of area size measurements is when one wants to sum up the
area sizes of all polygons belonging to some class. This class could be crop type:
What is the size of the area covered by potatoes? If our crop classification is in a
stored data layer, the computation would include (a) selecting the potato areas,
and (b) summing up their (stored) area sizes. Clearly, little geometric computa- Geometric computations
tion is required in the case of stored features. This is not the case when we are
interactively defining our vector features in GIS use, and we want measurements
to be performed on these interactively defined features. Then, the GIS will have
to perform complicated geometric computations.
https://E-next.in
Measurements on raster data
Measurements on raster data layers are simpler because of the regularity of the
cells. The area size of a cell is constant, and is determined by the cell resolution.
Horizontal and vertical resolution may differ, but typically do not. Together with
the location of a so-called anchor point, this is the only geometric information
stored with the raster data, so all other measurements by the GIS are computed.
The anchor point is fixed by convention to be the lower left (or sometimes upper
left) location of the raster.
Location of an individual cell derives from the raster’s anchor point, the cell reso-
lution, and the position of the cell in the raster. Again, there are two conventions:
the cell’s location can be its lower left corner, or the cell’s midpoint. These con-
ventions are set by the software in use, and in case of low resolution data they
become more important to be aware of.
The area size of a selected part of the raster (a group of cells) is calculated as the
number of cells multiplied by the cell area size.
The distance between two raster cells is the standard distance function applied
to the locations of their respective mid-points, obviously taking into account
the cell resolution. Where a raster is used to represent line features as strings
of cells through the raster, the length of a line feature is computed as the sum
of distances between consecutive cells. This computation is prone to error, as
already discovered in Chapter 2 (Question 11).
https://E-next.in
6.2.2 Spatial selection queries
When exploring a spatial data set, the first thing one usually wants is to select
certain features, to (temporarily) restrict the exploration. Such selections can be
made on geometric/spatial grounds, or on the basis of attribute data associated
with the spatial features. We discuss both techniques below.
https://E-next.in
Interactive spatial selection
In interactive spatial selection, one defines the selection condition by pointing at

or drawing spatial objects on the screen display, after having indicated the spa-
tial data layer(s) from which to select features. The interactively defined objects
are called the selection objects; they can be points, lines, or polygons. The GIS Selection objects
then selects the features in the indicated data layer(s) that overlap (i.e. intersect,
meet, contain, or are contained in; see Figure 2.15) with the selection objects.
These become the selected objects.
As we have seen in Section 3.5, spatial data stored in a geodatabase is associated
with its attribute data through a key/foreign key link. Selections of features lead
to selections on the records. Vice versa, selection of records may lead to selection
of features.
Interactive spatial selection answers questions like “What is at . . . ?” In Fig-
ure 6.2, the selection object is a circle and the selected objects are the red poly-
gons; they overlap with the selection object.
https://E-next.in
Area Perimeter Ward_id Ward_nam District Pop88 Pop92 Figure 6.2: All city wards
65420380.0000
24813620.0000
41654.940000
30755.620000
1 KUNDUCHI
2 KAWE
Kinondoni
Kinondoni
22106
32854
27212.00
40443.00
that overlap with the
18698500.0000
81845610.0000
26403.580000
49645.160000
3 MSASANI
4 UBUNGO
Kinondoni
Kinondoni
51225
47281
63058.00
58203.00 selection object—here a
4468546.00000 13480.130000 5 MANZESE Kinondoni 59467 73204.00
4999599.00000
4102218.00000
10356.850000
8951.096000
6 TANDALE
7 MWANANYAMALA
Kinondoni
Kinondoni
58357
72956
71837.00
89809.00
circle—are selected (left),
3749840.00000
2087509.00000
9447.420000
7502.250000
8 KINONDONI
9 UPANGA WEST
Kinondoni
Ilala
42301
9852
52073.00
11428.00
and their corresponding
2268513.00000
1400024.00000
9028.788000
6883.288000
10 KIVUKONI
11 NDUGUMBI
Ilala
Kinondoni
5391
32548
6254.00
40067.00 attribute records are high-
888966.900000 4589.110000 12 MAGOMENI Kinondoni 16938 20851.00
1448370.00000
6214378.00000
5651.958000
14552.080000
13 UPANGA EAST
14 MABIBO
Ilala
Kinondoni
11019
43381
12782.00
53402.00
lighted (right, only part of
2496622.00000
1262028.00000
7121.255000
4885.793000
15 MAKURUMILA
16 MZIMUNI
Kinondoni
Kinondoni
54141
23989
66648.00
29530.00
the table is shown). Data
35362240.0000
1010613.00000
28976.090000
5393.771000
17 KINYEREZI
18 JANGIWANI
Ilala
Ilala
3044
15297
3531.00
17745.00 from an urban application
475745.500000 3043.068000 19 KISUTU Ilala 8399 9743.00
1754043.00000
29964950.0000
7743.187000
36964.000000
20 KIGOGO
21 KIGAMBONI
Kinondoni
Temeke
21267
23203
26180.00
27658.00
in Dar es Salaam, Tanza-
1291479.00000
720322.100000
5187.690000
4342.732000
22 MICHIKICHINI
23 MCHAFUKOGE
Ilala
Ilala
14852
8439
17228.00
9789.00
nia. Data source: Dept. of
9296131.00000
483620.700000
16321.530000
3304.072000
24 TABATA
25 KARIAKOO
Ilala
Ilala
18454
12506
21407.00
14507.00 Urban & Regional Plan-
3564653.00000 9586.751000 26 BUGURUNI Ilala 48286 56012.00
2639575.00000
912452.800000
6970.186000
4021.937000
27 ILALA
28 GEREZANI
Ilala
Ilala
35372
7490
41032.00
8688.00
ning and Geo-information
6735135.00000 13579.590000 29 KURASINI Temeke 26737 31871.00
Management, ITC.
https://E-next.in
Spatial selection by attribute conditions
It is also possible to select features by using selection conditions on feature at-

tributes. These conditions are formulated in SQL if the attribute data reside in
a geodatabase. This type of selection answers questions like “where are the fea-
tures with . . . ?”
Area IDs LandUse
174308.70 2 30
2066475.00 3 70
214582.50 4 80 Figure 6.3: Spatial se-
29313.86 5 80 lection using the attribute
73328.08 6 80
condition Area < 400000
53303.30 7 80
614530.10 8 20 on land use areas in Dar
1637161.00 9 80 es Salaam. Spatial fea-
156357.40 10 70 tures on left, associated
59202.20 11 20
attribute data (in part) on
83289.59 12 80
225642.20 13 20 right. Data source: Dept.
28377.33 14 40 of Urban & Regional Plan-
228930.30 15 30 ning and Geo-information
986242.30 16 70
Management, ITC.
Figure 6.3 shows an example of selection by attribute condition. The query ex-
pression is Area < 400000, which can be interpreted as “select all the land use
areas of which the size is less than 400, 000.” The polygons in red are the selected
areas; their associated records are also highlighted in red. We can this selected
set of features as the basis of further selection. For instance, if we are interested
in land use areas of size less than 400, 000 that are of land use type 80, the se-
https://E-next.in
lected features of Figure 6.3 are subjected to a further condition, LandUse = 80.
The result is illustrated in Figure 6.4. Such combinations of conditions are fairly
common in practice, so we devote a small paragraph on the theory of combining
conditions.
Area IDs LandUse
174308.70 2 30
2066475.00 3 70 Figure 6.4: Further spa-
214582.50 4 80 tial selection from the
29313.86 5 80 already selected fea-
73328.08 6 80
tures of Figure 6.3 using
53303.30 7 80
614530.10 8 20 the additional condition
1637161.00 9 80 LandUse = 80 on land use
156357.40 10 70 areas. Observe that fewer
59202.20 11 20
features are now selected.
83289.59 12 80
225642.20 13 20 Data source: Dept. of
28377.33 14 40 Urban & Regional Plan-
228930.30 15 30 ning and Geo-information
986242.30 16 70
Management, ITC.
https://E-next.in
Combining attribute conditions
When multiple criteria have to be used for selection, we need to carefully express
all of these in a single composite condition. The tools for this come from a field
of mathematical logic, known as propositional calculus.
Above, we have seen simple, atomic conditions such as Area < 400000, and LandUse =
80. Atomic conditions use a predicate symbol, such as < (less than) or = (equals).
Other possibilities are <= (less than or equal), > (greater than), >= (greater than
or equal) and <> (does not equal). Any of these symbols is combined with an
expression on the left and one on the right. For instance, LandUse <> 80 can be Atomic and composite
used to select all areas with a land use class different from 80. Expressions are conditions
either constants like 400000 and 80, attribute names like Area and LandUse, or
possibly composite arithmetic expressions like 0.15 × Area, which would com-
pute 15% of the area size.
Atomic conditions can be combined into composite conditions using logical connec-
tives. The most important ones are AND, OR, NOT and the bracket pair (· · ·). If
we write a composite condition like
Area < 400000 AND LandUse = 80,
we can use it to select areas for which both atomic conditions hold true. This is Logical connectives
the meaning of the AND connective. If we had written
Area < 400000 OR LandUse = 80
instead, the condition would have selected areas for which either condition holds,
https://E-next.in
so effectively those with an area size less than 400, 000, but also those with land
use class 80. (Included, of course, will be areas for which both conditions hold.)
The NOT connective can be used to negate a condition. For instance, the condi-
tion NOT (LandUse = 80) would select all areas with a different land use class
than 80. (Clearly, the same selection can be obtained by writing LandUse <> 80,
but this is not the point.) Finally, brackets can be applied to force grouping
amongst atomic parts of a composite condition. For instance, the condition
(Area < 30000 AND LandUse = 70) OR (Area < 400000 AND LandUse = 80)
will select areas of class 70 less than 30, 000 in size, as well as class 80 areas less
than 400, 000 in size.
https://E-next.in
Spatial selection using topological relationships
Various forms of topological relationship between spatial objects were discussed

in Section 2.3.4. These relationships can be useful to select features as well. The
steps carried out are:
1. To select one or more features as the selection objects, and

2. To apply a chosen spatial relationship function to determine the selected
features that have that relationship with the selection objects.
Selecting features that are inside selection objects This type of query uses the
containment relationship between spatial objects. Obviously, polygons can contain Point-in-polygon query
polygons, lines or points, and lines can contain lines or points, but no other
containment relationships are possible.
Figure 6.5 illustrates a containment query. Here, we are interested in finding the
location of medical clinics in the area of Ilala District. We first selected all areas of
Ilala District, using the technique of selection by attribute condition District =
“Ilala”. Then, these selected areas were used as selection objects to determine
which medical clinics (as point objects) were within them.
Selecting features that intersect The intersect operator identifies features that
are not disjoint in the sense of Figure 2.15, but now extended to include points
and lines. Figure 6.6 provides an example of spatial selection using the intersect
relationship between lines and polygons. We selected all roads intersecting Ilala
District.
https://E-next.in
#
# ## # # # # #
#
# # # #
# ## # #
# #
#
#
# # ## # # ##
## # # # #
# # # #
# # #
#
#
#
# Figure 6.5: Spatial se-
#
# # # # # # ## ### # #
## #
#
# # # # lection using containment.
## # ### # # #
#
# ### # # #
# # # In dark green, all wards
## # #### ## # # ##
# # #
# # # # # ### #
#
# ##
# ######
within Ilala District as the
#
# ##
# # #### ### ##
# # # # ## selection objects. In red,
# # # # ## #
## ## #
#
# ## # ## all medical clinics located
# # # ## #
# # ## ## #
# ##
# #
# ##
# # ## ## # #
### #
### ##
#
#
inside these areas, and
# ##
# #
# # # #
## #
# # # # ## thus inside the district.
# # # ## ### # # #
# # #
# # ##
#
# Data source: Dept. of Ur-
# # ##### #
##
# #
# # ## # # #### # ####
ban & Regional Planning
## #
#
#
## ###
# #
# ## # #
# #### and Geo-information Man-
# ## # # # #
agement, ITC.
Selecting features adjacent to selection objects Adjacency is the meet relation-

ship of Section 2.3.4. It expresses that features share boundaries, and therefore
it applies only to line and polygon features. Figure 6.7 illustrates a spatial adja-
cency query. We want to select all parcels adjacent to an industrial area. The first
step is to select that area (in dark green) and then apply the adjacency function
to select all land use areas (in red) that are adjacent to it.
Selecting features based on their distance One may also want to use the dis-
tance function of the GIS as a tool in selecting features. Such selections can be
searches within a given distance from the selection objects, at a given distance, or
even beyond a given distance. There is a whole range of applications to this type
of selection, e.g.:
https://E-next.in
Figure 6.6: Spatial se-

lection using intersection.
The wards of Ilala District
function as the selection
objects (in dark green),
and all roads (partially) in
the district are selected (in
red). Data source: Dept.
of Urban & Regional Plan-
Management, ITC.
• Which clinics are within 2 kilometres of a selected school? (Information

needed for the school emergency plan.)
• Which roads are within 200 metres of a medical clinic? (These roads must
have a high road maintenance priority.)
Figure 6.8 illustrates a spatial selection using distance. Here, we executed the
selection of the second example above. Our selection objects were all clinics,
and we selected the roads that pass by a clinic within 200 metres.
In situations in which we know the distance criteria to use—for selections within,
at or beyond that distance value—the GIS has many (straightforward) compu-
tations to perform. Things become more complicated if our distance selection
https://E-next.in
condition involves the word ‘nearest’ or ‘farthest’. The reason is that not only Complex proximity
must the GIS compute distances from a selection object A to all potentially se- formulations
lectable features F , but also it must find that feature F that is nearest to (resp.,
farthest away from) object A. So, this requires an extra computational step to
determine minimum (maximum) values. Most GIS packages support this type
of selection, though the mechanics (‘the buttons to use’) differ.
Afterthought on selecting features So far we have discussed a number of dif-

ferent techniques for selecting features. We have also seen that selection condi-
tions on attribute values can be combined using logical connectives like AND,
OR and NOT . A fact is that the other techniques of selecting features can usu-
Figure 6.7: Spatial selec-

tion using adjacency. Our
selection object is an in-
dustrial area near down
town Dar es Salaam, Tan-
zania; our adjacency se-
lection finds all adjacent
land use areas. Data
source: Dept. of Urban
& Regional Planning and
Geo-information Manage-
ment, ITC.
https://E-next.in
Figure 6.8: Spatial se-

lection using the distance
#
function. With all clin-
# ## # # # # #
# ics being our selection
# # # #
# ## # #
# # objects, we searched for
#
#
# # ## # # ##
# # #
#
## #
# # # # #
# ## # roads that pass by within
#
#
# # # # # # ## ### # #
## #
#
# # # # 200 metres. Observe that
## # ### # # #
#
# # # # # ## #
# # # # # ## #
# ### # # # # # this also selects road seg-
# ## # # # ## # #
#
##
ments that are far away
# # # ######
# ##
# # ### ###
# ## from any clinic, simply be-
# # # # ##
# # # # ## #
## ## # cause they belong to a
#
# ## # ##
# # # ## #
# # ## ## #
#
# ##
##
# #
#
# #
## # #
#
##
# # road of which a segment
# #
## # ## ## #
# # # # ## # is nearby. Data source:
## # # # # ##
# # # ## ### # # #
# #
#
# #
#
Dept. of Urban & Re-
# ##
# # ##### #
##
# #
# # ## # # #### # ####
## #
gional Planning and Geo-
#
#
## ###
# #
# ## # #
## ## #
# information Management,
# ## # # #
ITC.
https://E-next.in
ally also be combined. Any set of selected features can be used as the input for Combining selection
a subsequent selection procedure. This means, for instance, that we can select conditions
all medical clinics first, then identify the roads within 200 metres, then select
from them only the major roads, then select the nearest clinics to these remain-
ing roads, as the ones that should receive our financial support. In this way, we
are combining various techniques of selection.
https://E-next.in
6.2.3 Classification
Classification is a technique of purposefully removing detail from an input data

set, in the hope of revealing important patterns (of spatial distribution). In the
process, we produce an output data set, so that the input set can be left intact.
We do so by assigning a characteristic value to each element in the input set,
which is usually a collection of spatial features that can be raster cells or points,
lines or polygons. If the number of characteristic values is small in comparison
to the size of the input set, we have classified the input set.
The pattern that we look for may be the distribution of household income in a
city. Household income is called the classification parameter. If we know for
each ward in the city the associated average income, we have many different
values. Subsequently, we could define five different categories (or: classes) of Classification parameter
income: ‘low’, ‘below average’, ‘average’, ‘above average’ and ‘high’, and pro-
vide value ranges for each category. If these five categories are mapped in a
sensible colour scheme, this may reveal interesting information. This has been
done for Dar es Salaam in Figure 6.9 in two ways.
The input data set may have itself been the result of a classification, and in such a
case we call it a reclassification. For example, we may have a soil map that shows
different soil type units and we would like to show the suitability of units for
a specific crop. In this case, it is better to assign to the soil units an attribute Reclassification
of suitability for the crop. Since different soil types may have the same crop
suitability, a classification may merge soil units of different type into the same
category of crop suitability.
In classification of vector data, there are two possible results. In the first, the
https://E-next.in
Figure 6.9: Two classifi-

cations of average annual
household income per
ward in Dar es Salaam,
Tanzania. Higher income
areas in darker greens.
Five categories were
identified. (a) with orig-
inal polygons left intact;
(b) with original polygons
merged when in same
category. The data used
for this illustration are not
(a) (b) factual.
input features may become the output features in a new data layer, with an ad-
ditional category assigned. In other words, nothing changes with respect to the
spatial extents of the original features. Figure 6.9(a) is an illustration of this first
type of output. A second type of output is obtained when adjacent features with
the same category are merged into one bigger feature. Such post-processing Aggregation and merging
functions are called spatial merging, aggregation or dissolving. An illustration of
this second type is found in Figure 6.9(b). Observe that this type of merging is
only an option in vector data, as merging cells in an output raster on the basis
of a classification makes little sense. Vector data classification can be performed
on point sets, line sets or polygon sets; the optional merge phase is sensible
https://E-next.in
only for lines and polygons. Below, we discuss two kinds of classification: user-
controlled and automatic.
Household income range New category value Table 6.1: Classification

table used in Figure 6.9.
391–2474 1
2475–6030 2
6031–8164 3
8165–11587 4
11588–21036 5
https://E-next.in
User-controlled classification
In user-controlled classification, a user selects the attribute(s) that will be used as

the classification parameter(s) and defines the classification method. The latter
involves declaring the number of classes as well as the correspondence between
the old attribute values and the new classes. This is usually done via a classifi-
cation table. The classification table used for Figure 6.9 is displayed in Table 6.1.
It is rather typical for cases in which the used parameter domain is continuous
(as in household income). Then, the table indicates value ranges to be mapped to
the same category. Observe that categorical values are ordinal data, in the sense
of Section 2.2.3.
Another case exists when the classification parameter is nominal or at least dis-
crete. Such an example is given in Figure 6.10.
We must also define the data format of the output, as a spatial data layer, which
will contain the new classification attribute. The data type of this attribute is
always categorical, i.e. integer or string, no matter what is the data type of the
attribute(s) from which the classification was obtained.
Sometimes, one may want to perform classification only on a selection of fea-
tures. In such cases, there are two options for the features that are not selected.
One option is to keep their original values, while the other is to assign a null Null value
value to them in the output data set. A null value is a special value that means
that no applicable value is present. Care must be taken to deal with these values
correctly, both in computation and in visualization.
https://E-next.in
Code Old category New category Figure 6.10: An example

10 Planned resi- Residential of a classification on a dis-
dential crete parameter, namely
20 Industry Commercial land use unit in the city
30 Commercial Commercial of Dar es Salaam, Tan-
40 Institutional Public zania. Colour scheme:
50 Transport Public Residential (brown), Com-
60 Recreational Public mercial (yellow), Public
70 Non built-up Non built-up (Olive), Non built-up (or-
80 Unplanned Residential ange). Data source: Dept.
residential of Urban & Regional Plan-
Management, ITC.
https://E-next.in
Automatic classification
User-controlled classifications require a classification table or user interaction.

GIS software can also perform automatic classification, in which a user only
specifies the number of classes in the output data set. The system automati-
cally determines the class break points. Two main techniques of determining
break points are in use.
1. Equal interval technique: The minimum and maximum values vmin and vmax
of the classification parameter are determined and the (constant) interval
size for each category is calculated as (vmax − vmin )/n, where n is the num-
ber of classes chosen by the user. This classification is useful in revealing
the distribution patterns as it determines the number of features in each
category.
2. Equal frequency technique: This technique is also known as quantile classifi-

cation. The objective is to create categories with roughly equal numbers of
features per category. The total number of features is determined first and
by the required number of categories, the number of features per category
is calculated. The class break points are then determined by counting off
the features in order of classification parameter value.
Both techniques are illustrated on a small 5 × 5 raster in Figure 6.11.
When to use which? Which of these techniques should be applied to a given

dataset depends upon the purpose of the analysis (what the user is trying to
https://E-next.in
achieve) as well as the characteristics of the data itself. The reader is encouraged
to experiment with their data and compare the results given by each method.
Other (and possibly better) techniques exist.
While these two types of classification can be used in spatial analysis, they are
also frequently used to develop visualizations of the same phenomena. In terms
of analytical operations we refer to some kind of calculation or function which
will use these categories. In terms of visualization, we refer to the graphical
representation of the data using these classifications. Just as either technique
yields different results in numeric terms, it will do the same in visual terms.
Please refer to Chapter 7 for more discussion on issues relating to mapping and
visualization.
https://E-next.in
1 1 1 2 8 1 1 1 1 4 1 1 1 2 5
4 4 5 4 9 2 2 3 2 5 3 3 4 3 5
4 3 3 2 10 2 2 2 1 5 3 2 2 2 5
4 5 6 8 8 2 3 3 4 4 3 4 4 5 5
4 2 1 1 1 2 1 1 1 1 3 2 1 1 1 Figure 6.11: Example of

two automatic classifica-
(a) original raster (b) equal interval (c) equal frequency
classification classification tion techniques: (a) the
original raster with cell
original new # cells original new # cells values; (b) classification
value value value value
based on equal intervals;
1,2 1 9 1 1 6
(c) classification based on
3,4 2 8 2,3 2 5 equal frequencies. Be-
5,6 3 3 4 3 6 low, the respective classi-
7,8 4 3 5,6 4 3 fication tables, with a tally
9,10 5 2 8,9,10 5 5
of the number of cells in-
volved.
https://E-next.in
6.3. Overlay functions 376
6.3 Overlay functions
In the previous section, we saw various techniques of measuring and selecting

spatial data. We also discussed the generation of a new spatial data layer from an
old one, using classification. In this section, we look at techniques of combining
two spatial data layers and producing a third from them. The binary operators
that we discuss are known as spatial overlay operators. We will firstly discuss
vector overlay operators, and then focus on the raster case.
Standard overlay operators take two input data layers, and assume they are geo-
referenced in the same system, and overlap in study area. If either of these re-
quirements is not met, the use of an overlay operator is senseless. The principle Overlay requirements
of spatial overlay is to compare the characteristics of the same location in both
data layers, and to produce a result for each location in the output data layer.
The specific result to produce is determined by the user. It might involve a cal-
culation, or some other logical function to be applied to every area or location.
In raster data, as we shall see, these comparisons are carried out between pairs
of cells, one from each input raster. In vector data, the same principle of com-
paring locations applies, but the underlying computations rely on determining
the spatial intersections of features from each input layer.
https://E-next.in
6.3.1 Vector overlay operators
In the vector domain, overlay is computationally more demanding than in the

raster domain. Here we will only discuss overlays from polygon data layers, but
we note that most of the ideas also apply to overlay operations with point or line
data layers.
vector data layer A vector data layer B
B1
A1 B3
A2 B2
B4
Figure 6.12: The polygon
intersect (overlay) opera-
vector data layer C tor. Two polygon layers
C1
A and B produce a new
C6 C A B
C5 C1 A1 B1
polygon layer (with asso-
C2
C3
A1
A2
B2
B4
ciated attribute table) that
C2 C4 C4 A2 B2 contains all intersections
C5 A2 B3
C3 C6 A1 B3 of polygons from A and B.
Figure after [8].
The standard overlay operator for two layers of polygons is the polygon intersec-
tion operator. It is fundamental, as many other overlay operators proposed in
the literature or implemented in systems can be defined in terms of it. The prin-
ciples are illustrated in Figure 6.12. The result of this operator is the collection of
all possible polygon intersections; the attribute table result is a join—in the rela- Spatial join
tional database sense of Chapter 3—of the two input attribute tables. This output
https://E-next.in
Figure 6.13: The residen-

tial areas of Ilala District,
obtained from polygon in-
tersection. Input for the
polygon intersection oper-
ator were (a) a polygon
layer with all Ilala wards,
(b) a polygon layer with
the residential areas, as
classified in Figure 6.10.
Data source: Dept. of Ur-
ban & Regional Planning
and Geo-information Man-
agement, ITC.
attribute table only contains one tuple for each intersection polygon found, and
this explains why we call this operator a spatial join.
A more practical example is provided in Figure 6.13, which was produced by
polygon intersection of the ward polygons with land use polygons classified as
in Figure 6.10. This has allowed us to select the residential areas in Ilala District.
Two more polygon overlay operators are illustrated in Figure 6.14. The first is
known as the polygon clipping operator. It takes a polygon data layer and restricts
its spatial extent to the generalized outer boundary obtained from all (selected) Polygon clipping
polygons in a second input layer. Besides this generalized outer boundary, no
other polygon boundaries from the second layer play a role in the result.
A second overlay operator is polygon overwrite. The result of this binary operator
https://E-next.in
is defined is a polygon layer with the polygons of the first layer, except where
polygons existed in the second layer, as these take priority. The principle is
illustrated in the lower half of Figure 6.14. Most GISs do not force the user to
apply overlay operators to the full polygon data set. One is allowed to first select
relevant polygons in the data layer, and then use the selected set of polygons as
an operator argument.
The fundamental operator of all these is polygon intersection. The others can be
defined in terms of it, usually in combination with polygon selection and/or
classification. For instance, the polygon overwrite of A by B can be defined as Polygon intersection
polygon intersection between A and B, followed by a (well-chosen) classification
that prioritizes polygons in B, followed by a merge. The reader is asked to verify
this.
Vector overlays are usually also defined for point or line data layers. Their defi-
nition parallels the definitions of operators discussed above. Different GISs use
different names for these operators, and one is advised to carefully check the
documentation before applying any of these operators.
https://E-next.in
Figure 6.14: Two more

polygon overlay operators:
(a) polygon clip overlay
CLIP BY
clips down the left hand
polygon layer to the gener-
alized spatial extent of the
right hand polygon layer;
(b) polygon overwrite over-
OVERWRITE lay overwrites the left hand
BY polygon layer with the
polygons of the right hand
layer.
https://E-next.in
6.3.2 Raster overlay operators
Vector overlay operators are useful, but geometrically complicated, and this
sometimes results in poor operator performance. Raster overlays do not suf-
fer from this disadvantage, as most of them perform their computations cell by
cell, and thus they are fast.
GISs that support raster processing—as most do—usually have a language to
express operations on rasters. These languages are generally referred to as map
algebra [54], or sometimes raster calculus. They allow a GIS to compute new
rasters from existing ones, using a range of functions and operators. Unfor- Map algebra
tunately, not all implementations of map algebra offer the same functionality.
The discussion below is to a large extent based on general terminology, and
attempts to illustrate the key operations using a logical, structured language.
Again, the syntax often differs for different GIS software packages.
When producing a new raster we must provide a name for it, and define how it
is computed. This is done in an assignment statement of the following format:
Output raster name := Map algebra expression.
The expression on the right is evaluated by the GIS, and the raster in which it
results is then stored under the name on the left. The expression may contain
references to existing rasters, operators and functions; the format is made clear
below. The raster names and constants that are used in the expression are called
its operands. When the expression is evaluated, the GIS will perform the calcu- Operands
lation on a pixel by pixel basis, starting from the first pixel in the first row, and
https://E-next.in
continuing until the last pixel in the last row. There is a wide range of operators
and functions that can be used in map algebra, which we discuss below.
https://E-next.in
Arithmetic operators
Various arithmetic operators are supported. The standard ones are multiplica-
tion (×), division (/), subtraction (−) and addition (+). Obviously, these arith-
metic operators should only be used on appropriate data values, and for in-
stance, not on classification values.
Other arithmetic operators may include modulo division (MOD) and integer di-
vision (DIV ). Modulo division returns the remainder of division: for instance,
10 MOD 3 will return 1 as 10 − 3 × 3 = 1. Similarly, 10 DIV 3 will return 3.
More operators are goniometric: sine (sin), cosine (cos), tangent (tan), and their
inverse functions asin, acos, and atan, which return radian angles as real values.
5 5 2 2 A 15 15 12 12 C1
5 5 5 2 C1 := A +10 15 15 15 12
MapA
6 2 2 2 16 12 12 12 C2
9 9 10 10
6 6 6 6 16 16 16 16
9 9 9 10
7 3 3 10
4 4 8 8 B C2 := A + B 7 7 14 14 11 11 -60 -60 C3
4 4 4 8 11 11 11 -60
MapB
1 1 1 8 71 33 33 -60 Figure 6.15: Examples of
C3 := ((A - B)/(A + B))*100 arithmetic map algebra ex-
1 1 8 8 71 71 -14 -14
pressions
Some simple map algebra assignments are illustrated in Figure 6.15. The assign-
https://E-next.in
ment:
C1 := A + 10
will add a constant factor of 10 to all cell values of raster A and store the result
as output raster C1. The assignment:
C2 := A + B
will add the values of A and B cell by cell, and store the result as raster C2.
Finally, the assignment
C3 := (A − B)/(A + B) × 100
will create output raster C3, as the result of the subtraction (cell by cell, as usual)
of B cell values from A cell values, divided by their sum. The result is multi-
plied by 100. This expression, when carried out on AVHRR channel 1 (red) and
AVHRR channel 2 (near infrared) of NOAA satellite imagery, is known as the
NDVI (Normalized Difference Vegetation Index). It has proven to be a good indica-
tor of the presence of green vegetation.
https://E-next.in
Comparison and logical operators
Map algebra also allows the comparison of rasters cell by cell. To this end, we
may use the standard comparison operators (<, <=, =, >=, > and <>) that we
introduced before.
A simple raster comparison assignment is:
C := A <> B.
It will store truth values—either true or false—in the output raster C. A cell
value in C will be true if the cell’s value in A differs from that cell’s value in B.
It will be false if they are the same.
Logical connectives are also supported in most implementations of map algebra.
We have already seen the connectives of AND, OR and NOT in Section 6.2.2. An-
other connective that is commonly offered in map algebra is exclusive OR (XOR).
The expression a XOR b is true only if either a or b is true, but not both. Examples Comparison operators and
of the use of these comparison operators and connectives are provided in Fig- connectives
ure 6.16 and Figure 6.17. The latter figure provides various raster computations
in search of forests at specific elevations. In the figure, raster D1 indicates forest
below 500 m, D2 indicates areas below 500 m that are forests, raster D3 areas
that are either forest or below 500 m (but not at the same time), and raster D4
indicates forests above 500 m.
https://E-next.in
A and B
A
A or B
A xor B
B
A and not B
(A and B) or C
C
Figure 6.16: Examples of
A and (B or C) logical expressions in map
algebra. Green cells rep-
resent true values, white
cells represent false val-
ues.
https://E-next.in
A
F F F D1
F F
F F F D2
F F F D1 := (A = "forest") AND (B < 500)
F F
F = forest
D2 := (A = "forest") OR (B < 500) 0 0

7 = 700 m. D3
6 = 600 m. Figure 6.17: Examples of
4 = 400 m. B D4
7 7 7 7 4 complex logical expres-
D3 := (A = "forest") XOR (B < 500)
7 7 7 7 4 sions in map algebra. A is
4 4 4 4 4 a classified raster for land
6 6 4 4 4 use, and B holds elevation
6 6 6 6 6 D4 := (A = "forest") AND NOT (B < 500) values.
https://E-next.in
Conditional expressions
The above comparison and logical operators produce rasters with the truth val-
ues true and false. In practice, we often need a conditional expression with
them that allows us to test whether a condition is fulfilled. The general format
is:2
Output raster := CON (condition, then expression, else expression).
Here, condition is the tested condition, then expression is evaluated if condition

holds, and else expression is evaluated if it does not hold.
This means that an expression like CON (A = “f orest00 , 10, 0) will evaluate to 10
for each cell in the output raster where the same cell in A is classified as for-
est. In each cell where this is not true, the else expression is evaluated, resulting
in 0 . Another example is provided in Figure 6.18, showing that values for the
then expression and the else expression can be some integer (possibly derived
from another calculation) or values derived from other rasters. In this example,
the output raster C1 is assigned the values of input raster B wherever the cells
of input raster A contain forest. The cells in output raster C2 are assigned 10
wherever the elevation (B) is equal to 7 and the groundcover (A) is forest.
2
We have already noted that specific software packages may differ in the specifics of the
syntax that make up an expression. This extends to the actual commands— some packages
using “IFF” instead of “CON”.
https://E-next.in
A C1
F F F 7 7 7 0 0
F F 7 7 0 0 0
F F F 0 4 4 0 4
MapB
F F F C1 := CON (A = "F", B, 0) 0 0 4 4 4
F F 0 0 0 6 6
F = forest C2
B 10 10 10 0 0 Figure 6.18: Examples of
7 7 7 7 4 10 10 0 0 0 conditional expressions in
C2 := CON ((A = "F") AND (B = 7), 10, 0)
7 7 7 7 4 0 0 0 0 0 map algebra. Here A is
4 4 4 4 4 7 = 700 m. 0 0 0 0 0 a classified raster holding
6 6 4 4 4 6 = 600 m. 0 0 0 0 0 land use data, and B is an
6 6 6 6 6 4 = 400 m.
elevation value raster.
https://E-next.in
6.3.3 Overlays using a decision table
Conditional expressions are powerful tools in cases where multiple criteria must
be taken into account. A small size example may illustrate this. Consider a
suitability study in which a land use classification and a geological classification
must be used. The respective rasters are illustrated in Figure 6.19 on the left. Do- Domain expertise
main expertise dictates that some combinations of land use and geology result
in suitable areas, whereas other combinations do not. In our example, forests
on alluvial terrain and grassland on shale are considered suitable combinations,
while the others are not.
We could produce the output raster of Figure 6.19 with a map algebra expression
such as:
Suitability := CON ((Landuse = “Forest” AND Geology = “Alluvial ”) OR

(Landuse = “Grass” AND Geology = “Shale”),
“Suitable”, “Unsuitable”)
and consider ourselves lucky that there are only two ‘suitable’ cases. In practice,
many more cases must usually be covered, and then writing up a complex CON
expression is not an easy task.
To this end, some GISs accommodate setting up a separate decision table that
will guide the raster overlay process. This extra table carries domain expertise,
https://E-next.in
Land use raster Decision table Geology

Alluvial Shale
Land use
Forest Suitable Unsuitable
Grass Unsuitable Suitable Figure 6.19: The use of
Lake Unsuitable Unsuitable Suitability a decision table in raster
overlay. The overlay is
computed in a suitability
study, in which land use
and geology are impor-
tant factors. The mean-
ing of values in both input
rasters, as well as the out-
put raster can be under-
stood from the decision ta-
Geology raster ble.
and dictates which combinations of input raster cell values will produce which
output raster cell value. This gives us a raster overlay operator using a decision
table, as illustrated in Figure 6.19. The GIS will have supporting functions to
generate the additional table from the input rasters, and to enter appropriate
values in the table.
https://E-next.in
6.4. Neighbourhood functions 392
6.4 Neighbourhood functions
In our section on overlay operators, the guiding principle was to compare or

combine the characteristic value of a location from two data layers, and to do
so for all locations. This is what map algebra, for instance, gave us: cell by cell
calculations, with the results stored in a new raster.
There is another guiding principle in spatial analysis that can be equally useful.
The principle here is to find out the characteristics of the vicinity, here called
neighbourhood, of a location. After all, many suitability questions, for instance,
depend not only on what is at the location, but also on what is near the location.
Thus, the GIS must allow us ‘to look around locally’.
To perform neighbourhood analysis, we must:
1. State which target locations are of interest to us, and define their spatial
extent,
2. Define how to determine the neighbourhood for each target,
3. Define which characteristic(s) must be computed for each neighbourhood.
For instance, our target might be a medical clinic. Its neighbourhood could be
defined as:
• An area within 2 km distance as the crow flies, or
• An area within 2 km travel distance, or
https://E-next.in
• All roads within 500 m travel distance, or
• All other clinics within 10 minutes travel time, or
• All residential areas, for which the clinic is the closest clinic.
The alert reader will note the increasingly complex definitions of ‘neighbour-
hood’ used here. This is to illustrate that different ways of measuring neighbour-
hoods exist, and some are better (or more representative of real neighbourhoods)
than others, depending on the purpose of the analysis.
Then, in the third step we indicate what it is we want to discover about the
phenomena that exist or occur in the neighbourhood. This might simply be its
spatial extent, but it might also be statistical information like:
• The total population of the area,
• Average household income, or
• The distribution of high-risk industries located in the neighbourhood.
The above are typical questions in an urban setting. When our interest is more in
natural phenomena, different examples of locations, neighbourhoods and neigh-
bourhood characteristics arise. Since raster data are the more commonly used in
this case, neighbourhood characteristics often are obtained via statistical sum-
mary functions that compute values such as average, minimum, maximum, and
standard deviation of the cells in the identified neighbourhood.
https://E-next.in
Determining neighbourhood extent To select target locations, one can use the
selection techniques that we discussed in Section 6.2.2. To obtain characteristics
from an eventually identified neighbourhood, the same techniques apply. So
what remains to be discussed here is the proper determination of a neighbour-
hood.
One way of determining a neighbourhood around a target location is by making
use of the geometric distance function. We discuss some of these techniques in
Section 6.4.1. Geometric distance does not take into account direction and certain Proximity function
phenomena can only be studied by doing so. For example, pollution spread by
rivers, ground water flow, or prevailing weather systems.
The more advanced techniques for computation of flow and diffusion are dis-
cussed in Section 6.4.2. Diffusion functions are based on the assumption that
the phenomenon spreads in all directions, though not necessarily equally eas-
ily in all directions. Hence, it uses local terrain characteristics to compute the Complex neighbourhoods
local resistance against diffusion. In flow computations, the assumption is that
the phenomenon will choose a least-resistance path, and not spread in all direc-
tions. This, as we will see, involves the computation of preferred local direction
of spread. Both flow and diffusion computations take local characteristics into
account, and are therefore more easily performed on raster data.
https://E-next.in
6.4.1 Proximity computations
In proximity computations, we use geometric distance to define the neighbour-

hood of one or more target locations. The most common and useful technique
is buffer zone generation. Another technique based on geometric distance that we
discuss is Thiessen polygon generation.
https://E-next.in
Buffer zone generation
The principle of buffer zone generation is simple: we select one or more target
locations, and then determine the area around them, within a certain distance.
In Figure 6.20(a), a number of main and minor roads were selected as targets,
and a 75 m (resp., 25 m) buffer was computed from them. In some case stud-
ies, zonated buffers must be determined, for instance in assessments of traffic
noise effects. Most GISs support this type of zonated buffer computation. An
illustration is provided in Figure 6.20(b).
Figure 6.20: Buffer zone

generation: (a) around
main and minor roads. Dif-
ferent distances were ap-
plied: 25 metres for minor
roads, 75 metres for main
roads. (b) Zonated buffer
zones around main roads.
Three different zones were
obtained: at 100 metres
from main road, at 200,
(a) (b) and at 300 metres.
In vector-based buffer generation, the buffers themselves become polygon fea-

tures, usually in a separate data layer, that can be used in further spatial analysis.
https://E-next.in
Buffer generation on rasters is a fairly simple function. The target location or lo-
cations are always represented by a selection of the raster’s cells, and geometric
distance is defined, using cell resolution as the unit. The distance function ap-
plied is the Pythagorean distance between the cell centres. The distance from a
non-target cell to the target is the minimal distance one can find between that
non-target cell and any target cell.
https://E-next.in
Thiessen polygon generation
Thiessen polygon partitions make use of geometric distance for determining neigh-
bourhoods. This is useful if we have a spatially distributed set of points as target
locations, and we want to know for each location in the study to which target
it is closest. This technique will generate a polygon around each target location
that identifies all those locations that ‘belong to’ that target. We have already
seen the use of Thiessen polygons in the context of interpolation of point data, as
discussed in Section 5.4.1. Given an input point set that will be the polygon’s
midpoints, it is not difficult to construct such a partition. It is even much easier
to construct if we already have a Delaunay triangulation for the same input point
set (see Section 2.3.3 on TINs).
Figure 6.21 repeats the Delaunay triangulation of Figure 2.9(b). The Thiessen
polygon partition constructed from it is on the right. The construction first cre-
ates the perpendiculars of all the triangle sides; observe that a perpendicular
of a triangle side that connect point A with point B is the divide between the
area closer to A and the area closer to B. The perpendiculars become part of the
boundary of each Thiessen polygon.
https://E-next.in
Figure 6.21: Thiessen

polygon construction
(right) from a Delau-
nay triangulation (left):
perpendiculars of the
triangles form the bound-
Delaunay triangulation Thiessen polygons aries of the polygons.
https://E-next.in
6.4.2 Computation of diffusion
The determination of neighbourhood of one or more target locations may de-

pend not only on distance—cases which we discussed above—but also on di-
rection and differences in the terrain in different directions. This typically is the
case when the target location contains a ‘source material’ that spreads over time,
referred to as diffusion. This ‘source material’ may be air, water or soil pollution, Diffusion and spread
commuters exiting a train station, people from an opened-up refugee camp, a
water spring uphill, or the radio waves emitted from a radio relay station. In
all these cases, one will not expect the spread to occur evenly in all directions.
There will be local terrain factors that influence the spread, making it easier or
more difficult. Many GISs provide support for this type of computation, and we
discuss some of its principles here, in the context of raster data.
Diffusion computation involves one or more target locations, which are better
called source locations in this context. They are the locations of the source of
whatever spreads. The computation also involves a local resistance raster, which
for each cell provides a value that indicates how difficult it is for the ‘source -
material’ to pass by that cell. The value in the cell must be normalized: i.e. valid Resistance
for a standardized length (usually the cell’s width) of spread path. From the
source location(s) and the local resistance raster, the GIS will be able to com-
pute a new raster that indicates how much minimal total resistance the spread has
witnessed for reaching a raster cell. This process is illustrated in Figure 6.22.
While computing total resistances, the GIS takes proper care of the path lengths.
Obviously, the diffusion from a cell csrc to its neighbour cell to the east ce is
shorter than to the cell that is √
its northeast neighbour cne . The distance ratio
between these two cases is 1 : 2. If val(c) indicates the local resistance value
https://E-next.in
Figure 6.22: Computation
of diffusion on a raster.
The lower left green cell
1 1 1 2 8 14.50 14.95 15.95 17.45 22.45
is the source location, in-
4 4 5 4 9 12.00 12.45 14.61 16.66 21.44
dicated in the local re-
sistance raster (a). The
4 3 3 2 10 8.00 8.95 11.95 13.66 19.66 raster in (b) is the mini-
mal total resistance raster
4 5 6 8 8 4.00 6.36 8.00 10.00 11.00 computed by the GIS.
(The GIS will work in
4 2 1 1 1 0.00 3.00 4.50 5.50 6.50
higher precision real arith-
metic than what is illus-
(a) (b)
trated here.)
for cell c, the GIS computes the total incurred resistance for diffusion from csrc to
ce as 12 (val(csrc ) + val(ce )), while the same for csrc to cne is 12 (val(csrc ) + val(cne )) ×
√
2. The accumulated resistance along a path of cells is simply the sum of these
incurred resistances from pairwise neighbour cells.
Since ‘source material’ has the habit of taking the easiest route to spread, we
must determine at what minimal cost (i.e. at what minimal resistance) it may
have arrived in a cell. Therefore, we are interested in the minimal cost path. To Minimal cost path
determine the minimal total resistance along a path from the source location csrc
to an arbitrary cell cx , the GIS determines all possible paths from csrc to cx , and
then determines which one has the lowest total resistance. This value is found,
for each cell, in the raster of Figure 6.22(b).
For instance, there are three paths from the green source location to its northeast
neighbour cell (with local resistance 5). We can define them as path 1 (N–E),
https://E-next.in
path 2 (E–N) and path 3 (NE), using compass directions to define the path from
the green cell. For path 1, the total resistance is computed as:
1 1
(4 + 4) + (4 + 5) = 8.5.
2 2
Path 2, in similar style, gives us a total value of 6.5. For path 3, we find
1 √
(4 + 5) × 2 = 6.36,
2
and thus it obviously is the minimal cost path. The reader is asked to verify one
or two other values of minimal cost paths that the GIS has produced.
https://E-next.in
6.4.3 Flow computation
Flow computations determine how a phenomenon spreads over the area, in

principle in all directions, though with varying difficulty or resistance. There
are also cases where a phenomenon does not spread in all directions, but moves
or ‘flows’ along a given, least-cost path, determined again by local terrain char-
acteristics. The typical case arises when we want to determine the drainage pat-
terns in a catchment: the rainfall water ‘chooses’ a way to leave the area.
This principle is illustrated with a simple elevation raster, in Figure 6.23(a). For
each cell in that raster, the steepest downward slope to a neighbour cell is com-
puted, and its direction is stored in a new raster (Figure 6.23(b)). This compu-
tation determines the elevation difference between the cell and a neighbour cell,
and takes
√ into account cell distance—1 for neighbour cells in N–S or W–E direc-
tion, 2 for cells in NE–SW or NW-SE direction. Among its eight neighbour Determining flow direction
cells, it picks the one with the steepest path to it. The directions in raster (b),
thus obtained, are encoded in integer values, and we have ‘decoded’ them for
the sake of illustration. Raster (b) can be called the flow direction raster. From
raster (b), the GIS can compute the accumulated flow count raster, a raster that for
each cell indicates how many cells have their water flow into the cell.
Cells with a high accumulated flow count represent areas of concentrated flow,
and thus may belong to a stream. By using some appropriately chosen threshold
value in a map algebra expression, we may decide whether they do. Cells with
an accumulated flow count of zero are local topographic highs, and can be used
to identify ridges.
https://E-next.in
156 144 138 142 116 98 0 0 0 0 0 0
148 134 112 98 92 100 0 1 1 2 2 0
138 106 88 74 76 96 0 3 7 5 4 0
128 116 110 44 62 48 0 0 0 20 0 1

Figure 6.23: Flow compu-
tations on a raster: (a) the
136 122 94 42 32 38 0 0 0 1 24 0
original elevation raster,
148 106 68 24 22 24 0 2 4 7 35 1 (b) the flow direction raster
computed from it, (c) accu-
(a) (b) (c) mulated flow count raster.
https://E-next.in
6.4.4 Raster based surface analysis
Continuous fields have a number of characteristics not shared by discrete fields.

Since the field changes continuously, we can talk about slope angle, slope aspect
and concavity/convexity of the slope. These notions are not applicable to discrete
fields.
The discussions in this section use terrain elevation as the prototypical exam-
ple of a continuous field, but all issues discussed are equally applicable to other
types of continuous fields. Nonetheless, we regularly refer to the continuous
field representation as a DEM, to conform with the most common situation.
Throughout the section we will assume that the DEM is represented as a raster.
https://E-next.in
Applications
There are numerous examples where more advanced computations on continu-

ous field representations are needed. A short list is provided below.
• Slope angle calculation The calculation of the slope steepness, expressed as

an angle in degrees or percentages, for any or all locations.
• Slope aspect calculation The calculation of the aspect (or orientation) of the
slope in degrees (between 0 and 360 degrees), for any or all locations.
• Slope convexity/concavity calculation Slope convexity—defined as the change

of the slope (negative when the slope is concave and positive when the
slope is convex)—can be derived as the second derivative of the field.
• Slope length calculation With the use of neighbourhood operations, it is pos-

sible to calculate for each cell the nearest distance to a watershed boundary
(the upslope length) and to the nearest stream (the downslope length). This
information is useful for hydrological modelling.
• Hillshading is used to portray relief difference and terrain morphology in

hilly and mountainous areas. The application of a special filter to a DEM
produces hillshading. Filters are discussed on page 6.4.4. The colour tones
in a hillshading raster represent the amount of reflected light in each loca-
tion, depending on its orientation relative to the illumination source. This
illumination source is usually chosen at an angle of 45◦ above the horizon
in the north-west.
https://E-next.in
• Three-dimensional map display With GIS software, three-dimensional views
of a DEM can be constructed, in which the location of the viewer, the angle
under which s/he is looking, the zoom angle, and the amplification fac-
tor of relief exaggeration can be specified. Three-dimensional views can
be constructed using only a predefined mesh, covering the surface, or us-
ing other rasters (e.g. a hillshading raster) or images (e.g. satellite images)
which are draped over the DEM.
• Determination of change in elevation through time The cut-and-fill volume of

soil to be removed or to be brought in to make a site ready for construction
can be computed by overlaying the DEM of the site before the work begins
with the DEM of the expected modified topography. It is also possible to
determine landslide effects by comparing DEMs of before and after the
landslide event.
• Automatic catchment delineation Catchment boundaries or drainage lines

can be automatically generated from a good quality DEM with the use
of neighbourhood functions. The system will determine the lowest point
in the DEM, which is considered the outlet of the catchment. From there,
it will repeatedly search the neighbouring pixels with the highest altitude.
This process is continued until the highest location (i.e. cell with highest
value) is found, and the path followed determines the catchment bound-
ary. For delineating the drainage network, the process is reversed. Now,
the system will work from the watershed downwards, each time looking
for the lowest neighbouring cells, which determines the direction of water
flow.
• Dynamic modelling Apart from the applications mentioned above, DEMs
https://E-next.in
are increasingly used in GIS-based dynamic modelling, such as the com-
putation of surface run-off and erosion, groundwater flow, the delineation
of areas affected by pollution, the computation of areas that will be covered
by processes such as debris flows and lava flows.
• Visibility analysis A viewshed is the area that can be ‘seen’—i.e. is in the

direct line-of-sight—from a specified target location. Visibility analysis de-
termines the area visible from a scenic lookout, the area that can be reached
by a radar antenna, or assesses how effectively a road or quarry will be
hidden from view.
Some of the more important computations mentioned above are further dis-
cussed below. All of them apply a technique known as filtering, so we will first
examine this principle in more detail.
https://E-next.in
Filtering
The principle of filtering is quite similar to that of moving window averaging,

which we discussed in Section 5.4.2. Again, we define a window and let the
GIS move it over the raster cell-by-cell. For each cell, the system performs some
computation, and assigns the result of this computation to the cell in the output
raster.3 The difference with moving window averaging is that the moving win- Window or kernel
dow in filtering is itself a little raster, which contains cell values that are used in
the computation for the output cell value. This little raster is a filter, also known
as a kernel which may be square (such as a 3x3 kernel), but it does not have to
be. The values in the filter are used as weight factors.
As an example, let us consider a 3 × 3 cell filter, in which all values are equal
to 1, as illustrated in Figure 6.24(a). The use of this filter means that the nine
cells considered are given equal weight in the computation of the filtering step.
Let the input raster cell values, for the current filtering step, be denoted by rij
and the corresponding filter values by wij . The output value for the cell under
consideration will be computed as the sum of the weighted input values divided
by the sum of weights:
X X
(wij · rij )/ |wij |,
i,j i,j
where one should observe that we divide by the sum of absolute weights.
3
Please refer to Chapter Five of Principles of Remote Sensing for a discussion of image-related
filter operations.
https://E-next.in
Figure 6.24: Moving win-
1 1 1 0 0 0 0 1 0
dow rasters for filtering.
(a) raster for a regular av-
1 1 1 -1 0 1 0 0 0 eraging filter; (b) raster
for an x-gradient filter;
1 1 1 0 0 0 0 -1 0 (c) raster for a y-gradient
(a) (b) (c)
filter.
Since the wij are all equal to 1 in the case of Figure 6.24(a), the formula can be
simplified to
1X
rij ,
9 i,j
which is nothing but the average of the nine input raster cell values. So, we see
that an ‘all-1’ filter computes a local average value, so its application amounts to
moving window averaging. More advanced filters have been devised to extract
other types of information from raster data. We will look at some of these in the
context of slope computations.
https://E-next.in
Computation of slope angle and slope aspect
A different choice of weight factors may provide other information. Special fil-
ters exist to perform computations on the slope of the terrain. Before we look at
these filters, let us define various notions of slope.
Figure 6.25: Slope angle

defined. Here, δp stands
for length in the horizon-
df tal plane, δf stands for
the change in field value,
a where the field usually is
terrain elevation. The
dp slope angle is α.
Slope angle, which is also known as slope gradient, is the angle α, illustrated in
Figure 6.25, between a path p in the horizontal plane and the sloping terrain.
The path p must be chosen such that the angle α is maximal. A slope angle can
be expressed as elevation gain in a percentage or as a geometric angle, in degrees
or radians. The two respective formulas are:
δf δf
slope perc = 100 · and slope angle = arctan( ).
δp δp
The path p must be chosen to provide the highest slope angle value, and thus
it can lie in any direction. The compass direction, converted to an angle with
the North, of this maximal down-slope path p is what we call the slope aspect.
https://E-next.in
Let us now look at how to compute slope angle and slope aspect in a raster
environment.
From an elevation raster, we cannot ‘read’ the slope angle or slope aspect di-
rectly. Yet, that information can be extracted. After all, for an arbitrary cell, we
have its elevation value, plus those of its eight neighbour cells. A simple ap-
proach to slope angle computation is to make use of x-gradient and y-gradient
filters. Figure 6.24(b) and (c) illustrate an x-gradient filter, and y-gradient filter,
respectively. The x-gradient filter determines the slope increase ratio from west
to east: if the elevation to the west of the centre cell is 1540 m and that to the x and y gradient filters
east of the centre cell is 1552 m, then apparently along this transect the elevation
increases 12 m per two cell widths, i.e. the x-gradient is 6 m per cell width. The
y-gradient filter operates entirely analogously, though in south-north direction.
Observe that both filters express elevation gain per cell width. This means that
we must divide by the cell width—given in metres, for example—to obtain the
(approximations to) the true derivatives δf /δx and δf /δy. Here, f stands for the
elevation field as a function of x and y, and δf /δx, for instance, is the elevation
gain per unit of length in the x-direction.
To obtain the real slope angle α along path p, observe that both the x- and y-
gradient contribute to it. This is illustrated in Figure 6.26. A, not-so-simple,
geometric derivation can show that always
q
tan(α) = (δf /δx)2 + (δf /δy)2 .
Now what does this mean in the practice of computing local slope angles from
an elevation raster? It means that we must perform the following steps:
https://E-next.in
1. Compute from (input) elevation raster R the non-normalized x- and y-
gradients, using the filters of Figure 6.24(b) and (c), respectively.
2. Normalize the resulting rasters by dividing by the cell width, expressed in

units of length like metres.
√
3. Use both rasters for generating a third raster, applying the · · · formula
above, possibly even applying an arctan function to the result to obtain the
slope angle α for each cell.
It can also be shown that for the slope aspect ψ we have
δf /δx
tan(ψ) = ,
δf /δy
Figure 6.26: Slope angle

and slope aspect defined.
Here, p is the horizontal
df/dy path in maximal slope di-
rection and α is the slope
angle. The plane tangent
north a p to the terrain in the origin
is also indicated. The an-
y
y gle ψ is the slope aspect.
df/dx
See the text for further ex-
x east planation.
https://E-next.in
so slope aspect can also be computed from the normalized gradients. We must
warn the reader that this formula should not trivially be replaced by using
δf /δx
ψ = arctan( ),
δf /δy
the reason being that the latter formula does not account for southeast and
southwest quadrants, nor for cases where δf /δy = 0. (In the first situation,
one must add 180◦ to the computed angle to obtain an angle measured from
North; in the latter situation, ψ equals either 90◦ or −90◦ , depending on the sign
of δf /δx.)
https://E-next.in
6.5. Network analysis 415
6.5 Network analysis
A completely different set of analytical functions in GIS consists of computations

on networks. A network is a connected set of lines, representing some geographic
phenomenon, typically of the transportation type. The ‘goods’ transported can
be almost anything: people, cars and other vehicles along a road network, com-
mercial goods along a logistic network, phone calls along a telephone network,
or water pollution along a stream/river network.
Network analysis can be performed on either raster or vector data layers, but
they are more commonly done in the latter, as line features can be associated
with a network, and hence can be assigned typical transportation characteristics
such as capacity and cost per unit. A fundamental characteristic of any network Directed and undirected
is whether the network lines are considered directed or not. Directed networks networks
associate with each line a direction of transportation; undirected networks do not.
In the latter, the ‘goods’ can be transported along a line in both directions. We
discuss here vector network analysis, and assume that the network is a set of
connected line features that intersect only at the lines’ nodes, not at internal ver-
tices. (But we do mention under- and overpasses.)
For many applications of network analysis, a planar network, i.e. one that can
be embedded in a two-dimensional plane, will do the job. Many networks are
naturally planar, like stream/river networks. A large-scale traffic network, on Planar networks
the other end, is not planar: motorways have multi-level crossings and are con-
structed with underpasses and overpasses. Planar networks are easier to deal
with computationally, as they have simpler topological rules.
Not all GISs accommodate non-planar networks, or can do so only using ‘tricks’.
https://E-next.in
These may involve the splitting of overpassing lines at the intersection vertex
and the creation of four lines out of the two original lines. Without further at-
tention, the network will then allow one to make a turn onto another line at this
new intersection node, which in reality would be impossible. In some GISs we Overpasses
can allocate a cost with turning at a node—see our discussion on turning costs
below—and that cost, in the case of the overpass, can be made infinite to en-
sure it is prohibited. But, as mentioned, this is a workaround to fit a non-planar
situation into a data layer that presumes planarity.
The above is a good illustration of geometry not fully determining the network’s
behaviour. Additional application-specific rules are usually required to define
what can and cannot happen in the network. Most GISs provide rule-based
tools that allow the definition of these extra application rules.
Various classical spatial analysis functions on networks are supported by GIS
software packages. The most important ones are:
1. Optimal path finding which generates a least cost-path on a network be-

tween a pair of predefined locations using both geometric and attribute
data.
2. Network partitioning which assigns network elements (nodes or line seg-

ments) to different locations using predefined criteria.
We discuss these two typical functions in the sections below.
https://E-next.in
Optimal path finding
Optimal path finding techniques are used when a least-cost path between two
nodes in a network must be found. The two nodes are called origin and desti-
nation, respectively. The aim is to find a sequence of connected lines to traverse
from the origin to the destination at the lowest possible cost.
The cost function can be simple: for instance, it can be defined as the total length
of all lines on the path. The cost function can also be more elaborate and take into
account not only length of the lines, but also their capacity, maximum transmis-
sion (travel) rate and other line characteristics, for instance to obtain a reasonable
approximation of travel time. There can even be cases in which the nodes visited
add to the cost of the path as well. These may be called turning costs, which are Turning costs
defined in a separate turning cost table for each node, indicating the cost of turn-
ing at the node when entering from one line and continuing on another. This is
illustrated in Figure 6.27.
Figure 6.27: Network

neighbourhood of node
a N with associated turning
from onto cost
a a ¥ costs at N . Turning at
N c a b 5 N onto c is prohibited
b a 8 because of direction, so
b b 15
c a 3 no costs are mentioned for
c b 6 turning onto c. A turning
b
cost of infinity (∞) means
that it is also prohibited.
The attentive reader will notice that it is possible to travel on line b in Figure 6.27,
https://E-next.in
then take a U-turn at node N , and return along a to where one came from. The
question is whether doing this makes sense in optimal path finding. After all,
to go back to where one comes from will only increase the total cost. In fact,
there are situations where it is optimal to do so. Suppose it is node M that
is connected by line b with node N , and that we actually wanted to travel to
another node L from M . The turn at M towards node L coming via another line
may be prohibitively expensive, whereas turning towards L at M returning to
M along b may not be so expensive.
Problems related to optimal path finding are ordered optimal path finding and
unordered optimal path finding. Both have an extra requirement that a num-
ber of additional nodes needs to be visited along the path. In ordered optimal
path finding, the sequence in which these extra nodes are visited matters; in Ordered and unordered path
unordered optimal path finding it does not. An illustration of both types is pro- finding
vided in Figure 6.28. Here, a path is found from node A to node D, visiting nodes
B and C. Obviously, the length of the path found under non-ordered require-
ments is at most as long as the one found under ordered requirements. Some
GISs provide support for these more complicated path finding problems.
https://E-next.in
12
D 148 12
12 B 12 D B
12
12 12 127
121 129
155
156
149
130
157
Figure 6.28: Ordered (a)
150 and unordered (b) opti-
16 122
C 16
C mal path finding. In both
135 131
133 cases, a path had to be

132
A A found from A to D, in (a)
by visiting B and then C,
in (b) both nodes also but
(a) (b) in arbitrary order.
https://E-next.in
Network partitioning
In network partitioning, the purpose is to assign lines and/or nodes of the net-
work, in a mutually exclusive way, to a number of target locations. Typically,
the target locations play the role of service centre for the network. This may be Service areas
any type of service: medical treatment, education, water supply. This type of
network partitioning is known as a network allocation problem.
Another problem is trace analysis. Here, one wants to determine that part of the
network that is upstream (or downstream) from a given target location. Such Connectivity
problems exist in pollution tracing along river/stream systems, but also in net-
work failure chasing in energy distribution networks.
Network allocation In network allocation, we have a number of target loca-

tions that function as resource centres, and the problem is which part of the net-
work to exclusively assign to which service centre. This may sound like a simple
allocation problem, in which a service centre is assigned those line (segments)
to which it is nearest, but usually the problem statement is more complicated.
These further complications stem from the requirements to take into account
• The capacity with which a centre can produce the resources (whether they
are medical operations, school pupil positions, kilowatts, or bottles of milk),
and
• The consumption of the resources, which may vary amongst lines or line seg-
ments. After all, some streets have more accidents, more children who
https://E-next.in
live there, more industry in high demand of electricity or just more thirsty
workers.
Figure 6.29: Network al-

location on a pupil/school
assignment problem. In
(a), the street segments
within 2 km of the school
are identified; in (b), the
selection of (a) is further
restricted to accommodate
the school’s capacity for
(a) (b) the new year.
The service area of any centre is a subset of the distribution network, in fact, a
connected part of the network. Various techniques exist to assign network lines,
or their segments, to a centre. In Figure 6.29(a), the green star indicates a pri-
mary school and the GIS has been used to assign streets and street segments
to the closest school within 2 km distance, along the network. Then, using de-
mographic figures of pupils living along the streets, it was determined that too
many potential pupils lived in the area for the school’s capacity. So in part (b),
the already selected part of the network was reduced to accommodate precisely
the school’s pupil capacity for the new year.
Trace analysis Trace analysis is performed when we want to understand which

part of a network is ‘conditionally connected’ to a chosen node on the network,
https://E-next.in
known as the trace origin. For a node or line to be conditionally connected, it
means that a path exists from the node/line to the trace origin, and that the
connecting path fulfills the conditions set. What these conditions are depends Tracing requires connectivity
on the application, and they may involve direction of the path, capacity, length,
or resource consumption along it. The condition typically is a logical expression,
as we have seen before, for instance:
• The path must be directed from the node/line to the trace origin,
• Its capacity (defined as the minimum capacity of the lines that constitute
the path) must be above a given threshold, and
• The path’s length must not exceed a given maximum length.
Tracing is the computation that the GIS performs to find the paths from the trace
origin that obey the tracing conditions. It is a rather useful function for many
network-related problems.
Figure 6.30: Tracing

functions on a network:
(a) tracing upstream,
(b) tracing downstream,
(c) tracing without condi-
tions on direction.
https://E-next.in
In Figure 6.30 our trace origin is indicated in red. In part (a), the tracing condi-
tions were set to trace all the way upstream; part (b) traces all the way down-
stream, and in part (c) there are no conditions on direction of the path, thereby Upstream and downstream
tracing all connected lines from the trace origin. More complex conditions are tracing
certainly possible in tracing.
https://E-next.in
6.6. GIS and application models 424
6.6 GIS and application models
We have discussed the notion that real world processes are often highly complex.
Models are simplified abstractions of reality representing or describing its most
important elements and their interactions. Modelling and GIS are more or less
inseparable, as GIS is itself a tool for modelling ‘the real world’ (or al least some
part of it).
The solution to a (spatial) problem usually depends on a (large) number of pa-
rameters. Since these parameters are often interrelated, their interaction is made
more precise in an application model.
Here we define application models to include any kind of GIS based model (in-
cluding so-called analytical and process models) for a specific real-world appli-
cation. Such a model, in one way or other, describes as faithfully as possible
how the relevant geographic phenomena behave, and it does so in terms of the
parameters.
The nature of application models varies enormously. GIS applications for famine
relief programs, for instance, are very different from earthquake risk assessment
applications, though both can make use of GIS to derive a solution. Many kinds
of application models exist, and they can be classified in many different ways.
Here we identify five characteristics of GIS-based application models:
1. The purpose of the model,

2. The methodology underlying the model,
3. The scale at which the model works,
https://E-next.in
4. Its dimensionality - i.e. whether the model includes spatial, temporal or spa- Model characteristics
tial and temporal dimensions, and
5. Its implementation logic - i.e. the extent to which the model uses existing
knowledge about the implementation context.
It is important to note that the categories above are merely different characteristics
of any given application model. Any model can be described according to these
characteristics. Each is briefly discussed below.
Purpose of the model refers to whether the model is descriptive, prescriptive

or predictive in nature. Descriptive models attempt to answer the “what is” -
question. Prescriptive models usually answer the “what should be” question by Descriptive and prescriptive
determining the best solution from a given set of conditions. models
Models for planning and site selection are usually prescriptive, in that they
quantify environmental, economic and social factors to determine ‘best’ or op-
timal locations. So-called Predictive models focus upon the “what is likely to be” Predictive models
questions, and predict outcomes based upon a set of input conditions. Exam-
ples of predictive models include forecasting models, such as those attempting
to predict landslides or sea–level rise.
Methodology refers to the operational components of the model. Stochastic

models use statistical or probability functions to represent random or semi-ran-
dom behaviour of phenomena. In contrast, deterministic models are based upon Inner workings of the model
a well-defined cause and effect relationship. Examples of deterministic models
https://E-next.in
include hydrological flow and pollution models, where the ‘effect’ can often be
described by numerical methods and differential equations.
Rule-based models attempt to model processes by using local (spatial) rules. Cel-
lular Automata (CA) are examples of models in this category. These are often
used to understand systems which are generally not well understood, but for
which their local processes are well known. For example, the characteristics of
neighbourhood cells (such as wind direction and vegetation type) in a raster-
based CA model might be used to model the direction of spread of a fire over
several time steps.
Agent-based models (ABM) attempt to model movement and development of mul-
tiple interacting agents (which might represent individuals), often using sets of
decision-rules about what the agent can and cannot do. Complex agent-based
models have been developed to understand aspects of travel behaviour and
crowd interactions which also incorporate stochastic components.
Scale refers to whether the components of the model are individual or aggre-
gate in nature. Essentially this refers to the ‘level’ at which the model operates.
Individual-based models are based on individual entities, such as the agent-based
models described above, whereas aggregate models deal with ‘grouped’ data, Individual and aggregate
such as population census data. Aggregate models may operate on data at the models
level of a city block (for example, using population census data for particular
social groups), at the regional, or even at a global scale.
Dimensionality is the term chosen to refer to whether a model is static or dy-

namic, and spatial or aspatial. Some models are explicitly spatial, meaning they
https://E-next.in
operate in some geographically defined space. Some models are aspatial, mean-
ing they have no direct spatial reference.
Models can also be static, meaning they do not incorporate a notion of time or
change. In dynamic models, time is an essential parameter (see Section 2.5. Dy-
namic models include various types of models referred to as process models or
simulations. These types of models aim to generate future scenarios from ex- Static and dynamic models
isting scenarios, and might include deterministic or stochastic components, or
some kind of local rule (for example, to drive a simulation of urban growth and
spread). The fire spread example given above is a good example of an explicitly
spatial, dynamic model which might incorporate both local rules and stochastic
components.
Implementation logic refers to how the model uses existing theory or knowl-
edge to create new knowledge. Deductive approaches use knowledge of the over-
all situation in order to predict outcome conditions. This includes models that
have some kind of formalized set of criteria, often with known weightings for
the inputs, and existing algorithms are used to derive outcomes. Inductive ap- Inductive and deductive
proaches, on the other hand, are less straightforward, in that they try to gener- approaches
alize (often based upon samples of a specific data set) in order to derive more
general models. While an inductive approach is useful if we do not know the
general conditions or rules which apply in a given domain, it is typically a trial-
and-error approach which requires empirical testing to determine the parame-
ters of each input variable.
Most GIS only come equipped with a limited range of tools for modelling. For
complex models, or functions which are not natively supported in our GIS, exter-
https://E-next.in
nal software environments are frequently used. In some cases, GIS and models
can be fully integrated (known as embedded coupling) or linked through data and Coupling
interface (known as tight coupling). If neither of these is possible, the external
model might be run independently of our GIS, and the output exported from
our model into the GIS for further analysis and visualization. This is known as
loose coupling.
It is important to compare our model results with previous experiments and to
examine the possible causes of inconsistency between the output of our models
and the expected results. The following section discusses these aspects further.
https://E-next.in
6.7. Error propagation in spatial data processing 429
6.7 Error propagation in spatial data processing
https://E-next.in
6.7.1 How errors propagate
In Section 5.2, we discussed a number of sources of error that may be present

in source data. It is important to note that the acquisition of base data to a high
standard of quality still does not guarantee that the results of further, complex
processing can be treated with certainty. As the number of processing steps
increases, it becomes difficult to predict the behaviour of error propagation. These Combined error from
various errors may affect the outcome of spatial data manipulations. In addition, individual sources
further errors may be introduced during the various processing steps discussed
earlier in this chapter, as illustrated in Figure 6.31.
The real world Spatial data

Gap between acquisition
reality and a
model of it source error
Human
decision-making Spatial
data sets
use error
process error
Planning
& management Spatial data
analysis & Figure 6.31: Error propa-
Produced modelling
geoinformation gation in spatial data han-
dling
One of the most commonly applied operations in geographic information sys-

tems is analysis by overlaying two or more spatial data layers. As discussed
above, each such layer will contain errors, due to both inherent inaccuracies in
https://E-next.in
the source data and errors arising from some form of computer processing, for
example, rasterization. During the process of spatial overlay, all the errors in the
individual data layers contribute to the final error of the output. The amount of
error in the output depends on the type of overlay operation applied. For exam-
ple, errors in the results of overlay using the logical operator AND are not the
same as those created using the OR operator.
Table 6.2 lists common sources of error introduced into GIS analyses. Note that
these are from a wide range of sources, and include various common tasks relat-
ing to both data preparation and data analysis. It is the combination of different
errors that are generated at each stage of preparation and analysis which may
bring about various errors and uncertainties in the eventual outputs.
Consider another example. A land use planning agency is faced with the prob-
lem of identifying areas of agricultural land that are highly susceptible to ero-
sion. Such areas occur on steep slopes in areas of high rainfall. The spatial data
used in a GIS to obtain this information might include:
• A land use map produced five years previously from 1 : 25, 000 scale aerial
photographs,
• A DEM produced by interpolating contours from a 1 : 50, 000 scale topo-

graphic map, and
• Annual rainfall statistics collected at two rainfall gauges.
The reader is invited to assess what sort of errors are likely to occur in this anal-
ysis.
https://E-next.in
Referring back to Figure 6.31, the reader is also encouraged to reflect on errors in-
troduced in components of application models discussed in the previous section.
Specifically, the methodological aspects of representing geographic phenomena.
What might be the consequences of using a random function in an urban trans-
portation model (when, in fact, travel behaviour is not purely random)?
https://E-next.in
Coordinate adjustments Generalization Table 6.2: Some of the
rubber sheeting/transformations linear alignment most common causes of
projection changes line simplification error in spatial data han-
datum conversions addition/deletion of vertices dling. Source: Hunter &
rescaling linear displacement Beard [23].
Feature Editing Raster/Vector Conversions
line snapping raster cells to polygons
extension of lines to intersection polygons to raster cells
reshaping assignment of point attributes
moving/copying to raster cells
elimination of spurious polygons post-scanner line thinning
Attribute editing Data input and Management
numeric calculation and change digitizing
text value changes/substitution scanning
re-definition of attributes topological construction / spatial indexing
attribute value update dissolving polygons with same attributes
Boolean Operations Surface modelling
polygon on polygon contour/lattice generation
polygon on line TIN formation
polygon on point Draping of data sets
line on line Cross-section/profile generation
overlay and erase/update Slope/aspect determination
Display and Analysis Display and Analysis
cluster analysis class intervals choice
calculation of surface lengths areal interpolation
shortest route/path computation perimeter/area size/volume computation
buffer creation distance computation
display and query spatial statistics
adjacency/contiguity label/text placement
https://E-next.in
6.7.2 Quantifying error propagation
Chrisman [13] noted that “the ultimate arbiter of cartographic error is the real
world, not a mathematical formulation”. It is an unavoidable fact that we will Errors are unavoidable
never be able to capture and represent everything that happens in the real world
perfectly in a GIS. Hence there is much to recommend the use of testing proce-
dures for accuracy assessment.
Various perspectives, motives and approaches to dealing with uncertainty have
given rise to a wide range of conceptual models and indices for the description
and measurement of error in spatial data. All these approaches have their ori-
gins in academic research and have strong theoretical bases in mathematics and
statistics. Here we identify two main approaches for assessing the nature and
amount of error propagation:
1. Testing the accuracy of each state by measurement against the real world, and
2. Modelling error propagation, either analytically or by means of simulation

techniques.
Modelling of error propagation has been defined by Veregin [56] as: “the ap-
plication of formal mathematical models that describe the mechanisms whereby
errors in source data layers are modified by particular data transformation op- Modelling error vs.
erations.” In other words, we would like to know how errors in the source data modelling error propagation
behave under manipulations that we subject them to in a GIS. If we are able
to quantify the error in the source data as well as their behaviour under GIS
manipulations, we have a means of judging the uncertainty of the results.
https://E-next.in
Error propagation models are very complex and valid only for certain data types
(e.g. numerical attributes). Initially, they described only the propagation of at-
tribute error [21, 56]. More recent research has addressed the spatial aspects of
error propagation and the development of models incorporating both attribute Attribute and locational
and locational components. These topics are outside the scope of this book, and components
readers are referred to [2, 27] for more detailed discussions. Rather than explic-
itly modelling error propagation, is often more practical to test the results of each
step in the process against some independently measured reference data.
https://E-next.in
Summary 436
Summary
This chapter has examined various ways of manipulating both raster and vector
based spatial data sets. It is certainly true that some types of manipulations are
better accommodated in one, and not so well in the other. Usually, one chooses
the format to work with on the basis of many more parameters, including the
availability of source data.
We have identified several classes of data manipulations or functions. The first
of these does not generate new spatial data, but rather extracts—i.e. ‘makes
visible’—information from existing data sets. Amongst these are the measure-
ment functions. These allow us to determine scalar values such as length, dis-
tance, and area size of selected features. Spatial selections allow us to selectively
identify features on the basis of conditions, which may be spatial in character.
A second class of spatial data manipulations generates new spatial data sets.
Classification functions assign a new characteristic value to each feature in a set
of (previously selected) features. Spatial overlay functions go a step further and
combine two spatial data sets by location. What is produced as an output spa-
tial data set depends on user requirements, and the data format with which one
works. Most of the vector spatial overlays are based on polygon/polygon inter-
section, or polygon/line intersections. In the raster domain, we have seen the
powerful tool of raster calculus, which allows all sorts of spatial overlay condi-
tions and output expressions, all based on cell by cell comparisons and compu-
tations.
Going beyond spatial overlays are the neighbourhood functions. Their principle
is not ‘equal location comparison’ but they instead focus on the definition of the
https://E-next.in
Summary 437
vicinity of one or more features. This is useful for applications that attempt to
assess the effect of some phenomenon on its environment. The simplest neigh-
bourhood functions are insensitive to direction, i.e. will deal with all directions
equally. Good examples are buffer computations on vector data. More advanced
neighbourhood functions take into account local context, and therefore are sen-
sitive to direction. Since such local factors are more easily represented in raster
data, this is then the preferred format. Flow and diffusion functions are exam-
ples.
We also looked at a special type of spatial data, namely (line) networks, and the
functions that are used on these. Optimal path finding is one such function, use-
ful in routing problems. The use of this function can be constrained or uncon-
strained. Another function often used on networks is network partitioning: how
to assign respective parts of the network to resource locations.
Various combinations of the analytical functions discussed above can be used
in an application model to simulate a given geographical process or phenomenon.
The output generated by these models can then be used in various ways, includ-
ing decision support and planning. Many different kinds of models exist, and
the type of model used will depend on the process or phenomena under study,
the nature of the data, and the type of output desired from the model.
The final section of this chapter discussed the issue of error propagation. It was
noted that at each stage of working with spatial data, errors can be introduced
which can propagate through the different operations. These errors can range
from simple mistakes in data entry through to inappropriate estimation tech-
niques or functions in operational models, and can serve to degrade the ‘end
result’ of our analyses significantly.
https://E-next.in
Questions 438
Questions
1. On page 352, we discussed the measurement function of distance between

vector features. Draw six diagrams, each of which contains two arbitrary
vector features, being either a point, a polyline, or a polygon. Then, in-
dicate the minimal distance, and provide a short description of how this
could have been computed.
2. On page 352, we mentioned that two polygons can only intersect when
their minimal bounding boxes overlap. Provide a counter-example of the
inverted statement, in other words, show that if their minimal bounding
boxes overlap, the two polygons may still not intersect (or meet, or have
one contained in the other).
3. In Figure 6.11 we provided an example of automatic classification. Rework

the example and show what the results would be for three (instead of five)
classes, both with equal interval classification and equal frequency classi-
fication.
4. In Figure 6.9, we provided a classification of average household income

per ward in the city of Dar es Salaam. Provide a (spatial) interpretation of
that figure.
5. Observe that the equal frequency technique applied on the raster of Figure 6.11
does not really produce categories with equal frequencies. Explain why
this is. Would we expect a better result if our raster had been 5,000 × 5,000
cells?
https://E-next.in
Questions 439
6. When discussing vector overlay operators, we observed that the one fun-
damental operator was polygon intersection, and that other operators were
expressible in terms of it. The example we gave showed this for poly-
gon overwrite. Draw up a series of sketches that illustrates the procedure.
Then, devise a technique of how polygon clipping can be expressed and
illustrate this too.
7. Argue why diffusion computations are much more naturally supported by

raster data than by vector data.
8. In Figure 6.22(b), each cell was assigned the minimum total resistance of
a path from the source location to that cell. Verify the two values of 14.50
and 14.95 of the top left cells by doing the necessary computations.
9. In Figure 6.23, we illustrated drainage pattern computations on the basis

of an elevation raster. Pick two arbitrary cells, and determine how water
from those cells will flow through the area described by the raster. Which
raster cell can be called the ‘water sink’ of the area?
10. In Section 6.4.4, we have more or less tacitly assumed throughout to be

operating on elevation rasters. All the techniques discussed, however, ap-
ply equally well to other continuous field rasters, for instance, for NDVI,
population density, or groundwater salinity. Explain what slope angle and
slope aspect computations mean for such fields.
https://E-next.in

6-Spatial Data Analysis (E-next.in)

Uploaded by

Copyright:

Available Formats

6-Spatial Data Analysis (E-next.in)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6-Spatial Data Analysis (E-next.in)

Uploaded by

Copyright:

Available Formats

Chapter 6

Spatial data analysis

1. Classification, retrieval, and measurement functions. All functions in this

• Classification allows the assignment of features to a class on the basis

More detail can be found in Section 6.2.

2. Overlay functions. These belong to the most frequently used functions

These are discussed further in Section 6.3.

We discuss these topics more fully in Section 6.4.

4. Connectivity functions. These functions work on the basis of networks,

• Contiguity functions evaluate a characteristic of a set of connected spa-

Geometric measurement on spatial features includes counting, distance and area

Figure 6.1: The minimal

In interactive spatial selection, one defines the selection condition by pointing at

It is also possible to select features by using selection conditions on feature at-

Area IDs LandUse

Area IDs LandUse

Area < 400000 AND LandUse = 80,

Area < 400000 OR LandUse = 80

Various forms of topological relationship between spatial objects were discussed

1. To select one or more features as the selection objects, and

Selecting features adjacent to selection objects Adjacency is the meet relation-

Figure 6.6: Spatial se-

• Which clinics are within 2 kilometres of a selected school? (Information

Afterthought on selecting features So far we have discussed a number of dif-

Figure 6.7: Spatial selec-

Figure 6.8: Spatial se-

Classification is a technique of purposefully removing detail from an input data

Figure 6.9: Two classifi-

Household income range New category value Table 6.1: Classification

In user-controlled classification, a user selects the attribute(s) that will be used as

Code Old category New category Figure 6.10: An example

User-controlled classifications require a classification table or user interaction.

2. Equal frequency technique: This technique is also known as quantile classifi-

Both techniques are illustrated on a small 5 × 5 raster in Figure 6.11.

When to use which? Which of these techniques should be applied to a given

4 2 1 1 1 2 1 1 1 1 3 2 1 1 1 Figure 6.11: Example of

In the previous section, we saw various techniques of measuring and selecting

In the vector domain, overlay is computationally more demanding than in the

Figure 6.13: The residen-

Figure 6.14: Two more

Output raster name := Map algebra expression.

D2 := (A = "forest") OR (B < 500) 0 0

Output raster := CON (condition, then expression, else expression).

Here, condition is the tested condition, then expression is evaluated if condition

Suitability := CON ((Landuse = “Forest” AND Geology = “Alluvial ”) OR

Land use raster Decision table Geology

In our section on overlay operators, the guiding principle was to compare or

2. Define how to determine the neighbourhood for each target,

3. Define which characteristic(s) must be computed for each neighbourhood.

• An area within 2 km distance as the crow flies, or

• An area within 2 km travel distance, or

• All other clinics within 10 minutes travel time, or

• The total population of the area,

• Average household income, or

• The distribution of high-risk industries located in the neighbourhood.

In proximity computations, we use geometric distance to define the neighbour-

Figure 6.20: Buffer zone

In vector-based buffer generation, the buffers themselves become polygon fea-

Figure 6.21: Thiessen

The determination of neighbourhood of one or more target locations may de-