Ontologies and Big
Data Considerations for
Effective Intelligence
Joan Lu
University of Huddersfield, UK
Qiang Xu
University of Huddersfield, UK
A volume in the Advances in Information Quality
and Management (AIQM) Book Series
Published in the United States of America by
IGI Global
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA, USA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail:
[email protected]
Web site: http://www.igi-global.com
Copyright © 2017 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
CIP Data Pending
ISBN: 978-1-5225-2058-0
eISBN: 978-1-5225-2059-7
This book is published in the IGI Global book series Advances in Information Quality and Management (AIQM) (ISSN:
2331-7701; eISSN: 2331-771X)
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
For electronic access to this publication, please contact:
[email protected].
111
Chapter 3
Spatial Databases:
An Overview
Grace L. Samson
University of Huddersield, UK
Mistura M. Usman
University of Abuja, Nigeria
Joan Lu
University of Huddersield, UK
Qiang Xu
University of Huddersield, UK
ABSTRACT
Spatial databases maintain space information which is appropriate for applications where there is need
to monitor the position of an object or event over space. Spatial databases describe the fundamental
representation of the object of a dataset that comes from spatial or geographic entities. A spatial database supports aspects of space and ofers spatial data types in its data model and query language.
The spatial or geographic referencing attributes of the objects in a spatial database permits them to be
positioned within a two (2) dimensional or three (3) dimensional space. This chapter looks into the fundamentals of spatial databases and describes their basic component, operations and architecture. The
study focuses on the data models, query Language, query processing, indexes and query optimization
of a spatial databases that approves spatial databases as a necessary tool for data storage and retrieval
for multidimensional data of high dimensional spaces.
INTRODUCTION
The extensive and increasing availability of collected data from geographical information system devices
and technology has made it excessively difficult to manage these information using existing spatial
database methods thus this has led to research advances in behavioural aspects of monitored subjects.
Geographic information systems (GIS) can efficiently handle all the major tasks of information extraction (which include data input and data verification, storage and manipulation, output and presentation,
data transformation and even interactions with the end users) from large datasets. This signifies that
geographic information system are complete database management systems which can handle all the task
mentioned above (Rigaux et al. 2003). Notwithstanding, based on the purpose of whatever application
under a user’s consideration, the objects in these (GIS) databases must be properly modelled for real
DOI: 10.4018/978-1-5225-2058-0.ch003
Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Spatial Databases
world data simplification (i.e. simplifying real world data so as to create an actual prototype of it) in
other to enhance efficient performance of the database. To achieve this, data analyst constantly seek an
appropriate data structure that efficiently stores the objects data in the database and allows for a better
database management. Spatial databases maintain space information which is appropriate for applications
where there is need to monitor the position of an object or event over space. Spatial databases describe
the fundamental representation of the object of a dataset that comes from spatial or geographic entities.
A spatial database supports aspects of space and offers spatial data types in its data model and query
language. The spatial or geographic referencing attributes of the objects in a spatial database permits
them to be positioned within a two (2) dimensional or three (3) dimensional space. A spatial database
unlike the classical database do not only query data based on their attributes alone, they also have the
capacity for querying data elements with respect to their locations. Spatial databases are built to compliment the classical database using some defined architecture (figure 2 a and b). This chapter looks into
the fundamentals of spatial databases and describes their basic component, operations and architecture.
Figure 1 shows a typical classical database system environment.
BACKGROUND
Spatial Database Management System Architecture
According to Ester et al. (1999), Spatial Database Systems (SDBS) are relational databases plus a concept of spatial location and spatial extension, and the explicit location and extension of objects define
implicit relations of spatial neighbourhood. Ester et al. (1997), argued that the efficiency of many KDD
algorithms for SDBS depends heavily on an efficient processing of these neighbourhood relationships
since the neighbours of many objects have to be investigated in a single run of a KDD algorithm. The
DBMS architecture are frequently employed in the construction of spatial databases. Nevertheless, while
typical databases can understand various numeric and character types of data according to Shekhar (1999),
additional functionality needs to be added for them to process spatial data types, these are typically called
geometry or features. Figure 2 shows three basic architectures for designing a spatial database management system according to Güting (1994) are the layered and the dual architecture. In layered architecture
Figure 1. Diagram showing a typical environment of database management
112
Spatial Databases
Figure 2. The diagram of spatial database architecture (a) the layered architecture (b) the dual architecture (Güting, 1994); (c) the integrated architecture in full detail
the database uses the standard DBMS and on top of the databases there is spatial tools as a top layer of
it while in the dual architecture the top layer is the integration layer that will integrate between standard
DBMS with spatial subsystem in bottom layer.
Spatial databases describe the fundamental representation of the object of a dataset that comes from
spatial or geographic entities. The spatial or geographic referencing attributes of the objects permits
them to be positioned within a two (2) dimensional or three (3) dimensional space. Basically, there are
two fundamental aspects of a spatial database that needs to be modelled: the spatial component (where)
and their attributes (what). These two factors determines the spatial data and attribute data that makes
up the spatial database. Spatial data describes the location of the object of concern while attribute data
tries to specify characteristics at that location (e.g. how much, when etc.). However representing these
data in the form that the computer would understand requires grouping the data into layers according
to the individual components with similar features (example layer could be waterlines, elevation, temperature, topography etc.). Nonetheless, the data properties of each layer (such as scale, projection, accuracy, and resolution) needs to be set by selecting appropriate properties for each of these layers. This
is where the logical layer of spatial database management system (SDBMS) comes in. The logical layer
Rigaux et al. (2003) carries the definition of spatial database schema which describes the structure of
the information as managed by the application. It also carries the constraints to be respected by the data
in the database. Defining the schema allows for further database operation (including data insert and
delete) and database query using an appropriate spatial query language. In other words it is correct to
state that the particular structures, constraints, and operations provided by the SDBMS depend wholly
on the logical data model as supported by this SDBMS (Rigaux et al. 2003).
Data Representation in a Spatial Database:
The content of spatial data according to Densham and Goodchild (1989); Li et al., (2006), describes
spatial objects’ specific geographic orientation and spatial distribution in the real world. Spatial data
also includes space entities attributes, the number, location, and their mutual relations. The data can
be the value of point, height, road length, polygon area, building volume and the pixel gray. It can be
113
Spatial Databases
the string of geographical name and annotation. It can also be graphics, images, multimedia, spatial
relationships or other topologies. Spatial phenomenon is described using dimensional objects such as
points, lines polygons or area, thus this complexity of spatial data and its intrinsic spatial relationships
limits the usefulness of conventional data mining techniques for extracting spatial patterns. Figure 3 can
be considered as a typical example of a spatial dataset. The picture is a Satellite Image of a County (the
wards in Yorkshire and the Humber for the 2011 census) showing the County’s boundary (the dashed
white line), the Census block including name, area, population, Boundary (shown by dark line), and
Water bodies (dark polygons) as obtained from Census (2011).
Given a d-dimensional space ℝd with a Euclidean distance, assume the dimension of d is 2 in that
Euclidean space, also assume that the space is a big rectangle with edges parallel to the axes of the
coordinate system (see Figure 3), then with such space we obtain our spatial dataset. Therefore in other
to store the data in spatial database table, we starts by creating a table with the item sets as obtained
from the image using a classical relational database model.
If we bring out a single block of the census_area in its 2-dimensional space from Figure 3(c), we
could get an information on each record. Figure 4 is a typical example of a single object as would be
represented as a record on the database table
Traditional database (Shekhar and Chawla, 2003) do not support the boundary (polygon) data types as
seen in our illustration above as such, there arise the need to create a separate relation using spatial data
model and then mapping the new table into a classical database, thus demanding us to create additional
tables that can store the spatial data types. For each of the rectangular block in the study area, the following are identified; polygon, edge and point and a separate table is created for it as shown in Figure 5.
Spatial Data Types
Unlike the classical relational database management system which store data as numbers, alphabets,
alphanumeric or even symbols, spatial databases store data abstract data types (ADTs) such as points,
Figure 3. Example spatial dataset: 2 thematic map of Yorkshire County and the relational database
tables describing them (Census 2011): (a) indicates the areas where actual census was conducted (b)
shows the areas in Yorkshire where English is spoken as a major language (c) a relation representing
the two spatial objects
114
Spatial Databases
Figure 4. Representation of a single record the census_area database table
Figure 5. Relational database table storing the spatial properties of the object
lines, polygons, coordinates (latitude and longitude coordinates which define a given location on the
surface of the earth), topology or other data types that can be mapped (Samson et al, 2013; Ernest and
Djaoen, 2015). The definition and implementation of spatial data types is the most fundamental issue
in the development of a spatial database systems (Güting and Schneider, 1993). In Schneider (1999) it
is shown that spatial data types are necessary to model geometry and to suitably represent geometric
data in a database system. The basic data types according to the author includes: point, line and region
and the more complex types are partitions and graphs including networks (roads, rivers etc.). In Güting
and Schneider (1993) spatial data types (points, lines and regions) are described as elements of a spatial
object which describes the objects attributes regardless of whether the database management system uses
a relational, complex object, object-oriented or some other data model. Spatial database (made up of collection of both spatial and non-spatial data) is optimized so as to optimally store and cross-examine data
objects located spatially. Compared with normal databases, which work only with numeric, character or
calendar data, spatial databases offer additional functions that allow processing spatial data types (Velicanu & Olaru, 2010). According to (Ernest & Djaoen, 2015) spatial data types which generally describes
the physical location and shape of geometric objects are classified into two types namely: geometry and
geography data types. The geometry data types allows data to be stored using the x and y (Euclidean)
coordinate system, using this method, the xy coordinates therefore positions the spatial object (points,
polygon/region or lines) on a 2dimensional Euclidean space. The geography types of spatial data Ernest
and Djaoen (2015) stores data based on round-earth coordinate system. In this case, the spatial object
115
Spatial Databases
is stored using its latitude and longitude coordinate’s value. More elucidation on spatial data types can
be found in Samson et al., (2013: 2014), but for a simple example let us look at the illustration below.
Suppose the problem at hand is to find the nearest town to the centre of Yorkshire Counties (marked
P) from the map in Figure 8, then we need to store the cities A through M in our database. By this we
could store the cities as point locations by taking the values of their x and y coordinates. In doing this we
create a table called Census_town and then follow the steps explained above. Figure 6 highlights different
types of spatial objects and how they hey are represented and Figure 7 shows a general overview of the
various spatial data types and how they are described as presented in Rigaux et al. (2003)
Figure 6. The different data types for representing spatial objects
Figure 7. Methods for representing spatial data types (Rigaux et al., 2003)
116
Spatial Databases
SPATIAL DATABASE MANAGEMENT SYSTEM
Point Data Types
Point data are completely characterized by their locations in a multidimensional space (Velicanu &
Olaru, 2010). Point data types are basically used to represent a single object in a given location in a
multidimensional space. In Samson et al, (2014) Points are shown to be efficient when modelling for
example, cities, forests, or buildings and also ideal object of a thematic maps describing for instance
land use/cover or the for the partitioning of a country into districts. A Point is the most important object
type supported by the spatial data types -both geometry and geography- (Ernest and Djaoen, 2015) and
they represent a singular position in space. The position of a point in space can be defined by using an
X-coordinate and Y-coordinate value-pair based on a planar (geometry) coordinate system or on the
latitude and longitude coordinates from a geographic coordinate system (Ernest and Djaoen, 2015). In
vector spaces, points can also be used to store extracted features from data e.g. text. Raster data model
are best expressed using points (which are used to store the raster image as pixels where each point
represents a single atomic cell of the image) because the raster say something about all the point in the
raster space, in this way the raster image can then be modelled as a single collection of spatially related
objects. In general, Points are simple geometry of dimension zero and they are not bounded by areas
either in length or in breath. For a better idea on how to analyse point patterns from a spatial dataset
see Samson et al, (2014) and for better understanding of sampling point data see Samson et al, (2013).
To define a point A from the geometric point of view we refer to the values its xy coordinate in the
plane. For instance in a 2 dimensional plane where x= 2 and y= 9,we write point A as A(2,9) for 3 dimsion A(x, y, z), for 4 dimensinsion A(x, y,m,z) etc. The Z coordinate refers to the height or elevation
of a Point, and the M coordinate represents a measure value - which is a user-defined value- (Ernest
and Djaoen, 2015). Figure 8 is a thematic map of the image in Figure 3, the map depicts the number of
females (in percentage) usual residents aged 16 to 74 in professional occupations in towns A through
M around the Yorkshire County. In this illustration, the towns has been modelled as point locations as
such will be represented (stored) on the database using their individual xy coordinates.
Figure 8. Using point locations to represent the towns around the county
117
Spatial Databases
Storing Region Data Types
In the previous section we have used point representations to model the spatial objects. In this section
we would model the objects using a rough geometric approximations of their spatial extent in addition
to which most Objects also have location and boundary (Mamoulis, 2012). Figure 9 is another version
of the image in Figure 8 where the spatial objects are modelled using geometric (rectangular) shapes, as
such is said to be a vector representation in which case the extent of the spatial objects has significant
effect on the result and performance of the query processing (Shekhar et al., 2011). Regions are abstract
data types (ADT) that represents the geometric part of a spatial object.
In most cases representing objects with large areas (like lakes, forest etc) normally require solid
shapes with a surface (an example is a two or three dimensional object). In essence these kinds of large
area entities are fitted using polygons (mostly rectangles), the boundary of the regions are enclosed in
polylines around the polygon
Converting a Region to Point Data
Representing the regions in Figure 10 with points is not a straightforward transformation, a procedure
has to be applied that can execute the functions by converting the features on the map into a set of points.
Notwithstanding, Candan and Sapino (2010) iterated the fact that this approach is particularly useful for
measuring similarities and distances where it is assumed that there is only one point per feature, it cannot be used for instance to express topological relationships between regions. Towns represented with
points A through M (in Figure 8 and 9) has been converted from large geometric areas (of 2-dimension)
to smaller sized points (of zero-dimension). This conversion is justifiable because their extents (shapes)
are not considered useful when considering their locations on the larger scale map. In other words the
result of the queries that concerns the object’s location on the map is not affected by reducing the shape
of the object to a point because considering the space they cover in contrast to the space that contains
them, the significance of the size of the their shape is inconsequential.
Figure 9. Using rectangles (minimum bounding rectangles) to enclose the regions before storage
118
Spatial Databases
The basic steps to converting from polygon to point is itemised as follows:
1.
2.
Find a two dimensional bounding box (MBR) around the points by:
a. Finding the convex hull for the set points or regions,
b. Eliminating points that are redundant to the solution, and
c. Enclose the polygon (convex) with the minimum bounding box that contains its m-number
of vertices.
Estimate the centre (centroid) of the object (on the x-axis) using the MBRs. (This in most cases is
used as the point that represents the object).
Minimum Box/Rectangle (MBB/MBR)
Minimum bounding box or minimum bounding rectangle/regions (MBR) is a paradigm for clustering
points based on their similarity measure. Points which are closer to each other (in space) to a certain extent
are always put together as objects in one cluster also known bucket (mostly rectangular in shape). Samet
(2009) describes a bucket as a subspace (of an underlying space) which contains sorted spatial objects that
has been grouped together in a natural way of arranging them based on their spatial order (also known as
spatial occupancy). In Mamoulis (2012), constructing the MBR of a spatial object or objects is seen as a
technique for handling spatial objects by approximating their geometric extent using the minimum bounding
box that encloses the object’s geometry minimally and more efficiently. This idea is an optimal filtering
method of managing spatial databases by preserving objects’ natural locality. When this spatial approximation is effectively (in terms of computational cost) achieved, then the objects are guaranteed to retain their
exact original geometry. The study in Güting (1994) expresses MBRs as a generic approximation of the
spatial data types (points, line, polygons) which links them to different spatial access methods and allows
the spatial objects to be organised in a database which is stored on an external memory using certain size
of buckets (MBR) as a match to the pages of the memory of the storage system (Dröge & Schek, 1993).
Figure 10is the map of the administrative area of Britain as obtained from (GADM, 2009). We have used
the map as a spatial dataset to like illustrate the various stages involved in converting regions to point data
for spatial data analysis. The files were extracted from GADM version created 2009.
Convex Hull
The convex hull of a set of points t is the smallest convex polygon that encloses t (Rigaux et al, 2003)
and computing the convex hull from that set of points which together makes up the convex polygon (in
which all points vertices outward from the centre) with the minimum area that includes all these points
is a way to represent the region occupied by the points (Asaeedi et al, 2013). Note that the points are
convex because we have assumed that for any two of them say a and b, the line segment ab is totally
within points a through b. Figure 11 is a sample dataset (a digitised map of one of the towns in Figure
10 a) with selected points picked from the towns boundary to another town, Therefore the points around
the polygon in Figure 11 are the convex points of the spatial object (town). Thus if we call the set of
all points p and if we choose C to be the convex points then the convex hull of p is the smallest convex
polygon that encloses p. Note that the points that did not fall on the polygon are non-convex point and
are insignificant in finding the convex hull of the object. For any subset of a plane coordinate system (say
points, rectangle, simple polygons), the convex hull is the smallest convex set that contains that subset.
119
Spatial Databases
Figure 10. Various stages involved in converting to point location: (a) map representing the study area
(b) regions on the map represented by their convex hull (c) constructed MBR around the convex hull
(d) centroid of regions using their MBR (e) expanded view of the area (in figure 10 d) around the point
marked d (f) region fully represented using points
Figure 11. Digitized map of one of the towns in Figure 10 (a) showing its MBR (red rectangle)
Algorithm for Finding the Convex Hull around a Polygon
Assuming all the points in
p
are distinct
120
Spatial Databases
Let the space be of dimension 2
Then
{
Enter coordinates representing each point (2* P),
(
)
Set of all points = p Set of all convex points = C //Where p = a,b,c ….r Then
(
)
for all pi ∈ p, i = 1 ….r Find
{s |s ≤ p ∧ p ∈ C } Start
with the s valuethat has the
smallest y-coordinate.Sort the points following a polar angle with s to get
simple (convex) polygon.
Consider points in anticlockwise order, and eliminate those that would create
a clockwise turn.
}
Return convex polygon
// sorted list of points t along the boundary (moving anticlockwise) with the
convex hull of S
Fitting the Rectangle
Assuming we have a set of points on a 2dimensional plane, we could compute the minimum bounding
rectangle that encloses the points. It is only natural that the minimum bounding rectangle for the points
be supported by the convex hull of the points (which is a convex polygon) and any points interior to the
polygon have no influence on the bounding rectangle (Eberly, 2015). More so, it has also been established by Freeman and Shapira (1975) that the most important constraint to be met in fitting a minimum
bounding box (mbb) is to ensure that at least one of the edges of the mbb coincides with some edges of
the convex polygon.
Thus the simple algorithm below could be used to construct a minimum bounding box (mbb) around
the convex polygon obtained above (section 3.2.1)
Algorithm for Finding the Minimum Bounding Box around a Convex Polygon
Given:
A set of convex points say p // the points constitutes the convex polygon
Produce:
The minimum bounding box for the set of points say q// the output points is
expected to be smaller than input points
Start:
// let V be the number of edges on the convex polygon
For v = 1…V
Draw a rectangle with vi as the base (i.e. vi is collinear with an edge
of the rectangle)
Calculate the area A of the rectangle
Repeat for all v
Display the rectangle with the smallest area as the minimum bounding box (mbb)
Other simpler method can be used by just finding the xmin,ymin,xmax,ymax val-
121
Spatial Databases
ues
Start
Get min x, max x
Get min y, max y
Using these bounds
Draw the rectangle around the points
// despite the fact that (Eberly, 2015) claimed that it is not necessary, the rectangle produced using this method will be axis aligned (that is
being orthogonal with one or both of the axes)
Baum and Hanebeck (2010) describes axis-aligned rectangles as rectangles in any dimensional space
which is represented with the extreme values (the minimum and maximum) on each axis, in addition,
they also demonstrated how the time complexity (in applying certain number of steps) for computation
can be calculated. Axis-aligned rectangles are very significant in SDMSs for objects with extents as we
can see in the elucidation given by Samet (1995) for instance, in other to store such arbitrary object we
need to represent them as n-dimensional rectangles for which each node in the storage structure (say tree)
corresponds to the n-dimensional rectangles that holds the child but most importantly the main spatial
objects which is stored in the leaf (node) of the tree are basically represented by the smallest axis-aligned
rectangle that contains them. It is important to remember that the nodes we refer to here relates to the
pages on a disk (computer memory)as such building the tree structure should put into consideration
the fact that minimum number of disk pages needs to be visited for any query operation or the indexing
structure is considered suboptimal
Finding the Centre (Centroid) of the Spatial Objects
After the rectangle is built, in then depending on the task at hand, the next thing we need to do is finding
the centre of the rectangle and this will help us with further processing. In the study of plane geometry
according to Bourne (2015), the centroid of area of geometric shapes or object in 2 dimension can
generally be seen as the mass of that object (or shape). It is similar to the centre of mass of the object
except for the fact that calculating the centroid Efunda (2016) involves only the geometrical shape of the
solid. If we assume we want to calculate the centroid from the bounding rectangle, we can summarise
the formula as in in equation 1.
Algorithm for Finding the Centroid of a Spatial Object
x 2 − x 1, y 2 − y1
For a simple rectangle: C x ,y =
2
Generally, this equation can be broken down into the sequence below.
i.
ii.
Get the x and y coordinates of each vertex v , in any order
For each v j ∈ V of the ith coordinates ( i =1,2… d ) in a d dimensional space
Compute centroid:
122
(1)
Spatial Databases
dxi, dyi
C i =
d
(2)
d is the number of dimension
iv.
Return a set of coordinates i , (e.g. x, y for just xy coordinates)
The formula described above is useful when we want to take the centroid from the bounding rectangle
of the spatial object under consideration, but generally, if we want to get the centroid on an polygon
based on the j-vertices then the formula below proves more useful.
Given any p points
With masses m j at positions x j for all j-vertices
Then centroid of that set of points is
p
∑ j =1 m j x j
x =
p
∑ j =1 m j
−
(3)
Though Equation 3 is a general equation for finding the centroid of a polygon with m masses, the
equation transforms to Equation 4 if all these masses are equal.
p
∑ j =1 x j
x =
p
−
(4)
EXTENDING THE CLASSICAL RELATIONAL
DATABASE FOR SPATIAL DATASET
Spatial Data Model
Developing a spatial database normally starts with developing a data model (which relates to the conceptual, logical and physical data modelling), that describes the contents of the database and their relationship (Singh & Singh, 2014). There are two common data models for modelling spatial information:
field based models and object-based models (shekhar et al. 1999). Raster data structure (also known field
based, space based or raster model) according to Gregory et al. (2009) is similar to placing a regular grid
over a study region and representing the geographical feature found in each grid cell numerically. The
model treats spatial information such as rainfall, altitude and temperature as a collection of spatial functions transforming space partition to an attribute domain. Raster are associated with image processing,
dynamic modelling and image processing and are easily manipulated using map algebra (e.g. multiplying
geographically corresponding cell values in two or more datasets). Spatial data can also be represented
123
Spatial Databases
as continuous surfaces (e.g. elevation, temperature, precipitation, pollution, noise e.tc) using the grid or
raster data Model in which a mesh of square cells is laid over the landscape and the value of the variable defined for each cell (Samson et al., 2014). The object based data model (which can also be called
the vector, feature or entity based model) treats the information space as if it is populated by discrete,
identifiable, spatially referenced entities. An implementation of a spatial data model in the context of
Object-Relational databases consists of a set of spatial data types and the operations on those types.
Vector data structure represents geographic objects with the basic elements points, lines and areas, also
called polygons. From the description given by Gregory et al. (2009), vector data is based on recording
point locations (zero dimensions) using x and y coordinates, stored within two columns of a database.
By assigning each feature a unique ID, a relational database can be used to link location to an attribute
table describing what is found there. Every element in a vector model is described mathematically and
bases on points that are defined by Cartesian coordinates (Neuman et al. 2010).
Spatial Data Modelling Paradigm
Typically, database design process is often divided into three main task, namely: conceptual, logical and
physical data modelling (Elmasri and Navathe 1989). In the conceptual modelling process, the objects
(or entities) are represented as a spatial and non-spatial datasets, and their attributes and the relationships between them are identified (Akinyemi 2010). The conceptual model describes how a system is
organized and how the system operates. The logical model produces a conceptual data model in terms
data that will be computed. This stage according to Singh and Singh (2014) constructs (entity classes or
object classes), operations (create relationships) and validity constraints (rules). At the stage of physical
modelling, we produce the actual database design based on the requirements established at the logical
modelling stage. Figure 12 shows the various stages involved in designing a spatial database and what
they entail.
Figure 12. Stages in building a spatial database (Singh and Singh, 2014)
124
Spatial Databases
Choice of DBMS for Building Spatial Databases
Relational database technology is inadequate for managing spatial data (Mamoulis, 2012). That means
it is quite tedious to build a spatial database directly using a classical database model, therefore the
traditional system needs to be extended to handle spatial data types (abstract data type - ADT) in other
words, relational databases should be equipped to manage and support spatial data types (or data models),
spatial functions, and spatial indexes. Spatial database management system are software tools that can
work with a classical DBMS but in addition, such a system would be able to supports a query language
from which spatial data types are callable, possess efficient algorithms for processing spatial operations
and provide rules for query optimization (Shekhar & Chawla, 2003).
In Ramakrishnan and Gehrke, (2003) two basic types of extended DBMS which are designed to
handle the complexity associated with spatial data were described. The new extended systems are basically object oriented and therefore are able to provide support for database systems in terms of handling
complex data types. The two object databases described are:
Object-Oriented Database Systems (OODBMS) and object relational database systems (ORDBMS)
both of which are able to Support user defined (spatial) abstract data types like polygons, line, points
coordinates and other complex types including; partitions and graphs (networks e.g. roads, rivers etc.).
It is expected of an efficient spatial data management system to support spatial query operations and the
same time support relational operations. Whereas object-oriented databases model real world entities
(considering their behaviours, relationship) unambiguously (without the use of tables and keys) providing high computational power where users can implement functions and embed them into the database
(Candan & Sapino, 2010), the object-relational counterpart which is an extension of the relational database features the functionalities that provides users the opportunity to model spatial databases by either
extending existing relational models with object-oriented features or by adding a special row and table
based data types into object-oriented databases (Stonebraker et al., 1990). According to Shekhar et al.
(2011) spatial types and operations may be integrated into query languages such as SQL, which allows
spatial querying to be combined with object-relational database management systems. Figure 13 is a
clear conceptualization of both approaches that are described in this section.
The interesting thing about objects based modelling is that they allow users to define their own new
data types, and complex objects like the ones in Figure 6 (the network and partitions data types), object
based modelling also benefit from having functionalities that allow the user to define methods (in form
Figure 13. Object oriented paradigm for spatial databases
125
Spatial Databases
of codes which are used to define object’s behaviour, properties and their state) and rules (that monitors
events and validate them against certain constraints and the results of queries from such databases are
held in containers (e.g. lists or sets)
Example Object-oriented database implementation
❖ Class Census_Town
Row (tuple) (town_code: integer, town_name: string, town_type: string, population: integer, area: float, shape: region)
❖ Class Language_Spoken
Row (tuple) (language_name: string, shape: region)
Object-relational database implementation
❖ CREATE TABLE censusTown
(town_code integer,
town_name: string,
town_type: string,
population: integer,
area: float,
shape: region,Primary Key (town_code))
❖ CREATE TABLE languageSpoken
(language_code integer,
language_name: string,
town_name: string
shape: region,Primary Key (language_code),
Foreign Key (town_code) References Census_Town)
SPATIAL QUERY PROCESSING
Spatial database management deals with the storage, indexing, and querying of data with spatial features, such as location and geometric extent (Mamoulis 2012). Thus for a spatial database evaluations
are frequently made between all structure with respect to complexity, query support, data type support
and application (Patel and Garg 2012). According to Candan and Sapino, (2010) typical query could
be any problem which is related to the spatial attributes of a given object, most data features (spatial or
non-spatial) can be represented in the form of one (or more) of the four common base models: strings,
vectors, fuzzy/probabilistic logic-based and graphs/trees, representations). In many spatial database
applications, object extents could be ignored or approximated (using the filter and refine) in a case
where the scale of the query is much larger than the scale of the objects thus, the filter-and-refine technique used in spatial query processing according to Shekhar et al (2011) plays a vital role in managing
multidimensional-index structures. Figure 14 shows a typical example of a spatial query process were
w is the query object. The diagram explains the basic steps of filtering and refining in a spatial query.
In Figure 14b, the objects are filtered by finding their mbrs and the mbrs that intersects the query area
are returned. At the stage of figure c, the rectangles that intersects the query window are evaluated (the
evaluation is carried out based on the actual object) and the through this refinement process, rectangles
126
Spatial Databases
Figure 14. Steps involved in querying rectangular objects (Mamoulis 2012): (a) is the region dataset
with w as the query window, (b) is the approximated (refined) dataset (c) is the final refinement stage
which containing actual objects which do not have any topological relationship with the query window
are recorded as false hits.
A simple example query is given below:
SELECT
town_name
language_name
FROM
censusTown c, languageSpoken l
WHERE
c. town_name = l. town_name
And c. population > 10,000,
And l.language_name = ‘English’
The simple query above takes the relations censusTown and languageSpoken, censusTown stores
the towns in Yorkshire County and languageSpoken stores the language prevalent in each town. The
query basically checks the towns where English is the major language and the population if greater than
10,000. There are basically three types of queries according to Güting (1994) that arise over spatial (point)
data namely: spatial range queries, nearest neighbour queries, and spatial join queries. For rectangle
data one is normally faced with queries like Intersection query (e.g. find all rectangles that intersects
a query rectangle) or Containment query (for example find all region/rectangles that are completely
within a query rectangle (Güting, 1994). In point and window query which are important examples of
spatial queries, we try to find the objects whose geometry contains a given point or overlaps a rectangle
(Rigaux et al., 2003)
Range Queries
In range queries according to Papadiaset al. (2003) the range typically corresponds to a rectangular window (like the one in figure 15) or a circular area around a query point. In this case one of the attributes
of the spatial object is identified as the range. For example find all primary schools that falls within
127
Spatial Databases
20miles of the University of Huddersfield. Range queries also known as similarity/distance threshold
queries Candan and Sapino (2010), tries to find matches in a database that are within the threshold associated with a given distance or a similarity measure based query. In general, a typical question for a
range query on point and spatial data would be: find all objects within a given range of another object.
Basically, range queries –for region – search according to Samet (2006), identifies a set of data points
whose specified keys have specific values or whose values are within given ranges.
Nearest Neighbour Queries
Nearest neighbour (similarity) searching or retrieval (Cazals et al., 2013), is a central computational
problem with significant applications in many field of study. The similarity between the objects according
to Jain et al. (1999); Jiawei (2001), is determined using distance measures over the numerous dimensions
in the dataset. Nearest neighbour queries mostly answer distance based questions like find the nearest
point/region to any given query point/region. The nearest neighbour to a point query is the point or object
that is closest to that point/object in the Euclidean space. Figure 15 shows points G, H…..J around the
point p and we are expected to find the nearest neighbour to P, this query will return J, but if the query
is modified to find k-nearest neighbour (say k=2) then the output will be E, J.
In Mamoulis (2012), nearest neighbour search was described as; find the nearest object (the k-nearest
objects) to a point q in the spatial relation when given a well-defined reference object q (in other words
the query retrieves from a spatial relation R, the nearest object to a query object q). In most database
applications with high-dimensional data nearest neighbour queries are very important (Berchtold et al.
1996) therefore the main concern for nearest neighbour search (if the database was indexed with a tree
data structure) is CPU-time rate which is always higher because the search is required to sort all the
nodes based on their min-max distance. Thus if the spatial relation is not indexed according to Mamoulis
(2012), then there would be need for the nearest neighbour algorithm (for clustering, classification or
any other purpose) to access all objects in the relation, in order to find the nearest neighbour to a query
object q.Figure 16 gives an illustration of how to find nearest neighbour of an object in space by simple
defining a criteria to approximate an optimal neighbour.
Figure 15. Different point locations with their distances to a central point p
128
Spatial Databases
Figure 16. Example of how to approximate nearest neighbour search
Basically, the problem of finding the nearest neighbours to an object in space in its general form according to Lifshits and Zhang (2009) can be defined as follows; Let U be a set of elements and let d be
a distance function that maps each pair of elements from U to some positive real number (where d is a
metric distance function, i.e., it satisfies the triangle inequality) although this need not always be the case.
Spatial Join
Spatial join operation combine two or more spatial datasets based on some certain spatial relationship.
For instance, if we want to get the list of all hospitals around the towns in Huddersfield then we could
set a query like:
SELECT
Name
FROM
Hospitals, Hudd_town
WHERE
Name_is_within (Hospitals.region, Hudd_town.region)
Spatial join is an important example of spatial query operation. According to (Rigaux et al., 2003),
using spatial join operation one could answer the query: find a pair of object that satisfy some spatial
relationships if given two different sets of spatial objects. As a simple example find the languages spoken in different cities in Yorkshire. To answer this query, we would need a thematic map of the cities in
Yorkshire County and the language predominant in each city, then we join the two maps and display the
each city and their language on a single map. It is important to note that intersection joins are useless for
point datasets as such the accurate problem would be the e-distance join, which finds all pairs of objects
(s,t) s∈S, t∈T within (Euclidean) distance e from each other of the e-distance join Rigaux et al., (2003).
129
Spatial Databases
Spatial Attributes
The data inputs of spatial database management are made up of two distinct types of attributes: nonspatial attributes and spatial attributes Shekhar et al (2011). Spatial attributes are used to define the
spatial location and extent of spatial objects (Bolstad (2002). The spatial attributes of a spatial object
most often include information related to spatial longitude, latitude and elevation, shape, area etc. and
the relationships among these objects are often implicit, such as overlap, intersect, behind.. (Ganguly
and Steinhaeuser 2008). This is quite unlike that of non-spatial objects that are explicit in data inputs
according to Agrawal and Srikant, (1994); Jain and Dubes, (1988). One feasible way to deal with implicit spatial relationships is to materialize the relationships into traditional data input columns and then
apply classical data mining techniques - although the materialization may result in loss of information.
Components of a Spatial Database
Geographic objects such as rivers and roads are always related to the same geographic area and they are
according to Rigaux et al. (2003) basically made up of the following components: (a) Description and
(b) spatial components. (The spatial component are known as their spatial attribute or extent);
These components describes the size of the objects and their location, orientation and shape in
2-Dimensional or 3-Dimensional space. They also describe the objects by means of their non-spatial
attributes. Whereas the Description component describes the object by setting out some descriptive attributes (like the name and the population of a city) which constitute the description of its alphanumeric
attributes. The spatial component of a spatial object represents the geometry (location, shape etc.) and
topology (relationships among spatial objects) (Rigaux et al., 2003).
Operations on Spatial Data
Spatial data objects can be grouped into themes. A theme is the geospatial information corresponding
to a particular topic. The details of spatial datasets are gathered in a theme. A theme is similar to a relation as defined in the relational model. It has a schema and instances. Rivers, cities, andcountries are
examples of themes. (Rigaux et al., 2003).
Operations on Themes
Ester et al. (1997) has claimed that for knowledge retrieval and discovery from spatial databases (which
contains information on specific themes) that most algorithms will make use of neighbourhood relationships since that is the main difference between the classical database and its spatial counterpart.
This phenomenon according to them can be proven by the fact that the behaviour and characteristics of
a spatial object is influenced by certain neighbouring objects which may “cause” the existence of the
spatial object as such, the attributes of its neighbours may have an influence on the object itself Ester
et al. (1999). For instance, given the 2-themes in Figure 3: Yorkshire towns (with their attributes such
as name, capital, population, and a geometric attribute say boundary), and Languages (expressing the
distribution of main spoken languages or groups of similar languages). If we express the themes using
the schema below:
130
Spatial Databases
•
•
Towns (name, capital, population, region: region)
Languages (name, location: region).
Then some of the common manipulations of these themes based on operations from the relational
algebra, include: overlay, join (or intersection), union, selection, nearest neighbour, overlap, distance etc.
A simple example for join operation
•
Find all the town in Huddersield where English language is spoken.
Suggested Solution (Spatial Join)
SELECT
T.name
FROM
Towns T, Languages L
WHERE
name_is_within (Towns.region, Languages.region)
In Schneider (1997), most of the different operations that are applicable to spatial datasets have been
enumerated including their return functions; these include topological relationships – example of a spatial
predicates that returns boolean values - (e.g. intersect (overlap), meet (touch), equal, covered by, adjacent
(neighbouring), outside etc.), directional relationships –also a spatial predicate - (e.g., north / south, left
/ right) and so on. Equally, Candan and Sapino, (2010) stated that spatial predicates and operations are
broadly categorized into two based on whether the information of interest is a single point in space or
has a spatial extent (e.g., a line or a region) and they established that the predicates and operators that are
needed to be supported by the database management system depend on the underlying spatial data model
and on the applications’ needs. In summary, according to Ester et al. (1999) it is a common premise to
assume that the standard operations from relational algebra such as selection, union, intersection and
difference are also suitably applicable relationship evaluation between any set of spatial object.
SPATIAL INDEXING AND INFORMATION RETRIEVAL
According to Güting (1994), a spatial database is any database that is able to provide at least a spatial
indexing method and a spatial join operation. With this, the system would be able to retrieve from a large
collection of objects in some space in a particular area without scanning the whole set. Indexing spatial
data according to Lungu and Velicanu, (2009) is a method of decreasing the number of searches, this
mechanism helps to locate objects in the same area of data (window query) or from different locations.
The main goal of indexing is to optimize the speed of database query (Ajit and Deepak, 2011) Given
a user query say for example “What is the cheapest, and fastest path from location Q to P?” the spatial
database could be indexed (sorted - implicitly) so that even if p is changed to another location say Z within
that same geometric region, there would not be any need to resort the data in other to answer the query
efficiently. This facilitates the information retrieval process and could also improve database integrity
and general performance. Since generally there is no ordering that exists in dimensions greater than 1
131
Spatial Databases
without transforming of the data to one dimension according to Cazals (2013), then indexing could be
used as a way of finding a better sort procedure on the database in other to accelerate search query performance. Generally speaking, in a database management system, every record can be conceptualized
as a point in a multidimensional space to Güting (1994). Unfortunately, this comparison is not always
appropriate for spatial data because the dimensionality of the representative point may be too high and
that poses a problem when considering spatial data (although we may decide to reduce the dimensionality of the representing point in other to approximate the spatial object). But then, using this form of
transformation (such as merely mapping spatial data into points in another space) proximity would not
be preserved. Such transformation as mentioned above would be fine for storage purposes and for queries
that only involve the points that embrace the line segments (including their end points). For example,
finding all the line segments that intersect a given point or set of points or a given line segment. However
(Samet, 1995), the method is not good for queries that involve points or sets of points that are not part
of the line segments as they are not transformed to the higher dimensional space by the mapping). If we
have to use a representative point to represent a line object, each line segment can then be represented
by its end points. As such the line segments are represented by a tuple of four items (i.e., a pair of x
coordinate values and a pair of y coordinate values). Therefore, we have constructed a mapping from a
two-dimensional space (i.e., the space from which the lines are drawn) to a four-dimensional space (i.e.,
the space containing the representative point corresponding to the line). Thus the present challenge of
data analyst would be to find techniques suitable to overcome the problems of inappropriate mapping of
spatial objects to point data. These techniques possibly will be the use of data structures that are based on
sorting the spatial objects by spatial occupancy (Samet, 2009). Spatial occupancy methods decompose
the space from which the data is drawn (e.g., the two dimensional space containing the lines as described
above) into regions called buckets. Spatial indexing methods preserve order in other words, objects in
close proximity should be placed in the same bucket or at least in buckets that are close to each other in
the sense of the order in which they would be accessed (i.e., retrieved from secondary storage in case
of a false hit, etc.). In large databases especially spatial – temporal ones, the efficiency of searching is
dependent on the extent to which the underlying data is sorted (Samet 2009). The sorting is encapsulated
by the data structure known as an index that is used to represent the spatial data thereby making it more
accessible. According to Cazals (2013), in order to store objects in these databases, it is common to map
every object to a feature vector in a (possibly high-dimensional) vector space. The feature vector then
serves as the representation of the object. The traditional role of the indexes is to sort the data, which
means that they order the data. However, since generally no ordering exists in dimensions greater than
1 without a transformation of the data to one dimension, the role of the sort process is one of differentiating between the data and what is usually done is to sort the spatial objects with respect to the space
that they occupy. The resulting ordering should be implicit rather than explicit so that the data need not
be resorted (i.e., the index need not be rebuilt) when the queries change. The indexes are said to order
the space and the characteristics of such indexes are explored further (Samet 2009). In Park et al (2013)
spatial indexing techniques are one of the most effective optimization methods to improve the quality of
large dynamic databases, this is achieved by applying ordering tools (e.g. Z-order curve, Hilbert curve)
which linearizes multidimensional data. A key property of these ordering functions is that it can map
multidimensional data to one dimension while preserving the locality of the data points. Once the data
is sorted according to these ordering then a spatial data structure is then built on top of it and query
results are refined, if necessary, using information from the original feature vectors. Any n-dimensional
132
Spatial Databases
data structure can be used for indexing the data, such as binary search trees and B-trees, R-trees X-trees
e.t.c. Cazal et al., (2013). According to Güting (1994) rectangles are more difficult to model than points
because they do not fall into a single cell of a bucket partition.
Indexing Points and Rectangular Objects
Because of the non - linearity that exists among large spatial data set, an effective data structure which
has the ability to tackle the branched structures that exists among a given spatial data is required. This
complex spatial dataset trait according to Candan and Sapino, (2010) are better represented using graphs
and trees because the larger datasets are always made up of other minor events or objects which are always
difficult to be ordered to form of sequences. Spatial objects can be indexed in the form of point or region
(rectangles), this can be achieved using a point access method (PAM) or a spatial access method (SAM).
Point Access Methods
Representing multidimensional point data is a central issue in a spatial database design according to Samet
(2006), this means that there is one dimension specified for each attribute or key. In a multi-dimensional
data space, shared data such as documents, music files, and images, are frequently specified as points
based on expressed features, as such requires a systems to provide an efficient multi-dimensional query
processing (Jagadish et al. 2006; Samet, 2006). Point access method works by defining space decomposition of disjoint points, as such in most tree indexing structures for point data, it is expected that
the leaf nodes will not overlap (Leutenegger et al, 1997). Figure 17 illustrates some well-known point
access methods as Mamoulis (2012) has identified. Point access methods can simply be seen as a data
structures and algorithms that primarily search for points that are defined in multidimensional space
examples including EXCELL, Grid file, hB-tree, Twin grid file, Two-level grid file, K-d tree, BSP-tree,
Quad-tree, UB-tree, Buddy tree, Locality-sensitive hashing (LSH)...(Paul, 2008).
In Figure 18, two different methods were presented for indexing the same point dataset: (i) the R-tree
and the (ii) Quad-tree. The presentation shows a set of points P = {A, B…}. Point C, D, E are close to
each other in space so we have clustered them in the same leave node (that is the minimum bounding
rectangle labelled Z1 outlined in red). The next upper level is the MBRs Z1, Z2 and Z3 (which encloses
Figure 17. Efficient point access methods (Mamoulis, 2012)
133
Spatial Databases
Figure 18. Comparison between different tree structures to represent the same point location dataset:
(a) and (b) R-tree (c) and (d) Quad-tree
leave nodes 1….9) these two non-leave nodes are yet grouped into the next higher level which in this
case is the root. We have assumed the capacity of a node to be four (4) entries for the r-tree and four for
the Quad-tree (following the normal conventional Quad-tree partitioning).
Task 1
Points A through M (in Figure 8) shows point locations (modelling them as points means we have considered them as locations on the map that do not have extents but have a spatial reference in the 2D plane)
of towns around the Yorkshire and the Humber County where the percentage of women in professional
occupation are 26 and above, examples of point query using point data:
1.
2.
3.
Find the distance between town A and town C.
Which of the locations is closest to the centre of the county (point P).
Find all town that are < 50 miles from P.
Note that the extent of each point locations will not alter the outcome of the query result.
Region Access Methods
For range queries, a data structure that can search for lines, polygons, etc would be required. Point access
methods have been known not to fully support region, overlap, enclosure, etc., so methods like cell tree,
extended k-d tree, GBD-tree, multilayer grid file, D-tree, P-tree, R-file, R+-tree, R*-tree, R-tree, Skd-tree
according to Paul (2008) have been proposed to manage these sort of data. Most spatial objects are better
stored as rectangles. Storing spatial objects as rectangles is an important task of spatial database as this
provides a basic approximation of the actual geometry of the object. Spatial approximation according to
Mamoulis (2012) helps to reduce the computational complexity associated with actual geometries of the
spatial objects. The usefulness of storing regions as rectangles cannot be overemphasized for instance
when handling a range query, we consider all the rectangles that intersects the region of the query. Many
134
Spatial Databases
objects in 2-dimensional spaces represents objects using their MBRs by so doing, the MBRs identify the
region of the larger data space that each object represents and then an Intersection tests is performed to
determine which objects intersect with the spatial query using an R-tree organizes all the spatial objects
by pruning the search space Jagadish et al. (2006). According to Güting (1994) The use of bounding
boxes, demands that most spatial data structures be designed to store either a set of points (for point
values) or a set of rectangles (for line or region values). Güting (1994) added that rectangles are more
difficult to model than points because they do not fall into a single cell of a bucket partition, therefore
three strategies have been developed to be able to handle rectangle data partitioning, these include: (a)
Transformation approach, (b) overlapping bucket regions, (c) clipping. Figure 19 presents an R-tree for
indexing region data. In this case a set of rectangles {A, B… L}. Rectangles A, B are close to each other
in space so we have clustered them in the same leave node (that is the minimum bounding rectangle
labelled 8).We have assumed the capacity of a node to be four entries (M) for the r-tree. The MBRs in
green are the leaf nodes and they contain the index record (b, obj_pointer). b is the number of the page
of the smallest MBR that contains the spatial object which is referenced by obj_pointer. The rectangles
in blue (containing M child nodes), are the non-leaf nodes with entries (b, child_pointer) where b in
this case is the number of the page of the smallest MBR that contains the MBRs in the child node that
is being referred to by child_pointer. Finally, these entry nodes are grouped into the root (labelled 1).
Note that the rectangles marked blue in the actual tree are the rectangles that have some kind of spatial
relationship with the query window P as such that determines the rectangles that are involved in the
query operation.
Tree Indexing Data Structures a Brief Comparison
One would have noticed the ubiquitous nature of the R-tree indexing structure, this is attributed to the
fact that the R-tree and its variant have proven very useful in management of multidimensional database
and handles both points and spatial data efficiently. R–tree according to Mamoulis 2012 is known as
the most dominant spatial access method (SAM). The main idea behind the data structure according to
Guttman (1984) is to group nearby objects by representing them using their minimum bounding rectangle
(the “R” in R-tree) in the next higher level of the tree. In Samet (1995), it is established that the R-tree
and its variant performs remarkably well when applied to indexing arbitrary spatial objects especially
Figure 19. R-tree for representing the regions
135
Spatial Databases
rectangles (in two (2) or d-dimension). The most important argument in Samet (1995) is that the smallest
rectangles that represents the object under consideration must be axis-aligned (see Baum and Hanebeck
(2010) for more on axis aligned rectangle). The main reason the R-tree index structure has been widely
accepted and used according to Berchtold et al. (2001) is because the index structure supports both points
data as well as data with extent whereas most of the other structures like the kdB-trees, grid files etc. do
not support both types of data concurrently, also the R-tree index structure proves efficient for spatial
clustering (which is a vital issue in the performance of tree based indexing structures) since it doesn’t
require point transformation in other to store spatial data. Guttman (1984) describes the functionalities
of the R-tree and most importantly identified the basic strategy for building an efficient R-tree structure
with respect to its clustering technique and partitioning strategy. Nevertheless, despite its popularity and
efficiency, basic limitations of this tree-based index structure (which majorly has to do with increase
in overlap of the MBRs of directory nodes when dimension increases) has rendered it incompetent to
some limit and has warranted research into new methods for enhancing its performance by refining the
way the tree is built. Among the many specialized indexes proposed to offer better performance than
the R–tree for high-dimensional data include: the X–tree, the VA–file, TV-tree, SR-tree, the pyramidtechnique, A-tree, X+-tree, PL-tree, the hybrid-tree etc. overall, indexing structures for multidimensional
spatial databases according to Mamoulis (2012) can only perform optimally if they provide an efficient
underlying paradigm for similarity and nearest neighbour (NN) search in high-dimensional space. In
Figure 20, the authors have listed some of the basic definition encountered in indexing a spatial database
using a tree data structure.
ISSUES ON SPATIAL DATABASE
Dimensionality
In some application domain, the underlying data (spatial or not) are made up of sets of objects and storing these objects in a database usually involve mapping every object to a feature vector in a (possibly
high-dimensional) vector space the objects are described using a collection of features which forms the
feature vector and this serves as the representation of the object in the database (Cazals et al., 2013).
Figure 20. Basic terms and terminologies used in constructing a tree structure for spatial indexing
136
Spatial Databases
High dimensionality that is, the measurement of the degree or size of the feature space according to
Katayama and Satoh (2002); Samet (2006), is one of the properties of feature spaces and in most cases
queries always relate to similarity retrieval from such a feature space of vectors, which in multidimensional informational systems boils down to nearest neighbour search in the space, therefore the challenge that faces such a data mining task is management of the dimensionality of the space. Furthermore
according to Böhm et al. (2001), in multidimensional spatial database management systems (SDBMs),
high dimensionality is characterized by the presence of excess attribute exceeding the total of at least
15. Some examples of application areas where the data is of considerably higher dimensionality (though
they may not be spatial) include, pattern recognition and image databases where the data is made up
of other set of objects, and the high dimensionality comes as a result of result of trying to describe the
objects via a collection of features (also known as a feature vector) examples colour, moments, textures,
shape descriptions, and so on (Samet, 2006). High-dimensional means a situation where the number
of the unknown parameters which are to be estimated is one or several orders of magnitude larger than
the number of samples in the data (Bickel et al., 2001). In large spatial databases, High-dimensional
data can be seen as data that is described by a large number of attributes, where this is the case then as
the dimensionality increases there is always an impending notion that the complexity of computational
process would also increase thereby leading to the ineffectiveness of various existing spatial data mining
algorithm (Assent, 2012). In Bouveyron et al. (2007), it is stated that many scientific domains consider
measured observations as high-dimensional and in such high dimensional (feature) space clustering always
proofs to be difficult due to the fact that the high-dimensional data are always embedded in different
low-dimensional subspaces which are hidden in the original space. It is worthy to note that in similarity
search operation, computing the Euclidean distance between two points in a high-dimensional space for
instance d, involves d multiplication operations and d −1 addition operations, and most importantly, the
computation requires the definition of what it means for two objects to be similar which according to
Samet (2006) is not always so obvious. Also Keim et at (2008) ascertained that feature-based approach
has several advantages compared to other approaches for implementing similarity search, the extraction
of features from the source data is usually fast and easily parametrizable, and metric functions for feature
vectors, as the Minkowski distances, can also be efficiently computed. Novel approaches for computing
feature vectors from a wide variety of unstructured data are proposed regularly. As in many practical
applications the dimensionality of the obtained feature vectors is high. The X-tree spatial index data
structure is a valuable tool to perform efficient similarity queries in spatial databases. Some examples
of High dimensional variables as identified by Mosley (2010) include: (mostly variables with many
units or levels) like ZIP code, Vehicle classification, etc. The author also identified the complications
associated with these types of high dimensional variable which include: Credibility at individual levels
and Determining proper groupings
1.
2.
Credibility at individual levels examples:
a. Convergence errors (from models).
b. Results that do not make sense.
Determining proper groupings example:
a. Thousands of ZIP codes.
137
Spatial Databases
Handling High Dimensional Data
Though searches through an indexed space usually involve relatively simple comparison tests, searching in high-dimensional spaces is generally time-consuming according to Samet (2006), but performing point and range queries are considerably easier (from the standpoint of computational complexity)
than performing similarity queries because point and range queries do not involve the computation of
distance. Suggestions have been made regarding the best possible ways to handle the problems of high
dimension in large databases. Cazals et al. (2013) has suggested several methods for handling high dimensional datasets including: i) Dimension reduction ii) Embedding methods, iii) Clustering, and iv)
Nearest neighbour methods
Dimension Reduction (DR)
Dimension reduction has been seen as one major way of tackling the problem of high-dimension in largely
correlated spatial data. Cazal et al. (2013) describes Dimension reduction as a processes of computing a
mapping function f from the high-dimensional feature space Rm to a lower-dimensional space Rk (upon
which a spatial data structure is built) with the goal of preserving the properties of the feature vectors
that are relevant for the application at hand. According to Parsons et al., (2004), dimension reduction
has to do with two main specific kind of task;
1.
2.
Feature Selection: This technique selects only the most significant of the dimensions from a
dataset that shows a group of objects that are similar on only a subset of their attributes. Though
this method of dimension reduction have difficulty when clusters are found in different subspaces,
they are quite successful on many datasets.
Feature Transformation: This techniques tries to summarize a dataset using fewer dimensions
by creating combinations of the original attributes. The techniques is very successful in uncovering hidden structure in a large datasets. Nevertheless, they preserve the relative distances between
objects, thereby making them less effective when there are large numbers of irrelevant attributes
that hide the genuine clusters in deep noise.
Other dimension reduction techniques according to (Kushilevitz et al., 2000), is the principal component analysis (PCA) which is a statistical technique that uses a kind of transformation to convert a set
of observations of probably correlated variables into a set of values of linearly uncorrelated variables
called principal components.
Embedding Methods
These methods for handling high dimensionality deals with Data embedding i.e., they embed the objects
into a vector space within which a distance metric approximating the original one can be used (Samet,
2006). This method also tends to ease the calculation of (costly) distances between spatial objects Cazals et al., (2013). In Weinberger and Saul (2006), discovering clever representations (which are used to
simplify problems in some application areas using symbolic input) automatically, from large amounts
of unlabelled data, remains a fundamental challenge as such, an algorithm that will faithfully learn low
dimensional representation of high dimensional data was examined and proposed. The algorithm relies
138
Spatial Databases
on modern tools in convex optimization which is obtainable in non-linear embedding method. According
to Cazal et al., (2013). Non-linear embedding methods of handling high dimensionality leads to easy to
implement polynomial-time algorithms and they prove more efficient with larger data sets than the ones
usually involved in iterative or greedy methods (like e.g. the ones involving EM or EM-like algorithms).
A majority of data embedding methods are isometric. In other words, they take a metric space (interpoint distances) as input and try to embed the data to a low dimensional Euclidean space such that the
inter-point distances are preserved as much as possible (Yang 2006). Semi Definite Embedding (SDE)
also known as the maximum variance unfolding – MVU is a kind of embedding method (Non-linear
dimensionality reduction technique) that attempt to map high-dimensional data onto a low-dimensional
Euclidean vector space using semidefinite embedding. The main intuition behind the maximum variance
unfolding – MVU (Weinberger and Saul, 2004; 2006) is to exploit the local linearity of manifolds and
create a mapping that preserves local neighbourhoods at every point of the underlying manifold. The
method can be used to analyse high-dimensional data that lies on or near a low dimensional manifold
as can be seen in the Semi-definite embedding (SDE).
Clustering Methods
Improvements to the traditional clustering algorithms solves the various problems such as curse of dimensionality and sparsity of data for multiple attributes (Paithankar and Tidke, 2015). Clustering methods are used according to Steinbach et al., (2004) to partition a large dataset similar functionality into
groups as a means of data compression. The process of finding clusters reveals meaningful homogenous
groups of objects embedded in subspaces of high dimensional data and also helps to extract the relevant
information from such high dimensional dataset (Paithankar and Tidke, 2015). The essence of this task
is to achieve scalability, proper understanding of data mining results, and insensitivity to the order of
input records in a database management system (Agrawal et al., 1998). According to the definition in
Berchtold et al (1997), an object or a data record typically has dozens of attributes and the domain for
each attribute can be large. It is therefore not meaningful to look for clusters in such a high dimensional
space as the average density of points anywhere in the data space is likely to be quite low. Thus according
to IBM (1996), such problem of high dimensionality is often attempted by requiring the user to specify
the subspace (a subset of the dimensions) for a given cluster analysis, (i.e. bearing in mind that user
identification of subspaces is quite prone to error). In a high dimensional dataset (having millions of
data points that exist in many thousands of dimensions-i.e. having many attributes- representing many
thousands of clusters), dimensionality can be handled by performing a two way clustering (ensemble
method) by first dividing the data into a group of overlapping subsets (using one distance metrics) and
then using a different distance measurements to estimate accurate clusters (McCallum et al., 2000). One
other way of handling high dimensionality using a clustering algorithm according to (Paithankar and
Tidke, 2015) is by combining the two basic methods suggested in this section (subspace clustering and
ensemble clustering).
It is worthy to note the difference between the two classes of datasets that one can encounter within
a spatial database
1.
Low dimensional dataset:
a. Limited number of clusters.
b. Low feature dimensionality (few attributes for each object).
139
Spatial Databases
2.
c. Small number of data points.
High dimensional dataset:
a. Large number of data points.
b. Many thousands of dimensions-i.e. having many attributes.
c. Many thousands of clusters.
Generally, in high dimensional spaces, the data are inherently sparse and the distance between each
pair of points is almost the same for a wide variety of data distributions and distance functions (Muller
et al., 2009).
Nearest Neighbour Methods
Hinneburg et al. (2015) argued that there can be several reasons for the meaninglessness of nearest
neighbour search in high dimensional space especially in a case of sparsity of data objects in space
which is unavoidable. Moreover Beyer et al. (1999) also supported the premise by claiming that in
high dimensional space, all pairs of points are almost equidistant from one another for a wide range of
data distributions and distance functions, thereby giving rise to the problem of instability. However it
is also important to know that Searching for a nearest neighbour among a specified database of points
is a fundamental computational task that arises in a variety of application areas (including information
retrieval, data mining, pattern recognition, machine learning, computer vision, data compression, and
statistical data analysis) where the database points are represented as vectors in some high dimensional
space (Kushilevitz et al., 2000). High-dimensional nearest neighbour problems arise naturally when
complex objects are represented by vectors of d numeric features (Arya et al., 1994). Therefore one
solution of the problem of high dimensionality using nearest neighbour search techniques applies an
approximation of the nearest neighbour query as a method for high dimensional filtering where queries
are compared against their most relevant candidates. The seemingly difficulty of obtaining algorithms
that are efficient in the worst case with respect to both space and query time for dimensions higher than
2, according to (Arya et al., 1994, Cazal et al., 2013) suggests that the alternative approach of finding
approximate nearest neighbours (which do not require heavy resource and is a less demanding computational task) is worth considering. Description of the approximate nearest neighbour object search is
given below in Figure 21:
Curse of Dimensionality: What Is It?
The curse of dimensionality - as explained by Bellman (1957) who defined the term originally - is a
term used to express the fact that the number of samples needed to estimate an arbitrary function with a
given level of accuracy grows exponentially with the number of variables (dimensions) that it comprises.
Figure 21. Approximated nearest neighbour model
140
Spatial Databases
This means that the number of objects (points) in the dataset that needs to be examined in deriving the
estimates in a similarity search (i.e., finding nearest neighbours), grows exponentially with the underlying dimension and thereby giving rise to the question “is nearest neighbour search meaningful in such a
domain?” Curse of dimensionality simply described as a situation whereby an extra dimension is added
to an already existing Euclidean space. Curse of dimensionality is also defined by Kouiroukidis and
Evangelidis (2011), as a phenomenon which states that in high dimensional spaces distances between
nearest and farthest points from query points become almost equal as such, nearest neighbour calculations
cannot discriminate candidate points. Though the above elucidation is faced by many database systems,
suggestions has been put forward towards fighting the curse of dimensionality. Ding (2007), suggested
Feature selection using SVM-RFE (Support Vector Machine-Recursive Feature Elimination) as a solution to the problem. Generally according to Verleysen and François (2005), the curse of dimensionality
is the expression of all phenomena that appear with high-dimensional data, and that have most often
unfortunate consequences on the behaviour and performances of learning algorithms.
Optimizing Spatial Databases
Optimizing spatial databases is one of the most important aspects of working with large volumes of data.
Basically, some of the issues to consider in trying to optimise a large spatial database include: reducing
the access rate and times to external memory, reading and writing in multiples of pages and reducing
number of disk accesses. In Figure 22 an illustration is shown on different ways of possibly optimizing
spatial database as suggested by Samet (2010), some of the ways of improving the behaviour of search
in a large spatial database includes:
Goal 1: Minimizing the number of children of a node that must be visited by search operations (i.e.
minimize the area common to children (overlap).
Goal 2: Reducing the likelihood of each node (say q) being visited by the search (i.e. minimize the total
area spanned by the bounding box of q (coverage).
In addition to the methods mentioned above, Katayama and Satoh (2002), has suggested another better way to optimise spatial databases by considering reducing the amount of disk page accesses because
reading and writing to external storage in disk- based data structures is much slower than doing same
Figure 22. Different way of optimizing a spatial database
141
Spatial Databases
with the main memory. Bulk loading disk-based data structures is another way of optimizing spatial
database, this is because external storage are organised in fixed data blocks of typically 4 to 16kb. A
good bulk loading method according to (Mamoulis, 2012; 2011) would build fast for static objects and
will ensure a lesser amount of wasted empty spaces on the tree pages. Giao and Anh (2015) identified
the advantages of the bulk loading a tree structure as the follows: 1. Faster loading of the tree with all
spatial objects at once 2. Reducing empty spaces in the nodes of the tree and 3. Better splitting of spatial
objects into nodes of the tree. Belciu and Olaru (2011) has also added that the best way to improve the
optimization of spatial databases is through spatial indexes. Using spatial indexes such as Grid index,
Z-order, Quadtree, Octree, UB-tree, R-tree, kd-tree, M-tree etc. improves the performance and integrity
of spatial databases in terms of storage and time costs. According to Lungu and Velicanu (2009) indexing
spatial data is a way to decrease the number of searches, and a spatial index (considered logic) is used
to locate objects in the same area of data (window query) or from different locations. Using approximations, spatial indexing methods organize space and the objects in it in such a way that only parts of the
space and a subset of the objects need to be considered to answer such a query Güting (1994).
Autocorrelation
Autocorrelation is basically seen as a measure of the similarity or interdependence of an object in space
with surrounding objects. According to Samson et al., (2014), the presence of spatial auto-correlation
and the fact that continuous data types are always present in spatial data makes it important to create
methods, tools and algorithms to mine spatial patterns in a complex spatial data set. In Jerrett et al (2003)
Failure to control for autocorrelation can lead to false positive significance tests and may indicate bias
resulting from a missing variable or group of variables. Spatial autocorrelation is an optimal method for
systematically ascertaining spatial patterns. According to Legendre and Fortin (1989) Spatial autocorrelation frequently occurs in ecological data, and many ecological theories and models implicitly adopt
an underlying spatial pattern in the distributions of organisms and their environment. Autocorrelation
arises from the fact that elements of a given population or community (or even the geographic/social
environment as a whole) that are close to one another in space or time are more likely to be influenced
by the same generating process. According to Chen et al. (2011) spatial autocorrelation shows correlation of a variable with itself through space. In their own view, Rossi and Queneherve, (1998); Legendre,
(1993) acknowledged that spatial autocorrelation measures the similarity between samples for a given
variable as a function of spatial distance. Spatial autocorrelation as seen by Dale and Fortin (2009) simply portrays self-dependence of spatial data (meaning that the individual observations made from the
chosen samples include information present in other observations, so that the effective sample size, say
n, is less than the number of observations, m); this dependence according to them poses a great problem
that affects the significance rates of statistical test when it is positive and as such must be corrected in
other to produce a better measurement of goodness-of-fit.
DISCUSSION
The most basic spatial query types are spatial selection, nearest neighbour search, and spatial joins. Extending a DBMS to support spatial data requires changes at all layers: data modelling, query languages,
storage and indexing, query evaluation and optimization, transaction management, etc. Nowadays more
142
Spatial Databases
users are interested in retrieving information related to the locations and geometric properties of spatial
objects. Users of mobile devices may want to find the nearest hotel to their location, Astrologers may
want to study the spatial relationships among objects of the universe, Army commanders may want to
schedule the movements of their troops according to the geography of the field, Scientists may want to
study the effects of object positions and relationships in a 2D/3D space to some scientific or social fact
(e.g., spatial analysis of protein structures, relationship between the residence of subjects and their psychic
behaviour and so on (Mamoulis 2012). The efficiency of searching is dependent on the extent to which
the underlying data is sorted in other words an appropriate index structure is required to be put in place
in other to achieve an optimal database efficiency (Samet (2009). In the search and retrieval process of a
spatial database for handling data query according to Mamoulis (2012), two major problem are most likely
•
•
First, the geometry of the objects could be too complex; therefore, testing a query predicate against
each object in a database would result in a high computational cost.
Secondly, exhaustively testing all objects of the relation against a spatial query predicate requires
a signiicant amount of I/O operations, for large databases
SOLUTIONS
1.
The first problem is mainly handled by storing the spatial objects together with their exact geometry using an appropriate spatial approximation example minimum bounding rectangle (MBR), the
minimum bounding rectangle encloses the convex hull covering the geometry of the spatial object;
the MBR of an object is the minimum rectangle which encloses the geometric extent of the object.
Thus in other to overcome the first problem mentioned above,
a. First, the query predicate is tested against the MBR of the objects, if the MBR passes the filter
step then the refinement operation set is applied. Filtering allows us selects the objects whose
minimum bounding box satisfies the spatial predicate - Intersection, adjacency, containment,
etc.- which are examples of spatial predicates a pair of objects to be joined must satisfy. The
filtering step consists of traversing the index, and applying the spatial test on the MBRs. An
mbr might satisfy a query predicate, whereas the exact geometry may not. (In other words,
filtering is a way of being able to allocate a spatial object’s natural geometry to a minimum
bounding box and then extracting the bounding boxes that for instance intersects with a given
query window).
b. Then the exact geometry of the object is tested against the query predicate. The refinement
stage searches each stored MBR of the spatial objects (extracted by filtering) and then test
the specific geometry of the object against the query predicate. The final spatial test is done
on the actual geometries of objects whose mbr satisfies the filter step.
For many spatial databases, many indexing methods that try to cope with the dimensionality curse
in high dimensional spaces have been proposed, but, usually these methods end up behaving like the
sequential scan over the database in terms of accessed pages when queries like k-Nearest Neighbours
are examined. Kouiroukidis and Evangelidis (2011), examined a multi-attribute indexing methods and
try to investigate when these methods reach their limits, namely, at what dimensionality a kNN query
requires visiting all the data pages.
143
Spatial Databases
CONCLUSION
More users these days are interested in retrieving information related to the locations and geometric
properties of spatial objects. Unlike the classical relational database management system which store
data as numbers, alphabets, alphanumeric or even symbols, spatial databases store abstract data types
(ADTs) such as points, lines, polygons, coordinates, topology or other data types that can be mapped.
Traditional database systems have great difficulties to cope with these kinds of data, because they have
been customised to fixed-length data of very simple internal structure. It is quite tedious to build a spatial
database directly using a classical database model, therefore the traditional system needs to be extended
and equipped to manage and support spatial data types (or data models), spatial functions, and spatial
indexes. The essential features of spatial databases that distinguishes them from alphanumeric data includes
a complex internal structure, arbitrary finite representation of shapes. Thus, domain-specific knowledge
is necessary for traditional/classical databases to be able to support non-standard database applications.
In this chapter (for the sake of efficient information retrieval) we have examined spatial data bases, its
architecture, optimization technique, methods of improving their internal and optimal storage capacity.
Because of the non - linearity that exists among large spatial data set, an effective data structure which
has the ability to tackle the branched structures that exists among a given spatial data is required. These
complex spatial dataset behaviors are better represented using graphs and trees due to the larger datasets
that are always made up of other minor events or objects and are always difficult to be ordered to form
of sequences. The main goal of indexing is to optimize the speed of database query in other to facilitates
information retrieval process and also improve database integrity and general performance. Developing
a spatial database normally starts with developing a data model (which relates to the conceptual, logical
and physical data modelling), that describes the contents of the database and their relationship. Therefore to achieve this, data analyst constantly seek an appropriate data structure that efficiently stores the
objects data in the database and allows for a better and efficient database management.
REFERENCES
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high
dimensional data for data mining applications. ACM.
Ajit, S., & Deepak, G. (2011). Implementation and Performance Analysis of Exponential Tree Sorting.
International Journal of Computer Applications, 24(3), 34-38.
Akinyemi, F. A. (2010). Conceptual Poverty Mapping Data Model. Transactions in GIS, 14, 85–100.
doi:10.1111/j.1467-9671.2010.01207.x
Arya, S., Mount, D. M., Netanyahu, N., Silverman, R., & Wu, A. Y. (1994, January). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proc. 5th ACM-SIAM Sympos.
Discrete Algorithms (pp. 573-582).
Asaeedi, S., Didehvar, F., & Mohades, A. (2013). Alpha-Concave Hull, a Generalization of Convex
Hull. arXiv preprint arXiv:1309.7829
144
Spatial Databases
Baum, M., & Hanebeck, U. D. (2010, September). Tracking a minimum bounding rectangle based on
extreme value theory. In Multisensor Fusion and Integration for Intelligent Systems (MFI), 2010 IEEE
Conference on (pp. 56-61). IEEE. doi:10.1109/MFI.2010.5604456
Belciu, A. V., & Olaru, S. (2011). Optimizing spatial databases. Available at SSRN 1800758
Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.
Berchtold, S., & Keim, D. A. (1996). The X-tree: An Index Structure for High-Dimensional Data. Proceedings of the 22nd VLDB Conference.
Berchtold, S., Keim, D. A., & Kriegel, H. P. (2001). An index structure for high-dimensional data. Readings in multimedia computing and networking.
Berchtold, S., Bohm, C., Keim, D., & Kriegel H.-P. (1997). A cost model for nearest neighbor search in
high-dimensional data space. In Proceedings of the 16th Symposium on Principles of Database Systems
(PODS).
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is Nearest Neighbors Meaningful?Proc.
of the Int. Conf. Database Theorie, (pp. 217-235).
Böhm, C., Berchtold, S., & Keim, D. A. (2001). Searching in high-dimensional spaces: Index structures
for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322–373.
doi:10.1145/502807.502809
Bolstad, P. (2002). GIS Fundamentals: A First Text on GIS. Eider Press.
Bourne, M. (2015). Applications of integration. Retrieved from http://www.intmath.com/applicationsintegration/5-centroid-area.php
Candan, K. S., & Sapino, M. L. (2010). Data management for multimedia retrieval. Cambridge University
Press. doi:10.1017/CBO9780511781636
Cazals, F., Emiris, I. Z., Chazal, F., Gärtner, B., Lammersen, C., Giesen, J., & Rote, G. (2013). D2. 1:
Handling High-Dimensional Data. Computational Geometric Learning (CGL) Technical Report No.:
CGL-TR-01.
Census. (2011). Wards in Yorkshire and the Humber. Available at http://ukdataexplorer.com/census/yo
rkshireandthehumber/#KS206EW0007
Chen, L., & Brown, S. D. (2014). Use of a tree‐structured hierarchical model for estimation of location
and uncertainty in multivariate spatial data. Journal of Chemometrics, 28(6), 523–538. doi:10.1002/
cem.2611
Ding, Y. (2007). Handling complex, high dimensional data for classification and clustering. University
of Mississippi.
Dröge, G. & Schek. (1993). Query-Adaptive Data Space Partitioning Using Variable-Size Storage Clusters.Proc. 3rd Intl. Symposium on Large Spatial Databases.
145
Spatial Databases
Eberly, D. (2015). Minimum-Area Rectangle Containing a Set of Points. Geometric Tools, LLC. Retrieved
from http://www.geometrictools.com/
Efunda. (2016). Solids: Centre of Mass. Available at http://www.efunda.com/math/solids/CenterOfMass.
cfm
Ernest, R., & Djaoen, S. (2015). Introduction to SQL Server Spatial Data. Retrieved from https://www.
simple-talk.com/sql/t-sql-programming/introduction-to-sql-server-spatial-data
Ester, M., Kriegel, H. P., & Sander, J. (1997). Spatial data mining: A database approach. In Advances in
spatial databases (pp. 47–66). Springer Berlin Heidelberg. doi:10.1007/3-540-63238-7_24
Ester, M., Kriegel, H. P., & Sander, J. (1999). Knowledge discovery in spatial databases. Springer Berlin
Heidelberg.
Fortin, M. J., & Dale, M. R. (2009). Spatial autocorrelation in ecological studies: A legacy of solutions
and myths. Geographical Analysis, 41(4), 392–397. doi:10.1111/j.1538-4632.2009.00766.x
Freeman, H., & Shapira, R. (1975). Determining the minimum-area encasing rectangle for an arbitrary
closed curve. Communications of the ACM, 18(7), 409–413. doi:10.1145/360881.360919
GADM. (2009). Global Administrative Areas: Boundaries without limit. Available at http://www.gadm.
org/download
Ganguly, A. R., & Steinhaeuser, K. (2008). Data mining for climate change and impacts. In ICDM
Workshops. doi:10.1109/ICDMW.2008.30
Giao, B. C., & Anh, D. T. (2015). Improving Sort-Tile-Recusive algorithm for R-tree packing in indexing time series. In Computing & Communication Technologies-Research, Innovation, and Vision for the
Future (RIVF), 2015 IEEE RIVF International Conference on (pp. 117-122). IEEE.
Güting, R. H. (1994). An introduction to spatial database systems. The VLDB Journal—The International
Journal on Very Large Data Bases, 3(4), 357-399.
Güting, R. H., & Schneider, M. (1993). Realms: A foundation for spatial data types in database systems.
In Advances in Spatial Databases (pp. 14-35). Springer Berlin Heidelberg. doi:10.1007/3-540-56869-7_2
Hinneburg, A., Aggarwal, C. C., & Keim, D. A. (2000). What is the nearest neighbor in high dimensional
spaces? In 26th Internat. Conference on Very Large Databases (pp. 506-515).
International Business Machines. (1996). IBM Intelligent Miner User’s Guide, Version 1 Release 1,
SH12-6213-00 edition. Author.
Jagadish, H. V., Ooi, B. C., Vu, Q. H., Zhang, R., & Zhou, A. (2006). Vbi-tree: A peer-to-peer framework
for supporting multi-dimensional indexing schemes. In Data Engineering, 2006. ICDE’06.Proceedings
of the 22nd International Conference on (pp. 34-34). IEEE. doi:10.1109/ICDE.2006.169
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys,
31(3), 264–323. doi:10.1145/331499.331504
146
Spatial Databases
Jerrett, M., Burnett, R., Willis, A., Krewski, D., Goldberg, M., DeLuca, P., & Finkelstein, N. (2003). Spatial
Analysis of the Air Pollution Mortality Relationship in the Context of Ecologic Confounders. Journal of
Toxicology and Environmental Health. Part A., 66(16-19), 1735–1778. doi:10.1080/15287390306438
PMID:12959842
Jiawei-Han, M. K. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.
Katayama, N., & Satoh, S. (2002). Experimental evaluation of disk based data structures for nearest
neighbour searching. AMS DIMACS Series, 59, 87.
Kouiroukidis, N., & Evangelidis, G. (2011). The effects of dimensionality curse in high dimensional
knn search. In Informatics (PCI), 2011 15th Panhellenic Conference on (pp. 41-45). IEEE. doi:10.1109/
PCI.2011.45
Kushilevitz, E., Ostrovsky, R., & Rabani, Y. (2000). Efficient search for approximate nearest neighbor in
high dimensional spaces. SIAM Journal on Computing, 30(2), 457–474. doi:10.1137/S0097539798347177
Legendre, P. (1993). Spatial autocorrelation: Trouble or new paradigm? Ecology, 74(6), 1659–1673.
doi:10.2307/1939924
Legendre, P., & Fortin, M. J. (1989). Spatial pattern and ecological analysis. Vegetatio, 80(2), 107–138.
doi:10.1007/BF00048036
Lifshits, Y., & Zhang, S. (2009). Combinatorial algorithms for nearest neighbors,near-duplicates and
small-world design. In Proc. SODA. doi:10.1137/1.9781611973068.36
Lungu, I., & Velicanu, A. (2009). Spatial Database Technology Used In Developing Geographic Information Systems. The 9th International Conference on Informatics in Economy – Education, Research
& Business Technologies. Academy of Economic Studies, Bucharest.
Mamoulis, N. (2012). Spatial data management (1st ed.). Morgan & Claypool Publishers.
McCallum, A., Nigam, K., & Ungar, L. H. (2000, August). Efficient clustering of high-dimensional data
sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 169-178). ACM. doi:10.1145/347090.347123
Mosley, R. C. (2010). Handling High Dimensional Variables. Pinnacle Actuarial Resources, Inc.
Muller, E., Gunnemann, S., Assent, & Seidl, T. (2009). Evaluating Clustering in Subspace Projections
of High Dimensional Data. VLDB ’09. Lyon, France: VLDB Endowment.
Paithankar, R., & Tidke, B. (2015). A H-K Clustering Algorithm for High Dimensional Data Using
Ensemble Learning. arXiv preprint arXiv:1501.02431
Papadias, D., Zhang, J., Mamoulis, N., & Tao, Y. (2003). Query processing in spatial network databases.
In Proceedings of the 29th international conference on Very large data bases (vol. 29, pp. 802-813).
VLDB Endowment.
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM
SIGKDD Explorations Newsletter, 6(1), 90–105. doi:10.1145/1007730.1007731
147
Spatial Databases
Patel, P., & Garg, D. (2012). Comparison of Advance Tree Data Structures. arXiv preprint arXiv:1209.6495.
Paul, E. B. (2008). Point access method. In Dictionary of Algorithms and Data Structures. Available
from: http://www.nist.gov/dads/HTML/pointAccessMethod.html
Ramakrishnan, R., & Gehrke, J. (2003). Database management systems (3rd ed.). New York: McGraw-Hill.
Rigaux, P., Scholl, M., & Voisard, A. (2003). Spatial Databases with Application to GIS. SIGMOD
Record, 32(4), 111.
Rossi, J. P., & Quénéhervé, P. (1998). Relating species density to environmental variables in presence
of spatial autocorrelation: A study case on soil nematodes distribution. Ecography, 21(2), 117–123.
doi:10.1111/j.1600-0587.1998.tb00665.x
Samet, H. (1995). Spatial data structures, Modern database systems: the object model, interoperability,
and beyond. New York, NY: ACM Press/Addison-Wesley Publishing Co.
Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
Samet, H. (2009). Sorting spatial data by spatial occupancy. In GeoSpatial Visual Analytics (pp. 31–43).
Springer Netherlands.
Samet, H. (2010, December). Sorting in space: multidimensional, spatial, and metric data structures for computer graphics applications. In ACM SIGGRAPH ASIA 2010 Courses (p. 3). ACM.
doi:10.1145/1900520.1900523
Samson, G. L., Lu, J., & Showole, A. A. (2014). Mining Complex Spatial Patterns: Issues and Techniques.
Journal of Information & Knowledge Management, 13(02), 1450019. doi:10.1142/S0219649214500191
Samson, G. L., Lu, J., Wang, L., & Wilson, D. (2013). An approach for mining complex spatial dataset.
Proceeding of Int’l Conference on Information and Knowledge Engineering. Retrieved from http://
worldcompproceedings.com/proc/proc2013/ike/IKE_Papers.pdf
Schneider, M. (1997). Spatial Data Types for Database Systems - Finite Resolution Geometry for Geographic information systems. LNCS, 1288.
Schneider, M. (1999). Spatial Data Types: Conceptual Foundation for the Design and Implementation of
Spatial Database Systems and GIS. In Proceedings of 6th International Symposium on Spatial Databases.
Shekhar, S., & Chawla, S. (2003). Spatial databases: A tour. Upper Saddle River, NJ: Prentice Hall.
Shekhar, S., Chawla, S., Ravada, S., Fetterer, A., Liu, X., & Lu, C. (1999). Spatial Databases - Accomplishments and Research Needs. IEEE Transactions on Knowledge and Data Engineering, 11(1),
45–55. doi:10.1109/69.755614
Shekhar, S., Evans, M. R., Kang, J. M., & Mohan, P. (2011). Identifying patterns in spatial information:
A survey of methods. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3),
193–214.
Singh, S. P., & Singh, P. (2014). Modelling a Geo-Spatial Database for Managing Travelers ‘demand.
International Journal of Database Management Systems, 6(2), 3–47. doi:10.5121/ijdms.2014.6203
148
Spatial Databases
Steinbach, M., Ertöz, L., & Kumar, V. (2004). The challenges of clustering high dimensional data. In
New directions in statistical physics (pp. 273–309). Springer Berlin Heidelberg. doi:10.1007/978-3662-08968-2_16
Stonebraker, M., Rowe, L. A., Lindsay, B. G., Gray, J., Carey, M. J., Brodie, M. L., & Beech, D. et al. (1990).
Third-generation database system manifesto. SIGMOD Record, 19(3), 31–44. doi:10.1145/101077.390001
Velicanu, A., & Olaru, S. (2010). Optimizing Spatial Databases. Informatica Economica, 14(2), 61–71.
Verleysen, M., & François, D. (2005, June). The curse of dimensionality in data mining and time series
prediction. In International Work-Conference on Artificial Neural Networks (pp. 758–770). Springer
Berlin Heidelberg. doi:10.1007/11494669_93
Weinberger, K. Q., & Saul, L. K. (2004). Unsupervised learning of image manifolds by semidefinite
programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR-04).
149