Academia.eduAcademia.edu

Spatial Databases: An Overview

Spatial databases maintain space information which is appropriate for applications where there is need to monitor the position of an object or event over space. Spatial databases describe the fundamental representation of the object of a dataset that comes from spatial or geographic entities. A spatial database supports aspects of space and offers spatial data types in its data model and query language. The spatial or geographic referencing attributes of the objects in a spatial database permits them to be positioned within a two (2) dimensional or three (3) dimensional space. This chapter looks into the fundamentals of spatial databases and describes their basic component, operations and architecture. The study focuses on the data models, query Language, query processing, indexes and query optimization of a spatial databases that approves spatial databases as a necessary tool for data storage and retrieval for multidimensional data of high dimensional spaces.

Ontologies and Big Data Considerations for Effective Intelligence Joan Lu University of Huddersfield, UK Qiang Xu University of Huddersfield, UK A volume in the Advances in Information Quality and Management (AIQM) Book Series Published in the United States of America by IGI Global Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA, USA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com Copyright © 2017 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data CIP Data Pending ISBN: 978-1-5225-2058-0 eISBN: 978-1-5225-2059-7 This book is published in the IGI Global book series Advances in Information Quality and Management (AIQM) (ISSN: 2331-7701; eISSN: 2331-771X) British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher. For electronic access to this publication, please contact: [email protected]. 111 Chapter 3 Spatial Databases: An Overview Grace L. Samson University of Huddersield, UK Mistura M. Usman University of Abuja, Nigeria Joan Lu University of Huddersield, UK Qiang Xu University of Huddersield, UK ABSTRACT Spatial databases maintain space information which is appropriate for applications where there is need to monitor the position of an object or event over space. Spatial databases describe the fundamental representation of the object of a dataset that comes from spatial or geographic entities. A spatial database supports aspects of space and ofers spatial data types in its data model and query language. The spatial or geographic referencing attributes of the objects in a spatial database permits them to be positioned within a two (2) dimensional or three (3) dimensional space. This chapter looks into the fundamentals of spatial databases and describes their basic component, operations and architecture. The study focuses on the data models, query Language, query processing, indexes and query optimization of a spatial databases that approves spatial databases as a necessary tool for data storage and retrieval for multidimensional data of high dimensional spaces. INTRODUCTION The extensive and increasing availability of collected data from geographical information system devices and technology has made it excessively difficult to manage these information using existing spatial database methods thus this has led to research advances in behavioural aspects of monitored subjects. Geographic information systems (GIS) can efficiently handle all the major tasks of information extraction (which include data input and data verification, storage and manipulation, output and presentation, data transformation and even interactions with the end users) from large datasets. This signifies that geographic information system are complete database management systems which can handle all the task mentioned above (Rigaux et al. 2003). Notwithstanding, based on the purpose of whatever application under a user’s consideration, the objects in these (GIS) databases must be properly modelled for real DOI: 10.4018/978-1-5225-2058-0.ch003 Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Spatial Databases world data simplification (i.e. simplifying real world data so as to create an actual prototype of it) in other to enhance efficient performance of the database. To achieve this, data analyst constantly seek an appropriate data structure that efficiently stores the objects data in the database and allows for a better database management. Spatial databases maintain space information which is appropriate for applications where there is need to monitor the position of an object or event over space. Spatial databases describe the fundamental representation of the object of a dataset that comes from spatial or geographic entities. A spatial database supports aspects of space and offers spatial data types in its data model and query language. The spatial or geographic referencing attributes of the objects in a spatial database permits them to be positioned within a two (2) dimensional or three (3) dimensional space. A spatial database unlike the classical database do not only query data based on their attributes alone, they also have the capacity for querying data elements with respect to their locations. Spatial databases are built to compliment the classical database using some defined architecture (figure 2 a and b). This chapter looks into the fundamentals of spatial databases and describes their basic component, operations and architecture. Figure 1 shows a typical classical database system environment. BACKGROUND Spatial Database Management System Architecture According to Ester et al. (1999), Spatial Database Systems (SDBS) are relational databases plus a concept of spatial location and spatial extension, and the explicit location and extension of objects define implicit relations of spatial neighbourhood. Ester et al. (1997), argued that the efficiency of many KDD algorithms for SDBS depends heavily on an efficient processing of these neighbourhood relationships since the neighbours of many objects have to be investigated in a single run of a KDD algorithm. The DBMS architecture are frequently employed in the construction of spatial databases. Nevertheless, while typical databases can understand various numeric and character types of data according to Shekhar (1999), additional functionality needs to be added for them to process spatial data types, these are typically called geometry or features. Figure 2 shows three basic architectures for designing a spatial database management system according to Güting (1994) are the layered and the dual architecture. In layered architecture Figure 1. Diagram showing a typical environment of database management 112 Spatial Databases Figure 2. The diagram of spatial database architecture (a) the layered architecture (b) the dual architecture (Güting, 1994); (c) the integrated architecture in full detail the database uses the standard DBMS and on top of the databases there is spatial tools as a top layer of it while in the dual architecture the top layer is the integration layer that will integrate between standard DBMS with spatial subsystem in bottom layer. Spatial databases describe the fundamental representation of the object of a dataset that comes from spatial or geographic entities. The spatial or geographic referencing attributes of the objects permits them to be positioned within a two (2) dimensional or three (3) dimensional space. Basically, there are two fundamental aspects of a spatial database that needs to be modelled: the spatial component (where) and their attributes (what). These two factors determines the spatial data and attribute data that makes up the spatial database. Spatial data describes the location of the object of concern while attribute data tries to specify characteristics at that location (e.g. how much, when etc.). However representing these data in the form that the computer would understand requires grouping the data into layers according to the individual components with similar features (example layer could be waterlines, elevation, temperature, topography etc.). Nonetheless, the data properties of each layer (such as scale, projection, accuracy, and resolution) needs to be set by selecting appropriate properties for each of these layers. This is where the logical layer of spatial database management system (SDBMS) comes in. The logical layer Rigaux et al. (2003) carries the definition of spatial database schema which describes the structure of the information as managed by the application. It also carries the constraints to be respected by the data in the database. Defining the schema allows for further database operation (including data insert and delete) and database query using an appropriate spatial query language. In other words it is correct to state that the particular structures, constraints, and operations provided by the SDBMS depend wholly on the logical data model as supported by this SDBMS (Rigaux et al. 2003). Data Representation in a Spatial Database: The content of spatial data according to Densham and Goodchild (1989); Li et al., (2006), describes spatial objects’ specific geographic orientation and spatial distribution in the real world. Spatial data also includes space entities attributes, the number, location, and their mutual relations. The data can be the value of point, height, road length, polygon area, building volume and the pixel gray. It can be 113 Spatial Databases the string of geographical name and annotation. It can also be graphics, images, multimedia, spatial relationships or other topologies. Spatial phenomenon is described using dimensional objects such as points, lines polygons or area, thus this complexity of spatial data and its intrinsic spatial relationships limits the usefulness of conventional data mining techniques for extracting spatial patterns. Figure 3 can be considered as a typical example of a spatial dataset. The picture is a Satellite Image of a County (the wards in Yorkshire and the Humber for the 2011 census) showing the County’s boundary (the dashed white line), the Census block including name, area, population, Boundary (shown by dark line), and Water bodies (dark polygons) as obtained from Census (2011). Given a d-dimensional space ℝd with a Euclidean distance, assume the dimension of d is 2 in that Euclidean space, also assume that the space is a big rectangle with edges parallel to the axes of the coordinate system (see Figure 3), then with such space we obtain our spatial dataset. Therefore in other to store the data in spatial database table, we starts by creating a table with the item sets as obtained from the image using a classical relational database model. If we bring out a single block of the census_area in its 2-dimensional space from Figure 3(c), we could get an information on each record. Figure 4 is a typical example of a single object as would be represented as a record on the database table Traditional database (Shekhar and Chawla, 2003) do not support the boundary (polygon) data types as seen in our illustration above as such, there arise the need to create a separate relation using spatial data model and then mapping the new table into a classical database, thus demanding us to create additional tables that can store the spatial data types. For each of the rectangular block in the study area, the following are identified; polygon, edge and point and a separate table is created for it as shown in Figure 5. Spatial Data Types Unlike the classical relational database management system which store data as numbers, alphabets, alphanumeric or even symbols, spatial databases store data abstract data types (ADTs) such as points, Figure 3. Example spatial dataset: 2 thematic map of Yorkshire County and the relational database tables describing them (Census 2011): (a) indicates the areas where actual census was conducted (b) shows the areas in Yorkshire where English is spoken as a major language (c) a relation representing the two spatial objects 114 Spatial Databases Figure 4. Representation of a single record the census_area database table Figure 5. Relational database table storing the spatial properties of the object lines, polygons, coordinates (latitude and longitude coordinates which define a given location on the surface of the earth), topology or other data types that can be mapped (Samson et al, 2013; Ernest and Djaoen, 2015). The definition and implementation of spatial data types is the most fundamental issue in the development of a spatial database systems (Güting and Schneider, 1993). In Schneider (1999) it is shown that spatial data types are necessary to model geometry and to suitably represent geometric data in a database system. The basic data types according to the author includes: point, line and region and the more complex types are partitions and graphs including networks (roads, rivers etc.). In Güting and Schneider (1993) spatial data types (points, lines and regions) are described as elements of a spatial object which describes the objects attributes regardless of whether the database management system uses a relational, complex object, object-oriented or some other data model. Spatial database (made up of collection of both spatial and non-spatial data) is optimized so as to optimally store and cross-examine data objects located spatially. Compared with normal databases, which work only with numeric, character or calendar data, spatial databases offer additional functions that allow processing spatial data types (Velicanu & Olaru, 2010). According to (Ernest & Djaoen, 2015) spatial data types which generally describes the physical location and shape of geometric objects are classified into two types namely: geometry and geography data types. The geometry data types allows data to be stored using the x and y (Euclidean) coordinate system, using this method, the xy coordinates therefore positions the spatial object (points, polygon/region or lines) on a 2dimensional Euclidean space. The geography types of spatial data Ernest and Djaoen (2015) stores data based on round-earth coordinate system. In this case, the spatial object 115 Spatial Databases is stored using its latitude and longitude coordinate’s value. More elucidation on spatial data types can be found in Samson et al., (2013: 2014), but for a simple example let us look at the illustration below. Suppose the problem at hand is to find the nearest town to the centre of Yorkshire Counties (marked P) from the map in Figure 8, then we need to store the cities A through M in our database. By this we could store the cities as point locations by taking the values of their x and y coordinates. In doing this we create a table called Census_town and then follow the steps explained above. Figure 6 highlights different types of spatial objects and how they hey are represented and Figure 7 shows a general overview of the various spatial data types and how they are described as presented in Rigaux et al. (2003) Figure 6. The different data types for representing spatial objects Figure 7. Methods for representing spatial data types (Rigaux et al., 2003) 116 Spatial Databases SPATIAL DATABASE MANAGEMENT SYSTEM Point Data Types Point data are completely characterized by their locations in a multidimensional space (Velicanu & Olaru, 2010). Point data types are basically used to represent a single object in a given location in a multidimensional space. In Samson et al, (2014) Points are shown to be efficient when modelling for example, cities, forests, or buildings and also ideal object of a thematic maps describing for instance land use/cover or the for the partitioning of a country into districts. A Point is the most important object type supported by the spatial data types -both geometry and geography- (Ernest and Djaoen, 2015) and they represent a singular position in space. The position of a point in space can be defined by using an X-coordinate and Y-coordinate value-pair based on a planar (geometry) coordinate system or on the latitude and longitude coordinates from a geographic coordinate system (Ernest and Djaoen, 2015). In vector spaces, points can also be used to store extracted features from data e.g. text. Raster data model are best expressed using points (which are used to store the raster image as pixels where each point represents a single atomic cell of the image) because the raster say something about all the point in the raster space, in this way the raster image can then be modelled as a single collection of spatially related objects. In general, Points are simple geometry of dimension zero and they are not bounded by areas either in length or in breath. For a better idea on how to analyse point patterns from a spatial dataset see Samson et al, (2014) and for better understanding of sampling point data see Samson et al, (2013). To define a point A from the geometric point of view we refer to the values its xy coordinate in the plane. For instance in a 2 dimensional plane where x= 2 and y= 9,we write point A as A(2,9) for 3 dimsion A(x, y, z), for 4 dimensinsion A(x, y,m,z) etc. The Z coordinate refers to the height or elevation of a Point, and the M coordinate represents a measure value - which is a user-defined value- (Ernest and Djaoen, 2015). Figure 8 is a thematic map of the image in Figure 3, the map depicts the number of females (in percentage) usual residents aged 16 to 74 in professional occupations in towns A through M around the Yorkshire County. In this illustration, the towns has been modelled as point locations as such will be represented (stored) on the database using their individual xy coordinates. Figure 8. Using point locations to represent the towns around the county 117 Spatial Databases Storing Region Data Types In the previous section we have used point representations to model the spatial objects. In this section we would model the objects using a rough geometric approximations of their spatial extent in addition to which most Objects also have location and boundary (Mamoulis, 2012). Figure 9 is another version of the image in Figure 8 where the spatial objects are modelled using geometric (rectangular) shapes, as such is said to be a vector representation in which case the extent of the spatial objects has significant effect on the result and performance of the query processing (Shekhar et al., 2011). Regions are abstract data types (ADT) that represents the geometric part of a spatial object. In most cases representing objects with large areas (like lakes, forest etc) normally require solid shapes with a surface (an example is a two or three dimensional object). In essence these kinds of large area entities are fitted using polygons (mostly rectangles), the boundary of the regions are enclosed in polylines around the polygon Converting a Region to Point Data Representing the regions in Figure 10 with points is not a straightforward transformation, a procedure has to be applied that can execute the functions by converting the features on the map into a set of points. Notwithstanding, Candan and Sapino (2010) iterated the fact that this approach is particularly useful for measuring similarities and distances where it is assumed that there is only one point per feature, it cannot be used for instance to express topological relationships between regions. Towns represented with points A through M (in Figure 8 and 9) has been converted from large geometric areas (of 2-dimension) to smaller sized points (of zero-dimension). This conversion is justifiable because their extents (shapes) are not considered useful when considering their locations on the larger scale map. In other words the result of the queries that concerns the object’s location on the map is not affected by reducing the shape of the object to a point because considering the space they cover in contrast to the space that contains them, the significance of the size of the their shape is inconsequential. Figure 9. Using rectangles (minimum bounding rectangles) to enclose the regions before storage 118 Spatial Databases The basic steps to converting from polygon to point is itemised as follows: 1. 2. Find a two dimensional bounding box (MBR) around the points by: a. Finding the convex hull for the set points or regions, b. Eliminating points that are redundant to the solution, and c. Enclose the polygon (convex) with the minimum bounding box that contains its m-number of vertices. Estimate the centre (centroid) of the object (on the x-axis) using the MBRs. (This in most cases is used as the point that represents the object). Minimum Box/Rectangle (MBB/MBR) Minimum bounding box or minimum bounding rectangle/regions (MBR) is a paradigm for clustering points based on their similarity measure. Points which are closer to each other (in space) to a certain extent are always put together as objects in one cluster also known bucket (mostly rectangular in shape). Samet (2009) describes a bucket as a subspace (of an underlying space) which contains sorted spatial objects that has been grouped together in a natural way of arranging them based on their spatial order (also known as spatial occupancy). In Mamoulis (2012), constructing the MBR of a spatial object or objects is seen as a technique for handling spatial objects by approximating their geometric extent using the minimum bounding box that encloses the object’s geometry minimally and more efficiently. This idea is an optimal filtering method of managing spatial databases by preserving objects’ natural locality. When this spatial approximation is effectively (in terms of computational cost) achieved, then the objects are guaranteed to retain their exact original geometry. The study in Güting (1994) expresses MBRs as a generic approximation of the spatial data types (points, line, polygons) which links them to different spatial access methods and allows the spatial objects to be organised in a database which is stored on an external memory using certain size of buckets (MBR) as a match to the pages of the memory of the storage system (Dröge & Schek, 1993). Figure 10is the map of the administrative area of Britain as obtained from (GADM, 2009). We have used the map as a spatial dataset to like illustrate the various stages involved in converting regions to point data for spatial data analysis. The files were extracted from GADM version created 2009. Convex Hull The convex hull of a set of points t is the smallest convex polygon that encloses t (Rigaux et al, 2003) and computing the convex hull from that set of points which together makes up the convex polygon (in which all points vertices outward from the centre) with the minimum area that includes all these points is a way to represent the region occupied by the points (Asaeedi et al, 2013). Note that the points are convex because we have assumed that for any two of them say a and b, the line segment ab is totally within points a through b. Figure 11 is a sample dataset (a digitised map of one of the towns in Figure 10 a) with selected points picked from the towns boundary to another town, Therefore the points around the polygon in Figure 11 are the convex points of the spatial object (town). Thus if we call the set of all points p and if we choose C to be the convex points then the convex hull of p is the smallest convex polygon that encloses p. Note that the points that did not fall on the polygon are non-convex point and are insignificant in finding the convex hull of the object. For any subset of a plane coordinate system (say points, rectangle, simple polygons), the convex hull is the smallest convex set that contains that subset. 119 Spatial Databases Figure 10. Various stages involved in converting to point location: (a) map representing the study area (b) regions on the map represented by their convex hull (c) constructed MBR around the convex hull (d) centroid of regions using their MBR (e) expanded view of the area (in figure 10 d) around the point marked d (f) region fully represented using points Figure 11. Digitized map of one of the towns in Figure 10 (a) showing its MBR (red rectangle) Algorithm for Finding the Convex Hull around a Polygon Assuming all the points in p are distinct 120 Spatial Databases Let the space be of dimension 2 Then { Enter coordinates representing each point (2* P), ( ) Set of all points = p Set of all convex points = C //Where p = a,b,c ….r Then ( ) for all pi ∈ p, i = 1 ….r Find {s |s ≤ p ∧ p ∈ C } Start with the s valuethat has the smallest y-coordinate.Sort the points following a polar angle with s to get simple (convex) polygon. Consider points in anticlockwise order, and eliminate those that would create a clockwise turn. } Return convex polygon // sorted list of points t along the boundary (moving anticlockwise) with the convex hull of S Fitting the Rectangle Assuming we have a set of points on a 2dimensional plane, we could compute the minimum bounding rectangle that encloses the points. It is only natural that the minimum bounding rectangle for the points be supported by the convex hull of the points (which is a convex polygon) and any points interior to the polygon have no influence on the bounding rectangle (Eberly, 2015). More so, it has also been established by Freeman and Shapira (1975) that the most important constraint to be met in fitting a minimum bounding box (mbb) is to ensure that at least one of the edges of the mbb coincides with some edges of the convex polygon. Thus the simple algorithm below could be used to construct a minimum bounding box (mbb) around the convex polygon obtained above (section 3.2.1) Algorithm for Finding the Minimum Bounding Box around a Convex Polygon Given: A set of convex points say p // the points constitutes the convex polygon Produce: The minimum bounding box for the set of points say q// the output points is expected to be smaller than input points Start: // let V be the number of edges on the convex polygon For v = 1…V Draw a rectangle with vi as the base (i.e. vi is collinear with an edge of the rectangle) Calculate the area A of the rectangle Repeat for all v Display the rectangle with the smallest area as the minimum bounding box (mbb) Other simpler method can be used by just finding the xmin,ymin,xmax,ymax val- 121 Spatial Databases ues Start Get min x, max x Get min y, max y Using these bounds Draw the rectangle around the points // despite the fact that (Eberly, 2015) claimed that it is not necessary, the rectangle produced using this method will be axis aligned (that is being orthogonal with one or both of the axes) Baum and Hanebeck (2010) describes axis-aligned rectangles as rectangles in any dimensional space which is represented with the extreme values (the minimum and maximum) on each axis, in addition, they also demonstrated how the time complexity (in applying certain number of steps) for computation can be calculated. Axis-aligned rectangles are very significant in SDMSs for objects with extents as we can see in the elucidation given by Samet (1995) for instance, in other to store such arbitrary object we need to represent them as n-dimensional rectangles for which each node in the storage structure (say tree) corresponds to the n-dimensional rectangles that holds the child but most importantly the main spatial objects which is stored in the leaf (node) of the tree are basically represented by the smallest axis-aligned rectangle that contains them. It is important to remember that the nodes we refer to here relates to the pages on a disk (computer memory)as such building the tree structure should put into consideration the fact that minimum number of disk pages needs to be visited for any query operation or the indexing structure is considered suboptimal Finding the Centre (Centroid) of the Spatial Objects After the rectangle is built, in then depending on the task at hand, the next thing we need to do is finding the centre of the rectangle and this will help us with further processing. In the study of plane geometry according to Bourne (2015), the centroid of area of geometric shapes or object in 2 dimension can generally be seen as the mass of that object (or shape). It is similar to the centre of mass of the object except for the fact that calculating the centroid Efunda (2016) involves only the geometrical shape of the solid. If we assume we want to calculate the centroid from the bounding rectangle, we can summarise the formula as in in equation 1. Algorithm for Finding the Centroid of a Spatial Object  x 2 − x 1, y 2 − y1  For a simple rectangle: C x ,y =   2  Generally, this equation can be broken down into the sequence below. i. ii. Get the x and y coordinates of each vertex v , in any order For each v j ∈ V of the ith coordinates ( i =1,2… d ) in a d dimensional space Compute centroid: 122 (1) Spatial Databases dxi, dyi   C i =   d  (2) d is the number of dimension iv. Return a set of coordinates i , (e.g. x, y for just xy coordinates) The formula described above is useful when we want to take the centroid from the bounding rectangle of the spatial object under consideration, but generally, if we want to get the centroid on an polygon based on the j-vertices then the formula below proves more useful. Given any p points With masses m j at positions x j for all j-vertices Then centroid of that set of points is   p  ∑ j =1 m j x j   x =  p    ∑ j =1 m j  − (3) Though Equation 3 is a general equation for finding the centroid of a polygon with m masses, the equation transforms to Equation 4 if all these masses are equal.   p  ∑ j =1 x j   x =    p   − (4) EXTENDING THE CLASSICAL RELATIONAL DATABASE FOR SPATIAL DATASET Spatial Data Model Developing a spatial database normally starts with developing a data model (which relates to the conceptual, logical and physical data modelling), that describes the contents of the database and their relationship (Singh & Singh, 2014). There are two common data models for modelling spatial information: field based models and object-based models (shekhar et al. 1999). Raster data structure (also known field based, space based or raster model) according to Gregory et al. (2009) is similar to placing a regular grid over a study region and representing the geographical feature found in each grid cell numerically. The model treats spatial information such as rainfall, altitude and temperature as a collection of spatial functions transforming space partition to an attribute domain. Raster are associated with image processing, dynamic modelling and image processing and are easily manipulated using map algebra (e.g. multiplying geographically corresponding cell values in two or more datasets). Spatial data can also be represented 123 Spatial Databases as continuous surfaces (e.g. elevation, temperature, precipitation, pollution, noise e.tc) using the grid or raster data Model in which a mesh of square cells is laid over the landscape and the value of the variable defined for each cell (Samson et al., 2014). The object based data model (which can also be called the vector, feature or entity based model) treats the information space as if it is populated by discrete, identifiable, spatially referenced entities. An implementation of a spatial data model in the context of Object-Relational databases consists of a set of spatial data types and the operations on those types. Vector data structure represents geographic objects with the basic elements points, lines and areas, also called polygons. From the description given by Gregory et al. (2009), vector data is based on recording point locations (zero dimensions) using x and y coordinates, stored within two columns of a database. By assigning each feature a unique ID, a relational database can be used to link location to an attribute table describing what is found there. Every element in a vector model is described mathematically and bases on points that are defined by Cartesian coordinates (Neuman et al. 2010). Spatial Data Modelling Paradigm Typically, database design process is often divided into three main task, namely: conceptual, logical and physical data modelling (Elmasri and Navathe 1989). In the conceptual modelling process, the objects (or entities) are represented as a spatial and non-spatial datasets, and their attributes and the relationships between them are identified (Akinyemi 2010). The conceptual model describes how a system is organized and how the system operates. The logical model produces a conceptual data model in terms data that will be computed. This stage according to Singh and Singh (2014) constructs (entity classes or object classes), operations (create relationships) and validity constraints (rules). At the stage of physical modelling, we produce the actual database design based on the requirements established at the logical modelling stage. Figure 12 shows the various stages involved in designing a spatial database and what they entail. Figure 12. Stages in building a spatial database (Singh and Singh, 2014) 124 Spatial Databases Choice of DBMS for Building Spatial Databases Relational database technology is inadequate for managing spatial data (Mamoulis, 2012). That means it is quite tedious to build a spatial database directly using a classical database model, therefore the traditional system needs to be extended to handle spatial data types (abstract data type - ADT) in other words, relational databases should be equipped to manage and support spatial data types (or data models), spatial functions, and spatial indexes. Spatial database management system are software tools that can work with a classical DBMS but in addition, such a system would be able to supports a query language from which spatial data types are callable, possess efficient algorithms for processing spatial operations and provide rules for query optimization (Shekhar & Chawla, 2003). In Ramakrishnan and Gehrke, (2003) two basic types of extended DBMS which are designed to handle the complexity associated with spatial data were described. The new extended systems are basically object oriented and therefore are able to provide support for database systems in terms of handling complex data types. The two object databases described are: Object-Oriented Database Systems (OODBMS) and object relational database systems (ORDBMS) both of which are able to Support user defined (spatial) abstract data types like polygons, line, points coordinates and other complex types including; partitions and graphs (networks e.g. roads, rivers etc.). It is expected of an efficient spatial data management system to support spatial query operations and the same time support relational operations. Whereas object-oriented databases model real world entities (considering their behaviours, relationship) unambiguously (without the use of tables and keys) providing high computational power where users can implement functions and embed them into the database (Candan & Sapino, 2010), the object-relational counterpart which is an extension of the relational database features the functionalities that provides users the opportunity to model spatial databases by either extending existing relational models with object-oriented features or by adding a special row and table based data types into object-oriented databases (Stonebraker et al., 1990). According to Shekhar et al. (2011) spatial types and operations may be integrated into query languages such as SQL, which allows spatial querying to be combined with object-relational database management systems. Figure 13 is a clear conceptualization of both approaches that are described in this section. The interesting thing about objects based modelling is that they allow users to define their own new data types, and complex objects like the ones in Figure 6 (the network and partitions data types), object based modelling also benefit from having functionalities that allow the user to define methods (in form Figure 13. Object oriented paradigm for spatial databases 125 Spatial Databases of codes which are used to define object’s behaviour, properties and their state) and rules (that monitors events and validate them against certain constraints and the results of queries from such databases are held in containers (e.g. lists or sets) Example Object-oriented database implementation ❖ Class Census_Town Row (tuple) (town_code: integer, town_name: string, town_type: string, population: integer, area: float, shape: region) ❖ Class Language_Spoken Row (tuple) (language_name: string, shape: region) Object-relational database implementation ❖ CREATE TABLE censusTown (town_code integer, town_name: string, town_type: string, population: integer, area: float, shape: region,Primary Key (town_code)) ❖ CREATE TABLE languageSpoken (language_code integer, language_name: string, town_name: string shape: region,Primary Key (language_code), Foreign Key (town_code) References Census_Town) SPATIAL QUERY PROCESSING Spatial database management deals with the storage, indexing, and querying of data with spatial features, such as location and geometric extent (Mamoulis 2012). Thus for a spatial database evaluations are frequently made between all structure with respect to complexity, query support, data type support and application (Patel and Garg 2012). According to Candan and Sapino, (2010) typical query could be any problem which is related to the spatial attributes of a given object, most data features (spatial or non-spatial) can be represented in the form of one (or more) of the four common base models: strings, vectors, fuzzy/probabilistic logic-based and graphs/trees, representations). In many spatial database applications, object extents could be ignored or approximated (using the filter and refine) in a case where the scale of the query is much larger than the scale of the objects thus, the filter-and-refine technique used in spatial query processing according to Shekhar et al (2011) plays a vital role in managing multidimensional-index structures. Figure 14 shows a typical example of a spatial query process were w is the query object. The diagram explains the basic steps of filtering and refining in a spatial query. In Figure 14b, the objects are filtered by finding their mbrs and the mbrs that intersects the query area are returned. At the stage of figure c, the rectangles that intersects the query window are evaluated (the evaluation is carried out based on the actual object) and the through this refinement process, rectangles 126 Spatial Databases Figure 14. Steps involved in querying rectangular objects (Mamoulis 2012): (a) is the region dataset with w as the query window, (b) is the approximated (refined) dataset (c) is the final refinement stage which containing actual objects which do not have any topological relationship with the query window are recorded as false hits. A simple example query is given below: SELECT town_name language_name FROM censusTown c, languageSpoken l WHERE c. town_name = l. town_name And c. population > 10,000, And l.language_name = ‘English’ The simple query above takes the relations censusTown and languageSpoken, censusTown stores the towns in Yorkshire County and languageSpoken stores the language prevalent in each town. The query basically checks the towns where English is the major language and the population if greater than 10,000. There are basically three types of queries according to Güting (1994) that arise over spatial (point) data namely: spatial range queries, nearest neighbour queries, and spatial join queries. For rectangle data one is normally faced with queries like Intersection query (e.g. find all rectangles that intersects a query rectangle) or Containment query (for example find all region/rectangles that are completely within a query rectangle (Güting, 1994). In point and window query which are important examples of spatial queries, we try to find the objects whose geometry contains a given point or overlaps a rectangle (Rigaux et al., 2003) Range Queries In range queries according to Papadiaset al. (2003) the range typically corresponds to a rectangular window (like the one in figure 15) or a circular area around a query point. In this case one of the attributes of the spatial object is identified as the range. For example find all primary schools that falls within 127 Spatial Databases 20miles of the University of Huddersfield. Range queries also known as similarity/distance threshold queries Candan and Sapino (2010), tries to find matches in a database that are within the threshold associated with a given distance or a similarity measure based query. In general, a typical question for a range query on point and spatial data would be: find all objects within a given range of another object. Basically, range queries –for region – search according to Samet (2006), identifies a set of data points whose specified keys have specific values or whose values are within given ranges. Nearest Neighbour Queries Nearest neighbour (similarity) searching or retrieval (Cazals et al., 2013), is a central computational problem with significant applications in many field of study. The similarity between the objects according to Jain et al. (1999); Jiawei (2001), is determined using distance measures over the numerous dimensions in the dataset. Nearest neighbour queries mostly answer distance based questions like find the nearest point/region to any given query point/region. The nearest neighbour to a point query is the point or object that is closest to that point/object in the Euclidean space. Figure 15 shows points G, H…..J around the point p and we are expected to find the nearest neighbour to P, this query will return J, but if the query is modified to find k-nearest neighbour (say k=2) then the output will be E, J. In Mamoulis (2012), nearest neighbour search was described as; find the nearest object (the k-nearest objects) to a point q in the spatial relation when given a well-defined reference object q (in other words the query retrieves from a spatial relation R, the nearest object to a query object q). In most database applications with high-dimensional data nearest neighbour queries are very important (Berchtold et al. 1996) therefore the main concern for nearest neighbour search (if the database was indexed with a tree data structure) is CPU-time rate which is always higher because the search is required to sort all the nodes based on their min-max distance. Thus if the spatial relation is not indexed according to Mamoulis (2012), then there would be need for the nearest neighbour algorithm (for clustering, classification or any other purpose) to access all objects in the relation, in order to find the nearest neighbour to a query object q.Figure 16 gives an illustration of how to find nearest neighbour of an object in space by simple defining a criteria to approximate an optimal neighbour. Figure 15. Different point locations with their distances to a central point p 128 Spatial Databases Figure 16. Example of how to approximate nearest neighbour search Basically, the problem of finding the nearest neighbours to an object in space in its general form according to Lifshits and Zhang (2009) can be defined as follows; Let U be a set of elements and let d be a distance function that maps each pair of elements from U to some positive real number (where d is a metric distance function, i.e., it satisfies the triangle inequality) although this need not always be the case. Spatial Join Spatial join operation combine two or more spatial datasets based on some certain spatial relationship. For instance, if we want to get the list of all hospitals around the towns in Huddersfield then we could set a query like: SELECT Name FROM Hospitals, Hudd_town WHERE Name_is_within (Hospitals.region, Hudd_town.region) Spatial join is an important example of spatial query operation. According to (Rigaux et al., 2003), using spatial join operation one could answer the query: find a pair of object that satisfy some spatial relationships if given two different sets of spatial objects. As a simple example find the languages spoken in different cities in Yorkshire. To answer this query, we would need a thematic map of the cities in Yorkshire County and the language predominant in each city, then we join the two maps and display the each city and their language on a single map. It is important to note that intersection joins are useless for point datasets as such the accurate problem would be the e-distance join, which finds all pairs of objects (s,t) s∈S, t∈T within (Euclidean) distance e from each other of the e-distance join Rigaux et al., (2003). 129 Spatial Databases Spatial Attributes The data inputs of spatial database management are made up of two distinct types of attributes: nonspatial attributes and spatial attributes Shekhar et al (2011). Spatial attributes are used to define the spatial location and extent of spatial objects (Bolstad (2002). The spatial attributes of a spatial object most often include information related to spatial longitude, latitude and elevation, shape, area etc. and the relationships among these objects are often implicit, such as overlap, intersect, behind.. (Ganguly and Steinhaeuser 2008). This is quite unlike that of non-spatial objects that are explicit in data inputs according to Agrawal and Srikant, (1994); Jain and Dubes, (1988). One feasible way to deal with implicit spatial relationships is to materialize the relationships into traditional data input columns and then apply classical data mining techniques - although the materialization may result in loss of information. Components of a Spatial Database Geographic objects such as rivers and roads are always related to the same geographic area and they are according to Rigaux et al. (2003) basically made up of the following components: (a) Description and (b) spatial components. (The spatial component are known as their spatial attribute or extent); These components describes the size of the objects and their location, orientation and shape in 2-Dimensional or 3-Dimensional space. They also describe the objects by means of their non-spatial attributes. Whereas the Description component describes the object by setting out some descriptive attributes (like the name and the population of a city) which constitute the description of its alphanumeric attributes. The spatial component of a spatial object represents the geometry (location, shape etc.) and topology (relationships among spatial objects) (Rigaux et al., 2003). Operations on Spatial Data Spatial data objects can be grouped into themes. A theme is the geospatial information corresponding to a particular topic. The details of spatial datasets are gathered in a theme. A theme is similar to a relation as defined in the relational model. It has a schema and instances. Rivers, cities, andcountries are examples of themes. (Rigaux et al., 2003). Operations on Themes Ester et al. (1997) has claimed that for knowledge retrieval and discovery from spatial databases (which contains information on specific themes) that most algorithms will make use of neighbourhood relationships since that is the main difference between the classical database and its spatial counterpart. This phenomenon according to them can be proven by the fact that the behaviour and characteristics of a spatial object is influenced by certain neighbouring objects which may “cause” the existence of the spatial object as such, the attributes of its neighbours may have an influence on the object itself Ester et al. (1999). For instance, given the 2-themes in Figure 3: Yorkshire towns (with their attributes such as name, capital, population, and a geometric attribute say boundary), and Languages (expressing the distribution of main spoken languages or groups of similar languages). If we express the themes using the schema below: 130 Spatial Databases • • Towns (name, capital, population, region: region) Languages (name, location: region). Then some of the common manipulations of these themes based on operations from the relational algebra, include: overlay, join (or intersection), union, selection, nearest neighbour, overlap, distance etc. A simple example for join operation • Find all the town in Huddersield where English language is spoken. Suggested Solution (Spatial Join) SELECT T.name FROM Towns T, Languages L WHERE name_is_within (Towns.region, Languages.region) In Schneider (1997), most of the different operations that are applicable to spatial datasets have been enumerated including their return functions; these include topological relationships – example of a spatial predicates that returns boolean values - (e.g. intersect (overlap), meet (touch), equal, covered by, adjacent (neighbouring), outside etc.), directional relationships –also a spatial predicate - (e.g., north / south, left / right) and so on. Equally, Candan and Sapino, (2010) stated that spatial predicates and operations are broadly categorized into two based on whether the information of interest is a single point in space or has a spatial extent (e.g., a line or a region) and they established that the predicates and operators that are needed to be supported by the database management system depend on the underlying spatial data model and on the applications’ needs. In summary, according to Ester et al. (1999) it is a common premise to assume that the standard operations from relational algebra such as selection, union, intersection and difference are also suitably applicable relationship evaluation between any set of spatial object. SPATIAL INDEXING AND INFORMATION RETRIEVAL According to Güting (1994), a spatial database is any database that is able to provide at least a spatial indexing method and a spatial join operation. With this, the system would be able to retrieve from a large collection of objects in some space in a particular area without scanning the whole set. Indexing spatial data according to Lungu and Velicanu, (2009) is a method of decreasing the number of searches, this mechanism helps to locate objects in the same area of data (window query) or from different locations. The main goal of indexing is to optimize the speed of database query (Ajit and Deepak, 2011) Given a user query say for example “What is the cheapest, and fastest path from location Q to P?” the spatial database could be indexed (sorted - implicitly) so that even if p is changed to another location say Z within that same geometric region, there would not be any need to resort the data in other to answer the query efficiently. This facilitates the information retrieval process and could also improve database integrity and general performance. Since generally there is no ordering that exists in dimensions greater than 1 131 Spatial Databases without transforming of the data to one dimension according to Cazals (2013), then indexing could be used as a way of finding a better sort procedure on the database in other to accelerate search query performance. Generally speaking, in a database management system, every record can be conceptualized as a point in a multidimensional space to Güting (1994). Unfortunately, this comparison is not always appropriate for spatial data because the dimensionality of the representative point may be too high and that poses a problem when considering spatial data (although we may decide to reduce the dimensionality of the representing point in other to approximate the spatial object). But then, using this form of transformation (such as merely mapping spatial data into points in another space) proximity would not be preserved. Such transformation as mentioned above would be fine for storage purposes and for queries that only involve the points that embrace the line segments (including their end points). For example, finding all the line segments that intersect a given point or set of points or a given line segment. However (Samet, 1995), the method is not good for queries that involve points or sets of points that are not part of the line segments as they are not transformed to the higher dimensional space by the mapping). If we have to use a representative point to represent a line object, each line segment can then be represented by its end points. As such the line segments are represented by a tuple of four items (i.e., a pair of x coordinate values and a pair of y coordinate values). Therefore, we have constructed a mapping from a two-dimensional space (i.e., the space from which the lines are drawn) to a four-dimensional space (i.e., the space containing the representative point corresponding to the line). Thus the present challenge of data analyst would be to find techniques suitable to overcome the problems of inappropriate mapping of spatial objects to point data. These techniques possibly will be the use of data structures that are based on sorting the spatial objects by spatial occupancy (Samet, 2009). Spatial occupancy methods decompose the space from which the data is drawn (e.g., the two dimensional space containing the lines as described above) into regions called buckets. Spatial indexing methods preserve order in other words, objects in close proximity should be placed in the same bucket or at least in buckets that are close to each other in the sense of the order in which they would be accessed (i.e., retrieved from secondary storage in case of a false hit, etc.). In large databases especially spatial – temporal ones, the efficiency of searching is dependent on the extent to which the underlying data is sorted (Samet 2009). The sorting is encapsulated by the data structure known as an index that is used to represent the spatial data thereby making it more accessible. According to Cazals (2013), in order to store objects in these databases, it is common to map every object to a feature vector in a (possibly high-dimensional) vector space. The feature vector then serves as the representation of the object. The traditional role of the indexes is to sort the data, which means that they order the data. However, since generally no ordering exists in dimensions greater than 1 without a transformation of the data to one dimension, the role of the sort process is one of differentiating between the data and what is usually done is to sort the spatial objects with respect to the space that they occupy. The resulting ordering should be implicit rather than explicit so that the data need not be resorted (i.e., the index need not be rebuilt) when the queries change. The indexes are said to order the space and the characteristics of such indexes are explored further (Samet 2009). In Park et al (2013) spatial indexing techniques are one of the most effective optimization methods to improve the quality of large dynamic databases, this is achieved by applying ordering tools (e.g. Z-order curve, Hilbert curve) which linearizes multidimensional data. A key property of these ordering functions is that it can map multidimensional data to one dimension while preserving the locality of the data points. Once the data is sorted according to these ordering then a spatial data structure is then built on top of it and query results are refined, if necessary, using information from the original feature vectors. Any n-dimensional 132 Spatial Databases data structure can be used for indexing the data, such as binary search trees and B-trees, R-trees X-trees e.t.c. Cazal et al., (2013). According to Güting (1994) rectangles are more difficult to model than points because they do not fall into a single cell of a bucket partition. Indexing Points and Rectangular Objects Because of the non - linearity that exists among large spatial data set, an effective data structure which has the ability to tackle the branched structures that exists among a given spatial data is required. This complex spatial dataset trait according to Candan and Sapino, (2010) are better represented using graphs and trees because the larger datasets are always made up of other minor events or objects which are always difficult to be ordered to form of sequences. Spatial objects can be indexed in the form of point or region (rectangles), this can be achieved using a point access method (PAM) or a spatial access method (SAM). Point Access Methods Representing multidimensional point data is a central issue in a spatial database design according to Samet (2006), this means that there is one dimension specified for each attribute or key. In a multi-dimensional data space, shared data such as documents, music files, and images, are frequently specified as points based on expressed features, as such requires a systems to provide an efficient multi-dimensional query processing (Jagadish et al. 2006; Samet, 2006). Point access method works by defining space decomposition of disjoint points, as such in most tree indexing structures for point data, it is expected that the leaf nodes will not overlap (Leutenegger et al, 1997). Figure 17 illustrates some well-known point access methods as Mamoulis (2012) has identified. Point access methods can simply be seen as a data structures and algorithms that primarily search for points that are defined in multidimensional space examples including EXCELL, Grid file, hB-tree, Twin grid file, Two-level grid file, K-d tree, BSP-tree, Quad-tree, UB-tree, Buddy tree, Locality-sensitive hashing (LSH)...(Paul, 2008). In Figure 18, two different methods were presented for indexing the same point dataset: (i) the R-tree and the (ii) Quad-tree. The presentation shows a set of points P = {A, B…}. Point C, D, E are close to each other in space so we have clustered them in the same leave node (that is the minimum bounding rectangle labelled Z1 outlined in red). The next upper level is the MBRs Z1, Z2 and Z3 (which encloses Figure 17. Efficient point access methods (Mamoulis, 2012) 133 Spatial Databases Figure 18. Comparison between different tree structures to represent the same point location dataset: (a) and (b) R-tree (c) and (d) Quad-tree leave nodes 1….9) these two non-leave nodes are yet grouped into the next higher level which in this case is the root. We have assumed the capacity of a node to be four (4) entries for the r-tree and four for the Quad-tree (following the normal conventional Quad-tree partitioning). Task 1 Points A through M (in Figure 8) shows point locations (modelling them as points means we have considered them as locations on the map that do not have extents but have a spatial reference in the 2D plane) of towns around the Yorkshire and the Humber County where the percentage of women in professional occupation are 26 and above, examples of point query using point data: 1. 2. 3. Find the distance between town A and town C. Which of the locations is closest to the centre of the county (point P). Find all town that are < 50 miles from P. Note that the extent of each point locations will not alter the outcome of the query result. Region Access Methods For range queries, a data structure that can search for lines, polygons, etc would be required. Point access methods have been known not to fully support region, overlap, enclosure, etc., so methods like cell tree, extended k-d tree, GBD-tree, multilayer grid file, D-tree, P-tree, R-file, R+-tree, R*-tree, R-tree, Skd-tree according to Paul (2008) have been proposed to manage these sort of data. Most spatial objects are better stored as rectangles. Storing spatial objects as rectangles is an important task of spatial database as this provides a basic approximation of the actual geometry of the object. Spatial approximation according to Mamoulis (2012) helps to reduce the computational complexity associated with actual geometries of the spatial objects. The usefulness of storing regions as rectangles cannot be overemphasized for instance when handling a range query, we consider all the rectangles that intersects the region of the query. Many 134 Spatial Databases objects in 2-dimensional spaces represents objects using their MBRs by so doing, the MBRs identify the region of the larger data space that each object represents and then an Intersection tests is performed to determine which objects intersect with the spatial query using an R-tree organizes all the spatial objects by pruning the search space Jagadish et al. (2006). According to Güting (1994) The use of bounding boxes, demands that most spatial data structures be designed to store either a set of points (for point values) or a set of rectangles (for line or region values). Güting (1994) added that rectangles are more difficult to model than points because they do not fall into a single cell of a bucket partition, therefore three strategies have been developed to be able to handle rectangle data partitioning, these include: (a) Transformation approach, (b) overlapping bucket regions, (c) clipping. Figure 19 presents an R-tree for indexing region data. In this case a set of rectangles {A, B… L}. Rectangles A, B are close to each other in space so we have clustered them in the same leave node (that is the minimum bounding rectangle labelled 8).We have assumed the capacity of a node to be four entries (M) for the r-tree. The MBRs in green are the leaf nodes and they contain the index record (b, obj_pointer). b is the number of the page of the smallest MBR that contains the spatial object which is referenced by obj_pointer. The rectangles in blue (containing M child nodes), are the non-leaf nodes with entries (b, child_pointer) where b in this case is the number of the page of the smallest MBR that contains the MBRs in the child node that is being referred to by child_pointer. Finally, these entry nodes are grouped into the root (labelled 1). Note that the rectangles marked blue in the actual tree are the rectangles that have some kind of spatial relationship with the query window P as such that determines the rectangles that are involved in the query operation. Tree Indexing Data Structures a Brief Comparison One would have noticed the ubiquitous nature of the R-tree indexing structure, this is attributed to the fact that the R-tree and its variant have proven very useful in management of multidimensional database and handles both points and spatial data efficiently. R–tree according to Mamoulis 2012 is known as the most dominant spatial access method (SAM). The main idea behind the data structure according to Guttman (1984) is to group nearby objects by representing them using their minimum bounding rectangle (the “R” in R-tree) in the next higher level of the tree. In Samet (1995), it is established that the R-tree and its variant performs remarkably well when applied to indexing arbitrary spatial objects especially Figure 19. R-tree for representing the regions 135 Spatial Databases rectangles (in two (2) or d-dimension). The most important argument in Samet (1995) is that the smallest rectangles that represents the object under consideration must be axis-aligned (see Baum and Hanebeck (2010) for more on axis aligned rectangle). The main reason the R-tree index structure has been widely accepted and used according to Berchtold et al. (2001) is because the index structure supports both points data as well as data with extent whereas most of the other structures like the kdB-trees, grid files etc. do not support both types of data concurrently, also the R-tree index structure proves efficient for spatial clustering (which is a vital issue in the performance of tree based indexing structures) since it doesn’t require point transformation in other to store spatial data. Guttman (1984) describes the functionalities of the R-tree and most importantly identified the basic strategy for building an efficient R-tree structure with respect to its clustering technique and partitioning strategy. Nevertheless, despite its popularity and efficiency, basic limitations of this tree-based index structure (which majorly has to do with increase in overlap of the MBRs of directory nodes when dimension increases) has rendered it incompetent to some limit and has warranted research into new methods for enhancing its performance by refining the way the tree is built. Among the many specialized indexes proposed to offer better performance than the R–tree for high-dimensional data include: the X–tree, the VA–file, TV-tree, SR-tree, the pyramidtechnique, A-tree, X+-tree, PL-tree, the hybrid-tree etc. overall, indexing structures for multidimensional spatial databases according to Mamoulis (2012) can only perform optimally if they provide an efficient underlying paradigm for similarity and nearest neighbour (NN) search in high-dimensional space. In Figure 20, the authors have listed some of the basic definition encountered in indexing a spatial database using a tree data structure. ISSUES ON SPATIAL DATABASE Dimensionality In some application domain, the underlying data (spatial or not) are made up of sets of objects and storing these objects in a database usually involve mapping every object to a feature vector in a (possibly high-dimensional) vector space the objects are described using a collection of features which forms the feature vector and this serves as the representation of the object in the database (Cazals et al., 2013). Figure 20. Basic terms and terminologies used in constructing a tree structure for spatial indexing 136 Spatial Databases High dimensionality that is, the measurement of the degree or size of the feature space according to Katayama and Satoh (2002); Samet (2006), is one of the properties of feature spaces and in most cases queries always relate to similarity retrieval from such a feature space of vectors, which in multidimensional informational systems boils down to nearest neighbour search in the space, therefore the challenge that faces such a data mining task is management of the dimensionality of the space. Furthermore according to Böhm et al. (2001), in multidimensional spatial database management systems (SDBMs), high dimensionality is characterized by the presence of excess attribute exceeding the total of at least 15. Some examples of application areas where the data is of considerably higher dimensionality (though they may not be spatial) include, pattern recognition and image databases where the data is made up of other set of objects, and the high dimensionality comes as a result of result of trying to describe the objects via a collection of features (also known as a feature vector) examples colour, moments, textures, shape descriptions, and so on (Samet, 2006). High-dimensional means a situation where the number of the unknown parameters which are to be estimated is one or several orders of magnitude larger than the number of samples in the data (Bickel et al., 2001). In large spatial databases, High-dimensional data can be seen as data that is described by a large number of attributes, where this is the case then as the dimensionality increases there is always an impending notion that the complexity of computational process would also increase thereby leading to the ineffectiveness of various existing spatial data mining algorithm (Assent, 2012). In Bouveyron et al. (2007), it is stated that many scientific domains consider measured observations as high-dimensional and in such high dimensional (feature) space clustering always proofs to be difficult due to the fact that the high-dimensional data are always embedded in different low-dimensional subspaces which are hidden in the original space. It is worthy to note that in similarity search operation, computing the Euclidean distance between two points in a high-dimensional space for instance d, involves d multiplication operations and d −1 addition operations, and most importantly, the computation requires the definition of what it means for two objects to be similar which according to Samet (2006) is not always so obvious. Also Keim et at (2008) ascertained that feature-based approach has several advantages compared to other approaches for implementing similarity search, the extraction of features from the source data is usually fast and easily parametrizable, and metric functions for feature vectors, as the Minkowski distances, can also be efficiently computed. Novel approaches for computing feature vectors from a wide variety of unstructured data are proposed regularly. As in many practical applications the dimensionality of the obtained feature vectors is high. The X-tree spatial index data structure is a valuable tool to perform efficient similarity queries in spatial databases. Some examples of High dimensional variables as identified by Mosley (2010) include: (mostly variables with many units or levels) like ZIP code, Vehicle classification, etc. The author also identified the complications associated with these types of high dimensional variable which include: Credibility at individual levels and Determining proper groupings 1. 2. Credibility at individual levels examples: a. Convergence errors (from models). b. Results that do not make sense. Determining proper groupings example: a. Thousands of ZIP codes. 137 Spatial Databases Handling High Dimensional Data Though searches through an indexed space usually involve relatively simple comparison tests, searching in high-dimensional spaces is generally time-consuming according to Samet (2006), but performing point and range queries are considerably easier (from the standpoint of computational complexity) than performing similarity queries because point and range queries do not involve the computation of distance. Suggestions have been made regarding the best possible ways to handle the problems of high dimension in large databases. Cazals et al. (2013) has suggested several methods for handling high dimensional datasets including: i) Dimension reduction ii) Embedding methods, iii) Clustering, and iv) Nearest neighbour methods Dimension Reduction (DR) Dimension reduction has been seen as one major way of tackling the problem of high-dimension in largely correlated spatial data. Cazal et al. (2013) describes Dimension reduction as a processes of computing a mapping function f from the high-dimensional feature space Rm to a lower-dimensional space Rk (upon which a spatial data structure is built) with the goal of preserving the properties of the feature vectors that are relevant for the application at hand. According to Parsons et al., (2004), dimension reduction has to do with two main specific kind of task; 1. 2. Feature Selection: This technique selects only the most significant of the dimensions from a dataset that shows a group of objects that are similar on only a subset of their attributes. Though this method of dimension reduction have difficulty when clusters are found in different subspaces, they are quite successful on many datasets. Feature Transformation: This techniques tries to summarize a dataset using fewer dimensions by creating combinations of the original attributes. The techniques is very successful in uncovering hidden structure in a large datasets. Nevertheless, they preserve the relative distances between objects, thereby making them less effective when there are large numbers of irrelevant attributes that hide the genuine clusters in deep noise. Other dimension reduction techniques according to (Kushilevitz et al., 2000), is the principal component analysis (PCA) which is a statistical technique that uses a kind of transformation to convert a set of observations of probably correlated variables into a set of values of linearly uncorrelated variables called principal components. Embedding Methods These methods for handling high dimensionality deals with Data embedding i.e., they embed the objects into a vector space within which a distance metric approximating the original one can be used (Samet, 2006). This method also tends to ease the calculation of (costly) distances between spatial objects Cazals et al., (2013). In Weinberger and Saul (2006), discovering clever representations (which are used to simplify problems in some application areas using symbolic input) automatically, from large amounts of unlabelled data, remains a fundamental challenge as such, an algorithm that will faithfully learn low dimensional representation of high dimensional data was examined and proposed. The algorithm relies 138 Spatial Databases on modern tools in convex optimization which is obtainable in non-linear embedding method. According to Cazal et al., (2013). Non-linear embedding methods of handling high dimensionality leads to easy to implement polynomial-time algorithms and they prove more efficient with larger data sets than the ones usually involved in iterative or greedy methods (like e.g. the ones involving EM or EM-like algorithms). A majority of data embedding methods are isometric. In other words, they take a metric space (interpoint distances) as input and try to embed the data to a low dimensional Euclidean space such that the inter-point distances are preserved as much as possible (Yang 2006). Semi Definite Embedding (SDE) also known as the maximum variance unfolding – MVU is a kind of embedding method (Non-linear dimensionality reduction technique) that attempt to map high-dimensional data onto a low-dimensional Euclidean vector space using semidefinite embedding. The main intuition behind the maximum variance unfolding – MVU (Weinberger and Saul, 2004; 2006) is to exploit the local linearity of manifolds and create a mapping that preserves local neighbourhoods at every point of the underlying manifold. The method can be used to analyse high-dimensional data that lies on or near a low dimensional manifold as can be seen in the Semi-definite embedding (SDE). Clustering Methods Improvements to the traditional clustering algorithms solves the various problems such as curse of dimensionality and sparsity of data for multiple attributes (Paithankar and Tidke, 2015). Clustering methods are used according to Steinbach et al., (2004) to partition a large dataset similar functionality into groups as a means of data compression. The process of finding clusters reveals meaningful homogenous groups of objects embedded in subspaces of high dimensional data and also helps to extract the relevant information from such high dimensional dataset (Paithankar and Tidke, 2015). The essence of this task is to achieve scalability, proper understanding of data mining results, and insensitivity to the order of input records in a database management system (Agrawal et al., 1998). According to the definition in Berchtold et al (1997), an object or a data record typically has dozens of attributes and the domain for each attribute can be large. It is therefore not meaningful to look for clusters in such a high dimensional space as the average density of points anywhere in the data space is likely to be quite low. Thus according to IBM (1996), such problem of high dimensionality is often attempted by requiring the user to specify the subspace (a subset of the dimensions) for a given cluster analysis, (i.e. bearing in mind that user identification of subspaces is quite prone to error). In a high dimensional dataset (having millions of data points that exist in many thousands of dimensions-i.e. having many attributes- representing many thousands of clusters), dimensionality can be handled by performing a two way clustering (ensemble method) by first dividing the data into a group of overlapping subsets (using one distance metrics) and then using a different distance measurements to estimate accurate clusters (McCallum et al., 2000). One other way of handling high dimensionality using a clustering algorithm according to (Paithankar and Tidke, 2015) is by combining the two basic methods suggested in this section (subspace clustering and ensemble clustering). It is worthy to note the difference between the two classes of datasets that one can encounter within a spatial database 1. Low dimensional dataset: a. Limited number of clusters. b. Low feature dimensionality (few attributes for each object). 139 Spatial Databases 2. c. Small number of data points. High dimensional dataset: a. Large number of data points. b. Many thousands of dimensions-i.e. having many attributes. c. Many thousands of clusters. Generally, in high dimensional spaces, the data are inherently sparse and the distance between each pair of points is almost the same for a wide variety of data distributions and distance functions (Muller et al., 2009). Nearest Neighbour Methods Hinneburg et al. (2015) argued that there can be several reasons for the meaninglessness of nearest neighbour search in high dimensional space especially in a case of sparsity of data objects in space which is unavoidable. Moreover Beyer et al. (1999) also supported the premise by claiming that in high dimensional space, all pairs of points are almost equidistant from one another for a wide range of data distributions and distance functions, thereby giving rise to the problem of instability. However it is also important to know that Searching for a nearest neighbour among a specified database of points is a fundamental computational task that arises in a variety of application areas (including information retrieval, data mining, pattern recognition, machine learning, computer vision, data compression, and statistical data analysis) where the database points are represented as vectors in some high dimensional space (Kushilevitz et al., 2000). High-dimensional nearest neighbour problems arise naturally when complex objects are represented by vectors of d numeric features (Arya et al., 1994). Therefore one solution of the problem of high dimensionality using nearest neighbour search techniques applies an approximation of the nearest neighbour query as a method for high dimensional filtering where queries are compared against their most relevant candidates. The seemingly difficulty of obtaining algorithms that are efficient in the worst case with respect to both space and query time for dimensions higher than 2, according to (Arya et al., 1994, Cazal et al., 2013) suggests that the alternative approach of finding approximate nearest neighbours (which do not require heavy resource and is a less demanding computational task) is worth considering. Description of the approximate nearest neighbour object search is given below in Figure 21: Curse of Dimensionality: What Is It? The curse of dimensionality - as explained by Bellman (1957) who defined the term originally - is a term used to express the fact that the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with the number of variables (dimensions) that it comprises. Figure 21. Approximated nearest neighbour model 140 Spatial Databases This means that the number of objects (points) in the dataset that needs to be examined in deriving the estimates in a similarity search (i.e., finding nearest neighbours), grows exponentially with the underlying dimension and thereby giving rise to the question “is nearest neighbour search meaningful in such a domain?” Curse of dimensionality simply described as a situation whereby an extra dimension is added to an already existing Euclidean space. Curse of dimensionality is also defined by Kouiroukidis and Evangelidis (2011), as a phenomenon which states that in high dimensional spaces distances between nearest and farthest points from query points become almost equal as such, nearest neighbour calculations cannot discriminate candidate points. Though the above elucidation is faced by many database systems, suggestions has been put forward towards fighting the curse of dimensionality. Ding (2007), suggested Feature selection using SVM-RFE (Support Vector Machine-Recursive Feature Elimination) as a solution to the problem. Generally according to Verleysen and François (2005), the curse of dimensionality is the expression of all phenomena that appear with high-dimensional data, and that have most often unfortunate consequences on the behaviour and performances of learning algorithms. Optimizing Spatial Databases Optimizing spatial databases is one of the most important aspects of working with large volumes of data. Basically, some of the issues to consider in trying to optimise a large spatial database include: reducing the access rate and times to external memory, reading and writing in multiples of pages and reducing number of disk accesses. In Figure 22 an illustration is shown on different ways of possibly optimizing spatial database as suggested by Samet (2010), some of the ways of improving the behaviour of search in a large spatial database includes: Goal 1: Minimizing the number of children of a node that must be visited by search operations (i.e. minimize the area common to children (overlap). Goal 2: Reducing the likelihood of each node (say q) being visited by the search (i.e. minimize the total area spanned by the bounding box of q (coverage). In addition to the methods mentioned above, Katayama and Satoh (2002), has suggested another better way to optimise spatial databases by considering reducing the amount of disk page accesses because reading and writing to external storage in disk- based data structures is much slower than doing same Figure 22. Different way of optimizing a spatial database 141 Spatial Databases with the main memory. Bulk loading disk-based data structures is another way of optimizing spatial database, this is because external storage are organised in fixed data blocks of typically 4 to 16kb. A good bulk loading method according to (Mamoulis, 2012; 2011) would build fast for static objects and will ensure a lesser amount of wasted empty spaces on the tree pages. Giao and Anh (2015) identified the advantages of the bulk loading a tree structure as the follows: 1. Faster loading of the tree with all spatial objects at once 2. Reducing empty spaces in the nodes of the tree and 3. Better splitting of spatial objects into nodes of the tree. Belciu and Olaru (2011) has also added that the best way to improve the optimization of spatial databases is through spatial indexes. Using spatial indexes such as Grid index, Z-order, Quadtree, Octree, UB-tree, R-tree, kd-tree, M-tree etc. improves the performance and integrity of spatial databases in terms of storage and time costs. According to Lungu and Velicanu (2009) indexing spatial data is a way to decrease the number of searches, and a spatial index (considered logic) is used to locate objects in the same area of data (window query) or from different locations. Using approximations, spatial indexing methods organize space and the objects in it in such a way that only parts of the space and a subset of the objects need to be considered to answer such a query Güting (1994). Autocorrelation Autocorrelation is basically seen as a measure of the similarity or interdependence of an object in space with surrounding objects. According to Samson et al., (2014), the presence of spatial auto-correlation and the fact that continuous data types are always present in spatial data makes it important to create methods, tools and algorithms to mine spatial patterns in a complex spatial data set. In Jerrett et al (2003) Failure to control for autocorrelation can lead to false positive significance tests and may indicate bias resulting from a missing variable or group of variables. Spatial autocorrelation is an optimal method for systematically ascertaining spatial patterns. According to Legendre and Fortin (1989) Spatial autocorrelation frequently occurs in ecological data, and many ecological theories and models implicitly adopt an underlying spatial pattern in the distributions of organisms and their environment. Autocorrelation arises from the fact that elements of a given population or community (or even the geographic/social environment as a whole) that are close to one another in space or time are more likely to be influenced by the same generating process. According to Chen et al. (2011) spatial autocorrelation shows correlation of a variable with itself through space. In their own view, Rossi and Queneherve, (1998); Legendre, (1993) acknowledged that spatial autocorrelation measures the similarity between samples for a given variable as a function of spatial distance. Spatial autocorrelation as seen by Dale and Fortin (2009) simply portrays self-dependence of spatial data (meaning that the individual observations made from the chosen samples include information present in other observations, so that the effective sample size, say n, is less than the number of observations, m); this dependence according to them poses a great problem that affects the significance rates of statistical test when it is positive and as such must be corrected in other to produce a better measurement of goodness-of-fit. DISCUSSION The most basic spatial query types are spatial selection, nearest neighbour search, and spatial joins. Extending a DBMS to support spatial data requires changes at all layers: data modelling, query languages, storage and indexing, query evaluation and optimization, transaction management, etc. Nowadays more 142 Spatial Databases users are interested in retrieving information related to the locations and geometric properties of spatial objects. Users of mobile devices may want to find the nearest hotel to their location, Astrologers may want to study the spatial relationships among objects of the universe, Army commanders may want to schedule the movements of their troops according to the geography of the field, Scientists may want to study the effects of object positions and relationships in a 2D/3D space to some scientific or social fact (e.g., spatial analysis of protein structures, relationship between the residence of subjects and their psychic behaviour and so on (Mamoulis 2012). The efficiency of searching is dependent on the extent to which the underlying data is sorted in other words an appropriate index structure is required to be put in place in other to achieve an optimal database efficiency (Samet (2009). In the search and retrieval process of a spatial database for handling data query according to Mamoulis (2012), two major problem are most likely • • First, the geometry of the objects could be too complex; therefore, testing a query predicate against each object in a database would result in a high computational cost. Secondly, exhaustively testing all objects of the relation against a spatial query predicate requires a signiicant amount of I/O operations, for large databases SOLUTIONS 1. The first problem is mainly handled by storing the spatial objects together with their exact geometry using an appropriate spatial approximation example minimum bounding rectangle (MBR), the minimum bounding rectangle encloses the convex hull covering the geometry of the spatial object; the MBR of an object is the minimum rectangle which encloses the geometric extent of the object. Thus in other to overcome the first problem mentioned above, a. First, the query predicate is tested against the MBR of the objects, if the MBR passes the filter step then the refinement operation set is applied. Filtering allows us selects the objects whose minimum bounding box satisfies the spatial predicate - Intersection, adjacency, containment, etc.- which are examples of spatial predicates a pair of objects to be joined must satisfy. The filtering step consists of traversing the index, and applying the spatial test on the MBRs. An mbr might satisfy a query predicate, whereas the exact geometry may not. (In other words, filtering is a way of being able to allocate a spatial object’s natural geometry to a minimum bounding box and then extracting the bounding boxes that for instance intersects with a given query window). b. Then the exact geometry of the object is tested against the query predicate. The refinement stage searches each stored MBR of the spatial objects (extracted by filtering) and then test the specific geometry of the object against the query predicate. The final spatial test is done on the actual geometries of objects whose mbr satisfies the filter step. For many spatial databases, many indexing methods that try to cope with the dimensionality curse in high dimensional spaces have been proposed, but, usually these methods end up behaving like the sequential scan over the database in terms of accessed pages when queries like k-Nearest Neighbours are examined. Kouiroukidis and Evangelidis (2011), examined a multi-attribute indexing methods and try to investigate when these methods reach their limits, namely, at what dimensionality a kNN query requires visiting all the data pages. 143 Spatial Databases CONCLUSION More users these days are interested in retrieving information related to the locations and geometric properties of spatial objects. Unlike the classical relational database management system which store data as numbers, alphabets, alphanumeric or even symbols, spatial databases store abstract data types (ADTs) such as points, lines, polygons, coordinates, topology or other data types that can be mapped. Traditional database systems have great difficulties to cope with these kinds of data, because they have been customised to fixed-length data of very simple internal structure. It is quite tedious to build a spatial database directly using a classical database model, therefore the traditional system needs to be extended and equipped to manage and support spatial data types (or data models), spatial functions, and spatial indexes. The essential features of spatial databases that distinguishes them from alphanumeric data includes a complex internal structure, arbitrary finite representation of shapes. Thus, domain-specific knowledge is necessary for traditional/classical databases to be able to support non-standard database applications. In this chapter (for the sake of efficient information retrieval) we have examined spatial data bases, its architecture, optimization technique, methods of improving their internal and optimal storage capacity. Because of the non - linearity that exists among large spatial data set, an effective data structure which has the ability to tackle the branched structures that exists among a given spatial data is required. These complex spatial dataset behaviors are better represented using graphs and trees due to the larger datasets that are always made up of other minor events or objects and are always difficult to be ordered to form of sequences. The main goal of indexing is to optimize the speed of database query in other to facilitates information retrieval process and also improve database integrity and general performance. Developing a spatial database normally starts with developing a data model (which relates to the conceptual, logical and physical data modelling), that describes the contents of the database and their relationship. Therefore to achieve this, data analyst constantly seek an appropriate data structure that efficiently stores the objects data in the database and allows for a better and efficient database management. REFERENCES Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM. Ajit, S., & Deepak, G. (2011). Implementation and Performance Analysis of Exponential Tree Sorting. International Journal of Computer Applications, 24(3), 34-38. Akinyemi, F. A. (2010). Conceptual Poverty Mapping Data Model. Transactions in GIS, 14, 85–100. doi:10.1111/j.1467-9671.2010.01207.x Arya, S., Mount, D. M., Netanyahu, N., Silverman, R., & Wu, A. Y. (1994, January). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proc. 5th ACM-SIAM Sympos. Discrete Algorithms (pp. 573-582). Asaeedi, S., Didehvar, F., & Mohades, A. (2013). Alpha-Concave Hull, a Generalization of Convex Hull. arXiv preprint arXiv:1309.7829 144 Spatial Databases Baum, M., & Hanebeck, U. D. (2010, September). Tracking a minimum bounding rectangle based on extreme value theory. In Multisensor Fusion and Integration for Intelligent Systems (MFI), 2010 IEEE Conference on (pp. 56-61). IEEE. doi:10.1109/MFI.2010.5604456 Belciu, A. V., & Olaru, S. (2011). Optimizing spatial databases. Available at SSRN 1800758 Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Berchtold, S., & Keim, D. A. (1996). The X-tree: An Index Structure for High-Dimensional Data. Proceedings of the 22nd VLDB Conference. Berchtold, S., Keim, D. A., & Kriegel, H. P. (2001). An index structure for high-dimensional data. Readings in multimedia computing and networking. Berchtold, S., Bohm, C., Keim, D., & Kriegel H.-P. (1997). A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th Symposium on Principles of Database Systems (PODS). Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is Nearest Neighbors Meaningful?Proc. of the Int. Conf. Database Theorie, (pp. 217-235). Böhm, C., Berchtold, S., & Keim, D. A. (2001). Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322–373. doi:10.1145/502807.502809 Bolstad, P. (2002). GIS Fundamentals: A First Text on GIS. Eider Press. Bourne, M. (2015). Applications of integration. Retrieved from http://www.intmath.com/applicationsintegration/5-centroid-area.php Candan, K. S., & Sapino, M. L. (2010). Data management for multimedia retrieval. Cambridge University Press. doi:10.1017/CBO9780511781636 Cazals, F., Emiris, I. Z., Chazal, F., Gärtner, B., Lammersen, C., Giesen, J., & Rote, G. (2013). D2. 1: Handling High-Dimensional Data. Computational Geometric Learning (CGL) Technical Report No.: CGL-TR-01. Census. (2011). Wards in Yorkshire and the Humber. Available at http://ukdataexplorer.com/census/yo rkshireandthehumber/#KS206EW0007 Chen, L., & Brown, S. D. (2014). Use of a tree‐structured hierarchical model for estimation of location and uncertainty in multivariate spatial data. Journal of Chemometrics, 28(6), 523–538. doi:10.1002/ cem.2611 Ding, Y. (2007). Handling complex, high dimensional data for classification and clustering. University of Mississippi. Dröge, G. & Schek. (1993). Query-Adaptive Data Space Partitioning Using Variable-Size Storage Clusters.Proc. 3rd Intl. Symposium on Large Spatial Databases. 145 Spatial Databases Eberly, D. (2015). Minimum-Area Rectangle Containing a Set of Points. Geometric Tools, LLC. Retrieved from http://www.geometrictools.com/ Efunda. (2016). Solids: Centre of Mass. Available at http://www.efunda.com/math/solids/CenterOfMass. cfm Ernest, R., & Djaoen, S. (2015). Introduction to SQL Server Spatial Data. Retrieved from https://www. simple-talk.com/sql/t-sql-programming/introduction-to-sql-server-spatial-data Ester, M., Kriegel, H. P., & Sander, J. (1997). Spatial data mining: A database approach. In Advances in spatial databases (pp. 47–66). Springer Berlin Heidelberg. doi:10.1007/3-540-63238-7_24 Ester, M., Kriegel, H. P., & Sander, J. (1999). Knowledge discovery in spatial databases. Springer Berlin Heidelberg. Fortin, M. J., & Dale, M. R. (2009). Spatial autocorrelation in ecological studies: A legacy of solutions and myths. Geographical Analysis, 41(4), 392–397. doi:10.1111/j.1538-4632.2009.00766.x Freeman, H., & Shapira, R. (1975). Determining the minimum-area encasing rectangle for an arbitrary closed curve. Communications of the ACM, 18(7), 409–413. doi:10.1145/360881.360919 GADM. (2009). Global Administrative Areas: Boundaries without limit. Available at http://www.gadm. org/download Ganguly, A. R., & Steinhaeuser, K. (2008). Data mining for climate change and impacts. In ICDM Workshops. doi:10.1109/ICDMW.2008.30 Giao, B. C., & Anh, D. T. (2015). Improving Sort-Tile-Recusive algorithm for R-tree packing in indexing time series. In Computing & Communication Technologies-Research, Innovation, and Vision for the Future (RIVF), 2015 IEEE RIVF International Conference on (pp. 117-122). IEEE. Güting, R. H. (1994). An introduction to spatial database systems. The VLDB Journal—The International Journal on Very Large Data Bases, 3(4), 357-399. Güting, R. H., & Schneider, M. (1993). Realms: A foundation for spatial data types in database systems. In Advances in Spatial Databases (pp. 14-35). Springer Berlin Heidelberg. doi:10.1007/3-540-56869-7_2 Hinneburg, A., Aggarwal, C. C., & Keim, D. A. (2000). What is the nearest neighbor in high dimensional spaces? In 26th Internat. Conference on Very Large Databases (pp. 506-515). International Business Machines. (1996). IBM Intelligent Miner User’s Guide, Version 1 Release 1, SH12-6213-00 edition. Author. Jagadish, H. V., Ooi, B. C., Vu, Q. H., Zhang, R., & Zhou, A. (2006). Vbi-tree: A peer-to-peer framework for supporting multi-dimensional indexing schemes. In Data Engineering, 2006. ICDE’06.Proceedings of the 22nd International Conference on (pp. 34-34). IEEE. doi:10.1109/ICDE.2006.169 Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. doi:10.1145/331499.331504 146 Spatial Databases Jerrett, M., Burnett, R., Willis, A., Krewski, D., Goldberg, M., DeLuca, P., & Finkelstein, N. (2003). Spatial Analysis of the Air Pollution Mortality Relationship in the Context of Ecologic Confounders. Journal of Toxicology and Environmental Health. Part A., 66(16-19), 1735–1778. doi:10.1080/15287390306438 PMID:12959842 Jiawei-Han, M. K. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. Katayama, N., & Satoh, S. (2002). Experimental evaluation of disk based data structures for nearest neighbour searching. AMS DIMACS Series, 59, 87. Kouiroukidis, N., & Evangelidis, G. (2011). The effects of dimensionality curse in high dimensional knn search. In Informatics (PCI), 2011 15th Panhellenic Conference on (pp. 41-45). IEEE. doi:10.1109/ PCI.2011.45 Kushilevitz, E., Ostrovsky, R., & Rabani, Y. (2000). Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2), 457–474. doi:10.1137/S0097539798347177 Legendre, P. (1993). Spatial autocorrelation: Trouble or new paradigm? Ecology, 74(6), 1659–1673. doi:10.2307/1939924 Legendre, P., & Fortin, M. J. (1989). Spatial pattern and ecological analysis. Vegetatio, 80(2), 107–138. doi:10.1007/BF00048036 Lifshits, Y., & Zhang, S. (2009). Combinatorial algorithms for nearest neighbors,near-duplicates and small-world design. In Proc. SODA. doi:10.1137/1.9781611973068.36 Lungu, I., & Velicanu, A. (2009). Spatial Database Technology Used In Developing Geographic Information Systems. The 9th International Conference on Informatics in Economy – Education, Research & Business Technologies. Academy of Economic Studies, Bucharest. Mamoulis, N. (2012). Spatial data management (1st ed.). Morgan & Claypool Publishers. McCallum, A., Nigam, K., & Ungar, L. H. (2000, August). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 169-178). ACM. doi:10.1145/347090.347123 Mosley, R. C. (2010). Handling High Dimensional Variables. Pinnacle Actuarial Resources, Inc. Muller, E., Gunnemann, S., Assent, & Seidl, T. (2009). Evaluating Clustering in Subspace Projections of High Dimensional Data. VLDB ’09. Lyon, France: VLDB Endowment. Paithankar, R., & Tidke, B. (2015). A H-K Clustering Algorithm for High Dimensional Data Using Ensemble Learning. arXiv preprint arXiv:1501.02431 Papadias, D., Zhang, J., Mamoulis, N., & Tao, Y. (2003). Query processing in spatial network databases. In Proceedings of the 29th international conference on Very large data bases (vol. 29, pp. 802-813). VLDB Endowment. Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105. doi:10.1145/1007730.1007731 147 Spatial Databases Patel, P., & Garg, D. (2012). Comparison of Advance Tree Data Structures. arXiv preprint arXiv:1209.6495. Paul, E. B. (2008). Point access method. In Dictionary of Algorithms and Data Structures. Available from: http://www.nist.gov/dads/HTML/pointAccessMethod.html Ramakrishnan, R., & Gehrke, J. (2003). Database management systems (3rd ed.). New York: McGraw-Hill. Rigaux, P., Scholl, M., & Voisard, A. (2003). Spatial Databases with Application to GIS. SIGMOD Record, 32(4), 111. Rossi, J. P., & Quénéhervé, P. (1998). Relating species density to environmental variables in presence of spatial autocorrelation: A study case on soil nematodes distribution. Ecography, 21(2), 117–123. doi:10.1111/j.1600-0587.1998.tb00665.x Samet, H. (1995). Spatial data structures, Modern database systems: the object model, interoperability, and beyond. New York, NY: ACM Press/Addison-Wesley Publishing Co. Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. Samet, H. (2009). Sorting spatial data by spatial occupancy. In GeoSpatial Visual Analytics (pp. 31–43). Springer Netherlands. Samet, H. (2010, December). Sorting in space: multidimensional, spatial, and metric data structures for computer graphics applications. In ACM SIGGRAPH ASIA 2010 Courses (p. 3). ACM. doi:10.1145/1900520.1900523 Samson, G. L., Lu, J., & Showole, A. A. (2014). Mining Complex Spatial Patterns: Issues and Techniques. Journal of Information & Knowledge Management, 13(02), 1450019. doi:10.1142/S0219649214500191 Samson, G. L., Lu, J., Wang, L., & Wilson, D. (2013). An approach for mining complex spatial dataset. Proceeding of Int’l Conference on Information and Knowledge Engineering. Retrieved from http:// worldcompproceedings.com/proc/proc2013/ike/IKE_Papers.pdf Schneider, M. (1997). Spatial Data Types for Database Systems - Finite Resolution Geometry for Geographic information systems. LNCS, 1288. Schneider, M. (1999). Spatial Data Types: Conceptual Foundation for the Design and Implementation of Spatial Database Systems and GIS. In Proceedings of 6th International Symposium on Spatial Databases. Shekhar, S., & Chawla, S. (2003). Spatial databases: A tour. Upper Saddle River, NJ: Prentice Hall. Shekhar, S., Chawla, S., Ravada, S., Fetterer, A., Liu, X., & Lu, C. (1999). Spatial Databases - Accomplishments and Research Needs. IEEE Transactions on Knowledge and Data Engineering, 11(1), 45–55. doi:10.1109/69.755614 Shekhar, S., Evans, M. R., Kang, J. M., & Mohan, P. (2011). Identifying patterns in spatial information: A survey of methods. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), 193–214. Singh, S. P., & Singh, P. (2014). Modelling a Geo-Spatial Database for Managing Travelers ‘demand. International Journal of Database Management Systems, 6(2), 3–47. doi:10.5121/ijdms.2014.6203 148 Spatial Databases Steinbach, M., Ertöz, L., & Kumar, V. (2004). The challenges of clustering high dimensional data. In New directions in statistical physics (pp. 273–309). Springer Berlin Heidelberg. doi:10.1007/978-3662-08968-2_16 Stonebraker, M., Rowe, L. A., Lindsay, B. G., Gray, J., Carey, M. J., Brodie, M. L., & Beech, D. et al. (1990). Third-generation database system manifesto. SIGMOD Record, 19(3), 31–44. doi:10.1145/101077.390001 Velicanu, A., & Olaru, S. (2010). Optimizing Spatial Databases. Informatica Economica, 14(2), 61–71. Verleysen, M., & François, D. (2005, June). The curse of dimensionality in data mining and time series prediction. In International Work-Conference on Artificial Neural Networks (pp. 758–770). Springer Berlin Heidelberg. doi:10.1007/11494669_93 Weinberger, K. Q., & Saul, L. K. (2004). Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-04). 149