Li2015 PDF
Li2015 PDF
Li2015 PDF
Shuliang Wang
Spatial Deyi Li
Data Mining
Theory and Application
Spatial Data Mining
Deren Li · Shuliang Wang · Deyi Li
13
Deren Li Deyi Li
Wuhan University Tsinghua University
Wuhan Beijing
China China
Shuliang Wang
Beijing Institute of Technology
Beijing
China
Translation from the Chinese language second edition: 空间数据挖掘理论与应用 (第二版) by Deren
Li, Shuliang Wang, Deyi Li, © Science Press 2013. All Rights Reserved
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.
v
vi Foreword I
In this era of data and knowledge economy, spatial data mining is attracting
increasingly more interest from scholars and is becoming a hot topic. I wish suc-
cess to the authors with this book as well their continuing research. May they
attain even greater achievements! May more and more brightly colored flowers
open upon the theory and application of spatial data mining.
Rapid advances in the acquisition, transmission, and storage of data are produc-
ing massive quantities of big data at unprecedented rates. Moreover, the sources
of these data are disparate, ranging from remote sensing to social media, and thus
possess all three of the qualities most often associated with big data: volume,
velocity, and variety. Most of the data are geo-referenced—that is, geospatial.
Furthermore, internet-based geospatial communities, developments in volunteered
geographic information, and location-based services continue to accelerate the rate
of growth and the range of these new sources. Geospatial data have become essen-
tial and fundamental resources in our modern information society.
In 1993, Prof. Deyi Li, a scientist in artificial intelligence, talked with his older
brother, Prof. Deren Li, a scientist in geographic information science and remote
sensing, about data mining in computer science. At that time, Deren believed that
the volume, variety, and velocity of geospatial data were increasing rapidly and
that knowledge might be discovered from raw data through spatial data mining.
At the Fifth Annual Conference on Geographic Information Systems (GIS) in
Canada in 1994, Deren proposed the idea of knowledge discovery from GIS data-
bases (KDG) for uncovering the rules, patterns, or outliers from spatial datasets
in intelligent GIS. Subsequently, both brothers were farsighted in co-supervising
two doctoral research students, Mr. Kaichang Di (from 1995 to 1999) and Mr.
Shuliang Wang (from 1999 to 2002), to study spatial data mining by combining
the principles of data mining and geographic information science. Shuliang’s doc-
toral thesis was awarded one of China’s National Excellent Doctoral Thesis prizes
in 2005.
It is rare that Deren and Deyi, brothers but from different disciplines, collabo-
rated and co-supervised a graduate student, Shuliang. It is even rarer for them to
have co-authored a monograph. Together, they have fostered a research team to
undertake their pioneering work on the theories and applications of spatial data
mining.
ix
x Foreword II
Their knowledge and wisdom about spatial data mining is manifested in this
book, which offers a systematic and practical approach to the relevant topics and
is designed to be readable by specialists in computer science, spatial statistics,
geographic information science, data mining, and remote sensing. There are vari-
ous new concepts in this book, such as data fields, cloud models, mining views,
mining pyramids, clustering algorithms, and the Deren Li methods. Application
examples of spatial data mining in the context of GIS and remote sensing also are
provided. Among other innovations, they have explored spatiotemporal video data
mining for protecting public security and have analyzed the brightness of night-
time light images for assessing the severity of the Syrian Crisis.
This monograph is an updated version of the authors’ books published earlier
by Science Press of China in Chinese: the first edition in 2006 and the second in
2013. The authors tried their best to write this book in English, taking nearly 10
years to complete it. I understand that Springer, the publisher, was very eager to
publish the monograph, and one of its Vice Presidents came to China personally
to sign the publication contract for this English version when the draft was ready.
After reading this book, I believe that readers with a background in computer
science will have gained good knowledge about geographic information science;
readers with a background in geographic information science will have a greater
appreciation of what computer science can do for them; and readers interested in
data mining will have discovered the unique and exciting potential of spatial data
mining. I am pleased to recommend this monograph.
The technical progress in computerized data acquisition and storage has resulted
in the growth of vast databases, about 80 % of which are geo-referenced (i.e.,
spatial data). Although a computer user now has less difficulty understanding the
increasingly large amounts of data available in these spatial databases, there is an
impending bottleneck where the data are excessive while the knowledge to man-
age it is scarce. In order to overcome this bottleneck, spatial data mining, proposed
under the umbrella of data mining, is receiving growing attention.
In addition to the common shared properties of data mining, spatial data min-
ing has its own unique characteristics. Spatial data include not only positional and
attribute data, but also the spatial relationships between spatial entities. Moreover,
the structure of spatial data is more complex than the tables in ordinary relational
databases. In addition to tabular data, there are vector and raster graphic data in
spatial databases. However, insufficient attention has been paid to the spatial char-
acteristics of the data in the context of data mining.
Professor Deren Li is an academician of the Chinese Academy of Science
and an academician of the Chinese Academy of Engineering in Geo-informatics.
He proposed knowledge discovery from geographical databases (KDG) at the
International Conference on GIS in 1994. Professor Deyi Li is an academician
of the Chinese Academy of Engineering in computer science, specifically for
his contributions to data mining. The brothers co-supervised Dr. Shuliang Wang,
whose thesis was honored as one of the best national Ph.D. theses in China in
2005. Moreover, their research group has successfully applied and finished many
sponsored projects to explore spatial data mining, such as the National Natural
Science Fund of China (NSFC), the National Key Fundamental Research Plan
of China (973), the National High Technique Research and Development Plan of
China (863), and the China Postdoctoral Science Foundation. Their research work
focuses on the fundamental theories of data mining in computer science in com-
bination with the spatial characteristics of the data. At the same time, their theo-
retical and technical results are being applied concurrently to support and improve
spatial data-referenced decision-making in the real world. Their work is widely
accepted by scholars worldwide, including me.
xi
xii Foreword III
In this monograph, there are several contributions. Some new methods are pro-
posed, such as cloud model, data field, mining view, and pyramid of spatial data
mining. The discovery mechanism is believed to be a process of uncovering a form
of rules plus exceptions at hierarchal mining-views with various thresholds. Their
spatial data cleaning algorithms are also presented in this book: the weighted itera-
tion method, i.e., Deren Li Method. Three clustering techniques are also demon-
strated: clustering discovery with cloud models and data fields, fuzzy clustering
under data fields, and mathematical morphology-based clustering. Mining image
databases with spatial statistics, inductive learning, and concept lattice are dis-
cussed and application examples are explored, such as monitoring landslides near
the Yangtze River, deformation recognition on train wheels, land classification
based on remote-sensed image data, land resources evaluation, uncertain reason-
ing, and bank location selection. Finally, a prototype spatial data mining system is
introduced and developed.
As can be seen in the information above, Spatial Data Mining: Theory and
Application is the fruit of their collective work. The authors not only approached
spatial data mining as an interdisciplinary subject but pursued its real-world appli-
cability as well.
I thank the authors for inviting me to share their monograph in its draft stage
and am very pleased to recommend it to you.
The rapid growth in the size of data sets brought with it much difficulty in using
the data. Data mining emerged as a solution to this problem, and spatial data
emerged as its main application. For a long time, the data mining method was used
for spatial data mining without recognition of the unique characteristics of spatial
data, such as location-based attributes, topological relationships, and spatiotempo-
ral complexity.
Professor Deren Li is an expert in geospatial information science, and he con-
ceived the “Knowledge Discovery from GIS databases (KDG).” Professor Deyi Li
is an expert in computer science who also studied data mining early. In essence,
they combined the principles of data mining and geospatial information science
to bring the spatial data mining concept to fruition. They had the foresight to co-
supervise two doctoral students: Kaichang Di and Shuliang Wang, and thereafter
fostered a spatial data mining research team to continue carrying out their pioneer-
ing research. They have completed many successful projects, such as the National
Nature Science Foundation of China, the results of which are successfully being
applied in a wide variety of areas, such as landslide disaster monitoring. Many of
their achievements have received the acclaim of scholars in peer-peer exchanges
and have attracted international academic attention.
This book is the collaborative collective wisdom of Deren Li, Deyi Li, and
Shuliang Wang regarding spatial data mining presented in a methodical fashion
that is easy to read and maneuver. The many highlights of this book center on the
authors’ innovative technology, such as the cloud model, data field, mining view,
mining pyramid, and mining mechanisms, as well as data cleaning methods and
clustering algorithms. The book also examines spatial data mining applications in
various areas, such as remote sensing image classification, Baota landslide moni-
toring, and checking the safety of train wheels.
After reading this book, readers with a computer science background will know
more about geospatial information science, readers from the geospatial informa-
tion science area will know more about computer science, readers with data min-
ing experience will discover the uniqueness of spatial data mining, and readers
who are doing spatial data mining jobs will find this book a valuable reference.
xiii
xiv Foreword IV
It was a great honor to have been asked to read this book in its earlier draft
stages and to express my appreciation of its rich academic achievements. I whole-
heartedly recommend it!
xv
xvi Preface
times it may be necessary for spatial data mining to take advantage of population
data instead of sample data.
In this monograph, we present our novel theories and methods of spatial data
mining as well as our successful applications of them in the realm of big data. A
data field depicts object interactions by diffusing the data contribution from the
universe of samples to the universe of population. The cloud model bridges the
mutual transformation between qualitative concepts and quantitative data; and the
mining view of spatial data mining hierarchically distinguishes the mining require-
ments with different scales or granularities. The weighted iteration method is used
to clean spatial data of errors using the principles of post-variance estimation. A
pyramid of spatial data mining visually illustrates the mining mechanism. All of
these applications concentrate on the bottlenecks that occur in spatial data mining
in areas such as GIS and remote sensing.
We were urged by scholars throughout the world to share our innovative inter-
disciplinary approach to spatial data mining; this monograph in English is the
result of our 10-year foray into publishing our collective wisdom. The first Chinese
edition was published in 2006 by Science Press and was funded by the National
Foundation for Academy Publication in Science and Technology (NFAPST) in
China. Simultaneously, we began writing an English edition. The first Chinese edi-
tion, meanwhile, was well received by readers and sold out in a short time, and
an unplanned second printing was necessary. In 2013, writing a second Chinese
edition was encouraged for publication in Science Press on the basis of our new
contributions to spatial data mining. As a result, the English edition, although
unfinished, was reorganized and updated. In 2014, Mr. Alfred Hofmann, the Vice
President of Publishing for Springer, came to Beijing Institute of Technology per-
sonally to sign the contract for the English edition for worldwide publication. In
2015, the second Chinese edition won the Fifth China Outstanding Publication
Award, which is a unique award by Science Press; Ms. Zhu Haiyai, the President
of Publishing on Geomatics for Science Press, also personally wrote a long arti-
cle on the book’s publication process, which appeared in the Chinese Publication
Newspaper. Following the criteria of contribution to the field, originality of the
research, practicality of research/results, quality of writing, rigor of the research,
substantive research and methodology, the data field method was awarded the Fifth
Annual InfoSci®-Journals Excellence in Research Awards of IGI. The contributions
were further reported by VerticalNews journalists, collected in Issues in Artificial
Intelligence, Robotics and Machine Learning by ScholarlyEditions, Developments
in Data Extraction, Management, and Analysis by IGI Global.
To finish writing the monograph in English as soon as possible, we made it
a priority. We tried many translation methods, which taught us that only we, the
authors, could most effectively represent the monograph. Although it took nearly
10 years to finish it, the final product was worth the wait!
Deren Li
Shuliang Wang
Deyi Li
Acknowledgments
The authors are most grateful to the following institutions for their support: the
National Natural Science Foundation of China (61472039, 61173061, 60743001,
40023004, 49631050, 49574201, 49574002, 71201120, 70771083, 70231010, and
61310306046), the National Basic Research Program of China (2006CB701305,
2007CB310804), the National High Technology Research and Development Plan
of China (2001AA135081), China’s National Excellent Doctoral Thesis prizes
(2005047), the New Century Excellent Talents Foundation (NCET-06-0618),
Development of Infrastructure for Cyber Hong Kong (1.34.37.9709), Advanced
Research Centre for Spatial Information Technology (3.34.37.ZB40), the Doctoral
Fund of Higher Education (No. 20121101110036), Yunan Nengtou project on New
techniques of big energy data and so on.
Knowing this book would not have become a reality without the support and
assistance of others, we acknowledge and thank our parents, families, friends, and
colleagues for their assistance throughout this process. In particular, we are very
grateful to Professors Shupeng Chen, Michael Goodchild, Lotfi A. Zadeh, and
Jiawei Han for taking the time to write the forewords for this book; the follow-
ing colleagues for providing their assistance in various ways: Professors Xinzhou
Wang, Kaichang Di, Wenzhong Shi, Guoqing Chen, Kevin P. Chen, Zongjian
Lin, Ying Chen, Yijiang Zou, Renxiang Wang, Chenghu Zhou, Jianya Gong, Xi
Li, Kun Qin, Hongchao Ma, Yan Zou, Zhaocong Wu, Benjamin Zhan, Jie Shan,
Xiaofang Zhou, Liangpei Zhang, Wenyan Gan, Xuping Zeng, Yangsheng You,
Liefei Cai, etc.; and the following students for their good work: Yasen Chen,
Jingru Tian, Dakui Wang, Yan Li, Caoyuan Li, Likun Liu, Hehua Chi, Xiao Feng,
Ying Li, Linglin Zeng, Wei Sun, Hong Jin, Jing Geng, Jiehao Chen, Jinzhao Liu,
and so on.
The State Key Laboratory Engineering in Surveying Mapping and Remote
Sensing at Wuhan University, the International School of Software at Wuhan
University, the School of Software at Beijing Institute of Technology, and the
School of Economics and Management at Tsinghua University provided comput-
ing resources and a supportive environment for this project.
xvii
xviii Acknowledgments
Deren Li
Shuliang Wang
Deyi Li
Contents
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation for SDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Superfluous Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Hazards from Spatial Data. . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Attempts to Utilize Data. . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Proposal of SDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 The State of the Art of SDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Academic Activities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Theoretical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Applicable Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Bottleneck of SDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Excessive Spatial Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 High-Dimensional Spatial Data. . . . . . . . . . . . . . . . . . . . 13
1.3.3 Polluted Spatial Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Uncertain Spatial Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.5 Mining Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.6 Problems to Represent the Discovered Knowledge. . . . . 17
1.3.7 Monograph Contents and Structures. . . . . . . . . . . . . . . . 18
1.4 Benefits to a Reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 SDM Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 SDM Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 SDM Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.2 Understanding SDM from Different Views. . . . . . . . . . . 25
2.1.3 Distinguishing SDM from Related Subjects . . . . . . . . . . 26
2.1.4 SDM Pyramid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.5 Web SDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 From Spatial Data to Spatial Knowledge. . . . . . . . . . . . . . . . . . . . 30
2.2.1 Spatial Numerical. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Spatial Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xix
xx Contents
xxv
xxvi About the Authors
In the context of numerous and changeable big data, the spatial discovery mech-
anism is believed to be based on a form of rules plus exceptions at hierarchal
mining views with various thresholds. This monograph explores the spatiotemporal
specialties of big data and introduces the reader to the data field, cloud model, min-
ing view, and Deren Li methods. The data field method captures the interactions
between spatial objects by diffusing the data contribution from a universe of sam-
ples to a universe of population, thereby bridging the gap between the data model
and the recognition model. The cloud model is a qualitative method that utilizes
quantitative numerical characters to bridge the gap between pure data and linguis-
tic concepts. The mining view method discriminates the different requirements by
using scale, hierarchy, and granularity in order to uncover the anisotropy of spatial
data mining. Finally, the Deren Li method performs data preprocessing to prepare
it for further knowledge discovery by selecting a weight for each iteration in order
to clean the observed spatial data as much as possible. In these methods, the spa-
tial association, distribution, generalization, and clustering rules are extracted from
geographical information system (GIS) datasets while simultaneously discovering
knowledge from the images to conduct image classification, feature extraction, and
expression recognition.
The authors’ spatial data mining research explorations are extensive and
include projects such as evaluating the use of spatiotemporal video data mining
for protecting public security, analyzing the brightness of nighttime light images
for assessing the severity of the Syrian Crisis, and indicating the nighttime light
dynamics in the Belt and Road for helping promote the beneficial cooperation of
the countries in the global sustainability. All of their concepts and methods are
currently being applied in practice and are providing valuable results for a wide
variety of applications, such as landslide protection, public safety, humanitarian
aid, recognition of train wheel deformation, selection of new bank locations, spec-
ification of land uses, and human facial recognition. GISDBMiner, RSImageMiner,
and EveryData are computerized applications of these concepts and methods.
This book presents a rich blend of novel methods and practical applications
and also exposes the reader to the interdisciplinary nature of spatial data mining.
xxvii
xxviii Abstract
Spatial data mining (SDM) extracts implicit knowledge from explicit spatial datasets.
In this chapter, the motivation for SDM and an overview of its processes are
presented. The contents and contributions of this monograph are also summarized.
The applied cycle of new technologies is shortening. Humans are different from
animals in their ability to walk upright and make tools; human social properties
and natural properties may be separated based on physical performance, skills, and
intelligence. Human intelligence broadens the scope of activities in society, eco-
nomics, and culture and is constantly shortening the applied cycle of new tech-
nical innovations or inventions. Under this umbrella of evolving intelligence, the
instruments used in scientific and socio-economic fields to generate, collect, store,
process, and transmit spatial data are qualitatively changing constantly, which has
greatly shortened the cycle of their use (Fig. 1.1). For example, the integration
degree of a computer chip, the power of a central processing unit (CPU), and the
transfer rate of communications channels are doubling every 18 months. Reaching
50 million users in the United States took 38 years for radio broadcasting, 13 years
for television, and only four years for the internet. As a resource gateway for infor-
mation interdependence and interaction, the network space of the internet and the
World Wide Web (WWW) has greatly accelerated the computerized accumulation
of interactive data.
In the real world, many of these vast datasets are spatially referenced (Li and
Guan 2000) and provide geo-support for decision-making in areas such as trans-
portation services, urban planning, urban construction, municipal infrastructure,
resource allocation, hazard emergency response, capital optimization, product
marketing, and medical treatment. This spatial data infrastructure has accumulated
Fig. 1.1 Instruments and equipment changes with data accumulation (Li et al. 2006)
1.1 Motivation for SDM 3
and multi-frequency and are being marketed with high space, high dynamics, high
spectral efficiency, and high data capacity. Based on these characteristics, a three-
dimensional global network of earth observations is becoming a reality (Wang and
Yuan 2014); this is comprised of large, medium, and small satellites that orbit the
earth at high, medium, and low levels. The sensors on the satellites enable the obser-
vation of spatial entities on earth by using multi-level, multi-angle, all-weather, or
all-around methods in a pan-cubic network. The observed resolution of satellite
imagery may be coarse or fine. The ground resolutions can range from kilometer-
band to centimeter-band, the sensed spectrum can range from ultraviolet rays to
infrared rays, the sensed time interval can range from once every 10 days to three
times a day, and the detected depth can range from a few meters to over 10,000 m.
Furthermore, the capacity of sensed data is growing larger and larger while the
cycle gets shorter and shorter. The National Aeronautics and Space Administration
(NASA) Earth Observing System (EOS) is projected to generate some 50 Gbits of
remotely sensed data per hour. Only the sensors of EOS-AM1 and PM1 satellites
can collect remotely sensed data up to terabyte level daily. Landsat satellites obtain
global coverage of satellite imaging data every two weeks, which has accumulated
more than 20 years of global data. Some satellites may give more than one type
of imagery data; for example, EOS may provide MODIS imaging spectral data,
ASTER thermal infrared data, and measurable four-dimensional simulation of the
CERES data, MOPIT data, and MISR data. Nowadays, remotely sensed data from
new types of satellites, such as Quick Bird, IRS, and IKONOS, are being marketed
and applied; these computerized data are stored in magnetic mediums.
In order to define, manipulate, and analyze these data, spatial information sys-
tems such as GIS were developed for a personal computer. Because the local limi-
tation of a personal computer is not an issue within the network, data for the same
objects also can be globally networked, analyzed, or shared with authorization.
The Internet of Things (Höller et al. 2014) enables the internet to enter the real
world of physical objects. Ubiquitous devices are mutually connected and exten-
sively used (e.g., laptops, palmtops, cell phones, and wearable computers), with
an enormous quantity of data being created, transmitted, maintained, and stored.
Digital Earth is becoming a smart planet covered by a data skin.
The above mentioned spatial data highly match the potential demand of
humans to recognize and use natural resources for sustainable development.
Moreover, these data are continuously increasing and are being amassed instanta-
neously with both attribute depth and the scope of objects.
newspaper when not detected by any editor. What’s more, when the advantages
and disadvantages of VCDs, CVDs, and DVDs are compared on the home appli-
ances page, the living page, and the technology page of the same issue of a daily
newspaper on the same day, three different conclusions may be drawn and intro-
duced to the readers.
The situation is much worse for spatial data because they are more complex,
more changeable, and larger than common transactional datasets. Spatial data
include not only positional data and attribute data but also the spatial relationships
among spatial entities in the universe, on the earth’s surface, and in other space.
Moreover, the database structure for spatial data is more complex than the tables
in ordinary relational databases. In addition to tabular data, there are raster images
and vector graphics in spatial databases, the attributes for which are not explic-
itly stored in the database. Furthermore, spatial data are heterogeneous, uncer-
tain, time-sensitive, and multi-dimensional; spatial data for the same entity may
be distributed in different locations and departments with different structures and
standards. Handling these data is complicated. For example, faced with the endless
images coming from satellite reconnaissance, the U.S. Department of Defense has
been unable to fully deal with these vast remotely sensed data.
Conventionally, RS focuses on the fundamental theories and methods of math-
ematical model analysis, which continues to be the primary stage of spatial infor-
mation processing. In this stage, data become information instead of knowledge,
and the amount of processed data is also very limited. Current commercial software
on image processing (e.g., ENVI, PCI, and ERDAS) cannot meet the need to take
advantage of new types of imagery data from RS satellites. Because these software
programs lack new techniques to deal with the new types of imagery data (partially
or completely), it is impossible for them to achieve intelligent mixed-pixel segmenta-
tion, automatic spectral match, and automatic feature extraction of the target objects.
We now face the challenge of processing excessive quantities of spatial data
while humans are clamoring for maximum use of the data. With the continuous
expansion of spatial data in terms of its scale, scope, and depth, more magnetic
mediums for storage, appropriate instruments for processing, and trained tech-
nicians for maintaining them are continually needed. Although they exist in this
sea of spatial data, humans cannot make full use of it to aid spatially referenced
applications. Thus, the huge amounts of computerized datasets have far exceeded
human ability to completely interpret and use them appropriately, as shown in
Fig. 1.3 (Li et al. 2006).
Eventually, new bottlenecks will appear in geomatics, such as how to distin-
guish the data of interest from the meaningless data, how to understand spatial
data, how to extract useful information from the data, how to uncover the implicit
patterns in spatial data, and how to summarize uncovered patterns into useful
knowledge. It is also very difficult to answer time-sensitive questions such as the
following using these massive spatial data: Will it rain in the coming year? Where
will an earthquake happen tomorrow? The excessive data seem to be an unsolvable
problem. In the words of Naisbitt (1991), “We are drowning in information but
starved for knowledge.”
6 1 Introduction
Fig. 1.3 The capacity of spatial data production is excessively larger than the capacity available
for data analysis (Li et al. 2006)
the database, along with some simple functions for data analysis and reporting
(Connolly 2004). GIS is one such specific software that has a spatial database.
In support of the management decision-making process, a subject-oriented, inte-
grated, time-variant, and non-volatile collection of data constitutes a data ware-
house (Inmon 2005). Efficient organization of the data in a data warehouse,
coupled with efficient and scalable tools, allows the data to be used correctly and
efficiently to support decision-making. During the process of data utilization, the
industry-standard data warehouses and online analysis and process (OLAP) plat-
forms are directly integrated. OLAP explains why certain relationships exist using
a graphical multi-dimensional hypercube and can provide useful information for
databases with a small number of variables, but problems arise when there are tens
or hundreds of variables (Giudici 2003).
Artificial intelligence (AI) and machine learning (ML) were proposed to
replace the manual calculation of the data by using computers. There are three
main schools of thought on the fundamental theories and methodologies for AI:
symbolism, connectionism, and behaviorism. Symbolism maintains that the basic
element of human cognition is a symbol and the cognitive process is equal to the
operating process of the symbol. Symbolism is based on the physical symbol sys-
tem hypothesis—that is, a symbol operating system and limited rationality theory.
Connectionism maintains that the basic element of the human brain is the neuron
rather than a symbol, and the working mode of the brain is to use the neural net-
work, which is connected by ample multilayered parallel neurons. Connectionism
concerns itself with artificial neural networks and evolutional computation.
Behaviorism believes that intelligence depends on perception and behavior and
emphasizes the interactions between the real world and its surroundings. The three
strategies are all simulations of human deterministic intelligence. To automati-
cally acquire the knowledge from expert systems, ML was introduced to simulate
human learning. Using centralized datasets, computerized devices learn the ability
to improve their performance from their past performance. However, it is neces-
sary to establish an auxiliary computer expert system and to be prepared to work
hard to train self-learning to the computer. Regarding the differences between AI
and ML, the issues that they are dealing with is only a small category of human
intelligence or human learning (Li and Du 2007). Thus, datasets cannot be used
sufficiently by independently utilizing statistics, a database system, AI, or ML.
Moreover, besides the conventional querying and reporting of explicit data, new
demands on spatial data are increasing steadily.
First, most techniques focus on a database or data warehouse, which is physi-
cally located in one place. However, the rapid advance of web techniques broad-
ens the data scope of the same entity—that is, from local data to global data. As
increasingly more dynamic sources of data and information become available
online, the web is becoming an important part of the computing community and
many data may be distributed on heterogeneous sites. This growth has an accom-
panying need for providing improved computational engines in a cost-effective
way with parallel multi-processor computer technology.
8 1 Introduction
Second, before data are used for a given task, the existing data first should be
filtered by abandoning the meaningless data, leaving only the data of interest for
ultimate use and thereby improving the quality and quantity of the spatial data
used in depth and width.
Third, the increasing heterogeneity and complexity of new forms of data, such
as those arriving from the web and earth observation systems, require new forms
of patterns and models, together with new algorithms to discover such patterns
and models efficiently. Obviously, the vast amounts of accumulated spatial data
have extremely exceeded human capability to make full use of them. For exam-
ple, contemporary GIS analysis functionalities are not intelligent enough to offer
a further implicit description and future trends beyond the explicit spatial datasets
automatically.
Therefore, it is necessary to search for new techniques that will automatically
take advantage of the growing wealth of spatial databases and supervise the use of
data intelligently.
1.1.4 Proposal of SDM
applies the decision rules to the pre-processing of data, selection of data, modeling
of objects, extrapolating the knowledge, interpreting the knowledge, and applying
the knowledge. The knowledge is previously known, potentially useful, and ulti-
mately understandable. The datasets hide the spatial distribution, topological rela-
tionships, geographical features, moving trajectory, images, and graphics (Li et al.
2006; Leung 2009; Miller and Han 2009; Giannotti and Pedreschi 2010).
Since its introduction, SDM has continued to attract increasing interest from
individuals and organizations due to its various contributions to the location-
referenced field of academic activities, theoretical techniques, and real-world
applications.
1.2.1 Academic Activities
The academic activities surrounding SDM are rapidly growing. SDM is an impor-
tant theme and a hot topic that draws large audiences for papers, monographs,
demos, competitions, tools, and products. Besides its specific fields of application,
SDM has penetrated related fields, such as DM, knowledge discovery, and geo-
spatial information science. It is a key topic in the Science Citation Index (SCI).
Many well-known international publishing companies are also becoming increas-
ingly interested in SDM.
• Academic organizations: Knowledge Discovery Nuggets, the American Association
for Artificial Intelligence (AAAI), the International Society for Photogrammetry
and Remote Sensing (ISPRS).
• Conferences: IEEE International Conference on Spatial Data Mining and
Geographical Knowledge Services (ICSDM), IEEE International Conference
on Data Mining (ICDM), International Conference on Knowledge discovery in
databases (KDD), Symposium on Spatio-Temporal Analysis and Data Mining
(STDM), Advanced Spatial Databases, among others.
• Journals: Data Ming and Knowledge Discovery, WIRS Data Ming and
Knowledge Discovery, International Journal of Data Warehousing and Mining,
IEEE Transactions on Knowledge and Data Engineering, International Journal
of Very Large Databases, International Journal of Geographical Information
Science, International Journal of Remote Sensing, Artificial Intelligence,
Machine Learning, among others.
• Organizations: Spatial Data Mining Lab at Wuhan University, and Beijing
Institute of Technology in China, Spatial Data Mining and Visual Analytics Lab
at University of South Carolina in USA, Spatial Database and Spatial Data
Mining Research Group at University of Minnesota in USA.
10 1 Introduction
Since the inception of SDM, Li Deren has devoted himself and his team to explor-
ing SDM theories and applications funded by organizations such as the National
Science Foundation of China, the National High Techniques Research and
Development Plan, and the National Basic Research Plan. The authors have pro-
posed novel techniques such as the cloud model, data fields, and mining views,
which are successfully being implemented to forecast landslide hazards, monitor
train wheel deformation, protect public safety, assess the severity of the Syrian cri-
sis, promote the beneficial cooperation of Asian countries, and understand Mars
images, to name a few. Li Deren and his team have been invited to chair inter-
national conferences, give keynote speeches, edit special issues, and write over-
view papers. They also founded the International Conference of Advanced Data
Mining and Applications (ADMA). During the continuing process of research
and application of SDM, students have received doctorates, and one thesis was
awarded “China’s National Excellent Doctoral Thesis Prizes” by the Chinese gov-
ernment. Their contributions to SDM are inspiring more and more institutions and
industries.
1.2.2 Theoretical Techniques
sets, which obviously solved the “specious” ambiguity. In a rough set on incom-
pleteness, the resulting value range in a set combined the crisp set and the fuzzy
interval—that is, {0, (0, 1), 1}, along with a precise pair of upper and lower
approximations. The unknown decision-making is approached using the known
background attributes. In a cloud model, the randomness and fuzziness are com-
patible with each other—that is, a random case with fuzziness or a fuzzy set with
randomness, which bridges qualitative knowledge and its quantitative data.
In addition to the set methods for SDM, neural networks (Miller 1990)
and genetic algorithms (Buckless and Petry 1994) are connected to bionics.
Visualization is a visual tool in which many abstract datasets for complex entities
are depicted in specific graphics and images with human perception (Maceachren
1999). The decision tree characterizes spatial entities via a tree structure. The rules
are generated by using up-down expansion or down-up amalgamation (Quinlan
1993). The spatial data warehouse refines spatial datasets for effective manage-
ment and public distribution on SDM (Inmon 2005). Based on the data warehouse,
online SDM may be implemented on multi-dimensional views and efficient and
timely responses to user commands. The network broadens the scope of SDM
(e.g., web mining). The space of the discovery state provides a methodological
framework for implementing SDM (Li and Du 2007).
The state of the art of SDM has appeared many times in the literature.
Grabmeier and Rudolph (2002) reviewed clustering techniques under GIS.
Koperski et al. (1999) summarized the association rules discovered in RS, GIS,
computerized mapping, environmental assessment, and resource planning. After
the framework of SDM based on concepts and techniques was proposed (Li 1998),
Li Deren provided an overview of the theories, systems, and products (Li et al.
2001). Moreover, he wrote a monograph entitled “Spatial Data Mining Theory and
Application,” which systematically presented the origin, concepts, data sources,
knowledge to discover, usable techniques, system development, practical cases,
and future perspectives (Li et al. 2006, 2013). This monograph was praised as a
landmark in SDM (Chen 2007). Han et al. (2012) extended the concepts and tech-
niques of DM into SDM at an international conference focusing on geographic
knowledge discovery (Han 2009). Geostatistics-based spatial knowledge discovery
also was summarized (Leung 2009). The above-mentioned techniques are not iso-
lated from the practical application of them. The techniques are used comprehen-
sively and fully draw from the mature techniques of related fields, such as ML, AI,
and pattern recognition.
1.2.3 Applicable Fields
1.3 Bottleneck of SDM
The volume of spatial data is also rapidly increasing—not only by its horizontal
instances but vertical attributes as well. To depict a spatial entity accurately and
completely in a computerized world, increasingly more types of attributes need
to be observed in order to truly represent the objects in the real world. Only in
the spatial data infrastructures of Digital Earth are there attributes such as images,
14 1 Introduction
Spatial data are the root sources for SDM. Poor data quality may directly result in
unreliable knowledge, inferior service, and wrong decision-making (Shi and Wang
2002). However, spatial data collected from the real world are contaminated,
which means that SDM frequently faces data problems such as incompleteness,
dynamic changes, noise, redundancy, and sparsity (Hernàndez and Stolfo 1998).
Therefore, the United States (U.S.) National Center for Geographic Information
and Analysis has named the accuracy of GIS as its priority theme, and error analy-
sis of spatial data was set as the foremost problem in the 12th working group. The
National Data Standards Committee for Digital Mapping in the U.S. determined
the following quality standards for spatial data: position accuracy, attribute accu-
racy, consistency, lineage, and integrity. However, it is not enough to study and
apply spatial data cleaning and not understand the errors in multi-source SDM.
(1) Incompleteness. Observational data are sampled from their data popula-
tion. When compared to its population, the limited samples are incomplete;
the observational data are also inadequate to understand a true entity. For
example, when monitoring the displacement of a landslide, it is impos-
sible to monitor every point in the landslide. Therefore, representative
points under the geological properties are chosen as the samples. Here, it is
incomplete to monitor the displacement of the landslide on the basis of the
observed data on the representative points. Each observation is not independ-
ent, and it may affect every point with different weights in the universe of
1.3 Bottleneck of SDM 15
and that there is a determined boundary between two attributes. This approach
is obviously inconsistent with reality in the complicated and ever-changing real
world. Some studies addressing spatial data uncertainty have focused more on
positional uncertainty than attribute uncertainty.
1.3.5 Mining Differences
qualitative concept and quantitative data, and how to measure to what degree the
discovered knowledge is supportable, reliable, and of interest.
In summary, the above-mentioned difficulties may have a direct impact on the
accuracy and reliability of the resulting knowledge in SDM. They further make it
difficult to discover, assess, and interpret a large amount of important knowledge.
Sometimes, they disturb some development of SDM; however, if the difficulties
are reasonably resolved, decision-making errors using SDM may be avoided.
Additionally, spatial uncertainty in measuring the supporting information may
reflect the degree of confidence in the discovered knowledge. When the uncertain-
ties in the process of SDM are ignored, the resulting knowledge may be incom-
plete or prone to errors even if the techniques used are adequate.
Chapter 4: Spatial data cleaning. Whether spatial data are qualified or not, they
may influence the level of confidence in discovered knowledge directly. The errors
that may occur in spatial data are summarized along with cleaning techniques to
remove them.
Chapter 5: Usable methods and techniques in SDM. Crisp set theory includes
probability theory, evidence theory, spatial statistics, spatial analysis, and data
fields. The extended set theory includes fuzzy sets, rough sets, and cloud models.
By using artificial neural networks and genetic algorithms, SDM simulates human
thinking and evolution.
Chapter 6: Data fields. Data fields bridge the gap between the mining model and
the data model. Informed by the physical field, a data field depicts the interaction
between data objects. Its field function models how the data are diffused in order
to contribute to the mining task. All the equipotential lines depict the topological
relationships among the interacted objects.
Chapter 7: Cloud model. The cloud model bridges the gap between the qualitative
concept and quantitative data. Based on the characters {Ex, En, and He}, there are
cloud model generators to create all types of cloud models for reasoning and con-
trolling in SDM.
Chapter 8: GIS data mining. SDM promotes the automation and intelligent appli-
cation of discovered knowledge, thereby enabling more effective use of the current
and potential value of spatial data in GIS. Spatial association rules are discovered
with an Apriori algorithm, concept lattice, and cloud model. Spatial distribution
rules are determined with inductive learning. Decision-making knowledge is dis-
covered with a rough set. Clustering focuses on fuzzy comprehensive clustering,
20 1 Introduction
1.4 Benefits to a Reader
References
Di KC, Li DR, Li DY (1997) Framework of spatial data mining and knowledge discovery.
Geomatics Inf Sci Wuhan Univ 4:328–332
Eklund PW, Kirkby SD, Salim A (1998) Data mining and soil salinity analysis. Int J Geogr Inf
Sci 12(3):247–268
Ester M et al (2000) Spatial data mining: databases primitives, algorithms and efficient DBMS
support. Data Min Knowl Disc 4:193–216
Fayyad UM, Uthurusamy R (eds) (1995) Proceedings of the first international conference on
knowledge discovery and data mining (KDD-95), Montreal, Canada, Aug 20–21, AAAI Press
Giannotti F, Pedreschi D (eds) (2010) Mobility, data mining and privacy: Geographic knowledge
discovery. Springer, Berlin
Giudici P (2003) Applied data mining: statistical methods for business and industry. Wiley,
Chichester
Goodchild MF (2007) Citizens as voluntary sensors: spatial data infrastructure in the world of
web 2.0. Int J Spat Data Infrastruct Res 2:24–32
Grabmeier J, Rudolph A (2002) Techniques of clustering algorithms in data mining. Data Min
Knowl Disc 6:303–360
Han JW, Kamber M, Pei J (2012) Data mining: concepts and techniques, 3rd edn. The Morgan
Kaufmann Publishers Inc, Burlington
Hernàndez MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge
problem. Data Min Knowl Disc 2:1–31
Höller J, Tsiatsis V, Mulligan C, Karnouskos S, Avesand S, Boyle D (2014) From machine-to-
machine to the internet of things: introduction to a new age of intelligence. Elsevier
Inmon WH (2005) Building the data warehouse, 4th edn. Wiley, New York
Leung Y (2009) Knowledge discovery in spatial data. Springer, Berlin
Kaufman L, Rousseew PJ (1990) Finding groups in data: an introduction to cluster analysis.
Wiley, New York
Koperski K (1999) A progressive refinement approach to spatial data mining. PhD thesis, Simon
Fraser University, British Columbia
Li DR (1998) On the interpretation of “GEOMATICS”. J Surveying Mapp 27(2):95–98
Li DY, Du Y (2007) Artificial intelligence with uncertainty. Chapman and Hall/CRC, London
Li DR, Cheng T (1994) KDG-Knowledge discovery from GIS. In: Proceedings of the Canadian
Conference on GIS, Ottawa, Canada, June 6-10, pp 1001–1012
Li DR, Guan ZQ (2000) Integration and implementation of spatial information system. Wuhan
University Press, Wuhan
Li DR, Wang SL, Shi WZ, Wang XZ (2001) On spatial data mining and knowledge discovery
(SDMKD). Geomatics Inf Sci Wuhan Univ 26(6):491–499
Li DR, Wang SL, Li DY, Wang XZ (2002) Theories and technologies of spatial data mining and
knowledge discovery. Geomatics Inf Sci Wuhan Univ 27(3):221–233
Li DR, Wang SL, Li DY (2006) Theories and applications of spatial data mining. Science Press,
Beijing
Li DR, Wang SL, Li DY (2013) Theories and applications of spatial data mining, 2nd edn.
Science Press, Beijing
Maceachren AM et al (1999) Constructing knowledge from multivariate spatiotemporal data:
integrating geographical visualization with knowledge discovery in database methods. Int J
Geogr Inf Sci 13(4):311–334
Miller WT et al (1990) Neural network for control. MIT Press, Cambridge
Miller HJ, Han JW (eds) (2009) Geographic data mining and knowledge discovery, 2nd edn. The
Chapman and Hall/CRC, London
Naisbitt J (1991) Megatrends 2000. Avon Books, NewYork
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic
Publishers, London
Pitt L, Reinke RE (1988) Criteria for polynomial time (conceptual) clustering. Mach Learn
2(4):371–396
22 1 Introduction
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Shi WZ, Wang SL (2002) GIS attribute uncertainty and its development. J Remote Sens
6(5):393–400
Sridharan NS (ed) (1989) Proceedings of the 11th international joint conference on artificial
intelligence, Detroit, MI, USA, August 20–25, Morgan Kaufmann
Wang SL (2002) Data field and cloud model based spatial data mining and knowledge discovery.
Ph.D. Thesis, Wuhan University
Wang SL, Yuan HN (2014) Spatial data mining: A perspective of big data. Int J Data Warehouse
Min 10(4):50–70
Wynne HW, Lee ML, Wang JM (2007) Temporal and spatio-temporal data mining. IGI
Publishing, New York
Tung A et al. (2001) Spatial clustering in the presence of obstacles. IEEE Transactions on
Knowledge and Data Engineering, 359–369
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Zhang JX, Goodchild MF (2002) Uncertainty in geographical information. Taylor & Francis,
London
Chapter 2
SDM Principles
The spatial data mining (SDM) method is a discovery process of extracting gener-
alized knowledge from massive spatial data, which builds a pyramid from attribute
space and feature space to concept space. SDM is an interdisciplinary subject and
is therefore related to, but different from, other subjects. Its basic concepts are pre-
sented in this chapter, which include manipulating space, SDM view, discovered
knowledge, and knowledge representation.
2.1 SDM Concepts
SDM aims to improve human ability to extract knowledge and insights from large
and complex collections of digital data. It efficiently extracts previously unknown,
potentially useful, and ultimately understandable knowledge from these huge
datasets for a given task with constraints (Li et al. 2001, 2006, 2013; Wang 2002;
Han et al. 2012; Wang and Yuan 2014). To implement a credible, innovative, and
interesting extraction, the SDM method not only relies on the traditional theories
of mathematical statistics, machine learning, pattern recognition, neural networks,
and artificial intelligence, but it also engages new methods, such as data fields,
cloud models, and decision trees.
2.1.1 SDM Characteristics
SDM extracts abstract knowledge from concrete data. The data are explicit
while, in most circumstances, the knowledge is implicit. The extraction may be
a repeated process of human–computer interactions between the user and the
dataset. The original data are raw observations of spatial objects. Because the
data may be dirty, they need to be cleaned, sampled, and converted in accordance
with the SDM measurements and the user-designated thresholds. The discovered
knowledge consists of generic patterns acting as a set of rules and exceptions. The
patterns are further interpreted by professionals when they are utilized for data-
referenced decision-making with various requirements.
SDM’s source is spatial data, which are real, concrete, and massive in volume.
They already exist in the form of digital data stored in spatial datasets, such as
databases, data markets, and data warehouses. Spatial data may be structured (e.g.,
instances in relational data), semi-structured (e.g., text, graphics, images), or non-
structured (e.g., objects distributed in a network). Advances in acquisition hard-
ware, storage capacity, and Central Processing Unit (CPU) speeds have facilitated
the ready acquisition and processing of enormous datasets, and spatiotemporal
changes in networks have accelerated the velocity of spatial data accumulation.
Noises and uncertainties exist in spatial data (e.g., errors, incompleteness, redun-
dancy, and sparseness), which can cause problems for SDM; therefore, polluted
spatial data are often preprocessed for error adjustment and data cleaning.
SDM aims to capture knowledge. The knowledge may be spatial or non-spatial;
it is previously unknown, potentially useful, and ultimately understandable under
the umbrella of spatial datasets. This knowledge can uncover the description and
prediction of the patterns of spatial objects, such as spatial rules, general relation-
ships, summarized features, conceptual classification, and detected exception.
These patterns are hidden in the data along with their internal relationships and
developing trends. However, SDM is a much more complex process of selection,
exploration, and modeling of large databases in order to discover hidden models
and patterns, where data analysis is only one of its capabilities. SDM implements
high-performance distributed computing, seamless integration of data, rational
knowledge expression, knowledge updates, and visualization of results. SDM
also supports hyperlink and media quotas among hierarchical document structures
when mining data.
The SDM process is one of discovery instead of proofing. Aided by SDM
human–computer interaction, the process is automatic or at least semi-automatic.
The methods may be mathematic or non-mathematic, and the reasoning may be
deductive or inductive. The SDM process has the following requirements. First, it
is composed of multiple mutually influenced steps, which require repeated adjust-
ment to spiral up in order to extract the patterns from the dataset. Second, multi-
ple methods are encouraged, including natural languages, to present the process
of discovery and its results. SDM brings together all of the available variables and
combines them in different ways to create useful models for the business world
beyond the visual representation of the summaries in online analysis and process-
ing (OLAP) applications. Third, SDM looks for the relationships and associa-
tions between phenomena that are not known beforehand. Because the discovered
knowledge usually only needs to answer a particular spatial question, it is not nec-
essary to determine the universal knowledge, the pure mathematical formula, or a
new scientific theorem.
2.1 SDM Concepts 25
break down the local restrictions of spatial data, using not only the spatial data
in its own sector, but larger scopes or even all of the data in the field of space
and related fields. It also can be available to discover more universal spatial
knowledge and to implement spatial online data mining (SOLAM). To meet
the needs of decision-making, SDM makes use of decentralized heterogeneous
data sources, with timely and accurately extracted information and knowl-
edge through data analysis using the query and analysis tools of the reporting
module.
2.1.4 SDM Pyramid
In accordance with its basic concept, SDM’s process includes data preparation
(understanding the prior knowledge in the field of application, generating target
datasets, cleaning data, and simplifying data), data mining (selecting the data min-
ing functions and algorithms; searching for the knowledge of interest in the form
of certain rules and exceptions: spatial associations, characteristics, classifica-
tion, regression, clustering, sequence, prediction, and function dependencies), and
28 2 SDM Principles
Fig. 2.1 SDM pyramid
2.1 SDM Concepts 29
SDM pyramid more closely matched the reality of the physical world experience
as far as the spatial concept, spatial data, spatial information, changes in spatial
size, and increases in spatial scale, which finally become spatial knowledge. In dif-
ferent disciplines, the definitions of some basic concepts in Fig. 2.1, such as spa-
tial data, spatial information, and spatial knowledge, are probably different.
2.1.5 Web SDM
The Web is an enormous distributed parallel information space and valuable infor-
mation source. First, it offers network platform resources, such as network equip-
ment, interface resources, computing and bandwidth resources, storage resources,
and network topology. Second, a variety of data resources use the platform as a
carrier, such as text, sound, video data, network software, and application soft-
ware. The rapid development of network technology provides a new opportu-
nity for a wide range of spatial information sharing, synthesis, and knowledge
discovery.
While the high level of network resources makes the information of the net-
work suitable, user-friendly, general, and reusable, how can these distributional,
autonomous, heterogeneous data resources be used to obtain, form, and use the
required knowledge in a timely manner? With the help of a network, the local
restrictions are brought to the spatial dataset and can make use of not only the
internal data of a department, but also on a greater scale or even all of the spatial
data in the field of space or space-related fields. The discovered knowledge thus
becomes more meaningful. Because of the inherent open, distributed, dynamic,
and heterogeneous features of a network, it is difficult for the user to accurately
and quickly obtain the required information.
Web mining is the extraction of useful patterns and implicit information from
artifacts or activities related to the World Wide Web under the internet (Liu 2007).
As the internet broadens the available data resources, web mining may include
web-content mining, web-structure mining, and web-usage mining. Web-content
mining discovers knowledge from the content of web-based data, documents, and
pages or their descriptions. Web-structure mining uncovers the knowledge from
the structure of websites and the topological relationships among different web-
sites (Barabási and Albert 1999). Web-usage mining extracts Web-user behavior
or modeling and predicting how a user will use and interact with the Web (Watts
and Strogatz 1998). One of the most promising mining areas being explored is
the extraction of new, never-before encountered knowledge from a body of tex-
tual sources (e.g., reports, correspondence, memos, and other paperwork) that now
reside on the Web. SDM is one of those promising areas. Srivastava and Cheng
(1999) provided a taxonomy of web-mining applications, such as personalization,
system improvement, site modification, business intelligence, and usage character-
ization. Internet-based text mining algorithms and information-based Web server
log data mining research are attracting increasingly more attention. For example,
30 2 SDM Principles
some of the most popular Web server log data are growing at a rate of tens of
megabits every day; discovering useful model, rules, or the visual structures from
them is another area of research and application of data mining. To adapt to the
distributed computing environment of networks, SDM systems should also make
changes in the architecture and data storage mode, providing different levels of
spatial information services such as data analysis, information sharing, and knowl-
edge discovery on the basis of display, service, acquisition, and storage of spatial
information on the distributed computing platform. SOLAM and OLAP in multi-
dimensional view with SDM on a variety of data sources (Han et al. 2012), stress
the efficient implementation and timely response to user commands for more uni-
versal knowledge.
In the course of the development of a network, users come to realize that the
internet is a complex, non-linear network system. Next-generation internet will
gradually replace the traditional internet and become the information infrastruc-
ture of the future by integrating existing networks with new networks that may
appear in the future. As a result, the internet and its huge information resources
will be the most important basic strategic resource of a country. By efficiently and
reasonably using those resources, the massive consumption of energy and mate-
rial resources can be reduced and sustainable development can be achieved. The
authors believe that networked SDM must take into account the possible space,
time, and semantic inconsistencies in the databases distributed in networks, as
well as the difference in spatiotemporal benchmarks and standards for the seman-
tics. We must make use of spatial information network technology to build a pub-
lic data assimilation platform on these heterogeneous federal spatial databases.
Studies are underway to address solutions to these issues.
2.2.1 Spatial Numerical
2.2.2 Spatial Data
Spatial data are important references that help humans to understand the nature
of objects and utilize that nature by numerically describing spatial objects with
the symbol of attributes, amounts, and positions and their mutual relationships.
Spatial data can be numerical values, such as position elevation, road length, pol-
ygon coverage, building volume, and pixel grayscale; character strings, such as
place name and notation; or multimedia information, such as graphics, images,
videos, and voices. Spatial data are rich in content from the microcosmic world to
the macroscopic world, such as molecular data, surface data, and universal data.
Compared to common data, spatial data contain more specifics, such as spatiotem-
poral change, location-based distribution, large volume, and complex relation-
ships. In spatial datasets, there are both spatial data and non-spatial data. Spatial
data describe a geographic location and distribution in the real world, while
non-spatial data consist of all the other kinds of data. Sometimes, a spatial data-
base is regarded as a generic database—one special case of which is a common
database. Spatial data can be divided into raw data and processed data or digital
data and non-digital data. Raw data may be numbers, words, symbols, graphics,
images, videos, language, etc. Generally, SDM utilizes digital datasets after they
are cleaned.
2.2.3 Spatial Concept
A spatial concept defines and describes a spatial object, along with its connotation
and extension. Connotation refers to the essence reflected by the concept, while
extension refers to the scope of the concept. Generally, the definition of spatial
concept and its interpretation of connotation and extension are applied to explain
the phenomena and the states of spatial objects, as well as to resolve problems
32 2 SDM Principles
2.2.4 Spatial Information
2.2.5 Spatial Knowledge
Spatial knowledge is a useful structure for associating one or more pieces of infor-
mation. The spatial knowledge obtained by SDM mainly includes patterns such
as correlation, association, classification, clustering, sequence, function, excep-
tions, etc. As a set of concepts, regulations, laws, rules, models, functions, or con-
straints, they describe the attributes, models, frequency, and cluster and predict the
trend discovered from spatial datasets, such as an association of “IF the road and
river intersect, THEN the intersection is a bridge on the river with 80 % possibil-
ity.” Spatial knowledge is different from isolated information, such as “Beijing,
the capital of China.” In many practical applications, it is not necessary to strictly
distinguish information from knowledge. It should be noted that data processing
to obtain professional information is a basis of SDM but is not equal to SDM,
such as image processing, image classification, spatial query, and spatial analy-
sis. The conversion from spatial data to spatial information—a process of data
34 2 SDM Principles
2.2.6 Unified Action
Spatial numerical values act as digital carriers during the process of object collec-
tion, object transmission, and object application. Objects in the spatial world are
first transferred to forms of data by macro- and micro-sensors and other equipment
based on a certain model of the concept models, approximately described accord-
ing to some theories and methods, and finally stored as physical models in the
physical medium (e.g., hard drives, disks, tapes, videos) of a database in a spatial
information system (e.g., GIS) or as a separate spatial database. In fact, during the
process of data mining, the numerical data participate in the actual calculations,
and the unit of measurement in a certain space is only used to give those numerals
different spatial meanings.
A spatial numerical value is a kind of spatial data, while spatial data are the
spatial information carriers, which refers to the properties, quantities, locations,
and relationships of the spatial entities represented by spatial symbols, such as the
numerical values, strings, graphics, images, etc. Spatial data represent the objects,
and spatial information looks for the content and interpretation of spatial data.
Spatial information is the explanation of the application values of spatial data in
a specific environment, and spatial data are the carrier of spatial information. The
conversion from spatial objects to spatial data, and then to spatial information, is a
big leap in human recognition. The same data may represent different information
on different occasions, while different data may represent the same information on
the same occasion when closely related to the spatial data. For example, “the land-
slide displaces 20 mm southward” is spatial data, while “the landslide displaces
about 20 mm southward” is a spatial concept.
Spatial data are also the key elements to generate spatial concepts. A spatial
concept is closely related to spatial data; for example, “around 3,000 km” is a spa-
tial concept, but “3,000 km” is spatial data. The conversion between spatial con-
cept and spatial data is the cornerstone of the uncertain conversion between the
qualitative and the quantitative. A spatial concept is the description and definition
of a spatial entity. It is an important method used to represent spatial knowledge.
Although spatial data and spatial information are limited, spatial knowledge is
infinite. The applicable information structure that is formed by one or more pieces
2.2 From Spatial Data to Spatial Knowledge 35
2.3 SDM Space
With the generalization of a large spatial dataset, SDM runs in different spaces,
such as attribute space to recognize attribute data, feature space to extract features,
and concept space to represent knowledge. SDM organizes emergent spatial pat-
terns according to data interactions by using various techniques and methods. Fine
patterns are discovered in the microcosmic concrete space, and coarse patterns are
uncovered in the macrocosmic abstract space.
2.3.1 Attribute Space
Attribute space is a raw space that is composed of the attributes to depict spatial
objects. An object in human thinking is described in a nervous system with an
entity (e.g., case, instance, record, tuple, event, and phenomenon), and an object
in SDM is represented as an entity with attributes in a computerized system. The
dimension of attribute space comes from the attributes of spatial objects. Spatial
objects with multiple attributes create a multi-dimensional attribute space. Every
attribute in the attribute space links to a dimensional attribute of the object. When
it is put into attribute space, an object becomes a specific point with a tuple of its
attribute data in each dimension (e.g., position, cost, benefit). Thousands of spatial
objects are projected to the attribute space as tens of thousands of points. In the
context of specific issues, attribute data are primitive, chaotic, shapeless accumu-
lations of natural states, but they are also the sources to generate order and rules.
Going through the disorganized and countless appearance of attribute data, SDM
uncovers the implied rules, orders, outliers, and relevance.
2.3.2 Feature Space
Feature space is a generalized space that highlights the object essence on the basis
of attribute space. A number of different features of a spatial entity create a multi-
dimensional feature space. The feature may be single attribute, composite attrib-
ute, or derived attribute. It gradually approximates the nature of a great deal of
36 2 SDM Principles
objects with many attributes. The more generalized the dataset, the more abstract
the feature is. For example, feature-based data reduction is an abstraction of spa-
tial objects by sharply declining dimensions and greatly reducing data processing.
Based on the features, attribute-based object points are freely spaced, the whole
of which generates several object groups. The objects in the same group are often
characterized with the same feature. With gradual generalization, datasets are sum-
marized feature by feature, along with the distribution change of objects in fea-
ture space. Diverse distributions result in various object compositions and even
re-compositions. The combination varies with different features in the context of
the discovery task. Jumping from attribute space to feature space, attribute-based
object points are summarized as feature-based object points or clusters. When fur-
ther jumping from one microscopic feature space to another macroscopic feature
space, object points are partitioned into more summarized clusters. Furthermore,
the feature becomes more and more generic until the knowledge is discovered
(i.e., clustering knowledge).
2.3.3 Conceptual Space
Conceptual space may be generated from attribute space or feature space when
concepts are used in SDM. The objective world involves physical objects, and
the subjective world reflects the characteristics of the internal and external links
between physical objects and human recognition. From the existence of the sub-
jective object to the existence of self-awareness, each thinking activity targets a
certain object. The concept is a process of evolution and flow on objects, link-
ing with the external background. When there are massive data distributed in the
conceptual space, various concepts will come into being. Obviously, concepts are
more direct and better understood than data. Reflecting the intention and exten-
sion, all kinds of concepts create a conceptual space. All the data in the conceptual
space contributes to a concept. The contribution is related to the distance between
the data and the concept and the value of the data. The greater the value of the data
is, the larger the contributions of the data are. Given a set of quantitative data with
the same scope of attributes or features, how to generalize and represent the quali-
tative concepts is the basis of knowledge discovery. By combining and recombin-
ing the basic concepts in various ways, the cognitive events further uncover the
knowledge. In conceptual space, various concepts to summarize the object dataset
show different discovery states.
Discovery state space is a three-dimensional (3D) operation space for the cogni-
tion and discovery activity (Li and Du 2007). Initially, it is composed of the
2.3 SDM Space 37
2.4 SDM View
SDM view is a mining perspective which assumes that different knowledge may
be discovered from the same spatial data repositories for various mining purposes.
That is, different users with different backgrounds may discover different knowl-
edge from the same spatial dataset in different applications by using different
methods when changing measurement scales at different cognitive hierarchies and
under different resolution granularity.
View-angle enables the dataset to be illuminated by the purpose lights from
different angles in order to focus on the difference when SDM creates patterns
innovatively or uncovers unknown rules. The difference is accomplished via the
elements of SDM view, which include the internal essential elements that drive
SDM (i.e., the user, the application, and the method) and the external factors that
have an impact on SDM (i.e., hierarchy, granularity, and scale). The composition
of the elements and their changes result in various SDM views. As a computerized
simulation of human cognition by which one may observe and analyze the same
entity from very different cognitive levels, as well as actively moving between
the different levels, SDM can discover the knowledge not only in worlds with the
same view-angle but also in worlds with different view-angles from the same data-
sets for various needs. The choice of SDM views must also consider the specific
needs and the characteristics of the information system.
2.4.1 SDM User
A user is interested in SDM with personality. There are all kinds of human users,
such as a citizen, public servant, businessman, student, researcher, or SDM expert.
The user also may be an organization, such as a government, community, enter-
prise, society, institute, or university. In an SDM organization, the user may be
an analyst, system architect, system programmer, test controller, market salesman,
project manager, innovative researcher, or president. The background knowledge
context of the SDM user may be completely unfamiliar, somewhat knowledgea-
ble, familiar, or proficient. The realm indicates the level of human cognition of the
world; different users with different backgrounds have different interests. When a
SDM-referenced decision is made, the users are hierarchical. Top decision-mak-
ers macroscopically master the entire dataset for a global development direction
and therefore ask for the most generalized knowledge. Middle decision-makers
take over from the above hierarchy level and introduce and manage information.
Bottom decision-makers microscopically look at a partial dataset for local problem
resolution and therefore ask for the most detailed knowledge.
2.4 SDM View 39
2.4.2 SDM Method
When users input external data into a system, summarize datasets, and build a
knowledge base, the conventional methods encounter problems because of the
complexity and ambiguity of the knowledge and the difficulties in representing it.
Fortunately, this is not the case for and is the major advantage of SDM. The SDM
model mainly includes dependency relationship analysis, classification, concept
description, and error detection. The SDM methods are closely related to the type
of discovered knowledge, and the quality of the SDM algorithms directly affects
the quality of the discovered knowledge. The operation of the SDM algorithms
is supported by techniques that include rule induction, concept cluster, and asso-
ciation discovery. In practice, a variety of algorithms are often used in combina-
tion. The SDM system can be an automatic human–computer interaction using
its own spatial database or external databases from GIS databases. Development
can be stand-alone, embedded, attached, etc. Various factors need to be taken into
account in SDM; therefore, SDM’s theories, methods, and tools should be selected
in accordance with the specific needs of the user. SDM can handle many technical
difficulties, such as massive data, high dimension, contaminated data, data uncer-
tainty, a variety of view-angles, and difficulties in knowledge representation (Li
and Guan 2000).
2.4.3 SDM Application
SDM is applied in a specific field with constraints and relativity. SDM uncovers
the specific knowledge in a specific field for spatial data-referenced decision-
making. The knowledge involves spatial relationships and other interesting knowl-
edge that is not stored in external storage but is easily accepted, understood, and
utilized. SDM specifically supports information retrieval, query optimization,
machine learning, pattern recognition, and system integration. For example, SDM
will provide knowledge guidance and protection to understand remote sensing
images, discover spatial patterns, create knowledge bases, reorganize image data-
bases, and optimize spatial queries for accelerating the automation, intelligence,
and integration of image processing. SDM also allows the rational evaluation of
the effectiveness of a decision based on the objective data available.
SDM is therefore a kind of decision-making support technology. The knowl-
edge in the decision-making system serves data applications by helping the user
to maximize the efficient use of data and to improve the accuracy and reliability of
their production, management, operation, analysis, and marketing processes. SDM
supports all spatial data-referenced fields and decision-making processes, such as
GIS, remote sensing, GPS, transportation, police, medicine, transportation, naviga-
tion, and robotics.
40 2 SDM Principles
2.4.4 SDM Hierarchy
Hierarchy depicts the cognitive level of human beings when dealing with a dataset
in SDM, reflects the level of cognitive discovery, and describes the summarized
transformation from the microscopic world to the macroscopic world (e.g., knowl-
edge with different demands). Human thinking has different levels based on the
person’s cognitive stature. Decision-makers at different levels and under different
knowledge backgrounds may need different spatial knowledge. Simultaneously,
if the same set of data is mined from dissimilar view-angles, there may be some
knowledge of different levels. SDM thoroughly analyzes data, information, and
concepts at various layers by using roll-up and drill-down. Roll-up is for gener-
alizing coarser patterns globally, whereas drill-down is for detecting finer details
locally. Hierarchy conversion in SDM sets up a necessary communication bridge
between hardware platforms and software platforms. Their refresh and copy tech-
nology include communication and reproduction systems, copying tools defined
within the database gateway, and data warehouse designated products. Its data
transmission and transmission networks include network protocol, network man-
agement framework, network operating system, type of network, etc. Middleware
includes the database gateway, message-oriented middleware, object request bro-
ker, etc.
Human thinking is hierarchal. The cognitive activity of human beings may
arouse some physical, chemical, or electrical changes in their bodies. Reductionism
in life science supposes that thinking activities can be divided into such brain hier-
archies as biochemistry and neural structure. However, the certain relationships
among thinking activities and sub-cellular chemistry and electrical activities cannot
be built up, nor can the kind of neural structure be determined that will produce a
certain kind of cognitive model. A good analogy here is the difficulty of monitor-
ing e-mail activities on computer networks by merely detecting the functions of
the most basal silicon CMOS (Complementary metal-oxide-semiconductor) chip
in a computer. As a result, reductionism is questioned by system theory, which
indicates that the characteristics of the system as a whole are not the superposition
of low-level elements. An appropriate unit needs to be found to simulate human
cognition activities. The composition levels of the matter can be seen as the hier-
archy. For example, if the visible objects are macroscopic, celestial bodies are
cosmologic. Objects that are smaller than atoms and molecules are called micro-
scopic objects. The atomic hierarchy is very important because the physical model
of atoms is one of the five milestones for human beings to recognize the world.
Cognitive level is also related to resolution granularity and measurement scale.
Depending on the objects to be mined, SDM has the hierarchies of objects dis-
tributed throughout the world, from the analysis of protein interactions in live cells
to global atmospheric fluctuations. Spatial objects may be stars or satellites dis-
tributed in the universe; various natural or manmade features on the Earth’s sur-
face are also projected reflected in computerized information. They also can be
2.4 SDM View 41
2.4.5 SDM Granularity
2.4.6 SDM Scale
2.4.7 Discovery Mechanism
from datasets. To simulate human thinking activities, SDM must find approaches
to establish relationships between the human brain and the computer. Logically,
SDM is an inductive process that goes from concrete data to abstract patterns,
from special phenomena to general rules. The process of discovering the implicit
knowledge from the explicit dataset is similar to the human cognitive process of
uncovering the internal nature from the external phenomena. As a discovery pro-
cess, SDM changes from concrete data to abstract patterns and also from particu-
lar phenomena to general laws. Therefore, the feature interactions among objects
may help knowledge discovery.
The essence of SDM is to observe and analyze the same dataset by reorganiz-
ing datasets at various distances with mining views. The change from the elements
of view results in various SDM views. For example, the change of one or all of
the hierarchy, granularity, and scale will make SDM different in the angle aspects
of high cognition or low cognition, coarse resolution or fine resolution, small
measurement or big measurement. In the SDM process, the spatial concept first
is extracted from its corresponding dataset, and then the preliminary features are
extracted from the concept; finally, the characteristic knowledge is induced from
feature space. At a close distance, the knowledge template is at a low cognitive
hierarchy with fine granularity and a large scale; however, the discovered knowl-
edge is a detailed personality for microscopically distinguishing object differences
carefully. The tiny features may be uncovered for looking at the typical examples,
such as sharpening an image for local features. At a far distance, the knowledge
template is at a high cognitive hierarchy with coarse granularity and a small scale;
the discovered knowledge is summarized generally for macroscopically mastering
all of the objects as a whole. The subtle details are neglected for grasping the key
problems, such as smoothing an image for local features.
Regular rules and exceptional outliers are discovered simultaneously. A spatial
rule is a pattern showing the intersection of two or more spatial objects or space-
dependent attributes according to a particular spacing or set of arrangements (Ester
et al. 2000). In addition to the rules, during the discovering process of descrip-
tion or prediction, there may be some exceptions (also called outliers) that devi-
ate very much from other data observations (Shekhar et al. 2003). The following
approaches identify and explain exceptions (surprises). For example, spatial trend
predictive modelling first discovers the centers that are local maximal of a cer-
tain non-spatial attribute and then determines its theoretical trend when moving
away from the centers. Finally, a few deviations are found in that some data are
far from the theoretical trend, which may arouse the suspicion that they are noise
or are generated by a different mechanism. How are these outliers explained?
Traditionally, the detection of outliers has been studied using statistics. A number
of discordancy tests have been developed, most of which treat outliers as noise
and then try to eliminate their effects by removing them or by developing some
outlier-resistant method (Hawkins 1980). These outliers actually prove the rules;
in the context of data mining, they are meaningful input signals rather than noise.
In some cases, outliers represent unique characteristics of the objects that are
2.4 SDM View 45
Generally, some types of geometric rules can be discovered from GIS databases,
such as geometric shape, distribution, evolution, and bias (Di 2001). This general
knowledge of geometry refers to the common geometric features of a certain
group of target objects (e.g., volume, size, shape). The objects can be divided into
three categories: points (e.g., independent tree, settlements on a small-scale map),
lines (e.g., rivers, roads), and polygons (e.g., residents, lakes, squares). The num-
bers or sizes of these objects are calculated by using the methods of mathemati-
cal probability and statistics. The size of linear objects is expressed by length and
width, and the size of a polygon object is represented by its area and perimeter.
The morphological features of objects are expressed by a quantitative eigenvalue
46 2 SDM Principles
that is easily achieved by a computer through an intuitive and visual graph. The
morphological features of the linear objects are characterized by twists and turns
degree (complexity) and the direction; polygon objects are characterized by the
intensity, twists and turns degree of the boundaries, and the major axis direction;
and point objects have no morphological features. However, the morphological
features of point objects gathered together as a cluster can be determined by
methods similar to those for polygon objects. Generally, GIS databases only store
some geometric features (e.g., length, area, perimeter, geometric center), while the
calculation of morphological features requires special algorithms. Some statistical
values of geometric features (e.g., minimum, maximum, mean, variance, plural)
can be calculated; if there are enough samples, the feature histogram data can be
used as priori probability. Therefore, general geometric knowledge at a higher
level can be determined according to the background knowledge.
Spatial association refers to the internal rules among spatial entities that are
present at the same time and describes the conditional rules that frequently appear
in the feature data of spatial entities in a given spatial database. First, an asso-
ciation rule may be adjacent, connective, symbiotic, and contained. A one-dimen-
sional association includes a single predicate, and a multi-dimensional association
includes two or more spatial entities or predicates. Second, the association may
be general and strong. General association is a common correlation among spatial
entities, and a strong association appears frequently (Koperski 1999). The mean-
ings of strong association rules are more profound and their application range is
broader. They also are known as generalized association rules. Third, association
rules are descriptive rules that provide a measurement of the support, confidence,
and interest. For example, “(x, road) → close to (x, river) (82 %)” is a descrip-
tion of Chengde associated with roads and rivers. Describing association rules in
a form that is similar to structured query language (SQL) brings SDM into stand-
ard language and engineering. If the attributes of objects in SDM are limited to
Boolean type, the association rules can be extracted through the conversion type
in the database that contains the categorical attributes by combining some infor-
mation of the same objects. Fourth, association rules are temporal and transfer-
able and may ask for additional information, such as a valid time and transferable
condition. For example, the association rules on Yellow River water—“IF it rains,
THEN the water level rises (spring and summer)” and “IF it rains, THEN the
water level recedes (autumn and winter)”—cannot be exchanged when they are
used to prevent and reduce the flooding of the Yellow River. For example, the
association rules obtained in the first round of election may be different from or
even contrary to the rules determined in the second round of election as the voters’
willingness may be transferred under the given conditions. Rationally predicating
this transfer can help candidates adjust the strategy of election, thereby increasing
the possibility of winning. Finally, an association rule is a simple and practical
rule in SDM that attracts many researchers for normalization, query optimization,
minimizing a decision tree, etc. in spatial databases.
48 2 SDM Principles
Spatial clustering rules group a set of data in a way that maximizes the feature
similarity within clusters and minimizes the feature similarity between two differ-
ent clusters. Sequentially, spatial objects are partitioned into different groups via
the feature similarity by making the difference in data objects between different
groups as large as possible and the difference between data objects in the same
group as small as possible (Grabmeier and Rudolph 2002). According to the dif-
ferent criteria of similarity measurement and clustering evaluation, the commonly
used clustering algorithms may be based on partition, hierarchy, density, and grid
(Wang et al. 2011). Clustering rules further help SDM to discretize data, refine
patterns, and amalgamate information. For example, continuous datasets are parti-
tioned into discrete hierarchical clusters and multisource information are amalga-
mated for the same object.
2.5 Spatial Knowledge to Discover 49
Predictable spatial rules can forecast an unknown value, label, attribute, or trend
by assigning a spatial-valued output to each input. These rules can determine
the internal dependence between spatial objects and their impact variables in the
future, such as a regressive model or a decision tree. Before forecasting, corre-
lation analysis can be used to identify and exclude attributes or entities that are
useless or irrelevant. For example, the future occurrence of a volcano is extracted
from the structure, the plate movement, and the gravity field of Earth in the data-
base. When predicting future values of data based on the trends changing with
time, the specificity of the time factor should be fully considered.
A spatial serial rule summarizes the spatiotemporal pattern of spatial objects
changing during a period of time. It links the relationships among spatial data and
time over a long period of time. For example, in a city over the years, banks and
their branches store their operating income and expenditure accounting records
and the police department records security cases, from which financial and
social trends can be discovered under their geographical distribution. The time
constraints can be depicted with a time window or adjacent sequences. Time-
constrained serial rules also are called evolution rules. Although they are related
to other spatial rules, serial rules mining concentrates more on historical datasets
of the same object in different times, such as series analysis, sequence match, time
reasoning, etc. Spatial serial rule mining is utilized when the user wants to obtain
more refined information excavated for only a period of implicit models. Only by
using a series of values of the existing data changing with time can SDM better
predict future trends based on the mined results. When little change has occurred
in a database, gradual sequence rule mining may speed up the SDM process by
taking advantage of previous results.
50 2 SDM Principles
Outlier detection, in addition to the commonly used rules, is used to extract inter-
esting exceptions from datasets in SDM via statistics, clustering, classification,
and regression (Wang 2002; Shekhar et al. 2003). Outlier detection can also iden-
tify system faults and fraud before they escalate with potentially catastrophic con-
sequences. Although outlier detection has been used for centuries to detect and
remove anomalous observations from data, there is no rigid mathematical defini-
tion of what constitutes an outlier. Ultimately, it is a subjective exercise to deter-
mine whether or not an observation is an outlier. There are three fundamental
approaches to outlier detection (Hodge and Austin 2004):
(1) Determine the outliers without prior knowledge of the data, which pro-
cesses the data as a static distribution, pinpoints the most remote points, and
flags them as potential outliers. Essentially, it is a learning approach analogous
to unsupervised clustering.
(2) Model both normality and abnormality, which is analogous to supervised
classification and requires pre-labeled data tagged as normal or abnormal.
(3) Model only normality (or in a few cases, abnormality), which may be
considered semi-supervised as the normal class is taught but from which the
algorithm learns to recognize abnormality. It is analogous to a semi-supervised
recognition or detection task.
Normal distribution of the data is assumed in order to identify observations that
are deemed unlikely on the basis of the mean and standard deviations. Distance-
based methods frequently use the distance to the nearest neighbors to label obser-
vations as outliers or non-outliers (Ramaswamy et al. 2000). In the sequence rule,
outlier detection is a heuristic method, which recognizes data that cause a sudden
severe fluctuation in the sequential data as an exception by using linear deviation
detection. Lee (2000) used a fuzzy neural network to estimate the rules for dealing
with distribution abnormality in spatial statistics.
Spatial exceptions or outliers are the deviations or independent points beyond
the common features of the most spatial entities. An exception is an abnormal-
ity. If manmade factors have been ruled out, an exception is often the presence of
sudden changes (Barnett 1978). Deviation detection—a heuristic approach to data
mining—can identify the points that have sudden fluctuations in the sequence data
as exceptions (Shekhar et al. 2003). Spatial exceptions are the object features that
are inconsistent with the general actions or universal models of the data in spatial
datasets. They are the descriptions of analogical differences, such as the special
case in a standard class, the isolated points out of various classifications, the dif-
ference between a single attribute value and a set of attribute values in time serials,
and a significant difference between the actual value of an observation and the sys-
tem forecasting value. A lot of data mining methods ignore and discard exceptions
as noise or abnormality. Although excluding such exceptions may be conducive
to highlighting the generality, some rare spatial exceptions may be much more
2.5 Spatial Knowledge to Discover 51
significant than normal spatial objects (Hawkins 1980). For example, near a nota-
ble feature of the displacement in a large landslide point, there may be a potential
landslide hazard, which is the decisive knowledge of landslide prediction. Spatial
exceptional knowledge can be discovered with data fields, statistical hypothesis
testing, or identifying feature deviations.
2.6.1 Natural Language
Natural language is one of the best methods to describe datasets based on human
thinking and communicating with each other. As a carrier of human thinking, natu-
ral language helps to achieve a powerful tool for thinking to display and retain the
subject for thought and to organize the process of thinking. It is the foundation of
a variety of other formal systems or languages, which are derived from a special
language, such as computer language, or some specific symbol languages, such as
mathematical language. The formal systems constituted by these symbols further
become a new formal system.
The basic language value of natural language is a qualitative concept, cor-
responding to a group of quantitative data. Seen from the process of the atomic
model that evolved from the Kelvin model, the Thomson model, the Lenard model,
the Nagaoka model, and the Nicholson model to Rutherford’s atomic model with
nuclei, it is a universal and effective methodology to work out the model of mate-
rial composition. The concept maps the object from the objective world to subjec-
tive cognition. As far as concept generation, regardless of whether the characteristic
table theory or the prototype theory is used, all conceptual samples are reflected by
a set of data. The smallest unit of natural language is the language value to describe
52 2 SDM Principles
the concepts. The most basic language value represents the most basic concept—
that is, the linguistic atom. As a result, the linguistic atom forms the atomic model
when human thinking is modeled with the help of natural language.
The difference between spatial knowledge and non-spatial knowledge lies in
spatial knowledge having spatial concepts and spatial relationships; furthermore,
the abstract representation of these spatial concepts and spatial relationships is
most appropriately expressed by language values. Mastering the quantitative data-
set with qualitative language values conforms to human cognitive rules. Obtaining
qualitative concepts from a quantitative dataset reflects the essence of objects more
profoundly, and subsequently fewer resources are spent to deliver adequate infor-
mation and make efficient judgments and reasoning of complex things. When rep-
resenting the definitive properties of discovered knowledge, soft natural language
is more universal, more real, more distinct, more direct, and easier to understand
than exact mathematical language. A lot of knowledge obtained by SDM is quali-
tative knowledge after induction or abstraction, or a combination of qualitative and
quantitative knowledge. The more abstract the knowledge is, the more suitable is
natural language. However, the concept represented by natural language inevitably
has uncertainty commonly, is even blind and undisciplined, and therefore is a bot-
tleneck to freely transform between quantitative data and qualitative concept.
Support and confidence are two important indicators that reflect the usable level of
the rules: support is the measurement of a rule’s importance, whereas confidence
is the measurement of a rule’s accuracy. The support illustrates how it is represent-
ative of the rule in all transactions; the larger the support is, the more important the
associations rule becomes. A rule with a high confidence level but very small sup-
port indicates that it rarely happens, along with having little practical value.
In the actual SDM, it is important to define the thresholds of spatial knowl-
edge measurements (e.g., minimum support and minimum confidence). Under
normal circumstances, when the lift of the useful association rule is more than 1,
it suggests that the confidence of the association rules is greater than the expected
confidence. If its lift is not more than 1, the association rule is meaningless. The
general admission threshold value is based on experience, but users can also be
determined by statistics. Only when their measurements are larger than the defined
thresholds can the rules be accepted as the interesting ones (also called strong
rules) for application. Moreover, the defined thresholds should be appropriate
under the given situations. If the thresholds are too small, a large number of use-
less rules will be discovered, which not only affect the efficiency of the imple-
mentation and waste system resources, but also may flood the main objective. If
the value is too large, the results may not be rules at all, the number of rules is too
small, or the expected rules may be filtered out.
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of
international conference on very large databases (VLDB), Santiago, Chile, pp 487–499
Barnett V (1978) Outliers in statistical data. Wiley, New York
Barrabbasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512
Di KC (2001) Spatial data mining and knowledge discovering. Wuhan University Press, WuHan
Ester M et al (2000) Spatial data mining: databases primitives, algorithms and efficient DBMS
support. Data Min Knowl Disc 4:193–216
Frasconi P, Gori M, Soda G (1999) Data categorization using decision trellises. IEEE Trans
Knowl Data Eng 11(5):697–712
Grabmeier J, Rudolph A (2002) Techniques of clustering algorithms in data mining. Data Min
Knowl Disc 6:303–360
Han JW, Kamber M, Pei J (2012) Data mining: concepts and techniques, 3rd edn. Morgan
Kaufmann Publishers Inc., Burlington
Hawkins D (1980) Identifications of outliers. Chapman and Hall, London
Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Koperski K (1999) A progressive refinement approach to spatial data mining. Ph.D. thesis, Simon
Fraser University, British Columbia
Lee ES (2000) Neuro-fuzzy estimation in spatial statistics. J Math Anal Appl 249:221–231
Li DR, Guan ZQ (2000) Integration and Implementation of spatial information system. Wuhan
University Press, Wuhan
Li DR, Wang SL, Shi WZ, Wang XZ (2001) On spatial data mining and knowledge discovery
(SDMKD). Geomatics Inf Sci Wuhan Univ 26(6):491–499
Li DR, Wang SL, Li DY (2006) Theory and application of spatial data mining, 1st edn. Science
Press, Beijing
Li DR, Wang SL, Li DY (2013) Theory and application of spatial data mining, 2nd edn. Science
Press, Beijing
Li DY, Du Y (2007) Artificial intelligence with uncertainty. Chapman and Hall/CRC, London
Liu B (2007) Web data mining: exploring hyperlinks, contents, usage data, 2nd edn. Springer,
Heidelberg
Piatetsky-shapiro G (1994) An overview of knowledge discovery in databases: recent progress
and challenges. In: Ziarko Wojciech P (ed) Rough sets, fuzzy sets and knowledge discovery.
Springer, Berlin, pp 1–10
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large
data sets. In: Proceeding SIGMOD ‘00 proceedings of the 2000 ACM SIGMOD international
conference on management of data, pp 427–438
Reshef DN et al (2011) Detecting novel associations in large data sets. Science 334:1518
Shekhar S, Lu CT, Zhang P (2003) A unified approach to detecting spatial outliers.
GeoInformatica 7(2):139–166
Srivastava J, Cheng PY (1999) Warehouse creation-a potential roadblock to data warehousing.
IEEE Trans Knowl Data Eng 11(1):118–126
Wang SL (2002) Data field and cloud model based spatial data mining and knowledge discovery,
PhD thesis, Wuhan University, Wuhan
Wang SL, Shi WZ (2012) Data mining and knowledge discovery. In: Kresse Wolfgang, Danko
David (eds) Handbook of geographic information. Springer, Berlin
Wang SL, Yuan HN (2014) Spatial data mining: a perspective of big data. Int J Data Warehouse Min
10(4):50–70
Wang SL, Gan WY, Li DY, Li DR (2011) Data field for hierarchical clustering. Int J Data
Warehouse Min 7(4):43–63
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small world’ networks. Nature 393:400–442
Witten I, Frank E (2000) Data mining, practical machine learning tools and techniques with java
implementation. San Francisca: Morgan Kaufman Publishers
Chapter 3
SDM Data Source
SDM would be like water without a source or a tree without roots if it was
separated from its data resources. The development of new techniques promotes
the service features of spatial data. This chapter will explain spatial data based
on their contents and characteristics; review the techniques for acquiring spatial
data; introduce the structure of spatial data based on vectors, raster structures, and
their integration; discuss the process of modeling spatial data; explain spatial data-
bases and data warehouses in the context of seamless organization and fusion; and
introduce the National Spatial Data Infrastructures (NSDI) of nations and regions,
highlighting China’s NSDI; and based on NSDI, discriminate Digital Earth, Smart
Earth, and Big Data, in which SDM plays an important role.
3.1.1 Spatial Objects
Spatial objects are the core of an object-oriented data model. They are abstracted rep-
resentations of the entities (ground objects or geographical phenomena) that are natural
or artificial in the real world (e.g., building, river, grassland). In a spatial dataset, a large
number of spatial objects are classified as a point, line, or area or complex objects.
1. Point objects refer to points on Earth’s surface—that is, single points (chimneys,
control points), directed points (bridges, culverts), and group points (street lamps,
scattered trees). A point object contains a spatial location without shape and size
and occupies only one location data point in a computerized system.
2. Linear objects refer to the spatial curve of Earth’s surface. They are either a
point-to-point string or an arc string that can be reticulated, but they should be
interconnected (e.g., rivers, river grids, roads). They often have a shape but no
size; the shape is either a continuous straight line or a curve when projected
on a flat surface. In a computerized system, a set of elements are required to
fill the whole path.
3. Areal objects refer to the curved surface of Earth’s surface and contain shape
and size. A flat surface is comprised of the compact space surrounded by its
boundaries and a set of elements that fill the path. Areal objects are composed
of closed but non-crossing polygons or rings, which may contain a number of
islands that represent objects such as land, housing, etc.
4. Complex objects consist of any two or more basic objects (point, line, or area)
or the more complicated geo-objects. A simple ground object can be defined
as the composition of a single point, a single line, or a single area. Likewise,
a complex ground object can be defined as a composition of several simple
ground objects (e.g., the centerline, surface, traffic lights, overpasses, etc.
on a road) or elemental objects such as roads, houses, trees, and water tow-
ers. Elemental objects are depicted with geometric data and their interrelated
semantics.
For SDM, a spatial dataset is comprised of a position, attribute, graph, image, web,
text, and multimedia.
Positional data describe the specific position of a spatial object in the real world
and are generally the quantitative coordinate and location reference observed by
certain equipment and methodologies. Survey adjustment refers to the process of
confirming the estimate and accuracy of unknown parameters through observed
data obtained according to the positional data. To reduce data redundancy, the
essential organized data can be made up of nodes and arcs; that is, only the node
and arc contain the spatial location and other spatial objects have node-arc struc-
tures as well. Modern systems to acquire spatial data are now able to provide spa-
tial data to the micron resolution, which can result in discovering a large number
of positional coordinates.
Attribute data are the quantitative and qualitative description index of sub-
stances, characteristics, variables, or certain geographic objects. They can be
regarded as the facts of some point, point set, or features (Goodchild 1995). The
attribute data can be divided into quantitative data and qualitative data, which can
be discrete values or continuous values. A discrete value is the limited elements
3.1 Contents and Characteristics of Spatial Data 59
of a finite set, while a continuous value is any value in a certain interval (Shi and
Wang 2002). The attribute data of GIS describe the attributes of a point, line, poly-
gon, or remote sensing images.
Graphics and images are the visual representation and manipulation of spatial
objects on a certain surface. Their main texture or objects constitute the body of
spatial objects. Generally, graphic data are vector structures and imagery data are
raster structures. In recent years, with the increase in various satellites, the amount
of available remote-sensing images has grown dramatically. Finding interesting
objects from complex remote-sensing images for tasks such as environmental
inspection, resources investigation, and supervision of ground objects, which all
rely on remote sensing, is becoming increasingly cumbersome and time-consum-
ing. Remote-sensing images may be gray or pseudo-color and include the geo-
metric and spectral data of the ground objects. Images at different scales are often
used in different fields. For example, a small-scale image for environment inspec-
tion is mainly about texture and color, and a large-scale image for ground target
supervision is chiefly about shapes and structures. The similarity of image features
reflects the similarity between the entire image and its main objects. In addition
to the feature extraction that common image retrieval provides, remote-sensing
image retrieval also considers the comprehensive metadata, including geographical
location, band, different sensor parameters, the relationship between the scale fac-
tors and the image content, as well as the storage costs and efficiency of inquiries
that require expansive and numerous details. Consequently, content-based remote-
sensing image retrieval has its own characteristics in terms of feature selection,
similarity comparison, inquiry mechanism, system structure, and many other
aspects.
Network data are in a distributed and paralleled data space. The development of
network technology and its extensive use have made it possible to exchange data
in different countries, different regions, different departments in the same regions,
and the interior parts of a department. Web-based applications are infiltrating all
aspects of human life. In the context of a network, human–computer interaction
has become more electronic, informative, and massive, including e-commerce,
e-government, banking networks, network industries, network cultures, and a series
of new things. Web-based services have become distributed, shared, and collabo-
rative, and the quantity and quality of network data consumption are improving.
Because a network is open, dynamic, distributed, and heterogeneous, it is difficult
for its users to acquire data accurately and in a timely manner. The spatial data in
a network provide SDM, which originated from bar-code technology and massive
storage technology, with a whole new field for research and application. In addi-
tion, there are records of network access and the log data of the use of the web
server, from which useful patterns, rules, or topological relationships can be found.
Text data may be plain or rich documents to describe spatial objects. Plain
text is a pure sequence of character codes that are public, standardized, and uni-
versally readable as textual material without much processing. Rich text is any
text representation containing plain text completed by data, such as a language
identifier, font size, color, or hypertext link. In SDM, text data are electronic
60 3 SDM Data Source
for any document that is readable in digital form as opposed to such binary data
as encoded integers, real numbers, images, etc. The integration of text data and
sequential data can result in the extraction of more useful knowledge.
Multimedia data cover a broad range of text, sound, graphics, images, anima-
tions, and forms. The problems of multimedia databases include the effective stor-
age of different types of data as well as targeting different types of objects into a
single framework. Users need to have the ability to input different types of data
and to use a direct approach to scan heterogeneous data. In SDM, multimedia
databases make use of their methods of data storage and processing to address the
integrated relationship between the multimedia data and the spatial data with the
spatial objects as the framework and the multimedia data attached to the objects.
Spatial data mainly describe the location, time, and theme characteristics of vari-
ous objects on Earth. Location is the geographic identification of the data associ-
ated with a region. Time indicates the temporal factor or sequential order of the
data occurrence or process (i.e., dynamic changes). Theme is the multi-dimen-
sional structure of the same position with a number of subjects and attributes.
Location refers to the position, azimuth, shape, and size of a spatial object as
well as the topological relationship of its neighboring objects. The position and
topological relationship only relate to a spatial database system. Location can be
described by different coordinate systems, such as latitude and longitude coordi-
nates, standard map projection coordinates, or any set of rectangular coordinates.
Through a given coordinates conversion software, it is possible to realize the con-
version of different coordinate systems. There are two types of topological rela-
tionships between spatial objects: the first is the elemental relationship between
point, line, and polygon (e.g., polygon–polygon, line–line, point–point, line–pol-
ygon, point–line, and point–polygon), which illustrates the topological structure
among geometric elements; the second type is the object relationship between
spatial objects (e.g., disjoint, inclusion, exclusion, union, and intersection), which
is implicitly represented through positional relationships and queried by related
methods. GIS software has the functions to generate and query spatial topology. In
general, the positioning of a certain object is confirmed by determining the loca-
tion relationship between it and the referenced objects, especially the topological
relationships, rather than memorizing spatial coordinates. For example, location-
based service (LBS) is a general class of computer program-level services that are
used to include specific controls for location and time data as control features in
computer programs.
Time is a specific time or a particular period of time when spatial data are col-
lected or computed. Spatial data may be treated as the functions of time on spatial
objects. Some of the data change slowly and others rapidly. It is hard to collect the
changes at any time by ordinary methods. Satellite remote sensing is an alternative
3.1 Contents and Characteristics of Spatial Data 61
method for capturing the dynamic change features of spatial objects with time
effectively and continuously. For example, every 30–60 min, data from a geosta-
tionary meteorological satellite can be used to forecast weather and ocean storms.
The high resolution of time provided by satellite remote sensing is of great signifi-
cance in the study of the dynamic changes in the natural process.
Theme is a subject-oriented aspect of attribute data and refers to the character-
istics apart from location and time, such as terrain slope, terrain exposure, annual
precipitation, land endurance, pH value, land cover, population density, traffic
flow, annual productivity, per capita income, disease distribution, pollution level,
and resource distribution. Such themes can be extracted through aviation and aero-
space remote-sensing images manually, automatically, or semi-automatically and
stored and handled in other database systems, such as a relational database man-
agement system (RDBMS). Further development of remote sensing technology is
expected to obtain more thematic data.
SDM has a wide range of data sources, including observation, maps, remote sens-
ing, and statistics. These sources can be divided into direct data and indirect data.
Direct data are the first-hand raw data from original spatial objects—that is, data
observed with senses, such as vision and touch. Indirect data, on the other hand,
are the second-hand processed data from the computing analysis of direct data,
such as a map contour, recommended road, and buffer zone. Indirect data is a rela-
tive concept because additional indirect data can be extracted from indirect data.
Spatial data are stored as analog data and digital data (Chen and Gong 1998).
During the processing of spatial data, in addition to general databases, other
different types of databases with distinguishable natures and requests are involved.
Some of these databases focus on the characteristics of time and location, some
concentrate on the mathematical and physical characteristics of the data, and some
pay more attention to the data characteristics of data. It is difficult to illustrate the
spatial relationships between objects in a general database, but a spatial database
can produce a detailed description of the length, width, and height of objects and
the distance between one object and another object as well as the boundaries of
the objects. Time is also very important for such databases because users want to
know when the imagery data are obtained, such as when certain buildings were
demolished and the various changes over time. Other significant concepts include
adjacency and inclusion. For example, a specific building may belong to a park,
and its adjacent buildings may include a government office building. Here, the
adjacency indicates the nearest distance, such as the school or fire station nearest
to a house. All of these special requirements constitute the characteristics of spa-
tial databases.
For the integration of a spatial database system, a live data acquisition sys-
tem focusing on the collection, organization, and management of environmental
62 3 SDM Data Source
As the spatial data collected through diverse sensors or methods are easily affected
by various factors, their uncertainty is ubiquitous and everywhere. In data analy-
sis and data identification, it is inadequate to process the data one by one without
considering their connection. With an eye to the problem, it is necessary to inte-
grate the spatial data so that the representation of spatial object can maintain the
consistency.
Spatial data can be divided into at least two categories: complementary data and
cooperative data. Complementary data refers to spatial data derived from differ-
ent data sources. The environment features are independent from one another. It
reflects the different sides of a spatial object. The integration of complementary
data reduces the misunderstanding of features resulting from a lack of some object
characteristics so that the representation of the spatial object can be improved to
be consistent and correct. As complementary data are collected by dissimilar sen-
sors, the data differ a lot from the aspect of measurement accuracy, scope, and
output form. Thus, it is extremely important to unify the representation of data
extracted from different sensors before the integration of multiple data. In spatial
data processing, if the processing of certain data depends on other data process-
ing, they are called cooperative data. The integration of cooperative data is largely
related to the time and order of vicarious data processing.
One approach to complementary data fusion is that, guided by evidence theory,
the resulting representation can be synthesized from the evidence of spatial objects
3.1 Contents and Characteristics of Spatial Data 63
and dynamic environment monitoring, but it also will perfect the timeliness and
reliability of remote sensing data extraction. In addition, it will improve the data
utilization rate, which lays a good foundation for large-scale remote sensing appli-
cation research.
The schemes of multi-source remote sensing data fusion may be weighted
fusion, HSI (hue-saturation-intensity) transform fusion, feature fusion with wave-
let technique, classification fusion with Bayesian rule, and image fusion with local
histogram match filtering (Li and Guan 2000). To test the quality of integrated
spectral imaging data content, a comparison between a fused image and a spec-
tral imaging radiation value can be used to define it. If two images match with
each other, their global statistical parameters should be approximate. Among those
parameters, the average grade and bias index are relatively important. Average
grade reflects all the contrast of the small details and changing texture features and
also shows the image definition, while the bias index refers to the specific value
between the absolute value and the low-resolution image.
Traditional database technology is a single data source contained in a database
employed for data processing, such as transaction processing and decision-mak-
ing. The rapid development of computer applications in recent years has extended
in two directions. The first is breadth calculation, which aims to expand computer
applications and to realize extensive data exchange. The internet is one feature of
breadth calculation. The second direction is depth calculation, which enables com-
puters to participate more in data analysis, decision-making, etc. Apart from that,
database processing can be classified into operative processing and analytical pro-
cessing. Such a division draws a clear distinction between them as far as laying the
groundwork for a systematic environment (Codd 1995).
Therefore, research efforts are needed to reach the requirements of SDM and to
improve its analysis and decision-making efficiency within the foundation of net-
work and data processing technology. In addition, analytical processing, as well as
its data, need to be separated from operative processing. A method also is needed
to extract analytical data from transaction processing and then reorganize it in
accordance with SDM requirements. To establish such a new data memory, organi-
zation, analysis, processing environment, and powerful decision-making technol-
ogy, data warehouses are emerging as the need arises.
For example, the Yangtze River can be regarded as a polygon or linear object or can
be constructed into a complex geographical entity for management and description.
To meet the need of seamless organization in logic and physics, all spa-
tial attribute data should be under engineering management. Adopting RDBMS
to manage and save the attribute data in form of tables is suggested. Taking the
rules as reference, users can classify the features. For example, the features can be
divided into control points, roads, residential area, water system, and vegetation of
various levels; then, an exclusive classification code of features can be assigned.
Any complete feature usually corresponds to a piece of an attribute data record.
The spatial features are connected with their corresponding attribute records. To
effectively organize and describe spatial entities, the features can be extracted into
multi-grades based on the size of a feature (the largest coverage area).
The selection and representation of a spatial object and its attributes is very
important. For example, many attributes exist in urban areas, such as age, green
area, educational level of the citizens, and main roads. The selection and repre-
sentation of those attributes have a direct influence on geometric operation and
predicate evaluation in a position network. Considering the fact that a position net-
work can combine its geometric and analog data with a semantic network, it can
be brought into the spatial database to realize a seamless combined organization.
A position network helps fulfill the logically seamless organization of spatial
objects (Tan 1998). It is a set of geometric point network representations. The set
of geometric points is linked together by set theory and certain set operations cor-
responding to position limitations determined by a way of thinking or physical
factors. The internal contacts of a position network contain geometric operations,
variables, and results. For example, a node represents the union set of two vari-
ables and the result is a point set. The reasoning is on the basis of elevation on the
network. The expected relative positions of object features are stored in the net-
work, in which a model is established in order to operate the fundamental calcula-
tions of the geometric relationships between objects:
1. Computing direction (left, right, south, north, upper, lower, reflection, etc.): The
point set is given by the relative position and direction of other point-groups.
2. Region operations (close to, within the quadrangle, inside the circle, etc.): The
establishment of a point set of non-direction related with other point-groups.
3. Set operations: Union, intersection, and subtract.
4. Predicate evaluation: An operation to remove certain point sets by computing
the features of the objects.
Spatial data play an important role in SDM. For example, during the process of the
construction of a durable and practical GIS, spatial data accounts for 50–70 % of
the cost. In developing countries, the cost is higher but may reach only about 50 %
66 3 SDM Data Source
for spatial data accounts due to the cheap labor, However, the quality of the data
would be affected were the cost proportion were to further decrease (Gong 1999).
Spatial data are collected via various macro- and micro-sensors or equipment
(e.g., radar, infrared, photoelectric, satellite, multi-spectral scanners, digital cam-
eras, imaging spectrometer, total station, telescopes, television cameras, electronic
imaging, computed tomography imaging) or by accessing spatial data collected
by conventional field surveying, population census, land investigation, scanning
maps, digitizing maps, and statistical charts. Technology applications and analysis
of spatial data include other sources, such as computers, networks, GPS, RS, and
GIS. These sources contain the origin of the spatial data as well as the step, for-
mat, conversion, date, time, location, staff, environment, transport, and history to
collect, edit, and store spatial data.
Spatial data can be obtained by utilizing data acquisition methods, which
include point acquisition, area acquisition, and mobility acquisition. Point acquisi-
tion refers to the use of total station, GPS receivers, and other conventional surface
measurements to collect the coordinates and attributes of surface points on Earth
point by point. Area acquisition refers to the use of aviation and aerospace remote
sensing to acquire large areas of images from which geometrical and physical
features can be extracted. Mobility acquisition integrates GPS, RS, and GIS into
Earth’s observation system to obtain, store, manage, update, and apply spatial data.
3.2.1 Point Acquisition
Point acquisition refers mainly to total station and GPS. The theodolite, level
instrument, distance measuring instrument, total station, and GPS receivers used in
surface measurement can acquire geospatial data point by point. Two of the trendi-
est types—GPS and total station—are introduced here.
Total station is another name for total electronic station. All of the data col-
lected can be automatically transmitted to card records, e-books by hand, directly
mapped to the indoor computer, or transmitted to an e-pad, which maps auto-
matically on-site in the field. This method can be applied in cities with numerous
skyscrapers or in situations where high accuracy is required to obtain and update
spatial data as supplementary information for aviation and space remote sensing.
GPS is a satellite navigation system that is able to regulate time and measure
space. GPS consists of three parts: the space, which is made up of GPS satellites;
the control, which consists of a number of ground stations; and the user, which
has a receiver as the main component. The three parts have independent functions
but act as a whole when GPS is operating. As technology has advanced, GPS has
expanded from static to dynamic, from post-processing to real-time/quasi-real-
time positioning and navigation, which has greatly broadened its scope of applica-
tion. In the 1980s, China introduced GPS technology and receivers on a large scale
for the measurement of controls at all levels. China also developed a number of
GPS data processing software programs, such as GPSADJ, which has measured
3.2 Spatial Data Acquisition 67
the accuracy of thousands of GPS points, including the national GPS A-level and
B-level, the control networks of many cities, and various types of engineering con-
trol networks.
3.2.2 Area Acquisition
Area acquisition refers to aviation and aerospace remote sensing, which is the
main rapid, large-scale access to spatial data, including aviation and space pho-
togrammetry. Because they make use of a variety of sensors to scan spatial
objects on Earth’s surface for taking photos, the obtained images contain a large
amount of geometrical and physical data that indicate the real and present situa-
tion. Compared to GPS and other ground measurements that acquire data point by
point, this area access to geospatial data is effective.
Aerial photogrammetry and satellite photogrammetry acquire color or full-
color images, from which highly accurate geometric data can be extracted due to
their high-resolution and geometrical stability. On the other hand, aerial and aero-
space remote sensing is characterized by multi-spectrum and hyper-spectrum or
multi-polarization and multiband, from which more physical data can be extracted,
which therefore means that they are complementary tools.
1. The combination of photogrammetry and remote sensing.
• Remote sensing can greatly impact photogrammetry by breaking its limit on
focusing too much on the geometrical data of observed objects, such as their
shape and size; in particular, aerial photogrammetry for a long time only empha-
sized the mapping portion of a terrain diagram. Remote sensing technology, in
addition to the use of a visible light framework black-and-white camera, offers
color infrared photography, panoramic photography, infrared scanners, multi-
spectral scanners, as well as imaging spectrometer, CCD array scanners, and
matrix synthetic aperture cameras for detecting radar. Remote sensing was fur-
ther developed after the 1980s, once again demonstrating its tremendous influence
on photogrammetry. For example, the space shuttle, as a remote sensing platform
or means of launch, can return to the ground and be reused, which has greatly
enhanced the performance of remote sensing and is cost-effective. In addition,
remote sensing’s ground resolution (spatial resolution), radiometric resolution res-
olution (number of grayscales), spectral resolution (spectral bands), and time res-
olution (repetition cycle) of many new sensors have been enhanced significantly.
Remote sensing’s improved spectral resolution spawned the imaging spectrom-
eter and hyper-spectrum remote sensing. The so-called hyper-spectrum remote
sensing identifies multiband images at nanometer-level spectral resolution
through the breakdown of the spectrum, which forms a cube image and is char-
acterized by the unity of the images and spectrum. Thus, it has greatly enhanced
a user’s ability to study surface objects, recognize the object’s type, identify the
object’s material composition, and analyze its conditions and trends.
68 3 SDM Data Source
GIS basic data is automatically acquired with full digital photogrammetry. The
birth of full digital photogrammetry enabled computers to obtain GIS basic data
from aerial photographs or space remote sensing images. The process starts from
digital images or digitalized images, which are then processed on computers to
obtain the necessary data to establish GIS maps. GIS basic data can be simply
divided into the terrain data (three-dimensional coordinates of any point), graphic
data (thematic elements and basic graphics; i.e., point, line, and polygon) and
attribute data. As a result, it is possible to provide terrain data by generating a digi-
tal elevation model (DEM), graphic data by producing an orthophoto and extract-
ing its structure data, and attribute data by doing image interpretation and thematic
classification.
Automatic interpretation and thematic classification of images, also known as
the automatic recognition pattern of images, refer to the recognition and classifica-
tion of the attributes of images by using computers, certain mathematical methods,
and geographic elements (i.e., certain features of the subjects), so as to classify the
real geographic elements corresponding to the image data. Through the interpre-
tation and thematic classification of images, the attributes of basic graphics and
images of thematic elements can be acquired, which are necessary for establishing
GIS. To improve the accuracy and reliability of the automatic interpretation and
classification of images, texture analysis of the images and use of the nearby land-
scape should be considered apart from the spectral characteristics of the image.
With the introduction of a large number of high-resolution satellite images,
remote sensing images will become the main source of GIS data. To obtain data
quickly and effectively from remote sensing images, a video-based GIS should be
considered in the future to combine the analysis system of remote sensing images
and the standard geographic data system.
3.2.3 Mobility Acquisition
Mobile access to spatial data refers to the integration of global positioning system
(GPS), remote sensing (RS), and geographic information system (GIS). GPS, RS,
and GIS are the three major technical supports (hereinafter referred to as 3S) in the
Earth Observation System to acquire, store, manage, update, analyze, and apply
spatial data. 3S is an important technical way to realize sustainable development
of modern society, reasonable use of resources, smart planning and management
of urban and rural areas, and dynamic supervision and prevention of natural dis-
asters. At the same time, 3S is one of the scientific methods that is moving geo-
information science toward quantitative science.
To achieve an authentic 3S, it is necessary to study and resolve some common
basic problems in the process of design, implementation, and application of the
integrated 3S system in order to further design and develop practical theories,
methods, and tools, such as real-time system positioning, integrated data man-
agement, semantic and non-semantic data extraction, automatic data updates,
70 3 SDM Data Source
GPS-assisted aerial triangulation uses a GPS signal receiver on the plane and a GPS
receiver at one or more base stations on the ground to observe GPS satellite sig-
nals constantly at the same time. It can obtain 3D coordinates of the station at the
exposure time of the aerial surveying camera through the GPS carrier phase meas-
urement with differential positioning technology, and then can introduce them as
added observations into the photogrammetric area network adjustment to position
the point location and evaluate its quality through a unified method of mathemati-
cal models and algorithms. If the ground control station is replaced by an onboard
GPS aerial photo instrument and the artificial measurement of point coordinates for
GPS auxiliary automatic aerial triangulation is replaced by a full digital photogram-
metry system for automatic matching of the point changes of multi-chip images,
the time and manual labor necessary will be dramatically less than conventional
aerial triangulation. On this basis, construction and automatic updating will be real-
ized by generating a digital elevation model (DEM) and orthophotos through full
digital photogrammetry work stations and providing the necessary geometrical and
thematic data for GIS. For example, for the construction of an industrial area road
that is needed by different customers who are highly integrated, photogrammetry
can be completed quickly and the results of spatial databases can be addressed effi-
ciently using a system that is not only in accordance with the trends of aerial photo-
grammetry but also aligned with the sustainable development of society and a solid
foundation for the realization of Digital Earth.
3.3.1 Vector Data
Vector data represent spatial objects as precisely as possible with points, lines, and
polygons by continuously recording the coordinates. The point is located by the
coordinates. The location and shape of the line are represented by the coordinate
strings of the sampling points on its central axis. The location and range of the
polygon are indicated by the coordinate strings of the sampling points on the out-
lines of their range. The vector format can store complex data by minimizing the
data redundancy. Limited by the accuracy of the digital equipment and the length
of the data record, the vector records the coordinates of the sampling points, which
are taken directly from the spatial coordinates so that the objects can be repre-
sented in an accurate way. There are two vector methods that depict the character-
istics of point, line, and polygon: path topology and network topology. The vector
format features “explicit positions but implicit attributes.”
3.3.2 Raster Data
Raster data represent spatial objects by dividing Earth’s surface into a regular
array that is uniform in size (i.e., grid, cell, pixel). They implicitly depict the rela-
tionship between spatial objects and a simple data format. Respectively, point,
line, and polygon are indicated by a raster, a series of neighboring rasters along
with lines, and neighboring rasters that share the same properties in an area. The
basic unit of raster data is generally shaped like a square, a triangle, or a hexa-
gon. While they share similarities, they also differ from each other in some geo-
metrical features, such as directivity, subdivision, and symmetry. The raster data
may be plane or surface, which are organized by the raster unit-oriented method,
the instance variable-oriented method, and the homogeneous location-ori-
ented method. The raster format is featured with “explicit attributes but implicit
location.”
Each raster has a unique pair of rows and columns and a code that determines
the identity of a spatial location, and the code represents the pointer that correlates
the attribute type with its pixel record. The number of rows and columns depends
on the resolution of a raster and the attributes of a spatial object. Generally, the
more complicated the attributes are, the smaller the raster size is and the bigger the
resolution is. The amount of raster data increases in accordance with the square
index of the resolution.
3.3.3 Vector-Raster Data
Vectors and rasters are compared in Table 3.1; along with Fig. 3.1, the table shows
that the vector curve, represented by a series of points with coordinates, is able to
3.3 Spatial Data Formats 73
Fig. 3.2 Vector-raster data
location of a certain point, it can be directly entered into the index record, and the
corresponding record in the linear quad-tree then can be determined. Thereafter,
with the help of the indicators, its starting record number can be found in the lin-
ear quad-tree as well as the attribute value of the leaf node. In addition, due to
the direct link between the coarse grid Morton code and the record number, the
Morton code in the index documents also can be omitted and replaced by implicit
record numbers.
When designing the integrated format on vector-raster data, the fundamental
focus is the point, linear, and areal objects. The use of a sub-grid means that vec-
tor data may not have stored original samples that remain of good precision with
the converted data format. Furthermore, all location data adopt the linear quad-
tree address code as the basic data format, which ensures the direct counterparts of
manifold geometric objects.
Point objects possess location without shape and area. It is unnecessary to
regard the point-like objects as cover to decompose the quad-tree. Rather, they
are only used to convert the coordinates of the points into Morton address codes,
regardless of whether or not the entire configuration is quad-tree.
Linear objects have no area, and their shapes contain the whole path. We only
need to express the path of each linear object by a series of numbers instead of
decomposing the whole quad-tree. When it comes to the entire path, all the raster
addresses through which the linear object has gone are supposed to be recorded.
3.3 Spatial Data Formats 75
A linear object may be composed of a few arcs with a starting point and ending
point. The string of the mid-points contains not only the original sampling points
but also all the intersection points of the grid’s boundaries through which the path
of the arc has gone, whose codes fill the entire path. Such a format also takes full
account of the spatial characteristics that linear objects have on the ground. If a
linear object passes over rough terrain, only by recording the grid elevation value
of the DEM boundaries through which the curve has gone can its spatial shape and
length be better expressed. Although this data format increases a certain amount
of storage capacity compared to simple vector formats, it solves the quad-tree rep-
resentation problem of the linear objects and enables it to establish an integrated
data format based on the linear quad-tree code together with the point and area
objects. This makes the query issue quite simple and fast in terms of the inter-
section between a point and linear objects, the mutual intersection between linear
objects, and the intersection between linear and area objects. With the data files of
the arc, the data format of linear objects is represented as a collection of arcs.
Areal objects should contain borders and the whole region. The borders are
composed of arcs that surround regional data. Manifold objects may form multiple
covering layers. For example, visual objects such as buildings, squares, farmland,
and lakes can form one covering layer while the administrative divisions and the
soil types can form another two covering layers. If every covering layer is sin-
gle-valued, each grid only has the attribute value of one area object. One covering
layer represents a space, and even an island has its corresponding attributes. In
object-oriented data, the grid value of the leaf nodes is based on the identification
number of the object instead of the attributes of the objects. By using the cycle
indicators, the leaf nodes belonging to the same object are linked together to be
object-oriented. A rounding method is used to determine the values of boundary
grids in the areal objects. The value of the interacted grid of two ground objects is
based on which one occupies a larger area in the grid. In order to carry out precise
area or overlay calculations of the areal feature, it is feasible to further cite the
boundary data of the arcs. These object-oriented data are featured with both vector
and raster. The identification numbers of the areal object make it easy to determine
its boundary arc and extract all the middle blocks along the arc.
The spatial data of various types with public geographical coordinates are intrinsi-
cally linked, reflecting the logic organization of spatial data when they are placed
in a spatial database structure. A spatial data model mathematically depicts spa-
tial data content and presents the interrelation between two entities. At present, the
hierarchical model, network model, relational model, and object-oriented model
are in common use.
76 3 SDM Data Source
3.4.2 Relational Model
A relational model depicts spatial data with their entities and relationships by
using the connected two-dimensional table in certain conditions. Such a table is
an entity, in which a row is an entity record while a column is an entity attrib-
ute. There may be many columns as the entity is depicted with many attributes.
A unique keyword (single attribute or composited attributes) is selected to recog-
nize the entity. A foreign keyword connects one table to another table, showing
the natural relationships between entities (i.e., one to one, one to many, many to
many). Based on Boolean logic and arithmetic rules, a relational model allows
definition and manipulation of the data by using structured query language (SQL).
The data description is of high consistency and independent, and complicated data
become clear in terms of structure. Figure 3.3 shows a digital map and its rela-
tional database.
3.4 Spatial Data Model 77
3.4.3 Object-Oriented Model
The object-oriented model was the product of combining the object-oriented tech-
nique and the database technique. Since the late 1980s and early 1990s, applica-
tion of the object-oriented technique in GIS has been highly valued. In Fig. 3.4,
three kinds of spatial objects—point-like, linear, and areal—are oriented along
with their annotations.
1. Object-oriented contents
• Spatial objects may be single or composite in GIS. If various spatial objects
are merged appropriately, there will be 13 classes of spatial objects and a
data structure: node, point object, arc, linear object, area object, digital terrain
model, fracture surface, image pixel, cubic object, digital 3D model, voxel,
columnar objects, complex objects, and location coordinate. Each object has
an identifier in the class; however, there is no identifier in location coordi-
nates (e.g., a tuple of a floating point number in 2D GIS, a tuple of a floating
number in 3D GIS). The obvious advantage of an object-oriented data model
is that each class corresponds with one data structure and each object corre-
sponds with one record. No matter how complicated the objects are or how
many nested object relationships exist, there is still a structure table used for
representation. It is easily understandable as each spatial object has one class,
making it available to express the relationship by generalization, union, aggre-
gation, etc.
data models. In many cases of 3D spatial phenomena, using vector data alone
cannot solve all the problems, such as the problem of non-uniform density
within the entity or the irregular problem of a 3D surface, nor will using only
a grid approach because the measurement and representation are of high accu-
racy. However, complete vector or grid data models are obtained by putting the
above two methods into the three-dimensional data model construction.
3.5 Spatial Databases
When a spatial database is properly managed and is equipped with the appropriate
tools, it performs organization, management, and input/output tasks (Ester et al.
2000). Different from the common transactional data which only have a few fixed
data models with simple data conversion, spatial data are distinct in the under-
standings, definitions, representation, and storage needs of spatial phenomenon.
As a result, spatial data sharing is very complicated.
A surveying and mapping database is the system responsible for collecting, man-
aging, processing, and using digital surveying and mapping data (Fig. 3.6).
82 3 SDM Data Source
town planning needs; in rural areas, the data system of 1:2,000~1:5,000 serves
for land consolidation, agribusiness conglomerate, farmland planning, and irriga-
tion works. As for cadastral management, data related to real estate and cadastral
registration should be added. Also, in order to solve the problems of diverse and
changeable data, the system should be interactive. The success of establishing
image databases is due to the available real contents of various digital images and
the development of digital sensors that are part of the measurement database and
national spatial data infrastructure.
A DEM (digital elevation model) database aims to organize related data efficiently
and build a unified spatial index according to spatial distribution so that the user
can find the data of any realm quickly (i.e., seamlessly roaming terrain). Usually,
data display and computational efficiency are contradictory to each other in terms
of the performance of large-scale databases. Due to the limitations of both com-
puter display resolution and human eye sensory, the content capacity presentation
is always restricted. In other words, a computer display remains close to zero as
the computer reaches a certain complex level. At present, 2D GIS focuses on a
great number of vector data, along with the management and operations of attrib-
ute data. However, establishing a DEM system with many details is very difficult
compared to representing the 2D hierarchical relations of vector data. Fortunately,
DEM helps not only to improve the function of 3D surface analysis, but also to
add supplementary 3D representation data for the understanding of various spatial
distributions. Vector data are identified by object identification without duplicate
storage. However, a DEM of different levels belongs to different databases and the
DEM of the lowest level belonging to the original data is the fundamental data-
base of multi-scale, multi-resolution, and multi-source while the DEM located in
the other layers is derived from the fundamental data. After the process of data
fusion, data layers of the same scale are of one spatial resolution. To improve the
efficiency of multi-scale data query, display, and analysis, a multi-scale DEM with
hierarchy is of great importance in making it an alternative roaming operation and
highly efficient interaction.
A hierarchical structure may enhance the browsing productivity of DEM data-
bases. The spatial index of DEM data is constructed by hierarchical structure
composed of “a project-workplace-layer-block-array.” By virtue of this hierarchi-
cal structure, it is the only sure way to find the elevation within a DEM database.
This spatial index opens access to the data and is seamless. Users can both view
the whole and the part. They take advantage of the data from the sub-array of the
same layer to build a mutually referable mechanism among the different layers. It
becomes easy to automatically retrieve and transfer data of different levels accord-
ing to the display range.
84 3 SDM Data Source
3.5.3 Image Pyramid
With the rapid development of remote sensing technology, the number of imagery
data observed is growing geometrically. It is extremely urgent to establish a large
spatial database to manage and integrate the multi-source, multi-scale, and multi-
temporal images as a whole. However, due to the huge volume of images, it is
necessary to distribute them in different segmentations. By watching the pointer
during the process of indexing the recorded blocks, the data segmentations can be
retrieved directly. Another problem is retrieving a more abstract image, which is
transferred and extracted from the bottom layer, without any speed issues. As a
solution, an image pyramid with hierarchical scales should be given so that data
can be transferred in every layer.
The image pyramid stores and manages the remotely sensed images accord-
ing to the resolution. In light of all the map scale levels, as well as the workplace
and project division, the bottom layer is of the finest resolution and the largest
scale. The top layer is of the coarsest resolution and the smallest scale—that is,
the lower the layer, the larger the amount of data. As a result, an image database
in the image pyramid structure makes it easy to organize, index, and browse the
multi-source, multi-scale, and cross-resolution images (Fig. 3.7). When the image
pyramid is used, there are three issues of concern:
1. The query is very likely associated with multiple maps and themes; for exam-
ple, a road may exist in several maps.
2. Challenges are posed to maintaining the consistency and completeness of spa-
tial objects.
3. Individual management threatens the exchange of data and the preliminary
safety of spatial objects.
As a resolution, researchers of GIS companies (e.g., MG of Intergraph, ArcInfo of
ESRI) have been engaged in establishing a seamless geographic database to com-
plete the snap of the geometric edge or logical edge. As the physical storage is
based on the frame of the maps, the correlations of all geometric identifications
on the same spatial objects are processed invisibly for the users, which is called
“logical seamless organization.” To realize seamless organization of geographical
data both logically and physically in terms of large geographical database, Li et al.
(1998) proposed a better image display in different maps of various scales. The
main organization form of GIS databases is the basic working unit established on
the average map size; if the workplace identification is affected by image division,
it is considered as the image workplace.
For the sake of saving time in storage and image compression (or decom-
pression), the workplace can be reorganized when it is put in storage. As for the
image database, a workspace file index is established so that the system can sup-
port the internal effective image files. The system can name the workspace files
with sheet numbers. Within a row of spatial area of certain scale, the image data-
base established on the basis of the unit of at least a piece of a digital image of
the same plotting scale placed in the image workplace is called an image project.
3.5 Spatial Databases 85
To accomplish fast retrieval and historical data archive management, several image
projects of the same plotting scale, called image sub-projects, can be established.
As an image sub-project is independent in the spatial index data, it becomes
more convenient in the management of the data and the data product distribution.
Contributing to the application of vector data and DEM data, this method stim-
ulates the development and sharing of new image products and standardizes the
database system application.
86 3 SDM Data Source
Suppose that we define an average size sheet as the workplace and take multi-
source and multi-resolution remote sensing image data as the target data objects
(mainly orthographic images) so that the image pyramid and multi-hierarchical file
system can be combined to control image geometrical space and tonal space. At
the same time, a seamless spatial database system can be established by stitch-
ing because this spatial database is cross-scale, cross-projection zone, cross-image
project, and cross-image workspace. Then, it becomes easy and quick in terms of
spatial indexing, dispatching, seamless browsing, distribution of image products,
and sharing national imagery spatial data. By virtue of distributed management
objects technology, which regards an image sub-project as an object, not only can
the system safety be improved but the video distribution and data access can be
fully controlled. Under this system, the maintenance and management of topologi-
cal data are not confined to one server or client. It can both transfer data files into
an image (sub-) project registry that records the topological data from the distrib-
uted image databases’ file systems dynamically and transform the topological data
in the servers of local networks. Network operation by synchronization replication
establishes a seamless spatial database system for real-time query and roaming.
To better use and distribute geospatial data, a spatial database should be seam-
less with distinct features:
1. Retrieval, roaming, and inquiries should be conducted in the entire database.
2. Vector graphics should be translucently covered in raster images and capable
of updating traditional graphic data with new images, including amendments
and supplements, as well as deleting it.
3. The adaptive, multi-level resolution display based on the image pyramid
should be realizable; that is, reading images from the corresponding layer on
the image pyramid according to the chosen display scales automatically.
4. Its basic functions should include 3D display, query, roaming, analysis of
images, and a DEM.
5. All the spatial data should be transferable to other data systems supported by
GIS hardware and software according to the needs of users.
6. Visual products can be output according to map sheets or the scopes given by
the user.
3.6.1 Data Warehouse
The essence of a data warehouse is to assemble the data of daily business accord-
ing to certain design principles, change them at any time as needed when refining
to data products, and finally store them in the database. The basic idea of a data
warehouse is to pick up the required data for applications from multiple opera-
tional databases and transform them into a uniform format; with the use of a multi-
dimensional classification mechanism, the large amount of operational data and
historical data can be organized so as to speed up the inquiry. Usually, the data
stored in the operating databases are too detailed for decision-making, so gener-
ally a data warehouse will characterize and aggregate them before storage. In fact,
a data warehouse contains a large amount of data from a variety of data sources,
each of which is in charge of their respective databases. In general, transactional
databases serve the online transactional process (OLTP), with standard consistency
and recoverability while dealing with rather small transactions and a small amount
of data. In addition, they are mainly used for the management of the current ver-
sion of the data. On the other hand, a data warehouse stores historical, compre-
hensive, and stable data sets for online analysis and processing (OLAP) (Codd
1995). At present, there are two ways to store data in a data warehouse: relational
database storage structure and multi-dimensional database storage structure. A star
schema, snowflake schema, or their hybrids are often used to organize data in the
relational database structure, whose products include Oracle, Sybase IQ, Redbrick,
and DB2. In a multi-dimensional database storage structure, the data cube is
employed; these products include Pilot, Essbase, and Gentia.
A spatial data cube organizes the multi-dimensional data from different fields
(i.e., geospatial data and other thematic data from a variety of spheres) into an
accessible data cube or hypercube according to dimensions, using three or more
dimensions to describe an object. These dimensions are perpendicular to each
other because the analysis result the user needs is just in the cross-points (Inmon
1996). A data cube is a kind of multi-dimensional data structure with a hierarchy
of multi-dimensions to identify and gather data representing the requirements of
making a multi-faceted, multi-angle analysis for processing data in management
and decision-making. For example, the time dimension can express a hierarchy of
current, daily, weekly, monthly, quarterly, biannually, and annually.
In statistical analysis, a variety of complex statistical charts are often used, the
units of which sort sub-totals, monthly sub-totals, annual totals, etc. This example
is essentially a two-dimensional representation of a multi-dimensional data cube.
Long before the appearance and application of computers, this way of thinking
and method of calculation prevailed in China. This concept of data analysis is
88 3 SDM Data Source
Fig. 3.8 Data cube
they can be directly observed and analyzed when making decisions, from differ-
ent perspectives and to different extents, which can greatly enhance the efficiency
of data analysis. The typical operations of a data cube are drill down, roll up, slice
and dice, and pivot. All of these operations are called OLAP methods. OLAP is
the main data processing and analysis technology of data warehouses, whose main
functions are to analyze multi-dimensional data and generate reports. It general-
izes data and can be seen as a simple data mining technique. However, data min-
ing techniques such as induction, relevance, classification, and trend analysis are
more powerful analysis tools than OLAP because the data in OLAP are aggre-
gated through simple statistics while those in data mining require more complex
techniques (Srivastava and Cheng 1999).
this book focuses on the theories and methods of SDM, spatial databases are gen-
erally the source for the purposes here. In spatial data warehouses, these theories
and methods can be applied directly or even after minor improvement.
more difficult. Thus, better entry standards and completely compatible digitaliza-
tion will improve the well-being of our society.
The solution to these problems is the establishment of a NSDI. The Federal
Geographical Data Committee (FGDC) of the United States proposed a strategy
for NSDI in April 1997. In Executive Order 12906, former President William
Clinton recognized the urgent need for the country to find methods to build and
share geographic data. The document called for the establishment of a coordinated
NSDI to support the application of geographic data. Hence, NSDI is seen as a part
of the developing NII, in order to provide citizens access to important government
data and strengthen the democratic process.
To develop NSDI, federal agencies and various nongovernmental organizations
undertook major activities to establish the National Geospatial Data Clearinghouse
(NGDC) between data producers and users, as well as standards on the collection
and exchange of spatial data. A national digital spatial data network was set up
by NGDC for data producers and cooperative groups, which included the impor-
tant data classes effectual to a wide range of user types. A new relationship may
allow the institutions and individuals from all fields to work together and share
geospatial data. Their mission is to make it easy for spatial data to be utilized for
economic growth and improvement and protection of environmental quality at the
local, national, and global levels, as well as contribute to social progress. There are
four goals.
The first goal is to enhance the knowledge and understanding of NSDI’s
wishes, concepts, and benefits through education and to highlight its position. The
tasks are: (1) to prove the benefits gained through participation in the existing and
expected NSDI; (2) to develop the principles and practical use of NSDI through
formal or informal education and training; and (3) to improve the state of develop-
ment of NSDI and promote the helpful activities.
The second goal is to find common solutions for the discovery, storage, and
application of geospatial data to meet the needs of a wide range of groups. The
tasks are: (1) to continue to develop seamless national geospatial data interchange
stations; (2) to support the reform of the common methods used to describe geo-
spatial data sets; (3) to support the study of the tools for making applications
easier to use and to exchange data and results; and (4) to research, develop, and
implement the structures and technologies of data supply.
The third goal is to resort to the approaches based on community to develop
and maintain the common collection of geospatial data by selecting the right fac-
tors. Its tasks are: (1) to continue to develop the national network of geospatial
data; (2) to provide citizens, government, and industry the required additional geo-
spatial data; (3) to accelerate the development of general classification systems,
content standards, and data models, as well as other common models in order to
facilitate the improvement, sharing, and use of data; and (4) to provide the mecha-
nism and stimulus that bring the multi-resolution data from many different organi-
zations into the NSDI.
The fourth goal is to establish a relationship between organizations to sup-
port the continuous development of NSDI. The tasks include: (1) to establish a
92 3 SDM Data Source
procedure allowing the stakeholder groups to formulate the logic and complemen-
tation that support NSDI; (2) to establish a network and make contacts with the
public who are interested in NSDI through conferences; (3) to eliminate the obsta-
cles in management and administration to unify a format; (4) to search for new
resources for data production, collection, and maintenance; (5) to recognize and
support the political and legal organizations, technologies, individuals, schools and
economics which promote the development of NSDI; and (6) to participate in the
international geospatial data groups that develop global geospatial data systems.
1. FGDC and NSDI of the United States
• FGDC is the leader in the coordination and development of national spatial
data infrastructure. It was organized according to Circular No: A-16 of the
United States Office of Management and Budget (OMB), whose responsi-
bilities are to develop, maintain, and manage the national distributed data-
base of geospatial data, to encourage the study and use of relevant standards
and exchange formats, to promote the development of spatial data-related
technologies and transfer them into production, to communicate with other
federal coordination agencies working on processing geospatial data, and to
publish relevant technical and management reports.
• FGDC and its working groups provide the basic structure for every research
institute and private communities, cooperating to discuss the various issues
related to the installation of NSDI. The standard working groups of FGDC,
who study and set standards through structural procedures, make great efforts
to unify their standards as much as possible. They get support from experts
through public channels but have nothing to do with some expertise. The refer-
ence model of FGDC standards set the principles to guide the standard projects
of FGDC. It defines the expectations of FGDC standards, describes different
types of geospatial standards, and explains the FGDC standard process.
• FGDC’s mission is to formulate geospatial data standards so as to achieve the
data sharing between manufacturers and users, as well as to support NSDI.
According to Executive Order 12906, NSDI stresses the collection, processing,
storage, distribution, and improvement of the technologies, policies, standards
and human resources needed for using geospatial data; while FGDC focuses on
how to strengthen its coordinative role between NSDI and state and county gov-
ernments, academic organizations and private groups. Through its committees
and working groups, FGDC supports the activities of the NSDI in four ways:
to develop a national spatial data exchange network, to formulate spatial data-
sharing standards, to create a national geospatial data framework composed
by basic thematic data, and to promote geospatial data protocol of cooperative
investment and cost-sharing among partners outside federal organizations.
The formulation and application of FGDC standards go through five stages:
application, planning, drafting, public review, and final approval. These five
stages ensure the generation of standards in an open or consistent way and
that the nonfederal organizations’ participation in the formulation should be
as wide as possible and their standards should be in accordance with other
3.7 National Spatial Data Infrastructure 93
the data in the framework to all organizations, not only a society or a group
of customers. DGDF will take full advantage of the superiority of the geo-
spatial data that are establishing by local and regional governments, public
utilities, non-governmental organizations, and the offices of state and federal
agencies. Most of these geospatial data are built to solve a specific problem
and meet the needs of the local, so gathering, maintaining, and distribut-
ing the data will involve many organizations and departments. To make the
data consistent and easy to integrate, many works should be done to collect,
synthesize, and validate the local data. To achieve this goal, 6 institutional
responsibilities have been identified: policy-making, project evaluation,
framework management, regional integration, data producers, and distribu-
tors. These tasks can be assigned to a number of different organizations to
undertake them; the organizations with policy, mission, and authority will be
the most successful participants.
3. OGC
• The Open GIS Forum (OGF) was set up in the United States in 1992. It
deals with the problems encountered by the governmental and industrial
geodata users and their expenses. These users tend to share and distrib-
ute spatial data; however, due to technical or organizational reasons, they
are prevented from this goal. OGF has changed its name to the Open GIS
Consortium (OGC).
OGC is a non-profit organization whose purpose is to promote the use of new
technologies and professional methods to improve the interoperability of geo-
graphic data and then reduce the negative effects of the non-interoperability on
industry, education, and arts. The membership of OGC includes strategic mem-
bership, principal membership, application integration membership, techni-
cal committee membership, and associate membership. They share a common
goal to establish a national and global data infrastructure in order to facilitate
people’s free use of geographic data and the resources they deal with—even
to open a new market for communities that have not yet become involved
the geographic data processing, bringing a new business model for the whole
community.
• To share data, the core of OGC is an open GIS data model. There are 6 kinds
of geographic data: digital map, raster image data, point vector data, vector
data, 3D data, and spatiotemporal serial data. Because of growing environ-
mental problems, government and commercial organizations have increas-
ingly high demand for effective operation, making the need to integrate
geographic data of different sources more and more important. However,
it is troublesome, trivial, full of errors, and even completely impossible to
share geospatial data. Due to the extensiveness of the content and scope of
geographic data, the formatting is more complicated than those of any other
digital data. Moreover, it becomes much more complicated as software plat-
form, methods of data acquisition, data representation and transformation,
and users (individuals and organizations) vary. OGC’s software specification
3.7 National Spatial Data Infrastructure 95
The Ordnance Survey (OS) is the national agency authorized by the Queen
of England to take charge of surveying and mapping in England, Scotland, and
Wales. The OS headquarters are located in Southampton, while the base of
Northern Ireland’s Ordnance Survey is in Belfast and the Republic of Ireland’s OS
is in Dublin. All three are completely independent but are using the same coordi-
nate’s origin.
Britain attaches great importance to the comprehensive application of geo-
graphic data. To make the most of the advantages of GIS and to promote its
application, the Department of the Environment (DOE) conducted an extensive
investigation in 1987 and published a Chorley Report “Handling of Geographic
Data: Report of the Government Committee of Inquiry.” Among government
departments, there are more than 40 major geographic data producers and users,
but a low level of data sharing exists and there is a lack of uniform national stand-
ards in all communities. Stimulated by the American NSDI, the British govern-
ment proposed the National Spatial Data Framework (NSDF) in 1995 to organize
and coordinate cooperation in all aspects and to use an acceptable way to share
spatial data.
The NSDF encourages the collaboration of collecting, providing, and applying
geospatial data; promotes the use of standards and the practice use of geospatial
data in collection, provision, and application; and promotes access to geospatial
data. It is a superset of NSDI, and they share some similarities. The difference is
that NSDF does not demand that different databases share the same public data,
although it is very difficult to exercise.
The German Interior Ministry Survey Office decided in 1989 to establish the
Amtliches Topographisch-Kartographisches Information System (ATKIS). ATKIS
is a national authoritative topographic-cartographic information system that is
basically for spatial distribution-related information systems.
ATKIS consists of a number of Digitale Landschafts Modelle (DLM) and Digitale
Kartographischen Modelle (DKM). DLM is a digital landscape model described by
geometry properties, while DKM is a digital mapping model described by visualiza-
tion. In ATKIS, DLM and DKM have clear specifications. DLM includes a catalogue
of ground objects, ATKIS-OK, and a DLM data model, whereas DKM includes a
catalogue of ground symbols, ATKIS-SK, and a DKM data model.
As Britain and Germany have rather small territories, the mapping and updat-
ing task had been performed well. The basic topographic maps at 1:5,000-scale
3.7 National Spatial Data Infrastructure 97
had been completed; thus, they did not stress or formulate that digital orthophoto
maps must be included in the NSDF. As the high-resolution orthophotos maps are
important, the two countries also have made orthophoto maps at a 1:5,000 scale.
Just like the DRG products of the American USGS, Germany also has digital
raster data of topographic maps, which are shared by each state survey office. In
addition, it has a total of 91 topographic maps at a 1:50,000-scale (TK50). The
digital raster data of maps at a 1:25,000-scale (TK25) also was completed later in
1991. The raster data are mainly used as the background data of thematic maps to
build the database systems for the environment, transportation, disaster prevention,
and nature protection. Combined with DEMs and remote sensing images, it can
generate 3D landscape maps as well as provide statistical data.
In 1963, R. Tomlinson developed the first GIS in the world, the “Canada
Geographic Data System (CGIS).” CGIS is a thematic GIS in practice that has
brought Canada a good reputation in the GIS field and extends the development of
GIS into Canadian private companies. After the 1990s, Canada’s spatial informa-
tion technology clearly pursued the integration of RS, GPS, and GIS, whose scope
of application is expanding.
Canada is entering the era of digital products and data sharing, including the
emergence and standardization of digital data and the dissemination and provision
of data. Canada also has a wide range of GIS data products, including street net-
work files, digital borderline files, post data exchange files, and a variety of prop-
erty files including census documents. Due to the complexity of the data, among
most Canadian digitalized maps, it is still often difficult to accept the data for a
certain project because the 1:50,000-scale index image of the spatial database
in NTDB of EMR shows that the data are still not conforming, not uniform, and
incomplete—far from the standards of the 4D products by USGS. For DEMs,
Canada only has the national contours at a 1:250,000–scale recorded by discrete
points, rather than the raster DEM data.
The lack of an acceptable standard is the main reason why Canada has no uni-
fied digital spatial data. Thus, Canada is working to resolve this issue and hopes
to achieve the transformation between different GIS data through a common data
transformation standard. The Inter Agency Committee of Geomatics (IACG) pro-
vides national initial data products and being widely representative, works on
the development of national geographic standards. The Gulf Intracoastal Canal
Association (GICA) is a national business organization whose members include
advanced departments in GIS, remote sensing, and mapping. The main Canadian
departments who are engaged in GIS are the Canadian Institute of Surveying
and Mapping (CISM), the Canadian Committee of Land Survey (CCLS), the
Association of Canadian Land Surveyors (ACLS), the Canadian Hydrographic
Association (CHA), the Canadian Cartographic Association (CCA), the Canadian
98 3 SDM Data Source
Remote Sensing Society (CRSS), the Urban and Regional Data System Association
(URISA) in Canada, the Urban Data Systems Association of Ontario, the
Association of Geography in Municipal Quebec (AGMQ), and the Geographic
Society in the province of Nova Scotia.
Australia began working on a digitalized map in the early 1970s. The basic sur-
veying, mapping, and land management of Western Australia is in the charge of
the Department of Land Administration (DOLA), who started the digitalization
of topographic map and the establishment of Land Information System (LIS). In
order to let the whole society share the state’s geospatial data, a number of depart-
ments are involved in the establishment of the Western Australia Land Information
System (WALIS). The digitalized product is known as the map data, which can
be used in GIS after transforming structure. Later, the reproduction of a GIS data
model was used, known as the Geodata. Then, with the development of the NSDI
policy in the United States, Australia also adopted the same method to create a
spatial data framework. There are many kinds of AUSLIG digital maps, the pro-
duction of which has the following characteristics: compatibility for GIS, national
unity, guarantee for quality, comprehensive documents, and regular maintenance.
The Australian government has focused on the application of GIS in support
of the land bureau and its land management operations. To coordinate the land-
related activities between the national and regional levels, the Australian Land
Data Council (ALIC) was established to deal with land issues and to deliber-
ate related policies at the national level, support the formulation and use of the
national guidelines and standards in land data management, provide a forum at the
national level where experience and data of land data management policies can
be exchanged, and publish the annual report of the development of the Australian
Land Information System. ALIC is committed to building a national land data
management strategy to encourage all communities of the Australian economy to
effectively obtain land data in order to provide a solid basis for all levels of gov-
ernment and private organizations to make effective decisions when using land, as
well as to develop and provide an effective mechanism for data transformation. As
a result of the wide use of GIS and the rapid development of web technologies, the
land data system announced by ALIC is actually a service system for a spatial data
warehouse and data distribution center.
management services and access to the related data. This need stimulated the
establishment of a joint ministerial committee on GIS under the supervision of
the cabinet, whose members are the 21 representatives of the Japanese govern-
ment agencies, including the Ministry of International Trade and Industry. The
Office of the Cabinet Secretariat was responsible for the committee, with assis-
tance from the National Mapping Bureau and National Land Bureau. In 1996, the
joint committee published its implementation plan through the beginning of the
21st century. The first phase of the plan, which ended in 1999, included the stand-
ardization of metadata and clarification of the roles of all levels of government and
private companies to promote the building of NSDI, for which a federation for the
advancement of NSDI was established to support these activities and whose mem-
bers include more than 80 private companies.
APSDI will be based on the NSDI of all the countries in the region, but it will
be closely connected to other international projects, such as Agenda 21, the global
map, and global spatial data infrastructure. The establishment of this infrastruc-
ture and its standards and data management policies will help maximize the whole
region’s investment returns on spatial data and build a viable geographic data
industry. The establishment of APSDI is a huge project, but PCGIAP believes that
the project has enormous potential benefits and all efforts possible must be made
to complete it. The determination, goodwill, and cooperation shown by PCGIAP
will ensure that this goal is achieved; over time, a detailed implementation plan
will continue to develop and mature and be adaptable to the ever-changing tech-
nological environment and the technical and administrative obstacles will be over-
come. Due to the increasing awareness of the importance of spatial data resources,
investment in data collection, management, and development will increase as well.
A vibrant spatial data industry thus will appear, serving the government, indus-
try, and society. The sharing, synergy, and benefits of knowledge and experience
brought by the APSDI project will have a much greater effect than that of an indi-
vidual country in the Asia-Pacific region.
European Spatial Data Infrastructure (ESDI) integrates the diversity of spatial data
available for a multitude of European organizations. The European Union (EU)
project Humboldt contributes to its implementation. Its vision is that geographic
information (GI) with all its aspects should become a fully integrated component
of the European knowledge-based society (i.e., EUROGI). The EUROGI/eSDI-
Net initiative offers a sustainable network as a platform to exchange experiences
in the field of spatial data infrastructure (SDI) and GI as well as international con-
tacts to SDIs, GI and SDI key players, GI associations, and other related networks
and projects.
Although many people think that the coordination of GIS in Europe and even
around the globe is a good thing, in reality, political, historical, and cultural fac-
tors affect the design and implementation of this project. Because the level of eco-
nomic development and training facilities is imbalanced, the skill level and the
awareness of geospatial information among the European countries are also dif-
ferent, partly due to the different status of geography and other disciplines play in
schools and universities and the technical differences between the countries, such
as different applications for coordinates and benchmarks.
The greatest obstacle to implementing SDI in Europe and the whole world
arises from a large number of local issues, rather than the opposition to it. The
resistance stems from the oppression of the other organizations working on geo-
graphic information from the government, which restricts their activities from
3.7 National Spatial Data Infrastructure 101
overseas; the lack of working capital for oversea activities; the lack of interest
and awareness; the impact from politics and other factors; the limitations of the
links between subjects; the lack of clear leadership from the EU-DG (Directorate-
General) and other EU institutions; and the doubts and difficulties in copyright and
other legal matters.
OGC came up with the idea of OGIS to address the technical problems of the
data exchange between different spatial databases. In a very short period of time,
OGC has convinced geographic information software vendors and others that there
will be better business prospects for the geographic information industry if spatial
data can be shared and exchanged without the restrictions of professional stand-
ards. The OGIS of OGC aims to provide a comprehensive set of open interface
specifications that will allow software developers to write interoperable compo-
nents and provide clear access to geographic data and geographic resources in the
network. What is more worthy of attention is that the business members of OGC
in North America are very active in Europe, such as ESRI, ERDAS, Intergraph,
MapInfo, Autodesk, Genasys, PCI, Trimble, Oracle, Informix, Microsoft, Digital,
Hewlett Packard, Silicon Graphics, Sun, and IBM. These companies provide the
vast majority of the required geographic information software, hardware, and
database technology for European countries, and there are several thousands of
European employers who obviously have decision rights in their companies.
Before OGIS, the interoperable method in the ESDI was considered a top-down
approach. The national geographic information organizations in EUROGI have
indirect links with the EU Council and the European Commission so that when
the senior leadership has approved certain treaty on ESDI, the relevant standards,
orders, and agreements will be handed to the national geographic information
organization through EUROGI, who will distribute them to its government and
business members. However, in essence, the OGC model is a bottom-up approach
based on commercial interests. Because Europeans buy most of their geographic
information products from vendors in the United States, any interoperation stand-
ards and procedures passed by the American business community are most likely
to become the Vivendi in Europe despite their own hopes. It also shows that the
international market has far-reaching influences on the success and the goals of the
infrastructure described by GI2000.
All of these show that European standards and interoperability specifications
cannot be divorced from those of the world and they are especially dependent
on the geographic information business in the United States. If the market forces
can push European users to accept the interoperability specifications, it is still not
necessary for the EU to invest a lot on their geographic information standards;
instead, the funds will go to GI2000. The participation of the Europeans is needed
in the OGIS, and they will influence the discussion on interoperation through
EUROGI, the European business community, and the various activities of the EU.
In view of the interests of European businesses and geographic information, the
European people should be encouraged to become involved in the OGIS.
102 3 SDM Data Source
3.8.2 CNSDI Contents
3.8.3 CNGDF of CNSDI
China’s National Geospatial Data Framework (CNGDF) was established with the
combination of point, line, and area and multi-resolution due to the vast size of
Chinese territory, different development degrees between the east and the west,
3.8 China’s National Spatial Data Infrastructure 105
different natural and social conditions in different regions, and the national power
restrictions. Producing a geospatial data framework is more efficient than produc-
ing total factor map vector quantization because the storage capacity and comput-
ing speed are able to handle a large capacity of geospatial data.
First of all, CNGDF should follow the principles of speediness and efficiency
and also take into account the integrity, importance, machinability, and expand-
ability of the data. At present, it is time-consuming to establish a database with all
the elements, which cannot meet the demands of the rapid updating speed of basic
geographic data. As different devices can use different techniques to meet the
demands of product quality, a variety of data acquisition programs are presented
below to generate basic geographical data—digital elevation model (DEM), digital
orthophoto model (DOM), digital line graphs (DLG).
1. DEM. Fully digital automatic photogrammetry, interactive digital photo-
grammetry, analytic photogrammetry, scanning vector contour, and DEM
interpolation.
2. DOM. Digital photogrammetry, single chip digital differential correct,
Orthophoto map scanning, digital aviation orthophoto, and remote sensing
image processing (TM images, SPOT images, control point images).
3. DLG. 3D digital photogrammetry, computer-aided digital mapping, map scan-
ning vectorization, and artificial semi-automatic tracking feature elements on
DOM. For example, road, contour, settlement, water system, and administra-
tive boundary.
DEM, DOM, and DLG constitute a geospatial data framework with a scale from
1:10,000 to 1:150,000. The database can display the images in different image
resolutions and scales. When the scale enlarges, the details of the ground can still
be seen clearly. The database can also use dynamic window, zoom, and roam in
any direction. For example, 1:50,000 spatial data infrastructure of the National
Geographic Information System includes seven databases: DEM, DOM, DLG,
digital raster graphic (DRG), place name, land cover, and meta-data.
3.8.4 CSDTS of CNSDI
database management system every minute, the data he or she uses may be a few
days old or a few months old. Currently, GIS software cannot directly manipulate
the data of other GIS software, and it is necessary to go through a data exchange.
According to statistics, data exchange costs a great deal of manpower and capi-
tal resources. The cost of GIS spatial data exchange in developed countries has
reached 30 %. Thus, it is important to set spatial data standards in order to pro-
duce spatial data compliant with the standards and allow other industry communi-
ties to share data and avoid duplicate collection of basic spatial data as well. The
exchange standards address the conversion issues of spatial data among different
GIS software.
There are currently three data exchange methods for sharing data: external data
exchange, data interoperability specification, and data sharing platform.
An external data exchange usually is defined in ASCII code files in order to
convert the data into other software because many GIS software do not allow the
users to read and write internal data directly. However, external data exchanges
are defined by software vendors, and the contents and representation methods are
not the same, such as AutoCAD DXF, MGE ASC Loader format, and ARC/INFO
EOO format. It is difficult to update spatial data in a timely manner and maintain
data consistency through an external data exchange. For the sake of standardiza-
tion and consistency, many countries and industries set their own external data
exchange standard to call for using public data exchange formats in a country or a
department, such as STDS in the United States.
Data interoperability specification consists of drawing up a set of spatial data
with manipulation function API, which can be accepted by all parties. All of the
software vendors follow this standard to provide driver software in accordance
with the API function. In this way, different software is able to manipulate each
other’s data. At present, the work has progressed smoothly as Open Geodata
Interoperability Specification (OGIS) in several GIS companies in the United
States. For example, Geo-Media, which was launched by Intergraph Corporation,
can call ARC/INFO directly. This method is more convenient than the external
data exchange. However, the data provided by the defined API function might be
the smallest for the overall situation due to the spatial data stored by different GIS
software. The data also may be inconsistent and not new data because each of the
software programs mainly manages its own system.
The data sharing platform makes it possible for all spatial data and each applica-
tion software module of a community to share a platform; all of the data exist on the
server and all of the software are the programs of one client. Data can be accessed
from the server through this platform. Any application’s updated data are reflected
in the database in a timely manner so as to avoid inconsistency of the data. However,
this approach is more difficult to achieve. Right now the owners of a number of GIS
software programs are unwilling to discard the software’s base to adopt a public
platform. Only when the base server of the software proves to be absolutely superior
to other systems and has shown that it can manage a large number of basic spatial
data will the data sharing platform be possible. A spatial data sharing platform also
can adopt general data management software, such as Oracle SDO.
3.8 China’s National Spatial Data Infrastructure 107
3.9.1 GGDI
3.9.2 Digital Earth
Digital Earth was presented by Al Gore, the former U.S. vice president (Al 1998).
The digital geospatial data framework provides the basic and public data sets for
researching and observing Earth and conducting geographical analysis. Data can
be attached to the users based on the Digital Earth. There is a large amount of data
in this framework along with the additional user data, which can be used for min-
ing and the decision-making process.
Digital Earth is the unified digital reproduction and recognition of real Earth
and its relevant phenomena on computer networks. It is the multi-dimensional
description of Earth of multi-resolution, multi-scale, multi-space, and various
types by applying massive Earth information through broadband networks on
the base of computer technology, communication technology, multimedia tech-
nology, and big storage devices. Digital Earth can be used for supporting human
110 3 SDM Data Source
activities and promoting the quality of life, such as for global warming, sustaina-
ble economic and social development, precision farming, intelligent transportation
system, digitized battlefields, and so on (Grossner et al. 2008). To put it simply,
Digital Earth is a multi-resolution virtual globe based on terrestrial coordinate sys-
tem, which involves massive geographical information and can be indicated by 3D
visualization.
Digital Earth is an inevitable outcome of global information as it depicts the fea-
tures of human living, working, and learning on Earth in this data era and it can be
applied to many fields involving politics, economy, military, culture, education, social
life, and entertainment. Since it was proposed, research and applications of Digital
Earth have been undertaken all around the world and stretched to multiple levels such
as digital region, digital city, digital urban area, and digital enterprise. At the same
time, the construction and development of Digital Earth in turn accelerated the pace
of global information, and to a large extent, it has changed people’s lifestyles.
Digital Earth is the digital record of the real Earth in a computer network sys-
tem. Its realization in electronic computers needs to be supported by plenty of
technologies, including information infrastructure, high-speed networks, high-res-
olution satellite images, spatial information infrastructures, massive data process-
ing and high-capacity storage, scientific computation, visualization technology,
and virtual reality. Among them, geospatial information technology is indispensa-
ble, such as GPS, GIS, RS, and their integration. Digital Earth provides the most
basic data sets for researching and observing Earth. It requires a wide sharing of
spatial information to which user data shall be attached. From the above, we can
see that SDM is one of the key technologies for Digital Earth because it can recog-
nize and analyze the massive data accumulated in Digital Earth to find out the law
as well as new knowledge.
SDM is one of the key technologies of Digital Earth (Goodchild 2007). Earth is
a complex giant system where the courses of events are mostly nonlinear and the
span changes of time and space are different. Only by using data mining technol-
ogy based on high-speed computing can massive data gathered in Digital Earth be
found and the rules and its knowledge learned. The construction of Digital Earth
requires sharing data on a large scale. The spatial data warehouse provides effec-
tive tools for managing and distributing spatial data effectively. In Digital Earth,
the object of SDM generally is the spatial data warehouse. The next generation of
the Great Globe Grid (GGG) and Spatial Data Grids based on network comput-
ing may create a better environment for SDM in Digital Earth. Therefore, Digital
Earth would definitely bring great benefits to the use of framework data if SDM
technology can be used in a digital geospatial data framework.
3.9.3 Smart Planet
3.9.4 Big Data
Big data are complicated data that is tremendous in volume, variety, velocity, and
veracity (Wu et al. 2014). Big data offer an opportunity in the information world
to observe the whole world completely instead of just partial samples. Before big
data, statistics only could be sampled at random and conclusions were drawn from
sampled data because of the limitation of spatial data collection, calculation, and
transportation, which led to only partial information, such as the proverbial blind
man grasping a part of an elephant and seeing the segment as a whole. Therefore,
the deficiency of data sampling and the scattering of sample data contributed to
the difficulty in recognizing the overall laws and extraordinary changes when they
occurred.
In September 2008, a special issue on big data was published in Nature. In
May 2009, the Global Pulse project within the United Nations released “Big Data
112 3 SDM Data Source
May 2010, China Mobile established a massive distributed system and structured
massive data management system on the cloud. Huawei analyzes data based on
mobile terminals and stores massive data through the cloud to obtain valuable
information. Alibaba analyzes transactional data in business data through big data
technology to conduct credit approval operations. In March 2012, when China’s
Ministry of Science and Technology released their list of key national alterna-
tive projects for science and technology in 2013, big data research was the first
priority. According to a document released on September 5, 2015, “China Seeks
to Promote Big Data,” the State Council has called for a boost in the develop-
ment and application of big data in restructuring the economy and improving
governance.
Under the current condition of big data, it is possible to create, copy, and cal-
culate data greatly to overcome a data sampling deficiency. The overall data can
literally reproduce the original appearance of the real world, describe the whole
appearance of spatial objects, and imply the overall laws and development trends
to promote the efficiency of human beings in understanding the world and pre-
dicting the future. The United States employed professional knowledge and mod-
ern information technology to predict the influence of disasters accurately and in
a timely manner to release early-warning information (Vatsavai et al. 2012). The
high-resolution images of the tsunami around the railway station in Galle pro-
duced by the IKONOS 2 and QuickBird satellites explained the situation of the
architecture. In the monitoring system of precipitation in the Google Earth, users
now need only to open the Google Earth 3D terrain image, which is automati-
cally accompanied by a satellite cloud picture, a precipitation chart, single station
rainfall data, soil data, and onsite photos provided by the weather bureau; the ste-
reo disaster effect could be demonstrated to undergo an inundation analysis and
thereby provide the basis for decision analysis. ArcGIS can make any kind of dis-
aster map; ArcGIS Mobile can meet the needs of quick reporting of disasters and
collecting information for any kind of disaster.
Therefore, SDM focuses on the value of big data and using it efficiently by
providing a process that extracts the information from data; discovers knowledge
from the information and gains intelligence from the knowledge; improves self-
learning, self-feedback, and self-adaption of this systems; and realizes human–
computer interaction.
GIS is being applied increasingly as the need for geo-information services grows.
Since its introduction in the 1960s, GIS has gone through a long process of
development and accomplished remarkable achievements in mapping sciences,
resources management, environmental monitoring, transportation, urban planning,
precision farming, etc. The promulgation, distribution, and publishing of geospa-
tial information are growing rapidly. Michael F. Goodchild proposed “Citizens as
114 3 SDM Data Source
Fig. 3.10 Application system in ITS based on sensor Web Reprinted from Ref. (Li and Shao
2009), with kind permission from Springer Science+Business Media
Fig. 3.11 Reference architecture for an interoperable sensor Web Reprinted from Ref. (Li and
Shao 2009), with kind permission from Springer Science + Business Media
116 3 SDM Data Source
web model. When an end user sends a request on the client side, the decision-
making support system (DSS) transfers the received request to a service chain and
searches the corresponding registered services in the catalog services through a
workflow sequence. Then, the registered services acquire the information of inter-
est and send it to the end user. The second way is based on direct feedback from
the sensor web, which is applied when the corresponding registered service cannot
be found in the catalog services through the workflow sequence. In this case, new
sensor web feedback is further searched; if the required service can be found, it
will be sent to the end user and registered at the registration center. The third way
is based on retrieval of the digital products, which is also applied when the corre-
sponding registered service in catalog services cannot be found through the work-
flow sequence. In this case, the requested information is further searched through
the sensor web node instead of the sensor web feedback.
The new spatial information systems may be distinguished from a traditional
system on the view of the data providers, the data users, the geospatial data, and
the measurement and data share.
The new geo-information systems provide services not only to professionals
but also to all public users; a great many of these users require fundamental infor-
mation regarding professional and individual applications. It is unfortunate that
this information cannot be discovered from traditional 4D products directly, which
cannot satisfy the need for integrity, richness, accuracy, and reality in geospatial
information. In this new geo-information era, such information can be acquired
from DMIs, which are released on the internet according to specific requirements.
The users of these new geo-information systems are data and information pro-
viders as well. Data users can upload or annotate new geospatial information in
Web 2.0 in addition to the traditional downloading of information of interest. The
provided data and services are transferred from fixed updating at regular intervals
to more popular updating forms (i.e., from static updating to dynamic updating).
The boundary between the data provider and the data user thus is quickly blurring
thanks to this total open accessibility to spatial information.
Geospatial data from outdated sources live through a smart sensor web. Sensor
web technology provides access to services that meet the specific needs of profes-
sionals and general users in a multimedia and dynamic service environment. All
the sensors can be integrated together to build a large smart sensor web that pro-
vides real-time data updates, information extraction, and services.
Measurement includes measurement by specification to measurement on
demand. Geospatial data in service are DMIs instead of simple image maps. DMIs
are digital stereo images appended with six exterior orientation elements acquired
by Mobile Measurement Systems (MMSs). By using DMIs on the internet that are
accompanied by measuring software kits, measurement of a special object at cen-
timeter level precision is available. Undoubtedly, the overlapping of DMIs with
GIS data makes the representation of geographic objects more comprehensive and
vivid and facilitates visible, searchable, measurable, and minable functions.
Share is from data-driven to application-driven. Service-Oriented Architecture
(SOA) is a software structure that achieves interoperability by packaging the
3.10 Spatial Data as a Service 117
References
Aji A et al (2013) Hadoop-GIS: a high performance spatial data warehousing system over
MapReduce. In: Proceedings of the 39th international conference on very large data bases, VLDB
endowment, vol 6(11), pp 1009–1020 (August 26–30th 2013, Riva del Garda, Trento, Italy)
Al G (1998) The digital earth: understanding our planet in the 21st century, speech at the california
science center, Los Angeles, California, on January 31, 1998. url:http://www.isde5.org/al_
gore_speech.htm
Codd E (1995) Twelve rules for on-line analytic processing. Computer world, April 1995
118 3 SDM Data Source
There are always problems in spatial data, which makes spatial data cleaning
the most important preparation for data mining. If spatial data are input without
cleaning, the subsequent discovery may be unreliable and thus produce inaccurate
output knowledge as well as erroneous decision-making results. This chapter dis-
cusses the problems that occur in spatial datasets and the various data cleaning
techniques to remediate these problems. Spatial observation errors mainly include
stochastic errors, systematic errors, and gross errors, such as incompleteness, inac-
curacy, repetitiveness, inconsistency, and deformation in spatial datasets from
multiple sources with heterogeneous characteristics. Classical stochastic error
models are further categorized as indirect adjustment, condition adjustment, indi-
rect adjustment with conditions, condition adjustment with parameters, and condi-
tion adjustment with conditions. All of these models are considered parameters as
non-random variables in a generalized error model. By selecting weight for itera-
tion, Li Deren made the assumption of two multi-dimensional alternates of Gauss-
Markov when he established the distinction and reliability theory of adjustment.
As a result, Li extended Baarda theory into multiple dimensions and realized the
unification of robust estimation and least squares.
Spatial data vary by type, unit, and form with different applications, and problems
inevitably occur that affect the quality of spatial data. In recorded items, for exam-
ple, incompleteness, repetition, mistakes, inaccuracy, and abnormalities may occur
(Koperski 1999; Wang et al. 2002; Shi et al. 2002), which may be additionally
influenced by gross, systematic, and stochastic errors. The process of data acqui-
sition brings along with it the data’s systematic errors, such as instrument system
In the real world, most spatial data are polluted and are revealed a variety of ways,
such as disunity of scale and projection among different data sources, inconsistency
of sampled data and their forms, mismatch among different data sheets, redundancy
of data, discrepancy between different coordinate systems, deformation of graphics
and images, inconformity between the map scale and the unit of length of digitizers,
loss and repetition of spatial points and lines, missing identifiers for regional centers,
too-long or too-short lines when taking input, and discordance with the consistency
requirements of topology in nodal codes or attribute codes. The most common of
these are incompleteness, inaccuracy, repetition, and inconsistency.
The completeness (or integrity) of spatial data reflects the level of the overview
and abstraction of it. Spatial data may be missing or incomplete for various reasons
(Goodchild 2007). First, the defect of incompleteness is caused by errors such as
ellipsis (Smithson 1989). For example, the necessary domains and instances are
not recorded in a spatial database when they are designed; the decision rule of col-
lecting and editing spatial data does not take all the variables and impact factors
into account; the data in the spatial database cannot fully present all the possible
attributes and variable features of objects; and the spatial database does not include
all possible objects. Second, not all the characteristics in the measuring standards
are collected according to the criteria, definitions, and other rules; and some impor-
tant characteristics for recognizing spatial objects are fused because of the evalua-
tion standards. For example, the data will be incomplete if the boundary points of a
land parcel are omitted. Third, some elements are lacking in a process that intends
to help guarantee the continuity of data analysis and the future development of
editing systematic documents and a spatial system. For example, lazy input habits
4.1 Problems in Spatial Data 121
or the different demands for the data from different users can lead to missing nec-
essary data values. In most cases, the missing values must be entered by hand and
some missing values can be deduced by the spatial data source and related data.
There are various methods to deal with the data noises caused by unknown
attribute values (Clark 1997; Kim et al. 2003). The first method is ignorance,
where the records whose attributes have unknown values are ignored. The sec-
ond method is adding values, where the unknown value is treated as another value
of the attribute. The third method is likelihood estimation, where the unknown
value is replaced with the most possible value of the attribute. The fourth method
is Bayesian estimation, where the unknown value is replaced with the maximum
possible value of the attribute under its function distribution. The fifth method is
a decision tree, where the unknown value is replaced with the classification of
the object. The sixth method is rough set, where the rules are deduced from an
incompatible decision tree for the unknown values. The seventh method is a binary
model, where the vector of the attribute values is used to represent the data in the
context of transition probability between symbolic attributes. The average value,
maximum value, and minimum value are also used.
The inaccuracy of spatial data measures the discrepancy between the observed data
and its true value or the proximity of the measurement information and the actual
information. Many inaccuracy problems occur in the corresponding environments
that are related to the data types, such as processing methods, sorting algorithm,
positional precision, time variation, image resolution, and spectral features of spa-
tial objects. The inaccuracy may be quantitative and qualitative, such as incorrect
data whose values are contrary to the attributes of real objects, outdated data that
are not updated in a timely manner, data from inaccurate calculation or acquirement,
inexactitude category data which are difficult or impossible to explain, and vague
data with fake values or eccentric formats (Smets 1996). Topographic features can
be acquired by accurate measurement while the accuracy of forest or soil bounda-
ries may be low due to the influence of line identification errors and measurement
errors. If the accuracy of an attribute is high, the classification should be expected
to be rigorously in accordance with the real world. However, it is difficult to ensure
the complete accuracy of classification, and cartographic generalization might fur-
ther classify various attributes as one and the same attribute. A large amount of the
spatial data is not well used because of its low accuracy. Generally, spatial databases
are designed for a specific application or business step and are managed for different
purposes, which makes it difficult to unify all the related data and therefore makes it
easy to induce errors into the integrated spatial data.
The basic methods of dealing with inaccuracy include statistical analysis to
detect the possible errors or abnormal values (deviation analysis to identify the
value that does not follow the distributions or regression equations), spell check-
ing in documents processing, and a simple rule base (e.g., common sense rules,
122 4 Spatial Data Cleaning
business-specific rules to examine spatial data values, external data, and reorgan-
izing spatial data). Spatial data reorganization refers to the process of extracting
spatial data from the detached primitive data for a certain purpose, transforming
spatial data into more meaningful and integrated information, and mapping data
into a target spatial database.
The presence of repetitive spatial data indicates that duplicate records of the same
real object exist in one data source or in several systems. The records of similar
attribute values are considered duplicate records. Due to errors and expressions
in spatial data, such as spelling errors, and different abbreviations, it is possible
that records that do not exactly match might also be duplicate records. Duplicate
records are more common in multi-source SDM. The spatial data provided by
each source usually include identifiers or string spatial data, which may vary in
different spatial data sources or might cause some errors for various reasons,
such as errors caused by printed or typed-in information or aliases (Inmon 2005).
Therefore, whether two values are similar or not, it is not a simple arithmetical
solution but rather a set of defined equivalent rules and some fuzzy matching tech-
niques. Identifying two or more records that relate to one object in the real world
and eliminating the duplicate records can not only save storage and computing
resources, but also improve the speed, accuracy, and validity of the SDM-based
approaches. Merge or purge is the basic method.
Merge or purge is also called record linkage, semantic integration, instance
identification, data cleansing, and match or merge. This method detects and elimi-
nates duplicate records when integrating spatial data sources presented in heter-
ogeneous information (Hernàndez and Stolfo 1998). The current main approach
is to use the multi-attributes of the records and let the user define the equivalent
rules as to whether or not the records correspond to one spatial entity, thereby ena-
bling the computer to auto-match the possible records that correspond to the same
entity with the defined rules. Arithmetic-sorted neighborhood and arithmetic fuzzy
match/merge are currently popular for this process. Arithmetic-sorted neighbor-
hood sorts the entire spatial data set according to the user-defined code and groups
the possible matched records together. Repeated sorting can improve the accuracy
of the matching results. Arithmetic fuzzy match/merge adopts certain fuzzy tech-
nology to compare all the records in pairs after normalizing the spatial data of all
the attributes and finally merge the results for comparison.
Inconsistency in spatial data values often happens between the internal data
structure and the spatial data source during spatial data cleaning, which can
include a variety of problems that can be classified in two types. The first type
is context-related conflicts, such as different spatial data types, formats, patterns,
4.1 Problems in Spatial Data 123
The errors in spatial data are equally as relevant as the data, the measurement, and
the data structure. Errors mainly originate from the uncertainty of the attribute defi-
nitions, data sources, data modeling, and analysis process, in which the uncertainty
of the data sources because the measuring depends on subjective human judgment
and the hypothesis during data collection (Burrough and Frank 1996). The attribute
uncertainty of land cover classification data obtained by remote sensing is from the
uncertainty of the space, the spectrum, and the temporal characteristics (Wang et al.
2002). In 1992, a MIT report pointed out that the quality problems of data were not
uncommon. In most of the 50 departments or institutions involved in the sample
survey around the world, the accuracy of their spatial data was also below 95 %,
mainly because of the inaccurate rate and the repetitive rate (Fig. 4.1).
In surveying, users spend a great deal of time dividing errors into gross errors,
systematic errors, and stochastic errors, according to the size, characteristics, and
reasons for the errors.
ε = εg + εS + εn (4.1)
In Eq. (4.1), ε is the total observation error, εg is the gross error, εS is the system-
atic error, and εn is the stochastic error. In many years of empirical experience,
124 4 Spatial Data Cleaning
Fig. 4.1 Statistical figure on
the average of inaccurate and
repetitive data
it has become a custom to classify errors into these three types. From the view
of statistical theory, however, there is no common and clear definition. The best
approach is to analyze and sort the errors from different perspectives.
First, all three types of errors can be considered as model errors and can be
described by the following mathematical model:
εS = HS s s ∼ M(s0 , C SS ) (4.2a)
εg = Hg �l �l ∼ M(�î, C gg ) (4.2b)
εn = En εn εn ∼ M(0, C nn ) (4.2c)
In Eqs. (4.2a–4.2c), M(μ, C) shows the expectation is μ, the variance-covariance
matrix is any one of the distributions of C, while the matrix is HS, Hg, En and
decides the influence from the systematic errors, the gross errors, and the stochastic
errors on the observation values. The features of these three coefficient matrixes are
different (Fig. 4.2).
1. The elements in the coefficient matrix of the systematic errors, usually the
position function and the time function, are universal or group by group. For
instance, when additional parameters are considered as the regional invariants,
the coefficient matrix Hs is occupied completely; when they are strip invariants,
the coefficient matrix is occupied group by group and the systematic errors of
the photo are the function of (x, y).
2. The gross errors coefficient matrix Hg is of sparse occupation, and normally
there are only one or few nonzero elements in every column. For the p1 different
gross errors, there is Eq. (4.3):
Hg = ei+1 , ei+2 . . . ei+p1 (4.3)
It is well known that a stochastic error has a nonzero matrix. The systematic error
only can be seen as the functional model error (s = s0, CSS = 0) or as the sto-
chastic model error (s = 0, CSS ≠ 0), and of course can be both at the same time
(Eq. 4.2a).
When determining its reliability, the gross error is always considered as the
functional model error (Δl ≠ 0, CSS = 0) while to the position, it is better to be
seen as the stochastic model error, which is beneficial for effective discovery and
correction of the gross error.
In addition, the three types of errors also can be distinguished according to the
reasons why the errors appear. The systematic errors brought out in data acqui-
sition are due to certain physical, mechanical, technical, instrument, or operator
errors. Normally, it has some rules or changes regularly. The gross errors happen
because of irregular mistakes in data acquisition, data transmission, and data pro-
cessing, which cannot be adopted by the assumed and estimated error model as the
acceptable observation. As for stochastic errors, they are generated by the observa-
tion conditions (instruments, field environment, and observers). Different from the
systematic errors, they show no regularity in size and symbol, and only the total of
a large sum of errors is of certain statistical rules.
126 4 Spatial Data Cleaning
The mathematical model for the observation errors in spatial data is described
with a functional model and a stochastic model. Obtaining a set of observation
values from the subject of interest and using it to estimate the relevant unknown
parameters representing the subjects is called parameter estimation in mathemati-
cal statistics and adjustment in surveying. Realization of the errors first needs a
mathematical model reflecting the relationship between the observation values and
the unknown parameters (Li and Yuan 2002).
As a set of random variables, the observation vector can be described by its first
moment (expectation) and its second center moment (variance-covariance). Here for-
ward, the model used to describe observation expectation is called the functional model
while the model for the precision characteristics of observation values is the stochastic
model, and their combination is known as the mathematical model for adjustment.
The full-rank Gauss-Markov linear model is defined as follows. Assume that A
is the known coefficient matrix n × u (usually known as the first design matrix), x
is the unknown parameter vector u × 1, l is the random observation vector n × 1,
its variance-covariance matrix is D(l) = σ2P−1 (σ2 is the variance of unit weight),
and the matrix A is full column rank—that is, the weight matrix P is a positive
definite matrix. Therefore, the full-rank Gauss-Markov linear model is:
In statistics, the model errors can be defined as the difference between the models
built (including the functional model and the stochastic model) and the objective
reality, which in the form of an equation is
F1 = M0 − W (4.5)
In Eq. (4.5), F1 is the true model error, M0 is the mathematical model used, W is
the unknown reality, and M0 ≠ W.
If a mathematical model, according to verification theory in mathematical sta-
tistics, is considered, a hypothesis compared to the real world (the null hypoth-
esis), the starting point when determining the model is to make the model errors
4.1 Problems in Spatial Data 127
zero for both the expectation and variance of the observation values. To test the
null hypothesis, one or more standby hypotheses are needed. This kind of standby
hypothesis tries to make a more precise extension of the model built in order to
reduce model errors.
As the objective reality W is unknown, users must use a mathematical model
extended and refined as much as possible to replace it. Thus, the definition of the
difference between model M and the refined model M0 as plausible model errors is
meaningful for factual research:
F2 = M0 − M (4.6)
The mathematical model M can be extended and refined as much as possible to
make it very close to the objective reality ((M − W) → 0). For instance, in terms
of bundle adjustment with self-calibration, Schroth introduced a model in which
both the functional model and the stochastic model are expanded. Under this
premise, an equation is obtained:
F2 = M0 − M = (M0 − W ) − (M − W ) ≈ M0 − W (4.7)
Further discussion can derive from the model errors with this definition. In a
hypothesis test, the difference between the model M0 (the null hypothesis H0) and
an extended model M1 (the standby hypothesis Ha) must be verified as palpable
(Fig. 4.3). If M0 = M1, this model is untestable.
When making a choice between two standby hypotheses, whether or not the
two extended models M1 and M2 proposed from the original model M0 can be dis-
tinguished should be ensured. If they are identical to each other or one model is
contained by the other, then they are indistinguishable (Fig. 4.4).
Fig. 4.3 Relationship
between an original
assumption and its alternative
assumption under a single
alternative hypothesis
128 4 Spatial Data Cleaning
Fig. 4.4 Relationship
between an original
assumption and its alternative
assumption under two
alternative hypothesis
matrix is known. If there are some errors in the mathematical model, how
would the adjustment results be affected? The following three situations may
appear:
1. When too few unknown parameters are chosen in the functional model describ-
ing the observation expectation, there is a deviation in the estimation of the
observation values, the covariance matrix of the unknowns is small, and the
estimation of unit weight variance increases.
2. When too many unknown parameters are chosen, there is no deviation in the
estimate of the observation values, the covariance matrix of the unknown
increases, and the estimate of unit weight variance has no deviation.
3. When a wrong weight matrix is chosen, there is still no deviation in the esti-
mate of the observation values. If the weight of observation values is small, the
cofactor matrix of observation values decreases; otherwise, it increases. If the
weight of the observation values in the model introduces errors, the estimation
of the unit weight variance is deviated.
Therefore, the target of SDM is the dataset, which always involves a large amount
of spatial data from many sources, which in reality may be polluted.
In the complicated designs and operations of spatial datasets, if the integrity of
spatial data cannot be ensured, all the construction work and the following SDM
will be meaningless. It is very important to remember that just a 4 % increase in
data integrity could bring millions of dollars to any large enterprise.
If good data do not exist, it is not likely that SDM will be able to provide relia-
ble knowledge, quality service, and good decision support. Moreover, spatial rules
are very often hidden in large complex data, where some data may be redundant
and others are totally irrelevant. Their existence may affect the discovery of valu-
able spatial rules. All of these reasons necessarily require SDM to provide a data
cleaning tool that can select correct data items and transform data values (Dasu
2003).
4.2 The State of the Art 129
This stage was the simultaneous processing of the stochastic errors and the sys-
tematic errors. In the adjustment of photogrammetry, the unknown systematic
errors in images need to be considered first. The processing method introduced
them into the adjustment as additional unknown parameters and solved the
additional parameters of the featured systematic errors at the same time as the
unknown parameters were calculated. Filters and estimation also were used.
Most certainly, self-calibration adjustment with additional parameters can
effectively compensate for the influence of systematic errors in images in order to
make the unit weight of the mean square error of the analytic aero-triangulation
reach the high precision level of 3–5 μm. Bundle block adjustment programs
with additional parameters (e.g., PAT-B and WuCAPS) are effective in this stage.
For further information about compensating for systematic errors, please refer to
Li and Yuan (2002).
This stage further processed the systematic errors and the gross errors at the same
time in the adjustment from two aspects: (1) in theory, the reliability of the adjust-
ment system was studied (i.e., its ability to find the gross errors and the influence
of the undiscoverable gross errors); and (2) in reality, the importance of finding
a practical, effective, and automatic gross error positioning method. Many useful
requests and suggestions for optimal design in surveying areas are now available
due to research on the reliability theory of the adjustment system; however, the
exploration of automatic gross error positioning also introduced the automatic
gross error positioning method in some computer utilities, which is achieving
important results that are not possible for artificial methods.
The second stage and the third stage have generally paralleled since 1967.
This stage, which began developing in the 1980s, deals with all of the possi-
ble observation errors simultaneously in the adjustment. This is a very practical
approach because all three types of observation errors do exist at the same time in
reality.
In this stage, the distinguishability theory to discriminate different model errors
was raised theoretically. Here, the two types of model errors could be different
gross errors, different systematic errors, or a combination of the two. While the
theoretical distinguishability of the adjustment system was studied, in the mean-
time this work is informing the optimum design of surveying areas. In the real
world, this period in the future should find the adjustment method that can pro-
cess different model errors at the same time. For example, in the areas of analytic
4.2 The State of the Art 131
In 1988, the National Science Foundation of the United States sponsored and
founded the National Center for Geographic Information and Analysis. Among
the 12 research subjects of the center, the first subject designated the accuracy of
GIS as the first priority, while the twelfth subject designated the error problem in
GIS as the first priority among its six topics. Meanwhile, also in 1988, the U.S.
National Committee for Digital Cartographic Data Standards published spatial
data standards for lineage, positional accuracy, attribute accuracy, logical consist-
ency, and completeness. Later, currency (the level for spatial databases to meet the
newest demands) and thematic accuracy (defined by the substance’s attribute accu-
racy, class accuracy, and timing accuracy) were proposed, adding temporal accu-
racy to express time as the basis of the five standards. As lineage describes the
procedure of data acquisition and processing, spatial data accuracy should contain
spatial accuracy, temporal accuracy, attribute accuracy, topology consistency, and
completeness.
Data cleanliness in SDM does not receive adequate attention from the public.
Data access, query, and cleanliness are the three important topics in spatial data
processing, but for a long time users have focused only on the solutions to the
first two while seldom attending to the third. With the C/S structure, the dedicated
spatial data warehouse hardware and software and a complete set of communi-
cation mechanisms link users with spatial data to solve the data access problem.
When querying data, there are many choices in various impromptu query tools,
report writers, application development environments, end-user tools of the
second-generation spatial data warehouse, and multi-dimensional OLAP/ROLAP
tools. Currently, only a small amount of basic theoretical research and application
development of the second-generation spatial data has been conducted. The inter-
national academic magazine Data Mining and Knowledge Discovery observed that
only a few articles had been available on data cleaning, such as “Real-world Data
Is Dirty: Data Cleansing and the Merge/Purge Problem” (Hernàndez and Stolfo
1998). The development of data cleaning lags far behind that of data mining; even
academic conferences lacked sessions addressing data cleaning. At the same time,
far from the richness and completeness of spatial data access and query tools,
there are only a few tools to solve the problem of spatial data cleanliness, mainly
because of the enormous workload and costs. The present products providing
limited spatial data cleaning functions are QDB/Analyze of QDB Solutions Inc.,
WizRules of WizSoft Inc., Integrity Data Re-Engineering Environment of Vitality
Technology, Name and Address Data Integrity Software of Group ScrabMaster
Software, Trillium Software System, i.d. Centric and La Crosse of Trillium
132 4 Spatial Data Cleaning
4.3.1 Fundamental Characteristics
Spatial data cleaning is definitely not just about simply updating previous records
to correct spatial data. A serious spatial data cleaning process involves the analysis
and re-distribution of spatial data to solve spatial data duplication, both inside a
single spatial data source and among multiple spatial data sources, as well as the
inconsistency of the data content itself, which is not just inconsistency in form.
For example, the inconsistency of models and codes can be handled in combi-
nation with the spatial data extraction process in which the conversion of mod-
els and codes can be completed. It is a relatively simple and mechanical process.
Because SDM is a case-by-case application, it is difficult to build mathematical
models. Also, because the cleaning methods are closely related to spatial data sam-
ples, even the same method may produce very different experimental results in
various contexts of application. It is therefore difficult to conclude that a general
procedure and method is best. The typical data cleaning process models include
Enterprise/Integrator, Trillium, Bohn, and Kimball.
4.3.2 Essential Contents
Spatial data cleaning mainly consists of confirming data input, eliminating null
values errors, making sure spatial data values are set in the range of definition,
134 4 Spatial Data Cleaning
removing excessive spatial data, solving the conflicts in spatial data, ensuring the
reasonable definition and use of spatial data values, and formulating and employ-
ing standards.
The available tools fall into three classifications: (1) data migration, which
allows simple transformation rules, such as Warehouse Manager from the com-
pany Prism; (2) data scrubbing, which uses domain-specific knowledge to clean
spatial data and employs parsing and fuzzy matching technology to carry out
multi-source spatial data cleaning (e.g., Integrity and Trillium can specify the “rel-
ative cleanliness” of sources); and (3) data auditing, which finds rules and rela-
tions through statistical analysis of spatial data.
Choosing the right target spatial data is also a necessary part of spatial data
cleaning; in fact, it is the top priority. A feature of SDM is its case-by-case analy-
sis. For a given specific task, not all the attributes for the entities in spatial data-
bases have impacts; rather, it varies with different attributes. Consequently, it is
necessary to consider SDM as system engineering and to choose target data and to
ascertain the weight of the role played by the different attributes of target spatial
data. For example, for mining land prices, the point, line, and area factors have dif-
ferent impacts on the results.
There are two types of indirect compensation methods, which often cooperate with
other compensation methods.
The self-calibration method uses the concurrent overall adjustment of addi-
tional parameters. It chooses a systematic error model consisting of several param-
eters and regards these additional parameters as unknown numbers or addresses
them as weighted observation data to calculate them with other parameters in
order to self-calibrate and self-cancel the effects of systematic errors in the adjust-
ment process. This method does not increase any actual workload of photography
and measurement; it also could avoid some inconvenient laboratory test work.
Because the compensation is conducted by itself in the adjustment process, the
accuracy would obviously increase if it is addressed properly. Its disadvantages are
that it can compensate only the systematic errors reflected by the available connec-
tion points and control points; the selection of additional parameters is man-made
or experiential and the results therefore would be different according to different
selections; there is the possibility of worsening the calculation results due to the
strong correlation of the additional parameters and the strong correlation between
the additional parameters and other unknown parameters; and the calculation
work obviously increases. Because both the advantages and disadvantages of this
method are salient, there are many researchers who work on this method.
Another indirect compensation method is the post-test compensation method,
which was first introduced by Masson D’Autume. This method aims at the residual
errors of the observation value of the original photos or model coordinates and does
not change the original adjustment program. It analytically processes the residual
errors of the picture point (or model point) after several iterative computations and cal-
culates the systematic errors’ correction values of several sub-block interior nodes of
the picture point (or model point). Then, the systematic errors’ correction values of
all the picture points (or model points) are calculated by the two-dimensional inter-
polation method and its next calculation is performed after correction. This post-test
method of determining the systematic errors’ correction value by repeatedly analyzing
the residual errors can reliably modify the results and is easier to insert in all kinds
of existing adjustment systems. It is called post-processing if least square filter and
estimation of the coordinate residual errors of the ground control points are conducted
after adjustment. Therefore, in a broad sense, this method also is a post-test method.
This post-processing requires the ground control to have a certain density, which
mainly removes the stress produced in the ground control network in order to better
represent the photogrammetric coordinates in the geodetic coordinate system.
All of the above-mentioned methods also can be combined. For example, the
post-test compensation method can be used in a self-calibration regional network
and self-calibration adjustment can be used in duplicate photographic areas; these
two methods can be combined with the proving ground calibration method. Through
such combinations, the best strategy results can be achieved and the accuracy can be
increased as much as possible while the workload is increased accordingly.
4.5 Stochastic Error Cleaning 137
The classic stochastic error processing model includes indirect adjustment and con-
dition adjustment, indirect adjustment with conditions, condition adjustment with
parameters, and condition adjustment with conditions (summary models). All of the
above models treat the parameters as non-random variables. If the prior information
of the parameters is considered, many models can be used, such as the least square
collocation model, the least square filter estimation model, and the Bayesian estima-
tion model. Wang (2002) summarized the above-mentioned models by using gener-
alized linear models. Its principles are described in the following sections.
4.5.1 Function Model
is the point not observed random parameter of m2 × 1; Δ is the observation error
vector of n × 1; n is the number of observations; u is the number of non-random
parameters; u ≥ t; m is the number of random parameters; and m1 + m2 = m;
d = u − t is the number of non-random parameters that are not independent.
4.5.2 Random Model
Cov(�, Y ) = 0, Cov(Y , �) = 0
138 4 Spatial Data Cleaning
4.5.3 Estimation Equation
Let
LS µS
Ly = = E(Y ) = (4.9)
LS W µSW
Let
� � � � X̂ � �
Ly Vy X̂
L̄ = , V̄ = , Ẑ = = Ŝ ,
L V Ŷ
Ŝ w
� � � �
¯ = ∆y , B̄ = 0 E , C̄ = C 0
� �
�
∆ B A
We obtain
−1
Qyy 0
¯ = Qyy 0 Py 0
Var(�)
0 Q��
, P̄ = −1 =
0 P� (4.12)
0 Q��
V̄ = B̄Ẑ − L̄
(4.13)
C̄ Ẑ + C0 = 0
Let
B′ P L
′ ′
N̄ = B̄ P̄B̄, Ū = B̄ P̄L̄ =
Py Ly + A′ P L
N̄ C̄ ′
Ẑ Ū
−
−C0
=0 (4.18)
C̄ 0 K
Let
B′ P B B′ P A
′ N11 N12
B̄ P̄B̄ = =
N21 N22 B P A Py + A′ P A
′
N11 N12 C ′ B′ P L
X̂
N21 N22 0 Ŷ = Py Ly + A′ P L (4.19)
C 0 0 K −C0
B′ P L
N11 N12 X̂
N21 N22
=
Py Ly + A′ P L (4.21)
Ŷ
140 4 Spatial Data Cleaning
It is known that coefficient matrix B is full column rank from u = t. Therefore, the
solution to Eq. (4.21) is
−1 −1 −1 −1
N12 R−1 N21 N11 N12 R−1 B′ P L
X̂ N11 + N11 −N11
=
Ŷ −1
−R N21 N11 −1
R−1 A′ P L + Py Ly
(4.22)
where, R = Py + A′ P A − A′ P B B′ P B −1 B′ P A
That is,
� � �−1 �−1 ′ � �−1
X̂ = B′ Q�� + A1 QSS A′1 B B Q�� + A1 QSS A′1 (L − A1 µS )
�−1 � � (4.23)
Ŷ = LY + QYY A′ Q�� + AQYY A′
�
L − BX̂ − ALY
Consider
Ŝ QSS QSSW µS
A = (A1 0), Ŷ = , QYY = , LY =
Ŝ w QSW S QSW SW µSW
We obtain
�−1 � �
Ŝ = µS + QSS A′1 Q�� + A1 QSS A′1
�
L − BX̂ − A1 µS
−1
� � (4.24)
Ŝ w = µSW + QSW S A′1 Q�� + A1 QSS A′1
� �
L − BX̂ − A1 µS
4.5.4.3 Bayesian Estimation
In Eq. (4.19), when B = 0, C = 0, C0 = 0, then N11 = 0, N12 = 0, N21 = 0,
considering N22 = PY + A′ P A, then Eq. (4.19) changes to
′
A P A + Py Ŷ = A′ P L + Py Ly
(4.27)
We obtain
−1 ′
Ŷ = A′ P A + Py (4.28)
A P L + Py Ly
When m = 0—that is, Eq. (4.8) does not include random parameters—then A = 0.
At this moment, Eq. (4.8) changes to
L = BX +
CX + C0 = 0 (4.29)
B′ PB C ′ B′ PL
X̂
=
C 0 K −C0
Its solution is
−1
B′ PB C ′ B′ PL
X̂
=
K C 0 −C0
It is clear that the linear model with linear constraints is also a special case of the
generalized linear summary model.
142 4 Spatial Data Cleaning
When m = 0—that is, Eq. (4.8) does not include the random parameter A = 0—
and if u = t—that is, there are only t independent parameters—then d = u – t = 0,
u = t. Thus, C = 0, C0 = 0. Then, Eq. (4.8) changes to
L = BX +
This is the general linear model, whose solution is
−1
X̂LS = B′ PB B′ PL
From the above derivation, Eq. (4.8) includes the least square collocation model,
the least square filter estimation model, the Bayesian estimation, the linear model
with linear constraints, and the generalized linear summary model of the general
linear model. As for more complex nonlinear model parameter estimation theories
and their applications, readers can refer to Wang (2002).
Reliability gives the adjustment system the ability to detect gross errors, indicat-
ing the impact of the undetectable gross errors on the adjustment results as well
as the statistical measurements in the detection and discovery of the gross errors.
Internal reliability refers to the ability to discover the gross errors in detectable
observations, which is commonly measured by the minimum or the lower limit
of detectable gross errors. The smaller and lower the limit is, the stronger the reli-
ability is. External reliability, on the other hand, is used to represent the impact of
the undetectable gross errors on the adjustment results or the adjustment result’s
function due to its limited number of detectable gross errors (gross errors that are
lower than the limits are undetectable). The smaller the impact of the undetectable
gross errors on the results is, the stronger the external reliability is. The redundant
observations also are key in the detection of gross errors. As for both internal and
external reliability, the more redundant the observations are, the better.
Another demand of reliability research is to search for the location of gross
errors. Locating gross errors is about finding a way of automatically discovering
the existence of gross errors in the process of adjustments and pointing out their
locations precisely so as to reject them from the adjustment. This is an algorithmic
problem more than a theoretical issue because an algorithm is required to carry
out the process of automatic detection controlled by programs based on different
adjustment systems and manifold types of gross errors that are likely to arise. In
measurements, there are two specific ways in which the rejection of gross errors
and dealing with them can be generally categorized:
4.6 Gross Error Cleaning 143
Reliability research has two major tasks. One is a capacity study of the adjust-
ment system in discovering and distinguishing the gross errors of different mod-
els as well as the impact of the undetectable and indistinguishable model errors
on the adjustment results in theory. The other task is the quest for methods that
automatically discover and distinguish model errors while determining the loca-
tion of model errors in the process of adjustments. The former task applies to the
reliability analysis and optimal design method for a system that could design the
best graphics in the surveying area and meets the requirements as an integration of
precision, reliability, and economy. The latter task is able to perfect current mani-
fold adjustment programs by lifting the adjustment calculation to a higher level of
automation. As a result, the measurement results that are provided to the differ-
ent sectors of the national economy will not be limited to the geodetic coordinates
of the points but also their numerical precision and the numerical value of their
reliability.
Reliability research is based on the mathematical statistical hypothesis test.
The classical hypothesis testing theory was introduced by Neyman and Pearson
in 1993. In the field of survey adjustment, reliability theory was proposed by
Baarda in 1967–1968. Driven by a single one-dimensional alternative hypoth-
esis, Baarda’s reliability theory studies the capacity of the adjustment system in
discovering single model errors and the impact of the undetectable model errors
on the adjustment results. The former is called internal reliability and the latter is
called external reliability. Here, the model errors refer to the gross errors and sys-
tem errors. In addition, starting from the known unit weight variance, Baarda also
derived data snooping to test gross errors, which considers residual standard errors
that follow the normal distribution as statistical measurements.
Förstner and Koch later applied this theory to a single multidimensional alter-
native hypothesis so as to enable it to discover multiple model errors. From the
perspective of the statistics of single gross error test, Förstner and Koch calculated
the amount of testing needed for an unknown variance factor. Pope and Koch cal-
culated the test variable. Among several gross error tests, the F test variable was
calculated by Förstner and Koch. In 1983, Förstner introduced the possibility of
144 4 Spatial Data Cleaning
σ0 δ0 S
∇0 S = (4.30)
S T PSS S
4.6 Gross Error Cleaning 145
δ 2 (S)
PSS − PSS t = 0 2 PSS t
(4.31)
δ0
In this way, the vector length δ 20 (S) can be evaluated in different directions.
In his book entitled Sequel to Principle of Photogrammetry, Wang Zhizhuo
(2007), a member of the Chinese Academy of Science, wrote: “In the aspect of
distinguishable theory, i.e., a reliability theory under double alternative hypothesis,
German Förstner and Chinese Li Deren have made effective researches.” He fur-
ther specifically introduced the iteration method with variable weights, which is
presented in Sect. 7 of Chap. 4 (Wang 2007), and named it the “Li Deren method.”
4.6.2 Data Snooping
where vi is the correction of the ith observation, calculated by the error equation;
σvi is the middle error of vi; σ0 is the middle error with unit weight; qii is the ith
element on the main diagonal of total inverse matrix P−1; Bi is the ith row of the
matrix of error equation; and BTPB is the coefficient matrix of normal equation.
As for the statistic wi, which is used to detect gross errors, according to the
recognized significance level of Baarda in which α = 0.001, it can be obtained that
wi = 3.3. With N(0, 1) as the null hypothesis, if |vi | < 3.3σvi, the null hypothesis
can be accepted and there are no gross errors under this condition. However, if
|vi | ≥ 3.3σvi, the null hypothesis must be rejected and there are gross errors.
It can be seen from the discussion in the last section that it is difficult to posi-
tion the gross error, especially several gross errors, when they are incorporated in
a function model. If the observations containing gross errors are regarded as the
sample of large variance of the same expectation, then it can lead to the iteration
method with selected weights for positioning gross errors.
The basic theory of the method is that, due to unknown gross errors, the adjust-
ment method is still least squares. However, after each adjustment, according to
146 4 Spatial Data Cleaning
the residuals and other parameters and the weight function selected, the weight of
the iterative adjustment in the next step is calculated and included in the adjust-
ment calculation. If the weight function is chosen properly and the gross errors
can be positioned, the observations containing gross errors become smaller
and smaller until they reach zero. When the iteration is suspended, the relevant
residuals point out the values of the gross errors; subsequently, the results of the
adjustment will not be affected by gross errors. Thus, automatic positioning and
correctness are realized. The iteration method with selected weights can be used
to study the discoverability and measurability of the model errors (gross errors,
system errors, or distortions) in any adjustment system, and the influence of the
immeasurable model errors on the result of adjustment thus2 is calculated.
The method starts from theminimalcondition pi vi → min, in which the
weight function is p(v+1)
i = f vi , . . . , (v = 1, 2, 3, . . .). Some known func-
(v)
tions include the iterative method with minimum norm, the Danish method, the
weighted data snooping method, the option iteration method proceeding from the
principle of Robust, and the option iteration method deduced from the posteriori
variance estimation principle. The various types of functions can be divided into
the residual function, the standardized residual function, and the variance valua-
tion function according to content, and into the power function and the exponential
function according to the form.
To position gross errors effectively, the selection of a weight function should
meet the following conditions:
1. The weight of the observation containing gross errors should be inclined to zero
through iteration; that is, the redundant observation should be inclined to one.
2. The weight of the observation containing no gross errors should be equal to the
weight of the group observation (given before the posteriori variance is tested
or calculated) when the iteration suspends. The weight is one when there is
only one group of observation values with equal precision.
3. The selection of the weight function should guarantee the contraction of itera-
tion at a faster speed. As for the iteration method with selected weights, it is an
ideal positioning gross errors weight function to designate the posteriori vari-
ance of all observations as the basic component in the form of an exponential
function.
However, the usual least squares estimation and maximum likelihood estimation
have a stronger possibility of gross errors; therefore, the estimates are likely to be
unable to resist external disturbance. Also, the usual least squares estimation is not
the Robust method.
The equation of Robust estimation is as follows (Wang 2002):
b1 L1
b2 L2
V = BX̂R − L = . X̂R − . (4.32)
.. ..
bn Ln
where bi is the ith row of the designated matrix and X̂R is the robust estimate for
unknown parameter X. Suppose the weight of the ith observation is pi; the calcula-
tion of the unknown parameter is to solve the optimization problem.
n
n
pi ρ(υi ) = pi ρ bi X̂R − Li = min (4.33)
i=1 i=1
In Eq. (4.33), take the derivative of X̂R, make it zero, and note ϕ(vi ) = ∂ρ /∂vi.
Then
n
pi φ(υi )bi = 0 (4.34)
i=1
Let
ϕ(υi )/υi = Wi , P̄ii = Pi Wi (4.35)
In Eq. (4.35), Wi is called the weight factor and P̄ii is the equivalent weight.
Equation (4.34) can be noted as follows:
B′ P̄V = 0 (4.36)
Taking Eq. (4.29) into Eq. (4.36) will lead to
B′ P̄BX̂R − B′ P̄L = 0 (4.37)
Then,
−1
X̂R = B′ P̄B B′ P̄L (4.38)
Because of the introduction of P̄, Eq. (4.38) can both resist the disturbance of
gross errors and keep the form of least squares estimation. Equation (4.38) was
called the robust least squares estimation by Zhou Jiangwen.
It can be observed from Eq. (4.36) that the weight factor Wi is a nonlinear func-
tion of the residuals. To make the equivalent weight more practical, it is necessary
to improve Wi through the iteration calculation. The maximum likelihood estima-
tion can be made robust through the following actions (Fig. 4.5):
148 4 Spatial Data Cleaning
1. The impact of gross errors on adjustment results must have an upper limit.
2. The impact of minor gross errors on the adjustment results should not reach
the upper limit once; instead, it should increase gradually with the growth
of errors. That is, the increase rate affecting the function (the influence of an
added observation on the estimation) should have an upper limit.
3. The gross errors above a certain limit should not affect the adjustment result;
that is, an influence value of null should be set.
4. The changes in minor gross errors should not bring about big changes in
results; that is, the decline affecting the functions should not be too abrupt.
5. To make the estimate robust, the increase rate affecting the functions should
also have a lower limit to guarantee fast contraction during calculation.
With the above five principles, we can obtain the optional robust scope, which
affects the functions. If the options affecting the functions are in the scope, the
estimate is called robust estimation.
It must be pointed out that these five principles are established for gross errors
(i.e., genuine errors). As a matter of fact, users can use the observation residual
value only and must change the influence function and the weight function into a
correction function. This is a fatal deficiency in the robust estimation. The influ-
ence function in the classical least squares adjustment method can only meet robust
conditions (2) and (5) and cannot meet the requirements of gross error proof, so it
is not a robust adjustment method. The adjustment result of the least squares adjust-
ment method will be seriously influenced in proportion to the gross errors. The
small power exponent in the influence function in mini-norm can basically meet
conditions (1) and (2); therefore, they can get part of the robust characteristics. The
influence function in the Huber method already reaches the most important condi-
tions of (1), (2), and (5), but large gross errors still have some impact on the adjust-
ment result. The Hampel method meets all the requirements, and only condition
4.6 Gross Error Cleaning 149
(4) is approximately met. The Danish method introduced by Krarup almost meets
all the requirements except condition (3). The Stuttgart method satisfies all five
robust conditions. In the beginning of the iteration, because of many gross errors,
an observation with the large residual values is not abandoned casually; however, at
the end of the iteration, even the smallest gross errors are removed strictly.
The above-mentioned weight functions are usually chosen from the empirical
approach and the weight is represented by the function of correction. Because
the correction is only the visible part of the genuine errors, the above-mentioned
weight functions (except the Stuttgart method) do not take into account the geo-
metrical conditions of adjustment. In fact, gross errors are thought to be a sub-
sample of orthostatic, whose expectation is zero and the variance is large. The
posteriori variance calculated through the posteriori variance of the least squares
method can then be used to find the unusual large observation containing gross
errors. Next, according to the classic definition that an observation posteriori vari-
ance is out of proportion to its weight, a relatively small weight is given to it to
carry on the iteration adjustment, which makes it possible to position gross errors.
This method was proposed by Li Deren in 1983.
ViT Vi
σ̂i2 = (i = 1, 2, . . . , k is the number of group) (4.39)
ri
Statistical parameter
σ̂i,j2 2 p
vi,j i
2 p
vi,j i
Ti,j = = = (4.42)
σ̂i2 σ̂02 ri,j σ̂02 qvi ,jj pi,j
As for the adjustment containing only one group of observations of the same pre-
cision, its statistic and weight functions are as follows:
vi2
Ti = (4.44)
σ̂02 qvii pi
1 Ti < Fa,1,r
(v+1)
pi = σ̂02 ri
Ti ≥ Fa,1,r
(4.45)
vi2
To compare this with Baarda’s data snooping, the statistic parameters in Eq. (4.42)
are taken. In the first iteration pi = pi, j. Therefore,
1 vi
Ti 2 = √ = τi (4.46)
σ̂0 qvii
This is the variance τ during the data snooping of the unknown unit weight
variance.
It can be observed that data snooping equals the first iteration of the method.
As in the first iteration, the weight containing the gross error observation is incor-
rect and all the residual values and σ̂0 are affected; therefore, the estimate is not
precise. However, if taking advantage of the method in which the posteriori vari-
ance estimate changes the observation weight during the process of iteration, the
observation weight containing the gross errors will decline gradually to null and
ultimately will have no influence on the adjustment results, making the estimate
more precise.
4.7 Graphic and Image Cleaning 151
Spatial graphics and images from multiple sources may be derived from different
time periods or sensors. During the imaging procedure, the spectral radiance and
geometric distortion of spatial entities determine the quality of its graphics and
images to a great extent. In order to make spatial graphics and images approach a
primitive spatial entity to the utmost, it is necessary to correct and clean the radia-
tion and geometric distortion (Wang et al. 2002).
The output of the sensors that generate graphics and images has a close relation-
ship with the spectral radiance of the target. Cleaning the graphic and image data
of radiation deformation mainly corrects the radiation factors that affect the qual-
ity of remote-sensing images, which include the spectral distribution characteris-
tics of solar radiation, atmospheric transmission characteristics, solar altitudinal
angle, position angle, spectral characteristics of surface features, altitude and posi-
tion of sensors, and properties and record modes of sensors.
These methods mainly correct the errors of the sensor systems and the missing
data in accessing and transmitting. The manufacturing components and the proper-
ties (e.g., the spectral sensitivity and energy conversion sensitivity of every sensor)
are different; therefore, the methods to correct systematic errors are different. For
example, the systematic errors of sensor MSS are caused by the non-uniformity of
gain and drift of every detecting element arrayed by the sensor’s detector and the
possible changes in work. Compensation for the noises of the radiation measure-
ment value of sensor MSS can be performed according to the statistical analysis
of the calibration grey wedge and image data generated from satellites. After the
calibration parameters are obtained and the gain and drift of the voltage values are
calculated, they can be corrected through each pixel of every scanning line, one by
one. When the calibration data are absent or unreliable, statistical analysis can be
used for calibration, which means using spatial entity data in the scanning field to
calculate the total and average total of all the samples of every detecting element,
determining the gain and drift of all the detecting elements in each band, and then
correcting the data. Moreover, a signal transmission is needed before recording the
signals received by the internal sensors. In the process of signal transmission, dark
current is easy to generate, which decreases the signal-to-noise ratio. The sensitiv-
ity characteristics, response characteristics, position, altitude, and attitude of the
sensors also can affect the quality of spatial graphics and images and need to be
corrected.
152 4 Spatial Data Cleaning
The frequency low pass filter effect caused by the atmosphere absorbing and scat-
tering electromagnetic waves and the atmospheric oscillation effects and changes
in the radioactive properties of satellite remote sensing images. The main factor
is atmospheric scattering. Scattering is the result of molecules and particles in the
atmosphere, which can affect electromagnetic waves several times. The increas-
ing intensity value caused by the scattering effect does not include the information
of any target but will decrease the contrast ratios of images, lead to reductions in
image resolution, and can have the following three serious effects on the radioac-
tive properties of images: losing ground useful information of some short-wave
bands; producing interference in radioactive properties between neighboring pix-
els; and forming sky light with cloud reflections. Therefore, the frequency low
pass filter effect must be corrected; here, we present three methods to do so:
1. Correction by solving radiation equations entails substituting atmospheric data
into radiation equations to calculate the gray value equivalent to the atmos-
pheric scattering as the atmospheric approximate correction value. However,
this method is not often used.
2. The field spectral measurement method considers the image intensity value that
corresponds to the spectral data and images of measuring land features as the
correction value of regression analysis calculating radiation.
3. Multi-band image comparison addresses the effects of the scattering that
mainly occur in the short band and rarely influence the infrared band. This
approach takes an infrared image as the standard image without any apparent
influences and compares it with other band images using the histogram method
and the regression analysis in certain fields. The difference is the scattering
radioactive value needing correction.
The lighting conditions in photography (solar position, altitude angle, solar posi-
tion angle, the irradiance of the light source caused by direct solar light and sky-
light, etc.) also can affect the quality of images. The solar altitudinal angle and
the position angle are related to the irradiance and optical path length of the hori-
zontal surface and affect the magnitude of directional reflectivity. When the solar
altitude angle is 25–30°, the images acquired by photography can form the most
stereoscopic shadow and appropriate images; however, such results are difficult
to guarantee in actual photography. For satellite images under good atmospheric
conditions, the changes in lighting conditions mainly refer to the changes of solar
altitude while imaging solar altitude can be determined by the imaging time, sea-
son, and position and then corrected through these parameters. The correction of
lighting conditions is realized by adjusting the average brightness in an image. The
correction results can be obtained by determining the solar altitude angle, knowing
4.7 Graphic and Image Cleaning 153
the imaging season and position, calculating the correction constants, and then
multiplying the constants by the value of each pixel. The solar light spot, which
is the phenomenon where the solar border is brighter than the field around it when
the sunlight is reflecting and diffusing on the surface can be corrected by reducing
the border light. This can be achieved by calculating the shaded curve surfaces
(the distortion parts caused by the solar light spot and the reduced border light
in the shading change areas of images). Generally, the stable changing composi-
tions extracted from images by Fourier analysis can be regarded as shaded curve
surfaces.
4.7.1.4 Noise Removal
During the process of obtaining remote sensing images, the abnormal stripes
(generally following a scanning stripe circle) and the spots caused by differences
in the properties, interference, and breakdown of detectors could not only cause
directly quotative error information but also can lead to bad results and need to be
removed.
1. Periodic noises. The origin of periodic noises can be the periodic signals in ras-
ter scan and digital sampling institutions being coupled to the image electronic
signals of electron-optical scanners or mechanical oscillation in the electronic
scanner and magnetic tape recorder. The recorded images obstruct graphics
stacking with the primitive scenery by periodicity with changing amplitude, fre-
quency, and phase and produces periodic noises. The periodic noises reflected
in two-dimensional images are along the appearance of scanning periodic-
ity and distributed vertically with the scanning line and can be shown in two-
dimensional Fourier spectrum. Therefore, the noises can be reduced by using
Fourier transform in the frequency domain through band-pass or notch filter.
2. Striping noises. Striping noises are produced by equipment, such as changes in
gain and drift of the sensors’ detectors, data interrupt, and missing tape records.
This type of noise presents obvious horizontal strips in the images. The correc-
tion of striping noises usually compares the average density of the stripe and
the adjacent line parallel to the stripe and then chooses a gain factor, which can
show the difference or can use a similar method to reduce periodic noises.
3. Isolated noises. Isolated noises are caused by error codes in the data transmis-
sion process or the temperature perturbation in analog circuits. Because they
deviate from the image elements of the neighboring data in number, they are
called isolated noises. They can be dealt with through median filtering or noise
removing algorithm.
4. Random noises. Random noises are attached to the images and their values and
positions are not fixed, such as the granularity noise of photo negatives. They
can be restrained through averaging several images of sequence photography
of one changing scene. Moreover, there are also other methods, such as bidi-
rectional reflectance correction, emissivity correction, terrain correction, and
inversion of remote sensing physical variable.
154 4 Spatial Data Cleaning
The graphic and image cleaning of geometric deformation corrects the image
matching, which chiefly points to the relative displacement or graphics chang-
ing between images. It mainly refers to the direction x displacement parallax
and direction y displacement parallax. If the left and right images are resampled
according to the epipolar lines, there are no vertical parallaxes in the correspond-
ing epipolar lines. Assuming the radiometric distortion has been corrected, the
geometric deformation of a pixel mainly is the displacement p of direction x, and
the gray-scale function g1 (x, y), g2 (x, y) of the left and right images should satisfy
g1 (x, y) + n1 (x, y) = g2 (x + p, y) + n2 (x, y) (4.47)
where, n1 (x, y), n2 (x, y) respectively are the random noises of the left and right
images. Its error equation is
v(x, y) = g2 x ′ , y′ − g1 (x, y) = g2 (x + p, y) − g1 (x, y)
(4.48)
For any point P(x0, y0) that falls in the parallactic grid (i, j) of column i and row j,
according to the bilinear finite element interpolation method, the parallactic value p0
of p can be calculated by bilinear
interpolating
the parallax
pij , pi+1, j , pi,j+1 , pi+1,j+1
of the four vertexes P xi , yj , P xi+1 , yj , P xi , yj+1 , P xi+1 , yj+1 in the grid.
That is,
pij (xi+1 − x0 )(yj+1 − y0 ) + pi+1,j (x0 − xi )(yj+1 − y0 )
+pi,j+1 (xi+1 − x0 )(y0 − yj ) + pi+1,j+1 (x0 − xi )(y0 − yj ) (4.49)
p0 =
(xi+1 − xi )(yj+1 − yj )
where, xi ≤ x0 ≤ xi+1 , yj ≤ y0 ≤ yj+1. Substitute Eq. (4.49) in error Eq. (4.48), lin-
earize it, and calculate it; the parallactic value of the regular grid point P(i, j) can be
obtained and the parallactic grid can be formed to correct the geometric deformation.
Fig. 4.6 The aerial images before and after cleaning. a Original image. b Cleaned image
4.7 Graphic and Image Cleaning 155
To have a better understanding of spatial graphics and images cleaning, this sec-
tion uses an aerial image of a district in Nanning city, Guangxi province as an
example (Fig. 4.6a) and cleans and calculates it by the above-mentioned method.
This image was shot in 1998 and its measuring scale is 1:2000. Figure 4.4b is the
result of the image data cleaning.
Comparing Fig. 4.6a, b, it can be concluded that the original aerial image
before cleaning was affected by radioactive distortion to a great extent, obscured
with poor readability. It is also difficult to recognize the objects and geometric
deformation occurring to the left; however, the aerial image after cleaning is a
clearer image, the readability is enhanced, and the objects are easier to recognize.
The most obvious example is the square in the right corner of the image where the
fog that exists in the original aerial image was removed by the cleaning.
References
Burrough PA, Frank AU (eds) (1996) Geographic objects with indeterminate boundaries. Taylor
and Francis, Basingstoke
Clark CF (1997) Evaluating the uncertainty of area estimates derived from fuzzy land-cover clas-
sification. Photogram Eng Remote Sens 63:403–414
Dasu T (2003) Exploratory data mining and data cleaning. Wiley, New York
Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) (1996) Advances in knowledge
discovery and data mining. AAAI/MIT, Menlo Park, pp 1–30
Goodchild MF (2007) Citizens as voluntary sensors: spatial data infrastructure in the world of
Web 2.0. Int J Spat Data Infrastruct Res 2:24–32
Hernàndez MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge
problem. Data Min Knowl Disc 2:1–31
Inmon WH (2005) Building the data warehouse, 4th edn. Wiley, New York
Kim W et al (2003) A taxonomy of dirty data. Data Min Knowl Disc 7:81–99
Koperski K (1999) A progressive refinement approach to spatial data mining. Ph.D. Thesis,
Simon Fraser University, British Columbia
Li DR, Yuan XX (2002) Error handling and reliability theory. Wuhan University Press, Wuhan
Shi WZ, Fisher PF, Goodchild MF (eds) (2002) Spatial data quality. Taylor & Francis, London
Smets P (1996) Imperfect information: imprecision and uncertainty. Uncertainty management in
information systems. Kluwer Academic Publishers, London
Smithson MJ (1989) Ignorance and uncertainty: emerging paradigms. Springer, New York
Wang XZ (2002) Parameter estimation with nonlinear model. Wuhan University Press, Wuhan
Wang ZZ (2007) Sequence to principles of photogrammetry. Wuhan University Press, Wuhan
Wang SL, Shi WZ (2012) Chapter 5 data mining, knowledge discovery. In: Kresse W, Danko D
(eds) Handbook of geographic information. Springer, Berlin, pp 123–142
Wang SL, Wang XZ, Shi WZ (2002) Spatial data cleaning. In: Zhang S, Yang Q, Zhang C (eds)
Proceedings of the first international workshop on data cleaning and preprocessing, Maebashi
TERRSA, Maebashi City, 9–12 Dec, pp 88–98
Chapter 5
Methods and Techniques in SDM
Crisp set theory, which Cantor introduced in the nineteenth century, is the basis
of modern mathematics. Probability and spatial statistics classically target the ran-
domness in SDM. Evidence theory is an expansion of probability theory; spatial
clustering and spatial analysis are extensions of spatial statistics. The data field
concept will be introduced in Chap. 6.
5.1.1 Probability Theory
Probability theory (Arthurs 1965) is suitable for SDM with randomness on the
basis of stochastic probabilities in the context of adequate samples and back-
ground information. Probability theory originated from research by European
theory with rough set in the system of probabilistic rough classifiers generation
(ProbRough) by applying conditional attributes to deduce decision knowledge.
Shi Wenzhong proposed a probability vector that different probabilities are used
to show the distribution of spatial data uncertainty (Shi and Wang 2002). Using the
decision-tree-based probability graphical model, the probability field denotes that
the probability of categorical variables is the probability when the class observed
at one location is positive (Zhang and Goodchild 2002). With the decision-tree-
based probability graphical model, Frasconi et al. (1999) uncovered the data-
base with graphic attributes and the discovered knowledge was used to supervise
machine learning. Generally, the probability sample is used to carry out spatial
analysis for the attributes of regional soil. For an area without probability samples,
Brus (2000) designed a method using regressively interpolating probability sam-
ples to assess the general pattern of soil properties. According to the raster-struc-
tured land-cover classification model, evaluating how the uncertainty of input data
impact the output results, Canters (1997) put forward the probability-based image
membership vector that all the elements in one pixel element would affect and
decide the category of the pixel together. Taking advantage of supervised machine
learning, Sester (2000) supervised the interpretation of spatial data from a data-
base with a given sample space automatically. In addition, Journel (1996) studied
the uncertainty of random images.
5.1.2 Evidence Theory
evidence theory, the spatial knowledge can be extracted from a dataset with more
than one uncertain attribute. In addition, the comparative method also can be used
in knowledge discovery with uncertain attributes (Chrisman 1997). The framework
of evidence data mining (EDM) has two parts: mass functions and mass operators
(Yang et al. 1994). The mass function represents the data and knowledge—that
is, the data mass function and the rule mass function; several of the mass func-
tion operators are for knowledge discovery. EDM has been used to discover strong
rules in relational databases, as well as to identify volcanoes from a large number
of images of the Venus surface. The EDM framework makes the process of knowl-
edge discovery equivalent to the operation of the mass function for data and rules.
The methods based on evidence theory can easily deal with a null value or a miss-
ing attribute value and incorporate domain-knowledge in the course of knowledge
discovery. At the same time, the essentials of the algorithms are parallel, which
is advantageous when dealing with parallel, distributed, and heterogeneous data-
bases. Evidence theory’s advantages when dealing with uncertainty have potential
applications in SDM.
5.1.3 Spatial Statistics
5.1.4 Spatial Clustering
Spatial clustering groups a set of data in a way that maximizes the similar-
ity within clusters and minimizes the similarity between two different clusters
(Han et al. 2000). It sorts through spatial raw data and groups them into clusters
based on object characteristics. Without background knowledge, spatial cluster-
ing can directly find meaningful clusters in a spatial dataset on the basis of the
object attribute. To discover spatial distribution laws and typical models for the
entire dataset, spatial clustering, by using the measurement of a certain distance or
similarity, is able to delineate clusters or dense areas in a large-scale multidimen-
sional spatial dataset, which is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in other clusters
(Kaufman and Rousseew 1990; Murray and Shyy 2000). As a branch of statistics,
clustering is mainly based on geometric distance, such as Minkowski distance,
Manhattan distance, Euclidean distance, Chebyshev distance, Cambera distance,
and Mahalanobis distance. Different clustering distances may show some objects
close to one another according to one distance but farther away according to the
other distance (Ester et al. 2000). Derived from the matching matrix, some meas-
urements therefore are given to compare various clustering results when different
clustering algorithms perform on a set of data. Concept clustering is used when the
similarity measurement among data objects is defined based on a concept descrip-
tion instead of a geometric distance (Grabmeier and Rudolph 2002).
Spatial clustering is different from classification in SDM. It discovers signifi-
cant cluster structures from spatial databases without background knowledge. The
clustering algorithms for pattern recognition groups the data points in a multidi-
mensional feature space, and spatial clustering directly clusters graphs or images—
the shape of which are complex as well as numerous. If multivariate statistical
analysis is used at this time, the clustering will be inefficient and slow. Therefore,
the clustering algorithms in SDM should be able to designate point, line, polygon,
or arbitrary-shaped objects, along with self-adapted or user-defined parameters.
5.1.5 Spatial Analysis
Spatial analysis consists of the general methods that analyze spatial topology
structure, overlaying objects, buffer zones, image features, distance measurements,
etc. (Haining 2003). Exploratory data analysis uncovered the non-visual charac-
teristics and exceptions by using dynamic statistical graphics and links to dem-
onstrate data characteristics (Clark and Niblet 1987). Inductive learning, which is
exploratory spatial analysis (Muggleton 1990) combined with attribute-oriented
induction (AOI), explores spatial data to determine the preliminary characteristics
for SDM. Image analysis can directly explore a large number of images or can
perform as a preprocess phase of other knowledge discovery methods.
162 5 Methods and Techniques in SDM
Reinartz (1999) addressed regional SDM, and Wang et al. (2000) introduced
an approach to active SDM based on statistical information. Taking a spatial point
as a basic unit, Ester et al. (2000) integrated SDM algorithms and a spatial data-
base management system by processing the neighborhood relationships of many
objects. Spatial patterns were discovered with neighborhood graphs, paths, and a
small set of database primitives for their manipulation, along with neighborhood
indices to speed up the primitive’s processing. To reduce the search space for the
SDM algorithms while searching for significant patterns, the filtering process
was defined so that only certain classes of paths “leading away” from a starting
object were relevant. In the diagnosis index of attribute exceptions from the possi-
ble causal relation, Mouzon et al. (2001) found that attribute uncertainty had influ-
enced the exception diagnosis via the consistent algorithm and inductive algorithm.
5.2.1 Fuzzy Sets
A fuzzy set (Zadeh 1965) is used in SDM with fuzziness on the basis of a fuzzy
membership function that depicts an uncertain probability (Li et al. 2006).
Fuzziness is an objective existence. The more complicated the system is, the more
difficult it is to describe it accurately, which creates more fuzziness. In SDM, a
class and an object are treated as a fuzzy set and an element, respectively. Each
object is assigned a membership in the class. If there are many objects with inter-
mediate boundaries in the universe of interest, an object may be assigned a group
of memberships. The closer the membership approximates 1, the more it is pos-
sible that the element belongs to the class. For instance, if there are soil, river,
and vegetation in an image, a pixel will be assigned three memberships. The class
uncertainty mostly derives from subjective supposition and objective vagueness.
Without a precisely defined boundary, a fuzzy set is more suitable for fuzzy clas-
sification with spatial heterogeneous distribution of geographical uncertainty. To
classify remote-sensing images, fuzzy classification can produce different inter-
mediate results according to the classifier. For example, in statistical classifiers,
there are likelihood values of a certain pixel belonging to the alternative classes;
in neural network classifiers, there are class activation level values (Zhang and
Goodchild 2002). Burrough and Frank (1996) presented a fuzzy Boolean logic
model of uncertain data. Canters (1997) evaluated uncertain rules of estimating an
area in fuzzy land cover classification. Wang and Wang (1997) proposed a fuzzy
comprehensive method that combined fuzzy comprehensive evaluation with fuzzy
cluster analysis for land price evaluation. Vazirgiannis and Halkidi (2000) used
fuzzy logic to deal with the uncertainty in SDM.
Fuzzy comprehensive evaluation and fuzzy clustering analysis are the two
essential techniques in fuzzy sets. When they are integrated into SDM, the knowl-
edge is discovered well (Wang and Wang 1997) in the following steps:
1. A fuzzy set acquires the fuzzy evaluation matrix for each influential factor.
2. All of the fuzzy evaluation matrices multiply the corresponding weight matri-
ces, the product matrix of which is the comprehensive matrix of all factors.
3. The comprehensive matrix is further used to create a fuzzy similar matrix, on
the basis of which a fuzzy equivalent matrix is obtained.
4. Fuzzy clustering is implemented via the proposed maximum remainder
algorithms.
However, by the time a fuzzy membership is determined, the subsequent calcu-
lation has abandoned the fuzziness that should be propagated to the final results
continuously. Furthermore, because it focuses only on fuzzy uncertainty, fuzzy set
may become invalid when there is more than one uncertainty in SDM (i.e., ran-
domness and fuzziness).
164 5 Methods and Techniques in SDM
5.2.2 Rough Sets
A rough set is used in SDM with incompleteness via lower and upper approxima-
tions (Pawlak 1991). It is an incompleteness-based reasoning method that creates
a decision-making table by characterizing both the certainties and uncertainties. A
rough set consist of an upper approximation set and a lower approximation set, and
the standard for classifying objects is whether or not the information is sufficient
in a given universe. Objects in the lower approximation set that have the neces-
sary information definitely belong to the class, and objects within the universe but
outside of the upper approximation set and without the necessary information defi-
nitely do not belong to the class. The difference set between the upper approxima-
tion set and the lower approximation set is the indeterminate boundary, in which
the objects have insufficient information to determine whether or not they belong
to the class. If two objects have exactly the same information, they are equivalent
and cannot be distinguished from one another. Depending on whether statistical
information is used, the existent rough set model may be algebraic and probabilistic
(Yao et al. 1997). The basic unit of rough set is the equivalent class, as a grid in ras-
ter data, a point in vector data, or a pixel in an image. The more detailed the equiva-
lent class is classified, the more accurately that a rough set can describe the objects,
but at the expense of larger storage space and computing time (Ahlqvist et al.
2000). During its application in the field of approximate reasoning, machine learn-
ing, artificial intelligence, pattern recognition, and knowledge discovery, a rough
set becomes more sophisticated and accomplished from the initial qualitative analy-
sis (creating a minimum decision-making set) to the current focus on both quali-
tative analysis and quantitative calculation (computing rough probability, rough
function and rough calculus). There is also a further interdisciplinary relationship,
such as a rough fuzzy set, rough probabilistic set, and rough evidence theory.
Rough set-based SDM is a process of an intelligent decision-making analysis
that can deal with inaccurate and incomplete information in spatial data. In SDM,
a rough set may analyze attribute importance, attribute table consistency, and attrib-
ute reliability in spatial databases. The influences of attribute reliability on decision-
making can be determined to simplify the spatial data, the attribute table, and the
attribute dependence. By evaluating the absolute and relative uncertainty in the
decision-making algorithm, the paradigm and causal relationship may be ascer-
tained to generate the minimum decision-making and classification. To determine
the patterns in the decision-making table, the soft computing of rough sets can sim-
plify the attributes and regionalize the attribute values by deleting redundant rules
(Pal and Skowron 1999). A rough set has been employed in applications such as
distinguishing inaccurate spatial images and object-oriented software assessment;
describing the uncertainty model; assessing land planning in a suburban area;
selecting a bank location based on attribute simplification; roughly classifying
remote-sensing images together with fuzzy membership functions, rough neighbor-
hood, and rough precision; and generating the minimum decision-making knowl-
edge of national agricultural data. These methods and applications in data mining
5.2 Extended Set Theory 165
have been summarized in the literatures and data mining systems based on rough
set also have been developed (Polkowski and Skowron 1998a, b; Ahlqvist et al.
2000; Polkowski et al. 2000).
At present, the applicability of a rough set for the above uses has yet to be
proven definitively. In addition, a rough set is not a specialized theory that origi-
nated from SDM. It provides the derivative topological relationships of the con-
cept definition but does not address spatial relationships by undertaking actions
such as superimposing multiple layers. As a result, it is necessary to further
improve a rough set in the field of geospatial information sciences, such as geo-
rough space (Wang et al. 2002).
5.3 Bionic Method
Artificial neural networks (ANNs) and genetic algorithms are the typical bionic
methods in SDM. The first simulates the neural network of a human brain to dis-
cover the networked patterns by using a self-adaptive nonlinear dynamic system,
whereas the second imitates the process of biological evolution to determine the
optimal solutions by using selection, crossover, and mutation as basic operators.
ANNs reason from the nonlinear systems in a blurry background (Miller et al.
1990). Rather than explaining a distinctive characteristic of a specific analysis at a spe-
cific condition, ANNs are capable of expressing that information, which is also called
the connectionist method in artificial intelligence. Compared to traditional methods,
ANNs afford high fault tolerance and robustness to the network by reducing the noise
disturbance in pattern recognition. Moreover, the self-organizing and self-adapting
capabilities of neural networks greatly relax the constraint and their neural network
classification is more accurate than symbolic classification. Some commercial remote-
sensing image processing software now offers neural network classification modules.
Neural networks also can be used to classify, cluster, and forecast GIS data.
However, given a complex nonlinear system with many input variables, the
convergence, stability, local minimum, and parameter adjustments of the network
may encounter problems (Lu et al. 1996). For example, it is difficult to inter-infil-
trate professional knowledge for defining network parameters (e.g., the number of
neurons in the middle layer) and training parameters (e.g., learning rate and error
threshold). Neural network classification requires scanning training data more
often and hence takes more time. In addition, the discovered knowledge is hid-
den in the network structure instead of explicit rules. For intermediate results, it is
unnecessary to convert the discovered knowledge into more refined rules, and the
network structure itself can be treated as a kind of knowledge representation at this
stage. However, at the final results stage of SDM, the discovered knowledge is not
easily understood and explained for decision-making.
To overcome this weakness, the forward ANN algorithm can be improved to
avoid slow training speeds, long learning times, and possible local minimums
in the process of data mining by adding training data and hidden nodes at each
step (Lee 2000). Lu et al. (1996) introduced NeuroRule to extract classification
from neural networks via network training, network pruning, and rule extraction.
Network training is similar to general BP network training. Network pruning aims
at deleting redundant nodes and connections to obtain a concise network without
increasing the error rate of classification. When extracting the rules, the activa-
tion values of the hidden units first are changed into a few discrete values without
reducing classification accuracy. Then, the rules can be obtained according to the
interdependences between the outputs and the hidden node values and between the
activation values of the hidden units and the inputs. Taking full advantage of the
capabilities of ANNs and using them in combination with other methods to over-
come their shortcomings will facilitate their wider use in SDM.
5.3.2 Genetic Algorithms
A genetic algorithm (GA) is used to uncover the optimal solutions for classifica-
tion, clustering, and prediction in SDM (Buckless and Petry 1994) by simulating
the evolutionary process of “survival of the fittest” in natural selection by using
a genetic mechanism as well as computer software. In the context of biological
5.3 Bionic Method 167
5.4 Others
5.4.1 Rule Induction
Rule induction summarizes spatial data for uncovering the patterns among spatial
datasets in order to acquire high-level patterns in the form of a concept tree, such
as a GIS attribute tree and a spatial relationship tree, under a certain background.
It focuses on searching generic rules from a large number of empirical data. The
background knowledge can be provided by users or extracted automatically from
datasets as one of the SDM tasks. When reasoning rules, an induction is different
from a deduction or commonsense reasoning. Induction reasons on the basis of
a large number of statistical facts and instances while deduction reasons on axi-
oms and commonsense reasoning on acknowledged knowledge (Clark and Niblet
1987). As an extension of deductive reasoning and inductive reasoning, a decision
rule is the relevance among the entire or partial data of a database. It emphasizes
the optimal condition on the premise of induction, and the result therefore is the
168 5 Methods and Techniques in SDM
conclusion of induction. Decision rule reasoning under rough sets makes proper
or similar decisions according to the condition. Decision rule includes associa-
tion rules (if customers buy milk, then they also will buy bread 42 % of the time),
sequential rules (a certain equipment item with one failure will also have the other
failure 65 % of the time within one month), and serial rules (stock A and stock B
will share the similar law of fluctuations in a season).
Association rules in SDM may be common or strong. Unlike common asso-
ciation rules, strong association rules are applied more widely and frequently,
along with more profound significance. The algorithm of association rules mainly
focuses on two aspects, such as increasing the efficiency and discovering more
rules. For example, the SQL-like language that SDM uses to describe the mining
process is in accordance with international standard query language, thereby mak-
ing SDM somewhat standard and engineering-oriented. In the model of associa-
tion rules, temporal information can be added to describe its time-effectiveness.
The same kinds of records may be amalgamated according to the time interval
among the data records and the item category of adjacent records. When associa-
tion rules cannot describe the patterns in datasets, transfer rules may depict the
systematic state transferred from the present moment to the next moment by a cer-
tain probability. The state of the next moment depends on the pre-state and trans-
fer probability. AOI is propitious to data classification (Han et al. 1993). Learning
from examples, also known as supervised learning, classifies the training samples
beforehand in the context of the background knowledge. The selection of sample
features has a direct impact on the effectiveness of its learning efficiency, results
expression, and process intelligibility. On the other hand, unsupervised learning
does not know the classification of the training samples in advance, and therefore
learns from observation and discovery. The sequential rule is closely related to
time; and the time interval constraint between the adjacent item sets can be used
to extend the discovering of sequential patterns from a single-layer concept to a
multi-layer concept, which goes forward one by one from top to down. When the
database changes a little, the progressive discovery of sequential rules is able to
use the previous results to accelerate the current mining process.
Koperski (1999) discovered strong association rules in geographical databases
with a two-step optimization. Han et al. (2000) uncovered association rules with
two parallel Apriori algorithms, such as intelligent data distribution and hybrid
distribution. Eklund et al. (1998) discovered association rules for planning envi-
ronments and monitoring the salinization of the secondary soil by integrating
decision support systems with GIS data in the analysis of soil salinity. Aspinall
and Pearson (2000) discovered association rules for protecting environments in
Yellowstone National Park by comprehending the ecological landscape, GIS, and
environmental models together. Clementini et al. (2000) explored association rules
in spatial objects with broad boundaries at different levels. Levene and Vincent
(2000) discovered the rules of functional independence and containing independ-
ence in relational databases.
5.4 Others 169
5.4.2 Decision Trees
5.4.3 Visualization Techniques
Visualization explores and represents spatial data and knowledge visually in SDM. A
visible picture is worth thousands of invisible words because SDM involves massive
data, complex models, unintelligible programs, and professional results. By develop-
ing a computerized system, visualization converts the abstract patterns, relations, and
tendencies of the spatial data to concrete visual representations that can be directly
sensed by human vision (Slocum 1999). There are many techniques that make
the process and results of SDM easily visible and understandable by users, such
as charts, pictures, maps, animations, graphics, images, trees, histograms, tables,
multi-media, and data cubes (Soukup and Davidson 2002). The visual expression
and analysis of spatial data can supervise discovery operations, data location, result
representation, and pattern evaluation and allows users to explore spatial data more
clearly and reduces the complexity of modeling. If multi-dimensional visualization is
processed in time according to the respective sequence, a comprehensive animation
can be created to reflect the process and knowledge of SDM. Ankerst et al. (1999)
produced a 3D-shaped histogram to express a similar search and classification in a
spatial database. Maceachren et al. (1999) structuralized the knowledge from multi-
source spatiotemporal datasets by combining geographical visualization with SDM.
5.5 Discussion
This chapter presented many methods and techniques that are usable in SDM,
including crisp set theory, extended theory, bionic method, and others.
170 5 Methods and Techniques in SDM
5.5.1 Comparisons
Crisp set theory, probability theory, and spatial statistics address SDM with ran-
domness. Evidence theory expands probability theory, and spatial clustering and
spatial analysis are extensions of spatial statistics. Based on the extended set
theory, a fuzzy set focuses on fuzziness and rough set emphasizes on incomplete-
ness. As far as its mathematical basis, probability theory and spatial statistics
are random probability, a rough set is the approximation, a fuzzy set is the fuzzy
membership, evidence theory is the evidence function, and the cloud model is the
probability distribution of membership. Rules induction performs under certain
background knowledge, and clustering algorithms perform without background
knowledge. For classification, clustering, and prediction, ANN implements a self-
adaptive nonlinear dynamic system, and GA determines the optimal solutions. A
decision tree produces rules according to different features to express classifica-
tion or a decision set in the form of a tree structure. Visualization extracts the pat-
terns from spatial data in the form of visual representations. In addition, online
SDM is built on multidimensional view, which is a verification of data mining, as
well as an analysis tool based on a network that emphasizes operation efficiency
and immediate response to user commands, for which its direct data source is spa-
tial data warehouses.
In addition, SDM not only develops and completes its own methods and tech-
niques, but also learns and absorbs the proven methods from other areas, such
as data mining, database systems, artificial intelligence, machine learning, geo-
graphic information systems, remote sensing, etc.
5.5.2 Usability
Approaches generally are not isolated from one another, and SDM usually
employs more than one approach in order to obtain more diverse, more accurate,
and more reliable results for different requirements and the degree of sophistica-
tion needed is fully realized (Reinartz 1999). Increasing the complexity of the sys-
tem are the skyrocketing amount of spatial data, continual refinements to spatial
rules, and the higher accuracy demanded by users. SDM therefore requires con-
stant refinement to keep the system relevant and effective. For example, SDM
provides limited types of functions, such as spatial rules, algorithms, visualiza-
tion modeling, and parallel computation. Therefore, the expansion of its functions
would be a useful undertaking. SDM’s efficiency when dealing with large amounts
of data, its capability to resolve complicated problems, and the extensibility of the
system can be improved.
Moreover, because SDM is an interdisciplinary subject, many factors need
to be taken into consideration. In practical applications, the theory, method, and
tools of SDM should be chosen according to the particular requirements of an
5.5 Discussion 171
References
Ahlqvist Q, Keukelaar J, Oukbir K (2000) Rough classification and accuracy assessment. Int J
Geogr Inf Sci 14(5):475–496
Ankerst M et al (1999) 3D shape histograms for similarity search and classification in spatial
databases. Lecture Notes in Computer Science, 1651, pp 207–225
Arthurs AM (1965) Probability theory. Dover Publications, London
Aspinall R, Pearson D (2000) Integrated geographical assessment of environmental condition in
water catchments: linking landscape ecology, environmental modeling and GIS. J Environ
Manage 59:299–319
Brus DJ (2000) Using nonprobability samples in design-based estimation of spatial means of soil
properties. In: Proceedings of accuracy 2000, Amsterdam, pp 83–90
Buckless BP, Petry FE (1994) Genetic algorithms. IEEE Computer Press, Los Alamitos
Burrough PA, Frank AU (eds) (1996) Geographic objects with indeterminate boundaries. Taylor
& Francis, Basingstoke
Canters F (1997) Evaluating the uncertainty of area estimates derived from fuzzy land-cover
classification. Photogram Eng Remote Sens 63:403–414
Chrisman NC (1997) Exploring geographic information systems. Wiley, New York
Clark P, Niblet TT (1987) The CN2 induction algorithm. Mach Learn 3:261–283
Clementini E, Felice PD, Koperski K (2000) Mining multiple-level spatial association rules for
objects with a broad boundary. Data Knowl Eng 34:251–270
Cressie N (1991) Statistics for spatial data. Wiley, New York
Eklund PW, Kirkby SD, Salim A (1998) Data mining and soil salinity analysis. Int J Geogr Inf
Sci 12(3):247–268
Ester M et al (2000) Spatial data mining: databases primitives, algorithms and efficient DBMS
support. Data Min Knowl Disc 4:193–216
Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) (1996) Advances in knowledge
discovery and data mining. AAAI/MIT, Menlo Park, CA, pp 1–30
Frasconi P, Gori M, Soda G (1999) Data categorization using decision trellises. IEEE Trans
Knowl Data Eng 11(5):697–712
Gallant SI (1993) Neural network learning and expert systems. MIT Press, Cambridge
Grabmeier J, Rudolph A (2002) Techniques of clustering algorithms in data mining. Data Min
Knowl Disc 6:303–360
Haining R (2003) Spatial data analysis: theory and practice. Cambridge University Press,
Cambridge
Han JW, Cai Y, Cercone N (1993) Data driven discovery of quantitative rules in relational data-
bases. IEEE Trans Knowl Data Eng 5(1):29–40
Han EHS, Karypis G, Kumar V (2000) Scalable parallel data mining for association rules. IEEE
Trans Knowl Data Eng 12(3):337–352
Howard CM (2001) Tools and techniques for knowledge discovery. PhD thesis. University of
East Anglia, Norwich
Jiang W et al (2001) Bridging the information gap: computational tools for intermediate resolu-
tion structure interpretation. J Mol Biol 308:1033–1044
172 5 Methods and Techniques in SDM
Journel AG (1996) Modelling uncertainty and spatial dependence: stochastic imaging. Int J
Geogr Inf Sys 10(5):517–522
Kaufman L, Rousseew PJ (1990) Finding groups in data: an introduction to cluster analysis.
Wiley, New York
Koperski K (1999) A progressive refinement approach to spatial data mining. PhD thesis, Simon
Fraser University, British Columbia
Lee ES (2000) Neuro-fuzzy estimation in spatial statistics. J Math Anal Appl 249:221–231
Levene M, Vincent MW (2000) Justification for inclusion dependency normal form. IEEE Trans
Knowl Data Eng 12(2):281–291
Li DR, Wang SL, Li DY (2006) Theory and application of spatial data mining, 1st edn. Science
Press, Beijing
Li DR, Wang SL, Li DY, Wang XZ (2002) Theories and technologies of spatial data mining and
knowledge discovery. Geomatics Inf Sci Wuhan Univ 27(3):221–233
Lu H, Setiono R, Liu H (1996) Effective data mining using neural networks. IEEE Trans Knowl
Data Eng 8(6):957–961
Maceachren AM et al (1999) Constructing knowledge from multivariate spatiotemporal data:
integrating geographical visualization with knowledge discovery in database methods. Int J
Geogr Inf Sci 13(4):311–334
Marsala C, Bigolin NM (1998) Spatial data mining with fuzzy decision trees. In: Ebecken NFF
(ed) Data mining. WIT Press/Computational Mechanics Publications, Ashurst Lodge, UK, pp
235–248
Miller WT et al (1990) Neural network for control. MIT Press, Cambridge
Mouzon OD, Dubois D, Prade H (2001) Using consistency and abduction based indices in pos-
sibilistic causal diagnosis. IEEE, pp 729–734
Muggleton S (1990) Inductive acquisition of expert knowledge. Turing Institute Press in associa-
tion with Addison-Wesley, Wokingham
Murray AT, Shyy TK (2000) Integrating attribute and space characteristics in choropleth display
and spatial data mining. Int J Geogr Inf Sci 14(7):649–667
Pal SK, Skowron A (eds) (1999) Rough fuzzy hybridization. Springer, Singapore
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic
Publishers, London
Polkowski L, Skowron A (eds) (1998a) Rough sets in knowledge discovery 1: methodologies and
applications. Studies in fuzziness and soft computing, vol 18. Physica-Verlag, Heidelberg
Polkowski L, Skowron A (eds) (1998b) Rough sets in knowledge discovery 2: applications,
case studies and software systems. studies in fuzziness and soft computing, vol 19. Physica-
Verlag, Heidelberg
Polkowski L, Tsumoto S, Lin TY (eds) (2000) Rough sets methods and applications: new devel-
opments in knowledge discovery in information systems. Physica-Verlag, Heidelberg
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Reinartz T (1999) Focusing solutions for data ming: analytical studies and experimental results
in real-world domains. Springer, Berlin
Sester M (2000) Knowledge acquisition for the automatic interpretation of spatial data. Int J
Geogr Inf Sci 14(1):1–24
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Shi WZ, Wang SL (2002) GIS attribute uncertainty and its development. J Remote Sens
6(5):393–400
Slocum TA (1999) Thematic cartography and visualization. Prentice Hall, Prentice
Soukup T, Davidson I (2002) Visual data mining: techniques and tools for data visualization and
mining. Wiley, New York
Vazirgiannis M, Halkidi M (2000) Uncertainty handling in the data mining process with fuzzy
logic. In: IEEE, pp 393–398
Wang SL, Li DR, Shi WZ, Wang XZ (2002) Theory and application of Geo-rough space.
Geomatics Inf Sci Wuhan Univ 27(3):274–282
References 173
Wang XZ, Wang SL (1997) Fuzzy comprehensive method and its application in land grading.
Geomatics Inf Sci Wuhan Univ 22(1):42–46
Wang J, Yang J, Muntz R (2000) An approach to active spatial data mining based on statistical
information. IEEE Trans Knowl Data Eng 12(5):715–728
Yang JB, Madan G, Singh L (1994) An evidential reasoning approach for multiple-attribute deci-
sion making with uncertainty. IEEE Trans Syst Man Cybern 24(1):1–18
Yao YY, Wong SKM, Lin TY (1997) A review of rough set models. In: Lin Y, Cercone N (eds)
Rough sets and data mining analysis for imprecise data. Kluwer Academic Publishers,
London, pp 47–75
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Zeitouni K (2002) A survey of spatial data mining methods databases and statistics point of
views. In: Becker S (ed) Data warehousing and web engineering. IRM Press, London, pp
229–242
Zhang JX, Goodchild MF (2002) Uncertainty in geographical information. Taylor & Francis,
London
Chapter 6
Data Field
A field exists commonly in physical space. The concept of a data field, which
simulates the methodology of a physical field, is introduced in this chapter for
depicting the interaction between the objects associated with each data point of
the whole space. The field function mathematically models how the data contribu-
tions for a given task are diffused from the universe of a sample to the universe of
a population when interacting between objects. In the universe of discourse, all
objects with sampled data not only radiate their data contributions but also receive
data contributions from other objects. Depending on the given task and the physi-
cal nature of the objects in the data distribution, the field function may be derived
from the physical field. All of the equipotential lines depict the interesting topo-
logical relationships among the objects, which visually indicate the interacted
characteristics of objects. Thus, a data field bridges the gap between a mining
model and a data model.
In physical space, interacted particles lead to various fields. Due to the interacting
particles, the field can be either a classical field or a quantum field. Depending on
the interacting range, the field may be long-range or short-range. As an interact-
ing range increases, the strength of the long-range field slowly diminishes to the
point of being undetectable (e.g., gravitational fields and electromagnetic fields),
while the strength of the short-range field rapidly attenuates to zero (e.g., nuclear
field). Based on the interacting transformation, the field is often classified as a sca-
lar field, vector field, tensor field, or spinor field. At each point, a scalar field’s
values are given by a single variable on the quantitative difference, a vector field
is specified by attaching a vector with magnitude and direction, a tensor field is
specified by a tensor, and a spinor field is for quantum field theory. A vector field
may be with or without sources. A vector field with constant zero divergence is a
field without sources (e.g., a magnetic field); otherwise, it is a field with sources
(e.g., an electrostatic field). In basic physics, the vector field is often studied as a
field with sources, such as Newton’s law of universal gravitation and Coulomb’s
law. When it is put in the field with sources, an arbitrary particle feels a force with
strength (Giachetta et al. 2009). Moreover, according to the values of a field vari-
able, whether or not it changes over time at each point, the field with sources can
be further grouped into time-varying fields and stable active fields. The stable field
with sources is also known as a potential field.
Field strength is often represented by using the potential. In a field, the poten-
tial difference between an arbitrary point and another referenced point is a deter-
minate value on a unit particle. The physical potential is the work performed by a
field force when a unit particle is moved from an arbitrary point to another refer-
enced point in the field. It is a scalar variable of the amount of energy transferred
by the force acting at a distance. The energy is the capacity to do the work, and
the potential energy refers to the ability of a system to do work by virtue of its
position or internal structure. The distribution of the potential field corresponds to
the distribution of the potential energy determined by the relative position between
interacted particles. Specifying the potential value at a point in space requires
parameters such as object mass or electrical charge, and distance between a point
6.1 From a Physical Field to a Data Field 177
and its field source, while it is independent of the direction of the distance vec-
tor. The strength of the field is visualized via the equipotential, which refers to a
region in space where every point in it is at the same potential.
In physics, it is a fact that a vacuum is free of matter but not free of field.
Modern physics even believes that field is one of the basic forms of particle exist-
ence, such as a mechanical field, nuclear field, gravitational field, thermal field,
electromagnetic field, and crystal field. As field theory developed, field was
abstracted as a mathematical concept to depict the distribution rule of a physical
variable or mathematical function in space (Landau and Lifshitz 2013). The poten-
tial function is often seen simply as a monotonic function of spatial location, hav-
ing nothing to do with the existence of the particle. That is, a field of the physical
variable or mathematical function in a space exists if every point in the space has
an exact value from the physical variable or mathematical function.
Data space is an exact branch of physical space. In a data space, all objects are
mutually interacting via data. From a generic physical space to an exact data
space, the physical nature of the relationship between the objects is identical to
that of the interaction between the particles. If the object described with data is
taken as a particle with mass, then a data field can be derived from a physical field.
Inspired by the field in physical space, the field was introduced to illuminate
the interaction between objects in data space. In the data space, all objects are
mutually interacting via data. Given a logic database with N records and M attrib-
utes, the process of knowledge discovery can begin with the underlying distribu-
tion of the N data points in the M-dimensional universal space. If each data object
is viewed as a point charge or mass point, it will exert a force on any other objects
in its vicinity, and the interactions of all the data objects will form a field. Thus, an
object described with data is treated as a particle with mass. The data refer to the
attributes of the object determined from observation, while the mass is the physi-
cal variable of matter determined from its weight or from Newton’s second law of
motion. Surrounding the object, there is a virtual field simulating the physical field
around particles. Each object has its own field. Given a task, the field gives a vir-
tual media to show the mutual interaction between objects without touching each
other. An object transforms its field to other objects, and it also receives all their
fields. Depending on their contributions to a given task, the field strength of each
object is different, which then uncovers a different interaction. By using this inter-
action, objects may be self-organized under the given data and task. Some patterns
that are previously unknown, potentially useful, and ultimately understood may be
further extracted from datasets.
178 6 Data Field
Definition 6.1 In data space Ω ⊆ RP, let dataset D = {x1, x2, …, xn} denote a
P-dimensional independent random sample, where xi = (xi1, xi2, …, xiP)T with
i = 1, 2, …, n. Each data object xi is taken as a particle with mass mi. xi radiates
its data energy and is also influenced by others simultaneously. Surrounding xi, a
virtual field is derived from the corresponding physical field. Thus, the virtual field
is called a data field.
6.2.1 Necessary Conditions
In the context of the definition, a data field exists only if the following neces-
sary conditions are characterized (i.e., short-range with source and temporal
behavior):
1. The data field is a short-range field. For an arbitrary point in the data space
Ω, all the fields from different objects are overlapped to achieve a superposed
field. If there are two points—x1 with enough data samples and x2 with less or
no data samples—the summarized strength of the data field at point x1 must be
stronger than that of x2. That is, the range of the data field is so short that its
magnitude decreases rapidly to 0 as the distance increases. The rapid attenua-
tion may reduce the effect of noisy data or outlier data. The effect of the super-
posed field may further highlight the clustering characteristics of the objects in
close proximity within data-intensive areas. The potential value of an arbitrary
point in space does not rely on the direction from the object, so the data field
is isotropic and spherical in symmetry. The interaction between two faraway
objects can be ignored.
2. The data field is a field with sources. The divergence is to measure the magni-
tude of a vector field at a given point in terms of a signed scalar. If it is positive,
it is called a source. If it is negative, it is called a sink. If it is zero in a domain,
there is no source or sink, and it is called solenoidal (Landau and Lifshitz
2013). The data field is smooth everywhere away from the original source. The
outward flux of a vector field through a closed surface is equal to the volume
integral of the divergence on the region inside the surface.
3. The data field has temporal behavior. The data on objects are static independ-
ent of time. For a stable field with sources independent of time in space, there
exists a scalar potential function ϕ(x) corresponding to the vector field function
F(x) that describes the intensity function. Both functions can be interconnected
with a differential operator (Giachetta et al. 2009), F(x) = ∇ϕ(x).
6.2 Fundamental Definitions of Data Fields 179
6.2.2 Mathematical Model
The distribution law of a data field can be mathematically described with a poten-
tial function. In a data space Ω ⊆ RP, a data object xi brings about a virtual field
with the mass mi. If an arbitrary point x = (x1, x2, …, xP)T exists in Ω, the scalar
potential function of the data field on xi is defined as Eq. (6.1):
ϕ(xi ) = mi × K �x−x σ
i�
(6.1)
where mi is the mass of xi. K(x) is the unit potential function. ||x − xi|| is the dis-
tance between xi and x in the field. σ is an impact factor.
Each data object xi in D has its own data field in Ω. All of the data fields are
superposed on point x. In other words, any data object is affected by all the other
objects in the data space. Therefore, the potential value at x in the data fields on D
in Ω is defined as follows.
Definition 6.2 Given a data field created by a set of data objects D = (x1, x2, …, xn)
in space Ω ⊆ RP, the potential at any point x ∈ Ω can be calculated as
n n
�x − xi �
ϕ(x) = ϕD (x) = ϕi (x) = mi × K (6.2)
σ
i=1 i=1
For the fact that the vector intensity always can be constructed out of scalar poten-
tial by a gradient operator, the vector intensity F(x) at x can be given as
n
2 �x − xi �
F(x) = ∇ϕ(x) = 2 (xi − x) · mi · K
σ σ
i=1
Because the term σ22 is a constant, the above formula can be simply written as
n
�x − xi �
F(x) = (xi − x) · mi · K (6.3)
σ
i=1
6.2.3 Mass
Mass mi is the mass of object xi (i = 1, 2, …, n). It represents the strength of
n
the data field from xi; and it meets mi ≥ 0 and normalization condition mi = 1.
i=1
If each data object is supposed to be equal in mass, indicating the same influence
over space, then a simplified potential function can be derived from Eq. (6.4):
n
1 �x − xi �
ϕ(x) = K (6.4)
n σ
i=1
180 6 Data Field
The unit potential function K(x) ( K(x)dx = 1, xK(x)dx = 0) expresses the law
´ ´
that xi always radiates its data energy in the same way in its data field. In fact,
the potential function reflects the density of the data distribution. According to
the properties of probability density function, it can be proven that the difference
between the potential function and the probability density function is only a nor-
malization constant if K(x) has finite integral in Ω; that is, K(x)dx = M < +∞
´
n
�x−xi � 2
−
ϕ(x) = mi × e (6.6)
σ
i=1
6.2 Fundamental Definitions of Data Fields 181
6.2.5 Impact Factor
Once the form of potential function K(x) is fixed, the distribution of the associated
data field is primarily determined by the impact factor σ under the given data set D
in space Ω.
The impact factor σ ∈ (0, +∞) controls the interacted distance between data
objects. Its value has a great impact on the distribution of the data field. The effective-
ness of the impact factor is from various sources, such as radiation brightness, radia-
tion gene, data amount, distance between the neighbor equipotential lines, and grid
density of Descartes coordinate. They all make their contributions to the data field. As
a result, the distribution of a potential value is determined by the impact factor, and
the different potential function has a much smaller influence on the estimation.
In nature, the optimal choice of σ is a minimization problem of a univariate,
nonlinear function, which can be solved by standard optimization algorithms, such
as a simple one-dimensional searching method, stochastic searching method, or
simulated annealing method. Li and Du (2007) introduced the optimization algo-
rithm of impact factor σ.
Take an example in two-dimensional space. Figure 6.1 shows the e quipotential
distributions of 5 data field from the same five data objects with a variance of σ.
When σ is very small (Fig. 6.1a), the interaction distance between two data objects
is very short. ϕ(x) is equivalent to the superposition of n peak functions, each
center of which is the data object. The potential value around every data object is
very small. The extreme case is that there is no interaction between two objects,
and the potential value is 1/n at the position of data object. Otherwise, when σ
is very large (Fig. 6.1c), the data objects are strongly interacting. ϕ(x) is equiva-
lent to the superposition of n basic functions, which change slowly when the width
is large. The potential value around every data object is very large. The extreme
case is that the potential value is approximately 1 at the position of the data object.
When the integral of unit potential function is finite, there is a normalized con-
stant between the potential function and the probability intensity function at most
(Giachetta et al. 2009). As can be seen from the abovementioned two extreme
conditions, the potential distribution of a data field may not uncover the required
rule on inherent data characteristics. Thus, the selection of impact fact σ does por-
tray the inherent distribution of data objects (Fig. 6.1b).
In physics, a field can be depicted with the help of field force lines for a vector field
and isolines (or iso-surfaces) for a scalar field. The field lines, equipotential lines
(or surfaces), are also adopted to represent the overall distribution of the data fields.
6.3.1 Field Lines
A field line is a line with an arrowhead and length. The line arrowhead indicates
the direction, and the line length indicates the field strength. It is a locus that is
defined by a vector field and a starting location within the field. A series of field
lines visually portray the field of a data object. The density of the field lines fur-
ther shows the field strength. All the data field lines centralize the data object and
the field of the source is the strongest. When approaching the source, the length of
the field lines becomes longer and the density thicker. Otherwise, given the dis-
tance ||x − xi|| leaving the field source, the larger the distance grows, the weaker
the field becomes. The same is true with the field line—that is, leaving the data
object, the length of the field lines becomes shorter and the density scarcer. Their
distribution shows that the data field is a distribution of spherical symmetry.
Figure 6.2 is a plot of force lines for a data force field produced by a single object,
which indicates that the radial field lines always are in the direction of the spheri-
cal coordinate towards the source. The force lines are uniformly distributed and
spherically symmetric, decreasing as the radial distance ||x − xi|| further increases,
which indicates a strong attractive force on the sphere centered on the object.
Equipotential lines (surfaces) come into being if the points with the same potential
value are lined up together.
Given potential ψ, an equipotential can be extracted as one where all the points
on it satisfy the implicit equation ϕ(x) = ψ. By specifying a set of discrete poten-
tials {ψ1 , ψ2 , . . . , ψn }, a series of nested equipotential lines or surfaces can be
drawn for better understanding of the associated potential field as a whole form.
Let ψ1 , ψ2 , . . . , ψn respectively be the potentials at the positions of the objects
x1 , x2 , . . . , xn. The potential entropy can be defined as (Li and Du 2007)
6.3 Depiction of Data Field 183
Fig. 6.2 The plot of force lines for a data force field created by a single object (m = 1, σ = 1)
a 2-dimensional plot, b 3-dimensional plot
n
ψi ψi
H=− log (6.7)
Z Z
i=1
n
where Z = ψi is a normalization factor. For any σ ∈ [0, + ∞], potential entropy
i=1
H satisfies 0 ≤ H ≤ log (n), and H = log(n) if and only if ψ1 = ψ2 = · · · = ψn.
The equipotential lines are mathematical entities that describe the spherical cir-
cles at a distance ||x − xi|| on which the field has a constant value, such as con-
tour lines on map tracing lines of equal altitude. They always cross the field lines
at right angles to each other, in no specific direction at all. The potential differ-
ence compares the potential from one equipotential line to the next. In the context
of equal data masses, the area distributed with intensive data will have a greater
potential value when the potential functions are superposed. At the position with
the largest potential value, the data objects are the most intensive. Figure 6.3a
is a map of the equipotential lines for a potential field produced by a single data
object in a two-dimensional space, which consists of a family of nested, concen-
tric circles or spheres centered on the data object. As can be seen from Fig. 6.3a,
the equipotentials closer to the object have higher potentials and are farther apart,
which indicates strong field strength near and around the object.
In a multi-dimensional space, the equipotential lines are extended to the equipo-
tential surfaces because the field becomes a set of nested spheres with the center of
the coordinate of the data object, and the radius is ||x − xi||. Figure 6.3b is another
map of the equipotential surfaces for another 3D space. Obviously, the equipoten-
tial surfaces corresponding to different potentials are quite different in topology.
184 6 Data Field
Fig. 6.3 The equipotential map for a potential field produced by a single data object (σ = 1)
a equipotential lines, b equipotential surfaces
6.3.3 Topological Cluster
Fig. 6.4 Self-organized characteristics of 390 data objects in the field (σ = 0.091) a self-
organized level by level, b characterized level
References
7.1.2 Properties
The cloud model integrates randomness and fuzziness. The uncertainty of the
concept can be represented by multiple numerical characters. We can say that
the mathematical expectation, variance, and high-order moment in probability
theory reflect several numerical characters of the randomness but not the fuzzi-
ness. Membership is a precise description of the approach to the fuzziness, which
does not take the randomness into account. A rough set measures the uncertainty
7.1 Definition and Property 189
0 9 mm x
by the research of two precise sets that are based on the background of precise
knowledge rather than the background of uncertain knowledge. In the cloud model
approach, we can express the uncertainty of the concept by higher-ordered entropy
apart from the expectation, entropy, and hyperentropy in order to conduct research
in more depth.
Figure 7.1 shows an interpretation of the concept of “displacement is 9 mm
around” with the cloud model, where x is the displacement and μ(x) is the certainty
to the concept. In Fig. 7.1, a piece of the cloud is composed of many drops repre-
sented by data, any one of which is a stochastic mapping in the universe of discourse
from a qualitative fuzzy concept, along with the membership of the data belonging
to the concept. It is observable as a whole shape but fuzzy in detail, similar to a
natural cloud in the sky. When the cloud drop of the piece of cloud is created, the
mapping of each cloud drop is random and its membership is fuzzy, which is the
process that integrates randomness and fuzziness; “displacement is 9 mm around”
is a piece of the cloud. It is composed of many cloud drops represented by data
{…, 8 mm, 9 mm, 10 mm, …}, any one of which is a stochastic mapping in the uni-
verse of “displacement” from a qualitative fuzzy concept of “displacement is 9 mm
around,” along with the membership {…, 0.9, 1, 0.9, …} of the data {…, 8 mm,
9 mm, 10 mm, …} belonging to the concept. Also, “displacement is 9 mm around”
is a concept, and {…, 8 mm, 9 mm, 10 mm, …} are the data of the concept. The
generated data may be a stochastic process of {……, 8 mm, 9 mm, 10 mm, …}, and
their memberships {…, 0.9, 1, 0.9, …} on the concept are fuzzy.
1. The expected value Ex: The mathematical expectation of the cloud drop distributed
in the universal set. In other words, it is the point that is the most representative of
the qualitative concept or the most classical sample while quantifying the concept.
2. The entropy En: The uncertainty measurement of the qualitative concept. It is
determined by both the randomness and the fuzziness of the concept. In one
aspect, as the measurement of randomness, En reflects the dispersing extent of
the drops; in the other aspect, it is also the measurement of “this and that,” rep-
resenting the value region in which the drop is acceptable by the concept. As
a result, the connection of randomness and fuzziness is reflected by using the
same numerical character.
3. The hyperentropy He: This is the uncertainty measurement of the entropy—that
is, the entropy of the entropy—which is determined by both the randomness
and fuzziness of the entropy.
To express the qualitative concept by the quantitative method, we generate cloud
drops under the numerical characters of the cloud; the reverse (i.e., from the quan-
titative expression to the qualitative concept) extracts the numerical characters from
the group of cloud drops. Figure 7.1 shows the three numerical characteristics of the
linguistic term “displacement is 9 mm around.” In the universe of discourse, Ex is
the position corresponding to the center of the cloud gravity, the elements of which
are fully compatible with the spatial linguistic concept; En is a measure of the con-
cept coverage (i.e., a measure of the spatial fuzziness), which indicates how many
elements could be accepted to the spatial linguistic concept; and He is a measure of
the dispersion on the cloud drops, which also can be considered as the entropy of En.
There are various implementation approaches for the cloud model, resulting in
different kinds of clouds, such as the normal cloud model, the symmetric cloud
model, the half cloud model, and the combined cloud model.
Normal distribution is one of the most important distributions in probability
theory. It is usually represented by two numerical characters—the mean and the
variance. The bell-shaped membership function is the most frequently used func-
2
− (x−a)
tion in the fuzzy set, and it is generally expressed as µ(x) = e 2b2 . The normal
cloud is a brand-new model developed based on the normal distribution and the
bell-shaped membership function.
There are various mathematical characters representing uncertainties, and these
variations also exist in the real world. For example, the symmetric cloud model
represents the qualitative concept with symmetric features (Fig. 7.2a). The half
cloud model depicts the concept with uncertainty on one side (Fig. 7.3b). As a
result, we can construct various kinds of cloud models.
7.3 The Types of Cloud Models 191
Fig. 7.2 Various cloud models: a Symmetric cloud model. b Half cloud model. c Combined
cloud model. d Floating cloud. e Geometry cloud. f Multi-dimensional cloud model
Fig. 7.3 Spatial objects represented by a cloud model: a point, b line, c directed line, d polygon
192 7 Cloud Model
The clouds constructed by other given clouds are called virtual clouds, such as
a combined cloud, floating cloud, and geometric cloud. A combined cloud is used
to synthesize linguistic terms into a generalized term. If we use the mechanism of
combined cloud construction recursively from low concept levels to high concept
levels, we can obtain the concept hierarchies for the linguistic variables, which are
very important in data mining (Fig. 7.2c). The floating cloud mechanism is used
to generate default clouds in the blank areas of the universe by other given clouds
(Fig. 7.2d). If we consider a universe as a linguistic variable and we want to rep-
resent linguistic terms by clouds, the only essential work is to specify the clouds
at the key positions. Other clouds can be automatically generated by the floating
cloud construction method. A geometric cloud is a whole cloud constructed by
least square fitting on the basis of some cloud drops—that is, the expected curve of
geometry cloud is generated by the least square (Fig. 7.2e). The cloud model can
be multidimensional as well. Figure 7.2f shows the joint relationship of the multi-
dimensional cloud and its certainty degree.
In geospatial information science, there are three basic objects: point, line, and
area. The line may be non-directed or directed. If they are represented by a combi-
nation of cloud models, the spatial objects may be illustrated as shown in Fig. 7.3.
With the further combination of points, lines, and areas, various types of complex
objects may be represented with cloud models.
7.4 Cloud Generator
Given {Ex, En, He} in a concept, the data may be generated to depict the cloud
drops of a piece of cloud, which is called a forward cloud generator; given certain
datasets, {Ex, En, He} also may be generated to represent a concept called a back-
ward cloud generator. The backward cloud generator can be used to extract the
conceptual knowledge from numeric datasets, and the forward cloud generator can
be used to represent the qualitative knowledge with quantitative data. The type of
cloud model must be chosen according to the type of data distribution. Because a
normal distribution is ubiquitous, the normal cloud model is taken as an example
in the following sections to present the algorithms of the forward and backward
cloud generators.
(a) (b)
Ey Eny Hey
Ex
Ex
En CG drop(xi, µi)
He
Enx CG drop(xi, yi, µ i)
Hex
(4) Let (xi, yi, µi) be a cloud drop, and one quantitative implementation of
the linguistic value is represented by this cloud. In this drop, (xi, yi) is
the value in the universal domain corresponding to this qualitative con-
cept and µi is the degree measure of the extent to which (xi, yi) belongs
to this concept.
(5) Repeat steps (1)–(4) until n cloud drops are generated.
Figure 7.4a shows the algorithm of the forward normal cloud generator. The
corresponding 2D forward normal cloud generator is shown in Fig. 7.4b.
Backward cloud generator (CG−1) is the model for transition from quantitative
value to qualitative concept. It maps a quantity of precise data back into the quali-
tative concept expressed by Ex, En, He, as shown in Fig. 7.5.
There are two kinds of basic algorithms of the CG−1 based on statistics, which
are classified by whether or not the certainty degree is utilized.
Algorithm 7.3 Backward normal cloud generator with the certainty degree
Input: xi and the certainty degree μi, i = 1, 2, …, n
Output: (Ex, En, He) representative of the qualitative concept
7.4 Cloud Generator 195
Fig. 7.5 Backward cloud
Ex
generator drop(xi, yi) CG-1 En
He
Steps:
(1) Calculate the mean of xi for Ex, i.e., Ex = MEAN(xi).
(2) Calculate the standard variance of xi for En, i.e., En = STDEV (xi).
(3) For each couple of (xi, μi), calculate
′ −(xi − Ex)2
Eni = .
2 ln µi
(4) Calculate the standard variance of En′i for He, i.e., He = STDEV
(En′i).
In the algorithm, MEAN and STDEV are the functions for the mean and stand-
ard variance of the samples.
There are drawbacks to this one-dimensional (1D) backward cloud generator
algorithm, which are as follows:
1. The certainty degree μ is required for recovering En and He; however, in prac-
tical applications, only a group of data representative of the concept usually
can be obtained, but not the certainty degree μ;
2. It is difficult to extend this algorithm to higher dimensional situations, and the
error in the higher dimensional backward cloud is larger than that of the 1D
backward cloud.
To overcome the drawbacks of the algorithm requiring the certainty degree, the
following backward cloud algorithm utilizes only the value of xi for backward
cloud generation based on the statistical characters of the cloud.
Algorithm 7.4 Backward normal cloud generator without certainty degree
Input: Samples xi, i = 1, 2, …, n
Output: (Ex, En, He) representative of the qualitative concept
Steps:
(1) X̄ = n1 ni=1 xi , S 2 = n−1
1 n
i=1 (xi − X̄) // the mean and variance of xi.
2
(2) Ex = X̄ .
(3) En = π2 × n1 ni=1 |xi − Ex|.
√
(4) He = S 2 − En2 .
196 7 Cloud Model
(b)
(a)
Ex
En CG drop (xi , µi )
He
a
−(x−Ex)2
2
µi = e 2(Enni )
Return drop(x,µi)
End For
End
The joint distribution of membership, which is generated by specific value x
and the precondition cloud generator, is illustrated in Fig. 7.6. All of the cloud
drops are located in the straight line of X = x.
7.5 Uncertainty Reasoning
7.5.1 One-Rule Reasoning
If there is only one factor in the rule antecedent, we call the rule a one-factor
rule—that is, “If A, then B.” CGA is the X-conditional cloud generator for linguis-
tic term A, and CGB is the Y conditional cloud generator for linguistic term B.
Given a certain input x, CGA generates random values μi. These values are con-
sidered as the activation degree of the rule and input to CGB. The final outputs are
cloud drops, which form a new cloud.
Combining the algorithm of the X and Y conditional cloud generators (Li et al.
2013), the following algorithm is introduced for one-factor one-rule reasoning.
Algorithm 7.6 One-factor one-rule reasoning
Input: x, ExA, EnA, HeA
Output: drop(xi, μi)
Steps:
(1) En′A = G(EnA, HeA) // Create random values that satisfy the normal
distribution probability of mean EnA and standard deviation HeA.
−(x−ExA )2
′ )2
(2) Calculate µ = e 2(EnA
.
(3) En′B = G(EnB, HeB) // Create random values that satisfy the normal
distribution probability of mean EnB and standard deviation HeB.
√
(4) Calculate y = ExB ± −2 ln(µ)EnB′ , let (y, μ) be cloud drops. If
x ≤ ExA, “−” is adopted, while if x > ExA, “+” is adopted.
(5) Repeat steps (1)–(4), generating as many cloud drops as needed.
Figure 7.7 is the output cloud of a one-factor one-rule generator with one input.
It can be seen that the cloud model-based reasoning generated uncertain results.
The uncertainty of the linguistic terms in the rule is propagated during the reason-
ing process. Because the rule output is a cloud, the final result can be obtained in
several forms: (1) one random value; (2) several random values as sample results;
(3) the expected value, which is the mean of many sample results; and (4) the lin-
guistic term, which is represented by a cloud model. The parameters of the model
are obtained by an inverse cloud generator method.
If we input a number of values to the one-factor rule and draw the inputs and out-
puts in a scatter plot, we can obtain the input-output response graph of the one-factor
Fig. 7.7 Output cloud of
one-rule reasoning
198 7 Cloud Model
Fig. 7.8 Input-output
response of one-rule
reasoning
one-rule reasoning (Fig. 7.8). The graph looks like a cloud band, not a line. The band
is more focused closer to the expected values; farther away from the expected value,
the band is more dispersed. This is consistent with human intuition. The above two
figures and discussion show that the cloud model based on uncertain reasoning is
more flexible and powerful than the conventional fuzzy reasoning method.
If the rule antecedent has two or more factors, such as “If A1, A2,…, An, then
B,” the rule is called a multi-factor rule. In this case, a multi-dimensional cloud
model represents the rule antecedent. Figure 7.9 is a two-factor one-rule genera-
tor, which combines a 2D X-conditional cloud generator and a 1D Y-conditional
cloud generator. It is not difficult to apply the reasoning algorithm on the basis of
the cloud generator; consequently, multi-factor one-rule reasoning is conducted in
a similar way.
7.5.2 Multi-rule Reasoning
Usually, there are many rules in a real knowledge base. Multi-rule reasoning is
frequently used in intelligent GIS or spatial decision-making support systems.
Figure 7.10 is a one-factor multi-rule generator, and the algorithm is as follows.
Algorithm 7.7 One-factor multi-rule generator reasoning
Input: x, ExAi, EnAi, HeAi
Output: y
ExB
ExA2 EnA2 HeA2 EnB drop(zi, µi)
HeB CGB
ExA1 µi
EnA1
HeA1 CGA
drop(x, y, µi )
(x,y)
Fig. 7.9 A two-factor one-rule generator
7.5 Uncertainty Reasoning 199
CGA1
CGB1
CGA2
CGB2
CGA3
yi
CGB3 CG
-1
CGA4 CG
CGB4
CGA5
CGB5
x
Fig. 7.10 One-factor five-rule generator
Steps
(1) Given input x, determine how many rules are activated. If ExAi
−3EnAi < x < ExAi +3EnAi, then rule i is activated by x.
(2) If only one rule is activated, output is a random y by the one-factor
one-rule reasoning algorithm. Go to step (4).
(3) If two or more rules are activated, each rule first outputs a random value
by the one-factor one-rule reasoning algorithm, and a virtual cloud is con-
structed by the geometric cloud generation method. A cloud generator
algorithm is run to output a final result y with the three numerical charac-
ters of the geometric cloud. Because the expected value of the geometric
cloud is also a random value, the expected value can be taken as the final
result for simplicity. Go to step (4).
(4) Repeat step (1)–(3) to generate as many outputs as required.
The main concept of the multi-rule reasoning algorithm is that when several
rules are activated simultaneously, a virtual cloud is created by the geometric
Fig. 7.11 Two activated
rules
200 7 Cloud Model
cloud method. Because of the least square fitting property, however, the final out-
put is more likely to be close to the rule of high activated degree. This is con-
sistent with human intuition. Figure 7.11 is a situation of two rules activated,
and Fig. 7.12 has three rules activated. Only the mathematical expected curves
are drawn for clearness, and the dash curves are the virtual cloud. The one-factor
multi-rule reasoning can be easily extended to multi-factor multi-rule reasoning on
the basis of multi-dimensional cloud models.
The following is an illustrative example of multi-factor multi-rule reason-
ing. Suppose there are five rules to describe the terrain features qualitatively,
and rule input is the linguistic term of location that is represented by a 2D cloud
(Fig. 7.13a). Rule output is the linguistic term of elevation that is represented by a
1D cloud (Fig. 7.13b).
• Rule 1: If location is southeast, then elevation is low.
• Rule 2: If location is northeast, then elevation is low to medium.
• Rule 3: If location is central, then elevation is medium.
• Rule 4: If location is southwest, then elevation is medium to high.
• Rule 5: If location is northwest, then elevation is high.
Figure 7.13c is the input-output response surface of the rules. The surface is an
uncertain surface and the roughness is uneven. Closer to the center of the rules
(the expected values of the clouds), the surface is smoother, showing the small
7.5 Uncertainty Reasoning 201
uncertainty; at the overlapped areas of the rules, the surface is rougher, showing
large uncertainty. This proves that multi-factor multi-rule reasoning also represents
and propagates the uncertainty of the rules as does the one-factor one-rule reason-
ing does.
References
Li DR, Wang SL, Li DY (2013) Spatial data mining theories and applications, 2nd edn. Science
Press, Beijing
Li DY, Du Y (2007) Artificial intelligence with uncertainty. Chapman & Hall/CRC, London
Li DY, Liu CY, Gan WY (2009) A new cognitive model: cloud model. Int J Intell Syst
24(3):357–375
Wang SL, Shi WZ (2012) Data mining and knowledge discovery. In: Kresse W, Danko David
(eds) Handbook of geographic information. Springer, Berlin
Chapter 8
GIS Data Mining
SDM can extend finite data in GIS to infinite knowledge. In this chapter, the
Apriori algorithm, concept lattice, and cloud model are discussed and utilized in
case studies to discover spatial association rules and attribute generalization. First
bank operational income analysis and location selection assessment are used to
demonstrate how inductive learning uncovers the spatial distribution rules; urban
air temperature and agricultural data are used to demonstrate how a rough set dis-
covers decision-making knowledge; and the monitoring process of the Baota land-
slide is used to show the practicality and creditability of applying the cloud model
and data fields under the SDM views to support decision-making. Fuzzy compre-
hensive clustering, which perfectly integrates fuzzy comprehensive evaluation
and fuzzy clustering analysis, and mathematical morphology clustering, which
can handle clusters of arbitrary shapes and can provide knowledge of outliers and
holes, are also discussed. The results show that SDM is practical and credible as
decision-making support.
Spatial association rule mining extracts association rules whose support and con-
fidence are not less than the user-defined min_sup and min_conf, respectively. The
problem to be solved by data mining can be decomposed into two sub-problems:
(1) how to locate all the frequent itemsets whose support is at or more than the
min_sup, and (2) how to generate all the association rules whose confidence is at
or more than the min_conf for the frequent itemsets. The solution to the second
sub-problem is straightforward, so the focus here is to develop a new efficient
algorithm to solve the first sub-problem.
The non-redundant rule cannot be reasoned from other rules. Among the gener-
alized association rules, a large number of rules may be generated; if a too-small
threshold is set, the generated rules may be beyond the processing capability. In
these rules, there are still a number of redundant rules. Because the redundant
rules can be derived by other rules, it is of great importance to remove them and
create non-redundant rules, which have identical support and confidence, and few
antecedent conditions but many subsequent conclusions, which will help uncover
as much as possible of the most useful information.
Definition 8.2 For rule r: l1 ⇒ l2, if ¬∃ r ′ : l1′ ⇒ l2′ , support(r) ≤ support(r′),
confidence(r) ≤ confidence(r ′), and l ′1 ⊆ l1, l 2′ ⊆ l2, then the rule r is called a non-
redundant rule (Zaki and Ogihara 2002).
4. Create all the non-void subsets s (s ⊆ l) for each frequent itemsets l.
5. Extract the association rules “s ⇒ l – s” if freq(s ∪ (l − s))/freq(s) ≥ min_conf,
where freq is the support frequency, min_conf is the conf_conf.
In the process of joining them, the items in the itemset are arranged alphabeti-
cally. When joining Lk with Lk, if the preceding k − 1 items in some two elements
are identical, they are joinable; otherwise, they are unjoinable and are left alone.
For instance, join L1 with L1 to generate C2 (candidate 2-itemset), then browse D
to delete the non-frequent subsets in C2 and generate L2 (2-frequent itemset). The
Apriori algorithm can easily explore the association rules from the dataset. For
example, in Fig. 8.1, assume that six customers {T1, T2, T3, T4, T5, T6,} bought
a total of six kinds of products {a, b, c, d, e, f} in a retail outlet, which created
transactional data D. Let freq count the support frequency. If the support threshold
is 0.3 and the confidence threshold is 0.8, then the process of generating frequent
itemsets will proceed. Table 8.1 shows the extracted 20 association rules.
However, the Apriori algorithm has its weaknesses. First, in the process of gen-
erating k-frequent itemsets from k-candidate itemsets, the database D is re-browsed
once more, which results in browsing the database too many times. If the database
is excessively massive, the algorithm will be very time-consuming. Second, when
association rules are extracted, it is important to calculate all the subsets of the fre-
quent itemsets, which is time-consuming as well. Third, many redundant rules are
generated. We offer the concept lattice as a solution to these problems.
The concept lattice (also called formal concept analysis) was introduced by Rudolf
Wille in 1982. It is a set model that uses a mathematical formula to directly
express human understanding of the concept’s geometric structure. The concept
lattice provides a formal tool to take the intension and extension of a concept as
the concept unit. The extension is the instances covered by the concept, and the
intension is the description of the concept as well as the common characteristics
of the instances covered by the concept. Based on a binary relationship, the con-
cept lattice unifies the intension and extension of a concept and reflects the gen-
eralization/characterization connection between an object and an attribute. Each
vertex of the concept lattice is a formal concept composed of an extension and
an intension. A Lattice Hasse diagram simply visualizes the generalization/charac-
terization relationship between the intension and extension of a concept. Creating
a Hasse diagram is similar to the process of concept classification and clustering.
Discovered from such features as spectrum, texture, shape, and spatial distribution,
the patterns are described by intension sets and the inclusion of extension sets.
Therefore, a concept lattice is suitable as the basic data structure of rules mining.
SDM may be treated as the process of concept formation from a database. A
concept lattice first analyzes the formal context triple (G, M, I), where G is the
set of (formal) objects, M is the set of (formal) attributes, and I is the relationship
between the object G and the attribute M; that is, I ⊆ G × M. The formal concept
of formal context (G, M, I) defines a pair (A, B), where, A ⊆ G and B ⊆ M. Set A,
B as the intension and extension of formal concept (A, B), respectively. The sub-
concept-superconcept relationship among these concepts is formally denoted as:
H1 = (A1, B1) ≤ H2 = (A2, B2): ⇔ A1 ⊆ A2 (⇔B1 ⊇ B2). The set of all the con-
cepts of formal context (G, M, I) constitutes a complete concept lattice; that is, B
(G, M, I). A Lattice Hasse diagram is created under partial relationship. If H1 < H2,
and no other element H3 is in the lattice such that H1 < H3 < H2, there is an edge
from H1 to H2.
The algorithm to construct the concept lattice may automatically generate fre-
quent closed itemsets. In the constructed concept lattice, a frequent concept lattice
node is denoted as (O, A), where O is a collection of objects with the maximum
208 8 GIS Data Mining
Definition 8.3 For the closed itemset l and itemset g, if g ⊆ I meets γ(g) = l, and
¬∃ g′ ⊆ I, g′ ⊆ g, meets γ(g′) = l, then g is called a productive subset of l.
The above is the formal definition of a productive subset (Zaki and Ogihara 2002).
The productive subset is the sub-itemset designated to construct the frequent closed
itemset—that is, the reduced connotation set (Xie 2001). To get the productive sub-
set, all the nonblank subsets are first uncovered from the connotation sets of each
frequent closed node. Sequentially, a subset is deleted if it is a child node to a parent
node or another subset is its proper subset. The remaining subsets are the productive
subsets of the frequent closed nodes, which give birth to non-redundant association
rules. Figure 8.2 illustrates the nodes of the frequent concept lattice and their pro-
ductive subsets Gf by creating the concept lattice from Fig. 8.1 and Table 8.1.
According to the range of confidence values, there are two non-redundant associa-
tion rules: one with confidence of 100 %, the other with confidence less than 100 %.
1. Rules with confidence of 100 %: {r: g ⇒ (f/g)|f ∈ FC ^ g ∈ Gf ^ g ≠ f} where
f is a frequent closed itemset, Gf is its productive subset, f/g is the itemset
Fig. 8.2 Nodes of frequent
closed concept lattice and
their productive subsets
8.1 Spatial Association Rule Mining 209
Using the concept lattice, the association rules to meet min_supp and min_conf
can be automatically extracted from the formal context. It is unnecessary some-
times to generate all the rules because only a specific rule is of interest. Then, the
direct extraction method may deduce these rules by analyzing the relationship
between some designated concept lattice nodes.
Assume that in the Hasse diagram, there are two concept lattice nodes of inter-
est: C1 = (ml, n1) and C2 = (m2, n2), respectively, corresponding to the exten-
sive cardinality |ml| and |m2|. Also, fr(n) represents the function of the occurrence
210 8 GIS Data Mining
frequency of the objects containing connotation n in the data set. The relationship
between the concept nodes is used to directly compute the confidence and support
of the association rules (Clementini et al. 2000).
1. The method to compute support is as follows:
(a) If there is a direct connection between two nodes (i.e., there is a line con
nection between the two nodes in the Hasse diagram), then: s(n1Rn2) =
fr(n1 ∪ n2)/fr(∅), where fr(n1 ∪ n2) denotes the contemporary occurrence fre-
quency of n1 and n2 in the dataset D, and fr(∅) means the frequency of the
“void set” connotation attribute in the dataset D; in fact, it is equal to the car-
dinality of the dataset D.
(b) If there is no direct connection between the two nodes, then it is
necessary to find a common child node n3 between the two, then:
s(n1Rn2) = fr(n3)/fr(∅).
2. The method to compute confidence is as follows:
(a) If there is a direct connection between two nodes (i.e., there is a line
connection between the two nodes in the Hasse diagram), then the con-
fidence between n1 and n2 is C(n1Rn2) = fr(n1 ∪ n2)/fr(n1) = |m2|/|m1|,
where fr(n1) (fr(n1) ≠ 0) is the occurrence frequency of n1 in the dataset
D, and fr(n1 ∪ n2) is the contemporary occurrence frequency of n1 and n2
in the dataset D. If n2 ⊆ n1, then C(n1Rn2) = 1.
(b) If there is no direct connection between the two nodes, then it is
necessary to find a common child node n3 between the two, then
C(n1Rn2) = C(n1Rn3) × C(n2Rn3).
The confidence and support are set according to the specific circumstances. When
both of them meet the min_supp and min_conf, the association rules are of interest.
The time complexity is an aspect of the concept lattice. When the concept lat-
tice is applied in data mining, the formal context of mining objects first are trans-
formed into the form of a single-valued attribute table; then, the corresponding
concept lattice nodes are created from the intersection and union of the contextual
concepts. For a formal context with n attributes, the concept lattice can create a
maximum with 2n concept nodes. The concept lattice does not seem practical in
data mining. In fact, even if the number of features of the formal context is huge,
the actual object often has only a smaller fraction of the features. The number of
actual lattice nodes also is often much less than the upper limit because the actual
data often contain some unlikely concurrent feature combinations. Although it
takes a little time to manipulate the intersection and union on the concept nodes,
the concept lattice runs much faster than the Apriori algorithm when creating fre-
quent itemsets for association rules.
8.1 Spatial Association Rule Mining 211
In general, the frequent itemsets of spatial association rules exist at a high con-
ceptual level, and it is difficult to uncover them at a low conceptual level. In par-
ticular, when the attribute is numeric and the mining is on the original conceptual
level, strong association rules will not be generated if min_sup and the min_conf
are larger, while many uninteresting rules will be produced if the thresholds
are relatively smaller. In this case, the attribute needs to be elevated to a higher
conceptual level via attribute generalization, and then the association rules are
extracted from the generalized data. The cloud model flexibly partitions the attrib-
ute space by simulating human language. Every attribute is treated as a linguis-
tic variable that sometimes is represented with two or more attributes, which is
regarded as a multidimensional linguistic variable. For each linguistic variable,
several language values are defined, and overlapping is allowed among the neigh-
boring language values. In the overlapping area, an identical attribute value may
be assigned to different clouds under different probabilities. The cloud to represent
a linguistic variable can be assigned by the user interactively or acquired by the
cloud transform automatically.
Cloud model-based attribute generalization can be a pre-processing step for
mining association rules. The Apriori algorithm is implemented after the attributes
first are generalized. This pretreatment is highly efficient for the linear relationship
between the time spent and the size of the database. During the process of attribute
generalization, the database only needs to be browsed once in accordance with the
maximum membership of each attribute value assigned to the corresponding lin-
guistic value. After attribute generalization, different tuples from the original data
may be merged into one tuple if they become identical at the high concept level,
which reduces the size of the database significantly. Also, a new attribute “count”
is added to record how many original tuples are combined in a merged tuple. It is
212 8 GIS Data Mining
not a real attribute but rather is for the purpose of computing the support of the
itemset in the mining process. Due to the characteristics of the cloud and the over-
lapping between adjacent clouds, the new attribute “count” will be different at dif-
ferent times in the attribute generalization relationship table when the final mining
results are stable.
A geographical and economic database is used below as an experimental case
to verify the feasibility and efficacy of cloud models in mining association rules.
The interesting item in this case is that the relationship between the geospatial
information and the economic conditions, due to a small amount of actual data, is
set as the basis; and a large number of simulation data are produced by generating
random numbers (10,000 total records). The six attributes are x, y, z (elevation),
road network density, distance to the sea, and per capita income. These attributes
are all numeric, the positions x and y are in the Cartesian coordinate, and the road
density is denoted in road length per km2. If the original digital data are directly
used, it is difficult to discover the rules. So, the cloud model first is used to gen-
eralize the attributes. Specifically, attributes x and y are treated as a language vari-
able called “location,” and its definition of eight two-dimensional language values
are “southwest,” “northeast,” “north-by-east,” “southeast,” “northwest,” “north,”
“south,” and “central”—most of which are swing cloud. The remaining attributes
are regarded as one-dimensional language variables and three language values are
defined, such as “low,” “medium,” and “high” for expressing terrain, road net-
work density, and per capita annual income; and “near,” “median,” and “far” are
used to show the distance to the ocean. The language values “low” and “near” are
semi-fall cloud, and “high” and “far” are semi-rise cloud. There is some overlap-
ping between adjacent cloud models. Due to the irregularity of Chinese territory
in shape, the eight numerical character values of the cloud model for expressing
“location” are given manually, and the hyper-entropy value is taken as 1/10,000
of the entropy. The remaining one-dimensional clouds are generated by the above-
mentioned method automatically. After attribute generalization, the amount of
data is considerably reduced. The generalized attributes are shown in Table 8.4.
Depending on the characters of the cloud model, the value of “count” in the gen-
eralized attribute table is slightly different at different times, and the digits in the
table represent average results after a plurality of generalization is taken.
With the min_sup of 6 % and min_conf of 75 % in the generalized database, the
result is eight four-itemsets. Eight association rules are discovered, among which
the subsequent rule is the per capita annual income, and the antecedent is the sum
of the other three attributes. The association rules are expressed with a productive
method, the contents of which are as follows:
• Rule 1: If the location is “southeast,” the road network density is “high,” and the
distance to the ocean is “near,” then the per capita annual income is “high.”
• Rule 2: If the location is “north-by-east,” the road network density is “high,”
and the distance to the ocean is “near,” then the per capita annual income is
“high.”
8.1 Spatial Association Rule Mining 213
• Rule 3: If the location is “northeast,” the road network density is “high,” and
the distance to the ocean is “medium,” then the per capita annual income is
“medium.”
• Rule 4: If the location is “north,” the road network density is “medium,” and
the distance to the ocean is “medium,” then the per capita annual income is
“medium.”
• Rule 5: If the location is “northwest,” the road network density is “low,” and the
distance to the ocean is “far,” then the per capita annual income is “low.”
• Rule 6: If the location is “central,” the road network density is “high,” and
the distance to the ocean is “medium,” then the per capita annual income is
“medium.”
• Rule 7: If the location is “southwest,” the road network density is “low,” and the
distance to the ocean is “far,” then the per capita annual income is “low.”
• Rule 8: If the location is “south,” the road network density is “high,” and
the distance to the ocean is “medium,” then the per capita annual income is
“medium.”
These association rules are visualized in Fig. 8.3. Different rules are plotted with
the color gradient ellipse gradually decreasing from the oval center outward with
214 8 GIS Data Mining
Fig. 8.3 Geographic-
economic association rules
Fig. 8.4 Geographic-
economic association rules
generalization
a decreasing membership degree, and the marked digits in the ovals are the rule
numbers. In order to find multi-level association rules, further generalization of the
location attribute by a virtual cloud is implemented. The combined cloud “west”
is from the “northwest” and the “southwest,” and the “central-south” is from the
“south” and “central.” Therefore, the eight rules in Fig. 8.3 are reduced to six rules
in Fig. 8.4, where rules 5 and 7 are merged as new rule 5, rules 6 and 8 are merged
as new rule 6, and the remaining rules 1, 2, 3, and 4 are unchanged.
• Rule 5: If the location is “west,” the road network density is “low,” and the dis-
tance to the ocean is “far,” then the per capita annual income is “low.”
• Rule 6: If the location is “central-south,” the road network density is “high,”
and the distance to the ocean is “medium,” then the per capita annual income is
“medium.”
The results indicate the efficacy of the cloud model in the association rule mining
pretreatment process. The combination of the cloud model and the Apriori algo-
rithm also demonstrate that the cloud model is capable of discovering association
rules from numeric data at multiple conceptual levels.
8.2 Spatial Distribution Rule Mining with Inductive Learning 215
race percentage, the mean income, the median income, the gender percentage, the
mean age, and the median age. The learning granularity is a spatial point object
(i.e., each bank). First, the spatial and non-spatial information of the multiple
layers are agglomerated into the bank attribute table by spatial analysis, thereby
forming the data for learning. Then, by inductive learning the hidden relation-
ship between the bank’s operational income and a variety of geographic factors is
uncovered; and the spatial analysis and deductive inference are combined to spec-
ulate the bank’s management status and to make an assessment for the selection of
a new bank locations. The GIS software utilized is ArcView along with a second-
ary development, and the inductive learning software is C5.0.
In order to facilitate the inductive learning, based on the deposits in 1993 and
1994, the bank operational income is graded as “good,” “average,” or “poor.” If
the “deposits in 1994” > “deposits in 1993” and “deposits in 1994” > “median
deposits in 1994,” then the operational income is “good”; if “deposits in
1994” < “deposits in 1993” and “deposits in 1994” < “median deposits in 1994,”
then the operational income is “bad”; and the remaining operational income is
“average.” The results includes 49 “average,” and there are 34 “good” and “poor,”
which are shown by different colors in Fig. 8.6.
As the bank profit has a pronounced relationship with how many other banks
are in the area and how close by they are, the distance to the nearest bank for each
bank is first calculated and the number of banks within a certain distance (1 km),
which are added as two indicators into the bank attributes. Then, the closest dis-
tance to the road network as well as the attribute data of the location where the
bank is located are obtained via spatial analysis and are added to the bank attribute
table. In addition, the coordinates of the bank also are added to the bank attrib-
ute table. In this way, data mining is carried out in the new agglomerated bank
attribute table. There are a total of 23 attributes in the inductive learning; with the
exception of the bank operational income attribute “class” being a discrete value,
the values of the other attributes are continuous.
The bank operational income is treated as the decision attribute (classifying
attribute), and the other attributes are the condition attributes. Using inductive learn-
ing to uncover the bank attribute table, a total of 22 rules are obtained (Table 8.5).
Table 8.5 The association rules between bank operational income and geographical factors
Rule No. Rule content
Rule 1 (Cover 19) PCT ASIAN > 3.06, AVG INC > 36483.52, DIST CLOSEST
BANK > 0.663234 ⇒ class Good (0.857)
Rule 2 (Cover 4) SQ MILES ≤ 0.312, POP GROWTH > -6.62 ⇒ class Good (0.833)
Rule 3 (Cover 2) NO CLOSE BANK > 18, X COORD > 1065.441 ⇒ class
Good (0.750)
Rule 4 (Cover 18) YEAR ESTABLISHED ≤ 1962, POP GROWTH > −6.62,
PCT ASIAN > 0.88, X COORD > 1064.672 ⇒ class Good (0.700)
Rule 5 (Cover 17) YEAR ESTABLISHED > 1924, PCT BLACK ≤ 4.09 ⇒ class
Good (0.526)
Rule 6 (Cover 8) PCT BLACK > 4.09, MID AGE > 35.43 X
COORD ≤ 1064.672 ⇒ class Good (0.900)
Rule 7 (Cover 5) POP GROWTH ≤ −6.62,
X COORD ≤ 1065.441 ⇒ class Average (0.857)
Rule 8 (Cover 4) YEAR ESTABLISHBD ≤ 1965, PCT BLACK ≤ 4.09,
PCT ASIAN ≤ 3.06 ⇒ class Average (0.833)
Rule 9 (Cover 4) PCT OTHER > 1. 32, DIST CLOSEST BANK ≤ 0.376229 ⇒ class
Average (0.833)
Rule 10 (Cover 4) YEAR ESTABLISHED ≤ 1924, POP GROWTH > -6.62,
DIST CLOSEST BANK ≤ 0.179002 ⇒ class Average (0.833)
Rule 11 (Cover 4) POP GROWTH ≤ -6.62, NO CLOSE BANK ≤ 18 ⇒ class
Average (0.833)
Rule 12 (Cover 9) 1951 < YEAR ESTABLISHED ≤ 1962, PCT ASIAN ≤ 0.88,
AVG AGE > 31. 34 ⇒ class Average (0.800)
Rule 13 (Cover 8) YEAR ESTABLISHED > 1951, MIN DIST Road > 0.093013,
PCT BLACK > 4.09, X COORD > 1064.672 ⇒ class Average (0.800)
Rule 14 (Cover 7) YEAR ESTABLISHED > 1962, PCT BLACK > 4.09, DIST
CLOSEST BANK ≤ 0.050138 ⇒ class Average (0.778)
Rule 15 (Cover 2) PCT BLACK ≤ 4.09, PCT MALE ≤ 42.71 ⇒ class Average (0.750)
Rule 16 (Cover 2) PCT ASIAN > 3.06, AVG INC ≤ 36483.52 ⇒ class Average (0.750)
(continued)
218 8 GIS Data Mining
Table 8.5 (continued)
Rule No. Rule content
Rule 17 (Cover 5) PCT ASIAN ≤ 3.06, MED AGE ≤ 35.43, DIST CLOSEST
BANK > 0.376229, X COORD ≤ 1064.672 ⇒ class Bad (0.857)
Rule 18 (Cover 4) YEAR ESTABLISHED > 1960, PCT OTHER ≤ 1.32,
MED AGE ≤ 35.43, X COORD ≤ 1064. 672 ⇒ class Bad (0.833)
Rule 19 (Cover 3) PCT ASIAN ≤ 3.06, AVG AGE ≤ 31. 34 ⇒ class Bad (0.800)
Rule 20 (cover 3) 1924 < YEAR ESTABLISHED ≤ 1951, SQ MILES > 0.312 AVG
AGE ≤ 36.22 ⇒ class Bad (0.800)
Rule 21 (Cover 20) YEAR ESTABLISHED > 1962, MIN DIST Road ≤ 0.093013
SO MILES > 0.312, PCT BLACK > 4.09, PCT ASIAN ≤ 3.06,
AVG AGE > 34.1, DIST CLOSEST BANK > 0.050138, X
COORD > 1064.672 ⇒ class Bad (0.773)
Rule 21 (Cover 2) YEAR ESTABLISHED > 1981,
PCT BLACK ≤ 4.09, PCT ASIAN > 0.82,
PCT MALE > 42.71 ⇒ class Bad (0.750)
Evaluation of learning result
Rules (a) (b) (c) Classified as
No. Errors 33 1 (a): class Good
22 12(10.3 %) 3 42 4 (b): class Average
3 1 30 (c): class Bad
The learning error rate is 10.3 %, for a learning accuracy rate of 89.7 %. These
rules reveal the relationship between the operational income and the geographical
factors of the Bank of Atlanta locations (Fig. 8.6; Table 8.5). These relationships are
relatively complex: the closely located banks both have “good” operational income
and “average” or “bad” operational income, and the banks established at a simi-
lar time both have “good” operational income and “average” or “bad” operational
income. Precise rules are formed by combining a number of factors. In rule 1, if the
bank location has a proportion of Asian population greater than 3.06 %, the average
annual income is greater than 36,483.52, and the nearest bank distance from another
is greater than 0.663234 km, then its operational income is “good.” The rule produces
five bank cases at a confidence of 0.857. For rule 2, if the area where the bank loca-
tion is located is less than or equal to 0.312 square miles and the population growth
rate is greater than −6.62 %, then the bank operational income is “good.” Rule 4
reveals that if the bank was established earlier than or equal to 1962, has a popula-
tion growth rate greater than −6.62 %, an Asian population proportion greater than
0.88 %, and the x-coordinate is greater than 1064.672 (located in the eastern part),
then the bank operational income is “good.” For rule 19, if the location where the
bank is located has a proportion of the Asian population less than or equal to 3.06 %
and the average age of the plot is less than or equal to 31.34 years, then the bank
8.2 Spatial Distribution Rule Mining with Inductive Learning 219
Fig. 8.9 Rough sets illustrations. Reprinted from Ref. Wang et al. (2004), with kind permission
from Springer Science+Business Media. a p × p, b 2p × 2p, c 3D
is determined, which are two of the indicators, only existing banks are taken into
consideration as it is assumed there are no other new banks.
The attribute data of the new bank can be reasoned with the rules by induc-
tive learning to obtain predictive values of “good,” “average,” or “bad” for the
operational income of each new bank. In the reasoning process, if several rules
are activated simultaneously, then the output category of high confidence (sum) is
regarded as the final output. If no rule is activated, then the default output value is
“average.” An overlay of a new bank layer and other relevant layers is carried out,
with different operational incomes in different colors to produce a forecast map of
the new bank’s projected operational income or as the new bank location assess-
ment map as shown in Fig. 8.8.
In Fig. 8.8, it can be clearly observed that the northeast and the central urban
areas are the preferable geographical locations for the new bank location, but there
are only a few good plots in the south.
8.2 Spatial Distribution Rule Mining with Inductive Learning 221
It should be noted that the above speculation of the bank management and loca-
tion assessment results are only considered as having relative accuracy, which
means they should be treated as a decision-making guide and reference only.
Various instances of uncertainty exist within several processes in the knowl-
edge acquisition and application process. The first source of uncertainty is that
the test data involved in the learning is not rich enough. For example, if only the
total deposits are known, they can be used to represent the operational income.
However, since there is no information about the number of bank staff, it is not
possible to depict the operational efficiency. Furthermore, if the bank depositor
distribution data are not available, only the household statistics of the bank loca-
tions can be used in inductive learning in accordance with the proximity principle,
which cannot consider the situation where the place of work and place of resi-
dence is not identical, etc.
The second source of uncertainty is based on the bank deposits in two consecu-
tive years to define the operational income of “good,” “average,” “poor,” which
means different definitions will produce different learning results. In addition,
the first uncertainty source has a certain relationship with the second uncertainty
source. Therefore, the more abundant the bank operational data are, the more rea-
sonable the definition of the operational income will be; however, a great deal of
operational data is not available for commercial and research purposes.
The third source of uncertainty is that the knowledge discovered from induc-
tive learning is often subject to different rules and therefore different degrees of
coverage (the number of covered cases) and ultimately different degrees of confi-
dence in applying the knowledge. On the other hand, the knowledge acquired by
inductive learning may not be complete as well because appropriate rules are not
in place during location investigations for reasoning the new bank information of
a small number of candidate locations. The only output possible then becomes the
default value of “average.”
The aforementioned example shows that inductive learning can uncover the hid-
den relationships between the bank operational income and a variety of geographi-
cal factors from the bank data and related layer data and that further speculation
about operational status and new location selection assessment can be made with
this knowledge. Despite the uncertainties that exist, these analyses and evaluation
results are significant information for bank management and decision-making. GIS
spatial analysis also plays an important role in this example (e.g., buffer analysis,
distance analysis, and inclusion analysis), which are applied for extracting the spa-
tial information needed by the inductive learning from multiple layers. Also, induc-
tive learning techniques provide knowledge for subsequent spatial analysis and
decision-making, which discovers the required knowledge so that spatial analysis
is no longer dependent on expert input. Inductive learning and spatial analysis pro-
mote each other and improve the intelligence level of spatial analysis and decision
support. Although the example here is based on the case of the bank operational
analysis and location selection assessment here, combining inductive learning and
spatial analysis is very appropriate for a wide variety of evaluation and spatial
222 8 GIS Data Mining
decision support applications in the resources and facilities area. The quantitative
description of uncertainty and its visualization in the combined process of GIS/
inductive learning for SDM are topics worthy of further study.
Rough set provides a means for GIS attribute analysis and knowledge discovery
with incompleteness via lower and upper approximations. Under rough sets, SDM
is executed in the form of a decision-making table (Pawlak 1991).
Given a universe of discourse U and U is a finite and non-empty set. Suppose
an arbitrary set X ⊆ U. Xc is the complement set of X, and X ∪ Xc = U. There is
equivalence relation R ⊆ U × U on U. U/R(x) is the equivalence class set com-
posed of the disjointed subsets of U partitioned by R(x), R(x) is the equivalence
class of R including element x, and (U, R) formalizes an approximate space. The
definitions for lower and upper approximation are as follows.
• Lower approximation (interior set) of X on U: Lr(X) = {x ∈ U|[x]R ⊆ X},
• Upper approximation (closure set) of X on U: Ur(X) = {x ∈ U|[x]R ∩ X ≠ Ф}.
If the approximate space is defined in the context of region, then
• Positive region of X on U: Pos(X) = Lr(X),
• Negative region of X on U: Neg(X) = U − Ur(X)
• Boundary region of X on U: Bnd(X) = Ur(X) − Lr(X)
The lower approximation Lr(X) is the set of spatial elements that definitely
belong to the spatial entity X, while the upper approximation Ur(X) is the set
of spatial elements that possibly belongs to X. The difference between the
upper approximation and the lower approximation is the uncertain bound-
ary Bnd(X) = Ur(X) − Lr(X). It is impossible to decide whether an element in
Bnd(X) belongs to the spatial entity due to the incompleteness of the set. Lr(X) is
certainly “Yes,” Neg(X) is surely “No,” while both Ur(X) and Bnd(X) are uncer-
tainly “Yes or No.” With respect to an element x ∈ U, it is certain that x ∈ Pos(X)
belongs to X in terms of its features, but x ∈ Neg(X) does not belong to X while
x ∈ Bnd(X) cannot be ensured by means of available information whether it
belongs to X or not. Therefore, it can be seen that Lr(X) ⊆ X ⊆ Ur(X) ⊆ U,
U = Pos(X) ∪ Bnd(X) ∪ Neg(X), and Ur(X) = Pos(X) ∪ Bnd(X) (Fig. 8.9).
X is defined if Lr(X) = Ur(X), while X is rough with respect to Bnd(X) if
Lr(X) ≠ Ur(X). A subset X ∈ U defined with the lower approximation and upper
approximation is called a rough set. The rough degree is
Rcard (Ur(X) − Lr(X)) Rcard (Bnd(X))
δ(X) = =
Rcard (X) Rcard (X)
where Rcard (X) denotes the cardinality of set X. X is crisp when δ(X) = 0.
8.3 Rough Set-Based Decision and Knowledge Discovery 223
8.3.1 Attribute Importance
Table 8.6 shows the relationships among the attributes location, terrain, and road
network density, which was processed by induction within mainland China and is
used to analyze the dependence and importance of an attribute (Li et al. 2006, 2013).
In Table 8.6, U consists of 13 objects {l, 2, …, 13}, and the attribute set is
A = {location, terrain, road network density}. Objects 3 and 4 are undiscriminat-
ing for attribute location, and Objects 9 and 11 are undiscriminating for the attrib-
ute terrain, … The partitioning generated by the attributes is as follows:
U/R(location) = {{l},{2},{3, 4},{5, 6},{7, 8},{9, 10},{11},{12, 13}}
U/R(terrain) = {{l, 4, 6, 8, 10, 13}, {2, 3, 5, 7, 12}, {9, 11}}
U/R(road network density) = {{l}, {2, 3, 5, 6, 7, 8, 12, 13}, {4, 9, 10, 11}}
U/R(location, terrain) = {{l}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}, {11},
{12}, {13}}
To study the temperature patterns in various regions, the characteristic rules are
summarized from the inductive results by further analyzing the numeric tempera-
ture data of major cities in China (Di 2001). The experimental data, taken from
The Road Atlas of China Major Cities, are the monthly mean temperatures of a
total of 37 cities. As the dispersion degree of the data is relatively large and dif-
ficult to use for the purpose of extracting general rules, a histogram of temperature
data was obtained from statistics. For the sake of simplicity, the histogram equali-
zation method was used to group the temperature data into three clusters of low
(−19.4 to 9.8), medium (9.8 ~ 20.3), and high (20.3 to 29.6). The 12 months are
grouped into four quarters, and the temperature of each quarter is the mean of the
three months. The city names are summarized based on their positions as north,
northeast, south, northwest, southwest, etc. Table 8.7 shows the results. Because
there was a large number of tuples in the data table, the data were further sum-
marized (i.e., medium or low into medium-low and medium or high into medium-
high; Table 8.8). In Table 8.8, the northwest and southwest groups are identical,
which are further summarized as west. The fully summarized results are shown
in Table 8.9, from which the characteristic rules were uncovered; the results are
consistent with subjective feelings about the distribution of temperature patterns in
China.
– (location, north) ⇒ (spring, medium-low) ^ (summer, high) ^ (autumn,
medium-low) ^ (winter, low) ^ (annual mean, medium-low)
– (position, northeast) ⇒ (spring, low) ^ (summer, height) ^ (autumn,
low) ^ (winter, low) ^ (annual mean, low)
8.3 Rough Set-Based Decision and Knowledge Discovery 225
Then, each condition attribute is removed to determine whether the decision table
remains consistent and also to determine whether the condition attribute is omis-
sible relative to the decision attribute. First, all the record values of attributes
“spring,” “autumn,” and “annual mean” are identical, which can be preserved only
once. Then, the decision table can be simplified as Table 8.10, where the attribute
“winter” is omissible because the decision table is still consistent if the attribute
“winter” is removed; however, “spring” or “summer” are not omissible because
the decision table will be inconsistent, i.e., identical conditions versus different
decisions if the attributes “spring” or “summer” are removed. Then, the values
of various condition attributes can be further simplified, removing the omissi-
ble attribute values, which are represented as “–,” and the final simplified results
are shown in Table 8.11, from which the minimum decision rules (distinguishing
rules) are finally extracted.
– (spring, medium-low) ∧ (summer, high) ⇒ (location, north);
– (spring, low) ⇒ (location, northeast);
– (spring, medium-high) ⇒ (location, south);
– (summer, medium-high) ⇒ (location, west).
The data for this experiment are agricultural data from the statistical yearbook of
China’s 30 main provinces (cities and districts) from 1982 to 1990 (Di 2001; Li
et al. 2013).
With the scatter plot of exploratory data analysis, every province (city, district)
first is intuitively observed with some preliminary regular patterns and its data are
selected for further processing. Exploratory data analysis can analyze the relation-
ship between one attribute and the other attributes; for example, when the agri-
cultural population increases, the general trend of total agricultural output also
8.3 Rough Set-Based Decision and Knowledge Discovery 227
increases. However, it is difficult to discover more rules using the scatter plot only.
It is also difficult to describe more rules with relevant statistical methods because
the proximal population can vary greatly in its output, and the proximal output
can vary greatly for a certain population number. In addition, the regularity of
the scatter plot of arable land area and total agricultural output is poor, as well as
the agricultural investment and the agricultural output. Thus, using the relation-
ship between a variety of agricultural factors and the agricultural total output, the
data for 1990 were selected for creating a decision table. The condition attributes
include arable land area, agricultural population, and agricultural investment; and
the decision attribute is the total agricultural output (Table 8.12).
As shown in Table 8.12, a cloud model and rough set were combined to dis-
cover the implicit classification rules. First, a cloud model was used to deduce the
attributes. Then, the attribute values of a province and city name were generalized
into the region in which they are located: northeast, north, northwest, east, central,
south, and southwest. Other numeric attributes were generalized into three lin-
guistic values with the maximum variance method (Di 2001): “small,” “medium,”
and “big” for the arable land area and agricultural investment; “many,” “medium,”
and “few” for the agricultural population; and “high,” “medium,” and “low” for
the agricultural output. These numeric attributes were discretized and represented
with the cloud model. Table 8.13 shows the results of the generalized attributes, in
which a new attribute of “count” is added to record the number of merged records.
Based on Table 8.13, the initial decision rules were generated by a rough set.
U/R(region) = {{l, 2, 3}, {4, 5, 6, 7, 8, 9}, {10, 11, 12, 13, 14, 15}, {16, 17, 15,
19}, {20, 21}, {23, 24, 25, 26}, {27, 28, 29, 30}}
U/R(arable land area) = {{l, 4,5,8,27}, {2, 3, 9, 10, 11, 12, 14, 17, 19, 20, 21, 22,
23, 24, 29}, {6, 7, 13, 15, 16, 18, 25, 26, 28, 30}}
U/R(agricultural population) = {{l, 2, 9, 10, 11, 16, 20, 22, 24, 25, 25, 29}, {3, 6,
7, 12, 13, 14, 15, 18, 26, 30} {4, 5, 8, 17, 19, 21, 23, 27}} is
U/R(agricultural investment) = {l, 2, 8, 14, 19, 20, 23, 27}, {3, 4, 5, 9, 11, 12, 16,
17, 18, 21, 26, 29}, {6, 7, 10, 13, 15, 22, 24, 25, 28, 30}}
U/R(agricultural output) = {{l, 2, 5, 16, 19, 20, 21, 22, 24, 25, 29}, {3, 6, 7, 9, 11,
12, 13, 14, 15, 18, 26, 28, 30}, {4, 8, 17, 23, 27}}
The calculation of positive region is as follows:
Pos region(agricultural output) = {10, 11, 12, 13, 14, 15, 20, 21, 22}
Pos arable land area(agricultural output) =
Pos agricultural population(agricultural output) = {3, 6, 7, 12, 13, 14, 15, 15, 26, 30}
Pos agricultural investment(agricultural output) =
Pos{region, agricultural population, agricultural investment}(agricultural output) = {l, 2, 3, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30}
Pos{region, arable land area, agricultural population}(agricultural output) = Pos{region, agricultural
population, agricultural investment}(agricultural output)
Pos{region, arable land area, agricultural population, agricultural investment}(agricultural
output) = Pos{region, agricultural population, agricultural investment}(agricultural output)
As seen from the above results, the cardinal number of the positive region is
always smaller than the cardinal number of the universe of discourse 30. There is
no dependence on the γ value of l—that is, the decision table is inconsistent. The
reason for the inconsistency is that record 4 and record 5 share identical condi-
tion attributes but entirely different decision attributes. Also, the attribute of agri-
cultural population is important in the condition attributes, but the two attributes
of arable land area and agricultural investment are redundant for decision-making;
therefore, one of them can be removed, which then reduced the attribute of arable
land area. Merging the same records resulted in Table 8.14. For convenience, the
tuples were renumbered.
To remove the redundant attribute values of the condition attributes in
Table 8.14, each condition attribute value within each record was carefully evalu-
ated to determine whether the removal of the value would change the decision-
making results; and if there was no change, then the value was redundant and
could be removed. For example, for record 8 and record 12, the removal of the
two attributes of agricultural population and agricultural investment did not
affect the agricultural output; therefore, all the attribute values of the two attrib-
utes were removed. By removing all redundant attribute values and merging
identical records in Table 8.14, the final simplified decision table was obtained
(Table 8.15).
230 8 GIS Data Mining
To sum up, rough set is able to refine the decision table greatly without chang-
ing the results. Key elements are preserved but the omissible attributes are reduced,
which may significantly accelerate the speed of decision-making. If richer data are
collected, the discovered rules will be more abundant and rewarding.
8.4 Spatial Clustering
Spatial clustering assigns a set of objects into clusters by virtue of their observa-
tions so that the objects that are similar to one another within the same cluster and
dissimilar to the objects in other clusters, or it groups a set of data in a way that
maximizes the similarity within clusters and also minimizes the similarity between
two different clusters. It is an unsupervised technique without the knowledge of
what causes the grouping and how many groups exist (Grabmeier and Rudolph
2002; Xu and Wunsch 2005). Spatial clustering helps in the discovery of densely
populated regions by exploring spatial datasets. For big data, an effective cluster-
ing algorithm can handle noise properly, detect clusters of arbitrary shapes, and
perform stably and independently of expert experience and the order of the data
input (Fränti and Virmajoki 2006). Clustering algorithms may be implemented on
partition, hierarchy, distance, density, grid, model, etc. (Parsons et al. 2004; Zhang
et al. 2008; Horng et al. 2011; Silla and Freitas 2011).
232 8 GIS Data Mining
1. Partition-based algorithms
The algorithms are used by the distance iterative clustering model, which relies
on the distance from the target to the clustering center for clustering convex
and far distances between cluster with slight disparity in the diameter differ-
ence (Ester et al. 2000), such as Partitioning Around Medoids (PAM). Ng and
Han (2002) proposed an improved k-medoid random search algorithm, the
Clustering Large Applications Based on RANdomized Search (CLARANS)
algorithm, which is a specialized SDM algorithm. Concept clustering is an
extension of segmentation, which partitions data into different cluster by a set
of object-describing conceptual values, rather than the geometric distance-based
approach, to realize the similarity measurement between data objects (Pitt and
Reinke 1988). The common numeric concept hierarchical algorithm discretizes
the domain interval of numerical attributes, forming multiple sub-intervals as
the leaf nodes at the conceptual level. It carries out conceptual segmenting for
all the data in the numerical field, and each segment is represented by a con-
cept value; then, all of the data in the original fields are substituted by the cor-
responding concept value of each segment, which creates a concept table. The
supervised conceptualization methods include ChiMerge (Kerber 1992), Chi2
(Liu and Rudy 1997), minimum entropy, and minimum error; the unsupervised
conceptualization methods includes uni-interval, uni-frequency, and k-means.
2. Hierarchy-based algorithms
The algorithms decompose datasets iteratively into a subset of a tree diagram
until every subset contains only one target, and the construction process is
suitable by the splitting or merging methods. It uncovers a nested sequence
of clusters with a single, all-inclusive cluster at the top and single-point clus-
ters at the bottom. In the sequence, each cluster is nested into the next cluster.
Hierarchical clustering is either agglomerative or divisive. Agglomerative algo-
rithms start with each element as a disjoint set of clusters and merge them into
successively larger clusters (Sembiring et al. 2010). Divisive algorithms begin
with the whole set and proceed to divide it into successively smaller clusters
(Malik et al. 2010). Examples can be found in Balanced Iterative Reducing and
Clustering using Hierarchies (BIRCH) (Zhang et al. 1996), Clustering Using
Representatives CURE) (Guha et al. 1998), and Clustering In Quest (CLIQUE)
(Agrawal and Srikant 1994), all of which utilize the hierarchical algorithm and
the correlation between the dataset to achieve clustering. Accounting for both
interconnectivity and closeness in identifying the most similar pair of clus-
ters, CHAMELEON yields accurate results for highly variable clusters using
dynamic modeling (George et al. 1999); however, it is not suitable for group-
ing large volume data (Li et al. 2008). Some clustering algorithms were further
hybridized, such as clustering feature tree (CBCFT) hybridizing BIRCH with
CHAMELEON (Li et al. 2008), Bayesian hierarchical clustering for evaluating
marginal likelihoods of a probabilistic model (Heller and Ghahramani 2005),
and the support vector machine (Horng et al. 2011). Lu et al. (1996) proposed
the automatic generation algorithm AGHC and the AGPC for numeric concep-
tual hierarchy based on the hierarchy and partitioning clustering algorithm.
8.4 Spatial Clustering 233
3. Density-based algorithms
The algorithms detect clusters by calculating the density of data objects nearby.
Because the density of data objects makes no assumptions on the shape of
clusters, density-based algorithms are able to determine the clusters of arbi-
trary shapes. Density-based Spatial Clustering of Applications with Noise
(DBSCAN) detects clusters by counting the number of data objects inside an
area of a given distance (Ester et al. 1996). The Statistical Information Grid-
based Method (STING) and STING+ are continuous improvements of the
DBSCAN (Wang et al. 2000). DENsity-based CLUstEring (DENCLUE) mod-
eled the overall density as the sum of the influential functions of data objects
and then used a hill-climbing algorithm to detect clusters inside the dataset
(Hinneburg and Keim 2003). DENCLUE has high time complexity; however,
more efficient algorithms are sensitive to noise or rely too much on parameters.
4. Grid-based algorithms
The algorithms quantize the feature space into a finite amount of grids and then
implement the calculations on the quantized grids. The effect of grid-based
algorithms is independent from the number of data objects and depends only
on the amount of quantized grids in each dimension. For example, WaveCluster
(Sheikholeslami et al. 1998) applied wavelet transform to discover the dense
regions in the quantized feature space. The operation of quantizing grids makes
grid-based algorithms very efficient when handling large datasets. However, at
the same time, they bring a bottleneck on accuracy. In addition, there is math-
ematical morphology clustering for raster data (Li et al. 2006).
5. Model-based algorithms
The algorithms focus on fitting the distribution of given data objects with a
specific mathematic model. As an improvement of k-means, the EM algo-
rithm (Dempster et al. 1977) assigns each data object to a cluster according
to the mean of the feature values in that cluster. Its steps of calculation and
assignment repeat iteratively until the objective function obtains the required
precision. Knorr and Ng (1996) discovered clustering proximity and common
characteristics. Ester et al. (1996) employed clustering methods to investigate
the category interpretation knowledge of mining in large spatial databases.
Tung et al. (2001) proposed an algorithm dealing with the obstruction of a river
and highway barrier in spatial clustering of SDM. Murray and Shyy (2000)
proposed an interactive exploratory clustering technique.
Fig. 8.10 Comparison of results from different clustering algorithms. a Original data set, b data
field (5 clusters), c k-means (5 clusters), d BIRCH (5 clusters), e CURE (5 clusters), f CHAME-
LEON (5 clusters)
8.4 Spatial Clustering 235
are selected and their weights are determined by Delphi hierarchical processing
(Wang and Wang 1997); then, the membership is evaluated and the fuzzy com-
prehensive evaluation matrix X is obtained. Second, the comprehensive evalu-
ation matrix B of all the factors obtained is determined based on the equation
Y = A − X, and the results of the grading evaluation of the assessed spatial enti-
ties are integrated into a comprehensive grading evaluation matrix. A is the factor
weight matrix. Third, the elements in the comprehensive grading evaluation matrix
are used to obtain fuzzy similar matrix R and fuzzy equivalent matrix (transitive
closed matrix) t(R). Finally, according to t(R), the maximum residual method or
mean absolute distance method, which are based on fuzzy confidence, are used in
clustering to obtain the final clustering knowledge.
Suppose that the set of influential factors is U = {u1, u2, …, um}, with the
matrix of weights A = (a1, a2, …, am)T, and the set of sub-factors ui = {ui1, ui2,
…, uiki} with the matrix of weights Ai = (ai1, ai2, …, aiki)T (i = 1, 2, …, m) simul-
taneously (Fig. 8.11). The set of grades is V = {v1, v2, v3, …, vn}, including n
grades.
In the context of the given grades, the fuzzy membership matrix of the sub-
factors of factor Ui may be described as
x11 x12 ... x1n
x21 x22 ... x2n
Xi =
... ... ... ...
xki 1 xki 2 ... xki n
Take the element yij (i = 1, 2, …, l; j = 1, 2, …, n) of matrix Yl×n as the original
data, and the fuzzy similar matrix can be created—that is, R = (rij) l×l—which
indicates the fuzzy similar relationships among the entities.
n
k=1 (yik × yjk )
rij = 1
n
2 1
2 n 2 2
yik × k=1 yjk
k=1
The fuzzy similar matrix R = (rij)l×l will need to be changed into the fuzzy equiva-
lent matrix t(R) when clustering. The fuzzy matrix for clustering is a fuzzy equiv-
alent matrix t(R), and it shows a fuzzy equivalent relationship among the entities
instead of the fuzzy similar relationships. The fuzzy equivalent matrix has three
characteristic conditions of self-reverse, symmetry, and transfer. However, the fuzzy
similar matrix R frequently satisfies two characteristic conditions of self-reverse
and symmetry but not transfer. The fuzzy similar matrix R may be changed into
the fuzzy equivalent matrix t(R) via the self-squared method, for there must be a
minimum natural number k(k = 1, 2, …, l, and k ≤ l) that makes equivalent matrix
k k k+1 k+2 l
t(R) = R2 if and only if R2 = R2 = R2 = · · · = R2 (Wang and Klir 1992).
t11 t12 . . . t1l
t21 y21 . . . t2l
t(R)l×l =
... ... ... ...
The fuzzy confidential level α is a fuzzy probability that two or more than two
entities belong to the same cluster. Under the umbrella of the fuzzy confidential
level α, the maximum remainder algorithms is described as follows:
238 8 GIS Data Mining
(1) (1) Tj
Tmax = max(T1 , T2 , . . . , Tl ), Kj = (1)
Tmax
The resulting grade is furthermore the grade of all the entities in ClusterZ.
The case study of fuzzy comprehensive clustering then was applied to land eval-
uation in Nanning City for 20 land parcels with different characteristics stochasti-
cally distributed (Table 8.16; Fig. 8.12). The set of grades is V = {I, II, III, IV}.
Before the land evaluation process began, the influential factors were selected
and weighted by using Delphi hierarchical processing in the context of the local
reality and the experience of experts. The trapezoidal fuzzy membership func-
tion was chosen to compute the membership that each influential factor belongs to
every grade. Then, the matrix of fuzzy comprehensive evaluation Y20×4 could be
obtained (Table 8.16).
Based on the matrix of Y20×4, fuzzy comprehensive evaluation traditionally
determines the land grades on maximum fuzzy membership (Table 8.17).
Comparatively, fuzzy comprehensive clustering further takes the elements yij
(i = 1, 2, …, 20; j = 1, 2, 3, 4) of Y20×4 as the original data to create the fuzzy
similar matrix R20×20. In the context of R20×20, fuzzy equivalent matrix t(R20×20)
8.4 Spatial Clustering 239
is computed. With the clustering threshold of 0.99, the clustering process is shown
in Table 3 by using the maximum remainder of algorithms-based clustering. In the
end, the grades of each land parcel are obtained via maximum characteristics algo-
rithms (Table 8.18), and they are also mapped in Fig. 8.12.
A comparison of the results of the proposed comprehensive clustering
(Table 8.19; Fig. 8.12) and the single fuzzy comprehensive evaluation (Table 8.17)
is presented below.
The results of the traditional fuzzy comprehensive evaluation did not match the
reality of Nanning city for some of the land parcels; for example, the following
errors were made for land parcels 14, 9, and 20:
– Land parcel 14, which is grade I, was misidentified as grade II. Land parcel 14
is the Dragon Palace Hotel and is located in the city center with shops, infra-
structure, and large population density.
– Land parcel 9, which is grade II, was misidentified as grade I. Land parcel 9
is the Provincial Procuratorate, which is obstructed from the city center by the
Yong River.
240 8 GIS Data Mining
– Land parcel 20, which is grade IV, was misidentified as grade II. Land parcel 20
is the Baisha Paper Mill, which is in a suburban area with bad infrastructure.
These errors occurred because traditional fuzzy comprehension evaluation cannot
differentiate close membership values, especially when the difference between the
first maximum membership and the second maximum membership is small. When
the grade with the maximum membership value is chosen as the land grade, some
important land information, which is hidden in the land’s influential factors, may
be lost, such as obstructers of a river, lake, road, railway (Tung et al. 2001) or
the spread of land position advantage. The subordinate grade will be cut off once
the maximum membership value is chosen even though the maximum member-
ship may represent the essential grade of a land parcel. For example, as can be
seen from the matrix Y20×4 (Table 8.17), land parcel 20, the Baisha Paper Mill, has
membership values of grade II 0.622 and grade IV 0.604. This dilemma cannot be
overcome by fuzzy comprehensive evaluation alone.
Table 8.18 The clustering process of maximum remainder algorithms
Land parcel T(l) K(l) Cluster 1 T(2) K(2) Cluster 2 T(3) K(3) Cluster 3 T(4) K(4) Cluster 4
1 16.253 0.954 16.253 0.982 16.253 0.998 1
2 16.28 0.956 16.28 16.280 1.000 2
3 17.005 0.998 3
4 16.28 0.956 16.28 0.984 16.280 1.000 4
8.4 Spatial Clustering
p T (p)
(3) Kj = p
Tmax
, j = 1, 2, L, 20, p = 1, 2, 3, 4,
242
Steps:
(1) Initialize i = l;
(2) Construct the circular structural element Bi, and its radius as i;
(3) Calculate a closing operation Yi = X · Bi;
(4) Count connected regions in Yi, i.e., the number of clusters ni, if ni > 1,
then i = i + l, go to step (2); Otherwise, return to step (4);
(5) Calculate the optimal number of clusters nk according to ni and obtain
the radius k of the corresponding structural element;
(6) Calculate Y = X · Bk; and
(7) Show the connected regions in Y with a distinct color for different
clustering.
There is a huge disparity within the diameter of the clustering in data a and
the distance between them is relatively close; data b and c are concave clustering;
and data d is the outcome of data c plus extra noises. For data a, b, c, d, MMC
clustering
results
clustering
boundary
discovered
holes
8.5 Landslide Monitoring
Fig. 8.14 SDM views of
landslide monitoring dataset
and their pan-hierarchical
relationship
dataset: monitoring point, monitoring date, and moving direction of the landslide
deformation. Each basic standpoint has two values, same and different.
• Point-Set: {same monitoring point, different monitoring point}
• Date-Set: {same monitoring date, different monitoring date}
• Direction-Set: {same moving direction, different moving direction}
The different combinations of these three basic standpoints may produce all the
SDM views of landslide monitoring data mining. The number of SDM views is
C21 · C21 · C21 = 8 (8.2)
Among these eight SDM views, each view has a different meaning when mining
a dataset from a specific perspective. All the SDM views show a pan-hierarchical
relationship (Fig. 8.14) by using a cloud model.
• View I: same monitoring point, same monitoring date, and same moving
direction
• View II: same monitoring point, same monitoring date, and different moving
direction
• View III: same monitoring point, different monitoring date, and same moving
direction
• View IV: same monitoring point, different monitoring date, and different mov-
ing direction
• View V: different monitoring point, same monitoring date, and same
moving-direction
• View VI: different monitoring point, same monitoring date, and different mov-
ing direction
8.5 Landslide Monitoring 247
• View VII: different monitoring point, different monitoring date, and same mov-
ing direction
• View VIII: different monitoring point, different monitoring date, and different
moving direction
In Fig. 8.14, from central SDM view I to outer view VIII, the observed distance
expands farther and farther while the mining granularity gets bigger and bigger.
In contrast, from the outer SDM view VIII to central view I, the observed distance
becomes closer and closer while the mining granularity gets bigger and bigger.
This occurs because SDM views I, II, III, and IV focus on the different attributes
of the same monitoring point and subsequently discover the individual knowledge
of the landslide in a conceptual space; in contrast, SDM views V, VI, VII, and VIII
pay attention to the different characteristics of multiple monitoring points and may
discover the common knowledge of the landslide in a characteristic’s space.
In the eight SDM views, SDM view I is the basic SDM view because the other
seven SDM views may be derived from it when one, two, or three basic stand-
points of SDM view I change. While the landslide monitoring data are mined in
the context of SDM view I, the objective is a single datum of an individual moni-
toring point on a given date. If three moving-directions dx, dy, dh are taken as
a tuple (dx, dy, dh), then SDM views IV, VI, and VIII also may be derived from
view II when the monitoring point and/or the monitoring-date change. Thus, SDM
view II is called the basic composed SDM view where the focus is on the total dis-
placement of an individual monitoring point on a given date. In the visual field of
the basic SDM view or basic composed SDM view, the monitoring data are single
isolated data instead of piles of data. In SDM, they are only the fundamental units
or composed units but are not the final destination. So the practical SDM views
are the remaining six SDM views III, IV, V, VI, VII, and VIII. Tables 8.20 and
8.21 describe the examples of the basic SDM view and basic composed SDM view
from cloud model and data field.
Because the Yangtze River flows in the west-east direction, the moving direc-
tion of the Baota landslide is geologically north-south; and the displacements of
landslide mainly appear in X-direction (north-south), which matches the condi-
tions of SDM view III.
It is noted that all of the spatial knowledge that follows was discovered from the
databases with the properties of dx, dy, and dh; and the properties of dx constitute
the major examples. The linguistic terms of different displacements on dx, dy, and
dh may be depicted by the pan-concept hierarchy tree (Fig. 8.15) in the concep-
tual space, which were formed by cloud model (Fig. 8.16). For example, the nodes
“very small” and “small” both have the same node “9 mm around.”
dxi − min(dx)
µ(dxi ) =
max(dx) − min(dx)
where, max(dx) and min(dx) are the maximum and minimum of dx = {dx1, dx2,
…, dxi, …, dxn}. Then, the rules for Baota landslide monitoring in X direction can
be discovered from the databases in the conceptual space (Table 8.22).
250 8 GIS Data Mining
8.5.3.1 Rules Visualization
Fig. 8.17 Microcosmic
knowledge on all of
landslide-monitoring points
Fig. 8.18 Mid-cosmic
knowledge on landslide
monitoring cross-sections
Fig. 8.19 Macrocosmic
knowledge
Fig. 8.20 Exception of
landslide
8.5 Landslide Monitoring 253
small-scale landslide disaster. Actually, the terrain of the Baota landslide is perpen-
dicular and the obliquity is top-precipitous like chairs. These landslide characteris-
tics are identical to the above knowledge, which explains that the internal forces,
which include landslide material character, geological structure, and gradient are
the main causes of the formation of landslide disasters. Contemporarily, the infor-
mation from the patrolling of landslide areas, the natural realism in the area, and
the spatial knowledge discovered from the SDM views are very similar.
All the above landslide-monitoring points may further create their data field and
isopotential lines spontaneously in a characteristic space. Intuitively, these points
are grouped into clusters to represent different kinds of spatial objects naturally.
Figure 8.21 visualizes the clustering graph from the landslide monitoring points’
potential on dx in the characteristic space.
In Fig. 8.21, all the points’ potentials form the potential field and the isopoten-
tial lines spontaneously. When the hierarchy increases from Level 1 to Level 5,
i.e., from the fine granularity world to the coarse granularity world, these landslide
monitoring points are intuitively grouped naturally into different clusters at differ-
ent hierarchies of various levels as follows.
• No clusters at the hierarchy of Level 1. The displacements of landslide monitor-
ing points are separate at the lowest hierarchy.
• Four clusters at the hierarchy of Level 2: cluster BT14, cluster A (BT13, BT23,
BT24, BT32, BT34), cluster B (BT11, BT12, BT22, BT31, BT33), and cluster
BT21. At the lower hierarchy, the displacements of landslide -monitoring points
(a) (b)
Fig. 8.21 Clustering graph from landslide monitoring points’ potential. a All points’ cluster
graph, b Points’ cluster graph without exceptions
254 8 GIS Data Mining
(BT13, BT23, BT24, BT32, BT34) have the same trend of “the displacements
are small” and the same is true (BT11, BT12, BT22, BT31, BT33) of “the dis-
placements are big”; while BT14 and BT21 show different trends from both of
them and each other i.e., the exceptions, “the displacement of BT14 is smaller”
and “the displacement of BT21 is extremely big.”
• Three clusters at the hierarchy of Level 3: cluster BT14, cluster (A, B) and clus-
ter BT21. When the hierarchy increases, the displacements of the landslide-
monitoring points (BT13, BT23, BT24, BT32, BT34) and (BT11, BT12, BT22,
BT31, BT33) have the same trend of “the displacements are small,” however,
BT14 and BT21 are still unable to be grouped into this trend.
• Two clusters at the hierarchy of Level 4: cluster (BT14, (A, B)) and cluster
BT21. When the hierarchy increases, the displacements of landslide -monitor-
ing point BT14 can be grouped into the same trend of (BT13, BT23, BT24,
BT32, BT34) and (BT11, BT12, BT22, BT31, BT33), i.e., “the displacements
are small,” however, BT21 is still an outlier.
• One cluster at the hierarchy of Level 5L cluster ((BT14, (A, B)), BT21). The
displacements of landslide monitoring points are unified at the highest hierar-
chy, i.e., the landslide is moving.
These clusters show different “rules plus exceptions” at different changes from
the fine granularity world to the coarse granularity world. The clustering between
attributes at different cognitive levels makes many combinations, showing the
discovered knowledge with different information granularities. When the excep-
tions BT14 and BT21 are granted to eliminate, the rules and the clustering process
will be more obvious (Fig. 8.21b). Simultaneously, these clusters represent differ-
ent kinds of landslide monitoring points recorded in the database; and they can
naturally form the cluster graphs shown in Fig. 8.21. As can be seen in these two
figures, the displacements of landslide- monitoring points (BT13, BT23, BT24,
BT32, BT34) and (BT13, BT23, BT24, BT32, BT34) first compose two new clus-
ters, cluster A and cluster B. Then, the two new clusters compose a larger cluster
with cluster BT14, and they finally compose the largest cluster with cluster BT21,
during the process of which the mechanism of SDM is still “rules plus excep-
tions.” In other words, SDM is particular views for searching the spatial database
on Baota landslide displacements by different distances only, and a longer distance
leads to pie more meta-knowledge discovery.
In summary, the above knowledge discovered from the Baota landslide moni-
toring dataset is closer to human thinking in decision-making. Moreover, it is
consistent with the results in the references (Wang 1999; Zeng 2000; Jiang and
Zhang 2002); and the discovered knowledge very closely matches the actual facts.
When the Committee of Yangtze River (Zeng 2000) investigated the region where
the Baota landslide occurred, they determined that the landslide had moved to the
Yangtze River; and near the landslide monitoring point BT21, a small landslide
hazard had taken place. Currently, two pieces of big rift remain, with the wall rift
of farmer G. Q. Zhang’s house at nearly 15 mm. Therefore, SDM in landslide dis-
aster is not only practical and realistic but essential.
References 255
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of
international conference on very large databases (VLDB), Santiago, Chile, pp 487–499
Clementini E, Felice PD, Koperski K (2000) Mining multiple-level spatial association rules for
objects with a broad boundary. Data Knowl Eng 34:251–270
Di KC (2001) Spatial data mining and knowledge discovering. WuHan University Press, WuHan
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM
algorithm. J R Stat Soc B 39(1):1–38
Ester M et al (1996) A density-based algorithm for discovering clusters in large spatial databases
with noise. In: Proceedings of the 2nd international conference on knowledge discovery and
data mining. AAAAI Press, Portland, pp 226–231
Ester M et al (2000) Spatial data mining: databases primitives, algorithms and efficient DBMS
support. Data Min Knowl Discovery 4:193–216
Fränti P, Virmajoki O (2006) Iterative shrinking method for clustering problems. Pattern Recogn
39(5):761–765
George K, Han EH, Kumar V (1999) CHAMELEON: a hierarchical clustering algorithm using
dynamic modeling. IEEE Comput 27(3):329–341
Grabmeier J, Rudolph A (2002) Techniques of clustering algorithms in data mining. Data Min
Knowl Discovery 6:303–360
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases.
In: Proceedings of the ACM SIGMOD international conference on management of data.
ACM Press, Seattle, pp 73–84
Han JW, Kamber M, Pei J (2012) Data mining: concepts and techniques, 3rd edn. Morgan
Kaufmann Publishers Inc., Burlington
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd
international conference on machine learning, Bonn, Germany
Hinneburg A, Keim D (2003) A general approach to clustering in large databases with noise.
Knowl Inf Syst 5:387–415
Horng SJ et al (2011) A novel intrusion detection system based on hierarchical clustering and
support vector machines. Expert Syst Appl 38(1):306–313
Jiang Z, Zhang ZL (2002) Model recognition of landslide deformation. Geomatics Inf Sci Wuhan
Univ 27(2):127–132
Kerber R (1992) ChiMerge: discretization of numeric attributes. In: Proceedings of AAAI-92, the
9th international conference on artificial intelligence. AAA Press/The MIT Press, San Jose,
pp 123–128
Knorr EM, Ng RT (1996) Finding aggregate proximity relationships and commonalities in spatial
data mining. IEEE Trans Knowl Data Eng 8(6):884–897
Li DR, Wang SL, Li DY (2006) Theory and application of spatial data mining, 1st edn. Science
Press, Beijing
Li J, Wang K, Xu L (2008) Chameleon based on clustering feature tree and its application in cus-
tomer segmentation. Ann Oper Res 168(1):225–245
Li DR, Wang SL, Li DY (2013) Theory and application of spatial data mining, 2nd edn. Science
Press, Beijing
Liu H, Rudy S (1997) Feature selection via discretization. IEEE Trans Knowl Discovery Data
Eng 9(4):642–645
Lu H, Setiono R, Liu H (1996) Effective data mining using neural networks. IEEE Trans Knowl
Data Eng 8(6):957–961
MacQueen J (1967) Some methods for classification and analysis of multivariate observations.
In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability,
vol 1, Berkeley, CA, pp 281–297
Malik HH et al (2010) Hierarchical document clustering using local patterns. Data Min Knowl
Discovery 21:153–185
256 8 GIS Data Mining
Murray AT, Shyy TK (2000) Integrating attribute and space characteristics in choropleth display
and spatial data mining. Int J Geogr Inf Sci 14(7):649–667
Ng R, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE
Trans Knowl Data Eng 14(5):1003–1016
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review.
SIGKDD Explor 6(1):90–105
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic
Publishers, London
Pitt L, Reinke RE (1988) Criteria for polynomial time (conceptual) clustering. Mach Learn
2(4):371–396
Sembiring RW, Zain JM, Embong A (2010) A comparative agglomerative hierarchical clustering
method to cluster implemented course. J Comput 2(12):1–6
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering
approach for very large spatial databases. In: Proceedings of the 24th very large databases
conference (VLDB 98), New York, NY
Silla CN Jr, Freitas AA (2011) A survey of hierarchical classification across different application
domains. Data Min Knowl Discovery 22:31–72
Tung A et al (2001) Spatial clustering in the presence of obstacles. IEEE Trans Knowl Data Eng
359–369
Wang SQ (1999) Landslide monitor and forecast on the three gorges of Yangtze River.
Earthquake Press, Beijing
Wang SL (2002) Data field and cloud model based spatial data mining and knowledge discovery.
PhD thesis. Wuhan University, Wuhan
Wang SL et al (2004) Rough spatial interpretation. Lecture notes in artificial intelligence, 3066,
pp 435–444
Wang SL, Chen YS (2014) HASTA: a hierarchical-grid clustering algorithm with data field. Int
J Data Warehouse Min 10(2):39–54
Wang ZY, Klir GJ (1992) Fuzzy measure theory. Plenum Press, New York
Wang XZ, Wang SL (1997) Fuzzy comprehensive method and its application in land grading.
Geomatics Inf Sci Wuhan Univ 22(1):42–46
Wang J, Yang J, Muntz R (2000) An approach to active spatial data mining based on statistical
information. IEEE Trans Knowl Data Eng 12(5):715–728
Wang SL, Gan WY, Li DY, Li DR (2011) Data field for hierarchical clustering. Int J Data
Warehouse Min 7(4):43–63
Wang SL, Fan J, Fang M, Yuan HN (2014) HGCUDF: hierarchical grid clustering using data
field. Chin J Electr 23(1):37–42
Xie ZP (2001) Concept lattice-based knowledge discovery. PhD thesis. Hefei University of
Technology, Hefei
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw
16(3):645–678
Zaki MJ, Ogihara M (2002) Theoretical foundations of association rules. In: Proceedings of the
3rd SIGMOD workshop on research issues in data mining and knowledge discovery, pp 1–8
Zeng XP (2000) Research on GPS application to landslide monitoring and its data processing,
dissertation of master. Wuhan University, Wuhan
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very
large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on
management of data, Montreal, Canada, pp 103–114
Chapter 9
Remote Sensing Image Mining
Remote sensing (RS) images are one of the main spatial data sources. The vari-
ability of image data types and their complex relationships are the main difficul-
ties that face SDM. This chapter covers a combination of inductive learning and
Bayesian classification to classify RS images, the use of rough sets to describe and
classify images and extract thematic information; image retrieval based on spatial
statistics; and image segmentation, facial expression analysis, and recognition uti-
lizing cloud model and data field mapping. Potential applications the authors have
explored are provided in this chapter as well, such as using the brightness of night-
time light imagery data as a proxy for evaluating freight traffic in China, assessing
the severity of the Syrian Crisis, and indicating the dynamics of different countries
in the global sustainability.
(3) Mixed pixels: Due to spatial resolution differences, a single pixel in the image
may actually be affected synthetically by multiple features.
(4) Temporal changes: With a change of time, RS image data of the same geo-
graphical position may incur the complexity and uncertainty induced by the
information changes due to differences in climate, conditions, human activi-
ties, etc.
(5) The mutual relationships of spatial distribution for unit space in RS images
are complicated, rather than a simple mathematical relationship.
Image enhancement integrates visual characteristics and rough sets (Wu 2004).
To suppress noise simultaneously when enhancing images, the noise of a pixel is
9.1 RS Image Preprocessing 259
Fig. 9.1 Mean filtering results with rough sets. a Original image. b Mean shift filter. c Mean
shift filter with rough set
defined as C2 = {0, l}, of which 0 is a noise pixel and 1 is a non-noise pixel. The
equivalence relationship of noise is defined as
Rn (S) = Yi Yj {Sij m(Sij ) − m(Si±1,j±1 ) > Q}
where, Rn(S) is the set of all noisy pixels. Si±1, j±1 is the adjacent subblocks of
subblocks Sij, and Sij and Si±1, j±1 are generalized into a macro block. Q is a
threshold value—that is, Rn(S) is the absolute value of the difference between Sij
and the mean gray value of its adjacent Si±1, j±1 if it is rounded to be larger than Q.
All of the above partitioned sub-images are merged to obtain
A1 = Rt(x) − Rn(s) and A2 = R− t (x) − Rn(s). After the noises are eliminated, A1,
A2 represent the set of all pixels with larger gradients and the set of all pixels with
smaller gradients, respectively—both of which need to be enhanced with their
characteristics. To supplement the sub-image, the noise pixels are filled with the
mean gray value of the macro blocks. In sub-image A1, the pixels with smaller gra-
dients are filled by using L/2 (L is the gray value; for example, L = 255). In sub-
image A2, the pixels with larger gradients are filled by using L/2.
When enhancing images, an exponential transformation is performed for A1 to
stretch the gray values to increase the overshoot on both sides of the image edge.
The noise is augmented, which is not sensed by visual characteristics. A histogram
equalization transform is implemented for A2 to decrease the overshoot on both
sides of the edge of the noisy pixels. Noisy pixels with smaller frequencies are
merged, thereby weakening the noise. Figure 9.2 is an enhanced image produced
by combining the visual characteristics and rough sets. Figure 9.2a–c are, respec-
tively, the original image (TM), the conventional enhancement result of the posi-
tive histogram, and the enhancement result of combining the visual characteristics
and rough set. Image results (c) are obviously better than that of (b).
260 9 Remote Sensing Image Mining
Fig. 9.2 Image enhancement using visual characteristics and rough sets. a Original image.
b Histogram equalization. c Visual characteristics and rough set
Bayesian classification (the maximum likelihood method) meets the statistical nor-
mal assumption in the spectral data and can derive a minimum classification error
theoretically. It can distinguish water, residential areas, green areas, and other
large area classes from most multi-spectral RS images. However, if further subdi-
vision is needed (e.g., divide water area into rivers, lakes, reservoirs, and ponds or
divide green area into vegetable fields, orchards, forest and grassland), the prob-
lems may appear as different objects with the same spectrum and different spectra
from an identical object (Christopher et al. 2000). To solve the problem, inductive
learning is combined with Bayesian classification for using morphological char-
acteristics, distribution rules, and weak spectral differences in the granularity of
the object and the pixel. Figure 9.3 outlines the combination process of inductive
learning and Bayesian classification (Di 2001; Li et al. 2006, 2013a, b, c).
In Fig. 9.3, the inductive learning for spatial data is implemented at two gran-
ularities of the spatial object and the pixel, with the class probability value as a
learning attribute from the Bayesian preliminary classification. A GIS database
is essential for providing the training area for Bayesian classification, the poly-
gon and pixel sample data for inductive learning, the control points for RS image
rectification, and the test area for evaluating the accuracy of the image classifica-
tion results. To maintain representativeness, the training and test areas are selected
interactively. The desired attributes and data format are the same as for learn-
ing, but without the class attribute. The results from SDM are a set of rules and a
9.2 RS Image Classification 261
default class with the confidence between 0 and 1. The strategy of deductive rea-
soning is as follows:
(1) If only one rule is activated, the rule’s output is the resultant class.
(2) If the activation of multiple rules occurs simultaneously, then the class with
higher confidence is the resultant class.
(3) If the activation of multiple rules occurs simultaneously and the confidences
of the classes are identical, then the output of the rule covered by as many
learning samples as possible is the resultant class.
(4) If no rule is activated, the default class is the resultant class.
The experimental image is a portion of a SPOT multispectral image with
three bands of a Beijing area acquired in 1996, as shown in Fig. 9.4a. After
it is stretched and corrected, the size of 2,412 × 2,399 pixels becomes
2,834 × 2,824 pixels. The GIS database is the 1:100,000 land use database of the
Beijing area before 1996. The GIS software is ArcView, the RS image process-
ing software is ENVI, and the inductive learning software is See 5 1.10 based on
the C5.0 algorithm. Image classification displays the overlay with the GIS layers.
The selected GIS layers are the land use layer and the contour layer. Because the
contour lines and elevation points are too sparse to be interpolated into a digital
elevation model (DEM), the contour layer is processed into an elevated polygon
layer—specifically, elevation stripes for less than 50 m (<50 m), 50–100, 100–200,
200–500 m, and greater than 500 m (>500 m).
262 9 Remote Sensing Image Mining
(a) (b)
Only Bayesian classification is utilized first to obtain the eight classes of water,
irrigated paddy, irrigated cropland, dry land, vegetable plot, orchard, forest land,
and residential land shown in the confusion matrix of classification (Table 9.1).
As can be seen, the classification accuracy is relatively high. The vegetable area
is shown in bright green to distinguish it from other green areas. The dry land,
orchard, and woodland (shown in dark green) have less spectral difference, which
leads to serious reciprocal errors and low classification accuracy. Part of the
shadow area in the woodland area is incorrectly classified as water.
Based on the preliminary results of Bayesian classification, inductive learning can
be used for two aspects of improving Bayesian classification: (1) subdividing water
and distinguishing shadows by using polygon granular learning, and (2) improv-
ing the classification accuracy of dry land, orchard, and woodland by using pixel
granular learning. For the learning granularity of a polygon to subdivide water, the
selected attributes are area, geographical location, compactness, and height. Then,
the stripes of 200–500 m and >500 m are merged to >200 m; their class attribute
values are 71 for river, 72 for lake, 73 for reservoir, 74 for pond, and 99 for shadow.
After 604 polygons are learned with 98.8 % accuracy, 10 production rules are
extracted to the spatial distribution and geometric features (Table 9.2).
In Table 9.2, Rule 1 shows that the compactness attribute plays a key role in iden-
tifying a river, Rule 2 applies location and compactness for lake recognition, and
Rules 9 and 10 manipulate elevation for shadow recognition. When these rules are
used to identify shadows and subdivide water areas, the accuracy of shadow recogni-
tion is improved to 68 % for woodlands and the error in water recognition is reduced.
For the learning granularity of a pixel to distinguish dry land, orchards, and wood-
lands, the selected attributes include coordinate, elevation stripe, dry land probability,
9.2 RS Image Classification 263
orchard probability, and woodland probability. The output has three classes: (1) 1 %
(2909 pieces) of the massive samples is randomly selected for learning, (2) 63 rules
are discovered with the learning accuracy of 97.9 %, and (3) an additional 1 % of the
samples are randomly selected for detection, the accuracy of which is 94.4 %.
Figure 9.4b shows the classification results from inductive learning and
Bayesian classification (displayed by samplings), and Table 9.3 is the confusion
matrix and accuracy indicators for the same test area used for Bayesian classifi-
cation. Compared with Bayesian classification, the classification accuracy of
dry land, orchards, and woodlands are increased to 69.8, 78.5 and 91.8 %; their
individual increased rates are 6.2, 29.6 and 32 %, respectively. The overall clas-
sification accuracy is improved by 11.2 % (88.8751–77.6199 %) and the kappa
coefficient is increased by 0.1245 (0.8719–0.7474). The results also demonstrate
that inductive learning can preferably solve the problem of the same spectrum
from different objects and the same objects with different spectra.
The classification with rough sets is defined to search for an appropriate classifica-
tion boundary in the boundary region according to the algorithm and parameters,
which is the deduction process of the set. Figure 9.5 is an example of RS clas-
sification of a volcano with rough sets. The white region is the upper approximate
set of the volcano, showing all the possible areas belonging to the volcano; the
red region is the lower approximation set of the volcano, illustrating all the defi-
nite areas as the volcano body. The boundary region between the upper and lower
approximate set is the uncertain part of the volcano (Wu 2004).
The classification may improve substantially if a rough set is integrated with
other reasonable methods. For example, in a rough neural network integrating rough
sets and artificial neural networks (ANNs) (Fig. 9.6), the rough set infers logic rea-
soning rules from the data on the object’s indiscernibility and knowledge simplifica-
tion, the results of which are used as the input of the knowledge system on ANN.
As discussed in previous chapters, ANN simulates the human visual intui-
tive thinking mechanism using the idea of nonlinear mapping; it also carries out
an implicit function that encodes the input and output correlation knowledge
expressed by the neural network structure, which can improve the precision of RS
image classification based on a rough set. Figure 9.7 shows the original images
(SPOT 5) and classification by the integrating method. In the confusion matrix
Fig. 9.5 The approximate
set of volcanic image
classification
Fig. 9.6 Rough neural
network model for image
classification
266 9 Remote Sensing Image Mining
Fig. 9.7 Results of
classification with rough
neural network
(Tables 9.4 and 9.5), the overall accuracy of the results of simple rough image
classification is 74.8 % and the coefficient kappa is 0.821 (Table 9.4), while the
overall accuracy of the results of a multilayer perceptron classification with rough
set and the neural network is 91.5 % and the coefficient kappa is 0.895 (Table 9.5).
The overall classification accuracy is improved by 16.71 % (91.51–74.8 %) and
the kappa coefficient is increased by 0.074 (0.895–0.821). Therefore, the rough
neural network takes advantage of the knowledge learning of rough sets and the
nonlinear mapping of the neural network for processing takes advantage of the
uncertainty of RS images.
The thematic extraction of rivers from the RS images is studied for the purpose of
monitoring the ecological balance of the environment. Figure 9.8 shows an origi-
nal image (TM) and its water extraction results. The gray value of the image is
used to extract rivers from the images, and a rough topological relationship matrix
and a rough membership function are used to deal with the spatial relationships
between the class of river and other adjacent classes in the image (Wang et al.
2004).
The rough relationship matrix identifies and propagates the certainties with
Pos(X), Neg(X), and uncertainties with Bnd(X). In the matrix, none-empty is 1 and
empty is 0.
Pos(A) ∩ Pos(B) Pos(A) ∩ Bnd(B) Pos(A) ∩ Neg(B)
Rr9 (A, B) = Bnd(A) ∩ Pos(B) Bnd(A) ∩ Bnd(B) Bnd(A) ∩ Neg(B)
Neg(A) ∩ Pos(B) Neg(A) ∩ Bnd(B) Neg(A) ∩ Neg(B)
Fig. 9.8 Rough river thematic maps (in continuum) Reprinted from Wang et al. (2004), with
kind permission from Springer Science+Business Media. a Original image. b Lr(X) image.
c Ur(X) image
268 9 Remote Sensing Image Mining
The rough membership value is regarded as the probability of x ∈ X given that x
belongs to an equivalence class. μX(x) + μ ~ X(x) = 1
1
x ∈ Pos(x)
Rcard (X ∩ [x]R ) (0, 1) x ∈ Bnd(x)
µX (x) = =
Rcard ([x]R ) 0
x ∈ Neg(x)
1 − µ∼X (x) x ∈∼ X
Content-based image retrieval locates the target images by using their visual fea-
tures, such as color, layout, texture, shape, structure, etc. In a RS image retrieval
system, texture, shape, and structure are more favorable features versus levels of
gray or false colors (Ma 2002; Zhou 2003).
The texture feature is widely applied in image retrieval from the main surface
features of the forest, grassland, farmland, and city buildings of RS images. The
visual textures—roughness, contrast, orientation, linearity, regularity and coarse-
ness—are used to calculate image similarity. Different scale images lead to
changeable textures. At present, co-occurrence matrix, wavelet transform, and
Markov random fields are effective in texture analysis.
9.3 RS Image Retrieval 269
A shape feature is represented based on the boundary and region, which make use
of the shape of the outer boundary and the shape of the region, respectively, in the
invariance of movement, translation, rotation, and change. Global shape features have
a variety of simple factors, such as roundness, area, eccentricity, principal axis direc-
tion, Fourier descriptor, and moment invariants. Local shape features include line
segment, arc, corner, and high curvature point. Equipped with professional knowl-
edge and application background in the RS image retrieval system, the shape feature
library can be created and the storage space also can be saved by ignoring the shape
feature value. The shape matching accuracy and the retrieval rate can be improved if
the shape of the ground target combines the metadata of the RS image and the dimen-
sion knowledge of typical objects, especially querying ground manmade objects such
as vehicles and airplanes for ground reconnaissance and surveillance missions.
The structure feature includes the layout, adjacency, inclusion, etc. for the distri-
bution of objects and regions and their relationships in images. The spatial topologi-
cal structure among many objects is often implicitly stored in a spatial data structure,
the identical index of which supports the relative position and absolute positions
simultaneously. The image query can be based on either the position or the feature.
Currently, the structure feature is rarely used for image query, and the layout-based
query is a simple structure match with the range information between objects.
Both are within the whole definition domain. γ(h) is the semivariogram. The semi-
variogram is represented in graphics by a variograph—that is, a graph plotted by
the semivariogram versus h. Sill, nugget, and range are the three parameters that
describe the semivariogram completely (Fig. 9.9a). Each has its own characteris-
tics when the semivariogram is used in the context of image processing. Sill is its
limit; nugget reveals noise in an image via an image template window; and range
represents the correlation of the RS images in a direction. Inside the range, the
image contains spatial correlation that decreases when the range increases; outside
the range, its spatial correlation disappears (Franklin et al. 1996). If the size of an
image template window is a two-dimensional array of pixels, it can be decided via
the range of the horizontal range and the vertical direction by calculating the semi-
variogram in the horizontal and vertical directions.
Images with different structural features have different semivariograms, with
different variographs (Fig. 9.9b). The semivariogram of an image is
N
1 2
γ (h) = G(x, y) − G(x + h, y + h)
2N(h)
i=1
Here, G(x,y) is the gray value of the image and N(h) is the number of pixel pairs
separated by h pixels. Figure 9.9c is an actual variograph of an image.
To define a new parameter to describe image similarity based on a
semivariogram,
tan θ
ρ=
tan β
When trains stop at a railway station, train workers perform an inspection to deter-
mine the health of the trains, focusing on the brakes on each wheel and the con-
nections between two railroad carriages. If a problem is uncovered, it must be
reported before the train leaves the station, which is timely troubleshooting; rail-
way stations are eager to automatize this difficult work. The following approach
was proposed by the authors. Images first were taken with CCD cameras when
the trains entered the station and then the brake of a train wheel was identified and
its thickness measured. Nearly 1800 images were calculated because normally it
takes 180 s to take images for a train at a station, and about 10 frames of images
are taken per second. However, most of the images may be useless because the
train wheels may not appear in some of the images and the image features may
be blurred by train motion. There are two aspects of the effectiveness of the algo-
rithm: (1) whether or not all the useful images are in the selected images and (2)
how the detection ratio is determined.
Traditional methods are difficult. First, feature recognition of a train wheel
is time-consuming and was unreliable as a train is in changing motion when it
approaches the station. Second, image comparison based on the gray correlation
and the histogram intersection proved to be invalid. Third, wavelet transform can
recognize the motion of blurred images successfully, but it cannot tell the differ-
ence between the standard useful image and images with no train wheels.
To find useful images with train wheels quickly and accurately, semivariogram-
based image retrieval was applied by measuring the image similarity parameter ρ.
272 9 Remote Sensing Image Mining
The useful image includes the features, such as the brake attached to the wheel
and the connection of two railroad carriages, while a useless image is blurred by
train motion (Fig. 9.10).
The threshold of image similarity plays an important practical role. A larger
threshold means useful images may be lost in selection but the computational
speed may be faster. A smaller threshold may lead to an increased number of
selected images but the computational speed may decrease. This experiment shows
that there is no fixed threshold that can separate useful and useless completely in
all conditions. In order to search out all the images of train wheels and connec-
tions, the similarity threshold can be empirically determined in the context of the
actual demands and professional experience:
– If |ρ − 1| < 0.01, although all the chosen images are useful, many other useful
images cannot be chosen.
– If |ρ − 1| < 0.1, all the chosen images are useful, but some useful images are
missing.
– If |ρ − 1| < 0.5, all the useful images are chosen, along with some useless
images.
In the experiment, the train wheel images obtained under different imaging
conditions were utilized for detection using the above algorithm to find all the use-
ful images. The results show that all the useful images can be chosen, but with
several useless images mixed in. For the image similarity parameters in four direc-
tions, the ratio of finding all the useful images was consistently around 90 % for
both the detection of train wheels and connections. Figure 9.11 shows the experi-
mental samples of the standard image and candidate images, as well as the image
similarities. In Fig. 9.11a, the two images are clearly quite different. In Fig. 9.11b,
two images are similar. In Fig. 9.11c, two images are similar but the lighting con-
ditions are different. The selection procedure can be automatically finished within
one minute, which satisfies the practical requirement of less than 6 min.
This technique was successfully applied in the actual system (Figs. 9.12 and
9.13). Supported by the system, a professional operator can easily make an accu-
rate final judgment of whether or not the train wheels are normal.
Fig. 9.10 Useful images versus useless images for train health. a Useful image with the brake
attached to the wheel. b Useful image with the connection of two railroad carriages. c Useless
image blurred by train motion
9.3 RS Image Retrieval 273
(b)
|ρ0°-1| 0.083
|ρ45°-1| 0.027
|ρ90°-1| 0.073
|ρ135°-1| 0.094
(c)
|ρ0°-1| 0.023
|ρ45°-1| 0.047
|ρ90°-1| 0.035
|ρ135°-1| 0.102
In the {Ex, En, He} of the cloud model for facial expression images, Ex reveals the
common features and expressions of the women. Constructing the standard facial
expression can reflect the average facial expression state as follows: En reveals the
degree of deviation of different expressions from the standard facial expression,
which can reflect the degree to which face recognition is impacted by individual
characteristics and different internal and external factors; He reveals the degree of
dispersion of the different expressions from the standard facial expression (i.e., the
degree to which individual characteristics and environmental factors affect a facial
expression; Table 9.6).
First, the AN expression was selected in the image library as well as the set of 10
different Japanese women (KA, KL, KM, KR, MK, NA, NM, TM, UY, and YM),
whose AN face images become the original images—that is, the input of cloud
drops, with the non-certainty reverse cloud generator algorithm. The 10 different
individual facial images of {Ex, En, He} when they are at AN status (i.e., the out-
put of the cloud numerical characters) are shown in the first column of Table 9.4
along with the results. The cloud numerical characters for the six remaining
expressions (DI, FE, HA, NE, SA, and SU) were obtained; their results are shown
in the 2nd, 3rd, 4th, 5th, 6th, and 7th rows, respectively, in Table 9.6.
From Table 9.6, it can be seen that the cloud drops reveal that the input (i.e.,
the original selected images) reflects the different personalities of the 10 different
individuals for one facial expression; the digital features of an image of {Ex, En,
He} as output reflect the common features of 10 individuals who show one expres-
sion. Although there are 10 different individuals, 10 different human expressions
can become part of the basis for the common features for one expression by add-
ing them to the database. Ex reveals the basic common features of one expression
as the standard facial expression, which can reflect the average expression state;
En shows the degree of deviation from the standard expression for 10 different
individuals for one type of expression; and He reveals the dispersion of the degree
of deviation for one type of expression from the standard facial expression for 10
different individuals with different characteristics and influences by internal and
external factors.
276 9 Remote Sensing Image Mining
KL
KM
KR
MK
NA
NM
TM
UY
YM
Ex
En
He
9.4 Facial Expression Image Mining 277
The KA woman in the image library and the set of her seven face images (AN, DI,
FE, HA, NE, SA, and SU) were selected as the original images (i.e., the input of
cloud drops), with the non-certainty reverse cloud generator algorithm, the image
of {Ex, En, He} of seven facial images obtained for KA (i.e., the output cloud
numerical characters); the results are shown in the first row of Table 9.4. The cloud
numerical characters for the remaining nine women (KL, KM, KR, MK, NA, NM,
TM, UY, and YM) were obtained by adopting the method for KA, the results of
which are shown in the 2nd, 3rd, 4th, 5th, 6th, 7th, 8th, 9th, 10th rows, respec-
tively, of Table 9.6.
From Table 9.6, it can be seen that the cloud drop of the original image as
the image input reflects the different personality characteristics of seven types of
expressions of one individual; the numerical character image of {Ex, En, He} as
the output reflects the common features of the identical person. Although there
are seven different expressions, the different expressions are based on the com-
mon features of one expression, which can be increased by the features of differ-
ent individuals. Ex reveals the basic common features of one person, which is set
as the standard facial expression, and reflects the undisturbed calm state of one
person; En reveals the degree of deviation for different expressions from the stand-
ard facial expression for one person and reflects the individual’s degree of mood
fluctuation influenced by internal and external factors; He reveals the degree of
difference that expressions deviate from the standard facial expression for one per-
son and reflects the degree of difference between the individual’s mood fluctua-
tions influenced by internal and external factors (i.e., the degree of psychological
stability).
From Table 9.4, it can be seen that the original input image can reflect the dif-
ferent features of different expressions, the output digital feature image of {Ex,
En, He}, and reflects the common features of different expressions. Although the
input images are different facial expression images, these input images are based
on one common feature, which are expanded by the addition of different personal-
ity features. Ex reveals the basic common features of a facial expression, which is
set as its standard facial expression and reflects the average state of facial expres-
sion; En reveals the degree of difference that expressions deviate from the stand-
ard facial expression (i.e., the degree of influence by individual characteristics and
environmental factors); and He reveals that the dispersion of the degree of devia-
tion of the different expressions from the standard facial expression and reflects
the degree of difference that expressions deviate from the standard facial expres-
sion (i.e., the effect degree of personal characteristics and environment factors on
facial expression).
278 9 Remote Sensing Image Mining
where ||xi, xipq|| is the distance between point xi and pixel object xipq, gipq denotes
the energy of xipq, and σ indicates the impact factor. In face recognition, it is sug-
gested that ||xi, xipq|| is the Euclidean norm and gipq is the gray of pixel Xipq when
the image is black and white, varying from black at the weakest intensity to white
at the strongest. The potential of point xi is the sum of the potentials from each
object Xipq.
2
Q
P Q
P xi ,xipq
−
�(xi ) = ϕ(xi ) = gipq × e 2σ 2
If the points with the same potential value are lined up together, equipotential lines
come into being. Furthermore, all of the equipotential lines depict the interesting
topological relationships among the objects, which visually indicate the interacted
characteristics of the objects. Figure 9.14 shows a facial image that is portioned
by a grid with p row (p = 1, 2, …, P) and q (q = 1, 2, …, Q) column (a) and
its equipotential lines with the data field in two-dimensional space (b) and three-
dimensional space (c).
In Ω ⊆ RJ on face recognition, the data field of face Xi to recognize on dataset
D = {X1, X2, …, XI} receives a more summarized potential.
2
I P
I Q xi ,xipq
−
�(Xi ) = ϕ(xi ) = gipq × e 2σ 2
Human facial expression recognition with a data field first normalizes the origi-
nal face image to obtain a standardized 32 × 32 pixel facial image. Specifically,
the image is rotated, cut, and scaled based on the center of the left and right eyes
of the original facial image as reference data, along with the elliptical mask to
eliminate the effects of hair and background. Then, the gray transformation for the
standard face image is carried out to extract the important feature points of each
face image using the feature extraction method with the data field, which forms
the simplified human face image. Third, the deviation matrix is simplified, and
the public eigenface space is constructed, the simplified human face image is pro-
jected to the public eigenface space, with the corresponding projection coefficient
as the logical features of the human face image. Finally, according to the logical
features, all of the face images in the new feature space create a second-order data
field, and the clustering recognition of human face images is realized under the
interaction between the data and the self-organizing aggregation.
Fig. 9.14 Facial data field of KA’s happy face: a partitioned facial image b 2D field equipoten-
tial c 3D field equipotential
280 9 Remote Sensing Image Mining
A total of 213 face images in the JAFFE database were processed uniformly,
and the clustering algorithm with the data field then were applied to cluster projec-
tion data in the “Eigen face” space (Fig. 9.15).
Figure 9.15a shows the standardized facial images. Figure 9.15b shows the
equipotential of the data field on the facial images with the impact factor of 0.05.
The high potential area in the equipotential line distribution clearly concentrates
on the cheek, forehead, and nose, whose gray is relatively larger. In the distribu-
tion of the data field, each local point with the maximum potential value possesses
all of the contribution radiated from adjacent pixels, and the maximum potential
value and position of these local points can be regarded as the logical feature of
the human face image. The simplified human face image was extracted by the
algorithm based on the features of facial data field (Fig. 9.15c). The eigenface
image corresponding to the six preceding principal eigenvectors is processed by
K-L transform from the set of simplified human face images. The simplified face
images are projected into the public eigenface space, and the resulting first two
principal eigenvectors that constitute a two-dimensional eigenface space whose
projected data distribution are shown in Fig. 9.15d.
By all appearances, there is preferable separability in the two-dimensional
eigenface space for simplified face images representing different human facial
expressions. The final recognition results are shown in Table 9.7, which show that
this method provides a favorable recognition rate.
Specifically, 10 different expressions of frontal face gray images chosen from
the JAFFE database, consisting of seven images from the same person and the
other three images from three strangers, are obtained facial topological structures
by natural clustering. It is apparent that the clustering speed of the three strangers
I, H, and J was the slowest. The process is shown in Fig. 9.16.
Fig. 9.16 Face recognition with facial data fields. a Different facial images and their normalized
images. b Equipotential of facial images and their hierarchical recognition result
(Tian et al. 2014). DMSP-OLS images have been widely applied in socioeconomic
studies due to their unique capacity to reflect human activities (Li et al. 2013a, b, c).
If luminosity recorded by nighttime light imagery could be used as a proxy (Chen and
Nordhaus 2011), not only can the relative amounts of freight traffic (Tian et al. 2014)
efficiently increase but areal demands for freight traffic as well. The responses of night-
time light during the Syrian Crisis were monitored and its potential for monitoring the
conflict was analyzed (Li and Li 2014). The nighttime light dynamics of the countries
(regions) indicate that the Belt and Road Initiative will boost infrastructure building,
financial cooperation, and cultural exchanges in those regions.
China’s huge growth has led to a rapid increase in demand for freight traffic.
Timely assessments of past and current amounts of freight traffic are the basis for
predicting future demands of freight traffic and appropriately allocating transpor-
tation resources. The main objective of this study was to investigate the feasibility
of the brightness of nighttime lights as a proxy for freight traffic demand in China
(Tian et al. 2014). The brightness of nighttime light as a proxy for freight traffic in
a region with relatively brighter nighttime lights usually has more business activi-
ties and consequently larger freight transport demand. More developed areas also
usually have brighter nights and larger freight traffic demand.
This section provides a description of the used method for extracting the sum
light from nighttime light images. The sum light of each province/municipality was
extracted from the re-valued annual image composite, which is equal to the total dig-
ital number (DN) values of all the pixels in the province/municipality. Then, the sum
light was regressed on total freight traffic (TFT) consisting of railway freight traf-
fic (RFT), highway freight traffic (HFT), and waterway freight traffic (WFT) at the
province level; three groups of regression functions between the sum light and TFT,
RFT, and HFT were developed. The standard error of the estimates (SEEs) was cal-
culated to measure the accuracy of predictions using the regression functions. Third,
each province/municipality’s HFT to each pixel was disaggregated based on the pix-
el’s DN value for downscale HFT from the province level to the pixel level and a
1 km × 1 km resolution map for 2008 was produced. For the DN of the pixels of the
nighttime lights, a minimum threshold was set to eliminate the effects of blooming
and to mask large rural lit areas where dim nighttime lights can be detected but there
are no small business activities and freight traffic. Because electric power consump-
tion and freight traffic are both typical socioeconomic indicators that have strong
correlations with the GDP, we selected 10 as the threshold value and re-valued the
annual image composites (Zhao et al. 2012).
This study was mainly derived from the following data sources. Three version
4 DMSP-OLS stable lights annual image composites for the years 2000, 2004, and
9.5 Brightness of Nighttime Light Images as a Proxy 283
2008 were obtained from the National Oceanic and Atmospheric Administration’s
(NOAA) National Geophysical Data Center (NGDC) (Earth Observation Group,
2011). The annual image composites were produced by all of the available cloud-
free images from satellites F14, F15, and F16 for their corresponding calendar years
in the NGDC’s digital archives. The DN values of the annual image composites rep-
resent the brightness of the nighttime stable lights varying from 0 to 63. Each stable
light’s annual image composite has a spatial resolution of 1 km × 1 km. Acquired
from the National Bureau of Statistics of China (NBSC) (2001, 2005, 2009, 2011),
freight traffic data (China Statistical Yearbook 2009, 2005, 2001) were used as base-
line data, and the GDP was supporting data (Zhou et al. 2011).
The brightness of the nighttime light is a good proxy for TFT and HFT at the
province level. Sum light can be used to estimate the GDP, and freight transport
demand correlates strongly to the GDP. The experiments in Tian et al. (2014)
show strong relationships between the natural logarithm of sum light and the natu-
ral logarithm of the GDP and the natural logarithm of the GDP and the natural
logarithm of TFT; sum light more strongly correlates to HFT than to RFT even
though the relationships between sum light and RFT are significant at the 0.01
level. Rail transport is the traditional primary form of transportation in China.
Normally, goods that are large in mass and volume (e.g., coal, steel, timber) are
shipped via railways, but the price of such goods per unit weight may be very
low comparatively (e.g., a software CD vs. a ton of coal). In a further anomaly,
Heilongjiang and Shanxi are two moderately developed provinces and provide
most of the timber and coal in China. The timber and coal are transported out of
Heilongjiang and Shanxi mainly via railways. Guangdong and Jiangsu are two of
the most developed provinces in China; high-tech and light industries are their pil-
lar economic activities. Materials and finished goods are shipped into and out of
Guangdong and Jiangsu mainly through highways. Although the GDPs and sum
light of Guangdong and Jiangsu are far larger than those of Heilongjiang and
Shanxi, the RFT amounts of Guangdong and Jiangsu are much smaller than those
of Heilongjiang and Shanxi. Thus, when RFT does not have a strong correlation
to the GDP, sum light is not a sound measure of RFT. By contrast, road trans-
port is a burgeoning transportation method in China and has replaced rail trans-
port as the primary mode for freight. The more developed Chinese provinces or
municipalities normally have more developed highway networks and consequently
more goods and materials are shipped on the highways. Therefore, sum light can
be used as a proxy of HFT. Sum light does not strongly correlate to RFT, but the
ratios of RFT to TFT are very small; consequently, sum light still has strong rela-
tionships with TFT in these provinces.
TFT and HFT are also socioeconomic factors that significantly correlate with
the GDP. Because the brightness of nighttime light can reflect a region’s popula-
tion and economic level, the brightness can be used as a proxy of HFT at the pixel
level. To show a specific application of the brightness as a proxy of HFT at the
pixel level, this study produced a Chinese HFT map for 2008 by disaggregating
each province/municipality’s HFT to each pixel in proportion to the DN value of
the pixels of the 2008 nighttime light image composite (Fig. 9.17).
284 9 Remote Sensing Image Mining
The spatial variation of HFT demand is clearly shown by the map in Fig. 9.17.
The amount of HFT gradually decreases from the urban core regions to subur-
ban regions. In Sichuan and Chongqing, where the population is very high but
the developed area is limited, the HFTs of urban core regions are extremely large.
In Beijing, Shanghai, and Guangzhou, where economic development and urban-
ization levels are both very high, the HFTs of urban core regions are moderate
because their large total HFTs have been distributed to relatively broader devel-
oped areas. Therefore, compared to the traditional census-based HFT data, the
nighttime-light-derived HFT data can provide more detailed geographic infor-
mation for freight traffic demand. Compared to an undeveloped region, a devel-
oped region normally has more and brighter nighttime lights and consequently a
larger sum light. Additionally, the developed region usually has a larger population
and produces and consumes more goods, which lead to a larger freight transport
demand.
This study aimed to analyze the responses of nighttime light to the Syrian Crisis
and evaluated their potential to monitor the conflict (Li and Li 2014). For the
country and all of its provinces, a direct method of evaluation was determined as
the amount of nighttime light for the country and each provincial region. First, the
sum of nighttime lights (SNL) of Syria was calculated for each month based on
the digital values of all the pixels in a region so that the SNL had no physical unit.
Second, the SNL of each province was calculated. Third, the change in the propor-
tions of SNL was calculated for different regions. Fourth, nighttime light variation
9.5 Brightness of Nighttime Light Images as a Proxy 285
was correlated with statistics by summarizing the humanitarian scale of this crisis.
To show the nighttime light trend in a continuous spatial dimension, a data cluster-
ing method was further developed for time series nighttime light images; while the
trend is irrelevant to the overall magnitude of the nighttime lights, similar trends of
nighttime light were grouped into one class. The method has three steps: extract-
ing the lit pixels, normalizing the time series nighttime light, and clustering the
time series nighttime light images. Because the lit areas at night were the only
focus of this study, the dark areas were excluded at first. For each pixel, if its value
was smaller than the threshold of every month, then it was labelled as a dark pixel
or a lit pixel. The threshold was set to three in this analysis. Each pixel in the lit
area was normalized. The K-means algorithm was used to cluster the normalized
nighttime light data into classes in the lit areas, and the dark areas were labelled as
the “dark region” class.
The ongoing Syrian Crisis has caused severe humanitarian disasters, including
more than 190,000 deaths (Cumming-Bruce 2014). This study mainly was derived
from three data sources: administrative boundaries, satellite-observed nighttime
light images, and statistical data from human rights groups. The international and
provincial boundaries of Syria were derived from the Global Administrative Areas
(http://www.gadm.org/). Syria adjoins Turkey to the north, Iraq to the east, Jordan
to the south, and Israel and Lebanon to the west. Syria is divided into 14 provin-
cial regions, with Damascus, Aleppo, and Homs as its major economic centers.
The DMSP/OLS monthly composites between January 2008 and February 2014
were selected as the nighttime light images for the analysis (Fig. 9.18). A total of
38 monthly composites were used for this analysis. The March 2011 image was
selected as the base image for the intercalibration (Wu et al. 2013).
Here, it is analyzed by administrative regions. Figure 9.18 shows that the night-
time light in Syria has very sharply declined since the crisis, and many small
lit patches have gone dark. The city of Aleppo, the site of fierce battles (AAAS
2013), has lost most of its lit areas. Although the cities of Homs and Damascus
also lost a lot of nighttime lights, the intensity of their losses appear to be less
than that of Aleppo. The SNLs between January 2008 and February 2014 in Syria
show some fluctuations before March 2011, but the nighttime light in Syria has
continuously declined since March 2011. The SNL of each province was calcu-
lated between January 2008 and February 2014. The Golan Heights, as a part of
Quneitra, has been controlled by Israel since 1967, the nighttime lights of which
were excluded when calculating the SNL in Quneitra. The SNLs between January
2008 and February 2014 in different provinces of Syria show a sharply declining
trend of nighttime lights for all the provinces since March 2011.
The change in the proportions of SNL was calculated in different regions
between March 2011 and February 2014. In addition, the size of the lit area,
defined as the area where the light value is greater than 3, was also retrieved for
each region, and its change in proportion during the period was also calculated.
The two indices are illustrated in Fig. 9.19.
In Fig. 9.19, it can be seen that Syria has lost about 74 and 73 % of its night-
time lights and lit areas, respectively. In addition, most of the provinces lost >60 %
286 9 Remote Sensing Image Mining
Fig. 9.18 The nighttime light monthly composites Copyright © 2013, Taylor & Francis Group.
a March 2011. b February 2014
Fig. 9.19 The nighttime light change in proportions for Syria and all the provinces between
March 2011 and February 2014 Copyright © 2013, Taylor & Francis Group. a SNL change in
proportion. b Lit area (LA) change in proportion
of their nighttime lights and lit area. Damascus, the capital of Syria, is an excep-
tion; it only lost about 35 % of the nighttime lights and no lit areas during the
period, and the nighttime light has fluctuated notably, which is different from most
of the provinces that experienced continuous decline. However, Rif Dimashq, in
the countryside of Damascus, lost about 63 % of its lights, showing that its secu-
rity situation is much more severe than in Damascus. This finding is consistent
with the fact that the Assad regime has strongly controlled the capital, although
the battles around the capital were intense (Barnard 2013). Quneitra, another
exception, also lost only 35 % of its lights during the period. Idlib and Aleppo
9.5 Brightness of Nighttime Light Images as a Proxy 287
are the provinces with the most severe decline in nighttime light, losing 95 and
88 % of the nighttime lights and 96 and 91 % of the lit areas, respectively. In fact,
the battles in these two provinces were particularly fierce (Al-Hazzaa 2014; AAAS
2013). We also found that Deirez-Zor and Rif Dimashq lost 63 and 60 % of their
nighttime lights, respectively, but only 40 and 42 % of their lit area. We can infer
that basic power supplies in these two provinces still were working in most of the
areas, although the battles there also were intense.
Can the nighttime light variation be correlated with statistics summarizing
the humanitarian scale of this crisis? The number of internally displaced per-
sons (IDPs) and the SNL loss of all the Syrian regions between March 2011 and
December 2013 are presented in Table 9.8. These two data groups and linear
regression analysis showed that the nighttime light decline was correlated to the
number of displaced persons during the Syrian Crisis. This finding supports the
assumption that multi-temporal nighttime light brightness is a proxy for popula-
tion dynamics (Bharti et al. 2011). In brief, nighttime light variation can reflect the
humanitarian disasters in the Syrian Crisis.
Therefore, the nighttime light experienced a sharp decline as the crisis broke
out. Most of the provinces lost >60 % of the nighttime light and lit areas because
of the war, and the amount of the loss of nighttime light was correlated with the
number of IDPs. Also, the international border of Syria is a boundary to the night-
time light variation patterns, confirming the previous study’s conclusion that the
administrative border has the effect of socioeconomic discontinuity (Pinkovskiy
2013).
288 9 Remote Sensing Image Mining
The “Belt and Road” refers to the Silk Road Economic Belt and the twenty-first-
century Maritime Silk Road. In the twenty-first century—a new era marked by
the theme of peace, development, cooperation and mutual benefit—it is all the
more important for us to carry on the Silk Road Spirit in face of the weak recov-
ery of the global economy and complex international and regional situations.
When Chinese President Xi Jinping visited Central Asia and Southeast Asia in
September and October 2013, he raised the initiative of jointly building the Silk
Road Economic Belt and the twenty-first-century Maritime Silk Road, which have
attracted close attention from all over the world. It is a systematic project, which
should be jointly built through consultation to meet the interests of all, and efforts
should be made to integrate the development strategies of the countries along the
Belt and Road. The Chinese government has drafted and published the Vision and
Actions on Jointly Building Silk Road Economic Belt and twenty-first-Century
Maritime Silk Road to promote the implementation of the initiative; instill vigor
and vitality into the ancient Silk Road; connect Asian, European and African coun-
tries more closely; and promote mutually beneficial cooperation to a new high and
in new forms (The National Development and Reform Commission, Ministry of
Foreign Affairs, and Ministry of Commerce of the People’s Republic of China,
with State Council authorization 2015).
The Belt and Road Initiative not only covers the countries along the ancient
roads, but it also links the Asia-Pacific, European, African and Eurasian eco-
nomic circles. There are three directions of the Silk Road Economic Belt: China—
Central Asia—Russia—Europe, China—Central Asia—West Asia—Persian
Gulf—Mediterranean, and China—Southeast Asia—South Asia—Indian Ocean.
Also, there are two directions of the twenty-first-century Maritime Silk Road:
China—South China Sea—Indian Ocean—Europe, China—South China Sea—
South Pacific. It passes through more than 60 countries and regions with a total pop-
ulation of 4.4 billion, with the purpose of boosting infrastructure building, financial
cooperation, and cultural exchanges in those regions.
Nighttime light imagery acquired from DMSP can illustrate the socioeconomic
dynamics of countries along the Belt and Road. Comparing the DMSP nighttime
light imagery from 1995 to 2013, the urbanizations and economic growth patterns
can be uncovered. Most of the countries along the Belt and Road have experienced
significant growth (Fig. 9.20).
By further analyzing the countries within a distance of 500 km to the Belt and
Road (Fig. 9.21), we found that the nighttime light in China and Southeast Asia
(e.g. Cambodia, Laos, Thailand, Vietnam and Burma) has quickly grown, show-
ing rapid urbanization and economic growth in these countries. Particularly, China
surpassed Russia to be the country with the largest total nighttime light. The
rapid growth of night-time light in some countries (e.g. Bosnia And Herzegovina,
Somalia and Afghan) is from the peace process and post-war reconstruction.
Nighttime light decline in some developing countries (e.g. Syria and Ukraine) may
9.5 Brightness of Nighttime Light Images as a Proxy 289
Fig. 9.20 DMSP nighttime light imagery in the belt and road region. a DMSP nighttime light
imagery in 1995. b DMSP nighttime light imagery in 2013
be due to economic decline and war. Nighttime light decline in some developed
countries (e.g. Sweden and Denmark) is because of their control of excessive out-
door lighting.
Therefore, in order to resolve regional and global structural contradictions, all
nations need to cooperate and the solution to domestic structural contradictions
in many countries also depends on cooperating with other nations. The Belt and
Road Initiative will play an important role to help establish stability and recovery
and lead to the next phase of prosperity in the global economy. The Belt and Road
cooperation features mutual respect and trust, mutual benefit and win-win coop-
eration, and mutual learning between civilizations. As long as all countries along
the Belt and Road make concerted efforts to pursue our common goal, there will
be bright prospects for the Silk Road Economic Belt and the twenty-first-century
Maritime Silk Road, and the people of countries along the Belt and Road can all
benefit from this initiative.
290 9 Remote Sensing Image Mining
Fig. 9.21 Twenty countries (regions) with nighttime light growth in the Belt and Road region
during 1995–2013. a Countries (regions) with the highest nighttime light growth. b Countries
(regions) with the lowest nighttime light growth
Video cameras as sensors have been widely used in intelligent security, intelligent
transportation, and intelligent urban management along with the development of
smart cities. At present, there is a national multilevel online monitoring project,
which is a five hierarchical monitoring network. There are more than 20 million
ubiquitous cameras for city management, public security, and health care. Video
monitoring systems are running in more than 600 cities in China (Li et al. 2006,
2013a, b, c).
Digital video data are acquired in an integrated manner under a large-scale
spatiotemporal environment, which includes information about objects, places,
behaviors, and time. With the popularization of high definition (HD) video data,
the storage of HD video occupies more than several times that of low-definition
video. It is difficult to store massive video data for a long period of time and there
also is the lack of a fast and effective video retrieval mechanisms to find meaning-
ful video information. Furthermore, the large amount of information on HD video
requires much more rapid computing speed. Nowadays, we are unable to make
full use of the surveillance video data in a large-scale spatiotemporal environment
because its storage cost is prohibitive, the retrieval results are inaccurate, and the
9.6 Spatiotemporal Video Data Mining 291
data files are too large to analyze. It is critical to find new automatic and real-time
techniques for video data intelligent compression, analysis, and processing.
Spatiotemporal video data mining not only extracts information and processes
data intelligently, but it also distinguishes abnormal behavior from normal behav-
ior of persons, cars, etc. automatically. In this way, we can delete large amounts
of video recordings of the normal activities and private activities of individuals
that need to be protected, while keeping the data about suspicious cars and peo-
ple, as well as data about people who need care and supervision (e.g., elderly with
dementia and children with disabilities). There are no popular techniques and
products of spatiotemporal video data mining due to their technical difficulties.
Spatiotemporal video data mining has some technical difficulties: The data are too
expensive to store. The amount of data to be processed and stored on a city-level
video surveillance network is massive. The amount of geo-spatial data in digital
cities reaches the TB level, while the amount of city-scale video data in smart
cities may be in the PB level. According to the current technical and industry
standards, the data collected by a HD camera in one hour, compressed using the
H.264 standard, needs 3.6 GB space to be stored. Videos are always kept for three
months. If a 4 TB storage server costs 50,000 yuan, the cost to the Tianjin city
security system may be as high as 58.32 billion yuan only for storing the videos,
which equals the total GDP of Tibet in 2012. Because the amount of surveillance
video data is huge, the vast majority of the cities in China relieve the pressure
from construction funds by reducing video time and quality, which also reduces
the value of criminal investigation and identification through video data. The rapid
growth of storage size and investment has become a problem that is restricting city
surveillance systems.
The retrieval of video data is inaccurate. It is difficult for people to locate use-
ful knowledge from massive video data. A large number of experiments have
shown that if an operator looks at two monitors at the same time, he will miss
95 % of the surveillance targets after 22 min. Currently, cross-regional crimes
have become the main modus operandi for international crimes. The Procuratorial
Daily indicated that 70–90 % of the total crimes are committed by moving people.
As the areas of crime-related regions expand, increasingly more video data need to
be retrieved. However, traditional monitoring systems simply can only sample and
record; they cannot analyze or extract critical information and high-level seman-
tic content efficiently. Once crimes have occurred, investigators must manually
browse the suspected targets one by one. The results are therefore low, and it is
easy to miss the best time to solve the case. To improve the detection rate, it is
292 9 Remote Sensing Image Mining
vital to utilize video retrieval technology to locate useful information from mas-
sive video data quickly.
The video data are not being utilized to their full advantage in public security.
The lifecycle of social safety incidents generally include cause, planning, imple-
mentation, occurrence, and escape. Most of the professional crimes have the char-
acteristics of large behavior spatiotemporal span and high-level concealment.
Although the behavior in the crime preparation period is traceable, the existing
domestic technology can only analyze existing local data and make simple judg-
ments, and it is difficult to detect the spatiotemporal behavior abnormal events
early from multi-scale and multi-type media data. Some crime prediction soft-
ware can detect criminals’ abnormal behavior to a certain extent, which has a great
effect on the prevention of crime. In Los Angeles, for example, the use of a set of
crime prediction software named “PredPol” reduced the crime rate by 13 %; how-
ever, at the same time, the city’s crime rate increased by 0.4 %. At present, video
monitoring systems are operating throughout China, which obtains PB-level video
monitoring data every day, but it cannot produce alerts through analysis to prevent
security incidents.
To solve the problem of storage size and investment, video data compression
removes redundant information from video data and can reduce the storage size
of video data. At the same time, cloud storage can store massive video data as it
breaks through the bottleneck of the capacity of traditional storage methods and
realizes the linear expansion in performance and capacity. It provides data stor-
age and business access to the external world based on the technology of applica-
tion clusters, networks, and distributed file systems. Application software enables
a large variety of storage devices to work together in the network. In its virtual
storage, all the equipment is completely transparent to cloud users. Any cloud or
any authorized user can connect to the cloud storage through a cable. The user has
a storage capacity equivalent to the entire cloud. The monitored district is distrib-
uted widely, and each monitoring point creates a great deal of data. Using cloud
storage systems can realize distributed management easily and can extend capacity
at any time.
9.6.4.1 Behavior Analysis
The behavior analysis of video objects is the basis of event detection by the pre-
image processing of video and conducting research about the temporal and spa-
tial data of the target of interest in the video scenes (sometimes in relation to
294 9 Remote Sensing Image Mining
9.6.4.2 Event Detection
The integration of GIS data and media data (video, audio) can enable the monitor-
ing of a city from four spatiotemporal dimensions. By combining video with GIS,
continuous information is located through continuous automatic video data min-
ing and then obtaining spatial information by GIS. The combination of the two
methods can conduct meaningful spatiotemporal correlation analysis and abnor-
mal behavior analysis (Fig. 9.24).
Traditional geographic information analysis is time-consuming, labor-intensive,
and unstable. Comparatively, spatiotemporal analysis from both geo-spatial data
and video data offers universality, humanity, and intellectualization. Real-time anal-
ysis and data mining can obtain the background data of static space and continu-
ous data about humans, cars, etc. Through spatiotemporal analysis of geographic
information and video showing fixed monitory points on a map, surveillance on a
moving car’s position, and warning prompts, users can observe a monitor’s oper-
ating state clearly and obtain the graphic information of monitored sites quickly
and intuitively. If necessary, users can switch to the monitor site instantly and then
provide a basis for the remote command. It also can be applied to various fields
such as automatic protection system, emergency response, road maintenance, river
improvement, city management, mobile surveillance, tourism, etc.
296 9 Remote Sensing Image Mining
References
AAAS (American Association for the Advancement of Science) (2013) Conflict in Aleppo, Syria:
a retrospective analysis. http://www.aaas.org/aleppo_retrospective
Al-Hazzaa HA (2014) Opposition and regime forces split Idlib province after ISIS withdrawal. http://
www.damascusbureau.org/?p=6622
Barnard A (2013) Syrian forces recapture damascus suburb from rebels. http://www.nytimes.
com/2013/11/14/world/middleeast/syrian-forces-recapture-damascus-suburb-from-rebels.html?_
r=0
Bernhard S, John CP, John AT, Alex JS (2001) Estimating the support of a high-dimensional dis-
tribution. Neural Comput 13(7):1443–1471
Bharti N, Tatem AJ, Ferrari MJ, Grais RF, Djibo A, Grenfell BT (2011) Explaining seasonal fluc-
tuations of measles in niger using nighttime lights imagery. Science 334:1424–1427
Chen X, Nordhaus WD (2011) Using luminosity data as a proxy for economic statistics. Proc
Natl Acad Sci USA 108:8589–8594
Christopher BJ, Ware JM, Miller DR (2000) Bayesian probabilistic methods for change detection
with area-class maps. Proc Accuracy 2000:329–336
Cressie N (1991) Statistics for spatial data. Wiley, New York
Cumming-Bruce N (2014) Death toll in Syria estimated at 191,000. http://www.nytimes.
com/2014/08/23/world/middleeast/un-raises-estimate-of-dead-in-syrian-conflict-to-191000.html?_
r=0
Di KC (2001) Spatial data mining and knowledge discovering. WuHan University Press, WuHan
Franklin SE, Wulder MA, Lavigne MB (1996) Automated derivation of geographic window sizes
for use in remote sensing digital image texture analysis. Comput Geosci 22(6):665–673
Fu Y, Guo G, Huang TS (2010) Age synthesis and estimation via faces: a survey. IEEE Trans
Pattern Anal Mach Intell 32(11):1965–1976
Letu H, Hara M, Yagi H, Naoki K, Tana G, Nishio F, Shuher O (2010) Estimating energy con-
sumption from night-time DMSP/OLS imagery after correcting for saturation effects. Int J
Remote Sens 31:4443–4458
Li X, Li DR (2014) Can night-time light images play a role in evaluating the Syrian Crisis? Int J
Remote Sens 35(18):6648–6661
Li DR, Wang SL, Li DY (2006) Theory and application of spatial data mining, 1st edn. Science
Press, Beijing
Li X, Chen F, Chen X (2013a) Satellite-observed nighttime light variation as evidence for global
armed conflicts. IEEE J Sel Topics Appl Earth Obs Remote Sens 6:2302–2315
References 297
Spatial data are stored in the form of file documents, while the attribute data are
stored in databases. GISDBMiner is a GIS data-based SDM system (Fig. 10.1)
that supports the common standard data formats. The discovery algorithms are
executed automatically, and there is geographic human–computer interaction. The
user can define the subset of interest in a dataset, provide the context knowledge,
set the threshold, and select the knowledge representation format. If the needed
parameters are not provided, the system will automatically use default parameters.
In GISDBMiner, the user first must trigger the knowledge discovery command,
which signals the spatial database management to allow access to the data subset
of interest from the spatial database (i.e., the task-related data). Then, the mining
algorithms are executed to discover knowledge from the task-related data accord-
ing to the user’s requirements and professional knowledge. Finally, the discovered
knowledge is output for the user’s application or is added to the knowledge base,
which is used for future knowledge discovery tasks. Generally, knowledge dis-
covery requires repeated interactions to obtain satisfactory final results. To start
the knowledge discovery process, the following steps are necessary: (1) the user
interactively selects the data of interest from the spatial database, (2) a visual
query and the search results are displayed to the user, and (3) the data subset of
interest is gradually refined. The process of knowledge discovery begins thereafter.
GISDBMiner has several key functions, which include data preprocessing, asso-
ciation rule discovery, clustering analysis, classification analysis, sequence analysis,
variance analysis, visualization, and spatial knowledgebase. The association rule
discovery process employs the Apriori algorithm and concept lattice, the clustering
analysis adopts the K-means clustering algorithm and the concept clustering algo-
rithm, and the classification analysis applies decision trees and neural networks.
bands, and the classification rules by which some object classes are established
from certain band intervals. Moreover, there are a large number of complex spatial
objects in remote sensing images that possess complex relationships and distribu-
tion rules between the objects, for which the spatial distribution rule mining fea-
ture of RSImageMiner was designed.
RSImageMiner includes image management and data mining on the image fea-
tures of spectrum (color), texture, shape, and spatial distribution pattern, as well as
image knowledge storage and management and knowledge-based image classifica-
tion, retrieval, and object recognition. During the RSImageMiner process, if the
image is too large, the data are divided into a number of relatively small sample
areas. The extent to which the subsample image is cut is set in accordance with the
sample image extent setting. Image features reflect the relationships between the
local regions of different imagery gray values, which should be normalized. After
the normalization process, the feature data of the image are stored in a database in
the form of a table. The various functionalities of image processing in the system
are based on integral management within the same project. The functions do not
interfere with each other, and different views can be designed for different pur-
poses by switching views. Saving the file is executed by entering the file name in
the pop-up dialog box. The image is stored by using the binary long object mode,
the knowledge is stored as a file, the rules in the knowledge base are stored in the
form of relational tables for data records, and the field text of knowledge is stored
as a whole in the relational table.
302 10 SDM Systems
Image management may open, store, delete, display, preprocess, and browse
images. Image and feature data are integrally stored and managed by using a rela-
tional database management system as a binary long object mode.
Texture feature mining is implemented in the imagery texture feature, which
includes cutting samples, normalizing, storing feature data, generating Hasse dia-
grams, and generating rules. The Hasse diagrams are generated by setting the param-
eters of the support threshold and the confidence threshold, the table file name and
text file name of redundant rules, and the table file name and text file name of the
non-redundant rules. The Hasse diagram in Fig. 10.3, which was created based on
the mining algorithm of concept lattice, reflects the hierarchical relationship within
the data manually or automatically. The manual mode inputs each sample point by
hand, which reflects the illustration process of constructing the incremental concept
lattice and the Hasse diagram. The automatic mode calls all data records in the data-
base to generate the complete concept lattice by itself. Based on the generated Hasse
diagram, the rules are generated as the list for display (Fig. 10.4). Some of the fre-
quent nodes have a degree of support value greater than the support threshold value.
Redundant rules are generated according to the frequent nodes directly, and the non-
redundant rules are generated according to the algorithm for non-redundant rules.
Shape feature mining is implemented by the imagery shape feature, which
includes detecting the edge, extracting the shape feature, generating Hasse dia-
grams, and generating rules. Boundary extraction delineates the typical sample area
and extracts the shape feature based on those boundaries. Shape feature extraction
calculates the shape feature of the boundary polygon in different sample areas.
Spectral feature mining is applied to the imagery spectral feature and includes
opening the image; acquiring its spectral values by outlining the sample area and
its schema; storing the spectral values in the sample area; normalizing the spec-
tral values, such as the degree of brightness (1, 2, 3, 4, or 5 corresponding to
very dark, dark, general, relatively bright, and very bright); generating the Hasse
10.2 RSImageMiner for Image Data 303
Fig. 10.4 Texture feature
rules
diagrams; and generating the non-redundant association rules of the spectral fea-
ture according to the generated Hasse diagram. According to the mining algorithm
for association rules, the spectrum of each band is divided into several sections,
and each spectral band is a data item. The spectrum of each band value corre-
sponds to the interval for each value and establishes a multiple-value context table.
Spatial distribution rule mining is implemented by the imagery distribution fea-
ture and includes cutting the sample, normalizing, storing feature data, generating
Hasse diagrams, and generating rules. To determine the spatial distribution rule,
all the data should be mined. However, the processing speed for large amounts of
data is very slow, so the system provides the option of choosing a sample training
area to uncover the spatial distribution rule of the sample area. First, a classified
remote sensing image is opened; for example, the image is categorized into four
types: green field, water, residential area, and bare land. Second, the extent of a
sub-sample image is selected. Third, the sample image is cut, and the resulting
image is saved. The gray value of the classified image is then normalized, such
as 0, l, 2, 3, respectively, representing four types of ground features; and the spa-
tial relationship data of the sample image is stored as a record into the database.
Finally, in accordance with the generated Hasse diagram, the non-redundant asso-
ciation rules are generated, reflecting the spatial distribution.
304 10 SDM Systems
The knowledge management function integrally stores and manages the dis-
covered knowledge. When the knowledge is generated, it is stored in the knowl-
edge base; each item of the knowledge from the image is stored as a relational data
record and the knowledge documents are stored in the form of text as a whole. This
function includes edit, query, comparison, and classification. Knowledge addition
refers to a specific rule from the data that is added as a record into the knowledge
base. Knowledge deletion refers to a specific rule that is deleted from the repository.
Knowledge query searches one or more rules and compares multiple knowledge
bases to extract the common rules and stores them in a file for common knowledge.
Multiple knowledge files can be used to generate a knowledge classifier.
The discovered knowledge—that is, texture association rules, spectral (color)
knowledge, shape knowledge, and spatial distribution rules—can be used to assist
image classification, retrieval, and object recognition. The discovered knowledge
also can be applied along with professional knowledge for comprehensive results.
Additionally, various texture association rules are used to determine the image
class by calculating the match degree of a texture image from the sample area.
Spatiotemporal video data mining can analyze abnormal motion trajectory, suspi-
cious objects, moving object classes, video in spatiotemporal multi-camera, and
video from different periods.
(1) Analysis of abnormal motion trajectory. The linear trajectory and charac-
teristics are extracted, identified. and classified. Then, the problem of trajectory
cross or separation also addressed as well as the problem of analyzing multi-
objective overlapping anomaly.
(2) Target analysis based on the hybrid model. Target analysis includes color
model analysis, shape model analysis, and feature blocks or feature point anal-
ysis, such as a pedestrian classification based on a hog operator (see Fig. 10.5).
(3) Classification combined with pattern recognition. SVM, ANN, and Boost
rapid classification technology are used to classify and identify moving targets.
(4) Spatiotemporal analysis of multi-camera video. This feature captures a tar-
get from multiple views at the same time. Where multiple cameras are installed
at a fixed location, the times for all cameras at which a target appears are inter-
connected. With this association information, the image target matching tech-
nology can be combined to conduct temporal and spatial correlation analysis.
(5) Camera video analysis in different periods. Extraction is not dependent
on the time-related information of image characteristics, such as environmen-
tal light, the contrast ratio, and other information. Pedestrians or vehicles that
appear in different periods are automatically retrieved and recognized.
10.4 EveryData
The EveryData system is a web-based platform for users to collect, pre-process, ana-
lyze, and visualize the results for spatial data and non-spatial data (Fig. 10.6). The
website uses J2EE-related frameworks like Struts2, Hibernate3, and Spring 2 to deal
with the database and to handle http requests. The front page uses d3, BootStrap
framework, JS, JSP, and CSS. The data mining algorithms are implemented using
Java, R, and Weka. Some of its available features to date are shown in Fig. 10.7.
Fig. 10.6 EveryData system
306 10 SDM Systems
Users can upload their dataset files. Before uploading, users are required to set the
parameters for the dataset (e.g., the delimiter of the dataset) and indicate whether the
file has missing values and whether the first row is data or the parameter description.
10.4 EveryData 307
After uploading the data files, users can choose the dataset of interest and preview it.
Users then may proceed with the data mining tasks using the selected dataset.
Using the Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) as an exam-
ple, users can use the K-means clustering algorithm to process the data. At the
backend, the K-means algorithm is invoked using R. After clustering, users can
see the cluster centers shown in the figures, the percentages of each cluster are
shown using a pie chart, and the total number of elements in each cluster is shown
as a bar chart. Because the Iris dataset has four dimensions (length of sepal, length
of petal, width of sepal, and width of petal), the website also uses 16 two-dimen-
sional images to show the clustering results. Users also can use parallel coordi-
nates to obtain a clear view of every cluster. Figure 10.7a–d are interoperable.
KNN is the classification algorithm available in the EveryData system. Users
can upload a training set and a test set separately, the results of which will be
printed in the webpage and will include the parameters of the classifier and the
accuracy and classification results of each record in the test set. The KNN algo-
rithm is implemented by invoking functions provided by Weka. A simplified deci-
sion tree algorithm and an Apriori algorithm to analyze the association rules then
are implemented. Clicking on the nodes can expand or shrink the decision tree.
The results of the Apriori algorithm are shown in graph format (see Fig. 10.7e–f).
The clustering algorithm HASTA (Wang and Chen 2014; Wang et al. 2015) and
several image segmentation algorithms by the developers, including the GDF-Ncut
algorithm and the mean shift algorithm (Wang and Yuan 2014; Yuan et al. 2014),
also are available on the website. After uploading and choosing the proper dataset,
the user is required to set three parameters for the algorithms in order to produce
clustering results (Fig. 10.8).
References
Li DR, Wang SL, Li DY (2006) Theory and application of spatial data mining, 1st edn. Science
Press, Beijing
Li DR, Wang SL, Li DY (2013) Theory and application of spatial data mining, 2nd edn. Science
Press, Beijing
Wang SL, Chen YS (2014) HASTA: a hierarchical-grid clustering algorithm with data field. Int J
Data Warehousing Min 10(2):39–54
Wang SL, Yuan HN (2014) Spatial data mining: a perspective of big data. Int J Data Warehousing
Min 10(4):50–70
Wang SL, Chen YS, Yuan HN (2015) A novel method to extract rocks from Mars images. Chin J
Electr 24(3):455–461
Yuan HN, Wang SL, Li Y, Fan JH (2014) Feature selection with data field. Chin J Electr
23(4):661–665