Bok:978 3 319 09177 8 PDF
Bok:978 3 319 09177 8 PDF
Bok:978 3 319 09177 8 PDF
Fatos Xhafa
Leonard Barolli
Admir Barolli
Petraq Papajorgji Editors
Modeling and
Processing for Next-
Generation Big-Data
Technologies
With Applications and Case Studies
Modeling and Optimization in Science
and Technologies
Volume 4
Series editors
Srikanta Patnaik, SOA University, Orissa, India
e-mail: [email protected]
Ishwar K. Sethi, Oakland University, Rochester, USA
e-mail: [email protected]
Xiaolong Li, Indiana State University, Terre Haute, USA
e-mail: [email protected]
Editorial Board
Li Cheng, The Hong Kong Polytechnic University, Hong Kong
Jeng-Haur Horng, National Formosa University, Yulin, Taiwan
Pedro U. Lima, Institute for Systems and Robotics, Lisbon, Portugal
Mun-Kew Leong, Institute of Systems Science, National University of Singapore
Muhammad Nur, Diponegoro University, Semarang, Indonesia
Luca Oneto, University of Genoa, Italy
Kay Chen Tan, National University of Singapore, Singapore
Sarma Yadavalli, University of Pretoria, South Africa
Yeon-Mo Yang, Kumoh National Institute of Technology, Gumi, South Korea
Liangchi Zhang, The University of New South Wales, Australia
Baojiang Zhong, Soochow University, Suzhou, China
Ahmed Zobaa, Brunel University, Uxbridge, Middlesex, UK
About this Series
The book series Modeling and Optimization in Science and Technologies (MOST) pub-
lishes basic principles as well as novel theories and methods in the fast-evolving field of
modeling and optimization. Topics of interest include, but are not limited to: methods
for analysis, design and control of complex systems, networks and machines; methods
for analysis, visualization and management of large data sets; use of supercomputers for
modeling complex systems; digital signal processing; molecular modeling; and tools
and software solutions for different scientific and technological purposes. Special em-
phasis is given to publications discussing novel theories and practical solutions that, by
overcoming the limitations of traditional methods, may successfully address modern
scientific challenges, thus promoting scientific and technological progress. The series
publishes monographs, contributed volumes and conference proceedings, as well as ad-
vanced textbooks. The main targets of the series are graduate students, researchers and
professionals working at the forefront of their fields.
ABC
Editors
Fatos Xhafa Admir Barolli
Universitat Politècnica de Catalunya University of Salerno
Barcelona Salerno
Spain Italy
Nowadays, we are witnessing an exponential growth in data sets, coined as Big Data
era. The data being generated in large Internet-based IT systems is becoming a corner-
stone for cyber-physical systems, administration, enterprises, businesses, academia and
all human activity fields. Indeed, there is data being generated everywhere: in IT sys-
tems, biology, genomics, financial, geospatial, social networks, transportation, logis-
tics, telecommunications, engineering, digital content, to name a few. Unlike recent
past where the focus of IT systems was on functional requirements and services, now
data is seen as a new asset and data technologies are needed to support IT systems with
knowledge, analytics and decision support systems.
Researchers and developers are facing challenges in dealing with this data deluge.
Challenges arise due to the extremely large volumes of data, their heterogenous na-
ture (structured & unstructured) and the pace at which data is generated requiring
both offline and online processing of large streams of data as well as storing, secu-
rity, anonymicity, etc. Obviously, most traditional database solutions may not be able
to cope with such challenges and non-traditional database and storage solutions are im-
perative today. Novel modelling, algorithms, software solutions and methodologies to
cover full data cycle (from data gathering to visualisation and interaction) are in need
for investigation.
This Springer book brings together nineteen chapter contributions on new models
and analytic approaches for the modelling of large data sets, efficient data process-
ing (online/offline) and analysis (analytics, mining, etc.) to enable next generation data
aware systems offering quality content and innovative services in a reliable and scal-
able way. The book chapters critically analyze the state of the art and envision the road
ahead on modelling, analysis and optimisation models for next generatation big data
technologies. Finally, benchmarking, frameworks, applications, case studies and best
practices for big data are also included in the book.
Respectively, Chakraborty et al. present a novel tree based data gathering scheme
for data gathering in Vehicular Sensor Networks. The presented approach enables an
efficient design of data collection protocol through which delay sensitivity and reliabil-
ity of the large volume of application data as well as the scarcity of sensor resources
can be addressed. Matsuo et al. present a data gathering method considering geo-
graphical distribution of data values for reducing traffic in dense mobile wireless sensor
networks. Data replication and the usefulness of P2P techniques to increase availability
and reliability are presented in the chapter by Spaho et al.
Concretely, Morreale et al. present a methodology for data cleaning and preparation
to support big data analysis along with a comparative examination of three widely avail-
able data mining tools. The proposed methodology is used for analysis and visualisation
of a large scale time series dataset of environmental data. The research issues related
to visualisation of multidimensional data is studied in the chapter by Okada, where
the author introduces an interactive visual analysis tool for multidimensional data and
multi-attributes data. Strohbach et al. analyse the high level requirements of big data
analytics and then provide a Big Data Analytics Framework for IoT and their applica-
tion to smart city. Their approach is exemplified through a case study in the smart grid
domain. A prototype of the framework addressing the volume and velocity challenges
is also presented.
[12] Antonio Scarfò and Francesco Palmieri. How the big data is leading the evolu-
tion of ICT technologies and processes
[13] David Simms. Big Data, Unstructured Data and the Cloud: Perspectives on
Internal Controls
Scarfò and Palmieri highlight the most important innovation and development
trends in the new arising scenarios of Big Data and its impact on the organisation of
ICT-related companies and enterprises. The authors make a critical analysis and ad-
dress the missing links in the ICT big Picture and present the emerging data-driven
reference models for the modern information-empowered society. Simms analyses the
increasing awareness by businesses and enterprises about the value of Big Data to the
world of corporate information systems. Through his analysis it is shown nevertheless
that inspite of the potential advantages brought by Big Data and Cloud computing, like
the use of outsourcing, their adoption requires addressing issues of confidentiality, in-
tegrity and availability of applications and data. An appropriate understanding of risk
and control issues is advocated in the chapter as a need for a successful adoption of
these new technologies.
lowing two chapters address the challenges to be faced for achieving user-centric aware
IoT that brings together people and devices into a sustainable eco-system.
Emerging Applications
Altogether, the chapters of the book bring a variety of big data applications from IoT
systems, smart cities, traffic control, energy efficient systems, disaster management, etc.
sheding light on the great potential of Big Data but also envisioning the road ahead in
this exciting field of the data science.
Acknowledgements
The editors of this book wish to sincerely thank all the authors of the chapters for their
interesting contributions to this Springer volume, for taking aboard all comments and
feedback from editors and reviewers and for their timely efforts to provide high quality
chapter manuscripts. We are grateful to the reviewers of the chapters for their generous
time and for giving useful suggestions and constructive feedback to the authors. We
would like to acknowledge the encouragement received from Prof. Srikanta Patnaik,
the editor in chief of the Springer series “Modeling and Optimization in Science &
Technology” and the support from Dr. Leontina Di Cecco, Springer Editor, and the
whole Springer’s editorial team during the preparation of this book.
Finally, we wish to express our gratitude to our families for their understanding and
support during this book project.
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
List of Contributors
Patricia Morreale
Yasuharu Sawada
Kean University, Union, NJ USA
Kyoto University, Japan
[email protected]
[email protected]
Sukumar Nandi
Indian Institute of Technology, India Ryoichi Shinkuma
[email protected] Kyoto University, Japan
[email protected]
Shojiro Nishio
Osaka University, Japan Carlos Silva
[email protected] Kean University, Union, NJ USA
Yoshihiro Okada [email protected]
Kyushu University, Japan
[email protected] David Simms
Haute Ecole de Commerce, University of
Yusuke Omori Lausanne, Switzerland
Kyoto University, Japan [email protected]
[email protected]
Muhammad Younas
Fatos Xhafa Oxford Brooks University, UK
Technical University of Catalonia, [email protected]
Spain
[email protected] Miguel A. Zamora-Izquierdo
University of Murcia, Spain
[email protected]
Zhang Xueli
Chinese Academy of Telecom Holger Ziekow
Research, China AGT International, Germany
[email protected] [email protected]
Exploring the Hamming Distance in Distributed
Infrastructures for Similarity Search
1 Introduction
In the current Big Data scenario, users have become data sources; companies store un-
countable information from clients; millions of sensors monitor the real world, creating
and exchanging data in the Internet of things. According to a study from the Interna-
tional Data Corporation (IDC) published in May 2010 [1], the amount of data available
in the Internet surpassed 2 ZB in 2010, doubling every 2 years, and might surpass 8
ZB in 2015. The study also revealed that approximately 90% of them are composed
of unstructured, heterogeneous, and variable data in nature, such as texts, images, and
videos.
Emerging technologies, such as Hadoop [2] and MapReduce [3], are examples of
solutions designed to address the challenges imposed by Big Data in the so-called three
Vs: Volume, Variety, and Velocity. Through parallel computing techniques in conjunc-
tion with grid computing or, recently, taking advantage of the infrastructure offered by
the cloud computing concept, IT organizations offer means for handling large-scale,
distributed, and data-intensive jobs. Usually, such technologies offer a distributed file
system and automated tools for adjusting, on the fly, the number of servers involved in
the processing tasks. In such case, large volumes of data are pushed over the networking
facility connecting the servers, transferring < key, value > pairs from mappers to reduc-
ers in order to obtain the desirable results. In this scenario it is desirable to minimize
the need for moving data across the network in order to speedup the overall processing
task.
While the current solutions are unquestionably efficient for handling traditional ap-
plications, such as batch processing of large volumes of data, they do not offer adequate
support for the similarity search [4], whose objective is the retrieval of sets of similar
data given by a similarity level. As an example, similarity between data may be used
in a recommender system based on users’ social profiles. In an example, a user profile
can be defined as a set of characteristics that uniquely influence how users make their
decisions. Users with similar characteristics are more likely to have similar interests
and preferences.
In this way, to make a similarity search system based on users’ profiles, the char-
acteristics of a user in a social network can be placed in a vector and, using a
Vector Space Model (VSM), the similarity between users can be measured through the
use of vector distance metrics, such as Euclidean distance, cosine, and Hamming dis-
tance. But, except for the Hamming distance, all the other metrics are affected by the
curse of dimensionality [4]. The high computational cost due to the dimensionality
problem is a challenge to be faced by the new similarity search systems in a Big Data
scenario.
This chapter presents how to support similarity search using the Ham-
ming distance as similarity metric. To achieve that, data are indexed
in a database using the Locality Sensitive Hashing (LSH) function called
Random Hyperplane Hashing (RHH) function [5]. RHH is a family of LSH functions
that uses the cosine similarity between vectors and Hamming as the distance metric
between the generated binary strings, i.e., the greater the cosine similarity between a
pair of content vectors, the lower the Hamming distance between the binary strings.
These binary strings represent a data identifier whose similarity can be measured using
the Hamming distance. Each query in this database is evaluated through the use of the
Hamming distance between the query identifier and each data identifier.
In the similarity search, a query to be evaluated is composed by the same set of
characteristics of content indexed in the database. Each user of the similarity search
system enters the desired characteristics and a similarity level according to the desired
volume of answers to the query. The greater the similarity level, the lower the number
of data retrieved because the more specific is the query. The query is indexed in the
Exploring the Hamming Distance in Distributed Infrastructures 3
database using the RHH function and the Hamming similarity between the query and
all data identifiers in the database is calculated. All those profiles whose Hamming
similarity satisfies the desired similarity level are returned as the query response.
To evaluate the similarity search system, the following tests were done: the corre-
lation of the cosine similarity between content vectors and the Hamming similarity of
their identifiers is presented using four different similarity levels (0.7, 0.8, 0.9, and
0.95), the frequency distribution of the Hamming distance between content identifiers
according to their similarity level; and, finally, some results of selected queries, and re-
sponses are presented. In our experiments, content vectors represent users profile in the
Adult Data Set of the UCI Repository [6].
In our previous work [7] an overlay solution for this similarity search system was de-
veloped on top of a Distributed Hash Table (DHT) structure. Essentially, it was shown
the possibility of storing similar data in servers close to the logical space of the overlay
network by using a put(k, v) primitive, and that it is also possible to efficiently recover
a set of similar data by using a single get(k, sim) primitive. In another previous work [8]
the HCube was shown, a Data Center solution designed to support similarity searches in
Big Data scenarios, aiming to reduce the distance to recover similar contents in a sim-
ilarity search. In HCube similar data are stored in the same hosting server or in servers
nearly located in the Data Center.
This chapter is organized as follows: Section 2 presents some background on the
technologies used in the Hamming DHT and the HCube. Section 3 contains a literature
review on related work on similarity search in Peer-to-Peer (P2P) networks and Data
Center. Section 4 briefly presents the Hamming DHT and HCube solutions. Section
5 evaluates the proposed similarity search system in distributed scenarios. Section 6
provides some final remarks and future work.
2 Background
This section presents the concept of the VSM, a model to represent data as vectors in
a multidimensional space; the RHH Function, an LSH function used to generate data
identifiers preserving similarity between content vectors; and the Hamming similarity
function, a similarity function that is used to compare the Hamming distance between
binary identifiers.
For the experiments, some adaptations must be done in the adult vectors. Numeri-
cal attributes present in the vectors, such as “age”, had to be normalized to the range
[0..1]. Such normalization was done by dividing the value by the highest one in the set.
Coordinates that represent discrete attributes (for example, “sex” that may be “male”
or “female”) were divided into different ones, each one in a separated dimension corre-
sponding to the possible values. For the “sex” attribute, two dimensions were created:
“male” and “female”. If the person is a man, his vector has a value “1” for the “male”
dimension and “0” for the female dimension and vice versa in case of a woman being
represented.
As stated in [9], this procedure was necessary because the notion of similarity or
distance for discrete informations is not as straightforward as for the numerical ones
and was a major challenge faced here. This is due to the fact that different values taken
by a discrete attribute are not inherently ordered and hence a notion of ordering between
them is not possible. Also, the notion of similarity can differ depending on the particular
domain. Due to this fact, each attribute of a vector had to be extended in a number of
dimensions equal to all the values it contains. Using this procedure, the adult vector had
to be extended from 14 to 103 dimensions.
A possibility to measure the similarity between vectors that represent some data is to
calculate the cosine of the angle between them (simcos ). The cosine similarity produces
high quality results across several domains as presented in [10]. To illustrate this, Figure
1 presents an application in which user profiles are represented by two-dimensional
vectors, each dimension describing the user’s interest in Sports and Literature, using a
scale in which “0” means no interest and “10” means total interest in a topic. Consider
four user profiles represented by the tuples PROFILE1(3,4) - rank 3 for Sports and 4
for Literature, PROFILE2(4,4) - rank 4 for both Sports and Literature, PROFILE3(5,3)
- rank 5 for Sports and 3 for Literature, and PROFILE4(7, 3) - rank 7 for Sports and 3
for Literature.
In order to provide insights into the use of similarity by cosine (simcos ), consider the
development of a friendship recommender system based on users’ profiles responsible
for indicating friends with similar interests to a new user profile and represented by the
tuple NEW PROFILE(8,1) - rank 8 for Sports and 1 for Literature. In this case, the
recommender system would suggest to NEW PROFILE the following order of prefer-
ence for the establishment of new friendships: 1) PROFILE4 whose simcos ≈ 0.96, 2)
PROFILE2 whose simcos ≈ 0.9, 3) PROFILE3 whose simcos ≈ 0.8, and 4) PROFILE1
whose simcos ≈ 0.7.
Exploring the Hamming Distance in Distributed Infrastructures 5
5
Literature
PROFILE1
4 PROFILE2
PROFILE3 PROFILE4
3
2
NEW_PROFILE
1
0
0 2 4 6 8 10
Sports
Table 1. 8-bits profile identifiers, Hamming distance (Dh ), Hamming similarity (simh ) and Cosine
similarity (simcos ) of user profiles
3 Literature Review
Among the related works on similarity search available in the literature, most of
them are based on some kind of indexing schemes such as hashing functions [4] or
Space Filling Curve (SFC) [11]. Both are motivated by the nearest neighbors problem
[4], i.e., how to retrieve the most similar data in an indexing space? The adoption of
the Hamming similarity property of the RHH function brings us an advantage: it is not
necessary the use of the Hilbert SFC to aggregate similar identifiers.
As said in [4], SFCs are affected by the curse of dimensionality [4]. To address the
dimensionality problem, Indyk [4] proposed to use LSH functions. These functions may
reduce the number of dimensions of a vector creating an identifier for it, represented by
a binary string of size m (m ≥ 0). The distance between two binary strings generated
by the application of any LSH function in a pair of content vectors is inversely propor-
tional to the similarity between them. Our proposal of using the Hamming distance to
measure the similarity among content mirroring the corresponding similarities among
user profiles is, to the best of our knowledge, a new approach in the design of similarity
search systems.
In the first implementation of the similarity search prototype it was necessary to
search the whole database. We are aware that it is not adequated in the actual Big Data
scenario which motivated this proposal. To address this Big Data challenging scenario,
we propose two distributed architectures, Hamming DHT [7] and HCube [8], that can
serve as infrastructure for similarity search in a Big Data context. The Hamming DHT
is a P2P network which exploits the Hamming similarity to facilitate the search and
retrieval of similar profiles. The HCube is a data center infrastructure specialized in the
similarity search using the Hamming similarity.
The overlay approaches appear as solutions to help managing large volumes of data.
In general, these solutions are based on Distributed Hash Tables (DHT), sharing data
among peers in an overlay network. According to [12], P2P multidimensional indexing
methods emerge as a whole new paradigm over the last few decades. In this scenario,
DHT need to be equipped with multidimensional queries and similarity search process-
ing capabilities.
Hycube [13] is an example of DHT that uses Hamming as a distance metric and
organizes peers in a unit size hypercube. However, the costs involved in the mainte-
nance of the hypercube under churn and incomplete cubes are greater than the costs in
a consistent hashing DHT, such as the proposed Hamming DHT.
Exploring the Hamming Distance in Distributed Infrastructures 7
pSearch, proposed by Tang et al. [14], is a P2P network that also uses content vectors
and cosine similarity, but it differs from the Hamming DHT since it is specialized for
similarity search of text documents in a P2P network. Also, pSearch discusses a way to
build a distributed term dictionary in order to index text documents.
Bhattacharya [15] uses cosine similarity and LSH functions to propose a framework
for similarity search in distributed databases. It extends the get(k) primitive, provided
by any DHT implementation, to support a similarity level get(k, Dh ) in the Hamming
distance Dh from k. This proposal is not specialized in similarity searches, since it is a
patch to be applied over any existent DHT to support similarity search.
From this brief survey we can detach that the Hamming DHT is focused in the reduc-
tion of the distance (in hops) necessary to recover similar contents and in the increase
of the recall in similarity search systems. The use of LSH functions to index similar
contents is not new, but the exploration of the Hamming similarity in the organization
of content identifiers is an original idea with no similar proposals. This modification
makes the Hamming DHT much simpler than other solutions.
When compared to these works, HCube is a distributed solution using techniques
such as LSH functions and SFCs to support similarity searches, but HCube is not based
on an overlay solution. HCube presents a server-centric data center structure specialized
for similarity search, where similar data are stored in servers physically near, recovering
such data in a reduced number of hops and with reduced processing requirements.
On the other hand, players such as Google [16] and Amazon [17] developed data
centers specialized in storing and processing large volumes of data through MapReduce
solution, but none of them consider the similarity search. In this way, HCube opens up
a new research field, where applications can benefit from its similarity search struc-
ture, avoiding processing-intensive tasks as occur in MapReduce, since similar data are
organized in servers nearly located during the storing phase.
In the next section is presented how to perform searches on a distributed infrastruc-
ture based on Hamming distance, considering the similarity in their results.
In the proposed similarity search system, a user issuing a query must select the desired
characteristics and a similarity level, in the [0..1] interval, for the query. Then, the con-
tent database is searched for the entries that satisfy the filled characteristics. In this first
implementation, only complete queries are allowed, with all attributes filled.
Figure 2 shows an example of the main interface used to select the desired charac-
teristics for the search. In the same interface the user inputs a similarity level. As said
before, the greater the similarity level, the lower the number of obtained results, because
less profiles in the database satisfy the query. The results are put in a list as the output
of the similarity search.
Figure 3 summarizes the prototype operation. The user of the proposed similarity
search selects all the desired characteristics for a query, as well as the desired similarity
level. After selecting the desired characteristics, a content vector is built according to the
procedure described in Section 2.1. This vector − →v is used as input for the RHH function
that generates the query hash identifier. The Hamming similarities between − →
v and all
8 R. da Silva Villaça et al.
entries in the database are calculated, and all entries satisfying the desired similarity
level are listed to the user. A second level filter can be implemented to improve the
results or to select the most recent ones, as an example, but this filter is not present in
this implementation.
Some examples of results, using the adult set described in Section 2.1, are presented
and analyzed. The prototype interface was used to get results for all similar profiles
Exploring the Hamming Distance in Distributed Infrastructures 9
in the database according to a desired similarity level. In the evaluation we use four
similarity levels to illustrate the results: 0.7, 0.8, 0.9, and 0.95.
Table 2 presents some of the results. In all cases the input to the search was a profile
with the following characteristics: “31” years old; working in the “Private” sector; High
School graduates, “HS-grad”; “9” years in school; “Divorced”; working on a “Sales”
department; relationship status equal to “Other-relative”; “White”; “Male”; born in the
“United-States”; annual income “<=50K” U.S. dollar. It was used a similarity level of
0.7, which means that results greater than or equal to 0.7 were considered.
Table 2. Some results of the prototype, and their Hamming similarity (simh ), using the same user
profile q as input. q: <31; Private; HS-grad; 9; Divorced; Sales; Other-relative; White; Male;
United-States; <=50K>.
Not all results are presented in Table 2 but there were 48842 profiles in the database.
In the following tests, 48842 queries were done in the same 48842 profiles, leading to
2385540964 results to be evaluated according to the selected similarity level. The total
number of results is variable and depends on the size of the database and the similarity
level. The greater the similarity level, the shorter is the total number of profiles retrieved.
In the next subsections we present the Hamming DHT and the HCube, two different
ways of distributing the proposed similarity search system.
10 R. da Silva Villaça et al.
The Hamming DHT inherits from Chord [18] the consistent hashing approach and the
join and leave procedures, but proposes two new features aimed at extracting the max-
imum benefits from the proposed mechanism for generating similar identifiers for sim-
ilar contents. In short, the two new features are: 1) the use of Gray codes in the orga-
nization of the identifiers in the ring and 2) the establishment of fingers based on the
Hamming distance of peers’ identifiers.
This section details the storage and retrieval aspects of the proposed system, whose
content classification mechanism is developed using the RHH function meeting the
following properties:
– ∀c1 , c2 ∈ C : simcos (c1 , c2 ) → [0..1], where c1 and c2 are content vectors in a content
vector space C .
– ∀c1 , c2 ∈ C : Dh (RHH(c1 ), RHH(c2 )) ∝ 1/simcos (c1 , c2 ).
In essence, as shown in [19], the properties inherent to the RHH functions can repre-
sent, with high accuracy level, the similarity between contents measured as their Ham-
ming similarity. Such characteristic, in conjunction with the Gray code organization
of the identifiers in the Hamming DHT and the establishment of fingers based on the
Hamming distance between peers, provide an efficient system for the similarity search
reducing the distance (in hops) between peers storing similar contents.
In the sequence, after obtaining their identifiers, peers join the ring, which is orga-
nized according to the Gray code sequence, differently from other DHTs like Chord,
in which the ring is organized in a crescent natural order of identifiers. Figure 4 shows
an example of the proposed Hamming DHT using m = 5. As can be seen in this figure,
there are four peers (3 - 000112, 13 - 011012, 30 - 111102, and 22 - 101102) and three
contents (0 - 000002, 24 - 110002, and 31 - 111112). Observe on the right side of Fig-
ure 4 the differences between the Gray code sequence and the crescent natural order of
identifiers. From the Gray code sequence we have 3 < 13 < 30 < 22.
4.2 HCube
As seen in Figure 5, the logical organization of HCube is composed of two layers, the
Admission Layer and the Storage Layer.
The Admission Layer provides the interface between the external world and the
HCube structure. This layer is composed of a set of Admission Servers, which operate
on top of the HCube, receiving the queries from the users/applications and preparing
such queries for being injected in the HCube in order to perform the similarity search.
For simplicity, Figure 5 shows only one Admission Server at the Admission Layer.
On the top left corner of Figure 5, it is presented a query vector −
→
q composed of d
dimensions being admitted by the HCube, in conjunction with the desirable similarity
level sim. The designation of the Admission Server to be in charge of handling such
query may be set according to, for example, the geographic region from where the
query is originated.
Once the query vector − →
q is received, the Admission Server obtains the data identi-
fier di according to the process described in Section 2.2 which has 128-bits length in
Exploring the Hamming Distance in Distributed Infrastructures 13
Figure 5. The di is the data identifier of reference for the similarity search process. In
the sequence, the Admission Server reduces the di, also using the RHH function, aimed
at obtaining the identifier si, which indicates the Hosting Server responsible for the ref-
erence data di. In this example, the si has 6-bits length, resulting in a HCube composed
of 64 servers (4x4x4 servers).
After the admission of the query − →q , a get composed of the reference data identifier
di, the hosting server si, and the desirable sim level is issued in the Storage Layer. Such
message is routed toward si by using the XOR Routing presented later in Section 4.2.3
and, from the hosting server si, a series of get(di, sim) messages is triggered, being
forwarded to the neighboring servers also using the XOR mechanism.
The generation of such get(di, sim) messages is driven by some factors, including
the reference data di and the desirable sim level. Specifically for the purposes of evalu-
ations, the get(di, sim) messages are gradually sent to all servers, following a crescent
order of Hamming distance between server identifiers, since the objective is to provide a
full analysis of the distance between servers storing similar data and the recall in which
similar data is recovered.
As the get messages are routed inside HCube, the servers containing data within
the desirable sim level forward them to the Admission Server responsible for handling
such request at the Admission Layer. The Admission Server summarizes the answers
and delivers the set of similar data to the requesting user/application, concluding the
similarity search process. As an alternative implementation, the Admission Server may
return a list of references to the similar data, instead of returning the entire data set. Such
option avoids unnecessary movement of huge volumes of data and allows the users to
choose what piece of data they really want to retrieve, for example, a doctor may open
only a few past diagnoses related to the current treatment.
⎡ ⎤ ⎡ ⎤
8 9 11 10 56 57 59 58
⎢12 13 15 14⎥ ⎢60 61 63 62⎥
⎢
L1 = ⎣ ⎥ ⎢
L3 = ⎣ ⎥
4 5 7 6⎦ 52 53 55 54⎦
0 1 3 2 48 49 51 50
⎡ ⎤ ⎡ ⎤
24 25 27 26 40 41 43 44
⎢28 29 31 30⎥ ⎢44 45 47 46⎥
⎢
L2 = ⎣ ⎥ ⎢
L4 = ⎣ ⎥
20 21 23 22⎦ 36 37 39 38⎦
16 17 19 18 32 33 35 34
Fig. 7. Visualization of an HCube with 64 servers (L1, L2, L3, and L4)
be directly connected to a given server, and the remaining servers whose Hamming
distance is 1 will be placed in a higher number of hops.
Figure 8 exemplifies an HCube bigger than 6-bits, showing the layers an 8x4x4 (7-
bits) HCube with 128 servers. Note the Gray SFC represented by the arrows, and con-
sider server 9 (00010012) located in the L1 as an example. There are seven Hamming
distance 1 servers, organized as follows: servers 8 (00010002, L1) and 11 (00010112,
L1) in the x axis; servers 25 (00110012, L1) and 1 (00000012, L1) in the y axis; servers
41 (01010012, L2) and 73 (10010012, L4) in the z axis and, finally, server 13 (00011012,
L1) which is located in a higher number of hops from server 9. It is important to high-
light that the distance in hops to server 13 is reduced given the wrapped links established
between edge servers of HCube.
d(a, b) div 2i = 1, a
= b, 0 ≤ i ≤ n − 1. (2)
In order to exemplify the creation of the routing tables, consider the first exam-
ple given in Section 4.2.2, assuming server 29 as a = 011101 and its neighbor 13 as
b = 001101. The distance d(a, b) = 010000 and the highest i that satisfies the con-
dition (2) is i = 4, concluding that the identifier b = 001101 must be stored in the
bucket βn−1−i = β1 . Basically, condition (2) denotes that server a stores b in the bucket
βn−1−i , in which n − 1 − i is the length of the longest common prefix (lcp) existent be-
tween both identifiers a and b. This can be observed in Table 3, in which the buckets
β0 , β1 , β2 , β3 , β4 , β5 store the identifiers having lcp of length 0, 1, 2, 3, 4, 5 with server
29 (011101).
Table 3. Hypothetical routing table for server 29 (011101) in an HCube with a server identity
space in which n = 6
β0 β1 β2 β3 β4 β5
111101 001101 010101 011001 011111 011100
100000 000000 010000 011000 011110
100010 000100 010100 011010
111111 001111 010111 011011
Such routing tables approach is one of the main advantages of the XOR-based mech-
anism, since a server only needs to know one neighbor per bucket of the possible 2n
servers available in the network to successfully route packets. If a server has more than
one entry per bucket, such additional entries might optimize the routing process, re-
ducing the number of hops in the path from source to destination. Another important
1 div denotes the integer division operation on integers.
Exploring the Hamming Distance in Distributed Infrastructures 17
characteristic of this routing table is related to the number of servers that fit in each one
of the buckets. There is only one server that fits in the last bucket (β5 ), two in β4 , four
in β3 , doubling until reaching the first bucket (β0 ), where 50% of all servers fit in such
buckets (32 servers for n = 6). For simplicity of presentation, Table 3 shows examples
of neighbor servers limited to four lines.
Afterwards, assuming the Gray code distribution of server identifiers adopted in
HCube, filling the buckets is easier, since each one of the Hamming distance 1 neigh-
bors fit in exactly one bucket of the routing table, assuring all the traffic forwarding
inside the HCube. Note in the first line of Table 3 the intentional presence of all servers
whose identifiers present Hamming distance 1 to server 29 (011101): server 111101 in
β0 , 001101 in β1 , 010101 in β2 , 011001 in β3 , 011111 in β4 , and 011100 in β5 . As
mentioned before, for HCubes bigger than 64 servers (6-bits), only 6 servers with Ham-
ming distance 1 will be physically connected to a given server. In this case, a signaling
process [21] is used to discover the other servers located at distances bigger than 1 hop.
In the next section is presented some evaluations of the proposed similarity search
system in distributed environments.
5 Evaluations
In the first place, the correlation between Hamming and cosine similarity was evalu-
ated, showing that it is possible to use the RHH function to generate content identifiers
preserving its similarity in the Hamming distance between them.
After that, some evaluations regarding the Hamming DHT and HCube are presented.
More results can be found in [7] and [8]. These evaluations show that it is possible to
reduce the distance between similar data in distributed environments using both pro-
posals.
Correlation Correlation
1 1
Hamming Similarity
Hamming Similarity
0.8 0.8
(128 bits)
(128 bits)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cosine Similarity (>= 0.7) Cosine Similarity (>= 0.8)
(a) (b)
Correlation Correlation
1 1
Hamming Similarity
Hamming Similarity
0.8 0.8
(128 bits)
(128 bits)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cosine Similarity (>= 0.9) Cosine Similarity (>= 0.95)
(c) (d)
Fig. 9. Correlation between cosine and Hamming similarity of their 128-bits identifiers with co-
sine similarity greater than or equal to 0.7 (a), 0.8 (b), 0.9 (c), and 0.95 (d)
The greater the cosine similarity, the greater the correlation. In Table 4 is shown
the average of the correlation between cosine similarity (simcos ) of pairs of queries and
adult profiles, and the Hamming similarity (simham ) of their 128-bits identifiers. These
samples were sorted according to these similarity levels: 0.7, 0.8, 0.9, and 0.95.
The results show a strong correlation between cosine and Hamming similarity for
128-bits pairs of adult profiles, especially for the highest similarity levels. Still examin-
ing this table, the greater the similarity level, the greater the correlation between cosine
similarity and profile identifiers. The standard deviations of these averages for the low-
est similarity levels are high due to the probabilistic characteristic of the RHH function,
but the confidence intervals of the samples were calculated and are low and acceptable.
For the highest similarity levels the standard deviations are negligible, as the confidence
interval of the samples.
Exploring the Hamming Distance in Distributed Infrastructures 19
Another test to express the feasibility of the similarity search in this scenario is the
evaluation of the frequency distribution of the Hamming distance of pairs of content
identifiers, according to their cosine similarity. As indicated in Section 5.3.1, each pro-
file in the Adult data set was used as a query vector in the entire database. For each pair
of adults, the cosine similarity and the corresponding Hamming distance of their identi-
fiers were measured. The evaluation was done using four different similarity intervals:
[0.7..0.8), [0.8..0.9), [0.9..0.95), [0.95..1).
Figure 10(a) shows the frequency distribution of the Hamming distance of the 128-
bits user’s identifiers, which have a cosine similarity greater than or equal to 0.7 and
less than 0.8 [0.7..0.8). As depicted, the results tends to a normal distribution in which
most of the distances are between 20% (25 bits) and 30% (38 bits) of the total length of
the identifier. The same behavior can be observed in the other ranges: [0.8..0.9) (Figure
10(b)), [0.9..0.95) (Figure 10(c)), and [0.95..1.0) (Figure 10(d)).
Frequency distribution of the Hamming distances Frequency distribution of the Hamming distances
Similarity 0.7 - 0.8 Similarity 0.8 - 0.9
% of occurences (average)
% of occurences (average)
35 35
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 5 10 15 20 25 30 35 40 45 50 55 60 65
Hamming distances Hamming distances
(a) (b)
Frequency distribution of the Hamming distances Frequency distribution of the Hamming distances
Similarity 0.9 - 0.95 Similarity 0.95 - 1.0
% of occurences (average)
% of occurences (average)
35 35
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 5 10 15 20 25 30 35 40 45 50 55 60 65
Hamming distances Hamming distances
(c) (d)
Fig. 10. Frequency distribution. Similarity levels 0.7 (a), 0.8 (b), 0.9 (c) and 0.95 (d).
• Each profile in this set was used as a query vector for similar profiles in all profiles
of the same data set. There are a total of 48842 adult profiles in the data set, which
means that 48842 x 48842 queries were realized over the Adult Data Set;
• An identifier was generated for each query and a lookup having such identifier as
argument was executed in both DHTs. The lookup message is forwarded to the peer
which is responsible for the query identifier (the hosting peer), i.e., the identifier’s
successor on the ring;
• From the hosting peer, each profile identifier within the query’s similarity level is
retrieved, and the distance in number of hops from the hosting peer to the other
peers hosting each similar contents is measured.
• The frequency distribution of the number of hops to retrieve all similar profiles in
the set according to their similarity level. This evaluation shows that the Hamming
DHT aggregates more than a normal ring-style DHT, such as Chord, reducing the
distance between similar profiles in the number of hops;
• The query recall, corresponding to the fraction of the profiles that is relevant to the
query and successfully retrieved. This evaluation shows that it is possible to build a
more efficient search engine on top of the Hamming DHT, at lower cost, measured
in the number of hops to complete the query.
To perform the tests, we have implemented a Chord and a Hamming DHT simulator,
including a feature to generate random peers and join them in a ring-style DHT. Also,
the developed simulator indexes and stores each profile identifier k using the put(k, v)
operation. The lookup(k) operation returns the successor of the key k on the ring which
represents the peer responsible for storing the profile associated to this key, the host-
ing peer. The get(k) primitive was extended to handle the proposed similarity level
assuming the format get(k, sim): given a key (k) and a similarity level (sim), all similar
profiles stored in the hosting peer are returned. This search is extended to the neighbors
of the hosting peer with distance of 1 hop (or longer distances) aiming to improve the
searching results.
(a) (b)
Frequency Distribution Frequency Distribution
1000 peers, Similarity 0.8, 128 bits 10000 peers, Similarity 0.8, 128 bits
50 50
Similar Contents (%)
30 Hamming_DHT 30 Hamming_DHT
Chord Chord
20 20
10 10
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of hops Number of hops
(c) (d)
Frequency Distribution Frequency Distribution
1000 peers, Similarity 0.9, 128 bits 10000 peers, Similarity 0.9, 128 bits
70 70
60 60
Similar Contents (%)
50 50
40 Hamming_DHT 40 Hamming_DHT
Chord Chord
30 30
20 20
10 10
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of hops Number of hops
(e) (f)
Frequency Distribution Frequency Distribution
1000 peers, Similarity 0.95, 128 bits 10000 peers, Similarity 0.95, 128 bits
100 100
Similar Contents (%)
80 80
60 Hamming_DHT 60 Hamming_DHT
Chord Chord
40 40
20 20
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of hops Number of hops
(g) (h)
Fig. 11. Frequency distribution of the number of hops and 128-bit keys. 1000 peers with similarity
levels 0.7, 0.8, 0.9, and 0.95. 10000 peers with similarity levels 0.7, 0.8, 0.9, and 0.95.
fingers of the Hamming DHT, which privileges the Hamming distance between peers,
and the organization of the identifiers according to the Gray code sequence in the ring,
contributes to the results presented in these tests.
5.2.2 Recall
Figures 12(a), 12(c), and 12(e) show the recall in a similarity search for the Chord and
the Hamming DHT with 1000 peers, similarity levels 0.7, 0.8, and 0.9 and 128-bit.
22 R. da Silva Villaça et al.
Figures 12(b), 12(d), and 12(f) show the results obtained from the same tests simulated
with 10000 peers.
From the results of Figure 12, it is possible to see that using the Hamming DHT as
an infrastructure to support similarity search is a valuable approach. As an example,
from Figure 12(d), it is possible to see that a search engine designed over the Hamming
DHT having a get(k, 0.8) function with a depth of 4 hops, about 91% of all similar
profiles can be retrieved, while in Chord, only about 50% of them can be retrieved. The
better frequency distribution shown in Figure 11 justifies the better recall obtained by
the Hamming DHT when compared to Chord.
5.3 HCube
This section describes the experiments performed in the evaluation of the HCube pro-
posal. The main idea is to validate our proposal as a valuable approach in order to sup-
port the searching for similar data. To evaluate the HCube the following experiments
were performed:
• Adult profile identifiers with 128-bits were generated using RHH, indexed and dis-
tributed in HCubes with 1024, 2048, 4096, and 8192 servers;
• Profiles in the query set were used as query vectors in the Adult set. Pairs of queries
and adults were randomly selected to be evaluated considering similarity levels
greater than or equal to 0.7, 0.8, 0.9, and 0.95;
• A 128-bits identifier was generated for each query. Lookups having these identifiers
and a Hamming similarity level were performed. To map the hosting server for
each query, each 128-bits identifier was reduced to a 10-, 11-, 12-, and 13-bits
identifier using RHH. These reduced identifiers correspond to the hosting servers
of the queries for each evaluated size of HCubes;
• From the hosting server, all profiles that fit within the Hamming similarity level are
retrieved.
• The correlation between cosine and Hamming similarity of adult identifiers (128-
bits), and between cosine and Hamming similarity of hosting server identifiers;
• The frequency distribution of the number of hops to retrieve all similar profiles in
the set according to their similarity level;
• The query recall, corresponding to the fraction of all relevant profiles that was suc-
cessfully retrieved.
To perform the tests, adult profiles were indexed and distributed in the HCube using
the put(k, v) primitive. The get(k, sim) primitive was used to retrieve all similar profiles
in a hosting server and its neighbors given an identifier (k) and a similarity level (sim).
Only to permit us to evaluate the recall and the number of hops, this search was extended
to all neighbors of the hosting server in a configurable depth.
Exploring the Hamming Distance in Distributed Infrastructures 23
Recall - 1000 peers - Similarity 0.7 - 128 bits Recall - 10000 peers - Similarity 0.7 - 128 bits
100 100
Similar Contents (%)
(a) (b)
Recall - 1000 peers - Similarity 0.8 - 128 bits Recall - 10000 peers - Similarity 0.8 - 128 bits
100 100
Similar Contents (%)
(c) (d)
Recall - 1000 peers - Similarity 0.9 - 128 bits Recall - 10000 peers - Similarity 0.9 - 128 bits
100 100
Similar Contents (%)
80 80
60 60
40 40
20 Hamming DHT 20 Hamming DHT
Chord Chord
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Number of hops Number of hops
(e) (f)
Recall - 1000 peers - Similarity 0.95 - 128 bits Recall - 10000 peers - Similarity 0.95 - 128 bits
100 100
Similar Contents (%)
80 80
60 60 Hamming DHT
Chord
40 40
20 Hamming DHT 20
Chord
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Number of hops Number of hops
(g) (h)
Fig. 12. Recall. 1000 peers, similarity levels 0.7, 0.8, 0.9, and 0.95. 10000 peers, similarity levels
0.7, 0.8, 0.9, and 0.95.
24 R. da Silva Villaça et al.
5.3.1 Correlation
Table 5 shows the average of the correlation between cosine similarity (simcos ) of pairs
of queries and adult profiles, and the Hamming similarity (simham ) of their 128-bits
identifiers. Also, the correlation between cosine similarity and the Hamming similarity
of 10-, 11-, 12-, and 13-bits hosting server identifiers is presented. To calculate these
correlations we randomly selected 10 samples of queries and adults profiles, each sam-
ple containing about 1% of the total number of pairs in the Adult set. These samples
were sorted according to the similarity levels: 0.7, 0.8, 0.9, and 0.95.
The results show a strong correlation between cosine and Hamming similarity for
128-bits pairs of adult profiles, especially for the highest similarity levels. When the
similarity level falls bellow 0.9 it is possible to notice a moderate correlation (< 0.8)
between cosine and Hamming similarity of pairs of adults and their hosting server iden-
tifiers. Still examining this table, the greater the similarity level, the greater the correla-
tion between cosine similarity of pairs of adults and Hamming similarity of their hosting
server identifiers.
The standard deviation of these averages for the lowest similarity levels are high
due to the probabilistic characteristic of the RHH function, but the confidence intervals
of the samples were calculated and are low and acceptable. For the highest similarity
levels the standard deviations are negligible, as the confidence interval of the samples.
(a) (b)
Frequency Distribution Frequency Distribution
Similarity 0.9 Similarity 0.95
90 90
80 80
70 70
60 10_bits 60 10_bits
11_bits 11_bits
50 12_bits 50 12_bits
40 13_bits 40 13_bits
30 30
20 20
10 10
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of hops Number of hops
(c) (d)
Fig. 13. Frequency distribution. Similarity levels 0.7 (a), 0.8 (b), 0.9 (c), and 0.95 (d).
to recover similar data is n-(sim*n). As an example, with sim=0.7 and n=10, the maxi-
mum number of hops necessary to recover all 0.7 similar data is 10-(0,7*10) = 3. The
highest values presented in the HCube results are due to the reduced number of dimen-
sions necessary to build an incomplete Hamming cube. The average of the number of
hops to retrieve all similar profiles in a given query is presented in Table 6:
5.3.3 Recall
An important metric to indicate the efficiency of any information retrieval system is
the recall, which represents the fraction of the profiles that is relevant to the query and
successfully retrieved in a similarity search. This evaluation shows that it is possible
to build an efficient search engine on top of the HCube. In Figure 14 the results of
the evaluation of the recall are presented. They represent the average of all queries and
results in the Adult set.
From these results, it is possible to see that the best results occur for the highest simi-
larity levels. As an example, taking an HCube with 4096 servers (12-bits) and similarity
26 R. da Silva Villaça et al.
100 100
Similar Contents (%)
80 80
60 60
40 40
10 bits 10 bits
11 bits 11 bits
20 12 bits 20 12 bits
13 bits 13 bits
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Number of hops Number of hops
(a) (b)
Recall - Similarity 0.9 Recall - Similarity 0.95
100 100
80 80
60 60
40 40
10 bits 10 bits
11 bits 11 bits
20 12 bits 20 12 bits
13 bits 13 bits
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Number of hops Number of hops
(c) (d)
Fig. 14. Recall. Similarity levels 0.7 (a), 0.8 (b), 0.9 (c), and 0.95 (d).
level equal to 0.8, 20% of the similar adult profiles are retrieved using 2 hops and 50%
of them are retrieved with 4 hops. In the same scenario, using similarity level of 0.95,
60% of all similar profiles are stored in the hosting server, and expanding the search for
its immediate neighbors (1 hop), 90% of all similar profiles are retrieved.
In the next section the chapter is concluded and future works are discussed.
In short, the similarity search based on Hamming metric is an innovative proposal. The
basic assumption is that users with similar profiles have similar interests. A user profile
can be obtained in several ways, such as by crawling social networks, extracting de-
mographic informations, curriculum vitae, or user accounts. Compared to other related
works in the literature, the main advantages of the proposed similarity search are: is
does not suffer from the curse of dimensionality; the similarity is evaluated indepen-
dently from the type and semantics of each dimension of the content vector; it involves
low cost computational operations, such as the XOR function, to compute the similar-
ity between two identifiers. Also, the similarity search is well adapted to be indexed in
virtually [7] or physically distributed systems [8] and can be extended to be a hybrid
approach.
The use of weights in the similarity search proposal is a work to investigate in the fu-
ture. In this initial experimentations, the use of different scales in some profile attributes
causes them to contribute at different levels to estimate the similarity between profile
Exploring the Hamming Distance in Distributed Infrastructures 27
vectors. As an example in the Adult set, if the user sets “age” with a greater weight
than the other characteristics, in Table 2, result R9 could be more similar to the query q
than R10 . The way users express their preferences can be implemented in a collaborative
way.
Another aspect of the similarity search is that it becomes easily adapted to be a dis-
similarity search system. A dissimilarity system can be used in several applications
to contrast different profiles in a recommendation. It is only necessary to search for
the negate identifier of a profile. As an example, a dissimilarity search for the follow-
ing profile <27; Private; Some-college; 11; Never-married; Adm-clerical; Own-child;
White; Female; United-States; >=50K> using 0.9 as a dissimilarity level, returns <38;
Private; HS-grad; 9; Married-civ-spouse; Craft-repair; Husband; White; Male; Cuba;
<=50K> and <37; Private; HS-grad; 9; Married-civ-spouse; Sales; Husband; White;
Male; Haiti; <=50K>, among others.
The results show that the Hamming DHT can be a useful tool to serve as an overlay
infrastructure for similarity search. The evaluation compares the Hamming DHT and
Chord because it can be used as a reference element in the DHT literature, even if it has
not been proposed for similarity search. Also, to the best of our knowledge, no other
chapter in the DHT literature explores the Hamming similarity of content identifiers to
propose a DHT specialized for similarity search, which makes difficult our comparisons
with other approaches.
HCube is a Data Center solution designed to support similarity searches in Big Data
scenarios, aiming to reduce the distance and to improve the recall of similar content.
This chapter shows that the union of the VSM representation, the RHH function, the
Gray SFC, the three-dimensional structure used in a server-centric Data Center and
the XOR-based routing solution provide the necessary substrate to efficiently achieve
the objectives of HCube. As future work, alternatives for HCubes bigger than 8192
servers will be investigated. A practical approach could be the increase in the number
of servers’ NICs. However, as it is a physically limited approach, we believe that an
interesting option can be the introduction of an HCube hierarchy.
References
1. Gantz, J., Reinsel, D.: The Digital Universe Decade - Are You Ready?
http://www.emc.com/collateral/analyst-reports/
idc-digital-universe-are-you-ready.pdf (2010) (Online; Acesso em 2 de
Março de 2013)
2. The Apache Software Foundation: Apache R
Hadoop, http://hadoop.apache.org/
(2013) (Online; Acesso em 5 de Março de 2013)
3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun.
ACM 51(1), 107–113 (2008)
4. Indyk, P., Motwani, R.: Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality. In: STOC 1998: Proceedings of the 30th Annual ACM Symposium on The-
ory of Computing, pp. 604–613. ACM, New York (1998)
5. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: STOC
2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, New
York, NY, USA, pp. 380–388 (2002)
28 R. da Silva Villaça et al.
Abstract. Opportunistic networks are the next step in the evolution of mobile
networks, especially, since the number of human-carried mobile devices such as
smartphones and tablets has greatly increased in the past few years. They assume
unselfish communication between devices based on a store-carry-and-forward
paradigm, where mobile nodes carry each other’s data through the network, which
is exchanged opportunistically. In this chapter, we present opportunistic networks
in detail and show various real-life scenarios where such networks have been suc-
cessfully deployed or are about to be, such as disaster management, smart cities,
wildlife tracking, context-aware platforms, etc. We highlight the challenges in
designing successful data routing and dissemination algorithms for opportunistic
networks, and present some of the most important techniques and algorithms that
have been proposed in the past few years. We show the most important issues for
each of them, and attempt to propose solutions for improving opportunistic rout-
ing and dissemination. Finally, we present what the future trends in this area of
research might be, from information-centric networks to the Internet of Things.
1 Introduction
Opportunistic networks are extensions of the legacy Mobile Ad Hoc Networks
(MANETs) concept. Legacy MANETs are composed of mobile nodes that collabo-
ratively set up a network plane by running a given routing protocol. Therefore, the
sometimes implicit assumption behind MANETs is that the network is well connected,
and nodes’ disconnection is an exception to deal with. Most notably, if the destination
of a given message is not connected to the network when the message is generated,
then that message is dropped after a short time (i.e., the destination is assumed to not
exist). Opportunistic networks are mobile wireless networks in which the presence of
a continuous path between a sender and a destination is not assumed, since two nodes
may never be connected to the network at the same time. The network is assumed to
be highly dynamic, and the topology is, thus, extremely unstable and sometimes com-
pletely unpredictable. Nevertheless, the network must guarantee end-to-end delivery of
messages despite frequent disconnections and partitions.
The opportunistic networking paradigm is particularly suitable to those environments
that are characterized by frequent and persistent partitions. In the field of wildlife track-
ing, for example, some kinds of sensor nodes are used to monitor wild species. In these
cases it is not easy (nor possible sometimes) to have connectivity among a source sen-
sor node and a destination data collector node. This happens because the animals to be
monitored move freely and there is no possibility to control them in such a way to favor
connectivity. Opportunistic networks may also be exploited to bridge the digital divide.
In fact, they can support intermittent connectivity to the Internet for underdeveloped or
isolated regions. This can be obtained by exploiting mobile nodes that collect informa-
tion to upload to the Internet as well as requests for Web pages or any kind of data that
need to be downloaded from the Internet. Both data and requests are uploaded to and
downloaded from the Internet once the mobile data collector node reaches a location
where connectivity is available.
This chapter focuses on the particular problem of data gathering in such challenged
networks. We describe different alternatives and solutions for routing the data from
a source to its destination. Today different techniques could be employed for mobile
data gathering. A basic strategy would be to only allow data delivery when mobile de-
vices are in direct proximity of the sinks. This technique has very little communication
overhead, given that messages are only sent directly from the sensor node generating
messages to the sinks. However, depending on how frequently mobile nodes meet the
sinks, the delivery of the data might be very poor. This is particularly true if the sinks
are very few and spread out. More refined techniques include epidemically inspired ap-
proaches, which would randomly spread the data over the network, so that eventually
a sink could be reached. We analyze both these worlds, highlighting specific problems
and solutions to solve them in concrete case scenarios.
The rest of this chapter is organized as follows. We first introduce in more detail
the domain of opportunistic mobile communication. We present case studies where
opportunistic networks are already being successfully deployed and used, and highlight
specific issues regarding their implementation in particular scenarios. In Section 3 we
perform an analysis of the latest advances in data routing and dissemination solutions
based on the use of opportunistic mobile networks. We highlight the main issues of each
proposal, and attempt to give solutions to some of them in Section 4. Section 5 presents
future trends, and Section 6 concludes our study.
2 Opportunistic Networks
This section presents a detailed view regarding opportunistic networks (including defi-
nitions, benefits, and challenges), as well as a presentation of real-life use cases where
they can bring a significant contribution.
2.1 Definition
Opportunistic networks (ONs) are a natural evolution of MANETs, where most (or
sometimes all) of the nodes are mobile wireless devices. These devices range from small
wireless-capable sensors to smartphones and tablets. The evolution from MANETs to
ONs was necessary because opportunistic networks help transmit data horizontally, i.e.,
using costless inter-device transmissions, taking advantage of the already-existent de-
vice interaction. Moreover, ONs help disseminate data and decongest currently existing
Data Modeling for Socially Based Routing in Opportunistic Networks 31
1 In opportunistic networks, data items are generally called messages. From this point on, we
will refer to them thus, making a distinction only where the situation requires it.
2 http://www.haggleproject.org.
32 R.-I. Ciobanu, C. Dobre, and F. Xhafa
(which we discuss in more detail in Section 2.3). Furthermore, the paper also ana-
lyzes several ON routing and forwarding algorithms, while proposing a taxonomy used
to classify them. They are split into algorithms with and without infrastructure. The
infrastructure-based algorithms may be based on dissemination or on context, whereas
the algorithms without an infrastructure are split based on their infrastructure type (fixed
or mobile). Another detailed study of ONs has been performed by Conti et al. [17].
Their paper describes opportunistic networks, stating that the understanding of human
mobility is paramount to designing efficient protocols for opportunistic networking.
Conti et al. also discuss ON architecture, forwarding algorithms, data dissemination,
security, and applications and conclude their work by observing that there is a strong
link between opportunistic networking and mobile social networks. The authors also
show that ONs can be used for both point-to-point communication as well as data
dissemination.
2.2 Challenges
Aside from the benefits of opportunistic networks and their applicability in real-life
(presented in Section 2.3), there are several challenges that must be taken into consid-
eration when designing an ON-based solution. The first and most important caveat of
ONs is that the lack of connectivity at all times leads to a potential lack of end-to-end
paths. In other words, deploying an opportunistic network means accepting the fact that
not all messages may successfully reach their destinations, or when they do, reach them
with high delays. As stated in Section 2.1, there are many factors that can affect an ON’s
hit rate3 , ranging from the number of devices in the network to the behavior and social
grouping of the device’s owners (if we are dealing with an ON where the nodes are
humans carrying mobile devices). Opportunistic network administrators must be aware
of this and only use such networks where delays and loss of messages are acceptable.
The purpose of every ON routing or dissemination algorithm is to increase the hit rate,
because higher hit rates make ONs much more likely to be successfully used in real-life
scenarios.
Closely related to the first challenge is the decision of selecting a message’s next hop.
Ideally, each node should have access to the future behavior of the entire network, and
thus, choose the shortest path between it and a message’s destination, similar to what
is done for classical static networks. Unfortunately, this is not the case for opportunis-
tic networks, and this is why researchers are still proposing new methods of deciding
whether a message should be exchanged when two nodes are in range of each other.
Aside from selecting the next hop, decisions should be made regarding the amount of
copies a message should have in the network, and whether it should be kept or deleted
by the originating node. Various methods for routing or dissemination algorithms have
been proposed over the years, and the most successful ones are presented in Section 3,
along with their (still existing) issues.
The main bottlenecks of data dissemination in ONs are large inter-contact times and
slow nodes. If a large period of time passes before a node is encountered by any other
3 The percentage of messages that reach their destinations, out of the total number of messages
sent in the network.
Data Modeling for Socially Based Routing in Opportunistic Networks 33
device, then there is no chance of that node receiving the information it is interested
in. Therefore, opportunistic networks should be as dense as possible, so that each node
gets a fair chance of receiving data from others. Furthermore, slower nodes may also
act as bottlenecks, especially, if they are central nodes with lots of contacts. If this is
the case, they are considered as suitable forwarders by other nodes and data are sent
to them. Since they download slowly, they prevent the other nodes from being able to
transfer many data items at once, thus slowing down the entire network. This is why
an opportunistic network should be carefully configured to use suitable algorithms, that
are able to detect bottlenecks caused by slow and selfish nodes, and which can take
advantage of a contact between two nodes optimally.
Another aspect that should be taken into consideration when deploying an oppor-
tunistic network is that, since the nodes are generally mobile devices, they have a lim-
ited life until they need to be recharged. The more data transfers are performed in a short
period of time, the quicker a device’s battery consumes, which in turn leads to remov-
ing it from the network for a certain period of time (until it is recharged). Furthermore,
congestion also leads to quicker energy consumption, since a node has to retry sending
messages when the receiving node is flooded with data forwards. Asymmetric data rates
can also cause needless power consumption, because the node with the faster connec-
tion is blocked by the slower node until all the data is exchanged, so it’s working at a
slower rate than it actually can.
An area of opportunistic networks that hasn’t been researched too much is security.
Along with privacy, it is a key condition to people accepting opportunistic networks
where their devices are nodes. This would imply that a message sent by a node can
only be decrypted by the intended recipient, and that nodes can’t enter the network
and perform malicious deeds (such as flooding nodes with data, reporting false infor-
mation, etc.). Moreover, nodes should be stopped from being selfish, using incentive
mechanisms.
By taking all these challenges into consideration, the main goal of opportunistic net-
works is to achieve real mobile computing without the need for a connected network.
Researchers are not there yet, but the algorithms and solutions proposed have been bet-
ter and better over the past few years, so the research is heading into the right direction.
It probably won’t be too long until we will be able to see and use opportunistic networks
for various purposes.
This subsection presents real-life use cases for ONs. It highlights various areas where
opportunistic networks have been or are soon to be deployed.
security missions. One such solution is proposed by Bruno et al. [8], and it implies us-
ing ONs to create an overlay infrastructure for rescue and crisis management services.
The proposed solution uses the unaffected components of the static infrastructure (i.e.,
the ones that have not been damaged by the disaster), by making them act as nodes in
opportunistic networks. Aside from these ad hoc nodes, special networks and connec-
tors are also deployed, and even singular mobile devices belonging to the survivors or
to people nearby the disaster spot may be used. The main goal is to offer connectiv-
ity where otherwise there would be none, which leads to a higher efficiency in finding
survivors or organizing the rescue efforts. Moreover, another goal is to lower the con-
gestion rate, since regular network infrastructures tend to become very crowded during
such incidents.
A similar solution is proposed by Lilien et al. [34], where systems that were not
originally nodes of an opportunistic network dynamically join it, with the purpose of
aiding with communication in disaster situations. MAETT (Mobile Agent Electronic
Triage Tag) and Haggle-ETT [36] are two similar methods that are used to collect the
triage data (i.e., location) of a disaster victim, when regular communication systems are
down. They allow it to be collected and represented in an electronic format, which can
then be transmitted to coordination points where it is processed and made available for
the rescue missions. The difference between the two algorithms is that MAETT uses
mobile agents for storing the electronic triage tag, whereas Haggle-ETT is based on the
Haggle architecture.
2.3.4 Advertising
More recently, ONs have been used for other purposes, such as advertising. One such
example is MobiAd [25], an application that presents the user with local advertisements
in a privacy-preserving manner. The ads are selected by the phone from a pool of adver-
tisements which are broadcast on the local mobile base station or received from local
WiFi hotspots. Information about ad views and clicks is encrypted and sent to the ad
channel in an opportunistic way, via other mobile devices or static WiFi hotspots. This
helps ensure privacy, since the other nodes won’t discover which ads were viewed, and
the ad provider isn’t able to know which user saw what ad.
Another way of applying opportunistic networking to advertising is proposed by
Heinemann and Straub [26]. They base their proposition on word-of-mouth (WOM)
communication or viral marketing, which means facilitating advertising where none
of the participants are marketing sources. In classical WOM communication, a user A
passes next to a store and sees a promotion for a certain item. Knowing that a friend
B is interested in those types of items, A lets B know about the offer, thus advertising
the product through word-of-mouth, even though there is no commercial interest from
36 R.-I. Ciobanu, C. Dobre, and F. Xhafa
user A. Heinemann and Straub propose using this mechanism over an ON infrastruc-
ture. Instead of showing their promotions in the window, stores have mobile devices
that broadcast information about promotions and offers on a certain radius. When peo-
ple with mobile devices pass near the store, their own device receives the offer and
attempts to match it with the current user’s preferences. If the user is interested, then
an alert is shown. If not, the information is stored for opportunistic transmission. As-
suming that node A received an offer it is not interested in, it carries it around until
it encounters node B. If node B is interested, then the offer is forwarded to it, and its
owner is notified. This mechanism would imply having an application where users can
store their preferences, in order to help match the received offers.
4 An international network of antennas that is able to track data and control the navigation of
interplanetary spacecraft, used by NASA.
Data Modeling for Socially Based Routing in Opportunistic Networks 37
the position of the point that communication must be performed with. Moreover, the
satellites can relay data between each other before it reaches Mars.
5 http://www.distributed-systems.net/index.php?id=
extreme-wireless-distributed-systems.
38 R.-I. Ciobanu, C. Dobre, and F. Xhafa
consumes less power than both WiFi and 3G). CAPIM would be used to disseminate
announcements to students, signal their location, the classes they take, the interactions
they have, etc. Other examples of context-aware platforms that might benefit from an
ON-based framework include PACE [27], SOCAM [22] or CoWSAMI [3].
3.1.1 Epidemic
The Epidemic algorithm [44] is based on the way a virus spreads: when two potential
carriers meet, the one with the virus infects the other one, if it isn’t already infected.
Thus, when an ON node A encounters a node B, A downloads all the messages from B
that it does not already contain, and vice versa. The simplest version of this algorithm
assumes that a node’s data memory is unlimited, so that it can store all the messages
that can be at once in the opportunistic network. However, this is unfeasible in real-
life, especially as the network grows ever larger, so a modified Epidemic version exists,
where the data memory of a node is limited. Thus, when node A’s memory is full and
it encounters node B, first it has to drop the oldest messages in its memory, in order
to make room for the messages that it will download from node B. This also makes
the algorithm somewhat inefficient, since some older messages may be important (e.g.,
they may be addressed to nodes that A is about to encounter) and some new ones totally
irrelevant (e.g., their destinations may be nodes that A will never meet).
40 R.-I. Ciobanu, C. Dobre, and F. Xhafa
3.1.2 Spray-and-Wait
Spray-and-Wait [42] is an improvement to Epidemic that attempts to treat the conges-
tion (and consequently the energy consumption) problem by limiting the total number
of messages sent in the network, while keeping a high hit rate. As the name states, the
algorithm is split into two phases. The Spray phase assumes that, for each message
originating at a source node, a predefined number of copies are transferred to the en-
countered nodes, in the order that they are seen. The nodes that received the message
will then do the same with the nodes that they encounter. Secondly, the Wait phase oc-
curs after the end of the Spray phase (i.e., after the message has been transmitted for
the given number of times) and it implies that, if a message’s destination has not been
encountered yet, then the message will only be relayed when (and if) the destination is
encountered. Thus, Spray-and-Wait combines the speed of Epidemic routing with the
simplicity of direct transmission. However, unlike the basic Epidemic algorithm, Spray-
and-Wait doesn’t guarantee the maximum hit rate. Moreover, it still does not take into
account any context information in its routing decisions.
called a familiar stranger and has a high number of contacts with the current node, but
the contact durations are short. There are also stranger nodes, where the contact duration
is short and the number of contacts is low, and finally friend nodes, with few contacts,
but high contact durations. In order to construct an overlay for publish/subscribe sys-
tems, community detection is performed in a decentralized fashion. Thus, each node
must detect its own local community. The disadvantage of this method is that broker
nodes tend to be congested, given that all data directed to their community must first
pass through them. If the members of a community are subscribed to many channels,
it would be more suitable to be able to have multiple brokers, in order to increase the
efficiency.
3.2.3 SRSN
The SRSN algorithm [6] is based on the assumption that ad hoc detected communities
may miss important aspects of the true organization of an opportunistic network, where,
for example, a node might have a strong social link to another node that is encountered
rarely. In such a situation, a detected social network might omit this tie and thus yield
suboptimal forwarding paths. Therefore, two types of social networks are considered.
First of all, there is a detected social network (DSN) as given by a community detection
algorithm such as k-CLIQUE, an approach similar to the one taken by BUBBLE Rap.
Secondly, the authors also propose a self-reported social network (SRSN) as given by
social network links (in this case, Facebook relationships). The algorithm follows a
42 R.-I. Ciobanu, C. Dobre, and F. Xhafa
few steps: nodes generate data, carry it around the network and, when they encounter
another node, they only exchange information if the two nodes are in the same network
(either DSN of SRSN). Therefore, there are two versions of this algorithm: one that
uses the DSN, and another one that uses SRSN. Through extensive experiments, the
authors show that using SRSN information instead of DSN decreases the delivery cost
and produces comparable delivery ratio. This happens because the two social networks
differ in terms of structural and role equivalence, with the better approximation being
obtained through the SRSN. The results presented by the SRSN algorithm are relevant
in terms of highlighting the importance of using readily available information (such
as Facebook, Google+, Twitter, or LinkedIn social relationships) for approximating a
user’s social relationships. However, in situations where the social network does not
correctly approximate the network’s behavior, or where social network information is
not available, such an algorithm can’t be used.
3.2.4 ContentPlace
ContentPlace [7] deals with data dissemination in resource-constrained ONs, by mak-
ing content available in regions where interested users are present, without overusing
available resources. To optimize content availability, it exploits learned information
about users’ social relationships to decide where to place user data. The design of
ContentPlace is based on two assumptions: users can be grouped together logically,
according to the type of content they are interested in, and their movement is driven by
social relationships. When a node encounters another node, it decides what information
seen on the other node should be replicated locally. Thus, ContentPlace defines a utility
function by means of which each node can associate a utility value to any data object.
When a node encounters another peer, it selects the set of data objects that maximizes
the local utility of its cache. Due to performance issues, when two nodes meet, they do
not advertise all information about their data objects, but instead they exchange a sum-
mary of data objects in their caches. Finally, the data exchange is accomplished when a
user receives a data object it is subscribed to when it is found in an encountered node’s
cache. To have a suitable representation of users’ social behavior, an approach that is
similar to the caveman model is used, that has a community structure which assumes
that users are grouped into home communities, while at the same time having relation-
ships in acquainted communities. The utility is a weighted sum of one component for
each community its user has relationships with. Community detection is done using
k-CLIQUE. By using weights based on the social aspect of opportunistic networking,
ContentPlace offers the possibility of defining different policies. There are five policies
defined: Most Frequently Visited (MFV), Most Likely Next (MLN), Future (F), Present
(P), and Uniform Social (US). These policies allow the network manager to change the
behavior of the nodes, according to the configuration of the network.
base their routing decisions on the history of contacts between nodes. If a node A has
encountered a node B many times in the recent past, it is assumed that it will encounter
it again in the near future, because the two nodes have similar paths. Moreover, based
on the shapes of past contacts, various types of distributions and approximations have
been employed to predict the future behavior of ON nodes and perform optimal routing
decisions.
3.3.1 PROPHET
PROPHET [35] is a prediction-based routing algorithm for ONs which performs proba-
bilistic routing by establishing a metric called delivery predictability (P) at every node A
for a known destination B. This probability signifies A’s chance to successfully deliver
a message to B. When two PROPHET nodes meet, they exchange summary vectors
which (among other information) contain the delivery predictability P. When a node
receives this information, it updates its internal delivery predictability vector (whose
size is equal to the total number of nodes in the ON), and then decides which messages
to request from the other node based on the forwarding strategy used. There are three
steps performed when computing delivery predictability values. First, whenever a node
is encountered, the local value of the metric is updated, which leads to a higher P for
nodes that are encountered more often. Secondly, since nodes that are not in contact
for long periods of time are not very likely to be good forwarders toward each other,
the delivery predictability must age, thus being reduced with the passage of time. The
aging process is based on an aging constant and a given unit of time. Finally, the deliv-
ery predictability also has a transitive property, based on the fact that, if two nodes A
and B meet each other often, and node A also has many encounters with a node C, then
C is also a good forwarding node for A. Based on the scaling constant, the transitivity
property also impacts the computation of the delivery predictability. The forwarding
strategy chosen by the authors is a simple one, and implies that, when two nodes A and
B meet, if the delivery probability of the destination of a message at B is higher for A,
then the message is transferred (and the other way around as well). The main caveat of
this algorithm is that, since nodes are not being split into communities, there exists the
risk of flooding a popular node (such as a professor in an academic environment, which
interacts with students from various study years). This can be avoided by redirecting a
part of the messages destined for a such a node to other less popular nodes.
3.3.2 RANK
Hui and Crowcroft [29] study the impact of predictable human interactions on forward-
ing in PSNs. By applying vertex similarity on a dataset extracted from mobility traces,
they observe that adaptive forwarding algorithms can be built by using the history of
past encounters. Furthermore, the authors design a distributed forwarding algorithm
based on node centrality and show that it is efficient in terms of hit rate and delivery
latency. This greedy algorithm is entitled RANK, and (similarly to BUBBLE Rap) it
uses popular nodes to disseminate data. The popularity of a node is quantified by the
Freeman betweenness centrality, which is defined as the number of times a node falls
on the shortest path of other nodes. The authors assume that each node knows its own
centrality and the centrality of the nodes it encounters, but not of the other nodes in the
network, so it cannot know the highest centrality in the system. Therefore, the greedy
44 R.-I. Ciobanu, C. Dobre, and F. Xhafa
algorithm pushes traffic on all paths to nodes that have a higher centrality than the cur-
rent node, until the destination is reached or the messages expire. Since knowing the
individual centrality for each node at any point in time is complicated, the authors pro-
pose analyzing the past activity of a node to see if it was a good carrier in the past,
and then use this information for future forwarding. Therefore, they analyze how well
the past centrality can predict the future centrality for a given node, and for this reason
they extract three consecutive three-week sessions for a mobility trace and run a set of
greedy RANK emulations on the last two data sessions, using centrality values from
the first session. The test results show that human mobility is predictable to a certain
degree and that past contact information can successfully be used to approximate the
future behavior of a node in the ON. However, one of the main limitations of the RANK
algorithm is that it only focuses on a day-to-day analysis, whereas a finer-grained pre-
dictability may prove to be more useful for messages that have a lower tolerance for
delays. Another important caveat of the algorithm is that, although it uses prediction of
future node behavior, it does not consider the ON nodes as belonging to communities,
which may lead to congestion at the nodes that are most popular in terms of centrality.
3.3.3 dLife
dLife [37] is an opportunistic network routing algorithm that is able to capture the
dynamic represented by time-evolving social ties between pairs of nodes. The authors
highlight the fact that user behavior is dynamic, the network itself evolves, meaning that
network ties are created and broken constantly. This is why dLife focuses on the dif-
ferent behavior users have in different daily periods of time, instead of estimating their
behavior per day. The dynamics of social structures are represented as a weighted con-
tact graph, where the weights are used to express how long a pair of nodes is in contact
over different periods of time. There are two complementary utility functions employed
by dLife: the Time-Evolving Contact Duration (TECD), which is the evolution of social
interaction among pairs of users in the same daily interval over consecutive days, and
the TECD Importance (TECDi ), which is the evolution of a user’s importance, based
on its node degree and social strength toward its neighbors, in different periods of time.
TECD is used to forward messages to nodes that have a stronger social relationship
with the destination than the current carrier. Each node computes the average of its
contact duration with other nodes during the same set of daily time periods over con-
secutive days. If the carrier and the encountered node have no social information toward
the destination, forwarding is done based on TECDi , where the encountered node gets
a message if it has a greater importance than the carrier. The authors also propose a
community-based version of dLife, entitled dLifeComm, where the social communi-
ties are computed similarly to BUBBLE Rap (i.e., using k-CLIQUE), but the decision
whether to forward to a node is done based on TECD and TECDi , thus changing over
time. dLife offers the advantage of using both the history of contacts, as well as social
information, in performing routing decisions.
4 Potential Solutions
In this section, we present a couple of alternatives to the existing solutions shown in
Section 3 and highlight the improvements they bring.
Data Modeling for Socially Based Routing in Opportunistic Networks 45
4.1 SPRINT
SPRINT [14] is a novel ON point-to-point routing algorithm that takes advantage of
both social knowledge, as well as contact prediction, when making decisions. It uses
information about the nodes from the contact history and from existing self-reported
social networks. Moreover, it includes a Poisson-based prediction of a node’s future
behavior. Through extensive experiments, it has been shown that SPRINT performs
better that existing socially aware opportunistic routing solutions in terms of hit rate,
latency, delivery cost6 , and hop count7. This section presents the motivations behind
SPRINT and the functionality of the algorithm.
enc(M, A)
U1 (M, A) = f reshness(M) + p(M, A) ∗ (1 − )
24
sn (M) + hop(M) + pop(A) + t(M, A)
U2 (M, A) = ce (M, A) ∗
4
The f reshness(M) component of U1 favors new messages, being positive if the mes-
sage has been created less than a day ago, and 0 otherwise. p(M, A) is the probability
Data Modeling for Socially Based Routing in Opportunistic Networks 47
of node A being able to deliver a message M closer to its destination, and is based on
predicting a node’s behavior, combined with the idea that a node has a higher chance of
interacting with nodes it is socially connected with and/or has encountered before. It is
computed based on the knowledge that the node contacts follow a Poisson distribution.
The first step is to count how many times node A encountered each of the other nodes
in the network. If a node has been previously met in the same day of the week or in the
same 2h interval as the current time, the total encounters value is increased by 1. For the
nodes encountered in the past that are in the same social community as node A, the total
number of contacts is doubled. Then, the probabilities of encountering nodes based on
past contacts are computed by performing a ratio between the number of encounters
per node and the total number of encounters. The next step consists of computing the
number of encounters N that node A will have for each of the next 24 hours by using the
Poisson distribution probabilities and choosing the value with the highest probability as
N. The first N nodes are then picked as potential future contacts for each of the next 24
hours (sorted by probability), and for the rest of them p(M, A) is set to 0. U1 also uses
enc(M, A), which is the time (in hours) until the destination of message M will be met
by A according to the probabilities previously computed. If the destination will never
be encountered, then enc(M, A) is set to 24 (so the product is 0).
The second component of the utility function is U2 . ce (M, A) is set to 1 if node A
is in the same community as the destination of message M or if it will encounter a
node that has a social relationship with M, and 0 otherwise. The prediction information
computed for U1 is used to analyze the potential future encounters of a node. The sn (M)
component is set to 1 (and 0 otherwise) if the source and destination of M do not have
a social connection. hop(M) represents the normalized number of nodes that M has
visited, pop(A) is the popularity value of A according to its social network information
(i.e., number of Facebook friends in the opportunistic network), and finally t(M, A) is
the total time spent by node A in contact with M’s destination.
SPRINT is compared to BUBBLE Rap and it is shown that it performs better in terms
of hit rate, delivery latency, hop count, and delivery cost for three mobility traces and
one synthetic mobility model simulation. The complete results, along with their analy-
sis, are presented in detail in [14]. The algorithm’s main advantage over other solutions
is that it does not rely solely on one method for deciding the next hop. It combines social
information, from both offline and online sources, with ad hoc prediction mechanisms
(which can be switched on-the-fly, according to the characteristics of the ON), to offer
a more complete view of a node’s behavior.
4.2 SENSE
SENSE [16] is a collaborative selfish node detection and incentive mechanism for op-
portunistic networks that is not only able to detect the selfish nodes in an ON, but also
has the possibility of improving the network’s performance by incentivising the partici-
pating nodes into carrying data for other nodes. Altruism is an important component of
ONs, since nodes must rely on each other for a successful transmission of their intended
messages. Thus, nodes refusing to participate in the routing process are punished by the
algorithm, and therefore, have no way to get their messages to be delivered, unless they
accept to help other nodes route their data as well.
48 R.-I. Ciobanu, C. Dobre, and F. Xhafa
is considered selfish by node A, so A does not send it messages for routing and does
not accept messages from B, either. Node A then notifies B that it considers it selfish,
so B would not end up considering node A selfish. This also functions as an incentive
mechanism, because if a node wants its messages to be routed by other nodes, it should
not be selfish toward them. Therefore, every time a node is notified that it is selfish in
regard to a certain message, it increases its altruism value. If there is a social connection
between the selfish node and the source of the message, the inter-community altruism
is increased. Otherwise, the intra-community altruism value grows.
The formula for computing altruism values for a node N and a message M based on
the list of past forwards O and on the list of past receives I is the following:
N.id=o.d,N.id=i.s
altruism(N, m) = ∑ type(m, o.m) ∗ thr(o.b)
o∈O,i∈I,o.m=i.m
A past encounter x has a field x.m which specifies the message that was sent or
received, x.s is the source of the transfer, x.d is the destination, and x.b is the battery
level of the source. type is a function that returns 1 if the types of the two messages
received as parameters are the same (in terms of communities, priorities, etc.), and 0
otherwise, while thr returns 1 if the value received as parameter is higher than a preset
threshold, and 0 if it’s not the case. Thus, the function counts how many messages of
the same type as M have been forwarded with the help of node N, when N’s battery was
at an acceptable level.
Test results show that SENSE can help improve opportunistic network performance
(with metrics such as hit rate, delivery latency, hop count, and delivery cost) when
selfish nodes exist. It is demonstrated that SENSE outperforms a scenario where self-
ish nodes are present, but no selfishness detection and incentive mechanism is avail-
able. Moreover, it even performs better than existing similar algorithms, such as IRON-
MAN [5]. It is also shown that SENSE can successfully differentiate between a node
being selfish on purpose, and a node not being able to deliver messages due to low-
battery power. For the full set of tests and results, we refer to reader to [16]. The main
advantage of SENSE is not only that it can detect and avoid selfish nodes, but also that
it can limit the total number of messages sent in the opportunistic network by carefully
selecting a message’s destination, based on the social connection and history of routing.
5 Future Trends
One of the main limitations of research in this area, so far, is that it has mostly focused
on point-to-point communication. However, we believe that the future of opportunistic
networks is heading toward data dissemination, where communication is done based on
a publish/subscribe paradigm. This is why we are expecting a focus on data dissemina-
tion instead of point-to-point routing in the near future. This includes moving toward
information-centric networks (ICNs) and the Internet of Things (IoT).
An ICN is a novel method of making the Internet more data-oriented and content-
centric [21] and is basically a global-scale version of the publish/subscribe paradigm.
The focus changes from referring to data by its location (and an IP address) to request-
ing Named Data Objects (NDOs) instead. When an ICN network element receives a
50 R.-I. Ciobanu, C. Dobre, and F. Xhafa
request for content, it can respond with the content directly if it has the data cached, or
it can request it from its peers otherwise. This way, an end user is not concerned with
the location of an object, only with its actual name, thus being able to receive it from
any number of hosts. Mobile devices play an important role in ICNs, since they may be
used to cache data as closely as possible to interested users, based on context informa-
tion. Therefore, efficient opportunistic routing and dissemination algorithms have to be
employed in order to move the data accordingly and replicate it as needed.
The Internet of Things [23] aims to improve social connectivity in physical com-
munities by leveraging information detected by mobile devices. It assumes a number
of such devices being able to communicate between each other to gather context data,
which is then used to make automated decisions. The deployment of IoT generally has
three steps. The first one is getting more devices onto the network, the second step is
making them rely on each other, coordinating their actions for simple tasks without hu-
man intervention, and the final step is to understand these devices as a single system
that needs to be programmed. The more devices will be connected, the more important
will the role of the routing and dissemination protocols be.
It is estimated that IoT will have to accommodate over 50,000 billion objects of very
diverse types by 2020 [40]. Standardization and interoperability will thus be absolute ne-
cessities for interfacing them with the Internet. New media access techniques, commu-
nication protocols, and sustainable standards will need to be developed to make Things
communicate with each other and with people. One approach would be the encapsula-
tion of smart wireless identifiable devices and embedded devices in Web services. We
can also consider the importance of enhancing the quality of service aspects like response
time, resource consumption, throughput, availability, and reliability. The discovery and
use of knowledge about services availability and of publish/subscribe/notifymechanisms
would also contribute to enhancing the management of complex Thing structures.
Because of the fast increase of mobile data traffic volume being generated by
bandwidth-hungry smartphone applications, cellular operators are forced to explore
various possibilities to offload data traffic away from their core networks. 3G cellu-
lar networks are already overloaded with data traffic generated by smartphone applica-
tions (e.g., mobile, TV). With the advent of IoT, the potentially huge number of Things
will not be easily incorporated by today’s communication protocols and/or Internet ar-
chitecture. Mobile data offloading may relieve the problem, by using complementary
communication technologies (considering the increasing capacity of WiFi), to deliver
traffic originally planned for transmission over cellular networks. Here, opportunistic
networks can find quick benefits.
Again related to IoT, new services shall be available for persistent distributed knowl-
edge storing and sharing, and new computational resources shall be used for the ex-
ecution of complicated tasks. Actual forecasts indicate that in 2015 more than 220
Exabytes of data will be stored [40]. At the same time, optimal distribution of tasks
between smart objects with high capabilities and the IoT infrastructure shall be found.
New mechanisms and protocols will be needed for privacy and security issues at all IoT
levels including the infrastructure. Solutions for stronger security could be based on
models employing the context-aware capability of Things. New methods are required
for energy saving and energy-efficient and self-sustainable systems. Researchers will
Data Modeling for Socially Based Routing in Opportunistic Networks 51
look for new power-efficient platforms and technologies and will explore the ability of
smart objects to harvest energy from their surroundings.
The large variety of technologies and designs used in the production of Things is a
main concern when considering the interoperability. One solution is the adoption of stan-
dards for Things intercommunication. Adding self-configuration and self-management
properties could be necessary to allow Things to interoperate and, in addition, integrate
within the surrounding operational environment. This approach is superior to the cen-
tralized management, which cannot respond to difficulties induced by the dimensions,
dynamicity, and complexity of the Internet of Things. The autonomic behavior is impor-
tant at the operational level as well. Letting autonomic Things react to events generated
by context changes facilitates the construction and structuring of large environments that
support the Internet of Things. Special requirements come from the scarcity of Things’
resources, and are concerned with power consumption. New methods of efficient man-
agement of power consumption are needed and could apply at different levels, from the
architecture level of Things to the level of the network routing. They could substantially
contribute to lowering the cost of Things, which is essential for the rapid expansion of
the Internet of Things.
Some issues come from the distributed nature of the environment in which different
operations and decisions are based on the collaboration of Things. One issue is how
Things converge on a solution and how the quality of the solution can be evaluated. An-
other issue is how to protect against faulty Things, including those exhibiting malicious
behavior. Finally, the way Things can cope with security issues to preserve confiden-
tiality, privacy, integrity, and availability are of high interest. For all these, examples of
mechanisms designed to cope with such problems by actively using any communication
opportunity were presented throughout the chapter.
6 Conclusions
In this chapter, we have provided the definition of opportunistic networks and have
shown the challenges facing the deployment of such networks in real-life. However, we
have also presented several use cases where ONs have been successfully deployed, and
other areas where interesting and valid propositions have been presented. This leads us
to believe that opportunistic networks have a good applicability in real-life, especially,
if the algorithms and solutions keep evolving, as they have been doing in the past few
years.
We have also presented several ON routing and dissemination algorithms. For each
of them, we have shown that they have both strengths, as well as weaknesses. Some
of these algorithms are suitable for a certain type of situation, other are better for dif-
ferent scenarios. There is no single best algorithm, and this is because opportunistic
networks are so varied and can range from large and dense networks with thousands
of participants, to small and sparse networks that must make the most of any contacts
between nodes. This is why the research area of ONs is so vast and keeps evolving
constantly, with the proposed solutions becoming better and better. Moreover, we have
shown how some of the issues in opportunistic networking might be fixed by leveraging
52 R.-I. Ciobanu, C. Dobre, and F. Xhafa
social networks, node behavior prediction and selfish node detection, and incentive
mechanisms. We concluded our presentation by showing that the future trends in the
area of mobile networking are veering toward data dissemination, through information-
centric networks and the Internet of Things.
Acknowledgments. This work was partially supported by the project “ERRIC - Empow-
ering Romanian Research on Intelligent Information Technologies/FP7-REGPOT-2010-
1”, ID: 264207. The work has been cofounded by the Sectoral Operational Programme
Human Resources Development 2007-2013 of the Romanian Ministry of Labour, Fam-
ily, and Social Protection through the Financial Agreement POSDRU/89/1.5/S/62557.
References
1. Akyildiz, I.F., Akan, Ö.B., Chen, C., Fang, J., Su, W.: InterPlaNetary Internet: state-of-the-art
and research challenges. Computer Networks Journal 43(2), 75–112 (2003)
2. Arnaboldi, V., Conti, M., Delmastro, F., Minutiello, G., Ricci, L.: DroidOppPathFinder: A
context and social-aware path recommender system based on opportunistic sensing. In: Pro-
ceedings of IEEE International Symposium on a World of Wireless, Mobile and Multimedia
Networks, WoWMoM 2012 (2012)
3. Athanasopoulos, D., Zarras, A.V., Issarny, V., Pitoura, E., Vassiliadis, P.: CoWSAMI:
Interface-aware context gathering in ambient intelligence environments. Pervasive and Mo-
bile Computing Journal 4(3), 360–389 (2008)
4. Bharathidasan, A., An, V., Ponduru, S.: Sensor networks: An overview. Technical report,
Department of Computer Science, University of California, Davis (2002)
5. Bigwood, G., Henderson, T.: In: Proceedings of IEEE Third International Conference on
Privacy, Security, Risk and Trust, PASSAT 2011 (2011)
6. Bigwood, G., Rehunathan, D., Bateman, M., Henderson, T., Bhatti, S.: Exploiting self-
reported social networks for routing in ubiquitous computing environments. In: Proceedings
of the 2008 IEEE International Conference on Wireless & Mobile Computing, Networking
& Communication, WIMOB 2008, pp. 484–489. IEEE Computer Society, Washington, DC
(2008)
7. Boldrini, C., Conti, M., Passarella, A.: Exploiting users’ social relations to forward data in
opportunistic networks: The HiBOp solution. Pervasive and Mobile Computing Journal 4,
633–657 (2008)
8. Bruno, R., Conti, M., Passarella, A.: Opportunistic networking overlays for ICT services in
crisis management. In: Proceedings of the International Conference on Information Systems
for Crisis Response and Management, ISCRAM 2008 (2008)
9. Chaintreau, A., Hui, P., Crowcroft, J., Diot, C., Gass, R., Scott, J.: Pocket switched networks:
Real-world mobility and its consequences for opportunistic forwarding. Technical report,
University of Cambridge Computer Lab (2005)
10. Chourabi, H., Nam, T., Walker, S., Gil-Garcia, J.R., Mellouli, S., Nahon, K., Pardo, T.A.,
Scholl, H.J.: Understanding smart cities: An integrative framework. In: Proceedings of the
45th Hawaii International Conference on System Science, HICSS 2012, pp. 2289–2297
(2012)
Data Modeling for Socially Based Routing in Opportunistic Networks 53
11. Ciobanu, R., Dobre, C.: Data dissemination in opportunistic networks. In: Proceedings
of 18th International Conference on Control Systems and Computer Science, CSCS-18,
pp. 529–536. Politehnica Press (2012)
12. Ciobanu, R.I., Dobre, C.: Predicting encounters in opportunistic networks. In: Proceedings
of the 1st ACM Workshop on High Performance Mobile Opportunistic Systems, HP-MOSys
2012, pp. 9–14. ACM, New York (2012)
13. Ciobanu, R.I., Dobre, C., Cristea, V.: Social aspects to support opportunistic networks in an
academic environment. In: Li, X.-Y., Papavassiliou, S., Ruehrup, S. (eds.) ADHOC-NOW
2012. LNCS, vol. 7363, pp. 69–82. Springer, Heidelberg (2012)
14. Ciobanu, R.I., Dobre, C., Cristea, V.: SPRINT: Social prediction-based opportunistic rout-
ing. In: Proceedings of IEEE 14th International Symposium and Workshops on a World of
Wireless, Mobile and Multimedia Networks, WoWMoM 2013, pp. 1–7 (2013)
15. Ciobanu, R.-I., Dobre, C., Cristea, V., Al-Jumeily, D.: Social aspects for opportunistic com-
munication. In: Proceedings of the 11th International Symposium on Parallel and Distributed
Computing, ISPDC 2012, pp. 251–258 (2012)
16. Ciobanu, R.-I., Dobre, C., Dascălu, M., Trăusan-Matu, S., Cristea, V.: Collaborative self-
ish node detection with an incentive mechanism for opportunistic networks. In: Proceed-
ings of IFIP/IEEE International Symposium on Integrated Network Management, IM 2013,
pp. 1161–1166 (2013)
17. Conti, M., Giordano, S., May, M., Passarella, A.: From opportunistic networks to opportunis-
tic computing. Communications Magazine 48(9), 126–139 (2010)
18. Desta, M.S., Hyytiä, E., Ott, J., Kangasharju, J.: Characterizing content sharing properties
for mobile users in open city squares. In: Proceedings of the 10th Annual Conference on
Wireless on-demand Network Systems and Services, WONS 2013, pp. 147–154 (2013)
19. Dobre, C., Manea, F., Cristea, V.: CAPIM: A context-aware platform using integrated mobile
services. In: Proceedings of IEEE International Conference on Intelligent Computer Com-
munication and Processing, ICCP 2011, pp. 533–540 (2011)
20. Doria, A., Uden, M., Pandey, D.P.: Providing connectivity to the Saami nomadic commu-
nity. In: Proceedings of the 2nd International Conference on Open Collaborative Design for
Sustainable Innovation, DYD 2002, Bangalore, India (December 2002)
21. Ghodsi, A., Shenker, S., Koponen, T., Singla, A., Raghavan, B., Wilcox, J.: Information-
centric networking: seeing the forest for the trees. In: Proceedings of the 10th ACM Work-
shop on Hot Topics in Networks, HotNets-X, pp. 1:1–1:6. ACM, New York (2011)
22. Gu, T., Pung, H.K., Zhang, D.Q.: A service-oriented middleware for building context-aware
services. Journal of Network and Computer Applications 28(1), 1–18 (2005)
23. Guo, B., Yu, Z., Zhou, X., Zhang, D.: Opportunistic IoT: Exploring the social side of the
Internet of Things. In: Proceedings of IEEE 16th International Conference on Computer
Supported Cooperative Work in Design, CSCWD 2012, pp. 925–929 (2012)
24. Guo, S., Derakhshani, M., Falaki, M.H., Ismail, U., Luk, R., Oliver, E.A., Ur Rahman, S.,
Seth, A., Zaharia, M.A., Keshav, S.: Design and implementation of the KioskNet system.
Computer Networks Journal 55(1), 264–281 (2011)
25. Haddadi, H., Hui, P., Henderson, T., Brown, I.: Targeted advertising on the handset: privacy
and security challenges. Human-Computer Interaction Series. Springer (July 2011)
26. Heinemann, A., Straub, T.: Opportunistic networks as an enabling technology for mobile
word-of-mouth advertising. In: Pousttchi, K., Wiedmann, D.G. (eds.) Handbook of Re-
search on Mobile Marketing Management, pp. 236–254. Business Scrience Reference, PA
(2010)
54 R.-I. Ciobanu, C. Dobre, and F. Xhafa
27. Henricksen, K., Robinson, R.: A survey of middleware for sensor networks: state-of-the-
art and future directions. In: Proceedings of the International Workshop on Middleware for
Sensor Networks, MidSens 2006, pp. 60–65. ACM, New York (2006)
28. Hui, P., Chaintreau, A., Scott, J., Gass, R., Crowcroft, J., Diot, C.: Pocket switched net-
works and human mobility in conference environments. In: Proceedings of the ACM SIG-
COMM Workshop on Delay-Tolerant Networking, WDTN 2005, pp. 244–251. ACM, New
York (2005)
29. Hui, P., Crowcroft, J.: Predictability of human mobility and its impact on forwarding. In:
Proceedings of the Third International Conference on Communications and Networking in
China, ChinaCom 2008, pp. 543–547 (2008)
30. Hui, P., Crowcroft, J., Yoneki, E.: BUBBLE Rap: social-based forwarding in delay tolerant
networks. In: Proceedings of the 9th ACM International Symposium on Mobile Ad Hoc
Networking and Computing, MobiHoc 2008, pp. 241–250. ACM, New York (2008)
31. Hui, P., Yoneki, E., Chan, S.Y., Crowcroft, J.: Distributed community detection in delay
tolerant networks. In: Proceedings of 2nd ACM/IEEE International Workshop on Mobil-
ity in the Evolving Internet Architecture, MobiArch 2007, pp. 7:1–7:8. ACM, New York
(2007)
32. Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L.S., Rubenstein, D.: Energy-efficient com-
puting for wildlife tracking: design tradeoffs and early experiences with ZebraNet. SIGOPS
Operating Systems Review 36(5), 96–107 (2002)
33. Le, V.-D., Scholten, H., Havinga, P.: Unified routing for data dissemination in smart city
networks. In: Proceedings of the 3rd International Conference on the Internet of Things, IOT
2012, pp. 175–182. IEEE Press, USA (2012)
34. Lilien, L., Gupta, A., Yang, Z.: Opportunistic networks for emergency applications and their
standard implementation framework. In: Proceedings of IEEE International Performance,
Computing, and Communications Conference, IPCCC 2007, pp. 588–593 (2007)
35. Lindgren, A., Doria, A., Schelén, O.: Probabilistic routing in intermittently connected net-
works. SIGMOBILE Mobile Computing and Communications Review 7(3), 19–20 (2003)
36. Martı́n-Campillo, A., Martı́, R., Yoneki, E., Crowcroft, J.: Electronic triage tag and oppor-
tunistic networks in disasters. In: Proceedings of the Special Workshop on Internet and Dis-
asters, SWID 2011, pp. 6:1–6:10. ACM, New York (2011)
37. Moreira, W., Mendes, P., Sargento, S.: Opportunistic routing based on daily routines. In:
IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks,
WoWMoM 2012, pp. 1–6 (2012)
38. Pelusi, L., Passarella, A., Conti, M.: Opportunistic networking: data forwarding in discon-
nected mobile ad hoc networks. Communications Magazine 44(11), 134–141 (2006)
39. Pentland, A., Fletcher, R., Hasson, A.: DakNet: Rethinking connectivity in developing na-
tions. Computer Journal 37(1), 78–83 (2004)
40. INFSO D.4 Networked Enterprise & RFID, INFSO G.2 Micro & Nanosystems, and Working
Group RFID of the ETP EPoSS. Internet of Things in 2020. Roadmap for the future (2009),
http://www.caba.org/resources/Documents/IS-2008-93.pdf (accessed
December 20, 2013)
41. Small, T., Haas, Z.J.: The shared wireless infostation model: a new ad hoc networking
paradigm (or where there is a whale, there is a way). In: Proceedings of the 4th ACM In-
ternational Symposium on Mobile Ad Hoc Networking and Computing, MobiHoc 2003, pp.
233–244. ACM, New York (2003)
42. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Spray and wait: an efficient routing scheme
for intermittently connected mobile networks. In: Proceedings of the ACM SIGCOMM
Workshop on Delay-Tolerant Networking, WDTN 2005, pp. 252–259. ACM, New York
(2005)
Data Modeling for Socially Based Routing in Opportunistic Networks 55
43. Thilakarathna, K., Viana, A.C., Seneviratne, A., Petander, H.: The power of hood friendship
for opportunistic content dissemination in mobile social networks. Technical report, INRIA,
Saclay, France (2012)
44. Vahdat, A., Becker, D.: Epidemic Routing for Partially-Connected Ad Hoc Networks. Tech-
nical report, Duke University (April 2000)
45. Yoneki, E., Hui, P., Chan, S., Crowcroft, J.: A socio-aware overlay for publish/subscribe
communication in delay tolerant networks. In: Proceedings of the 10th ACM Symposium
on Modeling, Analysis, and Simulation of Wireless and Mobile Systems, MSWiM 2007,
pp. 225–234. ACM, New York (2007)
Decision Tree Induction Methods and Their Application
to Big Data
Petra Perner
Abstract. Data mining methods are widely used across many disciplines to
identify patterns, rules, or associations among huge volumes of data. While in
the past mostly black box methods, such as neural nets and support vector ma-
chines, have been heavily used for the prediction of pattern, classes, or events,
methods that have explanation capability such as decision tree induction meth-
ods are seldom preferred. Therefore, we give in this chapter an introduction to
decision tree induction. The basic principle, the advantageous properties of de-
cision tree induction methods, and a description of the representation of deci-
sion trees so that a user can understand and describe the tree in a common way
is given first. The overall decision tree induction algorithm is explained as well
as different methods for the most important functions of a decision tree induc-
tion algorithm, such as attribute selection, attribute discretization, and pruning,
developed by us and others. We explain how the learnt model can be fitted to
the expert´s knowledge and how the classification performance can be im-
proved. The problem of feature subset selection by decision tree induction is
described. The quality of the learnt model is not only to be checked based on
the overall accuracy, but also more specific measure are explained that describe
the performance of the model in more detail. We present a new quantitative
measures that can describe changes in the structure of a tree in order to help the
expert to interpret the differences of two learnt trees from the same domain.
Finally, we summarize our chapter and give an outlook.
1 Introduction
Data mining methods are widely used across many disciplines to identify patterns,
rules, or associations among huge volumes of data. While in the past mostly black
box methods, such as neural nets and support vector machines, have been heavily
used for the prediction of pattern, classes, or events, methods that have explanation
capability such as decision tree induction methods are seldom preferred. Besides, it is
very important to understand the classification result not only in medical application
more often but also in technical domains. Nowadays, data mining methods with ex-
planation capability are more heavily used across disciplines after more work on ad-
vantages and disadvantages of these methods has been done.
Decision tree induction is one of the methods that have explanation capability.
Their advantages are easy use and fast processing of the results. Decision tree induc-
tion methods can easily learn a decision tree without heavy user interaction while in
neural nets a lot of time is spent on training the net. Cross-validation methods can be
applied to decision tree induction methods while not for neural nets. These methods
ensure that the calculated error rate comes close to the true error rate. In most of the
domains such as medicine, marketing or nowadays even technical domains the expla-
nation capability, easy use, and the fastness in model building are one of the most
preferred properties of a data mining method.
There are several decision tree induction algorithms known. They differ in the way
they select the most important attributes for the construction the decision tree, if they
can deal with numerical or/and symbolical attributes, and how they reduce noise in
the tree by pruning. A basic understanding of the way a decision tree is built is neces-
sary in order to select the right method for the actual problem and in order to interpret
the results of a decision tree.
In this chapter, we review several decision tree induction methods. We focus on
the most widely used methods and on methods we have developed. We rely on gener-
alization methods and do not focus on methods that model subspaces in the decision
space such as decision forests since in case of these methods the explanation capabil-
ity is limited.
The preliminary concepts and the background are given in Section 2. This is fol-
lowed by an overall description of a decision tree induction algorithm in Section 3.
Different methods for the most important functions of a decision tree induction algo-
rithm are described in Section 4 for the attribute selection, in Section 5 for attribute
discretization, and in Section 6 for pruning.
The explanations given by the learnt decision tree must make sense to the domain
expert since he often has already built up some partial knowledge. We describe in Sec-
tion 7 what problems can arise and how the expert can be satisfied with the explanations
about his domain. Decision tree induction is a supervised method and requires labeled
data. The necessity to check the labels by an oracle-based classification approach is also
explained in Section 7 as well as the feature subset-selection problem. It is explained in
what way feature subselection can be used to improve the model besides the normal
outcome of decision tree induction algorithm.
Section 8 deals with the question: How to interpret a learnt Decision Tree. Besides
the well-known overall accuracy different more specific accuracy measures are given
and their advantages are explained. We introduce a new quantitative measure that can
describe changes in the structure of trees learnt from the same domain and help the user
to interpret them. Finally we summarize our chapter in Section 9.
the whole set of attributes are selected only those attributes that are most relevant for the
classification problem. Therefore a decision tree induction method can also be seen as a
feature selection method.
Once the decision tree has been learnt and the developer is satisfied with the quality of
the model the tree can be used in order to predict the outcome for new samples.
This learning method is also called supervised learning, since samples in the data
collection have to be labeled by the class. Most decision tree induction algorithms
allow using numerical attributes as well as categorical attributes. Therefore, the result-
ing classifier can make the decision based on both types of attributes.
A decision tree is a directed acyclic graph consisting of edges and nodes (see Fig.
2).
The node with no edges entering is called the root node. The root node contains all
class labels. Every node except the root node has exactly one entering edge. A node
having no successor is called a leaf node or terminal node. All other nodes are called
internal nodes.
The nodes of the tree contain the decision rules such as
≤ .
The decision rule is a function f that maps the attribute A to D. The above described
rule results into a binary tree. The sample set in each node is split into two subsets based
on the constant c for the attribute A. This constant v is called cut-point.
In case of a binary tree, the decision is either true or false. In case of an n-ary tree,
the decision is based on several constant ci. Such a rule splits the data set into i sub-
sets. Geometrically, the split describes a partition orthogonal to one of the coordinates
of the decision space.
60 P. Perner
A terminal node should contain only samples of one class. If there are more tthan
one class in the sample sett we say there is class overlap. This class overlap in eeach
terminal node is responsiblee for the error rate. An internal node contains always m more
than one class in the assigneed sample set.
A path in the tree is a seequence of edges from (v1,v2), (v2,v3), ... , (vn-1,vn). We say
the path is from v1 to vn and
d has a length of n. There is a unique path from the rooot to
each node. The depth of a node
n v in a tree is the length of the path from the root tto v.
The height of node v in a tree
t is the length of the largest path from v to a leaf. T The
height of a tree is the heigh
ht of its root. The level of a node v in a tree is the heighht of
the tree minus the depth of v.
Fig
g. 2. Representation of a Decision Tree
A binary tree is an ordereed tree such that each successor of a node is distinguisheed
either as a left son or a rightt son. No node has more than one left son, nor has it moore
than one right son. Otherwiise it is an n-ary tree.
Let us now consider, thee decision tree learnt from Fisher´s Iris data set. This ddata
set has three classes (1-Seetosa, 2-Vericolor, 3-Virginica) with 50 observations for
each class and four predicttor variables (petal length, petal width, sepal length, aand
sepal width). The learnt treee is shown in Figure 3. It is a binary tree.
Fig. 3.
3 Decision Tree learnt from Iris Data Set
Decision Tree Induction Methods and Their Application to Big Data 61
The average depth of the tree is (1+3+3+2)/4=9/4=2.25. The root node contains
the attribute petal_length. Along a path the rules are combined by the AND operator.
Following the two paths from the root node we obtain, for example, two rules such as:
1: ℎ ≤ 2.45
In the latter rule we can see that the attribute petal_length will be used two times
during the problem-solving process. Each time a different cut-point is used on this
attribute.
The overall procedure of the decision tree building process is summarized in Figure 4.
Decision trees recursively split the decision space (see Fig. 5) into subspaces based on
the decision rules in the nodes until the final stopping criterion is reached or the re-
maining sample set does not suggest further splitting. For this recursive splitting the
tree building process must always pick among all attributes that attribute which shows
the best result on the attribute selection criteria for the remaining sample set. Whereas
for categorical attributes the partition of the attributes values is given a-priori, the
partition of the attribute values for numerical attributes must be determined. This
process is called attribute discretization process.
The attribute discretization process can be done before or during the tree buildding
process [1]. We will consid der the case where the attribute discretization will be ddone
during the tree building proccess. The discretization must be carried out before the atttrib-
ute selection process, since the selected partition on the attribute values of a numerrical
attribute highly influences th
he prediction power of that attribute.
After the attribute selecttion criterion was calculated for all attributes based on the
remaining sample set at thee particular level of the tree, the resulting values are evaalu-
ated and the attribute with the
t best value for the attribute selection criterion is seleccted
for further splitting of the sample
s set. Then, the tree is extended by two or more ffur-
ther nodes. To each node is i assigned the subset created by splitting on the attribbute
values and the tree building g process repeats.
Attribute splits can be do
one:
The influence of the kinnd of attribute splits on the resulting decision surface for
two attributes is shown in Figure
F 6. The axis-parallel decision surface results in a rrule
such as
ℎ ≥ 4.9
The latter decision surface better discriminates between the two classes than the axxis-
parallel one, see Figure 6. However,
H by looking at the rules we can see that the exppla-
nation capability of the treee will decrease in the case of the linear decision surface.
Decision Treee Induction Methods and Their Application to Big Data 63
The induced decision treee tends to overfit to the data. This is typically caused due
ues and in the class information present in the training set.
to noise in the attribute valu
The tree building process will
w produce subtrees that fit to this noise. This causess an
increased error rate when classifying unseen cases. Pruning the tree which meeans
replacing subtrees with leavves can help to avoid this problem.
Now, we can summarize the main subtasks of decision tree induction as follows:
• attribute selection (Infformation Gain [2], X2-Statistic [3], Gini-Index [4], GGain
Ratio [5], Distance meaasure-based selection criteria [6],
• attribute discretization (Cut-Point [2], Chi-Merge [3], MDL-principle [7] , LV VQ-
based discretization, Hisstogram-based discretization, and Hybrid Methods [8],
• recursively splitting thee data set, and
• pruning (Cost-Complex xity [4], Reduced Error Reduction Pruning [2], Confideence
Interval Method [9], Minimal
M Error Pruning [10]).
Beyond that decision trree induction algorithms can be distinguished in the w way
they access the data and in non-incremental
n and incremental algorithms.
Some algorithms access the whole data set in the main memory of the compuuter.
This is insufficient when th
he data set is very large. Large data sets of millions of ddata
do not fit into the main mem
mory of the computer. They must be assessed from the ddisk
or other storage device so that all these data can be mined. Accessing the data frrom
external storage devices willl cause long execution time. However, the user likes to get
results fast and even for exploration purposes he likes to carry out quickly variious
experiments and compare th hem to each other. Therefore, special algorithms have bbeen
developed that can work effficiently although using external storage devices.
Incremental algorithms can
c update the tree according to the new data while nnon-
incremental algorithms go through
t the whole tree building process again based on the
combined old data set and the
t new data.
Some standard algorith hms are: CART, ID3, C4.5, C5.0, Fuzzy C4.5, O OC1,
QUEST, CAL 5.
64 P. Perner
Formally, we can describe the attribute selection problem as follows: Let Y be the full
set of attributes A, with cardinality k, and let ni be the number of samples in the re-
maining sample set i. Let the feature selection criterion function for the attribute be
represented by S(A,ni). Without any loss of generality, let us consider, a higher value
of S to indicate a good attribute A. Formally, the problem of attribute selection is to
find an attribute A based on our sample subset ni that maximizes our criterion S so that
( , ) = max ∁ ,| | ( , ) (1)
Note that each attribute in the list of attributes (see Fig. 1) is tested against the cho-
sen attribute selection criterion in the sequence it appears in the list of attributes. If
two attributes have equal maximal value then the automatic rule picks the first appear-
ing attribute.
Numerous attribute selection criteria are known. We will start with the most used
criterion called information gain criterion.
Following the theory of the Shannon channel [11], we consider the data set as the
source and measure the impurity of the received data when transmitted via the chan-
nel. The transmission over the channel results in the partition of the data set into sub-
sets based on splits on the attribute values J of the attribute A. The aim should be to
transmit the signal with the least loss on information. This can be described by the
following criterion:
( )= ( )− ( ⁄ )= !
where I(A) is the entropy of the source, I(C) is the entropy of the receiver or the ex-
pected entropy to generate the message C1, C2, ..., Cm , and I(C/J) is the losing entropy
when branching on the attribute values J of attribute A.
For the calculation of this criterion, we consider first the contingency table in Table
4 with m the number of classes, k the number of attribute values J, n the number of
examples, Li number of examples with the attribute value Ji, Rj the number of exam-
ples belonging to class Cj, and xij the number of examples belonging to class Cj and
having the attribute value Ji.
Now we can define the entropy over all class C by:
( ) = −∑ (2)
=∑ ∑ − = ∑ −∑ ∑ (3)
Decision Tree Induction Methods and Their Application to Big Data 65
The best feature is the one that achieves the lowest value of (2) or, equivalently,
the highest value of the "mutual information" I(C) - I(C/J). The main drawback of this
measure is its sensitivity to the number of attribute values. In the extreme case a fea-
ture that takes N distinct values for the N examples achieves complete discrimination
between different classes, giving I(C/J)=0, even though the features may consist of
random noise and be useless for predicting the classes of future examples. Therefore,
Quinlan [5] introduced a normalization by the entropy of the attribute itself:
( )
( )= (4)
( )
with
( ) = −∑ .
Other normalizations have been proposed by Coppersmith et. al [13] and Lopez de
Montaras [6]. Comparative studies have been done by White and Lui [14].
( )=1−∑ (5)
The Gini function of the class given the feature values is defined as:
66 P. Perner
( )=∑ ( ) (6)
with
( )= 1−∑
A numerical attribute may take any value on a continuous scale between its minimal
value x1 and its maximal value x2. Branching on all these distinct attribute values does
not lead to any generalization and would make the tree very sensitive to noise. Rather
we should find meaningful partitions on the numerical values into intervals. The in-
tervals should abstract the data in such a way that they cover the range of attribute
values belonging to one class and that they separate them from those belonging to
other classes. Then, we can treat the attribute as a discrete variable with k+1 interval.
This process is called discretization of attributes.
The points that split our attribute values into intervals are called cut-points. The
cut-points k lies always on the border between the distributions of two classes.
Discretization can be done before the decision tree building process or during deci-
sion tree learning [1]. Here, we want to consider discretization during the tree build-
ing process. We call them dynamic and local discretization methods. They are dynam-
ic since they work during the tree building process on the created subsample sets and
they are local since they work on the recursively created subspaces. If we use the
class label of each example we consider the method as supervised discretization
methods. If we do not use the class label of the samples we call them unsupervised
discretization methods. We can partition the attribute values into two (k=1) or more
intervals (k>1). Therefore we distinguish between binary and multi-interval discreti-
zation methods. The discretization process on numerical attribute values belongs to
the attribute-value aggregation process, see Figure 7.
In Figure 8, we see the conditional histogram of the attribute values of the attribute
petal_length of the IRIS data set. In the binary case (k=1) the attribute values would be
split at the cut-point 2.35 into an interval from 0 to 2.35 and into a second interval from
2.36 to 7. If we do multi-interval discretization, we will find another cut-point at 4.8.
That groups the values into three intervals (k=2): intervall_1 from 0 to 2.35, interval_2
from 2.36 to 4.8, and interval_3 from 4.9 to 7.
Attribute-value aggregation can also be mean full on categorical attributes. Many
attribute values of a categorical attribute will lead to a partition of the sample set into
many small subsample sets. This again will result in a quick stop of the tree building
process. To avoid this problem, it might be wise to combine attribute values into a
more abstract attribute value. We will call this process attribute aggregation. It is also
possible to allow the user to combine attributes interactively during the tree building
process. We call this process manual abstraction of attribute values, see Figure 7.
Decision Tree Induction Methods and Their Application to Big Data 67
16
14
12
10
Mean
4 Peaks
Klasse 1
Setosa
2 Klasse 2
Versicolor
Klasse 3
0 Virginica
1,0 1,3 1,6 1,9 2,2 2,5 2,8 3,1 3,4 3,7 4,0 4,3 4,6 4,9 5,2 5,5 5,8 6,1 6,4 6,7
Petal length
Fig. 8. Histogram of Attribute Petal Length with Binary and Multi-interval Cut-Points
( , ; )= !
68 P. Perner
with S the subsample set, A the attribute, and T the cut-point that separates the sam-
ples into subset S1 and S2.
I(A, T; S) is the entropy for the separation of the sample set into the subset S1 and
S 2:
| | | |
( ; ; )= ( )+ ( ) (8)
| | | |
with
( ) = −∑ , , (9)
and ( ) respectively.
The calculation of the cut-point is usually a time-consuming process since each
possible cut-point is tested against the selection criteria. Therefore, algorithms have
been proposed that speedup the calculation of the right cut-point [28].
= ( − ) + ( − ) (10)
and
= + (11)
with N being the number of all samples and h(xi) the frequency of attribute value xi. T
is the threshold that will be tentatively moved over all attribute values. The values m0
and m1 are the mean values of the two groups that give us:
= + (13)
where P0 and P1 are the probability for the values of the subsets S1 and S2:
( ) ( )
=∑ and =∑ (14)
= ! .
Decision Tree Induction Methods and Their Application to Big Data 69
for the required number of intervals in the remaining data set. This will result in
bushy decision trees or will stop the tree building process sooner as necessary. It
would be better to calculate the number of intervals from the data.
Fayyad and Irani [7] developed a stopping criterion based on the minimum descrip-
tion length principle. Based on this criterion the number of intervals is calculated for
the remaining data set during the decision tree induction. This discretization proce-
dure is called MLD-based discretization.
Another criterion can use a cluster utility measure to determine the best suitable
number of intervals.
MDL-Based Criteria
The MDL-based criterion was introduced by Fayyad and Irani [7]. Discretization is
done based on the gain ratio. The gain ratio I(A, T;S) is tested after each new interval
against the MDL-criterion:
( ) ∆( , ; )
( , ; )< + (15)
∆( , ; ) = (3 − 2) − ∗ ( )− ∗ ( )− ∗ ( ) (17)
LVQ-Based Discretization
Vector quantization methods can also be used for the discretization of attribute values
[8]. LVQ [15] is a supervised learning algorithm. This method attempts to define class
Decision Tree Induction Methods and Their Application to Big Data 71
: ( + 1) = ( )+ ( )− ( ) (18)
: ( + 1) = ( )− ( )− ( ) (19)
ℎ : ( + 1) = ( ) (20)
This behavior of the algorithms can be employed for discretization. The algorithm
tries to optimize the misclassification probability. A potential cut-point might be in
the middle of the learned codebook vectors of two different classes. However, the
proper initialization of the codebook vectors and the choice of the learning rate α (t )
is a crucial problem.
Figure 9 shows this method based on the attribute petal_length of the IRIS
domain.
16
14
12
10
Mean
4
Klasse 1
Setosa
2 Klasse 2
Versicolor
Klasse 3
0 Virginica
1,0 1,3 1,6 1,9 2,2 2,5 2,8 3,1 3,4 3,7 4,0 4,3 4,6 4,9 5,2 5,5 5,8 6,1 6,4 6,7
Codebook vectors
1,0 1,3 1,6 1,9 2,2 2,5 2,8 3,1 3,4 3,7 4,0 4,3 4,6 4,9 5,2 5,5 5,8 6,1 6,4 6,7
Petal length
Fig. 9. Class Distribution of an Attribute, Codebook Vectors, and the learnt Cut-Points
Histogram-Based Discretization
A histogram-based method was first suggested by Wu et al. [16]. They used this
method in an interactive way during top-down decision tree building. By observing
72 P. Perner
the histogram, the user selects the threshold which partitions the sample set in groups
containing only samples of one class. In Perner et al. [8] an automatic histogram-
based method for feature discretization is described.
The distribution p(a|aCk)P(Ck) of one attribute a according to classes Ck is cal-
culated. The curve of the distribution is approximated by a first-order polynom with
the coefficients a0 and a1 and the supporting places xi that are the attribute values. The
minimum square error method is used for approximating the real histogram curve by
the first order polynom:
=∑ ( + − ) (21)
The cut-points are selected by finding two maxima of different classes situated
next to each other.
We used this method in two ways: First, we used the histogram-based discretiza-
tion method as described before. Second, we used a combined discretization method
based on the distribution p(a|aSk)P(Sk) and the entropy-based minimization criteri-
on. We followed the corollary derived by Fayyad and Irani [7], which says that the
entropy-based discretization criterion for finding a binary partition for a continuous
attribute will always partition the data on a boundary point in the sequence of the
examples ordered by the value of that attribute. A boundary point partitions the ex-
amples into two sets, having different classes. Taking into account this fact, we de-
termine potential boundary points by finding the peaks of the distribution. If we found
two peaks belonging to different classes, we used the entropy-based minimization
criterion in order to find the exact cut-point between these two classes by evaluation
of each boundary point K with Pi ≤ K ≤ Pi+1 between these two peaks. The resulting
cut-points are shown in Figure 10.
16
14
12
10
Mean
4 Peaks
Klasse 1
Setosa
2 Klasse 2
Versicolor
Klasse 3
0 Virginica
1,0 1,3 1,6 1,9 2,2 2,5 2,8 3,1 3,4 3,7 4,0 4,3 4,6 4,9 5,2 5,5 5,8 6,1 6,4 6,7
Petal length
Fig. 10. Examples sorted by attribute values for attribute petal length, labeled peaks, and the
selected cut-points
This method is not as time-consuming as the others. Compared to the other meth-
ods it gives us reasonably good results (see table5).
Decision Tree Induction Methods and Their Application to Big Data 73
Chi-Merge Discretization
The ChiMerge algorithm introduced by Kerber [3] consists of an initialization step
and a bottom-up merging process, where intervals are continuously merged until a
termination condition is met. Kerber used the ChiMerge method static. In our study,
we apply ChiMerge dynamically to discretization. The potential cut-points are inves-
tigated by testing two adjacent intervals by the 2 independence test. The statistical
test is:
ℎ =∑ ∑ (22)
where m is equal two (the intervals being compared), k is the number of classes, Aij is
the number of examples in ith interval and jth class.
The expected frequency Eij is calculated according to:
∗
= (23)
---
150 DS
PETALWI
---
150 DS
PETALLEN
---
150 DS
PETALWI
---
150 DS
PETALLEN
<=1.65 >1.65
47 DS 7 DS
[Versicol] ??? [Virginic]
Table 2. Error Rate for Decision Trees based on different Discretization Methods
Automatic Abstraction
However, it is also possible to do automatically abstractions on symbolical attribute
values during the tree-building process based on the class-attribute interdependence.
Then, the discretization process is done bottom-up starting from the initial attribute
intervals. The process stops when the criterion is reached.
76 P. Perner
6 Pruning
If the tree is allowed to grow to its maximum size, it is likely that it becomes
overfitted to the training data. Noise in the attribute values and class information will
amplify this problem. The tree-building process will produce subtrees that fit to noise.
This unwarranted complexity causes an increased error rate when classifying unseen
cases. This problem can be avoided by pruning the tree. Pruning means replacing
subtrees by leaves based on some statistical criterion. This idea is illustrated in Fig-
ures 15 and 16 on the IRIS data set. The unpruned tree is a large and bushy tree with
an estimated error rate of 6.67%. Subtrees get replaced by leaves up to the second
level of the tree. The resulting pruned tree is smaller and the error rate becomes
4.67% calculated with cross-validation.
Pruning methods can be categorized either as pre- or post-pruning methods. In pre-
pruning, the tree growing process is stopped according to a stopping criterion before
the tree reaches its maximal size. In contrast, in post-pruning the tree is first
developed to its maximum size and afterwards pruned back according to a pruning
procedure.
However, pruning methods are always based on some assumptions. If this assump-
tion of the pruning method is true for the particular data set can only be seen on the
calculated error rate. There might be data sets where it is better to stay on the unpruned
tree and there might be data sets where it is better to use the pruned tree.
---
150 DS
PETALLEN
<=2.45 >2.45
50 DS 100 DS
[Setosa ] PETALLEN
<=4.9 >4.9
54 DS 46 DS
PETALWI PETALWI
<=1.55 >1.55
2 DS 2 DS
[Virginic] [Versicol]
Fig. 15. Unpruned Decision Tree for the IRIS Data Set
Decision Tree Induction Methods and Their Application to Big Data 77
---
150 DS
PETALLEN
<=2.45 >2.45
50 DS 100 DS
[Setosa ] PETALLEN
<=4.9 >4.9
54 DS 46 DS
PETALWI [Virginic]
<=1.65 >1.65
47 DS 7 DS
[Versicol] [Virginic]
Fig. 16. Pruned Tree for the IRIS Data Set based on Minimal Error Pruning
Pruning Methods
with E(T) being the number of misclassified samples of the subtree T, N(T) is the
number of samples belonging to the subtree T, Leaves(T) is the number of leaves of
the subtree T, and α is a free defined parameter, which is often called complexity
parameter. The subtree whose replacement causes minimal costs is replaced by a leaf:
= ( )∗( ( )
= ! .
)
The algorithm tentatively replaces all subtrees by leaves if the calculated value for
α is minimal compared to the values α of the other replacements. This results in a
sequence of trees T0 < T2<... <Ti<...< Tn where T0 is the original tree and Tn is the root.
The trees are evaluated on an independent data set. Among this set of tentative trees is
selected the smallest tree as final tree that minimizes the misclassifications on the
independent data set. This is called the 0-SE selection method (0-standard error).
Other approaches use a relaxed version, called 1-SE method, in which the smallest
tree does not exceed Emin+SE(Emin). Emin is the minimal number of errors that yields a
decision tree Ti and SE(Emin) is the standard deviation of an empirical error estimated
from the independent data set. SE(Emin) is calculated as follows:
( )
( )= (25)
Visualization techniques should show to user the location of the class-specific data
distribution dependent on two attributes, as shown in Figure 11. This helps the user to
understand what changed in the data. From a list of attributes the user can pick two
attributes and the respective graph will be presented.
Decision tree induction is a supervised classification method. Each data entry in
the data table used for the induction process needs a label. Noisy data might be caused
by wrong labels applied by the expert or by other sources to the data. This might re-
sult in low classification accuracy.
The learnt classifier can be used in an oracle-based approach. Therefore, the data
are classified by the learnt model. All data sets that are misclassified can be reviewed
by the domain expert. If the expert is of the opinion that the data set needs another
label, then the data set is relabeled. The tree is learnt again based on the newly labeled
data set.
If the user has to label the data, then it might be apparent that the subjective deci-
sion about the class the data set belongs to might result in some noise. Depending on
the form of the day of the expert or on his experience level, he will label the data
properly or not as well as he should. Oracle-based classification methods [19][20] or
similarity-based methods [21][22] might help the user to overcome such subjective
factors.
The decision tree induction algorithm is also a feature selection algorithm. Accord-
ing to the criterion given in Section 5 the method selects from the original set of at-
tributes Y with cardinality g the desired number of features o into the selected subset
X while X (Zeichen??)Y.
There are two main methods for feature selection known: the filter approach (see
Fig. 18) and the wrapper approach (see Fig. 19). While the filter approach attempts to
assess the merits of features from the data alone, the wrapper approach attempts to the
best feature subset for use with a particular classification algorithm.
Learning Algorithm
In Perner [23], we have shown that using the filter approach before going into the
decision tree induction algorithm will result in slightly better error rate.
Contrarily based on our experience, it is also possible to run the induction algo-
rithm first and collect from the tree the chosen subset X of attributes. Based on that we
can segment the database into a data set having only the subset X of attributes and run
the decision tree induction algorithm again. The resulting classifier will often have
better accuracy than the original one.
For a black box model we only get some quantitative measures such as the error rate
as quality criterion. These quantitative measures can also be calculated for decision
trees. However, the structure of a tree and the rules contained in a tree are some other
information a user can use to judge the quality of the tree. This will sometimes cause
problems since the structure and the rules might change depending on the learning
data set.
Often the user starts to learn a tree based on a data set DSn and after some while he
collects more data so that he gets a new data set DSn+1 (see Figure 20 ). If he com-
bines data set DSn and data set DSn+1 into a new data set DS´ the resulting decision
tree will change compared to the initial tree. Even when he learns the new tree only
based on the data set DSn+1 he will get a different tree compared to the initial tree.
Data sampling will not help him to come around this problem. On the contrary, this
information can be used to understand what has been changed in the domain over
time. For that he needs some knowledge about the tree building process and what the
structure of a tree means in terms of generalization and rule syntax.
In this section, we will describe the quantitative measures for the quality of the de-
cision tree and the measures for comparing two learnt trees.
TIME
Data Stream
8.1 Quantitative Measures for the Quality of the Decision Tree Model
One of the most important measures of the quality of a decision tree is accuracy, re-
spectively, the error rate.
Decision Tree Induction Methods and Their Application to Big Data 81
= (26)
This measure is judged based on the available data set. Usually, cross-validation is
used for evaluating the model since it is never clear if the available data set is a good
representation of the entire domain. Compared to test-and-train, cross-validation can
provide a measure statistically close to the true error rate. Especially if one has a
small sample set, the prediction of the error rate based on cross-validation is a must.
Although this is a well-known fact by now, there are still frequently results presented
that are based on test-and-train and small sample sets. If a larger data set is available,
cross-validation is also a better choice for the estimation of the error rate since one
can never be sure if the data set covers the property of the whole domain. Faced with
the problem of computational complexity, n-fold cross-validation is a good choice. It
splits the whole data set subsequently into blocks of n and runs cross-validation based
that.
The output of cross-validation is the mean accuracy. As you might know from sta-
tistics it is much better to predict a measure based on single measures obtained from a
data set split into blocks of data and to average over the measure than predict the
measure based on a single calculation on the whole data set. Moreover, the variance
of the accuracy gives you another hint in regard to how good the measure is. If the
variance is high, there is much noise in the data; if the variance is low, the result is
much more stable.
The quality of a neural net is often not judged based on cross-validation. Cross-
validation requires setting up a new model in each loop of the cycle. The mean
accuracy over all values of the accuracy of the each single cycle is calculated as
well as the standard deviation of accuracy. Neural nets are not automatically set up
but decision trees are. A neural network needs a lot of training and people claim
that such a neural net—once it is stable in its behavior is the gold standard. How-
ever, the accuracy is judged based on the test-and-train approach and it is not sure
if it is the true accuracy.
Bootstrapping for the evaluation of accuracy is another choice but it is much more
computationally expensive than cross-validation; therefore, many tools do not provide
this procedure.
The mean accuracy and the standard deviation of the accuracy are overall
measures, respectively. More detailed measures can be calculated that give a more
detailed insight into the behavior of the model [24].
For that we use a contingency table in order to show the quality of a classifier, see
Table 3. The table contains the assigned class distribution by the classifier and the real
class distribution as well as the marginal distribution cij. The main diagonal is the
number of correct classified samples. The last row shows the number of samples as-
signed to the class assigned to this line and the last line shows the real class distribu-
tion in the data set. Based on this table, we can calculate parameters that assess the
quality of the classifier in more detail.
82 P. Perner
The correctness p is the number of correct classified samples over the number of
samples:
∑
=∑ ∑
(27)
We can also measure the classification quality pki according to a particular class i
and the number of correct classified samples pti for one class i:
=∑ and =∑ (28)
In the two class cases these measures are known as sensitivity and specifity. Based
on the application domain is must be decided if for a class a high correctness is re-
quired or not.
Other criteria shown in Table 4 are also important when judging the quality of a
model.
Generalization Capability of the Classifier Error Rate based on the Test Data Set
Representation of the Classifier Error Rate based on the Design Data Set
Classification Costs • Number of Features used for Clas-
sification
• Number of Nodes or Neurons
Explanation Capability Can a human understand the decision
Learning Performance Learning Time
Sensitivity to Class Distribution in the
Sample Set
If a classifier has a good representation capability which is judged based on the de-
sign data set, it might not mean that the classifier will have a good generalization
capability which is judged on the test data set. It will not necessarily mean that the
classifier will classify unseen samples with high accuracy.
Decision Tree Induction Methods and Their Application to Big Data 83
Another criterion is the cost for classification expressed by the number of features
and the number of decisions used during classification. The other criterion is the time
needed for learning. We also consider the explanation capability of the classifier as
another quality criterion and the learning performance. Decision trees can be fast
constructed without heavy user interaction but they tend to be sensitive to the class
distribution in the sample set.
We propose a first similarity measure for the differences of the two models as fol-
lows:
3. Then build substructures of all l rules by decomposing the rules into their
Substructures.
4. Compare two rules i and j of two decision trees d1 and d2 for each of the nj and ni
substructures with s attributes.
5. Build similarity measure SIMij according to formula 18-23.
= ( + + ⋯+ + ⋯+ ) (29)
with
= ,
and
={ (30)
If the rule contains a numerical attribute A<=k1 and A´<=k2=k1+x then the
similarity measure is
´ | | | |
= 1− =1− = 1− < (31)
and
=0 ≥ . (32)
with t a user chosen value that allows x to be in a tolerance range of s % (e.g., 10%) of
k1. That means as long as the cut-point k1 is within the tolerance range we consider the
term as similar, outside the tolerance range it is dissimilar. Small changes around the
first cut-point are allowed while a cut-point far from the first cut-point means that
something seriously has happened with the data.
The similarity measure for the whole substructure is:
= ∑ { 1 = ´ (33)
0 ℎ
The overall similarity between two decision trees d1 and d2 is
, = ∑ max∀ (34)
for comparing the rules i of decision d1 with rules j of decision d2. Note that the simi-
larity Sim d2,d1 must not be the same.
Decision Tree Induction Methods and Their Application to Big Data 85
9 Conclusions
In this chapter, we have described decision tree induction. We first explained the gen-
eral methodology of decision tree induction and what are the advantages and disad-
vantages. Decision tree induction methods are easy to use. They only require a table
of data for which the representation should be in an attribute value-based fashion. The
tree induction process runs fully automatically. User interaction is not necessarily
required. The fastness of the method allows quickly building a model that is prefera-
ble in many domains. Cross-validation is the method of choice to calculate the error
rate based on the data set. The so calculated error rate comes close to the true error
rate. Then we described the main structure of a decision tree and how an expert can
read it in a common way. The overall algorithm of a decision tree induction method
was given. For the main functions of the decision tree building process different
methods have been explained for attribute selection, attribute discretization, and prun-
ing. We described methods for attribute discretization that are standard methods and
methods that we have developed.
Most of these methods described in this chapter are implemented in our Tool Deci-
sion Master© (www.ibai-solutions.de). Decision Master is still one of the most flexi-
ble and reliable tools for decision tree induction and fits to the user needs. The exam-
ples given in this chapter have been calculated using the tool Decision Master.
Many more decision tree induction algorithms have been developed over time.
Most of them strongly depend on the underlying distribution in the data sample. The
once that works on average good on all sample sets it up to our experience Quinlain´s
C4.5. The other ones outperform C4.5 on specific data sets but might give very worse
results on other data sets.
It was explained how the explanations can be fit to the domain experts knowledge
and what further can be done for feature selection in order to improve the quality of
the model.
The quality of the model is assessed not only by the overall error rate rather than tis
more specific error rates are necessary for a good evaluation of the model. We intro-
duced a new quality criterion that can evaluate how much the structure of a tree dif-
fers from a former tree and how to interpret it.
Open problems in decision tree induction might be methods that can deal with im-
balanced data sets, better visualization techniques of the properties in the data and of
big decision tree models, strategies how to automatically deal with competing attrib-
utes, and more support on how to interpret decision trees.
References
1. Dougherty, J., Kohavi, R., Sahamin, M.: Supervised and Unsupervised Discretization of
Continuous Features. In: 14th IJCAI Machine Learning, pp. 194–202 (1995)
2. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1998)
Decision Tree Induction Methods and Their Application to Big Data 87
3. Kerber, R.: ChiMerge: Discretization of Numeric Attributes. In: AAAI 1992 Learning: In-
ductive, pp. 123–128 (1992)
4. Breiman, L., Friedman, J.H., Olshen, R.A.: Classification and Regression Trees. The
Wadsworth Statistics/Probability Series, Belmont California (1984)
5. Quinlan, J.R.: Decision trees and multivalued attributes. In: Hayes, J.E., Michie, D., Rich-
ards, J. (eds.) Machine Intelligence 11. Oxford University Press (1988)
6. de Mantaras, R.L.: A distance-based attribute selection measure for decision tree induc-
tion. Machine Learning 6, 81–92 (1991)
7. Fayyad, U.M., Irani, K.B.: Multi-Interval Discretization of Continuous Valued Attributes
for Classification Learning. In: 13th IJCAI Machine Learning, vol. 2, pp. 1022–1027.
Morgan Kaufmann, Chambery (1993)
8. Perner, P., Trautzsch, S.: Multinterval Discretization for Decision Tree Learning. In: Amin,
A., Pudil, P., Dori, D. (eds.) SPR 1998 and SSPR 1998. LNCS, vol. 1451, pp. 475–482.
Springer, Heidelberg (1998)
9. Quinlan, J.R.: Simplifying decision trees. Machine Learning 27, 221–234 (1987)
10. Niblett, T., Bratko, I.: Construction decision trees in noisy domains. In: Bratko, I., Lavrac,
N. (eds.) Progress in Machine Learning, pp. 67–78. Sigma Press, England (1987)
11. Philipow, E.: Handbuch der Elektrotechnik, Bd 2 Grundlagen der Informati-onstechnik,
pp. 158–171. Technik Verlag, Berlin (1987)
12. Quinlan, J.R.: Decision trees and multivalued attributes. In: Hayes, J.E., Michie, D., Rich-
ards, J. (eds.) Machine Intelligence 11. Oxford University Press (1988)
13. Copersmith, D., Hong, S.J., Hosking, J.: Partitioning nominal attributes in decision trees.
Journal of Data Mining and Knowledge Discovery 3(2), 100–200 (1999)
14. White, A.P., Lui, W.Z.: Bias in information-based measures in decision tree induction.
Machine Learning 15, 321–329 (1994)
15. Kohonen, T.: Self-Organizing Maps. Springer (1995)
16. Wu, C., Landgrebe, D., Swain, P.: The decision tree approach to classification, School
Elec. Eng., Purdue Univ., W. Lafayette, IN, Rep. RE-EE 75-17 (1975)
17. Perner, P., Belikova, T.B., Yashunskaya, N.I.: Knowledge Acquisition by Decision Tree
Induction for Interpretation of Digital Images in Radiology. In: Perner, P., Rosenfeld, A.,
Wang, P. (eds.) SSPR 1996. LNCS, vol. 1121, pp. 208–219. Springer, Heidelberg (1996)
18. Kuusisto, S.: Application of the PMDL Principle to the Induction of Classification Trees.
PhD-Thesis, Tampere Finland (1998)
19. Muggleton, S.: Duce - An Oracle-based Approach to Constructive Induction. In: Proceed-
ing of the Tenth International Join Conference on Artificial Intelligence (IJCAI 1987),
pp. 287–292 (1987)
20. Wu, B., Nevatia, R.: Improving Part based Object Detection by Unsupervised Online
Boosting. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007,
pp. 1–8 (2007)
21. Whiteley, J.R., Davis, J.F.: A similarity-based approach to interpretation of sensor data us-
ing adaptive resonance theory. Computers & Chemical Engineering 18(7), 637–661 (1994)
22. Perner, P.: Prototype-Based Classification. Applied Intelligence 28(3), 238–246 (2008)
23. Perner, P.: Improving the Accuracy of Decision Tree Induction by Feature Pre-Selection.
Applied Artificial Intelligence 15(8), 747–760
24. PernerZscherpelPerner, P., Zscherpel, U., Jacobsen, C.: A Comparision between Neural
Networks and Decision Trees based on Data from Industrial Radiographic Testing. Pattern
Recognition Letters 22, 47–54 (2001)
88 P. Perner
25. Georg, G., Séroussi, B., Bouaud, J.: Does GEM-Encoding Clinical Practice Guidelines
Improve the Quality of Knowledge Bases? A Study with the Rule-Based Formalism. In:
AMIA Annu. Symp. Proc. 2003, pp. 254–258 (2003)
26. Lee, S., Lee, S.H., Lee, K.C., Lee, M.H., Harashima, F.: Intelligent performance manage-
ment of networks for advanced manufacturing systems. IEEE Transactions on Industrial
Electronics 48(4), 731–741 (2001)
27. Bazijanec, B., Gausmann, O., Turowski, K.: Parsing Effort in a B2B Integration Scenario -
An Industrial Case Study. In: Enterprise Interoperability II, Part IX, pp. 783–794. Springer
(2007)
28. Seidelmann, G.: Using Heuristics to Speed Up Induction on Continuous-Valued Attributes.
In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 390–395. Springer, Heidelberg
(1993)
Sensory Data Gathering for Road Traffic
Monitoring: Energy Efficiency, Reliability,
and Fault Tolerance
1 Introduction
c Springer International Publishing Switzerland 2015 89
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_4
90 S. Chakraborty et al.
continuous data streaming a challenging issue for design perspective. The tradi-
tional approaches for data collection in sensor network uses directional flooding,
where the sensory data are streamed toward the sink. However, flooding-based
approach is not scalable and reliable for a continuous data streaming over the low
capacity devices. This is because the network gets overloaded with data packets,
that increase the probability of packet loss, and consume an extra energy that
makes the sensors more prone to sudden failure. Tree-based data gathering has
many advantages over the data gathering through directional flooding. It offers
an ordered delivery of application data with a minimum redundancy. Each node
in a convergecast tree forwards the sensed data as well as the accumulated data
from all its children to its parent node so that all the data are eventually de-
livered to the root or sink node. Thus, the collision-free limited communication
saves the critical battery power at sensor node. Considering the strip-like phys-
ical distribution of the sensors for road traffic management, every set of sensors
dedicated to a particular sink can form a Depth First Search (DFS) tree rooted
at that sink. As internode communication consumes the maximum power [37],
energy saving is crucial requirement for the resource-constrained sensors. Inter-
ference among neighbors, idle listening, and overhearing are other major sources
of energy wastage. Past researches on sensor power management incorporated
wakeup-based schedule for sensor nodes [18]. Based on slotted time interval,
nodes switch between the sleep and wakeup state. To access sensory data at the
sink or the gateway, it is required that the network remains connected at any
point of time maintaining the sensing coverage [45].
There exists a number of chain-based, cluster-based, or tree-based solutions
for data gathering in WSN [32]. As pointed out in [31], most of them do not
consider coverage, connectivity, and fault tolerance during the data forwarding.
Authors in [26] have proposed a virtual robust spanning tree-based protocol for
data gathering maintaining the point coverage in WSN. Here, the sensing and
relaying activities are distributed among nodes. Every node is assumed to know
Sensory Data Gathering for Road-Traffic Monitoring 91
the location of all the targets. The tree construction algorithm selects a tree
edge based on the link weight and the hop-count distance. Authors in [25] have
proposed an autonomous tree maintenance scheme for WSN with the assump-
tion that all the nodes including the sink are homogeneous and can leave the
network. The paper lacks in the theoretical complexity analysis and the coverage
issues. Thus neighbor information is maintained at each node and the mainte-
nance scheme is not local also. Another connected dominating set (CDS)-based
topology control scheme has been proposed in [39] that focused mainly on the
coverage issues. Here, nodes require the neighbor distance information. Addition-
ally, no scheme for the topology maintenance and the application data handling
has been provided. A comparative study between different graph-based topol-
ogy control schemes and the CDS-based topology control technique has been re-
ported in [22]. However, most of them require a 2-hop neighborhood information.
For highly dense sensor network, a topology control scheme has been proposed
by Iyengar et al. [12] assuring the connected-coverage and the low-coordination
among sensors. Their wakeup-based scheme constructs the multiple node dis-
joint topologies to offer the fault tolerance with a minimum number of active
nodes. The work assumed the position information of every node, and did not
consider the changes in underlying topology. All the above-mentioned works on
tree-based convergecast and topology control for general WSN are not applica-
ble directly to the road sensor network environment where a continuous data
streaming is essential. The limitations of the above works can be summarized as
follows:
– Reliability should be ensured during the continuous data streaming over
the resource constrained sensor devices. Duplicate data delivery and data
flooding need to be avoided to assure an improved energy efficiency in the
sensory devices. Further, the sensing coverage, the network connectivity and
the fault-tolerance need to be assured simultaneously for an efficient sensing
and the data collection. Most of the existing works do not ensure the reliabil-
ity, coverage, connectivity and fault tolerance simultaneously [31], which are
essential to be considered for the road sensor network. The existing works
that ensure coverage during the data forwarding require either the position
information or is centralized in nature.
– Most of the existing works as reported in [22] use the reactive path repairing
technique, which is costly for delay sensitive road sensor networks.
– During the path establishment after a node fails, the reliability of applica-
tion message delivery should be maintained. The existing data forwarding
schemes do not consider the application message reliability during repairing
time.
2 Literature Survey
WSN has emerged to become an integral part for the design and development of
Intelligent Transport System (ITS) for road and traffic monitoring. Most of the
works in the literature, based on road sensor networks have mainly focused on the
92 S. Chakraborty et al.
design aspects of low cost and effective sensors that can detect road conditions
and moving vehicles. Different types of sensors have been designed for vehicular
traffic monitoring, such as, optical remote sensing [27], air-bone sensors [28], laser
scanner sensors [10], magnetic sensors [1], etc., out of which magnetic sensors
have been proved to be the most reliable and cost-effective solution [6,43]. In [36],
the authors have shown that magnetic impedance sensors provide better low-
field sensitivity to detect the passing vehicles based on the Earth’s magnetic flux.
With the advances in the sensor hardware research for efficient road-surveillance,
many algorithms have been designed for the development of ITS utilizing the
advantages of magnetic sensors. The design objectives of these algorithms mainly
include the detection of passing vehicles [19], the vehicle speed measurement [30],
traffic information prediction [4,29], collision warning [33], and the traffic light
control [35,3] etc.
Though the design and use of sensors for ITS have been widely studied in
the literature, its networking aspects are still underexplored. As discussed ear-
lier, sensor network for road-surveillance demands special design issues in terms
of deployment parameters for the network architecture, topology control, sen-
sor scheduling based on the connectivity and the coverage, and the design of
data forwarding protocols based on road traffic characteristics. Two types of
network architecture are explored in the literature - the vehicular sensor net-
work [16,23,5,21,11,41], where the sensors are placed inside the vehicles, and the
road sensor network [14,40,42,20,13], where the sensor nodes are placed along
the roads or the pavements.
In a vehicular sensor network, sensors are placed inside the vehicles, and sen-
sory data are forwarded either to the neighboring nodes or to the roadside sinks,
for further processing and information extraction. The nodes in a vehicular sen-
sor network are mobile in nature, that imposes several research challenges for the
design of data collection protocols. In [16], the authors have proposed a coordina-
tion mechanism among the vehicular sensors to design a data collection protocol
that aims in minimizing the network congestion. They have used a hybrid model
where the mobile vehicular sensors forward data to the static roadside sensors.
For congestion free data collection, their proposed scheme uses a collaboration
mechanism among the static and the mobile sensors, based on the position infor-
mation obtained from Geographical Positioning System (GPS). However, their
scheme requires an absolute position information, and GPS is costly enough to
use with every sensor node as it consumes a significant amount of power. Momen
et al. [23] have proposed a random structure vehicular sensor system, where the
network coverage is guaranteed based on the vehicular mobility model. In [21],
the authors have developed a multi-hop data dissemination protocol for vehicu-
lar sensor network based on the absolute positioning information obtained from
the GPS server. Though their protocol reduces the transmission delay, it suffers
from extra power consumption required for the GPS functioning. Haddadou et
al. [11] have proposed another data dissemination protocol for vehicular sensor
network using a diffusion-based approach. Though the design of a vehicular sen-
sor network have classified it as a cost-effective solution for vehicle monitoring,
Sensory Data Gathering for Road-Traffic Monitoring 93
z Gi
Gj
y
Rc(x)
Rs Gk
x
Rs
w Gl
1. The proposed scheme aims to construct a DFS tree rooted at the sink node
assuring both the coverage and connectivity into the network. The tree struc-
ture ensures a continuous data streaming over the sensor nodes with the
minimum energy requirement. A tree management module is proposed to
handle the dynamics of the network during the data streaming. The objec-
tive is to maintain the convergecast tree on the events of node leaving and
joining in the tree either due to a state transition or due to an arbitrary node
failure. The focus has been given to satisfy the connected-coverage criteria
even after the repairing or recovery such that the minimum number of nodes
remain in wakeup state at any point of time. The maintenance cost, both in
terms of delay and communication, is aimed to be low.
Sensory Data Gathering for Road-Traffic Monitoring 95
2. The continuous streaming of application data are handled properly with the
motivation of assuring no loss or redundancy in the data delivery. A data
management module, that works cooperatively with the tree management
module, has been designed for this purpose.
3. The effectiveness of the proposed scheme is evaluated using the simulation
results, and compared with other naive approaches for data collection.
Magnetic sensors [1,15], that can detect the moving vehicles from disturbances
in the Earth’s magnetic field, are widely used for in-road deployment, due to
high sensitivity, small size, and low-power consumption. Every sensor node is
assumed to be static and equipotential in terms of battery power, memory limit,
and processing capacity. Each sensor is assumed to have the data sensing range of
radius Rs and the communication range of radius Rc such that Rc = 2 × Rs [38].
Nodes go to sleep periodically to save critical battery power. A node is called
active when it is in wakeup state and actively participating in data sensing
and forwarding. Again, a sensor node is called inactive when it is sleeping and
saving the energy. In sleep mode, the transmitter does not send or receive any
message. However a node can receive and respond to the triggered interrupt
even it is sleeping. This chapter focuses on a part of the road sensor network
consisting of a single sink and its set of dedicated sensors as shown by the
bounded region in Fig. 1. It is assumed that the physical distribution of the
sensors along the road is virtually divided into blocks of equal length Rs . The
breadth of each block is typically ≤ Rs . The set of nodes in every block forms a
group. Nodes of a particular group are synchronized with the same schedule such
that only one node stays awake at any time instance. However, the schedule of
different groups are different. The assumption of keeping one node active at each
block suffices to keep the road segment sensing-covered. The set of active and
inactive nodes, for a particular group, together is called the set of dedicated nodes
for that group. There exists a small overlapping region, called timeout interval,
between two adjacent time slots of any schedule to accomplish the handover
96 S. Chakraborty et al.
3.2 Initialization
The sink node is assumed to be aware of the physical distribution of all the
sensors. The sink node initiates the tree construction and assigns groupID to each
node of the respective block during this phase. Once the groupID is set and the
set of dedicated nodes are identified for a group, the schedule is computed locally
for that particular group. The local schedule computation can be performed by
running a distributed randomized algorithm at each dedicated node, similar to
the one given as Algorithm 1 in [44]. This local computation of schedule also
assures that global consistency and synchronization within a group. Once the
active node for the next time slot is selected, remaining nodes of the dedicated
set for that group go the sleep state. The sink node initiates the tree construction
by 1-hop Token broadcast. All the active nodes from every block participate in
tree construction by broadcasting the Token in 1-hop neighborhood in turn. As
the Token is forwarded from the sink toward the nodes at the farthest distance,
a DFS tree rooted at the sink, is constructed level by level maintaining the
connectivity in the network. Every node in the tree sense data as well as forward
it to the parent where all data are eventually forwarded to the root or sink
node. A part of such convergecast tree has been shown in Fig. 2, where an arrow
represents a tree edge and an dark shaded circle represents an active node. The
light shaded circles represent the set of redundant nodes and empty circles denote
the set of inactive nodes.
T_WT/timeout +
w/F_WAKEUP WAIT_URG send/B/F_JOIN
w/URGENT
conf = F
SLE_LISTEN
WAIT_DEC
n/JOIN
T_SL/timeout +
send/B/JOIN
SLE_WAIT
T_WT/timeout T_SL/timeout + send/CC/
send/B/LEAVE suspend
p/LEAVE
ACT_LISTEN P_WAIT_J n/JOIN
send/CC/resume c/LEAVE
T_SUCC/
n/JOIN timeout
T_SUCC/
n/JOIN
WAIT_L
c/LEAVE C_WAIT_J timeout
send/CC/close +
CC/reset
INIT LOOKUP
p/LEAVE + send/f/
T_WT/timeout
send/CC/suspend F_WAKEUP
send/n/JOIN_ACK
RESUME WAIT_FJ
T_SUCC/timeout + F_JOIN +
send/n/JOIN_ACK send/f/F_JACK
JOIN_ACK / F_JACK /
(detecting failure + send/B/F_WAKEUP + conf = T + send/B/URG_ACK
send/B/URGENT + w/URG_ACK)
adapt to the changes in the tree such that no data is lost or delivered redun-
dant. Based on these two cases of possible changes in underlying convergecast
tree, the activities of each node have been modeled by state transition diagram
(STD) as given in Fig.3. Initially, every active node remains in ACT LISTEN
state where as rest other nodes remain in SLE LISTEN state. After the suc-
cessful implementation of all necessary actions for tree maintenance, every node
participating in maintenance eventually reaches the RESUME state. The tree
maintenance activities have been described in view of the STD considering two
different cases of node leaving. The control messages that trigger a transition
between two states in STD, are assumed to be 1-hop.
98 S. Chakraborty et al.
– ACT LISTEN
• On receiving a JOIN message from the new node n, node u in
ACT LISTEN state goes to WAIT L state and waits for a LEAVE mes-
sage from either the parent or child node.
• On receiving a LEAVE message at state ACT LISTEN, node u goes
to either state C WAIT J (if received from the child node c) or state
P WAIT J (if received from the parent node p) and waits for a JOIN
message.
• After sleep timeout T SL, a node goes to SLE WAIT state and broadcast
a LEAVE message.
– SLE WAIT
• If a node receives a JOIN message within T WT timeout, it goes to
SLE LISTEN state and changes its mode from active to sleep.
• If a node does not receive a JOIN message within T WT timeout, that
indicates a possible node crash which will be discussed later.
– WAIT L
• If the LEAVE message is received from the parent node p, node u goes to
INIT state after sending a suspend signal to CC (to stop data forwarding
from buffer).
• If the LEAVE message is received from a child node c, node u goes to
INIT state.
• If no LEAVE message is received within T SUCC timer timeout, it is as-
sumed that the sending node has crashed. u safely broadcasts JOIN ACK
message and goes to RESUME state to set its parent and child accord-
ingly.
– C WAIT J
• From state C WAIT J, node u sends a close signal to CC and waits for
receiving reset signal from CC to remove the leaving node from child set.
• On receiving JOIN message from the new node n, u goes to INIT state.
• If no JOIN is received within T SUCC timeout, node u goes to state
LOOKUP, where it performs special maintenance activities.
– P WAIT J
• From state P WAIT J, node u sends a suspend signal to CC to stop data
forwarding from buffer.
• On receiving JOIN message from new node n, u reaches INIT state.
• If no JOIN is received within T SUCC timeout, node u goes to state
LOOKUP, where it performs special maintenance activities.
Sensory Data Gathering for Road-Traffic Monitoring 99
– INIT
• As node u has received both the LEAVE and JOIN messages, it sends a
JOIN ACK message to the newly added node n and reaches to RESUME
state to set the parent or child variable accordingly.
– LOOKUP
• A node reaches to LOOKUP state if no JOIN message has been received
within T SUCC timeout after receiving a LEAVE message. This may
happen if the intended recipient is outside the communication range of
the new node, as shown in Fig. 4. In that case, to continue the uninter-
rupted data forwarding and maintain the tree connectivity, extra nodes
from the set of inactive nodes are awaken forcefully by triggering in-
terrupt. Thus for a particular group, under this kind of situation, more
than one nodes may remain in active state. The challenge is to keep
the count of active nodes for any group as low as possible. Node u at
LOOKUP state, sends an interrupt F WAKEUP to forcefully wakeup
the next node to wakeup in Gu and goes to WAIT FJ state.
– SLE LISTEN
• A node remains in this state at sleep mode. There can be two actions
on which a node goes from sleep mode to active mode, and changes its
state.
• After sleep timeout T SL, a node goes to wake-up mode (normal sce-
nario), broadcast JOIN message and goes to ACT LISTEN state.
• On receiving F WAKEUP from a node w, the node goes to wakeup mode
(forced wakeup scenario due to special maintenance activities), transfers
to WAIT URG state and waits T WT time to make sure that it should
perform special maintenance activities.
– WAIT URG
• Node u at state WAIT URG goes to ACT LISTEN state after T WT
timeout by broadcasting a F JOIN message.
– WAIT FJ
• On receiving a F JOIN message from node f , node at state WAIT FJ
sends a F JACK message to node f and goes to RESUME state to set
its parent or child accordingly.
– ACT LISTEN (Extended)
• A node that has changed the state from WAIT URG to ACT LISTEN
by broadcasting a F JOIN message, waits at ACT LISTEN state for a
F JACK message. On receiving a F JACK message, it goes to RESUME
state and sets its parent or child accordingly.
– RESUME
• On reaching state RESUME, finally the node sends a resume signal
to CC to start data forwarding from buffer. The node returns back to
ACT LISTEN state after successfully changing parent (or child) node
and resuming back data forwarding activities.
100 S. Chakraborty et al.
y y
r Gj r Gj
Rc(u) Rs Gk Rc(u) Rs Gk
x u x u
Rs Rs
w Gl w Gl
The maintenance scheme for handling the special case has been described
pictorially in Fig. 4 and Fig. 5. Let node x ∈ Gk remains active for the time
slot tm and goes to sleep during slot tn as in Fig. 4. Another node, say u ∈ Gk ,
becomes active in slot tn . τmn denotes the timeout interval between slot m
and n. Let, w ∈ Gl and y ∈ Gj are active nodes for the corresponding blocks
according to Fig. 4. If y ∈ Gj is outside the range of Rc (u), it does not receive
Join message from u as in Fig. 4. On expiry of T SUCC, y sends a F WAKEUP
signal as an interrupt to a node r ∈ Gj where r is the node to become active
for the time slot o (m, n, o are consecutive time slots for the schedule of Gj ).
On receiving the F WAKEUP from y, node r becomes active and broadcasts a
F JOIN message. Node u ∈ Gk on receiving F JOIN from r, sets parent(u) ← r
and sends back F JACK as an acknowledgment. On receiving F JACK, node r
sets parent(r) ← y and Child(r) ← u and thus, both the nodes y and r remain
active for the time slot n. The application data is forwarded through the path
(w, u, r, y) according to Fig. 5. At the end of τno , node y goes to sleep and node
r continues for the time slot o.
mode wakes up and goes to WAIT URG state as discussed earlier. It can be
noted that during tree maintenance for the special case (when two active nodes
are not in the communication range of each other) the F WAKEUP interrupt
signal is used only to forcefully wakeup the next node in Gu according to the
schedule (which is known to every other nodes in the group). However, for node
crash, the F WAKEUP interrupt signal is broadcast to wakeup all the nodes in
the group to select the optimal node based on their level of confidence (conf ).
The level of confidence at each node is calculated based on the residual energy
and the distance metric from both the parent and child of the crashed node (can
be calculated from the signal strength and free space path loss model). The node
with maximum residual energy and within the communication range of both the
specified nodes is the best suitable for elected to be active node and thus, sets
conf to True. If multiple nodes satisfy the conditions, the one with the minimum
nodeID is selected as the active one. The recovery actions for this case are as
follows.
– WAIT URG (Extended)
• On receiving an URGENT message, every node u in Gk , goes to
WAIT DEC state to decide its conf.
– WAIT DEC
• If conf is set to True, the node broadcasts an URG ACK message and
reaches the RESUME state to set its parent and child accordingly.
• If conf is set to false, the node goes back to SLE LISTEN state.
– ACT LISTEN (Extended)
• On receiving an URGENT ACK message, a node that has sent an UR-
GENT message, goes to RESUME state to set its parent and child ac-
cordingly.
Case 2 : The node fails in sleep mode
Let a node x ∈ Gk should wake up in time slot tn following normal schedule. It
may happen that due to some nonrecoverable fault, x failed to wake up on time.
The failure detection of node x is delayed until x is expected to participates in
maintenance activities at time t. From the given STD in Fig. 3, a leaving node
that has already sent a LEAVE message waits for a JOIN message from a new
node at state SLE WAIT. If u does not receive any JOIN message within T WT
timeout, it assumes the new node to be failed while sleeping and continues the
active state by moving back into ACT LISTEN.
Again, during the special maintenance activities, at LOOKUP state, a node
sends a F WAKEUP message to the node to be active in next slot and waits
for an acknowledgment at state WAIT FJ. If the next node to be active fails
to acknowledge, it is assumed to be failed and the node u returns back to the
LOOKUP state to reinitiate the process considering another sleeping node after
T WT timeout. Once a new node becomes active the tree is repaired locally
following the similar steps as mentioned in the previous case. Following theorems
show the correctness of the proposed scheme. Let VT denotes the set of active
nodes that are part of the convergecast tree at any instance of time. Also d(x, y)
be the eucleadian distance between any two nodes x and y.
102 S. Chakraborty et al.
Lemma 2. Let node u ∈ Gi and v ∈ Gj are two nodes such that Gi and√ Gj are
consecutive to each other. The maximum distance between u and v is 5Rs .
Proof. From the assumption, for every group, at least one node is in active
state and participates
√ in convergecast tree. From Lemma 2, in the worst case,
d(u, v) = 5Rs > Rc . Thus in the worst case, the JOIN message sent by
node u ∈ Gi will never be received by node v ∈ Gj , where Gi and Gj are
consecutive.
Lemma 4. Let v ∈ Gj and w ∈ Gk be two active nodes such that the groups Gj
and Gk are consecutive to each other. Now if d(v, w) > Rc then there must exist
at least one node u, in Gj (or in Gk ), such that u becomes active to make the
path {v, u, w} a part of the tree.
Proof. From√ Lemma 2, √ the maximum distance between two nodes v ∈ Gj and
w ∈ Gk is 5Rs . As 5Rs > Rc , v and w will not be connected. Let v and w
of the rectangle R = 2Rs × Rs ,
be positioned at the end point of the diagonal δ
considering Gj and Gk consecutive and R = Bj Bk , where Bj and Bk are the
blocks for the groups Gj and Gk , respectively. Let u be any node positioned
at any point within the circular area of radius Rc and centered at w and u be
within R. Therefore, d(u, w) < Rc and u and w are connected. Now, to maintain
the connectivity such that {v, u, w} be a path of the tree, d(u, v) < Rc must
be true. In the worst case, if u be positioned on the diagonal δ, to maintain
Sensory Data Gathering for Road-Traffic Monitoring 103
objective of the CC is to control the flow of convergecast data to and from the
buffer at each node such that no data is lost or delivered redundant to the sink.
Actions performed on receiving each such control signal are stated as follows.
Proof. Let at time tm < tp , node v ∈ Gj receives the LEAVE message from
its parent node u ∈ Gi . According to the STD given in Fig. 3, at this state
P WAIT J, node v sends a suspend signal to CC. CC at node v stops data
forwarding from buffer. The sequence number of the last message sent from CC
at node v to its parent node u be Sm . Now, node v changes state from INIT to
RESUME after sending a JOIN ACK message to the new parent node z ∈ Gi .
As the path {v, z} is established, node v sends a resume signal to CC to start
data fowarding and goes to ACT LISTEN state. Thus, all data stored in buffer
at node v starting from sequence number Sn = Sm + 1 are forwarded to the new
parent node z. No data is lost at node v.
Again, the leaving node u forwards all the received data from its child node
v upto sequence number Sm to its parent node w before going to sleep. Node v
does not send any message with sequence number greater than Sm to node u.
Hence no message is lost at node u.
Let at time tm < tn < tp , node w, receives the LEAVE message from its child
node u ∈ Gi . From STD given in Fig. 3, st this state C WAIT J, node v sends a
close signal to its CC. On receiving the last message with sequence number Sm
from the child node u, CC sends back a reset signal to TMM of node w such
that node w can safely remove node u from its child set. At this point, node w
has forwarded the last message, with sequence number Sm , received from node
u toward the sink. Thus no message is lost at node w. Let the message with
sequence number Sm be received at the sink at time tp . Let the path {v, z, w}
establishment time be Δ. As node, v has forwarded the messages starting from
sequence number Sn to node z and z will eventually forward this to its parent,
Sensory Data Gathering for Road-Traffic Monitoring 105
node w. The messages starting from sequence number Sn will be received by the
sink node via node w at tq , where tq ≥ tp + Δ. Hence proved.
The proposed scheme assures the correct delivery of all application data with
no loss or redundancy, even during the local maintenance under normal state
transition. However, if a node crashes before forwarding all its buffered data to
its parent, the loss of those buffered data at the crashed node can not be avoided.
4 Simulation Result
The proposed scheme has been simulated with the message-passing interface of
NS2 [2]. The simulation scenario has been designed by considering a strip-like
topology with 25 nodes including the sink. For every sensor, Rc has been consid-
ered to be 250 unit and Rs be half of that. There exists five blocks (corresponding
groups G1 to G5 ), each with the dimension of Rs ×Rs . The sensors are distributed
according to the proposed scheme maintaining the redundancy. The MAC layer
protocol has been considered to be SMAC with 1 Mbps data rate. Transport
layer protocol is UDP and Application data traffic is assumed to be CBR with
data generation rate of 12 Kbps. The first state transition for G1 occurs at 1st sec
from the simulation starts. The schedule interval between two consecutive blocks
is 10 ms where as the slot time is 30 ms for every schedule in every block. Each
sensor is assumed to have 5 Joule of energy in store initially. Considering micaZ
sensor mote, the per node energy dissipation is assumed to be 0.053W for receiv-
ing, 0.045W for transmitting and 0.0027W for sleeping. The proposed scheme
has been compared with the diffusion-based convergecast and the chain-based
convergecast under same topology. For diffusion-based convergecast, all the 25
nodes are active all the time. For chain-based convergecast five nodes, each from
different block, that participate in chain construction are active all the time.
For the proposed scheme, only the set of designated nodes participate in data
forwarding periodically. The result presented in Fig. 6 shows that the proposed
scheme outperforms the other two schemes. The slope for the plotted line denot-
ing the average residual energy for the proposed scheme is almost the half of the
diffusion- and chain-based solution. This implies that the proposed scheme would
help in increasing the network lifetime. Fig. 7 shows average energy dissipation
with respect to number of groups. The contention among nodes increases as the
number of groups increases. As a result, the average energy dissipation increases
exponentially for diffusion and linearly for chain-based forwarding. However, for
the proposed scheme, average energy dissipation is almost constant. The peri-
odic sleep-wakeup schedule and redundancy-based failure handling reduces the
contention among neighboring nodes.
The proposed scheme has also been simulated to observe the effect of tree
management delay on the performance of convergecast and the amount of packets
received at the sink. Average delay for an application message sent from a node to
be received by the sink node is called average end-to-end delay Δ, and is plotted
with respect to every group. Δ is maximum because of collision if the diffusion-
based convergecast is used. On the other hand, the chain-based convergecast
106 S. Chakraborty et al.
5
Diffusion
4.8
4.7
4.6
4.5
4.4
4.3
0 10 20 30 40 50 60 70 80 90 100
Simulation time (sec)
350
Diffusion
Chain based Forwarding
Proposed Scheme
Average Energy Dissipation (Watt-hour)
300
250
200
150
100
50
2 4 6 8 10 12 14 16
Number of Virtual Blocks
offers minimum Δ. Fig. 8 shows that offered Δ for the proposed scheme is almost
similar to that of the chain-based solution even in presence of tree maintenance
responsibility. Thus, the proposed scheme would serve good for delay sensitive
road sensor network. It has also been noted that the % of loss and redundancy
for the application messages received at the sink is 0. Due to local repairing and
recovery technique the proposed scheme is scalable too. Fig. 9 shows the amount
of packets received from individual groups. It can be observed from the figures
that the proposed scheme outperforms diffusion- and chain-based forwarding.
Cumulative control overhead at time tn can be defined as the fraction of
total number of control messages sent to the total number of messages sent in
Sensory Data Gathering for Road-Traffic Monitoring 107
Diffusion
Chain based Forwarding
2.5 Proposed Scheme
1.5
0.5
0
0 1 2 3 4 5 6
Group Number
500
Diffusion
Chain based Forwarding
Proposed Scheme
400
Number of Packets Received
300
200
100
0
0 1 2 3 4 5 6
Group Number
the network during the time interval [t0 , tn ], where t0 denotes the time nodes
started communication. From Fig. 10, it can be observed that the cumulative
control overhead for the proposed scheme is only the 2% and that also remains
stable with time. This proves that the maintenance overhead in terms of control
message communication is nominal and constant for the proposed scheme. Fig. 11
shows the probability of simultaneous activation of two nodes in a single block.
It can be seen from the figure that the probability is very less. For 16 numbers
of consecutive blocks, on average only 2.4% of blocks will have two active nodes
simultaneously. It can be noted that the above two metrics are not compared
with diffusion- and chain-based forwarding, as they does not ensure coverage
and connectivity during node failure.
108 S. Chakraborty et al.
0.05
Proposed Scheme
0.04
0.02
0.01
0
10 20 30 40 50 60 70 80 90 100
Simulation time (sec)
0.05
Proposed Scheme
Probability of Two Nodes Active in a Block
0.04
0.03
0.02
0.01
0
2 4 6 8 10 12 14 16
Number of Virtual Blocks
References
1. Magnetic sensors, honeywell,
http://www.magneticsensors.com/vehicle-detection-solutions.php
2. Ns-2 network simulator, version 2.34, http://www.isi.edu/nsnam/ns/
3. Ceriotti, M., Corra, M., D’Orazio, L., Doriguzzi, R., Facchin, D., Guna, S., Jesi,
G., Lo Cigno, R., Mottola, L., Murphy, A., Pescalli, M., Picco, G., Pregnolato,
D., Torghele, C.: Is there light at the ends of the tunnel? wireless sensor networks
for adaptive lighting in road tunnels. In: Proceedings of the 10th International
Conference on Information Processing in Sensor Networks, pp. 187–198 (2011)
4. Chan, K.Y., Dillon, T.: On-road sensor configuration design for traffic flow pre-
diction using fuzzy neural networks and taguchi method. IEEE Transactions on
Instrumentation and Measurement 62(1), 50–59 (2013)
5. Chen, L.W., Peng, Y.H., Tseng, Y.C.: An infrastructure-less framework for pre-
venting rear-end collisions by vehicular sensor networks. IEEE Communications
Letters 15(3), 358–360 (2011)
6. Cheung, S.Y., Varaiya, P.: Traffic surveillance by wireless sensor networks: Final
report. Tech. Rep. UCB-ITS-PRR-2007-4, University of California, Berkeley (2007)
7. Coleri, S., Cheung, S.Y., Varaiya, P.: Sensor networks for monitoring traffic. In:
Proceedings of the Allerton Conference on Communication, Control and Comput-
ing (2004)
110 S. Chakraborty et al.
8. Ergen, S.C., Varaiya, P.: Pedamacs: Power efficient and delay aware medium access
protocol for sensor networks. IEEE Transactions on Mobile Computing 5, 920–930
(2006)
9. Flathagen, J., Drugan, O., Engelstad, P., Kure, O.: Increasing the lifetime of road-
side sensor networks using edge-betweenness clustering. In: Proc. of IEEE ICC, pp.
1–6 (2011)
10. Gallego, N., Mocholi, A., Menendez, M., Barrales, R.: Traffic monitoring: Improv-
ing road safety using a laser scanner sensor. In: Proceedings of the Electronics,
Robotics and Automotive Mechanics Conference, pp. 281–286 (2009)
11. Haddadou, N., Rachedi, A., Ghamri-Doudane, Y.: Advanced diffusion of classified
data in vehicular sensor networks. In: Proceedings of the 7th International Wireless
Communications and Mobile Computing Conference, pp. 777–782 (2011)
12. Iyengar, R., Kar, K., Banerjee, S.: Low-coordination topologies for redundancy in
sensor networks. In: Proc. of the 6th ACM MobiHoc, pp. 332–342 (2005)
13. Jeong, J., Guo, S., He, T., Du, D.: APL: Autonomous passive localization for wire-
less sensors deployed in road networks. In: Proceedings of the 27th IEEE Conference
on Computer Communications, pp. 583–591 (2008)
14. Jeong, J., Guo, S., He, T., Du, D.: Autonomous passive localization algorithm for
road sensor networks. IEEE Transactions on Computers 60(11), 1622–1637 (2011)
15. Knaian, A.N.: A wireless sensor network for smart roadbeds and intelligent trans-
portation systems. Tech. rep., Massachusetts Institute of Technology, USA (2000)
16. Kong, F., Tan, J.: A collaboration-based hybrid vehicular sensor network architec-
ture. In: Proceedings of the International Conference on Information and Automa-
tion, pp. 584–589 (2008)
17. Koyama, A., Honma, Y., Arai, J., Barolli, L.: An enhanced zone-based routing
protocol for mobile ad-hoc networks based on route reliability. In: Proceedings of
the 20th IEEE AINA, pp. 61–68 (2006)
18. Kumar, S., Chauhan, S.: A survey on scheduling algorithms for wireless sensor
networks. International Journal of Computer Applications 20(5), 7–13 (2011)
19. Gu Lee, B., Han Kim, J.: Algorithm for finding the moving direction of a vehicle
using magnetic sensor. In: Proceedings of the IEEE Symposium on Computational
Intelligence in Control and Automation, pp. 74–79 (2011)
20. Li, W., Chan, E., Hamdi, M., Lu, S., Chen, D.: Communication cost minimization
in wireless sensor and actor networks for road surveillance. IEEE Transactions on
Vehicular Technology 60(2), 618–631 (2011)
21. Lim, K.W., Jung, W.S., Ko, Y.B.: Multi-hop data dissemination with replicas
in vehicular sensor networks. In: proceedings of the IEEE Vehicular Technology
Conference, pp. 3062–3066 (2008)
22. Manolopoulos, Y., Katsaros, D., Papadimitriou, A.: Topology control algorithms
for wireless sensor networks: a critical survey. In: Proc. of the 11th CompSysTech,
pp. 1–10 (2010)
23. Momen, A., Azmi, P., Bazazan, F., Hassani, A.: Optimised random structure ve-
hicular sensor network. IET Intelligent Transport Systems 5(1), 90–99 (2011)
24. Ng, E.H., Tan, S.L., Guzman, J.: Road traffic monitoring using a wireless vehicle
sensor network. In: Proceedings on IEEE International Symposium on Intelligent
Signal Processing and Communications Systems, pp. 1–4 (2009)
Sensory Data Gathering for Road-Traffic Monitoring 111
25. Onodera, K., Miyazaki, T.: An autonomous algorithm for construction of energy-
conscious communication tree in wireless sensor networks. In: Proc. of the 22nd
AINAW, pp. 898–903 (2008)
26. Ostovari, P., Dehghan, M., Wu, J.: Connected point coverage in wireless sensor
networks using robust spanning trees. In: Proc. of the 31st ICDCSW, pp. 287–293
(2011)
27. Palubinskas, G., Kurz, F., Reinartz, P.: Detection of traffic congestion in optical
remote sensing imagery. In: Proceedings of the IEEE International Geoscience and
Remote Sensing Symposium, vol. 2, pp. 426–429 (2008)
28. Pantavungkour, S., Shibasaki, R.: Three line scanner, modern airborne sensor and
algorithm of vehicle detection along mega-city street. In: Proceedings of the 2nd
GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban
Areas, pp. 263–267 (2003)
29. Pascale, A., Nicoli, M., Deflorio, F., Dalla Chiara, B., Spagnolini, U.: Wireless
sensor networks for traffic management and road safety. IET Intelligent Transport
Systems 6(1), 67–77 (2012)
30. Pelczar, C., Sung, K., Kim, J., Jang, B.: Vehicle speed measurement using wireless
sensor nodes. In: Proceedings of the IEEE International Conference on Vehicular
Electronics and Safety, pp. 195–198 (2008)
31. Rothery, S., Hu, W., Corke, P.: An empirical study of data collection protocols
for wireless sensor networks. In: Proceedings of the ACM REALWSN, pp. 16–20
(2008)
32. Santi, P.: Topology control in wireless ad hoc and sensor networks. ACM Comput.
Surv. 37(2), 164–194 (2005)
33. Sung, K., Yoo, J.J., Kim, D.: Collision warning system on a curved road using
wireless sensor networks. In: Proceedings of the IEEE 66th Vehicular Technology
Conference, pp. 1942–1946 (2007)
34. Tang, J., Zhu, B., Zhang, L., Hincapie, R.: Wakeup scheduling in roadside direc-
tional sensor networks. In: Proc. of IEEE GLOBECOM, pp. 1–6 (2011)
35. Tubaishat, M., Qi, Q., Shang, Y., Shi, H.: Wireless sensor-based traffic light con-
trol. In: Proceedings of the 5th IEEE Consumer Communications and Networking
Conference, pp. 702–706 (2008)
36. Uchiyama, T., Mohri, K., Itho, H., Nakashima, K., Ohuchi, J., Sudo, Y.: Car traffic
monitoring system using MI sensor built-in disk set on the road. IEEE Transactions
on Magnetics 36(5), 3670–3672 (2000)
37. Wang, Q., Hempstead, M., Yang, W.: A realistic power consumption model for
wireless sensor network devices. In: Proc. of 3rd Annual IEEE SECON, vol. 1,
pp. 286–295 (2006)
38. Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., Gill, C.: Integrated coverage and
connectivity configuration in wireless sensor networks. In: Proc. of the 1st SenSys,
pp. 28–39 (2003)
39. Wightman, P., Labrador, M.: A3Cov: A new topology construction protocol for
connected area coverage in WSN. In: Proc. of IEEE WCNC, pp. 522–527 (2011)
40. Xie, W., Zhang, X., Chen, H.: Wireless sensor network topology used for road
traffic. In: Proceedings of the IET Conference on Wireless, Mobile and Sensor
Networks, pp. 285–288 (2007)
41. Yu, X., Liu, Y., Zhu, Y., Feng, W., Zhang, L., Rashvand, H., Li, V.O.K.: Efficient
sampling and compressive sensing for urban monitoring vehicular sensor networks.
IET Wireless Sensor Systems 2(3), 214–221 (2012)
112 S. Chakraborty et al.
42. Zeng, Y., Xiang, K., Li, D.: Applying behavior recognition in road detection us-
ing vehicle sensor networks. In: Proceedings of the International Conference on
Computing, Networking and Communications, pp. 751–755 (2012)
43. Zhang, L., Wang, R., Cui, L.: Real-time traffic monitoring with magnetic sensor
networks. Journal of Information Science and Engineering 27, 1473–1486 (2011)
44. Zhou, G., Huang, C., Yan, T., He, T., Stankovic, J.A., Abdelzaher, T.F.: MMSN:
Multi-frequency media access control for wireless sensor networks. In: Proc. of the
25th IEEE INFOCOM, pp. 1–13 (2006)
45. Zhu, C., Zheng, C., Shu, L., Han, G.: A survey on coverage and connectivity issues
in wireless sensor networks. J. Netw. Comput. Appl. 35, 619–632 (2012)
Data Aggregation and Forwarding Route Control
for Efficient Data Gathering
in Dense Mobile Wireless Sensor Networks
1 Introduction
Recently, participatory sensing, where sensor data are gathered from portable sensor
devices such as smart phones, has attracted much attention [2, 9, 14, 15]. In participa-
tory sensing, it is general that sensor data are uploaded to the Internet through some
infrastructures such as 3G and LTE networks. It is undesirable for participatory sensing
to generate a large amount of traffic that may exhaust the limited channel bandwidth
shared by a wide variety of applications. For this reason, MWSNs (Mobile Wireless Sen-
sor Networks), which are constructed by mobile sensor nodes held by ordinary people
without any infrastructure [17], have recently attracted much attention as a way of real-
izing participatory sensing. In a MWSN, sensor readings are gathered to a sink through
multi-hop wireless communication [5, 20].
In MWSNs constructed by mobile sensor nodes held by ordinary people, the num-
ber of sensor nodes is generally very large. For example, there are generally more than
10,000 people in the daytime in some stations in major cities such as Tokyo. When
assuming that all of them hold sensor devices with Wi-Fi interface whose communi-
cation range is about 100[m], a sensor node can directly communicate with about 100
sensor nodes. In such an environment, an arbitrary geographical point in the sensing
area can be sensed by many sensor nodes when an application monitors the geograph-
ical distribution of temperature in the station. We call this networks dense MWSNs.
c Springer International Publishing Switzerland 2015 113
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_5
114 K. Matsuo et al.
On the other hand, most of MWSN applications require a certain geographical granu-
larity of sensing in a specific area (e.g., sensor data of every 100[m]×100[m] square in a
1,000[m]×1,000[m] flatland) at every sensing time. In such a situation, if a sink gathers
sensor data from all sensor nodes in the entire area, the network bandwidth and the bat-
tery of sensor nodes are unnecessarily wasted. Thus, it is desirable to efficiently gather
sensor data from the minimum number of sensor nodes which are necessary to guaran-
tee the geographical granularity required by the application. For this aim, as a data gath-
ering method which efficiently gathers sensor data in dense MWSNs, DGUMA (Data
Gathering method Using Mobile Agents) has been proposed [6]. DGUMA uses mobile
agent, which is an application software that autonomously operates on a sensor node
and moves among sensor nodes. Mobile agents are generated by the sink and allocated
on sensor nodes located near the sensing points, which are determined from the re-
quirement on geographical granularity. For gathering sensor data, DGUMA constructs
a static tree-shaped logical network whose nodes are mobile agents (sensing points).
Every time when the sensing time comes, sensor nodes where agents locate perform
sensing and send the sensor data to the sink according to the tree-shaped network. By
doing so, DGUMA can reduce the traffic for gathering sensor data since mobile agents
control transmissions of sensor data.
Here, sensor readings on environmental information such as sound and temperature
tend to have a same or similar value at adjacent sensing points. However, DGUMA
does not consider such a characteristic of sensor readings, and gathers sensor readings
acquired by all agents even when there are ones with the same value. If we can aggregate
such sensor readings with the same value, further reduction of traffic for gathering sen-
sor data can be expected [1, 10, 11, 13]. In addition, it is general that the geographical
distribution of data values changes over time. In such a case, it is effective to dynami-
cally construct communication routes (tree-shaped network in DGUMA) so that many
sensor readings with the same value are aggregated. However, since DGUMA con-
structs a static tree-shaped network for gathering sensor data, effective data aggregation
cannot be achieved in some geographical distribution of data values.
In this chapter, we present an extended method of DGUMA, named DGUMA/DA
(DGUMA with Data Aggregation), that considers the geographical distribution of data
values in dense MWSNs. In DGUMA/DA, each mobile agent aggregates multiple read-
ings with the same value in order to reduce the traffic for gathering sensor data. More-
over, at every sensing time, each mobile agent searches its adjacent agents which have
the same reading during gathering sensor data by changing their direction of forward-
ing sensor data from lengthwise (up or down) to crosswise (right or left) or vice versa.
When an adjacent agent which has the same reading is found, the mobile agent fixes
its direction of forwarding sensor data so that the adjacent agent can continuously ag-
gregate the readings. In addition, when the reading of the adjacent agent on the fixed
direction becomes different, the mobile agent releases its direction of forwarding sensor
data. This mechanism increases the chances of aggregating multiple readings with the
same value. Using these two mechanisms (i.e., data aggregation and forwarding route
control), DGUMA/DA further reduces the traffic for gathering sensor data to the sink.
Furthermore, we confirm the effectiveness of DGUMA/DA by comparing with
DGUMA through a theoretical analysis and some simulation experiments.
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 115
N
1 2 kN-1 kN
1 1 2 kN-1 kN
kM-1
kM
: sensing point
2 Assumptions
We assume dense MWSNs constructed by mobile sensor nodes which are held by ordi-
nary people and equipped with a radio communication facility. These sensor nodes pe-
riodically observe (sense) a physical phenomenon (e.g., sound, temperature, and light),
and communicate with each other using multi-hop radio communication. According to
the requirement from an application, the sink periodically monitors the sensing area
while guaranteeing the geographical granularity of sensing. More specifically, the sink
gathers sensor readings from sensor nodes located near the sensing points which are de-
termined from the requirement of the geographical granularity at the timing of sensing.
We call the interval of data gathering the sensing cycle.
26℃
30℃
2.2 Geo-Routing
Sensor nodes adopt a geo-routing protocol based on that proposed in [7] to transport a
message to the destination specified as a position (not a node). In this protocol, nodes
perform a transmission process using the information on positions of the transmitter
and the destination specified in the packet header. Specifically, the transmitter records
the coordinates of the destination and itself into the packet header of the message, and
broadcasts the message to its neighboring nodes. Each node which received this mes-
sage judges whether it locates within the forwarding area. The forwarding area is deter-
mined based on the positions of the transmitter, the destination and the communication
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 117
communication ID : node
range (r)
: message
3
8
7
4
transmitter
1 2 9
the destination the destination
forwarding area 5 6 10
range, so that any node in the forwarding area is closer to the destination than the trans-
mitter and can communicate directly to all nodes in the area (see Fig. 3). A node within
the forwarding area sets the waiting time, and then it forwards the message after the
waiting time elapses. The waiting time is set shorter as the distance between the node
and the destination gets shorter. Each node within the forwarding area cancels its trans-
mission process when it detects the message forwarded by another node. For example,
in Fig. 4, nodes {2, . . ., 6} receive a message from node 1, and set the waiting time.
Since node 2 is the closest node to the destination in the forwarding area of node 1,
it first sends the message. On receiving this message, all nodes in the forwarding area
of node 1 (i.e., nodes {3, . . . , 6}) cancel their transmission process. By repeating this
procedure, the message is forwarded to nodes which are closer to the destination. If the
transmitter node exists within half of the communication range (r/2) from the destina-
tion, each node which received the message sends an ACK to the transmitter node after
the waiting time elapses instead of forwarding the message. As a result, the nearest node
from the destination (which has first sent the ACK) can find that it is the nearest one
because all nodes within r/2 from the destination can detect the ACK sent by the node
and cancel sending it. If the transmitter node did not receive an ACK from any node, it
also can find that the node itself is the nearest node.
3 Related Work
In [16], the authors proposed a traffic reduction method utilizing temporal correlation
of data in wireless sensor networks. In this method, each sensor node stores a sensor
reading at last sensing time and compares a new sensor reading and the previous one
at every sensing time. If the difference between the sensor readings is smaller than
the threshold predetermined by the sink, the node does not send its sensor reading to
the sink at the sensing time. Thus, the traffic for data gathering is reduced by utilizing
temporal correlation of data. It is different from ours which reduces traffic by using the
geographical distribution of data values. However, it is similar to ours in the way that
the node does not send its sensor reading if the sensor reading corresponds to another
one.
In [19, 21–24], the authors proposed data gathering methods which construct an over-
lay network to effectively aggregate sensor data using the spatial correlation of data in
wireless sensor networks. In [19], sensor data are gathered using a routing tree, which
is a combination of the minimum spanning tree and the shortest path tree. Raw data
is sent according to the former tree and aggregated data is gathered according to the
latter tree. In [21], an energy efficient routing tree is constructed using game theory that
considers transmission energy, the effect of wireless interference, and the opportunity
for aggregating correlated sensor data. In [22], clusters are constructed based on the
compression ratio, which indicates the ratio that a node compresses the data received
from its neighbors using spatial correlation. In [23], a routing tree is dynamically con-
structed. In this routing tree, each node sends its holding packet to its neighbors which
hold more packets and exist closer to the sink. This is because a sensor node can have
more opportunities to aggregate sensor data when holding more packets. In [24], as-
suming a circular sensing area, a data gathering method using mobile sensor nodes is
proposed. This method divides the area into polar grids. In each grid, a node is chosen in
each grid to aggregate data observed in the grid and to join a routing tree. These meth-
ods are similar to ours in the way that sensor data are gathered using spatial correlation.
However, these methods do not consider the change in the geographical distribution of
data values.
In [18], assuming that sensor nodes which have spatial correlated data tend to ex-
ist close to each other, and continuous queries which specify a condition on a sensor
reading to be gathered, a dynamic route construction method for the queries is pro-
posed. This method detects sensor nodes which satisfy the condition, and constructs
some clusters consisting of sensor nodes which exist close to each other. For gathering
sensor data, a minimum spanning tree is constructed by those clusters. When sensor
nodes which satisfy the condition are changed by the change in the distribution of data
value, the sink computes a new minimum spanning tree and informs it to all sensor
nodes. This method is similar to ours in the way that routing tree is constructed dynam-
ically according to the distribution of data value. However, this method is conducted in
a centralized manner.
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 119
In DGUMA, if a sensor node on which a mobile agent operates moves away from the
sensing point, the mobile agent moves from the current sensor node to another sensor
node which locates closest to the corresponding sensing point. Specifically, a mobile
agent starts moving when the distance between the sensing point and itself becomes
longer than the threshold. This threshold is a system parameter which is set as a constant
smaller than r/2 and the radius of the valid area for sensing, which can guarantee that a
sensor node on which a mobile agent operates can communicate with all sensor nodes
located near (within r/2) from the sensing point and can sense the data at the sensing
point. In order to move to the sensor node located closest to the sensing point, the mobile
agent issues a message containing the agent data, and broadcasts it to neighboring nodes
within r/2 from the sensing point. As in Section 2.2, the sensor node located closest to
the sensing point first sends an ACK and boots a mobile agent. Other sensor nodes
cancel to send the same ACK because they can detect this ACK. Also, the original
mobile agent stops its operation when detecting the ACK.
1 2 3 4 5 6
28 28 28 28 28 28 :forwarding direction
7 8 9 10 11 12 N :gridID
30 30 28 28 29 29 :sensor reading
5.1 Outline
Similar to DGUMA, DGUMA/DA deploys the agent data to k2 · M · N sensing points.
Each mobile agent sets up timer, which will be described in Section 5.2. When the
timer expires, the mobile agent sends sensor data to the sink. In this process, each
mobile agent aggregates multiple readings with the same value as described in Section
5.3. In addition, DGUMA/DA dynamically constructs the forwarding tree according to
the procedure described in Section 5.4.
When the sink receives the sensor data, it restores aggregated sensor readings, ac-
cording to the procedure described in Section 5.5.
100[m]
Farthest sensing point
from the sink 100[m]
400[m]
400[m]
Sensor data in the buffer (received from 3) : 28 1 Sensor data in the buffer : 28 1 6
Received sensor data from 5: 28 6 Its own sensor data : 28 4
1 2 3 4 5 6
28 28 28 28 28 28 :forwarding direction
7 8 9 10 11 12 N :gridID
30 30 28 28 29 29 :sensor reading
(1) When a mobile agent has received multiple packets, it first copies sensor data in-
cluded in the packet first received to its buffer.
(2) It checks whether there is a sensor reading in another received packet, which is
identical with that in the buffer. If so, the mobile agent adds only the gridIDs of the
sensor reading after the gridIDs of the corresponding reading in the buffer. After
that, the mobile agent adds other sensor data, whose sensor reading is not identical
with any of those in the buffer, to the end of buffer.
(3) When its timer expires, the mobile agent adds its own sensor data to the buffer in
the same way. If its sensor reading is identical with one of them in the buffer, it
adds only its gridID after the gridIDs of the corresponding reading in the buffer and
moves the corresponding sensor data to the end of the buffer. After that, the agent
creates a new packet that contains all sensor data in the buffer, and sends the packet
to its parent.
Here, when the mobile agent has received only one packet, or when there is only one
sensor data in the buffer, it checks whether its own sensor reading and that at the end of
the buffer is identical. If so, the mobile agent adds neither its own sensor reading nor its
gridID to the buffer.
According to the above procedure, each mobile agent can aggregate sensor readings
by adding only its gridID when there is a sensor reading identical with its own reading
in the received packet. In addition, when the sensor readings become identical between
adjacent sensing points on the forwarding tree, more traffic can be reduced since nothing
is added to the packet. Note that all sensor readings can be restored at the sink from the
aggregated sensor data. The details are presented in Section 5.5.
Fig. 8 shows an example of the above procedure. First, mobile agents at grids {1,
6, 7, 12} send a packet with its own sensor data (i.e., their own sensor readings and
gridIDs) to their parents. On receiving the packet from the mobile agent at grid 1, the
mobile agent at grid 2 copies the received sensor data to its buffer. After the expiration
of its timer, the agent at grid 2 sends the packet without adding its sensor data (its own
124 K. Matsuo et al.
sensor reading and gridID) because its own sensor reading is identical with that at the
end of the buffer. Mobile agents at grids {3, 5, 8, 11} perform in the same way as that at
grid 2. On the other hand, after copying sensor data received from the agent at grid 8 to
its buffer, the mobile agent at grid 9 adds its sensor data to the end of the buffer because
there is no sensor data with the same sensor reading as its own reading (i.e., 28). Then,
the mobile agent at grid 4 receives multiple packets from grids 3 and 5, and aggregates
sensor data included in these packets according to the procedure described in step (2).
After the aggregation, the buffer contains only one sensor data whose sensor reading is
identical with its own reading (i.e., 28). Thus, the mobile agent sends the packet without
adding its sensor data. Also, the agent at grid 10 receives multiple packets from grids
{4, 9, 11}, aggregates them, and sends the packet with the aggregated sensor data to the
sink.
(1) At every sensing time, each mobile agent changes its forwarding direction, and
sends a packet following the procedure in Section 5.3. Note that the lengthwise tree
is used at the first sensing time.
(2) When each mobile agent receives a packet, the mobile agent checks whether the
sensor reading at the end of the received packet is identical with its own reading. If
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 125
so, and the path from the child to itself is not fixed, it sends a route fix message to
the child. On the other hand, if its own sensor reading is not identical with that of
the end of the received packet, and when the route from the child to itself is fixed,
the mobile agent sends a fixed-route release message to the child.
(3) When a mobile agent receives a route fix message from its current parent, it fixes
the route from itself to the parent. After that, it does not change its forwarding
direction at the subsequent sensing times.
(4) When a mobile agent receives a fixed-route release message, it stops fixing the
route. After that, it restarts to change its forwarding direction at the subsequent
sensing times.
By doing so, DGUMA/DA can construct the forwarding tree which aggregates more
sensor readings even when the geographical distribution of data values dynamically
changes.
Fig. 10 shows the detailed procedure of the forwarding route control. In this figure,
grids with the same color indicate the sensor readings with the identical value. In Fig.
10(a), at the first sensing time, each mobile agent sends a packet using the lengthwise
tree according to the procedure described in Section 5.3. In this phase, the agent at
grid 6 sends the route fix message to its child, the agent at grid 1, because its sensor
reading is identical with that of grid 6. On receiving this message, the agent at grid 1
fixes the route to the agent at grid 6. The agents in the other grids perform in the same
way. As a result, the red-colored lengthwise routes as shown in Fig. 10(b) are fixed in
this phase. At the second sensing time, each mobile agent without receiving route fix
message switches its forwarding direction from lengthwise to crosswise, and sends a
packet using the forwarding tree in Fig. 10(c). In this phase, the agent at grid 7 sends
the route fix message to the agent at grid 6 because its sensor reading is identical with
that of grid 7. The agent at grid 6 fixes the route to the agent at grid 7. After performing
this procedure at other grids, the red-colored crosswise routes as shown in Fig. 10(d)
are also fixed.
1 2 3 4 5 1 2 3 4 5
Fixed
6 7 8 9 10 6 7 8 9 10
11 12 13 14 15 11 12 13 14 15
16 17 18 19 20 16 17 18 19 20
21 22 23 24 25 21 22 23 24 25
(a) Forwarding tree at the (b) Route fixing after the 1st
1st sensing time. sensing time.
1 2 3 4 5 1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
Fixed
11 12 13 14 15 11 12 13 14 15
16 17 18 19 20 16 17 18 19 20
21 22 23 24 25 21 22 23 24 25
(4) If the sensor readings become different between grids on a fixed route, the sink
recognizes that the fixed route has been released. Thus, the sink stops fixing the
corresponding route in the tree information, and switches the forwarding direction
at the next sensing time.
According to the above procedure, the sink can recognize the topology of the for-
warding tree and restore sensor readings at all grids.
Fig. 11 shows an example of the above procedure when the sink received sensor data
in the environment shown in Fig. 8. First, the sink extracts sensor readings at grids {1,
6, 7, 9, 10, 12} from the received (aggregated) packet. Next, the sensor reading at grid
2 is restored by referring to its current child (i.e., grid 1). The sensor reading at grid
3 is also restored by referring to that at its current child (i.e., grid 2). In addition, The
sink fixes the routes between these grids (i.e., from 1 to 2, and from 2 to 3), where the
restoring process is applied. By applying this procedure for other grids {4, 5, 8, 11}, the
sink can restore all the sensor readings. Note that, routes on the same column or row of
the grid where the sink locates (i.e., from 7 to 10, from 12 to 10, and from 4 to 10) are
not fixed.
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 127
1 2 3 4 5 6
28 28
7 8 9 10 11 12
30 28 28 29
1 2 3 4 5 6
28 28 28 28 28 28
7 8 9 10 11 12
30 30 28 28 29 29
:fixed-route
29℃
26℃
25℃
6 Discussion
DGUMA/DA can reduce traffic for data gathering by data aggregation and forwarding
route control. On the other hand, the forwarding route control generates the overhead
since it needs to send route fix and fixed-route release messages. In this section, we
discuss the performance gain (traffic reduction) and the overhead by the forwarding
route control. Here, it is difficult to cover all situations of distribution of data values.
In order to simplify the discussion, we assume a situation in which the distribution
changes from Fig. 12(a) (lengthwise distribution) to Fig. 12(b) (crosswise distribution).
In the lengthwise distribution, sensor readings can be aggregated efficiently without
forwarding route control (only using the lengthwise tree). On the other hand, in the
crosswise distribution, all routes between grids need to change (the forwarding tree
needs to change to the crosswise tree) in order to efficiently aggregate sensor readings.
128 K. Matsuo et al.
In other words, the performance gain and the overhead generated by the forwarding
route control become the largest in this situation.
For the discussion, we assume a D[m]×D[m] flatland as the√ sensing area.√ The sink
divides the area into G lattice-shaped grids whose size is D/ G[m] × D/ G[m]. In
addition, variables in Table 2 are used in this section. h in Table 2 is expressed by the
following equation:
⎧
⎪
⎨ 1 ( √G·l ≤ 1).
⎪ D
h=⎪ ⎪ (2)
⎩ √D (Otherwise).
G·l
We assume that the mobile agent located in the grid where the sink exists can commu-
nicate directly to the sink. The sink locates at the grid of nth row and mth column.
Assuming the above situation, we calculate the following theoretical values:
– The overhead which is generated by changing the forwarding tree from Fig. 13(b)
to Fig. 13(c).
– The traffic which is generated by data gathering in Fig. 13(a).
– The traffic which is generated by data gathering in Fig. 13(c).
– The traffic which is generated by data gathering in Fig. 13(b).
At the next sensing time, the forwarding route becomes the crosswise tree. When gath-
ering data with the crosswise tree, all routes are fixed
√ except for those on nth row or
mth column. Thus, the number of fixed routes is ( G − 1)2 . Considering that the size
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 129
(a) Lengthwise tree and Lengthwise distribution (b) Lengthwise tree and Crosswise distribution
(c) Crosswise tree and Crosswise distribution (d) Lengthwise tree and Crosswise distribution
h [hops] on average
Send ACK
:sensor data
1. 1. 1.
2.
3.
Fig. 15. An example of data gathering using lengthwise tree in the lengthwise distribution
of a route fix message is also equal to sheader [B], and that these messages are sent us-
ing the geo-routing protocol, the total traffic for fixing routes, S f ix , is expressed by the
following equation:
√
S f ix = ( G − 1)2 (sheader · h + sA ) [B]. (4)
As a result, the total overhead generated by the forwarding route control, S overhead , is
derived by the following equation:
√
S overhead = S release + S f ix = 2( G − 1)2 (sheader · h + sA ) [B]. (5)
6.2 Traffic for Data Gathering Using Lengthwise Tree in the Lengthwise
Distribution
In this case, the data gathering tree can be treated as optimized. In order to derive the
theoretical value of the traffic for data gathering, S opt , we separately derive traffic in the
following steps (see Fig. 15):
1. Lengthwise packet transmission (at every column).
2. Crosswise packet transmission (at nth row).
3. Packet transmission from the mobile agent to the sink at the grid where the sink
locates.
forwarding tree. The size of this packet is (sheader + sdata + sID )[B]. So, the traffic for
delivering this packet to the parent becomes (h · (sheader + sdata + sID ) + sA )[B].
The parent sends the packet to its parent without adding its sensor reading nor gridID
because its own sensor reading is identical with that included in the received packet.
Thus, the traffic for delivering this packet to the next parent becomes the same, that is,
(h · (sheader + sdata + sID ) + sA )[B].
This packet is delivered in the same way until a mobile agent at the grid√on n-th row
receives it. Since the number of mobile agents which send the packet is ( G − 1) in a
column, the total traffic generated at this column, S opt,col , is expressed by the following
equation:
√
S opt,col = (( G − 1) {(sheader + sdata + sID )h + sA }) [B].
√
Considering that there are G columns in the sensing area, the total traffic generated
by mobile agents for this phase, S opt,step1 , is expressed by the following equation:
√ √ √
S opt,step1 = G · S opt,col = ( G( G − 1) {(sheader + sdata + sID )h + sA }) [B].
STEP2. Crosswise Packet Transmission (At nth Row): A mobile agent at the left
end or the right end grid of nth row receives two packets from its children (at upper and
lower grids). The agent sends the packet to its parent without adding its sensor reading
nor gridID because its own sensor reading is identical with that included in the packet.
Thus, the size of packet transmitted by the agent becomes (sheader + sdata + 2sID )[B], and
the traffic for delivering this packet to its parents becomes (h · (sheader + sdata + 2sID ) +
sA )[B].
On the other hand, mobile agents except for those at the left- and right-end grids re-
ceive multiple packets with different sensor readings, and aggregate them. Let us focus
on the parent of the agent at the left end grid (shown in Fig. 16). This agent receives
three packets from its children at the upper, the lower, and the left grids. Here, since
sensor readings in the packets received from upper and lower grids are identical, the
size of sensor data in the buffer is (sdata + 2sID )[B] after aggregating these packets. On
the other hand, since the sensor reading in the packet from the left grid is different from
that in the buffer, sensor data whose size is (sdata + 2sID )[B] is added to the buffer. In
addition, the mobile agent adds only its own gridID after the gridIDs with the sensor
reading which is identical with its own reading. As a result, the agent creates a packet
whose size is (sheader + 2(sdata + 2sID ) + sID )[B]. Thus, the size of a new created packet
increases by ((sdata + 2sID ) + sID )[B] at the agent at each grid on nth row. Therefore,
the size of packet transmitted by the agent at the kth grid from the left end becomes
(sheader + k(sdata + 2sID ) + (k − 1)sID )[B], and the traffic for delivering this packet to its
parent becomes (h · {sheader + k(sdata + 2sID ) + (k − 1)sID } + sA )[B].
There are (m − 1) grids from the left end grid to the mth grid. Thus, the total traffic
generated at these grids, S opt,step2(le f t) , is expressed by the following equation:
⎧ ⎫
⎪
⎪
⎨
m−1
m−1 ⎪
⎪
⎬
S opt,step2(le f t) = (h · ⎪
⎪(m − 1)s + k(s + 2s ) + (k − 1)s ID ⎪
⎪ + (m − 1)sA ) [B].
⎩ header data ID
⎭
k=1 k=1
132 K. Matsuo et al.
[B] [B]
51 52 29 2
[B]
30 1 91
Generated packet:
30 1 91 29 2 92 52
Fig. 16. An example of aggregation by the parent of the agent at the left-end grid in the lengthwise
distribution (G=100, n=5, m=5)
√
In the same way, the total traffic generated at ( G − m) grids between the right end and
mth grids, S opt,step2(right) , is expressed by the following equation:
⎧ √ √ ⎫
⎪
⎪
⎪ √
G−m
G−m ⎪
⎪
⎪
⎨ ⎬
S opt,step2(right) = (h · ⎪
⎪ ( G − m)s + k(s + 2s ) + (k − 1)s ⎪
⎪
⎪
⎩
header data ID ID
⎪
⎭
k=1 k=1
√
+ ( G − m)sA ) [B].
STEP3. Packet Transmission from the Mobile Agent to the Sink at the Grid Where
the Sink Locates: As shown √ in Fig. 17,√the mobile agent at the grid where the sink
locates records different
√ G readings, 2 G gridIDs of the top and bottom grids in the
sensing area, and ( G − 2) gridIDs of grids in nth row except for those of the left-
and right-end grids, √in the buffer. Thus,
√ the size of the packet transmitted by this agent
becomes (sheader + G · sdata + (3 G − 2)sID )[B]. Since we assume that this agent
can directly communicate with the sink, the traffic generated at this grid, S opt,step3 , is
expressed by the following equation:
√ √
S opt,step3 = (sheader + G · sdata + (3 G − 2)sID + sA ) [B].
55 28 5
30 1 91 29 25 10 100 26
Fig. 17. An example of aggregation at the grid where the sink locates in the lengthwise distribu-
tion (G=100, n=5, m=5)
6.3 Traffic for Data Gathering Using Crosswise Tree in the Crosswise
Distribution
In this case, the relation between the forwarding tree and the distribution of data values
is the same as that in Section 6.2. Thus, the total traffic becomes S opt , which is expressed
by Eq.(6).
In the same way as described in Section 6.2, in order to derive the theoretical value of
the traffic for data gathering in this situation, S no opt , we separately derive traffic in the
following steps (see Fig. 18):
:sensor data
1. 1. 1.
2.
3.
Fig. 18. An example of data gathering using lengthwise tree in the crosswise distribution
√
(n − 1)sA )[B]. In the same way, the total
√traffic generated at ( G − n) grids between
√ the
√
bottom edge and nth row becomes (h· ( G − n)sheader + k=1 k(sdata + sID ) +( G−
G−n
n)sA )[B]. Therefore, the total traffic generated at a column, S no opt,col , is expressed by
the following equation:
⎧ √ ⎫
⎪
⎪ ⎪
⎪
⎨ √
⎪ ⎪ √
G−n n−1
⎬
S no opt,col = (h · ⎪
⎪( G − 1)s + ( k + k)(s + s ID ⎪
) ⎪ + ( G − 1)sA ) [B].
⎪
⎩
header data
⎪
⎭
k=1 k=1
√
Considering that there are G columns in the sensing area, the total traffic generated
by mobile agents in this phase, S no opt,step1 , is expressed by the following equation:
√
S no opt,step1 = G · S no opt,col
⎧ √ ⎫
√ ⎪
⎪
⎪ √ G−n
n−1 ⎪
⎪
⎪
⎨ ⎬
= ( G ·h⎪⎪ ( G − 1)s + ( k + k)(s + s ) ⎪
⎪
⎪
⎩
header data ID
⎪
⎭
k=1 k=1
√ √
+ G( G − 1)sA ) [B].
At nth row, every agent receives the packet from the upper√grid (the size is (sheader +
(n − 1)(sdata + sID ))[B]) and lower grid (the size is (s
√header + ( G − n)(sdata + sID ))[B]).
Thus, each agent stores sensor data whose size is (( G − 1)(sdata + sID ))[B] in its buffer.
STEP2. Crosswise Packet Transmission (At nth Row): A mobile agent at the left-
end or the right-end grid of nth row first adds its own sensor reading and gridID to its
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 135
Generated packet:
(e) 30 1 2 28 41 42 26 61 62 27 51 52
[B]
Fig. 19. An example of aggregation by the parent of the agent at the left-end grid in the crosswise
distribution (G=100, n=5, m=5)
buffer because its own reading is identical to none of sensor readings √ in the buffer. Thus,
the size of packet transmitted by the agent becomes (sheader + G(sdata √ + sID ))[B], and
the traffic for delivering the packet to its parent becomes (h· sheader + G(sdata + sID ) +
sA )[B]. √
Every other agent on nth row also holds sensor data whose size is (( G − 1)(sdata +
sID ))[B] in the buffer. On receiving a packet from the child at nth row (from the left or
right grid), an agent first merges the sensor data included in the packet with those in its
buffer. Let us focus on the parent of the agent at the left-end grid (in Fig. 19). First, every
sensor reading included in the packet from the left-end grid (i.e., {30, . . ., 28, 24, . . .,
26}, (b) in Fig. 19) except for the last one (i.e., 27) is included in the buffer ((a) in Fig.
19). Thus, according to step (2) in Section 5.3, only the gridIDs in the packet are added √
to the buffer√ ((c) in Fig. 19), and the size of sensor data in the buffer becomes ( G ·
sdata + (2 G − 1)sID )[B]. Second, according to step (3) in Section 5.3, the mobile agent
adds only its own gridID after the gridIDs with the sensor reading which is identical
with its own reading √ ((d) in Fig.√ 19). As a result, the agent creates a packet whose
size is (sheader + G · sdata √ + 2 G · sID )[B] ((e) in Fig. 19). Thus, the size of a new
created packet increases by G · sID [B] at the agent at each grid on nth row. Therefore,
the size of√packet transmitted √ by the agent at the kth grid from the left-end becomes
(sheader + G · sdata + k G ·√sID )[B], and √ the traffic for delivering this packet to its
parent becomes (h · (sheader + G · sdata + k G · sID ) + sA )[B]. Since there are (m − 1)
grids from the left-end grid to the mth grid, the total traffic generated at these grids,
S no opt,step2(le f t) , is expressed by the following equation:
⎧ ⎫
⎪
⎪
⎨ √
m−1 √ ⎪
⎪
⎬
S no opt,step2(le f t) = (h · ⎪
⎪ (m − 1)(sheader + G · sdata ) + k G · sID ⎪
⎪ + (m − 1)sA ) [B].
⎩ ⎭
k=1
136 K. Matsuo et al.
√
In the same way, the total traffic generated at ( G − m) grids between the right end
and mth grids, S no opt,step2(right) , is expressed by the following equation:
⎧ √ ⎫
⎪
⎪ ⎪
⎪
⎨ √
⎪ √ √ ⎪
G−m
⎬
S no opt,step2(right) = (h · ⎪
⎪ ( G − m)(s + G · s ) + k G · s ID ⎪
⎪
⎪
⎩
header data
⎪
⎭
k=1
√
+ ( G − m)sA ) [B].
STEP3. Packet Transmission from the Mobile Agent to the Sink at the Grid Where
the
√ Sink Locates: The mobile agent at the grid where the sink locates records different
G readings,
√ G gridIDs. Thus, the size of the packet transmitted by this agent becomes
(sheader + G · sdata + G · sID )[B]. Thus, the traffic generated at this grid, S no opt,step3 , is
expressed by the following equation:
√
S no opt,step3 = (sheader + G · sdata + G · sID + sA ) [B].
6.5 The Relation between the Performance Gain and the Overhead Generated
by the Forwarding Route Control
In the situation discussed in this section, the performance gain by the forwarding route
control becomes (S no opt − S opt )[B], while the overhead S overhead [B] is needed to recon-
struct the forwarding tree. Assuming that sheader = 21[B], sdata = 2[B], sID = 1[B],
sA = 5[B], D = 1, 000[m], G = 100, l = r = 100[m], m = n = 5, the performance gain
and the overhead respectively, become 0.866[B] and 4.212[B]. This indicates that the
overhead becomes larger.
However, when the distribution of data values does not change during successive
T sensing times after changing from lengthwise to crosswise distributions, the perfor-
mance gain becomes larger as T increases. We derive the minimum number of sensing
times, T , in order for the performance gain to be larger than the overhead.
First, when the forwarding route control is not implemented, the total traffic for T
times of data gathering becomes T · S no opt [B]. On the other hand, in DGUMA/DA,
the total traffic for T times of data gathering becomes T · S opt [B]. Thus, in order for
the performance gain to be larger than the overhead, T must be satisfied the following
condition:
T · S no opt − T · S opt > S overhead .
S overhead
T > . (8)
S no opt − S opt
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 137
Assuming the case described above, T must be larger than 4.863. This indicates that,
in the situation discussed in this section, the performance gain becomes larger than the
overhead generated by the forwarding route control when the distribution of data values
does not change in successive five sensing times.
7 Simulation Experiments
In this section, we show the results of simulation experiments for validating the discus-
sion in Section 6, and evaluating performance of DGUMA/DA. For the simulations, we
used the network simulator, Scenargie 1.51 .
1
Scenargie1.5 Base Simulator revision 8217, Space-Time Engineer,
https://www.spacetime-eng.com/
138 K. Matsuo et al.
Fig. 20 shows the experimental results and the theoretical values. The horizontal axis
sim
of all graphs is the number of grids, G. The vertical axis respectively indicates S overhead
sim sim
and S overhead in Fig. 20(a), S opt and S opt in Fig. 20(b), and S no opt and S no opt in Fig.
20(c).
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 139
12 12 12
8 8 8
6 6 6
4 4 4
a a a
2 2 2
a a a
0 0 0
100 121 144 169 196 225 100 121 144 169 196 225 100 121 144 169 196 225
The number of grids, G The number of grids, G The number of grids, G
5
4.5
4
Sensing tine, T
3.5
3
2.5
2
1.5 theoretical value
1 simulatiuon value
0.5
0
100 121 144 169 196 225
The number of grids, G
Fig. 21. The minimum number of sensing times in order for the performance gain to be larger
than the overhead
From these results, we can see that the theoretical values show similar tendency as the
experimental results. Here, the experimental results become larger than the theoretical
values especially when G is small. This is mainly because the average of the number of
hops in the simulation experiment becomes different from h calculated by Eq.(2).
In addition, we derived the minimum number of sensing times, T , in order for the
performance gain to be larger than the overhead using the results in Fig. 20. Fig. 21
shows the result. The horizontal axis of this graph is G. We can see that the difference
between the experimental results and the theoretical values become smaller as G in-
creases. This is because the average of the number of hops in the simulation experiment
becomes close to h as G increases.
20 3 1
18
16 2.5 0.9
Delivery ratio
14
2
Traffic[KB]
12 0.8
Delay[s]
10 1.5 0.2
0.7
8
100 121 144 169 196 225
6 1
DGUMA/DA 0.1
4
DGUMA/DAwoRC 0.5
2
DGUMA
0 0 0
100 121 144 169 196 225 100 121 144 169 196 225 100 121 144 169 196 225
The number of grids, G The number of grids, G The number of grids, G
using the fixed (initial) forwarding tree. The simulation time is 3,600[sec] and we eval-
uated the following three criteria:
– Traffic: The traffic is defined as the average of the summation of the size of all
packets sent by the sink and all sensor nodes between two consecutive sensing
times.
– Delay: The delay is defined as the average elapsed time from the start of each
sensing time to the time that the sink successfully receives sensor data.
– Delivery ratio: The delivery ratio is defined as the ratio of the number of sensor
readings which the sink correctly restored to that observed in all grids.
We examined the effects of G, the number of grids. Figs. 22 and 23 show the simu-
lation results. The horizontal axis of all graphs is the number of grids, G.
Lengthwise Distribution: Fig. 22(a) shows the traffic. From this result, we can see that
the traffics in all methods increase as G increases. This is because the number of sensor
data increases as G increases. We can also see that DGUMA/DA and DGUMA/DAwoRC
can gather sensor data with less traffic than DGUMA. This is because DGUMA gathers
all sensor data without aggregating them. The traffic in DGUMA/DA is almost same as
that in DGUMA/DAwoRC in the lengthwise distribution. This is because the topol-
ogy of the forwarding tree incidentally becomes suitable for data aggregation even
in DGUMA/DAwoRC. Here, the traffic in DGUMA/DA is slightly larger than that in
DGUMA/DAwoRC. This is because DGUMA/DA has to send messages to fix routes
for data aggregation.
Fig. 22(b) shows the delay. From this result, we can see that the delays in all methods
increase as G increases. This is obvious because the number of sensor data increases as
G increases. We can also see that the delay in DGUMA/DA becomes longer than those
in other methods. This is because mobile agents in DGUMA/DA have to wait until their
timers expire before sending a packet, while they can send their packet immediately
after receiving packets from all their child agents in other methods.
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 141
20 3 1
18
2.5 0.9
16
Delivery ratio
14 2
Traffic[KB]
0.8
Delay[s]
12
10 1.5 0.2
8 0.7
6 1 100 The
121 number
144 169 196 225
of grids
DGUMA/DA 0.1
4 0.5
DGUMA/DAwoRC
2
DGUMA
0 0 0
100 121 144 169 196 225 100 121 144 169 196 225 100 121 144 169 196 225
The number of grids, G The number of grids, G The number of grids, G
Fig. 22(c) shows the delivery ratio. From this result, we can see that the delivery ratio in
DGUMA/DA is higher than those in other methods. In DGUMA and DGUMA/DAwoRC,
mobile agents cannot send their packet until receiving packets from all their children.
Thus, no packet is sent to the sink once a packet collision occurs. On the other hand,
thanks for introducing the timer, mobile agents in DGUMA/DA can send their packet
even when packet collisions occur at their descendant. As G increases, the delivery ra-
tio in all methods become lower. This is because a chance of packet collisions becomes
larger due to the increase of the number of sensor data. Among the three methods, the
delivery ratio in DGUMA/DA keeps high since mobile agents with the different dis-
tances from the sink sends their packet at different timings according to their timers.
However, DGUMA/DA cannot completely eliminate packet collisions. This is because
mobile agents which have almost the same distance to the sink send their packets at al-
most the same time when rand in Eq.(1) becomes very close between these agents.
Crosswise Distribution: Fig. 23(a) shows the traffic. From this result, we can see
that the traffic in DGUMA/DA is still small even in the crosswise distribution, while
the traffic in DGUMA/DAwoRC becomes much larger. This is because, in the cross-
wise distribution, less sensor data are aggregated on the forwarding tree with the initial
(lengthwise) tree. On the other hand, DGUMA/DA appropriately changes the topology
of the forwarding tree according to the geographical distribution of sensor data. Thus,
more sensor data can be aggregated. As G increases, the difference in traffic increases
between methods. This is because the number of sensor data increases as G increases.
Figs. 23(b) and 23(c), respectively, show the delay and the delivery ratio. These results
are almost the same as in Figs. 22(b) and 22(c). This is because the differences in delay
and delivery ratio between DGUMA/DA and other methods are caused not only by the
difference of forwarding route, but also by the introduction of timer.
8 Conclusion
In this chapter, we have presented DGUMA/DA, which is a data gathering method
considering geographical distribution of data values in dense MWSNs. DGUMA/DA
142 K. Matsuo et al.
can reduce traffic for gathering sensor data by aggregating the same sensor readings
and dynamically constructing the forwarding tree for data aggregation. The results of
the simulation experiments show that DGUMA/DA can gather sensor data with high
delivery ratio and small traffic.
In this chapter, we assume that DGUMA/DA gathers sensor readings which have only
one attribute. However, in a real environment, it is possible that each sensor reading has
multiple attributes (e.g., temperature and light intensity). Therefore, it is necessary to
extend DGUMA/DA in order to efficiently gather sensor readings with multiple attributes.
In addition, DGUMA and DGUMA/DA do not consider erroneous or missing data. Thus,
it is necessary to extend DGUMA in order to handle them.
References
1. Ali, A., Khelil, A., Szczytowski, P., Suri, N.: An adaptive and composite spatio-temporal
data compression approach for wireless sensor networks. In: Proc. Int. Conf. on Modeling,
Analysis and Simulation of Wireless and Mobile Systems (MSWiM 2011), pp. 67–76 (2011)
2. Burke, J., Estrin, D., Hansen, M., Parker, A., Ramanathan, N., Reddy, S., Srivastava, M.B.:
Participatory sensing. Proc. Int. Workshop on World-Sensor-Web (WSW) at Embedded Net-
worked Sensor Systems (Sensys) (2006)
3. Camp, T., Belong, J., Davies, V.: A survey of mobility models for ad hoc network research.
Wireless Communications and Mobile Computing 2(5), 483–502 (2002)
4. Campbell, A.T., College, D., Eisenman, S.B., Lane, N.D., Miluzzo, E., Peterson, R.A., Lu,
H., Zheng, X., Musolesi, M., Fodor, K., Ahn, G.S.: The rise of peple-centric sensing. IEEE
Internet Computing 12(4), 12–21 (2008)
5. Di Francesco, M., Das, S.K., Anastasi, G.: Data collection in wireless sensor networks with
mobile elements: a survey. ACM Transactions on Sensor Networks 8(1), 1–34 (2011)
6. Goto, K., Sasaki, Y., Hara, T., Nishio, S.: Data gathering using mobile agents in dense mo-
bile wireless sensor networks. In: Proc. Int. Conf. on Advances in Mobile Computing and
Multimedia (MoMM), pp. 58–65 (2011)
7. Heissenbüttel, M., Braun, T., Bernoulli, T., Wälchli, M.: BLR: beacon-less routing algorithm
for mobile ad hoc networks. Computer Communications 27(11), 1076–1086 (2004)
8. Landsiedel, O., Götz, S., Wehrle, K.: Towards scalable mobility in distributed hash tables.
In: Proc. Int. Conf. on Peer-to-Peer Computing (P2P), pp. 203–209 (2006)
9. Lane, N.D., Miluzzo, E., Lu, H., Peebles, D., Choudhury, T., Campbell, A.T.: A survey of
mobile phone sensing. IEEE Communications Magazine 48(9), 140–150 (2010)
10. Luo, C., Wu, F., Sun, J., Chen, C.: Compressive data gathering for large-scale wireless
sensor networks. In: Proc. Int. Conf. on Mobile Computing and Networking (MobiCom),
pp. 145–156 (2009)
11. Luo, H., Liu, Y., Das, S.K.: Routing correlated data in wireless sensor networks: a survey.
IEEE Network 21(6), 40–47 (2007)
12. Matsuo, K., Goto, K., Kanzaki, A., Hara, T., Nishio, S.: Data gathering considering geo-
graphical distribution of data values in dense mobile wireless sensor networks. In: Proc. Int.
Conf. on Advanced Information Networking and Applications (AINA), pp. 445–452 (2013)
Data Aggregation and Forwarding Route Control for Efficient Data Gathering 143
13. Pattem, S., Krishnamachari, B., Govindan, R.: The impact of spatial correlation on routing
with compression in wireless sensor networks. ACM Trans. Sensor Networks 4(4), 28–35
(2008)
14. Reddy, S., Samanta, V., Burke, J., Estrin, D., Hansen, M., Strivastava, M.: Examining micro-
payments for participatory sensing data collections. In: Proc. Int. Conf. on Ubiquitous Com-
puting (UBICOMP), pp. 33–36 (2010)
15. Reddy, S., Samanta, V., Burke, J., Estrin, D., Hansen, M., Strivastava, M.: Mobisense-mobile
network services for coordinated participatory sensing. In: Proc. Int. Symposium on Au-
tonomous Decentralized Systems (ISADS), pp. 231–236 (2009)
16. Sharaf, M.A., Beaver, J., Labrinidis, A., Chrysanthis, P.K.: TiNA: a scheme for temporal
coherency-aware in-network aggregation. In: Proc. Int. Workshop on Data Engineering for
Wireless and Mobile Access (MobiDE), pp. 66–79 (2003)
17. Shi, J., Zhang, R., Liu, Y., Zhang, Y.: Prisense: privacy-preserving data aggregation in
people-centric urban sensing systems. In: Proc. Int. Conf. on Computer Communications
(INFOCOM), pp. 758–766 (2010)
18. Umer, M., Kulik, L., Tanin, E.: Optimizing query processing using selectivity-awareness in
wireless sensor networks. Computers, Environment and Urban Systems 33(2), 79–89 (2009)
19. Weng, H., Chen, Y., Wu, E., Chen, G.: Correlated data gathering with double trees in wireless
sensor networks. IEEE Sensors Journal 12(5), 1147–1156 (2012)
20. Yick, J., Mukherjee, B., Ghosal, D.: Wireless sensor network survey. Computer Net-
works 52(12), 2292–2330 (2008)
21. Zeydan, E., Kivanc, D., Comaniciu, C., Tureli, U.: Energy-efficient routing for correlated
data in wireless sensor networks. Ad Hoc Networks 10(6), 962–975 (2012)
22. Zhang, C.: Cluster-based routing algorithms using spatial data correlation for wireless sensor
networks. Journal of Communications 5(3), 232–238 (2010)
23. Zhang, J., Wu, Q., Ren, F., He, T., Lin, C.: Effective data aggregation supported by dynamic
routing in wireless sensor networks. In: Proc. Int. Conf. on Communications (ICC), pp. 1–6
(2010)
24. Zhu, Y., Vedantham, R., Park, S.J., Sivakumar, R.: A scalable correlation aware aggregation
strategy for wireless sensor networks. Information Fusion 9(3), 354–369 (2008)
P2P Data Replication:
Techniques and Applications
1 Introduction
Peer-to-peer (P2P) systems have become highly popular in recent times due to
their great potential to scale and the lack of a central point of failure. Thus, P2P
c Springer International Publishing Switzerland 2015 145
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_6
146 E. Spaho et al.
2 P2P Systems
– The peers should have autonomy and be able to decide services they wish
to offer to other peers.
– Peers should be assumed to have temporary network addresses. They should
be recognized and reachable even if their network address has changed.
– A peer can join and leave the system at its own disposal.
Initial research work and development in P2P systems considered data replica-
tion techniques as a means to ensure availability of static information (typically
files) in P2P systems under highly dynamic nature of computing nodes in P2P
systems. However, considering only static information and data might be in-
sufficient for certain types of applications in which documents generated along
application lifecycle can change over time. The need is then to efficiently replicate
dynamic documents and data.
Data replication aims at increasing availability, reliability, and performance
of data accesses by storing data redundantly [16–19]. A copy of a replicated data
object is called a replica. Replication ensures that all replicas of one data ob-
ject are automatically updated when one of its replicas is modified. Replication
involves conflicting goals with respect to guaranteeing consistency, availability,
and performance. Data replication techniques have been extensively used in dis-
tributed systems as an important effective mechanism for storage and access
to distributed data. Data replication is the process of creating copies of data
resources in a network. Data replication is not just copying data at multiple
locations as it has to solve several issues. Data replication can improve the per-
formance of a distributed system in several ways:
In [30], replica control mechanisms are classified using three criteria (see also
[20], [21], [22]): where updates take place (single-master vs. multi-master), when
updates are propagated to all replicas (synchronous vs. asynchronous) and how
replicas are distributed over the network (full vs. partial replication) as shown
in Fig. 1.
P2P Data Replication: Techniques and Applications 149
Pessimistic Optimistic
MASTER
R-W Replica
Pull
mode
Push
mode
SLAVE SLAVE
R Replica R Replica
MASTER
R-W Replica
MASTER MASTER
R-W Replica R-W Replica
A1 B1
A 2 B2 A3 B3
A1 B1
A2 B2
1
4
R r r
Data Write 2 3 2 3
Data Commit
find the right replication factor. Careful planning should be done when deciding
which documents to replicate and at which peers.
1
2
R r r
Data Write 3 4 3 4
Data Commit
the local site and after that the updates are propagated to the remote sites as
shown in Fig. 7. An advantage of asynchronous propagation is that the update
does not block due to unavailable replicas, which improves data availability.
The asynchronous replication technique can be classified as optimistic or non-
optimistic in terms of conflicting updating [35, 36].
Pessimistic Approaches
These approaches combine lazy replication with one-copy-serializability. Each
replicated data item is assigned a primary copy site and multiple secondary copy
sites and only the primary copy can be modified. But because primary copies
are distributes across the system, serializability cannot be always guaranteed
[37]. In order to solve these problems, constraints on primary and secondary
copy placements must be set. The problem is solved with the help of graph
representation. The Data Placement Graph (DPG) is a graph where each node
represents a site and there is a directed edge from Site i to Site j if there is at least
one data item for which Site i is the primary site and Site j is the secondary
site. The configurations a DPG can have for the system to be serializable is
determined with the Global Serialization Graph (GSG). The GSG is obtained
by taking the union of nodes and edges of the Local Serialization Graph (LSG)
at each site. The LSG is a partial order over the operations of all transactions
executed at that site. A DPG is serializable only if GSG is acyclic.
The above method can be enhanced so that it allows some cyclic configura-
tions. In order to achieve this, the network must provide FIFO reliable multicast
[35]. The time needed to multicast a message from one node to any other node
is not greater than Max and the difference between any two local clocks is not
higher than ε. Thus, how a site receives the propagated transaction in at most
M ax + ε units of time, chronological, and total orderings can be assured without
coordination among sites. The approach reaches the consistency level equivalent
to one-copy-serializability for normal workloads and for bursty workloads it is
quite close to it. The solution was extended to work in the context of partial
replication too [23]. Pessimistic approaches have the disadvantage that two repli-
cas might not be consistent for some time interval. That is why the criterion of
consistency freshness is being used, which is defined as the distance between two
replicas.
P2P Data Replication: Techniques and Applications 153
Optimistic Approaches
Optimistic approaches are used for sharing data efficiently in wide-area or mobile
environments. The difference between optimistic and pessimistic replication is
that the first does not use one-copy-serializability. Pessimistic replication use syn-
chronization during replica propagation and block other users during an update.
On the other hand, optimistic replication allows data to be accessed without
using synchronization, based on the assumption that conflicts will occur only
rarely, if at all. Update propagation is made in the background so that it is pos-
sible to exist divergences between replicas. Conflicting updates are reconciled
later.
Optimistic approaches have powerful advantages over pessimistic approaches.
They improve availability; applications do not block when the local or remote site
is down. This type of replication also permits a dynamic configuration of the net-
work, peers can join or leave the network without affecting update propagation.
There are techniques that allow such a thing as epidemic replication that propa-
gates operations reliably to all replicas. Unlike pessimistic algorithms, optimistic
algorithms scale to a large number of replicas as there is little synchronization
among sites. New replicas can be added on the fly without changing the existing
sites, examples are FTP and Usenet mirroring. Last but not least, optimistic
approaches provide quick feedback as the system applies the updates tentatively
as soon as they are submitted [36].
These advantages come at a cost with system consistency. Optimistic replica-
tion encounters the challenges of diverging replicas and conflicts between concur-
rent updates. For these reasons, it is used only with systems in which conflicts are
rare and that tolerate inconsistent data. An example are file systems in which
conflicts do not happen often due to data partitioning and access arbitration
that naturally happen between users.
Path replication is a technique that uses active replication and the requested
item is replicated to all nodes of the path between source and destination nodes.
In this replication, the peer with a high degree forwards much more data than
the peer with a low degree, so that a large number of replications will occur
at the peers with a high degree. Therefore, the storage load due to writing
and/or reading can be concentrated on a few high-degree peers, which thus play
an important role in the P2P system. If the system fails due to overload or
some other reason, a large amount of time is needed to recover the system [25].
However, this scheme has been employed in many distributed systems because
of its good search performance and ease of implementation. Path replication is
used in systems such as Freenet [26].
Random replication distributes the replicas in a random order. In ran-
dom forwarding n-walkers random walk, random replication is the most effec-
tive approach for achieving both smaller search delays and smaller deviations
in searches. Random replication is harder to implement, but the performance
difference between it and path replication highlights the topological impact of
path replication.
In [27], authors use developed two heuristic algorithms (HighlyUpFirst and
HighlyAvailableFirst) for solving the replica placement problem and improving
quality of availability.
In the HighlyUpFirst replication the nodes with the highest uptime are put
in the set that will receive the replica.
The HighlyAvailableFirst method fills the so-called replica set with the
nodes that have a high availability. This technique deals only with availability
and does not take into consideration the network overhead.
P2P Data Replication: Techniques and Applications 155
Replicating objects to multiple sites has several issues such as selection of objects
for replication, the granularity of replicas, and choosing appropriate site for
hosting new replica [42].
By storing the data at more than one site, if a data site fails, a system can oper-
ate using replicated data, thus, increasing availability and fault tolerance. At the
same time, as the data are stored at multiple sites, the request can find the data
close to the site where the request originated, thus increasing the performance
of the system. But the benefits of replication, of course, do not come without
overheads of creating, maintaining, and updating the replicas. If the application
has read-only nature, replication can greatly improve the performance. But, if
the application needs to process update requests, the benefits of replication can
be neutralized to some extent by the overhead of maintaining consistency among
multiple replicas. If an application requires rigorous consistency and has large
numbers of update transactions, replication may diminish the performance as
156 E. Spaho et al.
SP NDP
SP
P P RP FLC RF
P P
P
P SRP
Me Ma
μ (NDP) Fe
NDP
0 10 20 30 40 50 60 70 80 90 100
μ (RP) Lo Av Hi
RP
0 10 20 30 40 50 60 70 80 90 100
μ (SRP) Lw Md Hg
SRP
0 10 20 30 40 50 60 70 80 90 100
RF
The term sets of NDP, RP, and SRP are defined respectively as:
and the term set for the output (RF) is defined as:
Table 1. FRB
0.9 RP=10%
RP=50%
0.8
RP=100%
0.7
0.6
RF
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
NDP
0.9
0.8
0.7
0.6
RF
0.5
0.4
0.3
RP=10%
0.2 RP=50%
0.1 RP=100%
0
0 10 20 30 40 50 60 70 80 90 100
NDP
(b) Replication factor for SRP=60%.
0.9
0.8
0.7
0.6
RF
0.5
0.4
0.3
RP=10%
0.2 RP=50%
0.1 RP=100%
0
0 10 20 30 40 50 60 70 80 90 100
NDP
(c) Replication factor for SRP=100%.
7 Conclusions
applications. However, several issues arise, such as data consistency and design-
ing cost-efficient solutions, due to the highly dynamic nature of P2P systems.
This chapter conducts a theoretical survey of replication techniques in P2P
systems. We describe different techniques and discuss their advantages and dis-
advantages. The replication techniques depends on the application in which they
will be used. In general a replication technique should take into consideration at
the same time: the reduction of access time and bandwidth consumption, choose
an optimal number of replicas and a balanced workload between replicas. Data
replication is useful to achieve high-data availability, system reliability, and scal-
ability and can also be used for maximizing hit probability of access request for
the contents in P2P community, maximizing content searching (look-up) time,
minimizing the number of hops visited to find the requested content, minimiz-
ing the content cost, distributing peer load. But the benefits of replication, of
course, do not come without overheads of creating, maintaining, and updating
the replicas. If the application has read-only nature, replication can greatly im-
prove the performance. But, if the application needs to process update requests,
the benefits of replication can be neutralized to some extent by the overhead
of maintaining consistency among multiple replicas. If an application requires
rigorous consistency and has large numbers of update transactions, replication
may diminish the performance as a result of synchronization requirements.
Careful planning should be done when deciding which documents to replicate
and at which peers. During replication it is important to find the right replication
factor.
References
1. Xhafa, F., Fernandez, R., Daradoumis, T., Barolli, L., Caballé, S.: Improvement of
JXTA Protocols for Supporting Reliable Distributed Applications in P2P Systems.
In: Enokido, T., Barolli, L., Takizawa, M. (eds.) NBiS 2007. LNCS, vol. 4658, pp.
345–354. Springer, Heidelberg (2007)
2. Barolli, L., Xhafa, F., Durresi, A., De Marco, G.: M3PS: A JXTA-based Multi-
platform P2P System and Its Web Application Tools. International Journal of Web
Information Systems 2(3/4), 187–196 (2006)
3. Arnedo, J., Matsuo, K., Barolli, L., Xhafa, F.: Secure Communication Setup for
a P2P based JXTA-Overlay Platform. IEEE Transactions on Industrial Electron-
ics 58(6), 2086–2096 (2011)
4. Barolli, L., Xhafa, F.: JXTA-Overlay: A P2P Platform for Distributed, Collab-
orative, and Ubiquitous Computing. IEEE Transactions on Industrial Electron-
ics 58(6), 2163–2172 (2011)
5. Enokido, T., Aikebaier, A., Takizawa, M.: Process Allocation Algorithms for Saving
Power Consumption in Peer-to-Peer Systems. IEEE Transactions on Industrial
Electronics 58(6), 2097–2105 (2011)
6. Waluyo, A.B., Rahayu, W., Taniar, D., Scrinivasan, B.: A Novel Structure and
Access Mechanism for Mobile Data Broadcast in Digital Ecosystems. IEEE Trans-
actions on Industrial Electronics 58(6), 2173–2182 (2011)
7. Zhang, J., Honeyman, P.: A Replicated File System for Grid Computing. Concur-
rency and Computation: Practice and Experience 20(9), 1113–1130 (2008)
P2P Data Replication: Techniques and Applications 165
8. Elghirani, A., Subrata, R., Zomaya, A.Y.: Intelligent Scheduling and Replica-
tion: a Synergistic Approach. Concurrency and Computation: Practice and Ex-
perience 21(3), 357–376 (2009)
9. Nicholson, C., Cameron, D.G., Doyle, A.T., Millar, A.P., Stockinger, K.: Dynamic
Data Replication in LCG. Concurrency and Computation: Practice and Experi-
ence 20(11), 1259–1271 (2008)
10. Shirkey, C.: What is P2P..and What isn’t. O’Reilly Network (November 2000)
11. Gnutella, http://gnutella.wego.com/
12. NAPSTER, http://www.napster.com/
13. WinMX, http://www.frontcode.com/
14. FREENET, http://frenet.sourceforge.net/
15. GROOVE, http://www.groove.net/
16. Martins, V., Pacitti, E., Valduriez, P.: Survey of Data Replication in P2P Systems.
Technical Report (2006)
17. Bernstein, P., Goodman, N.: The Failure and Recovery Problem for Replicated
Databases. In: Proc. of the Second Annual ACM Symposium on Principles of
Distributed Computing, pp. 114–122. ACM Press, New York (1983)
18. Mustafa, M., Nathrah, B., Suzuri, M., Osman, M.: Improving Data Availability Us-
ing Hybrid Replication Technique in Peer-to-Peer Environments. In: Proc. of 18th
International Conference on Advanced Information Networking and Applications
(AINA-2004), pp. 593–598. IEEE CS Press (2004)
19. Loukopoulos, T., Ahmad, I.: Static and Adaptive Data Replication Algorithms for
Fast Information Access in Large Distributed Systems. In: Proc. of 20th Interna-
tional Conference on Distributed Computing Systems (ICDCS 2000), pp. 385–392.
IEEE CS Press (2000)
20. Xhafa, F., Potlog, A., Spaho, E., Pop, F., Cristea, V., Barolli, L.: Evaluation of
Intragroup Optimistic Data Replication in P2P Groupware Systems. Concurrency
Computat.: Pract. Exper (2012), doi:10.1002/cpe.2836
21. Potlog, A.D., Xhafa, F., Pop, F., Cristea, V.: Evaluation of Optimistic Replication
Techniques for Dynamic Files in P2P Systems. In: Proc. of Sixth International
Conference on on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC
2011), Barcelona, Spain, pp. 259–165 (2011)
22. Xhafa, F., Kolici, V., Potlog, A.D., Spaho, E., Barolli, L., Takizawa, M.: Data
Replication in P2P Collaborative Systems. In: Proc. of Seventh International Con-
ference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2012),
Victoria, Canada, pp. 49–57 (2012)
23. Coulon, C., Pacitti, E., Valduriez, P.: Consistency Management for Partial Repli-
cation in a High Performance Database Cluster. In: Proc. of the 11th Interna-
tional Conference on Parallel and Distributed Systems (ICPADS 2005), pp. 809–
815 (2005)
24. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and Replication in Unstruc-
tured Peer-to-Peer Networks. In: Proc. of 16th ACM International Conference on
Supercomputing (ICS 2002), pp. 84–95 (2002)
25. Keyani, P., Larson, B., Senthil, M.: Peer Pressure: Distributed Recovery from Attacks
in Peer-to-Peer Systems. In: Gregori, E., Cherkasova, L., Cugola, G., Panzieri, F.,
Picco, G.P. (eds.) NETWORKING 2002. LNCS, vol. 2376, pp. 306–320. Springer,
Heidelberg (2002)
26. Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A Distributed Anony-
mous Information Storage and Retrieval System. In: Federrath, H. (ed.) Anonymity
2000. LNCS, vol. 2009, pp. 46–66. Springer, Heidelberg (2001)
166 E. Spaho et al.
27. On, G., Schmitt, J., Steinmetz, R.: The Effectiveness of Realistic Replication
Strategies on Quality of Availability for Peer-to-Peer Systems. In: Proc. of the
Third International IEEE Conference pn Peer-to-Peer Computing, pp. 57–64
(2003)
28. Leontiadis, E., Dimakopoulos, V.V., Pitoura, E.: Creating and Maintaining Repli-
cas in Unstructured Peer-to-Peer Systems. In: Nagel, W.E., Walter, W.V., Lehner,
W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1015–1025. Springer, Heidelberg
(2006)
29. Kangasharju, J., Ross, K.W., Turner, D.A.: Optimal Content Replication in P2P
Communities. Manuscript, pp. 1–26 (2002)
30. Gray, J., Helland, P., O’Neil, P., Shasha, D.: The Dangers of Replication and a
Solution. In: Proc. of International Conference on Management of Data (SIGMOD
1996), pp. 173–182 (1996)
31. Lubinski, A., Heuer, A.: Configured Replication for Mobile Applications. In:
Databases and Information Systems, pp. 101–112. Kluwer Academic Publishers,
Dordrecht (2000)
32. Rohm, U., Bohm, K., Schek, H., Schuldt, H.: FAS - A Freshness-Sensitive Coordina-
tion Middleware for a Cluster of OLAP Components. In: Proc. of 28th International
Conference on Very Large Data Bases (VLDB 2002), pp. 754–765 (2002)
33. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery
in Database Systems (1987)
34. Kemme, B., Alonso, G.: A New Approach to Developing and Implementing Eager
Database Replication Protocols. ACM Transactions on Database Systems 25(3),
333–379 (2000)
35. Pacitti, E., Minet, P., Simon, E.: Fast Algorithms for Maintaining Replica Con-
sistency in Lazy Master Replicated Databases. In: Proc. of the 25th International
Conference on Very Large Data Bases (VLDB 1999), pp. 126–137 (1999)
36. Saito, Y., Shapiro, M.: Optimistic Replication. ACM Comput. Surv. 37(1), 42–81
(2005)
37. Chundi, P., Rosenkranz, D.: Deferred Updates and Data Placement in Distributed
Databases (1996)
38. Goel, S., Buyya, R.: Data Replication Strategies in Wide Area Distributed Systems.
In: Enterprise Service Computing: From Concept to Deployment, pp. 211–241. IGI
Global (2007)
39. Yamamoto, H., Maruta, D., Oie, Y.: Replication Methods for Load Balancing on
Distributed Storages in P2P Networks. The Institute of Electronics, Information
and Communication Engineers E-89-D(1), 171–180 (2006)
40. Sheppard, E.: Continuous Replication for Business-Critical Applications. White
Paper, pp. 1–7 (2012)
41. Van Der Lans, R.F.: Data Replication for Enabling Operational BI., White Paper
on Business Value and Architecture, pp. 1–26 (2012)
42. Ulusoy, O.: Research Issues in Peer-to-Peer Data Management. In: Proc. of Inter-
national Symposium on Computer and Information Sciences (ISCIS 2007), pp. 1–8
(2007)
43. Estepa, A.N., Xhafa, F., Caballé, S.: A P2P Replication-Aware Approach for Con-
tent Distribution in e-Learning Systems. In: Proc. of Sixth International Conference
on Complex, Intelligent, and Software Intensive Systems (CISIS 2012), pp. 917–922
(2012)
44. Terano, T., Asai, K., Sugeno, M.: Fuzzy Systems Theory And Its Applications.
Academic Press, Inc., Harcourt Brace Jovanovich Publishers (1992)
Leveraging High-Performance Computing
Infrastructures to Web Data Analytic Applications
by Means of Message-Passing Interface
1 Introduction
The novel trends of the linked and open data [1] have enabled a principally new
dimension of data analysis, which is no longer limited to internal data collections, i.e.,
“local data”, but spans over a number of heterogeneous data sources, in particular
from the Web, i.e., “global data”. Today’s data processing application have to
increasingly access interlinked data from many different sources, e.g., known social
networks as Twitter and Facebook, or web-scale knowledge bases as Linked Life
Data [2] or Open PHACTS [3]. Along the Semantic Web – one of the most
challenging data-centric application domains – offers the volume of the integrated
data that has already reached the order of magnitude of billions of triples (“subject-
predicate-object” relation entities in a Semantic Graph) [4] and is expected to further
grow in the future (see the actual Linking Open Data cloud diagram [5]). However,
existing data processing and analysis technologies are still far from being able to scale
to demands of global and, in case of large industrial corporations, even of local data,
which makes up the core of the “big data” problem. With regard to this, the design of
the current data analysis algorithms requires to be reconsidered in order to enable the
scalability on a big data demand.
The problem has two major aspects:
1. The solid design of current algorithms makes the integration with other techniques
that would help increase the analysis quality impossible.
2. Sequential design of the algorithms prevents porting them to parallel computing
infrastructures and thus do not fulfill high performance and other QoS user
requirements.
With regard to the first issue – the low performance of the design patterns used in
conventional data analytic and web applications – the SOA approach [6], e.g., as
implemented in the LarKC platform [7], enables the execution of the most
computation-intensive parts of application workflows (such as one shown in Figure 1)
on a high-performance computing system, whereas less performance-critical parts of
the application can be running in “usual” places, e.g., a web server or a database.
Fig. 1. Workflow-based design of a Semantic Web reasoning application (LarKC) and main
parallelisation patterns
The second identified issue – lack of parallelisation models for running the
applications on High-Performance Computing infrastructures – is however more
essential and poses the major obstacle to endorsing large-scale parallelism to the data-
centric development. The traditional serial computing architectures increasingly prove
ineffective when scaling Web processing algorithms to analyze the big data. On the
other hand, large-scale High-Performance Computing (HPC) infrastructures, both in
academic domain and industry, have very special requirement to the applications
running on them, which are not confirmed with the most of Web applications. While
the existing data-centric parallelisation frameworks, such as MapReduce/Hadoop
[19], have proved very successful for running on relatively small-scale clusters of
workstations and private Clouds (in the literature there are no evidence of Hadoop
being evaluated on parallel computing architectures with more than 100 nodes), the
Leveraging High-Performance Computing Infrastructures 169
use of HPC systems (offering several hundred thousands of nodes and computation
performance on the exascale range) remains out of scope of the current frameworks’
functionality. The reason for this is twofold. First, the existing frameworks (here we
basically refer to Hadoop [13], which dominates on the current software market) are
implemented as a set of services developed in Java programming language, which has
traditionally found quite a limited support on HPC systems due to security,
performance, and other restrictions. On the other hand, the well established, mature,
optimized, and thus very efficient parallelisation technologies that have been
developed for HPC, such as the Message-Passing Interface (MPI) [15], do not support
Java – the programming language that is used for developing most of the current data-
centric applications and databases, in particular for the Semantic Web.
Whereas the modern parallelisation frameworks cannot take a full advantage of the
deployment in HPC environments, due to their design features, the traditional
parallelisation technologies fail to meet the requirements of data-centric application
development in terms of the offered programming language support as well as service
abilities. The known approaches address this problem in two directions [12]. The first is
adapting data-centric (MapReduce-based) frameworks for running in HPC environments.
One of the promising researches, done by the Sandia laboratory, is to implement the
MapReduce functionality from scratch based on a MPI library [8]. Another work that is
worth mentioning is done by the Open MPI consortium to port some basic Hadoop
interfaces to a MPI environment, which is called MR+ [9]. The common problem of all
these approaches is that they are restricted to a MapReduce-like programming model,
which decreases their value when used in a wide range of applications.
The alternative approach is to develop Java bindings for a standard MPI library, which
is already optimized on a supercomputing system [10]. The benefit of using MPI in
parallel Java applications is twofold. First, the MPI programming model is very flexible
(support of several types of domain decomposition etc.), and allows the developer to go
beyond the quite restricted MapReduce’s key/value-based model in designing the parallel
execution pattern. Second, MPI is the most efficient solution for reaching application’s
high performance, which is in particular thanks to a number of optimizations on the
network interconnect level. Moreover, the HPC community has already done some efforts
to standardize a Java interface for MPI [32], so that Java is increasingly getting recognized
as one of the mainstream programming languages for HPC applications as well.
Both MapReduce- and MPI-based approaches to develop parallel Java applications
are promising for data-centric application development, each having certain pro and
contra arguments over each other. In this chapter, we concentrate on an approach
based on MPI. In particular, we discuss OMPIJava – a trend-new implementation of
Java bindings for Open MPI – one of the currently most popular Message-Passing
Interface implementations for the C, C++, and Fortran supercomputing applications.
The rest of the chapter is organized as follows. Section 2 discusses data-centric
application parallelisation techniques and compares MapReduce- and MPI-based
approaches. Section 3 introduces the basics of the OMPIJava tool and discusses its
main features and technical implementation details. Section 4 shows performance
evaluation results of OMPIJava based on standard benchmark sets. Section 5 presents
an example of a challenging Semantic Web application – Random Indexing –
implemented with OMPIJava and shows performance benchmarks for it. Section 6
discusses the use of performance analysis tools for parallel MPI applications. Section
7 concludes the chapter and discusses directions of future work.
170 A. Cheptsov and B. Koller
In the map stage, the input data set is split into independent chunks and each of the
chunks is assigned to independent tasks, which are then processed in a completely
parallel manner (process stage). In the reduce stage, the output produced by every
map task is collected, combined, and the consolidated final output is then produced.
The Hadoop framework is a service-based implementation of MapReduce for Java.
Hadoop considers a parallel system as a set of master and slave nodes, deploying on
them services for scheduling tasks as jobs (Job Tracker), monitoring the jobs (Task
Tracker), managing the input and output data (Data Node), re-executing the failed
Leveraging High-Performance Computing Infrastructures 171
tasks, etc. This is done in a way that ensures very high service reliability and fault
tolerance properties of the parallel execution. In Hadoop, both the input and the
output of the job are stored in a special distributed file system. In order to improve the
reliability, the file system also provides an automatic replication procedure, which
however introduces an additional overhead to the internode communication.
The Message-Passing Interface (MPI) is a process-based standard for parallel
applications implementation. MPI processes are independent execution units that
contain their own state information, use their own address spaces, and only interact
with each other via inter-process communication mechanisms defined by MPI. Each
MPI process can be executed on a dedicated compute node of the high-performance
architecture, i.e., without competing with the other processes in accessing the
hardware, such as CPU and RAM, thus improving the application performance and
achieving the algorithm speed-up. In case of the shared file system, such as Lustre
[21], which is the most utilized file system standard of the modern HPC
infrastructures, the MPI processes can effectively access the same file section in
parallel without any considerable disk I/O bandwidth degradation. With regard to the
data decomposition strategy presented in Figure 3a, each MPI process is responsible
for processing the data partition assigned to it proportionally to the total number of
the MPI processes (see Figure 3b). The position of any MPI process within the group
of processes involved in the execution is identified by an integer R (rank) between 0
and N-1, where N is a total number of the launched MPI processes. The rank R is a
unique integer identifier assigned incrementally and sequentially by the MPI run-time
environment to every process. Both the MPI process’s rank and the total number of
the MPI processes can be acquired from within the application by using MPI standard
functions, such as presented in Listing 1.
A typical data processing workflow with MPI can be depicted as shown in Figure
4. The MPI jobs are executed by means of the mpirun command, which is an
important part of any MPI implementation. Mpirun controls several aspect of parallel
program execution, in particular launches MPI processes under the job scheduling
manager software like OpenPBS [22]. The number of MPI processes to be started is
provided with the “-np” parameter to mpirun. Normally, the number of MPI processes
corresponds to the number of the compute nodes, reserved for the execution of
parallel job. Once the MPI process is started, it can request its rank as well as the total
number of the MPI processes associated with the same job. Based on the rank and
total processes number, each MPI process can calculate the corresponding subset of
the input data and process it. The data partitioning problem remains beyond the scope
of this work; particularly for RDF, there is a number of well-established approaches,
e.g., horizontal [23], vertical [24], and workload driven [25] decomposition.
Since a single MPI process owns its own memory space and thus cannot access the
data of the other processes directly, the MPI standard foresees special communication
functions, which are necessary, e.g., for exchanging the data subdomain’s boundary
values or consolidating the final output from the partial results produced by each of
the processes. The MPI processes communicate with each other by sending messages,
which can be done either in “point-to-point” (between two processes) or collective
way (involving a group of or all processes). More details about the MPI
communication can also be found in a previous publication about OMPIJava [27].
Leveraging High-Performance Computing Infrastructures 173
Although the official MPI standard only recognizes interfaces for C, C++, and Fortran
languages, there has been a number of standardization efforts made toward creating
MPI bindings for Java. The most complete API set, however, has been proposed by
mpiJava [28] developers. There are only a few approaches to implement MPI
bindings for Java. These approaches can be classified in two following categories:
Pure Java implementations, e.g., based on RMI (Remote Method Invocation)
[29], which allows Java objects residing in different virtual machines to
communicate with each other, or lower-level Java sockets API.
Wrapped implementations using the native methods implemented in C
languages, which are presumably more efficient in terms of performance
than the code managed by the Java run-time environment.
In practice, none of the above-mentioned approaches satisfies the contradictory
requirements of the Web users on application portability and efficiency. Whereas the
pure Java implementations, such as MPJ Express [30] or MPJ/Ibis [14][18], do not
benefit from the high speed interconnects, e.g., InfiniBand, and thus introduce
communication bottlenecks and do not demonstrate acceptable performance on the
majority of today’s production HPC systems [31], a wrapped implementation, such as
mpiJava [32], requires a native C library, which can cause additional integration and
interoperability issues with the underlying MPI implementation.
174 A. Cheptsov and B. Koller
In looking for a tradeoff between the performance and the usability, and in view of
the complexity of providing Java support for high speed cluster interconnects, the
most promising solution seems to be to implement the Java bindings directly in a
native MPI implementation in C.
Despite a great variety of the native MPI implementations, there are only a few of
them that address the requirements of Java parallel applications on process control,
resource management, latency awareness and management, and fault tolerance.
Among the known sustainable open-source implementations, we identified Open MPI
[33] and MPICH2 [34] as the most suitable to our goals to implement the Java MPI
bindings. Both Open MPI and MPICH2 are open-source, production quality, and
widely portable implementations of the MPI standard (up to its latest 2.0 version).
Although both libraries claim to provide a modular and easy-to-extend framework,
the software stack of Open MPI seems to better suit the goal of introducing a new
language’s bindings, which our research aims to. The architecture of Open MPI [16]
is highly flexible and defines a dedicated layer used to introduce bindings, which are
currently provided for C, F77, F90, and some other languages (see also Figure 6).
Extending the OMPI-Layer of Open MPI with the Java language support seems to be
a very promising approach to the discussed integration of Java bindings, taking
benefits of all the layers composing Open MPI’s architecture.
We have based our Java MPI bindings on the mpiJava code, originally developed in
HPJava [35] project and currently maintained on SourceForge [26]. mpiJava provides a
set of Java Native Interface (JNI) wrappers to the native MPI v.1.1 communication
methods, as shown in Figure 7. JNI enables the programs running inside a Java run-time
environment to invoke native C code and thus use platform-specific features and
libraries [36], e.g., the InfiniBand software stack. The application-level API is
constituted by a set of Java classes, designed in conformance to the MPI v.1.1 and the
specification in [28]. The Java methods internally invoke the MPI-C functions using the
JNI stubs. The realization details for mpiJava can be obtained from [17][37].
Open MPI is a high performance, production quality, and the MPI-2 standard
compliant implementation. Open MPI consists of three combined abstraction layers
that provide a full featured MPI implementation: (i) OPAL (Open Portable Access
Layer) that abstracts from the peculiarities of a specific system away to provide a
consistent interface adding portability; (ii) ORTE (Open Run-Time Environment) that
provides a uniform parallel run-time interface regardless of system capabilities; and
(iii) OMPI (Open MPI) that provides the application with the expected MPI standard
interface. Figure 6 shows the enhanced Open MPI architecture, enabled with the Java
bindings support.
of the Java class collection followed the same strategy as for the C++ class collection, for
which the opaque C objects are encapsulated into suitable class hierarchies and most of
the library functions are defined as class member methods. Along with the classes
implementing the MPI functionality (MPI package), the collection includes the classes
for error handling (Errhandler, MPIException), datatypes (Datatype), communicators
(Comm), etc. More information about the implementation of both Java classes and JNI-C
stubs can be found in previous publications [17][31].
Fig. 8. Comparison of the message rate for ompiJava and mpj for a) low and b) high message
size range
178 A. Cheptsov and B. Koller
The main challenges of the Random Indexing algorithm can be defined in the
following:
• Very large and high-dimensional vector space. A typical random indexing search
algorithm performs traversal over all the entries of the vector space. This means,
that the size of the vector space to the large extent determines the search
performance. The modern data stores, such as Linked Life Data or Open PHACTS
consolidate many billions of statements and result in vector spaces of a very large
dimensionality. Performing Random indexing over such large data sets is
computationally very costly, with regard to both execution time and memory
consumption. The latter poses a hard constraint to the use of random indexing
packages on the serial mass computers. So far, only relatively small parts of the
Semantic Web data have been indexed and analyzed.
• High call frequency. Both indexing and search over the vector space is highly
dynamic, i.e., the entire indexing.
The MPI implementation of Airhead search [43] is based on a domain
decomposition of the analyzed vector space and involves both point-to-point and
collective gather and broadcast MPI communication (see the schema in Figure 12).
In order to compare the performance of OMPIJava with MPJ-Express, we
performed the evaluation for the largest of the available data sets reported in [43]
(namely, Wiki2), which comprises 1 Million of high density documents and occupies
16 GByte disk storage space. The overall execution time (wall clock) was measured.
Figure 13a shows that both ompijava and mpj scale well until the problem size is large
enough to saturate the capacities of a single node. Nevertheless, ompijava was around
10% more efficient over alternative tools (Figure 13b).
180 A. Cheptsov and B. Koller
Cosine-based Cosine-based
vector similarity vector similarity
analysis MPI analysis
processes
Most similar Most similar
vectors selection vectors selection
Sync block
Fig. 14. MPI Global Broadcast Communication visualization for four MPI processes with the
Paraver tool
182 A. Cheptsov and B. Koller
in [47], which are indicatively much more preferable over all the existing Java
solutions in terms of performance. Our implementation will allow the developed MPI
communication patterns to be integrated in existing Java-based codes, such as Jena
[11] or Pellet [48], and thus drastically improve the competitiveness of the Semantic
Web application based on such tools. The development activities will mainly focus on
extending the Java bindings to the full support of the MPI-3 specification. We will
also aim at adding Java language-specific bindings into the MPI standard, as a
reflection of the Semantic Web value in supercomputing. The integration activities
will concentrate on adapting the performance analysis tools to the specific of Java
applications. Unfortunately, the existing performance analysis tools, such as Extrae
discussed in the previous section, does not provide a deep insight in the intrinsic
characteristics of the Java Virtual Machine, which however might be important for the
application performance optimization as the communication profile tailoring. For this
purpose, the traditional performance analysis tools for the Java applications, such as
ones provided by the Eclipse framework, must be extended with the communication
profiling capabilities. Several EU projects, such as JUNIPER [49], are already
working in this direction.
Acknowledgment. Authors would like to thank the Open MPI consortium for the
support with porting mpiJava bindings, to the EU-ICT JUNIPER project for the
support with the Java platform and parallelisation, as well as the developers of the
Airhead library, in particular David Jurgens, for the provided use case.
References
1. Gonzalez, R.: Closing in on a million open government data sets (2012),
http://semanticweb.com/
closinginona-millionopengovernmentdatasets_b29994
2. Linked Life Data repository website, http://linkedlifedata.com/
3. OpenPHACTS project website, http://www.openphacts.org/
4. Coffman, T., Greenblatt, S., Marcus, S.: Graph-based technologies for intelligence
analysis. Communications of ACM 47, 45–47 (2004)
5. Linked Open Data initiative, http://lod-cloud.net
6. Cheptsov, A., Koller, B.: A service-oriented approach to facilitate big data analytics on the
Web. In: Topping, B.H.V., Iványi, P. (eds.) Proceedings of the Fourteenth International
Conference on Civil, Structural and Environmental Engineering Computing. Civil-Comp
Press, Stirlingshire (2013)
7. Cheptsov, A.: Semantic Web Reasoning on the internet scale with Large Knowledge
Collider. International Journal of Computer Science and Applications, Technomathematics
Research Foundation 8(2), 102–117 (2011)
8. Plimpton, S.J., Devine, K.D.: MapReduce in MPI for large-scale graph algorithms. Parallel
Computing 37, 610–632 (2011)
9. Castain, R.H., Tan, W.: MR+. A technical overview (2012), http://www.open-
mpi.de/video/mrplus/Greenplum_RalphCastain-2up.pdf
184 A. Cheptsov and B. Koller
10. Cheptsov, A.: Enabling High Performance Computing for Semantic Web applications by
means of Open MPI Java bindings. In: Proc. the Sixth International Conference on
Advances in Semantic Processing (SEMAPRO 2012) Conference, Barcelona, Spain (2012)
11. McCarthy, P.: Introduction to Jena. IBM Developer Works (2013),
http://www.ibm.com/developerworks/xml/library/j-jena
12. Gonzalez, R.: Two kinds of big data (2011), http://semanticweb.com/
two-kinds-ofbig-datb21925
13. Hadoop framework website, http://hadoop.apache.org/mapreduce
14. Bornemann, M., van Nieuwpoort, R., Kielmann, T.: Mpj/ibis: A flexible and efficient
message passing platform for Java. Concurrency and Computation: Practice and
Experience 17, 217–224 (2005)
15. MPI: A Message-Passing Interface standard. Message Passing Interface Forum (2005),
http://www.mcs.anl.gov/research/projects/mpi/mpistandard/
mpi-report-1.1/mpi-report.htm
16. Gabriel, E., et al.: Open MPI: Goals, concept, and design of a next generation MPI
implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI
2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)
17. Baker, M., et al.: MPI-Java: An object-oriented Java interface to MPI. In: Rolim, J.D.P.
(ed.) IPPS-WS 1999 and SPDP-WS 1999. LNCS, vol. 1586, pp. 748–762. Springer,
Heidelberg (1999)
18. van Nieuwpoort, R., et al.: Ibis: a flexible and efficient Java based grid programming
environment. Concurrency and Computation: Practice and Experience 17, 1079–1107
(2005)
19. Dean, J., Ghemawat, S.: MapReduce - simplified data processing on large clusters. In:
Proc. OSDI 2004: 6th Symposium on Operating Systems Design and Implementation
(2004)
20. Resource Description Framework (RDF). RDF Working Group (2004),
http://www.w3.org/RDF/
21. Lustre file system - high-performance storage architecture and scalable cluster file system.
White Paper. Sun Microsystems, Inc. (December 2007)
22. Portable Batch System (PBS) documentation, http://www.pbsworks.com/
23. Dimovski, A., Velinov, G., Sahpaski, D.: Horizontal partitioning by predicate abstraction
and its application to data warehouse design. In: Catania, B., Ivanović, M., Thalheim, B.
(eds.) ADBIS 2010. LNCS, vol. 6295, pp. 164–175. Springer, Heidelberg (2010)
24. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web data
management using vertical partitioning. In: Proc. The 33rd International Conference on
Very Large Data Bases (VLDB 2007) (2007)
25. Curino, C., et al.: Workload-aware database monitoring and consolidation. In: Proc.
SIGMOD Conference, pp. 313–324 (2011)
26. OMPIJava tool website, http://sourceforge.net/projects/mpijava/
27. Cheptsov, A., et al.: Enabling high performance computing for Java applications using the
Message-Passing Interface. In: Proc. of the Second International Conference on Parallel,
Distributed, Grid and Cloud Computing for Engineering (PARENG 2011) (2011)
28. Carpenter, B., et al.: mpiJava 1.2: API specification. Northeast Parallel Architecture
Center. Paper 66 (1999), http://surface.syr.edu/npac/66
29. Kielmann, T., et al.: Enabling Java for High-Performance Computing: Exploiting
distributed shared memory and remote method invocation. Communications of the ACM
(2001)
Leveraging High-Performance Computing Infrastructures 185
30. Baker, M., Carpenter, B., Shafi, A.: MPJ Express: Towards thread safe Java HPC. In:
Proc. IEEE International Conference on Cluster Computing (Cluster 2006) (2006)
31. Judd, G., et al.: Design issues for efficient implementation of MPI in Java. In: Proc. of the
1999 ACM Java Grande Conference, pp. 58–65 (1999)
32. Carpenter, B., et al.: MPJ: MPI-like message passing for Java. Concurrency and
Computation - Practice and Experience 12(11), 1019–1038 (2000)
33. Open MPI project website, http://www.openmpi.org
34. MPICH2 project website, http://www.mcs.anl.gov/research/
projects/mpich2/
35. HP-JAVA project website, http://www.hpjava.org
36. Liang, S.: Java Native Interface: Programmer’s Guide and Reference. Addison-Wesley
(1999)
37. Vodel, M., Sauppe, M., Hardt, W.: Parallel high performance applications with mpi2java -
a capable Java interface for MPI 2.0 libraries. In: Proc. of the 16th Asia-Pacific
Conference on Communications (APCC), Nagoya, Japan, pp. 509–513 (2010)
38. NetPIPE parallel benchmark website, http://www.scl.ameslab.gov/netpipe/
39. Bailey, D., et al.: The NAS Parallel Benchmarks. RNR Technical Report RNR-94.007
(March 1994), http://www.nas.nasa.gov/assets/pdf/techreports/
1994/rnr-94-007.pdf
40. MPJ-Express tool benchmarking results, http://mpj-express.org/
performance.html
41. Sahlgren, M.: An introduction to random indexing. In: Proc. Methods and Applications of
Semantic Indexing Workshop at the 7th International Conference on Terminology and
Knowledge Engineering (TKE 2005), pp. 1–9 (2005)
42. Jurgens, D.: The S-Space package: An open source package for word space models. In:
Proc. of the ACL 2010 System Demonstrations, pp. 30–35 (2010)
43. Assel, M., et al.: MPI realization of high performance search for querying large RDF
graphs using statistical semantics. In: Proc. The 1st Workshop on High-Performance
Computing for the Semantic Web, Heraklion, Greece (May 2011)
44. Extrae performance trace generation library website,
http://www.bsc.es/computer-sciences/extrae
45. Paraver performance analysis tool website, http://www.bsc.es/
computer-sciences/performance-tools/paraver/general-overview
46. Fensel, D., van Harmelen, F.: Unifying reasoning and search to web scale. IEEE Internet
Computing 11(2), 95–96 (2007)
47. Weaver, J., Hendler, J.A.: Parallel materialization of the finite RDFS closure for hundreds
of millions of triples. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard,
D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 682–697.
Springer, Heidelberg (2009)
48. Sirin, E., et al.: Pellet: a practical owl-dl reasoner. Journal of Web Semantics (2013),
http://www.mindswap.org/papers/PelletJWS.pdf
49. Cheptsov, A., Koller, B.: JUNIPER takes aim at Big Data. inSiDE - Journal of Innovatives
Supercomputing in Deutschland 11(1), 68–69 (2011)
ReHRS: A Hybrid Redundant System for Improving
MapReduce Reliability and Availability
1 Introduction
MapReduce performs a job by breaking it into smaller map tasks and reduce tasks,
running these tasks in parallel on a large-scale cluster of commodity machines, called
a MapReduce cluster, and utilizing a distributed file system, such as Google File Sys-
tem [7] or Hadoop Distributed File System [8], to store the job’s input and output
data. In general, a MapReduce implementation, e.g., Apache Hadoop [2], has two
master servers. One is called JobTracker, which coordinates all jobs running on the
MapReduce clusters, performs task assignment for each job, and monitors the pro-
gresses of all map and reduce tasks. The other is called NameNode, which manages
the distributed filesystem namespace and processes all read and write requests.
These two master servers can run either on the same machine or on two separate
machines. But the machine(s) might fail or crash due to various reasons, such as
hardware and/or software faults, network link issues, and bad configuration [9]. Ya-
hoo has experienced three NameNode failures caused by hardware problems [9]. In a
system with ten thousands of highly reliable servers with MTBF of 30 years, a node
fails each day in average [10]. Although JobTracker and NameNode are run on relia-
ble hardware, they may fail some day. When JobTracker or NameNode crashes, the
operation of a MapReduce cluster will be interrupted, i.e., all MapReduce jobs regard-
less of what states they are right now cannot proceed and be completed. This is unac-
ceptable for time-critical data analysis for decision making and business intelligence.
Unreliable JobTracker and NameNode will also impact the operations of those com-
panies using MapReduce to process their data.
Redundancy mechanisms are common methods to improve system reliability
[11][12]. Some systems [1][13][14][15][17][18] employed cold-standby redundancy.
When a master node fails, a cold-standby node is used to takeover for it. This mecha-
nism can significantly enhance system reliability since the failure rate of a node in its
cold-standby mode is zero [19]. However, the cold-standby node has to restart the
operation of the mater node from scratch since it does not hold any states of the mas-
ter node, consequently leading to a long downtime. Some other systems [2][16] use
warm-standby redundancy to improve reliability and shorten system downtime.
Hadoop [2] provides a checkpoint node to periodically back up the file-system
namespace of NameNode. When NameNode fails, the namespace copy held by the
checkpoint node can be used to manually restart NameNode. But the namespace copy
might not be out of date when NameNode crashes. To solve this problem, systems
[16][20][21][22] utilized hot-standby redundancy to achieve a fast takeover. Hadoop
[2] provides a backup node to maintain the up-to-date copy of the namespace of
NameNode all the time. It might crash before the failure of NameNode, thus unable to
extend the operation of NameNode. Besides, Hadoop does not offer any redundancy
for its another master server JobTracker, therefore, causing another single point of
failure.
In this chapter, we propose a master server redundant mechanism called the Relia-
ble Hybrid Redundant System (ReHRS for short) which employs a hot-standby server
(HSS for short) and a warm-standby server (WSS for short) to enhance the reliability
and availability of the MapReduce master server. The functions of the HSS and WSS
are the same as those of the master server, but the two servers do not serve clients and
workers while in their standby modes. The HSS synchronizes itself with the master
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 189
server to achieve a fast takeover and continue any unfinished operations when the
master server fails. The WSS periodically wakes up to backup the master server’s
metadata. After that, it sleeps again to reduce its failure probability. To continue the
operations of the master server and HSS when any of them unexpectedly fails, we
present a failure detection algorithm for the two servers to mutually detect each oth-
er’s failure and launch an appropriate takeover process to continue each other’s opera-
tion. In addition, we introduced a dynamic warmup mechanism for the WSS to warm
itself up when discovering that the master server or HSS is unstable. The goal is to
enable the WSS to quickly play the role of the HSS before the master server or HSS
actually fails.
Extensive experiments and simulations are conducted to demonstrate and compare
the ReHRS and three existing state-of-the-art schemes, including No-Redundant
scheme (NR for short) which is similar to the design of Hadoop [2] for JobTracker,
Hot-Standby-Only scheme (HSO for short) [21], and Warm-Standby-Only scheme
(WSO for short) [16] in terms of takeover delays and impacts on the performances
and resource consumptions of JobTracker and NameNode. In addition, the perfor-
mance of the WSS is also evaluated.
The rest of this chapter is organized as follows. Section 2 introduces background
and related work of this chapter. Section 3 presents the ReHRS. The simulation and
experimental results are described and discussed in Section 4. Section 5 concludes
this chapter and outlines our future studies.
2.1 Background
Figure 1 shows the execution flow of a job J on a MapReduce cluster. In step 1, a
client requests a job ID from JobTracker. Then, he/she requests a set of workers’ loca-
tions from NameNode in step 2. In step 3, based on the worker locations replied by
NameNode, the client stores all his/her job resources, including a job JAR file, con-
figuration files, and a number of input chunks, in the corresponding worker storages.
In steps 4 and 5, the client submits J to JobTracker, and JobTracker initiates J, respec-
tively. After retrieving the chunk information of J from NameNode in steps 6 and 7,
JobTracker in step 8 assigns each map task of J to a worker, called mapper, and as-
signs each reduce task of J to a worker, called reducer. A mapper or reducer before
executing its assigned task has to retrieve the corresponding job resources from the
distributed file system by consulting NameNode. Each mapper runs the assigned map
task to produce intermediate <key,value> results, stores the results locally, and then
notifies JobTracker with the location of the results. Once all map tasks are finished,
JobTracker informs all the reducers to start their reduce tasks. Each reducer then ac-
quires a part of intermediate <key,value> results from all mappers, runs the assigned
reduce task, and stores the result it generates into the distributed file system with the
help of NameNode. After all reducers finish their tasks, JobTracker informs the client
of the completion of J.
190 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
It is clear that JobTracker and NameNode are two critical components during the
job execution. If JobTracker fails, clients cannot submit jobs, and mappers cannot
notify JobTracker of the locations of the intermediate results that they generate, con-
sequently, causing all subsequent reducers unable to start their tasks. On the other
hand, a failed NameNode cannot provide input chunk information for JobTracker to
perform task assignments, worker locations for mappers and reducers to obtain their
required job resources, and available worker locations for reducers to store the final
results, implying that the corresponding jobs cannot proceed or be completed. To
ensure normal operation of a MapReduce cluster, both JobTracker and NameNode
must be reliable and available since the startup of the system. That is why we would
like to enhance the reliabilities and availabilities of these two master servers.
control flow
5. Initiate the job 8. Task assignment
data flow
1. Get new job ID
mapper reducer
4. Submit job ...
JobTracker
7. Reply input
mapper reducer
6. Ask input
chunk info
chunk info
...
Cold-Standby Redundancy
Many systems [1][13][14][15][17][18] used cold-standby mechanisms to achieve a
higher reliability. For example, Zheng [17] enhanced MapReduce fault tolerance by
assigning several backup nodes to each map task based on data locality [26]. When a
map task fails on one node, one of its backup nodes will retrieve the required data
chunk from local or nearby disks and perform the task immediately. But the task has
to be executed from the very beginning. To prevent a straggler, i.e., a node with a
poor performance, from slowing down a job execution, MapReduce [1] additionally
executes a backup task for each task on another available cold-standby node so that
the corresponding job can be more quickly completed. Current MapReduce imple-
mentation only adopted a cold-standby mechanism to provide fault tolerance for its
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 191
workers, rather than for its master server. The main reason is that a cold-standby node
does not keep any state of the master-server. Hence, when the master server fails, the
cold-standby node cannot takeover for it and recover it to the state before the failure
occurs, implying that the takeover is useless.
Warm-Standby Redundancy
In a warm-standby redundancy mechanism, the states of the master server are periodi-
cally replicated to a warm-standby node. After that, the node sleeps to reduce its fail-
ure probability so as to improve system reliability. When the master server fails, the
state replica can be used to restart the operation of the master server. The checkpoint
node provided by Hadoop [2] is an example. It periodically backs up the namespace
of NameNode. When NameNode fails, the namespace copy held by the checkpoint
node can be used to restart NameNode, but the checkpoint node might not be able to
provide the latest namespace when NameNode fails, consequently failing to provide a
complete takeover. Besides, automatic takeover is not supported by the checkpoint
node. This scheme, as the WSO mentioned above, will be evaluated later in this chap-
ter. The warm-standby redundancy has been also employed by other systems. For
instance, Leu et al. [16] improved the intrusion detectors’ fault-tolerant capability in
their grid intrusion detection platform by deploying a set of detectors. When one de-
tector fails, another will be assigned to continue the failed detector’s unfinished task.
Hot-Standby Redundancy
Some systems [2][16][20][21][22] utilized hot-standby mechanisms (also called ordi-
nary parallel [12]) to speedup their failover processes. Hadoop [2] provides a backup
node to maintain an in-memory, up-to-date copy of the file-system namespace of
NameNode. The backup node can continue the operation of NameNode when
NameNode fails. Nevertheless, the backup node might crash before the failure of
NameNode since these two nodes run simultaneously in parallel, therefore, resulting
in insufficient reliability and availability for MapReduce. Besides, Hadoop does not
offer any redundant mechanism for its another master server JobTracker, thus leading
to another single point of failure. Paratus [20] introduced an instantaneous failover
approach by frequently replicating a primary virtual machine’s states to a backup
virtual machine. When the primary one crashes, the backup can immediately takeover
for it. In the backup scheme [21], one of a set of hot-standby nodes is elected to
takeover for a failed primary server so that Hadoop operation can be continued. How-
ever, the takeover process might not be able to finish in a short period of time since
the new primary server before providing its services has to setup IP configuration and
retrieve complete transient metadata from other worker nodes. This scheme, as the
HSO mentioned above, will be evaluated later in this chapter. Alvaro et al. [22] also
used several hot-standby nodes to provide NameNode with a fast failover in their
presented data-centric programming model, but they did not address JobTracker
failure.
192 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
Figure 2 depicts the architecture of the ReHRS, in which the master server and HSS
periodically send heartbeats to each other and detect each other’s failure. When dis-
covering that the other server is not stable, they request the WSS to warm itself up.
When making sure that the other server is failed, the surviving server initiates a corre-
sponding takeover process to continue the operation of the failed one.
heartbeats
Master server HSS
heartbeats
WSS
In Sections 3.1, 3.2, and 3.3, we will, respectively, describe what the master serv-
er’s metadata is composed of, how the HSS synchronizes its status with the master
server, and how the WSS periodically backs up the metadata. The proposed failure
detection algorithm, warmup mechanism, and takeover processes will detail in Sec-
tions 3.4, 3.5, and 3.6.
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 193
3.1 Metadata
In the ReHRS, the master server maintains two types of metadata. The first is persis-
tent metadata (P-metadata for short), which is the execution result of an operation and
is infrequently or never changed since it is generated. The master server usually keeps
it in its P-metadata file/database. For examples, the information concerning jobs and
the access control lists of files are respectively JobTracker’s and NameNode’s typical
P-metadata.
The other type is transient metadata (T-metadata for short), which refer to fre-
quently updated data. Hence, it is usually kept in the master server’s memory to ac-
celerate all required processing. For instance, the progresses/statuses of jobs sent by
cluster workers that run these jobs and the storage locations of those data chunks de-
livered by the cluster workers that hold these chunks are, respectively, JobTracker’s
and NameNode’s T-metadata.
as shown in Table 1 consists of five fields. The timestamp keeps the time point
when is generated, the client/worker ID illustrates the client/worker that requests
the operation, and the operation type field shows which type that belongs to.
can be any of the following three operation types: initiated operation (IOP for short),
response-required finished operation (RR-FOP for short), and response-unrequired
finished operation (RU-FOP for short).
IOP means that records information concerning an operation initiated by the
master server. Since the execution result (i.e., P-metadata) has not been generated, the
P-metadata field is null. RR-FOP means that records information concerning an
operation requested by a client/worker, and this operation have been finished by the
master server. So the P-metadata and client/worker ID fields must be filled with the
corresponding values. RU-FOP means that records information concerning an
operation both initiated and finished by the master server. Hence, the client/worker ID
field must be null, but the P-metadata field must be non-null. The synchronization
process between the master server and HSS is as follows.
1. The master server first inserts into its journal and sends to the HSS.
2. On receiving , the HSS inserts it into its journal and checks the operation type of
.
194 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
(a) If it is “IOP”, the HSS marks the state of as “completed” and replies the
master server with a message < , “IOP”, “cmp”> where cmp stands for com-
pletion, telling the master server that it has successfully stored . With the
record, the HSS can realize that the corresponding operation has been initiated.
(b) If it is “RR-FOP”, implying that the P-metadata and client/worker ID of are
not null, the HSS updates its own P-metadata file/database with the P-metadata
recorded in , marks the state of as “synchronous”, and replies the master
server with a message < , “RR-FOP”, “syn”>, and waits for the completion
message of the operation returned by the master server.
(c) If it is “RU-FOP”, the HSS updates its own P-metadata file/database with the
P-metadata recorded in , replies the master server with a message < , “RU-
FOP”, “cmp”>, and marks the state of as “completed”, indicating that this
operation has been completely performed and can be ignored when the HSS
acts as the master server in the future.
3. On receiving the reply R from the HSS, the master server checks R.
(a) If R = < , “IOP”, “cmp”>, the master server marks the state of as “com-
pleted” for reminding itself that the HSS has recorded .
(b) If R = < , “RR-FOP”, “syn”>, the master server immediately updates its own
P-metadata file/database with the P-metadata recorded in , responds to the
corresponding client/worker with the P-metadata, and marks the state of as
“completed” for reminding itself that the HSS has recorded the operation. After
that, it replies a message < , “cmp”> to the HSS.
(c) If R = < , “RU-FOP”, “cmp”>, the master server immediately updates its own
P-metadata file/database with the P-metadata recorded in and marks the
state of as “completed”.
4. Upon receiving < , “cmp”>, the HSS marks the state of as “completed”, indi-
cating that it can ignore this operation when taking over for the master server.
The above process has three synchronizations flows as illustrated in Figure 3. With
this process, the HSS can synchronize itself with the master server. Further, the HSS
can realize which operation is initiated, finished, and whether the corresponding re-
sponse is returned to the concerning requesters or not through all log records it main-
tains. This information also enables the HSS to continue any unfinished operations
when the master server fails.
(a) if Lv is the "IOP" type (b) if Lv is the "RU-FOP" type (c) if Lv is the "RR-FOP" type
On the other hand, upon receiving T-metadata TM from cluster workers, the master
server uses TM to update its in-memory T-metadata and forwards TM to the HSS
without generating a log record. On receiving TM, the HSS accordingly update its in-
memory T-metadata without replying a synchronization message. The purpose is to
reduce the burden of the master server and underlying network since T-metadata is
updated frequently.
Note that the WSS does not backup the master server’s T-metadata. The reason is
stated above.
Due to unpredictable errors or faults, both the master server and HSS might fail at any
moment. In the ReHRS, the two servers mutually send a heartbeat to each other every
predefined sending time period (STP) to indicate that they are still alive and available.
We assume that at least one network link is available for the master server, HSS, and
WSS to communicate, and the heartbeat transmission delays between the master serv-
er and HSS are constant. In other words, the two servers can receive each other’s
heartbeats in each receiving time period (RTP) when both of them operate normally,
where RTP≈STP. However, due to system busy or unstable, any of them might delay
sending its heartbeats.
196 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
The master server and HSS use the failure detection algorithm listed in Figure 4 to
detect each other’s failure. Whenever RTP times out, each of them, denoted by E,
checks to see whether it has received a heartbeat from the other, denoted by Q, during
the RTP or not. If no, is increased by 1, where represents the total number of
consecutive heartbeats that E has not received from Q.
When = , implying that E has not received Q’s heartbeats for consecutive
RTPs where is a predefined threshold, , E assumes that Q is unstable,
claims itself a commander, and requests the WSS to warm itself up. This can mitigate
the phenomenon that due to holding out-of-date metadata, the WSS might be unable
to make itself ready to act as the HSS in a short period of time. The details of the
WSS’s warmup mechanism will be described later. When = , E assumes that Q
has failed. If E is the master server, it informs the WSS to takeover for Q, i.e., the
HSS, without changing its role. However, if E is the HSS, it immediately takes over
for the master server and requests the WSS to act as the HSS.
This dynamic property enables the WSS to update its metadata/status to the latest one
before it is requested to act as the HSS so as to speedup its takeover process.
When both the master server and HSS are unstable, the WSS may be requested to
warm itself up by both of them. In this situation, the WSS might receive redundant P-
metadata sent by the two commanders. To avoid the WSS from storing duplicate log
records and disordering the sequence of P-metadata, each time when the WSS re-
ceives a log record from a commander, it checks to see whether it has received the
record or not. The process is as follows.
1. The WSS sends a message <“warm-up”, > to the commander.
2. Upon receiving <“warm-up”, >, the commander retrieves log record
from its journal, computes a hash value for , and sends the pair , to
the WSS where ranges from + 1 to , and is the largest ID of the
log records collected in the journal. Meanwhile, the commander sends the T-
metadata that it currently has to the WSS.
3. On receiving the T-metadata, the WSS immediately loads it in its memory.
4. On receiving , , the WSS compares with all the hash values previously
stored in its hash pool. If = ′ where ′ is a hash value in the hash pool,
will be dropped. Otherwise, the WSS inserts into its journal and stores in
the hash pool.
5. Following the sequence of = + 1, + 2, … , , the WSS checks the
operation type of to see whether it needs to update its P-metadata file/database
or not. If is of the “IOP” type, the WSS skips the update. Otherwise, it per-
forms the update.
When the commander requests the WSS to stop warming up, it stops sending data to
the WSS. Then the WSS goes to sleep once completing the current warm-up process.
nodes will act as the WSS. The state {N1M, N2H, N3W} transiting to {N1M, N2n/a, N3H}
implies that N2 fails, N1 continues serving as the master server, and N3 acts as the
HSS. Thus <Maft, Haft, Waft>=<N1, N3, n/a>. When {N1n/a, N2M, N3H} transits to {N1n/a,
N2M, N3n/a}, or {N1M, N2n/a, N3H} changes to {N1M, N2n/a, N3n/a}, no takeover will be
performed since no nodes can substitute for the failed ones. When a state changes to
{N1n/a, N2n/a, N3M}, N3 will be the master server, i.e., <Maft, Haft, Waft>=<N3, n/a, n/a>.
If a state turns to {N1n/a, N2n/a , N3n/a}, <Maft, Haft, Waft>=<n/a, n/a, n/a>.
1,3
1
{N1n/a, N2M, N3n/a}
3
n/a M
{N1 , N2 , N3H} 2,3 2 1,2,3
0 1,2
1 2
1,2
{N1M, N2H, N3W} {N1n/a, N2n/a, N3M} 3 {N1n/a, N2n/a, N3n/a}
2 1
2 1,3
{N1M, N2n/a, N3H}
1
3
{N1M, N2n/a, N3n/a}
2,3
Fig. 5. The state transition graph for nodes N1, N2, and N3 during the lifetime of the ReHRS,
where number 0 shown on the arrow stands for all nodes normally operate, and 1, 2, and 3
represent that N1, N2, and N3 fail, respectively
≠ ,
The Process of Taking over for the HSS
Haft takes over for Hbef only when Haft Hbef Hbef fails, and Haft is not n/a. The takeo-
ver process is as follows. Haft changes its IP address to Hbef’s and notifies Maft (recall
the master server) of the completion of the takeover. Then it starts all the HSS’s func-
tions, including receiving P-metadata and T-metadata sent by Maft, synchronizing P-
metadata with Maft, sending its heartbeats to Maft, and initiating the failure detection
algorithm to monitor Maft’s heartbeats.
However, Haft might be still warming itself up when it is requested to takeover for
Hbef, i.e., having not finished the warmup process. In this situation, Haft keeps updat-
ing its in-memory T-metadata with the new T-metadata received from Maft. However,
for each newly receiving log record sent by Maft, Haft buffers it in its memory and
instantly responds to Maft with a corresponding synchronization or completion mes-
sage. All buffered log records will be sequentially inserted in Haft’s journal, and the
non-null P-metadata conveyed in these log records will be sequentially applied to
Haft’s P-metadata file/database when the warmup process completes. With this takeo-
ver process, the response delays of Maft can be dramatically reduced, and the sequence
of P-metadata can be preserved.
≠
The Process of Taking over for the Master Server
Maft takes over for Mbef only when Maft Mbef, Mbef fails, and Maft is not n/a. The pro-
cess is as follows. First, Maft changes its IP address to Mbef’s and starts serving clients
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 199
and workers. Maft will also replicate new T-metadata to Haft, synchronize new log
records with Haft, send its heartbeats to Haft, and initiate the failure detection algorithm
to monitor Haft’s heartbeats if Haft is now not n/a. Further, Maft has to continue those
operations unfinished by Mbef. As mentioned above, a completed operation has two
log records to indicate its beginning and completion. Maft first extracts the operation
ID from each “IOP” type log record, e.g., , and finds the other log record of the
same operation ID, e.g., . If Maft cannot find , Maft reperforms the corre-
sponding operation since it considers that Mbef has not finished it. If Maft finds that
is a “RU-FOP” record, Maft ignores it because the corresponding operation has
been completed by Mbef. Similarly, if is a “RR-FOP” record, and its state is
“completed”, the record will be neglected. But if the state is “synchronous”, implying
that Mbef failed before responding to the corresponding client/worker, Maft immediate-
ly responds the client/worker with the P-metadata conveyed in the record. By dealing
with unfinished operations, Maft continues what has not been done by Mbef. Therefore,
all running jobs can proceed and be completed successfully.
4 Performance Evaluation
We implemented the ReHRS, NR [2], HSO [21], and WSO [16] in Java language and
built a test cluster consisting of 1030 virtual nodes and 27 switches as illustrated in
Figure 6. All nodes are connected through a 1 Gbps Ethernet network created by the
Network Simulation (NS2) [29], which is a simulation tool supporting TCP, routing,
and multicast protocols over wired and wireless networks. Each node runs Ubuntu
11.04 with an AMD Athlon(tm) 2800+ CPU, 1 GB memory, and a 80 GB disk drive.
During the experiments, 1024 nodes are deployed as workers. The remaining six
nodes connected to different switches act as JobTracker, JobTracker’s HSS,
JobTracker’s WSS, NameNode, NameNode’s HSS, and NameNode’s WSS when the
ReHRS is tested. When the HSO (WSO) is evaluated, the six nodes act as JobTracker,
JobTracker’s two hot-standby servers (two warm-standby servers), NameNode, and
NameNode’s two hot-standby servers (two warm-standby servers). As the NR is
employed, two of the six nodes are deployed to act as JobTracker and NameNode.
Internet
The simulated cluster
switch0
because the HSS is always synchronous with JobTracker. The takeover delays only
comprises the time of detecting JobTracker’s failure and configuring the HSS’s IP
address to JobTracker’s. According to our estimations repeated for 30 times, the aver-
age IP configuration time is about 294 ms with the standard deviation 24 ms for these
STPs, implying that the ReHRS’s takeover delays are mainly determined by STP. The
speedups of the ReHRS’s takeover delays on STP=1 ms is 9.4 (=2.82/0.30) times that
on STP=1000 ms, 1.83 (=0.55/0.30) times that on STP=100 ms, 1.07 (=0.32/0.30)
times that on STP=10 ms, showing that ReHRS with a smaller STP can considerably
speedup its takeover process.
40
Average takeover
35
delays (sec)
30
25
20
15
10
5
0
JT-500 JT-1000 JT-2000 JT-4000
ReHRS on STP=1 0.30 0.30 0.30 0.30
ReHRS on STP=10 0.32 0.32 0.32 0.32
ReHRS on STP=100 0.55 0.55 0.55 0.55
ReHRS on STP=1000 2.82 2.82 2.82 2.82
HSO on STP=1 2.68 4.99 9.66 18.96
HSO on STP=10 2.70 5.02 9.69 18.98
HSO on STP=100 2.93 5.25 9.91 19.21
HSO on STP=1000 5.25 7.50 12.12 21.52
WSO on STP=1 17.92 20.50 25.30 34.76
WSO on STP=10 18.13 20.76 24.95 34.61
WSO on STP=100 18.21 20.25 24.98 34.23
WSO on STP=1000 20.72 23.20 27.54 37.49
Fig. 7. Average takeover delays (sec) for the ReHRS, HSO, and WSO on different JobTracker
failure types and STPs. Note that the takeover delays of the NR is not shown since this scheme
does not provide any redundancy mechanisms.
The average takeover delays of the HSO is higher as the number of running jobs
increases since the HSO’s hot-standby server only maintains P-metadata rather than
T-metadata, i.e., the HSO’s takeover delays consists of not only JobTracker failure
detection time and IP address configuration time, but also JobTracker’s T-metadata
retrieval time. The speedups of the takeover delays on STP=1 ms range from
1.96 (=5.25/2.68) to 1.14 (=21.52/18.96) times that on STP=1000 ms, range from 1.09
(=2.93/2.68) to 1.01 (=19.21/18.96) times that on STP=100 ms, range from 1.01
202 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
Table 3. The speedups of the takeover delays of the ReHRS as compared with those of the
HSO on JobTracker failure types
Failure type
JT-500 JT-1000 JT-2000 JT-4000
STP (ms)
1 8.93 16.63 32.20 63.20
10 8.44 15.69 30.28 59.31
100 5.33 9.55 18.02 34.93
1000 1.86 2.66 4.30 7.63
Table 4. The speedups of the takeover delays of the ReHRS as compared with those of the
WSO on JobTracker failure type
Failure type
JT-500 JT-1000 JT-2000 JT-4000
STP (ms)
1 59.73 68.33 84.33 115.87
10 56.66 64.88 77.97 108.16
100 33.11 36.82 45.42 62.24
1000 7.35 8.23 9.77 13.29
Figure 8 shows the average takeover delays for all tested schemes (except the NR)
on the four NameNode failure types. Similarly, the size of T-metadata occupying
NameNode’s memory does not influence the ReHRS’s takeover delays on each STP.
Hence, the ReHRS’s results are identical to those shown in Figure 7. But these
NameNode failure types considerably impact the takeover delays of the HSO and
WSO, especially when NameNode holds more and more T-metadata in its memory.
The results also indicate that these STPs are unable to lower the HSO’s and WSO’s
takeover delays.
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 203
To show the impacts of these schemes with different STPs on the MapReduce mas-
ter server, we estimated the average CPU utilizations, memory utilizations, read-
request processing times, and write-request processing times of the master server.
Here, NameNode was chosen as the target since its performance in processing
read/write requests is easily influenced by STP. Note that NameNode is tested under
the same workload mentioned above.
Figures 9 and 10, respectively, plot the average CPU and memory utilizations of
NameNode when these schemes are employed on different STPs. The NR on any
STPs did not affect the CPU and memory consumptions of NameNode in processing
the workload. The reason is stated above. When the HSO is employed, more and more
NameNode’s resources are utilized as STP decreases. This is because that NameNode
has to send its heartbeats to another two hot-standby servers and meanwhile detect
their failures. The WSO causes the least overhead for NameNode as compared with
the ReHRS and HSO. Nevertheless, NameNode has a heavy load when STP=1 ms
because NameNode has to frequently send its heartbeats to the monitor node em-
ployed by the WSO.
40
35
Average takeover
30
delays (sec)
25
20
15
10
5
0
NN-50% NN-60% NN-70% NN-80%
ReHRS on STP=1 0.30 0.30 0.30 0.30
ReHRS on STP=10 0.32 0.32 0.32 0.32
ReHRS on STP=100 0.55 0.55 0.55 0.55
ReHRS on STP=1000 2.82 2.82 2.82 2.82
HSO on STP=1 12.53 14.93 17.37 17.37
HSO on STP=10 12.55 14.95 17.39 17.39
HSO on STP=100 12.78 15.17 17.62 17.62
HSO on STP=1000 15.04 17.46 19.86 19.86
WSO on STP=1 27.77 30.68 32.92 32.92
WSO on STP=10 27.84 30.49 32.94 32.94
WSO on STP=100 28.32 30.98 33.25 33.25
WSO on STP=1000 30.04 33.21 35.96 35.96
Fig. 8. Average takeover delays (sec) for the ReHRS, HSO, and WSO on different JobTracker
failure types and STPs. Similarly, the takeover delays of the NR is absent for the same reason
stated above
204 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
utilization (%)
90
CPU
80
70
60
STP=1 STP=10 STP=100 STP=1000
Fig. 9. The CPU utilizations of NameNode when the ReHRS, HSO, and WSO are employed
with different STPs
90
Memory
85
80
75
70
65
STP=1 STP=10 STP=100 STP=1000
Fig. 10. The memory utilizations of NameNode when the ReHRS, HSO, and WSO are em-
ployed with different STPs
Figures 11 and 12, respectively, show the average read-request and write-request
processing times of NameNode when these schemes are employed on different STPs.
Because the NR does not run any failure detection, its performance exactly reflects
the performance of NameNode, therefore is used as a baseline for comparison. When
STP=1 ms, the ReHRS, HSO, and WSO all result in high request processing times,
but the HSO consumed the highest processing time. When STP=10 ms, the processing
times on the ReHRS and WSO dramatically decreased, but the processing time on the
HSO was still very high. The reason is that the frequent heartbeat sending and receiv-
ing in the HSO burdens NameNode, consequently decreasing NameNode’s perfor-
mance. When STP increase to 100 ms or 1000 ms, the NameNode’s request-
processing times on the four schemes were very close, implying that the ReHRS,
HSO, and WSO on the two STP settings did not impact the performance of
NameNode.
The above results show that STP is an influential factor determining NameNode’s
performance in processing read and write requests. A smaller STP consumes more
NameNode’s resources and decreases NameNode’s performance, while a larger STP
has less impact on NameNode’s resource consumption and performance.
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 205
200
Read-request
processing
150
time (ms)
100
50
0
STP=1 STP=10 STP=100 STP=1000
ReHRS 124.7 3.4 1.5 1.0
HSO 189.3 102.9 2.1 1.0
WSO 122.6 3.0 1.4 1.1
NR 1.0 1.0 1.0 1.0
Fig. 11. The read-request processing times of NameNode when all tested schemes are em-
ployed with different STPs
350
Write-request
300
time (ms)
processing
250
200
150
100
50
0
STP=1 STP=10 STP=100 STP=1000
ReHRS 196.9 5.7 5.8 1.9
HSO 312.6 199.0 11.0 1.9
WSO 153.6 5.4 3.0 1.8
NR 1.8 1.8 1.8 1.8
Fig. 12. The write-request processing times of NameNode when all tested schemes are em-
ployed with different STPs
where , as shown in Figure 13, is the total time required by the WSS to finish
its warmup process when it receives a warmup request from . If fails right after
the WSS finishes its latest P-metadata update, comprises only the T-metadata
retrieval time. However, if the failure occurs when the WSS just starts to update its P-
metadata, includes both P-metadata and T-metadata retrieval times. is
the time period from the moment when the WSS is requested to warm itself up to the
moment when it is requested to takeover for the HSS. During , the WSS keeps
warming itself up to retrieve the P-metadata and T-metadata that it lacks. Given a
fixed , will be longer if is shorter. On the contrary, if the
206 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
master server/HSS is unstable for a long time before it fails, will be longer,
meaning that the WSS has more time to make itself ready for acting as the HSS
before the master server/HSS actually fails.
fail
E
Q
warm-up takeover
request request
WSS
Tearly Tremain
Ttotal
Fig. 13. The , , and for the ReHRS’s WSS, in which is the master
server if is the HSS, and is the HSS if is the master server
We evaluate the of the WSS on each failure type listed in Table 2 and
STP=1000 ms for 30 times to obtain the average . The results are presented in
Table 5. Then we use five different cases shown in Table 6 to estimate the average
for the WSS. Each case represents how long JobTracker and NameNode
behave unstable before they crash.
Case Description
1 is 2 STPs, i.e., = 2 sec.
2 is 4 STPs, i.e., = 4 sec.
3 is 8 STPs, i.e., = 8 sec.
4 is 16 STPs, i.e., = 16 sec.
5 is 32 STPs, i.e., = 32 sec.
Figures 14 and 15 illustrate the average for the WSS. For each case,
increases when more jobs are running or more T-metadata occupy
NameNode’s memory. This is because that the WSS needs more time to retrieve the
T-metadata it lacks. For each failure type, it is clear that the cases with a longer
lead to a shorter . In Case 5, the WSS can almost finish its warmup process
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 207
before JobTracker and NameNode actually fail. The results demonstrate that the
warmup mechanism can adapt to the unstable behaviors of JobTracker and
NameNode and enables the WSS to proactively warm itself up and act as the HSS
quickly.
During , the WSS is unlikely to fail since its reliability is high. For exam-
ple, assume that the WSS in its warmup or warmup mode follows a Poisson process
with a failure rate = 0.0001 per hour, and when =33.05 sec (i.e., the max-
imum in our results), the reliability of the Warmup is about 0.99999
.
(= e . ∗
∗ ), implying that the WSS has a very high possibility to finish its
warmup process and fully acts as the HSS. Similarly, the surviving server (i.e.,
shown in Figure 13) is unlikely to crash during if it has the same failure rate,
implying that the operation of the MapReduce master server can be continued.
35
The average Tremain
(sec) for the WSS
30
25
20
15
10
5
0
JT-500 JT-1000 JT-2000 JT-4000
Case 1 15.60 18.20 23.02 32.48
Case 2 13.60 16.22 21.02 30.48
Case 3 9.60 12.22 17.02 26.48
Case 4 1.60 4.22 9.02 18.48
Case 5 0.00 0.00 0.00 2.48
Fig. 14. The average for the ReHRS’s WSS on different JobTracker failure types with
STP=1000 ms
The average Tremain
35
(sec) for the WSS
30
25
20
15
10
5
0
NN-50% NN-60% NN-70% NN-80%
Case 1 25.42 28.34 30.56 33.05
Case 2 23.43 26.34 28.56 31.05
Case 3 19.43 22.34 24.56 27.05
Case 4 11.43 14.34 16.56 19.05
Case 5 0.00 0.00 0.56 3.05
Fig. 15. The average for the ReHRS’s WSS on different NameNode failure types with
STP=1000 ms
208 J.-C. Lin, F.-Y. Leu, and Y.-p. Chen
In this chapter, we propose the ReHRS to conquer the single-point-failure problem for
MapReduce and improve MapReduce reliability and availability by employing the
HSS to maintain the latest metadata of the master server and utilizing the WSS to
further extend MapReduce lifetime. To show that the ReHRS is capable of providing
a fast takeover, we evaluated the ReHRS and three schemes, the NR, HSO, and WSO,
in a large simulated MapReduce cluster given eight different JobTracker and
NameNode failure types and four different STPs. The simulation and experimental
results show that the ReHRS’s takeover delays are shorter than those of the NR, HSO,
and WSO regardless of which failure type or STP is employed, implying that the
ReHRS can effectively raise the availability of the MapReduce master server. In
addition, the results demonstrate that the warmup mechanism enables the WSS to
warm itself up while JobTracker and NameNode are unstable and shorten the time
required to act as the HSS.
References
1. Dean, J., Ghemawat, S.: Mapreduce: Simplified Data Processing on Large Clusters. Com-
munication of the ACM 51(1), 107–113 (2008)
2. Apache Hadoop, http://hadoop.apache.org (March 07, 2012)
3. The Disco project, http://discoproject.org (March 17, 2012)
4. Gridgain, http://www.gridgain.com (April 15, 2012)
5. MapSharp, http://mapsharp.codeplex.com (May 7, 2012)
6. Skynet, http://skynet.rubyforge.org (May 13, 2012)
7. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the
ACM Symposium on Operating Systems Principles, pp. 29–43. ACM, New York (2003)
8. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In:
Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, Incline
Village, NV, USA, pp. 1–10 (2010)
9. Hadoop Wiki, NameNodeFailover, http://wiki.apache.org/
hadoop/NameNodeFailover (September 9, 2011)
10. Dean, J.: Designs, lessons and advice from building large distributed Systems, Keynote
slides at http://www.cs.cornell.edu/projects/
ladis2009/talks/dean-keynote-ladis2009.pdf (September 20, 2011)
11. Loques, O.G., Kramer, J.: Flexible Fault Tolerance for Distributed Computer Systems. IEE
Proceedings-E on Computers and Digital Techniques 133(6), 319–337 (1986)
12. Shooman, M.L.: Reliability of Computer Systems and Networks: Fault Tolerance, Analy-
sis, and Design. John Wiley & Sons Inc., New York (2002)
A Hybrid Redundant System for Improving MapReduce Reliability and Availability 209
13. Sinaki, G.: Ultra-Reliable Fault Tolerant Inertial Reference Unit for Spacecraft. In: Pro-
ceedings of the Annual Rocky & Mountain Guidance and Control Conference, San Diego,
CA, pp. 239–248 (1994)
14. Pandey, D., Jacob, M., Yadav, J.: Reliability Analysis of a Powerloom Plant with Cold-
Standby for its Strategic Unit. Microelectronics and Reliability 36(1), 115–119 (1996)
15. Kumar, S., Kumar, D., Mehta, N.P.: Behavioural Analysis of Shell Gasification and Car-
bon Recovery Process in a Urea Fertilizer Plant. Microelectronics and Reliability 36(4),
671–673 (1996)
16. Leu, F.Y., Yang, C.T., Jiang, F.C.: Improving Reliability of a Heterogeneous Grid-based
Intrusion Detection Platform using Levels of Redundancies. Future Generation Computer
Systems 26(4), 554–568 (2010)
17. Zheng, Q.: Improving MapReduce Fault Tolerance in the Cloud. In: Proceedings of the
IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd
Forum, Atlanta, CA, pp. 1–6 (2010)
18. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce Per-
formance in Heterogeneous Environments. In: Proceedings of the 8th USENIX Conference
on Operating Systems Design and Implementation, San Diego, CA, pp. 29–42 (2008)
19. Cha, J.H., Mi, J., Yun, W.Y.: Modelling a General Standby System and Evaluation of its
Performance. Applied Stochastic Models in Business and Industry 24(2), 159–169 (2008)
20. Du, Y., Yu, H.: Paratus: Instantaneous Failover via Virtual Machine Replication. In: Pro-
ceedings of 8th International Conference on Grid and Cooperative Computing, Lanzhou,
Gansu, China, pp. 307–312 (2009)
21. Wang, F., Qiu, J., Yang, J., Dong, B., Li, X., Li, Y.: Hadoop High Availability through
Metadata Replication. In: Proceedings of the First International Workshop on Cloud Data
Management, pp. 37–44. ACM (2009)
22. Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J.M., Sears, R.C.: BOOM:
Data-centric Programming in the Datacenter. Technical Report UCB/EECS-2009-113,
EECS Department, University of California, Berkeley (July 2009)
23. He, X., Ou, L., Engelmann, C., Chen, X., Scott, S.L.: Symmetric Active/Active Metadata
Service for High Availability Parallel File Systems. Journal of Parallel and Distributed
Computing 69(12), 961–973 (2009)
24. Chen, Z., Xiong, J., Meng, D.: Replication-based Highly Available Metadata Management
for Cluster File Systems. In: Proceedings of the IEEE International Conference on Cluster
Computing, Heraklion, Greece, pp. 292–301 (2010)
25. Marozzo, F., Talia, D., Trunfio, P.: A Peer-to-Peer Framework for Supporting MapReduce
Applications in Dynamic Cloud Environments. In: Cloud Computing: Principles, 1st edn.
Springer (2010)
26. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Yahoo! Press (June 5, 2009)
27. Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: A Com-
prehensive Study. ACM Computing Surveys 33(4), 427–469 (2001)
28. Défago, X., Schiper, A., Urbán, P.: Total Order Broadcast and Multicast Algorithms: Tax-
onomy and Survey. ACM Computing Surveys 36(4), 372–421 (2004)
29. Issariyakul, T., Hossain, E.: Introduction to Network Simulator NS2. Springer Science
Media (2009) ISBN: 978-0-387-71759-3
Analysis and Visualization of Large-Scale Time Series
Network Data
Abstract. Large amounts of data (“big data”) are readily available and collected
daily by global networks worldwide. However, much of the real-time utility of
this data is not realized, as data analysis tools for very large datasets, particular-
ly time series data are cumbersome. A methodology for data cleaning and prep-
aration needed to support big data analysis is presented, along with a compara-
tive examination of three widely available data mining tools. This methodology
and offered tools are used for analysis of a large-scale time series dataset of en-
vironmental data. The case study of environmental data analysis is presented as
visualization, providing future direction for data mining on massive data sets
gathered from global networks, and an illustration of the use of big data tech-
nology for predictive data modeling and assessment.
1 Introduction
Increasingly large data sets are resulting from global data networks. For example, the
United States government’s National Oceanic and Atmospheric Administration
(NOAA) compiles daily readings of weather conditions from monitoring stations
located around the world. These records are freely available. While such a large
amount of data is readily available, the value of the data is not always evident. In this
chapter, large time series data was mined and analyzed using data mining algorithms
to find patterns. In this specific instance, the patterns identified could result in better
weather predictions in the future.
Building on earlier work [1,2,3,4], two separate datasets from NOAA were used.
One was the Global Summary of Day (GSOD) dataset [5], which currently has data
from 29,620 stations. The second was the Global Historical Climatology Network
(GHCN) dataset [6], which currently has data from 77,468 stations. In previous pro-
jects, the datasets were downloaded from NOAA’s FTP servers and made locally
available on our local database server. Each of the GSOD stations collect 10 different
types of data (precipitation, snow depth, wind speeds, etc.), while the GHCN stations
can collect over 80 different types of data, although the majority only collect 3 or 4
types. Some stations have been collecting data for over 100 years. Both datasets com-
bined consist of over 2.62 billion rows (records) in our database.
Data from both datasets was mined using Weka, RapidMiner, and Orange, which
are free data mining programs. Each of these programs has a variety of data mining
algorithms which were applied to our data. However, before mining, the data had to
be converted into a format which allowed it to be properly mined. The problem was
solved by developing custom Java programs to rearrange the data.
In this illustrated example, the goal was to find any patterns existing in the data,
help predict future significant weather events (snowstorm, hurricane, etc.), and visual-
ize results in a meaningful format. It is also hoped that this work mining large-scale
datasets will help others do the same with any dataset of similar magnitude. Although
global data was available, the portion of the dataset used was New Jersey, as it has its
own special microclimate. By using data from some of New Jersey’s extreme weather
events as starting points, this research was able to look for patterns that may assist in
predicting such events in the future. Examples of extreme events in New Jersey in-
clude the unpredicted snowstorm of October 2011, the December 26, 2010 snowstorm
(24”-30” accumulation), and a tornado Supercell that hit the state in August 2008.
The objective was to use the very large amount of data previously imported into a
local database server and run data mining algorithms against it to find patterns. This
approach was similar to one taken to find relationships in medical data from patients
with diabetes [7]. Fields such as telemedicine and environmental sustainability offer
great opportunities for big data analysis and visualization. For environmental big data
analysis, a variety of popular data mining software was used to evaluate the data to
see if one product provided superior results. The software products used were Weka
[8], RapidMiner [9], and Orange [10]. Additionally, the raw data was graphed
[11, 12] to see if any patterns were identified through visual inspection which the
mining software might overlook. By mining the data, a trend was expected in envi-
ronment/weather. Possible results could be evidence of global warming, colder win-
ters, warmer summers (heat waves), stronger/weaker storms, or more/less large
storms. This chapter is an extension of [1, 2], with a specific application of environ-
mental sustainability.
legend. Neighborhood residents and regular visitors to the area may generally know
the hazards of a particular urban spot, but sharing the knowledge of destructive or
hazardous patterns with organizations which might be able to remediate or prevent
such regular environmental degradation is not easily done. Rather, at each instance of
an environmental threat, the flooded underpass or poor air is addressed as a public
safety crisis and personnel and resources are deployed in an emergency manner to
provide appropriate traffic rerouting or medical attention.
In addition to critical events which threaten urban environmental conditions, more
insidious, slowly evolving circumstances which may result in future urban crises are
not monitored. For example, traffic volume on highly used intersections or bridges is
not monitored for increasing noise or volume, which might result in a corresponding
increase in hazardous emissions or stress fractures. When urban threats are identified,
the response process can be aggravated as traffic comes to a standstill or is rerouted,
hindering, or delaying emergency response personnel.
Both immediate and slower threats to urban environmental sustainability are dealt
with on an ‘as occurring’ basis, with no anticipation or preventive action taken to
avert or decrease the impact of the urban threat. The result, over the past years, has
been an increase in urban crisis management, rather than an increased understanding
of how our urban environments could be better managed for best use of all our re-
sources – including resources for public safety and environmental sustainability.
The increasing age of urban infrastructures, and the further awareness of environ-
mental hazards in our midst highlights that the management of environmental threats
on an ‘as needed’ basis is no longer feasible, particularly as the cost of managing an
environmental crisis can exceed the cost of preventing an environmental crisis. With
the potential to gather data in a real-time manner from urban sites, the opportunity for
anticipatory preparation and preventive action prior to urban environmental events
has become possible. Specifically, street level mapping, using real-time information,
is now possible, with the integration of new tools and technology, such as geographic
information systems and sensors.
Environmental sustainability in an urban environment is challenging. While nu-
merous measures of environmental sustainability, including air quality, rainfall, and
temperature, are possible, the group assessment of these measured parameters is not
as easily done. Air quality alone is composed of a variety of measurements, such as
airborne particulate matter (PM10), nitrogen dioxide (NO2), Ozone (O3), carbon
monoxide (CO) and carbon dioxide (CO2). Road traffic is the main cause of NO2 and
CO. While simple environmental solutions such as timing traffic lights are identified
as saving billions in fuel consumption and reducing air-pollution (i.e., improving air
quality) by as much as 20%, the technological underpinnings to accomplish this have
not been developed and deployed on an appropriate scale for urban data gathering and
correlation. Furthermore, the use of predictive models and tools, such as data mining,
to identify patterns in support or opposed to environmental sustainability is not com-
monly done in an urban setting. While hurricane, earthquake, and other extreme
weather events occur and the aftermath is dramatically presented, more mundane but
not less impacting events such as urban flash flooding, chemical spills on city roads,
and other environmental events are not anticipated or measured while occurring or
214 P. Morreale, A. Goncalves, and C. Silva
developing, with intent to reduce in scope and damage. By gathering data locally, for
assessment and prediction, areas, and events can be identified that might harm envi-
ronmental sustainability. This knowledge can be used to avoid or disable what might
have previously been an urban environmental disaster.
• Cluster analysis
• Predictive modeling
• Anomaly detection
• Association analysis
both predictive modeling and anomaly detection are used in the big data analysis pro-
ject detailed here. “Predictive modeling” can be further defined by two types of tasks:
classification, used for discrete target variables, and regression, which is used for
continuous target variables.
Forecasting the future value of a variable, such as would be done in a model of an
urban ecosystem, is a regression task of predictive modeling, as the values being
measured and forecast are continuous-valued attributes. In both tasks of predictive
modeling, the goal is to develop a model which minimizes the error between predict-
ed and true values of the target variables. By doing so, the objective is to identify
crucial thresholds that can be monitored and assessed in real-time so that any action
or alert may be automatic and high responsive.
“Anomaly detection” is also crucial to the success of big data modeling. Formally
stated anomaly detection is the task of identifying events or measured characteristics
which are different from the rest of the data or the expected measurement. These
anomalies are often the source of the understanding of rare or infrequent events.
However, not all anomalies are critical events, meriting escalation, and further inves-
tigation. A good anomaly detection mechanism must be able to detect non-normal
events or measurements, and then validate such events as being outside of expecta-
tions – a high detection rate and low false alarm rate is desired, as these define the
critical success rate of the application.
The overall objective of the big data environmental network implementation is the
gathering of environmental information in real-time and storing the data in a database
so that, the data can be visually presented in a geographic context for maximum un-
derstanding. Ideally, the big data environmental network is an implementation of a
wireless environmental sensing network for urban ecosystem monitoring and envi-
ronmental sustainability. By measuring environmental factors and storing the data for
comparison with future data gathered, the changes in data measured over time can be
assessed. Furthermore, if a change in one measured variable is detected, examination
of another measured variable may be needed to correlate the information, and deter-
mine if the measured conditions are declining or advancing over time. Known as
‘exception mining,’ this assessment can also be visually presented in a geographical
context, for appropriate understanding and preventive or divertive action.
A. Visual Presentation
Application development included a collection of data, which was archived into a
database. This was accomplished by using SQL, a relational database system. To
clearly understand and visualize the importance, functionality, and advantages pro-
vided by the wireless sensor network, the data must be clearly represented. In order to
do so, a programming language or framework is needed that provides the ability to
quickly gather data and accurately represent each of our sensors.
After consideration of availability, scalability, and recognition, as well as under-
standing the well-defined API and number of tutorials available, the Google Maps
framework was selected. The Google Maps API allows the user to use Google Maps
on individual websites, with JavaScript. In additional, a number of different utilities
are available.
Google Maps provides an additional advantage, as the nodes represented on the
map and the data contained in each one of the nodes can be setup using XML and
additional nodes or markers can be added with ease. Once the decision to use Google
Maps was made, the software development effort shifted from data representation
efforts to working on accessing the actual real-time data from the wireless sensor
network server. For Google Maps to process this new data and properly represent it,
an XML file is generated. Fig. 1 shows the exchange between the web server and the
wireless sensor network server.
As data is sent in from the nodes, it is passed through the base station to the Perl
XML Parser, which parses the incoming data and filters out unwanted packets. The
result remaining is the desired data packet set.
216 P. Morreale, A. Gonccalves, and C. Silva
Fig. 1. Data exchangee between web server and wireless sensor network server
The wireless sensor netw work data is gathered in a database, which is then presennted
in a visual context. Googlee Maps can be used to present the information in a ggeo-
graphical context. By clickking on the sensor node, real-time information is presennted
to the viewer, in appropriatte context. The raw data, collected from the sensor, is ap-
propriately converted to staandard units for display and understanding.
In addition to real-time data presentation in a geographical context, a tempooral
presentation, using time an nd date information, has also been developed. An illusstra-
tion of this can be seen in Fig. 2. A query by sensor presents the specific detailss of
that sensor at one point inn time, as well as providing a comparison of the sensoor’s
status for prior dates and tiimes. More than one sensor measurement can be overllaid
on the chart, which permits correlation of events and times precisely with sensors.
While data has been gath hered by sensors before, the correlation and presentationn of
this environmental real-timee information in a geographical context, mined from a vvery
large dataset, in addition to providing support for historical and temporal comparisson,
is innovative.
Analysis and Visualization of Large-Scale Time Series Network Data 217
Visualization of massively large datasets presents two significant problems. First, the
dataset must be prepared for visualization, and traditional dataset manipulation meth-
ods fail due to lack of temporary storage or memory. The second problem is the
presentation of the data in the visual media, particularly real-time visualization of
streaming time series data. Visualization of data patterns, particularly 3D visualiza-
tion, represents one of the most significant emerging areas of research. Particularly
for geographic and environmental systems, knowledge discovery and 3D visualization
is a highly active area of inquiry. Recent advances in association rule mining for time
series data or data streams makes 3D visualization and pattern identification on time
series data possible.
In streaming time series data the problem is made more challenging due to the dy-
namic nature of the data. Emerging algorithms permit the identification of time-series
motifs [25] which can be used as part of a real-time data mining visualization applica-
tion. Geographic and environmental systems frequently use sensor networks or other
unmanned reporting stations to gather large volumes of data which are archived in
very large databases [26]. Not all the data gathered is important or significant. How-
ever the sheer volume of data often clouds and obscures critical data which causes it
to be ignored or missed.
The research presented here outlines an ongoing research project working to visu-
alize the data from national repositories in two very large datasets. Problems encoun-
tered include dataset navigation, including storage and searching, data preparation for
visualization, and presentation.
Data filtering and analysis are critical tasks in the process of identifying and visual-
izing the knowledge contained in large datasets, which is needed for informed deci-
sion making. This research is developing approaches for time series data which will
permit pattern identification and 3D visualization. Research outcomes include as-
Analysis and Visualization of Large-Scale Time Series Network Data 219
sessment of data mining techniques for streaming time series data, as well as interpre-
tive algorithms, and visualization methods which will permit relevant information to
be extracted and understood quickly and appropriately.
Searching for temporal association rules in time series databases is important for dis-
covering relationships between various attributes contained in the database and time.
Association rules mining provides information in an “if-then” format. Because time
series data is being analyzed for this research, time lags are included to find more
interesting relationships within the data. A software package from Universidad de La
Rioja’s EDMANS Groups was used to preprocess and analyze the time series from
the NOAA datasets. The software package is called KDSeries and was created using
R, a language for statistical computing.
The KDSeries package contains several functions that preprocess the time series data
so knowledge discovery becomes easier and more efficient. The first step in prepro-
cessing is filtering. The time series are filtered using a sliding-window filter chosen by
the user. The filters included in KDSeries are Gaussian, rectangular, maximum, mini-
mum, median, and a filter based on the Fast Fourier Transform. Important minimum
and maximum points of the filtered time series are then identified. The optima are used
to identify important episodes in the time series. The episodes include increasing, de-
creasing, horizontal and over, below, or between a user-defined threshold.
After simple and complex episodes are defined, each episode is view as an item to
create a transactional database. Another R-based software package, arules, makes this
possible. Arules provide algorithms that seek out items that appear within a window
of a width defined by the user. From there, temporal association rules are then
extracted from the database.
The first algorithm being used to extract the temporal association rules is the Eclat
algorithm. Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is an
efficient algorithm that generates frequent item sets in a depth-first manner. Other
220 P. Morreale, A. Gonccalves, and C. Silva
algorithms such as Aproirii and FP-growth will then be used to extract associattion
rules and compared and con
ntrasted with each other. This work is ongoing.
3.3 Methodology
The data used is located on n NOAA’s FTP site in the form of .dly files. Each stattion
has its own .dly file which is i updated daily (if the station still collects data). Each .dly
file has all the data that hass ever been collected for that station. Whenever new datta is
added for a station, it is ap ppended to the end of the current .dly file, which preseents
the problem that each file must
m be downloaded over again to keep our local databbase
current. To obtain the dataa, a Java-based program was built that would download
every file in the folder ho olding the .dly files. The Java program used the Apaache
Commons Net library to do ownload the files from the NOAA FTP server.
After downloading all off the .dly files, the Java program opens an input stream m to
each of the downloaded files (one at a time). Each line of the .dly files containns a
separate data record, so thee Java program would read in each line and use it to form ma
MySQL “INSERT” statemeent that would be used to place data into a local databaase.
At one point, space was ex xhausted on the local machine, and researchers had to up-
grade the hard disk from ~2 200 GB to 256 GB to continue inserting data into the reela-
tional database.
Once all of the data wass placed into the local database, a web interface was bbuilt
that allowed users to search h the dataset (Fig. 3). The interface allows users to seaarch
by country, state (within th he United States), date range, and values that are <, <= =, >,
>=, !=, or == to any choseen value. Because the dataset contains the value -9999 for
any record that is invalid orr was not collected, the web interface also has the optionn to
exclude any -9999 values from f the results. The results are output with each line ccon-
taining a different data resuult, and each result consisting of month, day, year, and ddata
value.
For visualizing the data, the first step was to plot the location of each station ussing
Google Earth. The NOAA FTP server also has a file that lists the longitude, latituude,
on, so this information was placed into a separate tablee in
and elevation of each statio
our database. Next, a PHP script from the Google Earth website was customizedd to
support queries to the dataabase for the location of each station and then format the
results in KML. This KM ML data is then loaded into the browser-based versionn of
Google Earth on the same webpage. A separate PHP script was built that allows the
user to search for stations in the entire world, by country, or by state (if searchhing
within the US) (Fig. 4) and results from the query are graphed (Fig. 5).
This visualization can be integrated into a Google Earth display (Fig. 6).
222 P. Morreale, A. Gonccalves, and C. Silva
Fig. 6. NO
OAA GHCD reporting stations in Google Earth
data mining friendly format. The program queries for data within a specific date range
and station range (numbers were assigned to the stations), and for the specific data
types which we wanted to retrieve. The program then compiles the results into a file,
in which each station has a single record for each day. The file can be read like a table
in which each data type has its own column. This lets data be efficiently retrieved
from the database, while still allowing the data to be properly organized for mining.
A subset of the data is requested at one time as the conversion process can take
hours or days if too much data is requested. Most mining programs cannot handle
such large amounts of data, and not all element types are commonly used. Some min-
ing algorithms will not run if a data type is missing too many entries, which would be
the case if some of the less commonly collected data types were used. This program is
used to retrieve data from both the GSOD and GHCN datasets to reduce time wasted
on retrieving the same data multiple times, and to remove the need to have mining
programs connect to our database server.
A second Java program was written that converts commas to tabs. The mining pro-
gram Orange does not read CSV files, but does read tab-delimited files. Originally,
commas were used to separate data values, so a second program was made that would
quickly convert all commas to tabs in order to analyze the data with Orange.
Three programs were used to mine the data. The most success came from
RapidMiner, with quite a bit less success experienced with Weka and Orange. All the
three programs have a similar setup in which different functions are dragged and
dropped onto the interface screens and then connected together to run the chosen al-
gorithms. Each of the programs gave operational issues at times, but RapidMiner
seemed to be the most stable and useful.
Initially, mining began with Weka, but a few key issues made it clear that Weka
should be dropped early on. First, Weka’s “Knowledge Flow” program, which contains
mining functions, was not able to connect to our local database server (before we built
our conversion program). Second, Weka has a lot of mining algorithms, but little expla-
nation of how to use them. Frustrated with these problems, RapidMiner was tried.
RapidMiner has a large amount of mining functions, and it has an extension that
gives it some functions from Weka. It also has a file import wizard that helps ensure that
data is correctly imported into the program. Most importantly, there is a small guide for
each mining algorithm on the bottom right-hand corner of the screen that is automatical-
ly shown whenever a function is selected. The guide explains exactly what a function is
for, how to use it, and what its inputs and outputs are. RapidMiner also has a search
feature that lets users quickly find the algorithms they want to use. In addition, it has a
wizard called “Automatic System Construction” that runs different algorithms on data
to determine which ones may yield results [11]. We did not have much success with this
feature. Out of all the programs, only RapidMiner provided results.
Despite those positive features, there were two main problems with RapidMiner.
The first problem was that it would crash if functions with more than ~5MB of input
224 P. Morreale, A. Goncalves, and C. Silva
data were used. The second problem was that many functions would only run if nu-
meric data were converted to nominal data. This means that those numeric values
would be treated as though they were words, thus entirely removing their important
numeric properties. The nominal values have no relation to each other, such as differ-
ence (between two values), but are treated as separate, equal instances.
Orange was the last program we tried using. Before using Orange the CSV files
were converted to tab-delimited files, but this was not really an issue. A feature that
stands out in Orange is that Orange will only permit the addition of a new function if
it can be attached to one that has already been chosen. This removes some guesswork,
allows us to easily see what we can use, and identify functions that we may have oth-
erwise overlooked. Like RapidMiner, Orange would sometimes crash if it received
too much data.
Fig. 8. Graph showing temperature changes in New Jersey over the last 7 years (2005-20012)
using data from the GHCN dataset.
d This graph was generated from the results of the Naïve
Bayes algorithm.
226 P. Morreale, A. Gonccalves, and C. Silva
7 Visual Inspectio
on of Raw Data
8 Visualization an
nd Presentation in Context
Overall, the data mining programs did not yield results of great significance on the
datasets used thus far. How wever, the work with the data mining tools yielded m more
information. There was littlle new information learned from the algorithms or from m the
actual results. Most of the association rules were either nonsense due to converssion
from numeric to nominal ty ypes or very basic rules that are already commonly knoown
(like colder temperatures arre seen in the winter). Most decision trees also gave sim
milar
results or refused to give anything
a at all (many times the results were a tree witth a
single node). Some of thiss may also be due to the fact that there were only a ffew
commonly collected data typest (precipitation, minimum/maximum temperature)) in
both datasets. To make maatters worse, many data types in the GHCN dataset m may
Analysis and Visualization of Large-Scale Time Series Network Data 227
have been similar (value of 0) or missing too many values to form accurate associa-
tions or predictions. To obtain results of significance or find previously unknown
patterns there are two things needed:
1. many more data types that are continuously collected by all weather stations and
2. mining algorithms that support a greater range of numeric data types.
9 Conclusions
Although no new patterns were identified by this project, the greater benefit was
learning how to organize data for mining and how to mine large amounts of data. The
approach taken here for data mining is valid, however:
1. stronger connections between the data types were needed for pattern identification,
2. the mining programs did not handle the data properly, and
3. a greater range of data types is needed to obtain more significant results.
The methodology outlined here can be applied to a wide range of other fields, as
long as a dataset with a large number of continuously collected data types is used. The
health industry in particular may benefit from mining patient data to find hidden links
between different patients with the same diseases or illnesses. Such medical data is
almost continuously being collected in large amounts.
Reorganizing the data for use by mining algorithms is an important part of the pro-
cess. Writing a program to retrieve the data from the NOAA database and saving it
locally in a different format was not difficult, but the conversion process sometimes
took a very long time. The bulk of this time was taken to retrieve the files from the
database. To reduce such time, researchers can either build the database in the needed
format to begin with (not always possible), or convert the data from a file stored on
the same machine. The process to request and fetch the data from the database is the
most time consuming portion of the large dataset analysis. Storing the converted data
in a local file will make sure that the request-and-fetch process only has to be per-
formed once for each set of test data.
As for the different programs used, RapidMiner was far easier to use than Orange
and Weka. RapidMiner had its drawbacks with data handling and occasional crashing,
but it was simply easier to use thanks to its search feature, its documentation for every
function, and the fact that it gave results. Overall, Weka and Orange were lacking as
programs, as they were troublesome and not useful.
The three main algorithms that were used (association rules, Naïve Bayes,
and decision trees) all took approximately the same amount of time to run against a
very large amount of data. To remedy most of the crashes experienced, a computer
upgrade to a 64-bit operating system, more RAM, and more/faster processors, is
planned. Even without a more capable system, this problem was somewhat overcome
by splitting the files into subsets that were mined separately.
Future plans for this research include additional comparative experience with larg-
er datasets, involved data from other areas and case studies from other disciplines.
228 P. Morreale, A. Goncalves, and C. Silva
References
1. Holtz, S., Valle, G., Howard, J., Morreale, P.: Visualization and Pattern Identification in
Large Scale Time Series Data. In: IEEE Symposium on Large Scale Data Analysis and
Visualization (LDAV 2011), Providence, RI, pp. 17–18 (2011)
2. Morreale, P., Qi, F., Croft, P.: A Green Wireless Sensor Network for Environmental Moni-
toring and Risk Identification. International Journal on Sensor Networks 10(1/2), 73–82
(2011)
3. Shyu, C., Klaric, M., Scott, G., Mahamaneerat, W.: Knowledge Discovery by Mining As-
sociation Rules and Temporal-Spatial Information from Large-Scale Geospatial Image Da-
tabases. In: Proceedings of the IEEE International Symposium on Geoscience and Remote
Sensing (IGARSS 2006), pp. 17–20 (2006)
4. Zhu, C., Zhang, X., Sun, J., Huang, B.: Algorithm for Mining Sequential Pattern in Time
Series Data. In: Proceedings of the IEEE 2009 WRI International Conference on Commu-
nications and Mobile Computing, pp. 258–262 (2009)
5. NOAA Integrated Surface Database (GSOD), http://www.ncdc.noaa.gov/oa/
climate/isd/index.php (retrieved June 12, 2013)
6. NOAA Global Historical Climatology Network (GHCN) Database,
http://www.ncdc.noaa.gov/oa/climate/ghcn-daily/ (retrieved June 12,
2013)
7. Han, J., Rodriguez, J.C., Beheshti, M.: Diabetes Data Analysis and Prediction Model Dis-
covery Using RapidMiner. In: IEEE Proceedings of the 2nd International Conference on
Future Generation Communication and Networking (FGCN 2008), pp. 96–99 (2008)
8. Weka’s website, http://www.cs.waikato.ac.nz/ml/weka/ (retrieved June 12,
2013)
9. RapidMiner’s website, http://rapid-i.com/content/view/181/190/ (re-
trieved June 12, 2013)
10. Orange’s website, http://orange.biolab.si/ (retrieved June 12, 2013)
11. Shafait, F., Reif, M., Kofler, C., Breuel, T.R.: Pattern Recognition Engineering. In:
RapidMiner Community Meeting and Conference (RMiner 2010), Dortmund, Germany
(2010)
12. DPlot’s website, http://www.dplot.com/ (retrieved June 12, 2013)
13. Thuraisingham, B., Khan, L., Clifton, C., Maurer, J., Ceruti, M.: Dependable Real-time
Data Mining. In: Proceedings of the 8th IEEE International Symposium on Object-
Oriented Real-Time Distributed Computing (ISORC 2005), pp. 158–165 (2005)
14. Martinez, K., Hart, J.K., Ong, R.: Environmental Sensor Networks. IEEE Computer,
50–56 (August 2004)
15. Lewis, F.L.: Wireless Sensor Networks. In: Cooke, D.J., Das, S.K. (eds.) Smart Environ-
ments: Technologies, Protocols, and Applications. John Wiley, New York (2004)
16. Zimmerman, A.T., Lynch, J.P.: Data Driven Model Updating using Wireless Sensor Net-
works. In: Proceedings of the 3rd Annual ANCRiSST Workshop (2006)
17. Chang, N., Guo, D.: Urban Flash Flood Monitoring, Mapping, and Forecasting via a Tai-
lored Sensor Network System. In: Proceedings of the 2006 IEEE International Conference
on Networking, Sensing and Control, pp. 757–761 (2006)
18. Cordova-Lopez, L.E., Mason, A., Cullen, J.D., Shaw, A., Al-Shamma’a, A.I.: Online vehi-
cle and atmospheric pollution monitoring using GIA and wireless sensor networks. Journal
of Physics: Conference Series 76(1) (2007)
Analysis and Visualization of Large-Scale Time Series Network Data 229
19. Gahegan, M., Wachowicz, M., Harrower, M., Rhyne, T.-M.: The Integration of geographic
visualization with knowledge discovery in databases and geocomputation. Cartography
and Geographic Information Science 28(1), 29–44 (2001)
20. Arici, T., Akgu, T., Altunbasak, Y.: A Prediction Error-Based Hypothesis Testing Method
for Sensor Data Acquisition. ACM Transactions on Sensor Networks 2(4), 529–556 (2006)
21. Monmonier, M.: Geographic brushing: Enhancing exploratory analysis of the scatter plot
matrix. Geographical Analysis 21(1), 81–84 (1989)
22. MacEachren, A.M., Polsky, C., Haug, D., Brown, D., Boscoe, F., Beedasy, J., Pickle, L.,
Marrara, M.: Visualizing spatial relationships among health, environmental, and demo-
graphic statistics: interface design issues. In: 18th International Cartographic Conference
Stockholm, pp. 880–887 (1997)
23. Monmonier, M.: Strategies for the visualization of geographic time-series data.
Cartographica 27(1), 30–45 (1990)
24. Harrower, M.: Visual Benchmarks: Representing Geographic Change with Map Anima-
tion. Ph.D. dissertation, Pennsylvania State University (2002)
25. Mueen, A., Keogh, E.: Online Discovery and Maintenance of Time Series Motifs. In: Pro-
ceedings of 16th ACM Conference on Knowledge Discovery and Data Mining (KDD
2010), pp. 1089–1098 (2010)
26. Morreale, P., Qi, F., Croft, P., Suleski, R., Sinnicke, B., Kendall, F.: Real-Time Environ-
mental Monitoring and Notification for Public Safety. IEEE Multimedia 17(2),
4–11 (2010)
Parallel Coordinates Version of Time-Tunnel (PCTT)
and Its Combinatorial Use for Macro to Micro Level
Visual Analytics of Multidimensional Data
Yoshihiro Okada
Abstract. This chapter treats an interactive visual analysis tool called PCTT,
Parallel Coordinates Version of Time-tunnel, for multidimensional data and
multi-attributes data. Especially, in this chapter, the author introduces the com-
binatorial use of PCTT and 2Dto2D visualization functionality for visual ana-
lytics of network data. 2Dto2D visualization functionality displays multiple
lines those represent four-dimensional (four attributes) data drawn from one
(2D, two attributes) plane to the other (2D, two attributes) plane in a 3D space.
Network attacks like the intrusion have a certain access pattern strongly related
to the four attributes of IP packet data, i.e., source IP, destination IP, source
Port, and destination Port. So, 2Dto2D visualization is useful for detecting such
access patterns. Although it is possible to investigate access patterns of network
attacks at the attributes level of IP packets using 2Dto2D visualization function-
ality, statistical analysis is also necessary to find out suspicious periods of time
that seem to be attacked. This is regarded as the macro level visual analytics
and the former is regarded as the micro level visual analytics. In this chapter,
the author also introduces such combinatorial use of PCTT for macro level to
micro level visual analytics of network data as an example of multidimensional
data. Furthermore, the author introduces other visual analytics example about
sensor data to clarify the usefulness of PCTT.
1 Introduction
This chapter treats an interactive visual analysis tool for multidimensional and multi-
attributes data called PCTT, Parallel Coordinates Version of Time-tunnel (PCTT)
[1-3]. Originally, Time-tunnel [1, 2] visualizes any number of multidimensional data
records as individual charts in a virtual 3D space. Each chart is displayed on a rectan-
gular plane and the user easily puts more than one different planes overlapped
together to compare their data represented as charts in order to recognize the similari-
ty or the difference among them. Simultaneously, a radar chart among those data
on any attribute is displayed in the same 3D space to recognize the similarity and the
correlation among them. In this way, the user can visually analyze multiple multidi-
mensional data through interactive manipulations on a computer screen. However, in
Time-tunnel, only one chart is displayed on one rectangular plane. So, if there are a
huge number of data records, the user has to prepare accordingly such a huge number
of rectangular planes and practically it becomes impossible to interactively manipu-
late them. To deal with this problem, we enhanced the functionality of Time-tunnel to
enable it to display multiple chars like Parallel Coordinates [4] on each rectangular
plane. This is called Parallel Coordinates version of Time-tunnel (PCTT) [3]. With
this enhanced functionality, the user can visually analyze a huge number of multidi-
mensional data records through interactive manipulations on a computer screen. The
user can easily recognize the similarity or the difference among those data visually
and interactively.
Parallel Coordinates version of Time-tunnel (PCTT) can be used for the visualiza-
tion of network data because IP packet data have many attributes and such multiple
attribute data can be visualized using Parallel Coordinates. Furthermore, we also in-
troduced 2Dto2D visualization functionality to PCTT for intrusion detection of net-
work data. 2Dto2D visualization functionality displays multiple lines those represent
four-dimensional (four attributes) data drawn from one (2D, two attributes) plane to
the other (2D, two attributes) plane. Using 2Dto2D visualization, it is easy to under-
stand relationships of four attributes of each data. Network attacks have a certain
access pattern strongly related to the four attributes of IP packet data, i.e., source IP,
destination IP, source Port, and destination Port. So, 2Dto2D visualization is useful
for detecting such access patterns. In this chapter, we show several network-attack
patterns visualized using PCTT with 2Dto2D visualization.
Using PCTT with 2Dto2D visualization functionality, it is possible to investigate
access patterns of network attacks at the attributes level of IP packets. However, sta-
tistical analysis is also necessary to find out suspicious periods of time that seem to be
attacked. This is regarded as the macro level visual analytics and the former is regard-
ed as the micro level visual analytics. In this chapter, we introduce such combinatorial
use of PCTT for macro level to micro level visual analytics of network data as an
example of multidimensional data. We also introduce other visual analytics example
to clarify the usefulness of PCTT about sensor data related to our cyber physical sys-
tems research project.
The remainder of this chapter is organized as follows. First of all, Section 2 de-
scribes related work and points out the difference of our tool from the others. Next,
we explain essential mechanisms of IntelligentBox [5] in Section 3 because
IntelligentBox is a constructive visual software development system for 3D graphics
applications and Time-tunnel is developed as one of its applications. Therefore, Time-
tunnel can be combined with other Time-tunnel or other visualization tools. Section 4
describes details of Time-tunnel and its Parallel Coordinates version. And then, Sec-
tion 5 presents actual network data analysis using PCTT with 2Dto2D visualization as
examples of the micro level visual analytics. We also introduce examples of the mac-
ro level visual analytics and the combinatorial use of several PCTT in Section 6. In
Section 7, we also introduce other visualization examples carried out as the part of our
research project about cyber-physical systems. Finally we conclude the chapter in
Section 8.
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 233
2 Related Work
After the proposal of Parallel Coordinates, many modified versions having a variety of
additional features were proposed [6-11]. Our Parallel Coordinates version of Time-
tunnel (PCTT) can be used as the same visual analysis tool as original Parallel Coordi-
nates. Furthermore, PCTT visualizes multiple charts like Parallel Coordinates on one
individual rectangular plane and it originally provides multiple rectangular planes in a
virtual 3D space so that if the user has a huge amount of data records, he/she can ana-
lyze them by separating into several groups using multiple rectangular planes to recog-
nize the similarity or the difference among those data visually and interactively. This is
one of the advantages of our PCTT. Another popular data analysis method beside Paral-
lel Coordinates is based on star chart or radar chart. As the similar tools, there are Star
Glyphs of XmdvTool [12] and Stardinates Tool [13]. Stardinates Tool has combined
feature of Parallel Coordinates and Glyphs [12]. There are also researches [14, 15] simi-
lar to this. Our PCTT has combinatorial features of Parallel Coordinates and star chart
(radar chart) visualization tool with interactive interfaces.
As visualization tools of network data for the intrusion detection, there are many
visualization tools [16-30]. The paper [17] proposes several interactive visualization
methods for network data and the port scan detection based on PortVis [16], a tool for
port-based detection of security events. Most of them are 2D and only volume visualiza-
tion method uses 3D axes (Port high byte, Port low byte, and Time). The paper [18]
proposes a visual querying system for network monitoring and anomaly detection using
entropy-based features. The paper [19] proposes ClockView for monitoring large IP
spaces, which is a glyph in style of a clock to represent multiple attributes of time-series
traffic data in a 2D time table. The paper [20] proposes the use of CLIQUE, a visualiza-
tion tool of statistical models of expected network flow patterns for individual IP ad-
dresses or collections of IP addresses, and Traffic Circle, a standard circle plot tool. As
Parallel Coordinates-based visualization tools, there are VisFlowConnect [21] and trellis
plots of Parallel Coordinates [22]. As treemap-based visualization tools, there are
NAVIGATOR [23], which displays detail information like IP addresses, ports, etc.
inside each node of a treemap, and hierarchical visualization [24], which is a 2D map
similar to a treemap in a 3D space. Also, there are visualization methods for network
data using 3D plots [25] or lines in a 3D space [26-29]. DAEDALUS [29] is a 3D visual
monitoring tool of the darknet data. However, there have not been any visualization
tools like our PCTT. In this chapter, we also propose 2Dto2D visualization functionality
used with PCTT. The concept of 2Dto2D visualization functionality was derived from
the visualization tool called nicter Cube [30], and there have not been any visualization
tools like our PCTT with 2Dto2D visualization.
three standard messages, i.e., a set message, a gimme message and an update mes-
sage. These messages have the following formats:
(1) Parent box set <slotname> <value>.
(2) Parent box gimme <slotname>.
(3) Child box update.
A <value> in a format (1) represents any value, and a <slotname> in formats (1) and
(2) represents a user-selected slot of the parent box that receives these two messages.
A set message writes a child box slot value into its parent box slot. A gimme message
reads a slot value from a parent box and sets the value into its child box slot. Update
messages are issued from a parent box to all of its child boxes to tell them that the
parent box slot value has changed.
Each box has three main flags that control the above message flow, i.e., a set flag,
a gimme flag, and an update flag. These flags are properties of a display object. A box
works as an input device if its set flag is set to true. Contrarily a box works as an out-
put device if its gimme flag is set to true. A box sends update messages if its update
flag is set to true. Then, child boxes take an action depending upon the states of the
set flag and the gimme flag after they receive an update message or after they individ-
ually change their slot values.
When database records have too many attributes, it is impossible to visualize them
in one rectangular area as one Parallel Coordinates due to the width size limitation of
a display screen. Using multiple data-wings of Time-tunnel, the user can divide at-
tributes into several groups and assign each group to one of the multiple data-wings.
In this case, the user can visualize database records with a huge number of attributes
using multiple data-wings. For this case, we also extended radar chart visualization
functionality to visualize relationships among different attributes of all multidimen-
sional data as multiple radar charts as shown in Figure 9. This visualization is possible
because the number of data in each data-wing is the same in this case. As Figure 9
shows, it is possible to understand relationships between the two attributes corre-
sponding to any two adjacent data-wings about all data at a glance.
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 239
Fig. 14. Screen images of PCTT with 2Dto2D visualization for IP packet data
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 243
We use darknet flow data of IP packets sent from the outside of our university and
captured as pcap format files. Each file includes IP packet data in one hour and the
average number of them in a file is around 3,500. PCTT can read 24 hours files at
once so that it can visualize IP packet data of one day at maximum. Also, we can
specify an interval time and its begin time for visualizing IP packet data using the
GUI of PCTT as previously explained. There is an automatic change mode for the
begin time. In this mode, visualization results are automatically changed according to
the begin time. When the interval time is 30 seconds, a begin time will be shifted
every 30 seconds, and one shift needs around 0.1 seconds as a real-execution time
although several hundreds of IP packets are included in each of these intervals. So,
even if you want to check visualization results of IP packets in a whole day, you need
only 5 minutes. This value is reasonable although it dependents on the specification of
the PC you use because we used a standard PC whose specification is as follows:
CPU: Intel Core_i5, Memory: 4GB and no special graphics card. How many polylines
are displayed atomically is regarded as the performance of PCTT. Its number is a few
thousands. This number is enough for practical cases because it is difficult for a hu-
man to understand features of data if the data are represented as more than thousands
of polylines. The followings are a couple of network attack patterns those are actually
detected using PCTT with 2Dto2D visualization.
Port Scanning
Port Scanning is one of the most popular techniques attackers use to discover services
that they can exploit to break into systems. During checking IP packet data of four
days, we found only one case like port scanning as shown in Figure 15. In this case, a
certain computer located outside of our university sequentially access to different
source IP and source Port of our darknet in a very short period.
Computer of attacker
Computer of attacker
Computer of attacker
DoS Attacks
DoS attack means Denial of Service attack. There are several modes of the attack.
The most popular access pattern is one computer of an attacker simultaneously
accesses many times to his/her target computer in a very short period. As a result,
the target computer will become disenable to provide the services that the computer
originally provided. Sometime, the computer will become malfunctioned. Figure 17
is regarded to show access patterns of DoS attacks because they indicate such a case.
Indeed, the upper figure of Figure 17 shows different 2Dto2D visualization, i.e., 2D
(time, time) to 2D (destination IP, destination Port). Therefore, blue points located
from the left lower to the right upper mean the transition of time about the corre-
sponding IP packs those all tried to access to one target computer. As shown in the
lower figure of Figure 17, their source IP and source Port are the same.
Target computer
Computer of
attacker
Target computer
DDoS Attack
DDoS attack means Distributed Denial of Service attack. The most popular access
pattern is more than one computers controlled by an attacker simultaneously accesses
many times to his/her target computer in a very short period. As a result, the target
computer will become disenable to provide the services that the computer originally
provided. Sometime, the computer will become malfunctioned. Figure 18 shows such
access patterns.
Target computer
Several computers
controlled by
attacker
Through the way described in the previous subsection, you can find the characteristics
of IP packets data using PCTT with 2Dto2D visualization at the attribute level, i.e., at
the micro level. However, statistical analysis is also necessary to find out suspicious
periods of time that seem to be attacked. This is regarded as the macro level visual
analytics. Figure 19 shows statistics visualization results of IP packets data in several
different interval times, 30 minutes to 0.5 minutes. As shown in the figures, the
granularity of interval time is significant because the visualization results are strongly
dependent on the interval time. The interval times of one minute or 30 seconds are
suitable for the visualization of IP packets data of our darknet because the both results
are almost the same. Besides the total number of data in a certain interval, the system
provides several functionalities for calculating statistical values. For instance, those
are the information entropy, the variance and the number of different kinds of data.
The information entropy ℎ is calculated using the following expression.
ℎ = −∑ log . (1)
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 247
Here, is the ratio of the number of the same kind of data to , the total number of
data, in a certain interval. Indeed, we use as the normalized entropy because the
maximum value of ℎ depends on . is calculated by the following expression.
= × 100. (2)
= ∑ ( − ) . (3)
Here, is the number of the same kind of data, is the average of to , and
is the number of different kinds of data in a certain interval.
Fig. 19. Total number of IP packets in different interval times during a certain day
248 Y. Okada
(a) Numbers of different kinds of source IPs, destination IPs, source Ports, and desti-
nation Ports in each 30 seconds during a certain day.
(b) Variances of numbers of different kinds of source IPs, destination IPs, source
Ports, and destination Ports in each 30 seconds during a certain day.
Fig. 20. Different statistics results of source IPs, destination IPs, source Port, and destination
Ports in each 30 seconds during a certain day
250 Y. Okada
(c) Information Entropies of different kinds of source IPs, destination IPs, source
Ports, and destination Ports in each 30 seconds during a certain day.
Figure 20 shows different statistics results of source IPs, destination IPs, source
Ports, and destination Ports in each 30 seconds during a certain day using four data-
wings. In this way, using multiple data-wings, it is possible to analyze statistics re-
sults of multiple attributes together at once. As for the intrusion detection, DoS and
DDoS attacks indicate lower values of Information Entropy about both source IPs and
destination IPs. Port scanning indicates lower value and higher value of Information
Entropy about source IPs and destination Ports, respectively. Security hole attacks
indicate lower values of Information Entropy about both source IPs and destination
Ports.
Figure 21 shows an actual example of the combinatorial use of PCTT with 2Dto2D
visualization at from macro level to micro level visual analytics. At the macro level,
the time interval [16:24:30] indicates very high value rather than others as shown in
the upper figure and it seems that any suspicious actions were occurred in this
time period. So, if you check the same time period using PCTT with 2Dto2D visuali-
zation, you can find that DDoS attacks were occurred as shown in the lower figure of
Figure 21.
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 251
Time
(1) Macro level visualization: Total number of IP packets in each 30 seconds dur-
ing a certain day.
Time
(2) Micro level visualization: 2Dto2D visualization of PCTT at the same interval
as the above.
Fig. 21. Combinatorial use of PCTT with 2Dto2D visualization from macro level to micro level
visualization
252 Y. Okada
We have one project about cyber-physical systems. One of the important topics in
researches on cyber-physical systems is the analysis of big data to be collected from
the physical world using various sensors. As one of the analysis methods, the infor-
mation visualization is useful. Currently, we have been developing two types of visu-
alization tools using IntelligentBox system for our cyber-physical system project. The
first one is for the analysis of human movements because human movements are very
important cues for the analysis of human activities. Figure 22 shows a screen image of
such visualization tool. This is the building of our graduate school consisting of sev-
eral floors. To simplify a 3D model for the building, we employ 2D floor map images
and use the texture mapping mechanism called 2.5D inside building visualizer. It is
very easy to visualize the building and display human movements. In this case, glyphs
have a different color, a different size, and a different shape to specify several attrib-
ute values of their corresponding persons and move on a floor. So, we can understand
persons' activities from glyphs' movements.
The other one is PCTT because our human activity data are consisting of several
attributes collected by various sensors from physical world activities as shown in
Figure 23. Although we have not yet realized the combinatorial use of PCTT and the
2.5D inside building visualizer of Figure 22, it is possible to combine them and
realize it in the near future because all the visualization tools are implemented as
composite components of IntelligentBox.
As future work, we will investigate more details about suspicious accesses of net-
work data indicated as intrusion accesses by the proposed visualization tools. Also,
we will try to use the proposed visualization tools for various types of data such as
human activity data to clarify its usefulness.
References
1. Akaishi, M., Okada, Y.: Time-tunnel: Visual Analysis Tool for Time-series Numerical Da-
ta and Its Aspects as Multimedia Presentation Tool. In: Proc. of 8th Int. Conf. on Infor-
mation Visualization (IV 2004), pp. 456–461. IEEE CS Press (2004)
2. Akaishi, M., Okada, Y.: Time-tunnel: Visual Analysis Tool for Time-series Numerical Da-
ta and Its Combinational Variation. In: Proc. of 1st Int. Conf. on Geometric Modeling,
Visualization & Graphics (GMVAG 2005), Salt Lake, USA, July 21- 26 (2005)
3. Notsu, H., Okada, Y., Akaishi, M., Niijima, K.: Time-tunnel: Visual Analysis Tool for
Time-series Numerical Data and Its Extension toward Par-allel Coordinates. In: Proc. of
Int. Conf. on Computer Graphics, Imaging and Vision (CGIV 2005) (July 2005)
4. Inselberg, A., Dimsdale, B.: Parallel Coordinates: A Tool for Visualizing Multi-
dimensional Geometry. In: Proc. IEEE Visualization 1990, pp. 361–378. IEEE CS Press
(1990)
5. Okada, Y., Tanaka, Y.: IntelligentBox: A Constructive Visual Software Development Sys-
tem for Interactive 3D Graphic Applications. In: Proc. of Computer Animation 1995, pp.
114–125. IEEE CS Press (1995)
6. Martin, A., Ward, M.O.: High dimensional brushing for interactive explo-ration of multi-
variate data. In: Proc. IEEE Visualization 1995, pp. 271–278 (1995)
7. Fua, Y.-H., Ward, M.O., Rundensteiner, E.A.: Hierarchical Parallel Coor-dinates for Ex-
ploration of Large Datasets. In: Proc. IEEE Visualization 1999, pp. 43–50. IEEE CS Press
(1999)
8. Hauser, H., Ledermann, F., Doleisch, H.: Angular Brushing of Extended Parallel Coordi-
nates. In: IEEE Information Visualization (InfoVis 2002), pp. 127–130 (2002)
9. Graham, M., Kennedy, J.: Using Curves to Enhance Parallel Coordinate Visualizations. In:
Proc. Information Visualization IV 2003, pp. 10–16. IEEE CS Press (2003)
10. Artero, A.O., Ferreira de Oliveira, M.C., Levkowitz, H.: Uncovering Clus-ters in Crowded
Parallel Coordinates Visualizations. In: IEEE Information Visualization 2004 (InfoVis
2004), pp. 131–136 (2004)
11. Johansson, J., Cooper, M., Jern, M.: 3-Dimensional Display for Clustered Multi-Relational
Parallel Coordinates. In: IEEE Information Visu-alization (InfoVis 2005), pp. 188–193
(2005)
12. http://davis.wpi.edu/~xmdv/news.html
13. Lanzenberger, M., Miksch, S.: The Stardinates - Visualizing Highly Structured Data. In:
Proc. of Information Visualization IV 2003, pp. 47–52. IEEE CS Press (2003)
14. Fanea, E., Carpendale, S., Isenberg, T.: An Interactive 3D Integration of Parallel Coordinates
and Star Glyphs. In: IEEE Information Visualization (InfoVis 2005), pp. 149–156 (2005)
Parallel Coordinates Version of Time-Tunnel (PCTT) and Its Combinatorial Use 255
15. Tominski, C., Abello, J., Schumann, H.: 3D Axes-Based Visualizations for Time Series
Data, Poster Chapter. In: IEEE Information Visualization (InfoVis 2005) (2005)
16. McPherson, J., Ma, K.-L., Krystosk, P., Bartoletti, T., Christensen, M.: Portvis: A tool for
port-based detection of security events. In: ACM VizSEC 2004 Workshop, pp. 73–81
(2004)
17. Muelder, C., Ma, K.-L., Bartoletti, T.: Interactive Visualization for Network and Port Scan
Detection. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 265–283.
Springer, Heidelberg (2006)
18. Boschetti, A., Muelder, C., Salgarelli, L., Ma, K.-L.: TVi: A Visual Querying System for
Network Monitoring and Anomaly Detection. In: The 8th Int. Symp. on Visualization for
Cyber Security, VizSec 2011 (2011)
19. Kintzel, C., Fuchs, J., Mansmann, F.: Monitoring Large IP Spaces with ClockView. In:
The 8th Int. Symp. on Visualization for Cyber Security, VizSec 2011 (2011)
20. Best, D.M., Bohn, S., Love, D., Wynne, A., Pike, W.A.: Real-Time Visualization of Net-
work Behaviors for Situational Awareness. In: VizSec 2010, pp. 79–90 (2010)
21. Yin, X., Yurcik, W., Treaster, M., Li, Y., Lakkaraju, K.: VisFlowConnect: NetFlow Visu-
alizations of Link Relationships for Security Situational Awareness. In: VizSEC/DMSEC
2004, pp. 26–34 (2004)
22. Axelsson, S.: Visualization for Intrusion Detection - Hooking the Worm. Understanding
Intrusion Detection Through Visualization Advances in Information Security 24, 111–127
(2006)
23. Chu, M., Ingols, K., Lippmann, R., Webster, S., Boyer, S.: Visualizing Attack Graphs,
Reachability, and Trust Relationships with NAVIGATOR. In: The 7th Int. Symp. on Visu-
alization for Cyber Security, VizSec 2010, pp. 22–33 (2010)
24. Itoh, T., Takakura, H., Sawada, A., Koyamada, K.: Hierarchical Visualization of Network
Intrusion Detection Data. IEEE Computer Graphics and Applications, 40–47 (March/April
2006)
25. Lau, S.: The Spinning Cube of Potential Doom. Communications of the ACM 47(6),
25–26 (2004)
26. Wang, W., Lu, A.: Visualization Assisted Detection of Sybli Attacks in Wireless Net-
works. In: VizSEC 2006, pp. 51–60 (2006)
27. Malecot, E.L., Kohara, M., Hori, Y., Sakurai, K.: Interactively Combining 2D and 3D Vis-
ualization for Network Traffic Monitoring. In: VizSEC 2006, pp. 123–127 (2006)
28. Oberheide, J., Karir, M., Blazakis, D.: VAST: Visualizing Autonomous System Topology.
In: VizSec 2006, pp. 71–79 (2006)
29. Inoue, D., Suzuki, M., Eto, M., Yoshioka, K., Nakao, K.: DAEDALUS: Novel Application
of Large-Scale Darknet Monitoring for Practical Protection of Live Networks (Extended
Abstract). In: Kirda, E., Jha, S., Balzarotti, D. (eds.) RAID 2009. LNCS, vol. 5758,
pp. 381–382. Springer, Heidelberg (2009)
30. Nicter Cube of nicter,
http://www.nicter.jp/nw_public/scripts/cube.php
Towards a Big Data Analytics Framework
for IoT and Smart City Applications
AGT International
Hilpertstrasse 35, 64295 Darmstadt, Germany
{mstrohbach,hziekow,vgazis,nakiva}@agtinternational.com
1 Introduction
mined by applying data analytics techniques and generate value by offering innova-
tive services that increase citizens’ quality of life. Data may be provided by all stake-
holders of a Smart City [10], i.e., the society represented by citizens and businesses
and governments represented by policy makers and administrations.
On one hand data sources may include traditional information held by public bod-
ies (Public Sector Information, PSI) including anonymous data such as cartography,
meteorology, traffic, and any kind of statistics data as well as personal data, e.g., from
public registries, inland revenues, health care, social services etc. [67].
On the other hand citizens themselves create a constant stream of data in and about
cities by using their smartphones. By using apps like Twitter and Facebook, or apps
provided by the city administration, they leave digital traces related to their activities in
the physical city that has the potential to create valuable insights for urban planners.
With the advent of deployed sensor systems such as mobile phone networks, camera
networks in the context of intelligent transportation systems (ITS) or smart meters for
metering electricity usage, new data sources are emerging that are often discussed in the
context on the Internet of Things (IoT), i.e., the extension of the internet to virtually
every artifact of daily life by the use of identification and sensing technologies.
Thus, we can summarize that the key components required for Smart City applica-
tion are available: 1) an abundance of data sources, 2) infrastructure, networks, inter-
faces and architectures are being defined in the IoT and M2M community, 3) a vast
range of Big Data technologies are available that support the processing of large data
volumes, and 4) there is ample and wide knowledge about algorithms as well as
toolboxes [62] that can be used to mine the data.
Despite all the necessary conditions for a Smart City are met, there is still a lack of
an analytical framework that pulls all these components together such that services for
urban decision makers can easily be developed.
In this chapter, we address this need by proposing an initial version of such an ana-
lytical framework that we derived based on existing state of the art, initial findings
from our participation in the publicly funded projects Big Data Public Private Forum
(BIG) [7] and Peer Energy Cloud (PEC) [46] as well as our own experiences with
analytical applications for Smart Cities.
The European Project BIG is a Coordinated Support Action (CSA) that seeks to build
an industrial community around Big Data in Europe with the ultimate goal to develop a
technology roadmap for Big Data in relevant industrial sectors. As a part of this effort
the BIG project gathers requirements on Big Data technologies in industry driven work-
ing groups. Groups relevant for Smart Cities include health, public sector, energy, and
the transport working group. The project has released both an initial version of require-
ments in these and other sectors [67] as well as a set of technical white papers providing
an overview of the state of the art in Big Data technologies [31]. In this chapter, we
draw from these results of the BIG project and complement it with the findings from a
concrete use case in the energy sector as carried out in the PEC project.
The remainder of this chapter is structured as follows: Section 2 provides back-
ground about the technical challenges associated with the Big Data and Internet of
Things topics. Section 3 summarizes the state of the art in Big Data technologies. In
Section 4 we elaborate on the concrete Big Data challenges that need to be addressed
Towards a Big Data Analytics Framework for IoT and Smart City Applications 259
in the context of Smart City applications. Section 5 presents a case study from the
smart grid domain that demonstrates how we applied big data analytics in a realistic
setting. In Section 6, we extend the analytics presented in this case study towards an
initial big data analytics framework. In section 7 we report on our lessons learned. In
Section 8 we summarize further research directions required to extend the framework
fully implement it. Finally, Section 9 concludes this chapter.
In this section we describe the technologies that are concerned with connecting eve-
ryday artefacts, i.e., the Internet of Things, and relate them to Big Data technologies
that address the challenges of managing and processing large and complex data
sets. For an extensive discussion and definition of the term Big Data we refer to the
respective report of McKinsey [37].
The Volume Challenge. The Volume challenge refers to storing, processing, and
quickly accessing large amounts of data. While it is hard to quantify the boundary for
a volume challenge, common data sets in the order of hundreds of Terabyte or more
are considered to be big. In contrast to traditional storage technologies such as rela-
tional database management systems (RDBMS), new Big Data technologies such as
Hadoop are designed to easily scale with the amount of data to be stored and pro-
cessed. In its most basic form, the Hadoop system uses its Hadoop Distributed File
System (HDFS), to store raw data. Parallel processing is facilitated by means of its
Map Reduce framework that is highly suitable for solving any embarrassingly parallel
processing problems. With Hadoop it is possible to scale by simply adding more pro-
cessing nodes to the Hadoop cluster without the need to do any reprogramming as the
framework takes care of using additional resources as they become available.
Summarizing the trends in the volume challenge one can observe a paradigm shift
with respect to the way the data is handled. In traditional database management sys-
tems the database design is optimized for the specific usage requirements, i.e., data is
preprocessed and only the information that is considered relevant is kept. In contrast,
260 M. Strohbach et al.
in a truly data-driven enterprise that builds on Big Data technologies, there is aware-
ness that data may contain value beyond the current use. Thus, a master data set of the
raw data is kept that allows data scientists to discover further relationships in the data,
relationships that may reside beyond the requirements of today. As a side effect it also
reduce the costs of human error such as erroneous data extraction or transformation.
The Velocity Challenge. Velocity refers to the fact that data is streaming into the
data infrastructure of the enterprise at a high rate and must be processed with minimal
latency. To this end, different technologies are applicable, depending on the amount
of state and complexity of analysis [57]. In cases where only little state is required
(e.g., maintaining a time window of incoming values), but complex calculations need
to be performed over a temporally scoped subset of the data, Complex Event Pro-
cessing (CEP) engines (see section 3.3) offer efficient solutions for processing incom-
ing data in a stream manner. In contrast, when each new incoming data set needs to be
related to a large number of previous records, but only simple aggregations and value
comparisons are required, noSQL databases offer the necessary write performance.
The required processing performance can then be achieved by using streaming infra-
structures such as Storm [56] or S4 [51].
Veracity. Apart from the original 3V’s described above, an almost inexhaustible list
of Big Data V’s are discussed. For instance veracity relates to the trust and truthful-
ness of the data. Data may not fully be trusted, because of the way it has been ac-
quired, for instance by unreliable sensors or imperfect natural language extraction
algorithms, or because of human manipulation. Assessing and understanding data
veracity is a key requirement when deriving any insights from data sets.
M2M Standardization. Making rapid progress over the last couple of years, the
ETSI Technical Committee (TC) on Machine to Machine (M2M) communications
has recently published its first version of M2M specifications. The objective is to
define the end-to-end system architecture that enables integration of a diverse range
of M2M devices (e.g., sensors, actuators, gateways, etc.) into a platform that exposes
to applications a standardized interface for accessing and consuming the data and
services rendered through these (typically last mile) devices [16]. To this end, ETSI
M2M standards define the architecture, interfaces, protocols, and interaction rules that
govern the communication between M2M compliant devices.
The M2M logical architecture under development in ETSI comprises two high-
level domains:
1. The Network and Application (NA) domain, composed of the following elements:
─ M2M Access Network (AN) providing for communication between the Device
Domain and the Core Network (CN).
─ M2M Core Network (CN) providing for IP connectivity and the associated con-
trol functions to accommodate roaming and network interconnection.
─ M2M Service Capabilities (SC) providing functions shared by M2M applica-
tions by exposing selected infrastructure functionalities through network inter-
faces while hiding realization details.
─ M2M Applications running the actual application logic.
262 M. Strohbach et al.
• mIa, for the NA domain,, allowing access and use of Service Capabilities thereinn.
• dIa, for the D domain, alllowing an M2M application residing in an M2M host (i.e.,
Device or Gateway) to access and use different Service Capabilities in the saame
M2M host. When the M2M M host is an M2M Device, access and use of differrent
Service Capabilities in an
n M2M Gateway is supported also.
• mId, for the communicaation between M2M Service Capabilities residing in difffer-
ent M2M domains.
Towards a Big Data Analytics Framework for IoT and Smart City Applications 263
Over these references, resource management procedures adopt the RESTful style
for the exchange and update of data values on the basis of CRUD (Create, Read, Up-
date, Delete) and NE (Notify, Execute) primitives.
The role of wireless technologies in ETSI M2M is that of connectivity with mini-
mal infrastructure investment both in the Device domain and the Network and Appli-
cations domain.
On a global scale, the oneM2M Partnership Project, established by seven of the
world’s leading information and communications technology (ICT) Standards Devel-
opment Organizations (SDOs) is chartered to the efficient deployment of M2M sys-
tems. To this end, existing ETSI M2M standards are to be transferred to oneM2M and
ratified as global M2M standards.
Smart Cities. By amassing large numbers of people, urban environments have long
exhibited high population densities and now account for more than 50% of the
world’s population [58]. With 60% of the world population projected to live in urban
cities by 2025, the number of megacities (i.e., cities with a minimum population of 10
million people) is expected to increase also. It is estimated that, by 2023, there will be
30 megacities globally.
Considering that cities currently occupy 2% of global land area, consume 75% of
global energy resources and produce 80% of global carbon emissions, the benefit of
even marginally better efficiency in their operation will be substantial [58]. For in-
stance, the Confederation of British Industries estimates that the cost of road conges-
tion in the UK is GBP 20 billion (i.e., USD 38 billion) annually. In London alone,
introduction of an integrated ICT solution for traffic management resulted in a 20%
reduction of street traffic, 150 thousand tons of CO2 less emissions per year and a
37% acceleration in traffic flow [18].
Being unprecedentedly dense venues for the interactions – economic, social and of
other kind – between people, goods, and services, megacities also entail significant
challenges. These relate to the efficient use of resources across multiple domains
(e.g., energy supply and demand, building and site management, public and private
transportation, healthcare, safety, and security, etc.). To address these challenges, a
more intelligent approach in managing assets and coordinating the use of resources is
envisioned, based on the pervasive embodiment of sensing and actuating technologies
throughout the city fabric and supported by ubiquitous communication networks and
the ample processing capacity of data centers. The umbrella term Smart City [68]
refers to the application of this approach in any of six dimensions:
• Smart economy
• Smart mobility
• Smart environment
• Smart people
• Smart living
• Smart governance
By aggregating data feeds across these domains and applying data processing algo-
rithms to surface the dominant relationships in the data, the situational awareness of
264 M. Strohbach et al.
the Smart City at the executive level becomes possible. For instance, by leveraging its
open data initiative, the city of London provides a dashboard application demonstrat-
ing the kind of high-level oversight achievable by cross-silo data integration and the
use of innovative analytic applications [35].
The footprint of our current cities’ impact is growing at 8% annually, which means
it more than doubles every 10 years. Thus not surprisingly, NIKKEI estimates that
USD 3.1 trillion will be invested globally in Smart City projects over the next 20
years [69].
Relationship to Big Data. The popularity of data mashup platforms, as evident today
for human-to-machine and machine-to-human information, is expected to extend to
machine-to-machine information [16]. Data generated in the context of machine-to-
machine communication are typically not constrained by the processing capacities of
human entities in terms of volume, velocity, and variety. Particularly in regard to
velocity, the ongoing deployment of a large number of smart metering devices and
their supporting infrastructures across urban areas increases the percentage of fre-
quently updated small volume data in the overall data set of the Smart City. Thus
M2M data exchanges in the context IoT applications for a Smart City impact upon the
requirements of data handling through an increase in the volume, variety, and velocity
of the data.
Considering the trinity of IoT, M2M, and Smart Cities from the standpoint of cloud
technologies, it becomes apparent that the scalability to a large number of M2M de-
vices (i.e., sensors, actuators, gateways) and data measurements will be a prime (non-
functional) application requirement. It is, therefore, apparent, that IoT, M2M, and
Smart Cities are, from a requirements perspective, right at the core of what Big Data
technologies provides.
Increasing urbanization and M2M deployment bring on significant increases in the
data generated by IoT applications deployed in the Smart City fabric. For instance, the
London Oyster Card data set amounts to 7 million data records per day and a total of
160 million data records per month [4]. Given that a Smart City generates a wide
spectrum of data sets of similar – and even larger – size, challenges characteristic of
Big Data arise in collecting, processing, and storing Smart City data sets.
In this section we describe state of the art of Big Data according focusing on the vol-
ume and velocity challenge. For this section, we describe the state of the art mainly
from an industrial perspective, i.e., we provide examples of available technologies
that represent technologies that can also be used in a productive environment.
Fig. 2. Lambda Architecture (Source: Big Data – Principles and best practices of scalable
realtime data systems, ISBN 9781617290343 [38])
environments, introduces data-related challenges that are at the focus of Big Data
toolsets. The latter include technologies and tools that support solving the volume
challenge. A range of noSQL databases are able to keep up with high update rates.
Programming frameworks such as MapReduce [14] help processing large data in
batches. And Stream processing infrastructure such as Storm [56] and S4 [51] provide
support for scalable processing of high velocity data. In practice both technologies are
required in order to design a low latency query system. Marz and Warren have de-
vised the term Lambda architecture as an architectural pattern that defines the inter-
play between batch and speed layer in order to provide low latency queries [38].
New data (A) is provided both to the batch and speed layer as depicted in Fig. 2.
The batch layer (C) stores all incoming data in its raw format as master data set (B). It
is also responsible for running batch jobs that create batch views in a serving layer
(D) that is optimized for efficient querying (F). The batch layer is optimized for pro-
cessing large amount of data, e.g., by using the MapReduce framework and may re-
quire hours to process data. Consequently, the batch view will not be updated between
repeated executions of the batch jobs and the corresponding data cannot be included
in the result set of a query to the serving layer.
The speed layer (E) addresses this information gap by providing a real-time view
on the data that has arrived since the last executed batch job. This way an application
will always have up-to-date information by querying both the serving and speed layer.
266 M. Strohbach et al.
The need for CEP technology is rooted in various domains that required fast analy-
sis of incoming information. Examples can be found among others in the finance
domain where CEP technologies are used for applications like algorithmic trading or
the detection of credit card fraud [27]. In these cases CEP is very well suited due to
applications requiring fast analysis of high volume data streams involving temporal
patterns. For instance, a credit card fraud may be detected if multiple transactions are
executed in short time from far apart locations. Other application domains for CEP
include fields like logistics [60], business process management [61], and security
[22]. Also, the IoT domain has sparked a range of applications that require event-
driven processing and is one of the drivers for CEP technology. Specifically the field
of sensor networks has early on let to development of systems that are designed for
processing in an event-driven manner (e.g., TinyDB [36], Aurora/Borealis [1]).
Over the last years a number of CEP solutions have emerged in academia as well
as in industry. Some of the known early academic projects are TelegraphCQ [8], Ti-
nyDB [36], STREAM [41], Aurora/Borealis [1], and Padres [30]. The research initi-
ates were followed by (or directly led to) the emergence of several startups in this
domain. For instance the company StreamBase is based on the Aurora/Borealis pro-
ject and results from the STREAM project fueled the startup Coral8 and the CEP
solution of Oracle. Other vendors software vendors like Microsoft and IBM have
created CEP solution based on their own internal research projects (CEDR [3], Sys-
tem S [24]). In addition several major vendors have strengthened or established CEP
capabilities through acquisitions in recent years. Some examples to name are the
acquisition of Apama by Software AG, StreamBase Systems by TIBCO, and Sybase
by SAP.
Next to purely commercial offerings the market includes offerings that are availa-
ble as open source solutions. Example include engines like Esper [19], ruleCore [50],
or Siddhi [55]. Noticeable additions to the open source domains are the solutions
Storm [56] and S4 [51]. These solutions are not classical CEP engines in the sense
that they do not provide a dedicated query language. In contrast, Storm and S4 pro-
vide event processing platforms that are focusing on support for distributing of logic
to achieve scalability.
In this section, we describe in brief the challenges related toward an integrated solu-
tion for scalable analysis of Smart City data sources. We consider these challenges
mainly from an integration point of view along two dimensions. First there is a ques-
tion how batch and stream processing should be integrated in a modern Smart City
environment (Section 4.1). And second, there is the challenge how the variety
of data source should be handled in order to efficiently deliver new services and ana-
lyze these data sets as a whole rather than in isolation (Section 4.2). As social media
sources provide a potentially rich source of information, we describe them separately
(Section 4.3).
268 M. Strohbach et al.
will require real-time processing of frequently updated small volume structured data
sets, while other will require complex analytic operations on large volumes of infre-
quently updated but semantically enriched unstructured data sets. The wide range of
operational requirements entailed by this disparity, suggests that the proper instru-
mentation of the data processing stage will be a paramount concern for IoT applica-
tions in Smart Cities. Such instrumentation matters need to be addressed in conjunc-
tion with instrumentation options arising from section 4.1 above.
The initial challenge when considering the integration of social networks reports as
sensorial insights is the extraction of such signals. Extracted sensorial data might
include peoples’ stated opinions and statements regarding a particular sentiment, topic
of interests, reported facts about one’s self (e.g., illness, vacation, event attendance)
and various additional subjective reports. Natural language processing (NLP) along
with Machine Learning algorithms are applied for extracting the relevant signals from
the data posted by the users of the social medium.
An additional challenge is the reliability/credibility of the social sensors. Naturally,
the model-based inferred signals are attached with a certainty level produced by the
model, where handling the data uncertainty is not trivial. Moreover, the reporting user
could also be attached with a credibility score which measures the overall worthiness
of considering its data in general.
The sparsity of geo-tagged data in social network data is another challenge when
considering its value for integration with the IoT. For example, only 1% of all Twitter
messages are explicitly geo-tagged by the users. Recent work suggests several meth-
odologies to overcome this challenge by inferring the users’ location based on its
context [13] [17].
Finally, there is a need for an architecture that combines both batch and stream
processing over social data, for achieving a couple of goals:
270 M. Strohbach et al.
1. Enabling the combination of offline-modeling over vast amounts of data and ap-
plying the resulting model over streaming data in real time. Most of the complex
semantic analysis tasks, as for instance Sentiment Analysis [34] [45] require batch
modeling. Feature extraction could be done in real time over a data stream by ap-
plying a sliding time window (e.g., 20 seconds) over measures as terms’ frequency
or TF-IDF [52]. For addressing the challenge of evaluating models in data streams
where the data distribution changes constantly, a sliding window kappa-based
measure was proposed [6].
2. Analyzing data of all social sensor types, either streaming (e.g., Twitter) or non-
streaming (e.g., blog posts) using the same architecture.
In this section, we discuss a case study from the smart grid domain that illustrates the
application of big data on smart home sensor data. The case is taken from Peer Ener-
gy Cloud (PEC) project [46] that runs a smart grid pilot in a German city. The pilot
includes installations of smart home sensors in private homes that measure energy
consumption and power quality such as power voltage and frequency at several power
outlets in each home. Each sensor takes measurements every two seconds and streams
the results into a cloud-based infrastructure that runs analytics for several different
use cases. In this section, we discuss the technical details of three scenarios that we
implemented in our labs using data from deployed sensors:
1. Power quality analytics – shows the benefits of big data batch processing tech-
nologies.
2. Real-time grid monitoring – shows the benefits of in-stream analytics.
3. Forecasting energy demand – shows the need to combine both batch and stream
processing.
The big data challenges in this use cases arises due to the high data volumes. Every
household produces almost 2 million energy-related measurements a day. This num-
ber has to be multiplied by the number of households that use the technology and
accumulates over time. For instance, about 200,000 million voltage measurements per
month would be available in a full rollout in the small pilot city of Saarlouis.
The query logic for power quality analytics is relatively simple but poses challeng-
es due the high data volumes. Experiments in the pilot project revealed operational
challenges already when implementing the analytics for the first four pilot house-
holds. Already at this limited scope relational databases required tweaking to handle
the queries. However, using the MapReduce programming model and the Hadoop
framework it was straightforward to implement the power quality analytics in a scala-
ble way. This is because the underlying analytics problem is of embarrassingly paral-
lel nature and hence very well suited for parallel processing. For instance, it is trivial
to partition the data for analyzing household specific voltage fluctuations by house-
hold and the Hadoop-based implementation achieved close to linear scalability.
measurements per second. A full rollout in the pilot city would result in about
360,000 new measurements every second. Continuously inferring an accurate live
state of the grid poses challenges to the throughput and latency of the analytics sys-
tem. The addressed analytics for power quality and consumption analysis are charac-
terized by incremental updates as well as temporal aggregates. Specifically for such a
setting, stream processing and CEP technologies provide an answer to these challeng-
es. Inferring the live state only requires computations over latest sensor information.
Thus, the state that is needed for processing is relatively small. CEP engines keep the
state in memory and thereby enable high throughput. We found that CEP engines are
suitable to (a) support the query logic for live analysis power quality measures and
power consumption and (b) to provide the required throughput. Using the open source
CEP engine Esper [19] we could run the required analytics for thousands of house-
holds in parallel on a single machine. The performance depends on implementation
details of the specific analysis. However, the processing paradigm of CEP significant-
ly eased the development of high throughput analytics over the pilot data.
Suitable technologies exist for the challenges of model learning as well as real-time
application of the prediction models. However, no off-the-self solutions to our
knowledge directly meet the twofold challenges of this use case. Instead, a combina-
tion of big data technologies in required. We discuss such a combination in the fol-
lowing section.
Using Hadoop Eased Development. By using the Hadoop framework, we found that
the benefits of out-of-the box scalability materialized very early in the project. Even
for rather simple analytics and after only a few month of data collection the develop-
ment team struggled to make the corresponding queries scale sufficiently on relational
databases. However, using Hadoop it was straightforward to achieve sufficient scal-
ability and performance, without the need tune the implementation. This does not
mean that solutions based on relational databases could not have achieved the re-
quired performance and scalability. Yet, the burden for the development teams was
significantly lower using Hadoop.
simply mishandling of the system are among incidents that must be expected.
Therefore, analytics solution must cope with some degree of errors and uncertainty
in the input data.
Missing Best Practices for Combining Batch and Stream Processing. Regarding
(4), the challenge of combining batch and stream processing, we found that the appli-
cation of existing big data technologies is straightforward when doing batch and
stream processing in isolation. Both worlds have matured tool chains that work to a
large degree out of the box. However, the design space for combining batch and
stream processing is more open and best practices are less explored. While adapters
for data exchange exist, the details of the interplay between batch and stream process-
ing leaves significant effort to development. For instance, we found that the need for
batch driven model learning and stream driven application of the models reoccurs in
many use cases. However, the most suitable technologies for these tasks do not pro-
vide off-the-shelf support for this integrated scenario.
As a first step toward addressing the Volume and Velocity challenge in the context of
Smart Cities, we devised an Analytical Stream Processing framework for handling
large quantities of IoT-related data. It extends the basic Lambda architecture by sup-
porting statistical- and machine-based model learning in batches and its use in the
streaming layer.
Fig. 4 shows an initial draft of our framework. New data is being dispatched both
to batch layer and stream processing layer. The main responsibility of the batch layer
is to calculate models that characterize the incoming data, e.g., by describing re-
occurring patterns creating a prediction model. For instance, the batch layer can real-
ize the learning of models for households specific load prediction in the PEC project.
The model is provided in the serving layer. Further to the basic lambda architecture, it
is important to note that the serving layer must also serve the streaming layer that
needs the model data in order to provide its analytical results to the application layer.
Online load prediction in the PEC project is an example for such a situation. Here the
speed layer extracts features (i.e., temporal aggregates) from the incoming load meas-
urements and calls the previously learned prediction models to obtain a prediction
value.
Model Learning. In the model learning step the extracted features are used to calcu-
late the actual model. The model parameters representing the model depend on the
used algorithms. In our use case above this would include temporal aggregates of
household- and device-specific load measurements. For a statistical model the model
parameters would represent the parameters of a statistical distribution function such as
the mean and standard deviation for a normal distribution. The model would also
include thresholds on the variance based on which an anomaly is considered to be
detected.
276 M. Strohbach et al.
As in the basic Lambda architecture the application layer is accessing the data both
from the serving and speed layer. The application uses the data from the batch and
streaming layer in two ways. First, the streaming data can be used to provide a real
278 M. Strohbach et al.
time view of the features calculated in the batch layer (10). As in the basic lambda
architecture this can be useful to provide a real-time dashboard view, for instance
reflecting the current state of the power grid. Second, the application may only con-
sume events generated in the stream layer that are based on applying incoming data to
the model as described above (11). This way applications and human operators
can receive events about detected anomalies or continuous predictions about energy
consumption.
For further work we plan to extend our analytical framework and apply it to other
domains in the context of Smart Cities. This includes in particular adding more ma-
chine learning algorithms for the batch layer and the corresponding logic for the
streaming layer.
In this chapter we have focused on an analytical framework for processing large
volumes of data in real-time, i.e., we addressed mainly the volume and velocity chal-
lenge. As we extend our work to different data sets and application domains, it will,
however, be increasingly important to cope with the variety of data sources and their
data.
It is, therefore, a key requirement for the proposed analytical framework that both
the model learning algorithms as well as the stream logic can be applied uniformly
across different data sets and application domains. On one hand this will maximize
the value that can be extracted from available data sets and on the other hand the pro-
cessing chain can then easily be applied to new data sets, thus saving effort, time, and
costs during the development process.
A future research direction is therefore to extend the analytical framework with the
necessary mechanisms to achieve such uniform processing. This could for instance be
realized by a metadata model on which the corresponding logic operates. As we see
the application of this analytical framework mainly in the context of Smart Cities and
the Internet of Things, an entity-based framework that naturally models real-world
entities such as sensors, people, buildings, etc. [23]. Such an information model along
with a corresponding architecture has been defined by the IoT-A project [5].
8 Conclusions
In this chapter, we have proposed an initial draft of a Big Data analytical framework
for IoT and Smart City applications. The framework is based on existing state of the
art, initial findings from our participation in the publicly funded projects BIG [7] and
PEC [46] as well as our own experiences with analytical applications for Smart Cities.
Our work is motivated by the fact that key components such as data, sources, algo-
rithms, IoT architectures, and Big Data technologies are available today, but still there
is a lot of effort required to put them into operational value.
A significant part of this effort is due to missing standards (cf. for instance the SQL
query language for relational databases) and wide variety of different technologies in
Towards a Big Data Analytics Framework for IoT and Smart City Applications 279
the Big Data domain as well as the required integration effort. But the application of
advanced analytical computation at scale and speed also requires considerable design
effort and experience. We believe that an analytical Big Data framework along with
appropriate toolboxes can add significant value both to required development effort
and insights that can be derived from the data. While there are Big Data machine
learning libraries such as Mahout [42], as well as frameworks for model learning on
top of Hadoop [48], we are not aware of fully integrated analytical frameworks that
combine model learning and stream processing.
In order to fully benefit from the framework its overall design and associated
toolboxes need to support a variety of data sources and algorithms. While this re-
quirement does not change the high level architecture of the framework, such exten-
sions do have significant impact on the interface level and overall design of the indi-
vidual processing components. A carefully planned and sound conceptual design as
well as pragmatic implementation decisions will be an important enabler to reduce
development costs and create innovative services in the IoT and Smart City domain.
Acknowledgments. This work has partly been funded by the EU funded project Big
Data Public Private Forum (BIG), grant agreement number 318062, and by the Peer
Energy Cloud project which is part of the Trusted Cloud Program funded by the Ger-
man Federal Ministry of Economics and Technology. We would also like to thank
Max Walther, who implemented the batch-driven power quality analytics MapReduce
jobs as well as Alexander Bauer and Melanie Hartmann who supported us with the
design of the analytical models.
References
1. Abadi, D.J., et al.: The Design of the Borealis Stream Processing Engine. In: CIDR, vol. 5,
pp. 277–289 (2005)
2. Atzori, L., Iera, A., Morabito, G.: The internet of things: A survey. Computer Net-
works 54(15), 2787–2805 (2010)
3. Barga, R.S., et al.: Consistent streaming through time: A vision for event stream pro-
cessing. arXiv preprint cs/0612115 (2006)
4. Batty, M.: Smart Cities and Big Data, http://www.spatialcomplexity.info/
5. Bauer, M., Bui, N., Giacomin, P., Gruschka, N., Haller, S., Ho, E., Kernchen, R., Lischka,
M., Loof, J.D., Magerkurth, C., Meissner, S., Meyer, S., Nettsträter, A., Lacalle, F.O., Se-
gura, A.S., Serbanati, A., Strohbach, M., Toubiana, V., Walewski, J.W.: IoT-A Project De-
liverable D1.2 – Initial Architectural Reference Model for IoT (2011),
http://www.iot-a.eu/public/public-documents/d1.2/view
(last accessed September 18, 2013)
6. Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data. In:
Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS (LNAI), vol. 6332, pp.
1–15. Springer, Heidelberg (2010)
7. BIG Project Website, http://www.big-project.eu/ (last accessed September 19,
2013)
8. Chandrasekaran, S., et al.: TelegraphCQ: continuous dataflow processing. In: ACM
SIGMOD International Conference on Management of Data, pp. 668–668. ACM (2003)
280 M. Strohbach et al.
9. Chu, C.-T., Kim, S.K., Lin, Y.A., Yu, Y.Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-
Reduce for Machine Learning on Multicore. In: Schölkopf, B., Platt, J.C., Hoffman, T.
(eds.) Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 281–288.
MIT Press, Cambridge (2007)
10. Correia, Z.P.: Toward a Stakeholder Model for the Co-Production of the Public Sector In-
formation System. Information Research 10(3), paper 228 (2005),
http://InformationR.net/ir/10-3/paper228.html
(last accessed February 27, 2013)
11. DataMarket, http://datamarket.com/ (last accessed September 21, 2013)
12. Data.gov, http://www.data.gov/ (last accessed September 21, 2013)
13. Davis, J.R., Clodoveu, A., et al.: Inferring the Location of Twitter Messages based on User
Relationships. Transactions in GIS 15(6), 735–751 (2011)
14. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters.
Communications of the ACM 51(1), 1–13 (2008), doi:10.1145/1327452.1327492
15. Directive 2003/98/EC of the European Parliament and of the Council of 17 November
2003 on the re-use of public sector information, http://eur-lex.europa.eu/
LexUriServ/LexUriServ.do?uri=CONSLEG:2003L0098:20130717:EN:P
DF (last accessed January 13, 2014)
16. Dohler, M.: Machine-to-Machine Technologies, Applications & Markets. In: 27th IEEE Inter-
national Conference on Advanced Information Networking and Applications (AINA) (2013)
17. Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: A Twitter Geolocation System
with Applications to Public Health (2013)
18. The Economist, Running out of road (November 2006)
19. EsperTech, http://esper.codehaus.org (last accessed September 22, 2013)
20. Etzion, O.: On Off-Line Event Processing. Event Processing Thinking Online Blog (2009),
http://epthinking.blogspot.de/2009/02/on-off-line-event-
processing.html (last accessed September 17, 2013)
21. European Open Data Portal, http://open-data.europa.eu/ (last accessed Sep-
tember 21, 2013)
22. Farroukh, A., Sadoghi, M., Jacobsen, H.-A.: Towards vulnerability-based intrusion detec-
tion with event processing. In: 5th ACM International Conference on Distributed Event-
based System, pp. 171–182. ACM (2011)
23. Gazis, V., Strohbach, M., Akiva, N., Walther, M.: A Unified View on Data Path Aspects
for Sensing Applications at a Smart City Scale. In: IEEE 27th International Conference
onAdvanced Information Networking and Applications Workshops (WAINA 2013), pp.
1283–1288. IEEE Computer Society, Barcelona (2013), doi:10.1109/WAINA.2013.66
24. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: the system s declarative
stream processing engine. In: ACM SIGMOD International Conference on Management of
Data, pp. 1123–1134. ACM (2008)
25. Giraph Project, http://giraph.apache.org/ (last accessed September 21, 2013)
26. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): A vision, ar-
chitectural elements, and future directions. Future Generation Computer Systems (2013)
27. Hinze, A., Sachs, K., Buchmann, A.: Event-based applications and enabling technologies.
In: Third ACM International Conference on Distributed Event-Based Systems (2009)
28. INFSO D.4 Networked Enterprise & RFID INFSO G.2 Micro & Nanosystems, Internet of
Things in 2020 – A roadmap for the Future (September 2008), report available at
http://www.smart-systems-integration.org/public/
internet-of-things
29. ITU, The Internet of Things (2005)
Towards a Big Data Analytics Framework for IoT and Smart City Applications 281
30. Fidler, E., Jacobsen, H.A., Li, G., Mankovski, S.: The PADRES Distributed Pub-
lish/Subscribe System. In: FIW, pp. 12–30 (2005)
31. van Kasteren, T., Ravkin, H., Strohbach, M., Lischka, M., Tinte, M., Pariente, T., Becker,
T., Ngonga, A., Lyko, K., Hellmann, S., Morsey, M., Frischmuth, P., Ermilov, I., Martin,
M., Zaveri, A., Capadisli, S., Curry, E., Freitas, A., Rakhmawati, N.A., Ul Hassan, U.,
Iqbal, A.: BIG Project Deliverable D2.2.1 – First Draft of Technical White Papers (2013),
http://big-project.eu/deliverables (last accessed September 19, 2013)
32. Laney, D. 3D Data Management: Controlling Data Volume, Velocity and Variety. Meta
Group Research Report (2001), http://blogs.gartner.com/doug-laney/
files/2012/01/ad949-3D-Data-Management-Controlling-Data-
Volume-Velocity-and-Variety.pdf (last accessed September 21, 2013)
33. Leeds, D.J.: THE SOFT GRID 2013-2020:Big Data & Utility Analytics for Smart Grid.
GTM Research Report (2012), http://www.greentechmedia.com/
research/report/the-soft-grid-2013 (last accessed September 21, 2013)
34. Liu, B.: Sentiment analysis and opinion mining. Synthesis Lectures on Human Language
Technologies 5(1), 1–167 (2012)
35. The London Dahsboard, http://data.london.gov.uk/london-dashboard
(last accessed September 21, 2013)
36. Madden, S.R., Franklin, M.J., Hellerstein, J.M., Hong, W.: TinyDB: An acquisitional que-
ry processing system for sensor networks. ACM Transactions on Database Systems 30,
122–173 (2005)
37. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.:
Big data: The next frontier for innovation, competition, and productivity. McKinsey Glob-
al Institute (2013), http://www.mckinsey.com/insights/
business_technology/big_data_the_next_frontier_for_
innovation (last accessed August 20, 2013)
38. Marz, N., Warren, J.: A new paradigm for Big Data. In: Big Data – Principles and Best
Practices of Scalable Real-time Data Systems, ch. 1, Manning Publications Co. (to appear),
http://www.manning.com/marz/ (last accessed August 16, 2013), ISBN
9781617290343
39. Miorandi, S., Sicari, F., Pellegrini, D., Chlamtac, I.: Internet of things: Vision, applications
and research challenges. Ad Hoc Networks 10(7), 1497–1516 (2012)
40. Microsoft BI Team, Big Data, Hadoop and StreamInsightTM,
http://blogs.msdn.com/b/microsoft_business_intelligence1/
archive/2012/02/22/big-data-hadoop-and-streaminsight.aspx (last
accessed September 09, 2013)
41. Rajeev, M., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston,
C., Rosenstein, J., Varma, R.: Query processing, approximation, and resource management
in a data stream management system. In: CIDR Conference, pp. 1–16 (2002)
42. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications
Co. (2011) ISBN 9781935182689
43. Neubauer, P.: Neo4j and some graph problems, http://www.slideshare.net/
peterneubauer/neo4j-5-cool-graph-examples-
4473985?from_search=2 (last accessed September 21, 2013)
44. Palensky, P., Dietrich, D.: Demand side management: Demand response, intelligent energy
systems, and smart loads. IEEE Transactions on Industrial Informatics 7(3), 381–388
(2011)
45. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in In-
formation Retrieval 2(1-2), 1–135 (2008)
282 M. Strohbach et al.
1 Introduction
Nowadays, the proliferation of the data sources available on the Internet and the
widespread deployment of network-based applications are fostering the emer-
gence of new architectures referred as “big data”, characterized by the need of
capturing, combining, and processing an always growing amount of heteroge-
neous and unstructured data coming from new mobile devices, emerging social
media, and human/machine-to-machine communication. This implies managing
data at volumes and rates that push the frontiers of current archival and process-
ing technologies. For this reason, such architectures usually require the orches-
trated usage of geographically sparse resources to satisfy their immense compu-
tational and storage requirements. Consequently, the distributed nature of these
resources makes remote data access and movement, often involving petabytes
of data, the major performance bottleneck for the involved end-to-end applica-
tions. However, big data processing architectures are now widely recognized as
one of the most significant ICT innovations in the last decade, since they can
c Springer International Publishing Switzerland 2015 283
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_12
284 A. Scarfò and F. Palmieri
"#
"
!#
!
#
#
#
!
Fig. 1. ITU Global Statistics: telecom services subscription trends. Data from [1].
mobility is another great enabler for both data production and processing de-
mand. This is directly associated to the considerable increase in mobile telecom-
munication services subscriptions experienced in the last years (see fig. 1). These
services, also fostered by network pervasiveness and availability, resulted to be
an unbelievable accelerator for data production in almost any IT sector.
The constantly evolving social media are one of the most powerful catalysts of
unstructured data, mainly produced by mobile devices. The potential of social
networks is enormous. Their growth is driven by their widespread use not only
in the consumer arena but also in enterprise scenarios, so that they are even
now ready for marketing and social communication. In fact the future of unified
collaboration technologies is to become social-like.
To give an idea of the amount of data flowing in social media, consider that
every minute 100 hours of video are uploaded to YouTube [2], and considering
that an hour of standard video takes up about 1GB in average, resulting in a
production of 100GB/min, that is something as 50PB/year.
Science and research are other fields that are contributing significantly to big
data. During last years, Research Centers have done large investments in tech-
nologies in order to perform experiments and study or simulate physics phenom-
ena. Many scientific activities continuously gather and analyze huge amounts of
unstructured data. As an example, we can consider the Large Hadron Collider
(LHC) experiments, involving about 150 million sensors delivering data at the
rate of 40 million samples/second. The information produced by all four main
LHC experiments sums to about 700MB/sec (or 20.5 PB/year) before replica-
tion on geographically distant processing sites, reaching about 200 petabytes
286 A. Scarfò and F. Palmieri
after replication [3][4]. Whereas LHC is one of the most impressive experiments
in term of data produced, there are a lot of initiatives that are sources of huge
amount of data, encoding the human genome, or climate simulation or the Sloan
Digital Sky Survey, and so on.
2.6 Clouds
Clouds are also strongly involved in big data generation simply because a lot of
services relying on unstructured data are delivered in Cloud style. For instance,
e-mail, instant messaging, or many social networks rely on cloud infrastructures
because they provide the only scalable way allowing such a huge number of
end-users to access these services through the web. On the other side there
are enterprise applications and services, for which the Cloud paradigm (also in
its private cloud vision) is an excellent accelerator because it makes extremely
easy and affordable to deploy new large-scale data warehousing architectures,
supporting data generation, processing, replication or low-cost storage. The SaaS
(Software as a Service), PaaS (Platform as a Service) and DaaS (Data as a
service) cloud service models cover almost all the needs associated to big data
processing and make big data opportunities more affordable. In particular DaaS
can be seen as a form of managed service, similar to Software as a Service, or
Infrastructure as a Service delivering data analysis facilities offered by an outside
provider in order to help organizations in understanding and using the insights
gained from large datasets with the goal of acquiring a competitive advantage.
Big data as a service often relies upon cloud storage (see fig. 2) to provide
effective and flexible data access to the organization that owns the information
as well as to the provider working with it.
Cloud data services can also be the cheapest way to store data. The chart
reported in fig. 3 shows that the really relevant data for organizations objectives
may be a small part of the whole amount. Consequently, a tiering strategy based
on relevance can be an affordable solution to store a huge amount of data, where
the most part of data, characterized by a limited relevance is stored in the cloud.
288 A. Scarfò and F. Palmieri
Historically, data is one of the most important assets for firms and governments,
and data analysis is the first element needed for creating strategies and measure
their effects. So, big data processing and archival technologies may bring great
opportunities with them, in order to improve performance and competitiveness
of modern organizations, and resulting in significant benefits for the overall soci-
ety. There is a significant difference between data and big data beyond the pure
dimensional factors. Exploitation of big data potentialities creates the opportu-
nities for a real “quantum jump” or a “sharp transition”, that is, a significant and
unusual level of improvement, in the traditional IT scenario. Mainly, it changes
the way we manage and share our data by making available a huge amount of
heterogeneous information coming from a wide variety of sources that can be
correlated each other in order to create new added value.
How Big Data Is Leading Evolution 289
Several sectors of modern society have a great potential in drawing value from
big data. According to [6] both the electronic devices’ production and the in-
formation management sector can considerably gain from big data processing
technologies and produce added value even in the near term. In particular, they
can rely on big data analytics to generate insights to improve their strategies,
products and services. On the other hand, other strategic sectors, such as trans-
portation, manufacturing, and healthcare are characterized by a slightly lower
potential in gaining value from big data, essentially because of the fragmentation
of initiatives that, until now, have not yet allowed the involved organization to
collect, integrate, and analyze significant amounts of mission critical data. Also
government and financial organizations could better use their data in order to re-
fine their growth and evolution plans and forecast economic events. In addition,
big data correlation could provide a multidimensional view of their ecosystem
generating powerful advices that can be helpful in optimizing their operations.
Clearly, data availability is the most obvious enabler for all the above opportu-
nities. However, the widespread availability of massive amounts of information
introduces new challenges related to data security and privacy. In fact, when
coping with big data, the source is a fundamental matter not only to recognize
what or who produces the data but also to understand which data can be col-
lected/archived and how and where storing it, as well as what kind of data can
be provided to who. As digital data travels across organizational boundaries,
several policy issues, like privacy, security, intellectual property, and liability
become critical. In particular, privacy is a concern whose importance grows as
the big data value becomes more evident. Healthcare and financial data could
be the most significant examples in terms of introduced benefits while being
extremely privacy sensitive. In these cases it is necessary to deal with the trade-
off between privacy and utility. The recent and past history teaches that data
breaches can expose personal consumer information, confidential corporate in-
formation, and even national secrets or classified information so that security
issues become more and more important. With the emergence of social network-
ing technologies and new media for information sharing also the importance of
property right has significantly grown. Clarifying who “owns” information and
what rights come attached with a dataset, becomes fundamental for a fair use
of the involved data. Finally, another basic question is related to liability: who
is responsible for data utilization? Generally speaking, as the economic impor-
tance of big data increases, it also raises a number of legal issues, making the
resulting scenario very complex to manage because data, being immaterial, are
fundamentally different from many other assets, and it is impossible to compare
it with traditional physical assets from the legal point of view.
How Big Data Is Leading Evolution 291
-!1-60 1
2 0-( 2 -!!-0+21
0 -61 *,'*
-!2
4 -!00'7*
Professional skills are another fundamental factor since companies need new
professions together with the related educational courses in order to generate
value from big data technologies. Experts in data mining, machine learning,
statistics, management, and analysis, can significantly help organization in the
above task.
In order to face the main technological challenges related to big data it is neces-
sary to acquire a deep understanding of the fundamental features and dynamics
characterizing the involved scenario. These features (see fig. 4) can be essentially
individuated in the five “V”’s criteria:
– Volume: big data are huge in quantity and ever increasing. According to the
recent IDC survey [9] the volume of data that will be managed by 2020 will
increase more than 40 times over the current levels.
– Velocity: more data implies increased speed in accessing, transmitting, and
processing them, so that, the proper technological and architectural solutions
are to capture, understand, categorize, prioritize, and analyze big data at the
maximum possible speed.
– Variety: big data may come from a large number of sources and may be
archived by using a wide variety of formats, structured, unstructured, and
292 A. Scarfò and F. Palmieri
costs related to storage risk to grow uncontrolled. Fortunately, the cost for each
kind of storage device, even SSDs, is decreasing as the amount of data to be
stored grows, due the optimization of manufacturing processes and to the large
diffusion of the involved devices. This introduces a partial rebalancing effect that,
however, is not enough to compensate the data growth. In addition, increasing
the quantity of storage available is not a final response to data proliferation,
since more devices or higher capacity ones usually implies more space, but more
power consumption, more risks, and less performance.
Improving the storage efficiency could be the key. There are several initiatives
focused on storage features and aiming at improving their efficiency: Deduplica-
tion, Compression, Scaling-out, Auto-tiering, and Thin Provisioning. Each of them
can improve substantially the efficiency in storage utilization and, accordingly, re-
duces power consumption, bandwidth demand, and management overhead. The
effectiveness of each option depends on the specific application cases. At the same
time each technique presents its tradeoffs, limiting its utilization scope.
Data Deduplication
Deduplication is an excellent way for optimizing data storage, mostly in multi-
tenancy and centralized storage systems. While data volumes are continuously
growing, a lot of data stored around the world are the same, or very similar. We
can better appreciate this phenomenon by thinking about information published
via social media, very often reproposed by several users. Deduplication avoids
storing several times the same piece of data by partitioning an incoming data
stream into several data chunks, and comparing such chunks with previously
stored data. If a chunk is unique, then it can be stored safely. Otherwise, if
an incoming chunk is a duplicate of some block of data that has already been
stored, a reference to such block is associated to it and the data chunk is not
stored again. In other words, deduplication algorithms analyze the data to be
archived on disks and store only the unique blocks of a file or a set of files (e.g.,
present within a backup). This can reduce from 10 to 30 times the storage capac-
ity needed, bringing significant economic benefits. The fundamental dynamics
behind deduplication are sketched in fig. 5.
The choice of storing only unique chunks of data can be very effective in
terms of storage space utilization, although it introduces the need for additional
computing power in order to perform data deduplication-duplication in each
single I/O operation. This can introduce nonnegligible latencies so that online
deduplication can be performed only in presence of adequate performances and
storage architectures. The deduplication ratio is the measure of deduplication
efficiency. It can assume very different values case-by-case, depending on the
kind of data stored as well as on the devices’ occupation and hardware equip-
ment. In a typical storage system deduplication efficiency can significantly vary
over time. Initially, the storage occupation increases linearly since the device is
not still able to perform deduplication. As the amount of available data grows
the probability of finding replication of some pieces of data increases so that
deduplication can start, by introducing significant savings in storage occupa-
tion. Several commercial deduplication solution are available such as EMC Data
294 A. Scarfò and F. Palmieri
'!/ $
*#%$
'!0
#$"'%
'!1
)$ "%
'!2
$$
"#
'!$
$$ "
'$ 539
!"## 539
$$ "
" #'
$'!+##$
'!% 03-639
Domain, HP StoreOnce, but deduplication facilities have also been added in the
ZFS [11] open source implementation.
Data Compression
With the success of big data technologies, the demand for effective structured
and unstructured data compression techniques is ever growing, in order to reduce
storage space requirements and hence increment storage efficiency. Performance
in compression and decompression function is achieved through a sophisticated
balancing of hardware and software solutions, working on top of a sound data
storage architecture. In particular, the right algorithms should be chosen for
each kind of data, by considering a tradeoff between reduction in storage space
occupied and efficiency in decompression/compression activities, that can take
place online or offline depending on the chosen strategy. For example, lossless
compression tools derived from the Lempel-Ziv-Welch [12][13] [14] scheme (such
as Gzip [15]) or from the Burrows-Wheeler transform [16] (e.g. Bzip2 [17]) can
be used for generic data, whereas specialized lossless or lossy solutions (e.g.,
those provided in JPEG [18] or PNG [19]) can be more effective to compress
some specific kind of data, such as photo pictures. However, this may increase
processing times by a factor 3-4, also introducing an additional overhead when
data is inserted and updated.
Scale-Out Storage
Another strategic element for storage efficiency is scale-out storage, that is a
storage architecture relying on a scaling methodology to buildup a dynamic
How Big Data Is Leading Evolution 295
Thin Provisioning
Thin provisioning is essentially a flexible storage allocation strategy achieving
just-in-time storage space provisioning, that is allocating new room on disks
only when it is strictly needed and not at the system setup time, in order to
avoid any kind of storage over dimensioning at any time. In such a way storage
296 A. Scarfò and F. Palmieri
%")'"!%
$")#)&
-
&!*,
$"$ !
,&!*,$")#)&
-
&%
-
#&*
Storage Tiering
Storage Efficiency is also a concept related to the access performance needed by
applications, that according to their mission, purposes, operating environment,
and constraints (see fig. 6) may be characterized by specific demands usually
expressed in terms of throughput, latency, and protection, through specific Ser-
vice Level Agreements (SLA). In order to honor these SLAs, last generation of
storage solutions use the concept of Tiering.
In detail, storage tiering manages the migration of active data on higher-
performance storage devices (the topmost tiers) and inactive data to low-cost,
high-capacity ones. The result is an increased performance, lower costs, and a
denser footprint than conventional systems. By moving data on disk devices
How Big Data Is Leading Evolution 297
suitable for ensuring a specific SLA such architectures can realize a direct link
between data and their related SLAs. For instance, if a piece of database requires
a 200 IOPS access performance it must be placed on devices (SSD or SAS disks)
which are able to fulfill such SLA. A similar approach is very useful also to
balance performances and costs, by taking the most from a limited amount
of high performance storage devices (e.g., SSD units) while owning a virtually
unlimited storage space built by using low-cost devices (e.g., NL-SATA disks).
That is, through the concept of storage hierarchy, one can rely on the abstraction
of a virtual storage space characterized by the capacity of the whole aggregate
and the performance of the SSD devices.
Usually three tiers are enough, associated to data classification:
– Tier 1 : Mission Critical data, when the highest degree of performance, reli-
ability, and accessibility are needed
– Tier 2 : Seldom used data, when mid-level performance is acceptable
– Tier 3 : Archive data, when only retention required for compliance
The migration of data between storage tiers is based on fully automated or
manual policies. Auto-Tiering or Dynamic Storage Tiering seems to be the most
interesting way to manage data movement across disks. Clearly such a feature
simplifies operations and therefore the related costs. A typical criterion on which
basis moving data is the access frequency. Data mostly accessed is moved into
higher performance storage layer and vice-versa. At the same time, applications
and people can play a role in data moving across storage layers, deciding when
pieces of data have to be moved or deciding that some piece of data must be
stored in certain devices. That can happen, for instance, when access to data re-
quires a particularly high performance or in the case when data are permanently
archived.
'#&
"
'!#&
"
"
– Real time reporting and access needs, reducing drastically the expected ETL
(extract, transform, load) processing times
In memory architectures have several advantages compared to legacy solutions
mostly if related to business intelligence and data analytics. Real-time analytics
is one of the most appealing features offered by in memory solutions. It allows
very fast information navigation, correlation and extraction.
Furthermore, while developing a traditional business intelligence tool can take
more of 15-17 months, in memory solutions allow reducing considerably this time
since there is no more need of introducing sophisticated optimization techniques.
On the other hand, the amount of RAM needed increase as users grow, affecting
the costs of the solution. For this reason, several vendors are introducing high-
performance solid state memory in place of DRAM.
Many market leader vendors are investing in such technologies and releasing
new products based on in-memory solutions (e.g., SAP Hana, WebDNA, H2,
HazelCast, UnQLite, EhCache, Oracle TimesTen) and several available bench-
marks show that in-memory databases provide near-linear scalability. For in-
stance, ExtremeDB by McObject claims linear scalability within a benchmark
consisting of 160 64-bit processor cores and over 1 terabyte of data completely in
memory. On the other hand, an open source MySQL Cluster tested on a 16-node
cluster achieved 500,000 reads/second, and increasing the number of nodes to 32
triples the performance. However, performance is just the most visible aspect of
in-memory storage, but it should be also considered that almost all the available
solutions can provide replication to ensure availability of the data as well as load
balancing.
In order to scale-out in-memory storage architectures the new concept of
in-memory data grid (IMDG) has been proposed. It differs from traditional in-
memory system in distributing and storing data in multiple memorization nodes
scattered throughout the network. It is also based on a quite different data rep-
resentation model, usually object-oriented (serialized) and nonrelational. At a
first glance an IMDG is a distributed data base providing an interface similar to
a concurrent hash map. It memorizes objects by associating them with keys ac-
cording to the traditional key-value mapping schema. There are also some other
features in IMDGs that distinguish them from other products, such as NoSQL
and in-memory databases. One of the main differences would be truly scalable
Data Partitioning across clusters. Essentially, IMDGs in their purest form can
be viewed as distributed hash maps with every key cached on a particular cluster
node – the bigger the cluster, the more data you can cache. In a well-designed
IMDG, there is no, or minimal, data movement. The only movement, aimed
to reapportioning data along the cluster, should be when nodes are added or
removed. In this situation, processing should be performed only in the nodes
where data is cached.
300 A. Scarfò and F. Palmieri
Generally speaking, the big data landscape is getting very complex due to the
emerging needs and opportunities coming from such a rapidly evolving scenario.
It essentially encompasses two main classes of technologies: real-time data man-
agement systems that provide operational capabilities for real-time, interactive
workloads, and systems offering analytical capabilities that can be used to per-
form retrospective and complex analysis tasks that usually need to access most
or even all the data of interest. Although these classes are complementary they
are frequently deployed together.
New kind of database management systems have been designed to take advan-
tage of new distributed computing and storage architectures (e.g., clouds and
grids) in order to exploit massive computational power and archival capacity.
This allows also managing real-time big data workloads in a much easier and
cheaper way, making implementation efforts considerably faster. This is funda-
mental to allow some degree of direct interaction with the data as soon as they
are produced, for example, in financial or environment monitoring applications.
On the other hand, big data analytics, by allowing processing and correlation of
huge amounts of data in reasonable time, can foster the emergence of previously
hidden insights, and support effective forecasting by revealing unknown trends
and evolution patterns, by removing the need for sampling and polls as well as
promoting a new more investigative and deterministic approach to data analysis,
leading to more reliable and precise results.
Due to the volumes of data involved, big data analytics workloads tend to
be addressed by using parallel architectures like massively parallel processing
(MPP) and MapReduce-based systems. These technologies are also a reaction to
the limitations of traditional databases and their lack of ability to scale beyond
the resources granted by a single cluster of servers. Furthermore, the MapReduce
strategy provides a new way for analyzing data that is complementary to the
capabilities provided by SQL.
Traditional databases have substantial limitations when working with big data.
By summarizing, the most critical ones are:
In order to cope with the above problems, a higher degree of parallelism in both
data storage and query processing is needed.
Parallel Architectures
Traditional parallel database systems are structured according to a Symmetric
Multiprocessor (SMP) architecture, where multiple CPU are available to run
the data retrieval and processing software and manage the storage resources
and memory disks, but only a single CPU is used to perform database searches
within the context of a single query.
The new massively parallel processing (MPP) architectures are designed to
allow faster query operations on very large volumes of data. These architectures
are built by combining multiple independent server/storage units working in
parallel, in order to achieve linear increases in processing performance and scal-
ability. Spreading data across many independent units in small slices results in
more efficient database searches, whose performances increase roughly propor-
tionally to the number of involved units.
All the interactions between the server units are accomplished through a high-
performance network, so that, there is no disk-level sharing or any kind of con-
tention to be handled. For this reason such architectures are also known as
“shared-nothing” schemes.
NoSQL Systems
For this reason, new kinds of databases, known as NoSQL systems [21], are
emerging, based on architectures less constrained than the traditional relational
302 A. Scarfò and F. Palmieri
The above savings in disk space and access performance also affect the activ-
ities of retrieving and then storing data in memory so that larger amounts of
data can be loaded entirely into RAM by combining columnar and in-memory
technologies, so that both transactional and decision-support queries can achieve
almost zero latency responses.
Finally, Document-based databases, store and organize data as collections of
entire documents in the form of JSON objects [25], rather than as structured
tables. These objects can be seen as nested key-value pairs that can be nested as
much as you want. They can model arrays and understands different data types,
such as strings, numbers, and Boolean values.
Currently, the most popular NoSQL solutions available are the open source
Apache Cassandra DB [26] once used as the Facebook database, as well as
Google BigTable [27], LinkedIn Voldemort [28], Twitter FlockDB [29], Amazon
Dynamo [30], Yahoo! PNUTS [31], and MongoDB [32].
The whole Hadoop architecture is sketched in fig.8, where the entire system
is managed by a couple of Master servers running:
– a Name node that manages the HDFS hierarchical name space, by controlling
the distributed Data Nodes running on a large number of machines. The
Name node also manages blocks replica placement according to a rack-aware
strategy. It also keeps block metadata in memory in order to provide faster
access. Name nodes can be arranged in a federated way.
– A MapReduce engine JobTracker, governing the execution of data processing
jobs and managing the Task scheduling decisions, to which client applications
submit MapReduce jobs. The JobTracker pushes work out to the available
TaskTracker nodes, distributed throughout the network or within a local
cluster, striving to keep the runtime workload as close to the data as possible.
The Data Nodes, running on multiple machines equipped with their storage
resources, are responsible for serving HDFS read and write requests by perform-
ing block creation, retrieval, deletion, and replication under the control of the
Name Node. Thus, client applications achieve data access through Name Node,
however, once the Name Node has provided the location information of the data,
they can directly interface to one or more Data Nodes. The data blocks can be
read in parallel from several nodes simultaneously.
The Task Trackers govern the execution of tasks and periodically reports
the progress of such tasks by using a heartbeat message. TaskTracker instances
should, be deployed on the same nodes that host Data Node instances, so that
Map and Reduce operations are performed close to the data.
306 A. Scarfò and F. Palmieri
(&*0
+*&0+. 740
(4/0.
The Hadoop File System appears as a single disk and can address PBs of
data running on native file systems, that can be extremely effective for storing
large files, streaming data, managing write once, and read many operations, all
implemented on commodity hardware. However, it is not so efficient in managing
small files, low-latency access operations, and multiple writers.
Part of the Apache data management framework are also HBase and Hive.
HBase [36] is a Column-Oriented data store known as Hadoop Database. It is
fully distributed in order to serve very large tables. HBase is horizontally scalable
and integrated with MapReduce and built on top of the HDFS file system. It
also supports real-time CRUD (Create, Read, Update, Delete) operations unlike
native HDFS. HBase is the platform of choice when there are lots of data and
large amount of clients/requests to be handles. However, it is not so efficient when
traditional relational database retrieval operations are needed or in presence of
text-based searches. Hive [37] is a data warehousing solution built on top of
Hadoop (see fig. 9). It provides a SQL-like query language called HiveQL and is
able to structure various types of data formats and accesses data storage space
from various solutions such as HDFS and HBase. It has been explicitly designed
for ease of use and scalability and not for low latency or real time queries.
Analytics reporting processes are fundamental elements within the big data
framework, since they are the means which extract valuable information from
How Big Data Is Leading Evolution 307
collected data. From analytics and visualization tools is possible to gain com-
petitiveness from big data by:
– taking the right decisions from all the available information
– Predicting changes and behaviors and reacting preemptively to strategy mu-
tations
– discovering new opportunities such as new business or market segments
– Increasing efficiency by changing, for instance, processes, or products char-
acteristics.
– Quantifying current and potential risks associated to activities and processes
Analytics reporting is a classic IT function, but the ideas, and mostly the
results, coming from big data analytics are very different form the past. The pic-
ture reported in fig. 10 gives an effective summarization of analytics reporting
in the big data era. Clearly, what is totally different form the past is not only
just the huge amount of data but also the level of complexity, speed, and accu-
racy required by the analysis process. In the past, analytics on transactional or
structured data have helped organization in gaining competitiveness. Nowadays,
data coming from social, sensor, video, image, and machine-to-machine sources
represent a major opportunity for organizations and a big challenge for analytics
tasks. In fact in-deep the examination of such data may support organization in
better understanding their customers, partners, operations, and more generally
their business or efficiency improvement opportunities. Emerging systems that
deal with big data have to face with the enormous quantity of data available
and with its variety. Also, the rate at which new data is created and stored is
increasing more and more and, at the same time, the life of data is dropping
from year and months to hours and even seconds. Hence, there is a need for
analyzing data at the same speed in order to achieve up-to-date results.
If we take a closer look at the evolution of analytics, we perceive the emergence
on new kind of analysis processes that are the result of new datasets. Figure 11
shows how the most significant analytics processes are evolving in in order to
cope with big data.
Several old and new techniques can be employed for advanced analytics such
as machine learning, data mining, artificial intelligence, natural language pro-
cessing, genetic algorithms, and so on. The main difference between old analytics
and big data analytics are basically two: data formats and structures as well as
ability to predict the future. In fact, big data analytics processes provide a pro-
gressive view enabling organization to anticipate opportunities. They are able to
perform complex correlations and cross-channel analysis, together with real-time
forecasting. Nevertheless, they can largely use the new generation collaborative
technologies, whereas classic analytics are only able to achieve a rear view on
historical and structured data.
The availability of new kind of data allows to get more in depth, in understand-
ing the associated phenomena. Thus, new generation of analytics allow explor-
ing granular details and answer questions considered beyond reach in the past,
like: why it did happen? When will it happen again? What caused it happen?
308 A. Scarfò and F. Palmieri
)70*47(&)%3#,4
• "
• %"$ • !' !&!,%%
• &"%'
$*&'"! • +"$!,%%
• $'+"! "#' -'"!
• !'"$&
• #' -'"!
• "
• &'!,%% • %"$
• $" • +"$!,%%
% !&'"! • $'+"!
• #' -'"!
• $%
!,%%
• &!!
• )$!
"!'"!
/
#
"
"
% &
#
$
#
What can be done to avoid it? Emerging analytics promise to be predictive and
prescriptive, to improve decision making and efficiency, by discovering valuable
insights that otherwise remain hidden.
Just to give an idea of the difference, big data analytics can detect and report
in real time the consumer emotions during a service call on mentioning a com-
petitor, that is a piece of information very useful in real time. Another example,
regarding the personal business arena comes from the new offers of car insurance
companies that, just by installing a little control unit on the user’s vehicles allow
tailoring an insurance contract on the specific user’s habits, for a win-win busi-
ness model where both the company and the user can take advantage of data
produced by the sensors and analyzed by specific analytics tool.
It is always necessary to match the analytics solution with the current organi-
zation model in order to assign outcomes to the right organization elements. This
responds to questions like: What outcomes can see who? What are priorities in
using analytics? The example in fig. 12 shows that analytics can be deployed in
a traditional organization according to three fundamental approaches indicated
with A) B) and C):
– A) Embedded shared mode: analytics are deployed in a centralized mode
and serve the entire organization. In embedded mode analytics support pro-
cesses and increase efficiency. It is useful for standardization of processes and
methodologies, and for sharing practices and services among functions. But
it is not directly related to specific business issues and customers.
– B) Stand-alone shared mode: analytics are deployed as in the A) model
but outside organizational functions. The outcomes of stand-alone mode are
executive-level reports focused to the core competencies of the organization.
It can enable the development of standardized processes and methodologies
310 A. Scarfò and F. Palmieri
– Understanding business goals and potential benefits: the first pace in big data
modeling approach is taking a step back and thinking about the key business
drivers by focusing on which ones can take advantage of big data.
– Defining data and outcomes: organization must have a clear idea of what
kind of information they need and consequently, they should be aware about
what they already have, what is relevant, what is missing. It is important to
define the data sources and the volumes of data involved. In large organiza-
tions, potentially valuable data often exists in multiple shims. All information
about data should be clearly related to the outcomes useful to achieve the
defined business goals.
– Analyzing compliances, privacy, security, and liability requirements: orga-
nization should be aware about consequences of data utilization, this is a
critical factor to be addressed in terms of both processes and technologies.
– Defining data management processes and practices: it is very important, in
the first phase of the model adoption, keeping the traditional production
operational environment coexisting with a new environment where big data
are sift in order to achieve the defined business goal. This approach is useful
314 A. Scarfò and F. Palmieri
to minimize the adoption risks and, to compare the models. Later, the new
environment will be integrated with the original one. The new model has to
cover, from sources to outcomes, the delivery of data across the organization
– Choosing technological infrastructures: organization have to define the infras-
tructures to be used in order to take advantage of big data and, at the same
time, balance the costs/benefits, and protect the investment in mid time.
The chosen technological infrastructures should be scalable, adaptable, able
to fulfill performance request from business, and to allow customers con-
suming big data outcomes. The adaptability of the infrastructure could be
a critical factor for a business data-driven organization since modifying the
running business model is inherent in the data-driven model itself. The de-
livery of the outcomes is another key factor involving technological choices:
once decided who can access outcomes and how, technological infrastructures
have to deliver those services. Finally, technological infrastructures have to
support the privacy and compliance requirements.
– Knowing the people involved : once the data management model has been
defined, in addition to the technological infrastructure details it is necessary
to address the skills missing, by identifying roles and profiles
– Evolving the organization: the organization who invests in a big data-based
business model has to be ready to embrace a new culture driven by data
flowing across its departments. In addition, it should be aware of the impacts
of outcomes and be ready to react to them. For instance, if a predictive
analysis forecasts the decay of a certain kind of business on a geographic
area, the organization has to be ready to change its supply chain with the
sake of being more cost-effective. In order to take the most advantages of big
data, outcomes could be delivered to all people who are potentially interested
in, allowing to power up an engine of ideas (self-service query model).
– Managing outcomes: once the model starts producing outcomes, they have
to be consumed. Outcomes can imply modifications of processes, behaviors,
products, communications, and so on. Each modification should be evaluated
in terms of costs and potential benefits.
– Performing evaluation and evolution: once the model is deployed its results
have to be evaluated continuously in order to understand items like:
• if it is delivering the expected outcomes;
• if outcomes are bringing truly values;
• the cost of the big data model and the effectiveness of the infrastructures;
• how the infrastructures are supporting the model.
The whole big data modeling framework, observed from the processes and tech-
nologies perspective is sketched in fig.17.
How Big Data Is Leading Evolution 315
'%00&
'/',+4--,.0
40,*1,+
+)8/'/
'/,5 .8
.%+'91,+
+% * +0
+!./0.404.
5' /
+/,./
7 Conclusion
In the previous sections, we have discussed about big data and of their impli-
cations in terms of benefits and technological evolution. Now it appears clearer
that big data is a great opportunity bringing plenty of challenges and even risks.
Organizations could use big data in order to improve their competitiveness, im-
prove their efficiency and manage systemic risks. In order to take those strategic
advantages they should have to change their data management systems, their
internal processes and even their culture. In term of IT infrastructure, who wants
to take advantages of big data has to deal with a large amount of information
and mostly with several kind of data sources and formats, to be carefully con-
sidered in a properly crafted big data model. Then he has to adopt the right
analysis and visualizations tools, tightly coupled with the end-user applications,
to properly consume the outcomes. Consequently, achieving a correct alignment
between IT and business is a very critical factor in ensuring success. Several
departments can be involved in the big data project but the initiative should
pervade the whole organization according to a holistic approach.
References
1. International Telecommunication Union: World telecommunication/ict indicators
database, 16th edn. (2012)
2. YouTube: Statistics (November 2013),
http://www.youtube.com/yt/press/statistics.html
3. Brumfiel, G.: Down the petabyte highway. Nature 469(20), 282–283 (2011)
4. Lefevre, C.: Lhc: the guide (January 2008), http://cds.cern.ch/record/
1092437/files/CERN-Brochure-2008-001-Eng.pdf
5. Open Government Initiative: Open government data,
http://opengovernmentdata.org/
316 A. Scarfò and F. Palmieri
6. McKinsey Global Institute: Big data: The next frontier for innovation, competi-
tion, and productivity (2011), http://www.mckinsey.com/
insights/business technology/big data the next frontier for innovation
7. McKinsey Global Institute: Disruptive technologies: Advances that will transform
life, business, and the global economy (2013), http://www.mckinsey.com/
insights/business technology/disruptive technologies
8. Loecher, M., Jebara, T.: Citysense: Multiscale space time clustering of gps points
and trajectories. In: Proceedings of the Joint Statistical Meeting (2009)
9. IDC iView: Big data, bigger digital shadows, and biggest growth in the far east
(2012), http://www.emc.com/collateral/analyst-reports/
idc-the-digital-universe-in-2020.pdf
10. Meeker, M., Wu, L.: Kpcb internet trends (2013),
http://www.kpcb.com/insights
11. Bonwick, J., Ahrens, M., Henson, V., Maybee, M., Shellenbaum, M.: The
zettabyte file system. In: Proc. of the 2nd Usenix Conference on File and Storage
Technologies (2003)
12. Nelson, M.R.: Lzw data compression. Dr. Dobb’s Journal 14(10), 29–36 (1989)
13. Welch, T.A.: A technique for high-performance data compression. Com-
puter 17(6), 8–19 (1984)
14. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding.
IEEE Transactions on Information Theory 24(5), 530–536 (1978)
15. Gailly, J.L., Adler, M.: The gzip compressor (1999), http://www.gzip.org/
16. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm
(1994)
17. Seward, J.: The bzip2 algorithm (2000), http://sources.redhat.com/bzip2
18. Wallace, G.K.: The jpeg still picture compression standard. Communications of
the ACM, 30–44 (1991)
19. Boutell, T.: Png (portable network graphics) specification version 1.0 (1997)
20. Schmuck, F.B., Haskin, R.L.: Gpfs: A shared-disk file system for large computing
clusters. In: FAST, vol. 2, p. 19 (2002)
21. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14
(2010)
22. Copeland, G.P., Khoshafian, S.N.: A decomposition storage model. ACM SIG-
MOD Record 14(4), 268–279 (1985)
23. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M.,
Lau, E., Lin, A., Madden, S., O’Neil, E., et al.: C-store: a column-oriented dbms.
In: Proceedings of the 31st International Conference on Very Large Data Bases,
pp. 553–564. VLDB Endowment (2005)
24. Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: How dif-
ferent are they really? In: Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, pp. 967–980. ACM (2008)
25. Crockford, D.: The application/json media type for javascript object notation
(json) (2006)
26. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system.
ACM SIGOPS Operating Systems Review 44(2), 35–40 (2010)
27. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M.,
Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for
structured data. ACM Transactions on Computer Systems (TOCS) 26(2) (2008)
How Big Data Is Leading Evolution 317
28. Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., Gao, L.,
Ghosh, B., Gopalakrishna, K., Harris, B., et al.: Data infrastructure at linkedin.
In: 2012 IEEE 28th International Conference on Data Engineering (ICDE),
pp. 1370–1381. IEEE (2012)
29. Gupta, P., Goel, A., Lin, J., Sharma, A., Wang, D., Zadeh, R.: Wtf: The who
to follow service at twitter. In: Proceedings of the 22nd International Confer-
ence on World Wide Web, International World Wide Web Conferences Steering
Committee, pp. 505–514 (2013)
30. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin,
A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly avail-
able key-value store. In: SOSP, vol. 7, pp. 205–220 (2007)
31. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P.,
Jacobsen, H.A., Puz, N., Weaver, D., Yerneni, R.: Pnuts: Yahoo!’s hosted data
serving platform. Proceedings of the VLDB Endowment 1(2), 1277–1288 (2008)
32. Chodorow, K.: MongoDB: the definitive guide. O’Reilly (2013)
33. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters.
Communications of the ACM 51(1), 107–113 (2008)
34. White, T.: Hadoop: the definitive guide. O’Reilly (2012)
35. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file sys-
tem. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies
(MSST), pp. 1–10. IEEE (2010)
36. George, L.: HBase: the definitive guide. O’Reilly Media, Inc. (2011)
37. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H.,
Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce frame-
work. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Big Data, Unstructured Data, and the Cloud:
Perspectives on Internal Controls
David Simms
Abstract. The concepts of cloud computing and the use of Big Data have be-
come two of the hottest topics in the world of corporate information systems in
recent years. Organizations are always interested in initiatives that allow them
to pay less attention to the more mundane areas of information system man-
agement, such as maintenance, capacity management, and storage management,
and free up time and resources to concentrate on more strategic and tactical is-
sues that are commonly perceived as being of higher value. Being able to mine
and manipulate large and disparate datasets, without necessarily needing to pay
excessive attention to the storage and management of all the data that are being
used, sounds in theory like an ideal situation. A moment’s consideration re-
veals, however, that the use of cloud computing services, like the use of out-
sourcing facilities, is not necessarily a panacea. Management will always retain
responsibility for the confidentiality, integrity, and availability of its applica-
tions and data, and being able to develop the confidence that these issues have
been addressed. Similarly, the use of Big Data approaches offers many ad-
vantages to the creative and the visionary, but such activities do require an ap-
propriate understanding of risk and control issues.
1 Introduction
This chapter will set out the risks related to the management of data, with particular
reference to the traditional security criteria of confidentiality, integrity, and availabil-
ity, in the contexts of the wider use of unstructured data for the creation of value to
the organization and of the use of Big Data to gain greater insights into the behaviors
of markets, individuals, and organizations. Of particular interest are the questions of
identifying what data are held internally that could be of value and of identifying
external data sources, be these formal datasets or collections of data obtained from
sources, such as social networks, for example, and how these disparate data collec-
tions can be linked and interrogated while ensuring data consistency and quality.
Much of the current debate around big data technologies and applications concerns
the opportunities that these technologies can provide and where issues of security and
management are addressed; there is a tendency for these to be considered somewhat
in isolation. This chapter will set out the risks, both to the owners and the subjects of
the data, of the use of these technologies from the perspectives of security, consisten-
cy, and compliance. It will illustrate the areas of concern, ranging from internal
Many of the terms used in this chapter are reasonably recent coinages and definitions
can still be flexible and varied.
For the purposes of this chapter, we will follow Bernard Marr [1] with the defini-
tion that “Big data refers to our ability to collect and analyze the vast amounts of data
we are now generating in the world. The ability to harness the ever-expanding
amounts of data is completely transforming our ability to understand the world and
everything within it.” The fundamental idea is of the accumulation of datasets from
different sources and of different types that can be exploited to yield insights.
According to an article by Mario Bojilov in the ISACA Now journal [2], the ori-
gins of the term come from a 2001 paper by Doug Laney of Meta Group. In the
paper, Laney defines big data as datasets where the three Vs—volume, velocity, and
variety—present specific challenges in managing these data sets.
Unstructured data as a concept has been identified and discussed since the 1990s
but finding a definitive and all-encompassing definition of what this might be is sur-
prisingly difficult. Unstructured data have been defined vaguely positively by
Manyika et al [3] as “data that do not reside in fixed fields. Examples include free-
form text (e.g., books, articles, body of e-mail messages), untagged audio, image and
video data. Contrast with structured data and semi-structured data” which provides a
definition but one which is rather more in respect of characteristics that the data do
not possess rather than what they are.
Big Data, Unstructured Data, and the Cloud: Perspectives on Internal Controls 321
To complicate matters slightly, Blumberg and Atre [4] discussed the basic prob-
lems inherent in the use of unstructured data a decade ago. They wrote “The term
unstructured data can mean different things in different contexts” and that “a more
accurate term for many of these data types might be semi-structured data because,
with the exception of text documents, the formats of these documents generally con-
form to a standard that offers the option of meta data” (p. 42).
Cloud computing is described by NIST as “a model for enabling ubiquitous, con-
venient, on-demand network access to a shared pool of configurable computing re-
sources (e.g., networks, servers, storage, applications, and services) that can be rapid-
ly provisioned and released with minimal management effort or service provider in-
teraction” [5], while the Cloud, according to Manyika et al, is “a computing paradigm
in which highly scalable computing resources, often configured as a distributed sys-
tem, are provided as a service through a network.” For corporate users, the immediate
impact of using cloud services is not necessarily clearly distinguishable from that of
using traditional outsourced services, with services and/or data being available
through an external network connection. The key distinction is that unlike conven-
tional outsourcing, where typically the service provider contracts to store, process,
and manage data at a specific facility or group of facilities, in the context of the cloud
the data could be stored anywhere, perhaps split into chunks, with no external visibil-
ity over how that was organized.
For the individual, private user, the cloud, represented by such services as
Dropbox, iCloud, or Google Services, for example, is exactly as its name suggests, a
virtual and distant and slightly opaque facility by which services are provided without
significant identifying features or geographical links.
Internal Controls are the mechanisms – the policies, procedures, measures, and activ-
ities – employed by organizations to address the risks with which they are confronted.
To define a little further, these activities fall into the framework of control objectives are
the specific targets defined by management to evaluate the effectiveness of controls; a
control objective for internal controls over a business activity or IT process will general-
ly relate to a relevant and defined assertion and provide a criterion for evaluating
whether the internal controls in place do actually provide reasonable assurance that an
error or omission would be prevented or detected on a timely basis [6].
The traditional triad of information security objectives consists of Confidentiality,
Integrity, and Availability [7].
To give a concrete example of how control objectives and internal controls fit to-
gether, an organization might be concerned about unauthorized access to its data. The
control objective might be to ensure the confidentiality of the data, and one of many
322 D. Simms
From a point of view of completeness, it should also be mentioned that there are a
number of other classifications of information security criteria. In 2002, for example,
Donn Parker proposed an extended model encompassing the classic CIA triad that he
called the six atomic elements of information [10]. These elements are confidentiality,
possession, integrity, authenticity, availability, and utility. Similarly, the OECD's
Guidelines for the Security of Information Systems and Networks, first published in
1992 and revised in 2002 [11], set out nine generally accepted principles: Awareness,
Responsibility, Response, Ethics, Democracy, Risk Assessment, Security Design and
Implementation, Security Management, and Reassessment.
variable costs of service provision and marginal costs related to the acquisition of
additional clients or providing additional services or additional capacity to existing
clients should be straightforward to manage and reasonably simple to recover (and
exceed, of course) through appropriate pricing. Thus the provision of such services
can be viewed as a potentially lucrative business, with important initial investment
but steady and permanent future revenue streams.
Conceptually, the techniques of cloud computing are not significantly different to
those of traditional outsourcing of IT services. Many service providers offer out-
sourced data processing, system management, monitoring, and control services and
these services are very popular with many organizations who either do not wish to
have internal IT services and competences for organizational reasons or prefer to
outsource for financial or logistical purposes.
Key elements of any outsourcing agreement are the contract terms and the service
level agreements. With these terms, it is clear to both parties which services are being
provided, what resources are being made available, how these resources are being
managed, and how the quality of service can be evaluated and managed.
Many structured cloud services will provide such terms and conditions but might
not be able to specify every element of interest to the customer, such as the precise
location where data are stored or processed.
It is in the nature of any kind of outsourcing or service provision activity that both
parties will wish to maximize their revenues and benefits from the agreement while at
the same time accepting the minimum level of responsibility for addressing the risks
involved and for handling any issues that arise. In the context of cloud computing,
customers need to pay particular attention to the clear definition of roles and respon-
sibilities in order to try to avoid situations of blame-shifting and cost avoiding should
problems subsequently arise.
A key driver for the use of cloud facilities is costs: organizations might not wish to
tie up capital in IT infrastructure that might not be used to capacity and which might
only have a short life before obsolescence, when there might exist the possibility of
renting services as a regular P&L charge. Along with the resource and competency
questions, which of course also incur ongoing costs and can be expensive to update or
replace, this has long been a prime mover for outsourcing services. Experience in the
domain of outsourcing, however, reveals that the cost savings may not always be as
significant as hoped for. As mentioned above, the prudent management team will look
to ensure that its systems and data are secure and reliable by implementing additional
internal controls to generate evidence that there are no weaknesses in the service pro-
vider’s controls that can be or are being exploited. Typically, these internal controls
will take the form of data and transaction analysis to identify exceptions, or the ap-
pointment of specialist staff that can manage the relationship with the service provider
and ensure that trends are identified, that service levels are maintained, and that issues
are identified, reported, and rectified. Very often the costs associated with implement-
ing and operating such additional internal controls in an effective and robust manner
are such that they can eat into the margins created by the whole outsourcing initiative.
Big Data, Unstructured Data, and the Cloud: Perspectives on Internal Controls 325
From an internal controls perspective the presence and use of unstructured data
can pose numerous problems in respect of the confidentiality and integrity of data,
two-thirds of the famous “CIA” triad of information security objectives (the third
being availability). When data are located in a structured central database, they can, at
least in theory, be controlled, managed, verified, and secured. Once extracted from
the database, though, they can easily escape the internal control environment [17].
They can be used for decision taking without the assurance that they are still current,
complete, or valid. They can also, depending on the security measures in place and
the efficiency with which these are enforced, leave the organization easily, typically
on the USB media that have become ubiquitous, on laptop hard drives, or even as
attachments to emails.
as discussed above). A Type II service auditor’s report includes the information con-
tained in a Type I service auditor's report and includes the service auditor's opinion on
whether the specific controls were operating effectively during the period under re-
view. Because this opinion has to be supported by evidence of the operation of those
controls, a Type II report therefore also includes a description of the service auditor's
tests of operating effectiveness and the results of those tests.
SAS 70 was introduced in 1993 and effectively superseded in 2010 when the Au-
diting Standards Board of the American Institute of Certified Public Accountants
(AICPA) restructured its guidance to service auditors, grouping it into Statements on
Standards for Attestation Engagements (SSAE), and naming the new standard “Re-
porting on Controls at a Service Organization”. The related guidance for User Audi-
tors (that is, those auditors making use of service auditors’ reports in the evaluation of
the business practices or financial statements of organizations making use of the facil-
ities provided by service providers) would remain in AU section 324 (codified loca-
tion of SAS 70) but would be renamed Audit Considerations Relating to an Entity
Using a Service Organization. The updated and restructured guidance for Service
Auditors to the Statements on Standards for Attestation Engagements No. 16 (SSAE
16) was formally issued in June 2010 and became effective on 15 June 2011. SSAE
16 reports (also known as "SOC 1" reports) are produced in line with these standards,
which retain the underlying principles and philosophy of the SAS 70 framework. One
significant change is that management of the service organization must now provide a
written assertion regarding the effectiveness of controls, which is now included in the
final service auditor's report [19].
Internationally, the International Standard on Assurance Engagements (ISAE) No.
3402, Assurance Reports on Controls at a Service Organization, was issued in De-
cember 2009 by the International Auditing and Assurance Standards Board (IAASB),
which is part of the International Federation of Accountants (IFAC). ISAE 3402 was
developed to provide a first international assurance standard for allowing public ac-
countants to issue a report for use by user organizations and their auditors (user audi-
tors) on the controls at a service organization that are likely to impact or be a part of
the user organization’s system of internal control over financial reporting, and thus
corresponds very closely to the old SAS 70 and the American SSAE 16. ISAE 3402
also became effective on 15 June, 2011.
The importance of understanding the related guidance for user auditors (which also
applies, by extension, to user management) is critical. When a service audit report is
received, the reader need to follow a number of careful steps in order to be certain that
the report is both useful and valid before beginning to draw any conclusions from it.
These steps include:
1. Confirming that the report applies to the totality of the period in question. This is
of particular importance when using a service audit report in the context of obtain-
ing third-party audit comfort for a specific accounting period, but also applies to
more general use of the report. If management’s concern is over the effectiveness
of internal controls for the period from 1 January to 31 December 2012, say, and
the audit report covers the period from 1 April 2012 to 31 March 2013, how valid
is it for management’s purposes and how much use can they make of it? In the
simplest of cases, a prior period report will also be available that will provide the
330 D. Simms
necessary coverage, but in other circumstances this may not be the case and man-
agement will have to turn to other methods to acquire the comfort that they need.
These could include obtaining representations from the service manager that no
changes had been made to the control environment during the period outside the
coverage of the audit report and that no weaknesses in internal control effective-
ness, either design or operational, had been identified during that period. Other
methods could involve the performance and review of internal controls within the
client organization, a subject to which we will return below.
2. Confirming that all the systems and environments of operational significance to the
organization were included in the scope of the audit report. Management will need
to have an understanding of the platforms and applications that are being used for
their purposes at the service provider, including operating systems, databases, mid-
dleware, application systems, and network technologies, and be able to confirm
that these were all appropriately evaluated. Very often in the case of large service
providers a common control environment will exist, under which they apply iden-
tical internal control procedures to all of their environments: if the auditors have
been able to confirm that this is the case, then it is not inappropriate for them to
test the internal controls in operation around a sample of the operating environ-
ments, and management can accept the validity of their conclusions without need-
ing the confirmation that their particular instance of the database, for example, had
been tested. Should the coverage of the audit not meet management’s require-
ments, again management would need to evaluate the size and significance of the
gap and consider means by which they could obtain the missing assurance.
3. Understanding the results of the work done and the significance of any exceptions or
weaknesses noted by the auditors. Generally speaking, if the work has been per-
formed to appropriate standards and documented sufficiently, and if the conclusions
drawn by the auditors are solidly based on the evidence, this step should be reasona-
bly straightforward. Management should, however, guard against skipping to the
conclusion and, if the report contains it, the service provider’s management attesta-
tion, and blindly accepting the absence of negative conclusions as being sufficient for
their purposes. Each weakness in the design or the operation of controls should be
considered, both individually and cumulatively, to identify any possible causes for
concern. This is because it is possible that weaknesses identified during the audit,
considered to be insignificant within the overall framework of internal controls could
potentially be significant in respect of the specific circumstances (use of systems and
combination of technologies, for example) of one particular customer.
First, internal controls can be designed that allow local management to monitor and
evaluate the activities of a third party. These can take the form, for example, of pro-
cedures to track the responses of the service provider to requests for changes: if there
is a process by which the customer asks the service provider to change access rights
to an application or a datastore to correspond to the arrival or departure of a member
of staff, management can track these change requests, and the responses of the service
provider in order to ensure that the correct actions have been undertaken.
Very often the contract terms with the service provider will include regular meetings
and reporting mechanisms through which the provider will present status updates, usu-
ally in the form of progress against KPIs and lists of open points, and management can
ask questions and ensure that everything is under control. These meetings can form the
central points of control activities for management, as a structured means of ensuring
that they are monitoring the performance of the service provider in a regular and
consistent way, observing long-term trends, and identifying anomalies.
Secondly, audit procedures can be designed along similar principles to the above,
based on the expected results from the service provider’s activities. An example of
this would be extracting at the period end a list of user access rights to the organiza-
tion’s applications and comparing these to expectations, expectations based on man-
agement’s understanding of the access requirements and on the instructions given
during the year to create, modify, or delete access rights. If the rights correspond, this
can provide a layer of assurance to management that both the service provider’s inter-
nal procedures and controls are operational and that access to their applications and
data is being appropriately managed.
With an appropriately selected range of internal controls and audit procedures,
management can, therefore, obtain a certain level of comfort over the existence and
quality of the control environment in place at the service provider.
Arguments could be made on the grounds of costs and logistics that the data prepara-
tion process should be performed offsite at a third-party site, on purpose-built infra-
structure and away from the organization’s internal networks. This would be in order,
for example, to prevent excessive strain on resources caused by intensive processing
and by large quantities of data passing across the network. Such a solution would,
however, introduce the additional complication of ensuring adequate security over the
data once the datasets have left the organization’s security perimeter.
The “by whom” aspect will concern the use of internal resources, insofar as they
can be spared, and external resources. Depending of the amount and the nature of the
data to be processed, it may be appropriate to bring in resources from outside to per-
form aspects of the work. Internal resources will always be necessary, however, both
from IT to provide technical input, and from the business side as users who under-
stand where the data come from, what they represent, and how they relate to each
other. A common failing in any data cleaning or migrating exercise is to view it as a
purely technical procedure, whereas in practice the input from experienced and
knowledgeable data owners and end-users is critical.
The timescale for such a project will depend on several factors, including the quan-
tity and the nature of the data to be prepared, the availability of resources, and the
priority set by senior management for the process. Experience of such projects would
indicate that management should be setting their expectations in terms of months
rather than weeks, however, and that if data quality and security are really expected,
there is no scope for cutting corners.
Section A
Section B
Section A
Yes No Yes % No %
Q1 32 9 78% 22%
Q2 22 19 54% 46%
Q3 17 24 41% 59%
Q4 9 32 22% 78%
Q5 6 35 15% 85%
Q5A 6 0 100% 0%
Q6 14 27 34% 66%
Q7 12 29 29% 71%
Q8 4 37 10% 90%
Section B
Q9 39 2 95% 5%
In respect of the use of cloud computing facilities for the storage of data, 95% of
the responses reported that the organization was using or was planning to use such
facilities. The reasons most frequently given were cost management, flexibility, and a
desire to streamline IT activities to concentrate on more value-added activities
in-house. Of these organizations, however, only 15% had drawn up policies and
guidelines concerning the nature of data that could be stored in this way, and only
18% were adopting an approach based on the centralized management, monitoring,
and retrieval of data rather than leaving it to departments or individuals to manage.
Once again, the wide absence of overall policies and guidelines and the tendency
as reported to adopt potentially uncoordinated approaches to project scope and man-
agement runs counter to established good practice in respect of internal controls.
Without clear and enforced structure, inconsistency is likely to become a significant
barrier to success. In addition, delegating down to departments or individuals
increases the risks of decisions being taken without sufficient skills, experience, or
perspective.
Overall it was possible to draw a clear distinction between large and small organi-
zations and publicly and privately owned ones in respect of their approach to struc-
tured and formalized control environments. It was also possible to identify with high
accuracy the responses received from publicly owned corporations and those from
organizations subject to other strict and demanding controls requirements such as
Sarbanes-Oxley or local industry or environment-specific regimes. It was also possi-
ble to identify organizations that had significant internal audit or internal controls
functions, or that had been alerted to the risks involved in going through a process of
discovery in the context of a legal dispute.
The tentative conclusions that can be drawn from this very specific and targeted sur-
vey are the following.
First, the importance of having a structured and documented approach to data man-
agement and security is widely understood among the organizations surveyed, with a
particularly positive attitude toward risk management and compliance from larger
organizations and those subject to definite compliance regimes because of their own-
ership or industry. How this understanding actually translates into positive measures
designed to ensure compliance is another question, however, and IT security manag-
ers in particular reported seeing greater enthusiasm for establishing policies within
their organizations than for implementing and complying with the necessary proce-
dures. There was also a question of priorities and resources raised by smaller organi-
zations that did not feel that such policies corresponded to or were a part of their core
daily activities.
Secondly, the concept of unstructured data and the particular challenges posed by
such data is reasonably widely understood, but there has been little activity outside
large publicly owned corporations to address the issue in any systematic way. In gen-
eral IT departments had a good grasp of the nature and impact of the matter, and the
Big Data, Unstructured Data, and the Cloud: Perspectives on Internal Controls 337
subject was frequently raised by internal and external auditors and by legal advisers,
but it was rarely considered to be a subject of great priority by senior management.
Thirdly, even if thought has been given within some organizations to managing
employee access to and extractions of sensitive data, in the majority of cases monitor-
ing is weak and compliance cannot be ensured. Generally speaking, users who have
access to data stored in centralized databases tend to have the ability and the oppor-
tunity, and frequently the encouragement, to extract those data and use them for ana-
lytical or reporting purposes. Once the data have been extracted, access controls
around them are usually weaker, often being restricted to network or workstation
access controls, and even these restrictions reach their limits once data are copied
onto portable devices such as USB keys.
Fourthly, very few organizations are in a position of being able to detect reliably
whether their security has been breached or their data compromised, even when all
their systems and data are hosted and managed internally. Indeed, in respect of both
security breaches and employee misuse of data, it was reported that incidents were
typically identified either by chance or on the basis of information received, rather
than on the basis of regular and reliable compliance measures. Of course, in practice
being able to design, implement and monitor the operation of such control activities is
frequently nontrivial and requires competence, resources, and careful planning.
Whether organizations would be able to apply such controls to outsourced data, or be
satisfied by the monitoring and reporting services provided by their cloud service
provider, is the same question with an added layer of complexity.
Fifthly, the idea of cloud computing as a financially attractive option for outsourc-
ing a number of traditional IT activities including data storage is widespread and there
is a great deal of enthusiasm for it across a wide range of organizations, but this en-
thusiasm is not yet being widely and systematically backed up by detailed risk as-
sessments and careful consideration of the approaches needed to identify, classify,
manage, and monitor the data being transferred into the cloud.
The comments provided alongside the answers also provided useful information. In
particular, several respondents referred to the different types of cloud that are begin-
ning to exist: in certain industries such as financial services it would be unthinkable to
use cloud services in which confidential data might be stored outside Switzerland or
for which it would be difficult to obtain adequate audit comfort over key concerns,
but the use of some kind of industry-specific Swiss cloud, perhaps setup as a joint
venture, with appropriate controls and safeguards in place, might be conceivable.
Several respondents also flagged up the importance of the proper management of
backup media, which of course need to be subject to the same policies and procedures
for data management and security as live datasets. From the perspective of the man-
agement who will retain the responsibility for ensuring the availability and reliability
of system and data backups for good practice and going concern reasons, and for the
auditors who will be verifying this, it will be a challenge to identify exactly which
data are backed up where, how this is managed from a security and availability per-
spective, and what the timelines, sequences, and interdependencies would be for re-
storing part or all of a missing dataset.
338 D. Simms
Within all businesses there is constant pressure to reduce costs and cloud compu-
ting could be seen as an effective method of managing and reducing costs, particular-
ly in the short term. If there are no significant in-house IT systems, a case will always
be made for reducing to a bare minimum, or even eliminating entirely, the IT func-
tion, thereby, reducing staff costs alongside the operational costs of monitoring and
maintaining systems.
The decision to choose cloud computing services is not one to be taken lightly. For
individuals, the use of personal services in the cloud (such as Facebook, gmail, and
Dropbox) is, or rather should be, a matter of a calculated assessment of risks and ben-
efits, for the potential negative impacts of breaches in security, for example, can be
significant. For businesses and other organizations, the same concerns apply but on a
larger scale. For all organizations that have a responsibility to keep their, and others’,
data secure and confidential, the decision can only be taken after a detailed analysis of
how they will obtain, and continue to obtain, the necessary comfort that this is the
case. If they cannot build into their own procedures, into enforceable contract terms,
and into audit plans, the means of confirming the confidentiality, integrity, and avail-
ability of their systems and data, they should not consider externalizing it.
In addition, organizations should not forget the immediate costs and efforts in-
volved in moving onto the cloud. Datastores need to be identified, classified, cleaned
up, and archived, and serious technical and operational decisions are needed to deter-
mine what data will sit where. This will frequently be a project of a significant size
requiring expertise, resources and input from a number of people across the business
who understand the business, the systems, the data, and their use.
Experience of traditional outsourcing suggests that it is very easy for organizations
to overestimate the cost savings generated by a move toward service providers and to
underestimate the amount of internal competence and dedicated management required
to make a success of such initiatives. As long as the organization relies on its data and
retains responsibility for all aspects of its business from a regulatory perspective, it
will need to ensure that its management of the relationship with its service providers
and its access to critical operational information are both adequate and appropriate.
Typically this will require retaining or recruiting skilled, experienced, and reasonably
senior staff to liaise with and monitor the performance of the service provider.
The long-term consequences of opting for a cloud solution also need to be exam-
ined. Once on the cloud, systems and data are likely to stay there, and feasibly with
the same provider. What begins as a simple and cost-effective solution to a small
problem could develop into a long-term strategic commitment with little scope for
alteration.
concern for controls managers, or perhaps the take up of the technologies will not be
great, because of perceived controls issues or questions of cost or access.
It will also be interesting to study the impact of the uptake of such services on au-
dit opinions and compliance reports. This will surely be a major driver in the devel-
opment and the use of these services: if organizations find themselves subject to ad-
verse comments from regulators or external auditors, that will necessarily cause a
slow-down in adoption. On the other hand, if clean reports are issued and no concerns
are raised, take up will only be encouraged.
8 Conclusion
In common with all IT-related projects, initiatives in respect of data collation, and
accumulation and in respect of data migration and transfer require a great deal of
planning, strategic awareness and effective controls in order to ensure that the key
security objectives defined by the organization continue to be met. Both managing
unstructured data and managing the migration onto, and ongoing monitoring of, cloud
services, present countless opportunities for the loss of the confidentiality, integrity
and availability of data, to cite once again just the three most famous security objec-
tives.
The use of large datasets for competitive advantage is highly tempting for many
organizations from a tactical perspective, while the use of cloud services is attractive
for a number of reasons, financial, operational, and strategic. The senior management
of organizations tempted by such initiatives should be aware, however, that neither
type of project can be successfully completed overnight, and that they should be pre-
pared to provide the necessary resources, guidance, oversight, and supervision to
ensure that the advantages obtained through the initiatives are not outweighed by
decreased security, increased costs, or reduced comfort from internal controls.
References
[1] Marr, B.: Is This the Ultimate Definition of “Big Data” ?,
http://smartdatacollective.com/node/128486 (accessed September 2,
2013)
[2] Bojilov, M.: Big Data Defined. In: ISACA Now (2013),
http://www.isaca.org/Knowledge-Center/Blog/Lists/
Posts/Post.aspx?ID=299 (accessed September 2, 2013)
[3] Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.:
Big data: the next frontier for innovation, competition and productivity. McKinsey Glob-
al Institute, Washington DC (2011)
[4] Blumberg, R., Atre, S.: The Problem with Unstructured Data. DM Review (2003)
[5] National Institute of Standards and Technology. The NIST Definition of Cloud Compu-
ting. Special Publication 800-145, NIST, Gaithersburg (2011)
[6] Committee of Sponsoring Organizations of the Treadway Commission. Guidance on
Monitoring Internal Control Systems. AICPA, New York (2009)
340 D. Simms
[7] Krutz, R., Vines, D.: Cloud Security: A Comprehensive Guide to Secure Cloud Compu-
ting. Wiley Publishing Inc., Hoboken (2010)
[8] Internal Standards Organization. ISO/IEC 27001 Information Technology - Security
Techniques – Information Security Management Systems. ISO/IEC, Geneva (2005)
[9] National Institute of Standards and Technology, Information Security. Special Publica-
tion 800-100. NIST, Gaithersburg (2006)
[10] Parker, D.: Toward a New Framework for Information Security. In: Bosworth, S., Kabay,
M. (eds.) Computer Security Handbook, 4th edn. John Wiley & Sons, New York (2002)
[11] Organisation for Economic Co-operation and Development (2002) OECD Guidelines for
the Security of Information Systems and Networks: Towards a Culture of Security,
http://www.oecd.org/sti/ieconomy/15582260.pdf
(accessed January 8, 2014)
[12] Adolph, M.: Distributed Computing: Utilities, Grids and Clouds. ITU-T Technology
Watch Report 9, ITU, Geneva (2009)
[13] Gantz, J., Reinsel, D.: Extracting Value from Chaos. IDC iView (2011)
[14] Information Systems Audit and Control Association, Big Data: Impacts and Benefits.
ISACA, Chicago (2013)
[15] Bittman, T.: A Better Cloud Computing Analogy. Gartner Blogs (2009),
http://blogs.gartner.com/thomas_bittman/2009/09/22/
a-better-cloud-computing-analogy/ (accessed January 8, 2014)
[16] Information Systems Audit and Control Association, Security Considerations for Cloud
Computing. ISACA, Chicago (2012)
[17] Ghernaouti-Hélie, S., Tashi, I., Simms, D.: Optimizing security efficiency through effec-
tive risk management. Paper presented at the 25th IEEE International Conference on Ad-
vanced Information Networking and Applications (AINA 2011), Biopolis, Singapore,
March 22-25 (2011)
[18] American Institute of Certified Public Accountants, Quick Reference Guide to Service
Organizations: Control Reports. AICPA, New York (2012)
[19] American Institute of Certified Public Accountants, Service Organizations: Reporting on
Controls at a Service Organization Relevant to User Entities’ Internal Control over Fi-
nancial Reporting Guide. AICPA, New York (2011)
Future Human-Centric Smart Environments
1 Introduction
The smart concept is flooding our life. Currently, everybody speak about smart
cities, smart companies, smart transport, etc., whose main enablers are the last
advances in Information and Communication Technologies (ICT) as well as pro-
posals for the integration of sensors, actuators, and control processes. The world
is being transformed at such speed that by 2015 is expected that over 50 billion
devices are interconnected into a full ecosystem known as Internet of Things
(IoT) [1].
IoT represents a key enabler for smart environments, enabling the interaction
between smart things and the effective integration of real-world information and
c Springer International Publishing Switzerland 2015 341
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_14
342 M.V. Moreno-Cano et al.
knowledge into the digital world. Smart things, instrumented with sensing and
interaction capabilities or identification technologies, will provide the means to
capture information about the real world in much more detail than ever before,
which means it will be possible to influence real-world entities and other actors
in real time.
The initial roll out of IoT devices has been fueled primarily by industrial and
enterprise centric cases. For instance, a set of IoT application scenarios have been
identified for their expected high impact on business and social benefits. These
scenarios are showed in Fig. 1. Since some knowledge and services of different
IoT scenarios can be shared and used in the other scenarios, all of them can be
considered like linked, as it is reflected in Fig. 1.
Fig. 1. IoT application scenarios with high expected business and societal impact
However, the exploitation potential of IoT for smart services that address the
needs of individual users, user communities, or society at large, is limited at this
stage and not obvious to many people. Unleashing the full potential of IoT means
going beyond the enterprise centric systems and moving toward a user inclusive
IoT, in which IoT devices and contributed information flows provided by people
are encouraged. This will allow us to unlock a wealth of new user-centric IoT
information, and a new generation of services of high value for society will be
Future Human-Centric Smart Environments 343
built. The main strength of this idea is the high impact on many aspects of
everyday-life, affecting the behavior of humans.
To achieve smart and sustainable environments it is fundamental that, first,
people understand and participate actively within this ecosystem to ensure the
expected goals considered during the system design. Additionally, encouraging
people to contribute with their active participation (through their agreement
for sharing the information sensed by their devices, interacting with the system,
etc.) would be possible to build the technosocial foundations to unlock billions
of new user-centric information streams. Therefore, the human perception and
understanding is a key requirement for a successful uptake of ICT and IoT in
all society areas.
Furthermore, as individuals produce the majority of content on Internet to-
day, we can expect that users and their smart devices (like their smartphones)
will be responsible for the generation of the majority of IoT content as well.
Crowdsourcing [2] is a very good example of this trend. Through such collective
effort, continuous capturing of detailed snapshots of the physical world around
us would be possible, and thus providing a vast amount of data that can be
leveraged for the improvement of the population quality of life.
From the point of view of individuals, one of the most obvious impacts of IoT
applications will be present indoors, making smart buildings a reality. A smart
building provides occupants with customized services, thanks to their intelligent
capabilities, in offices, houses, industrial plants, or leisure environments. The
smart buildings field is currently undergoing a rapid transformation toward a
technology-driven sector with rising productivity. This paradigm will ultimately
create a solid foundation for continuous innovation in the building sector, fos-
tering an innovation ecosystem as the foundation stone for smart cities.
Bearing all these aspects in mind, in this chapter we present our proposal of
user-centric smart system based on the optimal integration and use of the infor-
mation provided by, among others, the users themselves. This system is appli-
cable to different smart environments such as transportation, security, health
assistance, etc. As an example of smart environment, in this work we focus on
the smart buildings field, where the system makes decisions and uses behavior-
based techniques to determine appropriate control actions, (such as control of
appliances, lights, power energy, air conditioning, control access, security, energy-
aware comfort services, etc.), and where it is promoted the user intervention and
participation in real-time.
The content of this chapter is organized as follows: since the emerging horizon
of IoT still presents a variety of technological and socioeconomic barriers that
have to be overcome, in Section 2 we enumerate the main challenges for becom-
ing this ideal concept into a livable and sustainable reality. Section 3 reviews the
potential of IoT technologies applied to user-centric systems, where the central
role of the user is reflected on all aspects of the ecosystem. Section 4 presents our
IoT-based architecture designed and developed at University of Murcia, which
is able to offer user-centric services in different smart environments. Later on we
focus on the instantiation of this system in two real use cases within the smart
344 M.V. Moreno-Cano et al.
The IoT enables a broad range of applications in the context of smart cities for
very diverse aspects of modern live, including: smart energy provisioning (e.g.,
monitoring and controlling of city-wide power distribution and generation), in-
telligent transport systems (e.g., reactive traffic management), smart home de-
ployments (e.g., the control of household devices and entertainment systems),
and assisted living (e.g., remote monitoring of aged and/or ill people), etc. For
each of these applications it is important to understand the user requirements
to design an optimal interaction with humans, to consider assurance needs (reli-
ability, confidentiality, privacy, auditability, authentication, and safe operation),
and to finally understand dependencies between the implemented systems.
From the above observations it can be formulated four main challenges to en-
able real-life smart city services by means of IoT technologies, while considering
security and privacy requirements: IoT technologies, trusted IoT, smart city man-
agement and services, and user involvement. These are further described next.
dementia patients who are sensed by the system and provided with customized
assisted-living services). In contrast, active involvement of individuals in produ-
cing content for Internet and social networks has become prominent only with
the advent of the so called Web 2.0, i.e., when supported by ubiquitous availabi-
lity of Internet connectivity and personal devices connected to Internet simple
and easy to use tools for content generation, publishing, and sharing. In the
IoT case, the threshold for active participation of users in a joint IoT system is
much higher due to lack of adequate tools, mechanisms, and incentives as well
as fear of disclosure of private information that could potentially be abused in
yet unknown ways.
The move toward a citizen inclusive IoT in which citizens provided IoT
devices and contributed information flows are encouraged will have significant
impact on the people and the societies in general. Thus, a variety of technological
socioeconomic barriers will have to be overcome to enable such inclusive IoT
solutions. In particular, the human perception of IoT is critical for a successful
uptake of IoT in all areas of our society.
Perceived levels of trust and confidence in the technology are crucial for
forming a public opinion on IoT. This is a real challenge for IoT solutions, which
are expected to behave seamless and act in the background, invisible to their
users.
In order to ensure a wide scale uptake of IoT in all areas of society, archi-
tectures and the protocols of an inclusive IoT ecosystem must be simple and
provide motivation for every citizen to contribute an increasing number
of IoT devices and information flows in their households, this way making them
available to their immediate community and to the IoT at large. In addition to
the simplicity in terms of how the system is used and the immediate and clear
benefits provided by the system to each individual user, implementation must
be done in such a way that to ensure adequate control and transparency in
order to increase confidence and to allow better understanding of what is hap-
pening with the information and devices contributed. If transparency and user
control are not treated adequately in such a community grown IoT system, there
is a real danger for the systems to be perceived with suspicion and mistrust by
the users, which may result in opposition and refusal of such technology, thus
hindering its wide spread deployment.
Bearing in mind this last challenge to achieve user-centric smart city services,
next section presents a complete description of the issue from different points of
view.
The motivation for such cooperation between humans and technology can
be seen from different perspectives. Fig.2 shows a schema of different aspects
affecting user-centric smart environments.
Nowadays one can already talk about smart people, smart computer systems
and, during last years, about smart objects too. Following this approach, in a
near future the “intelligence“ will be so disseminated around us - integrated in the
objects used daily by people - that it may be possible that people become foolish
living in a reality where “smart entities“ are in charge of taking all decisions for
them. Computer systems have already similar capacities as humans, like memory
and making decision processes. At this speed, in few years, a huge amount of our
personal memory could be stored out of us, as well as our capability of making
decisions. Some experts have already stated that the human role in a near future
could be reduced to a big sensed system to the service of computer systems.
Computer science is one of the most revolutionary scientific creation of the
mankind, but whose advances should not be fully dehumanizing, by considering
the human-centric perspective of them. Allen Newell, one of the fathers of the
artificial intelligence, published a book titled Unified Theories of Cognition [6],
which was fully congratulated by the scientific community. Newell described in
this book, that the goal of the intelligence is to relate two independent systems:
the knowledge and the goals. According to this approach, when a problem is
solved, the intelligence uses whole knowledge available to get a concrete goal,
i.e., the solution of such problem. But in this theory, two essential functions
of the human intelligence are excluded: create information and invent goals.
Therefore, the Newell’s definition of intelligence is not valid for humans. The
main capability of human intelligence consists of selecting its own information,
focus on its reality, and establish its own goals. Therefore, human intelligence
is genuine thanks to its capability of addressing the mental activity to adjust
ourselves to the reality - and even being able to extend it.
For all these reasons, computer systems providing people with human-centric
services should include the active participation of humans to be properly recogni-
zed like intelligent, and thus to make possible sustainable and livable ecosystems
to the service of the society.
Therefore, since the intelligence of computer systems is dependent on human
intelligence of users interacting with them, it is necessary to propose an effective
involvement process of humans during the whole lifecycle of this type of systems.
In this sense, from the beginning to the end of such system operation, it is
necessary to:
1. Include and take into account the data sensed and handed by user devices.
2. Provide users with all sensed data and predictions carried out according to
their context.
3. Consider the information provided from the user interactions with the sys-
tem. In this sense, people should be able to:
– Accept, change, or reject the solutions provided by the system.
– Communicate what are their preferences related with a specific service.
– Indicate what are their satisfaction/dissatisfaction levels regarding to
the provided solutions.
– Inform about their habits and goals.
– Specify new services that cover new needs or desires.
350 M.V. Moreno-Cano et al.
The general trend is that the goals considered during the system design can be
questioned by users at various level of detail, and automatically updated for an
optimum operation of the system. Consequently, this gives rise to a knowledge
milieu where all the members, more or less experienced, of a social network may
proactively share and refine their knowledge - in terms of goals generation - with
other people who own the same or different devices under similar contexts. Below
we describe in more details such social point of view of this emerging user-centric
IoT paradigm.
According to the definition made by Wasserman & Faust in 1994 [7], a Social
Network (SN) is generally a system with a set of social actors and a collection
of social relations that specify how these actors are relationally tied together.
In more recent years, due to the advent of new web technologies and platforms
(blogs, wikis, Facebook, and Twitter, for instance) - which allow the users to
take an active part in content creation - the definition of SNs has been updated
including content production and sharing aspects[8]. Thanks to this information
generation, smart systems can use this to provide more efficient services accor-
ding to the user requirements and the context conditions.
The actions and relationships of users in a proactive SN provide a rich source
of observable user behavior (for example, related to what, and with whom, con-
tent is shared) that modeling approaches can leverage. In order to exploit this
additional information, a user model for SNs, in addition to the attributes of
classic and context-aware user models, must also include attributes modeling
the user’s social behavior in terms of user relationships and content production
and sharing (or limitations to it) [9].
More in detail, user relationships in SNs are more commonly expressed as
friendship (or like, or trust) statements between users. Taken together, the trust
sets of the users in the SN can be seen like a user graph (i.e., a social graph), in
which the nodes are the users and the edges are the trust connections [10].
Since in an online community people are motivated to participate by different
things in different ways, it is to be expected that personalizing the incentives
and the way the reward for participation are presented to the individual would
increase the effect of the incentives on their motivation. Modeling the changing
needs of communities and adapting the incentive mechanisms accordingly can
help attract the kind of contributions when they are most needed. Therefore, user
modeling inside a community or a SN can be seen as an area that provides valu-
able insights and techniques in the design of adaptive incentive mechanisms [11].
Such mechanisms are able to motivate a user to participate or perform an action
inside the network. Therefore, the purpose of incentive mechanisms is to change
the state of the user (goals, motivations, etc.) to adapt the individual user to the
benefit of the overall system or community (which is the opposite of the purpose
of user adaptive environments, where the objective is to adapt the system to the
needs of the individual user).
Future Human-Centric Smart Environments 351
An incentive mechanism to switch from the user level to the community level
can be viewed as an adaptation mechanism toward the behavior of a community
of users. This incentive adaptation mechanism monitors the actions of the co-
mmunity represented in a community model, or in a collection of individual user
models, and makes adaptations to the interface, information layout, functionality
of the community, etc. to respond to the changes in the user model according to
some predefined goals (e.g., maximizing the participation and content sharing).
Therefore, it can be noticed that it is necessary the integration of suitable
computational techniques that let us face with all these aspects of modeling
and adaptation. The next part introduces the main concepts related to such
computational intelligence versant of user-centric IoT systems.
Complex Event
Middleware Publish-
BBDD
SOUPA
OWL
OCP EXTENSIONS
Subscribe RDF-S
RDF
Abstraction XML
traditional electricity bills. Now there is a huge opportunity to improve the offer
of cost-effective, user-friendly, healthy, and safe products for smart buildings,
which provide users with increased awareness (mainly concerning the energy
they consume), and permit them to be an input of the underlying processes of
the system.
Bearing these aspects in mind, the rest of the paper focuses on the human-
centric perspective of emergent IoT systems in the context of smart buildings,
where users are both the final deciders of actions and system co-designers, since
their feedback conditions the behavior of the software. Next more details about
the architecture showed in Fig.3 are given, at the same time we describe it
implementation over a real platform: City Explorer.
Looking at the lower part of Fig. 3, input data are acquired from a plethora
of sensor and network technologies such as Web, local, and remote databases,
wireless sensor networks or user tracking, all of them forming an IoT framework.
Sensors and actuators can be self-configured and controlled remotely through
the Internet, enabling a variety of monitoring and control applications.
354 M.V. Moreno-Cano et al.
6LowPAN) and Bluetooth can be used to avoid wiring in already built buildings
for instance, and connect new devices through a wireless sensor network.
In addition, a LAN installation is used in buildings to connect all IP-based
elements with the HAMs, whereas a changeable communication technology can
be used to connect the in-building network with Internet. Optical fiber, common
ADSL, ISDN, or cable-modem connections could be enough to offer remote mon-
itoring/management and a basic security system.
Given the heterogeneity of data sources and the necessity of seamless integra-
tion of devices and networks, a middleware mediator is proposed to deal with this
issue. Therefore, the transformation of the collected data from the different data
sources into a common language representation is performed in the middleware
layer.
During this stage new context information can be generated, which is provided
to the middleware for its registration in the ontology containing the data con-
text (acting then as context producer). Therefore, different algorithms must be
applied for the intelligent processing of data, events and decisions, depending on
the final desired operation of the system (i.e., the addressed services).
Considering the field of the smart buildings, and as Fig. 4 shows, in this layer
it should be implemented the data processing techniques for covering services
such as security, tele-assistance, energy efficiency, comfort, and remote control,
among others. In this context, intelligent decisions are made through behavior-
based techniques to determine appropriate control actions, such as appliance and
lights control, power energy management, air conditioning adjustment, etc.
Finally, the specific features for service provisioning, which are abstracted from
the final service implementations, can be found in the upper layer in Fig. 3.
This way, our approach is to offer a framework with transparent access to the
underlying functionalities to facilitate the development of different types of final
applications.
As we showed in Fig. 4, a schema of our automation platform offering some
ubiquitous services in the smart buildings field is represented. It is divided into
the indoor part of the platform and all the connections with external elements for
remote access, technical teleassistance, security, and energy efficiency/comfort
provision.
Additionally, in order to provide a local human-machine interface (HMI),
which can be considered trustworthy by users and lets them interact easily with
the system, several control panels have been distributed through the building to
manage automated spaces. This comprises an embedded solution with an HMI
adapted to the controlled devices and able to provide any monitored data in a
transparent way. Bearing users in mind, the HMI of the control panels of City
Explorer manages to reduce the risk of injury, fatigue, error, and discomfort, as
well as improves productivity and the quality of the user interactions.
Taking into account the common services provided in the smart buildings
context, in the next section we present two examples of real use cases where our
proposal of smart system (City Explorer) is deployed and working to provide the
main common building services, such as remote access, technical tele-assistance,
security, energy efficiency, and comfort provision in two different contexts: in a
technology transfer center and in the main campus buildings of the University
of Murcia.
Buildings represent one of the fields where energy sustainability must be satisfied
to ensure energy sustainability of modern cities, mainly due to the increasing
amount of time that people spend indoors. For instance - and as a reference
Future Human-Centric Smart Environments 357
Scenario. A reference building where our smart system is already deployed and
working is the Technology Transfer Center of the University of Murcia5 , where
City Explorer is installed. Fig. 5 depicts one of the floors of this reference build-
ing, where a set of laboratories is present on the lower part of the map. This
screenshot has been obtained from our SCADA-web, which also offers the possi-
bility of consulting any monitored data from the heterogeneous sensor network
deployed in the building.
Fig. 5. SCADA-web view of the ground floor of the reference smart building
– High comfort level: learn the comfort zone from users’ preferences, guaran-
tee a high-comfort level (thermal, air quality, and illumination) and a good
dynamic performance.
– Energy savings: combine the comfort conditions control with an energy sa-
ving strategy.
– Air quality control: provide CO2 -based demand-controlled ventilation sys-
tems.
For Satisfying the above requirements demands control implies controlling the
following actuators:
– Shading systems to control incoming solar radiation and natural light as well
as to reduce glare.
– Windows opening for natural ventilation or mechanical ventilation systems
to regulate natural airflow and indoor air changes, thus affecting thermal
comfort and indoor air quality.
– Electric lighting systems.
– Heating/cooling (HVAC) systems.
also takes into account user interactions with the system using the control panel
associated to such room or using the SCADA-web access.
Looking at Fig. 5, we have taken the second laboratory starting from the left
as the reference testbed for carrying out the experiments. In this test laboratory
we have allocated different room spaces where sensors are distributed. Fig. 6
provides an overview of such deployments as well as the contexts of an office, a
dining room, a living room, a corridor, and a bedroom.
Temperature sensor
1.5m x 1.7m Presence sensor
LUX
LUX LUX
Controllable blinds
Controllable HVAC
LUX
Controllable switches
4.3m x 1.3m 4.6m x 1.25m
4.3m x 1.3m 4.6m x 1.25m Controllable ceiling lights
comfort conditions in our test building is fully integrated into the management
layer of the IoT architecture presented in the previous section.
The parameters that have been identified to affect comfort and energy perfor-
mance are: indoor temperature and humidity, natural lighting, user activity level,
and power consumed by electrical devices. Environmental parameters (temper-
ature, humidity, and natural lighting) present a direct influence on energy and
comfort conditions but, in addition to them, thermal conditions also depend on
the user activity level and the number of users in the same space. Depending on
the indoor space (such as a corridor or a dining room), the comfort conditions
are different and, therefore, the energy needed too. Moreover, the heat dissipated
by electrical devices also affects thermal conditions.
One of the most relevant inputs to our energy efficiency management is the
human activity level, which is provided by our indoor localization mechanism
integrated in City Explorer and detailed in [24]. This is based on an RFID/IR
data fusion mechanism able to provide information about the occupants’ identity,
indoor locations, and activity level. Hence, location data allow the system to
control the HVAC and lighting equipment accordingly.
After the identification and localization of occupants inside the building, diffe-
rent comfort profiles for each user are generated with default settings according
to their preferences. In this way, considering accurate user positioning informa-
tion (including user identification) as well as user comfort preferences for the
management process of the appliances involved, energy wastage derived from
overestimated or inappropriate settings is avoided. Nevertheless, occupants are
free to change the default values for their own preferences when they do not feel
comfortable. For this, users can communicate their preferences to the system
Future Human-Centric Smart Environments 361
through the control panel of the HAM associated to their location, or through
the SCADA-web access of City Explorer. Then, our management system is able
to update the corresponding user profiles as long as these values are within
the comfort intervals defined according to the minimum levels of comfort in
the building context. On the other hand, when occupants are distributed in such
way that the same appliance is providing comfort service to more than one occu-
pant, our intelligent system is able to provide them with comfort conditions that
satisfy the greatest number of them (always considering the minimum levels of
comfort).
Regarding user interaction with the system to communicate their comfort
preferences and energy control strategies, besides City Explorer lets users explore
monitored data by navigating through the different automated areas or rooms
of the building, its intuitive graphic editor also allows users to easily design any
monitoring/control task and/or actions over the actuators (appliances) deployed
in the building. The setting of the overall system can also be carried out by users
using City Explorer, and without any need to program by code any controller.
In this way, it is possible to setup the whole system by simply adding maps and
pictures over which users can place the different elements of the system (sensors,
HAM units, etc.), and design monitoring and control actions through arrows in
a similar way to that in which a flowchart is built. Therefore, our system gives
users integral control of any aspect involved in the management of the building.
An example of the graphic editor of City Explorer where some rules have been
defined by users is shown in Fig.7.
The system can detect inappropriate settings indicated by users according to
both their comfort requirements and associated energy consumption parameters.
Therefore, with the aim of offering users information about any unsuitable de-
sign or setting of the system, as well as to help them to easily understand the
link between their everyday actions and environmental impact, City Explorer
is able to notify them about such matters (i.e., acting as a learning tool). On
the other hand, when the system detects disconnections and/or failures in the
system during operation, it sends alerts by email/messages to notify users to
check these issues. All these features, included in our management system, let
contribute to user behavior changes and increase their awareness over time, and
detect unnecessary stand-by consumption of the controllable subsystems of the
building.
31 12%
30 10%
29 10%
28 11%
27 12%
26 9%
25 11%
24 10%
23 11%
22 10%
21 10%
20 11%
19 10%
18 8%
17 7%
Days
16 9%
15 8%
14 9%
13 8%
12 9%
11 7%
10 8%
9 7%
8 6%
7 6%
6 7%
5 7%
4 8%
3 6%
2 7%
1 6%
0% 2% 4% 6% 8% 10% 12%
Percentage Energy Savings
Fig. 8. Percentage of energy savings considering a user-centric approach for the comfort
appliances management in smart buildings
6 Conclusion
The proliferation of ICT solutions, in general, and IoT approaches, in parti-
cular, represents new opportunities for the development of intelligent services
to achieve more efficient and sustainable environments. In this sense, persua-
sive energy monitoring technologies have the potential to encourage sustainable
energy lifestyles within buildings, as the proposal described in this chapter has
demonstrated. Nevertheless, to effect positive ecological behavior changes, user-
driven approaches are needed, whereby design requirements are accompanied by
an analysis of intended user behavior and motivations. Nevertheless, large data
samples are needed to understand user preferences and habits for the case of
indoor services.
In this way, building management systems able to satisfy energy efficiency
requirements, but user comfort conditions are also considered necessary. Howe-
ver, to date, studies have tended to bring users into the loop after the design is
completed, rather than including them in the system design process.
In this work, after providing the main challenges that still have to be faced
to achieve smart and sustainable environments from the IoT perspective, and
364 M.V. Moreno-Cano et al.
References
1. Atzori, L., Iera, A., Morabito, G.: The internet of things: A survey. Computer
Networks 54(15), 2787–2805 (2010)
2. Ganti, R.K., Fan, Y., Hui, L.: Mobile crowdsensing: Current state and future chal-
lenges. IEEE Communications Magazine 49(11), 32–39 (2011)
3. SENSEI EU PROJECT, http://www.sensei-project.eu
4. Bélissent, J.: Getting clever about smart cities: new opportunities require new
business models (2010)
5. Ducatel, K., et al.: Scenarios for ambient intelligence 2010, ISTAG report,
European Commission. Institute for Prospective Technological Studies, Seville,
ftp://ftp.cordis.lu/pub/ist/docs/istagscenarios2010.pdf (November 2001)
6. Newell, A.: Unified theories of cognition, vol. 187. Harvard University Press (1994)
7. Wasserman, S.: Social network analysis: Methods and applications, vol. 8. Cam-
bridge University Press (1994)
8. ISTAG. Report on revising europe ict strategy. Technical report, European Com-
mission (2009)
9. Spiliotopoulos, T., Oakley, I.: Applications of Social Network Analysis for User
Modeling
Future Human-Centric Smart Environments 365
10. Shi, Y., Larson, M., Hanjalic, A.: Towards understanding the challenges facing
effective trust-aware recommendation. Recommender Systems and the Social Web,
40 (2010)
11. Vassileva, J.: Motivating participation in social computing applications: a user
modeling perspective. User Modeling and User-Adapted Interaction 22(1-2),
177–201 (2012)
12. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-
niques. Morgan Kaufmann (2005)
13. Bin, S., Yuan, L., Xiaoyi, W.: Research on data mining models for the internet of
things. In: 2010 International Conference on Image Analysis and Signal Processing
(IASP). IEEE (2010)
14. Reilly, D., Taleb-Bendiab, A.: An jini-based infrastructure for networked appliance
management and adaptation. In: Proceedings of the 2002 IEEE 5th International
Workshop on Networked Appliances, Liverpool. IEEE (2002)
15. Sarikaya, B., Ohba, Y., Moskowitz, R., Cao, Z., Cragie, R.: Security Bootstrapping
Solution for Resource-Constrained Devices. IETF Internet-Draft (2012)
16. Tschofenig, H., Gilger, J.: A Minimal (Datagram) Transport Layer Security Imple-
mentation. IETF Internet-Draft (2012)
17. Kivinen, T.: Minimal IKEv2, IETF Internet-Draft (2012)
18. Moskowitz, R.: HIP Diet EXchange (DEX), IETF Internet-Draft (2012)
19. Zamora-Izquierdo, M.A., Santa, J., Gomez-Skarmeta, A.F.: An Integral and Net-
worked Home Automation Solution for Indoor Ambient Intelligence. IEEE Perva-
sive Computing 9, 66–77 (2010)
20. Nieto, I., Botía, J.A., Gómez-Skarmeta, A.F.: Information and hybrid architec-
ture model of the OCP contextual information management system. Journal of
Universal Computer Science 12(3), 357–366 (2006)
21. Centre Europeen de Normalisation: Indoor Environmental Input Parameters for
Design and Assesment of Energy Performance of Buildings - Addressing Indoor
Air Quality, Thermal Environment, Lighting and Acoustics. EN 15251 (2006)
22. Handbook, A. S. H. R. A. E. Fundamentals. American Society of Heating, Refrig-
erating and Air Conditioning Engineers. Atlanta (2001)
23. Perez-Lombard, L., Ortiz, J., Pout, C.: A review on buildings energy consumption
information. Energy and Buildings 40(3), 394–398 (2008)
24. Moreno-Cano, M.V., Zamora-Izquierdo, M.A., Santa, J., Skarmeta, A.F.: An In-
door Localization System Based on Artificial Neural Networks and Particle Filters
Applied to Intelligent Buildings. Neurocomputing 122, 116–125 (2013)
25. Berglund, L.: Mathematical models for predicting the thermal comfort response of
building occupants. ASHRAE Transactions 84(1), 1848–1858 (1978)
Automatic Configuration of Mobile Applications
Using Context-Aware Cloud-Based Services
1 Introduction
We believe that mobile devices, especially in the last few years, have evolved consid-
erably. Not only have they increased performance, they but also provide more fea-
tures than before, such as the new and improved touch screens. However, in addition
to the opportunities with mobile devices, they also present new challenges that are not
present in standard desktop computing. Energy consumption, varying network cover-
age, and the relatively small screen sizes are examples of this. Moreover, in 2011 for
the first time smartphones exceeded PCs in terms of devices sold 1 , which further
highlights the importance of mobile devices and in this case smartphones specifically.
Innovations in hardware capabilities open up new opportunities and challenges
when developing systems that run on or integrate with mobile devices. When com-
bined with the wireless capabilities of high-speed Internet through EDGE, 3G, and 4G
as well as Bluetooth and WLAN, many new research possibilities appear. Further-
more, with the updated network infrastructure and more affordable payment options
from the ISPs (Internet Service Providers), the always-connected devices are becom-
ing mainstream.
While smartphones are becoming increasingly powerful, the software they run has
also gone through some major evolutionary steps. Particularly the Android and iOS
marketplaces have been a very important factor in the platforms’ success. In 2011
both Android Market2 and the App Store3 reached over 500,000 available applica-
tions. The maturity of the platforms and the popularity of apps are giving businesses a
new channel to promote products, offer new features, and generally expand their
methods of reaching out to potential customers.
Moreover, the ability to purchase and download native applications directly to the
smartphones has proven to be a popular service for both consumers and developers.
Developers are able to publish their applications quickly and users can navigate
through a library consisting of many thousands of applications, providing everything
from games and educational software to enterprise solutions. Additionally, the rating
systems also provide end users with the ability to directly give feedback on the quality
and price of the offered application. These features have certainly made people use
their phone for more tasks than before. With the increase in usage and capabilities of
the smartphones, new and interesting research opportunities have emerged. These
include context-aware solutions and applications that have been around for some time
now and have, successfully enriched mobile applications. In this work, we seek to
build on these achievements and utilize context as a source of information for user
information and interface tailoring [14]. The issue with much of the earlier ap-
proaches is that they have only looked at either one source of context-aware informa-
tion or treated the context separately in the case of multiple sources. We propose a
different approach, which combines context-aware information from several dimen-
sions in order to build a rich foundation to base our algorithms on (what algo-
rithms...this is not clear). We exploit cloud computing technology to create a new user
experience and a new way to invoke control over user’s mobile phone. Our solution is
a remote configuration of an Android phone, which uses the context-aware informa-
tion foundation to constantly adapt to the environment and change in accordance with
the user’s implicit requirements. Cloud-based service providers and developers are
increasingly looking toward the mobile domain, having their expectations focused on
1
http://www.smartplanet.com/blog/business-brains/milestone-more-
smartphones-than-pcs-sold-in-2011/21828
2
http://www.research2guidance.com/android-market-reaches-half-a-
million-successful-submissions/
3
http://www.apple.com/iphone/built-in-apps/app-store.html
Automatic Configuration of Mobile Applications 369
the access and consumption n of services from mobile devices [13]. Hence, integratting
an application, running on a mobile device, with cloud computing services is becoom-
ing an increasingly importtant factor. Potentially, by utilizing such connectivityy to
offload computation to thee cloud, we could greatly amplify mobile application pper-
formance at minimal cost [3 3]. Our work focuses on data access transparency (whhere
clients transparently will pu
ush/pull data to/from the cloud), and adaptive behavior of
cloud applications. We adaapted the behavior of the Google App Engine server apppli-
cation based on context infoormation sent from the users’ devices thus integrating ccon-
text and cloud on a secure mobile
m platform [1].
The main contribution of our work in this chapter describes a cross-source integgra-
tion of cloud-based, context-aware information. This solution incorporates rem mote,
web-based configuration off smartphones and advances the research area of conteext-
aware information and web b applications. By expanding and innovating our existting
work we propose a new, no ovel solution to multidimensional harvesting of contexttual
information and allow for automatic
a web application execution and tailoring.
4
http://www.britisha
airways.com/travel/iphone-app/public/en_gb
370 T.-M. Grønli et al.
Because mobile devices are usually carried around everywhere and have the caapa-
bility to communicate with h external resources, they provide an ideal match for thhese
kinds of applications. It replaces other items and tasks, like boarding passes and
check-in procedures, with a simple and well-integrated application that is more prrac-
n need to print out a boarding pass or stand in queue at the
tical. Moreover, one does not
check-in counter.
These features of the mo obile devices are a result of many different components co-
operating. Most application ns communicate with various resources; these can be loocal
to the phone, like sensors, or backend services that provide the wanted informatiion.
There are several research areas
a that have made significant contributions toward thhese
technological advances we are able to use today. These areas are presented in the nnext
section of this chapter. We
W will concentrate on four concepts that are particulaarly
important when it comes to o mobile devices and the integration of network commuuni-
cation, namely 1) Distributted Computing, 2) Mobile Computing, 3) Pervasive Coom-
puting, and 4) Internet of Th
hings.
As presented in Figure 2,2 these majors steps in mobility and computing are relaated
both in regards to technologgy and research challenges. When moving toward the riight
of the figure, there is a ten
ndency to either add new problems or make existing oones
Automatic Configuration of Mobile Applications 371
more challenging [17]. On the figure, we have marked (with a stapled box) the most
important issues that are investigated in this chapter.
Although new problems appear with the paradigms on the right, it is important to
build on previous research findings. Distributed computing has a considerable knowl-
edge base that helps both mobile and pervasive computing moving forward. Internet
of Things is also building on knowledge learned from the previous paradigms. We
will not go into detail on the Internet of Things, as this is outside the scope of our
work, but we include a general description to complete the overall picture of the four
research areas.
This definition consists of two parts, which are the hardware and software. The
hardware must cooperate to complete specific tasks, while the software should try to
unify these hardware resources into one coherent system. Distributed computing thus
takes advantage of networked resources, trying to share information, or even have
different resources join forces to be able to achieve complex tasks that might take too
long for just one standalone computer. Compared to implementing a single-machine
system, distributed systems have their own challenges and issues.
One important concept within distributed computing is the remote network com-
munication between devices. Different components of the system need to communi-
cate, this is at the core of the distributed computing area. There are several attempts at
making the development and overall design of such systems easier and more reliable
from RPC to peer-to-peer communication and remote method invocation.
One common example of a feature that is implemented to minimize the battery usage
is the light sensor, which registers the amount of light in the room and adjusts the
screen brightness accordingly.
Based on these characteristics of mobile computing, Satyanarayanan [17] identified
five main research challenges. These are Mobile Information Access, Mobile Net-
working, Energy-Aware Systems, Location Sensitivity, and Adaptive Applications, all
of which form the bulk of research interest in the area today.
5
http://www.bbc.co.uk/news/technology-13613536
Automatic Configuration of Mobile Applications 375
themes, among which we mention: cloud system design (Mell and Grance, 2011),
benchmarking of the cloud [10], and provider response time comparisons. Mei et al.
[11] have pointed out four main research areas in cloud computing that they find par-
ticularly interesting is the Pluggable computing entities, data access transparency,
adaptive behavior of cloud applications and automatic discovery of application quality.
The Internet of Things is an up and coming area, which will undoubtedly attract
more research interest in the near future. The popularity is increasing, as shown by
Google trend graph searches.
this feature is that the user did not need to manually update each device; users have a
“master configuration” stored externally that can be directly pushed to their phone or
tablet. It is also easier to add more advanced configuration options when the user can
take advantage of the bigger screen, mouse, and keyboard on a desktop/laptop PC for
entering configuration values than those found on mobile devices. On the webpage,
by selecting the applications user wants to store on the mobile device and pressing the
“save configuration”-button, a push message is sent to the client application.
4.2 Meta-Tagging
To make it possible for users to tag their appointments and contacts with context in-
formation we added special meta-tags. By adding a type tag, for example, $[type =
work] or $[type= leisure], we were able to know if the user had a business meeting or
Automatic Configuration of Mobile Applications 377
a leisure activity. We then filtered the contacts based on this information. If the tag
$[type=work] was added, this lets the application know that the user is in a work
setting and it will automatically adapt the contacts based on this input. In a work con-
text only work-related contacts would be shown. To add and edit these tags we used
the web-interface of Google contacts and calendar.
Std.
Statement Domain Mean
Dev.
Web application
Table 1. (continued)
Context-awareness
Cloud computing
30
25
20
15
10
0
S1 S2 S3
SD D A SA
5.2 Context-Awareness
In terms of context-aware information, participants were asked to take a stand in re-
spect of four statements, with results shown below (Figure 4). For the first statement
(S4), although a clear majority supported this assertion, opinions were somewhat
spread and this answer was not statistically significant. For the next two statements a
very positive bias was registered, indicating correctly computed context-awareness
and correct presentation to the users. Again for S7, users are eager to see more cloud-
based services and integration.
30
25
20
15
10
5
0
S4 S5 S6 S7
SD D A SA
25
20
15
10
0
S8 S9 S10 S11
SD D A SA
From the literature we point at the ability for modern applications to adapt to their
environment as a central feature [4]. Edwards [5] argued that such tailoring of data
and sharing of contextual information would improve user interaction and eliminate
manual tasks. Results from the user evaluation support this. The users find it both
attractive as well as have positive attitudes toward automation of tasks such as push
updates of information by tailoring the interface. This work has further elaborated on
context-aware integration and has shown how it is possible to arrange interplay be-
tween device context-aware information, such as sensors, and cloud-based context-
aware information, such as calendar data, contacts, and applications, building upon
suggestions for further research on adaptive cloud behavior as identified by Christen-
sen [2] and Mei et al. [10][11].
To register the tags the standard Google Calendar and Contacts web interface were
used. Such a tight integration with the Google services and exposure of private infor-
mation was not regarded as a negative issue. As shown in the results, most of the
users surveyed disagreed that this was an inconvenience. This perception makes room
for further integration with Google services in future research, where, among them,
Automatic Configuration of Mobile Applications 381
the Google+ platform will be particularly interesting as this may bring opportunities
for integrating the social aspect and possibly merge context-awareness with social
networks.
Sensors are an important source of information input in any real-world context and
several previous research contributions have looked into this topic. The work present-
ed in this chapter follows in the footsteps of research such as that of Parviainen et al.
[31], and extends sensor integration to a new level. By taking advantage of the rich
hardware available on modern smartphones, the developed application is able to have
tighter and more comprehensively integrated sensors in the solution. Although sensor
integration as a source for context-awareness is well received, it still needs to be fa-
ther enhanced. In particular it would be useful to find out appropriate extent and
thresholds that should be used for sensor activation and deactivation. We have shown
that it is feasible to implement sensors and extend their context-aware influence by
having them cooperate with cloud-based services in a cross-source web application
scenario. Further research includes investigating sensor thresholds and the manage-
ment of different sources by different people in a web scenario.
In this chapter we investigated into context-aware and cloud-based adaptation of
mobile devices and user’s experience. Our research has added a new and novel con-
tribution to the area of context-awareness in the cloud setup. We have proposed and
demonstrated principles in implemented applications, whereby context-aware infor-
mation is harvested from several dimensions to build a rich foundation on which to
base our algorithms for context-aware computation. Furthermore, we have exploited
and combined this with the area of cloud computing technology to create a new user
experience and a new way to invoke control over user’s mobile phone. Through a
developed application suite, we have shown the feasibility of such an approach, rein-
forced by a generally positive user evaluation. Moreover, we believe our solution,
incorporating remote and automatically configuration of Android phone advances the
research area of context-aware information.
7 Future Research
It can be very expensive to write separate applications in the native language of the
platforms. Therefore, when developing multi-platform systems there is also the possi-
bility to use HTML 5 technology to create a shared application. While we acknowl-
edge this fact, we also see that in today’s environment there are scenarios where
HTML 5 does not provide the wanted performance requirements or lacks specific
features. In the topic of heterogeneity, we see the potential for future research provid-
ing more detail on this idea of a common platform, such as a closer investigation of
HTML 5 compared to native applications. Looking at large cloud-computing provid-
ers, they have access to an enormous amount of data, and the lack of transparent
knowledge on how this information is used has provoked concerns. This would cer-
tainly be an interesting topic for future research on the topic of cloud computing ac-
ceptance in regard to personal information sharing.
382 T.-M. Grønli et al.
From several scenarios described in this chapter, we see that future research should
continue to innovate and expand the notion of context-awareness enabling further
automatic adaptation and behavior altering in accordance with implicit user’s needs.
References
[1] Binnig, C., Kossmann, D., Kraska, T., Loesing, S.: How is the weather tomorrow?: to-
wards a benchmark for the cloud. In: Proceedings of the Second International Workshop
on Testing Database Systems. ACM, Providence (2009)
[2] Christensen, J.H.: Using RESTful web-services and cloud computing to create next gen-
eration mobile applications. In: Proceedings of the 24th ACM SIGPLAN Conference
Companion on Object Oriented Programming Systems Languages and Applications.
ACM, Orlando (2009)
[3] Cidon, A., et al.: MARS: adaptive remote execution for multi-threaded mobile devices.
In: Proceedings of the 3rd ACM SOSP Workshop on Networking, Systems, and Applica-
tions on Mobile Handhelds, MobiHeld 2011, pp. 1:1–1:6. ACM, New York (2011)
[4] Abowd, G.D., Dey, A.K.: Towards a better understanding of context and context-
awareness. In: Gellersen, H.-W. (ed.) HUC 1999. LNCS, vol. 1707, pp. 304–307. Springer,
Heidelberg (1999)
[5] Edwards, W.K.: Putting computing in context: An infrastructure to support extensible
context-enhanced collaborative applications. ACM Transactions on Computer-Human
Interaction (TOCHI) 12, 446–474 (2005)
[6] Elsenpeter, R.C., Velte, T., Velte, A.: Cloud Computing, A Practical Approach, 1st edn.
McGraw-Hill Osborne Media (2009)
[7] Grønli, T.-M., Hansen, J., Ghinea, G., Younas, M.: Context-Aware and Cloud Based Ad-
aptation of the User Experience. In: Proceedings of the 2013 Advances in Networking
and Applications (AINA), pp. 885–891. IEEE Computer Society (2013)
[8] Grønli, T.-M., Ghinea, G., Younas, M.: Context-aware and Automatic Configuration of
Mobile Devices in Cloud-enabled Ubiquitous Computing. Journal of Personal and
Ubiquitous Computing (2013)
[9] Khajeh-Hosseini, A., et al.: The Cloud Adoption Toolkit: supporting cloud adoption deci-
sions in the enterprise. Software: Practice and Experience, Software: Practice and Experi-
ence 42(4, 4), 447–465 (2012)
[10] Mei, L., Chan, W.K., Tse, T.H.: A Tale of Clouds: Paradigm Comparisons and Some
Thoughts on Research Issues. In: Proceedings of the 2008 IEEE Asia-Pacific Services
Computing Conference, pp. 464–469. IEEE Computer Society (2008)
[11] Mei, L., Zhang, Z., Chan, W.K.: More Tales of Clouds: Software Engineering Research
Issues from the Cloud Application Perspective. In: Proceedings of the 2009 33rd Annual
IEEE International Computer Software and Applications Conference (2009)
[12] Mell, P., Grance, T.: The NIST Definition of Cloud Computing (2011)
[13] Paniagua, C., Srirama, S.N., Flores, H.: Bakabs: managing load of cloud-based web ap-
plications from mobiles. In: Proceedings of the 13th International Conference on Infor-
mation Integration and Web-based Applications and Services, iiWAS 2011, pp. 485–490.
ACM, New York (2011)
[14] Strobbe, M., Van Laere, O., Ongenae, F., Dauwe, S., Dhoedt, B., De Turck, F.,
Demeester, P., Luyten, K.: Integrating Location and Context Information for Novel
Personalised Applications. IEEE Pervasive Computing, 1 (2011)
Automatic Configuration of Mobile Applications 383
[15] Vaquero, L.M., et al.: A break in the clouds: towards a cloud definition. SIGCOMM
Comput. Commun. Rev. 39(1), 50–55 (2008)
[16] Vermesan, O., et al.: Internet of Things Strategic Research Roadmap. European Research
Cluster on the Internet of Things, Cluster Strategic Research Agenda (2009)
[17] Satyanarayanan, M.: Pervasive computing: vision and challenges. IEEE Personal Com-
munications 8(4), 10–17 (2001)
[18] Zhang, D., Yang, L.T., Huang, H.: Searching in Internet of Things: Vision and Challeng-
es. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing
with Applications (ISPA), pp. 201–206 (2011)
[19] Boger, M.: Java in Distributed Systems: Concurrency, Distribution and Persistence, 1st
edn. Wiley (2001)
[20] Saha, D., Mukherjee, A.: Pervasive Computing: A Paradigm for the 21st Century. Com-
puter 36(3), 25–31 (2003)
[21] Tanenbaum, M., Van Steen, A.: Distributed Systems: Principles and Paradigms. Prentice
Hall (2002)
[22] Kamal, R.: Mobile Computing. Oxford University Press, USA (2008)
[23] Satyanarayanan, M.: Fundamental challenges in mobile computing. In: Proceedings of
the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, PODC
1996, pp. 1–7. ACM, New York (1996)
[24] Weiser, M.: The computer for the 21st century. Scientific American 3(3), 3–11 (1991)
[25] Hansmann, U., et al.: Pervasive Computing: The Mobile World, 2nd edn. Springer (2000)
[26] West, M.T.: Ubiquitous computing. In: Proceedings of the 39th ACM Annual Conference
on SIGUCCS, SIGUCCS 2011, pp. 175–182. ACM, New York (2011)
[27] West, M.T.: Ubiquitous computing. In: Proceedings of the 39th ACM Annual Conference
on User Services Conference, SIGUCCS 2011, pp. 175–182. ACM, New York (2011)
[28] Parkkila, J., Porras, J.: Improving battery life and performance of mobile devices with
cyber foraging. In: 2011 IEEE 22nd International Symposium on Personal Indoor and
Mobile Radio Communications (PIMRC), pp. 91–95 (2011)
[29] Patel, P., et al.: Towards application development for the internet of things. In: Proceed-
ings of the 8th Middleware Doctoral Symposium, MDS 2011, pp. 5:1–5:6. ACM, New
York (2011)
[30] Perkins, C.E.: Mobile networking in the Internet. Mob. Netw. Appl. 3(4), 319–334 (1998)
[31] Parviainen, M., Pirinen, T., Pertilä, P.: A speaker localization system for lecture room en-
vironment. In: Renals, S., Bengio, S., Fiscus, J.G. (eds.) MLMI 2006. LNCS, vol. 4299,
pp. 225–235. Springer, Heidelberg (2006)
A Socialized System for Enabling the Extraction
of Potential Values from Natural and Social
Sensing
1 Background
The recent development of sensing devices has enabled us to collect many kinds
of sensing data from the natural and social environments around us. Sensing
data here include historical data that people generate in society when they visit
somewhere, purchase something, communicate with others, as well as typical
sensing data about temperature, and humidity [1][2]. We believe such sensing
data contain ‘value’, which could stimulate our potential abilities and raise our
intelligence levels. For instance, such sensing data are expected to let us recognize
potential needs of people or society [3]. Although it is still an open issue as to
how ‘value’ is defined and measured, one possible way could be that we define
how big a value is compared to how large its expected economic impact will
be, i.e., how much money the value is expected to produce. However, we face
two fundamental problems to be solved to leverage sensing data. First, it is
c Springer International Publishing Switzerland 2015 385
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_16
386 R. Shinkuma et al.
hard for humans to understand raw/unprocessed sensing data [4] and second,
it is inefficient in terms of management costs to keep all sensing data ‘usable’
because the size of sensing data is huge and they even include even sensing data
that are very unlikely to be referred to [5]. Note that ‘usable’ here means that
sensing data are not just archived but are stored in a database system that
accepts queries to discover required data from outside and responds quickly to
them. An option for us is not to keep sensing data usable but to keep their
characteristics usable so that, at least, we can extract values contained in them.
A typical example of this is that it is inefficient we keep the historical GPS data of
all mobile users usable, while it might be sufficient to retain the characteristics
of the data. One of the typical statistical-characteristics could be the places
and times many people stayed at specific locations, which could provide useful
information to the marketing departments of retailers. However, this is just one
of the hypotheses we humans could expect in the values extracted from sensing
data. If we only select a part of the data from the whole sensing data according
to hypotheses predefined by humans and archive the rest of the data, we might
lose numerous opportunities to obtain values that could be extracted from them.
Therefore, our objective here should be to model characteristics of sensing data
that could produce values without any hypotheses predetermined by humans
and keep them usable.
We propose a new-generation system to solve these that models the charac-
teristics of sensing data and extracts values from these characteristics, which is
called a socialized system[6][7][8][10]. The socialized system produces network
graphs from various kinds of sensing data. Suppose that, when you send and
receive e-mails to and from your colleagues, you are connected with each of
them via a link, which is a network-graph representation of your relationships
extracted from the historical data of e-mail communications[11]. To apply this
to other people and integrate your network graph and other obtained network
graphs into a network graph, the structure of the integrated network graph rep-
resents your characteristics, i.e., who has a close relationship with you, how many
people have close relationships with you, and how ‘central’ you are in a commu-
nity. The socialized system develops the above concept so that it can be applied
to any general sensing data. For example, if we produced a network graph from
historical data about locations people visited and what they purchased, the net-
work graph would not only represent people but also locations and products as
nodes and the relationship between any pair of them as a link. Then, we could
expect that the structure of the produced network graph would model the char-
acteristics of people, locations, and products and it would retrieve the values
extracted from the historical data of people’s movements and purchases.
The following sections are organized as follows: Section 2 and 3 will discuss
the system model and algorithms of the socialized system. Then, Section 4 will
provide network-graph examples of the socialized system. We will also describe
SocialCast, which is a new content delivery paradigm built on the socialized
system in Section 5. The last section concludes this chapter.
A Socialized System for Enabling the Extraction of Potential Values 387
2 System Model
Before discussing the system model of the socialized system let us first consider,
a simple system model that enables us to record data and then to reproduce
their characteristics. As seen in Fig. 1 (a), first, a real space is captured by a
high-resolution camera device. The input image of the real space is encoded so
that it can be recorded as digital data. Then, the recoded data are decoded so
that the output image can be displayed on a high-resolution display device. A
question raised here is what is required for the encoder. The answer should be
not to lose the characteristics of the input image expected to be reproduced when
it is decoded and displayed. A simple example is that, if color is expected to be
reproduced on the display device, the color characteristics have to be encoded
without loss. As seen in Fig 1 (b), the socialized system can be illustrated in
the same way as in Fig. (a). The encoder in the socialized system produces
network graphs from the input sensing data, which are called relational graphs
and whose structure models the characteristics of the original sensing data. The
decoder reproduces the characteristics of the original sensing data.
Author1
Workshop Technical
location committee
Keyword1 Keyword2
with another node via a direct link. Seven examples of decodable characteristics
are listed below:
1. Connectivity between a pair of nodes: no connection, direct connection via
a link, or indirect connection via other nodes
2. Link strength between two directly connected nodes
3. Number of common nodes between two directly connected nodes
4. Shortest route between two indirectly connected nodes
5. Number of links each node has (degree)
6. Centrality of each node
7. Clustering characteristics of the graph
The characteristics listed above have been conventionally discussed about human
networkgraphs [9], nodes represent people. Therefore, discussing the centrality
of a node means discussing the centrality of a person who corresponds to that
node. However, relational graphs represent many other kinds of objects than
people, in which, for example, discussing the centrality of a node might mean
discussing the centrality of a product, a location, or something else. If some
product or some location has a high centrality, it means the product and the
location might have a significant impact on the market. Note that the centrality
of the location and the product explains how significant it is for other related
objects, i.e., other locations, other products, and other people in the relational
graph, which is different from just statistical number of visits of the location or
of sales of the product.
users distributed in 3500 folders. Each message presented in the folders contains
the senders’ and the receivers’ email addresses, dates and times, subjects, body
text, and some other technical details specific to emails. We produced a relational
graph from the dataset as Shetty and Abid [11] did even though they called it as
social network. A link between two people is only established if they exchange
e-mails at least five times. We only considered bidirectional links, which means
that we only considered a contact had occurred if both sent emails to each other.
What we additionally did and Shetty and Abidi did not do [11] was to remove
unsent or duplicated e-mails. Furthermore, we assumed the number of e-mails
sent/received between two people would equal the link strength between them.
McNett and Voelker [12] collected movement trace data from approximately
275 mobile users who were students at the University of California in San Diego
for an 11-week period in 2002. Each mobile device they used was equipped
with a Symbol Wireless Networker 802.11b Compact Flash card. They identified
users according to their registered wireless card MAC address, and assumed
that there was a fixed one-to-one mapping between users and wireless cards.
However, we produced a relational graph in which people, locations, and days
(weekday/weekend) were mixed as nodes and assumed that the strength of the
link between two objects was equal to how long they stayed with each other.
We also used a dataset of technical reports published by the Institute of
Electronics, Information, and Communication Engineers (IEICE) in Japan [13],
where each report included four kinds of objects on authors, technical commit-
tees, workshop locations, and keywords. We obtained a full-mesh partial graph
from each technical report in which objects were represented as nodes and con-
nected with once another like they are in Fig. 3. Then, all the partial graphs
generated from technical reports could be integrated into one big relational-graph
because partial graphs could be connected via the common nodes between them.
The link strength between two nodes was determined by how many times the
link appeared when partial graphs were integrated. We used the datasets from
the technical reports from 2008 to 2010 of the Communications Society of IEICE.
Fig. 4. Degree distribution of relational graph produced from Enron dataset, which is
approximated by y = 103.25 x−1.54 , where the correlation factor is -0.944
392 R. Shinkuma et al.
Fig. 5. Degree distribution of relational graph produced from UCSD dataset, which
is approximated by y = 101.37 x−0.54 after top 20% of nodes with largest degree are
removed, where the correlation factor is -0.679
Fig. 6. Degree distribution of relational graph produced from IEICE dataset, which is
approximated by y = 104.02 x−1.64 after top 10% of nodes with largest/smallest degree
are removed, where the correlation factor is -0.894
A Socialized System for Enabling the Extraction of Potential Values 393
5.1 Objective
Much research has been focused on how to achieve load balancing to deal with the
problem of increasing traffic on networks. Open Shortest Path First (OSPF)[21]
is a common routing protocol and is used in intra-domain Internet situations.
The network operator in this protocol assigns a metric to each physical link.
In this section, physical routers and physical links mean physical routers in a
computer network and links between the physical routers; we will use a link
or a node only when it means a link or a node in the relational graph. As
recommended by Cisco[22], it is often assigned the inversely proportional value
of the bandwidth of each physical link. In such cases, the metric of physical links
that have a high bandwidth decreases and is frequently chosen as the shortest
path, which is then likely to be overloaded. Fortz and Thorup [25] suggested a
way of optimizing the metrics used in OSPF to prevent physical links from such
overloading. SocialCast uses relational metrics as a metric for physical links. A
relational metric is produced from a relational graph and represents the degree to
which two or more objects relate. For example, if user A frequently downloads a
certain type of content, e.g., content X, the relational metric between user A and
content X should be large. As another example, if user A and user B share private
394 R. Shinkuma et al.
content Y, the relational metric between user A, user B, and content Y should
be large. Since relational metrics differ in terms of content to be distributed, and
the physically shortest path is not always chosen as a distribution path. Load
can be balanced without special techniques by using relational metrics like that
in demand projection [25]
Reducing retrieval latency is also an important issue. Caching can be one ap-
proach for achieving this. Popularity is frequently used as a metric to determine
which content is to be cached; the more popular content is, the more frequently
it will be cached. Naturally, popular content has a high likelihood of being re-
quested by many users, but that does not mean that retrieval latency for all
kinds of content will be reduced. If popular content is preferentially cached, less
popular content is less frequently cached. As the number of requests for content
on the Web is known to follow a Zipf-like or power-law distribution [26], there is
a considerable amount of unpopular content, and popularity-based caching can-
not handle this. Coordinated caching has been proposed [27] to use distributed
caches more efficiently; neighboring routers should not cache the same content
and even low-popularity content should be cached. However, the main drawback
of coordinated caching is its complexity. Relational metrics represent the rela-
tionships between content and users, and therefore, the metric used for cache
management will differ between distributed content and users who cache con-
tent. The caches of the content can thus be effectively distributed without any
complex mechanisms.
SocialCast also effectively ensures privacy and security. Considering relational
metrics and calculating a path that has a large metric value means that dis-
tributed content will only be shared with those who have some relationship with
the content. Calculating a distribution path on the lower layer also has an ad-
vantage because the range of disclosures will be physically limited.
the content and determines the distribution path using the relational metrics
provided by the upper layer.
The second function is to manage content, which includes content discovery
and cache replacement. The network controller stores the physical location of
all content in the network, i.e., what content exists in each router on its own
cache. When published content is delivered, the caches of the routers on the
distribution route can be replaced. The updated cache status should be known
by the network controller for the next discovery of content.
The third function is to direct a specific router to redistribute the content
of its cache. In essence, the router acts like a cache server on content delivery
networks (CDNs) [24]. After the distribution path is established, the controller
checks if any router on the path has the content in its cache. If any router has
the content to distribute on its cache, the content will be distributed from the
router that has the cache of the content instead of the original source of the
content.
f (i, j, c) (if eij ∈ E(G))
m(eij ) = (1)
0 (otherwise),
where m(eij ) is the relational metric value assigned to directed edge eij , and i
represents the tail and j represents the head of the edge in edge eij . Here, E(G)
is the set of all edges in graph G, c is the content that will be distributed, and
f () represents the function to generate relational metric value. This equation
means m(eij ) is used as the metric to forward content c from router i to the
next router, j,
of the physical link as the distance of the physical link if the physical link exists
in the graph, on +∞ otherwise. This means the distance of edge eij is calculated
by
1
(if eij ∈ E(G))
d(eij ) = f (i,j,c) (2)
+∞ (otherwise),
where d(eij ) represents the distance of a directed edge from tail router i to head
router j.
Second, calculate the shortest path to the destination router u. Shortest path
pu is built using the algorithm described in problem 2 of Dijkstra [29].
Ct+1
i = Cti ∪ {c} (3)
If the cache of a router is about to exceed its capacity when caching content,
cache replacement will occur. This replacement is done by solving the optimiza-
tion problem written as
max
t+1
v(j) (4)
c∈Ci
j∈Ct+1
i \{c}
1. The relational graph was created. Each link between objects was calculated
using the method described as in Subsection 4.1.
2. The physical network was created. Each router had connectivity with routers
within the range of radius r. All users were randomly assigned to each router
in the physical network.
3. After the physical network was created, the distribution was processed until
the caches of all routers were filled. The method of cache replacement method
followed Eqs. (3) and (5).
4. One of the users is randomly selected in time slot t as a requester of content.
The content to be distributed was chosen from content that the requester
was directly interested in. The chosen content was regarded as content that
the requester asked for in the simulation. The possibility of which content
was chosen based on the relational metric between each content and the
requester.
5. The distribution path was constructed based on the proposed method de-
scribed in Susection 5.3 or the conventional method, which will be described
in Subsection 5.4. The content was actually distributed along the path to
the routers that destination users were assigned to. During the distribution,
all routers placed and replaced their own caches according to Eqs. (3) and
(5). We check which physical link was used in each distribution, the averaged
latency until targeted users retrieved content, and the relational metric of
the physical links on the distribution path.
6. We incremented time slot t, and repeated steps 3 and 4 until it reached
threshold tstop .
physical link. The physical link distance was normalized in a range from 0.0
to 1.0. Other simulation parameters were listed in Table 2. The bandwidth of
all physical links of the physical network and the size of all content to be dis-
tributed in the simulation corresponded to constant B and Cs . Also, all routers
had the same constant cache capacity, Ci in the simulation. Threshold tstop was
the threshold for the time slot to stop simulation.
Minimum physical distance path distribution. Distance d(eij ) of all edges eij ∈
E(G) was set to 1.0. That means the length of the distribution path was regarded
as hop counts on the path. Apart from this, the steps to construct the distribution
path were the same as those described in Subsection 5.3.
Popularity-based cache mechanism. If the cache did not exceed capacity, the
distributed content was cached according to Eq. (3). If the cache did exceed
capacity, the cache was replaced by solving:
max popularity(j), (5)
c∈Ct+1
i
j∈Ct+1
i \{c}
where Ct+1
i represents the cache set on router i at time slot t + 1, c, and j are
content, and popularity(j) represents the popularity of content j. The popularity
of content j was assigned in the simulation according to the number of users who
directly had an interest in content j. Since the distribution of the number of users
assigned to each content followed a power law, the popularity of each content also
follows the power law. Therefore, we assumed the model for users and content and
the model for assigning popularity to all content were appropriate for considering
the situation with the distribution of the contents that have power-law popularity
[31].
Simulation Results. This section compares the results obtained from the con-
ventional and proposed methods for each item when cache size Ci for all routers
in the physical network was equal to one and three. We will then discuss the
possible reasons for these.
Load balance. We counted the times of each physical link was used for content
distribution and calculated the ratio of the times each physical link was used
to the number of times all physical links were used to evaluate load balance.
We also compared the cumulative distribution function to the number of times
each physical link was used in all used physical links with the conventional and
proposed methods.
Figs. 8(a) and 8(b) plot the results for load balancing for the conventional
and proposed methods when the cache sizes of the routers were Ci = 1 and
Ci = 3. The proportion of physical links only used once in the distribution
400 R. Shinkuma et al.
1 1
0.8 0.8
CDF
CDF
0.4 0.4
1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15
How many times each link is used How many times each link is used
Fig. 8. Ratio of times each physical link used to sum of times of all physical links used
during simulation trial
1 1
0.8 0.8
0.6 0.6
CDF
CDF
0.4 0.4
Proposed Proposed
0.2 0.2
Conventional Conventional
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Avg. of retrieval latency(s) Avg. of retrieval latency(s)
using the conventional method is lower than that using the proposed method.
That means many physical links are used multiple times in the distribution with
the conventional method. This is because the conventional method always uses
the physically shortest path, which does not change from content to content.
However, since the proposed method takes into consideration the relationship
between intermediate routers and the content to be distributed, it works so that
the distribution path will differ.
Average of retrieval latency for users. We measured the retrieval latency for each
user to evaluate the averaged retrieval latency. Here, retrieval latency means the
time between content being published and a user obtaining it. We then calcu-
lated the averaged time and compared the cumulative distribution function to
the averaged retrieval latency.
Figs. 9(a) and 9(b) plot the results for averaged retrieval latencies with the
conventional and proposed methods when the cache sizes of the routers were
Ci = 1 and Ci = 3. Here, the proposed method out-performed the conventional
method: 50% of all content distributed reached targeted users within about 3.0 s
A Socialized System for Enabling the Extraction of Potential Values 401
1 1
0.8 0.8
0.6 0.6
CDF
CDF
0.4 0.4
Proposed
Proposed
0.2 Conventional 0.2 Conventional
0 0
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0
Avg. of relational metrics of links of Avg. of relational metrics of links of
distribution path distribution path
(a) Ci = 1 for all router i. (b) Ci = 3 for all router i.
Fig. 10. Cumulative distribution function of averaged relational metric value of each
physical link in distribution path
with the conventional method, while it only took about 2.0 s with the proposed
method when the cache size is equal to one. A possible reason for this is that
the cache of content is well distributed with the proposed method. As described
in Subection 5.4, the distribution path will differ depending on content c, so
different content is cached on different routers. If a cache is well distributed, the
probability of the existence of a router that has a cache of the requested content
near the routers increases, and the latency then becomes smaller because the
content can be distributed from the cache. The results for when the cache size
is equal to three are even better. This is because, if routers have a bigger cache,
the probability of the existence of a router that has a cache of the requested
content near the routers increases even more.
6 Conclusion
This chapter has tackled two problems we face when extracting values from
sensing data: 1) it is hard for humans to understand raw/unprocessed sensing
data and 2) it is inefficient in terms of management costs to keep all sensing data
‘usable’. This chapter also discussed a solution, i.e., the socialized system, which
encodes the characteristics of sensing data in relational graphs so as to extract
values that originally contained the sensing data from the relational graphs. The
system model, the encoding/decoding logic, and the real-dataset examples were
presented.
We also proposed a content distribution paradigm built on the socialized
system that was called SocialCast. SocialCast can achieve load balancing, low-
retrieval latency, and privacy-conscious delivery by distributing content using
relational metrics produced from the relational graph of the socialized system.
We did a simulation and presented the results to demonstrate the effectiveness
of this approach.
We here list the four remaining issues that need to be addressed to commer-
cialize the socialized system:
viewpoints. The authors also would like to thank the industrial forum for Mo-
bile Socialized System (MSSF), Japan for their contributions from industrial
viewpoints.
References
1. Aizawa, K., Tancharoen, D., Kawasaki, S., Yamasaki, T.: Efficient retrieval of life
log based on context and content. In: Proceedings of the the 1st ACM Workshop
on Continuous Archival and Retrieval of Personal Experiences (CARPE 2004),
pp. 22–31 (2004)
2. Laurila, J., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Dousse, D.O., Eberle, J.,
Miettinen, M.: The mobile data challenge: Big data for mobile computing research.
In: Proceedings of Mobile Data Challenge by Nokia Workshop (2012)
3. LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big Data, Ana-
lytics and the Path From Insights to Value. MIT Sloan, Management Review 52(2)
(2011)
4. Lymberopoulos, D., Bamis, A., Savvides, A.: Extracting spatiotemporal human
activity patterns in assisted living using a home sensor network. In: Proceedings
of the 1st International Conference on PErvasive Technologies Related to Assistive
Environments (PETRA 2008), Article No. 29 (2008)
5. Lynch, C.: Big data: How do your data grow? Nature 455, 28–29 (2008)
6. Shinkuma, R., Kasai, H., Yamaguchi, K., Mayora, O.: Relational Metric: A New
Metric for Network Service and In-network Resource Control. In: Proceedings
of IEEE Consumer Communications and Networking Conference (CCNC 2012),
Work-In-Progress session (2012)
7. Kida, A., Shinkuma, R., Takahashi, T., Yamaguchi, K., Kasai, H., Mayora, O.:
System Design for Estimating Social Relationships from Sensing Data. In: Pro-
ceedings of IEEE International Conference on Advanced Information Networking
and Applications (AINA 2013), Workshop on Data Management for Wireless and
Pervasive Communications (2013)
8. Yogo, K., Kida, A., Shinkuma, R., Kasai, H., Yamaguchi, K., Takahashi, T.: Ex-
traction of Hidden Common Interests between People Using New Social-graph
Representation. In: Proceedings of International Conference on Computer Com-
munications and Networks (ICCCN 2011), Workshop on Social Interactive Media
Networking and Applications (August 2011)
9. Borgatti, S.P.: Centrality and network flow. Social Networks 27(1), 55–71 (2005)
10. Nishio, T., Shinkuma, R., Pellegrini, F.D., Kasai, H., Yamaguchi, K., Takahashi, T.:
Trigger Detection Using Geographical Relation Graph for Social Context Aware-
ness. Mobile Networks and Applications 17(6), 831–840 (2012)
11. Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical
report. database schema and brief statistical report. Information Sciences Institute,
vol. 4 (2004)
12. McNett, M., Voelker, G.M.: Access and mobility of wireless pda users. Technical
report, Computer Science and Engineering, UC San Diego (2004)
13. The Institute of Electronics, Information and Communication Engineers (IEICE),
Japan, http://www.ieice.org/jpn/
14. Newman, M.E.: Models of the Small World. Journal of Statistical Physics 101(3-4),
819–841 (2000)
404 R. Shinkuma et al.
15. Fronczak, A., Holyst, J.A., Jedynak, M., Sienkiewicz, J.: Higher order clustering
coefficients in Barabási-Albert networks. Physica A: Statistical Mechanics and its
Applications 316 (1), 688–694 (2002)
16. Barabási, A.-L., Albert, R., Jeong, H.: Mean-field theory for scale-free random
networks. Physica A: Statistical Mechanics and its Applications 272(1), 173–187
(1999)
17. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-
laborative filtering. IEEE Internet Computing 7(1), 76–80 (2003)
18. Pan, J., Paul, S., Jain, R.: A survey of the research on future internet architectures.
IEEE Communications Magazine 49(7), 26–36 (2011)
19. Ahlgren, B., Dannewitz, C., Imbrenda, C., Kutscher, D., Ohlman, B.: A survey
of information-centric networking. IEEE Communications Magazine 50(7), 26–36
(2012)
20. Carzaniga, A., Papalini, M., Wolf, A.L.: Content-based publish/subscribe network-
ing and information-centric networking. In: Proceedings of the ACM SIGCOMM
Workshop on Information-centric Networking (ICN 2011), pp. 56–61 (2011)
21. Moy, J.: OSPF Version 2. RFC 1247 (Draft Standard). Obsoleted by RFC 1583,
updated by RFC 1349 (July 1991)
22. Cisco, Configuring OSPF, http://www.cisco.com/en/US/docs/ios/120/np1/
configuration/guide/1cospf.html
23. Borst, S., Gupta, V., Walid, A.: “Distributed Caching Algorithms for Content
Distribution Networks. In: Proceedings of IEEE International Conference on Com-
puter Communications (INFOCOM 2010), pp. 1–9 (2010)
24. Vakali, A., Pallis, G.: Content delivery networks: Status and trends. IEEE Internet
Computing 7(6), 68–74 (2003)
25. Fortz, B., Thorup, M.: Optimizing OSPF/IS-IS weights in a changing world. IEEE
Journal on Selected Areas in Communications 20(4), 756–767 (2002)
26. Breslau, L., Phillips, G., Shenker, S.: Web caching and Zipf-like distributions: evi-
dence and implications. In: Proceedings of 18th Annual Joint Conference of the IEEE
Computer and Communications Societies (INFOCOM 1999), vol. 1, pp. 126–134
(1999)
27. Korupolu, M., Dahlin, M.: Coordinated placement and replacement for large-scale
distributed caches. IEEE Transactions on Knowledge and Data Engineering 14(6),
1317–1329 (2002)
28. Appa, G., Kotnyek, B.: A bidirected generalization of network matrices. Net-
works 47(4), 185–198 (2006)
29. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische
Mathematik 1(1), 269–271 (1959)
30. Shinkuma, R., Jain, S., Yates, R.: In-network caching mechanisms for intermit-
tently connected mobile users. In: Proceedings of 34th IEEE Sarnoff Symposium,
pp. 1–6 (2011)
31. Adamic, L.A., Huberman, B.A.: Zipf’s law and the Internet. Glottometrics 3,
143–150 (2002)
Providing Crowd-Sourced and Real-Time Media
Services through an NDN-Based Platform
1 Introduction
Thanks to online social networks (like Facebook, Google+, LinkedIn, Myspace,
Twitter, and Spotify), which are experiencing an explosive growth since the
past few years, people can always stay in touch with each other and exchange
messages, thoughts, photos, videos, files, and any other type of contents. This
growth is sustained by the widespread adoption of new generation devices (such
as notebooks, smart phones, and tablets) [1], together with emerging broadband
wired and wireless technologies and, without any doubt, it will become more and
more evident in the coming years. As a final result, people continuously generate
and request a massive amount of data, provided by other users worldwide.
Horizons of online social networks can be further extended, thus enhancing
users’ interaction and data discovery, by introducing crowd-sourcing approaches,
c Springer International Publishing Switzerland 2015 405
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_17
406 G. Piro et al.
which refer to the practice of obtaining services and contents by soliciting contri-
butions from a group of people who, in most cases, form an online community [2].
When used together, online social networks, mobile devices, and crowd-sourcing
platforms have all the potential to create connected communities scattered all over
the world, thus paving the way to several novel applications. Their joint exploita-
tion, for example, can be used to capture media contents, i.e., audio and video, from
a very large number of users during an event (think for example of a football match,
a concert, a public event, a dangerous situation, and so on) and delivering them in
real-time to any other user. This way, anyone can see (and listen) what others see
(and listen) with their eyes (and ears) in a specific place, while being physically far.
To the best of our knowledge, network architectures enabling such kind of social
audio-video real-time services have not yet been fully standardized. Hence, novel
ideas and technologies have to be promoted by the research community in order
to make this a very attractive application achievable as soon as possible.
In parallel, the so-called Information Centric Network (ICN) approach is
emerging to foster a transition from a host-centric to a content-centric Future
Internet [3]. The ICN approach is currently investigated and developed in sev-
eral projects, such as Data-Oriented Network Architecture (DONA) [4], Publish
Subscribe Internet Technology (PURSUIT) [5], Scalable and Adaptive Internet
Solutions (SAIL) [6], CONVERGENCE [7], and Named Data Networking (NDN)
[8], COntent Mediator architecture for content-aware nETworks (COMET), and
MobilityFirst [9].
Despite some distinctive differences (e.g., content naming scheme, security-
related aspects, routing strategies, cache management), they share a common
receiver-driven data exchange model based on content names [10,3].
In the authors’ opinion, the ICN represents a powerful technological basis to
distribute crowd-sourced contents. Unfortunately, despite the number of works
available in the literature that investigate and propose innovative techniques to
deliver contents in ICN, solutions properly designed for the considered scenario
have not been proposed yet.
To bridge this gap, we conceive herein a network platform based on the NDN
rationale, which is able to efficiently discover and distribute crowd-sourced mul-
timedia contents within a distributed and online social network. The devised
platform is composed of (i) a community of users that are in the same place
to take part in an event record and broadcast data and multimedia streams
from their multiple points of view, (ii) a number of remote users interested in
such information, (iii) a distributed Event Management System, which creates
events and handles the social community, and (iv) an NDN communication in-
frastructure able to efficiently manage users requests and distribute contents.
Moreover, it addresses four different tasks: event announcement, event discov-
ering, media discovering, and media delivering. In addition, to offer an optimal
management of multiple and heterogeneous events, we also designe a hierarchical
name-space and a sliding-window scheme to efficiently download real-time me-
dia contents through NDN primitives. We evaluate the performances of the our
proposal through the ccnSim simulator [11]. In particular, focusing the attention
Providing Crowd-Sourced and Real-Time Media Services 407
The past 10 years have witnessed the rise of Online Social Networks as the pre-
dominant form of communication over the Internet. They are a software plat-
form, in the form of a mobile application or an internet website, that enable
social relations among people, and the exchange of information between them.
Users of an online social network usually maintain a list of first-degree con-
tacts (i.e., friends and families, or direct collegues), with whom executing direct
interactions (i.e., the exchange of contents). The type of contents being published
depends on the scope of the Social Network. Accordingly, we can have:
Field Description
Content Name Specifies the requested item. It is formed by several com-
ponents that identify a subtree in the name space.
Min Suffix Enables the access to a specific collection of elements
Components according to the prefix stored into the Content Name
field.
Max Suffix Enables the access to a specific collection of elements
Components according to the prefix stored into the Content Name
field.
Publisher Public Key Imposes that only a specific user can answer to the con-
Digest sidered Interest.
Exclude Defines a set of components forming the content name
that should not appear in the response to the Interest.
Child Selector In the presence of multiple answers, it expresses a pref-
erence for which of these should be returned.
Answer Origin Kind Several bits that alter the usual response to Interest (i.e.,
answer can be “stale” or answer can be generated).
Scope Limits where the Interest may be propagated.
Interest Life time It indicates, approximatively, the time interval after
which the Interest will be considered deprecated.
Nonce A randomly genenerated byte string used for detecting
duplicates.
An user may ask for a content by issuing an Interest, which is routed toward
the nodes in posses of the required information (e.g., the permanent repository,
namely publisher, or any other node that contains a valid copy in its cache), thus,
triggering them to reply with Data packets. Routing operations are executed by
1
It has been taken from the official software implementation of the NDN communi-
cation protocol suite, i.e., CCNx, available at http://www.ccnx.org.
410 G. Piro et al.
Field Description
Content Name Identifies the requested item.
Signature Guarantees the publisher authentication.
Publisher Public Key Identifies the users that generated data.
Digest
Time Stamp Expresses the generation time of the content.
Type Explicits what the Data packet contains (i.e., data, en-
crypted data, public key, link, and NACK with no con-
tent).
Freshness Seconds Defines the life time of the carried payload. It is used to
schedule caching operations (eventually prohibited).
Final Block ID Indicates the identifier of the final block in a sequence of
fragments.
Key Locator Specifies where to find the key to verify this content.
Content It is the content with an arbitrary length.
the strategy layer only for Interest packets. Whereas, Data messages, just follow
the reverse path toward requesting user, allowing every intermediate node to
cache the forwarded content.
To accomplish these activities, each NDN node exploits three main data struc-
tures: (i) the Content Store (CS), which is a cache memory, (ii) the Forwarding
Information Base (FIB), containing list of faces through which forwarding Inter-
est packets asking for specific contents, and the (iii) the Pending Interest Table
(PIT), which is used to keep track of the Interest packets that have been for-
warded upstream toward content sources, combining them with the respective
arrival faces, thus, allowing the properly delivery of backward Data packets sent
in response to Interests.
Basically, when an Interest packet arrives to an NDN node, the CS is in
charge of discovering whether a data item is already available or not. If so, the
node may generate an answer (i.e., a Data packet) and send it immediately
back to the requesting user. Otherwise, the PIT is consulted to retrieve if others
Interest packets, requiring the same content, have been already forwarded toward
potential sources of the required data. In this case, the Interest ’s arrival face is
added to the PIT entry. Else, the FIB is examined to search a matching entry,
indicating the list of faces through which forwarding the Interest. At the end,
if there is not any FIB entry, the Interest is discarded. On the other hand,
when a Data packet is received, the PIT table comes into play. It keeps track
of all previously forwarded Interest packets and allows the establishment of a
backward path to the node that requested the data.
Providing Crowd-Sourced and Real-Time Media Services 411
Recently, some research activities are focusing the attention to the management
of multimedia applications in NDN [44][45]. The most interesting proposals re-
lated to real-time applications are presented in the sequel:
forwarded to the remote users (i.e., the callee). The Content Name of the
first Interest packet is built by appending to the aforementioned namespace
the content, optionally encrypted, of the SIP INVITE message. The callee
will answer to this request by generating a Data packet containing the SIP
response. From this moment on, the exchange of media contents is done
using the RTP protocol. To each media chunk is assigned an unique sequence
number and, in order to fetch quickly voice packets, the caller can generate
and send multiple Interest packets at the same time. Every time a new Data
packet is received, a new Interest is released, thereby restoring the total
number of pending Interests.
– Audio Conference Tool (ACT) [47]. It is a more complex architecture
enabling audio conference services, which exploits the named data approach
to discover ongoing conferences, as well as speakers in each conference, and
to fetch voice data from individual speakers. A specific namespace and an
algorithm to create names have been tailored to support all of these tasks.
Before joining the conference, an user issues specific requests to collect in-
formation about the list of ongoing conferences and to known the list of
speakers in a conference (i.e., the group of participants that produce voice
data from which fetching Data packets). Similarly to VoCCN [46], each Data
packet is identified by an unique segment number and an user could request
multiple Data packets at the same time. Hence, whenever a new Data packet
comes back, the user will issue a new Interest.
– The MERTS platform [48]. It has been designed to handle at the same
time real-time and non real-time flows in a NDN architecture. To this end,
a new field in the Interest message, e.g., the Type of Service (TOS), has
been introduced to differentiate real-time and nonreal-time traffics, thus al-
lowing each NDN node to classify the type of service to which each packet
belongs to. In addition, a flexible transport mode selection scheme has been
devised to adapt the behavior of the NDN node according to the TOS as-
sociated to a specific packet. In order to serve real-time applications, the
one-request-n-packets strategy is proposed, according to which the user is-
sues a Special Interest (SI) asking for n consecutive Data packets. Such a
request can be satisfied by more nodes inside the network that may store in
their repository/cache one, more, or all requested chunks. When all the n
chunks will be received by the user, a new SI is generated. To ensure that
the end-to-end route will not be deleted after the reception of only a subset
of chunks requested with the SI, the normal functionality of the PIT table
is modified by imposing that the SI can be erased only after the expiration
of its life time. Finally, with the aim of optimizing the memory utilization of
the cache and improving performances of nonreal-time services, the caching
of real-time contents is completely disabled.
– Adaptive retransmission scheme for real-video streaming [49]. This
work proposes a novel and efficient retransmission scheme of Interest packets,
which has been conceived to reduce video packet losses. In particular, the
retransmission of requests not yet been satisfied after a given timeout is
introduced to offer a minimum level of reliability in NDN. From one side, a
414 G. Piro et al.
lifetime estimation algorithm captures the RTT variation caused by the in-
network caching and dynamically evaluates the value of the retransmission
timeout. From another hand, the Explicit Congestion Notification (ECN)
field is added to the Data packet for signaling the ongoing network congestion
to the end user. This information will be used by the retransmission control
scheme to differentiates channel errors from network congestion episodes.
Hence, based on the reason of packet losses, this algorithm adaptively adjusts
the retransmission window size, i.e., the number of total Interest that can
be retransmitted by the client.
– Time-based Interest protocol [50]. This proposal tries to avoid the waste
of the uplink bandwidth due to the generation of multiple and contemporary
Interest. To reach this goal, it introduces a new Interest packet, which is
sent by the user in order to ask for a group of contents that are generated by
the publisher during a specific time interval. During such time interval, all
chunks generated by the remote server or transferred from other nodes can
be delivered to the user. This novel scheme requires a modification of the
normal behavior of a NDN router: the Interest packet should not be deleted
from the PIT table until that its life time will expire.
– The NDNVideo architecture [51]. It has been designed and imple-
mented on top of CCNx, in order to offer both real-time and nonreal-time
video streaming services. A first important issue addressed by the NDNVideo
project is the design of the namespace that enables the publisher to uniquely
identifying every chunk of the multimedia content and allows the consumer
to easily seek a specific place in the stream. In particular, the Content Name
is built in order to provide information about the video content, the encoding
algorithm, and the sequence number associated to a given chunk. Moreover,
to facilitate the seeking procedure, the user can specify in its first request
a timecode that will be used by the server to select the most suitable Data
packet within the video stream. Then, after the reception of the first Data
packet, the user will ask for video data using consecutive segment numbers.
To supports real-time streaming services, the client may issues multiple In-
terest packets at the same time. However, to avoid that it will fetch data too
quickly and request segments that does not yet exist, the client estimates the
generation rate of Interest packets by knowing the time at which previous
data packet ware generated by the publisher (this information is stored in
a specific field of the Data packet). If data are not received fast enough to
playback the video at the correct rate, the client may skip to the most recent
segment, thus continuing to see the video from there instead of pausing the
playback. Finally, a low-pass filter similar to the one defined in the TCP
protocol is adopted to adjusts the retransmission timeout based on previous
RTT values.
flows) within a online social community. The scenario considered in our work is
composed by a group of users that captures and transmits multimedia contents to
a second group of consumers diffused worldwide. It covers a number of significant
use-cases, such as:
– real-time broadcasting of social events: users participating to a given event
capture media contents from different point of views and share such media
streams with remote clients;
– virtual tourism: users visiting monuments (or other kind of tourist attrac-
tion) capture media contents and share them with other ones;
– real-time broadcasting of activities and environmental conditions during dan-
gerous situations.
In such a scenario, multimedia contents are not provided by a media server,
but they are shared by a (potentially large) number of users. This scheme fully
reflects the crowd-sourcing paradigm, presented in the Sec. 2.
In our opinion, the design of an network architecture able to efficiently dis-
tribute crowd-sourced multimedia contents within a social community spread
around the world is flanked by the born of a number of issues and challenges
that need to be carefully investigated.
the set of available multimedia contents (i.e., those captured by users of the
social community that are participating to a given event). According to the
NDN paradigm, this challenge has to be addressed through a completely
host-less protocol, which should allow any user to download a media stream
for a given event by adopting NDN primitives.
– Management of content requests. According to the NDN paradigm, audio,
and video contents should be divided into a list of consecutive chunks that
should be requested by the client during the service execution. Unlike Video-
On-Demand, the distribution of a real-time stream has to deal with a spe-
cific class of problems to ensure the timely delivery of an ordered stream of
chunks. Video and audio chunks, in details, have to be received in playing
order and within a given time interval (the playout delay), before they are
actually played, thus “expiring”. A chunk not delivered before its expira-
tion will result in degradation of the rendered video, impacting the end user
QoE. To solve these challenges, client nodes implement a receiving buffer
queue where the chunks are stored in order, that is emptied while the video
is being played. Therefore, any chunk not received before its playing instant
becomes useless. To reduce the chance of chunk loss, an efficient mechanism
that control the retransmission of user’s requests (e.g., requests for chunks
close to expiration and not yet received), should be conceived.
The main goal of the service architecture described in this chapter is to dis-
cover, organize, store, process, and deliver media contents, which are captured
according to the crowd-sourcing paradigm, within a social community spread
around the world. Such an architecture has been conceived as an extension of
our previous contributions discussed in [56] and [57].
As shown in Fig. 1, the service platform is composed by:
1. a community of users that, being into the same place to take part to an
event (i.e., concert, football match, sightseeing, city accident, and so on)
record and broadcast it from their multiple points of view;
2. a number of users interested to see what the aforementioned social commu-
nity is capturing;
3. a distributed Event Management System, which creates events and handles
the social community;
4. a NDN-based communication infrastructure able to efficiently manage user’s
requests and distribute multimedia contents.
event
Server of the Event
Management System
NDN network
architecture
community of users that
broadcast the event
In line with theoretical suggestions presented in [46], the name space we con-
ceived herein is based on both contractable names and on-demand publishing
concepts. The contractable names approach identifies the ability of an user to
construct the name of a desired content through specific and well-known algo-
rithms (e.g., it knows the structure of the name tree and the value that each field
of the Content Name may assume). The on-demand publishing criteria, instead,
defines the possibility to request a content that has not yet been published in
the past, but it has to be created in answer to the received request (this may
occur very often during a real-time communication). The name tree structure
adopted in our platform is:
/domain/ndn streaming/activity/details
Similarly to [58], starting from the root tree, which is identified with “/”, we in-
troduced the domain field in order to explicitly indicate the domain in which the
service is offered. The second field adopts the keyword ndn streaming to explicit
the considered service (i.e., the streaming of a video content over the conceived
NDN-based platform). The activity field specifies the task to which the Interest
or Data packet belongs to. As anticipated before, it may assume four different
values. i.e., event announcement, event discovering, media discovering, and
media delivering. Finally, the latest field, i.e., details, is used for appending to
the Content Name specific values that can be exchanged among nodes during
the service execution.
An user that wants to provide multimedia contents for a specific event should
register itself and announce its media capabilities to the Event Management
System. To this end, it firstly sends a Interest packet asking for the list of events
active within a given location (i.e., the geographical area in which the user is
located). The Content Name of this request is set to:
This message will reach the closest node (i.e., a server of the Event Management
System or a NDN router storing this data within its cache) able to provide these
information, which will answer with the corresponding Data packet.
Then, the user will generate a new Interest packet, whose Content Name will
contain, in the details field, its position and its media capabilities, as reported
in the sequel:
/domain/ndn streaming/event announcement/
registration/event/position/media capabilities
This message will be processed by the Event Management System that will
update the Event Data Base and will answer with a Data packet of confirmation.
420 G. Piro et al.
This message will be routed toward the first node in posses of this information,
which will respond with a Data packet containing the requested information.
Once the user has identified the event of its interest, he will retrieve the list
of available multimedia contents, together with their characteristics, by sending
an Interest packet with the Content Name equal to:
To ensure that this request will be handled by a node of the Event Management
System, which is the only device storing updated information, the Publisher-
PublicKeyDigest and AnswerOriginKind fields of the Interest packet contain
the hash function of the public key of the Event Management System and the
numerical value 3, respectively. The corresponding Data packet, generated in
answer to this request, will allow the user to select its preferred content among
those available. From this moment on, the user can start fetching the multimedia
content from a specific source.
With the aim of designing a more flexible platform as possible, we assume
also that the user can change the source of the media stream during the time.
To better support this feature, it is necessary to send periodically the aforemen-
tioned Interest packet, thus continuously updating information about available
multimedia contents.
The media delivering process consists of a channel bootstrap phase, a flow con-
trol strategy, and an efficient mechanism for retransmitting Interest packets. It
is performed after that the user has selected the event of its interest, i.e., the
event name, and the media source, identified through its social nick name (i.e.,
source nick name), from which fetching media data. To enable these functional-
ities we need to extend the basic structure of the Interest packet by introducing
an additional Status field marking if the Interest is related to the channel boot-
strap phase or to a retransmission.
As detailed in [56] and [57], the bootstrap phase is in charge of retrieving the
first valid chunkID of the video stream, which is the latest generated I-frame. For
this reason, it sends an Interest packet in which a timecode (e.g., HH:MM:SS:FF)
is appended to the Content Name (this approach has been already introduced
in [51]):
/domain/ndn streaming/media delivering/
The Status field set to BOOTSTRAP and the Nonce field set by the client. An
Interest with Status = BOOTSTRAP would travel unblocked until it reaches
the first good stream repository (i.e., a node who can provide a continuous real-
time flow of chunks, not just cached ones). As soon the node receives the boot-
strap data message, it can initiate the sliding window mechanism to request the
subsequent chunks.
Each chunks of the video stream is identified with an unique chunkID and
can be retrieved by issuing an Interest packet having the Content Name set to:
Each node has a windows to store W pending chunks. We define pending chunk
a chunk whose Interest has been sent by the node, and the window containing
the pending chunks a Pending Window. Together with the chunkID, we store in
the pending window other information, such as the timestamp of the first request
and the timestamp of the last retransmission. Whenever a new data message is
received, the algorithm described in Fig. 2 runs over the Pending Window, to
execute the following operations:
1. purge the Pending Window from all the chunks who are expired, i.e., who
have already been played, to free new space in the sliding window;
2. retransmit all chunks that have not been received for a given timeout (onward
denoted as windowTimeout ;
3. transmit, for each slot that got freed by the received or expired chunks, the
Interest for a new one.
Furthermore, the same operations are performed if a node does not receive
any data for at least windowTimeout seconds, in which case, all the Interests
for not expired chunks in the Pending Window are retransmitted, together with
new chunks if new slots have been freed due to expired chunks.
Fig. 2 details the implemented algorithm; for the purpose of brevity and read-
ability, the variable names have been contracted: P W is the Pending Window, W
is the aforementioned system parameter, indicating how many Interests should
a node have ongoing, W inT is the window timeout, after which Interests in the
Pending Window are resent, Int is a new Interest message, CID is a chunkID in
the pending window, lastT x is the transmission time of the most recent Interest
for a given chunkID, LC is the chunkID of the most recent requested chunk
and N N C is the number of new chunks to request, after the pending window
has been purged. Moreover, to provide a further insight, we reported in Fig. 3
an example of the conceived sliding window algorithm, in which we have set the
value of W to be equal to 3.
As described in Sec. 2, NDN nodes along the routing path of an Interest will
stop its propagation if they have previously routed another Interest for the same
resource, and the correspondent data has not been sent back yet (in that case,
they will simply update their Pending Interest Table adding the face from where
this newcomer Interest was originated, so to reroute the data back recursively
along the path the Interest has gone through). However, to enable retransmission
422 G. Piro et al.
By itself, ccnSim models a complete data distribution systems, with a high de-
gree of fidelity concerning catalogs, requests and repositories distribution, and
network topologies. Unfortunately, in its original version, ccnSim does not sup-
port real-time video transmissions. Hence, to evaluate the performances of the
proposed platform, we extended the simulator in several aspects, by adding:
Providing Crowd-Sourced and Real-Time Media Services 423
Data[chunkID=1]
window Timeout
Interest [chunkID=2, status=retransmitted]
Interests deleted
Interest[chunkID=3, status=retransmitted]
from the Interest
PendingTable Interest[chunkID=4, status=retransmitted]
because out of delay
Data[chunkID=2]
loss Data[chunkID=3]
2 3 Timeout
loss
– a support for links with bounded capacity and packets with a well-defined
size, which was missing in ccnSim, to be able and estimate the NDN behavior
under some bandwidth constraints;
– a transmission queue for each face of each node, in order to properly manage
the packet transmission under constraints due to the data-rate of channels,
– the support for synthetic video traces, so to be able to transmit and receive
chunk of real videos, and consequently being able to reconstruct the received
video and evaluate its Peak Signal-to-Noise Ratio (PSNR);
– a cleanup mechanism for each node’s PIT, to avoid having long-term stale
entries due to expired chunks;
– an improved logging system, so to be able to record each node’s received
chunks and reconstruct the received video;
– more controls server-side, to send a data only for those chunks who have
already been generated;
– the sliding window mechanism described above, and all the related data
structures;
– the Interest forceful propagation in case of retransmission.
it is assumed the presence of only one event where participate a number of users
that produce, in real-time, video streams with different encoding characteristics.
In particular, without loss of generality, in every simulation round, each video
content, is mapped to a video stream compressed using H.264 [60] at a average
coding rate randomly chosen in the range 250 ÷ 2000 kbps. We suppose also
that these users are connected to the network through the same access point,
which is identified by one router of the aforementioned topology, randomly cho-
sen every run among available ones. On the other hand, clients of the social
community, that are interested in downloading video contents generated by the
previous group of users, are connected to remaining nodes (1 client per node).
Further, the selection of a media stream has been modeled considering that con-
tents popularities follow a Zipf distribution, which is commonly adopted for user
generate contents [61].
In our study, we adopted the optimal routing strategy, already available within
the ccnSim framework. According to it, Interest packets are routed toward router
to which are attached users that generate video data along the shortest path.
On the other hand, three caching strategies have been considered in our study:
no-cache, LRU, and FIFO [62]. When well known LRU or FIFO policies are
adopted, we set the size of the cache to 10000 chunks. The no-cache policy is
intended to evaluate the performance of the NDN without using any caching
mechanism. Furthermore, a baseline scenario, in which the no-cache policy is
enabled and the PIT table is totally disabled (this means that each user estab-
lishes with the service provider a unicast communication and the server should
generate a dedicated Data packet for each generated Interest), has been consid-
ered as a reference configuration.
Once a client selects the video content of its interest, it performs the bootstrap
process described in the previous section and then starts sending Interest packets
following the designed sliding window mechanism. The window size W has been
set to 10, ensuring that faces of the server are almost fully loaded in all considered
scenarios. Also, the transmission queue length associated to each face, Q, has
been set, in order to be larger than
Q = 2 · Lc · PD ,
Parameter Value
Topology Deutsche Telekom with 68 routers
Link capacity 50 Mbps and 100 Mbps
Number of active events 1
Number of user generate contents 10, 20, 50, 100
Number of clients 67
Chunk size 10Kbytes
Video average bit rate 250kbps, 600kbps, 1000kbps, and
2000kbps
W (window size) 10
Playout delay 10 s and 20 s
Window timeout 1/10, 1/5, and 1/2 of the playout delay
Protocol configuration No cache, LRU, FIFO, baseline scenario
Cache size 10000 chunks
Simulation time 600 s
Number of seeds 15
100
90 baseline, links @ 50Mbps no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
Chunk Loss Ratio [%]
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
100
baseline, links @ 50Mbps no cache, links @ 50Mbps
90
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
Chunk Loss Ratio [%]
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 4. Chunk loss ratio (scenario with α = 1) when the P D is set to 10s and the winT
is equal to (a) 1/10 P D and (b) and 1/2 P D, respectively
100
90 baseline, links @ 50Mbps no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
Chunk Loss Ratio [%] FIFO, links @ 100Mbps LRU, links @ 100Mbps
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
100
90 baseline, links @ 50Mbps no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
Chunk Loss Ratio [%]
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 5. Chunk loss ratio (scenario with α = 1) when the P D is set to 20s and the winT
is equal to (a) 1/10 P D and (b) and 1/2 P D, respectively
role. In presence of live video streaming services, clients that are connected to a
channel request the same chunks simultaneously. In this case, a NDN router has
to handle multiple Interest messages that, even though sent by different users,
are related to the same content. According to the NDN paradigm, such a node
will store all of these requests into the PIT, waiting for the corresponding Data
packet. As soon as the packet is received, the router will forward it to all users
that have requested the chunk in the past. According to these considerations,
the use of the cache will not produce a relevant gain of network performances.
Indeed, the PIT helps reducing the burden at the publisher side by avoiding that
many Interest packets for the same chunk are routed to the server.
428 G. Piro et al.
100
baseline, links @ 50Mbps no cache, links @ 50Mbps
90
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
Chunk Loss Ratio [%] FIFO, links @ 100Mbps LRU, links @ 100Mbps
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
100
90 baseline, links @ 50Mbps no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
Chunk Loss Ratio [%]
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 6. Chunk loss ratio (scenario with α = 1.5) when the P D is set to 10s and the
winT is equal to (a) 1/10 P D and (b) and 1/2 P D, respectively
100
baseline, links @ 50Mbps no cache, links @ 50Mbps
90
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
Chunk Loss Ratio [%] FIFO, links @ 100Mbps LRU, links @ 100Mbps
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
100
baseline, links @ 50Mbps no cache, links @ 50Mbps
90
FIFO, links @ 50Mbps LRU, links @ 50Mbps
80 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
Chunk Loss Ratio [%]
70
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 7. Chunk loss ratio (scenario with α = 1.5) when the P D is set to 20s and the
winT is equal to (a) 1/10 P D and (b) and 1/2 P D, respectively
Another important finding is related to the impact that the content’s pop-
ularity has on the amount of requests reached by publishers. From Figs. 8-11,
in fact, it is possible to observe that the higher is the α value, the lower is the
percentage of requests that reach the publishers. The reason is that when α in-
creases, the probability that an user of the social community is interested to one
of the most popular contents increases as well, thus amplifying the capability
of the PIT table of blocking the propagation of multiple requests asking for the
same data packet.
To conclude our study, we have computed the PSNR, which is nowadays one of
the most diffused metrics for evaluating user satisfaction, i.e., the QoE, together
with interactivity level, in real-time video applications [63]. Results shown in
430 G. Piro et al.
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
baseline scenario no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
no cache, links @ 100Mbps FIFO, links @ 100Mbps
LRU, links @ 100Mbps
100
Percentage of Interests received by
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Figs. 12-15, which reports the PSNR computed when α has been set to 1 and
1.5, respectively, are in line with those reported for chunk loss ratio: the PNSR
is higher in the same case in which the chunk loss ratio is lower. Hence, also
in this case we can verify that when both the link capacities and the playout
delay increase, the total amount of chunks received by end users increases, thus
obtaining an higher satisfaction level. In addition, while with the absence of
the cache is registered a limited worsening of the PSNR, the baseline scenario
reaches always the lowest performances.
Providing Crowd-Sourced and Real-Time Media Services 431
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
baseline scenario no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
no cache, links @ 100Mbps FIFO, links @ 100Mbps
LRU, links @ 100Mbps
100
Percentage of Interests received by
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
baseline scenario no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
no cache, links @ 100Mbps FIFO, links @ 100Mbps
LRU, links @ 100Mbps
100
Percentage of Interests received by
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 10. Percentage of interests received by publishers (scenario with α = 1.5) when
the P D is set to 10s and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
Providing Crowd-Sourced and Real-Time Media Services 433
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(a)
baseline scenario no cache, links @ 50Mbps
FIFO, links @ 50Mbps LRU, links @ 50Mbps
no cache, links @ 100Mbps FIFO, links @ 100Mbps
LRU, links @ 100Mbps
100
Percentage of Interests received by
90
80
70
publishers [%]
60
50
40
30
20
10
0
10 20 50 100
Number of available video contents
(b)
Fig. 11. Percentage of interests received by publishers (scenario with α = 1.5) when
the P D is set to 20s and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
434 G. Piro et al.
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(a)
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(b)
Fig. 12. PSNR of the Y component (scenario with α = 1) when the P D is set to 10s
and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
Providing Crowd-Sourced and Real-Time Media Services 435
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(a)
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(b)
Fig. 13. PSNR of the Y component (scenario with α = 1) when the P D is set to 20s
and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
436 G. Piro et al.
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(a)
100
90
PSNR of the Y component [dB]
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(b)
Fig. 14. PSNR of the Y component (scenario with α = 1.5) when the P D is set to 10s
and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
Providing Crowd-Sourced and Real-Time Media Services 437
100
90
80
70
60
50
40
30
baseline, links @ 50Mbps no cache, links @ 50Mbps
20 FIFO, links @ 50Mbps LRU, links @ 50Mbps
10 baseline, links @ 100Mbps no cache, links @ 100Mbps
FIFO, links @ 100Mbps LRU, links @ 100Mbps
0
10 20 50 100
Number of available video contents
(b)
Fig. 15. PSNR of the Y component (scenario with α = 1.5) when the P D is set to 20s
and the winT is equal to (a) 1/10 P D and (b) 1/2 P D, respectively
References
1. Cisco: Cisco visual networking index: Forecast and methodology, 2012–2017. White
Paper (May 2013)
2. Chatzimilioudis, G., Konstantinidis, A., Laoudias, C., Zeinalipour-Yazti, D.:
Crowdsourcing with smartphones. IEEE Internet Computing 16(5), 36–44 (2012)
3. Matsubara, D., Egawa, T., Nishinaga, N., Kafle, V., Shin, M.K., Galis, A.: Toward
future networks: A viewpoint from ITU-T. IEEE Communication Magazine 51(3),
112–118 (2013)
4. Koponen, T., Chawla, M., Chun, B.G., Ermolinskiy, A., Kim, K.H., Shenker, S.,
Stoica, I.: A data-oriented (and beyond) network architecture. In: Proc. of the ACM
Conf. on Applications, Technologies, Architectures, and Protocols for Computer
Communications (SIGCOMM), Kyoto, Japan (2007)
5. Fotiou, N., Nikander, P., Trossen, D., Polyzos, G.C.: Developing information net-
working further: From PSIRP to PURSUIT. In: Tomkos, I., Bouras, C.J., Elli-
nas, G., Demestichas, P., Sinha, P. (eds.) Broadnets 2010. LNCS, SITE, vol. 66,
pp. 1–13. Springer, Heidelberg (2012)
6. Dannewitz, C., Kutscher, D., Ohlman, B., Farrell, S., Ahlgren, B., Karl, H.: Net-
work of Information (NetInf), An information-centric networking architecture.
Computer Communications 36(7), 721–735 (2013)
7. Melazzi, N., Salsano, S., Detti, A., Tropea, G., Chiariglione, L., Difino, A., Anadio-
tis, A., Mousas, A., Venieris, I., Patrikakis, C.: Publish/subscribe over information
centric networks: A Standardized approach in CONVERGENCE. In: Proc. of Fu-
ture Network Mobile Summit (FutureNetw), Berlin, Germany (July 2012)
8. NDN: Project website (2011), www.named-data.net/ (accessed: July 08, 2013)
9. Xylomenos, G., Ververidis, C., Siris, V., Fotiou, N., Tsilopoulos, C., Vasilakos, X.,
Katsaros, K., Polyzos, G.: A Survey of Information-Centric Networking Research.
IEEE Communications Surveys Tutorials PP(99), 1–26 (2013)
10. Bari, M., Chowdhury, S., Ahmed, R., Boutaba, R., Mathieu, B.: A survey of nam-
ing and routing in information-centric networks. IEEE Communications Maga-
zine 50(12), 44–53 (2012)
Providing Crowd-Sourced and Real-Time Media Services 439
11. Rossini, G., Rossi, D.: Large scale simulation of ccn networks. In: Algotel (2012)
12. Myspace, http://www.myspace.com (accessed: January 7, 2014)
13. Facebook, http://www.facebook.com/ (accessed: January 7, 2014)
14. Weibo, S.: http://www.weibo.com/ (accessed: January 7, 2014)
15. Orkut, http://www.orkut.com/ (accessed: January 7, 2014)
16. Hi5, http://www.hi5.com/ (accessed: January 7, 2014)
17. NK, http://www.nk.pl/ (accessed: January 7, 2014)
18. VKontakte, http://www.vk.com/ (accessed: January 7, 2014)
19. Twitter, http://www.twitter.com/ (accessed: January 7, 2014)
20. Thumblr, http://www.thumblr.com/ (accessed: January 7, 2014)
21. Identi.ca, http://www.identi.ca/ (accessed: January 7, 2014)
22. Vine, http://www.vine.com/ (accessed: January 7, 2014)
23. Flickr, http://www.flickr.com (accessed: January 7, 2014)
24. Spotify, https://www.spotify.com/it (accessed: January 7, 2014)
25. Linkedin, http://www.linkedin.com (accessed: January 7, 2014)
26. Ceballos, M.R., Gorricho, J.L.: P2P file sharing analysis for a better performance.
In: Proc. of ACM International Conference on Software Engineering (2006)
27. Liu, B., Cui, Y., Lu, Y., Xue, Y.: Locality-awareness in bittorrent-like P2P appli-
cations. IEEE Transactions on Multimedia 3(11) (April 2009)
28. Li, J.: Peer-to-Peer multimedia applications. In: Proc. of ACM International Con-
ference on Multimedia (2006)
29. Liu, J., Rao, S.G., Li, B., Zhang, H.: Opportunities and Challenges of Peer-to-Peer
Internet Video Broadcast. In: Proc. of IEEE, Special Issue on Recent Advances in
Distributed Multimedia Communications (2008)
30. Xiao, X., Shi, Y., Gao, Y.: On Optimal Scheduling for Layered Video Streaming in
Heterogeneous Peer-to-Peer Networks. In: Proc. of ACM International Conference
on Multimedia (2008)
31. da Silva, A., Leonardi, E., Mellia, M., Meo, M.: A Bandwidth-Aware Scheduling
Strategy for P2P-TV Systems. In: Proc. of IEEE International Conference on Peer-
to-Peer Computing, pp. 279–288 (2008)
32. Ciullo, D., Garcia, M.A., Horvath, A., Leonardi, E., Mellia, M., Rossi, D., Telek, M.,
Veglia, P.: Network awareness of P2P live streaming applications: a measurement
study. IEEE Transaction on Multimedia (12) (2010)
33. Jacobson, V., Smetters, D.K., Thornton, J.D., Plass, M.F., Briggs, N.H., Braynard,
R.L.: Networking named content. In: ACM CoNEXT 2009 (2009)
34. Zhang, L., Estrin, D., Burke, J., Jacobson, V., Thornot, J., Smatters, D., Zhang,
B., Tsudik, G., Krioukov, D., Massey, D., Papadopulos, C., Abdelzaher, T., Wang,
L., Crowley, P., Yeh, E.: Named data networking (NDN) project. PARC Technical
Report TR-2010-02 (October 2010)
35. Lin, W.S., Zhao, H.V., Liu, K.R.: Incentive cooperation strategies for peer-to-
peer live multimedia streaming social networks. IEEE Transactions on Multime-
dia 11(3), 396–412 (2009)
36. Cheng, X., Liu, J.: Nettube: Exploring social networks for peer-to-peer short video
sharing. In: Proc. of IEEE INFOCOM 2009, pp. 1152–1160. IEEE (2009)
37. Wang, X., Chen, M., Kwon, T., Yang, L., Leung, V.: Ames-cloud: A framework of
adaptive mobile video streaming and efficient social video sharing in the clouds.
IEEE Transactions on Multimedia 15(4), 811–820 (2013)
38. Wang, Z., Wu, C., Sun, L., Yang, S.: Peer-assisted social media streaming with
social reciprocity. IEEE Transactions on Network and Service Management 10(1),
84–94 (2013)
440 G. Piro et al.
39. Hoßfeld, T., Seufert, M., Hirth, M., Zinner, T., Tran-Gia, P., Schatz, R.: Quan-
tification of YouTube QoE via crowdsourcing. In: Proc. of IEEE International
Symposium on Multimedia (ISM), pp. 494–499 (2011)
40. Recursive fact-finding: A streaming approach to truth estimation in crowdsourcing
applications, pp. 530–539 (2013)
41. Hei, X., Liang, C., Liang, J., Liu, Y., Ross, K.W.: A measurement study of a
large-scale P2P IPTV system. IEEE Transactions on Multimedia 9(8), 1672–1687
(2007)
42. Magharei, N., Rejaie, R.: Prime: Peer-to-peer receiver-driven mesh-based stream-
ing. IEEE/ACM Transactions on Networking (TON) 17(4), 1052–1065 (2009)
43. Jimenez, R.: Distributed Peer Discovery in Large-Scale P2P Streaming Systems:
Addressing Practical Problems of P2P Deployments on the Open Internet. PhD
thesis, KTH, Network Systems Laboratory (NS Lab), QC 20131203 (2013)
44. Grieco, L.A.: Emerging topics: special issue on multimedia services in information
centric networks (guest editorial). IEEE COMSOC MMTC E-letter, 4–5 (July
2013)
45. Piro, G., Grieco, L.A., Boggia, G., Chatzimisios, P.: Information-centric network-
ing and multimedia services: present and future challenges. ETT, Transactions on
Emerging Telecommunications Technologies (2013) (to be published)
46. Jacobson, V., Smetters, D.K., Briggs, N.H., Plass, M.F., Stewart, P., Thornton,
J.D., Braynard, R.L.: Voccn: voice-over content-centric networks. In: ACM ReArch
2009 (2009)
47. Zhu, Z., Wang, S., Yang, X., Jacobson, V., Zhang, L.: ACT: audio conference tool
over named data networking. In: Proceedings of the ACM SIGCOMM Workshop
on Information-Centric Networking, pp. 68–73. ACM, New York (2011)
48. Li, H., Li, Y., Lin, T., Zhao, Z., Tang, H., Zhang, X.: MERTS: A more efficient
real-time traffic support scheme for Content Centric Networking. In: Proc. in IEEE
Int. Conf. on Computer Sciences and Convergence Information Technology, ICCIT,
pp. 528–533 (2011)
49. Han, L., Kang, S.S., Kim, H., In, H.: Adaptive retransmission scheme for video
streaming over content-centric wireless networks. IEEE Communications Let-
ters 17(6), 1292–1295 (2013)
50. Park, J., Kim, J., Jang, M.W., Lee, B.J.: Time-based interest protocol for real-
time content streaming in content-centric networking (CCN). In: Proc. of IEEE
Int. Conf. on Consumer Electronics, ICCE, pp. 512–513 (2013)
51. Kulinsky, D., Burke, J., Zhang, L.: Video streaming over named data networking.
IEEE COMSOC MMTC E-letter, 6–9 (July 2013)
52. Pallis, G., Vakali, A.: Insight and perspectives for content delivery networks. ACM
Communication Magazine, 101–106 (January 2006)
53. Vakali, A., Pallis, G.: Content delivery networks: status and trends. IEEE Internet
Computing 7(6), 68–74 (2003)
54. Ahlgren, B., Dannewitz, C., Imbrenda, C., Kutscher, D., Ohlman, B.: A survey
of information-centric networking. IEEE Communications Magazine 50(7), 26–36
(2012)
55. Melazzi, N.B., Chiariglione, L.: The Potential of Information Centric Networking in
Two Illustrative Use Scenarios: Mobile Video Delivery and Network Management
in Disaster Situations. IEEE COMSOC MMTC E-letter, 17–20 (July 2013)
56. Ciancaglini, V., Piro, G., Loti, R., Grieco, L.A., Liquori, L.: CCN-TV: a data-
centric approach to real-time video services. In: in Proc. of IEEE International Con-
ference on Advanced Information Networking and Applications, AINA, Barcelona,
Spain (March 2013)
Providing Crowd-Sourced and Real-Time Media Services 441
57. Piro, G., Ciancaglini, V.: Enabling real-time TV services in CCN networks. IEEE
COMSOC MMTC E-letter, 17–20 (July 2013)
58. Piro, G., Cianci, I., Grieco, L.A., Boggia, G., Camarda, P.: Information centric
services in smart cities. Elsevier Journal of Systems and Software 88, 169–188
(2014)
59. Omnet++, http://www.omnetpp.org/ (accessed: January 7, 2014)
60. Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the
H.264/AVC video coding standard. IEEE Transaction on Circuits and Systems
for Video Technology 13(7), 560–576 (2003)
61. Chiocchetti, R., Rossi, D., Rossini, G., Carofiglio, G., Perino, D.: Exploit the known
or explore the unknown?: hamlet-like doubts in ICN. In: Proc. of ACM ICN Work-
shop on Information-Centric Networking, pp. 7–12 (2012)
62. Rossi, D., Rossini, G.: Caching performance of content centric networks under
multi-path routing (and more). In: Technical report, Telecom ParisTech.s (2011)
63. Piro, G., Grieco, L., Boggia, G., Fortuna, R., Camarda, P.: Two-level Downlink
Scheduling for Real-Time Multimedia Services in LTE Networks. IEEE Transaction
on Multimedia 13, 1052–1065 (2011)
Linked Open Data as the Fuel for Smarter Cities
Abstract. In the last decade big efforts have been carried out in order
to move towards the Smart City concept, from both the academic and
industrial points of view, encouraging researchers and data stakeholders
to find new solutions on how to cope with the huge amount of gener-
ated data. Meanwhile, Open Data has arisen as a way to freely share
contents to be consumed without restrictions from copyright, patents
or other mechanisms of control. Nowadays, Open Data is an achievable
concept thanks to the World Wide Web, and has been re-defined for its
application in different domains. Regarding public administrations, the
concept of Open Government has found an ally in Open Data concepts,
defending citizens’ right to access data, documentation and proceedings
of the governments.
We propose the use of Linked Open Data , a set of best practices to
publish data on the Web recommended by the W3C, in a new data life
cycle management model, allowing governments and individuals to han-
dle better their data, easing its consumption by anybody, including both
companies and third parties interested in the exploitation of the data,
and citizens as end users receiving relevant curated information and re-
ports about their city. In summary, Linked Open Data uses the previous
Openness concepts to evolve from an infrastructure thought for humans,
to an architecture for the automatic consumption of big amounts of data,
providing relevant and high-quality data to end users with low mainte-
nance costs. Consequently, smart data can now be achievable in smart
cities.
1 Introduction
In the last decade, cities have been sensorised, protocols are constantly refined
to deal with the possibilities that new hardware offers, communication networks
are offered in all flavours and so on, generating lots of data that needs to be
dealt with. Public administrations are rarely able to process all the data they
generate in an efficient way, resulting in large amounts of data going un-analysed
and limiting the benefits end-users could get from them.
Citizens are also being encouraged to adopt the role of linked open data
providers. User-friendly Linked Open Data applications should allow citizens to
easily contribute with new trustable data which can be linked to already existing
published (more static generally) Linked Open Data provided by city councils.
c Springer International Publishing Switzerland 2015 443
F. Xhafa et al. (eds.), Modeling and Processing for Next-Generation Big-Data Technologies,
Modeling and Optimization in Science and Technologies 4, DOI: 10.1007/978-3-319-09177-8_18
444 M. Emaldi et al.
with a explicitly defined semantic meaning, linked to other datasets and allowed
to be searched for [4]. In 2006, Sir Tim Berners-Lee described a set of principles
to publish Linked Data on the Web:
Over the years, enormous amount of applications based on Linked Data have
been developed. Big companies like Google2 , Yahoo!3 or Facebook4 have lately
invested resources on deploying Semantic Web and Linked Data technologies.
Several governments from countries like the United States of America or the
United Kingdom have published a big amount of Open Data from different
administrations in their Open Data portals.
Linked Data is real and can be the key for data management in smart cities.
Following these recommendations, data publishers can move towards a new data-
powered space, in which data scientists and application developers can research
on new uses for Linked Open Data.
Through the literature, a broad variety of different definitions of data life cycle
models can be found. Although they have been developed for different actuation
domains, we describe here some of them which could be applied for generic data,
independently of its original domain.
2
http://www.google.com/insidesearch/features/search/knowledge.html
3
http://semsearch.yahoo.com/
4
http://ogp.me/
446 M. Emaldi et al.
The first model to be analysed is the model proposed by the Data Documentation
Initiative (DDI). The DDI introduced a Combined Life Cycle Model for data
managing [5]. As Figure 1 shows, this model has eight elements or steps which
can be summarised as follows, according to [6]:
– Study concept. At this stage, apart from choosing the research question
and the methodology to collect data, the processing and analysis stage of
the needed data to answer the question is planned.
– Data collection. This model proposes different methods to collect data,
like surveys, health records, statistics or Web-based collections.
– Data processing. At this stage, the collected data are processed to answer
the proposed research question. The data may be recorded in both machine-
readable and human-readable form.
– Data archiving. Both data and metadata should be archived to ensure
long-term access to them, guaranteeing confidentiality.
– Data distribution. This stage involves the different ways in which data
are distributed, as well as questions related to the terms of use of the used
data or citation of the original sources.
– Data discovery. Data may be published in different manners, through pub-
lications, web-indexes, etc.
– Data analysis. Data can be used by others to achieve different goals.
– Repurposing. Data can be used outside of their original framework, re-
structuring or combining it to satisfy diverse purposes.
– Preserve. Data preservation implies the storage of the data and metadata,
ensuring that these data can be verified, replicated and actively curated over
time.
– Discover. The authors describe the data discovering process as one of the
greatest challenges, as many data are not immediately available because they
are stored in individual laptops. The main challenges to publish the data in a
proper way are related to the creation of catalogues and indexes, and about
the implementation of the proper search engines.
– Integrate. Integrating data from different and heterogeneous sources can
become a difficult task, as it requires understanding methodological differ-
ences, transforming data into a common representation and manually con-
verting and recording data to compatible semantics before analysis can begin.
– Analyse. As well as the importance of a clear analysis step, this models
remarks the importance of documenting this analysis with sufficient detail
to enable its reproduction in different research frameworks.
– Creating data. Creating the data involves the design of the research ques-
tion, planning how data are going to be managed and their sharing strategy.
If we want to reuse existing data, we have to locate existing data and collect
them. Whether data is new or existing, at this stage the metadata has to be
created.
– Processing data. Like in other models, at this stage the data is translated,
checked, validated and cleaned. In the case of confidential data, it needs to be
“anonymized”. The UK Data Archive recommends the creation of metadata
at this stage too.
– Analysing data. At this stage, data are interpreted and derived into visu-
alisations or reports. In addition, the data are prepared for preservation, as
mentioned in the following stage.
– Preserving data. To preserve data properly, they are migrated to the best
format and stored in a suitable medium. In addition to the previously created
metadata, the creating, processing, analysis and preserving processes are
documented.
– Giving access to data. Once the data is stored, we have to distribute
our data. Data distribution may involve controlling the access to them and
establish a sharing license.
– Re-using data. At last, the data can be re-used enabling new research
topics.
the entire life cycle of Linked Data [9]. As Figure 4 shows, the proposed life cycle
phases are the following:
– Storage. As RDF data presents more challenges than relational data, they
propose the collaboration between known and new technologies, like column-
store technology, dynamic query optimisation, adaptive caching of joins, op-
timised graph processing and cluster/cloud scalability.
– Authoring. LOD2 provides provenance about data collected through dis-
tributed social, semantic collaboration and networking techniques.
– Interlinking. At this phase, LOD2 offers approaches to manage the links
between different data sources.
– Classification. This stage deals with the transformation of raw data into
Linked Data. This transformation implies the linkage and integration of data
with upper level ontologies.
– Quality. Like other models, LOD2 develops techniques for assessing quality
based on different metrics.
– Evolution/Repair. At this stage, LOD2 deals with the dynamism of the
data from the Web, managing changes and modifications over the data.
– Search/Browsing/Exploration. This stage is focussed on offering Linked
Data to final users through different search, browsing, exploration and visu-
alisation techniques.
Linked Open Data as the Fuel for Smarter Cities 451
Linkage
The different stages of this model, which are going to be explained widely in
following sections, are:
– Discovery: The first step in our model consists of discovering where data can
be taken from, identifying the available datasets which contain the necessary
data to accomplish our task. Datasources can either be maintained by us, or
by external entities, so the more metadata we can gather from datasets, the
easier further steps will become.
– Capture: Once datasources are identified, data need to be collected. In a
Smart City environment, there are a lot of alternatives to capture data, like
sensors, data published by public administration, social networks , mobile
sensing, user generated data or in more traditional ways like surveys.
– Curate: After the required data are captured, they are prepared to be stored
and in need of proper methods to explore them. This processing involves the
analysing, refining, cleaning, formatting and transformation of the data.
– Store: The storage of data is, probably, the most delicate action in the life
cycle. Above the storage all the analysis tools are build, and is the “final
endpoint” when someone requests our data. A suitable storage should have
indexing, replication, distribution and backup features, among other services.
– Publish: Most of the previously mentioned models prioritise the analysing
stage over the publication stage. In our model, we defend the opposite ap-
proach for a very simple reason: when you exploit your data before their
publication, and using different processes as the rest of the people who is
452 M. Emaldi et al.
going to use them, you are not making enough emphasis on publishing these
data correctly. Everybody has ever met a research paper or an application in
which accessing the data was difficult, or, when once the data was collected
it became totally incomprehensible. To avoid this issue, we propose to pub-
lish the data before their consumption, following the same process the rest
of the people do.
– Linkage: Before consuming data, we suggest to search for links and rela-
tionships with other datasets found in the discovery step. Actual solutions
do not allow the linkage with unknown datasets, but tools are developed to
ease link discovery processes between two or more given datasources.
– Exploit: Once the data is published, we use the provided methods to make
use of the data. This data consumption involves data mining, analytics or
reasoning.
– Visualise: To understand data properly, designing suitable visualisations is
essential to show correlations between data and the conclusions of the data
analysis in a human-understandable way.
4 Identified Challenges
Taking into consideration the large amounts of data present at Smart Cities ,
data management’s complexity can be described in terms of:
– Volume
– Variety
– Veracity
– Velocity
This four variables can also be found in Big Data-related articles (also known
as the Big Data’s Vs) [10, 11], so it is not surprising at all that Smart Cities are
going to deal with Big Data problems in the near future (if they are not dealing
with them right now).
Data scientist need to take into account these variables, which could overlap
in certain environments. Should this happen, each scenario will determine the
most relevant factors of the process, generating unwelcome drawbacks on the
other ones.
4.1 Volume
The high amount of data generated and used by cities nowadays needs to be
properly analysed, processed, stored and eventually accessible. This means con-
ventional IT structures need to evolve, enabling scalable storage technologies,
distributed querying approaches and massively parallel processing algorithms
and architectures.
There is a growing trend which defends that the minimum amount of data
should be stored and analysed without significantly affecting the overall knowl-
edge that could be extracted from the whole dataset. Based on Pareto’s Principle
Linked Open Data as the Fuel for Smarter Cities 453
(also known as the 80-20 rule), the idea is to focus on a 20% of the data to be
able to extract up to the 80% of knowledge within it. Even being a solid re-
search challenge in Big Data, there are occasions where we cannot discard data
from being stored or analysed (e.g. sensor data about monitoring a building can
be used temporally, while patient monitoring data should be kept for historical
records).
However, big amounts of data should not be seen as a drawback attached
to Smart Cities. The larger the datasets, the better analysis algorithms can
perform, so deeper insights and conclusions should be expected as an outcome.
These could ease the decision making stage.
As management consultant Peter Drucker once said: “If you can not measure
it, you can not manage it”, thus, leaving no way to improve it either. This adage
manifests that should you want to take care of some process, but you are not
able to measure it or you can not access the data, you will not be able to manage
that process. That being said, the higher amounts of data available, the greater
the opportunities of obtaining useful knowledge will become.
4.2 Variety
Data is rarely found in a perfectly ordered and ready for processing format.
Data scientists are used to work with diverse sources, which seldom fall into
neat relational structures: embedded sensor data, documents, media content,
social data, etc. As it can be seen in Section 5.2 there are many different sources
where data can come from in a smart city. Despite of presented data cycle can be
applied for all kind of data, the different steps of the cycle have to be planned,
avoiding the overloading of implemented system. For example, data from social
media talking about an emergency situation may be prioritised over the rest of
the data, to allow Emergency Response Teams (ERTs) react as soon as possible.
Moreover, different data sources can describe the same real-world entities in
such different ways, finding conflicting information, different data-types, etc.
Taking care of how data sources describe their contents will lead to an easier
integration step, lowering development, analytics and maintenance costs over
time.
4.3 Veracity
Several efforts are trying to convert existing data in high-quality data, provid-
ing an extra confidence layer in which data analysts can rely. In a previous re-
search [14], we introduced a provenance data model to be used in user-generated
Linked Data datasets, which follow W3C’s PROV-O ontology7. Some other re-
searches as [15] or [16] also provide some mechanisms to measure the quality and
trust in Linked Data.
4.4 Velocity
Finally, we must assume that data generation is experiencing an exponential
growth. That forces our IT structure to not only tackle with volume issues, but
also with high processing rates. A widely spread concept among data businesses
is that sometimes you can not rely on five-minute-old data for your business
logic.
That is why streaming data has moved from academic fields to industry to
solve velocity problems. There are two main reasons to consider streaming pro-
cessing:
– Sometimes, input data is too fast to store in their entirety without rocketing
costs.
– If applications mandate immediate response to the data, batch processes
are not suitable. Due to the rise of smartphone applications, this trend is
increasingly becoming a common scenario.
5.1 Discovery
Before starting any process related with data management, where that data
can be found must be known. Identifying the data sources that can be queried
is a fundamental first step in any data life cycle. These data sources can be
divided in two main groups: a) internal, when the team in charge of creating
and maintaining the data is the same that makes use of it, or b) external, when
data is provided by a third party.
The first scenario usually provides a good understanding of the data, as their
generation and structure is designed by the same people who are going to use
7
http://www.w3.org/TR/prov-o/
Linked Open Data as the Fuel for Smarter Cities 455
them. Whereas in real applications, its becoming more common to turn to ex-
ternal data sources to use in the business logic algorithms. Data scientists and
developers make use of external datasets for analysing them, expecting to get
new insights and create new opportunities from existing data. Luckily, some
initiatives help greatly whilst searching for new Open Data sources.
The projects described above can establish the basis to search for external
data sources, on top of which further analysis and refinement processes can be
built.
5.2 Capture
Data are undoubtedly the basis of Smart Cities: services offered to citizens, deci-
sions offered to city rulers by Decision Support Systems, all of them work thanks
to big amounts of data inputs. These data are captured from a wide variety of
sources, like sensor networks installed along the city, social networks, publicly
available government data or citizens/users who prosume data through their de-
vices (they crowdsource data, or the devices themselves generate automatically
sensing data). In most cases, these sources publish data in a wide set of het-
erogeneous formats, forcing data consumers to develop different connectors for
each source. As can be seen in Section 5.3, there are a lot of different and widely
extended ontologies which can represent data acquired from sources found in
8
http://datahub.io/
9
http://ckan.org/
10
http://lod-cloud.net/
456 M. Emaldi et al.
Smart Cities, easing the capture, integration and publication of data from het-
erogeneous domains. In this section, different sources of data which can be found
in Smart Cities are shown, while in Section 5.3 the transformation process from
their raw data to Linked Data is exposed.
Social Networks. Since the adoption of the Web 2.0 paradigm [24], users have
become more and more active when interacting with the Web. The clearest ex-
ample of this transformation of the Web can be found in social networks and
the high growth of their users. For example, at the end of the second quarter of
2013, Facebook has almost 1.2 billion users16 , while at the end of 2012, Twit-
ter reached more than 200 million monthly active users17 . Although users of
social networks generate a lot of data, it is hard to manipulate them because
users write in a language not easily understood by machines. To solve this issue
many authors have worked with different Natural Language Processing (NLP)
techniques. For example, NLP and Named Entity Recognition (NER) systems
[25] can be used to detect tweets which talk about some emergency situation
like a car crash, an earthquake and so on; and to recognise different properties
about the emergency situation like the place or the magnitude of this situation
[26, 27]. Extracting data from relevant tweets could help emergency teams when
planning their response to different types of situations as can be seen at [28–30].
Government Open Data. Government Open Data has gained a lot of value
in recent years, thanks to the proliferation of Open Data portals from different
administrations of the entire World. In these portals, the governments publish
11
http://helheim.deusto.es/bizkaisense/
12
http://traintimes.org.uk/map/tube/
13
http://www.arduino.cc/
14
http://www.raspberrypi.org/
15
https://xively.com
16
http://techcrunch.com/2013/07/24/facebook-growth-2/
17
https://twitter.com/twitter/status/281051652235087872
Linked Open Data as the Fuel for Smarter Cities 457
relevant data for the citizens, in a heterogeneous set of formats like CSV, XML or
RDF. Usually, data from these portals can be consumed by developers in an easy
way thanks to the provided APIs, so there are a lot of applications developed
over these data. As citizens are the most important part of Smart Cities, these
applications make them an active part in the governance of the city.
To illustrate the importance of Government Open Data, in Table 1 some Open
Data portals are shown.
Mobile Sensing and User Generated Data. Another important data source
in which Smart Cities can capture data are the citizens themselves. Citizens can
interact with city in multiple ways: for example, in [14] an interactive 311 service
is described. In this work, authors propose a model to share and validate, through
provenance and reputation analysis, reports about city issues published by its
citizens. In Urbanopoly [31] and Urbanmatch [32], different games are presented
to tourist with the aim of gathering data and photographies from tourist points of
interest in the city using their smartphones’ cameras. In the same line, csxPOI
[33] creates semantically annotated POIs through data gathered by citizens,
allowing semi-automatic merging of duplicate POIs and removal of incorrect
POIs.
As have been shown, in a Smart City a lot of data sources can be found, pub-
lishing an abundant stream of interesting data in a different and heterogeneous
manner. In section 5.3, how to transform these data into standard formats is
shown.
5.3 Curate
As it can be seen in Section 2, the Linked Data paradigm proposed the Resource
Description Framework (RDF) as the best format to publish data and encourage
the reuse of widely extended ontologies. In this section we explain what is an
458 M. Emaldi et al.
ontology, which are the most popular ontologies and how we can map previously
captured raw data to a proper ontology. At the end of this section a set of best
practices to construct suitable URIs for Linked Data are shown.
As defined by [34], an ontology is a formal explicit description of concepts
in a domain of discourse, properties of each concept describing various features
and attributes of the concept and restrictions on slots. According to this defi-
nition, an ontology has Classes which represent the concept, Properties which
represent different characteristics of Classes and Restrictions on the values of
these properties and relationships among different Classes. An ontology allows
modelling data avoiding most ambiguities originated when fusing data from dif-
ferent sources, stimulating the interoperability among different sources. As seen
in Section 5.2, data may come from a wide variety of sources in Smart Cities,
whereby the ontologies seem to be a suitable option to model these data.
The following works use ontologies to model different data sources which can
be found in a Smart City. In Bizkaisense project [35], diverse ontologies like
Semantic Sensor Network ontology (SSN) [36], Semantic Web for Earth and En-
vironmental Terminology (SWEET) [37] or Unified Code for Units of Measure
ontology (UCUM)18 are used to model raw data from air quality stations from
the Basque Country. AEMET Linked Data project19 has developed a network of
ontologies composed by SSN ontology, OWL-Time ontology20 , wsg84 pos ontol-
ogy21 , GeoBuddies ontology network22 and its own AEMET ontology, to describe
measurements taken by meteorological stations from AEMET (Spanish National
Weather Service). In [38] authors extend SSN ontology to model and publish as
Linked Data the data stream generated by the sensors of an Android powered
smartphone.
Another example of semantic modelling of infrastructures from a city can
be found in LinkedQR [39]. LinkedQR is an application that eases the manag-
ing task of an art gallery allowing the elaboration of interactive tourism guides
through third parties Linked Data and manual curation. LinkedQR uses Mu-
sicOntology [40] to describe the audioguides and Dublin Core [41], DBpedia
Ontology and Yago [42] to describe other basic information.
LinkedStats project23 takes data about waste generation and population of
Biscay to develop a statistical analysis about the correlation between these two
dimensions of the data. It models these statistical data through the RDF Data
Cube Vocabulary [59], an ontology developed for modelling multi-dimensional
data in RDF. At last, in [43] the authors show how Linked Data enables the
integration of data from different sensor networks.
18
http://idi.fundacionctic.org/muo/ucum-instances.html
19
http://aemet.linkeddata.es/models.html
20
http://www.w3.org/TR/owl-time/
21
http://www.w3.org/2003/01/geo/wgs84_pos
22
http://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/
ontologies/83-geobuddies-ontologies
23
http://helheim.deusto.es/linkedstats/
Linked Open Data as the Fuel for Smarter Cities 459
The mapping between raw data and ontologies, usually is made by appli-
cations created ad hoc to each case; Bizkaisense, AEMET Linked Data and
LinkedStats have their own Python scripts to generate proper RDF files from
raw data. In the case of LinkedQR, it has a control panel where the manager
can manually type data and map to a desired ontology. Instead, there are tools
designed for transforming raw data into structured data. One of them is Open
Refine24 (formerly Google Refine). Open Refine is a webtool which can apply dif-
ferent manipulations to data (facets, filters, splits, merges, etc.) and export data
in different formats based on custom templates. Additionally, Google Refine’s
RDF Extension allows exporting data in RDF.
Another interesting tool is Virtuoso Sponger, a component of OpenLink Vir-
tuoso25 which generates Linked Data from different data sources, through a set
of extractors called Cartridges. There are different Cartridges which support a
wide variety of input formats (CSV, Google KML, xHTML, XML, etc.) and
vendor specific Cartridges too (Amazon, Ebay, BestBuy, Discogs, etc.).
After modelling data, one of the most important concepts in Linked Data are
the URIs or Unified Resource Identifiers. As shown in Section 2, to publish a
data resource as Linked Data it has to be identified by an HTTP URI which
satisfies these conditions:
– An HTTP URI is unique and consistent.
– An HTTP URI can be accessed from everywhere by everyone.
– The URI and its hierarchy are auto-descriptive.
Designing valid URIs is a very important step into the publication of Linked
Data: if you change the URIs of your data, all the incoming links from external
sources are going to be broken. To avoid this issue there is a set of good practices,
proposed in [44]:
– Be on the web. Return RDF for machines and HTML for humans through
standard HTTP protocol.
– Don not be ambiguous. Use a URL to describe a document and a different
URI to identify real-world objects. A URI can not stand for both document
and real-world object.
To apply these good practices correctly, authors propose two solutions, which
nowadays have been widely adopted by Linked Data community:
– 303 URIs. Use 303 See Other status code for redirecting to the proper
RDF description of the real-world object or to the HTML document. For
example:
• http://helheim.deusto.es/hedatuz/resource/biblio/5112 - A URI
identifying a bibliographic item.
• http://helheim.deusto.es/hedatuz/page/biblio/5112 - The HTML
view of the item.
24
http://openrefine.org/
25
http://virtuoso.openlinksw.com/
460 M. Emaldi et al.
• http://helheim.deusto.es/hedatuz/data/biblio/5112 - A RDF
document describing the item.
– Hash URIs. URIs contains a fragment, a special part separated by the
symbol #. The client has to strip-off the fragment from the URI before
requesting it to server. Server returns a RDF document in which the client
has to search the fragment.
5.4 Store
Once data is mapped to a proper ontology and the RDF files are generated,
is time to store them. Due to the big amount of data generated in a city, an
appropriate storage has to:
Along this section, the first three points are discussed, while the fourth is
discussed in Section 5.5. Before a wide description of each analysed datastore is
given, a brief description is presented in Table 2.
allowing to web-agents the access to data. Virtuoso supports the SPARQL UP-
DATE syntax, allowing the update of datastore through HTTP POST requests;
and it provides connectors for different Java powered RDF engines, like Jena26 ,
Sesame27 or Redland28 . Further, it supports some OWL properties for reason-
ing. According to the Berlin SPARQL Benchmark [45] Virtuoso 7 can load one
billion triples in 27:11 minutes.
Another datastore which is becoming popular is Stardog29 . Developed by
Clark & Parsia, Stardog is a RDF database which supports SPARQL querying
and OWL reasoning. It offers a Command Line Interface to manage the different
databases (create, remove, add/remove data, SPARQL queries, etc.), while they
can be queried through HTTP too. Furthermore it has its own query syntax
which indexes the RDF literals; and its own Java library to manage databases
from Java applications. It supports OWL 2 reasoning, supporting different OWL
profiles30 like QL, RL, EL or DL.
The last analysed datastore is Fuseki31 . Part of Jena’s framework, Fuseki (for-
merly known as Joseki) offers RDF data over HTTP, in a REST style. Fuseki im-
plements W3C’s SPARQL 1.1 Query, Update, Protocol and Graph Store HTTP
Protocol. It has a web-panel to manage the datastore and can interact with the
rest of the Jena components.
As can be seen at Table 2, all analysed datastores are similar in terms of
input/output formats or offered APIs. But there are differences in other aspects,
like security: Fuseki does not support users nor access roles. Another difference is
the installation and executing complexity: while Fuseki and Stardog are launched
as a Java JAR file, Virtuoso can be installed through Debian’s package system
and launched as a UNIX daemon. In the other hand, Virtuoso is more than a
“simple” RDF store, Virtuoso is a relational database engine, an application
server in which both preinstalled applications and our own applications can be
launched, and much more. Furthermore, Virtuoso can be installed in a cluster
formed by multiple servers.
Concluding this section, we can say that Fuseki can be used in light-weight
installations, when hardware and data are limited; Stardog in more complex
systems, due to its fast query execution times. Meanwhile, Virtuoso offers more
services like Sponger (described in Section 5.3) or Semantic Wiki, whereby, it
can be suitable for environments which need more than the simple storage of
RDF triples.
5.5 Publish
Publication stage is one of the most important stages in the life cycle of Linked
Data in Smart Cities, because this stage determines how citizens or developers can
26
http://jena.apache.org/
27
http://www.openrdf.org/
28
http://librdf.org/
29
http://stardog.com/
30
http://www.w3.org/TR/owl2-profiles/
31
http://jena.apache.org/documentation/serving_data/index.html
462 M. Emaldi et al.
acquire Linked Data to exploit (Section 5.7) through different discovery methods
(Section 5.1). As we saw in section 5.4, the three proposed RDF stores include some
publication API or SPARQL endpoint, but, sometimes the specifications of the
system to be deployed need additional features at this publication stage.
One of these features can be the 303 Redirection explained at Section 5.3.
Although all datastores mentioned on Section 5.4 offer a SPARQL endpoint to
explore data, Linked Data paradigm demands resolvable HTTP URIs as resource
identifiers. Fortunately, there are tools which fulfil this demand. One of them,
Pubby32 , adds Linked Data interfaces to SPARQL endpoint. Pubby queries the
proper SPARQL endpoint to retrieve data related to a given URI and man-
ages the 303 Redirection mechanism. Depending on the Accept header of the
HTTP request, Pubby redirects the client to the HTML view of data or to the
RDF document describing the resource. Pubby can export data in RDF/XML,
NTriples, N3 and Turtle. In Figure 6, an example of the HTML view of a resource
is shown.
D2R Server [46] allows the publication of relational databases as Linked Data.
At first D2R requires a mapping from the tables and columns of the database to
32
http://wifo5-03.informatik.uni-mannheim.de/pubby/
Linked Open Data as the Fuel for Smarter Cities 463
selected ontologies using D2RQ Mapping Language. Once this mapping is done,
D2R offers a SPARQL endpoint to query data and a Pubby powered interface.
Besides the publication of the data itself, it is important to consider the
publication of provenance information about it. In [47] the authors identify
the publication of provenance data as one of the main factors that influence
web content trust. At the time of publishing provenance information two ap-
proaches can be taken: the first is to publish basic metadata like when or who
created and published the data; the second is to provide a more detailed de-
scription of where the data come from, including versioning information or the
description of the data transformation workflow, for example. Some ontologies
help us in the process of providing provenance descriptions of Linked Data.
Basic provenance metadata can be provided using Dublin Core terms, like dc-
terms33 :contributor, dcterms:creator or dcterms:created. Other vocabularies like
the Provenance Vocabulary [48] or the Open Provenance Model (OPM) [49] pro-
vide ways to publish detailed provenance information like the mentioned before.
The W3C has recently created the PROV Data Model [50], a new vocabulary for
provenance interchange on the Web. This PROV Data Model is based on OPM
33
http://purl.org/dc/terms/
464 M. Emaldi et al.
and describes the entities, activities and people involved in the creation of a piece
of data, allowing the consumer to evaluate the reliability of the data based on the
their provenance information. Furthermore, PROV was deliberately kept exten-
sible, allowing various extended concepts and custom attributes to be used. For
example, the Uncertainty Provenance (UP) [51] set of attributes can be used to
model the uncertainty of data, aggregated from heterogeneously divided trusted
and untrusted sources, or with varying confidence.
Here we can find an example of how bio2rdf.org -an atlas of post-genomic data
and one of the biggest datasets in the LOD Cloud- represent the provenance data
of its datasets, using some of the mentioned ontologies in conjunction with VoID,
a dataset description vocabulary:
@prefix r d f s : <h t t p : / /www. w3 . o r g / 2 0 0 0 / 0 1 / r d f −schema#> .
@prefix r d f : <h t t p : / /www. w3 . o r g /1999/02/22 − r d f −s y n t a x−ns#> .
@prefix v o i d : <h t t p : / / r d f s . o r g / ns / v o i d#> .
@prefix d c t e r m s : <h t t p : / / p u r l . o r g / dc / t e r m s/> .
@prefix xsd : <h t t p : / /www. w3 . o r g / 2 0 0 1 /XMLSchema#> .
@prefix p r o v : <h t t p : / /www. w3 . o r g / ns / p r o v#> .
<h t t p : / / b i o 2 r d f . o r g / b i o 2 r d f d a t a s e t : b i o 2 r d f −a f f y m e t r i x −20121002>
a void : Dataset ;
d c t e r m s : c r e a t e d ”2012 −10 −02”ˆˆ xsd : d a t e ;
r d f s : l a b e l ” a f f y m e t r i x d a t a s e t by Bio2RDF on 2012−10−02” ;
d c t e r m s : c r e a t o r <a f f y m e t r i x > ;
d c t e r m s : p u b l i s h e r <h t t p : / / b i o 2 r d f . org > ;
v o i d : dataDump <datadump> ;
p r o v : wasDerivedFrom
<h t t p : / / b i o 2 r d f . o r g / b i o 2 r d f d a t a s e t : a f f y m e t r i x > ;
d c t e r m s : r i g h t s ” r e s t r i c t e d −by−s o u r c e −l i c e n s e ” ,
” a t t r i b u t i o n ” , ” use−s h a r e −m o d i f y ” ;
v o i d : s p a r q l E n d p o i n t <h t t p : / / a f f y m e t r i x . b i o 2 r d f . o r g / s p a r q l > .
In the process of publishing data from Smart Cities as Linked Data, new on-
tologies are going to be created to model the particularities of each city. The
creators of these ontologies have to publish a suitable documentation, allowing
the proper reuse of them. A tool for publishing ontologies and their documenta-
tion is Neologism34 . Neologism shows ontologies in a human-readable manner,
representing class, subclass and property relationships through diagrams.
5.6 Linkage
Connecting existing data with other available resources is a major challenge for
easing data integration. Due to its interlinked nature, Linked Data provides a
perfect base to connect the data present in a given dataset.
The linkage stage starts a loop on the model after the publishing step, es-
tablishing relationships between existing data and external datasets, in order to
provide links to new information stores.
Different frameworks have been developed to deal with class and properties
matching. The basis of these frameworks is to provide data discovery features
through links to external entities related to the items used in the analysis.
The Silk - Link Discovery Framework [52] offers a flexible tool for discovering
links between entities within different Web data sources. Silk makes use of Silk -
34
http://neologism.deri.ie/
Linked Open Data as the Fuel for Smarter Cities 465
With a similar approach LIMES (LInk discovery framework for MEtric Spaces)
[53] can be used for the discovery of links between Linked Data knowledge bases,
focusing on a time-efficient approach especially when working with large-scale
matching tasks. LIMES relies on triangle inequality mathematical principles for
distance calculations, which reduce the number of comparisons necessary to com-
plete a mapping by several orders of magnitude. This approach helps detecting
the pairs that will not fulfil the requirements in an early stage, thus avoiding
spending time in more time-consuming processing. The architecture followed by
LIMES framework is depicted in Figure 9.
5.7 Exploit
At this stage, the focus is located on exploiting data for business-logic processes,
should they involve data mining algorithms, analytics, reasoning, etc.
Whereas complex processing algorithms can be used independently of the
dataset format, Linked Open Data can greatly help at reasoning purposes.
Linked Open Data describes entities using ontologies, semantic constraints and
466 M. Emaldi et al.
restriction rules (belonging, domain, range, etc.) which favor the inference of
new information from the existing one. Thanks to those semantics present in
Linked Data, algorithms are not fed with raw values (numbers, strings...), but
with semantically meaningful information (height in cm, world countries, com-
pany names...), thus, resulting in higher quality outputs and making algorithms
more error aware (i.e. if a given algorithm is in charge of mapping the layout
of a mountainous region, and founds the height of one of the mountains to be
3.45, it is possible to detect the conversion has failed at some point, as height
datatype was expected to be given in meters).
As seen in Sections 5.2 and 5.2, sensors and social networks are common data
input resources, generating huge amounts of data streamed in real time. The
work done in [54] comprises a set of best practices to publish and link stream
data to be part of the Semantic Web.
However, when it comes to exploiting Linked Data streams, SPARQL can
find its limits [55]. Stream-querying languages such as CQELS’s35 language (an
extension of the declarative SPARQL 1.1 language using the EBNF notation) can
greatly help in the task. CQELS [56] (Continuous Query Evaluation over Linked
Stream) is a native and adaptive query processor for unified query processing
over Linked Stream Data and Linked Data developed at DERI Galway.
Initially, a query pattern is added to represent window operators on RDF
Stream:
G r a p h P a t t e r n N o t T r i p l e s : : = GroupOrUnionGraphPattern |
O p t i o n a l Gr a p h P a t t e r n | MinusGraphPattern | GraphGraphPattern |
∗ StreamGraphPattern ∗ | S e r v i c e G r a p h P a t t e r n | F i l t e r | Bind
35
https://code.google.com/p/cqels/
Linked Open Data as the Fuel for Smarter Cities 467
Eventually, both batch and streaming results can be consumed through REST
services by web or mobile applications, or serve as input for more processing
algorithms until they are finally presented to end users.
5.8 Visualise
In order to make meaning from data, humans have developed a great ability to
understand visual representations. The main objective of data visualisation is to
communicate information in a clean and effective way through graphical means.
It is also suggested that visualisation should also encourage users engagement
and attention.
The “A picture is worth a thousand words” saying reflects the power images
and graphics have when expressing information, and can condense big datasets
into a couple of representative, powerful images.
As Linked Data is based on subject-predicate-object triples, graphs are a
natural way to represent triple stores, where subject and object nodes are inter-
connected through predicate links. When further analysis is applied on triples, a
diverse variety of representations can be chosen to show processed information:
charts, infographics, flows, etc.[57]
Browser-side visualisation technologies such as d3.js36 (by Michael Bostock)
and Raphaël37 are JavaScript-based libraries to allow the visual representation
of data on modern web browsers, allowing anybody with a minimum internet
connection try to understand data patterns in a graphical form.
For developers not familiar with visualisation techniques, some investigations
are trying to enable the automatic generation of graphical representations of
Linked Data query results. The LDVM (Linked Data Visualisation Model) [58]
is proposed as a model to rapidly create visualisations of RDF data. LODVisu-
alisation 38 is an implemented prototype which supports the LDVM.
Visualbox39 is a simplified edition of LODSPeaKr40 focused on allowing peo-
ple create visualisations using Linked Data. In Figure 10, a SPARQL query to
36
http://d3js.org/
37
http://raphaeljs.com/
38
http://lodvisualisation.appspot.com/
39
http://alangrafu.github.io/visualbox/
40
http://lodspeakr.org/
468 M. Emaldi et al.
6 Conclusions
In this chapter, we have proposed Linked Data as a suitable paradigm to manage
the entire data life cycle in Smart Cities. As can be seen, along this chapter we
expose a set of guidelines for public or private managers which want to contribute
with data from their administration or enterprise into a Smart City, bringing
closer existing tools and exposing practical knowledge acquired by authors while
working with Linked Data technologies. The proposed data life cycle for Smart
Cities covers the entire travelling path of data inside a Smart City, and the
mentioned tools and technologies fulfil all the needed tasks to go forward on this
path.
But Linked Open Data is not all about technology. The Open term of Linked
Open Data is about the awareness of public (and private) administrations to
provide citizens with all the data which belong to them, making the governance
process more transparent; the awareness of developers to discover the gold be-
hind data and the awareness of fully informed citizens participating on decision
making processes: Smart Cities, smart business and Smart Citizens.
Urban Linked Data applications also empower citizens’ role of first level data
providers. Thanks to smartphones, each citizen is equipped with a full set of sen-
sors which are able to measure the city’s pulse at every moment: traffic status,
speed of each vehicle to identify how they are moving, reporting of roadworks or
malfunctioning public systems and so forth. Citizens are moving from data con-
sumers to data prosumers, an aspect data scientists and application developers
can benefit from to provide new services for Smart Cities.
Linked Open Data as the Fuel for Smarter Cities 469
References
1. World Health Organization: Urbanization and health. Bull World Health Organ.
88, 245–246 (2010)
2. Bizer, C., Boncz, P., Brodie, M.L., Erling, O.: The meaningful use of big data: four
perspectives – four challenges. SIGMOD Rec. 40(4), 56–60 (2012)
3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific Ameri-
can 284(5), 28–37 (2001)
4. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International
Journal on Semantic Web and Information Systems (IJSWIS) 5(3), 1–22 (2009)
5. Initiative, D.D.: Overview of the DDI version 3.0 conceptual model (April 2008)
6. Ball, A.: Review of data management lifecycle models (2012)
7. Burton, A., Treloar, A.: Designing for discovery and re-use: the ‘ANDS data sharing
verbs’ approach to service decomposition. International Journal of Digital Cura-
tion 4(3), 44–56 (2009)
8. Michener, W.K., Jones, M.B.: Ecoinformatics: supporting ecology as a data-
intensive science. Trends in Ecology & Evolution 27(2), 85–93 (2012)
9. Auer, S., et al.: Managing the life-cycle of linked data with the LOD2 stack. In:
Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part II. LNCS, vol. 7650, pp. 1–16.
Springer, Heidelberg (2012)
10. deRoos, D., Eaton, C., Lapis, G., Zikopoulos, P., Deutsch, T.: Understanding big
data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Os-
borne Media (2011)
11. Russom, P.: Big data analytics. TDWI Best Practices Report, Fourth Quarter
(2011)
12. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep
web: is the problem solved? In: Proceedings of the 39th International Conference
on Very Large Data Bases, PVLDB 2013, pp. 97–108. VLDB Endowment (2013)
13. Buneman, P., Davidson, S.B.: Data provenance–the foundation of data quality
(2013)
14. Emaldi, M., Pena, O., Lázaro, J., Lápez-de-Ipiã, D., Vanhecke, S., Mannens, E.:
To trust, or not to trust: Highlighting the need for data provenance in mobile apps
for smart cities. In: Proceedings of the 3rd International Workshop on Information
Management for Mobile Applications, pp. 68–71 (2013)
15. Hartig, O., Zhao, J.: Using web data provenance for quality assessment. In: Pro-
ceedings of the International Workshop on Semantic Web and Provenance Man-
agement, Washington DC, USA (2009)
16. Bizer, C., Cyganiak, R.: Quality-driven information filtering using the WIQA pol-
icy framework. Web Semantics: Science, Services and Agents on the World Wide
Web 7(1), 1–10 (2009)
17. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia:
A nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007.
LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
470 M. Emaldi et al.
18. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
mann, S.: Dbpedia-a crystallization point for the web of data. Web Semantics:
Science, Services and Agents on the World Wide Web 7(3), 154–165 (2009)
19. Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the open linked data.
In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 552–565.
Springer, Heidelberg (2007)
20. Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker,
S.: Sig. ma: Live views on the web of data. Web Semantics: Science, Services and
Agents on the World Wide Web 8(4), 355–364 (2010)
21. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor
networks. IEEE Communications Magazine 40(8), 102–114 (2002)
22. Sanchez, L., Galache, J.A., Gutierrez, V., Hernandez, J., Bernat, J., Gluhak, A.,
Garcia, T.: SmartSantander: the meeting point between future internet research
and experimentation and the smart cities. In: Future Network & Mobile Summit
(FutureNetw 2011), pp. 1–8 (2011)
23. Le-Phuoc, D., Quoc, H.N.M., Parreira, J.X., Hauswirth, M.: The linked sensor
middleware–connecting the real world and the semantic web. In: Proceedings of
the Semantic Web Challenge (2011)
24. O’reilly, T.: What is web 2.0: Design patterns and business models for the next
generation of software. Communications & Strategies (1), 17 (2007)
25. Maynard, D., Tablan, V., Ursu, C., Cunningham, H., Wilks, Y.: Named entity
recognition from diverse text types. In: Recent Advances in Natural Language
Processing 2001 Conference, pp. 257–274 (2001)
26. Sixto, J., Pena, O., Klein, B., López-de-Ipiña, D.: Enable tweet-geolocation
and don’t drive ERTs crazy! improving situational awareness using twitter. In:
SMERST 2013: Social Media and Semantic Technologies in Emergency Response,
Coventry, UK, vol. 1, pp. 27–31 (2013)
27. Martins, B., Anastácio, I., Calado, P.: A machine learning approach for resolving
place references in text. In: Geospatial Thinking, pp. 221–236. Springer (2010)
28. Abel, F., Hauff, C., Houben, G.J., Stronkman, R., Tao, K.: Twitcident: fighting fire
with information from social web streams. In: Proceedings of the 21st International
Conference Companion on World Wide Web, pp. 305–308 (2012)
29. Vieweg, S., Hughes, A.L., Starbird, K., Palen, L.: Microblogging during two natural
hazards events: what twitter may contribute to situational awareness. In: Proceed-
ings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
1079–1088 (2010)
30. Hughes, A.L., Palen, L.: Twitter adoption and use in mass convergence and
emergency events. International Journal of Emergency Management 6(3), 248–260
(2009)
31. Celino, I., Cerizza, D., Contessa, S., Corubolo, M., Dell’Aglio, D., Valle, E.D.,
Fumeo, S.: Urbanopoly – a social and location-based game with a purpose to
crowdsource your urban data. In: Proceedings of the 2012 ASE/IEEE International
Conference on Social Computing and 2012 ASE/IEEE International Conference
on Privacy, Security, Risk and Trust, SOCIALCOM-PASSAT 2012, pp. 910–913.
IEEE Computer Society, Washington, DC (2012)
32. Celino, I., Contessa, S., Corubolo, M., Dell’Aglio, D., Valle, E.D., Fumeo, S.,
Krüger, T.: UrbanMatch - linking and improving smart cities data. In: Bizer, C.,
Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) Linked Data on the Web. CEUR
Workshop Proceedings, CEUR-WS, vol. 937 (2012)
Linked Open Data as the Fuel for Smarter Cities 471
33. Braun, M., Scherp, A., Staab, S.: Collaborative semantic points of interests. In:
Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L.,
Tudorache, T. (eds.) ESWC 2010, Part II. LNCS, vol. 6089, pp. 365–369. Springer,
Heidelberg (2010)
34. Noy, N.F., McGuinness, D.L.: Ontology development 101: A guide to creating your
first ontology. Stanford knowledge systems laboratory technical report KSL-01-05
and Stanford medical informatics technical report SMI-2001-0880 (2001)
35. Emaldi, M., Lázaro, J., Aguilera, U., Peña, O., López-de-Ipiña, D.: Short paper:
Semantic annotations for sensor open data. In: Proceedings of the 5th International
Workshop on Semantic Sensor Networks, SSN 2012, pp. 115–120 (2012)
36. Lefort, L., Henson, C., Taylor, K., Barnaghi, P., Compton, M., Corcho, O., Garcia-
Castro, R., Graybeal, J., Herzog, A., Janowicz, K.: Semantic sensor network XG
final report. W3C Incubator Group Report (2011)
37. Raskin, R.G., Pan, M.J.: Knowledge representation in the semantic web for
earth and environmental terminology (SWEET). Computers & Geosciences 31(9),
1119–1125 (2005)
38. d’Aquin, M., Nikolov, A., Motta, E.: Enabling lightweight semantic sensor net-
works on android devices. In: The 4th International Workshop on Semantic Sensor
Networks (SSN 2011) (October/Autumn 2011)
39. Emaldi, M., Lázaro, J., Laiseca, X., López-de-Ipiña, D.: LinkedQR: improving
tourism experience through linked data and QR codes. In: Bravo, J., López-de-
Ipiña, D., Moya, F. (eds.) UCAmI 2012. LNCS, vol. 7656, pp. 371–378. Springer,
Heidelberg (2012)
40. Raimond, Y., Abdallah, S., Sandler, M., Giasson, F.: The music ontology. In:
ISMIR 2007: 8th International Conference on Music Information Retrieval, Vi-
enna, Austria, pp. 417–422 (September 2007)
41. Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource
discovery. Internet Engineering Task Force RFC 2413, 222 (1998)
42. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowl-
edge. In: Proceedings of the 16th International Conference on World Wide Web,
pp. 697–706. ACM (2007)
43. Stasch, C., Schade, S., Llaves, A., Janowicz, K., Bröring145, A.: Aggregating linked
sensor data. Semantic Sensor Networks, 46 (2011)
44. Ayers, A., Völkel, M.: Cool uris for the semantic web. Woking Draft. W3C (2008)
45. Bizer, C., Schultz, A.: The berlin sparql benchmark. International Journal on Se-
mantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)
46. Bizer, C., Cyganiak, R.: D2r server-publishing relational databases on the semantic
web. In: Proceedings of the 5th International Semantic Web Conference, p. 26
(2006)
47. Gil, Y., Artz, D.: Towards content trust of web resources. Web Semantics: Science,
Services and Agents on the World Wide Web 5(4), 227–239 (2007)
48. Hartig, O.: Provenance information in the web of data. In: Proceedings of the
WWW 2009 Workshop on Linked Data on the Web, LDOW 2009 (2009)
49. Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska,
N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., Van den
Bussche, J.: The open provenance model core specification (v1.1). Future Genera-
tion Computer Systems 27(6), 743–756 (2011)
50. Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth,
P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., Tilmes, C.:
PROV-DM: The PROV data model (2013)
472 M. Emaldi et al.
51. De Nies, T., Coppens, S., Mannens, E., Van de Walle, R.: Modeling uncertain
provenance and provenance of uncertainty in W3C PROV. In: Proceedings of the
22nd International Conference on World Wide Web Companion, Rio de Janeiro,
Brazil, pp. 167–168 (2013)
52. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk-a link discovery framework for
the web of data. In: Proceedings of the International Semantic Web Conference
2010 Posters & Demonstrations Track. Citeseer (2009)
53. Ngomo, A.C.N., Auer, S.: Limes: a time-efficient approach for large-scale link dis-
covery on the web of data. In: Proceedings of the Twenty-Second International
Joint Conference on Artificial Intelligence, vol. 3, pp. 2312–2317. AAAI Press
(2011)
54. Sequeda, J., Corcho, O., Taylor, K., Ayyagari, A., Roure, D.D.: Linked stream
data: A position paper. In: Proceedings of the 2nd International Workshop on Se-
mantic Sensor Networks (SSN 2009) at ISWC 2009. CEUR Workshop Proceedings,
vol. 522, pp. 148–157 (November 2009)
55. Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world!
reasoning upon rapidly changing information. IEEE Intelligent Systems 24(6),
83–89 (2009)
56. Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and
adaptive approach for unified processing of linked streams and linked data. In:
Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N.,
Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 370–388. Springer,
Heidelberg (2011)
57. Khan, M., Khan, S.S.: Data and information visualization methods, and interactive
mechanisms: A survey. International Journal of Computer Applications 34(1), 1–14
(2011)
58. Brunetti, J.M., Auer, S., Garcı́a, R.: The linked data visualization model. In: In-
ternational Semantic Web Conference (Posters & Demos) (2012)
59. Cyganiak, Richard and Reynold, Dave. The RDF Data Cube Vocabulary (2013),
http://www.w3.org/TR/2013/CR-vocab-data-cube-20130625/
Benchmarking Internet of Things Deployment:
Frameworks, Best Practices, and Experiences
1 Introduction
Tackling global challenges is considered nowadays as an important driver of Informa-
tion and Communication Technologies (ICT) development policies, research pro-
grams, and technological innovations. Emerging ICT technologies such as the Internet
of Things (IoT) provide the promise of alleviating or solving many of our planets
problems in areas as resource shortage and sustainability, logistics and healthcare, by
an efficient integration of the real world with the digital world of modern computer
systems on the Internet. The underlying vision is that they will enable the creation of
a new breed of so-called smart services and applications, able to make existing
business processes more effective and efficient and the delivery of service more
personalized to the needs and situation of individual users and the society at large.
During the recent years, IoT has undergone a vast amount of research and initia-
tives driven by individual organizations to experiment the potentialities of IoT while
trying to gain a ‘large’ footprint on the envisioned large marker. This conducted in the
creation of many experiments either driven by research consortia or large industries.
While technologies are becoming mature, recent analyzes demonstrated a number of
shortcomings:
Nevertheless, there is consensus on the wideness of the IoT over many of its char-
acteristics:
Several issues hamper large-scale deployment of IoT on both technical and nontechnical
dimensions. The table below list some of the issues as they are documented in IoT re-
lated literature [2][3][4]:
1
“The Internet of Things will include 26 billion units installed by 2020. IoT product and ser-
vice suppliers will generate incremental revenue exceeding $300 billion, mostly in services,
in 2020”, according to Gartner analysts [1].
476 F.L. Gall et al.
2 Preliminary Concept
Smart city concept and the challenges encountered to deploy it in real-field show
similarities with a public policy or program as it is a multi-stakeholders process which
involves a public decision, operators, and beneficiaries, each may have orthogonal
expectations, and various involvement in the different stages of the process. Another
element that makes the analogy significant is the long duration required to get realized
the impacts corresponding to long-term social and environmental expectations.
The perspective of analysis followed by this work to characterize smart cities de-
ployments aims at encompassing the global dimensions of the smart cities. The main
Benchmarking Internet of Things Deployment 477
focus of the analysis relates to the added value produced by the implementation of
such a concept for the various stakeholders involved, which also is equivalent to ana-
lyse the production of outputs-outcomes-impacts in these cases.
The classical Patton evaluation framework [5] has been adapted for the smart cities
case. The following paragraphs explain the main concepts of public policies evalua-
tion and express their mapping with smart cities components.
Public policy is defined through its programming cycle based on three main steps:
• Policy design which consists of the identification of the public (or societal) prob-
lem that needs to be addressed and defining the objectives of the policy;
• Policy implementation which consists of the implementation and operational
inputs to achieve objectives. Inputs can come from various sources: resources (hu-
man and financial), but also physical facilities, legal aspects, etc.
• Policy completion which consists of the production of the outcomes and impacts
which are expected to match the original objectives.
• Equivalent to policy design is the definition of the general purpose of the smart
city. This covers a large number of societal expectations such as energy, health, in-
clusion, climate change, and all the great challenges the European Union want to
fix through its programmes and policy. These objectives are subdivided in several
specific objectives. Taking the example of the climate change priority, the overall
objective of the policy could be translated in the decreasing of the carbon footprint.
The actions undertaken to achieve this are large and cover several parts: energy ef-
ficiency, mobility, etc.
• Equivalent to the implementation policy is the process of deploying: deployment
of sensors, platforms, applications, services. But also nontechnical inputs such as
regulations and legal capabilities, human, and financial resources and many more.
Following the above example, on the topic of energy efficiency, the implementa-
tion could be split in a diversity of actions, each of them implying dedicated
realisations:
─ refurbishing public buildings;
─ incentivizing renovation of private buildings where possible;
─ improving public transport and overall urban transport management;
─ increasing energy efficiency of public lighting;
─ etc.
Defining these actions is the heart of the policy design and is operationalized into
specifications that the deployments should answer.
• Finally, equivalent to the policy completion is the achievement of impacts ex-
pected to be induced by deployment of the smart city components (for instance
based on IoT) and its applications and services. As for a public policy or program,
an IoT deployment is here considered as a mean to produce intended impacts.
478 F.L. Gall et al.
Fig. 1. Smart
S city implementation as a public policy
Considering a smart cityy as a public policy enables then to use the classical evallua-
tion tools and methods to observe,
o monitor, and assess it. Evaluations methods hhave
been proposed and improv ved over the years. They consist of examining the oveerall
consistence of the policy cycle
c and the extent of matches between the preliminnary
logic intervention and the observed outcomes and impacts. They allow highlightting
the casual hypotheses don ne under intervention logic which are often unclear and
unshared among stakeholdeers.
Conceptually, as shown on the Fig. 2, evaluation is based on six evaluation criteeria:
(1) Effectiveness which co onsists of analysing to what extent the real observed oout-
comes and impacts corresp ponding to the objectives – i.e., the primary expected im-
pacts - and in highlighting the unexpected effects of the implementation of the pollicy
or programme, (2) Efficienccy which consists of analysing to what extent the “amouunt”
of observed effects is consiistent with the inputs: could the policy or program have the
same effects with less inpu ut? Or inversely, could the policy have more effects w with
more inputs? (3) Relevancee consists of asking or re-asking the logic of the intervven-
tion: Are the done hypoth hesis on causal relationships consisting in answering the
needs and challenges, or so ome of them are false and should be removed? (4) Utiility
consists of analysing the naature of the observed effects regarding the social needs and
demand. This criterion is seldom considered into evaluations mandated by pollicy
makers, as the existence off the policy itself is seldom questioned, (5) Coherence ccon-
sists of analysing the consiistence of inputs and implementations regarding the objjec-
tives of the programme or policy: are the realizations and actions undertaken in lline
with the objectives or hav ve critical aspects been missed that impede the corrrect
implementation of the progrramme?
Benchmarking Internet of Things Deployment 479
Fig. 2. Evaluation criteria in the context of a public policy/program, adapted from Means [[6]
This conceptual framew work has been used as a starting point to build our oown
“benchmarking framework k” containing the dimensions, criteria, and indicators (or
metrics) needed to charactterize the IoT deployments in the context of smart citties.
Modeling IoT deployment in such a way has the benefit to include the external ccon-
text of IoT deployment alon ng its entire lifecycle: (1) upstream regarding the challeeng-
es and objectives expected d to be achieved, (2) downstream regarding the outcom mes
and impacts (even unprediccted) it produces.
It is something new in thhe context of technological deployment where evaluatiions
are mainly based on techno ological performance and improve the technical characteeris-
tics to do better in a more effficient way or at lower cost for instance.
A key element in the development toward smarter cities is that the whole vaalue
network of actors and stakeeholders is involved in the creation of new service offeriings
in cities. This cooperation is often referred to as value network or ecosystem anaaly-
sis [7], but what is actually
y at stake is the creation of business models for these nnew
services, that incorporate ann active role for public bodies, rather than simply assum
ming
a purely commercial logic. Looking at the overall dimensions and not only the teech-
nological one offers a new w perspective of analysis based on a systemic approoach
which include the diverse anda sometime diverging interest of involved stakeholdders,
including public authoritiess seeking for an harmonious and sustainable developm ment
of their city. Beyond techn nological challenges, all of these aspects are of paramoount
importance to allow a transfer from technology development to market exploitationn in
a sustainable way.
Therefore, the developmment of a benchmarking framework for IoT deploymentts in
the framework of smart citties requires to go beyond pure technological approachh to
include the business ecosy ystem dimension, including the specific role of public au-
thorities. In order to undersstand business models related to IoT deployments in sm mart
cities, we need to think about structured approaches to provide answers for the diveerse
and complex questions co ompanies, citizens, and governments face there. Relaated
considerations are detailed d in the following paragraph, starting from the originss of
business modeling literaturee.
480 F.L. Gall et al.
3 Methodology
Defining metrics to be captured is the basis of any benchmarking framework which
needs comparable dataset to build the comparative analysis. Before realizing the
comparative exercise of a benchmark, there is, thus, a need to get the standalone
measure of the deployment under evaluation.
In the present framework, the specific interest is on Internet of Things industrial
deployments. Such deployments exhibit diverse and broad characteristics. It is, thus,
necessary to structure the information to be collected.
The preliminary temptation is to propose a benchmarking framework based on a holis-
tic model that could serve different purposes. Each person willing to develop a bench-
mark could then select a subset of the proposed metrics, depending on its own intentions.
This section proposes a classification for the metrics together with their sublevels.
After defining the overall approach for the benchmarking framework and the na-
ture of targeted indicators (objectives, input, process, and output) relevant to the
evaluation, there is a need to identify the areas of interest of the benchmark, later on
referred as benchmarking dimensions.
To identify these dimensions, the proposed work started from the STEEPLE analy-
sis framework: the PEST (Political, Economic, Social, and Technological) framework
is a traditional approach for the strategic analysis of a macro-environment. One of its
recent extensions is known as STEEPLE covering the Social, Technological, Eco-
nomic, Environmental, Political, Legislative, and Ethical dimensions. This framework
is classically used to characterize the external influences of a business system and has
been retained as a source of inspiration when building the benchmarking framework.
In addition recent work of the ‘European Research Cluster on the Internet of Things’
acknowledged the contribution of standardization and testing to increase interopera-
bility of IoT technologies and thus, accelerate their industrial deployment.
While defining the framework, after preliminary interviews, it rapidly appeared
that the overall benchmarking was closely related to the overall business ecosystem of
IoT deployments. Since the general adoption of this concept in the literature related to
the rise of internet-based e-commerce [8], the focus of business modeling has gradual-
ly shifted from the single firm to networks of firms, and from simple concepts of
interaction or revenue generation to extensive concepts encompassing the value net-
work, the functional architecture, the financial model, and the eventual value proposi-
tion made to the user [9][10]. In attempt to capture these various elements, one
approach has been to consider business modeling as the development of an unambig-
uous ontology that can serve as the basis for business process modeling and business
case simulations [11][12]. This corresponds with related technology design approach-
es [13] aimed at the mapping of business roles and interactions onto technical mod-
ules, interfaces, and information streams. Due to the shifting preoccupation from
single-firm revenue generation towards multi-firm control and interface issues, the
guiding question of a business model has become “Who controls the value network
and the overall system design” just as much as “Is substantial value being produced
by this model (or not)” [14].
Based on the tension between these questions, Ballon proposes a holistic business
modeling framework that is centered around control on the one hand and creating
value on the other. It examines four different aspects of business models: (1) the way
Benchmarking Internet of Things Deployment 481
in which the value network is constructed or how roles and actors are distributed in
the value network, (2) the functional architecture, or how technical elements play a
role in the value creation process, (3) the financial model, or how revenue streams run
between actors and the existence of revenue sharing deals, and (4) the value proposi-
tion parameters that describe the product or service that is being offered to end users.
For each of these four business model design elements, three underlying factors are
important, which are represented in Fig. 3.
Each of the parameters in the business model matrix is explained in more detail in
the Table 3.These parameters discuss the various important parts required in under-
standing ICT-related business models.
Using these inputs, five dimensions have been proposed to characterize the IoT de-
ployments, in addition to two transversal one providing the general description of the
deployment and its overall appreciation2. These five dimensions are:
2
This framework has been produced on behalf of the PROBE-IT project and are documented
into project deliverables.
482 F.L. Gall et al.
deployments. As the set of technologies related to IoT is broad, the framework cat-
egorizes the problem domain into the following four IoT technology dimensions:
─ IoT services and application - that leverage an underlying IoT infrastructure in
order to improve the effectiveness and efficiency of operation of existing ser-
vices in an organization or the introduction of novel services that can be enabled
by leveraging the IoT.
4 Current Situatio
on for IoT Deployments in Smart Cities
A subset of the developped d taxonomy has been used to analyze 12 IoT deploymeents
around the world (Europe, Africa,
A Asia, and Latin America). These deployments hhave
been chosen for their maturrity and their smart city orientation. Most of those deplloy-
ments operate at a citywidde scale and provides a variety of smart services (eneergy
savings, light monitoring, etc.). Fig. 4 portrays the smart cities participating in the
study, showing their geograaphical distribution.
Asia: Europe:
Beijing city (China) Aarhus city (Danmark)
Hengqin city (China) Barcelona city (Spain)
Abu-Dhabi city (UAE) Nice city (France)
Africa: Birmingham city (England)
Johannesburg city (South
h Africa) Manchester city (England)
Latin America: Santander city (Spain)
Aparecida city Karlsruhe city (Germany)
Rio de janeiro city
Based on the maturity of actual deployments and related services and the ddata
available, three areas – techhnology, economy, and human factors were explored. So-
cio-environmental dimensio on and legal aspects could not be in-depth considered ass no
or poor information was av vailable on these aspects. For each case, literature reseaarch
and interviews has been co onducted. Data collected was synthesized into a comm mon
matrix illustrated in the Tab
ble 4.
486 F.L. Gall et al.
Dimensions metrics
Looking through the 12 smart city cases, it is possible to discern three high-level
objectives that articulate the smart city vision, each of which addresses specific
challenges:
• Societal improvement: Some cities would like to make better life for their citizens
and they are primarily interested in leveraging IoT technologies to address the frus-
trations of daily life and improve citizens well being. These cities are increasingly
looking to improve the effectiveness and the efficiency of cities public services like
education, healthcare, public safety, transportation, utilities, etc.
This is exemplified by the smart city agenda of Birmingham, where the city’s
objective is to provide citizens with better quality of life and economic prosperity.
Other examples include Barcelona and Johannesburg, where the challenge is the
transport congestion, improvement of public transport services is a building block
of the smart city agendas.
• Economic growth: Other cities create a high quality of life and robust city
infrastructure with the aim to be a business hub that attract businesses and
employees to their areas and create new business and employment opportunities.
The attractiveness and reputation given by this kind of smart city projects help
cities move toward global competitiveness.
This was clearly the case in Johannesburg, Barcelona and Hengqin cities. They
are building an advanced ICT infrastructure to attract companies and investors.
Other cities, such as Santander, created a testbed for smart city services and IoT
technologies focused on open data and made them available to attract businesses
and stimulate the innovation in technologies based on real-time data on a city’s in-
frastructure.
• Environmental sustainability: Most of cities share a set of challenges related to
environmental sustainability such as the increasing needs of energy, pollution, &
waste production, etc. They are under the pressure to use energy more efficiently
and improve the environment through lower pollution and carbons emissions.
To cope with the environmental challenges, cities are looking to use IoT technolo-
gies to increase the efficiency and effectiveness of key municipal services, such as
waste and water management and street lighting, and to monitor cities’ progress to-
ward climate change mitigation. The best example of this is Amsterdam city, where
reducing energy consumption and more efficient energy usage were the key objec-
tives for the initiation of the smart city project.
It is worthwhile mentioning that the three high-level objectives are not exclusive of
each other. They are all major reasons behind the initiation of smart city projects.
They do not exclude that in a specific smart city context another objective may be
present, but considered less important.
The study of smart city cases revealed a variety of players involved in the
development of smart city projects: the mayor, the city council, the city
municipalities, the utilities authority, the service providers, the network operator, the
networking equipment suppliers, etc. This large ecosystem of stakeholders can be
classified into four main roles: (1) Policy makers (e.g., city council, city
municipalities, top level Government officals, etc.), (2) Operator, service providers
(e.g., city services, etc.), (3) Technology providers (e.g., big industries, innovative
SME, etc.), (4) Users (e.g., citizens), confirming the preliminary assumptions made in
the benchmarking framework [15].
High-level objectives
business / Business
Citizen well-being
Good Governance
transformation
Resources moni-
attractiveness
Creation of new
cation, etc.)
Energy saving
toring
Stakeholders
Policy makers
Technology/Servi
ce providers
Users
We draw Table 4 which links each stakeholders’ role to the high-level objectives
of city. Policymakers generally seek to deliver better life for businesses and citizens
with a limited and shrinking budget – they are looking for promoting the use of IoT
technologies to effectively provide public services like education, healthcare, public
safety, transportation, good governance, etc., while reducing energy savings and mon-
itoring city resources. Also, they play a key role in launching smart city initiatives and
attracting sponsors, whereas businesses and technology providers are looking to de-
velop innovative city services and create new business models.
At the same time, citizens are expecting more from their cities. They are looking for
high quality of life and optimal conditions for professional development. They are seek-
ing for personalized services that make them more efficient and effective and they are
ready to pay for services that remove some of the frustrations of everyday life.
Although the disparity of stakeholders’ interests, it worthwhile mentioning that a
strong partnership building is a must for the achievement of a common city vision,
flagship projects, effective collaboration, and synergy; otherwise competing interests
can result in delays and even cancelation of smart city projects.
Benchmarking Internet of Things Deployment 489
As shown in table above, a wide range of smart services can be deployed as part of
smart city initiatives. These can be grouped in the following domains:
• Transportation - to reduce traffic congestion and make travel more efficient, se-
cure, and safe.
• Environment – to manage and protect city resources, control costs, and deliver
only as much energy or water as is required while reducing waste.
• Building – to improve the quality of life in city buildings (e.g., home, commercial
buildings, etc.).
• Education – to increase access to educational resources, improve quality of educa-
tion, and reduce costs.
• Tourism– to improve access to cultural sites and improve entertainment services.
• Healthcare – to improve healthcare services availability, provide wellness and
preventive care, and become more cost-effective.
• Public safety – to use real-time information to anticipate and respond rapidly to
emergencies.
• Agriculture– to improve the management of agriculture.
• City management – to streamline management and deliver new services in an
efficient way.
The benchmarking of the smart cities highlighted the infant nature of the deployment
of smart cities until now. Most of the cities are still in stage of the vision of their initi-
ators, and even some services are provided, in most of the cases, it is still poor in
terms of value creation, and far away from the revolution promised by these smart
cities.
Bringing IoT solutions to life and making them economically viable requires a
proper understanding of the genuine business opportunities that may be created out of
this new emerging market. The attempts currently experimented seem not sustainable
in time and the bases that begin to appear seem to fail in building a full ecosystem.
Looking at the IoT deployment cases, we may easily notice that manufacturing,
deployment and management of smart devices as well as the analytical opportunities
arising from big streams of real-time data all represent huge business opportunities.
All of these elements alone or combined interest many stakeholders, but we observe a
lack of common view about a ‘generic’ value chain. The creation of value is implied
and not shared among the various stakeholders involved in IoT deployments and
questions such as who is paying for what? who is getting money for what? have no
clear answer.
Consequently, there is a need of a conceptual framework which identifies the dif-
ferent elements composing this value chain, the potential actors, their function, and
their interactions to organize an overall ecosystem.
Based on the cases studied, Fig. 6 depicted a proposed value chain which identifies
what elements create and bring value to the ecosystem.
Benchmarking Internet of Things Deployment 491
Within this value chain n, three underlying components can be sketched out: (1)
infrastructure, (2) data management, and (3) application &services.
On one hand, the infra astructure layer includes the IT and IoT infrastructture
which enable to collect, sto
ore, and circulate data. Three different kinds of componeents
are distinguished:
• IoT infrastructure pro ovisioning covers the wide bunch of providers of sm mart
devices which emerge suchs as sensors, actuators, cameras, traffics lights, sm
mart
meters, etc. The value crreated by this element into the chain is related to manuffac-
turing and deploying succh devices as well as improving their powerful, size (smmall-
er), energy consumption,, connectivity capacities for instance.
• Connectivity provisioning covers all aspects related to the network infrastructture
to connect smart devicess. Numerous technologies are available such as Zigbee, Z-
Wave, WLAN, WIFI, GPRS G and HSPA.DSL, FTT, LTE, MTC, 3/4G. The vaalue
created by this element in
i the chain relies on the transmission of information frrom
devices to servers and beetween devices. The element is not specific to IoT deplloy-
ments, and for now, thee rules in place follow the scheme of traditional mobbile
communications. As a consequence, network operators are trying to offer serviices
that go beyond connecttivity provisioning to increase their revenues and vaalue
share.
• IT infrastructure coverrs the aspects related to the storage and record of the ddata
transmitted (e.g., cloud computing). IoT and mobile services based on real-tiime
communication increase the need of such infrastructures to enable real-time acccess
to data as well as netw work-based computing services. These infrastructures are
gaining in importance baased on the new services and opportunities offered by nnew
connectivity capacity andd the becoming huge generation of data.
492 F.L. Gall et al.
On the other hand, the services and applications layer includes the services deliv-
ered to users. The connected devices enable (or expect to enable) the creation of ap-
plications and services which address the specific needs of verticals market (e.g.,
supply chain management in manufacturing and logistics management in the transpor-
tation industry). But in the case of IoT, business opportunities are perceived as huger,
particularly at the crossroad between diverse vertical domains and services (e.g.,
smart home and smart meter in the Brazil case). This appears as a mean to propel the
emergence of new IoT service opportunities from cross-application synergies (as for
instance the “smart-life” concept3) through platforms supporting multiservices. These
kinds of platforms is gaining interest and importance for various stakeholders, and
noticeably for big companies which initiate discussion with cities to propose them
‘city operating systems’ as it is the case between Cisco and Barcelona. But cities
themselves want to invest into this for instance to promote usage and accessibility of
open data and allow companies (and in particular SMEs) to take benefit from this and
propose new services.
The value created by this layer is related to the services delivered to the end user
based on sets of data coming from a single or multiples sources. But the business
models behind are too complex because among others of the fragmentation of the
stakeholders or the data fragmentation.
If we summarize these two layers according to their function, the infrastructure
layer enables to generate large amount of data and the services and applications layer
uses this data to produce services once they are aggregated, combined, and analyzed.
But the link between the two is missing. Based on the analyses of the IoT deploy-
ments cases, data management appear to be a “puzzle” as many questions remain
unclear such as who owns the data, how to pay the data producers, by whom ? what is
the value of raw data ? etc.
Our proposed value chain aims to fill the gap between the infrastructure layer and
the services layer introducing a “data layer” which includes three different elements:
• Raw data consists in data coming directly from sensors and devices (data pro-
ducers) which may operate in different sectors. These raw data streams trace in-
formation and communication networks formed across a city, record citizens
identity and their movements patterns, monitor the behaviors of users, and ma-
chines (e.g.; transport systems), etc. They are not necessary usable in this state
but their aggregation allow offering services.
• Intelligent data consists in handling the amounts of data generated by several
sources. Data collation and aggregation is becoming at the heart of value crea-
tion. The aim of data aggregation is to deliver only information relevant to spe-
cific requests to applications.
• Data broker consists in organizing the value flow of data. Such as in a real mar-
ket, it consists at linking data producers, data managers, and data users, deter-
mines the value of data sets, fix the price and conditions of exchange between
stakeholders, and so contribute to organize an ecosystem. Based on the inter-
views carried out, this particular element appears to be the cornerstone which is
3
Terminology proposed by the European project BUTLER www.butler-iot.eu
Benchmarking Internet of Things Deployment 493
missing today. In the different cases analyzed, either this function is not covered;
either it is covered by different stakeholders that did not share a common view. It
is for instance the case in different cities where the city services want to take care
of this function in order to foster open data, but have to deal with technological
providers which are fostering for their own standard.
This model is a first approach and could be enriched with what might exist in the
world of ‘big data’. It nevertheless reflects the vision of the current stakeholders and it
is noticeable that also stakeholders outside the IT or IoT circles (for instance banks or
insurers) might have the opportunity to play a role to organize this value chain and the
business ecosystem.
On another side, involvement of public stakeholders introduces additional com-
plexity into the conceptual value chain, as their rules are different from the private
actors. The public value concept for IoT deployment is introduced by defining the
core principals of a public business model which come down to the questions “Who
governs the value network?” as well as “Is public value being generated by this net-
work?” Governance and public value are then proposed as the two fundamental ele-
ments in business models that involve public actors. The governance parameters align
with the value network and functional architecture already included in the benchmark-
ing framework, where the public value parameters detail the financial architecture and
value proposition.
The identified public value parameters related to the financial architecture are:
• Return on public investment: This refers to the question whether the expected
value generated by a public investment is purely financial, public, direct, indirect,
or combinations of these, and - with relation to the earlier governance parameters –
how a choice is justified [16]. A method, which is often used in this respect, is the
calculation of so-called multiplier effects, i.e., the secondary effects a government
investment or certain policy might have, which are not directly related to the origi-
nal policy goal.
• Public partnership model: The organizational parameter to consider in this case
is how the financial relationships between the private and public participants in the
value network are constructed and under which legal entities they set up coopera-
tion [17][18] One example of such a model is the public-private partnership (PPP).
• Public value creation: This parameter examines public value [19][20] from the
perspective of the end user and refers to the justification a government provides in
taking the initiative to deliver a specific service, rather than leaving its deployment
to the market [21]. One such motivation could be the use of market failure as a
concept and justification for government intervention.
• Public value evaluation: The core of this parameter is the question whether or not
an evaluation [17]is performed of the public value the government sets out to cre-
ate and if this evaluation is executed before or after the launch of the service.
494 F.L. Gall et al.
Now that we have established which parameters are important in a context wherre a
public entity becomes partt of the value network in offering a service, and how we
interpret the different termss, we proposes to include them in a business model maatrix
that take into account publicc value (Fig. 7).
The detailed, qualitativee description of all the parameters of this expanded maatrix
allows for the thorough analysis and direct comparison of complex business moddels
that involve public actors inn the value network. Such parameters need to be includding
in the benchmark of IoT deployments
d involving public actors such as in the casee of
smartcities.
This paper focused on the analysis of IoT deployments in the context of smart ciities
based on a benchmarking ofo twelve smart cities deployments worldwide. The analyysis
method used consisted of considering IoT deployments as a public policy or ppro-
grammes which enable the use of classical evaluation tools. This perspective of anaaly-
sis allowed the consideratio
on of IoT deployments under all their dimensions, not oonly
through their technological characteristics and performance. In particular, this analoogy
permitted the analysis of IoT deployments through the perspective of productionn of
effects and impacts with respect to their preliminaries objectives. Through this ex-
ploratory work, it appears that IoT deployments and smart cities are far away frrom
Benchmarking Internet of Things Deployment 495
delivering their promised objectives particularly to fill end users expectations. The
main problem encountered deals with the lack of identification of the value creation
along the chain from the infrastructure to the services and applications provisioning.
In this chapter, we proposed a model of IoT data value chain organized around data
generation process.
The cornerstone of this value chain relies on a data broker that structures, coordi-
nates and manages the flow of data between the data producers (at infrastructure
layer) to data users (at services and implications layer) making so the link between
raw data and intelligent data. Public stakeholders introduce an additional complexity
that should be taken into account by injecting public value concepts.
This work is of course limited by its exploratory nature, the size, and nature of the
cases which were studied and the people interviewed. Only two categories of stake-
holders were interviewed: policy makers and technological providers. These two
categories are the ones who actively act to implement smart cities and deployed
IoT-based solutions. They act as initiators of such experimentations and it was per-
fectly relevant to meet them especially to gather data on the objectives pursued and
pitfalls and key success factors encountered to make their vision concrete. But this
work could be largely enriched by complementing and confronting their vision (and
the observations done through to that prism) with the vision of the other stakeholders
involved such as the services and applications providers and the end users. The work
reveals that the end-users are for now mainly out of the loop and gathering informa-
tion about how they perceive the changes that appear in smart cities and the value
they give to these new services could also importantly enrich the analyses.
Another limit relies on the scope of IoT deployments in the smart city context. It
appears that this scope was not straightforward for everyone and covers various reali-
ties in the different cases. IoT deployment and smart cities are still immature. In most
of the cases it is still a concept and does not correspond to a concrete reality. In this
context, gathering information on objectives and impacts suffers from lack of accu-
racy and reflects more a vision than a reality. It could be interesting to pursue these
observations in time to analyze how smart cities are “becoming real” and evolving
with respect to their objectives.
Finally, the proposed models on data value chain and business model should be
considered as a first step to foster a way for the different stakeholders to build a com-
prehensive business strategy (or business strategies) taken into account the divergence
of interests among them. Clarifying the value creation at this stage and the value flow
among players brings them a new perspective to define their positioning and identify
the other players they are competing but also with who they may collaborate. The
introduction of public value into the loop introduces also new way to understand the
motivations and drivers of the different stakeholders and is helpful to clarify the game
rules. Introducing more transparency and building a common and shared vision about
the value chain and the ecosystem is in our view the better way to enable sustainable
and economically viable IoT deployments and smart cities.
Finally, the proposed model could be further enriched by including Big Data-related
considerations which exists independently of the context of smart cities.
496 F.L. Gall et al.
References
[1] MacGillivray, C., Turner, V., Lund, D.: Worldwide Internet of Things (IoT) 2013–2020
Forecast: Billions of Things, Trillions of Dollars, Gartnet Market Analysis (2013)
[2] Smith, I.G. (ed.): The Internet of Things 2012, New Horizons (2012) IERC, Casagras2,
ISBN 978-0-9553707-9-3
[3] Gauthier, P., Gonzales, L.: L’Internet des Objets... Internet, mais en mieux (2011) ISBN
978-2-12-465316-4
[4] Rifkin, J.: The Third Industrial Revolution: How Lateral Power Is Trans-forming Energy,
the Economy, and the World (2009) ISBN 978-0230115217
[5] Patton, M.Q.: Utilization-Focused Evaluation. Sage, ISBN 978-1-4129-5861-5
[6] European Commission, MEANS Collection - Evaluation of socio-economic programmes
(1999) ISBN 92-828-6626-2 CX-10-99-000-EN-C
[7] Mazhelis, O., Warma, H., Leminen, S., Ahokangas, P., Pussinen, P., Rajahonka, M.,
Siuruainen, R., Okkonen, H., Shveykovskiy, A., Myllykoski, J.: Internet-of-Things Mar-
ket, Value Networks, and Business Models: State of the Art Report, ch. 2 (2013) ISBN
978-951-39-5249-5
[8] Hawkins, R.: The Business Model as a Research Problem in Electronic Commerce, So-
cio-economic Trends Assessment for the digital Revolution (STAR), IST project, Issue
Report No. 4, SPRU – Science and Technology Policy Research, Brighton (2001)
[9] Linder, J.C., Cantrell, S.: Changing Business Models: Surveying the Landscape. Institute
for Strategic Change, Accenture, New York (2000)
[10] Faber, E., Ballon, P., Bouwman, H., Haaker, T., Rietkerk, O., Steen, M.: Designing busi-
ness models for mobile ICT services. In: Proceedings of 16th Bled E-Commerce Confer-
ence, Bled, Slovenia (2003)
[11] Pigneur, Y.: An ontology for m-business models. In: Spaccapietra, S., March, S.T.,
Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 3–6. Springer, Heidelberg (2002)
[12] Osterwalder, A.: The business model ontology: a proposition in a design science ap-
proach. PhD thesis, HEC Lausanne, Lausanne (2004)
[13] Gordijn, J., Akkermans, J.M.: E3-value: design and evaluation of e-business models.
IEEE Intelligent Systems 16(4), special issue on E-business, 11–17 (2001)
[14] Ballon, P.: Control and Value in Mobile Communications: A Political Economy of the
Reconfiguration of Business Models in the European Mobile Industry. PhD thesis, De-
partment of Communications, Vrije Universiteit Brussel (2009),
http://papers.ssrn.com/paper=1331439
[15] Vallet Chevillard, S., Le Gall, F., Zhang, X., Gluhak, A., Marao, G., Amazonas, J.R.:
Benchmarking framework for IoT deployment evaluation, Project Deliverable (2013)
[16] Margolis, J.: Benefits, External Economies, and the Justification of Public Investment.
The Review of Economics and Statistics 39(3), 284–291 (1957),
[17] http://www.jstor.org/stable/10.2307/1926044
[18] Bovaird, T.: Public-private Partnerships: From Contested Concepts to Prevalent Practice.
International Review of Administrative Sciences 70(2), 199–215 (2004)
[19] Bovaird, T.: Developing New Forms of Partnership With the ‘Market’ in the Procurement
of Public Services. Public Administration 84(1), 81–102 (2006)
[20] Moore, M.: Creating Public Value: Strategic Management in Government. Harvard Uni-
versity Press (1995)
[21] Talbot, C.: Measuring Public Value: A Competing Values Approach. The Work Founda-
tion Research Report (2008), http://www.theworkfoundation.com/
Assets/Docs/measuring_PV_final2.pdf
[22] Benington, J.: From Private Choice to Public Value? In: Benington, J., Moore, M. (eds.)
Public Value: Theory and Practice, pp. 31–39. Palgrave MacMillan (2011)
Author Index
D Ecoinformatics 447
ecosystem 474, 475, 479, 480, 488, 491,
Dashboards 311 493, 494, 496
Data 473, 475, 482–488, 490–494, 496 Education 491
data aggregation 114, 122 electric lighting 358
Data analysis 446, 449 encoder 387
Data as a service 287 encounter 31, 32, 36–44, 46–49
Data availability 145, 149, 152, 163 energy management 352
Data Center 1, 3, 27 energy savings 358
data cleaning 211 Enron (Enron Corp.) 389
Data Collection 103 enroute caching 397
Data Compression 294 Environment 491
Data curation 458 environmental 476, 484, 486, 488, 490
Data Deduplication 293 environmental sustainability 212
Data discovery 446, 451, 454 epidemic 30, 39, 40
Data Gathering 90 ETL 299
Data integration 448 Euclidean distance 2
Data life cicle 443, 445, 447, 448, 450, 451 evaluation 9, 13, 17, 19, 20, 22, 25, 27
Data management 444, 447, 452, 455 evaluation criteria 478
data mining 211 evaluation framework 476
Data publication 462 evolution 289
Data quality 454
Data replication 145–148, 153, 156 F
data restoration 125
Data storage 460 Failure detection 196
data-centric applications on big-scale, 167 Failure detection algorithm 196
decoder 387 Financial model 482
delivery 29, 30, 38, 39, 41–43, 45, 47–50 fingers 10, 21
dense MWSNs 113 fixed-route release message 125
deployment 473-477, 479–486, 491–496 forwarding 31–33, 36, 41, 43, 44, 48, 49
detection 33, 40–42, 45, 47–50, 52 forwarding area 116
devices 30, 32–41, 46, 48, 50, 474, 475, forwarding route control 114, 124
483, 484, 491–493 forwarding tree 119
DGUMA 114, 119 frequency distribution 3, 19, 20, 22
DGUMA/DA 114, 121 Fully parallel redundancy 192
DGUMA/DAwoRC 139 Fuzzy logic 157
Subject Index 501
Linked Data O
Linked Open Data 443–445, 454
Linked Data 444, 450, 458, 464 OCP platform 355
load balance 399 on-line social network 405
Locality Sensitive Hashing 3, 5–7 Ontology 458
lookup 20 Open Data 443, 445, 457
opendata project 362
M opportunistic 29–42, 44–51
Orange 228
machine learning 351 Ordinary parallel 191
machine-to-machine 287 OSPF 393
MANET 116 overlay 34, 40, 41
Mapper 189
MapReduce 187, 303 P
MapReduce cluster 188
massively parallel processing 300, 301
P2P 1, 3, 6, 7
memory 39, 48
P2P applications 145, 147, 156,
message 29–33, 39–41, 43–49
P2P Systems 145, 146, 147, 148
meta-business 312
packet 120, 122
Metadata 447
packet collision 141
middleware 355
packet header 116
mobile 29–38, 40, 41, 45, 46, 48, 50, 52
parent 120
mobile agent 114, 119
participatory sensing 113
mobile telecommunication services 285
Payment Card Industry 292
Mobility 31, 32, 40, 43, 44, 46, 47, 288
Performance 475, 483
model 38, 42, 47, 48, 50
performance analysis and optimization tools
multi-hop wireless communication
169
113
performance evaluation 407
MWSNs 113
Periodical P-metadata backup/update 195
MySQL 221, 299 physical link 393
physical phenomenon 115
N physical router 393
Platform as a Service 287
Naïve Bayes Algorithm 225 P-metadata 193
name space design 406 Policy makers 473, 478, 489, 496
NameNode 188 popularity 41, 43–45, 47, 394
NDN-based service platform 405, 424 prediction 29, 34, 39, 43–47, 52
nearest neighbors 6 predictive modeling 214
network bandwidth 114 Privacy 290, 385, 475
network controller 394 privacy-conscious delivery 402
network graph 386 profiles 2–9, 17–20, 22, 24–27
network-attached storage 295 Provenance 450, 454, 457, 463
networking 29–36, 38–52 pub/sub-based controller 394
NFS 295 Public
NL-SATA 295 Public partnership model, 476–479, 483,
NOAA 228 484, 487–490, 494–496
NoSQL 301 Public safety 491
NR 189 Public Sector 289
NS2 199 publish/subscribe 40, 41, 49, 50
Subject Index 503
put 3, 20, 22 security 32–34, 50, 51, 290, 352, 394, 475,
power law 391 490
selfishness 33, 47–49, 52
Q sensing area 115
sensing cycle 115
sensing point 115
query 2, 3, 7, 8, 12, 13, 17, 19, 20, 22, 24, sensing time 114, 119
25, 27 sensor 29, 30, 34, 36, 37
sensor data 120, 385
R Sensor Networks 89, 214
sensor node 115
Random Hyperplane Hashing 1–3, 5, 6, 13, sensor reading 120
17–19 Sensors 451, 456, 466
random waypoint mobility model 137 servers 2, 3, 7, 12–17, 22, 25, 27
RapidMiner 224 Service Level Agreements 296
RDF 445, 450, 455 services and applications 474, 483, 493,
Real-time analytics 299 496
real-time media services 405 shared-nothing 301
Recall 1, 7, 13, 15, 20–23, 25–27 Similarity Search 1–3, 6, 7, 9, 10, 12–15,
Reducer 189 17, 19, 21, 22, 25–27
Regulation 485 sink 115
regulatory 473, 475 Small world 391
ReHRS 187 smart buildings 343
relational databases 301 smart cities 343
relational graph 385 Smart City 443, 444, 451, 452, 454,
retrieval latency 385 476-478, 485, 487–491, 496
relational metric 385 Smart Data 444
relational metric-based controller 394 smart environments 341
Reliability 100 smart management 346
remote control 356 smartphone 30, 35, 37, 50
Replica distribution 153, Smartphones 444, 457
Replication factor 145, 146, 151, 157, 158, social 32, 34, 37–42, 44–50, 52
160 SocialCast 385
Replication requirements 146, 155 social communication 285
Replication techniques 145–148, 153 Social networks 285, 451, 456
retrieval 2, 6, 9, 11, 13, 20, 22, 25, 26 social perspective 350
RFID 286 socio-economic 476
route fix message 125 Software as a Service 287
routing 11, 12, 16, 17, 27, 29–32, 35, Space Filling Curve 6, 7, 14
38–41, 43–51 SPARQL 445, 461, 466
RR-FOP 193 Spatial information flow 311
RTP 196 spatial sensing range 402
RU-FOP 193 SSD 295
stakeholders 473-476, 478, 479, 483–485,
487–489, 491, 493, 494, 496
S Stakeholders 475, 489
standardisation 480, 483
SAS 295 State/metadata synchronization 193
SCADA 354 STEEPLE 480
Scale-out storage 294 storage area networks 295
504 Subject Index
RP Replication Percentage
RPC Remote Procedure Call
RSS Really Simple Syndication.
RSS Really Simple Syndication
SaaS Software as a Service
SAIL Scalable and Adaptive Internet Solutions
SAN Storage Area Network
SAS Serial Attached SCSI
SCADA Supervisory Control and Data Acquisition
SCF Store-Carry-and-Forward
SFC Space Filling Curve
SLA Service Level Agreement
SLC/MLC-SSD Single-level cell/ Multi-level cell solid state drive
SN Social Network
SNC Saami Network Connectivity
SOA Service Oriented Architectures
SOAP Simple Object Access Protocol.
SP Super-Peer
SPARQL SPARQL Protocol and RDF Query Language.
SQL Structured Query Language, computer language to
manipulate relational database
SQL Structured Query Language.
SRP Scale of Replication per Peer
SRSN Self-Reported Social Network
SSD solid state drives/devices
SWIM Shared Wireless Infostation Model
TECD Time-Evolving Contact Duration
TMM Tree Management Module
TSV Tab Separated Values.
URI Uniform Resource Identifier.
US Uniform Social
VANET Vehicular Ad-Hoc Network
VSM Vector Space Model
W3C World Wide Web Consortium.
WLAN Wireless Local Area Network
WMS Web Map Service.
WOM Word-of-Mouth
XML eXtensible Markup Language.
XML Extensible Markup Language
XOR eXclusive OR
Glossary
α: The total number of consecutive heartbeats that a server has not received from an-
other server.
Hr : A hash value stored in the hash pool.
Hr : The hash value of log record Lr where r ranges from maxID+1 to Z.
Lv : A log record generated by the ReHRS master server whenever a write operation is
initiated or completed, where v≥1 is an incremental record ID.
Mbef , Hbef , and Wbef : They are nodes that act as the master server, HSS, and WSS
before a state transition, respectively.
Maf t , Haf t , and Waf t : They are nodes that act as the master server, HSS, and WSS
after a state transition, respectively.
Nu S : It represents node Nu S state (also called role) is S, where u= 1, 2, 3 and S εM, H,
W, n/a in which M, H, W, and n/a respectively stand for the master server, HSS, WSS,
and unavailable
Tearly : The time period from the moment when the WSS is requested to warm itself up
to the moment when it is requested to take over for the HSS.
Tremain : The time required by the ReHRS’s WSS to finish its warm-up process
Ttotal : The total time required by the WSS to finish its warm-up process when it re-
ceives a warm-up request from the commander.
T − metadata: The metadata that is frequently updated.
K: It refers to the largest record ID of the log record currently collected in the HSS’s
journal
masID: It is the maximum ID of the log records currently collected in the WSS’s jour-
nal
Analytics: Using software-based algorithms and statistics to derive meaning from data
Anomaly detection: is the task of identifying events or measured characteristics which
differ from other data or the expected measurements.
Assurance: the process of obtaining and using accurate and current information about
the efficiency and effectiveness of policies and operations, and the status of compliance
with the statutory obligations, in order for management to control an organization’s ac-
tivities.
Audit comfort: the support needed to draw conclusions provided by audit procedures.
512 Glossary
Audit procedures: the various tests of details, analytical procedures, confirmations and
other activities performed to gain audit evidence.
Batch layer: Refer to the corresponding processing layer of the Lambda architecture
[38] that processes data in batches.
Batch processing: Refers to processing methods that collect data over a certain amount
of time and then process several data sets as a whole.
Big data: the accumulation of datasets from different sources and of different types that
can be exploited to yield insights.
Big Data analytics framework: An architectural reference model including architec-
tural principles and a set of tools suitable for analytical challenges that require model
learning and near real-time decision making based on large data sets.
Big data: refers to very large datasets such the data cannot be processed using tech-
niques for smaller data collections; large data management and data processing tools
have been developed for big data analysis.
Big Data: We use the term Big Data both to refer to "datasets whose size is beyond the
ability of typical database software tools to capture, store, manage, and analyze" [37].
Business Intelligence: software system to reports, analyzes and presents data. These
tools use data previously stored in data warehouse/Data Mart
Cassandra: an open source database management for huge amount of data on a dis-
tributed systems. Currently part of apache software foundation
CEP: See Complex Event Processing.
Clickstream analytics: The analysis of users’ Web activity through the items they click
on a page
Cloud computing: a computing paradigm in which highly scalable computing re-
sources, often configured as a distributed system, are provided as a service through
a network.
Cold-standby: redundancy When this scheme is employed, the task fails on one node
will be executed by another node from the very beginning.
Complex Event Processing: Refers to processing methods that are tailored for appli-
cation of logic over continuous data flows. Complex event processing derives complex
events from more simple events.
Dashboard: A graphical reporting of static or real-time data on a desktop or mobile
device. The data represented is typically high-level to give managers a quick report on
status or performance
Data center: A physical facility that houses a large number of servers and data storage
devices. Data centers might belong to a single organization or sell their services to many
organizations
Data cleaning: refers to the process of detecting incorrect data in a dataset and remov-
ing or correcting the data in such a manner that it would be useable, such as correcting
the data format.
Data Mart: subset of Data Warehouse ready to be used by Business Intelligence tools
Data mining: refers to retrieving data from a dataset, looking for patterns within this
data, and displaying it in an intelligible way.
Data mining: The process of deriving patterns or knowledge from large data sets
Data set: A collection of data, typically in tabular form
Glossary 513
Data Warehouse: a suite of applications and databases optimized for storing and pro-
cessing large amount of structured data
DB, DBMS: Database management system. Software that collects and provides access
to data in a structured format
Design effectiveness: that internal control procedures are appropriate in respect of the
control objective they are intended to achieve and the risk they are intended to address.
Device: Technical physical component (hardware) with communication capabilities to
other IT systems. A device can be either attached to or embedded inside a Physical En-
tity, or monitor a Physical Entity in its vicinity.
E: E could be either the HSS or the WSS. If E is the HSS, Whenever RTP times out, it
checks to see whether it has received a heartbeat from the WSS during the RTP or not.
But if E is the WSS, then it checks to see whether it has received a heartbeat from the
HSS during the RTP or not.
ETL process: Extract, Transform and Load, tools to extract data form sources and
transform them for operational needs and load in data store architectures (e.g. data
warehouse)
Fully parallel redundancy: It refers to a set of schemes that consists of several servers.
Each of them can offer a transparent takeover when another server fails.
Hadoop: An open-source MapReduce implementation developed by Apache.
Hadoop: Open Source framework to process large data set on distributed systems. Cur-
rently part of apache software foundation
Hot-standby redundancy: This scheme provides a backup node to maintain an up-to-
date copy of the state of a master server. When the master server fails, the backup node
can continue the operation of master server.
HSO: A hot-standby-only scheme.
HSS: A hot-standby server employed by the ReHRS.
Hybrid Redundant System: The full name of ReHRS.
In-Stream processing: Refers to methods that process data continuously (cf batch pro-
cessing).
Intelligent Transportation Systems: A distributed architecture based on ICT compo-
nents (e.g., sensors, actuators, servers, etc.) and M2M communications that is used to
manage and control a transportation infrastructure.
Internal controls: the activities performed to ensure that policies and procedures are
implemented and operated consistently and effectively, allowing errors and omissions
to be prevented or detected.
Internet of Things: A collection of things having identities and virtual personalities op-
erating in smart spaces using intelligent interfaces to connect and communicate within
social, environmental, and user contexts.
Internet of Things: A global infrastructure for the information society, enabling ad-
vanced services by interconnecting (physical and virtual) things based on, existing and
evolving, interoperable information and communication technologies
Interoperability: The ability to share information and services. The ability of two or
more systems or components to exchange and use information. The ability of systems to
provide and receive services from other systems and to use the services so interchanged
to enable them to operate effectively together.
514 Glossary
represent time series data, if the temperatures were gathered in sequence, over time.
Unstructured data: data sitting outside structured and organized data repositories such
as relational databases.
Unstructured data: data that not resides in fixed structures like matrixes, tables etc.
Visualization: tools for providing a synoptic view of information
Warm-standby redundancy: When this scheme is employed, the states of the master
server are periodically replicated to a warm-standby node. When the master server fails,
the state replica can be used to restart the operation of the master server.
WSO: A warm-standby-only scheme.
WSS: A warm-standby server employed by the ReHRS
x: A pre-defined threshold to request the WSS to warm itself up.
y: A pre-defined threshold of taking over the failed server.
Z: The largest ID of the log records collected in the journal of the commanders
ZFS: open source distributed file system solution implemented by Sun Microsystems