Academia.eduAcademia.edu

A Framework for Participatory Sensing Systems

2012

Participatory sensing (PS) systems are a new emerging sensing paradigm based on the participation of cellular users in a cooperative way. Due to the spatio-temporal granularity that a PS system can provide, it is now possible to detect and analyze events that occur at different scales, at a low cost. While PS systems present interesting characteristics, they also create new problems. Since the measuring devices are cheaper and they are in the hands of the users, PS systems face several design challenges related to the poor accuracy and high failure rate of the sensors, the possibility of malicious users tampering the data, the violation of the privacy of the users as well as methods to encourage the participation of the users, and the effective visualization of the data. This dissertation presents four main contributions in order to solve some of these challenges. This dissertation presents a framework to guide the design and implementation of PS applications considering all these a...

University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School January 2012 A Framework for Participatory Sensing Systems Diego Mendez Chaves University of South Florida, [email protected] Follow this and additional works at: http://scholarcommons.usf.edu/etd Part of the American Studies Commons, and the Computer Sciences Commons Scholar Commons Citation Mendez Chaves, Diego, "A Framework for Participatory Sensing Systems" (2012). Graduate Theses and Dissertations. http://scholarcommons.usf.edu/etd/4135 This Dissertation is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected]. A Framework for Participatory Sensing Systems by Diego Méndez Chaves A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science and Engineering College of Engineering University of South Florida Major Professor: Miguel A. Labrador, Ph.D. Wilfrido Moreno, Ph.D. Rafael Perez, Ph.D. Kandethody Ramachandran, Ph.D. Sudeep Sarkar, Ph.D. Date of Approval: May 7, 2012 Keywords: spatial interpolation, kriging, PCA, ICA, spatial outliers, user’s location, spatial data-mining Copyright c 2012, Diego Méndez Chaves DEDICATION To my family, Vı́ctor, Patty and Lilita, who always support and encourage me to improve every day. To Caitrin and Satrajit, who are always there for me. ACKNOWLEDGEMENTS I would like to thank Dr. Miguel Labrador for his invaluable guidance and support, and for his patience and belief in my work. I would also like to thank the committee members, Dr. Wilfrido Moreno, Dr. Rafael Perez, Dr. Kandethody Ramachandran, and Dr. Sudeep Sarkar, for taking the time to be a part of my committee and providing important input. I would like to thank the Environmental Protection Commission of Hillsborough County and emQbit for his technical support and contributions. I would like to thank everyone in my research group for their important feedback regarding my research. And to all of my friends, Idálides, Óscar, Ingo, Mehrgan, Luis and Alfredo, for sharing all these great experiences over the past years. TABLE OF CONTENTS LIST OF TABLES iv LIST OF FIGURES v ABSTRACT CHAPTER 1.1 1.2 1.3 viii 1 INTRODUCTION Motivation Problem Statement Contributions 1.3.1 A Framework for the Design and Implementation of PS Systems 1.3.2 Data Verification 1.3.3 Data Visualization 1.3.4 Density Maps 1.3.5 Structure of the Dissertation 1 1 5 8 8 8 9 10 11 CHAPTER 2 LITERATURE REVIEW 2.1 Data Verification 2.1.1 Neighborhood Generation 2.1.2 Spatial Outlier Detection Algorithms 2.2 Data Visualization 2.3 Density Maps 2.4 Related PS Applications 12 12 13 14 17 20 22 CHAPTER 3 P-SENSE AND A FRAMEWORK FOR PS SYSTEMS 3.1 P-Sense: A PS System for Air Pollution Monitoring 3.1.1 Sensing Devices 3.1.2 First-Level Integrator 3.1.3 Data Transport Network 3.1.4 Server 3.1.5 Pollution Data Collected 3.2 A Framework for PS Systems 3.2.1 Sample Size Determination 3.2.2 Data Collection 3.2.3 Data Verification 3.2.4 Data Visualization 3.2.5 Density Maps 24 24 25 26 28 29 31 32 33 34 36 37 37 i CHAPTER 4 DATA VERIFICATION 4.1 Modeling the Generation of Invalid Data 4.1.1 Location Sampling 4.1.2 Sensor Malfunctioning and Lack of Accuracy 4.1.3 Malicious Attacks 4.1.3.1 Random Attacks 4.1.3.2 Cooperative Attacks 4.2 The Hybrid Algorithm for Data Verification 4.2.1 Local Outlier Detection Algorithm 4.2.2 Neighborhood Definition 4.2.2.1 Delaunay Triangulation 4.2.2.2 Gaussian Mixture Models 4.2.2.3 Selection of the Neighbors for Each Location 4.2.3 Spatial Outlier Detection for Multiple Attributes 4.2.3.1 General S-Outlier Detection and Removal 4.2.3.2 Hybrid S-Outlier Detection and Removal 4.2.4 Computational Complexity 4.3 Experiments and Analysis 4.3.1 Random Attacks 4.3.1.1 Conspiracy Level 4.3.1.2 Conspiracy Percentage 4.3.2 Execution Time 4.3.3 Cooperative Attacks 4.4 In Conclusion: Data Verification 39 39 40 42 43 43 44 45 47 49 50 50 52 54 54 55 57 58 59 59 61 62 63 66 CHAPTER 5 DATA VISUALIZATION 5.1 MRF vs Kriging for PS 5.1.1 Markov Random Fields and Gibbs Sampling 5.1.2 The Kriging Technique 5.1.3 Modeling a PS Scenario for Data Interpolation 5.1.4 Results and Comparison 5.2 One Variable Spatial Interpolation for PS 5.3 Multivariate Spatial Interpolation for PS 5.3.1 Kriging and PCA 5.3.2 Kriging and ICA 5.4 Interpolation in Time and Space 5.4.1 The Grid and Delta Parameters 5.4.2 Multivariate Kriging in 3D 5.5 In Conclusion: Data Visualization 68 69 69 71 75 76 81 83 83 85 88 89 90 94 CHAPTER 6 DENSITY MAPS 6.1 Density Map of Users 6.1.1 The Univariate Case 6.1.2 A Multivariate Extension 6.2 Experiments and Analysis 6.2.1 The Univariate Case ii 95 95 97 100 101 103 6.3 6.2.2 Increasing the Number of Users per Region 6.2.3 The Multivariate Case In Conclusion: Density Maps CHAPTER 7 CONCLUSIONS 103 107 108 110 REFERENCES 113 APPENDICES Appendix A Appendix B 122 123 124 P-Sense - Permission for Reuse G-Sense - Permission for Reuse iii LIST OF TABLES Table 4.1 The accuracy and range of the pollution sensors. 42 Table 4.2 Parameters used in the experiments. 59 Table 5.1 Maximum R2 for each variable with the corresponding grid size and delta time parameters. 92 Table 5.2 Configuration of the expert predictors for all the variables. 93 Table 6.1 Coefficient of determination R2 and total number of users. 106 iv LIST OF FIGURES Figure 1.1 Resolution of the traditional system (each cell is 2.5 × 2.5 Km) versus the PS system (each cell is 50 × 50 m). 3 Figure 1.2 Participatory Sensing areas of research. 4 Figure 1.3 CO interpolated using real data collected at the USF campus, with (left) and without (right) data verification. 7 Figure 3.1 P-Sense’s system architecture. 25 Figure 3.2 The ArduinoBT board and the environmental sensors utilized in P-Sense. 26 Figure 3.3 Graphical user interface. 27 Figure 3.4 Mobile-side software architecture. 27 Figure 3.5 A Bluetooth environmental data frame. 28 Figure 3.6 Server application software architecture. 29 Figure 3.7 Visualization Web service. 30 Figure 3.8 Effect of the aggregation process. 32 Figure 3.9 A general framework for PS applications and its associated techniques. 33 Figure 4.1 A snapshot of the variables of interest over the USF campus after the data verification process. 41 Figure 4.2 The generation of the areas of influence. 45 Figure 4.3 The hybrid algorithm for data verification and removal. 46 Figure 4.4 The local and spatial outlier detection methods. 48 Figure 4.5 The two steps of the Hybrid approach. 51 Figure 4.6 The neighborhood for the Sparse (left) and Dense (right) locations. 52 v Figure 4.7 The ratio between dense and sparse locations versus the conspiracy percentage. 53 R2 versus the conspiracy level (left) and the conspiracy percentage (right) for all four variables and the three algorithms. 60 Figure 4.9 The execution time of the three algorithms for data verification. 62 Figure 4.10 The R2 for the cooperative attacks where only some of the users surrounding the center cooperate. 63 The R2 for the cooperative attacks where all of the users surrounding the center cooperate. 65 The effect of the verification process in the final interpolation for temperature, RH, CO2 and CO. 66 Figure 5.1 Cliques definition. 71 Figure 5.2 Experimental semivariogram and the fitted curve of an isotropic spherical model. 75 The map (top row) and surface (bottom row) representation of the synthetic data created for the experiments. 77 A comparison between the results for the spatial interpolation using kriging (top row) and MRF (bottom row). 78 Figure 5.5 R2 for MRF under different conditions of noise and missing ratios. 79 Figure 5.6 R2 for kriging under different conditions of noise and missing ratios. 80 Figure 5.7 An example of the result of spatial interpolation using kriging. 82 Figure 5.8 The proposed kriging plus PCA/ICA method, versus multivariate cokriging. 84 Contour map for the spatial interpolation using kriging with PCA (top row) and kriging with ICA (bottom row). 86 Figure 5.10 Modeling time using an additional dimension. 89 Figure 5.11 R2 as a function of the grid size and the delta time. 91 Figure 5.12 Known and predicted values after using kriging interpolation along with PCA and time information. 92 Contour maps for all variables (temperature, RH, CO2 , CO, air quality, and combustible gasses) in the USF campus. 96 Figure 4.8 Figure 4.11 Figure 4.12 Figure 5.3 Figure 5.4 Figure 5.9 Figure 6.1 vi Figure 6.2 Gradient field and contour maps for CO2 in the USF campus. Figure 6.3 The linear mapping between the Mean Magnitude of Change (MMC) per region, to the number of locations per region. 100 R2 for all variables in the univariate case using CO as the variable of control. 102 Location of the users after 10 iterations for all variables, when using only CO as the control variable. 104 Figure 6.6 Increasing the number of users (max loc) per region. 105 Figure 6.7 R2 for all variables for the multivariate case, using RH and CO are the control variables. 107 Figure 6.4 Figure 6.5 vii 98 ABSTRACT Participatory sensing (PS) systems are a new emerging sensing paradigm based on the participation of cellular users in a cooperative way. Due to the spatio-temporal granularity that a PS system can provide, it is now possible to detect and analyze events that occur at different scales, at a low cost. While PS systems present interesting characteristics, they also create new problems. Since the measuring devices are cheaper and they are in the hands of the users, PS systems face several design challenges related to the poor accuracy and high failure rate of the sensors, the possibility of malicious users tampering the data, the violation of the privacy of the users as well as methods to encourage the participation of the users, and the effective visualization of the data. This dissertation presents four main contributions in order to solve some of these challenges. This dissertation presents a framework to guide the design and implementation of PS applications considering all these aspects. The framework consists of five modules: sample size determination, data collection, data verification, data visualization, and density maps generation modules. The remaining contributions are mapped one-on-one to three of the modules of this framework: data verification, data visualization and density maps. Data verification, in the context of PS, consists of the process of detecting and removing spatial outliers to properly reconstruct the variables of interest. A new algorithm for spatial outliers detection and removal is proposed, implemented, and tested. This hybrid neighborhood-aware algorithm considers the uneven spatial density of the users, the number of malicious users, the level of conspiracy, and the lack of accuracy and malfunctioning sensors. The experimental results show that the proposed algorithm performs as good as the best estimator while reducing the execution time considerably. viii The problem of data visualization in the context of PS application is also of special interest. The characteristics of a typical PS application imply the generation of multivariate time-space series with many gaps in time and space. Considering this, a new method is presented based on the kriging technique along with Principal Component Analysis and Independent Component Analysis. Additionally, a new technique to interpolate data in time and space is proposed, which is more appropriate for PS systems. The results indicate that the accuracy of the estimates improves with the amount of data, i.e., one variable, multiple variables, and space and time data. Also, the results clearly show the advantage of a PS system compared with a traditional measuring system in terms of the precision and spatial resolution of the information provided to the users. One key challenge in PS systems is that of the determination of the locations and number of users where to obtain samples from so that the variables of interest can be accurately represented with a low number of participants. To address this challenge, the use of density maps is proposed, a technique that is based on the current estimations of the variable. The density maps are then utilized by the incentive mechanism in order to encourage the participation of those users indicated in the map. The experimental results show how the density maps greatly improve the quality of the estimations while maintaining a stable and low total number of users in the system. P-Sense, a PS system to monitor pollution levels, has been implemented and tested, and is used as a validation example for all the contributions presented here. P-Sense integrates gas and environmental sensors with a cell phone, in order to monitor air quality levels. ix CHAPTER 1 INTRODUCTION Participatory Sensing (PS) is a new sensing paradigm that relies on the voluntary cooperation of cellular users equipped with embedded or integrated sensors. Given the large and increasing number of cellular users, this new data collection method provides the opportunity to address large-scale societal problems like never before [1, 2, 3]. Compared with traditional monitoring systems, PS systems promise to be more cost effective and precise, as the data collector does not have to invest capital in the user equipment nor in its deployment, maintenance, and operation while being able to collect a considerably larger number of samples and from places that cannot be afforded today, providing better spatial resolution and precision in the estimations. Examples of PS systems are applications that collect pollutant measurements to determine the air quality index (AQI) [4], measure radioactivity levels and pollen, real-time GPS locations to assess traffic congestion and travel times, and collect noise, humidity, and temperature measurements to show real-time maps of these variables. 1.1 Motivation PS systems have several important advantages when compared to traditional monitoring systems. Traditional monitoring systems usually consist of very few static and most of the time expensive monitoring stations located in strategic places in a county, city or state. As a result, with traditional systems, spatial estimations for regions far from the monitoring stations could be too coarse, which might not be very helpful to those individuals who are 1 very sensitive to the variables of interest, or to those organizations trying to assess, monitor, and control these variables or activities. On the other hand, PS systems consist of many mobile and cheap stations, limited by the number of cellular phone users located in the region of interest. Therefore, a PS application is capable of providing considerably larger amounts of data and with better spatial resolution, i.e., from many more places, than current measurement systems. It is argued that the spatial resolution that a PS system can provide is a strong motivation for considering this approach as a replacement of traditional measurement systems. By using a PS system, it is now possible measuring, detecting, and analyzing events that occur at different spatial scales, ranging from blocks and neighborhoods, up to counties and states. As an example, Figure 1.1(a) shows the interpolation of Carbon Monoxide (CO) measurements in Los Angeles County in California, which has a high density of static pollution stations, and Figure1.1(b) shows the interpolation of the data collected using a PS system in the University of South Florida (USF). It is important to emphasize that one little cell in the LA plot is similar in size to the entire area of measurements shown in the USF plot, and those variations and different CO levels in USF are impossible to detect and visualize using interpolations from distant stations. Finally, another important motivation in favor of participatory systems is cost. In a PS system, sensors are meant to be cheap and easy to deploy, and since the units are in the hands of the users, the data collector does not incur in capital, deployment, operational, and maintenance costs. In the same way as any other real system, before any PS system is deployed and used in a large-scale application, several important research challenges need to be addressed. These challenges, which are categorized in five areas, are described next: • Privacy: mechanisms are needed to protect the personal information of the user and the places she visited as part of the PS application. If the privacy of the users is not protected, they may decline their participation in the application. For instance, the work in [5] develops a correlated noise model that can be utilized to perturb location- 2 (a) Traditional monitoring system, LA CO lev- (b) PS system, USF CO level. els. Figure 1.1: Resolution of the traditional system (each cell is 2.5 × 2.5 Km) versus the PS system (each cell is 50 × 50 m). tagged data in a way that protects both user data and the location while ensuring accurate reconstruction of community statistics. • Security: the transmission, processing, and storage of the information must be secured to protect delicate data of the users and provide data integrity. These mechanisms need to do so while being simple and energy efficient, so they can be run in a resourceconstrained device. • Data Validity: a PS application must provide a method to verify the validity of the data. For example, in the air pollution context, users may fake the readings of the sensors (e.g., to avoid contamination fines). Mechanisms must be in place to detect short and long-term erroneous measurements. The work in the area of multivariate statistical analysis and diagnosis of air pollution monitoring systems [6, 7] usually does not consider the localization correlation of the measurements that a PS system with massive users would have. Also, the non-stationary characteristics of the air quality measurements imply sophisticated algorithms to infer and detect doubtful measurements. • Visualization: PS applications will collect huge amounts of data, more precisely, 2dimensional data series with gaps in time and space. Interpolation techniques to 3 effectively visualize all this information are also needed. This problem is strongly connected with data validity since an inference of a value could help filling the gaps of the series. Techniques, such as kriging interpolation are useful when the correlation between variables, locations, and time (previous measurements), should be considered. • Incentives: an additional problem involves the provision of incentives for participation. Why should I participate in a PS application? Why should I bother to get samples and transmit them? What would I gain? Why should I use my time, my battery, etc., to collect data for others? Appropriate mechanisms need to be included to encourage the participation of the users who will collect the data of interest. The work in [8] is one of the first ones that addresses the incentives problem in PS. In addition to the software and hardware design and implementation that are also required, all these five areas need to be addressed when designing a real PS application. Figure 1.2 is a pictorial representation of how these areas together create a complete PS system. Figure 1.2: Participatory Sensing areas of research. Taken from [9] c 2011 IEEE. In this dissertation, a PS system called P-Sense [9] is utilized to study PS related problems, with focus on environmental applications. P-Sense brings the possibility of having not only pollution data with considerably better spatial resolution [10], from places not possible 4 before, such as your community, school, and neighborhood, but also the possibility to assess your exposure to pollution according to the places visited during your daily activities [11]. Further, with the appropriate sensors integrated into a cellular phone, P-Sense promises to achieve these goals in a continuous and invisible manner, as the application can run in the background of cellular users’ phones. P-Sense has been collecting environmental data (CO (ppm), CO2 (ppm), combustible gases (ppm), air quality (4 discrete levels), temperature (F), and relative humidity (%)) across the University of South Florida (USF) campus for several months. This dataset has been used to validate all the proposed methods and techniques presented in this dissertation. 1.2 Problem Statement Most of the challenges in PS systems are consequences of the fact that now the measuring devices are cheaper (of lesser quality than the static ones) and they are in the hands of the users: 1) the accuracy of the individual samples might not be as good as the quality of the samples from expensive static stations; 2) the measuring devices may fail with higher frequency when compared to the expensive ones; 3) the data are more prone to being intentionally tampered; 4) the cellular phone users need to have a good motivation to participate in the data collection process; and 5) the PS system needs to implement security and privacy mechanisms in order to guarantee the protection of the user’s identity and the integrity of the data. In addition, there is no unifying approach to guide the design of an entire PS system considering all these challenges in a comprehensive manner. Part of the work of this dissertation is meant to fill this gap presenting a general framework to successfully guide the designer in the creation of a PS system. This framework splits the design process in 5 modules: sample size determination, data collection, data verification, data visualization, and density maps generation modules. The remaining part of the dissertation aims to solve three of these stages of the framework, more specifically, data verification, data visualization, and density maps generation. 5 The problem of sensor data verification is addressed first, which considers the detection and removal of invalid data, whether intentionally or unintentionally produced, in a computationally efficient manner. The computational problem is very challenging given the large amounts of data and the fact that current inference tools use all samples to produce estimates in other locations. Invalid data can be the result of two possible scenarios: 1) unintentional errors produced by malfunctioning sensors that generate abnormal values; or 2) intentional modifications inserted by the users, e.g., polluters (companies, countries, etc.) who must comply with certain maximum levels of emitted pollutants to the atmosphere in order to avoid penalty fees, and would like to report modified values of the collected pollution data. The problem of wrong data detection and removal is also important because of its strong connection with the problem of sensor data visualization and analysis. Visualization is largely used to find spatial patterns on the variables [12]. In general, inconsistent or invalid data can be the reason of an incorrect (or biased) analysis of the global status of the variable due to the strong influence of a small set of data in the statistical model [13]. Figure 1.3 shows this problem when the same interpolation technique is applied to the data with and without the removal of the outliers. From the graph, it can be seen how the variable of interest, in this case the spatial interpolation of carbon monoxide data collected in the University of South Florida (USF) campus, is completely different when the outliers are detected and removed (left) than when data verification is not performed (right). After looking at Figure 1.3 one could argue that particular outliers could be visually detected and manually removed from the data, however, it is better if an automated procedure can take care of these problems. Further, in some cases, such as those maliciously introduced by the users, the visual detection might not be so obvious. This also shows the importance of creating an application that automatically detects and removes outliers using not only the location information but also the behavior of the variables [11]. Although it could be argued that a PS system might not need of data interpolation techniques, the reality is that the density and mobility of the users and the user selection 6 Figure 1.3: CO interpolated using real data collected at the USF campus, with (left) and without (right) data verification. techniques given by incentive mechanisms for participation, produce time series with many measurement gaps both in time and space. Thus, interpolation techniques are also needed. However, current interpolation techniques, might not be appropriate. Since the dataset collected by PS systems normally consists of measurements from multiple variables at different locations and times, these interpolation techniques should be modified to successfully exploit the advantages that such a dataset brings. When compared to the spatial interpolation of a single variable, including the correlation among variables, close locations and time, strongly improves the quality of the final estimations. This dissertation also addresses the problem of how to define the ideal location of the users, which for the first time is connected to the problem of incentives. The question to be answered is the following: given the large number of potential participants, where should the system take samples from, or which cellular phone users should the system select to take samples from, so that the variable of interest is well characterized and at a low cost? Once this question is answered, incentive mechanisms should use that information to motivate those users, in a region of interest, to actively participate and collect the data. Although some incentive mechanisms in the literature have included some embedded algorithms to address this particular question, most of them focus on one particular aspect, like maintaining the level of user participation over time, lowering the total sampling cost, 7 or covering the area of interest [14], and none has determined the number and locations of the participants based on the quality of the variable being sampled. 1.3 Contributions The main contributions of this dissertation are presented next. Each one of these con- tributions will be elaborated in depth in subsequent chapters. 1.3.1 A Framework for the Design and Implementation of PS Systems The first contribution of this dissertation is the design of a framework to guide the design and implementation of PS systems that considers the challenges explained before [15]. The proposed framework splits the design of PS application in five modules: sample size determination, data collection, data verification, data visualization, and density maps generation modules. Chapter 3 describes this framework in detail. The following sections present the remaining contributions, that are mapped one-onone to three different modules of this framework: data verification, data visualization and density maps. 1.3.2 Data Verification Chapter 4 of the dissertation includes mechanisms to detect and remove invalid data, whether introduced by malfunctioning sensors or malicious users. The detection and removal of outliers as a result of malfunctioning sensors that generate abnormal values is fairly simple if multiple measurements for the same location are available. A solution to this problem is proposed based on the Mahalanobis distance on multivariate data [16, 17]. The results show that this technique is very successful removing this type of outliers. On the other hand, intentional modifications to the data by the users are harder to detect. To address this problem, known spatial data mining techniques for spatial outlier detection and removal are utilized, and a new hybrid method based on the classification of the data in neighborhoods is proposed and implemented [18]. An observation is considered a 8 spatial outlier if its value is notably different compared to the values of the locations around it [19]. Pollution data are classified into Sparse (distant neighbors or heterogeneous measurements) or Dense (nearby neighbors and homogeneous measurements) neighborhoods, considering spatial (location of the samples) and non-spatial (environmental variables) information. A more complex adversary model is presented in [20] where now the users cooperate with each other in order to confuse the data verification system, but the hybrid algorithm still successfully detects and removes the spatial outliers in this scenario. This new method considers different characteristics of a typical PS application, such as the density of the neighborhood of each location and the behavior of the variables to reconstruct the original variables regardless of the intrusion of malicious users modifying their measurements. The results indicate that this new hybrid approach detects malicious modifications of the data in a very efficient manner with conspiracy levels of up to 22% of the total number of users, and it is capable of reconstructing the variables of interest even if the users are cooperatively conspiring against the system. This hybrid approach also reduces the computational time applying computationally expensive techniques for invalid data detection only in those neighborhoods that make sense while using very inexpensive and simple techniques in the rest of the neighborhoods. 1.3.3 Data Visualization Depending on the dataset under analysis, the optimal data interpolation technique can be different. Having this in mind and considering different density of users in the area of interest, two techniques were studied. Kriging [21, 22] and Markov Random Fields (MRF) [23] are widely used for spatial interpolation and image restoration, respectively. This study shows that because of the kriging computational requirements, MRF are preferred when having a high density of measurements in the space, but kriging clearly outperforms MRF in the case of low density conditions (a low number of measurements in the map), since it generates estimations with a higher coefficient of determination. Since for the P-Sense 9 case, and almost for any PS application, the density of measurements is really low, kriging is utilized for the spatial interpolation process. Kriging is utilized to interpolate the data collected by P-Sense, using one (univariate) or multiple (multivariate) variables [24]. The visualization results for the univariate case show that the spatial resolution of the information is quite good, which is now very useful for those people sensitive to the variables of interest, or those agencies or organizations in need of more precise data. The multivariate case of kriging is very important because correlations among the variables are supposed to provide additional information that should improve the quality of the estimations. However, the use of cokriging (the multivariate extension of kriging) is very expensive [25], especially for systems with large amounts of data, like PS systems. To address this problem, the use of kriging along with Principal Component Analysis (PCA) and Independent Component Analysis (ICA) for multivariate spatial interpolation is proposed. Finally, a new method to make interpolations in space and time is presented, which is more appropriate for PS systems. The results indicate that the accuracy of the estimates improves with the amount of data, i.e., one variable, multiple variables, and space and time data. Also, the results clearly show the advantage of a PS system compared with a traditional measuring system in terms of the precision and spatial resolution of the information provided to the users. With a PS system like P-Sense, it is now possible to detect critical environmental conditions that strongly impact different aspects of life in a precise, unobtrusive, and cost-effective manner. 1.3.4 Density Maps This dissertation also proposes the use of a separate module in PS systems to calculate and generate density maps [26]. Density maps will provide incentive mechanisms input information for them to encourage the participation of users located in those specific areas where it is more important to take samples from, so the variable of interest is properly characterized (provides good information), and lowers the cost, avoiding selecting users who do not provide new or additional valuable data. 10 Once the data have been collected and spatially estimated, the PS system should be able to generate the desired density map based upon the measurements and the location of the users. This map is nothing more than the desired locations from where to collect measurements in the next iteration. The idea behind the density maps is based on the observation that if a variable is fairly constant in a region of the map, a low density of users would be required to reconstruct it. On the other hand, if the pollutant has high variability in a region, more locations (measurements) would be needed to characterize it. An univariate and a multivariate mechanism to generate these maps based on the magnitude of the gradient, has been proposed, implemented and tested. Consequently, the incentives mechanism should encourage the displacement of the users by raising or decreasing the amount of reward, in order to follow the requirements established by the density maps. The experiments show how the density maps greatly improve the quality of the estimations while maintaining a stable and low total number of users in the system. 1.3.5 Structure of the Dissertation The rest of the dissertation is structured as follows. Chapter 2 contains the literature review on areas related to data verification, data visualization and density maps, and also PS systems that are relevant to this dissertation. Chapter 3 presents the details about P-Sense and the proposed framework for PS applications. Following this framework, the subsequent chapters are mapped one-to-one to some of its stages. Chapter 4 details the problem of data verification and the proposed solutions. Chapter 5 presents the techniques developed for data visualization and interpolation in PS applications. Chapter 6 introduces the concept and techniques related to density maps. Finally, Chapter 7 concludes the dissertation and presents possible areas of future research. 11 CHAPTER 2 LITERATURE REVIEW A significant amount of work has been done defining the general aspects of Participatory Sensing, including general architectures, the role of the users, incentive mechanisms, privacy and security issues, data management, and general strengths and challenges [1, 27, 28, 9, 29]. However, the work in the specific areas of data verification, data visualization and density maps in the context of PS systems is poor, or rather, non-existent. The following sections will introduce the most significant works in these three different areas, establishing the pros and cons if used on a PS context, and how the work presented in this dissertation solves the challenges related to the areas of data verification, data visualization and density maps. Some interesting applications, related to pollution monitoring and PS systems are also presented at the end of the chapter. 2.1 Data Verification The problem of data validation has not been addressed yet in the context of participa- tory sensing. There is work on data verification in similar areas, such as wireless sensor networks, but this work cannot be applied directly to PS systems. For instance, a survey made by Chandola et al. in [30] presents the challenges for anomaly detection in sensor networks, generated by sensor fault or intrusion detection. Some of these challenges are the different types of data that are being collected, the noisy environment, non-accurate sensors and transmission channels. A multivariate approach to solve the problem of data verification in sensor networks is presented in [31] where local outliers, called elliptical anomalies, are removed in a distributed way. While some of these requirements still apply, in PS 12 applications the processing of the data is not made in the mobile devices and then the computational restriction of the devices and the online/distributed fashion are not considered now. In this section, the main focus will be on the most relevant work in the areas of neighborhood generation and spatial outlier detection algorithms. This work was the initial point of most of the contributions in this dissertation in the area of data verification for PS application. 2.1.1 Neighborhood Generation The idea of finding a neighborhood for each location has been considered before but in different contexts. Depending on the system, the requirements might change and different techniques need to be applied. In the research area of geology [32], the nearest neighbor method of strain analysis is evolved by using the Delaunay triangulation to calculate the natural neighbors. This technique has proven practical and computationally efficient when compared to similar techniques, such as the Fry or the Crespi methods. In the area of boundary elements simulation, the Petrov-Galerkin method [33] which is based on the Voronoi maps (a dual graph of the Delaunay triangulation) and the Delaunay triangulation, is compared to the finite element method (FEM), a well-established numerical method used in different fields. It was shown that the numerical results agree with the ones found by using FEM, but being more computational efficient because of the use of the Delaunay triangulation. The Delaunay triangulation has also been used for spatial interpolation, being the basis of the Height technique as presented in [34]. In the specific area of spatial outlier detection, initial approaches used k-nearest neighbors to define the neighborhood for each location, but these approaches did not consider the spatial or non-spatial attributes of the data to generate the neighborhoods. Taking this into consideration, more sophisticated techniques has been designed. The work in [35] uses self-organizing maps (SOM) to define the neighborhood of each location. Voronoi maps (the dual graph of the Delaunay triangulation) are used in [36] to generate what they call micro 13 and macro neighborhoods, and the work in [37] utilizes the Delaunay triangulation in the generation of the neighborhoods. In [38], neighborhoods are generated using a technique based on random walks only on the spatial attributes. As it can be seen, all these previous techniques rely on the spatial attributes of the data to generate the neighborhood for each location. As part of this dissertation, a hybrid algorithm classifies each location into a Dense (close neighbors with similar measurements) or a Sparse (distant neighbors or with very different measurements) neighborhood. In order to do this, the generation of the neighborhood should also consider the non-spatial attributes. Having this in mind, the use of a clustering technique on the non-spatial attributes is also included in order to successfully classify each location. The proposed hybrid technique calculates an initial estimate of the neighborhood using the Delaunay triangulation and also utilizes a Gaussian Mixture Model (GMM) as the clustering algorithm on the non-spatial attributes. GMM have been identified as an important promising machine learning technique for pattern recognition [39]. For instance, the work in [40] presents a comparative study where GMM outperforms radial basis functions and similar models for voice recognition. In [41], a GMM is utilized to detect clusters embedded in different feature subspaces, proving its great performance on different standard datasets. Even in the area of PS applications, Gaussian Mixture Models have been used along with Support Vector Machines (SVM) in order to infer the user’s context to properly control the way the data can be collected [42]. In this dissertation, Gaussian Mixture Models are utilized instead of a simple clustering algorithm, such as K-means, since a GMM takes into consideration the variability of the attributes through the integration of the covariance matrix. 2.1.2 Spatial Outlier Detection Algorithms Extracting interesting and important patterns from spatial data is more difficult than the corresponding patterns from traditional numerical and categorical data due to the spatial autocorrelation, relationships and data types [12]. This complexity limits the usefulness 14 of traditional data mining techniques. The data inputs of spatial data mining have two distinct types of attributes: spatial and non-spatial attributes. The relationships among non-spatial attributes are explicit, for instance, ordering, membership, subclass, etc. On the other hand, the relationships among spatial attributes are often implicit, such as, overlap, intersect, etc. This particular property implies the use of special data mining techniques. Spatial statistics (also known as geostatistics) literature provides two different types of test for detecting spatial outliers: graphical and quantitative tests. Graphical tests, based on visualization data, work on highlighting outliers in order for a trained analyst to detect them. Some examples of graphical tests are variogram clouds and Moran scatterplots. These methods are limited by the lack of a exact criteria to classify the spatial outliers. On the other hand, quantitative outliers present a concise technique to detect outliers based, mainly, on the comparison of each data to its neighborhood [12]. Due to its advantages, the quantitative approach was utilized in this dissertation in order to detect spatial outliers, and therefore, verify data. After generating a neighborhood, all the neighbors have the same impact in the estimation of the real value for that location, and this is not a problem when a neighborhood is Dense, in the sense of having close neighbors with similar measurements. However, when the neighbors are really far away from the estimation point, they should not affect in the same way to the estimation of the real value for that location. This is the reason why, when creating the neighborhood, not only the spatial but also the non-spatial attributes should be considered. Two main aspects are important for the analysis of the previous work in the area of spatial outliers detection and removal: does the technique consider univariate or multivariate non-spatial attributes? and how is the neighborhood generated? In general, a global approach, where an outlier is defined as an observation which appears to be inconsistent with the remainder of the data, is not utilized. A value that appears to be an outlier from a global point of view, might not be an outlier in its vicinity, and on the contrary, it can be representing an important and interesting event to be considered [43, 44]. A considerable number of papers [16, 44, 12] introduce the general concepts on spatial outlier 15 detection and removal by creating a comparison between the spatial attributes and the estimations made by the neighborhoods of each location. This technique is also known as local differences. These initial approaches only consider one variable and the generation of the neighborhood for each location is based solely on distance, e.g., the k-closest neighbors. Another univariate approach is presented in [36], but in this case, the generation of the neighborhoods is through the use of Voronoi maps. The technique presented in [45] uses a local differences approach, with a Mahalanobis distance criterion, but again, the generation of the neighborhoods relies only on distance (kclosest neighbors). The work in [35] also presents a multivariate approach to detect outliers through the use of the Mahalanobis distance, but the definition of the neighborhood only considers spatial attributes through self-organizing maps. After testing, it was found that these techniques based only on the Mahalanobis distance performed poorly and were really sensitive to the final tuning of the parameters. The work in [38] uses random walks and Kmeans to estimate the neighborhood, but the final technique to calculate the neighborhood function per location remains the same across all locations. Another inconvenient of the previous technique is the sensitivity to the selection of the number of neighbors, and the univariate nature of the approach. The work in [13] presents an interesting use of the local kriging technique along with local differences, similar to the approach presented in this dissertation, but it is designed only for the univariate case and it does not estimate a proper neighborhood for each location. Other disadvantages are the univariate nature of the approach, and that the authors assume the possibility of grid sampling over the area of interest, something that might not be possible in a PS scenario where the users are moving freely. The algorithms presented in [46] use ordinary kriging in an incremental global approach to determine the spatial outliers. However, it only applies for the univariate case and it depends on the estimation of a clean initial dataset (a dataset with no spatial outliers whatsoever), which might not occur in cases when a lot of users are conspiring (tampering 16 data) against the system. The work in [19] extended this approach to the multivariate case by using cokriging, but it again depends on the selection of a good initial dataset. Finally, the work in [43] gives a general framework for spatial outlier detection estimating a neighborhood based on spatial and non-spatial attributes, but it does not consider the variability of the attributes when using K-means nor a different neighborhood function for each location. It is important to note that none of these algorithms considers an irregular density of the users as it is done in this dissertation, or in the best case they consider a semi-regular grid. For the area of data verification, this dissertation presents a technique that not only considers multivariate non-spatial attributes, but also generates the neighborhood based on spatial and non-spatial attributes of the data. Moreover, this technique is computational efficient by classifying each location into Dense or Sparse, and then applying robust and computational demanding technique only when necessary. 2.2 Data Visualization The specific problem of spatial interpolation in the context of participatory sensing applications has not been addressed yet, but in this section a review of the related work and similar techniques is presented. In general, the focus of this section will be on predictive monitoring of pollution data, and the kriging technique, with some extensions considering multivariate data and computational efficient execution. It is necessary to consider the special characteristics of these applications since the mobility of the users and the use of noisy sensors produce a low density of measurements in time and space. There is work on the area of data inference to fill measurement gaps and predictive monitoring on time series, but most of these approaches only use the information from one location. In this dissertation, kriging a spatial interpolation technique is used, and some modifications are also proposed in order to include information from multiple variables and time. 17 It is interesting to have a general idea of the work on predictive monitoring on environmental data. The work in [47] includes an illustrative example to show how to use PCA for air quality variables analysis. In [48], the authors include a statistical analysis of air pollution data using meteorological information using factor analysis and VARIMAX rotation to detect underlying behaviors in the variables. The work in [49] includes the VARIMAX rotation to the work on PCA, and the eigenvalues are the criterion to define the number of principal components. A recent work in [6] presents a predictive monitoring system that is able to detect abnormal fluctuations of the data, but they only considered data from a single station. Data visualization is essentially a problem of spatial inference based upon irregularly spaced points of measurements. This is specially interesting for pollution monitoring agencies in order to predict the effects of emissions withing non-monitored areas [11, 50]. Many of the traditional monitoring stations are positioned to detect high concentrations of the pollutants and support decision making relative to these values. Nevertheless, when these values are viewed in an independent way, they give an incomplete picture of the potential pollution problems even in areas of high monitor density. The work made in [51] gives a nice presentation of the general aspects, the importance and benefits of data extraction and visualization for wireless sensor networks (WSN), which gives a good reference point to PS application, even though mobility of the users imposes stronger restrictions. For the spatial interpolation process, the simple Sheppard technique is used on a single variable. Finally, the work in [52] presents an implementation of an online web-based visualization system for sensor data that considers the different resolution of the sensor data and also plots the mobility of the sensors. All this previous work shows the importance of appropriate spatial interpolation techniques that consider the special characteristics of PS data. The kriging technique is known as a BLUE (Best Linear Unbiased Estimator) interpolator, i.e., it is a linear estimator that matches the correct expected value of the population (unbiased), also minimizing the variance of the observations (best) [53]. Nevertheless, there are other spatial interpolation techniques that could be used for much simpler scenarios, 18 such as, electronic imaging systems [54], stochastic interpolation using radial basis functions [55], bayesian inference [56], splines[57, 58], just to cite some; but after considering them, all of these options failed for not considering either multiple variables, or time correlation, or low density of points. That is the reason why an initial comparison between a well known and widely used probabilistic technique for spatial interpolation and image reconstruction, Markov Random Fields (MRF), and kriging was made first. Kriging and Markov Random Fields (MRF) are widely used techniques for spatial interpolation and image restoration, respectively. For instance, a comparison in the use of kriging and MRF for the prediction of particulate matter PM10 concentrations around Pittsburgh was made in [59]. Some algorithmic techniques are presented in [60] to improve the belief propagation in Markov Random Fields for image restoration. A good introductory material about MRF and its application to image processing can be found in [61], and the work in [23] introduces the basic theory about MRF, stochastic relaxation and the annealing scheduling for image restoration. The work in [62] uses GMRF for spatial interpolation, and a procedure to estimate the parameters of the model is presented. The work presented in [53] gives a clear presentation of the kriging technique, its computational problems, and the implementation difficulties. Also, the work in [63] is an interesting starting point to understand most of the details about geostatistical tools, such as, kriging and cokriging. The analysis for cokriging shows the practical difficulty in specifying all the parameters for the cokriging case. Even though it was not used, the work in [25] presents an implementation of cokriging. Finally, in [64] the authors give an insight on important considerations when selecting kriging for the analysis of experimental data. For a complete review of the kriging technique, its aspects and applications, the interested reader can follow the work in [65, 66]. As it was mentioned before, the recent work around the kriging technique has focused on its computational time and memory requirements. Although the time constraint for a PS application in air pollution is not too strong, the ideas and tools in this work are useful. The work in [67] presents an interesting mixture of kriging with Gaussian Markov 19 Random Fields (GMRF) to partially solve the computational problem of kriging, but the time correlation between multiple variables is not included. The work in [68] presents an efficient implementation of the kriging technique for real-time applications. In a similar way, an efficient implementation of kriging via matrix vector products is presented in [69]. None of these previous works considers multivariate information in the spatial interpolation process. In order to consider the multivariate characteristic of PS data, the use of PCA and ICA along with kriging has been utilized in this dissertation. Nevertheless, there is work on the usage of kriging and PCA for these purposes. The work in [70] uses fixed measuring stations to monitor the temperature on water streams in a temporal climatological scale. Principal component analysis is employed to construct the physiographical space and build semivariograms for characterizing the correlation structure. In [71] the authors also use principal component analysis for summarizing the relationships among the soil variables at different spatial scales. In the area of soil properties analysis, the work in [72] uses principal components along with kriging to interpolate fields of 16 soil variables. In [73], kriging and PCA are used to assess the horizontal and vertical distribution and transport of an industrial contaminant in soil. The work presented in this dissertation differs from these others in the use of not only principal component analysis (PCA), but also independent component analysis (ICA) along with the traditional kriging technique. Further, in this work not only one variable, but multiple variables and interpolation in space and time, are presented. 2.3 Density Maps Most of the previous work on PS system do not analyze the importance of the location of the users in the finally quality of the estimation of the variable(s) of interest. In general, the work has been focused on how the users are encouraged to participate in the collection process and how the recruitment process selects the “best” users for the system. There are similar applications on pollution and environmental monitoring, such as [74, 75], that 20 keep track of the behavior of the participants and their measurements, in order to estimate (directly or indirectly) the variable of interest in other locations. While this approach can be useful for some applications, it can poorly estimate or completely miss events that occur in small localities or at different scales. Some applications are starting to consider the location and mobility of the users in the data collection process. In [76] the authors present a set of metrics in order to evaluate the individual participants’ fit with any given project, which can be useful in order to improve the quality of the collected data. This approach can keep a profile for each user, creating at the same time a set of “trusted” users, which can be useful for the data verification process. The work in [77] considers the mobility profile of the users and the spatio-temporal coverage to create coordination services, enabling the efficient selection of the users. It also considers the level of commitment of each participant to modify the selection process. A disadvantage of this system is the lack of feedback to the incentives mechanism. Considering this, the work in [78] improves the previous application feeding back information from the current performance of the system for the recruitment process in the next iteration, something that is also considered in this dissertation since is very important for the correct reconstruction or estimation of the variables of interest. One strong assumption is the complete coverage of the area by the users, something that might not always happen. As it can be seen, most of the current PS applications simply collect the information from the participants without considering any quality metric based upon the performance of the estimation process. In this dissertation, not only the locations, but also the value of the measurements are considered in order to generate a set of desired locations of the users (density maps). These locations should be used by the incentives mechanisms to collect the data, either by selecting only a subset of the users or by encouraging their displacement Through the work on density maps, this dissertation has proven that the displacement of the users is imperative for a PS system. In order to achieve this, incentives mechanism using reverse auctions, like the one presented in [79], could be utilized to encourage this desired displacement, without hurting the privacy protection of the user. By offering a 21 higher reward for the regions were an increase of the number of users is needed, and a lower reward for the regions with low density requirements, the users would be forced to move to the desired locations. This data collection can be performed in two ways: opportunistic or interactive [80]. By using an opportunistic data collection, the users are not aware of the exact moments at which the data will be collected and, in general, will not require its interaction. On the other hand, using the interactive collection, the users would have to actively interact with the cellphone in order to measure the variable of interest. The algorithms proposed in this dissertation will not be affected by the way the data are collected, since the incentives mechanisms and the phone applications deal with these issues. 2.4 Related PS Applications This last section presents some Participatory Sensing systems that in one way or another are connected to the work developed in this dissertation. The focus is mainly on PS applications addressing the problem of pollution monitoring, in the same way P-Sense [9] does it. The work in [81] presents an interesting comparison between sensor networks and sensor grids for air pollution monitoring. This work is a good reference to show the advantages of a participatory sensing approach compared to these static networks. Without using gas sensors, some work has focused on estimating the air quality in indirect ways. In [74], by tracking the behavior of the users , the authors can indirectly estimate the air quality levels in a region of interest. In a similar approach, the PEIR project [82] relies on demographics and location data to estimate the environmental exposure of the users. The PS system presented in [75] tracks the behavior of car drivers in order to estimate the most efficient paths for gas saving, allowing drivers to find the most fuel efficient routes between two arbitrary points in the map. Even though it does not include a direct or indirect measuring of the pollution levels, it shows the importance of “green” approaches to reduce the pollution levels in the cities. 22 The initial approaches to monitor air pollution utilized sensor arrays, GPRS modems and static networks. In [83] the authors present the challenges and importance of air pollution monitoring, and executed a field study, recruiting 7 taxi drivers and 3 students in Accra, Ghana to collect data with a GPS logger integrated with SO2 and NO2 sensors. The work in [84] presents an array of sensors with GPRS modems capable of collecting air pollution data (CO, NO2 and SO2 ) and send it back to an Internet enabled pollution server. In a similar way, the work in [85] presents a GPRS based system to collect atmospheric pollution data (CO, CO2 , SO2 , NO2 and O3 ), also generating email and SMS messages if toxic levels are detected. The main problem again is the fact that the location of the sensors is fixed. However, the approach made with P-Sense exploits the capabilities and mobility of cell phones, integrating them with multiple gas sensors, bringing an additional flexibility compared to static arrays of sensors. The work in [86] presents a handheld device that collects pollution variables (CO, NOx and O3 ), and by integrating a GPRS modem and a GPS module, it reports air quality levels. In the Haze Watch project [87], cell phones and pollution sensors (CO, O3 , SO2 , and NO2 ) were integrated in order to measure the air quality, and then transmit the data to a database server. It also allows a web interface overlaying pollution maps on Google maps, in a similar ways like P-Sense does. The work in [88] proposes a mobile system that monitors carbon monoxide and also includes a good analysis of the visualization problem. Also, in the area of data visualization, the work in [89] presents a framework for Web-based visualization of a large dataset of mobile sensing devices for air pollution. 23 CHAPTER 3 P-SENSE AND A FRAMEWORK FOR PS SYSTEMS In this chapter, P-Sense, the PS system that was developed to collect environmental data in order to have a real dataset for all of the experiments, is presented. It also helps introducing most of the problems related to real PS applications. After introducing P-Sense and all its technical details, a new framework for PS applications is introduced. 3.1 P-Sense: A PS System for Air Pollution Monitoring P-Sense is a participatory sensing system for air pollution monitoring and control. P- Sense provides large amounts of pollution data in time and space with different spatial resolutions. With P-Sense, government officials will have data to monitor and control the Air Quality Index (AQI) [4, 90] of a city, state, or country; doctors will be able to correlate respiratory problems of their patients to the AQI they are exposed to during their daily activities, in the places they work and live; county officials, community developers, and real state agents will have data to determine the best place where to build a new school, hospital, or community and advertise properties according to the AQI where they are located. PSense achieves this massive data collection in time and space in a simple and cost-effective manner. Figure 3.1 shows P-Sense’s system architecture. As it can be seen from the figure, P-Sense consists of four main components: the sensing devices, the first-level integrator, the data transport network, and the servers. P-Sense is based on the G-Sense system architecture presented in [91] and the doctoral dissertation of Alfredo Perez [92]. Each of the components of this architecture are described next. 24 Figure 3.1: P-Sense’s system architecture. Taken from [9] c 2011 IEEE. 3.1.1 Sensing Devices The environmental data are collected by a set of individual external sensors integrated in a board and working like one single sensor device with multi-sensing features. These sensors, integrated to the cellphone via Bluetooth, become the mobile sensing devices or the mobile wireless sensor network. The sensing system can be carried along with the cellphone, or can work as a standalone device due to its independent power supply. Similarly, wireless sensor networks made of static environmental sensors can be integrated into the system. For this case the expensive static pollution stations would be part of this static wireless network. All the environmental data acquired from the sensors are transmitted to a first-level integrator device. P-Sense integrates the following external sensors (shown in Figure 3.2): • The ArduinoBT development board [93]: a Bluetooth capable AVR based board that integrates digital and analog sensors. • Gas sensors by Figaro Sensors [94]: carbon dioxide (CDM-4161), combustible gas (FCM-6812), carbon monoxide (TGS-5042), and air quality (AM 1-2602). • Temperature and relative humidity sensor by Sensirion [95]: SHT-75 digital sensor. 25 Figure 3.2: The ArduinoBT board and the environmental sensors utilized in P-Sense. 3.1.2 First-Level Integrator The cellular phone is the first level integrator of the mobile sensing network. In PSense a PRO200 Sanyo cellular phone with GPS and Bluetooth capabilities was utilized. It receives the Bluetooth frames coming from the ArduinoBT board and captures the corresponding GPS fixes. This component has the capability to perform initial data analysis and a graphical user interface (shown in Figure 3.3) to show the measurements to the user in real-time and provide feedback. In the case of the static sensor network, another first-level integrator device and the appropriate software interfaces are needed to collect and integrate the environmental data from these networks into P-Sense. This device is the usual sink node available in most wireless sensor networks. In addition to the graphical user interface, the cellular phone runs an application to manage the reception of data from the sensors, the reception and transmission of data between the phone and severs, and other important functions. This application was implemented using the Java ME platform for resource-constrained devices. Figure 3.4 shows the software architecture of the cellular phone application, which is based entirely upon the work in [92], but only a subset of its functionality is used here. The most important modules of the architecture are described next. 26 Figure 3.3: Graphical user interface. Taken from [9] c 2011 IEEE. Figure 3.4: Mobile-side software architecture. Taken from [91] c 2011 IEEE. The Sensor Communications and OS Layer module abstracts the communication between the sensors and the mobile device. In P-Sense, the Bluetooth interface communicates with the acquisition board to receive the air quality measurements. A connection is opened with the ArduinoBT module through which the environmental data frames (see Figure 3.5) are constantly received. The Location Management component is divided into two modules, the location acquisition module and the critical point management module. The former module acquires the GPS fixes, monitors the location information at all times, and displays this information in the GUI. The critical point management module is meant to reduce the amount of traffic 27 Figure 3.5: A Bluetooth environmental data frame. on the network, and the transfer costs and energy consumption in the client [92], but this functionality is not used in P-Sense. For the Sensor Management component, the data acquisition module obtains the information from the sensor communication layer, transforms the data into meaningful units (Fahrenheit, percentage of humidity, and parts-per-million) and displays the data for immediate monitoring. The Server Communication Management component handles the communication with the servers and consists of the communication management and the session management modules. The communication management module provides the way to transfer data from the mobile device to the servers. This communication uses UDP, but other standard interfaces, such as TCP, can be used to implement this data transmission. UDP was selected since it is more efficient in terms of energy consumption when no high reliability is needed. The session management module provides the methods to exchange session data with the servers. A basic session management module was implemented in the system allowing the servers to recognize the identity of the user. 3.1.3 Data Transport Network The first level integrators transmit the environmental data using networks based on IP technology, such as current Internet links available in existing static and cellular networks. 28 3.1.4 Server This part of the architecture serves as the data repository for the entire system as well as the platform for data processing, analysis, and visualization. Users of the environmental data will “connect” to these servers to get access and visualize the data. In order to perform these tasks, the server is equipped with a database for data storage and an application server to develop and run methods for data analysis. The server application was developed using the Java SE platform (version 6 update 13 or better) and deployed using Sun Glassfish Enterprise Server V 2.1. The database utilized were Postgres 8.3.7-1 and Postgis 1.3.6-1 along with the JDBC version 3 drivers for Postgres. For the visualization and data access part, Google Web Toolkit 1.5.3, Google Web Toolkit Maps API, and Eclipse Europa (or newer) were utilized. With these tools, any data user with Internet access and a web browser is able to access and visualize the data in a customized Google map. The modules of the software architecture of the server application are based completely upon the work in [92] and are shown in Figure 3.6. Figure 3.6: Server application software architecture. Taken from [91] c 2011 IEEE. 29 The Mobile Wireless Sensor Network Management component takes care of the connectivity with the mobile sensing devices. The functionality of this component matches that described in the server communication management component of the mobile-side software architecture. The Data Management component handles the data from the sensors. It consists of two modules: data collection and data visualization. The data collection module obtains the data from the packets and interacts with the database communication module in order to store the data. The data visualization module allows clients to see the variables of interest in real time. This feature was implemented using the Google Web Toolkit and Web 2.0 tools. The visualization service allows the users to choose the variables to be visualized from a selectable list box. The collected information for one specific variable is displayed over the Google Map using colored markers (see Figure 3.7). The users can get information, such as range of variation, information about the user who collected the data, the time when it was collected and the exact value. If the zoom level is large enough, a new representation is added, showing each value as a colored circle shadow around the point. Figure 3.7: Visualization Web service. Taken from [9] c 2011 IEEE. The Database Management component allows the definition, implementation, access, and maintenance of the information in the database. The database communication module provides access to the database, including insertion and query operations. To perform these 30 tasks a Java Database Connectivity (JDBC) resource was created in the server application. The spatial and relational database module consists of the specific database to store the system data. In this system the data definition is based on PostgresSQL and its extension PostGis, which adds support for geographic objects. SQL is the manipulation language and pgAdmin III the software engine. 3.1.5 Pollution Data Collected The campus of the University of South Florida, in Tampa, Florida was the location for the collection of the data. The area of the campus is approximately 3.9km2 . During three months data were collected 3 times a day for one hour, 3 to 4 days a week. Two users carrying the sensor board and the corresponding cellular phones measured the air quality on predefined points in the campus. These points were defined in order to have an uniform density of measurements around the campus. Since in a real PS system the density of the measurements will not be uniform, the use of incentives is necessary in order to force the users to collect data in the required locations, and in some way, control the density of the measurements per area. The first assumption for the collection of the data is that all measurements collected through the same hour are assumed to be simultaneous, and belonging to the same snapshot. This helps increasing the number of effective measurements per snapshot (the set of all measurement across the campus) of the system and therefore improve the accuracy of the spatial interpolation process. P-Sense measures CO (ppm), CO2 (ppm), combustible gases (ppm), air quality (4 discrete levels), temperature (F), and relative humidity (%). Once the measurements are available in the central server, they are aggregated and averaged. In this stage, the data are aggregated based on a virtual grid with equal size cells built on top of the campus map. As it will be shown later, the size of the grid affects the accuracy of the predictions in the spatial interpolation process (refer to Chapter 5). Once the measurements are aggregated, an average is calculated by using all values that fall in the same cell of the grid, and leaving 31 only one valid value per cell. The aggregation process and the calculation of the average is determined by the data verification process (refer to Chapter 4). After experimenting, it was seen that the use of an average value per cell, instead of using all possible values, does not affect the performance of the system, but it does improves the speed of the execution of the algorithms. For instance, using the kriging technique for the interpolation of a single point, the use of an average value per cell reduces the number of values used for the interpolation from around 600 to only 20 to 30 values. A complete description of these techniques can be found in Chapter 5 of this dissertation. Figure 3.8 shows the effect of the aggregation stage. All collected measurements are shown at the left; the same scenario but after the aggregation and averaging of the values is shown at the right. The darker circles in the left graph indicate more than one value at that location. Figure 3.8: Effect of the aggregation process. 3.2 A Framework for PS Systems The right hand side of Figure 3.9 shows the modules of the proposed framework, which consists of the sample size determination module, the data collection module, the data verification and visualization modules, and the density map generation module. The left hand side of the figure identifies some of the techniques used to address the challenges and goals of each module. Each of the modules are described in more detail next. Some techniques associated to each module are introduced here, and some others will be presented in subsequent chapters as they are part of the contributions of this dissertation. 32 Figure 3.9: A general framework for PS applications and its associated techniques. 3.2.1 Sample Size Determination The Sample Size Determination Module needs to find the appropriate sample size so that a PS system can be comparable in accuracy to traditional measurement systems. By exploiting the large number of cellular users, a PS application should be able to improve the low accuracy of a single sensor by using multiple samples and spatial data aggregation. This aggregation process requires the definition of a grid in the map and a minimum number of samples per cell in the grid. By using the given accuracy of the sensors, the sample size determination module should be able to define the sample size per cell to achieve a determined accuracy level. The sample size determination is based on the fact that the arithmetic average of a measurement repeated n times approximates to the real value with a standard deviation √ given by the inverse of n. This means that by having multiple samples the accuracy of the measurements can be increased. Obviously, the number of samples cannot be increased indefinitely, since it would cost more than what is affordable for the system. 33 The sensors can be characterized using the accuracy reported by the manufacturers or through the standard deviation from the real values. Therefore, given a desired accuracy and confidence level, the number of samples can be determined. The work in [96] uses the Lagrange multiplier method to solve this optimization problem, i.e., the cost of a single measurement versus the overall accuracy. Since it only depends on the sensor’s accuracy, this is a one time calculation and the sample size per cell will not change unless the whole accuracy of the system is required to change. There are two ways to achieve this number of measurements in practice. The first one is to use the cellular phones to sample the sensors multiple times and then send a complete packet to the server. The second way uses the measurements of several users that are close to each other and aggregates the measurements in space, as explained previously. This increases the effective number of measurements per location, requiring fewer samples from individual users. The aggregation process is also essential to reduce data redundancy, improve data accuracy, and reduce the computational requirements of posterior processing. 3.2.2 Data Collection The second module in the framework is the Data Collection Module, which has to deal with the problems of incentives, privacy, and security for user participation. Without appropriate incentives, users will not be willing to participate. Having a target density map, incentive mechanisms need to collect enough samples per region. Further, these mechanisms need to keep enough users participating at all times, otherwise, the system may die or it may not be able to collect enough samples. Incentive mechanisms need to consider coverage and location as well. It is very important to know where the users are and how much of the area of interest they are covering. The system will not be very effective if most of the samples are collected from users who are located very close to each other. This module is also in charge of protecting the users’ privacy and the integrity of the data. Therefore, this module must include privacy protecting mechanisms, so no one can determine who took the 34 samples and from where, as well as security mechanisms to guarantee that the data comes from the right user and was not modified by someone else. In general, users need to gain something, be rewarded somehow, for their participation, otherwise, they will not be willing to carry extra sensors, spend their cellular phone batteries, and incur in extra costs delivering the data. There are several ways in which users can be motivated to participate in a PS system. One obvious method is to offer users some sort of monetary reward for their data. In this category many schemes based on micropayments [27] and auctions can be found. For example, the reverse auction scheme presented in [79] utilizes virtual credits and return on investments in order to keep prices low and a reasonable number of participants over time. Other methods are based on the benefit that the users receive from collecting the data, as the participants are also the users of the collected data. For example, a person with allergies might be willing to collect pollen data for free if the system provides her with access to the pollen map built by all participants. Other incentive schemes are of a different type, such as entertainment and reputation. If the data collection process somehow is part of a game that users enjoy playing, they might also be willing to participate without any economic reward. Similarly, if the system ranks the participants according to the quality or quantity of information that they collect, the users might participate given the recognition they get among their peers. The practical implementation of incentive mechanisms is not easy. These mechanisms need to be economically feasible and aware of budget limits to acquire the data; fair, in the sense that all users should be able to participate if they so desire; include mechanisms to attract new participants and keep a minimum level of user participation over time; and cover the area of interest, i.e., encourage users to collect data from the desired locations only, and therefore control the density of the measurements per area. These are only a few of the most important challenges in this area. Another key issue in PS systems is privacy. Even with good incentives for participation, users might not be willing to collect data if their privacy is not guaranteed, i.e., there should be no possibility to infer the identification and location of the user by looking at 35 the collected data. Privacy preserving mechanisms are very challenging in PS systems due to the computational and energy limitations of cellular phones and the complexity of the algorithms needed. Privacy can be achieved using encryption, obfuscation, and anonymization mechanisms, and combinations of them [97, 98]. In addition to the energy consumption, the efficacy of these schemes is evaluated based on the degree of privacy protection given by the Information Loss (IL) metric, which evaluates how far the reported locations are from the real ones. Finally, security mechanisms must be included to guarantee the integrity of the data and make sure that the data come from the authorized users. Digital signatures are commonly used here but more work is needed given the computational complexity and energy consumption of these mechanisms, among the most important issues. 3.2.3 Data Verification The third module is the Data Verification Module. Since the sensors are cheap units in the hand of the users, PS systems are more prone for unintentional and/or intentional invalid measurements. Unintentional invalid measurements can be attributed to malfunctioning sensors, which report abnormal levels of the variables. Intentional invalid measurements are caused by non-trusted users inserting fake values for their own benefit. This last case can occur, for example, in the scenario where industries must comply with certain maximum levels of emitted pollutants to the atmosphere in order to avoid penalty fees. Therefore, before drawing conclusions from the data, i.e., before applying other procedures to analyze the data (e.g., data visualization), the PS system needs to verify the data and remove these abnormal measurements. Considering that outliers could come from malfunctioning sensors, which generate outof-range values, or from malicious users who modify the measurements for their own benefit, appropriate mechanisms need to be included to deal with both aspects. The complete details about data verification are given in Chapter 4. 36 3.2.4 Data Visualization The next module is the Data Visualization Module, which is in charge of showing the data in an easy to understand and efficient manner. Here, the selection of appropriate interpolation techniques to efficiently visualize the data is very important, as they have different prediction accuracies and computational complexities. Further, considering that the raw data in a PS is a 2-dimensional time series, widely-used interpolation techniques, like kriging, need to be revised to interpolate in time and space. For the expert analyst, as well for the end-users, a complete visual representation of the variables of interest is desired (see figure 4.1). Since only scattered data across the map are available, spatial interpolation techniques are necessary to estimate the values of the variables for the locations on the map where no measurements exist. For those situations, ordinary kriging, considered the best linear unbiased estimator [68], is the best choice for the spatial interpolation process. However, kriging implementations normally estimate each variable independently and without using temporal information. Given that PS users can be sensing several environmental variables at the same time, it is beneficial to consider multiple variables in the interpolation process because correlations among the variables provide additional information to improve the quality of the estimations. Nonetheless, the use of cokriging (a multivariate extension of kriging) is very expensive, especially for systems with large amounts of data. To address this problem, the use of kriging along with PCA and ICA for multivariate spatial interpolation is recommended. The complete details of these mechanisms will be explained in depth in Chapter 5. 3.2.5 Density Maps The last module of the framework is the Density Maps Generation Module. Once the invalid measurements are removed and the data are spatially interpolated, the PS system should be able to generate a desired density map. This map is nothing more than the desired locations from where to collect measurements in the next iteration, which is based upon the current behavior of the variable. For example, the density of measurements for the 37 next round should be decreased in a region of the map where the variable/event/factor of interest is fairly constant; however, it should be increased in a region where the variable presents high variability. That is why the output of this module is fed back into the data collection module, i.e., the incentive mechanism should use this density map to increment and decrement the rewards to the users in those particular regions. As it can be seen from Figure 3.9, the density maps close the loop of the framework providing feedback to the incentive mechanism in the data collection module. In the next iteration, the incentive mechanism should consider this information to encourage the displacement of the users to the regions where new measurements are needed to properly characterize the behavior of the variables. The idea behind the generation of the density maps is that if a variable is fairly constant in a region of the map, a low number of locations would be required to reconstruct the variable. On the other hand, if the pollutant has high variability in a region, more locations (measurements) would be needed to better characterize it. Considering this, a gradientbased metric is used to define the number of users that are required to properly characterize the variable of interest, taking into consideration a limited budget. The complete details about this module will be presented in Chapter 6. 38 CHAPTER 4 DATA VERIFICATION This chapter studies the problem of sensor data verification in Participatory Sensing (PS) systems using and P-Sense, an air quality/pollution monitoring application, as a validation example. Data verification, in the context of PS, consists of the process of detecting and removing spatial outliers to properly reconstruct the variables of interest. This dissertation proposes, implements, and tests a hybrid neighborhood-aware algorithm for outlier detection that considers the uneven spatial density of the users, the number of malicious users, the level of conspiracy, and the lack of accuracy and malfunctioning sensors. The algorithm utilizes the Delaunay triangulation and Gaussian Mixture Models to build neighborhoods based on the spatial and non-spatial attributes of each location. With this neighborhood definition, it can be shown that it is not necessary to apply accurate but computationally expensive estimators to the entire dataset to obtain good results, as equally accurate but computationally cheaper methods can also be applied to part of the data and obtain good results as well. The experimental results show that this new hybrid algorithm performs as good as the best estimator while reducing the execution time considerably. The chapter is organized as follows. Section 4.1 describes how invalid data is generated. Section 4.2 introduces the proposed hybrid algorithm for data verification and removal, and finally, Section 4.3 presents the results of the experiments and their corresponding analysis. 4.1 Modeling the Generation of Invalid Data This section describes the models used to generate erroneous data due to sensor malfunc- tioning and lack of accuracy and malicious attacks. Like it was presented before, P-Sense 39 was developed in order to collect environmental data to create a dataset that characterizes PS-related problems. 4.1.1 Location Sampling The data collected during one hour on a randomly chosen day is used as the dataset for the experiments presented here. Nevertheless, the training of the parameters was done in different sets for different days of the week and hours of the day. Figure 4.1 shows the measurements of the six environmental variables after applying a spatial interpolation using a multivariate approach with the combination of the widely-known ordinary kriging technique and PCA (Principal Component Analysis) and ICA (Independent Component Analysis) [24] (refer to Chapter 5). The figure shows the fine granularity of the measurements provided by P-Sense, definitively not attainable by static expensive stations located miles away from the campus. This is one of the most important advantages of a PS system when compared with current measuring systems. In this case, it is now possible to identify that the peaks in the CO2 plot corresponded to parking lot areas, definitively with higher levels of the gas compared to the rest of the campus. In order to be able to properly characterize the problem of location sampling, the data across the USF campus have been spatially interpolated. The interpolated data is used now as the data that the PS system would like to estimate, (see figure 4.1) and the sampling process selects locations and values from these data. In almost all cases for spatial outliers detection algorithms, a constant spatial density and/or a grid distribution of the users/locations is assumed [46, 13, 16, 19]. Nevertheless, this might not be true for a real PS application. With an ideal incentives policy, it could be argued that the location of the measurements can be controlled generating a perfect grid. Nevertheless, on a real application, not all areas in the space can be covered since the users have the tendency to concentrate in specific areas, such as downtown. This problem is presented in more detail in Chapter 6. 40 Figure 4.1: A snapshot of the variables of interest over the USF campus after the data verification process. In order to model this behavior a Gaussian sampling method has been created. Using four 2-dimensional Gaussian functions, the centers of concentration and the corresponding dispersion of the users are defined through the use of the corresponding means and covariance matrices. Following this spatial distribution, N random locations are generated, and the corresponding values for all six variables are picked from the original interpolated dataset. These measurements are the ideal values, but in the following sections a process for adding noise to model real sensors is introduced. By using these four Gaussian functions a good coverage of the campus is achieved, but still having different location densities (refer to figures 4.5 and 4.6). It is also worth noticing that on every instance of the experiments, different locations are selected by using the previous sampling method. This approach brings a more complex test scenario, providing dynamics and mobility to the sensors, instead of having a static dataset for all experiments. 41 4.1.2 Sensor Malfunctioning and Lack of Accuracy The original measurements of the sensors are modified in two ways, due to lack of accuracy and malfunctioning. In order to model the lack of accuracy of the sensors, once a location is selected, some noise is added to all six variables using an uniform distribution with parameters given by the accuracy of each sensor. In order to model a malfunctioning sensor, a set of locations is randomly selected, and one of the sensor’s measurements is changed to a random value chosen from an uniform distribution with parameters given by the range of the sensor. The values for the accuracy and range are taken from the manufacturers’ datasheets, after the characterization of the sensor, or by considering the natural range of the variables (see Table 4.1). Is is assumed that when an user participates in the application, the cellphone collects more than one measurement per sensor and sends this packet to the central server. Also, more than one user might be participating at the same time in the same area. Considering this, for every selected location, several new noisy measurements are generated (15 for the experiments presented here). Table 4.1: The accuracy and range of the pollution sensors. Variable Temperature RH Air Quality CO2 CO Comb Gases Range [0 : 150] [0 : 100] [0 : 31] [200 : 9000] [0 : 1000] [0 : 14000] Accuracy ±0.54◦ F ±1.8% ±0.2 ±5ppm ±1ppm ±80ppm After all locations are selected and the noise model is applied to the sensors, a process takes place in which an imaginary grid divides the area in small cells and an aggregation mechanism is applied. The aggregation process is essential to reduce data redundancy, to improve data accuracy, and to reduce the computational requirements of posterior processing [99, 51]. After the aggregation process, a local outlier removal algorithm is executed 42 (see Section 4.2) and the average for each variable for each cell is left. For the test scenarios presented here, the size of the cell is approximately 50 × 50 meters. 4.1.3 Malicious Attacks In this dissertation, an adversary model is also created, where users maliciously change the values of their measurements. Two cases are considered: one completely random, in which users do not know if anyone else is forging the data, and one scenario in which users attack the system in a cooperative manner. 4.1.3.1 Random Attacks In order to model random attacks, two parameters are used: conspiracy percentage and conspiracy level, both in a range of [0 : 1]. The conspiracy percentage is the number of malicious users compared with the total number N of participants in the system. A value of 0 represents a system with no malicious users and a value of 1 means that all users are conspiring against the system. Therefore, a conspiracy ratio of 0.3 means that thirty percent of the users are modifying their original measurements. After the data have been aggregated, a set of locations is selected to comply with the ratio defined by the conspiracy percentage parameter, and then the corresponding measurements are reduced by an amount defined by the conspiracy level. The conspiracy level determines the amount of reduction that the users apply to the acquired measurements. For instance, a value of 0 implies no change to the original measurements whereas a value of 0.4 implies that the measurements of all sensors are reduced by 40% their original values. Only a reduction of the values is considered because this is what makes sense in a pollution application, i.e., no one would like to cheat increasing the pollution level. The data verification mechanisms presented here do not know of the exact value of these parameters or how the data was manipulated, and therefore, the outliers detection process is not trivial. 43 4.1.3.2 Cooperative Attacks The previous case presents a simple scenario where users are blind, in the sense that they do not know how the rest of the users are modifying their values. In the previous scenario, due to its random nature, in some occasions these users will cluster and make the data verification a difficult task, but in some other cases they will appear isolated and surrounded by trusted users, making the detection process simpler. In cooperative attacks, a more complex case is considered where users forge their data in a more intelligent way, cooperating in order to confuse the data verification system and succeed in their objective. In order to model this scenario, two new concepts are introduced: the center of conspiracy, as the point with the maximum level of conspiracy that the attackers will use, and the area of influence, as the region where the malicious users are conspiring. For example, the center of conspiracy might be the location of a factory that would like to fake the pollution values around it in order to avoid penalty fees. In order to confuse the verification system, the malicious users surrounding this center will cooperate, modifying their values in a decreasing way with respect to the distance to the center of conspiracy. In other words, the closer a user is to the center, the higher the conspiracy value she uses, considering the conspiracy level of the center, as the maximum allowable value. In this way, the closest users will decrease their values in an amount similar to the center of conspiracy, while the furthest users will decrease their value in a really small amount. Additionally, in this cooperative case, two possible scenarios are created. In the first one, some of the surrounding users of the center conspire, while others will not. This implies that some trusted users will be mixed with malicious users. This scenario is shown in Figure 4.2(a), where crosses represent the locations of the users participating in the system, the centers of conspiracy are represented with a circle, and the malicious neighbors with squares. For the data verification process, this is a simpler condition since the values reported by the trusted users will help detecting the malicious users. The second scenario 44 (a) Some neighbors cooperate to conspire. (b) All neighbors cooperate to conspire. Figure 4.2: The generation of the areas of influence. considers that all surrounding nearby users will cooperate and conspire against the system. This situation, shown in Figure 4.2(b), is more difficult for the data verification system. 4.2 The Hybrid Algorithm for Data Verification This section describes the new hybrid algorithm proposed in this dissertation for the detection and removal of invalid data in a computational efficient manner. The steps of the entire algorithm are shown in Figure 4.3. Initially, the data are aggregated in space using an imaginary grid, as explained before. On this new dataset, a local outlier detection algorithm is applied first in order to remove the outliers due to sensor malfunctioning. With the resulting measurements, the average per cell is calculated, leaving one valid value per variable per cell. The next stage is crucial. Here, a new technique is proposed to classify each location as Dense or Sparse. A Sparse location implies either distant neighbors or a strongly heterogeneous neighborhood, i.e., neighbors with very different measurement values of the same variables. On the other hand, a Dense location implies either nearby neighbors or an homogeneous neighborhood. This classification is achieved using the Delaunay triangulation and Gaussian Mixture Models. 45 Figure 4.3: The hybrid algorithm for data verification and removal. Using this new classification of the locations, the hybrid algorithm calculates the neighborhood function, which estimates the value of a location based on its neighbors. This is one of the fundamental contributions of this dissertation. If the location is classified as Dense, a simple algorithm, such as the median, is used to detect and remove the outliers. Otherwise, multivariate local kriging is used. Finally, after the hybrid algorithm removes the suspicious locations, the remaining locations are used to interpolate the data and measure the quality of the interpolation using the coefficient of determination (R2 ): P (yi − fi )2 R = 1 − Pi 2 i (yi − ȳ) 2 (4.1) where yi are the known values, fi are the predicted values, and ȳ is the mean of the known data. A R2 value close to 1 implies an accurate fit of the estimations with the original variables, and a value of zero, or negative, implies a poor estimation. 46 4.2.1 Local Outlier Detection Algorithm Traditional statistical outlier detection methods can be either univariate or multivariate. Multivariate statistical methods are more robust and selective than univariate methods [17]. As presented in [16, 17], a widely used multivariate method is based on the Mahalanobis distance. The Mahalanobis distance takes into account the covariance among the variables in calculating distances. With this measure, the problems of scale and correlation inherent in the Euclidean distance are no longer an issue. The square root ∆ of the quadratic form: ∆2 = (~x − µ ~ )′ Σ−1 (~x − µ ~) (4.2) is the Mahalanobis distance from the (d × 1) vector ~x to the (d × 1) vector µ ~ representing the expected value of the random vector X, and the (d × d) matrix Σ is the variancecovariance matrix of X, X being the collection of all observations ~x. For the density of a d-dimensional normal variable, the paths of ~x values yielding a constant height for the density are ellipsoids. That is, the multivariate normal density is constant on surfaces where the Mahalanobis distance is constant. These paths are called contours: (~x − µ ~ )′ Σ−1 (~x − µ ~ ) = c2 (4.3) √ These ellipsoids are centered at µ ~ and have axes ±c λi~ei , where Σ~ei = λi~ei for i = 1, 2, . . . , d. In other words, (λi , ~ei ) is one of the d eigenvalue-eigenvector pairs for Σ. For multivariate normally distributed data the values are approximately chi-square distributed with d degrees of freedom (χ2d ). Multivariate outliers can now simply be defined as observations having a large Mahalanobis distance. In other words, the solid ellipsoid of ~x values satisfying (~x − µ ~ )′ Σ−1 (~x − µ ~ ) ≤ χ2p (α) has probability (1 − α). 47 (4.4) Using the data aggregation process explained before, the algorithm detects abnormal correlations among the measurements in the same cell. The correlation among all random variables in the same cell should follow the same behavior. Therefore, the Mahalanobis distance is calculated for each measurement in the same cell, and consequently, the algorithm removes the measurements that do not satisfy χ2d (α), the upper 100αth percentile of a chisquare distribution with d degrees of freedom. For this case d = 6 (the number of variables), and after exhaustive tests, α = 0.05 has been found as the optimal parameter to detect around 90% of the local outliers because of sensors malfunctioning. The remaining local outliers get caught by the second processing stage (spatial outliers detection) or they do not affect significantly to the quality of the final estimations. A graphical representation of this method is shown in Figure 4.4(a), where only CO and combustible gases are used. Points with the same Mahalanobis distance create ellipses (ellipsoids in high-dimensional data), which become the thresholds to detect abnormal measurements, i.e., points outside the ellipse are considered local outliers. Using only the points inside the ellipse, the arithmetic average is calculated, leaving one valid value per variable per cell. (a) Ellipse for the two variable case (CO and com- (b) Spatial outlier on the CO data for the USF cambustible gases). pus. Figure 4.4: The local and spatial outlier detection methods. 48 In the second stage, a spatial outlier detection algorithm takes care of the invalid data introduced by malicious users. If all measurements in the cell come from the same user, the previous method will not be able to detect measurements that have been consistently modified by a malicious user. As it can be seen from Figure 4.4(b), the insertion of a spatial outlier by a malicious user could generate wrong conclusions about the levels of CO in the USF campus. The proposed method is based on the observation that spatial outliers can be detected if their values are notably different compared to the values from close-by locations, or nearby users. Spatial outliers are removed in this second stage using spatial data mining techniques and a new hybrid method based on the classification of the data in neighborhoods. 4.2.2 Neighborhood Definition The input data for a spatial data mining algorithm have two different types of attributes: spatial, i.e., locations, coordinates, and the like, and non-spatial attributes, e.g., the values of the measured variables of interest. In the proposed approach, both attributes are considered to define a neighborhood and calculate the neighborhood function for the outlier detection algorithm. It is important to emphasize that similar approaches found in the literature consider a fixed size of K-closest neighbors for each location. The algorithm presented here is more realistic since it does not assume an a priori known number of neighbors per location. To classify each location as dense or sparse according to its neighborhood, the neighborhoods must be defined or created first. This technique executes in two steps: First, the Delaunay triangulation is used to cluster the data based on the spatial attributes. Second, Gaussian Mixture Models are applied to cluster the data based on the non-spatial attributes of the dataset. The use of these two clustering methods is very important since the classification of the locations is not straightforward. A location can have really close neighbors, but if the measurements of the neighbors are strongly heterogeneous, i.e., very dissimilar values of 49 the variables of interest, the use of a median algorithm might not be the right choice to detect the outliers. This can be the result when using only a distance-based classification, such as the Delaunay triangulation. To solve this problem, an additional classification stage that considers the behavior of the measurements in the neighborhood, is required. This is performed by the Gaussian Mixture Models. 4.2.2.1 Delaunay Triangulation The Delaunay triangulation (a dual graph of the Voronoi diagram) has been used before to determine the natural neighbors [32, 33]. A Delaunay triangulation is the set of edges connecting a set of points such that each location is joined to its nearest/natural neighbors. Here, the Delaunay triangulation is used as the first estimate to classify each location, which is based only on the spatial attributes of the dataset. Using this triangulation, for each location the average distance to all neighbors is calculated, resulting in N averages, one per location. A location is classified i as having a Sparse neighborhood if its average distance to its neighbors µdi complies with µdi > (µavg + σavg ), where µavg and σavg are the mean and standard deviation of the average distance of all locations. Figure 4.5(a) shows the initial classification of the locations as Sparse or Dense after using this technique. 4.2.2.2 Gaussian Mixture Models The second step clusters the data based on the non-spatial attributes of the dataset. For this purpose, a Gaussian Mixture Model (GMM) is trained on the variables without using the location information. This clusters the locations based upon how similar their measurements are. Normally, is is expected nearby locations to have similar measurements. The Expectation-Maximization (EM) algorithm [100, 101] is utilized with an initial estimation of the means and covariance matrices using K-means. As any other clustering technique, the number of components kc to be used needs to be defined. Considering this, the distortion metric and the jump criterion presented in [102] are 50 utilized. Applying this technique on different scenarios always generated consistent results, finding that 5 or 6 components for the GMM creates a good clustering of the locations, in the sense that it is possible to classify the neighbors into homogeneous or heterogeneous. The result of the GMM clustering technique can be seen in Figure 4.5(b). The homogeneity of each neighbor is determined by a metric based on the mode component of each neighborhood. The mode component (the component with the highest frequency of appearance) of the neighborhood of location i is defined as mBi , the number of neighbors as nei , and the number of neighbors that belong to component mBi as nmi . If nmi < (0.4 · nei ), location i is classified as Sparse. The previous criterion establishes that at least 40% of the locations must belong to the mode component to classify a location as Dense. After experimenting, it was found that using a threshold greater than 0.5 (the existence of a majority component) is a very strong condition and most locations would end up being classified as Sparse. After applying this additional classification criterion, those additional Sparse locations are included to the previous classification shown in Figure 4.5(a). The final classification (Figure 4.5(c)) clearly shows how after applying this additional density metric, more locations are classified as Sparse. Both in Figure 4.5(a) and 4.5(c), Dense locations are represented as blue triangles and Sparse locations as red dots. (a) Classification based only on (b) Classification of the location (c) Classification based on disspatial attributes (distance). Lo- by using GMM with 5 compo- tance and the similarity of the cations classified as Dense are nents. measurements. Dense locations 69.6% and as Sparse are 30.4%. are 55.7% and Sparse are 44.3%. Figure 4.5: The two steps of the Hybrid approach. 51 4.2.2.3 Selection of the Neighbors for Each Location Up to this point, each location has been classified as Dense or Sparse. Now the neighbors for each location need to be selected to calculate the neighborhood function. In the case of Sparse locations, local kriging is utilized to calculate the neighborhood function. For this, the K-closest neighbors to each Sparse location are considered, where K = N 10 has been found as a minimum value to create a good regression using a spherical model for the semi-variogram. The interested reader will find more details about the kriging technique in Chapter 5. In the case of Dense locations, a location j is a neighbor of location i if the following two conditions hold: 1. dji < (µavg − σavg ), where dji is the distance between j and i. 2. γ(j, mBi ) > 1 × 10−3 , where γ(j, mBi ) is the probability that the location j belongs to component mBi . Figure 4.6: The neighborhood for the Sparse (left) and Dense (right) locations. Even though the value 1 × 10−3 seems small, it was found that it is a good threshold to consider nearby locations having “similar” measurements but that do not belong to the same component. Figure 4.6 presents the neighbors for each location, Sparse and Dense. It 52 can be seen how the Dense neighborhoods cluster following the information from the GMM components (see Figure 4.5(b)). The final purpose of this two-stage procedure is to have a classification that is sensitive to the level of conspiracy in the system. Figure 4.7 shows the ratio between Dense and Sparse locations versus the conspiracy percentage. When there is no conspiracy, the ratio depends only on the distance, and ultimately, on the Delaunay triangulation. As expected, when the conspiracy percentage is incremented, the number of Sparse locations also increments since the heterogeneity of the locations in the same neighborhood also increases. This special behavior is what allows this algorithm to adapt to an increasing number of malicious users. Figure 4.7: The ratio between dense and sparse locations versus the conspiracy percentage. The conspiracy level is fixed to 0.4. All these previous parameters, such as the definition of 40% as the threshold to classify a location as Dense instead of the majority criterion, have been selected after a training stage on different datasets. In the same way, filtering data outside one or two standard deviations, is a known technique to remove outliers and non-significant data. 53 4.2.3 Spatial Outlier Detection for Multiple Attributes After the neighborhood for each location is defined, a spatial outlier removal algorithm is needed. There are two alternatives to detect outliers using spatial statistics: a quantitative or a graphical approach [12]. The graphical approach basically highlights the outliers for a later interpretation of the user. Some examples of these methods are variogram clouds and Moran scatterplots. The graphical tests lack a precise criterion to decide on the outliers and it strongly depends on the perception of the analyst, that is why a quantitative approach is considered here. 4.2.3.1 General S-Outlier Detection and Removal The proposed hybrid algorithm is based upon the work of Lu et al. in [16, 43, 45] about local differences. The basic general algorithm is presented first and the hybrid modification is presented afterwards. Given a set of points X = {x1 , x2 , . . . xN } in a space with dimension p ≥ 1, an attribute function is defined as a mapping of X to ℜ (the set of all real numbers). The attribute function f (xi ) is the value of that attribute in location xi . For each point xi , N Nk (xi ) is denoted as the set of the k-closest neighbors of xi , where k can be locationdependent, but for simplicity in this case, k is a constant for all locations. A neighborhood function g(xi ) is a mapping of X to ℜ, such that for each location, it returns a summary statistic of the attribute values in N Nk (xi ). To detect s-outliers, the attribute function f (xi ) is compared to the neighborhood function g(xi ). Such comparison is done through a function h, which can be a ratio (f /g), a difference (f − g), etc. Given yi = h(xi ), a point xi is an s-outlier if yi is an extreme value in {y1 , y2 , . . . , yN }. This definition can be very general, and there are multiple variations depending on how the neighborhood function g(xi ) is calculated, or which comparison function h is used. Three basic algorithm are: the iterative r (g is the average of the attribute in N Nk (xi ), and h = f /g), the iterative z (g is the average, but h = f − g) and the median algorithm (g is the median and h = f − g) [16]. After extensive testing, the median algorithm was found to give the best results among these three. This performance is because the median is a 54 robust estimator of the “center” of a sample. Algorithm 1 presents a general pseudo-code of the median algorithm. Algorithm 1: The Median Algorithm. Input: X, f Output: Spatial outlier-free data foreach spatial point xi do Compute the k neighbor set N Nk (xi ), the neighborhood function g(xi ) = median of the data set {f (x) : x ∈ N Nk (xi )}, and the comparison function hi = h(xi ) = f (xi ) − g(xi ); end Let µ and σ denote the mean and standard deviation of the dataset {h1 , h2 , . . . , hN }.; Standardize the dataset and compute the absolute values: yi = | hiσ−µ |; For a given positive integer m, let {i1 , i2 , . . . , im } be the indexes such that their y values are the m largest; Locations {xi1 , xi2 , . . . , xim } are the m s-outliers of the dataset X; As it can be seen, this approach presents several problems. First, the number of neighbors k is fixed for every location, and the selection of these neighbors is only based on distance, without considering the non-spatial attributes. Second, it is necessary to know before hand the number m of outliers in the dataset (which is not feasible) and that also fixes the stopping point of the algorithm regardless of the dataset. And third, it does not consider the possibility of a multivariate attribute function. A multivariate extension has been made in [45], but after testing, the results were really poor. 4.2.3.2 Hybrid S-Outlier Detection and Removal In order to detect spatial outliers, a multivariate (multiple non-spatial attributes) quantitative approach is considered here. Given a set of points X = {x1 , x2 , . . . xN }, a multivariate attribute function is defined as F = {F1 , F2 , . . . , F6 }, representing the measurements for all 6 variables in all N locations. A neighborhood function gi (j) and a comparison function hi (j) = Fi (j) − gi (j) are also defined for each attribute i and every location j. Almost all previous work in the area of spatial outlier detection considers only one neighborhood function. However, this approach is different, as the proper estimation is selected based on the classification of the neighborhood. Proper in the sense of finding a 55 good trade-off between computational requirements and quality of the final fit of the spatial interpolation. The idea is to use a simple local method (iterative r, iterative Z, or median algorithms) [16, 45] when a location has a Dense neighborhood, and use a linear estimation of the local value using the kriging technique, which is considered the best linear unbiased spatial predictor [19], for the Sparse case. This combination improves the accuracy and execution time compared to existing methods, as the most appropriate detection algorithm is applied for each neighborhood, i.e., a simple method in dense neighborhoods and a costly but more accurate method in sparse neighborhoods only. For the case of locations in dense neighborhoods, after initial tests, the median algorithm was chosen over the iterative r or z algorithms due to its higher detection ratio, simplicity, and lower computational complexity. The reader may find and use similar algorithms and probably obtain similar results; however, the main interest here is to show that our hybrid algorithm improves the final results of a single gi (.) approach. For the sparse case, a global estimation using kriging is avoided, but instead local kriging is utilized. Local kriging is a kriging estimation that only uses a subset (nearby locations) of the total number of locations to reduce the computational complexity. For this local estimation, Ordinary kriging is preferred since Simple kriging does not adapt well to local trends [103]. Ordinary kriging uses a local mean µz to re-estimate the mean at each grid node from the data within the search neighborhood. In order to have a multivariate criterion, either PCA or ICA is utilized (depending on the variable of interest) as a previous stage to the interpolation process [24]. The complete details about the use of kriging along with PCA o ICA can be found in Chapter 5. Algorithm 2 presents the proposed hybrid approach in pseudo-code. As it can be seen, the hybrid approach removes a location and all its attributes if the estimation of at least one of the attributes (either using the median or local kriging) is considered a spatial outlier. The parameter θ can vary between 2.5 and 3.0, corroborating the results presented in [16]. Some approaches consider data modification instead of removal. In that case, the location is not removed but the attribute function for that location is changed to the value 56 Algorithm 2: The Hybrid Algorithm. Input: Aggregated data Output: Spatial outlier-free data foreach attribute Fi do foreach location j do Calculate gi (j) using the median or the local kriging estimation depending on the neighborhood classification; Calculate hi (j) = Fi (j) − gi (j); end hn (.) = Normalize(hi (.)); find a location b with maximum hn (.); while hn (b) > θ do Remove location b and all its attributes Fi (b); Update gi (x) and hi (x) for location x that has b in its neighborhood; Find a new location b with maximum hn (.); end end predicted by the neighborhood function. After testing, it was found that when having a low number of outliers and a low density of malicious users per neighborhood, this approach works well. On the other hand, when there are a large number of malicious users, the s-outliers affect the estimation process on their neighbors and the final performance is poor. The removal of the locations classified as s-outliers lead to a better quality of the final estimations, and that is the reason why that approach is used here. 4.2.4 Computational Complexity The hybrid algorithm will be compared against the median algorithm, and a pure local Ordinary kriging approach. To ease the mathematical analysis, C is denoted as the number of neighbors for each location, and the number of spatial outliers as ms = cons perc × N . For a dataset of l points, the median requires O(l) time and Ordinary kriging requires O(l2 ) [69]. For the general spatial outlier removal algorithm, the strongest computational requirement consists of the initial calculation of the gi (.) function for all N locations. Using the median algorithm for this stage requires O(CN ) time, while the kriging approach takes O(C 2 N ). Ideally, the algorithm will remove one spatial outlier per cycle, therefore, exe- 57 cuting additional ms cycles. When removing one spatial outlier, the algorithm recalculates gi (.) for C locations, requiring O(Cms ) for the median and O(C 2 ms ) for the kriging case. The Delaunay triangulation requires O(N logN ) [34], the training of a GMM using the EM algorithm requires O(kc N d) [41], and applying the classification criteria (spatial and non-spatial) requires O(N ). Therefore, the complete neighborhood definition stage executes in O(N logN ) time. As explained before, the hybrid algorithm has the median and the kriging algorithm as lower and upper execution bounds, respectively. Ideally, when the system has no malicious users (ms ≈ 0), the hybrid algorithm depends on the initial calculation of gi (.) for all N using the median (O(CN )), but no outliers are removed. Therefore, the total complexity would be O(N logN + CN ) = O(N logN ). On the other hand, when all users are conspiring (ms → N ), the hybrid algorithm would calculate the gi (.) for all N using kriging (O(C 2 N )) and additionally would remove N outliers (O(C 2 N )). The total complexity would be O(N logN + C 2 N + C 2 N ) = O(C 2 N ). In practice, the hybrid algorithm would be somewhere in between these two computational bounds, as it will be corroborated with the experimental results in the following section. 4.3 Experiments and Analysis In this section, the experiments will compare the performance of three algorithms for spatial outlier detection: the median algorithm alone (refer to Algorithm 1), the local kriging algorithm alone, and the proposed hybrid algorithm (refer to Algorithm 2). For the local kriging algorithm, a structure similar to Algorithm 1 is utilized, but the local kriging estimation is calculated for the g(.) neighborhood function, regardless of the classification of the neighborhood, in a similar way as presented in [16, 43]. Initially, the results for the malicious random attacks are presented. After this, the execution time of these three algorithms is measured, and the results are compared. Finally, the case for cooperative attacks is presented and analyzed. 58 4.3.1 Random Attacks The evaluation consists of two different scenarios. The first scenario fixes the conspiracy level and increases the conspiracy percentage. The second scenario uses a fixed conspiracy percentage and increases the conspiracy level. In both cases the execution time was measured to show that the hybrid algorithm is faster than a pure kriging approach. Table 4.2 shows the parameters used in the experiments. Table 4.2: Parameters used in the experiments. Parameter Number of Locations (N) Conspiracy Level Conspiracy Percentage Local Outliers Repetitions Range 100 [0.1 : 0.70] [0.0 : 0.25] 10 4 For each instance of the experiment, the coefficient of determination is calculated, and since every experiment is repeated 4 times, the average of R2 is reported as the final performance metric. As explained before, the detection ratio for the local outliers is around 90%, and no more results are presented here since the final goal is the quality of the final interpolation, measured through R2 . Due to formatting restrictions, the results for four variables (temperature, relative humidity, carbon monoxide (CO) and carbon dioxide (CO2 )) are reported, but the performance of the algorithms is similar for the remaining variables (combustible gases and air quality). 4.3.1.1 Conspiracy Level For this first experiment, the conspiracy level is varied while maintaining a fixed value of the conspiracy percentage (0.15). The results of these experiments are shown in the left part of Figure 4.8. In general, a negative value of R2 represents a model-fitting procedure for a completely different data. One of the most notable results is the behavior of the median algorithm for this conspiracy percentage. Regardless of the conspiracy level, the median algorithm 59 Figure 4.8: R2 versus the conspiracy level (left) and the conspiracy percentage (right) for all four variables and the three algorithms. 60 always produces a negative coefficient of determination, which implies a poor result for the final spatial interpolation process. The only exception is in the case of the carbon monoxide (CO), which presents similar results to local kriging and the hybrid algorithm. This is because CO presents really low values across the map and the conspiracy of the user does not affect this variable much. On the other hand, the local kriging and the hybrid algorithms generate similar and almost constant performance for the complete range of the conspiracy level. This clearly proves how robust and resilient the local kriging algorithm is, but also that the new hybrid approach matches this performance. The reader must remember that kriging is, statistically speaking, the best linear unbiased spatial predictor [19]. Achieving a performance close to kriging is optimal, and above this value could imply over fitting of the data. Nevertheless, as it will be shown later, the hybrid algorithm not only matches this accuracy, but also outperforms the local kriging approach in the execution time. 4.3.1.2 Conspiracy Percentage In this set of experiments, the conspiracy level is fixed at 0.5 and the conspiracy percentage is varied. The results of these experiments are shown in the right part of Figure 4.8. The median algorithm achieves a poor performance when the conspiracy percentage is greater than 0.10 (1 out of 10 users is modifying the measurements). Even though its performance is really good before this point, this condition should not be assumed as the worst case scenario for a real PS system. Nevertheless, the local kriging and hybrid algorithms again outperform the median algorithm. As expected, the accuracy of the final spatial interpolation suffers with the number of malicious users. However, both algorithms present a good coefficient of determination up to a conspiracy level of 0.22, a notable result compared to the case for the median algorithm. As in the previous experiment, the hybrid algorithm has a similar performance compared to the local kriging, but with a faster execution time. 61 4.3.2 Execution Time Finally, it is important to compare the execution time of the three algorithms. Figure 4.9 plots the different execution times for both experiments: on the left, the conspiracy level is varied while the conspiracy percentage is fixed at 0.20, and on the right, the conspiracy percentage is varied while the conspiracy level is fixed at 0.40. As it can be seen from the figure, the execution times are roughly constant for all instances of the experiments. As expected, the median algorithm is the fastest of all three because of the simplicity in the calculation of the median; however, the limitations of its performance have been presented already. Figure 4.9: The execution time of the three algorithms for data verification. On the other hand, the figure shows how the hybrid algorithm executes in less than half the time of the local kriging, but still matching its performance. Under the simulated conditions and after extensive experiments, it was found that the neighborhood generation is quite fast, and the bulk of the computation comes from the spatial outlier detection and removal. These results corroborate the theoretical analysis of the execution time of these three algorithms. 62 4.3.3 Cooperative Attacks This section evaluates the proposed algorithms in the case of cooperative attacks. Two different conditions are considered: one where all of the users surrounding the center of conspiracy cooperate, and another where only some of these users cooperate. Figure 4.10: The R2 for the cooperative attacks where only some of the users surrounding the center cooperate. 63 Two centers of conspiracy are considered and 5 surrounding users cooperate, as shown in Figure 4.2. The experiments consider values of the conspiracy level of the center between 0.1 and 0.8. For all three algorithms local kriging (left), Hybrid (middle) and median (right), the verified case is compared with the non-verified result. Figure 4.10 presents the coefficient of determination corresponding to the case where only some of the users surrounding the centers of conspiracy cooperate. As expected, the hybrid algorithm achieves a better performance compared with the median case and really close to the optimal estimator (local kriging). Also, the median algorithm decreases its accuracy as the conspiracy level increases, something that does not happen to the hybrid and local kriging cases, as they exhibit an almost stable performance. Although not shown here, the quality of the spatial estimation for the rest of the variables without data verification is also very poor for conspiracy levels above 0.1. Figure 4.11 now presents the case where all users surrounding the centers of conspiracy are cooperating to confuse the verification system even further. For all three algorithms local kriging (left), Hybrid (middle) and median (right), the verified case is compared with the non-verified result. Again, local kriging achieves the best results, followed really close by the hybrid algorithm, while the median approach has a mediocre performance. If these results are compared with the ones shown in Figure 4.10, it is easy to see that this case is more demanding for the data verification system than the case where only some of users conspire. In addition, it can be observed how this cooperative attack affects more as the conspiracy level increases. Finally, Figure 4.12 shows the effect and importance of the verification process. On the left you can see the normal interpolation when no malicious users are attacking the system; in the middle the resulting spatial estimation after all the surrounding users have cooperatively conspired, without data verification; and on the right the interpolation when the Hybrid algorithm is applied. In this case, there are 5 cooperative users and two centers of conspiracy, each with a conspiracy level of 40%. For the non-verified case, it is clear how the spatial estimation is very poor and the visualization can be misleading. On the other 64 Figure 4.11: The R2 for the cooperative attacks where all of the users surrounding the center cooperate. hand, the quality of the spatial estimation after the verification process is quite good and reconstructs the original variable in such a way that it can extract the details and most of the meaningful information. 65 Figure 4.12: The effect of the verification process in the final interpolation for temperature, RH, CO2 and CO. 4.4 In Conclusion: Data Verification The main conclusions that can be drawn from this chapter about data verification are: • The adversary model that as been designed and implemented, successfully simulates complex scenarios in which the users attack the system in a random and a cooperative way, also considering the low accuracy and the high failure rate of the sensors. 66 • The neighborhood-aware mechanism properly classifies each location into Dense and Sparse, based on the spatial and non-spatial attributes. This was achieved through the used of the Delaunay triangulation (applied to the spatial attributes) and the Gaussian Mixture Models (applied to the non-spatial attributes). • By using this classification it is possible to apply computational expensive estimation techniques only in the cases where is necessary, while still matching the performance of kriging (the best spatial estimator) under different test scenarios. The experimental results show that the proposed hybrid algorithm executes in less than half of the time when compared to the kriging technique. • It has been shown how the data verification module is fundamental in the appropriate analysis of the data, since it removes invalid measurements that can affect the conclusions that are drawn from the complete spatial interpolation. 67 CHAPTER 5 DATA VISUALIZATION This chapter studies the problem of data interpolation in Participatory Sensing (PS) systems using the data collected by P-Sense as a validation example. While traditional environmental monitoring systems consist of very few static measuring stations, PS systems rely on the participation of many mobile stations. As a result, the structure of the data provided by each system is different and instead of a multivariate time series with a few gaps in the same space, it consists of a multivariate time-space series with many gaps in time and space. First, two data interpolation techniques, Markov Random Fields and kriging, are analyzed. After showing the trade-offs and superiority of kriging, this technique is used to perform a one-variable data interpolation. The problems of cokriging for multivariate interpolation are introduced, and then, a proposed new method based on Principal Component Analysis and Independent Component Analysis is presented. Finally, a new method to interpolate data in time and space is proposed, which is more appropriate for PS systems. The results indicate that the accuracy of the estimates improves with the amount of data, i.e., one variable, multiple variables, and space and time data. Also, the results clearly show the advantage of a PS system compared with a traditional measuring system in terms of the precision and granularity of the information provided to the users. The chapter is organized as follows. Section 5.1 compares Markov Random Fields versus kriging, two important techniques for spatial interpolation. Section 5.2 shows the use of kriging to interpolate one variable in space. The work on multivariate spatial interpolation and the corresponding results are presented in Section 5.3, and the proposed interpolation method for space and time is presented in Section 5.4. 68 5.1 MRF vs Kriging for PS First, a comparative analysis between two techniques for spatial interpolation (MRF and kriging) is presented. After introducing the basic theory of both techniques, a PS scenario is simulated to compare the performance of these approaches. 5.1.1 Markov Random Fields and Gibbs Sampling The theory of Markov Random Fields (MRF) applied to image restoration is used here for spatial interpolation. Every location in the map is considered equivalent to a pixel in the image, and the idea is to estimate and fill the missing locations in the map, or the noisy pixels in the image. In [23], an analogy between images and statistical mechanics is made. The pixel gray levels and the orientation of the edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines the Gibbs distribution [104]. The algorithm for image restoration adopts a Bayesian approach and introduces a hierarchical stochastic model for the original image based on the Gibbs distribution, leading to a procedure that allows the computation of the maximum a posteriori (MAP) estimate. The restoration algorithm also uses stochastic relaxation and annealing scheduling. The stochastic relaxation algorithm can be described as follows: 1. A local change is made in the image based upon current values of pixels in the immediate neighborhood. This change is generated by sampling from a local conditional probability distribution. 2. The local conditional distributions are dependent on a global control parameter T , known as temperature. At low temperatures, the local conditional distributions concentrate on states that increase the objective function, whereas at high temperatures, the distribution is essentially uniform. 69 3. The algorithm avoids local maximum by using the “annealing schedule”. It begins by using high temperatures where many of the stochastic changes will actually decrease the objective function, and then the temperature is gradually decreased. The energy function is relaxed in the local environment of the variable, conformed by the cliques of the MRF, or fully connected subsets of variables. The Markovian assumption makes the value of the variables to depend only on the vicinity instead on the values of all pixels. Since the conditional distribution of every location is unknown, N replicated images (maps) with noisy and missing pixels are generated. At each iteration, the energy of each pixel, which depends on its value and its neighborhood, is compared with the energy of replacing that pixel with a new pixel drawn from one of these generated images. The new value for the pixel is randomly assigned to the current image based on an uniform distribution and the energy exponential ratio. Finally, the “annealing schedule” updates the current temperature on each iteration using: T = T0 /log(1 + K) (5.1) where T0 is the initial temperature, and K is the current iteration. The proposed model for the MRF is an 8-pixel/location neighborhood (see Figure 5.1), as follows: Horizontal cliques = {(S, 4), (S, 5)}, vertical cliques = {(S, 2), (S, 7)}, and diagonal cliques = {(S, 1), (S, 3), (S, 6), (S, 8)}. For the energy function, 2-pixel cliques are utilized in three different configurations: horizontal, vertical, and diagonal. The energy function (based on the Ising model [105]) is the following: U (s) = αV (s) + βh X SH V (s) − V (s′ ) + βv X SV V (s) − V (s′ ) + βd X SD V (s) − V (s′ ) (5.2) where V (s) is the value for the pixel for location s, α measures the external field, and (βh , βv , βd ) measure the bonding strengths for horizontal, vertical, and diagonal cliques, and 70 Figure 5.1: Cliques definition. (SH , SV , SD ) are the sets of horizontal, vertical and diagonal cliques, respectively. The definition of these parameters depends on the characteristics of the variable to be estimated (or the image to be restored). Since the idea here is to prove the concept of MRF versus kriging interpolation, the parameters of the energy function (α and βh,v,d ) were optimized for the test map. 5.1.2 The Kriging Technique Kriging is one of the most used techniques in geostatistics (a branch of statistics that focuses on spatio-temporal datasets). Kriging is a BLUE (Best Linear Unbiased Estimator) interpolator, i.e., it is a linear estimator that matches the correct expected value of the population (unbiased), also minimizing the variance of the observations (Best). The basic theory on kriging to spatially interpolate one variable will be presented. Sections 5.2 and 5.3 present the use of kriging along with some modifications to spatially interpolate the dataset of air pollution measurements collected by P-Sense. For the interested reader, a detailed explanation of the kriging technique can be found in [22, 53]. Here, a brief introduction to the theory of this technique is presented based on these references. 71 Let Z(~x, ξ) be a random function representing any of the variables of interest (temperature, RH, CO, etc.). In this notation ~x implies a point in the space (for this case a 2D or 3D space), while ξ denotes the state variable in the space of realizations. Therefore, given a set of sample values Z(x~i , ξ1 ), with i = 1, 2, . . . , N , the main idea is to reconstruct all possible locations, i.e., Z(~x, ξ1 ) a realization of Z(~x, ξ). One first assumption about the random function is that they are second order stationary, implying that the two first statistical moments (mean and covariance) are translation invariant: 1. Constant mean: E[Z(~x, ξ)] = m 2. The autocovariance is a function of the distance between the reference points x~1 and x~2 : cov[x1 , x2 ] = E[(Z(x~1 , ξ) − m)(Z(x~2 , ξ) − m)] = C(~h) The normal definition of variance (also called dispersion variance) is achieved when ~h = 0: C(0) = var[Z] = σ 2 . For simplicity, the variable ξ and the vector notation for x and h, are dropped. A new variable Y (x) can now be defined having zero mean: Y (x) = Z(x) − m (5.3) E[Y (x)] = 0 (5.4) Yi = Y (xi ) (5.5) It is desired now to find a linear estimator Y0∗ of Y0 at point x0 using the observed values. The general form of this estimator is given by: Y0∗ = n X λi Y i (5.6) i=1 The weights λi are calculated by imposing the restriction that the statistical error ǫ(x0 ) = (Y0 − Y0∗ ) has zero expected value and minimum variance. After using these 72 restrictions, a linear system of equations can be found: C = λb (5.7) where the matrix C is defined as:      C=    C(0) C(x1 − x2 ) . . . C(x1 − xn ) . . . . . . C(xn − x1 ) C(0)          (5.8) and the vector b is given by:      b=    C(x1 − x0 ) . . C(xn − x0 )          (5.9) Once the weights are calculated, the estimation of Y0 at point x0 can be found. Nevertheless, the second order stationary property not always holds and in that case is necessary to use the intrinsic hypothesis in which it is assumed that the first order increments: δ = Y (x + h) − Y (x) (5.10) are second order random functions with: E[Y (x + h) − Y (x)] = m(h) = 0 (5.11) var[Y (x + h) − Y (x)] = 2γ(h) (5.12) and 73 The function γ(h) is the variogram, and 2γ(h) is called the semi-variogram. The variogram is related to the covariance function in the following way: γ(h) = C(0) − C(h) (5.13) It is clear that by calculating the variogram, the covariance can also be calculated, and therefore, the weights λ. In order to calculate a robust experimental semivariogram 2γ̄(h), the work in [106] defines: 2γ̄(h) = N 1 X |Z(xi ) − Z(xj )|1/2 N i=1 !4 /(0.457 + 0.494/N ) (5.14) Given this experimental semivariogram, these values need to be fitted to a theoretical model, being the spherical, Gaussian, and exponential the most commonly used ones. For the dataset that was used, the one that worked better is the spherical model: γ(h) = (s − n)  3h h3 − 3 2r 2r   1(0,r) (h) + 1[r,∞) (h) + n1(0,∞) (h) (5.15) where n is the nugget or the height of the jump of the semivariogram at the discontinuity of the origin, s is the sill or the limit of the variogram tending to infinity lag distances (h), and r is the range or the distance in which the difference of the variogram from the sill becomes negligible. The 1A (h) function is 1 if h ∈ A, and 0 otherwise. Nevertheless, the use of any of the other two models will not change the methods presented here, but just the fitting process of the semivariogram. For the semivariograms, it is assumed that the variability of the random function does not depend on the direction in the space (isotropic semivariograms). In the future, for the case of air pollution, if wind speed and direction data are available, anisotropic semivariograms could be considered. As mentioned before, the isotropic spherical model for the semivariogram fits well the calculated experimental semivariograms. Figure 5.2 shows the semivariogram for the simulated PS scenario (presented next), which includes the fitted 74 spherical model. Even though the model does not seem to do a perfect job due to the dispersion of the points, this fit is a normal result when using noisy data. Figure 5.2: Experimental semivariogram and the fitted curve of an isotropic spherical model. Range = 13.628, Sill = 4866, N ugget = 0.0. A software application has been implemented to automate the entire interpolation process. The application generates the experimental semivariogram, fits the values to the spherical model, and makes use of mGstat [107] to calculate the interpolated data by using three different types of kriging, simple, ordinary, and trend. The use of kriging using trends leads to the best results. 5.1.3 Modeling a PS Scenario for Data Interpolation A squared (L × L) map using a mixture of 2-dimensional gaussians in the space was generated, where L is the size of the side of the map (or number of pixels per side). For each gaussian, the mean (µ), the covariance (Σ), and a weight (w) are defined. This map is scaled in such a way that the final map has a maximum of 255 and a minimum of 0 (following the normal convention for gray values in an image). As explained before, this original map is replicated N times, one per iteration, representing different snapshots of the system. All maps have the same basic configuration for the mixture of gaussians (µ and 75 Σ), but for each replicated map the removed locations are different in order to model the dynamics of a real PS application. The monitored variable (e.g. air pollution) is assumed to be stable during the N snapshots of the system. In order to model the characteristics of a real PS application, two sources of disturbance were added: Gaussian noise per location and missing locations per snapshot. The Gaussian noise, with zero mean and a standard deviation σ that is varied during the experiments, models the lack of accuracy (noise) of the sensors in the cellular phones. On the other hand, due to the mobility of the users, only some locations in the map will be available at every time. To model this, randomly selected locations were removed on each iteration. A ratio (nr = lun lkn , from 0 to 1) between the number of missing (lun ) and known (lkn ) locations is defined. In this case, a ratio nr = 0 represents the case where no locations are removed, but a ratio nr = 1 means that all (L × L) locations are deleted. The locations to be removed are randomly chosen (following an uniform distribution) from the set of known locations, until the desired ratio is met. When using kriging, only the known data are considered, but for MRF, all locations are required to be defined, since they are supposed to be pixels from an image. For the locations that are selected to be removed, the known value is replaced by a random value following an uniform distribution between 0 and 255. If the removed data were replaced by a fixed value, say the mean of the map, when the ratio nr is low the final estimates by using the MRF-Gibbs algorithm tend to this fixed value. 5.1.4 Results and Comparison In order to measure the quality of the estimations, the coefficient of determination (R2 ) is utilized. By calculating the coefficient of determination, the performance of both techniques (MRF and kriging) can be compared under different noise levels and ratios nr . In this section the results and corresponding analysis for both approaches, MRF and kriging, are presented. The variables to be estimated are assumed to be continuous and smooth in the space, i.e. temperature does not change abruptly between two close-by loca- 76 tions in the map. Using this, a low-pass smoothing filter is applied to the final estimations in both cases, MRF and kriging. First, a qualitative analysis is given. In Figure 5.3 you can see, from left to right, the original map, a modified noisy map, and the restored version after running the MRF-Gibbs algorithm. The generated map will be the base of all the experiments. The side of the grid is L = 35 pixels and 4 Gaussians are used to generate the map, and the noise has a standard deviation σ = 6, the missing ratio is nr = 0.6. The MRF algorithm is tuned for this specific map, and the optimal initial temperature for all experiments presented here is T0 = 3.5 (the literature [23] reports values in the range of T0 = [2 : 4]). After initial tests, it was found that 50 is an appropriate number of iterations for the MRF case. Figure 5.3: The map (top row) and surface (bottom row) representation of the synthetic data created for the experiments. Figure 5.3 gives an idea of the performance of MRF for estimating the missing and noisy locations. Figure 5.4 now presents a comparative, but still qualitative, result between MRF 77 and kriging. While having a fixed noise standard deviation (σ = 6), the missing ratio is increased in order to analyze the quality of the estimations. For low missing ratios, such as nr = 0.5 (left column), both approaches have a good performance, but when the ratio is increased to nr = 0.8 (middle column) and nr = 0.95 (right column), it is clear that kriging generates a smoother and more accurate estimation of the original map, compared to the MRF case. Figure 5.4: A comparison between the results for the spatial interpolation using kriging (top row) and MRF (bottom row). Nevertheless, by only looking at the surface map, it is not possible to quantify and compare the performance of MRF and kriging. It is necessary to quantify the accuracy of the estimations while varying the noise σ and the missing ratio nr . For this purpose, the coefficient of determination R2 is used. It is worth mentioning that the additional stage of filtering (low-pass filter) always improved the R2 since it removes high frequency components not present in the original map. This smoothing filter uses a 5 × 5 averaging window. Experiments were run with different values for the noise σ and varying the missing ratio nr between 0 and 1. Figure 5.5 shows the performance, as given by R2 , for the MRF case. 78 Three different values for the standard deviation of the noise are considered: 6, 15 and 30. On the left, the range for the missing ratio is [0.0 : 1.0], and on the right, the range is only [0.8 : 1.0] in order to have a closer look at the conditions for a typical PS application. As it can be expected, the higher the noise and the higher the missing ratio on the signal, the lower the accuracy of the estimates. For a missing ratio nr < 0.7, the performance of MRF case is still acceptable (R2 > 0.6). Figure 5.5: R2 for MRF under different conditions of noise and missing ratios. The results of the performance of the kriging technique in the spatial interpolation process are shown in Figure 5.6. Again, three different values for the standard deviation of the noise are considered: 6, 15 and 30. On the left, the range for the missing ratio is [0.0 : 1.0], and on the right, the range is only [0.8 : 1.0]. As it is shown in the figure, kriging has a great performance for almost all the different missing ratios. Although the performance also decreases with the increase of the noise σ, the performance is much better than the one observed in the MRF case. Nevertheless, the PS case needs to be considered now. For a typical PS application, the number of known locations, compared to the number of points in the space to be estimated, is really low. For the scenario proposed here, that would imply a missing ratio always higher 79 Figure 5.6: R2 for kriging under different conditions of noise and missing ratios. than 0.9. In Section 3.1, a real PS application for air pollution monitoring is presented. For that specific dataset, the missing ratio is always greater than 0.9, and this is a typical value for real PS applications. For this case, a closer look to the condition nr = [0.8 : 1.0] is considered. For the condition where nr = [0.8 : 1.0], Figures 5.5 and 5.6 present a second graph on the right. There is a big difference between MRF and kriging in this range of nr . Kriging clearly outperforms MRF for missing ratios greater than 0.8, and the strong fall of kriging only occurs for values greater than 0.975, and even for that value, the coefficient of determination is around 0.8. Several conclusions can be drawn from this quantitative analysis. Based on the coefficient of determination, even though MRF has a good performance for high densities (low missing ratios), for the case of low densities (high missing ratios) kriging clearly outperforms MRF. However, this conclusion does not consider the computational requirements of the techniques. Because of the calculation of the semi-variogram and the estimation of each location based on all other known locations, kriging is computationally more expensive than MRF, for which the estimation happens in a small neighborhood. Therefore, it can 80 be concluded that MRF is a good choice for spatial interpolation when a high density of measurements is available, but for small densities, kriging is clearly the right technique. One of the possible solutions for the poor performance of MRF for large values of nr is to increase the size of the neighborhood (see Figure 5.1). By doing this, the estimation of each pixel/location will now depend on farther locations, reducing the effect of gaps in the space. The downside of this approach is that the computational requirements will increase as well. For example, if the size of the neighborhood is increased continuously, at the end, the energy function will include all possible known locations in the map, arriving to a case similar to kriging. Kriging determines the weights by using the semi-variogram, and the MRF with a neighborhood as big as the map/image should use a distance criterion to define the parameters of the energy function. As explained before, Section 3.1 presents a PS application for air pollution, and due to the density of the measurements, kriging will be use for the spatial interpolation process. Nonetheless, the work will be focused on the extreme missing ratio conditions, and some modifications for kriging using multiple variables and time are presented and utilized. 5.2 One Variable Spatial Interpolation for PS From this point, the dataset will be the one generated after the collection of real pollution data with the use of P-Sense. This creates a complex and interesting test scenario for the presented algorithms. To measure the quality of the interpolated data, the coefficient of determination (R2 ) is calculated. This calculation is computationally expensive since it has to interpolate every known location in the map without using the information for that location. Once the estimation is finished, the coefficient of determination using the known values is calculated. Figure 5.7 shows an example of the mesh and contour maps of two interpolated variables, temperature and combustible gases. As it can be seen from the figure, the interpolated data follows a smooth function in space. Nonetheless, there is a really interesting result: the map for combustible gases shows an almost constant behavior across most of the campus, but there is a lump in one of 81 Figure 5.7: An example of the result of spatial interpolation using kriging. Temperature and combustible gases. the corners. The location on the campus where this increase on the combustible gases is occurring is filled with big buildings that require powerful air conditioning plants and parking facilities. These conditions might be the cause of this unexpected behavior, and thanks to P-Sense it is now possible to alert the corresponding agencies to take care of these situations. This previous result is one of the most important contributions of this dissertation. By using traditional air pollution stations, it would be impossible to generate spatial estimations with this resolution and accuracy because this behavior occurs on small localities. For a governmental agency it would be crucial to know these details and behaviors of the pollution variables to decide where to build new schools, hospitals, or similar important buildings. For an individual, this information could be very useful to avoid allergic or respiratory situations. For instance, a PS user could set threshold values in the mobile application to specify his or her tolerance to a particular variable of interest and receive automatic alert messages when the thresholds are exceeded. This feedback mechanism can be used as an incentive mechanism to promote user participation. 82 5.3 Multivariate Spatial Interpolation for PS An extension of the univariate case is necessary to consider the correlation information of multiple variables, as the use of correlation information should improve the accuracy of the prediction process. For such a scenario, the cokriging technique considers the variability in space of multiple variables to interpolate every variable. Nonetheless, one of the problems of using cokriging is that it not only needs to calculate a semivariogram for each variable (which is the same case for kriging), but also the cross-semivariograms (a function that represents the cross-spatial dependence between two different variables) for each pair of variables. In this case, the number of regressions for the semivariogram model increases exponentially with the number of variables, the complexity of the system increases, and the estimates depend on many different fitted parameters, compromising the final accuracy. As a result, a modification of kriging is proposed instead of the cokriging technique. The proposed modification works as follows. It first pre-processes the PS pollution measurements by using principal component analysis or independent component analysis. After that, it uses the regular one-variable kriging technique for each of the components, and then the original variables are reconstructed. It will be shown, that this simple procedure improves the accuracy of the system by using the correlation between the variables. 5.3.1 Kriging and PCA The objectives of the principal components analysis (PCA) are data reduction and interpretation [17]. The selected k principal components can replace the original p variables, reducing the dimensionality of the input data. Furthermore, the principal components are uncorrelated linear combinations, which help reducing the redundancy of the input data. PCA is based on the analysis of the covariance (or correlation in other cases) matrix Σ. Let Σ be the covariance matrix associated with the random vector X′ = [X1 , X2 , . . . , Xp ]. Let Σ have the eigenvalues-eigenvector pairs (λ1 , ~e1 ),(λ2 , ~e2 ),...,(λp , ~ep ), where λ1 >, λ2 > ... > λp . After having the sorted pairs of eigenvalues and eigenvectors, the ith principal 83 component is given by: Yi = ~e′i X = ~ei1 X1 + ~ei2 X2 + · · · + ~eip Xp , for i = 1, 2, . . . , p (5.16) If more principal components are included, more of the variability of the system will be included, but it will not reduce much its dimensionality. For this specific application, dimensionality reduction is not the objective. The fact that all principal components are uncorrelated linear combinations is the key fact of this approach. Because of the additional complexity of having to model the cross-semivariograms, it is suggested to first apply PCA to the data. The use of the principal components instead of the original variables provides uncorrelated new variables for the kriging process. Since the variables are uncorrelated, the use of cokriging brings no improvement compared to simple kriging. The idea now is to use the principal components for the regular one-variable kriging. To reconstruct the interpolated original variables, the inverse transformation (using the eigenvectors from the correlation matrix of the original data) is applied to the kriged values. In this approach, if there are K original variables, all K principal components will be used, therefore, there will be no dimensionality reduction. A pictorial representation of this method versus cokriging is shown in Figure 5.8, and the red arrow shows the flow of the processing for the spatial interpolations. Figure 5.8: The proposed kriging plus PCA/ICA method, versus multivariate cokriging. 84 By using kriging and PCA the complexity of the spatial-interpolation process is reduced compared to the cokriging case. Nevertheless, by using cokriging a cokriging variance for each variable can be generated, which is not the case for the kriged values on the principal components. The generated variance for that case will correspond to the variance of the principal components, and not the original variables under analysis. For the interested reader, an illustrative example on the use Principal Component Analysis (PCA) to analyze air quality variables is included in [47]. For the kriging and PCA combination, the accuracy of the system improves when compared with the case of one-variable kriging. Figure 5.9 shows the results of applying kriging on the principal components of the original pollution variables, and including time information, as it will be presented later in Section 5.4. As it can be seen, the smoothness of the map is maintained, and as it is shown in Table 5.1, the quality of the interpolation, measured using the coefficient of determination (R2 ), is higher than in the previous case when using kriging independently for each variable. The meaning of grid size and delta time in this model, and the corresponding choice for the optimal values, will be presented in Section 5.4. 5.3.2 Kriging and ICA Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of non-gaussian data so that the components are statistically independent, or as independent as possible [108]. To define the concept of independence, let us first consider two scalar-valued random variables y1 and y2 . The variables y1 and y2 are said to be independent if information on the value of y1 does not give any information on the value of y2 , and vice versa. This is the most important reason why using ICA along with kriging is a good idea, and a feasible solution versus cokriging. 85 Figure 5.9: Contour map for the spatial interpolation using kriging with PCA (top row) and kriging with ICA (bottom row). To define ICA a statistical “latent variables” model can be used. Assume that n linear mixtures x1 , x2 , . . . , xn of n independent components s1 , s2 , . . . , sn , are observed: xj = aj1 s1 + aj2 s2 + . . . ajn sn for all j (5.17) Without loss of generality, it is assumed that both the mixtures and the independent components have zero mean. If this condition does not hold, the means can be subtracted from them, having zero mean mixtures or components. The independent components are latent variables, meaning that they cannot be directly observed. In this case, the mixtures are the measures collected by P-Sense (6 variables in total) and the independent components are unknown variables to be estimated. It is convenient to have a matrix notation of the ICA model [109, 110]. Denoting X the random vector whose elements are the mixtures x1 , x2 , . . . , xn , and likewise S the random 86 vector with elements s1 , s2 , . . . , sn , the ICA model can be formulated as: X = AS (5.18) where X is an observed multidimensional vector, S is a multidimensional random vector in which its components are assumed mutually independent, and A is a mixing matrix to be estimated. After estimating A, its inverse can be computed, say W, and the independent components are simply: S = WX (5.19) An important measure of non-gaussianity is given by negentropy. Negentropy is based on the information theoretic quantity of (differential) entropy. The entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives. The more random, i.e. unpredictable and unstructured the variable is, the larger its entropy. More rigorously, entropy is closely related to the coding length of the random variable, in fact, it can be said that the entropy is the coding length of the random variable. Considering this, negentropy J is defined as follows: J(y) = H(ygauss ) − H(y) (5.20) where H(X) denotes the entropy of a random variable X, ygauss is a Gaussian random variable of the same covariance matrix as y. The negentropy is always non-negative, and it is zero if and only if y has a Gaussian distribution. Similar to PCA, ICA finds a linear subspace transformation, but with different constraints. The goal of ICA is to minimize the mutual information. Using the concept of differential entropy, the mutual information I between m scalar random variables yi can be found as follows: I(y1 , y2 , . . . , ym ) = m X i=i 87 H(yi ) − H(y) (5.21) This previous equation can be interpreted using the definition of code length. The terms H(yi ) represent the lengths of the code for each yi when they are coded separately, and H(y) is the code length when y is coded as a random vector, in other words, all the components are coded in the same code. Mutual information is a natural measure of the dependence among random variables. Given an invertible linear transformation S = WX, an important property of mutual information is: I(s1 , s2 , . . . , sm ) = X i H(si ) − H(X) − log|detW| =C− X J(si ) (5.22) (5.23) i considering that all independent components si are uncorrelated. This shows the fundamental relation between negentropy and mutual information. FastICA [111], a popular ICA algorithm based on a fixed-point iteration scheme, is utilized in this project. Again, after applying ICA to the original variables, the univariate version of kriging can be utilized to interpolate the independent components and then return to the original variables, as shown in 5.8. Figure 5.9 shows some interpolated maps using kriging along with ICA. A comparison against the other techniques for spatial interpolation (see Table 5.1) shows that ICA improves the accuracy of the system in a similar way that PCA does. Moreover, PCA and ICA complement each other in the sense that the maximum R2 is achieved by using PCA for some variables and using ICA for the other variables. Again, the meaning of grid size and delta time, and the choice for the optimal values, will be presented in Section 5.4. 5.4 Interpolation in Time and Space Suppose that for a given day, consecutive snapshots of the university campus are avail- able, say at 9AM, 10AM, and 11AM (see Figure 5.10). It is clear that the information contained in the snapshots at 9AM and 11AM can give an idea of what the state of the variables were at 10AM. 88 Figure 5.10: Modeling time using an additional dimension. So far, the performance of the inference process has been improved by using the correlation information between the given variables, and that is as far as most of the geostatistical methods go. Nevertheless, given the nature of the PS application, the correlation information from consecutive snapshots of the system can be used. Here, a modification of the kriging technique to make use of this available correlation is proposed. 5.4.1 The Grid and Delta Parameters The environmental conditions, and also the nature of the variables, affect the way how each variable changes in space. Not all variables change in the same manner under the same environmental conditions, therefore some important characteristics of the pollutants can happen on different scales. Because of this, the resolution required to characterize each variable can be different. Here, a broad range of values for the grid size, which defines the resolution of the grid, is considered. For the case in which only spatial information is considered, the application interpolates the variables using each grid size value, and the value for which the highest R2 is achieved determines the optimal size. Due to the different characteristics of the pollutants and the quality and density of the measurements, this optimal parameter is not the same for all variables. The spatial interpolation using kriging can be seen as the process of finding the unknown values of a variable in a 2D grid with some known locations. If this concept is extended 89 by adding one dimension to model time, information of consecutive snapshots, as having adjacent layers, can be included. Figure 5.10 gives a pictorial representation of this concept of interpolating the values for 10AM by using the information from the adjacent snapshots at 9AM and 11AM. An additional parameter delta time is used here to model the distance between consecutive grids. A snapshot will affect more if it is closer to the target grid. Since each variable changes differently in time, i.e., the temperature does not change as fast as the carbon monoxide or the combustible gases, it might be necessary to have a different “distance in time” for each variable. The experiments for this case will again consider a broad range of values for the parameter delta time to model the different dynamics for the variables. If a model with the dynamics of the air pollutants was available, it could be used to determine the optimal parameters delta time and grid size in a more elegant way. Since this kind of model is not available, the developed application interpolates the variables for all possible combinations of grid size and delta time, using only kriging for one variable and using kriging along with PCA or ICA. The application picks the duple (grid size, delta time) that achieves the maximum R2 . The results show that for different pollutants, the optimal grid size and delta time can be different (see Table 5.1), which is normal due to the differences in the nature of the variables being estimated. 5.4.2 Multivariate Kriging in 3D Part of the dataset that was collected using P-Sense in the university campus includes sets of three consecutive snapshots. Using this dataset and the modifications presented before, the given pollution variables were interpolated. Figure 5.11 shows the coefficient of determination as a function of the grid size and the delta time parameters for two selected variables. As expected, the behavior of R2 depends on the variable under analysis. In general, really good results can be achieved (R2 > 0.8) for grid size > 100 and delta time > 1x104 for these two specific variables. 90 Figure 5.11: R2 as a function of the grid size and the delta time. Kriging and PCA were used to interpolate temperature and combustible gases. Table 5.1 shows the complete comparison of all variables for all the interpolation techniques presented in this dissertation. The table gives the maximum coefficient of determination when using only kriging for one variable and when using kriging along with PCA or ICA. Both conditions, using only spatial information and using spatio-temporal information (consecutive snapshots) are tested. From Table 5.1 it can be seen that, in general, the use of time information always improves the accuracy of the inference process. As expected, the best performance of the system comes when using the correlation in time along with PCA or ICA. The comparison between ICA and PCA is very interesting, because the combination of both techniques gives a robust inference process with coefficients of determination in the range of [0.77 : 0.98] for almost all variables. 91 Figure 5.12: Known and predicted values after using kriging interpolation along with PCA and time information. Table 5.1: Maximum R2 for each variable with the corresponding grid size and delta time parameters. Method Krig Temp RH CO CO2 Comb Gas Krig-PCA Temp RH CO CO2 Comb Gas Krig-ICA Temp RH CO CO2 Comb Gas Only Space R2 Grid 0.6407 100 0.6607 80 0.6541 160 0.4832 80 0.5903 80 2 R Grid 0.6812 180 0.7308 60 0.8485 140 0.5819 160 0.7109 80 R2 Grid 0.6900 120 0.7580 120 0.8725 180 0.5716 160 0.7063 100 92 Space and Time R2 Grid Delta 0.7294 180 30000 0.7078 40 400 0.8381 160 1000 0.5561 80 25 0.6947 100 300 R2 Grid Delta 0.7708 180 50000 0.7978 80 10 0.9868 180 500 0.7367 100 100 0.8273 140 500 R2 Grid Delta 0.7758 180 500 0.7656 100 500 0.9867 180 1000 0.7507 160 5 0.8199 140 500 Figure 5.12 plots the known values and the predicted ones. This figure was generated using the optimal values shown in Table 5.1 for PCA and time information. As the figure shows, the predicted values follow very closely the known values, leading to the high coefficients of determination. As explained before, and as it was proven now, the parameters grid size and delta time are different for each variable and strongly determine the quality of the spatial interpolations. It is clear that one configuration (only kriging, or kriging with PCA, or kriging with ICA) will not work properly on estimating all possible variables under the restrictions of a PS system such as P-Sense, and that is the reason why expert predictors should be used to interpolate the variables. An expert predictor per variable can be used to consider these different configurations for the grid size and delta time for the kriging technique. Using the optimal values of grid size and delta time, the expert predictors will use these same values for all future estimations in the process of aggregation and 3-dimensional estimation.These optimal values can be found by looking at the maximum R2 for each variable in Table 5.1. The configuration for each expert predictor is summarized in Table 5.2. It is worth noticing that all expert predictors, either using kriging with PCA or ICA, will always consider the spatio-temporal information to interpolate the variables. Table 5.2: Configuration of the expert predictors for all the variables. Variable Temp RH CO CO2 Comb Gas Maximum R2 0.7758 0.7978 0.9868 0.7507 0.8273 Grid Size 180 80 180 160 140 93 Delta Time 500 10 500 5 500 Method Used Krig-ICA Krig-PCA Krig-PCA Krig-ICA Krig-PCA 5.5 In Conclusion: Data Visualization The main conclusions that can be drawn from this chapter about data visualization are: • By using kriging (the best spatial estimator) along with Principal Components Analysis (PCA) or Independent Components Analysis (ICA), a multivariate approach was created in order to improve the quality of the spatial interpolations in space. The experimental results show that by considering the correlation among multiples variables, the quality of the spatial interpolation improves when compared to the univariate case. • Through the use of distance to model time, the temporal correlation of the variables has been effectively included in the spatial interpolation process. The experimental results also show that by considering not only multiple variables but also the temporal correlations of adjacent snapshots of the system, the quality of the final spatial interpolation improves when compared to the multivariate approach alone. • The final configuration of the interpolation parameters strongly depends on the variable under analysis. Taking this into consideration, the use of expert predictors is necessary in order to achieve the best quality of the spatial interpolations. 94 CHAPTER 6 DENSITY MAPS One key challenge in PS systems is that of the determination of the locations and number of users where to obtain samples from in order to accurately represent the variable of interest with a low number of participants. This chapter presents the use of density maps, based on the current estimations of the variable, to address this challenge. The density maps are then utilized by the incentive mechanism in order to encourage the participation of those users indicated in the map. The results show how the density maps greatly improve the quality of the estimations while maintaining a stable and low total number of users in the system [26]. The chapter is organized as follows. Section 6.1 details the proposed technique, both for the univariate and the multivariate cases, and Section 6.2 presents the experiments and analyzes the results. 6.1 Density Map of Users In the same way of the previous chapters, the information collected by P-Sense is used here as the dataset to validate the proposed techniques and mechanisms. As an example, Figure 6.1 shows the contour maps of the spatial interpolation for CO and CO2 across the campus. This resulting interpolation is the dataset that will be used in the experiments presented here. The idea is to characterize the system depending on how good these variables can be reconstructed by sampling the maps with a reduced number of users. Through the generation of a density map (a concept that is first introduced in the PS context), the desired locations where to obtain the next measurements will be defined. 95 Figure 6.1: Contour maps for all variables (temperature, RH, CO2 , CO, air quality, and combustible gasses) in the USF campus. As explained before, this mechanism provides feedback and affects the process of data collection, both for the incentives strategies and the privacy protection mechanisms. The incentives strategies must encourage the displacement of the users to the locations where is more convenient to sample the variable of interest. Also, the privacy mechanisms should 96 guarantee that not only the identification of the user is not possible, but also that his location and displacement cannot be tracked. The idea behind the density maps is based on the observation that if a variable is fairly constant in a region of the map, a low number of locations would be required to reconstruct it. On the other hand, if the pollutant has high variability in a region, more locations (measurements) would be needed to better characterize it. This study analyzes two cases. The first case is the univariate case, where only one variable, e.g, CO or CO2 , or any other, is used to generate the desired location of the users. The second case is the multivariate case, in which now a set of control variables is used to create the desired density map. These cases are discussed next. 6.1.1 The Univariate Case In order to measure the level of variability of the variable under investigation, a gradientbased metric has been defined. The general definition of the gradient is that of a vector field that points in the direction of maximum change and its magnitude is the largest rate of change. Considering a two-dimensional scalar function f (x, y) (for the pollution application case, f is either temperature, RH, CO, CO2 , air quality, or combustible gases), the gradient is calculated by the partial derivatives of f by: ∇f = where ∂f ∂x ∂f ∂f î + ĵ ∂x ∂y (6.1) is the partial derivative of f along the x axis, e.g., longitude, and ∂f ∂y is the partial derivative of f along the y axis, e.g., latitude. Therefore, the magnitude of the gradient is: |∇f | = s ∂f ∂x 2 +  ∂f ∂y 2 (6.2) Since only discrete points of the functions are available, due to the resolution used during the spatial interpolation process (38×38 points for Figure 6.2(a)), the gradient is also calculated in a discrete way. In other words, the gradient and the magnitude of the gradient 97 is calculate in the x direction and in the y direction on a per point basis. Figure 6.2(a) shows how every point in the gradient map has a corresponding magnitude and direction. Nonetheless, in this approach, only the magnitude of the gradient is considered for each point. (a) Gradient field for CO2 . (b) Contour map and corresponding locations for CO2 . Figure 6.2: Gradient field and contour maps for CO2 in the USF campus. The density map is generated as follows. First, as shown in Figure 6.2(a), the area of interest is divided into regions using a virtual grid. For the pollution dataset, the area was divided in 25 (5 × 5) regions. Each region will be identified with the duple (i, j), where i is the coordinate along the x axis and j along the y axis, each varying from 1 to 5. Then, the Mean Magnitude of Change, M M Cf,ij , is calculated for each region (i, j) of the scalar function f , defined as the mean of the magnitude of the gradient for each region, as follows: M M Cf,ij = |∇fij (x, y)| (6.3) Remember that this is a discrete case and the magnitude of the gradient is calculated on a per point basis. In this case, M M Cf,ij implies the mean of the magnitude of the gradient 98 for the points that belong to the region (i, j). For instance, since 38 × 38 points and 5 × 5 regions are used, each region has approximately 7 × 7 points to calculate the average. Finally, the number of locations per region to take samples from during the next round is determined using the M M C per region and two input parameters (min loc and max loc) that limit the minimum and maximum number of locations required for each region. These are input parameters given by the administrator of the PS system because they are application dependent and vary according to the dynamics of the variable, the type of events to be detected, the total budget for the data collection process, etc. In Figure 6.2(b) the minimum and maximum number of users per region are 0 and 3, respectively, and consequently, 0 users are assigned to the region with the lowest M M C and 3 users to the region with the highest M M C. The number of locations for the other regions is assigned using a linear mapping between these two limits. Figure 6.3 gives you a pictorial representation of this linear mapping. Again, from Figure 6.2(b), it can be seen how after 8 iterations, the density map clearly follows the regions of the map where the CO2 behaves with a higher variability and reduces the density (number of users) in the regions of the map with an almost constant behavior. It is important to explain the reasons and implications of defining regions and not the exact location of the users. A desired density of users is generated on a per region basis because the exact location of the users cannot be controlled due to their natural mobility. Also, some of the privacy mechanisms guarantee the protection of the user’s information through the use of anonymization or obfuscation techniques that modify the users’ locations, making impossible to assume that the users will be located on specific points in the map. That is why regions are created to encourage users to be in an area of the map, but not exactly in a certain position at a certain time. The PS system administrator, then, has two important decisions to make. First, she has to decide the number of regions in which the area is going to be divided. Then, she has to decide the maximum and minimum number of users per region. Increasing the size of the regions would reduce the total number of users in the system, but it might 99 Figure 6.3: The linear mapping between the Mean Magnitude of Change (MMC) per region, to the number of locations per region. imply that some events could be missed or not detected. On the other hand, small regions would guarantee the detection and visualization of small events in the map, but would imply a generous budget for the data collection process. Therefore, it is necessary to find the optimal trade-off between these two extremes considering parameters such as, budget, speed of stabilization, and dynamics of the variables. 6.1.2 A Multivariate Extension So far a gradient-based metric has been presented, but it only considers the dynamics of a single variable to generate the desired locations of the users. Unless all variables are highly correlated, this density map might not properly characterize the other variables of interest, as it will be presented in Section 6.2. Taking this into account, the metric is now extended to the multivariate case. Having this in mind, control variables must be defined. These control variables are the set of variables that will influence and that will be considered in the generation of the density map. The new multivariate M M C (M M Cmij ) is define 100 as the average of the univariate M M C for the set of control variables (f, g, . . . n) for each region (i, j): M M Cmij = M M Cf,ij , M M Cg,ij , . . . M M Cn,ij (6.4) In a similar way to the univariate case, min loc users are assigned to the region (i, j) with the lowest M M Cmij , and max loc users to the region with the highest M M Cm . The number of locations for the other regions is assigned using a linear mapping between these two limits. 6.2 Experiments and Analysis In this section the experiments and the corresponding analysis are presented, both for the univariate and the multivariate cases. To determine the number of users for the first iteration on a real application, a default value per region can be calculated following temporal records about the behavior of that variable for that area in the map. Nevertheless, in order to create a good initial point for the experiments, the initial location of the users are predefined to be on the surroundings of the campus, initially guaranteeing the quality of the estimation to be very poor. Based on these measurements, a complete map of the area can be generated using known interpolation techniques, such as multivariate kriging [24, 68]. Once the complete status of the variables has been estimated, it is possible to generate the density maps for the next round of the data collection process. Using the number of users defined by the density map from the previous iteration, the locations are selected from each region following a uniform distribution. When sampling each region, if a location is selected twice, an additional random location should be necessary to achieve the desired density of users for that region. Nevertheless, in that case, a new location is not selected and that region is simply left with a lower density than expected. With this decision, the case where the incentive mechanism is not successful obtaining the desired number of users for a region, is being modeled. In order to measure the quality of 101 Figure 6.4: R2 for all variables in the univariate case using CO as the variable of control. The mean number of users for the 5 last iterations is 23. the estimations, the coefficient of determination (R2 ) is utilized, in the same way that has been used in the previous chapters. 102 6.2.1 The Univariate Case The results for the univariate case consider the generation of the density map using only one control variable, and sampling all variables following the desired density per region. First, the evolution of the system is presented, during 10 iterations of the whole process of collecting, estimating, and generating maps. The mean value of the coefficient of de2 termination (Rmean ) and the total number of users in the system (Nmean ) for the last 5 iterations are used to characterize the performance of the experiment. Figure 6.4 presents the coefficient of determination for all the environmental data by using carbon monoxide (CO) as the control variable. The parameters max loc and min loc have been fixed to 3 and 0, respectively. It is clear how after 4 to 5 iterations, the system achieves its maximum performance. It is also important to notice how the control variable (CO) maintains a stable behavior after 3 iterations and its coefficient of determination never decreases, while the quality of the estimations for the other variables is very poor in some cases. This is an expected behavior since the system is tracking the characteristics of this single variable in order to improve the quality of the estimations. Figure 6.5 presents the final location of the users after 10 iterations. The quality of these estimations should be compared against the original variables in Figure 6.1. The system clearly generates optimal locations for CO, but these locations might not be appropriate to estimate the rest of the variables, such as CO2 in the graph, which has a low value for R2 . Once the system achieves its maximum performance, the mean total number of users in the system is stable as well, and for this specific case was 23. 6.2.2 Increasing the Number of Users per Region As it was mentioned previously, the parameters max loc and min loc are applicationdependent. Assuming that the system administrator has a generous budget, the maximum number of users per region could be increased, but also an increase in the quality of the estimations is expected. In order to simulate this, min loc is fixed to zero and max loc 103 Figure 6.5: Location of the users after 10 iterations for all variables, when using only CO as the control variable. is gradually increased. For instance, Figure 6.6 shows the results of this experiment when using relative humidity (RH) as the control variable, but this behavior is similar for the rest of the variables. It can be seen from Figure 6.6(a) that not only the quality of the estimations improves with the number of users, but also the number of iterations necessary to achieve the maximum performance decreases. Also, from Figure 6.6(b), it can be seen how the total number of users in the system increases when incrementing max loc and it 104 (a) Coefficient of determination R2 for RH. (b) Total number of users N for RH. Figure 6.6: Increasing the number of users (max loc) per region. The min loc value is zero in all cases. 105 maintains a stable value after a small number of iterations. Not only that, the speed at which the stable state is reached also increases with the maximum number of users per region. Nevertheless, while changing max loc from 2 to 4 improves the accuracy from 0.6012 to 0.8654 (with a total of 39 users), when incrementing max loc from 4 to 6 the accuracy only achieves 0.9456 (with a total of 62 users). It is clear how after some point, increasing the maximum number of users per region does not bring a notorious benefit to the system, but it will significantly affect the available budget. Table 6.1 summarizes these results when using different control variables. The system administrator can use these results to define the optimal parameters based upon the desired estimation quality and the available budget. Table 6.1: Coefficient of determination R2 and total number of users. Variable Temperature 2 Rmean Nmean Relative Hunidity 2 Rmean Nmean Air Quality 2 Rmean Nmean CO2 2 Rmean Nmean CO 2 Rmean Nmean Combustion Gases 2 Rmean Nmean 2 0.9027 16.2 2 0.6012 18.0 2 0.9243 21.6 2 0.8257 20.4 2 0.7891 15.6 2 0.7951 15.6 106 max loc 4 0.9431 31.4 4 0.8654 39.6 4 0.9712 35.2 4 0.9078 38.0 4 0.9592 33.8 4 0.9582 37.0 6 0.9731 49.8 6 0.9456 62.4 6 0.9324 44.2 6 0.9380 58.0 6 0.9707 44.6 6 0.9725 50.0 Figure 6.7: R2 for all variables for the multivariate case, using RH and CO are the control variables. 6.2.3 The Multivariate Case The multivariate case is presented now, considering the results when using two control variables. Figure 6.7 shows the coefficient of determination for all variables when using 107 relative humidity (RH) and carbon monoxide (CO) as the control variables of the system. When these results are compared to the univariate case presented in Figure 6.4 where CO 2 was the only control variable, it can be seen that while for CO the Rmean remains almost at the same value (from 0.89 to 0.90), the performance significantly improves for the rest of the variables. Temperature improves from 0.86 to 0.95, RH from 0.63 to 0.82, air quality from 0.75 to 0.95, CO2 from 0.52 to 0.73, and combustible gases from 0.84 to 0.92. The most impressive result is that this improvement is gained in all variables while only incrementing the total number of users from 23 to 27. After experimenting, it has been found that including a large number of control variables requires an increment on the number of users, and also generates a density map with an almost uniform distribution of the users in the space. Even thought this is the ideal case, complete coverage of the area by the users, this also implies a larger or unlimited budget for the PS system. This previous behavior is the result of a wise selection of the control variables. If two variables are highly correlated, it is not necessary to use both since selecting only one of them will bring enough information to effectively generate the density map. After experimenting it was possible to identify how RH and CO would bring enough information to track the changes of almost all variables. 6.3 In Conclusion: Density Maps The main conclusions that can be drawn from this chapter about density maps are: • Through the use of a gradient-based metric, it is possible to establish the density of users per region in order to properly characterize the variables of interest, while maintaining a low and stable number of users. The experimental results show how this univariate approach successfully determines the optimal locations to characterize only one variable. 108 • By using a multivariate extension through the use of “control variables”, it is now possible to characterize the dynamics and spatial events of several variables at the same time. The experimental results show how the quality of the estimations for all variables has been improved when using this new multivariate approach. 109 CHAPTER 7 CONCLUSIONS In this dissertation different aspects of Participatory Sensing applications have been tackled. A framework for the design and implementation of PS systems has been presented, which addresses the unique characteristics and challenges related to this type of applications. This has been the first attempt to analyze PS applications from a system viewpoint The framework proposed here describes the main modules and challenges to design and implement a PS system. However, it provides a systemic view without looking at other issues and challenges that may arise in the first integrator device. For example, if several PS applications are run simultaneously in the same device, they might need to use the same sensors and compete for the same resources. In that case, new mechanisms are needed in the cell phone to use its resources in the best efficient manner, perhaps even sharing the data among applications. This is a non-trivial problem, recently recognized in [112]. Having presented the framework, this dissertation focused on 3 different modules of it: data verification, data visualization and density maps generation. For data verification a new hybrid algorithm for spatial outlier detection and removal has been proposed and implemented. It considers aspects that could be found in real PS systems, such as the uneven spatial density of the users, malicious users, and the lack of accuracy and malfunctioning sensors. While the median algorithm is really fast, it fails to detect the spatial outliers when the number of malicious users are increased. After experimenting, it has been shown how the Delaunay triangulation and Gaussian Mixture Models can be used to build neighborhoods based on the spatial and non-spatial attributes of each location. Using these neighborhoods is is possible to detect and remove the spatial outliers while being computationally efficient. 110 In order to simulate users attacking the system, a complete adversary model has been generated. This model not only considers noisy and faulty sensors, but also different scenarios, such as random and cooperative attacks made by malicious users. Under the different test scenarios the hybrid algorithm always matches the performance of the best spatial estimator (kriging) while significantly reducing the total execution time, which is particularly important given the large amount of data generated by participatory systems. In the area of data visualization for PS system, a comparative study between the use of Markov Random Fields and kriging techniques has been created. After showing the superiority of kriging in terms of the quality of the estimations, the technique is applied to make spatial interpolations for one variable. The results are very revealing in the sense that the technique interpolates data with high granularity, i.e., provides more precise information about a variable of interest in a particular place than the information provided when interpolating data from traditional measuring systems. This conclusion is extremely important from the users and system point of view. Users will be able to obtain more precise information about a variable that affects them directly. This feature justifies or somehow proves the usefulness and superiority of participatory sensing systems compared with traditional measuring systems. This dissertation has introduced a new technique for the spatial interpolation process (kriging along with PCA or ICA), and presents a solution using expert predictors for each of the variables of interest. The use of real pollution data imposes a strong problem due to the high levels of noise in the data, and the continuing changing conditions of the environment, but the implementation of PCA and ICA has given the necessary flexibility to achieve high accuracy in all measured variables. With these spatial estimations, the data coming from the participants can be validated and visualized for easy analysis. In order to close the loop in the framework, the use of density maps has been proposed. These maps determines the number of users and their locations where to take samples from in order to accurately reconstruct the variable of interest. Density maps are calculated on a round by round basis and fed back to the incentive mechanism so that only those users 111 located in those locations are encouraged to participate. The experiments have shown how the density maps greatly improve the quality of the estimations while maintaining a stable and low total number of users in the system. Even though the multivariate extension of the density maps has proven useful, it still requires more research and experimentation. A mechanism to automatically select the control variables is still necessary in order to successfully estimate the variables of interest without greatly increasing the total number of users, and therefore incurring in additional costs for the system. Some approaches could include metrics based on the correlation matrix and factor analysis. Future work on this and similar participatory sensing systems is needed in several directions. Sensor integration and minimization is a very important area. Users will not participate in the application if they have to carry another (bulky) device. Finally, participatory sensing systems will generate huge amounts of data. This presents a scalability problem that needs to be addressed from different directions. Finding ways to eliminate redundant or useless data, transmit less data, process and store data efficiently, and similar, are of paramount importance. Finally, appropriate techniques to merge both systems (PS and traditional) must be developed in order to exploit the advantages that both systems have (spatial resolution for PS and traditional stations accuracy). 112 REFERENCES [1] J. Burke, D. Estrin, M. Hansen, A. Parker, N. Ramanathan, S. Reddy, and M. B. Srivastava, “Participatory sensing,” in ACM Sensys World Sensor Web Workshop, 2006. [2] H. Weinschrott, F. Durr, and K. Rothermel, “Streamshaper: Coordination algorithms for participatory mobile urban sensing,” in IEEE International Conference on Mobile Adhoc and Sensor Systems, 2010. [3] S. Kanhere, “Participatory sensing: Crowdsourcing data from mobile smartphones in urban spaces,” in IEEE International Conference on Mobile Data Management (MDM), 2011. [4] Airnow, Air Quality Index, 2011. [Online]. Available: http://www.airnow.gov/ [5] N. Pham, R. Ganti, Y. Uddin, S. Nath, and T. Abdelzaher, “Privacy-preserving reconstruction of multidimensional data maps in vehicular participatory sensing,” in Wireless Sensor Networks, ser. Lecture Notes in Computer Science. Springer Berlin, 2010, vol. 5970, pp. 114–130. [6] Y. Kim, M. Kim, J. Lim, J. T. Kim, and C. Yoo, “Predictive monitoring and diagnosis of periodic air pollution in a subway station,” Journal of Hazardous Materials, vol. 183, no. 1-3, pp. 448–459, 2010. [7] H. Janes, L. Sheppard, and K. Shepherd, “Statistical analysis of air pollution panel studies: an illustration,” Elsevier, Ann Epidemiol, vol. 18, pp. 792–802, Jun. 2008. [8] J.-S. Lee and B. Hoh, “Sell your experiences: a market mechanism based incentive for participatory sensing,” in IEEE International Conference on Pervasive Computing and Communications, April 2010, pp. 60–68. [9] D. Mendez, A. Perez, M. Labrador, and J. Marron, “P-sense: A participatory sensing system for air pollution monitoring and control,” in IEEE International Conference on Pervasive Computing and Communications, WiP workshop, 2011, pp. 344–347. [10] N. Lane, E. Miluzzo, H. Lu, D. Peebles, T. Choudhury, and A. Campbell, “A survey of mobile phone sensing,” Communications Magazine, IEEE, vol. 48, no. 9, pp. 140–150, 2010. [11] D. Holland, R. Scheffe, W. Cox, A. Cimorelli, D. Nychka, and P. Hopke, “Spatial prediction of air quality data,” Environmental Manager, 2003. 113 [12] S. Shekhar, P. Zhang, Y. Huang, and R. Vatsavai, Trends in Spatial Data Mining, ser. Data Mining: Next Generation Challenges and Future Directions, 2003, ch. 3. [13] Z. Shuyu and Z. Zhongying, “Detection of outliers in spatial data by using local difference,” in International Conference on Intelligent Mechatronics and Automation, 26-31, 2004, pp. 400–405. [14] L. Jaimes, I. Vergara-Laurens, and M. Labrador, “A location-based incentive mechanism for participatory sensing systems with budget constraints,” in IEEE Pervasive Computing (PerCom), 2012. [15] D. Mendez and M. Labrador, “A framework for participatory sensing systems,” Submitted to IEEE Communications Magazine, 2012. [16] C.-T. Lu, D. Chen, and Y. Kou, “Algorithms for spatial outlier detection,” in IEEE International Conference on Data Mining, 2003. [17] R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Pearson Prentice Hall, 2007. [18] D. Mendez and M. Labrador, “On sensor data verification for participatory sensing systems,” International Journal of Communication Systems (submitted), 2012. [19] A. Militino, M. Palacios, and M. Ugarte, “Outliers detection in multivariate spatial linear models,” Journal of Statistical Planning and Inference, vol. 136, no. 1, pp. 125–146, 2006. [20] D. Mendez and M. Labrador, “Removing spatial outliers in ps applications,” in International Conference on Selected Topics in Mobile and Wireless Networking (iCOST), jul 2012. [21] D. G. Krige, “A statistical approach to some basic mine valuation problems on the witwatersrand,” Operational Research, vol. 4, no. 1, 1953. [22] P. Goovaerts, Geostatistics for Natural Resources Evaluation, 4th ed. versity Press, USA, 1997, vol. 4th printing. Oxford Uni- [23] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 6, pp. 721–741, November 1984. [24] D. Mendez, M. Labrador, and K. Ramachandran, “Data interpolation for participatory sensing systems,” Pervasive and Mobile Computing (submitted), 2012. [25] D. Marcotte, “Cokriging with matlab,” Computers and Geosciences, vol. 17, pp. 1265– 1280, September 1991. [26] D. Mendez and M. Labrador, “Density maps: Determining where to sample in participatory sensing systems,” in FTRA International Conference on Mobile, Ubiquitous, and Intelligent Computing (MUSIC), June 2012. 114 [27] S. Reddy, D. Estrin, M. Hansen, and M. Srivastava, “Examining micro-payments for participatory sensing data collections,” in ACM International Conference on Ubiquitous Computing, 2010. [28] K. Shilton, N. Ramanathan, S. Reddy, V. Samanta, J. Burke, D. Estrin, M. Hansen, and M. Srivastava, “Participatory design of sensing networks: strengths and challenges,” in Conference on Participatory Design (PDC). Indiana University, 2008, pp. 282–285. [29] K. Shilton, J. Burke, D. Estrin, M. Srivastava, and M. Hansen, “Participatory privacy in urban sensing,” Center for Embedded Network Sensing, UC Los Angeles, Tech. Rep., 2008. [30] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 15:1–15:58, Jul. 2009. [31] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Palaniswami, “Elliptical anomalies in wireless sensor networks,” ACM Trans. Sen. Netw., vol. 6, no. 1, pp. 7:1–7:28, jan 2010. [32] K. F. Mulchrone, “Application of delaunay triangulation to the nearest neighbour method of strain analysis,” Journal of Structural Geology, vol. 25, no. 5, pp. 689–702, 2003. [33] S. Li, K. Liu, S. Long, and G. Li, “The natural neighbour petrov-galerkin method for thick plates,” Engineering Analysis with Boundary Elements, vol. 35, no. 4, pp. 616–622, 2011. [34] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars, Computational Geometry, 3rd ed. Springer, 2008. [35] Q. Cai, H. He, and H. Man, “Somso: A self-organizing map approach for spatial outlier detection with multiple attributes,” in International Joint Conference on Neural Networks, 2009. [36] N. R. Adam, V. P. Janeja, and V. Atluri, “Neighborhood based detection of anomalies in high dimensional spatio-temporal sensor datasets,” in ACM symposium on Applied computing, 2004. [37] Z. Min-qi, C. Chong-cheng, L. Jia-xiang, F. Ming-hui, and J. Tamas, “An algorithm for spatial outlier detection based on delaunay triangulation,” in International Conference on Computational Intelligence and Security, 2008, pp. 102–107. [38] X. Liu, C.-T. Lu, and F. Chen, “Spatial outlier detection: random walk based approaches,” in International Conference on Advances in Geographic Information Systems. ACM, 2010. [39] S. Chandrakala and C. Sekhar, “Classification of varying length time series using example-specific adapted gaussian mixture models and support vector machines,” in International Conference on Signal Processing and Communications, 2010. 115 [40] F. Anifowose, “A comparative study of gaussian mixture model and radial basis function for voice recognition,” International Journal of Advanced Computer Sciences and Applications, vol. 3, no. 1, sep 2010. [41] Y. Li, M. Dong, and J. Hua, “A gaussian mixture model to detect clusters embedded in feature subspace,” Communications in Information and Systems, vol. 7, no. 4, pp. 337–352, 2007. [42] E. Miluzzo, M. Papandrea, N. Lane, H. Lu, and A. Campbell, “Pocket, bag, hand, etc. - automatically detecting phone context through discovery,” in International Workshop on Sensing for App Phones (PhoneSense), nov 2010. [43] F. Chen, C.-T. Lu, and A. P. Boedihardjo, “Gls-sod: a generalized local statistical approach for spatial outlier detection,” in International conference on Knowledge discovery and data mining. ACM, 2010. [44] S. Shekhar, C.-T. Lu, and P. Zhang, “A unified approach to detecting spatial outliers,” Geoinformatica, vol. 7, pp. 139–166, June 2003. [45] C.-T. Lu, D. Chen, and Y. Kou, “Detecting spatial outliers with multiple attributes,” in IEEE International Conference on Tools with Artificial Intelligence, 2003, pp. 122– 128. [46] A. Cerioli and M. Riani, “The Ordering of Spatial Data and the Detection of Multiple Outliers,” Journal of Computational and Graphical Statistics, vol. 8, no. 2, pp. 239– 258, 1999. [47] R. Henry and G. Hidy, “Multivariate analysis of particulate sulfate and other air quality variables by principal components-part i: Annual data from los angeles and new york,” Atmospheric Environment, vol. 13, no. 11, pp. 1581–1596, 1979. [48] F. Gonzalez, M. Galan, A. Umbria, and K. Gervilla, “Multivariate statistical analysis of meteorological and air pollution data in the campo de gibraltar region, spain,” Environmental Monitoring and Assessment, vol. 119, pp. 405–423, 2006. [49] L. Poissant, J. W. Bottenheim, P. Roussel, N. W. Reid, and H. Niki, “Multivariate analysis of a 1992 sontos data subset,” Atmospheric Environment, vol. 30, no. 12, pp. 2133–2144, 1996, a WMA International Specialty Conference on Regional Photochemical Measurements and Modeling. [50] L. Berliner, D. Nychka, and T. Hoar, Studies in the atmospheric sciences, ser. Lecture notes in statistics. Springer, 2000. [51] M. Hammoudeh, R. Newman, and S. Mount, “An approach to data extraction and visualisation for wireless sensor networks,” in International Conference on Networks (ICN), march 2009, pp. 156–161. [52] A. Talukder and A. Panangadan, “Online visualization of adaptive distributed sensor webs,” in IEEE Aerospace conference, march 2009, pp. 1–8. 116 [53] R. Moore, “Geostatistics in hydrology: Kriging interpolation,” Mathematics Department, Macquarie University, Sydney., Tech. Rep., 1999. [54] I. Amidror, “Scattered data interpolation methods for electronic imaging systems: a survey,” Journal of Electronic Imaging, 2002. [55] J. Kolibal and D. Howard, “Stochastic interpolation: A probabilistic view,” in ECSIS Symposium on Bio-inspired Learning and Intelligent Systems for Security, BLISS, aug. 2008, pp. 129–135. [56] R. Lemos and B. Sanso, “Conditionally linear models for non-homogeneous spatial random fields,” Statistical Methodology, pp. 1–14, 2011. [57] G. Jekabsons, ARESLab: Adaptive Regression Splines toolbox for Matlab/Octave, 2010. [Online]. Available: http://www.cs.rtu.lv/jekabsons/ [58] D. W. Nychka, “Spatial process estimates as smoothers,” North Carolina State University, Raleigh, North Carolina, USA. National Center for Atmospheric Research, Boulder, Colorado, USA., Tech. Rep., 1999. [59] N. Cressie, M. Kaiser, M. Daniels, J. Aldworth, J. Lee, S. Lahiri, and L. Cox, “Spatial analysis of particulate matter in an urban environment,” in Conference on Geostatistics for Environmental Applications (geoENV), 1999, pp. 41–52. [60] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” in Computer Vision and Pattern Recognition, 2004, pp. 261–268. [61] M. Li, “Markov random field edge-centric image/video processing,” University of California, San Diego. Electrical Engineering (Signal and Image Proc), Tech. Rep., 2007. [62] L. Fontanella, L. Ippoliti, R. Martin, and S. Trivisonno, “Interpolation of spatial and spatio-temporal gaussian fields using gaussian markov random fields,” Advances in Data Analysis and Classification, vol. 2, pp. 63–79, 2008. [63] C. Wikle and J. Royle, “Spatial statistical modeling in biology,” In Encyclopedia of Life Support Systems (EOLSS), 2004. [Online]. Available: http://www.eolss.net [64] D. Ginsbourger, D. Dupuy, A. Badea, L. Carraro, and O. Roustant, “A note on the choice and the estimation of kriging models for the analysis of deterministic computer experiments,” Appl. Stoch. Model. Bus. Ind., vol. 25, no. 2, pp. 115–131, mar 2009. [65] M. Moore, Spatial statistics: methodological aspects and applications, ser. Lecture notes in statistics. Springer, 2001. [66] N. A. Cressie, Statistics for spatial data, ser. Wiley series in probability and mathematical statistics: Applied probability and statistics. J. Wiley, 1993. [67] L. Hartman and O. Hössjer, “Fast kriging of large data sets with gaussian markov random fields,” Computational Statistics and Data Analysis, vol. 52, pp. 2331–2349, January 2008. 117 [68] B. Srinivasan, R. Duraiswami, and R. Murtugudde, “Efficient kriging for real-time spatio-temporal interpolation,” in Conference on Probability and Statistics in the Atmospheric Sciences, 2010. [69] N. Memarsadeghi, V. Raykar, R. Duraiswami, and D. Mount, “Efficient kriging via fast matrix-vector products,” in IEEE Aerospace Conference, 2008. [70] N. Guillemette, A. St-Hilaire, T. B. Ouarda, N. Bergeron, É. Robichaud, and L. Bilodeau, “Feasibility study of a geostatistical modeling of monthly maximum stream temperatures in a multivariate space,” Journal of Hydrology, vol. 364, no. 1, pp. 1–12, 2009. [71] S. Bocchi, A. Castrignano, F. Fornaro, and T. Maggiore, “Application of factorial kriging for mapping soil variation at field scale,” European Journal of Agronomy, vol. 13, pp. 295–308, 2000. [72] S. Grunwalda, K. Reddya, J. Prengera, and M. Fisher, “Modeling of the spatial variability of biogeochemical soil properties in a freshwater ecosystem,” Ecological Modelling, vol. 201, pp. 521–535, 2007. [73] C. Carlona, A. Crittoa, A. Marcomini, and P. Nathanail, “Risk based characterisation of contaminated industrial site using multivariate and geostatistical tools,” Environmental Pollution, vol. 111, pp. 417–427, 2001. [74] M. Demirbas, C. Rudra, A. Rudra, and M. Bayir, “Imap: Indirect measurement of air pollution with cellphones,” in IEEE International Conference on Pervasive Computing and Communications, 2009. [75] R. K. Ganti, N. Pham, H. Ahmadi, S. Nangia, and T. F. Abdelzaher, “Greengps: a participatory sensing fuel-efficient maps application,” in International conference on mobile systems, applications, and services., 2010. [76] S. Reddy, K. Shilton, J. Burke, D. Estrin, M. Hansen, and M. Srivastava, “Evaluating participation and performance in participatory sensing,” in International Workshop on Urban, Community, and Social Applications of Networked Sensing Systems (UrbanSense), Nov. 2008. [77] S. Reddy, V. Samanta, K. Shilton, J. Burke, D. Estrin, M. Hansen, and M. Srivastava, “Mobisense, mobile network services for coordinated participatory sensing,” in International Symposium on Autonomous Decentralized Systems (ISADS), 2009. [78] S. Reddy, K. Shilton, J. Burke, D. Estrin, M. Hansen, and M. Srivastava, “Using context annotated mobility profiles to recruit data collectors in participatory sensing,” in International Symposium on Location and Context Awareness, 2009. [79] J. Lee and B. Hohm, “Dynamic pricing incentive for participatory sensing,” Pervasive and Mobile Computing, vol. 6, pp. 693–708, 2010. 118 [80] N. D. Lane, S. B. Eisenman, M. Musolesi, E. Miluzzo, and A. T. Campbell, “Urban sensing systems: opportunistic or participatory?” in Workshop on Mobile computing systems and applications, ser. HotMobile. ACM, 2008, pp. 11–16. [81] M. Ghanem, Y. Guo, J. Hassard, M. Osmond, and M. Richards, “Sensor grids for air pollution monitoring,” in UK e-Science All Hands Meeting, 2004. [82] CENS, “Personal environmental impact report,” University of California, Los Angeles, Tech. Rep., 2009. [Online]. Available: http://urban.cens.ucla.edu/projects/peir/ [83] E. Paulos, R. Honicky, and E. Goodman, “Sensing atmosphere,” in Workshop position paper for the Sensing on Everyday Mobile Phones in Support of Participatory Research, ser. SenSys. ACM, 2008. [84] A. Al-Ali, I. Zualkernan, and F. Aloul, “A mobile gprs-sensors array for air pollution monitoring,” Sensors Journal, IEEE, vol. 10, no. 10, pp. 1666 –1671, oct. 2010. [85] S. Constantin, F. Moldoveanu, R. Campeanu, I. Baciu, S. Grigorescu, B. Carstea, and V. Voinea, “Gprs based system for atmospheric pollution monitoring and warning,” in IEEE International Conference on Automation, Quality and Testing, Robotics, vol. 2, may 2006, pp. 193 –198. [86] P. Dutta, P. M. Aoki, N. Kumar, A. Mainwaring, C. Myers, W. Willett, and A. Woodruff, “Common sense: participatory urban sensing using a network of handheld air quality monitors,” in Conference on Embedded Networked Sensor Systems (SenSys). ACM, 2009, pp. 349–350. [87] J. Carrapetta, N. Youdale, A. Chow, and V. Sivaraman, “Haze watch project,” University of New South Wales, Tech. Rep., 2007, poster. [Online]. Available: http://www.pollution.ee.unsw.edu.au. [88] P. Rudman, S. North, and M. Chalmers, “Mobile pollution mapping in the city,” in UK-UbiNet workshop on eScience and ubicomp, 2005. [89] W. Hedgecock, P. Volgyesi, A. Ledeczi, and X. Koutsoukos, “Dissemination and presentation of high resolution air pollution data from mobile sensor nodes,” in Annual Southeast Regional Conference. ACM, 2010. [90] AirData, Database from the Environmental Protection Agency, 2011. [Online]. Available: www.epa.gov/air/data/ [91] A. Perez, M. Labrador, and S. Barbeau, “G-sense: a scalable architecture for global sensing and monitoring,” Network, IEEE, vol. 24, no. 4, 2010. [92] A. Perez, “An architecture for global ubiquitous sensing,” Ph.D. dissertation, University of South Florida, April 2011. [93] Arduino, Open-source electronics, 2011. [Online]. Available: www.arduino.cc/en/ [94] Figaro, Gas Sensors, 2011. [Online]. Available: www.figarosensor.com/ 119 [95] Sensirion, General Sensors, 2011. [Online]. Available: www.sensirion.com [96] H. Nguyen, O. Kosheleva, V. Kreinovich, and S. Ferson, “Trade-off between sample size and accuracy: Case of measurements under interval uncertainty,” University of Texas at El Paso, Tech. Rep. 195, 2008. [Online]. Available: http://digitalcommons.utep.edu/cs techrep/195 [97] C. Bettini, S. Jajodia, P. Samarati, and X. S. Wang, “Privacy in location-based applications,” in Lecture Notes, Springer LNCS 5599, 2009. [98] D. Christin, A. Reinhardt, S. Kanhere, and M. Hollick, “A survey on privacy in mobile participatory sensing applications,” Journal of Systems and Software, vol. 84, no. 11, 2011. [99] S. Ozdemir and H. Cam, “Integration of false data detection with data aggregation and confidential transmission in wireless sensor networks,” Networking, IEEE/ACM Transactions on, vol. 18, pp. 736–749, 2010. [100] Y. Chen and M. Gupta, “Em demystified: An expectation-maximization tutorial,” University of Washington. Department of Electrical Engineering, Tech. Rep. UWEETR-2010-0002, Feb. 2010. [Online]. Available: http://www.ee.washington.edu [101] C. Tomasi, “Estimating gaussian mixture densities with em - a tutorial,” Duke University. Computer Science Department, Tech. Rep. UWEETR-2010-0002. [102] C. A. Sugar and G. M. James, “Finding the number of clusters in a dataset: An information-theoretic approach,” Journal of the American Statistical Association, vol. 98, no. 463, pp. 750–763, 2003. [103] D. Ginsbourger, D. Dupuy, A. Badea, L. Carraro, and O. Roustant, “A note on the choice and the estimation of kriging models for the analysis of deterministic computer experiments,” Appl. Stoch. Model. Bus. Ind., vol. 25, pp. 115–131, March 2009. [104] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988. [105] E. Ising, Zeitschrift Physik, vol. 31, p. 253, 1925. [106] N. Cressie and D. Hawkins, “Robust estimation of the variogram: I,” Mathematical Geology, vol. 12, pp. 115–125, 1980. [107] T. Hansen, mGstat, a Geostatistical Matlab toolbox, 2004. [Online]. Available: http://mgstat.sourceforge.net/ [108] A. Hyvarinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural Networks, vol. 13, no. 4-5, pp. 411–430, 2000. [109] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, 1st ed. College Station, Texas: John Wiley and Sons., 2001, vol. 1. 120 [110] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. 1998. Prentice Hall, [111] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, May 1999. [112] R. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: Current state and future challenges,” IEEE Communications Magazine, vol. 49, no. 11, Nov. 2011. 121 APPENDICES 122 Appendix A: P-Sense - Permission for Reuse 123 Appendix B: G-Sense - Permission for Reuse 124