2017 Book DifferentialPrivacyAndApplicat PDF
2017 Book DifferentialPrivacyAndApplicat PDF
2017 Book DifferentialPrivacyAndApplicat PDF
Tianqing Zhu
Gang Li
Wanlei Zhou
Philip S. Yu
Differential
Privacy and
Applications
Advances in Information Security
Volume 69
Series editor
Sushil Jajodia, George Mason University, Fairfax, VA, USA
More information about this series at http://www.springer.com/series/5576
Tianqing Zhu • Gang Li • Wanlei Zhou
Philip S. Yu
Differential Privacy
and Applications
123
Tianqing Zhu Gang Li
Deakin University Deakin University
Melbourne, Australia Melbourne, Australia
ISSN 1568-2633
Advances in Information Security
ISBN 978-3-319-62002-2 ISBN 978-3-319-62004-6 (eBook)
DOI 10.1007/978-3-319-62004-6
v
vi Preface
touch the topic of differential privacy or only focus on the theoretical analysis
of differential privacy. Instead of using abstract and complex notions to describe
the concepts, methods, algorithms, and analysis on differential privacy, this book
presents these difficult topics in a combination of applications, in order to help
students, researchers, and engineers with less mathematical background understand
the new concepts and framework, enabling a wider adoption and implementation of
differential privacy in the real world. The striking features of the book, differs from
others, can be illustrated from three basic aspects:
• A detailed coverage on differential privacy in the perspective of engineering,
rather than computing theory. The most difficult part in comprehending differ-
ential privacy is the complexity and the level of abstract of the theory. This
book presents the theory of differential privacy in a more natural and easy to
understand way.
• A rich set of state-of-the-art examples on various application areas helps readers
to understand how to implement differential privacy in real world scenarios.
Each application example includes a brief introduction to the problem and
its challenges, a detailed implementation of differential privacy to solve the
problem, and an analysis on the privacy and utility.
• A comprehensive collection of contemporary research results and issues in
differential privacy. Apart from the basic theory, most of the contents of the book
are from the recent publications in the last 5 years, reflecting the state-of-the-art
of research and development in the area of differential privacy.
This book intends to enable readers, especially postgraduate and senior under-
graduate students, to study up-to-date concepts, methods, algorithms, and analytic
skills for building modern privacy-preserving applications through differential
privacy. It enables students not only to master the concepts and theories in relation to
differential privacy but also to readily use the material introduced into implementa-
tion practices. Therefore, the book is divided into two parts: theory and applications.
In the theory part, after an introduction of the differential privacy preliminaries,
the book presents detailed descriptions from an engineering viewpoint on areas
of differentially private data publishing and differentially private data analysis
where research on differential privacy has been conducted. In the applications
part, after a summary on the steps to follow when solving the privacy-preserving
problem in a particular application, the book then presents a number of state-of-the-
art application areas of differential privacy, including differentially private social
network data publishing, differentially private recommender system, differential
location privacy, spatial crowdsourcing with differential privacy preservation, and
correlated differential privacy for non-IID datasets. The book also includes a final
chapter on the future direction of differential privacy and its applications.
Preface vii
Acknowledgments
1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Privacy Preserving Data Publishing and Analysis . . . . . . . . . . . . . . . . . . 1
1.2 Privacy Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.3 Privacy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.4 Differential Privacy.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4
1.5 Outline and Book Overview . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5
2 Preliminary of Differential Privacy .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.1 Notations.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2 Differential Privacy Definition .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
2.2.1 The Privacy Budget .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 9
2.3 The Sensitivity.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
2.3.1 The Global Sensitivity . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.3.2 The Local Sensitivity . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.4 The Principle Differential Privacy Mechanisms . . . . . . . . . . . . . . . . . . . . 12
2.4.1 The Laplace Mechanism . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
2.4.2 The Exponential Mechanism .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 14
2.5 Utility Measurement of Differential Privacy .. . .. . . . . . . . . . . . . . . . . . . . 15
3 Differentially Private Data Publishing: Settings and Mechanisms. . . . 17
3.1 Interactive and Non-interactive Settings . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
3.2 Publishing Mechanism .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
4 Differentially Private Data Publishing: Interactive Setting .. . . . . . . . . . . 23
4.1 Transaction Data Publishing .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
4.1.1 Laplace .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
4.1.2 Transformation.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
4.1.3 Query Separation . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
4.1.4 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
ix
x Contents
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 223
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 235
Chapter 1
Introduction
Over the past two decades, digital information collected by corporations, organi-
zations and governments has resulted in huge number of datasets, and the speed
of such data collection has increased dramatically over the last a few years. A
data collector, also known as a curator, is in charge of publishing data for further
analysis [8].
Figure 1.1 illustrates the data publishing and analysis process, in which there
are roughly two stages. In the first stage, data collection stage, individuals submit
their personal information to a dataset. Amount of data are collected in their relative
area, such as medical data, data from banks, social network data, etc. The curator
has the full management of this dataset. The second stage is the data publishing or
analysis stage. Data publishing aims to share datasets or some query results to public
users. In some literature, this scenario is known as data sharing or data release. Data
analysis provides data models to the public users. It may be associated with some
particular machine learning or data mining algorithms. Both data publishing and
analysis bring social benefits, such as providing better services, publishing official
statistics, providing data mining or machine learning tasks, etc.
As shown in Fig. 1.1, most of the collected datasets are personally related
and contain private or sensitive information. The privacy violation may occur in
both stages. If curators are not trustworthy, personal information will be disclosed
directly in data collection stage. Even though curators can be trusted and apply sev-
eral simple anonymization techniques, when he/she publishes aggregate information
to the public, personal information may be disclosed as the public is normally not
trustworthy [50, 109, 236]. Privacy-preserving has therefore become an urgent issue
that needs to be addressed [84].
Alice
Personal Aggregate User
Original ...
Bob Information Information
Dataset
Cathy Curator
User
Australia telecom company, breached the privacy of 15,775 customers when their
information was made publicly available on the internet between February 2012 and
May 2013 [3].
All of these example’s shows that simple anonymization is insufficient for
privacy preserving. The adversary with background information on an individual
can still has chance to identify the individual’s records. People need to seek more
rigorous way to guarantee personal sensitive information.
Research communities have proposed various methods to preserve privacy [8, 148],
and have designed a number of metrics to evaluate the privacy level of these
methods. The methods and their privacy criteria are defined as the privacy model. To
preserve privacy in the data collection stage, the privacy model is inserted between
curators and individuals who submitted their personal information. The privacy
model processes the personal information before submitting to the untrusted curator.
To preserve privacy in the data publishing and analysis stage, the privacy model is
placed between curator and public users. All data go through the privacy model
should satisfy the requirements of the model. Figure 1.2 shows different placement
of privacy models.
The most popular privacy model is the k-anonymity [197, 211], which requires
that an individual should not be identifiable from a group of size smaller than k.
Also, there are a number of other privacy models, such as l-diversity [151],
t-closeness [142, 143], and ı-presence [169].
However, the vulnerability of these models lies in the fact that they can be
easily compromised by uncontrolled background information [231]. From the
published dataset, users, including the adversary, can possibly figure out the privacy
requirement and anonymization operations the curator used. Wong et al. [231] show
that such additional background information can lead to extra information that
facilitates the privacy compromising. For example, when a dataset is generalized
Alice
Original User
Bob PM PM ...
Dataset
Cathy
Curator
User
until it minimally meets the k-anonymity requirement, the attacker would exploit
this minimality equivalence groups to reverse the anonymization and explore
the possible version of the original table. This process is called the minimality
attack [50, 109, 236], a special method in the record linkage attack.
Furthermore, several other new attacks are emerging to against traditional privacy
models. Composition attack [86] refers to the scenario that the adversary gleans
from other channels such as the Web, public records or domain knowledge to obtain
the background information. They explore how published dataset is at risk in the
face of rich, realistic sources of background information. Kifer presents the deFinetti
attack [125] by attacking a popular data anonymous schemes. The attacker only
needs to know the insensitive attributes of an individual in the dataset, and can then
carry out this attack by building a machine learning model over the sanitized dataset.
The attack exploits a subtle flaw in the way that prior work computes the probability
of disclosure of a sensitive attribute [125]. Moreover, Wong et al. [232] propose
a new attack named the foreground knowledge attack. If dataset is not properly
anonymized, patterns can be generated from the published dataset and be utilized
by the adversary to breach individual privacy. This kind of attack is referred as
foreground knowledge attack, on a contrary to the background information.
In general, researchers assume that the attacker has limited background infor-
mation and tries to profile this information when designing privacy models, which
makes these models tightly coupling with background information. However,
background information is hardly to predict or model. it is impractical to know what
kind of background information exactly the adversary might has. So people have
few opportunity to depict or control the information that adversary has. Difficulty in
modeling the background information is the inherent weakness in those traditional
privacy models. Consequently, Researchers have considered a solid privacy model
that is robust enough to provide a provable privacy guarantee against the background
information.
There comes the differential privacy model, which is a solid privacy model that
provides a provable privacy guarantee for individuals. It assumes that even if an
adversary knows all the other records in a dataset except one record, he/she still
cannot infer the information contained in that unknown record. In another word,
even the adversary get to know maximum background information except the
record he/she wants to know, he/she cannot identify the specific record. Under this
assumption, differential privacy theoretically proves that there is a low probability
of the adversary figuring out the unknown record. Compared to the previous privacy
models, differential privacy can successfully resist background attack and provide a
provable privacy guarantee.
The first steps toward differential privacy were taken in 2003. But it was
not until 2006 that Dwork et al. discussed the technology in its present form.
1.5 Outline and Book Overview 5
Differential Privacy
● Research directions
● Data Publishing ● Research theory ● Application domain
● Data Analysis ● Machine learning ● Social network
● Data mining ● Location privacy
● Research issues
● Statistics ● Recommender system
● Nature of Input Data
● Learning theory ● Healthcare Data release
● Nature of Output Data
● ... ● ...
● Mechanism
● Mechanism Setting
Since then, a significant body of work has been developed by scientists around
the world. Differential privacy is catching the attention of academics. In 2012,
Microsoft releases a whitepaper titled Differential Privacy for Everyone [1], striving
to translate this research into new privacy-enhancing technologies.
Differential privacy has recently been considered a promising privacy-preserving
model. The interest in this area is very high and the notion is spanning in a
range of research areas, ranging from the privacy community, to the data science
communities including machine learning, data mining, statistics and learning
theory. Figure 1.3 shows the key components associated with differential privacy
research. There are roughly two major directions on differential privacy research,
data publishing and data analysis. In both direction, the main research issues
include the nature of input and output data, mechanism design and settings. The
research methods and theories involve machine learning, data mining, statistics,
and cryptograph. Much work has been conducted in a number of application
domains, including social network, location privacy, recommender systems and
other applications.
This book provides a structured and comprehensive overview of the extensive
research on differential privacy, spanning multiple research areas and application
domains. It will facilitate better understanding on areas of data publishing and data
analysis in which research on differential privacy has been conducted, and how
techniques developed in one area can be applied in other domains.
The initial work on differential privacy was pioneered by Dwork et al. [61] in 2006.
The basic idea can be found in a series of works [62–64, 72, 75, 198].
6 1 Introduction
1. The first survey by Dwork et al. [62] recalled the definition of differential privacy
and two principle mechanisms. The survey aimed to show how to apply these
techniques in data publishing.
2. The report [63] exploited the difficulties that arose when data publishing
encountered prospective solutions in the context of statistics analysis. It identified
several research issues on data analysis that had not been adequately explored at
that time.
3. In the review [64], Dwork et al. provided an overview of the principal motivating
scenarios, together with a summary of future research directions.
4. Sarwate et al. [198] focused on privacy preserving for continuous data to solve
the problems of signal processing.
5. Recently, a survey of attacks on private data has been proposed Dwork et al. [75],
who summarize possible attacks that compromise personal privacy.
6. A book by Dwork et al. [72] presented an accessible starting place for anyone
looking to learn about the theory of differential privacy.
Those surveys and the book focus on the concepts and theories of differential
privacy; however, the mathematical theories are not easily implemented into
applications directly. Yet, after more than 10 years of theoretical development, a
significant number of new technologies and applications have appeared in this area.
We believe that now is a good time to summarize the new technologies and address
the gap between theory and application.
Here we attempt to find a clearer way to present the concepts and practical aspects
of differential privacy for the data mining research community.
• We avoid detailed theoretical analysis of related differentially private algorithms
and instead place more focus on its practical aspects which may benefit applica-
tions in the real world.
• We try to avoid repeating many references that have already been analyzed
extensively in the above well-cited surveys.
• Even though differential privacy covers multiple research directions, we restrict
our observations to data publishing and data analysis scenarios, which are the
most popular scenarios in the data mining community.
Table 1.1 defines the scope of these two major research directions in differential
privacy. Mechanism design for data publishing is normally independent from its
publishing targets, as the goals of publishing is to release query answers or a dataset
for further usage and, hence, is unknown to the curator. The mechanism design
for data analysis aims to preserve privacy during the analysis process. The curator
already knows the details of the analysis algorithm, so the mechanism is associated
with the analysis algorithm.
Table 1.1 Comparison between differentially private data publishing and analysis
Differentially private data publishing Differentially private data analysis
Mechanism Independent mechanism Coupled with a particular algorithm
Input Various data types Transaction dataset (training samples)
Output Query answers or datasets Various models
Chapter 2
Preliminary of Differential Privacy
2.1 Notations
Figure 2.1 shows a basic attack model. Suppose a curator would like to preserve
privacy for n records in a dataset D. However, an attacker has all information about
n1 records in dataset D except the nth record. These n1 records can be defined as
background information. He/she can make a query on the dataset D to get aggregate
information about n records in D. After compare the difference between query result
with the background information, the attacker can easily identify the information of
record n.
Differential privacy aims to resist the attack. Differential privacy acquires the
intuition that releasing an aggregated result should not reveal too much information
about any individual record in the dataset. Figure 2.2 shows how differential privacy
resists the attack. We define a dataset D0 that differs with D with only one record,
say, xn . When the attacker make the query f on both datasets, he/she has a very
high probability to get the same result s. Based on the results, he/she cannot identify
whether xn is in D or not. When the attacker cannot tell the difference between the
D Attacker
Query result
X1 Query on (x1, ...,xn) differentiates with
X2 background
... information to get xn
Xn-1
Xn
X1 Background information
X2 on (x1,...xn-1)
...
Xn-1
D
I am not sure if Xn in
X1
the D
X2 f(D)=s
...
Xn-1
Xn
Neighbouring
Datasets D’
X1 Attacker
X2 f(D’)=s
...
Xn-1
Xn
X1 Probability
X2 f(D)
DP Answer
... f(D)=s
Xn-1
Xn
True Answer S
outputs of D and D0 , then xn is safe. If the property is applicable for all records in
D, the dataset D can preserve privacy for all records.
In differential privacy, curator will not publish a dataset directly, instead, public
users submit their statistical queries to the curator, and curator replies them with
query answers. For a particular query, its true answer is unique, but its differentially
private answer is a distribution, as shown in Fig. 2.3, dataset D and D0 have very
high probability to output same results.
To present the definition of differential privacy formally, we use the following
key notations in the book. We consider a finite data universe X with the size jX j.
Let r represent a record with d attributes, a dataset D is an unordered set of n records
sampled from the universe X . Two datasets D and D0 are defined as neighboring
datasets if differing in one record. A query f is a function that maps dataset D to an
abstract range R: f W D ! R. A group of queries is denoted as F. Normally, we use
symbol m to denote the number of queries in F. There are various types of queries,
such as count, sum, mean and range queries.
The target of differential privacy is to mask the difference of query f between
the neighboring datasets [64]. The maximal difference on the results of query
f is defined as the sensitivity f , which determines how much perturbation is
required for the private preserving answer. To achieve the perturbation target,
differential privacy provides a mechanism M accesses the dataset and implements
2.2 Differential Privacy Definition 9
some functionality. The perturbed output is denoted by a ‘hat’ over the notation.
For example, b
f .D/ denotes the randomized answer of querying f on D. Table 2.1
summarizes some major notations used in the book. There are some other Greek
symbols such as , will be used temporarily in different chapters.
In Definition 2.1, parameter is defined as the privacy budget [89], which controls
the privacy guarantee level of mechanism M. A smaller represents a stronger
privacy. In practice, is usually set as less than 1, such as 0:1 or ln 2. Two privacy
10 2 Preliminary of Differential Privacy
M(D’) M(D)
Ratio bounded by e ”
D
X1
M1: e1-dp
X2
... Max(e1,...,εm)-dp
Xn-1
Xn Mm: em-dp
Parallel Composition
D Mm(D)
X1 X1
X2 M1, …, Mm X2
... ... Sei-dp
Xn-1 Xn-1
Xn Xn
Sequential Composition
The global sensitivity is only related to the type of query f . It considers the
maximum difference between query results on neighboring datasets. The formal
definition is as below:
Definition 2.2 (Global Sensitivity) For a query f W D ! R, the global sensitivity
of f is defined as
Global sensitivity works well when queries have relative lower sensitivity values,
such as count or sum queries. For example, the count query normally has
fGS D 1. When the true answer is over hundreds or thousands, the sensitivity
is much lower than the true answer. But for queries such as median, average,
the global sensitivity yields high values comparing with true answers. We will then
resort to local sensitivity for those queries [172].
Global Sensitivity
Smooth Bound
Local Sensitivity
4fLS D max
0
jjf .D/ f .D0 /jj1 : (2.3)
D
For many queries, such as the median, the local sensitivity is much smaller than
the global sensitivity. However, as the changing of local sensitivity may result in
information disclosure, it cannot be used in mechanisms directly. The value of local
sensitivity should be changed smoothly, so that a smooth bound should be added.
Definition 2.4 (Smooth Bound) For ˇ > 0, a function B W D ! R is a ˇ-smooth
upper bound on the local sensitivity of f if it satisfies the following requirements,
Figure 2.6 shows the relationship between the local sensitivity, smooth bound
and the global sensitivity. For some queries, the local sensitivity is lower than global
sensitivity. For queries such as count or range, the local sensitivity is identical
to global sensitivity. Because most literatures were concerned with the global
sensitivity, without specification, sensitivity refers to global sensitivity in this book.
mechanism [157]. The Laplace and Gaussian mechanisms are suitable for numeric
queries and the exponential mechanism is suitable for non-numeric queries.
The Laplace mechanism relies on adding controlled Laplace noise to the query result
before returning it to the user. The noise is sampled from the Laplace distribution,
which is centered at 0 with scaling b. The noise is presented by Lap.b/, in which a
larger b indicates a higher noise. The corresponding probability density function is:
1 jxj
Lap.x/ D exp. /: (2.6)
2b b
The mechanism is defined as follows:
Definition 2.5 (Laplace Mechanism) Given a function f W D ! R over a dataset
D, mechanism M provides the -differential privacy if it follows Eq. (2.5)
f
M.D/ D f .D/ C Lap. /: (2.7)
The mechanism shows that the size of noise is related to the sensitivity of query f
and the privacy budget . A larger sensitivity leads to a higher volume of noise.
To achieve .; ı/-differential privacy, one can use Gaussian noise [72]. In this case,
rather than scaling the noise to the `1 sensitivity, one instead scales to the `2
sensitivity as follow Definition 2.6:
Definition 2.6 (`2 -Sensitivity) For a query f W D ! R, the `2 -sensitivity of f is
defined as
The Gaussian mechanism with parameter adds zero-mean Gaussian noise with
variance .
Definition 2.7 (Gaussian
p Mechanism) Given a function f W D ! R over a dataset
D, if D 42 f 2ln.2=ı/= and N .0; 2 / are i.i.d. Gaussian random variable,
mechanism M provides the ; ı-differential privacy when it follows Eq. (2.7)
The Gaussian mechanism follows the same privacy composition to the Laplace
mechanism.
q.D; /
M.D/ D freturn with the probability / exp. /g: (2.10)
2q
Lap.1/ will be added to the true answer f1 .D/. Lastly, the mechanism M will output
a noisy answer M.D/ D f1 .D/ C Lap.1/. If the true answer is 10, the noisy answer
might be 11.
Suppose we have another query f2 : what is the most common disease in
this district? This query will generate non-numeric result and we can apply the
exponential mechanism. Table 2.3 lists all the diseases and their true numbers in
the first two columns. We first define the score function of f2 . We adopt the number
of people on each disease as the score function q. As deleting a person will have
a maximum impact of 1 on the result of q, the sensitivity of q is 4q D 1. The
probability of the output can then be calculated by Definition 2.8. Table 2.3 lists the
results when D 0, D 0:1 and D 1.
In the third column of Table 2.3, D 0 means that the mechanism chooses an
answer uniformly from those four options. The output probabilities are equal in
these options. Obviously, D 0 provides the highest privacy level, however it loses
almost all the utility. When D 0:1, Flu has the highest probability of being chosen
and HIV has the lowest probability. The gap is not very large, which indicates that it
can provide acceptable privacy and utility levels. When D 1, the probability gap
between HIV and other diseases is significant, which means that the mechanism can
retain a high utility, but have a lower privacy level.
When privacy level is fixed to , several utility measurements have been used in
both data publishing and analysis to evaluate the performance of differential privacy
mechanisms.
• Noise size measurement: the easiest way is calibrating how much noise has been
added to the query results. A smaller noise indicates higher utility. This utility
measurement has been widely used in data publishing.
• Error measurement: utility can be evaluated by the difference between the non-
private output and the private output. For example, the utility of single query
publishing can be measured by jf .D/ b f .D/j. A smaller distance shows higher
utility. The error measurement is normally represented by a bound with accuracy
parameters [29]:
16 2 Preliminary of Differential Privacy
Pr.max jF.D/ b
F.D/j ˛/ > 1 ˇ; (2.11)
f 2F
For data analysis, the utility normally depends on the analysis algorithms.
Suppose the algorithm is denoted by M and the private algorithm is denoted by
b Eq. (2.11) can be interpreted to
M,
b
Pr.jM.D/ M.D/j ˛/ > 1 ˇ: (2.13)
Equation (2.13) has several implementations in data analysis, such as risk bound
and sample complexity.
Chapter 3
Differentially Private Data Publishing: Settings
and Mechanisms
Interactive setting
D
X1
X2
...
Xn-1
Xn User
Non-Interactive setting
The correlation between queries leads to a higher sensitivity. Therefore, the non-
interactive setting normally incurs more noise than the interactive setting.
The above example presents the difference between two settings, and shows the
size of noise increasing dramatically when queries are correlated to each other.
In addition, for a dataset with size n, the Laplace mechanism can only answer,
at most, sub-linear in n number of queries to a certain level of accuracy [59]. To
simplify the problem, most papers on interactive setting assume that those queries
are independent to each other. These weaknesses make the Laplace mechanism
impractical in the scenarios that require answering large amounts of queries. New
mechanisms are required.
To fix the weaknesses of the Laplace mechanism, researchers are concerned
with new mechanisms design, aiming to publish various types of data with limited
noise. Table 3.3 summarizes the problem characteristics of differentiallyprivate data
3.2 Publishing Mechanism 19
Transformation
D A
X1 A1
X2 A2
... ...
Xn-1 Az-1
Xn Az User
In the new structure, f2 can be answered by the second row with the sensitivity
equals to 1. And we can get the result of diabetes from age 60 to 79 independently
from the first row in Table 3.2. As both results are independent to each other, the
result of f1 can be answered by the result of b f 2 and the first row. The total noise
of two queries will be 3 Lap.1=/, which is lower than the non-interactive
Laplace mechanism. In this example, the new structure is used to decompose
the correlation between queries, so the sensitivity can be decreased as well.
The challenge is to find a new structure. In this example, the new structure
is a meaningful frequent table; however, in most cases, the new structure is
meaningless, only used to decrease the sensitivity.
• Partition of dataset: the original dataset is divided into several parts to decrease
noise. In the above example, suppose we need to answer f1 with Table 3.2, the
noise Lap.1=/ needs to be added twice: one is added to the first row, another
is added to the second row. The total added noise will be 2 Lap.1=/. If we
partition the dataset by another way, for example, arranging the age range to 40–
79, the total noise will be decreased to Lap.1=/. The challenge is how to design
the partition strategy with multiple queries.
• Query separation: query separation assumes that a query set can be separated into
several groups, and that some queries can be answered in the sense of reusing
3.2 Publishing Mechanism 21
Interactive settings operate on various aspects of the input data, including transac-
tions, histograms, streams and graph datasets. In the following sections, we discuss
publishing scenarios involving these types of input data.
4.1.1 Laplace
Dwork et al. [68] proposed the Laplace mechanism to publish the transaction dataset
in their initial work. As mentioned earlier, this mechanism can efficiently answer all
types of real-values queries, but the maximum number of queries is limited by the
sub-linear size of the dataset. Dinur et al. [59] proved that answering substantially
more than a linear number of subset sum queries with error o.n1=2 / yields blatant
non-privacy. This lower bound implied that for a dataset with size n, the Laplace
mechanism can only answer maximum sub-linear of n queries in a certain level of
accuracy. Otherwise, adversaries can reconstruct a 1 o.1/ fraction of the original
database.
4.1.2 Transformation
The goal of query separation is to design a separation strategy for given types of
queries to decrease noise. Roth [195] presented the median mechanism and found
that, among any set of m queries, there are O.log m log jX j/ queries that can
determine the answers of all other queries. Based on this observation, all queries
are separated into hard and easy queries. Hard queries can be answered directly
by the Laplace mechanism, while easy queries are answered by the median values
of hard query results. Therefore, easy queries do not consume any privacy budget.
By separating the queries, the median mechanism can answer exponentially many
more queries with acceptable accuracy; however, it is inefficient and comes with an
exponential time complexity corresponding to the dataset size n.
4.1 Transaction Data Publishing 25
4.1.4 Iteration
Hardt et al. [92] proposed private multiplicative weights (PMW), which considers
datasets as a histogram with positive weight on each bin. By updating the weights,
PMW constructs a histogram sequence to answer a set of queries. Specifically, the
initial histogram x0 was set as a uniform distribution over the domain. The mech-
anism then maintained a sequence of histogram x0 , x1 ,. . . , xt in t iterations, which
gave increasing approximation to the original histogram x. After the parameters
have been calibrated for complexity and accuracy, this mechanism p is able to answer
each query with a sampling error approximately to O..log m/= n/. This means that
the sampling error grows logarithmically with an increase in the number of queries
being answered, while the Laplace mechanism’s error is linear, increasing by m. In
addition, PMW can accurately answer an exponential number of queries.
Similarly, Gupta et al. [88] presented a general iteration framework termed
iterative database construct (IDC), which implements other release mechanisms
by using the framework. In each round of iteration, when a significant difference
between the current dataset and the original dataset is witnessed for a given query,
the mechanism updates the current dataset for the next update. The update function
was defined by the Frieze/Kannan low-rank matrix decomposition algorithm. The
effectiveness of the framework is evaluated by cut queries in a social network graph
dataset. IDC is a more general framework that can be incorporated into various other
mechanisms, including the PMW and median.
4.1.5 Discussion
The transformation and query separation mechanisms can answer more queries
than the Laplace mechanism in a fixed privacy budget. They however have some
very restrict criteria on the dataset or queries. For example, the transformation
mechanism requires the dataset has some special properties and the query separation
mechanism assumes the query can be divided into distinct types. These criteria make
these mechanisms impractical.
The iteration mechanism has little assumption on the dataset and has several
advantages over the various publishing mechanisms mentioned above, such as
decreasing noise when confronted with many queries; it has a lower running time
in practice for low-dimensional datasets; it can be easily implemented and perform
26 4 Differentially Private Data Publishing: Interactive Setting
parallel on datasets. Many subsequent works therefore followed this mechanism. For
example, Ullman [222] extended PMW to convex minimization to exponentially
answer many convex minimization queries. Based on IDC, Huang et al. [103]
presented an efficient query on distance defined over an arbitrary metric. However,
since the iteration mechanism utilizes histogram to represent the dataset, it can not
be applied on complex or high-dimensional datasets.
Even most mechanisms focused on count query, there has also been significant
attention paid to the specific type of queries, such as conjunctions query, cut query,
distance query [103], range query, and halfspace queries. It shows that excluding
the Laplace mechanism, the other mechanisms can answer large numbers of queries
(exponential to n) with acceptable accuracy.
However, there is a serious problem in computational efficiency for those
mechanisms. Most works have an exponential running time, which implies that
the computational efficiency needs to be sacrificed for acceptable accuracy. This
conclusion was confirmed by Dwork et al. [66], who claimed that computationally
inefficient mechanism can accurately answer an exponential number of adaptively
chosen statistical queries. Ullman [221] quantitatively analyzed the bound and
showed that a differential privacy algorithm will require exponential running time to
answer n2Co.1/ statistical queries. Hardt et al. [94] improved the bound and showed
that there is no computationally efficient algorithm that can give valid answers to
n3Co.1/ (cube) adaptively chosen statistical queries from an unknown distribution.
4.2.1 Laplace
This is a direct mechanism which adds Laplace noise to the frequency of each bin
that a query covered. When a count of range query covers only small number of
bins, this mechanism retains high utility for the query result; however, if the original
dataset contains multiple attributes, the combination of these attributes and their
related range of values will lead to a large number of bins. The answer of queries
are meaningless due to the large amount of error accumulated.
The most popular mechanism for histogram release is the partition of the dataset,
except for the traditional Laplace mechanism. As the number of bins is derived
from the partition of attribute values, one method for decreasing error is to optimize
the partition strategy. Figure 4.1 shows the histogram example. Figure 4.1a is an
histogram example. When publish the histogram in the constraint of differential
privacy, as shown in Fig. 4.1b, the noise is estimated as D Lap.1=/, which has
to be added to each bin. Multiple bins lead to large noise. Figure 4.1c illustrates
that if bins can be re-partitioned, for example, range of age can be merged from
[0–20], [20–40] to [0–40], noise will be diminished. After that, original frequency
of bin [0–20] can be estimated by dividing the frequency of bin [0–40], as shown in
Fig. 4.1d. However, splitting the large bin into smaller bins leads to more Laplace
noise, while estimating the proportion of the large bin’s frequency may introduce
estimation error. Therefore, optimizing the partition strategy to obtain a trade-off
between splitting and merging bins is a challenge that needs to be addressed.
Xiao et al. [240] provided a kd-tree based partition method to generate nearly
uniform partitions to decrease the average error. Their idea is to partition the original
histogram and merge bins with similar frequencies. The average frequency will then
be close to those original frequencies, and will reduce the average error. Xiao et al.
apply the kd-tree to identify bins which have similar frequencies when answering
queries. A kd-tree is a binary tree that every non-leaf node can be considered as a
splitting hyperplane to divide the space into two half-spaces. Generating a balanced
kd-tree on the histogram frequency can help to divide the original histogram into
a small number of almost uniform structures. The authors achieved more accurate
result than Laplace mechanism.
In their subsequent work [239], Xiao et al. extended the two-dimensional
histogram into a d-dimensional (d > 2) one by using an algorithm DPCube. The
authors implemented a set of query matrices to generate an optimal query strategy
on a d-dimensional histogram to test performance. When the parameters (frequency
closeness threshold, partition times) are estimated properly, the DPCube achieves a
good balance between the maximum number of queries and the introduced errors.
28 4 Differentially Private Data Publishing: Interactive Setting
a b
c d
Fig. 4.1 Histogram publishing. (a) Histogram. (b) Noisy histogram. (c) Bin partition. (d) Estimate
frequency
When noise is added to bins, the consistency of the query results may be destroyed.
For example, the sum of two bins A C B may less than the frequency of one
bin A. To maintain consistency of the histogram release, Hay et al. [96] defined
a constrained inference to adjust the released output. Two types of consistency
constraints were explored. The first, sorted constraints, requires query results to
4.3 Stream Data Publishing 29
4.3.1 Laplace
Works on continual data publishing define two private levels in terms of differential
privacy, namely event-level and user-level privacy [69]. The former hides a single
event while the latter masks all events of a user. It appears that the user-level privacy
leads to more error than the event-level privacy because a user may have several
events to hide. In practice, most of the work focus on event-level privacy.
There are two ways the Laplace mechanism can be directly utilized for continual
release. The first method divides the privacy budget into several pieces 0 D =T,
and allocates them to each time step t. The mechanism samples fresh independent
random noise
t Lap.1= 0 / at each time step, and releases a noise count by using
b
t D .t/C
t . Here, the total volume of noise is T Lap.T=/, which is a dramatically
large volume of noise that results in inferior utility.
The second method adds independent Laplace noise t Lap.1=/ to each bit
P and at each time step t, computes at D .t/ C
t . TheP
in the stream final noisy count
t D ti ai . In a fixed total time step T, the introduced noise is i<T
i , which is
is b
bounded by O. T log.T=ı/
/. This result is better than that can be achieved by the first
method, but the volume of noise is still too large.
Chan et al. [34] presented a p-sums mechanism which computes the partial sum of
consecutive bits in a stream. The results can be considered as intermediate results
from which an observer can estimate the count at every time step. Laplace noise
is added to a p-sum result rather than an individual count answer. This guarantees
3=2
an error bound of O. log T /, which decreases the linear complexity of noise to log
complexity.
4.3.3 Iteration
4.3.4 Discussion
Even though continual data publishing is a popular topic in the data mining or
machine learning community, there is still little research work that focuses on the
privacy-preserving perspective. In fact, there are lots of unsolved problems in this
area. For example, the issues of how to release a multiple dimensional dataset
periodically and how to deal with other statistical queries in the continual release
need further exploration.
With the significant growth of Online Social Networks (OSNs), the increasing
volumes of data collected in those OSNs have become a rich source of insight into
fundamental societal phenomena, such as epidemiology, information dissemination,
marketing, etc. Most of OSN data are in the form of graphs, which represent
information such as the relationships between individuals. We follow the convention
to model an input OSNs dataset as a simple undirected graph G D .V; E/, where
V is the set of nodes and E V V is the set of edges. The time of connection
between nodes are defined as the degree. Figure 4.3 shows a graph.
Analyzing those graph data has enormous potential social benefits, but the fact
that the same graph data in OSNs might be used to infer sensitive information
about a particular individual [13] has raised concern among OSN participants. The
analysis of OSNs is a rapidly growing field, and many models require the operation
of subgraph counting, which counts the number of occurrences of a query subgraph
in an input graph. For example, the subgraph counts of k-star and triangle, as well
as the number of edges, are sufficient statistics for several popular exponential
random graph models. Moreover, many descriptive statistics of graphs are functions
of subgraph counts [113].
The privacy issue is more serious for graph data. For transaction dataset records,
at least the attributes are given. While in graph data, the attributes are derived, not
given. For example, given a graph, what are the important attributes to hide? The
node degree is an obvious attribute. But there can be many others, such as degree
pair of a link, common neighbors between each pair of nodes, etc. These attributes
are all derived. It is unclear what are the attributes the attackers can use to make the
attack. It is extremely difficult to decide what graph attributes can be derived and
need to be hidden.
In this case, two notions of neighboring graph can be defined in the context
of differential privacy for graph data: edge neighboring and node neighboring.
Accordingly, there are two concepts of differential privacy for graph: (a) Edge
differential privacy means adding or deleting a single edge between two nodes in
the graph makes negligible difference to the result of the query. Figure 4.4 shows
the edge differential privacy. (b) Node differential privacy ensures the privacy of a
query over two neighbouring graphs where two neighbouring graphs can differ up
to all edges connected to one node. Figure 4.5 shows the node differential privacy. If
the differential privacy mechanism is adopted in graph data, the research problem is
then to design efficient algorithms to release statistics about graph while satisfying
the definition of differential privacy.
The first differential privacy research over graph data was conducted by Nis-
sim et al., who showed how to estimate the number of triangles in a graph with
edge differential privacy [172]. They introduced the idea of instance-dependent
4.4 Graph Data Publishing 33
noise, and utilized the smooth sensitivity, which upper bounds the local sensitivity
tightly, especially when the number of triangles varies smoothly in a neighborhood
of the input graph. They showed how to efficiently calibrate noise in the context of
subgraph counts, based on the smooth sensitivity of the number of triangles. The
results of this technique were further extended by Karwa et al. [113] to release the
counting on k-triangles and k-stars. They achieved -differential privacy for k-star
counting, and .; /-differential privacy for k-triangle counting.
Rastogi et al. [190] studied the release of more general subgraph counts
under edge adversarial privacy, They considered a Bayesian adversary whose prior
knowledge is drawn from a specified family of distributions. By assuming that the
presence of an edge does not make the presence of other edges, they computed a high
probability upper bound on the local sensitivity, and then added noise proportional
to that bound. Rastogi et al.’s method can release more general graph statistics, but
their privacy guarantee protects only against a specific class of adversaries, and the
magnitude of noise grows exponentially with the number of edges in the subgraph.
Hay et al. [95] considered publishing the degree distributions. They showed
that the global sensitivity approach can still be useful when combined with post-
processing of the released output to remove some added noise, and constructed an
algorithm for releasing the degree distribution of a graph, with the edge differential
privacy. They also proposed the notion of node differential privacy and highlighted
the challenges in achieving it.
Zhang et al. [249] claimed that if one can find a isomorphic graph with proper
statistical properties that similar to original graph, the isomorphic graph can be used
to generate accurate query answers. Give a subgraph S, they adopted an exponential
mechanism to search a number of isomorphic copies of S to answer subgraph
queries.
G
. Blocki et al. [27] proceeded with a similar intuition. They showed that Lipschitz
extensions exist for all real-valued functions, and give a specific projection from
any graph to a particular degree-bounded graph, along with smooth upper bound on
its local sensitivity. Kasiviswanathan et al. [117] proposed efficient constructions for
such extensions for degree distribution as well as subgraph counts such as triangles,
k-cycles and k-stars. These techniques can generate statistics with better accuracy
for graphs that satisfy an expected condition, but may yield poor results on other
graphs. This makes the setting of the degree d a difficult task.
4.4.3 Discussion
Existing methods work reasonably well with edge differential privacy or even node
differential privacy guarantee for basic graph statistics. However, releasing specific
statistics such as cuts, pairwise distances between nodes, or on hyper-graphs, still
remain open issues.
In interactive settings, the privacy mechanism receives a user’s query and replies
with a noisy answer to preserve privacy. Traditional Laplace mechanisms can only
answer sublinear of n queries, which is insufficient in many scenarios. Researchers
have to provide different mechanisms to fix this essential weakness.
The proposed interactive publishing mechanisms improve performance in terms
of query type, the maximum number of queries, accuracy and computational
efficiency. Upon analysis, we conclude that these measurements in interactive
publishings are associated with one another. For example, given a fixed privacy
budget, a higher accuracy usually results in a smaller number of queries. On the
other hand, with a fixed accuracy, a larger number of queries normally leads to
computationally inefficient mechanisms. Therefore, the goal of data publishing
mechanism design is to achieve a better result that can balance the above mentioned
measurements. In practice, the choice of mechanism depends on the requirement of
the application.
Chapter 5
Differentially Private Data Publishing:
Non-interactive Setting
Non-interactive settings mean all queries are given to the curator at one time.
The key challenge for non-interactive publishing is the sensitivity measurement.
Correlation between queries will dramatically increase the sensitivity. Two possible
methods are proposed to fix this problem: one is decomposing the correlation
between batch queries, which is presented in Sect. 5.1, another is publishing a
synthetic dataset with the constraint of differential privacy to answer those proposed
queries. Related methods are presented in the synthetic dataset publishing Sections.
5.1.1 Laplace
In this situation, Laplace noise can be added in two different ways but both will
introduce high volume of noise. The first method directly adds noise to each query
result. The sensitivity of F is O.n2 / and the variance of the noise per answer is
O.n4 = 2 /. The second method adds noise to the frequency table, and then generates
the range query results accordingly. In this case, the sensitivity is 1, but the noise
in F will accumulate more quickly than in the first method. Hence, both methods
introduce a large volume of noise and will lead to inaccurate results, while how to
reduce noise remains a challenge in batch queries release.
5.1.2 Transformation
There are other mechanisms of pubishing batch queries, such as partition and
iteration. Kellaris et al. [122] decomposed the dataset columns into disjoint groups
and added Laplace noise to smooth each group’s average count, setting the result as
the new count of every column in the group. The final result is generated using the
new column’s count. Because the maximum number of original counts in a group
affects by a user is limited, the sensitivity of each group is decreased and the Laplace
noise required for -differential privacy is also diminishes. The advantage of this
mechanism is that it can limit the sensitivity in the numerical dataset.
5.1.4 Iteration
By recursively approximating the true answer, the noise in the output can be
effectively diminished. Xiao et al. [235] aimed to decrease the error of the released
output. They argued that the Laplace mechanism adds noise with a fixed scale to
every query answer regardless value of the answer. Thus, queries with small answers
have much higher expected error, which defined as relative error. In practice, larger
answers can tolerate more noise and for some applications, relative errors are more
important than absolute errors.
To decrease the relative error, Xiao et al. proposed a mechanism named iReduct,
which initially obtains rough error estimations of query answers and subsequently
utilizes this information to iteratively refine these error evaluations. The algorithm
consists of three stages: (1) divides into two parts and utilizes the first part
to perform Laplace mechanism to answer those queries; (2) estimates new noise
according to the noise answers; (3) identifies query answers with small but relatively
large reduced noise iteratively until errors can be diminished no further. In general,
iReduct takes advantage of parallel composition, which decreases the errors of the
query answers.
5.1.5 Discussion
The key problem in batch query publishing is how to decrease the sensitivity
between correlated queries. Currently, transformation is the most popular way to
tackle the problem. Current works mainly focused on range queries, and developed
appropriate structures to answer linear combination of those queries. More types of
structures need to be developed to answer various types of queries. Iteration and
partition of dataset may not be effective on the correlation decomposition, but they
have the potential to answer more types of queries.
5.2 Contingency Table Publishing 39
Contingency table is a popular data structure for data analysis in the medical science,
and social sciences fields [80]. It displays the frequencies of the combined attributes
in a dataset. Suppose D is a frequency dataset of 2d possible combinations of
these attributes. The curator does not normally release the entire contingency table
because when d is large, the contingency table is likely to be sparse and noise
may outweigh the true answers of queries. Instead, the curator will release subsets
containing parts of attributes that are defined as the k-way marginal frequency table
(k d). One contingency table may contain several overlapping marginal frequency
tables. Privately answering marginal queries is a special case of counting queries.
For example, it may have the form, “What fraction of individual records in D
satisfies certain properties of d1 and d2 ?”
Table 5.4 shows a contingency table with d D 4. Each record ri has a combination
of these four attributes, and there are 24 D 16 possible combinations. When k D 2,
as shown in the gray area in Table 5.4, the differential privacy mechanism only needs
to release 22 D 4 combination of attributes. The chosen of two attributes might
be fd1 ; d2 g, fd2 ; d3 g, fd3 ; d4 g and fd4 ; d1 g. The contingency table release not only
requires the privacy of k-way marginal frequency tables, but also the consistency
of these tables. For example, the counts of the same attributes in different marginal
frequency tables should be equal and should not violate common senses.
5.2.1 Laplace
Two methods exist to release k-way marginal frequency tables, both of which
involve the direct implementation of the Laplace mechanism. The most intuitive
method adds noise into the frequency of the whole contingency table [118]. Users
can create any k-way marginal frequency table from the noisy contingency table
and maintain the consistency of all tables. However, this method leads to noise
magnitude of O.2d /. When d is large, the noise will increase dramatically and
renders the perturbation results unrealistic.
The other method is to extract the marginal frequency tables from the original
dataset and add the noise to the marginal frequencies. This method achieves
excellent accuracy when n is large compared to the number of marginal tables,
but it can lead to the violation of consistency. For example, after adding noise, the
maximum score may be lower than the average score. Based on the weaknesses of
the traditional Laplace mechanism, researchers have attempted to find new solutions
to the accuracy and consistency problems. The most popular solutions are using the
iteration and transformation mechanisms.
5.2.2 Iteration
Ding et al. [58] proposed a differentially private data cubes to generate marginal
tables. The proposed method first organize all possible marginal frequency tables
in a lattice. With a d-dimensional binary dataset, there are 2d 2-ways marginal
frequency tables in the lattice. The method then iteratively selects which tables
should be released. In each iteration, it traverses all 2d tables in the lattice and
greedily selects one. Obviously, this method requires time polynomial in 2d and
is impractical. Qardaji et al. [186] improved the above method in a practical way.
Instead of generating 2d tables, they choose sets of attributes and use them as a
synopsis from which one can reconstruct any desired k-way marginal frequency
tables.
To improve accuracy, Chandrasekaran et al. [35] proposed an iterative dataset
construction mechanism, which maintains a sequence of queries, f1 .D/, f2 .D/, . . . ,
that give increasingly improved approximations to the F.D/. The algorithm is
capable of answering marginal queries with a worst-case accuracy guarantee for
dataset containing poly.d; k/ records in time exp.o.d//.
5.2.3 Transformation
The anonymized dataset publishing retains the format of the original records. This
is an attractive property as in some scenarios, people need to know the details
of attributes to determine further analysis methods. Data release can not achieve
this goal as users are only allowed to access the dataset by submitting queries.
The anonymized dataset publishing can guarantee that the published records have
the same attributes with the original records. The problem of anonymized dataset
publishing can be stated as follows: suppose the input dataset is D with d attribute
di , the curator would like to publish a b
D with the same attributes.
The anonymized dataset publishing has a long research history and has been
investigated heavily in the privacy-preserving community. Differentially private
anonymized dataset publishing, however, is a tough problem as publishing a specific
record is considered to be the violation of differential privacy. Researchers meet with
the difficulty that on the one hand, they have to release specific records. On the other
hand, they should meet the requirement of differential privacy.
To address the difficulty, Mohammed et al. [161] observed that if the anonymiza-
tion process follows the requirement of differential privacy at each step, the result
will satisfy with differential privacy. Based on the observation, they proposed an
anonymized algorithm DiffGen to preserve privacy for data mining purpose. The
anonymized procedure consists of two major steps: partition and perturbation.
Every attribute of the original dataset is generalized to its topmost state. The
partition step then splits these attributes into more specific groups according to
attribute taxonomy trees. It applies an exponential mechanism to select a candidate
for specialization. After that, the perturbation step adds noise to the true count of
each records group.
Table 5.5 shows an example on a transaction dataset. To release this dataset,
DiffGen generalize each attribute to its topmost according to pre-defined taxonomy
trees, which are shown in Fig. 5.1. Specifically, for the attribute Job, Fig. 5.2 shows
that all types of jobs are generalized to Any__Job and then partitioned according
to the job taxonomy tree. Similarly, all ages are generalized to [18–60) and then
are split into [18–40) and [40–65). In Fig. 5.2, the root of the partition tree is
the count of full generalized records (Age D Œ18–65/, Job = Any__Job). Let the
(a) (b)
Fig. 5.1 Taxonomy tree examples. (a) Taxonomy tree of job. (b) Taxonomy tree of age
Chen et al. proposed DiffPart to publish the set-value data. The difference
between DiffPart and DiffGen is that the DiffPart partitions the dataset based on
a context-free taxonomy tree while DiffGen utilizes the taxonomy tree based on the
underground dataset. Figure 5.3 shows an example of DiffPart, and Fig. 5.4 is the
taxonomy tree of the itemset.
The DiffPart has the computational time complexity of O.n jIj/. It publishes
dataset for frequent mining purpose, so the accuracy is measured based on the
accuracy of frequent mining.
The synthetic dataset follows the distribution of the original dataset but not
necessary share the same format. It was considered as a difficult problem for a
long time due to the large noise introduced. It leads to an inaccurate output. In this
perspective, publishing a synthetic dataset by using Laplace mechanism is hard.
Learning theory is one of the main tools used in differential privacy. The most
prevalent concept is Probably Approximately Correct (PAC) learning. It helps to
measure the lower bound on sample complexity and accuracy, both of which are
widely used in differential privacy to evaluate the performance of private algorithms.
44 5 Differentially Private Data Publishing: Non-interactive Setting
Suppose a set of samples D drawn from a universe X . The samples are labeled
according to a concept c W X ! f0; 1g from a concept class C . The goal of the
learning process is to find a hypothesis h W X ! f0; 1g which agrees with c on
almost the entire universe. To quantify the accuracy guarantee ˛ of the learning
algorithm, a lower bound of sample complexity is provided to evaluate how many
samples in r are needed to guarantee an accuracy of ˛.
A concept c W X ! f0; 1g is a predicate that labels samples by either 0 or
1. A group of concepts is defined as a concept class C . The learning algorithm is
successful when it outputs a hypothesis h that approximates the target concept c
over samples from D. More formally,
Definition 5.1 (PAC Learning) Algorithm A is a PAC learner for a concept class
C over X using hypothesis class H if for all concepts c 2 C , all distributions D
on X , given an input of n samples and given that each ri is drawn i.i.d. from D,
algorithm A outputs a hypothesis h 2 H satisfying
Accuracy
Subsequent works made progress in proving accuracy. Dwork et al. [74] utilized
boosting to improvep
the accuracy of the release dataset and improved the accuracy
n log jX j log2=3 m
lower bound to O. /. Hardt et al. [91] combined the exponential
mechanism with a multiplicative weight updating approach to achieve a nearly
optimal accuracy guarantee.
Efficiency
Although these mechanisms provide a tight bound on accuracy, none of them can
be performed in polynomial time. The time cost is super-polynomial in the size of
universe X and the concept class C . It is jX jpoly.n/ log jC j . Blum et al. [29] claimed
that if a polynomial time is required, the definition of privacy has to be relaxed. This
claim was confirmed by Dwork et al. [71], who proved that only after relaxing the
notion of to .; / could the run-time be linear in the size of the data universe and
the size of the query set.
Moreover, Ullman et al. [221] showed that no algorithm can answer more than
O.n2 / arbitrary predicate queries in polynomial time. The Laplace mechanism
is almost optimal among all computationally efficient algorithms for privately
answering queries. This result suggests that for privately query release, it is difficult
to design mechanisms to answer arbitrary queries efficiently. Classes of queries that
have a particular structure is what we can exploit.
Gaboardi et al. [85] presented a practical way to deal with the inefficient
problem for high dimensional datasets. The algorithm packages the computationally
hard step into a concisely defined integer program, which can be solved non-
privately using a standard solver. The optimization step does not require a solution
that is either private or exact: it can be quickly solved by existing, off-the-shelf
optimization packages quickly in practice. Even though the authors do not solve the
inefficient problem in synthetic data release, their work provides a practical way to
compute high dimensional data in practice.
Discussion
Table 5.7 summarizes key works on the synthetic dataset in terms of accuracy,
efficiency and privacy level. For simplicity, sensitivity is predefined as 1, the
dependence on ˇ is omitted and the run-time only considers the size of the universe.
In summary, computational learning theory extends the research work on
synthetic data release, proving it is possible to maintain an acceptable utility
while preserving differential privacy. Nevertheless, the issues of how to reduce the
computational complexity, and how to provide various types of queries on these
datasets remain as challenges.
5.4 Synthetic Dataset Publishing 47
Neither anonymized dataset publishing nor learning theory based publishing can
effectively handle the high-dimensional dataset. For the anonymized method, when
the input dataset contains many attributes, existing anonymized method will inject
a prohibitive amount of noise, which leads to an inferior utility. For the learning
theory based method, the computational complexity is exponential to the dimension
of the dataset, making the publication infeasible to high-dimensional dataset. One
promising way to address the high dimensional problem is to decompose high
dimensional data into a set of lower dimensional marginal datasets, along with an
inference method that infers the joint data distribution from these marginal datasets.
Zhang et al. [248] followed the above rationale and applied Bayesian network to
deal with the high dimensional problem. They assumed that there are correlations
between attributes. If these correlations can be modeled, the model can be used
to generate a set of marginals to simulate the distribution of the original dataset.
Specifically, given a dataset D, they first construct a Bayesian network to model the
correlations among the attributes in D. This model will approximate the distribution
of data in D using a set of low-dimensional marginals of D. After that, they inject
noise into each marginal to ensure differential privacy. These noisy marginals and
the Bayesian network are used to build an approximation of the data distribution
of D. Lastly, they sample records from the approximate distribution to construct
a synthetic dataset. The disadvantage of the solution is that it consumes too
much privacy budget in the Bayesian network constructing process, making the
approximation of the distribution inaccurate.
Chen et al. [44] addressed the disadvantage by proposing a clustering based
method. They first learn the pairwise correlation of all attributes and generate
a dependency graph. Secondly, they apply the junction tree algorithm to the
dependency graph to identify a collection of attribute clusters to derive all the noisy
marginals. At last, they make use of the noisy marginal tables and the inference
model to generate a synthetic dataset. Comparing with [248], they have limited
access to the dataset in the two steps, saving the privacy budget to obtain a better
result in the last step.
48 5 Differentially Private Data Publishing: Non-interactive Setting
The interactive setting has attracted attention due to advances in statistical databases.
In interactive settings, the privacy mechanism receives a user’s query and replies
with a noisy answer to preserve privacy. Traditional Laplace mechanisms can only
answer O.n/ queries, which is insufficient in many scenarios. Researchers have to
provide different mechanisms to fix this essential weakness.
The proposed interactive publishing mechanisms improve performance in terms
of query type, the maximum number of queries, accuracy, and computational
efficiency. Upon analysis, we conclude that these measurements in interactive
releases are associated with one another. For example, given a fixed privacy budget,
a higher accuracy usually results in a smaller number of queries. On the other hand,
with a fixed accuracy, a larger number of queries normally leads to computationally
inefficient mechanisms. Therefore, the goal of data publishing mechanism design is
to achieve a better result that can balance the above mentioned measurements. The
choice of mechanism depends on the requirement of the application.
High sensitivity presents a big challenge in non-interactive settings. Batch query
publishing methods can only publish limited types of queries. Publishing a synthetic
dataset seems appealing because, in some scenarios, people require details of the
attributes to determine further analysis methods. The research on synthetic data
publishing, however, is still in its early stages and there are many open problems
in this area. The essential problem is efficiency. Given most publishing mechanisms
need to sample datasets from the entire data universe, it is hard to search for a
suitable dataset in polynomial time.
Another problem is that synthetic dataset publishing can only publish datasets for
particular purposes. For example, an anonymization dataset focuses on a decision
tree algorithm, the published dataset obtains an acceptable result for decision tree
tasks, yet the proposed method does not guarantee the performance for other types
of tasks. Learning-based methods have the same disadvantage, or worse, as they
only guarantee learning performance for a particular class. Publishing a dataset for
multiple purposes needs further investigation.
The third problem is dealing with high-dimensional datasets. Even though [248]
and [44] have undertaken some initial work, they both consume too much of
the privacy budget when building the distribution model, making the results less
accurate than that of lower-dimensional datasets.
Chapter 6
Differentially Private Data Analysis
6.1.1.1 SuLQ
One of the limitations of SuLQ lies in its lack of the exponential mechanism.
When dealing with selection operation of an algorithm, SuLQ is unable to provide
sufficient privacy guarantees [83]. Therefore, SuLQ needs further improvement to
satisfy diverse algorithms.
6.1.1.2 PINQ
Privacy Integrated Queries platform (PINQ) [155] is an interface that provides more
operations than SuLQ. It uses Laplace noise on numeric queries and exponential
mechanism on selection operations. It not only defines the numeric queries such
as count (NoisyCount), sum (NoisySum) and average (NoisyAvg) with
independent noise, but also provides an operation Partition that allows queries
to execute on disjoint datasets. The Partition operation takes advantage of
parallel composition by partitioning the dataset into multiple disjoint sets. Its privacy
level is determined by the maximal in these sets. Recently, Proserpio et al. [184]
extended the platform into wPINQ (for weighted PINQ), which uses weighted
datasets to avoid the worst case sensitivities.
In summary, based on these private operations, SuLQ and PINQ can create
private algorithms such as ID3 classifier, singular value decomposition, k-means,
and statistical query learning models [28]. However, as these interfaces do not
consider the objectives and properties of various algorithms, the performance of
the analyzed results may be suboptimal.
Supervised learning refers to the prediction methods that extract models describing
data classes via a set of labeled training records [108]. As one of the most popular
supervised learning algorithms, decision tree learning, has been extensively studied
in Laplace/exponential frameworks.
52 6 Differentially Private Data Analysis
Decision trees are iterative processes that recursively partition the training
sample to build a tree with each label representing a leaf. Assuming there is an input
dataset D with d categorical attributes fa1 ; : : : ad g, a decision tree is constructed
from the root that holds all the training records then the algorithm chooses the
attribute ai that maximizes the information gain to partition the records into child
nodes. The procedure is performed recursively on each subset of the training records
until a stop criteria is meet.
The first differentially private decision tree algorithm was developed on the SuLQ
platform [28]. Noise is added to the information gain, and an attribute ai with
noisy information gain that is less than a specified threshold is chosen to partition
a node. However, as information gain is evaluated separately for each attribute in
each iteration, the privacy budget is consumed several times in each iteration, which
results in a large volume of noise. In addition, SuLQ fails to deal with continuous
attributes. If those attributes are simply discretized into intervals, the basic concept
of differential privacy is violated, because the split values in continuous attributes
would reveal information about the records.
To overcome the SuLQ platform’s disadvantages, Friedman et al. [83] improved
the algorithm in two ways. First, they implemented an exponential mechanism
in the attribute selection step. The score function is defined by the information
gain or the gain ratio. The attributes with a top score have a higher probability of
being selected. In this way, less of the privacy budget is consumed than by SuLQ.
Second, the proposed method can deal with continuous attributes. An exponential
mechanism is employed to select every possible splitting value and the continuous
attribute’s domain is divided into these intervals. Compared to SuLQ, they obtain
better performance.
Jagannatham et al. [104] provided an algorithm for building random private
decision trees, which randomly select attributes to create nodes. The algorithm first
creates a tree in which all the leaves are on the same level and then builds a leaf
count vector. Once the independent Laplace noise is added to the count vector,
a differentially private random decision tree can be generated from the noisy leaf
vector. The algorithm iteratively produces multiple random decision trees and uses
the ensemble method to combine these trees. As the attribute is randomly selected,
this step saves the privacy budget; however, as each tree’s magnitude is scaled up
with the number of trees in the final ensemble step, the utility of the ensemble
remains a problem.
Rana et al. [188] proposed a practical approach to ensemble decision trees in a
random forest. They do not strictly follow the notion of differential privacy, which
keeps the neighboring data distribution approximately invariant. Instead, they only
keep the statistic features invariant. A privacy attack model is defined to prove the
reliability of the proposed decision tree. As less budget has been consumed in the
ensemble process, this relaxation of differential privacy can lead to higher utility
compared to other algorithms.
Discussion Differentially private decision tree algorithms were the earliest algo-
rithms to be investigated in the differential privacy community. Table 6.2 compares
6.1 Laplace/Exponential Framework 53
Therefore, the challenge for clustering is to evaluate and minimize the sensitivity of
the cluster centers.
Nissim et al. [172] used local sensitivity with k-means clustering to circumvent
the large sensitivity problem, by relying on the following intuition. In a well-
clustered scenario, a point with noise should have approximately the same center
as its previous center. In addition, moving a few “well-clustered” records would not
ultimately change the centers. As such, they define a local sensitivity to measure
the record-based sensitivity of the cluster center, which is much lower than the
traditional global sensitivity. Since the value of local sensitivity is difficult to
measure, they provide a sample aggregate method to approximate local sensitivity.
Feldman et al. [79] proposed a novel private clustering by defining the private
coresets. A coreset P is a small weighted set of records that captures some geometric
properties of these records. They use the private coreset to preserve differential
privacy for k-median queries. When calculating what is the sum of distances from P
to some other records in set R, the input set R can be transferred under the boundary
of differential privacy to a private coreset P, in which k-median queries can easily
be handled. Because the coreset P is differentially private, the clustering algorithm
performed on P also satisfies differential privacy.
Based on local sensitivity, Wang et al. [227] implemented subspace clustering
algorithms. They introduce Laplace mechanism into an agnostic k-means clustering,
and an exponential mechanism into Gibbs sampling subspace clustering algorithm.
Their subsequent paper, Wang et al. [228] adopted a Johnson-Lindenstrauss trans-
form to guarantee differential privacy in a subspace clustering algorithm. The
Johnson-Lindenstrauss transform can reduces the dimensions of the dataset, making
the clustering problem practical, as well as preserving the distance between each
record.
Discussion In general, even some differentially private platforms such as
SuLQ [28], PINQ [155], Privgene [251], Gupt [162] can automatically implement
clustering algorithms. They all assume that the sensitivity of the cluster centers
has been predefined. Even though local sensitivity can partially solve the problem,
further reducing sensitivity still remains a challenge. Table 6.3 illustrate current
differentially private unsupervised methods.
Frequent itemset mining aims to discover itemsets that frequently appear in a dataset
D. Suppose I is the set of items in D, and an itemset refers to a subset of I. Let each
record ri 2 D denote as a transaction that contains a set of items from I. A frequent
itemset refers to a set of items whose number of occurrences in transactions is above
a threshold, and the proportion of supporting transactions in the dataset is defined as
the frequency. Given an itemset Ij , if transaction ri contains Ij , we say ri supports Ij ,
and the proportion of supporting transactions in the dataset is defined as the support
6.1 Laplace/Exponential Framework 55
itemset l and the number of basis sets w are large, the cardinality of the candidate set
is still too big to handle. Hence, how to efficiently decrease the size of the candidate
set is still a challenge.
Zeng et al. [246] utilized a random truncated method to decrease the number of
candidate itemsets. They propose an algorithm that randomly truncates transactions
in a dataset according to a predefined maximal cardinality. The algorithm then
iteratively generates candidate itemsets by the a priori property, and perturbs the
support of candidate itemsets.
Lee et al. [134] proposed a FP-tree based frequent itemset mining algorithm.
The propose algorithm first identify all frequent itemsets. The algorithm does not
know their supports, but only that their supports are above the threshold. Using
this information, the algorithm builds a differentially private FP-tree with privatized
supports of the frequent itemsets. The proposed algorithm injects noise in the data
structure at the intermediate (FP-tree) step. The final output can be further refined
through an optional post-processing step. The advantage of the algorithm is that
information disclosure affecting differential privacy occurs only for count queries
above the threshold; negative answers do not count against the privacy budget.
Shen et al. [200] applied a Markov chain Monte Carlo (MCMC) sampling
method to deal with the challenge of large candidate itemsets. They claim that
an MCMC random walk method can bypass the problem, and the exponential
mechanism can then be applied to select frequent itemsets. They use this method
to mine frequent graph patterns.
Xu et al. [244] applied a binary estimation method to identify all possible
frequent itemsets and then use an exponential mechanism to privately select frequent
itemsets. The number of candidates is eventually a logarithm of the original number
using a binary search.
Discussion Frequent itemset mining is a typical data mining task, which suffers
from searching large candidate sets. Differential privacy makes the problem worse.
Table 6.4 lists current key methods, in which [24, 134, 246] and [144] tried to merge
some candidates into groups using different methods. [200] adopted a sampling
method to search the candidate itemsets. [244] applied a binary search method.
All of those methods decreased the searching space from exponential to polynomial,
which is a big achievement. However, the noise still remains high, which needs to
be decreased further.
(a) (b)
The key technique utilized to select a hypothesis is ERM, in which an optimal model
is chosen by minimizing the expected loss over the training samples [38].
Suppose h 2 H is a hypothesis and w is the output model, we define a loss
function `.h.w; r/; y/ to estimate the expected risk of the hypothesis. The goal of
ERM is to identify a w that minimizes the expected risk of the whole universe.
Equation (6.1) shows the expected risk minimization.
Because the distribution D of the universe is unknown, we can only estimate the
empirical risk Rn .w/ on training sample D. Equation (6.2) shows the empirical risk
minimization.
n
1X
Rn .w/ D minw `.h.w; ri /; yi / C .w/; (6.2)
n iD1
(a) (b)
When incorporating differential privacy into the current learning process, current
works apply two methods via an ERM technique: output perturbation and objective
operation. The output perturbation inserts noise into the output w; while the
objective operation adds noise to the objective function prior to learning. Based
on both methods, we will discuss several learning algorithms, including logistic
regression and SVM.
(a) (b)
Fig. 6.5 (a) Objective perturbation diagram. (b) Objective perturbation
Chaudhuri et al. [38, 39] also presented the objective perturbation solution by adding
noise to the loss function `.D/, which is illustrated in Fig. 6.5. Particularly, Fig. 6.5a
shows that the noise is directly added to the learner and Fig. 6.5b shows that the
calibration of noise is associated with the gradient of Rn .w/. The formal ERM with
objective perturbation is described in Algorithm 3 [39].
After analyzing the performance of both solutions, they conclude that when
the function in the regression algorithm is convex and doubly differentiable,
the objective perturbation outperforms the output perturbation. This is because
regularization already changes the objective to protect against over-fitting, and
changing the objective will not significantly impact performance. This claim is
confirmed in terms of a risk bound in the next subsection.
Nearly all types of learning problems can be analyzed in terms of their risk
bound. For example, most of the existing papers solve regression problems that have
a single optimal objective. Ullman [222] considered the regression problem as a
multiple objective optimization when the datasets are examined with different aims.
Multiple convex minimization is considered as a set of queries that can be answered
by a prevalent multiplicative weights method. Ullman implemented several single
objective regression results in multiple objectives platform and their risk bounds are
consistent with those original papers.
Kasiviswanathan et al. [116] considered private incremental regression in terms
of streaming data. They combined continual release [69] with ERM technology to
analyze the risk bound of several algorithms. Taking linear regression as an example,
they continuously update a noisy version of the gradient
p function to minimize the
MLE lossp function. The risk bound depends on the d and the length of the stream:
O.minf d; Tg/, where this bound is quite close to the normal linear regression
private learner when the stream length T is large.
In addition, some papers consider private learning in a higher dimension dataset.
Kifer et al. [128] gave the first results on private sparse regression with high
dimensions. The authors designed the algorithm based on sub-sampling stability for
support recovery using a LASSO estimator. Their following work [214] extended
and improved the results with an algorithm based on a sample efficiency test
of stability. Jain et al. [106] proposed an entropy regularized ERM using a
sampling technique, This algorithm provides a risk bound that has a logarithmic
dependence on d. Kasiviswanathan et al. [114] considered random projections in
ERM frameworks. They provided a new private compress learning method with the
risk bound related to the Gaussian width of the parameter space C in the random
projection.
62 6 Differentially Private Data Analysis
One of the most promising directions is the deep learning. Recent research focus
has been devoted to the design of deep learning mechanisms. Abadi et al. [5] applied
objective perturbation in a deep learning algorithm by defining the loss function as
the penalty for mismatching training data. As the loss function is non-convex, they
adapted a mini-batch stochastic gradient descent algorithm to solve the problem.
The noise is added to every step of the stochastic gradient descent. In another
pilot work [202], Shokri et al. designed a distributed deep learning model training
system that enables multiple parties to jointly learn an accurate neural network.
They implemented private stochastic gradient descent algorithms to achieve .; ı/-
differential privacy within multiple parties. In the work of Phah et al. [180],
the authors perturbed the objective functions of the traditional deep auto-encoder
and designed the deep private auto-encoder algorithm by incorporating a Laplace
mechanism.
6.2.2.4 Discussion
Private learning applies ERM to estimate the hypothesis set to select an optimal
output w. It uses the output perturbation and the objective perturbation to ensure
that the output satisfies differential privacy. A series of works have proven that
this private learning is tractable for certain learning tasks, such as linear regression,
SVM, and deep learning, etc.
Risk bounds are highly associated with the dimension of the dataset. Current
research has decreased dependence on dimension to O.d/. Under certain assump-
tions, the dimension dependence could be further relaxed.
The second problem: how many samples are needed in bounded accuracy? is
associated with sample complexity analysis. Sample complexity interprets the
distance utility measurement in another way to show how many samples are needed
to at least achieve a particular accuracy ˛. Probably approximately correct (PAC)
learning [121] helps to measure the sample complexity of learning algorithms.
Based on this theory, Blum et al. [29] and Kasiviswanathan et al. [115] proved that
every finite concept class can be learned privately using a generic construction with
a sample complexity of O.VC.C / log jX j/ (we omit the other parameters). This is
a higher sample complexity than a non-private learner who only needs a constant
number of samples. Improving sample complexity is an essential issue.
The gap in sample complexity was studied in the follow-up papers and several
methods were provided, which can be categorized into the following three groups:
• the privacy requirement relaxation;
• the hypothesis relaxation;
• and semi-supervised learning.
6.2 Private Learning Framework 63
Beimel et al. [20] showed that when relaxing -differential privacy to (; ı)-
differential privacy, sample complexity can be significantly
p decreased. Their follow-
up paper decreases sample complexity to O.log. .d 1=ı/// [206].
Another way to relax the privacy requirement is to preserve privacy for the labels
of samples, rather than entire attributes of samples. Chaudhuri et al. [37] assumed
that except labels, attributes of samples are insensitive. They showed that any
learning algorithm for a given hypothesis set that guarantees label privacy requires at
least .d0 / examples. Here, d0 is the doubling dimension of the disagreement metric
at a certain scale. The doubling dimension in X is the smallest positive integer d0
0
such that every ball of X can be covered by 2d balls of half the radius.
The key idea of label privacy is to decrease the dimension of sensitivity
information. It may give enough protection in a scenario where the content of the
underlying samples is publicly known except their labels. In some other scenarios,
however, attributes may be highly sensitive. For example, in the healthcare data, the
identity (attributes) of the people should be protected as well as the diseases (label).
Chaudhuri et al.’s relaxation is not applicable in this case.
This gap can also be closed by relaxing the requirement of the hypothesis. If the
output hypothesis is selected from the learning concept H C , the learning process
is defined as a proper learning. Otherwise, it is called an improper learning. For
proper learning, the sample complexity is approximately .d/. If choosing improper
learning, the sample complexity can be further decreased.
Beimel et al. [18] confirmed that when selecting a hypothesis that is not in C , the
sample complexity can be decreased to the constant. Their subsequent paper [19]
proposed a probabilistic representation of C to improve the sample complexity.
They considered a list of hypothesis collections fH1 ; :::; Hm g rather than just one
collection of H to represent C . The authors assumed that, when sampling Hi from
the hypothesis list, there will be h 2 H close to c in high probability. The sample
complexity can be reduced to O.max.ln jHi j//.
This improvement in sample complexity comes at the cost of an increased
workload on evaluation, however. The learner will have to exponentially evaluate
many points that are far from the concept set C . In general, for a private learner,
if H D C , the sample complexity is O.d/ and the time for evaluation is constant.
If H ¤ C , there is constant sample complexity but O.exp.d// time for evaluation.
supervised learning to active learnings. The method starts with an unlabeled dataset
to create a synthetic dataset for a class C . This synthetic dataset is then used to
choose a subset of the hypotheses with a size of 2O.VC.C // . In the last step the authors
apply O.VC.C // labeled examples to choose the target synthetic dataset according
to the hypotheses set.
/
In this process, the sample complexity of the labeled samples is O. VC.C ˛3
/ while
dVC.C /
for the unlabeled samples it is O. ˛3 /. Comparing the sample complexity of
labeled and unlabeled samples, this private learner uses a constant number of labeled
samples and O.d/ unlabeled samples.
6.2.3.4 Discussion
accuracy bounds on the private learner, and tried to decrease the sample complexity
in a fixed accuracy bound. Current research narrows down the sample complexity
gap between the private and non-private learning processes, which means that we
can learn privately by using acceptable number of samples.
The private learning framework is only concerned with supervised learning
algorithms; in addition, these algorithms should be PAC learnable. These strict
constraints hinder the development of the privacy learning framework, making it
currently actively developing in theory, but still impractical for real applications.
7.1 Introduction
In recent years, deep learning has rapidly become one of the most successful
approaches to machine learning. Deep learning takes advantage of the increasing
amount of available computation to process big data. In addition, the new algorithms
and architectures being developed for deep neural networks are accelerating the
progress of various fields, including image classification, speech recognition, natural
language understanding, social networks analysis, bioinformatics, and language
translation [132].
The essential idea of deep learning is to apply a multiple-layer structure to
extract complex features from high-dimensional data and use those features to build
models. The multiple-layer structure consists of neurons. The neurons in each layer
receive a finite number of output neurons from the previous layer along with their
associated weights. The aim of the training process is to adjust the weights of these
neurons to fit the training samples. In practice, a stochastic gradient descent (SGD)
procedure is one of the most popular ways to achieve this goal.
Like other machine learning models, deep learning models are susceptible to
several types of attacks. For example, a centralized collection of photos, speech,
and video clips from millions of individuals might meet with privacy risks when
shared with others [202]. Learning models can also disclose sensitive information.
Fredrikson et al. proposed a model-inversion attack that recovers images from a
facial recognition system [81]. Adabi et al. [6] assumed that the adversaries would
not only have access to the trained model but may also have the full knowledge of
the training mechanism and the model parameters. Phah et al. [180] considered a
general adversarial setting in which potential privacy leaks can stem from malicious
inference with the model’s inputs and outputs.
Differential privacy can be integrated with deep learning to tackle these privacy
issues. However, directly applying Laplace noise within a deep learning model
yields inferior performance for several reasons:
• High sensitivity: in deep learning training processes, the sensitivity of both the
objective function and the model’s output during perturbation is quite high.
• Limited privacy budget: iterative training processes divide the privacy budget into
several pieces, which leads to high levels of noise in the final result.
Several possible solutions to these challenges have been explored. For example,
Adabi et al. [6] clipped the objective function to bound its sensitivity and applied
a moment accountant method to the objective function to form an optimal privacy
composition. Phah et al. [180] applied a functional mechanism to perturb the objec-
tive function and decrease noise. In general, the current research on differentially
private deep learning can be classified according to three criteria, as shown in
Table 7.1:
Its adversarial setting. In the work of Shokri et al. [202] and Adabi et al. [6],
assumed that adversaries would not only have access to the trained model but
would also have full knowledge of the training mechanism and the model’s
parameters. Whereas, Phah et al. [180] considered a more general adversarial
setting, where potential privacy leaks might arise from malicious inference with
a model’s inputs and outputs.
Its system setting. In a pilot study [202], Shokri et al. designed a distributed deep
learning model training system that enables multiple parities to jointly learn an
accurate neural network. However, both Adabi et al. [6] and Phah et al. [180]
considered a centralized system setting where data are held centrally.
The privacy guarantee method used. The methods for guaranteeing differential
privacy can be classified into two types. The first type adds noise to the execution
process of the optimization algorithm. The second perturbs the objective by
adding differentially private noise to the objective functions before the learning
procedure Shokri et al. [202] and Adabi et al. [6] designed a differentially private
SGD algorithm by introducing a sparse vector technique [73], while Adabi
et al.’s [6] SGD approach relied on a Gaussian mechanism. Phah et al. [180],
perturbed the objective functions of a conventional deep auto-encoder and
designed a deep private auto-encoder algorithm that incorporates a functional
mechanism.
7.2 Preliminary 69
7.2 Preliminary
Based on artificial neural networks and the rapid development of cloud computing
and big data techniques, deep learning aims to extract nonlinear features and func-
tions from massive data to train complex models and their numerous parameters.
The main difference between deep learning and traditional machine learning is that
the former involves learning feature representation, while the latter always leverages
hand-designed features. Thus, the two main challenges in deep learning are: how to
automatically learn the values of the parameters, or weights, of each neuron from
the training data; and how to minimize the objective functions of the neural network.
such as classification labels. This process can be interpreted as the right-hand side
of the flowchart shown in Fig. 7.1, which provides a high-level schematic of how
each layer works.
The circles in each layer represent neurons. Each neuron receives a finite number
of outputs from the neurons in the previous layer along with its associated weight.
A neuron’s output is calculated by applying a nonlinear activation function to all the
input values. Figure 7.2 shows a neuron’s inputs and its output. Suppose the inputs
of a neuron are x D .x1 ;P x2 ; x3 ; x4 / and the weights are w D .w1 ; w2 ; w3 ; w4 /, the
output of the neuron is a. wi xi /, where a./ is the activation function.
The activation function is normally a nonlinear function that transforms the
weighted sum of the inputs. The most popular activation functions are sigmoid,
tanh or a rectified linear function.
• Sigmoid function: sigmod.x/ D 1Ce1 x
• Tanh function: tanh.x/ D 1Ce22x 1
• Rectified linear function:
(
0 for x < 0
rec.x/ D
x for x 0:
The tanh function is a re-scaled version of the sigmoid function with an output
range of Œ1; 1 instead of Œ0; 1. The rectified linear function is a piece-wise linear
function that saturates at exactly 0 whenever the input x is less than 0.
In the designing of the multiple-layer structures, weights w are derived from
an objective function J.w; x/. Finding w is an optimization process that yields an
acceptably small loss of J.w; x/. There are several possible definitions for function
J.w; x/. The most prevalent one is Pthe average of the loss over the training samples
fx1 ; x2 ; : : : xn g, where J.w/ D 1n i J.w; xi /. As J.w; x/ is normally a non-convex
function, a gradient descent method is applied to estimate w.
The following sections briefly describe SGD and deep auto-encoders-two repre-
sentative concepts in this field. SGD, presented in Sect. 7.2.2, plays an important
role in most deep learning optimization algorithms. Deep auto-encoders are a fun-
damental part of deep learning model structures and are presented in Sect. 7.2.2.1.
7.2 Preliminary 71
1 X
wtC1 D wt ˛t gS ; gS D rw J.w; x/; (7.1)
jSj x2S
-
X
-
X
From the hidden layer to the output layer, the decoder uses the weight vector w:
The auto-encoder model is trained to minimize the difference between the input
and the output. There are two types of loss functions. For a binary input x, the
auto-encoder minimizes the negative log-likelihood of the reconstruction as shown
in Eq. (7.4)
iD1
X
J.w; x/ D logPr.xjNx; w/ D .xi logNxi C .1 xi /log.1 xN i //: (7.4)
n
This loss function actually measures the cross-entropy error of a binomial problem.
When the inputs are real values, this loss function normally measures the sum
of the squared Euclidean distance between the output and input values. as shown
in Eq. (7.5).
7.3 Differentially Private Deep Learning 73
1X
J.w; x/ D .Nxi xi /2 : (7.5)
2 i
jYT j
X
E.YT ; / D Œyi logb
yi C .1 yi / log.1 b
yi /; (7.6)
iD1
The common interests and permission requirements of multiple parties demand that
the privacy preserving goals of deep learning include the protection of both the
training datasets and the training parameters. The main methods for guaranteeing
differential privacy in deep learning models are: (1) adding noise to the execution
process of an existing optimization algorithm; or (2) perturbing the objective
functions of the given optimization problem. A differentially private SGD algorithm
is representative of the former method, while the latter method typically uses a deep
private auto-encoder. This section briefly reviews the rationales for these related
works.
One straightforward way to protect the privacy of training data against leaks is
to simply add noise to the trained parameters resulting from the learning algorithm.
That is, the privacy-enhancing techniques are not applied to the internal training
process but rather to its outputs. However, due to the difficulties in fine-grained noise
analysis when adopting this approach, adding too much noise would further distort
the utility of the learning algorithm. Motivated by this problem, the extant literature
has presented several sophisticated approaches that aim to guarantee the differential
privacy of training data during the training process. As an essential component of the
training process, the SGD algorithm has also been enhanced in various differentially
private techniques. Following Sections first presents the basic Laplace method, and
then presents private SGD, private auto-encoder and distributed private SGD.
74 7 Differentially Private Deep Learning
The most direct method is to add noise to the SGD process. Laplace noise can be
added to each iterative gradient descent iterative round t, as shown in Eq. (7.7)
b
g.xi / D rwt J.wt ; xi / C Lap.J=t /: (7.7)
wtC1 D b
b wt ˛t rwb
gt .wI xi /: (7.8)
Algorithm 2 shows the procedure for a basic Laplace method to train a model
with weight set w by minimizing the empirical objective function J.w; x/. The
weight set is initialized by a set of random numbers in Step 1. Suppose there are a
total of T rounds in this iterative gradient descent model, the privacy budget would
be divided into T pieces in Step 2. Step 3 takes a random batch sample set St from
the training set. For a single sample xi 2 St , partial differential gt .xi / of the function
J.wt ; xi / is computed in Step 4. Step 5 adds Laplace noise to the gt .xi / based on the
sensitivity of J.wt ; xi / and the privacy budget t . In the descent Step 6, the weights
are estimated from the weights in the last round and the noisy function b g.xi /. After
all rounds are finalized, weighs w is determined by the weights of in the last round.
Clearly, Step 5 guarantees that the output result b w satisfies -differential privacy.
However, as mentioned in Sect. 7.1, this basic Laplace method produces an inferior
learning model based on b w. The noise Lap.J=t / is quite large as there is no a
priori bound on the size of the gradients, which leads to a high J. In addition,
due to the vast number of iterative rounds in the gradient process, t is quite small,
which lead to a high level of noise. If there were 1000 iterations and the total privacy
budget for each sample was set to 1, each iteration would only be allocated 0:001 of
the privacy budget. Therefore, applying the basic Laplace method to deep learning
models is impractical in real-world cases. However, researchers have proposed
various methods to improve the quality of these models.
7.3 Differentially Private Deep Learning 75
Adabi et al. [6] extended the basic Laplace method and proposed a private SGD
algorithm. Similar to the basic Laplace mechanism, the noise is added to the
objective function J.w; x/. However, three additional technologies improve the
performance of the output model.
• Norm clipping of the objective function J to bound its sensitivity.
• Grouping several batches together then adding noise to the group.
• Applying moment accountant to the objective functions to form an optimal
privacy composition
A brief outline of their algorithm is presented before analyzing the above three
technologies in detail.
Algorithm 3 presents the differentially private SGD algorithm. It has some
differences to Algorithm 2. First, Algorithm 3 includes several new parameters,
including noise scale , group size L, and gradient norm bound C. These parameters
are used when adding noise, grouping batches and in norm clipping, respectively.
Similar to the basic Laplace method, the private SGD method initializes w0 with
random values in Step 1. However, in Step 2, instead of using a batch of samples
St , a larger sample group Lt is sampled. This can be considered as a merger of
several batch samples. The aim of Step 2 is to decrease the total noise added to
the private SGD method. The gradient computing in Step 3 is similar to the basic
Laplace method. In Step 4, a private SGD method clips the gradient with a constant
C, aiming to bound the sensitivity of gt . Step 5 adds calibrated Gaussian noise to gt
and calculates the weights using a noisy version of gt in Step 6. Finally, in addition
to the output of weights in Step 7, the privacy loss of the method is derived based
on a privacy accounting technique in Step 8.
76 7 Differentially Private Deep Learning
One of the challenges in differentially private deep learning is the high sensitivity
of the objective function. To bound the influence of each individual sample on g.x/,
which is the gradient of J.w; x/, the private SGD method clips each gradient in `2
norm by using a predefined threshold for C, as shown in Eq. (7.9).
(
kg.x/k2 for kg.x/k2 C
kg.x/k2 D (7.9)
C for kg.x/k2 > C:
In the norm clipping step, the gradient vector gt .xi / in round t will be replaced
by gt .xi /= max.1; kgt .xCi /k2 /, and the sensitivity of the gt .xi / is bounded to C: 2 g D
C. It worth noting that in a multi-layer neural network, each layer can be set to a
different clipping threshold C.
The proposed private SGD method applies a Gaussian mechanism with a sensitivity
equal to C. According to the Gaussian mechanism defined in Chap.p 2, each lot in
private SGD preserves .T; Tı/-differential privacy when D 2 f 2ln.2=ı/=.
As there are L=n lots in the dataset, the final output preserves .TL=n; TL=nı/-
differential privacy. This forms loose bound for the output in terms of sequential
composition.
However, to improve the bound of the privacy loss, the private SGD method
introduces ap stronger accounting method, moments accountant, which improves the
bound to O. TL=n; ı/-differential privacy for an appropriately chosen noise scale
and a threshold of C.
Theorem 7.1 There exist the constants c1 and c2 . Given the sampling probability
L=n, and the number of iterative round T, for any < c1 TL=n, the private SGD is
.; ı/-differential privacy for any ı > 0 if we choose
p
L=n Tlog.1=ı/
c2 : (7.10)
7.3 Differentially Private Deep Learning 77
Phah et al. [180] perturbed the objective functions of a traditional deep auto-encoder
and designed a deep private auto-encoder algorithm that incorporates a Laplace
mechanism.
Phah et al. [180] deep private auto-encoder algorithm perturbs the loss func-
tion J.w; x/. But, unlike the basic Laplace or the previously mentioned private
SGD method, this deep private auto-encoder algorithm applies a functional mecha-
nism [252] to a perturbed J.w; x/. This proposed deep private auto-encoder model
is used to make binomial prediction.
Algorithm 4 shows the key steps of the deep private auto-encoder algorithm.
The first step applies a Taylor expansion to approximate the loss function J.w; x/,
and then the Taylor expansion is perturbed using a functional mechanism in Step
2. After creating the perturbed loss function b J.w; x/, Step 3 computes the weight
vector using gradient descent. Step 4 stacks private auto-encoders to construct a
deep private auto-encoders. Here, the previous hidden layer can be considered as the
input of the next auto-encoder. The first four steps aim to create a deep private auto-
encoder model. Step 5 focuses on a binomial prediction problem in the top layer,
which includes a single binomial variable to predict y and apply a loss function to
the cross-entropy error E./. The function E./ is also perturbed by a functional
mechanism. The final step computes b by minimizing b E./.
Figure 7.4 shows the resulting structure of the deep private auto-encoder model.
It is worth noting that, before stacking multiple hidden layers, a normalization layer,
denoted as , is inserted after each hidden layer. The normalization layer guarantees
all data assumptions and that differential privacy will be satisfied when training the
next hidden layer.
78 7 Differentially Private Deep Learning
The deep private auto-encoder algorithm contains two perturbations. The first
perturbation is the loss function J.w; x/ during the training process; the second is the
loss function E./ in the prediction process. As both perturbations apply a Laplace
mechanism, each uses a different sensitivity, as shown in Eqs. (7.11) and (7.12),
respectively, are used in the perturbation.
1
4J D d b C b2 ; (7.11)
4
where jj is the number of variables in the normalization layer. Both sensitivities are
highly related to the variables in the hidden layer, or normalization layer. Increasing
the variables in the hidden layer may increase the prediction accuracy in the learning
process; however, this may also increase the amount of noise added to the loss
functions.
In addition to the potential privacy leaks yielded by non-private algorithms, the basic
architecture of centralized deep learning systems can also cause several privacy
issues, especially when considering the concerns of everyone involved. For example,
once an individual user has contributed information to a massive dataset that is
subsequently used to develop a complex learning model, it is nearly impossible for
that person to control or influence what happens to their data. It is relatively hard for
users to completely trust companies in the big data era. Further, when considering
the co-operation required between multiple companies and institutes aiming to con-
struct models using joint data, achieving proprietary data privacy for both companies
while jointly sharing the data used greatly compounds these difficulties. Distributed
private deep learning architectures are in need of development. However, SGD
algorithms can be parallelized and executed asynchronously [55, 191, 261], These
advantages led Shokris [202] to propose a distributed selective SGD (DSSGD)
protocol, in which selective parameter sharing was the key novelty.
At the outset, all the participants involved in the DSSGD agree on a common
learning objective. A parameter server is used to maintain the settings for all the
parameters !.global/ and their latest values.
Each participant i downloads the parameters w.i/ , replacing their local parameters
with the latest values from the server. Each participant then executes the SGD
.i/
algorithm and uploads a portion of the computation results rwP to the parameter
server. The server collates the updates from multiple participants, maintains the
latest parameter values, and makes the results available to all participants. Pseudo-
code for the participant’s side of the DSSGD protocol is shown below. Some of
the notations from the original description [202] have been replaced to maintain
consistency with the previous contents of this chapter. Note that the parameters
participant i uploads to the server are selected according to one of the following
criteria:
• largest values: sort gradients in rw.i/ and upload ˇu fraction of them, starting
with the largest.
• random with threshold: randomly subsample the gradients with values above
threshold .
These selection criteria are fixed for the entire training set.
80 7 Differentially Private Deep Learning
The above DSSGD protocol is only able to prevent direct privacy leaks that occur
as a result of sharing a participant’s local training data with others. Therefore,
Shokri et al. [202] also adopted a sparse vector technique (SVT) [73] to prevent
indirect privacy leaks caused by participants sharing locally updated parameters.
SVTs were originally proposed as a way of answering some queries in interactive
settings without the need to consume any of the privacy budget. This method is
based on the intuition that, in some situations, only the queries with answers above
a certain threshold need to be considered; those below the threshold do not. In this
way, SVTs first compare a noisy query answer and a noisy threshold, then release
an output vector that indicates whether the answer is above or below the given
threshold. Algorithm 6 shows the pseudo-code for a differentially private DSSGD
for participant i using an SVT.
This section briefly introduces the typical experimental methods, including bench-
mark datasets, learning objectives and computing frameworks.
There are some benchmark datasets commonly used in differentially private deep
learning domain. These include:
• MNIST1 [133]. This dataset consists of 60,000 training samples and 10,000
testing samples of handwritten digit recognition. Each example is a 28 28
image. Both [202] and [6] evaluated their the proposed algorithms using this
dataset.
• CIFAR [47]. The CIFAR dataset comprises labeled subsets of the 80 million Tiny
Images2 dataset. There are two versions of different numbers of image classes.
The CIFAR-10 dataset that consists of 50,000 training samples and 10,000
test samples belonging to ten classes. The CIFAR-100 dataset also has 50,000
training samples, but they are separated into 100 classes. Abadi et al. [6] mainly
used the CIFAR-10 dataset to conduct the experiments and used the CIFAR-100
dataset as a public dataset to train the network.
• SVHN3 [170]. This dataset is a collection of the house numbers from Google
Street View images. It contains over 600,000 samples; each sample is a 32 32
image with a single character in the center. Shokri et al. [202] used 100,000
samples for training and 10,000 samples for each test.
The learning objective stated in Shokri et al.’s work was to classify input data
from the MNIST and SVHN datasets into ten classes[202]. The accuracy of the
differentially private DSSGD protocol was compared to SGD and to a standalone
scenario. In the work of Abadi et al.’s work [6], the classification accuracy of
the proposed algorithm was evaluated on a training set and test set by changing
various parameters, such as the privacy budget, the learning rate and so on. In Phanh
et al.’s work [180], the learning objective was to execute the proposed deep private
auto-encoder for a binomial classification task on health data from a social network.
The prediction accuracy of the compared algorithms was evaluated by varying the
dataset’s cardinality and the privacy budget.
1
http://yann.lecun.com/exdb/mnist/.
2
http://groups.csail.mit.edu/vision/TinyImages/.
3
http://ufldl.stanford.edu/housenumbers/.
82 7 Differentially Private Deep Learning
7.5 Summary
4
https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software.
5
https://github.com/torch/nn.
6
https://www.tensorflow.org.
7
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-1.html.
Chapter 8
Differentially Private Applications: Where
to Start?
A lot of differentially private applications have been proposed nowadays. The var-
ious steps that can be followed when solving a privacy preservation problem for a
particular application are shown in Fig. 8.1. The dark boxes in the flowchart show
the steps, and the orange boxes illustrate the possible choices. First, it is necessary
to identify the scenarios: data publishing or data analysis. Data publishing aims to
release answers to queries or entire datasets to public users; whereas, data analysis
normally releases a private version of a model. Because private learning frameworks
solve privacy preservation problems using optimization, an optimization objective
normally has to be determined. The second step is identifying challenges in the
application. Although differential privacy is considered to be a promising solution
for privacy preservation issues, implementation in some applications still presents a
number of challenges. These challenges, and their possible solutions, are introduced
in the next subsection.
The third step is selecting the most suitable mechanisms. Some major mecha-
nisms, such as transformation and iteration, have already been presented in Chap. 3.
In the real world, a curator can develop more useful mechanisms based on the
specific requirements of the application. Once the solution, or algorithm, for a
problem has been determined, a utility and privacy analysis are required. As men-
tioned in Chap. 2, the most popular utility measurement is an error measurement,
which evaluates the distance between the non-private output and the private output.
Various applications may interpret these errors in different ways. For example, in
recommender systems, the error could be a mean average error, while in location-
based services, the error is normally a Euclidean distance error. The error is bounded
by the accuracy parameter ˛, which can be estimated using tools like a union
The most serious challenge in many applications is that queries with high sensitivity
lead to high levels of noise, which may significantly decrease the utility of
the applications. The noise added by differential privacy is calibrated according
to the sensitivity of the query. Simple queries, such as count or sum, introduce
minor noise, which has a low effect on utility. However, queries with high sensitivity
are found in many real-world applications, such as the similarity measurements in
recommender systems or the cluster center measurements in a clustering algorithm.
In these cases, deleting one record in the dataset will have a significant impact on the
similarity result. The essential challenge for differential privacy is how to decrease
the noise in queries with high sensitivity.
One of the most useful ways to deal with high sensitivity is to change the
definition of global sensitivity to be application-aware, which can create local
sensitivity. The notion of global sensitivity, which considers worst-case scenarios,
has quite a rigorous definition, but in many applications it may not be necessary
to use such strict sensitivity. If using global sensitivity generates some redundant
noise, the definition of sensitivity can be relaxed to achieve a trade-off between
privacy and utility.
One possible solution is to shrink the domain of randomization. The noise will
decrease when the range that is randomized is limited. For example, clustering is
one possible way to structure existing items into groups and limit the domains to be
randomized within each cluster.
Existing research on differential privacy assumes that, in a dataset, data records are
sampled independently. However, in real-world applications, data records are rarely
independent [32]. When a traditional differentially private technique performed on a
correlated dataset discloses more information than expected, this indicates a serious
privacy violation [126]. Recent research has been concerned with this new kind of
8.2 Challenges in Differentially Private Applications 87
privacy violation; however, coupled datasets lack a solid solution. Moreover, how to
decrease the large amount of noise incurred by differential privacy in a correlated
dataset remains a challenge.
As advances in correlated data analysis are made, it may now be possible
to overcome these research barriers. Correlated information can be modeled by
functions or parameters that can be further defined as background information in
a differential privacy mechanism. For example, Cao et al. [32] use time intervals
and correlation analyses to identify correlated records, and they model correlated
information using inter-behavior functions. This solution may help tackle these
research barriers by incorporating modeled information into a differential privacy
mechanism as most records are only partially correlated. In other words, the impact
of deleting a particular record may differ among the remaining records. If those
differences can be quantified, they can be used to calibrate the noise. New sensitivity
can be defined by correlating the quantified differences.
8.2.6 Summary
1
http://www.netflixprize.com.
2
http://www.grouplens.org.
3
https://webscope.sandbox.yahoo.com/catalog.php?datatype=r.
8.3 Useful Public Datasets in Applications 89
The T-Drive trajectory dataset contains 1-week of trajectories for 10,357 taxis.
It includes around 15 million total data points with a total trajectory distance
reaching 9 million kilometers.
GeoLife is a location-based social networking service, enabling users to share
their life experiences and build connections using their GPS location history. This
dataset includes 178 users over a period of 4 years (from April 2007 to October
2011). It contains 17,621 trajectories covering a distance of 1,251,654 km, and a
total duration of 48,203 h.
Gowalla is a location-based social networking website where users checking-in
to share their location. This dataset contains a total of 6,442,890 user check-ins of
the users during the period between February 2009 and October 2010. The main
objective behind the use of this dataset is to understand the basic laws that govern
the human motion and dynamics. This dataset belongs to SNAP.
There are also several well-known websites that can crawl photos using GPS
information, for example, Flickr: www.flickr.com. Instagram: http://instagram.com.
4
http://archive.ics.uci.edu/ml/.
90 8 Differentially Private Applications: Where to Start?
The basic settings of an application based on the flowchart in Fig. 8.1 are illustrated
in Table 8.2. This nomenclature is used in subsequent chapters.
The first row shows the type of applications, for example location privacy, rec-
ommender system or social network. The nature of the input and output data is also
listed in the table. The input data, output data and utility measurements are highly
related to the application. For example, the input data of a recommender system are
normally a user-item matrix, while social network data are normally graphs. The
output data of a recommender system are normally rating predictions, while social
network output data can be statistics information or graphs. Challenges and possible
solutions vary across application types. The publishing setting can be interactive or
non-interactive, while the analysis setting can be a Laplace/exponential framework
or a private learning framework. Utility measurements are normally based on the
error measurement, which is the distance between the non-private result and the
private result. For example, in a recommender system, the utility can be measured
by the distance between the original MAE and the MAE based on the private result.
9.1 Introduction
With the significant growth of Online Social Networks (OSNs), the increasing
volumes of data collected in those OSNs have become a rich source of insight into
fundamental societal phenomena, such as epidemiology, information dissemination,
marketing, etc. Much of this OSN data is in the form of graphs, which represent
information such as the relationships between individuals. Releasing those graph
data has enormous potential social benefits. However, the graph data infer sensitive
information about a particular individual [13], which has raised concern among
social network participants.
If the differential privacy mechanism is adopted in graph data, the research
problem is then to design efficient algorithms to publish statistics about the graph
while satisfying the definition of either node differential privacy or edge differential
privacy. The former protects the node of the graph and the latter protects the edge in
the graph.
Previous works have successfully achieved either node differential privacy and
edge differential privacy when the number of queries is limited. For example, Hay
et al. [95] implemented node differential privacy, and pointed out the difficulties
to achieve the node differential privacy. Paper [229] proposed to publish graph
dataset using a dK-graph model. Chen et al. [41] considered the correlation between
nodes and proposed a correlated release method for sparse graphes. However, these
works suffer from a serious problem: when the number of queries is increasing, a
large volume of noise will be introduced. This problem can be tackled by iteration
mechanism, which will be presented in this chapter.
This chapter focuses on both node and edge differential privacy. First, the chapter
present several basic methods on node and edge differential privacy, then proposed
an iteration method to publish a synthetic graph. Specifically, given a set of queries,
an iteration method is used to generate a synthetic graph to answer these queries
accurately. The iteration process can be considered as a training procedure, in
which queries are training samples and the synthetic graph is an output learning
model. A synthetic graph is finally generated by iterative update to answer the set of
queries. As the training process consumes less privacy budget than the state-of-the-
art methods, the total noise will be diminished. Table 9.1 shows the basic setting of
the application.
9.2 Preliminaries
Degree:
Besides the count of nodes and edges in a graph, Fig. 9.2 shows several popular
subgraph queries, including triangle counting, k-triangle counting, k-stars counting,
and degree distribution.
Node differential privacy ensures the privacy of a query over two neighbouring
graphs where two neighbouring graphs differ one node and all edges connected
to the node. Figure 9.3 shows the neighboring graphes in node differential privacy.
Hay et al. [95] first proposed the notion of node differential privacy and pointed
out the difficulties to achieve it. They showed that the result of query was highly
inaccurate for analyzing graph due to the large noise.
Let us use Fig. 9.3 to illustrate the problem. When answering query f1 : how many
nodes are there in the graph? the 4f1 equals to 1, and the noise adding to f1 is scaled
to Lap.1=/, which is quite low. However, when answering query f2 : how many
edges are there in the graph? the sensitivity of f2 equals to 5 as the maximum degree
of all nodes is 5. The noise adding to the f2 result is quite large comparing with
the f1 .
The high sensitivity of node differential privacy derives from the degree of a
node. When deleting a node, the maximum changing is determined by the largest
degree of node in a graph. Theoretically, the sensitivity of a graph G will be
94 9 Differentially Private Social Network Data Publishing
Truncation method transforms the input graphes into graphes with maximum degree
below a certain threshold [117]. The graph G is truncated to G by discarding
nodes with degree > . Figure 9.4 shows a truncated graph G3 for original graph
G, in which one node with degree 5 is discarded to make sure all nodes are equal or
below the degree of 3. By this way, the sensitivity of edge counting query will be
decreased from sensitivity 5 to sensitivity of 3.
Kasiviswanathan et al. [117] showed that given a query f defined on the trunked
graph G , a smooth bound S is necessary for the number of nodes whose degrees
may change due to the truncation. They applied Nissim et al.’s [172] ˇ-smooth
bound S.G/ for local sensitivity, which has been discussion in Definition 2.4 in
Chap. 2. One can add noise proportional to smooth bounds on the local sensitivity
using a variety
p of distributions. Kasiviswanathan et al. used the Cauchy distribution
Cauchy. 2S.G/=/. Algorithm 1 shows a typical algorithm to publish degree
distributions for a G.
9.3 Basic Differentially Private Social Network Data Publishing Methods 95
There are three major steps in the algorithm. The first step determines the
truncated parameter . As we may not know the maximum degree in the graph,
or the maximum degree may be very large, is normally approximated by b
Therefore, the algorithm randomized the cutoff to obtain a better bound. Given a
target parameter , the algorithm picks a random parameter in a range of bounded
constant multiple of . The second step creates the truncated graph by discarding
the node with degree greater than . The final step add Cauchy distributed noise to
each entry of the degree distribution.
Truncating G to G is not easy, as deleting all nodes with degree greater than
will ultimately delete more nodes and edges than we expected. Blocki et al. [27]
solved this problem by selecting an arbitrary order among the edges, traversing the
edges and removing each encountered edge that is connected to a node that has
degree is greater than . Day [53] used an reverse way to create truncated graph.
They first deleted all edges and add edges in a pre-defined order to achieve G .
are with capacity , while each edge .u; v/ in E is added as .u; v 0 / between vl
and vr with capacity 1. Let vfl .G/ denote the value of maximum flow from s to
t, vfl .G/=2 is a Lipschitz extension of an edge query f . The global node sensitivity
4vfl .G/ 2. Accordingly, Kasiviswanathan et al. [117] published the number of
edges by Algorithm 2.
Blocki et al. [27] proceeded with a similar intuition. They showed that Lipschitz
extensions exist for all real-valued functions, and give a specific projection from any
graph to a particular degree-bounded graph, along with smooth upper bound on its
local sensitivity.
Above works can only efficiently compute Lipschitz extensions for one-
dimensional functions, in which the output is a single value. Raskhodnikova
et al. [189] developed Lipschitz extensions for degree distribution queries with
multidimensional vector outputs, via convex programming. Specifically, they
9.3 Basic Differentially Private Social Network Data Publishing Methods 97
designed Lipschitz extensions with small stretch for the sorted degree list and
for the degree distribution of a graph.
Chen et al. [45] proposed an iterative based method to achieve node differential
privacy. Given graph G and any real-valued function f , they defined a sequence of
real-valued functions 0 D f0 .G/ f1 .G/ fm .G/ D f .G/ with the recursive
monotonicity property that: fi .G0 / fi .G/ fiC1 .G0 / for all neighbors G and G0 and
8i 2 f0; 1; ; mg. They then defined quantity X to approximate the true answer
of f , and the global sensitivity of X is . X .G/ D mini2Œ0;1 .fi .G/ C .n i//,
where Xı .G/ f .G/ but close to f .G/ for larger . For a carefully chosen ,
they output the X .G/ via Laplace mechanism in the global sensitivity framework,
as an approximation of the real-valued function f .G/. This recursive approach can
potentially return more accurate subgraph counting for any kinds of subgraphs with
node differential privacy. However, constructing the sequence of functions fi .G/ is
usually NP-hard, and how to efficiently implement it remains an open problem.
Edge differential privacy means adding or deleting a single edge between two nodes
in the graph makes negligible difference to the result of the query. The first research
over edge differential privacy was conducted by Nissim et al. [172], who showed
how to evaluate the number of triangles in a social network with edge differential
privacy. They showed how to efficiently calibrate noise for subgraph counts in terms
of the smooth sensitivity. The results of this technique are investigated by Karwa
et al. [113] to release counts on k-triangles and k-stars.
Rastogi et al. [190] studied the release of more general subgraph counts under
a much weaker version of differential privacy, edge adversarial privacy, which
considers a Bayesian adversary whose prior knowledge is drawn from a specified
family of distributions. By assuming that the presence of an edge does not make the
presence of other edges, they computed a high probability upper bound on the local
sensitivity, and then added noise proportional to that bound. Rastogi et al.’s method
can release more general graph statistics, but their privacy guarantee protects
only against a specific class of adversaries, and the magnitude of noise grows
exponentially with the number of edges in the subgraph.
Hay et al. [95] considered releasing a different statistics about graph, the degree
distributions. They showed that the global sensitivity approach can still be useful
when combined with post-processing of the released output to remove some added
noise, and constructed an algorithm for releasing the degree distribution of a graph,
with the edge differential privacy.
98 9 Differentially Private Social Network Data Publishing
The proposed method is called Graph Update method as the key idea is to update
a synthetic graph until all queries have been answered [260]. For a social network
graph G and a set of queries F D ff1 ; : : : ; fm g, the publishing goal is to release a
set of query results b
F and a synthetic graph b G to the public. The general idea is to
define an initial graph b
G0 and update it to b
Gm1 in m round according to m queries
in F. Release answers bF and the synthetic graph b G are generated during the iteration.
During the process, four different types of query answer involve in the iteration:
• True answer at : this is the real answer that a graph response to a query. True
answer can not be published directly as it will arise privacy concern. The true
answer is normally used as the baseline to measure the utility loss of a privacy-
preserving algorithm. The symbol at is used to represent the true answer for a
single query f , and At D F.G/ D fat1 ; : : : ; atm g is applied to represent an answer
set for a query set F.
• Noise answer an : when we add Laplace noise to a true answer, the result will
be the noise answer. Traditional Laplace method will release the noise answer
directly. However, as we mentioned in Sect. 9.1, it will introduce large amount
of noise to the release result. A single query answer is represented by an D
b
f .G/ D f .G/ C Lap.s=/ and an answer set is represented by An D b F.G/ D
fan1 ; : : : ; anm g.
• Synthetic answer as : this is the answer generated by a synthetic graph b G. A single
query is presented by as D f .b G/ and As D F.b G/ D fas1 ; : : : ; asm g is applied to
represent an answer set.
• Release answer ar : this is the answer finally released after the iteration. In Graph
Update method, the release answer set will consist of noise answers and synthetic
answers. The algorithm applied ar D b f and Ar D b F D far1 ; : : : ; arm g to represent
the single answer of a query and the answer set, respectively.
These four different query answers will control the graph update process. The
overview of the method is presented in Fig. 9.7. On the left side of the figure, the
9.4 Graph Update Method 99
query set F performs on the G to get true answer set At . Laplace noise is then added
to At to get a set of noise answer As D fas1; : : : asm g. Each noise answer asi helps
to update the initial b G0 and produce a release answer ari . The method eventually
outputs Ar D far1 ; : : : ; arm g and the b
Gm as final results.
Comparing with the traditional Laplace method, the proposed Graph Update
method adds less noise. As some queries are answered by the synthetic graph, these
query answers will not consume any privacy budget. Moreover, the synthetic graph
can be applied to predict new queries without any privacy budget. Eventually, the
Graph Update method can outperform the tractional Laplace method.
most of the answer in Ar are noise answers. More privacy budgets will be consumes
in this configuration. Consequently, the choice of 0 will have impact on different
scenarios.
To analyze the privacy level of the proposed method, the sequential composition is
applied. For the traditional Laplace method, when answering F with m queries,
will be divided into m pieces and arranged to each query fi 2 F. Specifically, we
have 0 D =m and for each query, the noise answer will be ani D fi C Lap.s= 0 /.
According to the sequential composition, the Laplace method preserve . 0 m/-
differential privacy, which is equal to -differential privacy.
In Graph Update method, the release answer set Ar are the combination of noise
answers An and synthetic answers As . Only An consume privacy budget, while As do
not. In Algorithm 4, even Step 4 adds Laplace noise to the true answer, the noise
result does not release directly. Only when the algorithm processed to Step 7, in
102 9 Differentially Private Social Network Data Publishing
which an is released, the algorithm consumes the privacy budget. Suppose there are
j.0 j m/ queries in F is released by synthetic answers, the algorithm preserves
..m j/ 0 /-differential privacy. As .m j/ 0 m 0 , the Graph Update method
preserve more strict privacy than tractional Laplace method.
Error measurement is applied to evaluate the utility. The error is defined by Mean
Absolute Error (MAE). MAEr of release answer Ar is defined as Eq. (9.3)
1 b
MAEr D jF i .G/ Fi .G/j
m
1Xb
D jf i .G/ fi .G/j
m f 2F
i
1 X
D jari ati j
m a 2A
i r
1
D jAr At j: (9.3)
m
Similarly, MAEn of noise answers and MAEs of synthetic answers are defined as
Eqs. (9.4) and (9.5), respectively.
1
MAEn D jAn At jI (9.4)
m
1
MAEs D jAs At j: (9.5)
m
It is obvious that for true answers At , the MAE is zero. MAEn represents the perfor-
mance of traditional Laplace method. A lower MAE implies a better performance.
The target of Graph Update method is to achieve a lower MAEr in a fixed privacy
budget. A simulated figure, Fig. 9.8, is applied to illustrate the relationship between
MAE values and the size of the query set m.
In Fig. 9.8, x axis is the size of the query set and y axis is the value of MAE.
For noise answer An , MAEn is arising with the increasing of m. A smooth line
is applied to represent the MAEn in this simulated figure. In real case, the line is
fluctuated as the noise is derived from Laplace distribution. The MAEs is decreasing
at the beginning with the increasing of m. When it reaches to its lowest point, the
MAEs begins to rise with the enhance of m. This is because with the update of
the graph, the synthetic graph is getting more and more accurate, MAEs is keeping
decreasing. However, as the iteration procedure is controlled by the noise answer, it
is impossible for synthetic graph to equal to the original graph, no matter how large
m is. On the contrary, with the increasing of m, more noise will be introduced to
iteration and the synthetic graph will be far away from the original graph.
9.4 Graph Update Method 103
This section evaluates the performance of the proposed Graph Update method
comparing with Laplace mechanism.
The experiment involve with four datasets listed in Table 9.2. These datasets are
collected from Stanford Network Analysis Platform (SNAP) [136].
The experiment considers the degree query on nodes, which is similar to the
count query on relation dataset. To preserve the edge privacy, the degree query has
the sensitivity of 1, which means deleting an edge will have maximum impact of 1
on the query result. The performance of results is measured by Mean Absolute Error
(MAE) (9.3).
104 9 Differentially Private Social Network Data Publishing
The performance of the Update Graph is examined through comparison with the
Laplace method [62] and Correlated method [41]. The size of query sets is from 1
to 200, in which each query is independent to each other. Parameters 0 and as
optimal ones for each dataset and the is fixed at 1 for all methods.
According to Fig. 9.9, It is observed that with the increasing of the size of the
query sets, MAEs of all methods are increasing approximately in linear. This is
9.5 Summary 105
because the queries are independent to each other and the privacy budget is arranged
equally to each query. With the linear increasing of the query number, the noise
added to each query answer is enhanced linearly.
Second, Fig. 9.9 shows that Update Graph has lower MAE comparing with other
two methods, especially when the size of the query set is large. As shown in
Fig. 9.9a, when the size of query set is 200, the MAE of Graph Update is 99:8500
while the Laplace method has MAE of 210:0020, and the Correlated method has
MAE of 135:2078 which is 52:45 and 26:15% higher than the proposed Update
Graph. This trend can be observed in Fig. 9.9b–d. The proposed Graph Update
mechanism has better performance because part of query answers does not consume
any privacy budget, while noise is only added in the updated procedure. Other
methods, including Laplace method consume the privacy budget when answering
every query. The experimental results show the effectiveness of Graph Update in
answering a large set of queries.
Third, it is worth to mention that when the size of the query set is limited,
the proposed Graph Update may not necessary outperform the Correlated method.
Figure 9.9a shows that when the size is less than 20, MAEs of Graph Update and the
Correlated method are mixed together. This is because when the query set is limited,
the synthetic graph can not be fully updated and may differ from the original graph
largely. Therefore, the performance may not necessary outperform other methods
significantly. This result shows that Graph Update is more suitable in scenarios that
need to answer a large amount of queries.
9.5 Summary
Nowadays, the privacy problem have aroused peoples attention. Especially the
online social network data, which contains a massive personal information. How
to release social network data is a hot topic that attracts lots of attention. This
chapter proposes several method to meet with the graph publishing problem. And
to overcome the problem of providing accurate results even when releasing large
numbers of queries, this chapter then presents an iterative method that transfers the
query release problem into an iteration based update process, so as to providing a
practical solution for publishing a sequence of queries with high accuracy. In the
future, much more complied queries should be investigated, such as cut queries and
triangle queries, which can allow researchers to get more information of the dataset
while still can guarantee users’ privacy.
Chapter 10
Differentially Private Recommender System
10.1 Introduction
10.2 Preliminaries
Table 10.2 User-item matrix User Alien Logan Beauty and the beasts The artist
for a movie recommender
system Alice 3 2 4 5
Bob 3 5 4
Cathy 2 3 4 0
... ... ...
Eva 3 4
110 10 Differentially Private Recommender System
In the Rating Prediction Stage, for any item ti , all ratings on ti by users in Nk .ua /
will be aggregated into the predicted rating brai . Specifically, the prediction of br ai is
calculated as a weighted sum of the neighbors’ ratings on item ti . Accordingly, the
determination of a suitable set of weights becomes an essential problem because
most work relies on the similarity between users or items to determine the weight.
For example, for item-based methods, the prediction ofb rai is formulated as follows:
P
j2I s.i; j/ ra;j
b
r ai D Pa : (10.3)
j2Ia js.i; j/j
In user-based methods, the active user’s prediction is made by the rating data
from many other users whose rating is similar to the active user. The predicted rating
of user v on item ˛ is:
P
u2Uv s.v; u/.ru˛ rNu /
b
rv˛ D rNv C P ;
u js.v; u/j
where rNv and rNu is average rating given by user u and v respectively.
Finally, the computed prediction are converted into recommendations, e.g., a
subset of items with the highest predicted rating is recommended to the user.
10.2 Preliminaries 111
Some of the most successful model-based methods are based on matrix factoriza-
tion. Matrix factorization characterizes both items and users by vectors of factors
inferred from item rating patterns. Matrix factorization models map both users
and items to a joint latent factor space of dimensionality l, such that user-item
interactions are modeled as inner products in that space.
Specifically, the method factorizes D into two latent matrices: the user-factor
matrix P with the size of n l and the item-factor matrix Q with the size d l. Each
row pu in P (and qi in Q) represents the relation between the user u (item t) and the
latent factor. The dimension l of the latent matrices is less than d. The dataset D is
approximated as a product of P and Q, and each known rating rui is approximated
byb rui D qT pu . Figure 10.2 shows the matrix factorization method.
To obtain P and Q, the method minimizes the regularized squared error for all
the available ratings:
X
min P; Q Œ.rui pu qTi /2 C .kpu k2 C kqi k2 /: (10.4)
rui 2D
The constant regularizes the learned factors and prevents overfitting. Two
common ways to solve the non-convex optimization problem are stochastic gradient
descent (SGD) and alternating least squares (ALS).
In SGD, the factors are learned by iteratively evaluating the error eui D rui pu qTi
for each rating rui , and simultaneously updating the user and item vectors by taking
a step in the direction opposite to the gradient of the regularized loss function:
The constant determines the rate of minimizing the error and is often referred to
as the learning rate.
ALS solves the problem by updating the user and item latent vectors iteratively.
In each iteration, fix P, then solve Q by least square optimization. And then fix Q,
solve P by least square optimization.
112 10 Differentially Private Recommender System
Both in SGD and ALS, once the factorization converges, the latent matrices P
and Q are used to predict unknown user ratings. The resulting latent vectors pu and
qi are multiplied:
rui D qTi pu ;
b (10.7)
The first typical method is provided by McSherry et al. [156], who was the first team
to introduce the differential privacy notion to collaborative filtering. They calibrated
Laplace noise for each step to create a covariance matrix, and used the non-private
recommender algorithm on this matrix to predict ratings. In general, it is a synthetic
covariance matrix publishing method.
N
For a large class of prediction algorithms it suffices to have following data: G:
N
average rating for all items by all users; I: average rating for each item by all users;
uN v : average item rating for each user; and covariance matrix (Cov). All ratings are
scaled in the interval ŒB; B. McSherry et al.’s method consists of three steps and
noise will be added to each step.
Step 1: Evaluation of Global and Item Averages, G N and IN Suppose there are n
users with d items, the global average of item will be estimated as follows with
privacy budget 1 . The sensitivity of r will be the rmax rmin D 2B.
P
N D .
G D rui / C Lap.2B=1 /
: (10.8)
jDj
Then for each item tj , calculating the average rating for each item by all users.
The average item ratings is calculated by adding a number of fictitious ratings ˇ to
stabilize the items averages, helping to limit the effect of noise for items with few
ratings, while only slightly affecting the average for items with many ratings.
P
. Dj rui / C ˇ GAvg C Lap.2B=2/
IN D : (10.9)
jDj j C ˇ
The added noise may causes the item average to go out of the range of input ratings
Œrmin ; rmax , the item average is clamped to fit this range.
Step 2: Evaluation of User Averages and Clamping of the Resulting Ratings
They follow the same technique to compute the user average ratings. The basis for
evaluating the user averages is the ratings after the item averages were discounted.
They stabilize the user effects with the addition of ˇu fictitious ratings with the
newly computed global average. The user average rating is calculated as follows.
N
Let D0 D frui I.i/jrui 2 Dg, the adjusted global average will be
P 0
. D rui / C Lap.2B=1/
GN 0 D :
jD0 j
114 10 Differentially Private Recommender System
Step 3: Calculate the Covariance Matrix The final measurement is the covari-
ance of the centered and clamped user ratings vectors. They use per-user weights
!u equal to the reciprocal of kek (the binary elements and vectors indicating the
presence of ratings) as follows:
X
Cov D wu ru ruT C b; (10.10)
u
X
WD !u eu eTu C b: (10.11)
u
Noise b added to Cov could be large if a user’s rating has large spread or if a user
has rated many items. McSherry et al. provided two solutions: (1) Center and clamp
all ratings around averages. If clamped ratings can be used, then the sensitivity of
the function can be reduced. (2) Carefully weight the contribution of each user to
reduce the sensitivity of the function. Users who have rated more items are assigned
lower weights.
After the independent Gaussian noise proportional to the sensitivity bound is
added to each entry in covariance matrix, all related data have been send to a non-
private recommender system.
This clamping adjusts ratings in the range Œ˛; ˛, improving the prediction
accuracy.
To perturb the SGD, the training samples are used to evaluate the prediction error
resulting from the current factor matrices, and then the matrices are modified in
a direction opposite to the gradient, with magnitude proportional to the learning
rate . Suppose there are w iterations. In each iteration, for each rui 2 D, add Laplace
noise to the error eui in each iteration.
116 10 Differentially Private Recommender System
Because the error will be exceed to maximum or minimum error, clamping method
will be used again to adjust the scale in the below equation
8
ˆ
ˆ
<emax if rui < emax
b
eui D be if jb
eui j emax
ˆ ui
:̂e if eui > emax :
max
ALS perturbation alternately fixes one of the latent matrices P or Q, and optimize the
regularized loss function for the other non-fixed matrix. When item-factor matrix Q
is fixed, the overall regularized loss function can be minimized by considering for
each user u the following loss function defined over the subset of ratings Du D
frvi 2 Djv D ug. First, estimate the bound of pm ax and qm ax as kpu k2 and kqi k2 ,
respectively. In each iteration, and for each user u in P, generate a noise vector b
with f .b/ / exp. kbk
2k
2 nu
pmax 2B
/. The noise vector b will be added to loss function
as shown in Eq. (10.13)
Similarly, for each item ti in Q, sampling noise vector b with f .b/ / exp. kbk
2k
2
ni
qmax 2B /. The noise vector b will be added to loss function as follows:
When finally obtain P and Q, the rating of u will be predicted by Eq. (10.7).
10.4 Private Neighborhood-Based Collaborative Filtering Method 117
Zhu et al. [258] mainly studied the trusted recommender systems. The algorithm
ensured that a user cannot observe sensitive information from the recommendation
outputs, and therefore, the proposed algorithm is immunized from a particular
attack, KNN attack.
Calandrino et al. [30] presented the KNN attack. They claim that if a recommenda-
tion algorithm and its parameters are known by an attacker, and supposing he/she
knows the partial ratings history of active user ua on m items, then the attacker can
infer user ua ’s remaining rating history. The inference process can be summarized
as follows.
The attacker initially creates k fake users known as sybils. He/she arranges each
sybil’s history rating with the m items in the active user ua ’s rating history. Then
with high probability, the k nearest neighbors of each sybil will consist of the
other k 1 sybils along with the active user ua . The attacker inspects the lists of
items recommended by the system to any of the sybils. Any item on the sybils, for
example, lists not belonging to those m items, will be an item that ua rates. The
attacker will finally infer the ratings history of an active user and this process will
be considered a serious privacy violation. While this is an example for user-based
methods, similar inference can also be processed for item based methods.
A KNN attack can be performed efficiently in CF due to the sparsity of a typical
rating dataset. Approximately, m D O.log n/ is sufficient for an attacker to infer
a user, where n is the total number of users in the rating dataset. For example, in
a dataset with thousands of users, m
8 is sufficient [30]. This is such a small
number that can easily be collected by an attacker. Furthermore, an attack will be
more serious if an attacker can adaptively change the rating history of his sybils by
observing the output of CF. This can be easily implemented in a system that allows
users to change previously entered ratings. How to hide similar neighbors is a major
privacy issue that cannot be overlooked.
118 10 Differentially Private Recommender System
where s.i; j/ is the output of the score functions representing the similarity between
ti with tj , I is item ti ’s candidate list for neighbors, and tj is the selected neighbor.
The probability of selecting tj will be arranged according to the qi .I; tj /.
The recommender mechanism uses the score function q to preserve differential
privacy. However, the naive recommender mechanism fails to provide accurate
predictions because it is too general, and therefore not suitable for recommendation
purposes. Accordingly, two operations are proposed to address this obstacle. The
first operation is to define a new Recommendation-Aware Sensitivity to decrease the
noise, and the second operation is to provide a new recommender mechanism to
enhance the accuracy. Both are integrated to form the proposed PNCF algorithm,
which consequently obtains a better trade-off between privacy and utility.
Lemma 10.1 For any item pair ti and tj , the score function q.I; tj / has Local
Sensitivity
( ! !)
rxi rxj
rxi rxj .jjri jjjjrj jj jjri jjjjrj jj/
RS.i; j/ D max max ; max ;
ux 2Uij jjri0 jj jjrj0 jj ux 2Uij
jjri jj jjrj jj jjri0 jj jjrj0 jj
(10.19)
where ux is the user that makes a great impact on the ti and tj similarity, and
RS.i; j/ D 1 when jUij j D 1.
Proof
Thus,
8 jjr jjjjr .r r r0 r0 /jj
< i i i j0 i 0j ; if jjri0 jj jjrj0 jj ri rj jjri jj jjrj jj ri0 rj0 ;
jjri jjjjrj jjjjri jjjjrj jj
RS.i; j/ 0 0
: ri rj .jjri jjjjrj jjjjr i jjjjrj jj/
; otherwise:
jjri jjjjrj jjjjri0 jjjjrj0 jj
(10.21)
jjri jjjjri .ri rj ri0 rj0 /jj r r
Please note jjri jjjjrj jjjjri0 jjjjrj0 jj
D maxx2Uij jjr0 xijjjjrxj0 jj .
i j
0
Thus, sim.i; j/ and sim .i; j/ differ at most by
( ! !)
rxi rxj rxi rxj .jjri jjjjrj jj jjri jjjjrj jj/
max max ; max ; (10.22)
ux 2Uij jjri0 jj jjrj0 jj ux 2Uij jjri jj jjrj jj jjri0 jj jjrj0 jj
Both Lemmas 10.1 and 10.2 indicate that depending on the length of rating vector
pairs, score function q will only change slightly in a normal case. Compared to
Global Sensitivity, Recommendation-Aware Sensitivity can significantly decrease
the noise magnitude. Recommendation-Aware Sensitivity is used in the PNCF
algorithm. To simplify the notation, RS.i; j/ or Recommendation-Aware Sensitivity
are used to represent B.RS.i; j// in the following formulas.
122 10 Differentially Private Recommender System
Private Neighbor Selection is the major private operation in PNCF. This section
presents the Private Neighbor Selection based on the exponential mechanism and
show how it is introduced into CF. Given an active user ua and a target item ti , a can-
didate item list I and a corresponding similarity list s.i/ D< s.i; 1/; : : : ; s.i; m/ >
are defined, which consists of similarities between ti and other m 1 items.
According to Eq. (10.17), each element in s.i/ is the score of the corresponding
item. The goal of Private Neighbor Selection is to select k neighbors in candidate
item list I, according to the score vector s.i/. It should be noted that ti is not selected
in I as ti is the target item.
However, even though Recommendation-Aware Sensitivity is applied to reduce
noise, the naive exponential mechanism still yields low prediction accuracy. The
reason is that performance of neighborhood-based CF methods is largely dependent
on the quality of neighbors. If the top k nearest neighbors are assumed as the highest
quality neighbors (sk is used to denote the similarity to the k-th neighbor), the
randomized selection process will have a high probability of picking up neighbors
with low scores. The rating pattern of these low quality neighbors may be totally
different from the pattern of the active user, which lowers the accuracy of the
prediction. To address this problem, the best solution is to improve the quality of
k neighbors under differential privacy constraints.
Motivated by this, the algorithm uses a new notion of truncated similarity as the
score function to enhance the quality of selected neighbors. The truncated notion
was first mentioned in Bhaskar et al.’s work [24], in which they used truncated
frequency to decrease the computational complexity. The same notion can be used
in our score function q to find those high quality neighbors. Specifically, for each
item in the candidate list I, if its similarity s.i; j/ is smaller than sk .i; /w, then s.i; j/
is truncated to sk .i; / w, where w is a truncated parameter in the score function;
otherwise it is still preserved as s.i; j/. The truncated similarity can be denoted as
b
s.i; j/ D max .s.i; j/; sk .i; / w/ ; (10.24)
6: end for
7: Sample an elements t from C1 and C0 without replacement according to their probability;
8: Nk .ti / D Nk .ti / C t;
9: end for
s.i;j/
of these have the probability proportion to exp 4kRS.i;j/ . In C0 , the algorithm
does not deal with the elements one by one. Instead,
it considers
C0 as a single
bs.i;j/
candidate with a probability proportion to exp . When C is chosen,4kRS.i;j/ 0
the items in C0 are selected uniformly. This does not violate differential privacy
because C0 is used as a single candidate and assign the weight according to the
exponential mechanism. This means the probability for each element in C0 still
follows exponential distribution. The probability of providing the same output for
neighboring datasets is still bounded by exp./.
The Private Neighbor Selection operation can provide high quality neighbors and
guarantee the differential privacy simultaneously. The choice of w will influence the
utility in a fixed privacy level.
The utility of PNCF is measured by the accuracy of the predictions. Given an input
dataset D, the non-private neighbourhood-based CF method is set as a baseline.
By comparing the selected neighbors in PNCF with the corresponding neighbors in
the baseline method, the utility level of the proposed algorithms can be analyzed.
To predict the rating of ti for the active user ua , the non-private algorithm
typically chooses the top k similar neighbors and then generates an integrated output
as the prediction. Therefore, the closeness between top k similar neighbors in the
124 10 Differentially Private Recommender System
baseline method and the privately selected neighbors in PNCF is the key factor
that determines the utility level. When implementing the exponential mechanism,
the randomized selection step will choose low similarity neighbors with high
probability, which significantly lowers the accuracy of recommendations.
The proposed PNCF algorithm can preserve the quality of the neighbors in two
aspects: every neighbor with a true similarity greater than .sk C w/ will be selected;
and no neighbor in the output has a true similarity less than .sk w/, where the
true similarity refers to the original similarity without perturbation or truncation.
Therefore, Private Neighbor Selection guarantees that with high probability, the k
selected neighbors will be close to the actual top k nearest ones. Hence, the utility
in PNCF will be better retained than that in the naive exponential mechanism. Two
theorems and proofs are used to support the claims as follows.
Theorem 10.1 Given an item ti , let Nk .ti / be the k selected neighbors and jvj
be the maximal length of all the rating vector pairs. Suppose RS as the maximal
Recommendation-Aware Sensitivity between ts and other items respectively, and
suppose is a small constant less than 1. Then, for all > 0, with probability at
least 1 , the similarity of all the items in Nk .ti / are larger than sk w, where
w D min.sk ; 4kRS
ln k.jvjk/
/.
Proof First, the algorithm computes the probability of selecting a neighbor with a
similarity less than .sk w/ in each round of sampling. This occurs when there is an
unsampled neighbor with similarity no less than sk . Then with the constraint of w
and , no neighbor in the output has true similarity less than .sk w/ after k rounds
sampling for neighbor selection.
If a neighbor with similarity .sk w/ is still waiting for selection, the probability
.s w/
k
exp. 4kRS /
of picking a neighbor with similarity less than .sk w/ is sk
exp. 4kRS /
D
w
exp. 4kRS /.Since there are at most jvj neighbors with similarity less than .sk w/,
according to the union bound, the probability of choosing a neighbor with similarity
w
less than .sk w/ is at most .jvj k/ exp. 4kRS /.
Furthermore, by the union bound in the sampling step, the probability of
choosing any neighbor with similarity less than .sk w/ is at most k .jvj k/
exp. w
4k
/.
w
Let k .jvj k/ exp. 4kRS /.
Then,
w
4kRS ln. k.jvjk/ / (10.25)
) w
4kRS ln k.jvjk/
) w 4kRS
ln k.jvjk/
:
Thus, the probability that similarities less than .sk w/ will be chosen is less
1
than . As defined in Sect. 10.4.2.2, Recommendation-Aware Sensitivity is O. jjvjj2 /.
10.4 Private Neighborhood-Based Collaborative Filtering Method 125
The proposed PNCF algorithm contains two private operations: the Private Neigh-
bor Selection and the Perturbation. The Private Neighbor Selection is essentially
processing the exponential mechanism successively. An item is selected without
replacement in each round until k distinct neighbors are chosen. The score sensitivity
is calibrated by Recommendation-Aware Sensitivity.
From the definition of the exponential mechanism, each selection round pre-
serves . 2k /-differential privacy. The sequential composition undertakes the privacy
guarantee for a sequence of differentially private computations. When a series of
private analysis is performed sequentially on a dataset, the privacy budget will
be added for each step. According to the sequential composition definition, Private
Neighbor Selection guarantees 2 -differential privacy as a whole.
126 10 Differentially Private Recommender System
The Perturbation step adds independent Laplace noise to the Nk .ti / chosen in the
previous step. Given an item set Nk .ti /, perturbation adds independent Laplace noise
to their similarities. The noise is calibrated by =2 and the Recommendation-Aware
Sensitivity is:
2 RS.i; j/
snoise .i; j/ D s.i; j/ C Lap : (10.28)
According to the definition of the Laplace mechanism, this step satisfies =2-
differential privacy.
Consequently, when combining both operations, the proposed method preserves
-differential privacy by applying composition lemma on the selection and pertur-
bation step together.
The datasets are the popular Netflix dataset1 and the MovieLens dataset.2
The Netflix dataset was extracted from the Netflix Prize dataset, where each
user rated at least 20 movies, and each movie was rated by 20–250 users. The
MovieLens dataset includes around one million ratings collected from 6040 users
about 3900 movies. MovieLens is the standard benchmark data for collaborative
filtering research, while the Netflix dataset is a real industrial dataset released
by Netflix. Both datasets contain millions of ratings that last for several years and
are sufficient for investigating the performance of the proposed method from both a
research and industry perspective. Specifically, the All-But-One strategy is applied,
which randomly selects one rating of each user, and then, predicts its value using all
the left ratings in the dataset.
To measure the quality of recommendations, a popular measurement metric is
applied, Mean Absolute Error (MAE) [7]:
1 X
MAE D jrai b
rai j; (10.29)
jTj a;i2T
1
http://www.netflixprize.com.
2
http://www.grouplens.org.
10.4 Private Neighborhood-Based Collaborative Filtering Method 127
(a) (b)
Fig. 10.6 Performance of PNCF and DP on the MovieLens dataset. (a) PCC-item. (b) PCC-user
exponential mechanism. On the other hand, compared with the baseline non-private
algorithm, the accuracy cost introduced by PNCF is much smaller than the cost
introduced by DP. This is because PNCF introduces two novel operations to reduce
the magnitude of introduced noise. Moreover, the two-tailed, paired t-test with a
95% confidence level has been applied to evaluate the performance of PNCF under
all configurations. The detailed statistical results on the Netflix dataset are presented
in Table 10.4. This table shows that the difference in performance between PNCF
and DP is statistically significant.
Moreover, Fig. 10.6 shows the results on the MovieLens dataset. Specifically,
Fig. 10.6a, b shows the performance of PNCF and DP with the PCC metric under
an item-based and user-based manner, respectively. It is clear that PNCF performs
much better then DP. In addition, it is observed that as k increases, both PNCF and
DP achieve better MAE. However, PNCF always performs better than DP across all
k values. This is because PNCF achieves privacy preserving by distinguishing the
quality of potential neighbors, and therefore always selects good-quality neighbors
as analysed in Sect. 10.4.3.1. Moreover, PNCF’s MAE performance is very close
to that of the non-private baseline. This indicates PNCF can retain the accuracy of
recommendation while providing comprehensive privacy for individuals.
10.5 Summary 129
10.5 Summary
11.1 Introduction
The widespread success of social network web sites, such as Del.icio.us and
Bibsonomy, introduces a new concept called the tagging recommender system [107].
These social network web sites usually enable users to annotate resources with
customized tags, which in turn facilitates the recommendation of resources. Over
the last few years, a large collection of data has been generated, but the issue of
privacy in the recommender process has generally been overlooked. An adversary
with background information may re-identify a particular user in a tagging dataset
and obtain the user’s historical tagging records [84]. Moreover, in comparison
with traditional recommender systems, tagging recommender systems involve more
semantic information that directly discloses users’ preferences. Hence, the privacy
violation involved is more serious than traditional violations [177]. Consequently,
how to preserve privacy in tagging recommender systems is an emerging issue that
needs to be addressed.
Over the last decade, a variety of privacy preserving approaches have been
proposed for traditional recommender systems [178]. For example, cryptography is
used in the rating data for multi-party data sharing [31, 247]. Perturbation adds noise
to the users’ ratings before rating prediction [181, 182], and obfuscation replaces a
certain percentage of ratings with random values [23]. However, these approaches
can hardly be applied in tagging recommender systems due to the semantic property
of tags. Specifically, cryptography completely erases the semantic meaning of
tags, while perturbation and obfuscation can only be applied to numerical values
instead of words. These deficiencies render these approaches impractical in tagging
recommendation. To overcome these deficiencies, the tag suppression method has
recently been proposed to protect a user’s privacy by modeling users’ profiles and
eliminating selected sensitive tags [177]. However, this method only releases an
incomplete dataset that significantly affects the recommendation performance.
11.2 Preliminaries
11.2.1 Notations
1
https://del.icio.us/.
2
http://www.bibsonomy.org/.
3
http://www.last.fm/.
4
https://www.netflix.com/.
134 11 Privacy Preserving for Tagging Recommender Systems
In tagging dataset D, a user’s profile is defined by P.u/ D< T.u/; W.u/ >, in
which tags in are represented by T.u/ D ft1 ; : : : ; tjTj g, and weights are denoted as
W.ua / D fw1 ; : : : ; wjTj g, where wi D 0 indicates that ti is unused. A set of users
11.3 Private Tagging Publishing Method 135
The PriTop algorithm aims to publish all users’ profiles by masking their exact tags
and weights under the notion of differential privacy. Three private operations are
introduced to ensure each user in the released dataset cannot be re-identified.
Private Topic Model Generation This creates multiple topics according to the
resources and tags by masking the topic distribution on tags. From the output,
the adversary cannot infer to which topic a tag belongs.
Topic Weight Perturbation This operation masks the weights of tags in a user’s
profile to prevent an adversary from inferring how many tags a user has annotated
on a certain topic.
Private Tag Selection Some privately selected tags replace the original tags
On the basis of these private operations, the proposed PriTop algorithm generates
a new tagging dataset for recommendations, and its pseudocode is provided in
Algorithm 1. Firstly, step 1 divides the privacy budget into three parts for three
private operations. Step 2 groups all tags into K topics and in step 3, the weight
for each topic is perturbed by Laplace noise. After privately selecting the new tags
to replace the original ones in step 4, the sanitized dataset b
D is finally released for
recommendation purposes in the last step.
The PriTop algorithm contains three private operations: Private Topic Model
Generation, Topic Weight Perturbation and Private Tag Selection. The privacy
budget is consequently divided into three pieces, as illustrated in Table 11.2.
11.3 Private Tagging Publishing Method 137
This operation categorizes unstructured tags into topics to eliminate the randomiza-
tion domain. The method is based upon the idea of a topic model, which considers
a document as a mixture of topics and a topic is a probability distribution over
words [25]. Latent Dirichlet Allocation (LDA) is the simplest topic model and the
intuition behind it is that documents exhibit multiple topics [26]. Moreover, it is
concise, clear and easy to be extended in terms of privacy preserving. Therefore, we
applied LDA as the baseline model and introduced differential privacy to generate a
private topic model. In this model, a resource is considered as a document and a tag
is interpreted as a word. After discovering a set of topics expressed by tags, Laplace
mechanism ensures that an adversary cannot infer to which topic a tag belongs.
We conceptualize the Private Topic Model Generation in three steps:
• Generating LDA model using the Gibbs Sampling approach [87].
• Adding Laplace noise to the LDA model to create a private model.
• Creating user’s profile according to the private LDA model.
138 11 Privacy Preserving for Tagging Recommender Systems
The first step constructs the LDA model by Gibbs Sampling. LDA makes the
assumption that there are K topics associated with a given set of documents, and
that each resource in this collection is composed of a weighted distribution of
these topics. Let Z D fz1 ; : : : zK g be a group of topics, with the following equation
representing a standard LDA model to specify the distribution over tag t.
K
X
Pr.tjr/ D Pr.tjzl /Pr.zl jr/; (11.2)
lD1
where Pr.tjzl / is the probability of tag t under a topic zl and Pr.zl jr/ is the probability
of sampling a tag from topic z in the resource r.
In the model, the main variables of interest are the topic-tag distribution Pr.tjz/
and the resource-topic distribution Pr.zjr/. Gibbs sampling is a relatively efficient
method to estimate these variables in the model by extracting a set of topics from
a large documents collection [87]. It iterates multiple times over each tag t of
resource r and samples the new topic z for the tag based on the posterior probability
Pr.zjti ; r; Zi / by Eq. (11.3) until the model converges.
CTK C ˇ RK
Crk C˛
Pr.zjti ; r; Zi / / PjTj tK P K RK
; (11.3)
TK
ti Cti K C jTjˇ KD1 Cri K C K˛
where CTK maintains a count of all topic-tag assignments and CRK counts the
resource-topic assignments. Zi represents all topic-tag assignment and resource-
topic assignment except the current z for ti . ˛ and ˇ are hyperparameters for the
Dirichlet priors, which can be interpreted as the prior observation for the counts.
Evaluation on the Pr.tjz/ and Pr.zjr/ is formulated as follows:
CTK C ˇ
Pr.tjz/ D PjTj tk : (11.4)
TK
ti Cti K C jTjˇ
RK
CrK C˛
Pr.zjr/ D PK : (11.5)
kD1 CrRK
iK
C K˛
After converging, the LDA model is generated with estimating Pr.zjt; r/, P.tjz/ and
P.zjr/.
The second step introduces differential privacy by adding Laplace noise to the
final counts in the LDA model. There are four difference counts in Eq. (11.3):
11.3 Private Tagging Publishing Method 139
TK
PjTj TK RK
PK RK
CtK , ti Cti K , CrK and KD1 Cri K .
If we change the topic assignment on current
P
ti , the TK
CtK will decrease by 1 and jTj TK
ti Cti K will increase by 1. Similarly, if the
P
CrK RK
decreases by 1, the KKD1 CrRK
iK
will increase by 1 accordingly. Based on this
observation, we sample two groups of independent random noise from Laplace
b
distribution and add them to four count parameters. The new Pr.zjt; r/ is evaluated
as follows:
b CTK C 1 C ˇ RK
CrK C 2 C ˛
Pr.zjt; r/ / PjTj tK P K RK
; (11.6)
TK
ti Cti K 1 C jTjˇ KD1 Cri k 2 C K˛
where 1 and 2 are both sampled from Laplace distribution Laplace. 2 / with the
sensitivity as 1.
The third step creates topic-based user profiles. For each user with tags T.ua / D
ft1 ; : : : ; tjT.ua /j g and related resources R.ua / D fr1 ; : : : ; rjR.ua /j g, each tag can be
assigned to a particular topic zl 2 Z according to the Pr.zjt; b r/. So the user profile
can be represented by a topic-based
After generating the topic-based user profile Pz .ua /, Laplace noise will be added to
mask the counts of tags in each topic.
K
b z .ua / D fb 4
W wz1 .ua /; : : : ; b
wzK .ua /g C Laplace : (11.7)
Noise added on the weight Wz .ua / implies the revision of the tag list Tz .ua /.
Positive noise indicates that new tags are added to the Tz .ua /, while negative noise
140 11 Privacy Preserving for Tagging Recommender Systems
CTK C ˇ RK
Crk C˛
Pr.zjti ; r; Zi / / PjTj tk PK RK
:
TK C C K˛
ti C ti K C jTjˇ kD1 rik
until converge
end for
end for
for each rj do
for each ti do
Sample noise from the Laplace distribution:
2
1 Laplace. /;
2
2 Laplace. /:
TK
CtK C1 Cˇ CRK C2 C˛
br.zjt; r/ /
Estimate P PjTj TK PK rK RK ;
ti Cti K 1 CjTjˇ kD1 Cri K 2 CK˛
end for
end for
for each user ua do
for each < r; t > pair do
Assign zi according to Prnoise .zjt; r/;
Generate Pz .ua /;
end for
end for
Output Pz D fPz .u1 /; : : : Pz .ujUj /g.
indicates some tags have been deleted from the list. For positive noise in the topic
zl , the operation will choose the tags with the highest probability in the current topic
zj according to the Pr.tjz/. For negative noise, the operation will delete the tag with
the lowest probability in the current topic zj . The adding and deleting operations will
be defined by Eqs. (11.8) and (11.9)
e
T zl .ua / D Tzl .ua / C tnew ; (11.8)
11.3 Private Tagging Publishing Method 141
jTj
where tnew D maxiD1 Pr.ti jzl /.
e
T zl .ua / D Tzl .ua / tdelete ; (11.9)
jTj
where tdelete D miniD1 Pr.ti jzl /.
After perturbation, we use e Pz .ua / D< e b z .ua / > to represent the noisy
T z .ua /; W
topic-based user profile. However, the e Pz .ua / still has the high probability to be re-
identified because it retains a major part of the original tags. The next operation will
replace all tags in e
T.ua / to preserve privacy.
The Private Tag Selection operation manages to replace original tags with selected
new tags. The challenge is how to select suitable new tags in related topics. For
a tag ti 2 eT zl .ua /, uniformly random tag selection within b T zl .ua / is unacceptable
due to significant utility detriment. The intuitive solution to retain utility is to use
the most similar tag to replace the original one. However, this approach is also
dangerous because the adversary can easily figure out the tag most similar using
simple statistical analysis. Consequently, the Private Tag Selection needs to: (1)
retain the utility of tags, and (2) mask the similarities between tags.
To achieve these, Private Tag Selection adopts the exponential mechanism to
privately select tags from a list of candidates. Specifically, for a particular tag ti , the
operation first locates the topic zl to which it belongs and all tags in b T zl .ua / are then
included in a candidate list I. Each tag in I is associated with a probability based on a
score function and the sensitivity of the function. The selection of tags is performed
based on the allocated probabilities.
The score function is defined by the distance between tags. In the LDA model,
the distance between tags is measured by the extent of the same shared topics [208].
Using a probabilistic approach, the distance between two tags t1 and t2 is computed
based on the Jensen-shannon divergence (JS divergence) between Pr.zjti D t1 / and
Pr.zjti D t2 /. JS divergence is a symmetrized and smoothed version of the Kullback-
Leibler divergence (KL divergence).
Pr1 and Pr2 is defined to be:
X Pr1 .i/
DKL .Pr1 jjPr2 / D ln Pr1 .i/; (11.10)
i
Pr2 .i/
1 1
DJS .Pr1 jjPr2 / D DKL .Pr1 jjS/ C DKL .Pr2 jjS/; (11.11)
2 2
end for
3. Select a tag tj from zl without replacement according to the probability;
end for
4. Output b
T z .ua /
where I is tag ti ’s candidate list, and tj 2 I are the candidate tags for replacement.
Each tag tj has a score according to Eq. (11.12).
The sensitivity for score function q is measured by the maximal change in the
distance of two tags when removing a topic shared by both ti and tj . Let D0JS .Pri jjPrj /
denote the new distance between ti and tj after deleting a topic, and the maximal
difference between D0JS .Pri jjPrj / and DJS .Pri jjPrj / is bounded by 1. Therefore, the
sensitivity of score function is 1.
On the basis of the score function and sensitivity, the probability arranged to
each tag tj is computed by Eq. (11.13) with the privacy budget 4 . The pseudocode
of Private Tag Selection is presented in Algorithm 3:
qi .I;tj /
exp 8
P ; (11.13)
qi .I;tj /
j2zl exp 8
The proposed PriTop aims to obtain the acceptable utility with a fixed differ-
entially privacy level. This section first proves the algorithm is satisfied with the
-differential privacy, then analyzes the utility cost.
To analyze the privacy guarantee, we apply two composite properties of the privacy
budget: the sequential and the parallel composition. Based on the above Lemmas
and privacy budget allocation in Table 11.2, we measure the privacy level of our
algorithm as follows:
• The Private Topic Model Generation operation is performed on the whole dataset
with the privacy budget 2 . This operation preserves 2 -differential privacy.
• The Topic Weight Perturbation applies the Laplace mechanism to the
K
4
weights of topics. The noise is calibrated by Lap jT.u/j and preserves
4 differentialprivacy for each user. Furthermore, as a user’s profile is
independent, replacing a user’s tags has no effect on other user profiles. The
Private Tag Selection preserves 4 -differential privacy as a whole.
• The Private Tag Selection processes the exponential mechanism successively. For
one user u, each tag in the profile is replaced by a privately selected tag until all
tags are replaced. Each selection is performed on the individual tags, therefore the
selection for each user guarantees 4 -differential privacy. Similar to the previous
operation, every user can be considered as subsets of the entire dataset. Thus, the
Private Tag Selection guarantees 4 -differential privacy.
Consequently, the proposed PriTop algorithm preserves -differential privacy.
Given a target user ua , the utility level of the proposed PriTop algorithm is
determined by the accuracy of the tagging recommendation, which is highly
dependent on the distance between P.ua / and b P.ua / [177]. The distance between
P.ua / and b
P.ua / is referred to as semantic loss [177].
P !
1 X b
t2P.ua / d.t; t/
SLoss D ; (11.14)
jUj u2U max d jT.u/j
whereb
t is the new tag replacing the tag t.
144 11 Privacy Preserving for Tagging Recommender Systems
If we consider each private step as query f , then the difference between f .D/ and
f .D/ is the sematic loss. We then apply a widely used error utility definition. We will
demonstrate the sematic loss is bounded by a certain value ˛ with a high probability.
All three private steps affect the semantic loss. But the first step, private topic
model generation, only affects the distance measurement between tags. Therefore,
we only need to measure the SLoss1 in the perturbation step and SLoss2 in the
selection step.
Theorem 3 For any user u 2 U, for all ı > 0, with probability at least 1 ı, the
SLoss1 of the user in the perturbation is less than ˛. When
K exp. ˛
4
a
/
jT.u/j ;
ı
the perturbation operation is satisfied with .˛; ı/-useful.
Proof The perturbation adds Laplace noise with =4 privacy budge to the weight
of each topic in a user’s profile. According to the property of Laplace distribution
Lap.b/:
Z 1 x
Pr.j j > t/ D Pr. > t/ C Pr. < t/ D 2 x exp dx D (11.15)
t b
t
D exp : (11.16)
b
We have
Z 0 x
2K d tai ;b
tai
Pr.SLoss1 > ˛a / D exp dx D (11.17)
max djT.ua /j ˛a 8 4
˛
K d tai ;b
tai a
Pr.SLoss1 > ˛a / D exp : (11.18)
max djT.ua /j 4
As the perturbation step adds new tags or delete tags, the d.tai ;b
tai / will be less
than the maximal value. When we use the JS divergence, the maximal d.tai ;b tai / is
1, so we obtain the evaluation on the SLoss1 is
K exp ˛4a
Pr.SLoss1 < ˛a / 1 : (11.19)
jT.ua /j
Let
K exp ˛4a
1 1 ı;
jT.ua /j
thus
11.3 Private Tagging Publishing Method 145
a
K exp ˛
4
jT.ua /j : (11.20)
ı
The average semantic loss for all the users is less than the maximal value, ˛ D
maxua 2U ˛a , we have
a
K exp ˛
4
jT.u/j : (11.21)
ı
Theorem 3 reveals the semantic loss of perturbation depends on the number of
tags a user has. More tags results in a lower semantic loss.
Theorem 4 For any user u 2 U, for all ı > 0, with probability at least 1 ı, the
SLoss2 of the user in the private selection is less than ˛. When
exp. 8 /
Q ;
1 ı˛
where Q is the normalization factor that depends on the topic that t 2 T.u/ belongs
to, the private selection operation is satisfied with .˛; ı/-useful.
Proof According to Marlkov’s inequality, we obtain
E.SLoss2 /
Pr.SLoss2 > ˛a / : (11.22)
˛a
For each tag tai in b Pa , the probability of ‘unchange’ in the private selection is
exp. /
proportional to Qi 8 , where Qi is the normalization factor depending on the topic
tai belongs to. Therefore, we obtain
!
X d.tai ;b
tai / exp 8
E.SLoss2 / D 1 :
max djT.ua /j Qi
ti 2T.ua /
Let
1
1 Q exp 8
1 1 ı;
˛a
thus
exp 8
Q : (11.24)
1 ı˛a
Del.icio.us dataset is retrieved from the Del.icio.us web site by the Distributed
Artificial Intelligence Laboratory (DAI-Labor),5 and includes around 132 million
resources and 950,000 users. We extracted a subset with 3000 users, 34,212
bookmarks and 12,183 tags.
Bibsonomy dataset is provided by Discovery Challenge 2009 ECML/PKDD2009.6
The dataset contains 3000 individual users, 421,928 resources and 93,756 tags.
MovieLens and Last.fm datasets both were obtained from HetRec 2011,7 which
were generated by the Information Retrieval Group at Universidad Autonoma de
Madrid.
5
http://www.dai-labor.de/.
6
http://www.kde.cs.uni-kassel.de/ws/dc09/.
7
http://ir.ii.uam.es/hetrec2011.
148 11 Privacy Preserving for Tagging Recommender Systems
T.u; r/ \ e
T.u; r/
precision.T.u; r/; T 0 .u; r// D 0
: (11.26)
jT .u; r/j
T.u; r/ \ T 0 .u; r/
recall.T.u; r/; T 0 .u; r// D : (11.27)
jT.u; r/j
The following experiments compare the PriTop with tag suppression when N
varies from 1 to 10. For PriTop, we chose the number of topic K D 100, and test
the performance when D 1 and D 0:5. For tag suppression, we fix the eliminate
parameter to D 0:8 and D 0:6, which corresponds to the suppression rates of
0:2 and 0:4, respectively.
Figure 11.3 presents the recall of recommendation results. It is observed that the
proposed PriTop algorithm significantly outperforms the tag suppression method
on both privacy budgets. Specifically, as shown in Fig. 11.3a, when N D 1, PriTop
achieves a recall at 0:0704 with the D 1 which outperforms the result from the
tag suppression with D 0:6, 0:0407, by 42:19%. This trend is retained as the
increasing of N. For example, when N D 5, PriTop achieves a recall at 0:1799
with the D 1 which outperforms the result from the tag suppression by 37.19%
when D 0:6, 0:113. When N reaches 10, the PriTop still retains 36:09% higher on
recall than tag suppression. Even we choose the lower privacy budget with D 0:5
and a higher eliminate parameter sigma D 0:8, the improvement of PriTop is still
significant. The PriTop has a recall of 0:1382, which is also 7:67% higher than
tag suppression with a recall of 0:1276. Moreover, the improvement of PriTop is
more obvious when N D 10. It achieves recalls of 0:1882 and 0:2408 when D 1
and D 0:5, respectively. But tag suppression only achieves recalls of 0:1538
and 0:1881 with D 0:6 and D 0:8. Similar trends can also be observed in
Fig. 11.3b–d. For example, in the MovieLens dataset, when N D 10 and D 1:0,
the recall of PriTop is 0:4445, which is 27:33% higher than tag suppression with
D 0:8. With the same configuration, PriTop is 22:43% and 25:22% higher than
tag suppression in Last.fm and Bibsonomy datasets. The experimental results show
the PriTop algorithm outperforms tag suppression in variety of N, which implies
that PriTop can retain more useful information for recommendations than simply
deleting the tags.
Moreover, it is clear that the performance of PriTop is very close to the non-
private baseline. For example, in Fig. 11.3a, when D 1, the recall of the De.licio.us
dataset is 0:2408, which is only 3:00% lower than the non-private recommender
result. Other datasets show the same trend. As shown in Fig. 11.3b–d, with the
same configuration, the PriTop result is 3:62% lower than the non-private result
in the MovieLens dataset, 7:58% lower in the Last.fm dataset and 1:4% lower in
the Bibsonomy dataset. The results indicate the PriTop algorithm can achieve the
privacy preserving objective while retaining a high accuracy of recommendations.
Figure 11.4 supports the above claims by plotting the precision results, which
also shows the improvement of PriTop compared to tag suppression and the high
recommender accuracy results of PriTop. However, curves in Fig. 11.3d are not
as smooth as others. This may be caused by the statistical property of Bibsonomy
11.4 Summary 149
Fig. 11.3 FolkRank recall result. (a) Del.icio.us. (b) MovieLens. (c) Last.fm. (d) Bibsonomy
dataset, which contains a large number of tags that appear only once. When the test
samples include a large proportion of these tags, the precision fluctuates.
11.4 Summary
Fig. 11.4 FolkRank precision result. (a) Del.icio.us. (b) MovieLens. (c) Last.fm. (d) Bibsonomy
and shrinks the randomized domain decrease noise. The PriTop algorithm can
publish all users’ profiles by masking their exact tags and weights under the notion
of differential privacy, and a novel private topic model-based method is used to
structure the tags and to shrink the randomized domain.
Current evaluation only concentrates on one recommender algorithm, FolkRank,
with other recommendation techniques, such as the tensor decompositions method,
requiring further investigations. Also we would like to explicitly explore the
semantic similarity in tags to help improve the tradeoff of privacy and utility.
Chapter 12
Differentially Location Privacy
12.1 Introduction
12.2 Preliminary
Let D be a location dataset as shown in Table 12.2. Each record contains a user with
all the locations where he/she visited. U is the set containing all the users, and jUj is
the number of the users. A location set is denoted as X and each location is a point
x2X.
12.3 Basic Location Privacy Methods 153
T.ua / is the set of locations where ua has ever been, and jT.ua /j is the number
of those locations. A query f is a function that maps the data set D to an abstract
range R: f W D ! R. We typically consider of R as the set of possible outputs of a
mechanism performed on a dataset. For example, If f is a count query, the abstract
range will fall into a real number domain R. A group query is represent by F.
Table 12.2 shows an example of location dataset, in which each user may
travels to different locations and a location can appear several times for the same
user. Apparently, the statistical information on a location dataset, say, a histogram
indicating the frequency of each location, is potentially valuable for location
knowledge discovering. In data analysis or mining mechanism, aggregate results are
abstracted from the statistical information to response the queries such as “which
locations are the most attractive?” and “how many users requested LBS at an
attractive location?”
To achieve the privacy preserving, answers to the queries should be obfuscated.
The goal of this work is to propose an efficient method for location data release with
differential privacy while maintaining adequate utility.
There are two possible queries for a location-based service: snapshot and continuous
queries [207]. A snapshot query is a request submitted once by the user. For
example, Where is the closest Sushi bar? A continuous query is submitted at discrete
time points by the same user. For example, Continuously send me gas price coupons
as I travel the interstate highway. Both types of queries are prevalent nowadays in
location based systems. Location privacy preserving ensures that from these queries,
adversary cannot infer the current location (from snapshot query) or the trajectory
(from continuous query) of the user. For location-based services, location privacy
calls for methods that preserve as much as the quality of the desired services, while
hindering the undesired tracking capacities of those services.
154 12 Differentially Location Privacy
Some types of snapshot queries can use existing differential privacy mechanisms.
For example, when a location based server would like to hide the number of
people in a particular region, a typical range query publishing mechanism can be
used. Cormode [49] applied spatial decomposition methods, which is a type of
dataset partitioning mechanism, to decrease noise. They instantiated a hierarchical
tree structure to decompose a geometric space into smaller areas with data points
partitioned among the leaves. Noise is added to the count for each node. Similarly,
Zhang et al. [253] applied spatial decomposition methods in the problem of private
location recommendations, which is the extension of range queries. Quadtree is used
to partition the region and noise is added to protect users’ historical trajectories.
Some location based applications need to hide the exact locations of individuals.
Chatzikokolakis et al. [36] proposed a new notion, geo-indistinguishability, which
protects an individual’s exact location, while disclosing enough location information
to obtain the desired service. Its main idea relates to a differential privacy level of
the radius that the individual has visited: for any radius r > 0, an individual will
have .; r/-privacy.
Most aforementioned work are concerned with the applications involving aggre-
gate location information about several people. For the protection of a single
person’s location, Dewri [56] incorporated differential privacy to k-anonymity by
fixing an anonymity set of k locations. The proposed method requires that the
probability of outputting the same obfuscated location from any set of anonymized
k locations should be similar. This property is achieved by adding Laplace noise to
each coordinate independently.
Geo-indistinguishability considered a query to provider for nearby Sushi bar
in a private way [36], i.e., by disclosing some approximate information z instead
of his exact location x. The method solve the problem of what level of privacy
guarantee can the user expect in this scenario? They considered the level of privacy
within a radius r. A user has a `-privacy with r, in which ` represent user’s privacy
level proportional to radius. The smaller ` is, the higher the privacy. Therefore,
Andrés et al. define geo-indistinguishability as follows: A mechanism satisfies -
geo-indistinguishability iff for any radius r > 0, the user has r-privacy within r.
This definition shows that a user is protected within any radius r, but with a level
` D r with the distance. Suppose is fixed, when r D 1 km, ` is small, which
means the user enjoys a higher privacy level. Figure 12.1 illustrates the idea of
privacy levels decreasing with the radius r.
The key idea of geo-indistinguishability is to make two points in radius r
indistinguishable, which is defined by a probabilistic model.
12.3 Basic Location Privacy Methods 155
Fig. 12.1
Geo-indistinguishability
privacy level
Suppose there are X of point of interest, which might be user’s possible locations,
Z is set as user’s reported values, which can be arbitrary. Assuming user has visited
x 2 X , but report z 2 Z . The attacker’s background information is modeled by
a prior distribution on X , where .x/ is the probability assigned to location
x. As the perturbed location z can be obtained by adding random noise to the
actual location x, z can be considered as a probabilistic value. The mechanism
K is a probabilistic function assigning each location a probability distribution on
Z . For example, K.x/.Z/ is the probability of reported point of x in Z Z .
Each observation Z 2 Z of a mechanism K induces a posterior distribution
D Bayes.; K; Z/ on K, defined as
K.x/.Z/.x/
.x/ D P 0 0
: (12.1)
x0 K.x /.Z/.x /
Fig. 12.2
Geo-indistinguishability
definition
two distributions, K.x/ and K.x0 /, should be less than `. Figure 12.2 illustrate
definition. When x and x0 are fixed, two circles are used to represent distributions of
K.x/ and K.x0 / in point set Z. The distance of K.x/ and K.x0 / should be bounded
in the privacy level `. When meeting with two set of points tuples x1 D x0 ; : : : and
x2 D x0 ; : : :, the distance between two tuples are defined by the maximum distance
between two points in these two tuples.
Definition 12.1 (Geo-Indistinguishability) A mechanism K satisfies -geo-
indistinguishability iff for all x; x0 :
2 d.x0 ;x/
K.x0 /.x/ D e ; (12.5)
2
2
where 2 is a normalization factor.
Step 2: Discretizated Step to Achieve Geo-Indistinguishability in the Discrete
Domain One common way to analyze trajectories on a continuous domain (and to
limit the size of the model) is using discretization on the space. After defining a grid
G of discrete Cartesian coordinates, remapb x D K.x0 /.x/ to the closest point x on G .
12.3 Basic Location Privacy Methods 157
The trajectory has several properties that make the privacy preserving a chal-
lenge [97]. Firstly, individual’s trajectories are highly unique. This indicates the
difficulty of achieving properties such as k-anonymity, where at least k individuals
have the same trajectory. Secondly, the trajectory is predictable in most cases. The
release of an individual’s location data can make him susceptible to privacy attacks.
Existing techniques for privacy-preserving location data release are derived using
partition-based privacy models, which have been shown failing to provide sufficient
privacy protection. Differential privacy is considered as a suitable privacy model
that can provide a rigorous privacy guarantee.
158 12 Differentially Location Privacy
Discretization
The size of v in grid G is tricky as there are several problems: (1) a user may
travel in different speeds so that some anchor points may not be estimated in a
same scale. (2) Large v leads to huge utility loss, while small v leads to significant
number of parameters in the Markov process. Therefore, a hierarchical reference
system, shown as Fig. 12.5, is applied to variate the size of v, with geometrically
increasing resolutions. Given a point x, there are two ways to map x to the next
anchor point aiC1 where (1) aiC1 is a neighboring cell of ai in the same G; or (2)
aiC1 is the parent or a child of ai in a different G. After mapping trajectory in grids,
step 1 outputs mapped trajectory such as t D a1 ; : : : ad , in which anchor points may
be derived from different grid with various v.
Step 2 selects the private model. One particular trajectory may have various
mapping in the first step. This step will select the most suitable perturbed trajectory
based on the transitive probabilities in Markov process. To preserve privacy, Laplace
noise is added to the count of locations, which changes the transitive probabilities.
160 12 Differentially Location Privacy
As noise may hide the direction of the trajectory, weights are calculated by
previous directions in a windows size w. Then the weighted sampling process will
help to improve the utility.
When addressing the issues of privacy in location dataset, the main challenge is to
obtain a trade-off between the level of privacy and the utility of the released dataset.
Most traditional privacy preserving methods lack of a rigid privacy guarantee on the
released dataset. Hence, Xiong et al. [242] proposed a Private Location Publishing
algorithm, with the aim to preserve the privacy for individuals while maximizing
the utility of the released dataset. Specifically, they attempt to address the following
issues:
• How to decrease the large magnitude of noise when using differential privacy? In
a sparse location dataset, the key method to decreasing the noise is to shrink the
scale of the randomization mechanism. Previous work focuses on methods that
generalized a particular location to a vague one, but this results in high utility
loss.
• How to calibrate the sensitivity for hierarchical structure dataset? As mentioned
earlier, the traditional sensitivity will be very large due to the large distance
between locations.
• How to design a release mechanism that retains the semantic information of the
location data?
Before proposing the hierarchical sensitivity, they first analyze the source of
redundant noise according to the definition of sensitivity in differential privacy.
For a location dataset, the distance between locations is continuous, in order to
preserve privacy, the sensitivity of a query has to be calibrated by the maximal
distance between locations, which could completely mask the true distance and
render the utility of dataset useless. They observe that a location dataset maintains
a hierarchical structure, which defines the different semantic on each level. For
example, country, city and street can be considered as levels of the location
dataset, and each level has its own semantic. In this hierarchical structure, users
may have diverse requirements on different levels of location. Some of them just
need to hide the street, or hide the city, few of them need to hide the county.
Hence, they can associate the hierarchical structure of the dataset with users’ various
requirements.
12.4 Hierarchical Snapshot Location Publishing 161
Based on this observation, they define the hierarchical sensitivity for differential
privacy mechanism, it is calibrated by the maximal distance between two locations
on the same level of location. Let L represent the level of the structure that user
intends to preserve, they then have the follow definition:
Definition 12.3 (Hierarchical Sensitivity) For a given L, the hierarchical sensi-
tivity of L is
The PriLocation algorithm aims to publish all users’ profiles by masking the
exact locations and weights under the notion of differential privacy. Three private
operations are introduced to ensure that each individual in the releasing dataset
cannot be re-identified by an adversary.
• Private Location Clustering: This creates location clusters and masks the exact
number of locations as well as the centers of each cluster. From the clustering
output, the adversary cannot infer to which cluster a location exactly belongs.
• Cluster Weight Perturbation: This operation masks the weights of the locations
in a user’s profile to prevent an adversary from inferring how many locations a
user has visited in a certain cluster.
• Private Location Selection: This aims to mask a user’s profile that an adversary
cannot infer the locations visited by this user.
Based on these private operations, the proposed PriLocation algorithm generates
a sanitized location dataset, and its pseudocode is provided in Algorithm 1. Firstly,
step 1 divides the privacy budget into three parts, corresponding to the three private
operations. Step 2 groups all locations into clusters, and in step 3 the weight
for each cluster is perturbed by Laplace noise. After privately selecting the new
locations to replace the original ones in step 4, the sanitized dataset b
D is released in
the last step.
In this subsection, they describe the Private Location Clustering operation that
privately groups locations into clusters. Private Location clustering categorizes
unstructured locations on each level into groups to shrink the randomized domain
for privacy purposes. According to the notion of differential privacy, this operation
ensures that deleting a location will not significantly affect the clustering results.
This means that an adversary cannot infer to which group a location belongs from
the clustering output. This objective can be achieved by adding Laplace noise in the
distance measurement during the clustering process.
As one of the most popular models, k-means is easy to be extended for privacy
preserving, especially for differential privacy [28]. Therefore, they apply k-means as
the baseline clustering algorithm and introduce the differential privacy to generate
a private cluster algorithm. Blum’s initial work in SuLQ framework claims that
a private clustering method should mask the cluster center and the number of
records in each cluster. Following this work, they conceptualize the Private Location
Clustering in two steps:
• Defining a quantitative measure of distance between locations.
• Adding noise in each iteration to privately cluster all locations.
As the location is defined by the longitude and latitude, The distance between
locations can be measured by Euclidean distance:
q
d.ti ; tj / D .xi xj /2 C .yi yj /2 : (12.8)
The next step introduces differential privacy into k-means, which is essentially
an iterated algorithm for grouping data observations into clusters. Let cl denote
the center of the cluster Cl . Equation (12.9) shows the objective function g, which
measures the total distance between the location and the cluster center it belongs to.
jTj X
X
gD il d.ti ; cl /; (12.9)
iD1 lD1
m X
X
b 2p HS
GD il d.ti ; cl / C Laplace : (12.11)
iD1 lD1
After generating the cluster-based user profile PC .ua /, Laplace noise will be added
to mask the counts of locations in each cluster.
b C .ua / D WC .ua / C Laplace 4 :
W (12.12)
Noise added on the weight WC .ua / implies the revision of the locations in TC .ua /.
Positive noise indicates that new locations are added to the TC .ua /, while negative
noise indicates that some locations have been deleted from the list. For positive noise
in the cluster Cl , the operation will choose the location close to the cluster center.
For negative noise, the operation will delete the location with the largest distance to
the cluster center. Namely,
e
T Cl .ua / D TCl .ua / C tnew ; (12.13)
e
T Cl .ua / D TCl .ua / tdelete ; (12.14)
After perturbation, they use e PC .ua / D< e b C .ua / > to represent the noisy
T C .ua /; W
cluster-based user profile. However, the e PC .ua / still has the high probability to be re-
identified because it retains a major part of the original locations. The next operation
will replace all locations in e
T C .ua / to preserve privacy.
The Private Location Selection operation replaces original locations with selected
new locations. The challenge is how to select a new location from the related
clusters. For a location ti 2 e T Cl .ua /, uniformly random location selection within
e
T Cl .ua / is unacceptable due to significant utility detriment. The intuitive approach to
retaining utility is to replace with the most similar location. However, this approach
is also insecure because the adversary can easily figure out the location most similar
using simple statistical analysis. Consequently, the Private Location Selection needs
to: (1) retain the utility of locations, and (2) mask the similarities between locations.
To achieve these requirements, Private Location Selection adopts the exponential
mechanism to privately select locations from a list of candidates. Specifically, for a
particular location ti , the operation first locates the cluster Cl to which it belongs, and
all the locations in eT Cl .ua / are then included in a candidate list I. Each location in I
is associated with a probability based on the score function and its related sensitivity.
The selection of locations is performed based on the allocated probabilities.
The score function is defined by the distance between locations. They define the
score function q for a candidate location tj with the target location ti as follows:
They then select a location tj from Cl to replace the ti according to this probability.
Eventually, the algorithm output b T C .ua /. The pseudocode of Private Location
Selection is presented in Algorithm 3.
12.4 Hierarchical Snapshot Location Publishing 167
end for
3. Select a location tj from Cl to replace the ti according to the probability;
end for
4. Output b
T C .ua /;
Theorem 5 For any user ua 2 U, for all ı > 0, with probability at least 1 ˇ,
the distance error of the released dataset is less than ˛. The lower bound of ˛ is
presented by Eq. (12.19)
P
ti 2b
T C .ua /;tj 2Cti
E.d.ti ; tj //
˛ max ; (12.19)
ua 2U HS jb
T C .ua /j ˇ
E.DEua /
Pr.DEua > ˛a /
˛a
E.DEua /
) Pr.DEua ˛a / > 1 :
˛a
E.DEua /
According to Definition 14.9, they estimate the Pr.DEua ˛a /. Let 1 ˛a
D
1 ˇ, they have
P
ti 2b
T C .ua /;tj 2Cti
E.d.ti ; tj //
˛ max : (12.20)
ua 2U HS jb
T C .ua /j ˇ
where is the normalization factor depending on the cluster to which ti belongs to.
The proof shows that the distance error for each user mainly depends on
the privacy budget and the normalization factor i . According to Eq. (12.16), the
normalization factor i for a particular location ti is defined as
!
X .HS d.ti ; tj //
i D exp : (12.21)
8jbT C .ua /j HS
ti 2b
T C .ua /;tj 2Cti
12.4 Hierarchical Snapshot Location Publishing 169
Therefore, the size of i is depended on the cohesion inside cluster Cti , in which
the compact cohesion results in small i and less distance error. Further analysis
shows that the cohesion is determined by the privacy budget in the private location
clustering operation. It can be concluded that the privacy budget affects on the utility
level of PriLocation.
The PriLocation algorithm contains three private operations: Private Location Clus-
tering, Cluster Weight Perturbation and Private Location Selection. The privacy
budget is consequently divided into three pieces, as illustrated in Table 12.3.
Because the Private Location Clustering operation is performed on the entire
dataset and will have effect on all users, they allocate more privacy budget (=2) than
other two operations. The other two operations only perform on individuals, and less
privacy budget are required. They allocate the rest =2 to these two operations (=4
for each).
Based on the privacy compositions and the privacy budget allocation in
Table 12.3, they measure the privacy level of our algorithm as follows:
• The Private Location Clustering operation is performed on the whole dataset
with the privacy budget 2 . According to the parallel composition, this operation
preserves 2 -differential privacy.
• The Cluster Weight Perturbation applies the Laplace
mechanism to the weights
4
of clusters. The noise is calibrated by Lap jT.u/j and preserves 4 -differential
privacy for each user. Furthermore, as a user’s profile is independent, replacing
a user’s locations has no effect on other user profiles. The Cluster Weight
Perturbation preserves 4 -differential privacy as a whole.
• The Private Location Selection adopts the Exponential mechanism. For one user
u, each location in the profile is replaced by a privately selected location until
all locations have been replaced. Each selection is performed on the individual
location, therefore according to parallel composition, for each user, the selection
guarantees 4 -differential privacy. Similar to the previous operation, every user
profile can be considered as a subset of the entire location dataset. Thus, the
Private Location Selection guarantees 4 -differential privacy.
Consequently, they can conclude that the proposed PriLocation algorithm pre-
serves -differential privacy.
170 12 Differentially Location Privacy
To maintain the consistency with previous research, they thoroughly compare the
distance error of PriLocation with the traditional differential privacy (DP) algorithm
as well as the k-anonymity approach on the four datasets. For the PriLocation
algorithm, they first fix the street privacy level, and set from 0:1 to 1:0 with a
step of 0:1. They set the number of cluster D 10, 40 and 80. For the traditional
differential privacy mechanism, they also set from 0:1 to 1:0 with a step of 0:1.
For the k-anonymity approach, as it adopt k to control the privacy level, rather than
, they set two empirical values, k D 10 and 50.
Figure 12.7 shows the results on those four datasets. It can be observed that the
distance error of the PriLocation algorithm in a variety of is less than that of the
traditional differential privacy with different privacy budgets, and this indicates that
PriLocation outperforms naive differential privacy on all the datasets. Specifically,
the PriLocation algorithm obtains a considerably lower distance error when D 1.
For example, in Fig. 12.7a, when D 80 and D 1, the distance error is 0:0239,
which is 69:14% lower than that of traditional differential privacy. Even in a higher
1
http://instagram.com.
12.4 Hierarchical Snapshot Location Publishing 171
1 0.8
DP k=50 DP
0.9 PriLocation with η=10 PriLocation with η=10
0.7
PriLocation with η=40 PriLocation with η=40
0.8 PriLocation with η=80 PriLocation with η=80
0.6
k=50
0.7
Distance Error
Distance Error
0.5
0.6
k=10
0.5 0.4
0.4
0.3
0.3
k=10
0.2
0.2
0.1
0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(a) (b)
1 0.9
DP DP
0.9 PriLocation with η=10 0.8 PriLocation with η=10
PriLocation with η=40 PriLocation with η=40
0.8 PriLocation with η=80 0.7 PriLocation with η=80
0.7
k=50 0.6
Distance Error
Distance Error
0.6
0.5 k=50
0.5
0.4
k=10
0.4
0.3
0.3
0.2
0.2
k=10
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(c) (d)
Fig. 12.7 Distance Error on different datasets with Street level privacy. (a) Distance Error
in Flickr. (b) Distance Error in GeoLife. (c) Distance Error in Div400. (d) Distance Error in
Instagram
privacy level ( D 0:1), the distance error is still lower than that of the traditional
differential privacy by 67:81% when D 80. This trend retains when equals to
other values, thus illustrates the effectiveness of PriLocation in terms of the distance
information retained in the randomized dataset. Similar trends can also be observed
in Fig. 12.7b–d. All figures show that PriLocation obtains a stable reduced distance
error comparing with the naive differential privacy.
They also compare the PriLocation algorithm with k-anonymity approach.
Figure 12.7 shows that k-anonymity has diverse performances in different datasets.
For Flickr dataset, Fig. 12.7a shows that the distance error of k-anonymity is 0:2446
when k D 10. This result outperforms PriLocation with < 0:4. However,
when > 0:4, D 80, PriLocation outperforms k-anonymity with k D 10.
When k D 50, k-anonymity has higher distance error (DE D 0:6980) than
all settings of PriLocation. In Fig. 12.7b. when k D 10, k-anonymity still has
higher distance error comparing with PriLocation. When k D 50, k-anonymity
has a poor performance (DE D 0:7802) comparing with PriLocation. They can
observe that the performance of k-anonymity depends on the size of the datasets.
172 12 Differentially Location Privacy
A larger dataset with smaller value of k has a better performance than a smaller
dataset with larger k. Flickr dataset has 1692 individual users and GeoLife only
contains 182 users, so that the performance of k-anonymity in Flickr is better than
in GeoLife with the same value of k. This observation is confirmed by Fig. 12.7c,
d. They shows that Instagram has a relative lower distance error comparing with
PriLocation while Div400 has higher distance error. This is because that Instagram
has 2015 individuals while Div400 only has 396. These results also demonstrate that
PriLocation has a stable performance that is independents of the size of a dataset.
12.5 Summary
In the last a few years, providers of location-based services have collected a wealth
of data of individuals and populations on the movement because of the popularity of
such LBSs. However, the places that people visit will disclose extremely sensitive
information such as their daily behaviours, home and work locations, preferences
and habits, etc. Location privacy is, therefore, an urgent issue that needs to be
addressed. The main aim of a location privacy LBS is to preserve as much as the
quality of the desired services, while hindering the undesired tracking capacities of
those services.
Although differential privacy has been considered as a promising solution to
provide a rigid and provable privacy guarantee for data publishing, the straightfor-
ward differential privacy solution for location data still suffers from three important
deficiencies: (1) it introduces a large volume of noise due to the sparsity of location
dataset, which significantly affects the performance; (2) the definition of sensitivity
in differential privacy can not be used directly in location data release because that
it causes unacceptable noise.
This chapter describes three methods that apply differential privacy to achieve the
aim of a location privacy LBS through various angles. Finally, a private publishing
algorithm is proposed to randomize location dataset in differential privacy, with the
goal of preserving users’ identities and sensitive information. The algorithm aims to
mask the exact locations of each user as well as the frequency that the user visit the
locations with a given privacy budget.
Chapter 13
Differentially Private Spatial Crowdsourcing
13.1 Introduction
Crowdsourcing has been a successful business model ever since it was first
introduced in 2006 [179]. It refers to employers outsourcing tasks to a group of
people [179], and with the increasing use of smart equipment with multi-modal
sensors. Spatial crowdsourcing (SC) [119] is a particular form of crowdsourcing
where workers perform their assigned tasks at an exact location. The tasks might
include taking photos, recording videos, reporting temperatures, etc. [216].
Within crowdsourcing platforms, tasks are uploaded by a requester to a spatial
crowdsourcing server (SC-server) which assigns the tasks to registered workers
using task-assigning algorithms. Workers submit their location in longitude/latitude
form, to the SC-server, and taking the probability of a worker accepting a task
into consideration, the server assigns each task to the most suitable workers to
ensure a high assignment success rate with minimal cost. However, this process
poses a privacy threat as the worker’s location data may be released by the SC-
server for further analysis, thus compromising the workers’ privacy. Studies have
demonstrated that published location information may be exploited to deduce
sensitive personal information such as home location, political views, and religious
inclinations [12, 100, 154].
Differential privacy has attracted attention in recent years in spatial crowdsourc-
ing. Some algorithms, such as DPCube [238] and AG [145], have been proposed
that allow the release of spatial datasets, while preserving privacy. The central
idea of these algorithms is to decompose a large map into many cells of varying
sizes and add noise into the count of users in each cell. These algorithms work
well in scenarios of that answer range queries on location datasets, however, many
weaknesses remain when applied directly in SC.
First, existing methods hide a worker’s exact location in a cloaked region
and suppose that workers are distributed uniformly within it. We argue that this
assumption is invalid unless the size of the region is very small. In large regions,
people may gather into small areas, because aggregation is a basic feature of human
society. The algorithms also tend to decompose the whole map into a group of cells
of varying sizes, yet maintain a similar worker count for each cell. Given a task
in a large cell, an uneven distribution will cause a huge error when the density of
workers around the task point is evaluated.
Second, existing methods generally consist of two steps—decomposing the map
into a grid and adding noise for each cell—and each step consumes a certain amount
of the privacy budget. Since the budget used for the noise-adding step is only a
portion of the total budget, the noise volume will be significant.
To fix these weaknesses, this chapter first proposed a basic method of differ-
entially private crowdsourcing. Based on that, the chapter present a reward-based
spatial crowdsourcing, in which workers accept tasks in exchange for rewards and,
at the same time, differential privacy is preserved. Table 13.1 shows the application
settings.
SC can be categorized into two modes, worker selected tasks mode (WST) and
server assigned tasks mode (SAT) [119].
In WST, the requester uploads tasks to the SC-server, and workers select their
favorite tasks autonomously. WST is easy to implement but inefficient. Given
workers always select the nearest tasks, the tasks with few workers nearby are
accepted with low probability, which leads a low global task acceptance rate.
SAT is much more efficient. To ensure a high assignment success rate, workers
must report their locations to the SC-server and tasks are assigned to the most
suitable workers according to their position. Obviously, due to location exposure,
this mode contains an inherent a privacy concern, and forms the main concern on
the community.
13.2 Basic Method 175
location using a PSD. The PSD is then published to the SC-server instead of exact
locations. Finally, the SC-server assigns the tasks to the workers according to the
published PSD.
The entire process can be divided into two steps: Step 1: building the work
private spatial decomposition (PSD) A PSD should be designed carefully for task
assignment at the SC-server. They used a two level adaptive grid and variable cell
granularity to avoid over or under-partition the space. The space is partitioned in
larger granularity in the first level, and each cell is further partitioned based on the
density of the workers. Figure 13.2 shows an example of the adaptive grid method.
The first level uniformly partitions the region, while second level further partitions
the region into smaller region according to the number of workers in each region.
At this stage, Laplace noise is added to the count of workers at each level.
Step 2: task assignment. When a request for a task is posted, the SC-server
queries the PSD and determines a geocast region, where the task is disseminated.
The goal is to obtain a high success rate for task assignment. while at the same time
preserve the system utility. To achieve the goal, three utility measurements have
been proposed:
• Assignment Success Rate (ASR). ASR measures the ratio of tasks accepted by a
worker to the total number of task requests.
• Worker Travel Distance (WTD). The total distances of workers need to travel to
complete the task.
• Average number of notified workers (ANW). A significant metric to measure
overhead of the system.
Based on the guide of these utility measurements, the geocast region should (1)
contain sufficient workers such that task is accepted with high probability, and (2)
keep a small size. The geocast region construction algorithm initially select a cell
that covers the task, and determines its utility values. If the utility values are lower
13.3 Differential Privacy in Reward-Based Crowdsourcing 177
than thresholds, the algorithm expands the region by adding neighboring cells, until
either when the utility of the obtained region exceeds the utility threshold. The
construction algorithm is a greedy heuristic, as it always chooses the candidate cell
that produces the highest utility increase at each step.
To et al. [217] presented a tool box, PrivGeoCrowd, to display the framework in
a visual, interactive environment. System designers can use the tool box to obtain
satisfactory results by tuning the parameters and allocation strategy.
Though spatial crowdsourcing is a hot topic, only a limited number of studies have
focused on its rewards. Dang et al. [52] considered the types of tasks given the varied
expertise of workers. They defined a maximum-task minimum cost assignment
problem and, by solving this problem, tasks could be assigned to the most expert
workers, while minimizing the given reward. Wu et al. [233] concentrated on
the relationship between crowdsourcing quality and payment. Their analyses are
based on a 2D interior design task, and the evaluation results show that increasing
monetary rewards does not significantly improve the average design quality, rather
it increases the probability of excellent solutions from the crowd. Luo et al. [150]
designed an incentive mechanism based on all-pay auctions. They transform the
task assignment problem into a profit maximization problem, and introduce a
contribution-dependent prize to model workers’ contributions. Similarly, Shah-
Mansouri et al. [199] propose an auction-based incentive mechanism, ProMoT, to
maximize the profit of the platform while providing satisfying rewards to smart-
phone users. Thepvilojanapong et al. [215] took advantage of microeconomics,
proposing SenseUtil, a participation-aware incentive mechanism with various utility
functions to increase the number of sensors while keeping payments low.
While all the aforementioned works make contributions to their specified sce-
narios, few consider privacy issues when assigning a task. Moreover, most of these
works regard reward allocation as an optimization problem, and do not integrate
task assignment well.
Xiong et al. [241] presented a novel method for reward-based SC with a
differential privacy guarantee, that uses a contour plot as a new sanitized rep-
resentation of location distribution. They also propose a reward-based allocation
strategy for achieving a specified assignment success probability in reward-based
SC applications.
The major idea of this method is to split the entire area into smaller regions
and apply a differential privacy mechanism to each region. As mentioned in
Sect. 13.1, splitting regions is a non-trivial task. Previous literature assumes that
worker’s locations are distributed uniformly, but this may not be the case in real-
world scenarios. The proposed method adopts a contour plot to illustrate workers’
locations and determine an assignment task radius. In this way, the method arranges
tasks with a better success rate and a privacy guarantee.
178 13 Differentially Private Spatial Crowdsourcing
They follow the framework that To et al. [217] provided, which has been shown
in Fig. 13.1. The CSP collects the true locations of workers, and generates a contour
plot with a DP guarantee. The plot is then submitted to the SC-server. Because
the exact worker locations will not be submitted, the privacy of each worker is
preserved. The SC-server estimates the location region for each task, according to
the contour plot (a circle is used to represent this region), and all workers within the
circle are selected for the task.
To perform task assignment and guarantee a high ASP, the SC-server must be
aware of workers’ location distribution, but submitting accurate worker locations
may violate personal privacy. Traditional privacy-preserving algorithms assume that
workers are uniformly distributed in an area, yet in real-world, this may not be
the case.
13.3 Differential Privacy in Reward-Based Crowdsourcing 179
where S is the sensitivity. After this step, a noisy count nnoisy is obtained for each
unit cell. Finally, a contour plot can easily be created by connecting the cells with
the same noisy count.
Figure 13.3 illustrates an example of building a contour plot with 6 6 unit cells.
The entire area is split into 6 6 disjoint cells, shown with dash lines. The exact
number of workers in each cell is labeled at the bottom right. Noise is added to each
count number during the privacy-preserving process, resulting in noisy counts, as
shown in Fig. 13.3b. Finally, the cells with the same noisy count are connected, e.g.,
the cells with noisy counts 36, 58, 65 in Fig. 13.3c. In this way, the contour plot in
Fig. 13.3d is generated. This contour plot helps the server to allocate tasks with a
differential privacy guarantee.
As an example, a worker distribution and count method is shown in Fig. 13.4.
The horizontal and vertical directions represent longitude and latitude. The solid
curves are contour lines with positive noisy counts, and the exact noisy counts are
labeled on these contour lines.
It can be proven that the proposed contour method provides a solid DP guarantee.
As shown in Fig. 13.3, the entire data domain is split into disjoint cells. Laplace
180 13 Differentially Private Spatial Crowdsourcing
Fig. 13.3 Generation of a contour plot with 6 6 cells. (a) Exact Count. (b) Noisy Count.
(c) Contour Lines. (d) Contour Plot
noise is added to the count of workers in each celli independently, with a privacy
budget of i D . According to parallel composition (Definition 2.1), the sequence
of the Laplace mechanism over the set of disjoint cells provides maxfi g-differential
privacy. As each i is equal to , the contour method is therefore -differentially
private.
The contour plot is constructed at the CSP and submitted to the untrusted SC
server, serving as an approximate distribution of workers. It prevent the SC server
from identifying any worker with his/her location, while the SC server can assign
the tasks to the workers according to the contour plot.
13.3 Differential Privacy in Reward-Based Crowdsourcing 181
After splitting the entire area into small cells, a radius, r, and a reward, w, are
specified for each task. These two factors have a great impact on both the acceptance
probability and the ASP.
Intuitively, whether a worker will accept a task mainly depends on the reward, w,
and the distance, d. Therefore, the acceptance probability .0 1/ can be
calculated by a function, f , with d and w, which is shown in Eq. (13.2)
ex ex
y D thx D : (13.3)
ex C ex
It has the natural property that y 2 Œ0; 1/ when x 2 Œ0; C1/, and the horizontal
asymptote, y D 1, when x ! C1. This means the function can be regarded
as a mapping from non-negative numbers to probabilities. Hyperbolic tangent
functions have been widely investigated because of these properties. In the field of
engineering, Wang et al. [224] proposed a nonlinear transformation strategy based
on hyperbolic tangent functions. Given a specific hyperbolic tangent function, the
curve can be determined and the peak-to-average power ratio of transmitted signals
in orthogonal frequency division multiplexing systems can be reduced effectively,
with low degradation in the bit error rate. In addition, through translation and
stretching, hyperbolic tangent functions can be transformed into the famous logistic
function (thx D 2logistic.x/ 1), which has been widely studied and applied in the
field of statistic [218]. The curve of the function is depicted in Fig. 13.5.
To model workers’ acceptance probability, the hyperbolic tangent function can be
applied as a framework of f . However, it will converge to 1 rapidly when x increases,
i.e, the function value grows to its supremum in a small interval from the origin. the
functions’ values can be considered as probabilities. In practice, fast convergence
will make the probabilities approximate to 1 for many x, and this does not fit with
real-world scenarios. To overcome this shortcoming, we introduce parameter c1 to
control the scalability of the function, i.e.,
it costs to execute the project. The higher the benefit-cost ratio, the better the project.
In spatial crowdsourcing, is proportional to the ratio of w and d. It measures how
attractive a task is for users,
w
D c2 ; (13.5)
d
where c2 is used for tuning the scale of the ratio. A larger value of implies a more
attractive task, which may accepted with a higher probability.
According to Eq. (13.4), when w w0 , we have
w w
y D th.c1 / D th c1 c2 D th c ; (13.6)
d d
where c is the dot product of c1 and c2 , namely, c D c1 c2 .
Finally, based on the hyperbolic tangent function and the distance-reward ratio
, the acceptance probability model is created by Eq. (13.7)
(
th.c wd /; if w w0 and d MTDI
D f .w; d/ D (13.7)
0; otherwise:
In Eq. (13.9), 1 represents the probability that a worker refuses t, .1 /n shows
the probability that n workers in the circle refuse the task. Therefore, 1 .1 /n
indicates the probability that at least one worker will accept the task, namely, the
ASP of t.
184 13 Differentially Private Spatial Crowdsourcing
To assign tasks efficiently, we must obtain the minimum radius, r, for task, t. It
is obvious that when the ASP of each task is equal to a given threshold EASP , the
minimum radius, r, for t can be solved.
We summarize our proposed strategy in the following steps, then provide the
details of each step with theoretical analyses:
1. Input EASP and a contour plot, evaluate the worker density deni for each task ti
according to the task location and contour lines in contour plot;
2. Set wi D W=jTj, compute ri with Eq. (13.16) for the given tasks;
ri
3. Reset wi D w0 C PjTj .W w0 jTj/, where w0 is the lower bound of reward.
iD1 ri
Then return to step 2 to re-compute ri with the tuned wi .
Specifically, the worker density is evaluated for each task in Step 1: if the location
of ti is on a contour line, the density is the value of the contour; when ti is between
two contour lines, the density is approximated by the mean of those two contour
values. Step 2 calculates ri with the mean reward. Step 3 tunes wi for each task in
proportion to the pre-computed ri , namely, increase the reward for the tasks with
large ri while decrease the reward for the ones with Jsmall ri . Then ri is calculated
again. Finally, the workers who are located in area trii will be informed to perform
task ti with the reward wi .
Before solving Eq. (13.9),J the SC-server has to measure the distance, d, and the
number of the workers in tr . Without additionalJprior knowledge, it is natural to
assume all workers are distributed uniformly
p in tr . Suppose worker, p, is at the
coordinates .x; y/, the expectation of .x C y2 / can be considered with respect
2
to d, i.e,
“ p
1 2
E.r/ D .x2 C y2 / 2 dx dy D r: (13.10)
r 3
Jt
r
The total number of workers in an area is calculated by the worker density, den,
J t
in r , which is shown in Eq. (13.11).
ln.1 EASP /
D r2 ln.1 /: (13.13)
den
h.r/
h.r0 / C h0 .r0 /.r r0 /: (13.15)
186 13 Differentially Private Spatial Crowdsourcing
ln.1 EASP / ek C 1 ek ek C 1 ek
2 ln k k r ln k k ;
den 2 e C1 2 e C1
(13.16)
where k D 3cw. Equation (13.16) is an affine function that can be solved.
In summary, the proposed method first builds a contour plot with full privacy
budget, then models acceptance probability and ASP based on hyperbolic tangent
function. Taylor’s formula is used to solve the model’s equations. Finally, the task-
informing radius is calculated with an optimized-reward distribution, as opposed to
the traditional uniform reward.
13.3.4.1 Performance on DE
The variation in the tendencies of DE for the Gowalla dataset, along with other
parameters is shown in Fig. 13.7. Specifically, Fig. 13.7a shows that DE increases
as the EASP increase, because achieving a larger EASP requires a larger number of
workers participate in task assignment and thus leads to more distance error. In
addition, the distance errors in the proposed optimized-reward method are generally
less than those of the uniform-reward method, which indicates that the total distance
error can be reduced by shrinking the radius distribution of the tasks in a smaller
interval.
Figure 13.7b shows that the distance error decreases when the total reward is
increased. Obviously, when a task’s reward is much higher, the probability of a
worker accepting it will be higher too. Thus, less workers are needed to achieve the
EASP , and this leads to a small distance error. However, once the reward is increased
to a certain level, the probability of a worker accepting the task becomes stable,
leading to an approximately fixed number of workers who will accept the task. Thus
the latter half of the curve flattens. Given difference between the amount of the
Fig. 13.7 DE estimation with Gowalla dataset. (a) DE VS EASP . (b) DE VS Reward. (c) DE
VS
13.3 Differential Privacy in Reward-Based Crowdsourcing 187
Fig. 13.8 DE estimation with T-driver dataset. (a) DE VS EASP . (b) DE VS Reward. (c) DE
VS
total rewards, distance errors in the proposed optimized-reward method were always
smaller than those of the uniform-reward methods.
Figure 13.7c illustrates the change in DE when the privacy budget varies.
A smaller means more volume of noise is added to each cell, which leads to a
larger distance error. Increasing the privacy budget, caused the volume of noise to
decrease to a stable level, thus the distance error levels off at the end of the curve.
Experimental results conducted on the T-Driver dataset are shown in Fig. 13.8,
where the total reward is 0:1W and D 0:3 in Fig. 13.8a, EASP D 0:88 and D 0:4
in Fig. 13.8b, EASP D 0:9, the total reward is 0:1W in Fig. 13.8c. The results show
the same variation tendency of the DE as the other parameters, and demonstrates
the reliability of the proposed method.
This experiment studies the RejR with different values for: EASP , the total reward and
the privacy budget in optimized-reward and uniform-reward strategies, respectively.
Results show that the optimized-reward method significantly outperformed the
uniform-reward method against the metric, RejR.
Figure 13.9b shows results for the Gowalla dataset, where the total reward is
0:1W and privacy budget is 0:3. As shown in Fig. 13.9a, the average travel distance
of each task, namely 2r=3, was distributed in a wide interval when we applied the
uniform-reward method, while the distribution was suppressed into a quite narrow
range with the optimized-reward method. Therefore, when MTD is defined as the
distance that 90% of workers are willing to travel, represented by the horizontal
line in Fig. 13.9a. The RejR of the uniform-reward method was approximately 0:1,
while it was 0 with the optimized-reward method. We achieved the same results
when the value of EASP was varied from 0:80 to 0:90, as shown in Fig. 13.9b.
We also conducted the experiments with a different total reward and privacy
budget, the results were almost the same as Fig. 13.9b. This demonstrates that the
proposed optimized-reward method can efficiently decrease the average distance of
a task by increasing its reward, thus ensuring that tasks with sparse workers can be
accepted with a high probability.
188 13 Differentially Private Spatial Crowdsourcing
2
Fig. 13.9 RejR with EASP . (a) The Boxplot of 3
r. (b) RejR VS EASP
Fig. 13.10 EC estimation with Gowalla dataset. (a) EC VS EASP . (b) EC VS Reward. (c) EC
VS
13.3.4.3 Performance on EC
Experimental results on the Gowalla dataset show that EC increases when the
EASP is increased, and decreases when the total reward and the privacy budget is
increased.
Figure 13.10a illustrates the relationship between EC and EASP , with a total
reward of 0:1W and D 0:2. The reason that EC increases with an increasing
EASP is that a higher EASP is achieved by assigning the task to more workers which
leads to a larger assignment radius. Given a fixed EASP , the radius generated with
the proposed optimized-reward method varies within a small range, resulting in a
smaller EC, compared to that of the uniform-reward method.
Figure 13.10b shows the value of EC with the total reward varying from 0:1W to
W. If the total reward is doubled or tripled, from the beginning, the EC decreases
significantly, because a higher reward, logically, increases the probability that a
worker will accept the task. When the probability reaches a high enough level to
become stable, the stimulating effect of reward gradually fades and the EC curve
flattens.
We also investigated the behavior of EC with different values for , a fixed reward
and a fixed EASP . As shown in Fig. 13.10c, when the EASP was set to 0:9, the total
13.4 Summary 189
Fig. 13.11 EC estimation with T-driver dataset. (a) EC VS EASP . (b) EC VS Reward. (c) EC
VS
reward was 0:1W, and was increased from the beginning, the EC decreased but
tended to become stable quickly. This implies that the total execution cost, as an
aggregate metric, is insensitive to noise. The result also shows that the total EC
caused by the proposed optimized-reward method is always smaller than that of the
uniform-reward method, and that the EC can be diminished by tuning the reward
allocation properly.
Experiments on the T-driver dataset are given in Fig. 13.11. They show that EC
has similar variation tendency when the parameters are changed, which demon-
strates the efficiency of the proposed method.
In the above discussions, we present and analyse the variation tendency with
various EASP , W and . We conclude that optimized reward outperforms uniform
reward in the execution cost.
13.4 Summary
Privacy issues are becoming increasingly concerning with the popularity of spatial
crowdsourcing. The main challenge in applying differential privacy to spatial
crowdsourcing is to achieve an optimal trade-off between privacy and effectiveness.
This chapter present an existing method, trusted third party, to preserving privacy.
Based on that, a reward-based SC is proposed to address this challenge. This method
first constructs a contour plot with a differential privacy guarantee to minimize the
magnitude of by fully using the given privacy budget so as to achieve accurate
task assignments. Then two models to calculate the assignment success probability
and the acceptance probability, respectively, are constructed to ensure efficient
task arraignment. The method can dramatically enhance the task acceptance ratio
through adjusting each task’s reward. Future work will extend the proposed method
to scenarios with redundant tasks assignments, and frameworks for SC without a
trusted third party will be explored.
Chapter 14
Correlated Differential Privacy for Non-IID
Datasets
14.1 Introduction
Although differential privacy has been widely accepted, previous work has mainly
focused on independent datasets which assumes all records were sampled from
a universe independently. Despite this, a real-world dataset often exhibits strong
coupling relations: some records are often correlated with each other, and this
may disclose more information than expected. For example, differential privacy
ensures that deleting a user will not affect the aggregated result. However, in a
social network dataset, users are always interacting with other users, and this kind
of relationship may provide helpful information for identifying those deleted users.
Another example assumes members in the same family may have a high probability
of catching the same infectious disease. If an adversary knows one person gets the
flu, he has a high probability of inferring the health of this person’s family. We refer
to this relationship as correlated information, and the involved records related to
each other are correlated records. An adversary with knowledge on the correlated
information will have a higher chance of obtaining private information [126], and
violating the definition of differential privacy. Hence, how to preserve rigorous
differential privacy in a correlated dataset is an emerging issue that needs to be
addressed.
Over the last decade, limited research has been concerned with correlated
differential privacy. A pioneer study by Kifer et al. [126], confirmed that if
correlated records are ignored, the released data will have a lower than expected
privacy guarantee. Their successive paper proposed a new privacy definition named
Pufferfish [127], which takes the correlated records into consideration, but it does
not meet the requirement of differential privacy. Chen et al. [41] dealt with the
correlated problem in social networks by multiplying the original sensitivity with
the number of correlated records. This straightforward method was not optimal
because it introduced a large amount of noise into the output, that overwhelmed
the true answer and demolished the utility of the dataset. Hence, a major research
barrier in correlated differential privacy is that the correlated dataset can provide
extra information to the adversary, which can not be modeled by the traditional
mechanism. In such a situation, satisfying the definition of differential privacy is a
more complicated task.
As advances in correlated data analysis are made, especially with recent devel-
opments in the research of non-iid data [32], it is now possible to overcome the
research barrier mentioned above. The correlated information can be modeled by
functions or parameters that can be further defined as background information
in the differential privacy mechanism. For example, Cao et al. [32] utilized the
time interval and correlation analysis to identify correlated records and model
correlated information by inter-behavior functions. This solution can help tackle the
research barrier by incorporating the modeled information to the differential privacy
mechanism.
However, there are still three main challenges with this approach:
• The first challenge is how to identify and represent correlated records. Records
are often correlated in terms of certain relationships that are not obvious. A
deep exploration of the relationship is necessary to understand which records
are correlated and how they interact with others.
• The second challenge lies in the fact that if an algorithm just increases the
sensitivity by multiplying it with the number of correlated records, the new
sensitivity will be large, especially when lots of records couple with each other.
To achieve a rigorous privacy guarantee, a large magnitude of noise has to be
added, and this will significantly decrease the utility of the correlated dataset.
• The third challenge occurs when answering a large number of queries. When the
number of queries is large, the privacy budget has to be divided into many small
parts, which increases the noise for each query. This problem is more serious in
a correlated dataset because the more queries that need to be answered, the more
correlated records that will be involved, and the larger the amount of noise that
will be introduced.
All these challenges imply correlated information should not be incorporated
into differential privacy in a straight forward manner, and a novel mechanism is
in high demand. This chapter first presents two basic definition on the correlated
differential privacy: Pufferfish [127] and Blowfish [98], and then proposes a compre-
hensive correlated differential privacy solution, including sensitivity measurement
and mechanism design. Table 14.1 shows the application setting for correlated
differential privacy.
Most existing differential privacy works assume the dataset consists of independent
records. However, in real world applications, records are often correlated with each
other. Kifer et al. [126] pointed out that differential privacy without considering
14.2 An Example: Correlated Records in a Dataset 193
correlation between records will decrease the privacy guarantee on the correlated
dataset. For example, suppose a record r has influence on a set of other records, and
this set of records will provide evidence on r even though record r is deleted from
the dataset. In this scenario, traditional differential privacy fails to provide sufficient
privacy as it claims.
Suppose we want to publish a dataset D with n records. To simplify the example,
we assume there is only one attribute in D and its values are A, B and C. Dataset
D can then be easily transferred to frequency dataset x in Table 14.2, where the
Attribute column stores the attribute values and the Count column represents the
number of records with each value. The target of privacy preserving is to hide the
true count in x.
To preserve -differential privacy, the randomization mechanism M will add
independent noise to the count. Since deleting a record will impact the count number
at most by 1, the sensitivity of the count query is 1, and the independent noise will
be sampled from the Laplace distribution Laplace. 1 /.
This process works well when records are sampled independently from domain
X . However, if some records are correlated with each other, traditional differential
privacy may under estimate the privacy risk [126]. For example, let the frequency
dataset x in Table 14.2 represent a medical report in which Attribute represents the
address and Count denotes the number of patients who have the Flu. Suppose a
patient named Alice and her nine immediate family members are living at the same
address B. When Alice contracts the Flu, the entire family will also be infected. In
this case, deleting the record of Alice in address B will impact nine other records,
and the count of address B will change to 90 (Alice got the Flu) or remain 100
(Alice is healthy). Suppose the noisy count returns 99 and the noise is sampled from
194 14 Correlated Differential Privacy for Non-IID Datasets
Laplace. 1 /. This means there is high probability Alice is healthy because the query
answer is close to 100. Specifically, the answer 99 is e10 times more likely than
the probability of Alice to get the Flu. Compared to the independent records with
privacy bounded in e , correlated records have a probability of ten times more likely
to be disclosed. In this instance, traditional differential privacy seriously mismatches
the reality for correlated records.
We define the problem as a correlated differential privacy problem. To deal with
this, one possible solution is to design a new sensitivity measurement based on the
relationship between records. A naive way to measure the sensitivity is to multiply
global sensitivity with the number of correlated records. In the above mentioned
example, while deleting Alice will impact at most ten records, the sensitivity is re-
measured as 1 10, and the noise will be sampled from Laplace. 10 /.
This naive sensitivity measurement can be extended to differential scenarios.
For instance, if A, B, C in Table 14.2 are linear dependent, that is A C B D C,
deleting a record in A will eliminate 1 count in A and 1 count in C at the most,
and sensitivity will be measured to 2. If we have A B D C, deleting a record
in A will at most change the count of 100 in C, so the sensitivity is measured as
max.count.A/; count.B//. It is obvious that in some cases, sensitivities will be very
high, leading to considerable redundance noise. This naive solution is not optimal
in correlated differential privacy. How to define new sensitivity in a proper way is a
problem of critical importance.
In summary, a major disadvantage of traditional differential privacy is overlook-
ing the relationship between records, which means the query result leaks more
information than is allowed. If we deal with the problem by simply multiplying
the number of correlated records to the sensitivity, the query result will contain
redundant noise and damages the utility of the dataset. Consequently, a sophisticated
solution to the correlated differential privacy problem is urgently needed.
To deal with this problem, Kifer et al. successive paper proposed a new privacy
framework, Pufferfish [127], which allows application domain experts to add extra
data relationships to develop a customized privacy definition.
14.3.1 Pufferfish
restrictive, the privacy may be not guaranteed sufficiently. If ‚ is too broad, the
privacy mechanisms will lead to little utility.
Definition 14.1 (Pufferfish Privacy) A privacy mechanism M is said to be -
Pufferfish private with parameters S, Q and ‚ if for all datasets D with distribution
2 ‚, and for all secrete pairs .si ; sj / 2 Q, and for all possible output , we have
P.M.D/ D jsi ; /
e e ; (14.1)
P.M.D/ D jsj ; /
14.3.2 Blowfish
P.M.D/ D /
e e ; (14.2)
P.M.D0 / D /
P.M.D/ D /
ed.x;y/ ed.x;y/ ; (14.3)
P.M.D0 / D /
However, most records are only partially correlated, and deleting a record may have
a different impacts on other records. These impacts is defined as the correlated
degree of records.
Definition 14.5 (Correlated Degree) Suppose two records ri and rj are correlated
to each other. This means the relationship between them is represented by the
correlated degree ıij 2 Œ1; 1 and jıij j ı0 , where ı0 is the threshold of the
correlated degree.
Corollary 14.1 If ıij < 0, ri and rj have a negative coupling; if ıij > 0, they have a
positive coupling; ı D 0 indicates no relationship. If jıij j D 1, record ri and rj are
fully correlated with each other.
The correlated degree represents the impact of a record on another record. The
smaller absolute value of ıij illustrates a weak coupling, and indicates that deleting
ri will have a low possibility of impacting rj . When ıij is closed to 1 or 1, the
coupling is strong, and deleting ri will greatly impact rj . However, in real world
applications, few records are fully correlated, and this observation can be useful in
our proposed method.
From the perspective correlated data analysis, it is possible to list all relationships
between records and maintain a correlated degree matrix , in which ı 2 .
0 1
ı11 ı12 ::: ı1n
B ı21 ı22 ::: ı2n C
DB
@:::
C: (14.4)
::: ::: :::A
ın1 ın2 ::: ınn
Here are four properties of : (1) It is symmetrical with ıij D ıji , which indicates
the relationship between two records is irrelevant to their sequence; (2) Elements on
the diagonal are equal to 1, which implies every record is fully correlated with itself;
(3) A threshold ı0 is defined to filter the weak correlated degree. In , jıij j ı0 . If
jıij j < ı0 , ıij is set to zero; (4) It is sparse. Only parts of records are correlated with
each other.
The above terms and correlated degree matrix will help solve the correlated
differential privacy problem.
several records may mix together and have exponential possible relationships,
thus making correlated analysis very complex.
• How to calibrate sensitivity for correlated records?
Traditional global sensitivity may not be suitable for correlated datasets due
to large noise. In our previous Flu example, global sensitivity introduces ten
times larger noise to the count output in a correlated dataset. In addition, local
sensitivity can not be used because it still only relates to an individual record
without considering coupling information.
• How to re-design the differential privacy mechanism?
Even correlated sensitivity can significantly decrease noise compared to large
noise when answering a large set of queries for global sensitivity. When dealing
with the correlated dataset, the traditional mechanism may not be suitable for a
correlated dataset. A new mechanism is expected to satisfy differential privacy,
as well as retain sufficient utility for future applications.
Several studies on correlated data analysis have attempted to identify and model
correlated information. Correlated information can be identified by correlated
analysis including time interval analysis, attribute analysis, or similarity analysis.
Cao et al. [33] presented a correlated Hidden Markov detection model to detect
abnormal group-based trading behaviors. They defined a time interval and assumed
behaviors falling into the same interval as correlated behaviors. Song et al. [204]
proposed a hybrid coupling framework, which applied some particular attributes
to identify relationships among records. Zhang et al. [250] identified the network
traffic correlated record using an IP address. They presented a correlated network
traffic classification algorithm.
Correlated information can be modeled in varies ways. Cao et al. [33] modeled
correlated information using the inter-couple and intra-couple behavior functions.
These functions were adopted in the correlated framework to represent the corre-
lated degree between behaviors. Zhou et al. [255] mapped the correlated records to
an undirected graph and proposed the multi-instance learning algorithm.
These approaches help to model background information for differential privacy.
An advanced differential privacy releasing mechanism will be proposed with the
aim of guaranteeing a sufficient privacy level as well as decreasing extra noise.
Correlated analysis is carried out to generate the correlated degree matrix for
a correlated dataset. This can be done in various ways depending on the background
knowledge of the curator or the characteristics of the dataset. Typical methods can
be conceptualized into two categories.
The first type of correlated analysis assumes the curator or the attacker obtained
the background knowledge in advance. The correlated degree matrix is pre-
defined as background knowledge. Taking Table 14.2 as an example. The curator
or attacker discover there are full coupling relationships among A, B and C, e.g.
14.4 Correlated Differential Privacy 199
Traditional global sensitivity will result in redundant noise derived from both
records and queries. For the record, as analyzed earlier, the traditional method
assumes records are fully correlated with each other, and therefore, it just multiplies
the global sensitivity with the maximal number of correlated records leading to
200 14 Correlated Differential Privacy for Non-IID Datasets
large noise. For a query, the traditional method uses a fixed global sensitive
without considering the prosperity of different queries. In actual fact, only some
of the responding records are correlated with others, and the curator only needs
to consider the correlated information within these responding records. Hence,
sensitivity should be adaptive for both the correlated record and the query.
Based on this observation, correlated sensitivity can defined, which takes both
record and query into consideration. The notion of record sensitivity is relating to
the correlated degree of each record. Based on this notion, the correlated sensitivity
associated with the query is proposed.
Definition 14.6 (Record Sensitivity) For a given and a query f , the record
sensitivity of ri is
n
X
CSi D jıij j.jjf .Dj / f .Dj /jj1 /; (14.5)
jD0
where ıij 2 .
The record sensitivity measures the effect on all records in D when deleting a
record ri . ıij 2 estimates the correlated degree between records ri and rj 2 D.
This notion combines the number of correlated records and the correlated degree
together. If D is an independent dataset, CSi is equal to the global sensitivity.
Definition 14.7 (Correlated Sensitivity) For a query f , correlated sensitivity is
determined by the maximal record sensitivity,
Correlated sensitivity CSq will be smaller than global sensitivity GS and local
sensitivity LS. Both assume each record is fully correlated with each other and the
correlated degree is also ignored.
Lemma 14.1 For a query f , correlated sensitivity is equal to or less than the global
sensitivity GS and the local sensitivity LS.
14.4 Correlated Differential Privacy 201
Proof Suppose there are at most k correlated recordsPin a dataset D, then we have
n
GS D k maxD;D0 .jjf .D/ f .D0 /jj1 /, and CSi D ıij .jjf .Dj / f .Dj /jj1 /.
PnjD1 Pk
Because at most k records are correlated, we have jD1 ıij D jD1 ıij k. As
j j 0
jjf .D / f .D /jj1 / maxD;D0 .jjf .D/ f .D /jj1 /, we have CSi GS. As any CSi
are less or equal to GS, for a query f , we have CSq GS.
.D/ f .D0 /jj
For the local sensitivity LS D k maxD0 .jjfP P1k/, we also have jjf .D /
j
j 0 n
f .D /jj1 / maxD0 .jjf .D/ f .D /jj1 / and jD1 ıij D jD1 ıij k, then we have
CSq LS.
Correlated sensitivity CS can be used in various types of data releasing mecha-
nisms. If records in the dataset are independent, the CS will be equal to the global
sensitivity, while for the correlated dataset, the CS will introduce less noise than GS
and LS.
Even though correlated sensitivity decreases the noise compared with global
sensitivity, when dealing with a large number of queries, the answers still have
high noise because the privacy budget has to be divided into several small parts.
This is especially so when the records are strongly correlated with others and the
noise is significantly higher than the independent dataset. To tackle the problem, an
iterative-based mechanism will be adopted to limit the noise in the query answer.
The iterative-based mechanism was first proposed by Hardt et al. [92] who
constructed a dataset sequence to answer all queries by iteratively updating the
datasets. When a given query witnesses a significant difference between the current
dataset and the true dataset, the mechanism updates the current dataset in the next
iteration [92]. The main advantage of this mechanism is that it can save the privacy
budget and decrease the noise when confronting lots of queries. Hence, it will be
suitable for data releasing in the correlated dataset.
In this section, a Coupled Iteration Mechanism (CIM) is proposed to answer a
set of queries on the correlated dataset.
The CIM aims to release the results of a set of queries by iteratively updating
the dataset under the notion of differential privacy. In this procedure, a dataset
is represented by a histogram x with length N. Let t be the round index, and the
histogram be represented by xt at the end of round t. The curator is given a query set
F and select a ft in each round t. Let at denotes the true answer and b at denotes the
noisy answer:
at D ft .x/; (14.8)
202 14 Correlated Differential Privacy for Non-IID Datasets
CSqt
b
at D ft .x/ C Laplace : (14.9)
The difference between the true answer given by xt1 and the noisy answer from xt
is denoted by b
dt :
b
d t D ft .xt1 / b
at : (14.10)
This is utilized to control the update round in each iteration. At a high level,
CIM maintains a sequence of histogram x0 , x1 ,..., xt , which gives increasing
approximation to the original dataset x.
The mechanism is shown in Algorithm 1. Firstly, the privacy budget is divided
into several parts and the histogram is initialized as the uniform distribution x0 .
In each round t, curators select a query ft 2 F, using xt to generate the answer
at D ft .xt / and the noise answer b at . The distance b d t between the query ft on xt1
and the noisy answer b at is computed. If jbdt j is less than a threshold T, the xt1 is
considered to be a good approximation of x on query ft . Curators will release the
ft .xt1 / directly and put the xt1 into the next iteration. If the distance is larger
than the threshold, the histogram xt1 will be improved in this round. Curators
will release b at and use an correlated updating function U to generate the new
histogram xt .
The CIM aims to answer a large group of queries with limited privacy budgets
on a correlated dataset. In summary, this mechanism has the following features:
• First, it takes the relationship between records into consideration. It applies not
only correlated sensitivity, but more importantly, it develops a correlated update
function to improve the histogram in each iteration.
• Second, it decreases the total amount of noise. The CIM maintains a histogram
sequence to answer a set of queries F, rather than using a single histogram to
answer all queries. One histogram in the sequence roughly corresponds to one
query in F. This way, each histogram can approximate the close answer to the
true answer.
• Finally, more queries can be answered than the traditional mechanism with the
same privacy budget. Only the update steps will consume the privacy budget.
Algorithm 1 indicates that even for a very large set of queries, the number of
update rounds is still less than the total number of queries.
This section defines a correlated update function U in the histogram context. For a
histogram xt1 , the function U firstly identifies all responded records r 2 qt . For
each record in qt , all correlated records are listed and denoted as superset qt . The
update function U then identifies a set of bins b that contain qt and re-arranges the
14.4 Correlated Differential Privacy 203
frequency of each bin in b according to Eq. (14.11). The final frequency of the xt
will be normalized so they sum to 1.
Definition 14.8 (Correlated Update Function) Let x0 , x2 , ..., xt be a histogram
sequence, and function U is defined as a correlated update function if it satisfies
xt D U.xt1 /. The U is defined as:
log N
umax D : (14.12)
2 ı02
Proof Give the original histogram x and the initial update histogram x0 , the CIM
will update the x0 in each round t until xt D x. The umax depends on how many steps
that x0 can be transferred to x. The method follows the update strategy of Hardt et
al. [92], who define the distance between x0 and x in terms of relative potential:
X
x.b/
t D RE.xjjxt / D x.b/ log : (14.13)
xt .b/
Based on Eq. (14.13), 0 log N. When t drops to zero, the update will be
terminated and xt =x. According to Hardt et al. [92], the potential drops in each
round is at least 2 ı02 , therefore there are at most log N
2 ı02
rounds in the CIM.
The umax is utilized to determine the privacy budget 0 in each round. Equa-
tion (14.14) shows the calculation of 0 :
2 ı 2
0 D : (14.14)
log N
Compared to the traditional data releasing mechanism, which divides the privacy
budget according to the number of queries, the algorithm can easily demonstrate
that 0 =jFj.
Lemma 14.2 also indicates umax is associated with parameter and the couple
parameter threshold ı0 . If the curator wants to successfully answer more queries,
he/she can choose a smaller to allow more rounds. However, this will lead to
larger noise in each query answer because the privacy budget 0 in each round will
also be diminished.
Second, to estimate the possible number of update rounds uF for a query set F,
let the probability of updating be 1 and the probability of non-updating be 2 , the
algorithm has the follow Lemma:
14.4 Correlated Differential Privacy 205
Lemma 14.3 When both the privacy budget 0 in each round and the parameter T
are fixed, the probability of the update will be
0 jT ˛j
1 D exp ; (14.15)
CSq
Proof Suppose we have jFj queries, and altogether jFj rounds. The probability of
the update is Pr.jb
d t j > T/. We have
Pr.jb
dt j > T/ D Pr.jft .x/ C t ft .xt1 /j > T/:
Let jft .x/ ft .xt1 /j ˛, and t be sampled from Laplace.0 =CSq / according to the
property of Laplace distribution:
Z 1 x
Pr.j j > t/ D Pr. > t/ C Pr. < t/ D 2 x exp dx D (14.18)
t
t
D exp (14.19)
CSq
Because D 0
, we have
b 0 .˛ T/
Pr.jdt j > T/ exp :
CSq
If there are jFj queries, the algorithm will update at most jFj exp 0 .˛T/
CSq
CS q
rounds.
Lemma 14.3 shows the probability of the update is related to T and ˛. If parame-
ter T is much smaller than ˛, the update probability will be very high and the noise
will increase simultaneously which will affect the accuracy of the answer. However,
if T is very large, even though we decrease the number of update rounds, but the
output answer is always far away from the true answer, which also decreases the
accuracy, we can conclude the accuracy of CIM is related to T. Section 14.4.7 uses
the experiment to demonstrate the trade-off between T and the accuracy of CIM.
206 14 Correlated Differential Privacy for Non-IID Datasets
The proposed CIM aims to obtain an acceptable utility with a fixed differentially
privacy budget. In this section, we will first prove the algorithm is satisfied with
-differential privacy, and then analyze the utility cost.
To prove CIM is satisfied with differential privacy, one needs to analyze which steps
in CIM will consume the privacy budget. Algorithm 1, accesses the histogram and
generate a noisy answer in each round. However, the noisy answer is only used
to check whether the current histogram is accurate enough to answer the query.
In most rounds, the algorithm does not release the noisy answer, and therefore the
algorithm consumes no privacy budget. The noisy answer is only released in the
update round when the current histogram is not accurate enough. Consequently, the
privacy budget is only consumed in the update step and the privacy analysis can be
easily limited in the correlated update functions.
The sequential composition accumulates privacy budget for each step when a
series of private analyses is performed sequentially on the dataset. As mentioned
earlier, given a x, we have umax D 2 ı 2 log N. The privacy budget 0 allocated
2 ı 2
to each round is 0 D log N0 . According to the sequential composition, the released
answers for the query set F will consume the 0 uF privacy budget. Because uF
umax , we have 0 uF . Consequently, the proposed CIM algorithm preserves
-differential privacy.
For the utility analysis, error measurement is used. In CIM, the utility is measured by
a set of released query answers. Accordingly, the error is measured by the maximal
distance between the original and the noisy answer.
Definition 14.9 ((˛,ˇ)-Accuracy for CIM) The CIM is (˛,ˇ)-accuracy for a set
of query F, if: with probability 1 ˇ, for every query f 2 F and every x, we have
CSq 1 2 L T
˛ log C ;
20 ˇ 2
If b
d T, it will be a non-update round and CIM will output ft .xt1 /.
Because
jb
dt j D jft .xt1 / b
at j T;
we have
and
j˛ Tj0
L exp ;
CSq
Then we have
and
˛0
L exp :
CSq
Accordingly,
j˛ Tj0 ˛0
Pr. max jCIMt .xt / ft .x/j > ˛/ L1 exp C L2 exp :
f 2F;t2L CSq CSq
Let
j˛ Tj0 ˛0
L1 exp C L2 exp ˇ;
CSq CSq
we have
j˛ Tj0 ˛0 ˇ
1 exp C 2 exp
CSq CSq L
.T 2˛/0 ˇ
) log 1 2 C log
CSq L
CSq 1 2 L T
)˛ log C :
20 ˇ 2
1
http://archive.ics.uci.edu/ml/.
14.4 Correlated Differential Privacy 209
NLTCS The National Long-Term Case Study (NLTCS) dataset was derived
from StatLib.2 It contained 21,574 records with 16 binary attributes, which
corresponded to 16 functional disability measures. These are six activities of
daily living and ten instrumental activities of daily living.
Three other datasets were derived from Hay’s work [96], which have been
widely used in the differentially private histogram release test.
Search Logs This synthetic data set was generated by interpolating Google
Trends data and America Online search logs. It contained 32,768 records, each
of which stored the frequency of searches with the keyword Obama within a
90 min interval between January 1, 2004 and August 9, 2009.
NetTrace This contained the IP-level network trace at a border gateway of a
university. Each record reported the number of external hosts connected to an
internal host. There were 65,536 records with connection numbers ranging from
1 to 1423.
Social Network This records the friendship relations among 11,000 students
from the same institution, sampled from an online social network web site. Each
record contains the number of friends of certain students. There were 32,768
students, each of who had at most 1678 friends.
All datasets contained no pre-defined correlated information. The Pearson
Correlation Coefficient is used to generate the correlated degree matrix with
the threshold ı0 D 0:6. Approximately all datasets had a quarter of their records
correlated with some other records, and the maximal size of the correlated group
was around 10. For each dataset, a query set F was generated with 10,000 random
linear queries and each answer fell into Œ0; 1. The accuracy of results was measured
by Mean Square Error (MSE).
F
1 Xb
MSE D .f i .x/ fi .x//2 : (14.21)
jFj f 2F
i
The performance of the CIM mechanism was examined in this section through
comparison with the state-of-the-art naive Laplace Mechanism (LM) [62]. To show
its effectiveness, correlated sensitivity (CS) in both CIM and LM was tested, and are
denoted as CIM&CS and LM&CS, respectively. The experiments were conducted
on all datasets and the privacy budgets varied from 0:1 to 1. Two parameters, T and
, were set to 0:3000 and 7:0000, respectively.
2
http://lib.stat.cmu.edu/.
210 14 Correlated Differential Privacy for Non-IID Datasets
As shown in Fig. 14.1, CIM has lower MSE than LM on all datasets. Specifically,
for the Adult dataset in Fig. 14.1a, when D 0:4, the CIM achieves a MSE of
0:3593 while LM achieves 0:5171. Thus, CIM outperfroms LM by 43:84%. When
D 1, CIM achieves a MSE of 0:0491 which outperformed LM by 73:12%. These
results illustrate that in correlated datasets, CIM outperforms LM when answering a
large set of queries. The improvement by CIM can also be observed in Fig. 14.1b–
f. The proposed CIM has better performance because it only consumes the privacy
budget in the update rounds, which is less than the total number of queries jFj. While
the traditional LM mechanism consumes the privacy budget when answering every
query, this actually leads to inaccurate answers. The experimental results show the
effectiveness of CIM when answering a large set of queries.
In the context of differential privacy, the privacy budget serves as a key param-
eter to determine privacy. Figure 14.1 shows the impact of on the performance
of CIM. According to Dwork [62], D 1 or less would be suitable for privacy
preserving purposes. For a comprehensive investigation, the CIM’s performance is
evaluated under various privacy preserving levels by varying the privacy budget
from 0:1 to 1 with a 0:1 step on six datasets. It was observed that as increases, the
MSE evaluation becomes better, which means the lower the privacy preserving level,
the larger the utility. In Fig. 14.1a, the MSE of CIM is 5:5350 when D 0:1. Even
though it preserves a strict privacy guarantee, the query answer is quite inaccurate.
When D 0:7, the MSE drops to 0:1025, retaining an acceptable utility of the result.
The same trend can be observed on other datasets. For example, when D 0:7, the
MSE is 0:1894 in Fig. 14.1b, and is 0:1733 in Fig. 14.1c. These results confirm the
utility will be enhanced as the privacy budget increases.
Moreover, it is observed that the MSE decreased much faster when ascends
from 0:1 to 0:4 compared to when ascends from 0:4 to 1. This indicates a larger
utility cost is needed to achieve a higher privacy level ( D 0:1). It is observed that
the performance for both the CIM and LM mechanism was stable when 0:7.
This indicates the CIM was capable of retaining the utility for data releasing while
satisfying a suitable privacy preserving requirement.
In addition, the MSE of CIM with the non-privacy release is compared. If the
curator answers all queries without any privacy guarantee, the MSE is 0. Figure 14.1
shows the MSE of CIM was very close to 0 when 0:7. This was because
CIM applied iterative steps and correlated sensitivity to reduce the magnitude of
introduced noise. This result confirms correlated differential privacy can ensure
rigorous privacy with an acceptable utility loss.
The evaluation shows the effectiveness of CIM on several aspects.
1. The proposed CIM can retain a higher utility of released data compared with the
LM.
2. As the privacy budget increased, the performance of CIM was significantly
enhanced. A suitable privacy guarantee can be selected to achieve a better
tradeoff.
3. When we have a sufficient privacy budget, the utility loss of released data is
small.
14.4 Correlated Differential Privacy 211
9 20
CIM/CS CIM/CS
8 LM/CS 18 LM/CS
0.4 16 0.8
7
0.3 14 0.6
6 MSE
MSE
0.2 12 0.4
5
MSE
MSE
0.1 10 0.2
4 0 0
0.6 0.7 0.8 0.9 1 8 0.6 0.7 0.8 0.9 1
3 ε ε
6
2
4
1 2
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(a) (b)
40 2.5
CIM/CS CIM/CS
LM/CS LM/CS
35
1.5 2 0.08
30
0.06
1
MSE
MSE
25
1.5 0.04
0.5
MSE
MSE
20 0.02
0 1 0
15 0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 1
ε ε
10
0.5
5
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(c) (d)
35 45
CIM/CS CIM/CS
LM/CS 40 LM/CS
30
1 35 1.5
25 0.8
30 1
MSE
MSE
0.6
20 25 0.5
MSE
MSE
0.4
15 20
0.2 0
0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 1
ε 15 ε
10
10
5
5
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(e) (f)
Fig. 14.1 Effectiveness of CIM. (a) Adult. (b) NLTCS. (c) IDS. (d) Search Log. (e)
NetTrace. (f) Social Network
212 14 Correlated Differential Privacy for Non-IID Datasets
GS GS
20 CS CS
100
0.8 4
15 0.6 80 3
MSE
MSE
0.4 2
MSE
MSE
0.2 60 1
10
0 0
0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 1
ε 40 ε
5
20
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(a) (b)
GS GS
180 CS 12 CS
160 6 0.4
10
140 0.3
4
MSE
MSE
120 8 0.2
MSE
MSE
2 0.1
100 6
80 0 0
0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 1
ε 4 ε
60
40 2
20 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1 0.2 0.3 0.4 0.5
ε
0.6 0.7 0.8 0.9 1 ε
(c) (d)
GS 200 GS
45 CS 180 CS
40 1.5 160 6
35 140
1 4
MSE
MSE
30 120
MSE
MSE
0.5 2
25 100
20 0 0
0.6 0.7 0.8 0.9 1 80 0.6 0.7 0.8 0.9 1
ε ε
15 60
10 40
5 20
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ε ε
(e) (f)
Fig. 14.2 Effectiveness of Correlated Sensitivity. (a) Adult. (b) NLTCS. (c) IDS. (d) Search
Log. (e) NetTrace. (f) Social Network
214 14 Correlated Differential Privacy for Non-IID Datasets
14.5 Summary
This line of research proves that the differential privacy has a generalization prop-
erty that can be applied in statistical analysis and machine learning [16, 174] though
at this early stage, many new possibilities on generalization need to be explored.
It is common that data owners have quite different expectations regarding the
acceptable level of privacy for their data. Consequently, differential privacy may
lead to insufficient privacy protection for some users, while over-protecting others.
If we can relax the level of privacy for some data owners, a higher level of utility
can often be achieved. To this extend, a personalized differential privacy is required,
in which users can specify a personal privacy for their data [111].
Ebadi [76] proposed a Personalized Differential Privacy (PDP) framework that
arranges various privacy budgets to each record. When a query is performed on the
dataset, the privacy level will be determined by the summary of privacy budgets
of all responding records. Alaggan et al. [10] considered the privacy expectation
not only for the individual user, but also for the individual item. They introduced
a concept of heterogeneous differential privacy as opposed to previous models that
implicitly assume uniform privacy requirements. Koufogiannis [129] introduced a
situation of releasing sensitivity when the privacy level is subject to change over
time. Its intuition can guide the privacy level allocation on personalized privacy
problem.
These are tentative frameworks on personalized privacy, but how to determine
the privacy budget for each record or individual still need to be tackled.
Most existing work is concerned with the centralized model of differential privacy,
in which a trusted curator holds the entire private dataset, and computes it in a
differentially private way. If a dataset is divided among multiple curators who
are mutually untrusting each other, however, how can they compute differentially
private messages for communication between themselves? Mironov et al. [158]
explored a two-party scenario by showing a lower bound for the problem of
computing the hamming distance between two datasets. Mohammed et al. [159]
presented a two-party protocol based on the exponential mechanism. Both solutions
are unlikely to be valid when the number of curator’s lies is more than two.
How to preserve distributed differential privacy within multiple parties is a future
topic for open-ended theoretical exploration. The solution will involve with inter-
discipline techniques, including privacy preserving, security protocol designing and
cryptography.
15.5 Differential Privacy in Genetic Data 217
Mechanism design is the field of algorithm design where the inputs to the mecha-
nism are controlled by strategic agents who may manipulate their inputs [15]. In this
setting, the design of the mechanism must convince agents to provide their correct
inputs to the mechanism.
A typical application of mechanism design is the auction design. Considering
a scenario in which a data analyst wishes to buy information from a population
to estimate some statistical information, while the owners of the private data
experience some cost for their loss of privacy, agents between the sellers and the
buyers wish to maximize their profit, so the goal is to design a truthful auction
mechanism while preserving the privacy of the dataset.
Differential privacy limits any individual’s influence on the result. If the mech-
anism satisfies differential privacy, agents will have little incentive to deviate from
truthful behavior since they can only change the selected equilibria to a small degree.
McSherry [157] first proposed designing auction mechanisms using differentially
private mechanisms as the building blocks. This private mechanism is only approx-
imately truthful. Nissim et al. [173] showed how to convert differentially private
mechanisms into exactly truthful mechanisms. Cummings et al. [51] studied the
multi-dimensional aggregative games and solved the equilibrium selection problem.
Barthe et al. [15] introduced a relational refinement type system for verifying
mechanism design and differential privacy.
These series of research work focus on the mechanism design based on the
differential privacy, which take advantage of the property of differential privacy.
The direction lies at the intersection of differential privacy, mechanism design and
probabilistic programming languages.
attack. Naveed et al. [168] reviewed the mitigating strategies for such attacks, as
well as contextualizing these attacks from the perspective of medicine and public
policy.
In the context of Genome-wide association studies (GWAS), several initial
studies have explored the capability of differential privacy methods in the release
of statistics for GWAS data, or shifting the original locations of variants. Johnson
and Shmatikov [110] developed a framework for ab initio exploration of case-
control GWAS that provide privacy-preserving answers to key GWAS queries. They
designed the operators to differentially private output the number of SNPs associated
with the disease, the location of the most significant SNPs, and the p-values for
any statistical test between a given SNP and the disease, etc. By considering the
protection against set membership disclosure, Tramer et al. proposed a relaxation
of the adversarial model of differential privacy, and shown that this weaker setting
achieves higher utility [220].
However, currently the differential private data release is still impractical,
because it introduces a large amount of noise even for a few singe-nucleotide
polymorphism locations (SNPs) in a given population. It is uncertain whether
there is a calibrated noise adding mechanism for GWAS data, which satisfies the
requirement of differential privacy [77].
noisy user data, then conducts post-process on them to obtain acceptable statistics
which can be further shared with the public. Herein the statistics types which are
available for the data curator to obtain would be limited to the population statistics
and depend on both the design of local perturbation mechanism and post-process
mechanism. That is to say, with the support of LDP model, the un-trusted data
curator can estimate the approximate information of all users such as the prevalence
rate in a population without inferring that which user has suffered from this sensitive
disease.
Local privacy model was first formalized in [115], and then a theoretic upper
bound under the LDP model was given by Duchi et al. [60]. Typically two main
research questions are investigated:
1. How to design acceptable LDP mechanisms for different original data types
generated by distributed users? For example, the ranges of data types can be
from the single attribute data to multiple attribute data and even the set-valued
data.
2. How to design acceptable LDP mechanisms that can achieve different analysis
targets such as value estimation, distribution estimation and more complex
machine learning tasks?
In the local differential privacy setting, the local perturbation mechanism con-
ducted by each user should be a type of non-trivial differentially private noisy
adding mechanism, such that it provides the data curator the ability to estimate
certain statistics of the user population while satisfying differential privacy. In
existing related research works, the non-trivial differentially private noisy adding
mechanisms referred above are all based on Randomized Response technique. In the
past several years, related LDP work have been on all three aspects of data life cycle:
the data types existed in user node (Original Data Type), the data types submitted to
data curator by users (Uploaded Data Type) and the targets that we want to achieve
via analysing the uploaded data (Post-process Target). Table 15.1 gives an overview
of above three aspects.
220 15 Future Directions and Conclusion
Table 15.1 Three aspects of data life cycle in local differential privacy
Original data type Uploaded data type Post-process target
Single numeric or Noisy attribute value Value estimation [230]; distribution
categorical attribute estimation [78]
Multiple numeric or Noisy attributes values Value estimation [230]; mean and fre-
categorical attribute quency estimation, complex machine
learning tasks [171]; multi-dimensional
joint distribution estimation [192]
Set-valued A random bit of noisy set-valued Frequent item set estimation [187]; dis-
data data [187]; noisy set-valued crete distribution estimation [223]
data [223]
Encoding data A random bit of the encoding of Count estimation [42]
location
The difficulty in data publishing lies on the high correlation when meeting with
large set of queries [102]. High correlation between queries leads to large volume
of noise. According to the definition of sensitivity, correlations between m queries
lead to higher sensitivity (normally m multiplied by the original sensitivity) than
independent queries. The noise calibrated by this higher sensitivity will be added
to each query answer. The accuracy of the results will be dramatically decreased
compared to independent queries [140].
Current solutions aim to break the correlation by using transformation mecha-
nism or publishing a synthetic dataset, both of which have been discussed in Chap. 5:
Non-interactive setting. However, both solutions can only partly solve the problem,
and at the same time, a new challenge is arising with the non-interactive setting: how
to deal with unknown fresh queries. As the curator cannot know what users will ask
after the data has been published, he/she has to consider all possible queries and
adds pre-defined noise. When the dimension of the dataset is high, it is impossible
to list all queries. Even if the curator is able to list all queries, this pre-defined noise
will dramatically decrease the utility of the publishing results.
Zhu et al. [257] observed that these two challenges can be overcome by
transferring the data publishing problem to a machine learning problem. They
treated the queries as training samples which are used to generate a prediction
model, rather than publishing a set of queries or a synthetic dataset. For correlations
between queries, they apply limited queries to train the model. These limited queries
have lower correlation than in the original query set. The model is also used to
predict the remaining queries, including those fresh queries. Consequently, the
model publishing method uses a limited number of queries to train an approximate
accurate model and answers other fresh queries.
15.7 Learning Model Publishing 221
Table 15.2 The difference between the model publishing method and private learning
Model publishing method Private learning
Model The model is used to predict fresh The model is used for traditional
query answers for public users machine learning
Training set Queries on the dataset Records in the dataset
Protect target Preserve the privacy of all queries Do not protect future samples
Learning algorithms Prediction Classification
15.8 Conclusion
16. R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization
error in adaptive data analysis. CoRR, abs/1503.04843, 2015.
17. R. Bassily, A. D. Smith, and A. Thakurta. Private empirical risk minimization: Efficient
algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
18. A. Beimel, S. P. Kasiviswanathan, and K. Nissim. Bounds on the sample complexity for
private learning and private data release. In TCC, pages 437–454. 2010.
19. A. Beimel, K. Nissim, and U. Stemmer. Characterizing the sample complexity of private
learners. In ITCS, pages 97–110, 2013.
20. A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs.
approximate differential privacy. CoRR, abs/1407.2674, 2014.
21. A. Beimel, K. Nissim, and U. Stemmer. Learning privately with labeled and unlabeled
examples. In SODA, pages 461–477, 2015.
22. Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,
2(1):1–127, 2009.
23. S. Berkovsky, Y. Eytani, T. Kuflik, and F. Ricci. Enhancing privacy and preserving accuracy
of a distributed collaborative filtering. RecSys ’07, pages 9–16, New York, NY, USA, 2007.
ACM.
24. R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive
data. In SIGKDD, pages 503–512, 2010.
25. D. M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, Apr. 2012.
26. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of Machine
Learning Research, 3:993–1022, Mar. 2003.
27. J. Blocki, A. Blum, A. Datta, and O. Sheffet. Differentially private data analysis of social
networks via restricted sensitivity. In ITCS, pages 87–96, 2013.
28. A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In
PODS, pages 128–138, 2005.
29. A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database
privacy. In STOC, pages 609–618, 2008.
30. J. A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten, and V. Shmatikov. "you might also
like: " privacy risks of collaborative filtering. In SP’11, pages 231–246, 2011.
31. J. Canny. Collaborative filtering with privacy via factor analysis. SIGIR ’02, pages 238–245,
New York, NY, USA, 2002. ACM.
32. L. Cao. Non-iidness learning in behavioral and social data. The Computer Journal, 2013.
33. L. Cao, Y. Ou, and P. S. Yu. Coupled behavior analysis with applications. IEEE Transactions
on Knowledge and Data Engineering, 24(8):1378–1392, 2012.
34. T.-H. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. ACM Trans.
Inf. Syst. Secur., 14(3):26:1–26:24, 2011.
35. K. Chandrasekaran, J. Thaler, J. Ullman, and A. Wan. Faster private release of marginals on
small databases. In ITCS’14, pages 387–402, 2014.
36. K. Chatzikokolakis, C. Palamidessi, and M. Stronati. Location privacy via geo-
indistinguishability. ACM SIGLOG News, 2(3):46–69, 2015.
37. K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In
COLT, pages 155–186, 2011.
38. K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS 2014,
pages 289–296, 2008.
39. K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk
minimization. Journal of Machine Learning Research, 12(2):1069–1109, 2011.
40. R. Chen, B. C. Fung, B. C. Desai, and N. M. Sossou. Differentially private transit data
publication: a case study on the Montreal transportation system. In SIGKDD, pages 213–221,
2012.
41. R. Chen, B. C. Fung, S. Y. Philip, and B. C. Desai. Correlated network data publication via
differential privacy. The VLDB Journal, 23(4):653–676, 2014.
42. R. Chen, H. Li, A. K. Qin, S. P. Kasiviswanathan, and H. Jin. Private spatial data aggregation
in the local setting. In ICDE, pages 289–300, 2016.
References 225
95. M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of
private networks. In ICDM, pages 169–178, 2009.
96. M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private
histograms through consistency. Proc. VLDB Endow., 3(1):1021–1032, 2010.
97. X. He, G. Cormode, A. Machanavajjhala, C. M. Procopiuc, and D. Srivastava. DPT:
Differentially private trajectory synthesis using hierarchical reference systems. Proc. VLDB
Endow., 8(11):1154–1165, 2015.
98. X. He, A. Machanavajjhala, and B. Ding. Blowfish privacy: Tuning privacy-utility trade-
offs using policies. In Proceedings of the 2014 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’14, pages 1447–1458, New York, NY, USA, 2014. ACM.
99. S. Ho and S. Ruan. Differential privacy for location pattern mining. In Proceedings of the
4th ACM SIGSPATIAL International Workshop on Security and Privacy in GIS and LBS,
SPRINGL 2011, November 1st, 2011, Chicago, IL, USA, pages 17–24, 2011.
100. B. Hoh, M. Gruteser, H. Xiong, and A. Alrabady. Enhancing security and privacy in traffic-
monitoring systems. IEEE Pervasive Computing, 5(4):38–46, Oct. 2006.
101. Y. Hong, J. Vaidya, H. Lu, P. Karras, and S. Goel. Collaborative search log sanitization:
Toward differential privacy and boosted utility. IEEE Trans. Dependable Sec. Comput.,
12:504–518, 2015.
102. D. Huang, S. Han, X. Li, and P. S. Yu. Orthogonal mechanism for answering batch queries
with differential privacy. In SSDBM, pages 24:1–24:10, 2015.
103. Z. Huang and A. Roth. Exploiting metric structure for efficient private query release. In
SODA, pages 523–534, 2014.
104. G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private
random decision tree classifier. Transactions on Data Privacy, 5(1):273–295, 2012.
105. P. Jain and A. Thakurta. Differentially private learning with kernels. In ICML, pages 118–126,
2013.
106. P. Jain and A. G. Thakurta. (near) dimension independent risk bounds for differentially private
learning. In ICML, pages 476–484, 2014.
107. R. Jäschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag recom-
mendations in folksonomies. PKDD 2007, pages 506–514, Berlin, Heidelberg, 2007.
Springer-Verlag.
108. H. Jiawei and M. Kamber. Data mining: concepts and techniques. San Francisco, CA, itd:
Morgan Kaufmann, 5, 2001.
109. X. Jin, N. Zhang, and G. Das. Algorithm-safe privacy-preserving data publishing. EDBT ’10,
pages 633–644, New York, NY, USA, 2010. ACM.
110. A. Johnson and V. Shmatikov. Privacy-preserving data exploration in genome-wide associa-
tion studies. In SIGKDD, pages 1079–1087, 2013.
111. Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential
privacy. In ICDE 2015, pages 1023–1034, 2015.
112. P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. In
ICML, pages 1376–1385, 2015.
113. V. Karwa, S. Raskhodnikova, A. Smith, and G. Yaroslavtsev. Private analysis of graph
structure. ACM Trans. Database Syst., 39(3):22:1–22:33, 2014.
114. S. P. Kasiviswanathan and H. Jin. Efficient private empirical risk minimization for high-
dimensional learning. In ICML, pages 488–497, 2016.
115. S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we
learn privately? In FOCS, pages 531–540, 2008.
116. S. P. Kasiviswanathan, K. Nissim, and H. Jin. Private incremental regression. CoRR,
abs/1701.01093, 2017.
117. S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith. Analyzing graphs with
node differential privacy. In TCC, pages 457–476, 2013.
118. S. P. Kasiviswanathan, M. Rudelson, A. Smith, and J. Ullman. The price of privately releasing
contingency tables and the spectra of random matrices with correlated rows. In STOC 2010,
pages 775–784, 2010.
228 References
119. L. Kazemi and C. Shahabi. Geocrowd: Enabling query answering with spatial crowdsourcing.
In Proceedings of the 20th International Conference on Advances in Geographic Information
Systems, SIGSPATIAL ’12, pages 189–198, New York, NY, USA, 2012. ACM.
120. M. Kearns. Efficient noise-tolerant learning from statistical queries. J. ACM, 45:983–1006,
1998.
121. M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory.
8(2001):44–58, 1994.
122. G. Kellaris and S. Papadopoulos. Practical differential privacy via grouping and smoothing.
In PVLDB, pages 301–312, 2013.
123. G. Kellaris, S. Papadopoulos, X. Xiao, and D. Papadias. Differentially private event sequences
over infinite streams. Proc. VLDB Endow., 7(12):1155–1166, 2014.
124. H. Kido, Y. Yanagisawa, and T. Satoh. Protection of location privacy using dummies
for location-based services. In Proceedings of the 21st International Conference on Data
Engineering Workshops, ICDEW ’05, pages 1248–, Washington, DC, USA, 2005. IEEE
Computer Society.
125. D. Kifer. Attacks on privacy and DeFinetti’s theorem. SIGMOD ’09, pages 127–138, New
York, NY, USA, 2009. ACM.
126. D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, pages 193–204,
2011.
127. D. Kifer and A. Machanavajjhala. Pufferfish: A framework for mathematical privacy
definitions. ACM Trans. Database Syst., 39(1):3:1–3:36, 2014.
128. D. Kifer, A. D. Smith, and A. Thakurta. Private convex optimization for empirical risk
minimization with applications to high-dimensional regression. In COLT, pages 25.1–25.40,
2012.
129. F. Koufogiannis, S. Han, and G. J. Pappas. Gradual release of sensitive data under differential
privacy. CoRR, abs/1504.00429, 2015.
130. R. Krestel, P. Fankhauser, and W. Nejdl. Latent Dirichlet allocation for tag recommendation.
RecSys ’09, pages 61–68, New York, NY, USA, 2009. ACM.
131. S. Le Blond, C. Zhang, A. Legout, K. Ross, and W. Dabbous. I know where you are and
what you are sharing: exploiting p2p communications to invade users’ privacy. IMC ’11,
pages 45–60, New York, NY, USA, 2011. ACM.
132. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
133. Y. Lécun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
134. J. Lee and C. W. Clifton. Top-k frequent itemsets via differentially private FP-trees. In
SIGKDD, pages 931–940, 2014.
135. J. Lee, Y. Wang, and D. Kifer. Maximum likelihood postprocessing for differential privacy
under consistency constraints. In KDD, pages 635–644, 2015.
136. J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://
snap.stanford.edu/data, June 2014.
137. C. Li, M. Hay, G. Miklau, and Y. Wang. A data- and workload-aware query answering
algorithm for range queries under differential privacy. Proc. VLDB Endow., 7(5):341–352,
2014.
138. C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing linear counting queries
under differential privacy. In PODS, pages 123–134, 2010.
139. C. Li and G. Miklau. An adaptive mechanism for accurate query answering under differential
privacy. Proc. VLDB Endow., 5(6):514–525, 2012.
140. C. Li and G. Miklau. Optimal error of query sets under the differentially-private matrix
mechanism. In ICDT, pages 272–283, 2013.
141. C. Li, G. Miklau, M. Hay, A. McGregor, and V. Rastogi. The matrix mechanism: optimizing
linear counting queries under differential privacy. The VLDB Journal, 24(6):1–25, 2015.
142. N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-
diversity. pages 106 –115, April 2007.
143. N. Li, T. Li, and S. Venkatasubramanian. Closeness: A new privacy measure for data
publishing. Knowledge and Data Engineering, IEEE Transactions on, 22(7):943–956, July
2010.
References 229
144. N. Li, W. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset mining with differential
privacy. Proc. VLDB Endow., 5(11):1340–1351, 2012.
145. N. Li, W. Yang, and W. Qardaji. Differentially private grids for geospatial data. In
Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013),
ICDE ’13, pages 757–768, Washington, DC, USA, 2013. IEEE Computer Society.
146. B. Lin and D. Kifer. Information preservation in statistical privacy and Bayesian estimation
of unattributed histograms. In SIGMOD, pages 677–688, 2013.
147. J. Lin. Divergence measures based on the Shannon entropy. Information Theory, IEEE
Transactions on, 37(1):145–151, 1991.
148. Y. Lindell and B. Pinkas. Privacy preserving data mining. pages 36–54, 2000.
149. H. Lu, C. S. Jensen, and M. L. Yiu. Pad: Privacy-area aware, dummy-based location privacy
in mobile services. In Proceedings of the Seventh ACM International Workshop on Data
Engineering for Wireless and Mobile Access, MobiDE ’08, pages 16–23, New York, NY,
USA, 2008. ACM.
150. T. Luo, H. P. Tan, and L. Xia. Profit-maximizing incentive for participatory sensing. Advances
in artificial intelligence, pages 127–135, 2014.
151. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: Privacy
beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1), Mar. 2007.
152. A. Machanavajjhala, A. Korolova, and A. D. Sarma. Personalized social recommendations -
accurate or private? PVLDB, 4(7):440–450, 2011.
153. L. Marinho, A. Hotho, R. Jäschke, A. Nanopoulos, S. Rendle, L. Schmidt-Thieme,
G. Stumme, and P. Symeonidis. In Recommender Systems for Social Tagging Systems,
SpringerBriefs in Electrical and Computer Engineering, pages 75–80. Springer US, 2012.
154. Y. Matsuo, N. Okazaki, K. Izumi, Y. Nakamura, T. Nishimura, K. Hasida, and H. Nakashima.
Inferring long-term user properties based on users’ location history. In Proceedings of the
20th International Joint Conference on Artificial Intelligence, IJCAI’07, pages 2159–2165,
San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
155. F. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data
analysis. Commun. ACM, 53(9), 2010.
156. F. McSherry and I. Mironov. Differentially private recommender systems: Building privacy
into the net. In SIGKDD, pages 627–636, 2009.
157. F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–
103, 2007.
158. I. Mironov, O. Pandey, O. Reingold, and S. Vadhan. Computational differential privacy. In
S. Halevi, editor, Advances in Cryptology - CRYPTO 2009, volume 5677 of Lecture Notes in
Computer Science, pages 126–142. Springer Berlin Heidelberg, 2009.
159. N. Mohammed, D. Alhadidi, B. C. M. Fung, and M. Debbabi. Secure two-party differentially
private data release for vertically partitioned data. IEEE Trans. Dependable Sec. Comput.,
11(1):59–71, 2014.
160. N. Mohammed, D. Alhadidi, B. C. M. Fung, and M. Debbabi. Secure two-party differentially
private data release for vertically partitioned data. IEEE Trans. Dependable Sec. Comput.,
11:59–71, 2014.
161. N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu. Differentially private data release for data
mining. In SIGKDD, pages 493–501, 2011.
162. P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy preserving data
analysis made easy. In SIGMOD, pages 349–360, 2012.
163. M. F. Mokbel, C.-Y. Chow, and W. G. Aref. The new Casper: Query processing for location
services without compromising privacy. In Proceedings of the 32Nd International Conference
on Very Large Data Bases, VLDB ’06, pages 763–774. VLDB Endowment, 2006.
164. J. Murtagh and S. Vadhan. The complexity of computing the optimal composition of
differential privacy. CoRR, abs/1507.03113, 2015.
165. S. Muthukrishnan and A. Nikolov. Optimal private halfspace counting via discrepancy. In
STOC, pages 1285–1292, 2012.
166. A. Narayanan and V. Shmatikov. How to break anonymity of the netflix prize dataset. CoRR,
abs/cs/0610105, 2006.
230 References
167. A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. SP ’08,
pages 111–125, Washington, DC, USA, 2008. IEEE Computer Society.
168. M. Naveed, E. Ayday, E. W. Clayton, J. Fellay, C. A. Gunter, J.-P. Hubaux, B. A. Malin, and
X. Wang. Privacy in the genomic era. ACM Comput. Surv., 48(1):6:1–6:44, 2015.
169. M. E. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared
databases. SIGMOD ’07, pages 665–676, New York, NY, USA, 2007. ACM.
170. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in
natural images with unsupervised feature learning. Nips Workshop on Deep Learning &
Unsupervised Feature Learning, 2012.
171. T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin. Collecting and analyzing
data from smart device users with local differential privacy. CoRR, abs/1606.05053, 2016.
172. K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data
analysis. In STOC, pages 75–84, 2007.
173. K. Nissim, R. Smorodinsky, and M. Tennenholtz. Approximately optimal mechanism design
via differential privacy. In Innovations in Theoretical Computer Science, pages 203–213,
2012.
174. K. Nissim and U. Stemmer. On the generalization properties of differential privacy. CoRR,
abs/1504.05800, 2015.
175. X. Pan, J. Xu, and X. Meng. Protecting location privacy against location-dependent attack in
mobile services. In Proceedings of the 17th ACM Conference on Information and Knowledge
Management, CIKM ’08, pages 1475–1476, New York, NY, USA, 2008. ACM.
176. R. Parameswaran and D. Blough. Privacy preserving collaborative filtering using data
obfuscation. In Granular Computing, 2007. GRC 2007. IEEE International Conference on
Granular Computing, page 380, Nov. 2007.
177. J. Parra-Arnau, A. Perego, E. Ferrari, J. Forne, and D. Rebollo-Monedero. Privacy-preserving
enhanced collaborative tagging. IEEE Transactions on Knowledge and Data Engineering,
99(PrePrints):1, 2013.
178. J. Parra-Arnau, D. Rebollo-Monedero, and J. Forne. Measuring the privacy of user profiles in
personalized information systems. Future Generation Computer Systems, (0):–, 2013.
179. S. Peng, Y. Yang, Z. Zhang, M. Winslett, and Y. Yu. DP-tree: Indexing multi-dimensional
data under differential privacy (abstract only). In Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’12, pages 864–864, New York,
NY, USA, 2012. ACM.
180. N. Phan, Y. Wang, X. Wu, and D. Dou. Differential privacy preservation for deep auto-
encoders: an application of human behavior prediction. In AAAI, pages 1309–1316, 2016.
181. H. Polat and W. Du. Privacy-preserving collaborative filtering using randomized perturbation
techniques. In ICDM 2003, pages 625–628, nov. 2003.
182. H. Polat and W. Du. Achieving private recommendations using randomized response
techniques. PAKDD’06, pages 637–646, Berlin, Heidelberg, 2006. Springer-Verlag.
183. L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd
sensing with cloaked locations. In Proceedings of the 2014 IEEE 15th International
Conference on Mobile Data Management - Volume 01, MDM ’14, pages 73–82, Washington,
DC, USA, 2014. IEEE Computer Society.
184. D. Proserpio, S. Goldberg, and F. McSherry. Calibrating data to sensitivity in private data
analysis. PVLDB, 7(8):637–648, 2014.
185. W. Qardaji, W. Yang, and N. Li. Understanding hierarchical methods for differentially private
histograms. Proc. VLDB Endow., 6(14):1954–1965, 2013.
186. W. H. Qardaji, W. Yang, and N. Li. Preview: practical differentially private release of marginal
contingency tables. In SIGMOD 2014, pages 1435–1446, 2014.
187. Z. Qin, Y. Yang, T. Yu, I. Khalil, X. Xiao, and K. Ren. Heavy hitter estimation over set-valued
data with local differential privacy. In SIGSAC, pages 192–203, 2016.
188. S. Rana, S. K. Gupta, and S. Venkatesh. Differentially private random forest with high utility.
In ICDM, pages 955–960, 2015.
References 231
189. S. Raskhodnikova and A. Smith. Efficient Lipschitz extensions for high-dimensional graph
statistics and node private degree distributions. FOCS, 2016.
190. V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship privacy: output perturbation for
queries with joins. In PODS, pages 107–116, 2009.
191. B. Recht, C. Ré, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing
stochastic gradient descent. In NIPS, pages 693–701, 2011.
192. X. Ren, C. Yu, W. Yu, S. Yang, X. Yang, J. A. McCann, and P. S. Yu. Lopub:
High-dimensional crowdsourced data publication with local differential privacy. CoRR,
abs/1612.04350, 2016.
193. Y. Ren, G. Li, and W. Zhou. Learning rating patterns for top-n recommendations. In
ASONAM, pages 472–479, 2012.
194. Y. Ren, G. Li, and W. Zhou. A learning method for top-n recommendations with incomplete
data. Social Network Analysis and Mining, pages 1–14, 2013.
195. A. Roth and T. Roughgarden. Interactive privacy via the median mechanism. In STOC,
pages 765–774, 2010.
196. B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space:
Privacy-preserving mechanisms for SVM learning. CoRR, abs/0911.5708, 2009.
197. P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing
information. page 188, 1998. cited By (since 1996) 101.
198. A. D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential
privacy: Algorithms and challenges for continuous data. IEEE Signal Processing Magazine,
30(5):86–94, 2013.
199. H. Shah-Mansouri and V. W. Wong. Profit maximization in mobile crowdsourcing: A truthful
auction mechanism. In Communications (ICC), 2015 IEEE International Conference on,
pages 3216–3221. IEEE, 2015.
200. E. Shen and T. Yu. Mining frequent graph patterns with differential privacy. In SIGKDD,
pages 545–553, 2013.
201. A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke. Personalized recommendation in
social tagging systems using hierarchical clustering. RecSys ’08, pages 259–266, New York,
NY, USA, 2008. ACM.
202. R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In SIGSAC, pages 1310–1321,
2015.
203. B. Sigurbjörnsson and R. van Zwol. Flickr tag recommendation based on collective
knowledge. WWW ’08, pages 327–336, New York, NY, USA, 2008. ACM.
204. Y. Song, L. Cao, X. Wu, G. Wei, W. Ye, and W. Ding. Coupled behavior analysis for capturing
coupling relationships in group-based market manipulations. KDD ’12, pages 976–984, New
York, NY, USA, 2012. ACM.
205. M. Srivatsa and M. Hicks. Deanonymizing mobility traces: using social network as a side-
channel. CCS ’12, pages 628–637, New York, NY, USA, 2012. ACM.
206. T. Steinke and J. Ullman. Between pure and approximate differential privacy. CoRR,
abs/1501.06095, 2015.
207. L. Stenneth and P. S. Yu. Mobile systems privacy: ‘mobipriv’ A robust system for snapshot or
continuous querying location based mobile systems. Transactions on Data Privacy, 5(1):333–
376, 2012.
208. M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of latent semantic
analysis, 427(7):424–440, 2007.
209. S. Su, P. Tang, X. Cheng, R. Chen, and Z. Wu. Differentially private multi-party high-
dimensional data publishing. In ICDE, pages 205–216, 2016.
210. X. Su and T. M. Khoshgoftaar. A survey of collaborative filtering techniques. Advances in
artificial intelligence, 2009:4, 2009.
211. L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
212. P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. Tag recommendations based on tensor
dimensionality reduction. RecSys ’08, pages 43–50, New York, NY, USA, 2008. ACM.
232 References
213. E. Ted. Engineering economy: applying theory to practice. Oxford University Press, USA,
third edition, 2010.
214. A. G. Thakurta and A. Smith. Differentially private feature selection via stability arguments,
and the robustness of the lasso. In Conference on Learning Theory, pages 819–850, 2013.
215. N. Thepvilojanapong, K. Zhang, T. Tsujimori, Y. Ohta, Y. Zhao, and Y. Tobe. Participation-
aware incentive for active crowd sensing. In High Performance Computing and Commu-
nications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing
(HPCC_EUC), 2013 IEEE 10th International Conference on, pages 2127–2134. IEEE, 2013.
216. H. To, G. Ghinita, and C. Shahabi. A framework for protecting worker location privacy in
spatial crowdsourcing. Proc. VLDB Endow., 7(10):919–930, June 2014.
217. H. To, G. Ghinita, and C. Shahabi. Privgeocrowd: A toolbox for studying private spatial
crowdsourcing. In Proceedings of the 31st IEEE International Conference on Data
Engineering, 2015.
218. K.-A. Toh and W.-Y. Yau. Combination of hyperbolic functions for multimodal biometrics
data fusion. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),
34(2):1196–1209, 2004.
219. Torch7. A scientific computing framework for luajit (torch.ch).
220. F. Tramèr, Z. Huang, J.-P. Hubaux, and E. Ayday. Differential privacy with bounded priors:
Reconciling utility and privacy in genome-wide association studies. In CCS, pages 1286–
1297, 2015.
221. J. Ullman. Answering n2+O(1) counting queries with differential privacy is hard. In STOC,
pages 361–370, 2013.
222. J. Ullman. Private multiplicative weights beyond linear queries. In PODS, pages 303–312,
2015.
223. S. Wang, L. Huang, P. Wang, Y. Nie, H. Xu, W. Yang, X. Li, and C. Qiao. Mutual information
optimally local private discrete distribution estimation. CoRR, 2016.
224. Y. Wang, J.-H. Ge, L.-H. Wang, and B. Ai. Nonlinear companding transform using hyperbolic
tangent function in OFDM systems. In Wireless Communications, Networking and Mobile
Computing (WiCOM), 2012 8th International Conference on, pages 1–4. IEEE, 2012.
225. Y. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability
and the sufficiency and necessity of ERM principle. CoRR, abs/1502.06309, 2015.
226. Y. Wang, S. Song, and K. Chaudhuri. Privacy-preserving analysis of correlated data. CoRR,
abs/1603.03977, 2016.
227. Y. Wang, Y. Wang, and A. Singh. Differentially private subspace clustering. In NIPS,
pages 1000–1008, 2015.
228. Y. Wang, Y. Wang, and A. Singh. A theoretical analysis of noisy sparse subspace clustering
on dimensionality-reduced data. CoRR, abs/1610.07650, 2016.
229. Y. Wang and X. Wu. Preserving differential privacy in degree-correlation based graph
generation. Transactions on data privacy, 6(2):127, 2013.
230. Y. Wang, X. Wu, and D. Hu. Using randomized response for differential privacy preserving
data collection. In EDBT/ICDT Workshops, 2016.
231. R. C.-W. Wong, A. W.-C. Fu, K. Wang, and J. Pei. Minimality attack in privacy preserving
data publishing. VLDB ’07, pages 543–554. VLDB Endowment, 2007.
232. R. C.-W. Wong, A. W.-C. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized
data be used for privacy breaches? ACM Trans. Knowl. Discov. Data, 5(3):16:1–16:24, Aug.
2011.
233. H. Wu, J. Corney, and M. Grant. Relationship between quality and payment in crowdsourced
design. In Computer Supported Cooperative Work in Design (CSCWD), Proceedings of the
2014 IEEE 18th International Conference on, pages 499–504. IEEE, 2014.
234. Q. Xiao, R. Chen, and K. Tan. Differentially private network data release via structural
inference. In SIGKDD, pages 911–920, 2014.
235. X. Xiao, G. Bender, M. Hay, and J. Gehrke. iReduct: differential privacy with reduced relative
errors. In SIGMOD, pages 229–240, 2011.
References 233
236. X. Xiao, Y. Tao, and N. Koudas. Transparent anonymization: Thwarting adversaries who
know the algorithm. ACM Trans. Database Syst., 35(2):8:1–8:48, May 2010.
237. X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. IEEE Trans.
on Knowl. and Data Eng., 23(8):1200–1214, 2011.
238. Y. Xiao, J. Gardner, and L. Xiong. Dpcube: Releasing differentially private data cubes for
health information. In Proceedings of the 2012 IEEE 28th International Conference on Data
Engineering, ICDE ’12, pages 1305–1308, Washington, DC, USA, 2012. IEEE Computer
Society.
239. Y. Xiao, L. Xiong, L. Fan, S. Goryczka, and H. Li. Dpcube: Differentially private histogram
release through multidimensional partitioning. Transactions on Data Privacy, 7(3):195–222,
2014.
240. Y. Xiao, L. Xiong, and C. Yuan. Differentially private data release through multidimensional
partitioning. In SDM, pages 150–168, 2010.
241. P. Xiong, L. Zhang, and T. Zhu. Reward-based spatial crowdsourcing with differential privacy
preservation. Enterprise Information Systems, 0(0):1–18, 0.
242. P. Xiong, T. Zhu, W. Niu, and G. Li. A differentially private algorithm for location data
release. Knowl. Inf. Syst., 47(3):647–669, 2016.
243. J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differentially private histogram
publication. The VLDB Journal, 22(6):797–822, 2013.
244. S. Xu, S. Su, L. Xiong, X. Cheng, and K. Xiao. Differentially private frequent subgraph
mining. In ICDE, pages 229–240, 2016.
245. G. Yuan, Z. Zhang, M. Winslett, X. Xiao, Y. Yang, and Z. Hao. Optimizing batch linear
queries under exact and approximate differential privacy. ACM Trans. Database Syst.,
40(2):11:1–11:47, 2015.
246. C. Zeng, J. F. Naughton, and J.-Y. Cai. On differentially private frequent itemset mining.
Proc. VLDB Endow., 6(1):25–36, 2012.
247. J. Zhan, C.-L. Hsieh, I.-C. Wang, T. sheng Hsu, C.-J. Liau, and D.-W. Wang. Privacy-
preserving collaborative recommender systems. Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on, 40(4):472 –476, july 2010.
248. J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. PrivBayes: private data
release via Bayesian networks. In SIGMOD, pages 1423–1434, 2014.
249. J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Private release of graph
statistics using ladder functions. In SIGMOD, pages 731–745, 2015.
250. J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, and Y. Guan. Network traffic classification
using correlation information. Parallel and Distributed Systems, IEEE Transactions on,
24(1):104–117, Jan 2013.
251. J. Zhang, X. Xiao, Y. Yang, Z. Zhang, and M. Winslett. PrivGene: Differentially private model
fitting using genetic algorithms. In SIGMOD, pages 665–676, 2013.
252. J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: Regression
analysis under differential privacy. Proc. VLDB Endow., 5(11):1364–1375, July 2012.
253. J. D. Zhang, G. Ghinita, and C. Y. Chow. Differentially private location recommendations in
geosocial networks. In MDM, volume 1, pages 59–68, July 2014.
254. Z. Zhou and J. Feng. Deep forest: Towards an alternative to deep neural networks. CoRR,
abs/1702.08835, 2017.
255. Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li. Multi-instance learning by treating instances as non-i.i.d.
samples. In Proceedings of the 26th Annual International Conference on Machine Learning,
ICML ’09, pages 1249–1256, New York, NY, USA, 2009. ACM.
256. T. Zhu, G. Li, W. Zhou, P. Xiong, and C. Yuan. Privacy-preserving topic model for tagging
recommender systems. Knowledge and Information Systems, 46(1):33–58, 2016.
257. T. Zhu, G. Li, W. Zhou, and P. S. Yu. Differentially private query learning: from data
publishing to model publishing. CoRR, abs/1606.05053, 2017.
258. T. Zhu, Y. Ren, W. Zhou, J. Rong, and P. Xiong. An effective privacy preserving algorithm
for neighborhood-based collaborative filtering. Future Generation Comp. Syst., 36:142–155,
2014.
234 References
259. T. Zhu, P. Xiong, G. Li, and W. Zhou. Correlated differential privacy: Hiding information in
non-iid data set. IEEE Transactions on Information Forensics and Security, 10(2):229–242,
2015.
260. T. Zhu, M. Yang, P. Xiong, Y. Xiang, and W. Zhou. An iteration-based differentially private
social network data release. CoRR, abs/1606.05053, 2017.
261. M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In
NIPS, pages 2595–2603. Curran Associates, Inc., 2010.
Index
C M
Cryptography, 131, 216 Machine learning, 1, 4, 5, 31, 49, 51, 57, 60,
67, 69, 89, 107, 208, 215, 216, 219–222
D
Data O
analysis, 5–6, 16, 39, 49–65, 83, 87, 153, Online social networks (OSNs), 31, 91
192, 196–198, 215–216, 222
mining, 1, 5, 6, 31, 41, 42, 49, 51, 56, 86,
107, 208, 217 P
release, 1, 5, 24, 41, 44, 46, 47, 136, Privacy, 1, 7, 17, 24, 38, 50, 67, 83–85, 92,
151, 153, 157, 160, 168, 172, 191, 210, 107, 131–173, 191, 216
218 Privacy preserving, 1–3, 5, 6, 10, 31, 41,
sharing, 1, 131, 218 49, 64, 73, 108, 118–120, 128, 129,
Differentially private data analysis, 6, 49–65 131–150, 153, 157, 160, 173, 175, 178,
Differentially private data publishing, 6, 17–21, 179, 189, 193, 197, 210, 212, 215, 216,
23–48 218
Differential policy, 195 Private learning, 49, 50, 53, 57–65, 83, 90,
Differential privacy, 4–17, 26, 29, 32–35, 40, 221, 222
41, 43–44, 46, 47, 50, 52, 54–60, 62, 63,
65, 67, 73, 74, 76, 77, 82, 83, 85–87, 91,
93–98, 101, 102, 108, 112, 113, 115, R
119, 122, 123, 125–127, 132, 135–138, Recommender system, 5, 29, 83, 85, 88, 90,
143, 150–154, 157, 158, 160–165, 107–129, 131–150, 214
169–174, 177–186, 189, 191–222
S
L Security, 151, 216
Location privacy, 5, 90, 151–172, 175, 214 Statistical learning, 51, 216, 221