Differential Privacy: 1 N I 1 N N

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Differential Privacy

Protecting privacy while performing statistical analysis is quite challenging. On the one had,
the goal of statistics and machine learning is to be as informative as possible. Protecting
privacy is the opposite goal.

How do we formally define privacy? Can we protect privacy and still do an informative
analysis? We address these questions in these notes.

1 Introduction

The definition of privacy that has become most common lately is differential privacy (Dwork
et al 2006).

2 Randomized Response

The predecessor to differential privacy is randomized response which is a method used in


surveys. It was proposed by Warner in 1965.

I want to know how many students have ever cheated on a test. Suppose that proportion
is p. If I ask this question I will not get truthful responses. I tell everyone to flip a coin C
with P (C = 1) = θ and P (C = 0) = 1 − θ. To protect their privacy, I tell them: if the coin
is tails answer YES and if the coin is heads answer the question “have you every cheated?”
The observation Y is thus Y = (1 − C) + CZ where Z = 1 if they have cheated and Z = 0
otherwise. So π ≡ P (Y = 1) is π = (1 − θ) + θp so that p = (π − 1 + θ)/θ. I can then
estimate p by estimating π.

3 Differential Privacy

Suppose we have a dataset X1 , . . . , Xn where Xi ∈ X . Knowing the sample space X explicitly


is critical for differential privacy. The data set D = {X1 , . . . , Xn } is in X n . Our goal is to
report some function Z = T (D) of the data. We will be using some sort of randomization
to do this. That is, we will take Z ∼ Q(·|X1 , . . . , Xn ).

Two datasets D and D’s are neighbors if they differ in one random variable. In other words
D = {X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn } and D0 = {X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn }. In this
case we write D ∼ D0 .

1
We say that Q satisfies -differential privacy if

Q(Z ∈ A|D) ≤ e Q(Z ∈ A|D0 )

for all A and all pairs D ∼ D0 . If Q has density q this means that

q(z|D)
sup 0
≤ e .
z q(z|D )

What does the definition mean? It means that whether you are in or not in the database has
little affect on the output Z. For example, suppose I think you are person i in the database
and I want to guess if you value is Xi = a or Xi = b. Before I see any information, suppose
my odds are P (Xi = a)/P (Xi = b). After I see Z,

P (Xi = a|Z) p(z|Xi = a) P (Xi = a)


=
P (Xi = b|Z) p(z|Xi = b) P (Xi = b)

and so
P (Xi = a) P (Xi = a|Z) P (Xi = a)
e− ≤ ≤ e .
P (Xi = b) P (Xi = b|Z) P (Xi = b)
Since e ≈ 1 +  and  is small, we see that knowing Z does not change my odds much. It is
also possible to show that we cannot construct any test with non-trivial power about what
your value of Xi is.

So, when differential privacy holds, we cannot learn much about whether a particular person
is in the dataset.

4 Queries

The computer science model of privacy is that some curator keeps the data and users send
queries about the data. The goal is to release answers that are differentially private.

Let f be a function of the data and define the sensitivity

∆ = sup |f (D) − f (D0 )|.


D∼D0

Suppose we release
Z = f (D) + W

where f (w) ∝ e−w/λ . Note that W has a Laplace distribution with standard deviation 2λ.
If we set λ = ∆/ then
p(z|D) 0

0
≤ e|f (D)−f (D )|/λ ≤ e
p(z|D )

2
so that differential privacy holds.

For example, suppose that X1 , . . . , Xn ∈ [−B, B] and that f (D) = X. Then ∆ = 2B/n so
we need to add noise with standard deviation O(B/(n)). As a function of n this is good.
As a function of B it is bad.

5 How Informative is Z?

Obviously we lose information when we use differential privacy.

As an extreme example, suppose the data are on [0, 1] and suppose the true distribution F
is a point mass at x ∈ [0, 1]. So the dataset is X = (X1 , . . . , Xn ) = (x, x, . . . , x). Suppose we
output Z1 , . . . , Zk from a differentially private Q(Z|X). Let Fb be the empirical distribution
of Z. Then it can be shown that Fb must be inconsistent, that is, there exists δ > 0 such
that,
lim inf P (sup |F (s) − Fb(s)| > δ) > 0.
n→∞ s

See Blum, Ligett and Roth (2008) and Wasserman and Zhou (2010).

Barber and Duchi (2014) studied differential privacy from the minimax point of view. Sup-
pose we observe X1 , . . . , Xn ∼ P where Xi ∈ [0, 1]d . Consider the simple task of estimating
the mean µ. If we ignore privacy, then we can use X which is risk E[||X − µ||2 ]  d/n. They
showed that any differentially private estimator µ e satisfies the lower bound

d3 d2
 
2 d d
µ − µ|| ]  + 2 2 =
E[||e 1+ 2 .
n n n n
d3
So the price we pay for privacy is n2 2
which is quite steep.

6 Releasing a Whole Dataset

Statisticians have been less enthusiastic about differential privacy than computer scientists.
One of the reasons for this is the heavy dependence on the notion of using privatized queries.
The idea that we would analyze data by sending queries to a curator is unrealistic. Real
data analysis involves: looking at the data, fitting models, testing fit, making predictions,
constructing confidence sets etc. This requires access to the whole data set. This leads to
the following questions. Can we release a privatized version of the whole dataset? In fact,
there are several ways to do this.

3
6.1 Exponential Mechanism

The exponential mechanism, due to McSherry and Talwar (2007), is a general method for
preserving differential privacy. Here, I’ll discuss the special case where we want to release a
private data set Z = (Z1 , . . . , Zk ). Let ξ(x, y) be some function that measures the distance
between two data sets x = (x1 , . . . , xn ) and z = (z1 , . . . , zk ). Define the sensitivity

∆ = sup sup |ξ(x, z) − ξ(y, z)|.


x∼y z

Now draw Z = (Z1 , . . . , Zk ) from the density


 
ξ(x, z)
q(z|x) ∝ exp − .
2∆

It is easy to check that this satisfies -differential privacy.

As an example, suppose that X is compact and define ξ(x, z) = supt |Fx (t) − Fz (t)| =
||Fx − Fz ||∞ where Fx is the empirical cdf of x = (x1 , . . . , xn ) and Fz is the empirical cdf of
z = (z1 , . . . , zk ). So ξ is the Kolmogorov-Smirnov distance. In this case, ∆ = 1/n and so we
draw z from the density  
n||Fx − Fz ||∞
q(z|x) ∝ exp − .
2
Wasserman and Zhou (2010) showed that, for this scheme, the optimal choice of k is k  n2/3
and that ||F − Fz ||∞ = OP (n−1/3 ). Without privacy we have ||F − Fx ||∞ = OP (n−1/2 ). So
we see that we have lost accuracy.

More generally, they showed that

(supx p(x))k e−3δn/16


P (||F − Fz ||∞ > δ) ≤
S(k, δ/2)

where S(k, δ/2) is the small ball probability, that is P (||F − Fz || ≤ δ/2). However, it is not
known if these bound are tight.

6.2 Density Estimation I

Another way to release a privatized dataset is to compute a privatized density estimate pb.
Then we can draw a sample Z1 , . . . , ZN ∼ pb. It is easy to show that if pb is differentially
private then so is Z = (Z1 , . . . , ZN ).

Dwork et al (2006) suggested using a privatized histogram which was analyzed in Wasserman
and Zhou (2010). Suppose that the data are on [0, 1]d . Divide the space into m = 1/hd bins

4
B1 , . . . , Bm and form the usual histogram estimator
X pbj
pb(x) = I(x ∈ Bj )
j
hd

where pbj = Cj /n and Cj is the number of observations in bin Bj . To privatize pb, define
X qbj
qb(x) = I(x ∈ Bj )
j
hd

ej / P D
where qbj = D e s, D
e j = max{Dj , 0} and Dj = Cj +νj where νj is drawn from a Laplace
s
density with mean 0 and variance 8/2 . Wasserman and Zhou (2010) showed that, if X has
a Lipschitz density, and if we choose m  nd/(2+d) the histogram of the privatized density
achieves the minimax rate n−2/(2+d) . So in this case, there is no loss in rate by releasing the
privatized data or the privatized histogram.

However, the minimax rate is not the whole story. Suppose that the original histogram pb
is sparse i.e. has many empty cells. The privatized histogram qb is forced to “fill in” these
empty cells. So in these cases, qb will look very different from pb. In particular, much of the
clustering structure will be lost. And if there is any lower dimensional structure in the data,
this will be destroyed.

6.3 Density Estimation II

A second approach is based on orthogonal series. For simplicity assume that X = [0, 1].
Write ∞
X
p(x) = 1 + βj ψj (x)
j=1

βj2 j 2γ ≤ C 2 . This is a
P
where {1, ψ1 , ψ2 , . . . , } is an orthonormal basis. Suppose that j
Sobolev ellipsoid. The minimax rate is n−2γ/(2γ+1) .

The usual density estimator in this framework is


m
X
pb(x) = 1 + βbj ψj (x)
j=1

where m = n1/(2γ+1) and βbj = n−1 i ψj (Xi ) which achieves the minimax rate. A privatized
P
estimator is m
X
qb(x) = 1 + (βbj + νj )ψj (x)
j=1

where νj is Laplace with mean 0 and standard deviation mc0 /(n) where c0 = supj supx |ψj (x)|.
It turns out that qb also achieves the minimax rate.

5
6.4 Density Estimation III

The most commonly used density estimator is the kernel density estimator
 
1X 1 x − Xi
pb(x) = K .
n i hd h

Is there a way to privatize pb?

This is trickier than histograms and orthogonal series estimators since pb is not easily described
by a finite set of parameters. In principle, we want to draw a random function g such that
P (g ∈ A|D) ≤ e P (g ∈ A|D0 ). But the sets A are now subsets of some function space and
it is not immediately clear how to do this.

So far, I know of only two ways to do this. Hall, Rinaldo and Wasserman (2013) suggested
using
C
g = pb + G
nhd/2
where C is an appropriate constant and G is a mean 0 Gaussian process with a certain
covariance structure. The resulting density estimator g is very wiggly but it does satisfy
differential privacy. Moreover, it has the same rate of convergence as the original density
estimator.

A second approach was recently presented by Alda and Rubinstein (2017). The first create
a grid on the sample space. Next, the approximate pb using Bernstein polynomials. Then the
add Laplace noise to the coefficients of the polynomials.

I should add that the methods in Hall, Rinaldo and Wasserman (2013) and Alda and Ru-
binstein (2017) are quite general and can be used for the private release of fairly general
functions. In fact, Alda and Rubinstein (2017) also apply their approach to classification,
logistic regression and empirical risk minimization. They also provide a lower bound which
shows that we must, in general, introduce an error of size ∆/ when privately releasing a
function, where ∆ is the sensitivity defined earlier.

7 Conclusion

Differential privacy (DP) is a very active area of research. Here is a summary of the strengths
and weaknesses of this approach:

Strengths:

1. DP gives a very rigorous, precise notion of privacy.

6
2. Many methods in machine learning and statistics can be made differentially private.
3. DP can be used for other purposes. For example, Dwork et al (2015) created a method
called reusable holdout that allows and interactive approach to data analysis while
making repeated looks at the data without introducing too much bias. The hear of the
method is to impose a sort of differential privacy on each step of the analysis.

Weaknesses:

1. DP has dominated the research in privacy. It seems that there is not much research in
other approaches.
2. DP is very strong. You need to add a lot of noise to the data.
3. When there is a structure in the data, such as voids, manifolds etc, it is destroyed by
DP.
4. I have not seen it really used in much practical data analysis.

You might also like