Diet of Random Variables

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Physics 116C

The Distribution of the Sum of Random Variables


Peter Young
(Dated: November 24, 2009)
Consider a random variable with a probability distribution P(x). The distribution is normalized,
i.e.
_

P(x) dx = 1, (1)
and the average of x
n
(the n-the moment) is given by
x
n
=
_

x
n
P(x) dx. (2)
The mean, , and variance,
2
, are given in terms of the rst two moments by
x, (3a)

2
x
2
x
2
. (3b)
The standard deviation, a common measure of the width of the distribution, is just the square
root of the variance, i.e. .
A commonly studied distribution is the Gaussian,
P
Gauss
=
1

2
exp
_

(x )
2
2
2
_
. (4)
We have studied Gaussian integrals before and so you should be able to show that the distribution
is normalized, and that the mean and standard deviation are and respectively. The width
of the distribution is , since the the value of P(x) at x = has fallen to a xed fraction,
1/

e 0.6065, of its value at x = . The probability of getting a value which deviates by n times
falls o very fast with n as shown in the table below.
n P(|x | > n)
1 0.3173
2 0.0455
3 0.0027
2
In statistics, we often meet problems were we pick N random numbers x
i
(this set of numbers
we call a sample) from a distribution and are interested in the statistics of the mean x of this
sample, where
x =
1
N
N

i=1
x
i
. (5)
The x
i
might, for example, be data points which we wish to average over. We would be interested
in knowing how the sample average x deviates from the true average over the distribution x. In
other words, we would like to nd the distribution of the sample mean x if we know the distribution
of the individual points P(x). The distribution of the sample mean would tell us about the expected
scatter of results for x obtained if we repeated the determination of the sample of N numbers x
i
many times.
We will actually nd it convenient to determine the distribution of the sum
X =
N

i=1
x
i
, (6)
and then then trivially convert it to the distribution of the mean at the end. We will denote the
distribution of the sum of N variables by P
N
(X) (so P
1
(X) = P(X)).
We consider the case where the distribution of all the x
i
are the same (this restriction can
easily be lifted) and that the distribution of the x
i
are statistically independent (which is not
easily lifted). The latter condition means that there are no correlations between the numbers, so
P(x
i
, x
j
) = P(x
i
)P(x
j
).
The distribution P
N
(X) is obtained by integrating over all the x
i
with the constraint that

x
i
is equal to the prescribed value X, i.e.
P
N
(X) =
_

dx
1

_

dx
N
P(x
1
) P(x
N
)(x
1
+x
2
+ x
N
X) . (7)
We cant easily do the integrals because of the constraint imposed by the delta function. As dis-
cussed in class and in the handout on singular Fourier transforms, we can eliminate this constraint
by going to the Fourier transform
1
of P
N
(X), which we call G
N
(k) and which is dened by
G
N
(k) =
_

e
ikX
P
N
(X) dX. (8)
1
In statistics the Fourier transform of a distribution is called its characteristic function.
3
Substituting for P
N
(X) from Eq. (7) gives
G
N
(k) =
_

e
ikX
_

dx
1

_

dx
N
P(x
1
) P(x
N
)(x
1
+x
2
+ x
N
X) dX . (9)
The integral over X is easy and gives
G
N
(k) =
_

dx
1

_

dx
N
P(x
1
) P(x
N
)e
ik(x
1
+x
2
++x
N
)
, (10)
which is just the product of N identical independent Fourier transforms of the single-variable
distribution P(x), i.e.
G
N
(k) =
__

P(t)e
ikt
dt
_
N
, (11)
or
G
N
(k) = G(k)
N
, (12)
where G(k) is the Fourier transform of P(x).
Hence to determine the distribution of the sum, P
N
(X), the procedure is:
1. Fourier transform the single-variable distribution P(x) to get G(k).
2. Determine G
N
(k), the Fourier transform of P
N
(X), from G
N
(k) = G(k)
N
.
3. Perform the inverse Fourier transform on G
N
(k) to get P
N
(X).
Lets see what this gives for the case of the Gaussian distribution in Eq. (4). The Fourier
transform G
Gauss
(k) is given by
G
Gauss
(k) =
1

2
_

exp
_

(x )
2
2
2
_
e
ikx
dx,
=
e
ik

2
_

e
t
2
/2
2
e
ikt
dt,
= e
ik
e

2
k
2
/2
. (13)
In the second line we made the substitution x = t and in the third line we completed the
square in the exponent, as discussed elsewhere in the 116 sequence. The Fourier transform of a
Gaussian is therefore a Gaussian.
We immediately nd
G
Gauss,N
(k) = G
Gauss
(k)
N
= e
iNk
e
N
2
k
2
/2
, (14)
4
which you will see is the same as G
Gauss
(k) except that is replaced by N and

2
is replaced by N
2
. Hence, when we do the inverse transform to get P
Gauss,N
(X), we must
get a Gaussian as in Eq. (4) apart from these replacements
2
, i.e.
P
Gauss,N
(X) =
1

2N
exp
_

(X N)
2
2N
2
_
. (15)
In other words, if the distribution of the individual data points is Gaussian, the distribution of
the sum is also Gaussian with mean and standard deviation given by

X
= N,
2
X
= N
2
. (16)
In fact this last equation is true quite generally for statistically indepednent data points, not
only for a Gaussian distribution, as we shall now show quite simply. We have

N
=
N

i=1
x
i
= Nx = N. (17)

2
X
=
_
N

i=1
x
i
_
2

_
N

i=1
x
i
_

2
(18a)
=
N

i=1
(x
i
x
j
x
i
x
j
) (18b)
=
N

i=1
_
x
2
i
x
i

2
_
(18c)
= N
_
x
2
x
2
_
(18d)
= N
2
. (18e)
To get from Eq. (18b) to Eq. (18c) we note that, for i = j, x
i
x
j
= x
i
x
j
since x
i
and x
j
are
assumed to be statistically independent. (This is where the statistical independence of the data is
needed.)
We now come to an important theorem, the central limit theorem, which will be derived in
another handout (using the same methods as above, i.e. using Fourier transforms). It applies to to
any distribution for which the mean and variance exist (i.e. are nite), not just a Gaussian. The
data must be statistically independen. It states that:
The mean of the distribution of the sum is N times the mean of the single-variable distri-
bution (shown by more elementary means in Eq. (17) above).
2
Of course, one can also do the integral explicitly by completing the square to get the same result.
5
The variance of the distribution of the sum is N times the variance of the single-variable
distribution (shown by more elementary means in Eq. (18e) above).
For any single-variable distribution, P(x), not necessarily a Gaussian, for N , the
distribution of the sum, P
N
(X), becomes a Gaussian, i.e. is given by Eq. (15). This amazing
result is is the reason why the Gaussian distribution plays such an important role in statistics.
We will illustrate the convergence to a Gaussian distribution with increasing N for the rectan-
gular distribution
P
rect
(x) =
_
_
_
1
2

3
, (|x| <

3) ,
0, (|x| >

3) ,
(19)
where the parameters have been chosen so that = 0, = 1. Fourier transforming gives
Q(k) =
1
2

3
_

3
e
ikx
dk =
sin(

3k)

3k
(20)
= 1
k
2
2
+
3k
4
40
.
= exp
_

k
2
2

k
4
20

_
.
Hence
P
N
(X) =
1
2
_

e
ikX
_
sin

3k

3k
_
N
dk,
which can be evaluated numerically.
As we have seen, quite generally P
N
(k) has mean N, (= 0 here), and standard deviation

N
(=

N here). Hence, to illustrate that the distributions really do tend to a Gaussian for N
I plot below the distribution of Y = X/

N which has mean 0 and standard deviation 1 for all N.


6
Results are shown for N = 1 (the original distribution), N = 2 and 4, and the Gaussian (N = ).
The approach to a Gaussian for large N is clearly seen. Even for N = 2, the distribution is much
close to Gaussian than the original rectangular distribution, and for N = 4 the dierence from a
Gaussian is quite small on the scale of the gure. This gure therefore provides a good illustration
of the central limit theorem.
7
Equation (15) can also be written as a distribution for the sample mean x (= X/N) as
3
P
Gauss,N
(x) =
_
N
2
1

exp
_

N(x )
2
2
2
_
. (22)
Let us denote the mean (obtained from many repetitions of choosing the set of N numbers x
i
) of
the sample average by
x
, and the standard deviation of the sample average distribution by
x
.
Equation (22) tells us that

x
= , (23a)

x
=

N
. (23b)
Hence mean of this distribution is the exact mean and its standard deviation, which is a measure
of its width, is /

N, which becomes small for large N.


These statements tell us that an average over many measurements will be close to the exact
average, intuitively what one expects. For example, if one tosses a coin, it should come up heads
on average half the time. However, if one tosses a coin a small number of times, N, one would not
expect to necessarily get heads for exactly N/2 of the tosses. Six heads out of 10 tosses (a fraction
of 0.6), would intuitively be quite reasonable, and we would have no reason to suspect a biased
toss. However, if one tosses a coin a million times, intuitively the same fraction, 600,000 out of
1,000,000 tosses, would be most unlikely. From Eq. (23b) we see that these intuitions are correct
because the characteristic deviation of the sample average from the true average (1/2 in this case)
goes down proportional to 1/

N.
To be precise, we assign x
i
= 1 to heads and x
i
= 0 to tails so x
i
= 1/2 and x
2
i
= 1/2 and
hence, from Eqs. (3),
=
1
2
, =

1
2

_
1
2
_
2
=
_
1
4
=
1
2
. (24)
3
We obtained Eq. (22) from Eq. (15) by the replacement X = Nx. In addition the factor of 1/

N multiplying the
exponential in Eq. (15) becomes

N in Eq. (22). Why did we do this? The reason is as follows. If we have a


distribution of y, Py(y), and we write y as a function of x, we want to know the distribution of x which we call
Px(x). Now the probability of getting a result between x and x+dx must equal the probability of a result between
y and y + dy, i.e. Py(y)|dy| = Px(x)|dx|. In other words
Px(x) = Py(y)

dy
dx

. (21)
As a result, when transforming the distribution of X in Eq. (15) into the distribution of x one needs to multiply
by N. You should verify that the factor of |dy/dx| in Eq. (21) preserves the normalization of the distribution, i.e.
R

Py(y)dy = 1 if
R

Px(x)dx = 1.
8
For a sample of N tosses, Eqs. (23) gives the sample mean and standard deviation to be

x
=
1
2
,
x
=
1
2

N
. (25)
Hence the magnitude of a typical deviation of the sample mean from 1/2 is
x
= 1/(2

N). For
N = 10 this is about 0.16, so a deviation of 0.1 is quite possible (as discussed above), while for
N = 10
6
this is 0.0005, so a deviation of 0.1 (200 times
x
) is most unlikely (also discussed above).
In fact we can calculate the probability of a deviation (of either sign) of 200 or more times
x
since
the central limit theorem tells us that the distribution is Gaussian for large N. The result is
2

2
_

200
e
t
2
/2
dt = erfc
_
200

2
_
= 5.14 10
2629
, (26)
(where erfc is the complementary error function), i.e. very unlikely!

You might also like