Miltinomiale Random Variable

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Chapter 5.

Multiple Random Variables


5.8: The Multinomial Distribution
(From “Probability & Statistics with Applications to Computing” by Alex Tsun)

As you’ve seen, the Binomial distribution is extremely commonly used, and probably the most important
discrete distribution. The Normal distribution is certainly the most important continuous distribution. In
this section, we’ll see how to generalize the Binomial, and in the next, the Normal.

Why do we need to generalize the Binomial distribution? Sometimes, we don’t just have two outcomes
(success and failure), but we have r > 2 outcomes. In this case, we need to maintain counts of how many
times each of the r outcomes appeared. A single random variable is no longer sufficient; we need a vector of
counts!

Actually, the example problems at the end could have been solved in Chapter 1. We will just formalize
this situation so that we can use it later!

5.8.1 Random Vectors (RVTRs) and Covariance Matrices

We will first introduce the concept of a random vector, which is just a collection of random variables stacked
on top of each other.

Definition 5.8.1: Random Vectors


Let X1 , ..., Xn be arbitrary random variables, and stack them into a vector like such:
 
X1
 .. 
X= . 
Xn

We call X an n-dimensional random vector (RVTR).


We define the expectation of a random vector just as we would hope, coordinate-wise:
 
E [X1 ]
E [X] =  ... 
 

E [Xn ]

What about the variance? We cannot just say or compute a single scalar Var (X) because what does that
mean for a random vector? Actually, we need to define an n × n covariance matrix, which stores all pairwise
covariances. It is often denoted in one of three ways: Σ = Var (X) = Cov(X).

Definition 5.8.2: Covariance Matrices

The covariance matrix of a random vector X ∈ Rn with E [X] = µ is the matrix denoted Σ =

1
2 Probability & Statistics with Applications to Computing 5.8

Var (X) = Cov(X) whose entries Σij = Cov (Xi , Xj ). The formula for this is:

Σ = Var (X) = Cov(X) = E (X − µ)(X − µ)T = E XXT − µµT


   

 
Cov (X1 , X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
 Cov (X2 , X1 ) Cov (X2 , X2 ) ... Cov (X2 , Xn ) 
=
 
.. .. .. .. 
 . . . . 
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Cov (Xn , Xn )

 
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
 Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )
=
 
.. .. .. .. 
 . . . . 
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )

Notice that the covariance matrix is symmetric (Σij = Σji ), and contains variances along the
diagonal.

Note: If you know a bit of linear algebra, you might like to know that covariance matrices
are always symmetric positive semi-definite.

We will not be doing any linear algebra in this class - think of it as just a place to store all the pairwise
covariances. Now let us look at an example of a covariance matrix.

Example(s)

If X1 , X2 , ..., Xn are iid with mean µ and variance σ 2 , then find the mean vector and covariance
matrix of the random vector X = (X1 , . . . , Xn ).

Solution The mean vector is:


   
E [X1 ] µ
 ..   .. 
E [X] =  .  =  .  = µ1n
E [Xn ] µ

where 1n denotes the n-dimensional vector of all 1’s. The covariance matrix is (since the diagonal is just the
individual variances σ 2 and the off-diagonals (i 6= j) are all Cov (Xi , Xj ) = 0 due to independence)

 2 
σ 0 ... 0
0 σ2 ... 0
2
Σ= . ..  = σ In
 
.. ..
 .. . . .
0 0 ... σ2

where In denotes the n × n identity matrix.


5.8 Probability & Statistics with Applications to Computing 3

Theorem 5.8.1: Properties of Expectation and Variance Hold for RVTRs

An important theorem is that properties of expectation and variance still hold for RVTRs.

Let X be an n-dimensional RVTR, A ∈ Rn×n be a constant matrix, b ∈ Rn be a constant


vector. Then:

E [AX + b] = AE [X] + b
Var (AX + b) = AVar (X) AT

Since we aren’t expecting any linear algebra background, we won’t prove this.

5.8.2 The Multinomial Distribution

Suppose we have scenario where there are r = 3 outcomes, with probabilities p1 , p2 , p3 respectively, such
that p1 + p2 + p3 = 1. Suppose we have n = 7 independent trials, and let Y = (Y1 , Y2 , Y3 ) be the RVTR of
counts of each outcome.
Pn Suppose we define each Xi as a one-hot vector (exactly one 1, and the rest 0) as
below, so that Y = i=1 Xi (this is exactly like how adding indicators/Bernoulli’s gives us a Binomial):

Now, what is the probability of this outcome (two of outcome 1, one of outcome 2, and four of outcome 3) -
that is, (Y1 = 2, Y2 = 1, Y3 = 4)? We get the following:
7!
pY1 ,Y2 ,Y3 (2, 1, 4) = · p21 · p12 · p43 [recall from counting]
2!1!4!
 
7
= · p21 · p12 · p43
2, 1, 4
This describes the joint distribution of the random vector Y = (Y1 , Y2, Y3 ), and its PMF should remind
7
of you of the binomial PMF. We just count the number of ways 2,1,4 to get these counts (multinomial
2 1 4
coefficient), and make sure we get each outcome that many times p1 p2 p3 .
Now let us define the Multinomial Distribution more generally.

Definition 5.8.3: The Multinomial Distribution


Pr
Suppose there are r outcomes, with probabilities p = (p1 , p2 , ..., pr ) respectively, such that i=1 pi =
1. Suppose we have n independent trials, and let Y = (Y1 , Y2 , ..., Yr ) be the RVTR of counts of each
outcome. Then, we say:
Y ∼ Multr (n, p)
4 Probability & Statistics with Applications to Computing 5.8

The joint PMF of Y is:


 Yr r
n X
pY1 ,...,Yr (k1 , ...kr ) = pki , k1 , ...kr ≥ 0 and ki = n
k1 , ..., kr i=1 i i=1

Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 − pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
 
np1
E [Y] = np =  ...  Var (Yi ) = npi (1 − pi ) Cov (Yi , Yj ) = −npi pj (for i 6= j)
 

npr

Notice the covariance is negative, which makes sense because as the number of occurrences of Yi
increases, the number of occurrences of Yj should decrease since they can not occur simultaneously.

Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables; let’s de-
compose them into their Bernoull trials. We’ll use different dummy indices as we’re dealing with covariances.

th
Let Xik for
Pnk = 1, . . . , n be indicator/Bernoulli rvs of whether the k trial resulted in outcome i, so
that Xi = k=1 Xik

th
Similarly,
Pn let Xj` for ` = 1, . . . , n be indicators of whether the ` trial resulted in outcome j, so that
Xk = `=1 Xj` .
Before we begin, we should argue that Cov (Xik , Xj` ) = 0 when k 6= ` since k and ` are different trials and
are independent.
Furthermore, E [Xik Xjk ] = 0 since it’s not possible that both outcome i and j occur at trial k.

n n
!
X X
Cov (Xi , Xj ) = Cov Xik , Xj` [indicators]
k=1 `=1
n X
X n
= Cov (Xik , Xj` ) [covariance works like FOIL]
k=1 `=1
Xn
= Cov (Xik , Xjk ) [independent trials, cross terms are 0]
k=1
Xn
= E [Xik Xjk ] − E [Xik ] E [Xjk ] [def of covariance]
k=1
Xn
= −pi pj [first expectation is 0]
k=1
= −npi pj

Note that in the third line we dropped one of the sums because the indicators across different trials k, ` are
independent (zero covariance). Hence, we just need to sum when k = `.
There is an example of the Multinomial distribution at the end of the section!
5.8 Probability & Statistics with Applications to Computing 5

5.8.3 The Multivariate Hypergeometric (MVHG) Distribution

Suppose there are r = 3 political parties (Green, Democratic, Republican). The senate consists of N = 100
senators: K1 = 45 Green party members, K2 = 20 Democrats, and K3 = 35 Republicans.

We want to choose a committee of n = 10 senators.


Let Y = (Y1 , Y2 , Y3 ) be the number of each party’s members in the committee (G, D, R in that order). What
is the probability we get 1 Green party member, 6 Democrats, and 3 Republicans? It turns out is just the
following:
45 20 35
  
1 6
pY1 ,Y2 ,Y3 (1, 6, 3) = 100
 3
10

This is very similar to the univariate Hypergeometric distribution! For the denominator, there are 100

10
ways to choose 10 senators. For the numerator, we need 1 from the 45 Green party members, 6 from the 20
Democrats, and 3 from the 35 Republicans.
Once again, let us define the MVHG Distribution more generally.

Definition 5.8.4: The Multivariate Hypergeometric Distribution

Suppose there are rPdifferent colors of balls in a bag, having K = (K1 , ..., Kr ) balls of each color,
r
1 ≤ i ≤ r. Let N = i=1 Ki be the total number of balls in the bag, and suppose we draw n without
replacement. Let Y = (Y1 , ..., Yr ) be the RVTR such that Yi is the number of balls of color i we drew.
We write that:

Y ∼ MVHGr (N, K, n)

The joint PMF of Y is:


Qr Ki
 r
i=1 ki
X
pY1 ,...,Yr (k1 , ...kr ) = N
 , 0 ≤ ki ≤ Ki for all 1 ≤ i ≤ r and kr = n
n i=1

Notice that each Yi is marginally HypGeo(N, Ki , n), so E [Yi ] = n K


N and
i

Ki N − Ki N − n
Var (Yi ) = n · · . Then, we can specify the entire mean vector E [Y] and covariance
N N N −1
matrix:
 K1 
nN
K  .  Ki N − Ki N − n Ki Kj N − n
E [Y] = n =  ..  Var (Yi ) = n · · Cov (Yi , Yj ) = −n ·
N Kr
N N N − 1 N N N −1
nN

Proof of Hypergeometric Variance. We’ll prove the variance of a univariate Hypergeometric finally (the vari-
ance of Yi ), but leave the covariance matrix to you (can approach it similarly to the multinomial covariance
matrix).
Let X ∼ HypGeo(N, K, n) (univariate hypergeometric). For i = 1, . . . , n, let Xi be the indicator of whether
K
or not we got a success on trial i (not independent indicators). Then, E [Xi ] = P (Xi = 1) = for every
N
K
trial i, so E [X] = n by linearity of expectation.
N
6 Probability & Statistics with Applications to Computing 5.8

 
K
First, we have that since Xi ∼ Ber :
N
 
K K
Var (Xi ) = p(1 − p) = 1−
N N

K K −1
Second, for i 6= j, E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1) P (Xj = 1 | Xi = 1) = · , so
N N −1

K K − 1 K2
Cov (Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] = · −
N N − 1 N2
Finally,
n
!
X
Var (X) = Var Xi [def of X]
i=1
 
n
X n
X
= Cov  Xi , Xj  [covariance with self is variance]
i=1 j=1
n X
X n
= Cov (Xi , Xj ) [bilinearity of covariance]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [split diagonal]
i=1 i<j

K K − 1 K2
    
K K n
=n 1− +2 · − [plug in]
N N 2 N N − 1 N2
K N −K N −n
=n · · [algebra]
N N N −1

5.8.4 Exercises

These won’t be very interesting since this could’ve been done in chapter 1 and 2!

1. Suppose you are fishing in a pond with 3 red fish, 4 green fish, and 5 blue fish.

(a) You use a net to scoop up 6 of them. What is the probability you scooped up 2 of each?
(b) You “catch and release” until you caught 6 fish (catch 1, throw it back, catch another, throw it
back, etc.). What is the probability you caught 2 of each?

Solution:

(a) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ MVHG3 (N = 12, K = (3, 4, 5), n = 6), and
3
 4 5
2 2 2
P (X1 = 2, X2 = 2, X3 = 2) = 12

6
5.8 Probability & Statistics with Applications to Computing 7

(b) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ Mult3 (n = 6, p = (3/12, 4/12, 5/12)), and
   2  2  2
6 3 4 5
P (X1 = 2, X2 = 2, X3 = 2) =
2, 2, 2 12 12 12

You might also like