Miltinomiale Random Variable
Miltinomiale Random Variable
Miltinomiale Random Variable
As you’ve seen, the Binomial distribution is extremely commonly used, and probably the most important
discrete distribution. The Normal distribution is certainly the most important continuous distribution. In
this section, we’ll see how to generalize the Binomial, and in the next, the Normal.
Why do we need to generalize the Binomial distribution? Sometimes, we don’t just have two outcomes
(success and failure), but we have r > 2 outcomes. In this case, we need to maintain counts of how many
times each of the r outcomes appeared. A single random variable is no longer sufficient; we need a vector of
counts!
Actually, the example problems at the end could have been solved in Chapter 1. We will just formalize
this situation so that we can use it later!
We will first introduce the concept of a random vector, which is just a collection of random variables stacked
on top of each other.
E [Xn ]
What about the variance? We cannot just say or compute a single scalar Var (X) because what does that
mean for a random vector? Actually, we need to define an n × n covariance matrix, which stores all pairwise
covariances. It is often denoted in one of three ways: Σ = Var (X) = Cov(X).
The covariance matrix of a random vector X ∈ Rn with E [X] = µ is the matrix denoted Σ =
1
2 Probability & Statistics with Applications to Computing 5.8
Var (X) = Cov(X) whose entries Σij = Cov (Xi , Xj ). The formula for this is:
Cov (X1 , X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
Cov (X2 , X1 ) Cov (X2 , X2 ) ... Cov (X2 , Xn )
=
.. .. .. ..
. . . .
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Cov (Xn , Xn )
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )
=
.. .. .. ..
. . . .
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )
Notice that the covariance matrix is symmetric (Σij = Σji ), and contains variances along the
diagonal.
Note: If you know a bit of linear algebra, you might like to know that covariance matrices
are always symmetric positive semi-definite.
We will not be doing any linear algebra in this class - think of it as just a place to store all the pairwise
covariances. Now let us look at an example of a covariance matrix.
Example(s)
If X1 , X2 , ..., Xn are iid with mean µ and variance σ 2 , then find the mean vector and covariance
matrix of the random vector X = (X1 , . . . , Xn ).
where 1n denotes the n-dimensional vector of all 1’s. The covariance matrix is (since the diagonal is just the
individual variances σ 2 and the off-diagonals (i 6= j) are all Cov (Xi , Xj ) = 0 due to independence)
2
σ 0 ... 0
0 σ2 ... 0
2
Σ= . .. = σ In
.. ..
.. . . .
0 0 ... σ2
An important theorem is that properties of expectation and variance still hold for RVTRs.
E [AX + b] = AE [X] + b
Var (AX + b) = AVar (X) AT
Since we aren’t expecting any linear algebra background, we won’t prove this.
Suppose we have scenario where there are r = 3 outcomes, with probabilities p1 , p2 , p3 respectively, such
that p1 + p2 + p3 = 1. Suppose we have n = 7 independent trials, and let Y = (Y1 , Y2 , Y3 ) be the RVTR of
counts of each outcome.
Pn Suppose we define each Xi as a one-hot vector (exactly one 1, and the rest 0) as
below, so that Y = i=1 Xi (this is exactly like how adding indicators/Bernoulli’s gives us a Binomial):
Now, what is the probability of this outcome (two of outcome 1, one of outcome 2, and four of outcome 3) -
that is, (Y1 = 2, Y2 = 1, Y3 = 4)? We get the following:
7!
pY1 ,Y2 ,Y3 (2, 1, 4) = · p21 · p12 · p43 [recall from counting]
2!1!4!
7
= · p21 · p12 · p43
2, 1, 4
This describes the joint distribution of the random vector Y = (Y1 , Y2, Y3 ), and its PMF should remind
7
of you of the binomial PMF. We just count the number of ways 2,1,4 to get these counts (multinomial
2 1 4
coefficient), and make sure we get each outcome that many times p1 p2 p3 .
Now let us define the Multinomial Distribution more generally.
Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 − pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
np1
E [Y] = np = ... Var (Yi ) = npi (1 − pi ) Cov (Yi , Yj ) = −npi pj (for i 6= j)
npr
Notice the covariance is negative, which makes sense because as the number of occurrences of Yi
increases, the number of occurrences of Yj should decrease since they can not occur simultaneously.
Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables; let’s de-
compose them into their Bernoull trials. We’ll use different dummy indices as we’re dealing with covariances.
th
Let Xik for
Pnk = 1, . . . , n be indicator/Bernoulli rvs of whether the k trial resulted in outcome i, so
that Xi = k=1 Xik
th
Similarly,
Pn let Xj` for ` = 1, . . . , n be indicators of whether the ` trial resulted in outcome j, so that
Xk = `=1 Xj` .
Before we begin, we should argue that Cov (Xik , Xj` ) = 0 when k 6= ` since k and ` are different trials and
are independent.
Furthermore, E [Xik Xjk ] = 0 since it’s not possible that both outcome i and j occur at trial k.
n n
!
X X
Cov (Xi , Xj ) = Cov Xik , Xj` [indicators]
k=1 `=1
n X
X n
= Cov (Xik , Xj` ) [covariance works like FOIL]
k=1 `=1
Xn
= Cov (Xik , Xjk ) [independent trials, cross terms are 0]
k=1
Xn
= E [Xik Xjk ] − E [Xik ] E [Xjk ] [def of covariance]
k=1
Xn
= −pi pj [first expectation is 0]
k=1
= −npi pj
Note that in the third line we dropped one of the sums because the indicators across different trials k, ` are
independent (zero covariance). Hence, we just need to sum when k = `.
There is an example of the Multinomial distribution at the end of the section!
5.8 Probability & Statistics with Applications to Computing 5
Suppose there are r = 3 political parties (Green, Democratic, Republican). The senate consists of N = 100
senators: K1 = 45 Green party members, K2 = 20 Democrats, and K3 = 35 Republicans.
This is very similar to the univariate Hypergeometric distribution! For the denominator, there are 100
10
ways to choose 10 senators. For the numerator, we need 1 from the 45 Green party members, 6 from the 20
Democrats, and 3 from the 35 Republicans.
Once again, let us define the MVHG Distribution more generally.
Suppose there are rPdifferent colors of balls in a bag, having K = (K1 , ..., Kr ) balls of each color,
r
1 ≤ i ≤ r. Let N = i=1 Ki be the total number of balls in the bag, and suppose we draw n without
replacement. Let Y = (Y1 , ..., Yr ) be the RVTR such that Yi is the number of balls of color i we drew.
We write that:
Y ∼ MVHGr (N, K, n)
Ki N − Ki N − n
Var (Yi ) = n · · . Then, we can specify the entire mean vector E [Y] and covariance
N N N −1
matrix:
K1
nN
K . Ki N − Ki N − n Ki Kj N − n
E [Y] = n = .. Var (Yi ) = n · · Cov (Yi , Yj ) = −n ·
N Kr
N N N − 1 N N N −1
nN
Proof of Hypergeometric Variance. We’ll prove the variance of a univariate Hypergeometric finally (the vari-
ance of Yi ), but leave the covariance matrix to you (can approach it similarly to the multinomial covariance
matrix).
Let X ∼ HypGeo(N, K, n) (univariate hypergeometric). For i = 1, . . . , n, let Xi be the indicator of whether
K
or not we got a success on trial i (not independent indicators). Then, E [Xi ] = P (Xi = 1) = for every
N
K
trial i, so E [X] = n by linearity of expectation.
N
6 Probability & Statistics with Applications to Computing 5.8
K
First, we have that since Xi ∼ Ber :
N
K K
Var (Xi ) = p(1 − p) = 1−
N N
K K −1
Second, for i 6= j, E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1) P (Xj = 1 | Xi = 1) = · , so
N N −1
K K − 1 K2
Cov (Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] = · −
N N − 1 N2
Finally,
n
!
X
Var (X) = Var Xi [def of X]
i=1
n
X n
X
= Cov Xi , Xj [covariance with self is variance]
i=1 j=1
n X
X n
= Cov (Xi , Xj ) [bilinearity of covariance]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [split diagonal]
i=1 i<j
K K − 1 K2
K K n
=n 1− +2 · − [plug in]
N N 2 N N − 1 N2
K N −K N −n
=n · · [algebra]
N N N −1
5.8.4 Exercises
These won’t be very interesting since this could’ve been done in chapter 1 and 2!
1. Suppose you are fishing in a pond with 3 red fish, 4 green fish, and 5 blue fish.
(a) You use a net to scoop up 6 of them. What is the probability you scooped up 2 of each?
(b) You “catch and release” until you caught 6 fish (catch 1, throw it back, catch another, throw it
back, etc.). What is the probability you caught 2 of each?
Solution:
(a) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ MVHG3 (N = 12, K = (3, 4, 5), n = 6), and
3
4 5
2 2 2
P (X1 = 2, X2 = 2, X3 = 2) = 12
6
5.8 Probability & Statistics with Applications to Computing 7
(b) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ Mult3 (n = 6, p = (3/12, 4/12, 5/12)), and
2 2 2
6 3 4 5
P (X1 = 2, X2 = 2, X3 = 2) =
2, 2, 2 12 12 12