Combinepdf
Combinepdf
Combinepdf
HAYK ALEKSANYAN
Contents
1. Events and probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Notion of a probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Outcomes and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Combinatorial analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1. Multiplication principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Binomial theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5. Multinomial coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Basic properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1. Continuity of probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2. Boole’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3. Graph coloring and Ramsey numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4. Inclusion-exclusion principle and related inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1. Inclusion-exclusion principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2. Bonferroni’s inequalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. Independent events and conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2. Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3. Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4. Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6. Discrete probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1. Important examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2. Bernoulli approximation to Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7. Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1. Discrete random variable and its probability mass function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2. Functions of random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3. Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.4. Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.5. Indicator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8. Multivariate discrete distribution. Expectation in the multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.1. Expectation in the multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9. Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.1. Independence of two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.2. Independence of n random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10. Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.1. Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.2. Arithmetic mean - Geometric mean inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.3. Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.4. Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.5. Information entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11. Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1. Markov and Chebyshev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2. WLLN - easy version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.3. Probabilistic proof of the Weierstrass approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12. Simple symmetric random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.1. Random walks on Z and the reflection principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.2. Arcsin law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2 HAYK ALEKSANYAN
1.2. Outcomes and events. We now formalize the above discussion about experi-
ments and their outcomes. Let E be some experiment whose outcome is not determin-
istic: for the sake of illustration consider rolling a six-sided dice. Then the set of all
possible outcomes of the experiment, which is usually denoted by Ω and referred to as
the sample space is
Ω = {1, 2, 3, 4, 5, 6},
in correspondence with the numbers of the dice. Questions one may ask about experi-
ment E can be as follows: “is the outcome equal to 3 ?”, “is it smaller than 5?”, “is it
a prime number ?”, etc. Such questions can be reformulated in terms of the subsets of
Ω. For example, the last question is equivalent to asking if the outcome of the experi-
ment lies in {2, 3, 5} which is a subset of Ω. Next, we collect all subsets of Ω that are
of interest to us into a list, where each member of such list corresponds to a possible
outcome of the experiment which we will refer to as an event. There is a natural link
between events and subsets of Ω, indeed take two subsets A and B of Ω, then
PROBABILITY LECTURE NOTES 3
The two examples above represent the edge-cases: for the given Ω the first is the
largest event space and the second is the smallest both in terms of inclusions.
Remark 1.2. Informally, we may think about F as the set of all outcomes of the
experiment that we can verify if they have happened or not in a particular realization
of the experiment. In other words F is the set of all questions one may ask about the
outcome of the experiment, that can be addressed.
Example 1.2.1. Let our experiment be rolling a fair 6-sided dice. Assume that the
outcome of the roll is not directly revealed to us, but one can only know if the roll
was even or odd (e.g. we want to build a simplistic random-bit generator). Then Ω =
{1, 2, 3, 4, 5, 6} and
F = {∅, {1, 3, 5}, {2, 4, 6}, {1, 2, 3, 4, 5, 6}}.
Exercise 1.2.2. Given A1 , ..., An ∈ F and let B be the set of all points in Ω
that belong to exactly k of the Ai -s, where 1 ≤ k ≤ n is fixed. Show that B ∈ F.
Exercise 1.2.3. Let Ω be finite. Show that F has even number of elements.
1.3. Probabilities. Getting back to the building blocks of a probability space discussed
in subsection 1.1, notice that we have not yet allocated any quantitative measure to an
event. In this section for each event A ∈ F we assign a probability and will write P(A)
for its value. The axiomatic approach that comes next is due to N. Kolmogorov.
Definition 1.2. (Probability measure) A mapping P : F → R is called a probability
measure on (Ω, F) if
1. 0 ≤ P(A) ≤ 1 for any A ∈ F,
2. P(∅) = 0 and P(Ω) = 1,
3. for any collection A1 , A2 , ... ∈ F of disjoint events, i.e. Ai ∩ Aj = ∅ for i 6= j,
we have
∞ ∞
!
[ X
P Ai = P(Ai ).
i=1 i=1
PROBABILITY LECTURE NOTES 5
(a) The condition that P(∅) = 0 is redundant and can be inferred from the rest of
the axioms. Indeed, take A1 = Ω and Ai = ∅ for i ≥ 2. In view of condition 3
of Definition 1.2 we have
∞
X
P(Ω) = P(Ω) + P(∅),
i=2
hence P(∅) = 0.
(b) P is finitely additive, i.e. for a collection of disjoint events Ai , i = 1, 2, ..., n we
have
n n
!
[ X
P Ai = P(Ai ).
i=1 i=1
This follows from countable additivity taking Ai = ∅ for all i = n + 1, ... .
(c) The condition that P(A) ≤ 1 for any A ∈ F is also redundant and follows from
the monotonicity of the probability measure. Indeed, take any A, B ∈ F where
A ⊂ B. Then B = A ∪ (B \ A) and since the two sets in the union are disjoint
the finite additivity proved above implies
(1.2) P(B) = P(A) + P(B \ A) ≥ P(A),
where the inequality is due to non-negativity of P. Since for any A ∈ F we have
A ⊂ Ω, using monotonicity of P we get P(A) ≤ P(Ω) = 1.
Exercise P1.2.4. Let Ω be at most countable set (finite or countable) and let
pi ≥ 0 be so that i pi = 1. Let F be an event space defined on Ω and for A ∈ F set
X
P(A) := pi .
i: ωi ∈A
1.4. Probability space. We combine the definitions of the event space and probability
measure into the following.
Definition 1.3. A probability space is a triplet (Ω, F, P) such that
(a) Ω is a non-empty set,
(b) F is an event space of subsets of Ω,
(c) P is a probability measure on (Ω, F).
A∩B
Figure 1. In the union A ∪ B the subset A ∩ B contributes both in P(A) and P(B)
thus we need to subtract P(A ∩ B) to compensate this doubled contribution.
In all three equalities the left-hand side is represented as a union of disjoint events.
Thus, employing the finite additivity of a probability measure we get
P(A) + P(B) = P(A ∩ B) + P(A \ B)+
P(A ∩ B) + P(B \ A) = P(A ∩ B) + P(A ∪ B),
completing the proof.
2. Combinatorial analysis
In this section we discuss various counting principles with an eye on their probabilis-
tic interpretation.
2.1. Multiplication principle. Consider this simple question: in how many different
ways we can form a 3-question exam ticket out of 5 probability, 7 analysis and 4 geometry
questions, if all 3 topics must be presented on the list? The answer is 5 × 7 × 4 in view of
the multiplication principle, also known as the fundamental rule of counting, defined
below.
The rule allows to compute the number of outcomes of a sequence of k experiments
Ei with i = 1, 2, ..., k, when there are m1 possible outcomes for the first experiment, m2
for the second, and so on until mk scenarios for the last one. Then, the outcome of the
joint experiment is an ordered k-tuple and the total number of such tuples equals (see
Figure 2)
m1 · m2 · ... · mk .
root
m1 1 2 m1
m1 · m2
1 2 m2 1 2 m2 1 2 m2
Exercise 2.0.1. In how many different ways n different hats can be assigned
to n different people so that each person gets exactly one hat?
We will prove the weak version of Stirling’s approximation which is still useful and
refer to analysis textbooks for the proof of the full version stated above.
k+1
Z
log k ≤ log x dx ≤ log(k + 1), with k = 1, 2, ...,
k
Rk R k+1
hence k−1 log x dx ≤ log k ≤ k log x dx, where k = 2, 3, ... . Summing the inequali-
ties over k and using the fact that log 1 = 0 we get
Zn n+1
Z
(2.2) log x dx ≤ log n! ≤ log x dx.
1 1
Rn
Using integration by part we get 1 log x dx = n log n − n, which plugging into (2.2)
completes the proof.
→ Problem 2.1.1. How many trailing 0-s does n! have? For example, there are none
for 0 ≤ n ≤ 4, there is one for 5 ≤ n ≤ 9, there are two for 10 ≤ n ≤ 14, etc.
PROBABILITY LECTURE NOTES 9
→ Problem 2.1.2. Prove that the fractional parts3 of log n! are dense in [0, 1].
Namely, show that for any interval [a, b] ⊂ [0, 1] there exists n ∈ N such that {log n!} ∈
[a, b].
2.3. Sampling. For some experiments we will be given a set X and will need to take
elements from X following some rule. For instance we have a deck of cards (the set X)
and the experiment consists in choosing three cards from the deck. When choosing a
card we might put it back into a deck so that it can be chosen again, or set aside and
enforce uniqueness of the selected items. In this generality we will refer to such process
as sampling from X and will discuss the following schemes of sampling from a set:
ordered with replacement,
ordered without replacement,
without order without replacement,
without order with replacement.
Here (without) with replacement refers to the condition of (not) using the same
element more than once, and (without) with order means that the way the elements are
arranged (does not) matters. For example (1, 2, 3) is different from (2, 1, 3) if order is
enforced (i.e. in ordered schemes) otherwise they are the same. The aim of this section
is to introduce counting principles that address the four cases listed above. Table 2.3
summarizes the number of possibilities of sampling k times from n objects with the
schemes mentioned above. We next proceed to the detailed analysis of these schemes.
with replacement without replacement
n!
ordered nk (n−k)!
(k+n−1)! n!
without order k!(n−1)! k!(n−k)!
2.3.1. Ordered with replacement. This is perhaps the easiest case of the four. We
have n objects and need to sample k of them regarding the order and allowing replace-
ment. Then we are looking for tuples of length k where for each position we have n
possible choices. Then, in view of the multiplication principle we have nk total choices
here.
Example 2.1.1. Let us compute the probability of getting at least one 6 in 10 throws
of a fair 6-sided dice. Set X = {1, 2, 3, 4, 5, 6}, then each outcome of the result of
the experiment of throwing the dice 10 times can be represented by a tuple of the form
(a1 , ..., a10 ) where ai ∈ X. Thus the sample space becomes
Ω = {(a1 , a2 , ..., a10 ) : ai ∈ X for i = 1, 2, ..., 10},
with F = 2Ωand P defined as the equiprobable measure on F, i.e.
|E|
P(E) = , for any E ∈ F.
|Ω|
Elements of Ω can equivalently be described as outcomes of ordered sampling with
replacement from X, hence |Ω| = 610 . We will compute the probability of not getting
6 in 10 throws, which in our notation is the event
E = {(a1 , a2 , ..., a10 ) : ai ∈ X \ {6} for i = 1, 2, ..., 10}.
3For x ∈ R let bxc ∈ Z be the largest integer that bxc ≤ x, i.e. bxc ∈ Z is the closest integer to x not
exceeding it. Then this difference x − bxc is called the fractional part of x and is denoted by {x}. By
definition we have {x} ∈ [0, 1) for any x ∈ R. For example {5.6} = 0.6 while {8} = 0.
10 HAYK ALEKSANYAN
Similarly to Ω, the elements of E are the results of sampling with replacement and with
regard to order from the set X \{6} that has 5 elements. Hence |E| = 510 . In particular,
the probability of getting at least one 6 in 10 throws becomes
10
|E| 5
1 − P(E) = 1 − =1− ≈ 0.838
|Ω| 6
2.3.2. Ordered without replacement. So far when discussing permutations we were
shuffling n objects and taking all n of them. In certain scenarios we might need to
choose only a fixed part of the n elements, say 0 ≤ k ≤ n of them. For example,
consider a code of length 3 that is made of digits 0 to 9. Assume in this code one can
use the same digit only once: what is the total number of such codes?
In this problem we are sampling from the set D := {0, 1, ..., 9} and hence n = 10.
Observe that when k = n, then the set of the codes coincides with the set of permutations
of D. For k < n we can use the multiplication principle from which we have n items for
the first character of the code, n − 1 for the second, so on until n − (k − 1) for the last
one. Hence the total number becomes
n!
n · (n − 1) · ... · (n − (k − 1)) = .
(n − k)!
This is usually denoted by P (n, k) or Akn . The sampling scheme discussed here is also
referred to as permutations of n objects taken k at a time.
2.3.3. Birthday problem. We now discuss a classical problem on counting, often re-
ferred to as a birthday paradox, which can be modeled using the scheme of ordered
sampling without replacement. Here is the statement of the problem.
What is the minimum number of people that should be present in the same
room, so that the probability that at least two of them have their birthdays
on the same day of the year is at least 1/2?
Since we have 365 days in a year (disregarding the leap year’s 1 additional day) it
might be tempting to guess that we need at least half of that, 182, as our answer. The
correct answer, however, is 23, significantly less than that!
To arrive at the answer, assume there are k people in the room and let n = 365 be
the number of days in a year. There are nk arrangements of the birthdays k people
can have (ordered sampling with replacement) and we now compute the number of such
arrangements that result in all people having birthdays on different days. The latter is
clearly a problem about permutations of n taken k at a time as discussed in subsection
2.3.2. Thus, there are Akn of them. Hence the probability of all k having birthdays on
different days is
Akn 1
= k n · (n − 1) · ... · (n − (k − 1)) =
nk n
1 2 k−1
1− 1− = ... = 1 − =: f (k).
n n n
The probability that there are at least two people having birthdays on the same day is
1−f (k). Plugging n = 365 in the above equality one may estimate that 1−f (22) ≈ 0.47
and 1 − f (23) ≥ 0.5 thus the answer is k = 23.
2.3.4. Without order without replacement. For permutations of n objects taken
0 ≤ k ≤ n at a time, as discussed in the previous section, the order of the elements
matter. For example, (0, 1, 2) and (1, 0, 2) are treated as different elements. There are
cases when we might need to disregard the order. For example, in how many ways 10
PROBABILITY LECTURE NOTES 11
players can form a single group of 4 members? In this case the order of the players
forming a group does not matter. In general, assume we have n objects and need to
n!
choose k of them. Then, in view of the results of Section 2.3.2 we have (n−k)! ways
to do this if take into account the order of the elements. Each choice is formed of an
ordered tuple of length k which itself can be ordered in k! different ways. Thus, if we
n!
disregard the order, then each of the (n−k)! elements will be counted k! times. Hence
dividing this number by k! gives
n!
,
k!(n − k)!
which is the number of ways k elements can be chosen out of n without order and
without replacement. This number is usually denoted by nk , is called a binomial
coefficient and reads “n choose k”. This method of sampling is called combinations of
n objects taken k at a time.
2.3.5. Without order with replacement. This is arguably the most challenging sce-
nario of counting discussed so far - combinations of k objects from n with replacement
and without order. The setting of this sampling scheme is as follows: given n distinct
objects, in how many ways we can select k ≥ 0 of them if replacement is allowed (i.e. we
can reuse the same element more than once) and order of the elements is disregarded?
Before formulating the main result consider a simple case when n = 3 and k = 2.
Assuming that we are selecting from {1, 2, 3}, the options we have4 are
11, 12, 13, 22, 23, 31.
Observe that each sample in the list is characterized uniquely by the number of 1-s, 2-s
and 3-s it contains (recall that we do not take order of the tuple into account). In the
light of this observation we way rewrite the list above as
(2.3) (2, 0, 0), (1, 1, 0), (1, 0, 1), (0, 2, 0), (0, 1, 1), (1, 0, 1),
4Notice that we consider ordered tuples, then there are 32 options.
12 HAYK ALEKSANYAN
where the i-th index of the 3-tuple indicates the count of i-s used where i = 1, 2, 3.
We see that for a given 3-tuple in (2.3) the sum of its elements is 2, thus what we do,
effectively, is equivalent to representing k as a sum of n non-negative integers. We arrive
at the following problem: find the number of integer solutions to x1 + x2 + x3 = 2 with
0 ≤ xi ≤ k for all i = 1, 2, 3. This observation to settles the general case.
Proof. Assume the n objects are enumerated as 1, 2, ..., n. Then any collection of size
k from n with replacement and disregarding the order is uniquely represented by an n-
tuple of integers (x1 , ..., xn ) where xi is the number of times i was used in the collection
of size k. This correspondence implies that the count we are interested in equals the
number of integer solutions to the following equation
Each equation of this form can be put in one-to-one correspondence with a sequence
of length k + n − 1 composed of | and + symbols, where each xi > 0 is replaced by
xi number of | symbols and if xi = 0 it is replaced by an empty symbol. Thus, for
example the corresponding sequences for (2.3), with the full sequence of reductions we
did, becomes
11 (2, 0, 0) || + +
12 (1, 1, 0) | + |+
13 (1, 0, 1) x1 + x2 + x3 = 2, | + +|
22 (0, 2, 0) xi ∈ {0, 1, 2} +||+
23 (0, 1, 1) for all i = 1, 2, 3. +| + |
33 (0, 0, 2) + + ||
We see that the number of solutions to (2.4) equals the number of sequences of length
k + n − 1 made of | and + symbols where the number of | symbols equals k and the
number of + symbols is n − 1. In view of section 2.3.4 the number of such sequences
equals n+k−1
k which completes the proof.
Exercise 2.2.2. We have n keys at our disposal and one and only one of them
is the right key for the given lock. At each step we randomly pick a key and try to unlock
with it. If we fail, we put the key back and select another one at random. What type of
sampling is this and in how many ways we will unlock the door for the first time at step
k?
This is sampling with replacement. There will be nk total ways to reach step k out of which (n−1)k−1
are successful.
PROBABILITY LECTURE NOTES 13
Proof. There are many proofs of this result, but here we will follow the combinatorial
approach. We will prove by induction on n that
X
(2.5) (x + y)n = a1 a2 · ... · an ,
where the sum is over all ordered n-tuples of the form (a1 , a2 , ..., an ) where each ai is
either x or y, i.e. all strings of length n formed of {x, y}. The case of n = 0, 1 is trivial.
The claim is obviously true for n = 2 since
(x + y)2 = xx + xy + yx + yy.
Assume that (2.5) if true for n − 1. To prove it for n notice that each string of length n
composed of {x, y} is constructed from a string of length n − 1 adding either x or y at
the beginning. Hence writing (x + y)n = x(x + y)n−1 + y(x + y)n−1 and applying the
inductive hypothesis on (x + y)n−1 completes the proof of (2.5). Now each summand
of (2.5) can be rearranged into k n−k for some 0 ≤ k ≤ n. But for given
n
the form x y
0 ≤ k ≤ n there are exactly k summands of the form xk y n−k since they are formed by
choosing k indicies out of 1, 2, ..., n for which ai = x in (2.5) and the rest of ai -s equals
y. Thus grouping terms of the form xk y n−k in (2.5) of which we have nk , completes
the proof of the theorem.
2.5. Multinomial coefficient. Assume we are sampling from {1, 2, ..., k} with replace-
ment n times. In how many ways can we do this if number i is sampled ni times where
i = 1, ..., k and n1 + ... + nk = n. To answer this question we will use the multiplication
principle. The result of this experiment is an n-tuple of the form (a1 , ..., an ) where
n
ai ∈ {1, ..., k} and there are n1 ways to fix the places of 1. When all 1-s are placed
there are n−n ways to fix the places of 2, so on until n−n1 −...−n
1
k−1
n2 nk ways to place k.
Thus, due to the multiplication principle the total number of outcomes of such sampling
equals
n n − n1 n − n1 − ... − nk−1
· ... · .
n1 n2 nk
Using the value of binomial coefficient nk = k!(n−k)! n!
the last expression simplifies to
n! n
(2.6) =: .
n1 ! n2 ! ... nk ! n1 n2 ... nk
The number in (2.6) is called a multinomial coefficient.
Example 2.3.1. In how many way can we arrange 5 red balls, 3 blue balls and 4 green
balls in a line? For this problem n = 5 + 3 + 4 = 12 and we are asking for
12 12!
= = 27720.
534 5! 3! 4!
A similar argument that we used to prove Theorem 2.3 leads to the following.
14 HAYK ALEKSANYAN
X n
(x1 + x2 + ... + xk ) = n
xn1 xn2 · ... · xnk k ,
n1 n2 ... nk 1 2
where the sum is over all integers (n1 , ..., nk ) with 0 ≤ ni ≤ n and n1 + ... + nk = n.
Exercise 2.4.1. How many terms does the sum in Theorem 2.4 have?
Exercise 2.4.2. Adapt the proof of Theorem 2.3 to prove Theorem 2.4.
∞
!
[
P An = lim P(An ).
n→∞
n=1
∞
!
\
P An = lim P(An ).
n→∞
n=1
Proof. We consider the case of increasing sequence first. Define a new sequence Bn as
follows,
B 1 = A1 ,
n−1
!
[
B n = An \ Ai = An ∩ Acn−1 .
i=1
∞
[ ∞
[ n
[ n
[
Ai = Bi and Ai = Bi for all n ≥ 1.
i=1 i=1 i=1 i=1
PROBABILITY LECTURE NOTES 15
We thus get
∞ ∞
! !
[ [
P Ai =P Bi
i=1 i=1
∞
X
= P(Bi ) (by countable addititivity)
i=1
n
X
= lim P(Bi )
n→∞
i=1
n
!
[
= lim P Bi (by finite addititivity)
n→∞
i=1
n
!
[
= lim P Ai
n→∞
i=1
= lim P(An ) (since Ai is increasing).
n→∞
To prove the theorem for decreasing events {Ai } notice that {Aci } is increasing, hence
from the already proved case we have
∞
!
[
P c
Ai = lim P(Aci ) = 1 − lim P(Ai ).
i=1
which combined with the previous equality settles the case of decreasing sequences.
→ Problem 3.1.1. Prove that the axiom 3 of Definition 1.2, namely the countable
additivity of the probability measure, can be replaced by finite additivity and continuity
in a sense of Theorem 3.1. More precisely, prove that axioms 1 and 2 of Definition 1.2
together with the following axiom: for any A1 , A2 , ... ∈ F with A1 ⊂ A2 ⊂ ... one has
∞
!
[
P Ai = lim P(Ai ),
i→∞
i=1
Recall that for disjoint events we have equality in the above inequality thanks to the
σ-additivity of probability measure.
16 HAYK ALEKSANYAN
Proof. Set B1 = A1 and Bi = Ai \(∪i−1 k=1 Ak ). Clearly Bi ⊂ Ai and hence P(Bi ) ≤ P(Ai )
for all i = 1, 2, ... in view of the monotonicity of the probability measure. Notice also
that {Bi } are disjoint, and ∪i Ai = ∪i Bi , hence
∞ ∞ ∞ ∞
! !
[ [ X X
P Ai = P Bi = P(Bi ) ≤ P(Ai ).
i=1 i=1 i=1 i=1
3.3. Graph coloring and Ramsey numbers. In this section we will discuss an ap-
plication of Boole’s inequality in the graph theory. Consider a complete graph 5 on n
vertices. A theorem due to F. Ramsey states that for any positive integers r and s
there is an integer N such that any red-blue coloring of edges6 of a complete graph on
N vertices contains a subgraph on r vertices with all edges being blue, or a subgraph
on s vertices with all edges being red. The smallest such N is denoted by R(r, s) and is
a called a Ramsey number.
5
Figure 3. A particular red-blue coloring (one of the 2(2) = 1024 possible variants)
of a complete graph on 5 verticies. Notice that we are coloring edges of the graph and
not verticies. Observe as well that there is a single monochromatic subgraph with 3
verticies with all edges being red.
The exact value of R(r, s) is not known for general r, s except for some special cases.
A particularly striking example of applications of probability to combinatorics is a lower
bound on R(s, s) due to P. Erdős (1947). The technique that we will use here is called
a probabilistic method and has far reaching applications beyond the graph coloring
problem (see [1] for more).
k
Proposition 3.3. If nk · 21−(2) < 1 then R(k, k) > n. In particular R(k, k) > b2k/2 c
for all k ≥ 3.
Proof. Consider a complete graph G on n vertices. By definition it has n2 = n(n−1)
2 =:
N edges which we will denote by e1 , ..., eN . We now define the experiment of randomly
coloring an edge by blue (+1) or red (−1) and describe the probability space corre-
sponding to this experiment. The sample space of all possible colorings of the graph
becomes
Ω = {~c = (c1 , ..., cN ) : ci ∈ {−1, 1}},
and the event space is F = 2Ω - the power set of Ω. We assume that coloring a single
edge either red or blue is equally likely and for any E ⊂ F define the probability measure
5A graph is called complete if there is an edge between any pair of vertices.
6Each edge of a graph is colored either red or blue without any restrictions.
PROBABILITY LECTURE NOTES 17
as
|E|
(3.1) P(E) :=
,
2N
where |E| is the number of elements in E. As discussed in Example 1.2.4 the triplet
(Ω, F, P) is indeed a probability space.
k
Take any subgraph on k verticies. Since G is a complete graph, the subgraph has
2 =: K edges. Without loss of generality assume edges of this subgraph are e1 , ..., eK .
Then, the event that this particular subgraph is monochromatic (i.e. all its edges have
the same color) is
E := {( c, ..., c , cK+1 , ..., cN ) : ci ∈ {−1, 1} for K < i ≤ N and c ∈ {−1, 1}}.
| {z }
K times
Thus with positive probability none of the events Ai occurs, hence there is a red-blue
coloring of G without monochromatic subgraph on k vertices. From this, in view of the
definition of the Ramsey number, we get R(k, k) > n.
Finally, setting n = b2k/2 c we obtain
n 1−(k) (n − k + 1) · ... · n 1−(k) nk 1−(k)
2 2 = ·2 2 ≤ 2 2
k k! k!
k2 k k
2 2 22 22
<2 k2
=2 < 1 for k ≥ 3.
k! 2 2 k!
k
Exercise 3.3.1. Show that 2 2 +1 < k! for k ≥ 3.
Exercise 3.3.2. Prove that among 6 people there are 3 who all know each
other or 3 people that no one knows another7.
Construct an example of a group of 5 people such that there is no subgroup of 3
people that all know each other or a subgroup of 3 people that none knows another.
This, coupled with the first claim shows that R(3, 3) = 6.
7In terms of the Ramsey numbers we are trying to show that R(3, 3) ≤ 6. Notice also that Proposition
n
1−(k)
3.3 cannot be applied here as for n = 6 and k = 3 the condition k
2 2 < 1 is not satisfied.
18 HAYK ALEKSANYAN
Hint: Notice that 2i is the total number of strings of length i. Then interpret what it means that a
shorter string is not a prefix of a longer string.
Note 1: This is called Kraft’s inequality for prefix-free codes.
Note 2: There is an interesting realization of the the above inequality on prefix codes generated by
binary trees. Consider a binary tree with finite number of nodes. Then prove that
X
2−depth(v) ≤ 1,
v is a leaf
where depth(v) is the length of a path from the root of the tree to the leaf. This claim is related to the
main inequality of this problem as follows: each path from root to a leaf can be identified with a string
of {0, 1}-s (binary string) where we put 0 if we move to the left child of a node and 1 if we go to the
right. Such correspondence between binary strings and the leafs sets the stage for applying the inequality
of the problem. Try to fill in the details.
n n
!
[ X X X
(4.1) P Ai = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + ...
i=1 i=1 1≤i<j≤n 1≤i<j<k≤n
A few remarks are in order. When Ω is a finite set, F = 2Ω and P is the equiprobable
measure (see Exercise 1.2.4), then the inclusion-exclusion principle becomes a counting
formula that computes the number of elements in the union of Ai -s. Also note that
the terms in the right-hand side of (4.1) are the missing components that turn Boole’s
inequality into an equality.
A more concise way of writing (4.1), and emphasizing the fact that we are summing
over all possible intersections of the events Ai , is the following
n
!
[ X \
(4.2) P Ai = (−1)|I|−1 P Aj .
i=1 I⊂{1,2,...,n} j∈I
I6=∅
PROBABILITY LECTURE NOTES 19
Proof. The proof is by induction on n. The formula for n = 2 was proved in (1.3).
Using the case of n = 2 for the union up to n − 1 and An we get
n n−1
! !!
[ [
P Ai = P An ∪ Ai =
i=1 i=1
n−1 n−1
! !
[ [
P(An ) + P Ai −P (Ai ∩ An ) (using induction)
i=1 i=1
(−1)|Ĩ|−1 P
X \ X \
= P(An ) + (−1)|I|−1 P Aj + Aj .
I⊂{1,2,...,n−1} j∈I I⊂{1,2,...,n−1} j∈Ĩ
I6=∅ I6=∅
J̃=J∪{n}
The first sum in the above equality is the part of (4.1) without An , and the second one
together with P(An ) covers all terms with An . This concludes the inductive step and
completes the proof.
since we
n
fix |I| positions and permute the rest n − |I| arbitrarily. Notice also that there
are |I| ways to choose |I| positions that will be fixed. Hence, by (4.1)
n
!
[ n (n − 2)! n (n − 3)! 1
(4.3) P Ai =1− + + ... + (−1)n−1 =
2 n! 3 n! n!
i=1
1 1 1
1− + + ... + (−1)n−1 ≈ 1 − e−1 ,
2! 3! n!
8A permutation σ : {1, 2, ..., n} → {1, 2, ..., n} is called a derangement if for all 1 ≤ i ≤ n we
have σ(i) 6= i. Notice that we may assume that the fixed deck of cards corresponds to the identity
permutation. Indeed, in any enumeration of all n! permutations if σ1 is the permutation corresponding
to the fixed deck, the complement of the event we are interested in is the set of all permutations σ such
that σ1 (i) 6= σ(i) for all i = 1, 2, ..., n. Applying the inverse of σ1 on both sides reduces the problem to
the case when the fixed deck corresponds to the identity.
20 HAYK ALEKSANYAN
where the last approximation is due to Taylor’s expansion of e−x . We conclude that the
probability of seeing the same card at some point from the two decks is approximately
1 − e−1 .
Exercise 4.1.1. Assume each of the two decks is shuffled randomly (instead
of one being fixed) and then one by one we pick a card from each deck until they are
empty. Model the probability space corresponding to this experiment and show that the
probability of getting the same card on each deck at some point is the same as in (4.3).
→ Problem 4.1.1. Let A = {1, 2, ..., k} and B = {1, 2, ..., n} where k, n ∈ N. Use the
inclusion-exclusion principle to compute the number of surjective (onto) functions from
A to B.
Hint: notice that the number of all functions from A to B is nk (sampling with replacement and
regarding the order). It is easier to compute the number of functions that are not surjective. To that
end for each i ∈ B define Ai as the set of all functions f : A → B that does not contain i in their image,
i.e. f (x) 6= i for any x ∈ A. Then use inclusion-exclusion principle with Ai -s.
→ Problem 4.1.2. For n ∈ N let ϕ(n) be the Euler’s totient function, i.e. ϕ(n) is
the number of integers 1 ≤ k ≤ n that are coprime with n. For example, ϕ(4) = 2,
ϕ(5) = 4, ϕ(11) = 10. Using the inclusion-exclusion principle prove that
Y 1
ϕ(n) = n 1− ,
p
p is prime
p|n
Then an integer 1 ≤ m < n is coprime with n if and only if its prime factorization does not contain
any of pi -s. Now define Ai = {1 ≤ m ≤ n : pi | m} where i = 1, 2, ..., k and use the inclusion-exclusion
principle.
A few clarifying remarks are in order. Notice that Sk is simply the k-th term in the
inclusion-exclusion formula disregarding the sign. In particular, the summation in Sk
is over all k-element subsets of {1, 2, ..., n} and hence there are nk terms in the sum.
Observe also that the case of k = 1 in (4.4) corresponds to Boole’s inequality for finite
number of terms.
PROBABILITY LECTURE NOTES 21
(−1)|J̃|−1 P
X \ X \
P(An ) + (−1)|J|−1 P Aj + Aj .
J⊂{1,2,...,n−1} j∈J J⊂{1,2,...,n−1} j∈J̃
1≤|J|≤k 1≤|J|≤k−1
J̃=J∪{n}
The first sum in the above inequality covers all terms of (4.4) which do not contain An ,
and the second sum coupled with P(An ) cover all terms with An . This completes the
induction step.
The proof of (4.5) is similar and we leave it as an exercise.
5.1. Independence.
Definition 5.1. (Independent events) Given a probability space (Ω, F, P) and two
events A, B ∈ F, call A and B independent if P(A ∩ B) = P(A)P(B). Otherwise
A and B are called dependent.
Property 5.1. If A and B are independent, then so are Ac and B, A and B c , and Ac
and B c .
Proof. We start with the first pair, namely Ac and B. Representing B as a union of
two disjoint events as
B = (Ac ∩ B) ∪ (A ∩ B),
22 HAYK ALEKSANYAN
Exercise 5.1.1. Prove that if A is independent of itself, then P(A) ∈ {0, 1}.
Example 5.1.1. Two fair dice are thrown. Let A1 be the event that the first one shows
an odd number, and let A2 be the event that the second one shows an even number.
Clearly the sample space of this experiment is the following
Ω = {(i, j) : i, j ∈ {1, 2, 3, 4, 5, 6}}.
Then, F = 2Ω and P((i, j)) = 1/36 for each 1 ≤ i, j ≤ 6. From here we have P(A1 ) =
P(A2 ) = 1/2 and P(A1 ∩ A2 ) = 3·3 1
36 = 4 . It follows that A1 and A2 are independent.
Let A3 be the event that the sum of the first and the second throws is 4. Since the
3 1
only possibilities are {(1, 3), (2, 2), (3, 1)}, we have P(A3 ) = 36 = 12 . We also have
2 1
P(A1 ∩ A3 ) = 36 = 18 6= P(A1 )P(A3 ) hence A1 and A3 are not independent.
The family {Ai : i ∈ I} is called pairwise independent if (5.1) holds for any J ⊂ I
with |J| = 2.
Ω2
A × Ω1
A
Ω1
Figure 4. A schematic view of the extension of A ⊂ Ω1 into Ω1 ×Ω2 as a cylindrical
set with base A.
9The requirement of Ω to be at most countable makes it easier to define a probability measure on their
i
cross product. Later on we will study the concept of product measure in detail that will allow us to
eliminate the restriction on Ωi to be discrete.
10The event A of the single experiment is put in correspondence with the cylindrical set à . Notice
i i
that projection πi : Ω → Ωi defined as πi (ω1 , ..., ωn ) = ωi maps Ãi to Ai .
24 HAYK ALEKSANYAN
where i = 1, 2, ..., n. In view of the definition of P we have P(Ãi ) = Pi (Ai ). The same
argument used for the case n = 2 shows that (cylindrical) events of the form Ã1 , ..., Ãn
are independent in a sense of Definition 5.2.
We conclude that finite number of different experiments can be joined together with
the construction described above into a single probability space such that the joint
experiment is represented as a series of independent experiments.
Exercise 5.1.2. Show that the cross product F1 × F2 of two event spaces is
not an event space in general.
Exercise 5.1.3. Fill in the necessary details for the construction for n ≥ 2
experiments.
Exercise 5.1.4. Get back to Example 5.1.1 and recover the probability space
discussed there using the construction with cross products described above. Individual
experiments will be two (independent) throws of a dice.
Exercise 5.1.5. Convince yourself that the probability space defined in Proposi-
tion 3.3 can be obtained by cross product as above from independent colorings of the edges
of the graph. Namely, if we enumerate the edges as e1 , ..., eN and for each 1 ≤ i ≤ N let
Ωi = {−1, 1}, Fi = 2Ωi and Pi (−1) = Pi (1) = 1/2 then the cross product of (Ωi , Fi , Pi )
coincides with the probability space described in the proposition.
→ Problem 5.1.1. Consider the matrix multiplication operation over the field Z2 =
{0, 1}. Namely, let A, B, C all be n × n matrices with elements from Z2 and consider
the problem of verifying whether AB = C. Prove that if AB 6= C, then
1
P ~b ∈ {0, 1}n : (AB)~b = C~b ≤ .
2
Here the equality is understood in Z2 , i.e. modulo 2, and the probability measure on
{0, 1}n is the equiprobable measure. Once you prove the above inequality, investigate how
one can verify with high confidence whether AB = C by doing independent tests by
choosing a vector ~b ∈ {0, 1}n uniformly at random and checking A(B~b) = C~b. Compare
the computational complexity of the series of such tests with direct computation of the
matrix product and element-wise comparison.
Definition 5.3. Given a probability space (Ω, F, P) and two events A, B ∈ F with
P(B) > 0. The conditional probability of A given B is denoted by P(A | B) and is
PROBABILITY LECTURE NOTES 25
defined as
P(A ∩ B)
(5.2) P(A | B) = .
P(B)
Notice that the proportionality constant P(B) is chosen so that P(B | B) = 1, i.e.
the probability of B given that B occurred is 1.
Theorem 5.2. (Conditional probability induces a new measure) If B ∈ F and P(B) > 0
then (Ω, F, PB ) is a probability space where PB : F → R is defined as PB (A) = P(A | B).
Proof. We only need to check that PB is a probability measure for which we need to
verify that PB satisfies all requirements of Definition 1.2.
We clearly have PB (A) ≥ 0 for any A ∈ F, and the fact that PB (A) ≤ 1 follows
from the monotonicity of the probability measure (see (1.2)). The conditions PB (∅) = 0
and PB (Ω) = 1 simply follow from (5.2). The only thing that remains to check is the
σ-additivity of PB . To that end fix a sequence of disjoint events A1 , A2 , ... ∈ F. Then
∞ ∞
! !
[ 1 [
PB Ai = P (Ai ∩ B) =
P(B)
i=1 i=1
∞ ∞ ∞
1 X X X
P(Ai ∩ B) = P(Ai | B) = PB (Ai ),
P(B)
i=1 i=1 i=1
where we used the fact that the events {Ai ∩ B}∞
i=1 are disjoint and that P is σ-additive.
Hence PB is a probability measure and the proof is complete.
5.3. Law of total probability. Given a probability space (Ω, F, P) we say that the
∈ I} ⊂ F forms a partition of Ω if Bi ∩ Bj = ∅ for i 6= j (the events
collection {Bi : i S
are disjoint) and i∈I Bi = Ω.
Let us see this result in action. Assume a student enters the exam which has a single
question from one of these subjects analysis, probability, geometry. The student knows
1 2 4
2 of the analysis questions, 3 of the probability and 5 of the geometry. If all subjects
are equally probable to be on the question list and once the subject is selected, all its
26 HAYK ALEKSANYAN
B1 B2 Bn
questions are equally probable, what is the probability that the student will pass the
exam?
Let B1 the list of analysis, B2 - probability and B3 - geometry questions. This
defines a partition of the probability space. Let also A be the event that the student
can pass the exam. Since all subjects have equal probability to be on the exam, we have
P(Bi ) = 31 for i = 1, 2, 3, and from the law of total probability we get
1 1 2 4 59
P(A) = P(A | B1 )P(B1 ) + P(A | B2 )P(B2 ) + P(A | B3 )P(B3 ) = + + = .
3 2 3 5 90
Exercise 5.3.1. What is the probability space in the above example with exam
tickets?
Clearly P(AN | B1 ) = 1 and P(AN | BN ) = 0. Now observe that when the first person
takes the i-th seat, with i > 1, then everyone up to i − 1 will take their own seat, except
possibly for the i-th person, who will choose a seat from the remaining N − i seats
uniformly at random. In the light of this observation, if the event Bi holds, then the
i-th person acts as the first person for the same problem but with N − i seats. Hence,
1
aN = (1 + a2 + ... + aN −1 ) .
N
PROBABILITY LECTURE NOTES 27
Observe that a2 = 12 . Using induction by N it follows from the last equality that aN = 21
for all N = 2, 3, ..., in particular the probability that the last person’s seat will be empty
is 1/2 regardless of the number of seats11 N .
Exercise 5.3.2. What is the probability space in the above example with theater
seats?
Exercise 5.3.3. In the problem with theater seats assume everyone up to k-th
person takes their and it is the k-th person who chooses a seat uniformly at random,
where 1 ≤ k < N . What is the probability that the last person will find their seat free?
→ Problem 5.3.1. Assume in the problem with theater seats two people, the first and
the second, mix their seats in the same way as in the original problem. What is the
probability that the last person will find their seat unoccupied? How about the first k
people mixing their seats?
5.4. Bayes rule. The result discussed in this section is of fundamental importance and
has far reaching applications beyond probability theory.
Let (Ω, F, P) be a probability space and let A, B ∈ F satisfy P(A), P(B) > 0. Then,
P(A | B)P(B)
(5.4) P(B | A) =
P(A)
which is called the Bayes rule. To see why (5.4) is true, observe that from the definition
of the conditional probability (5.2) we have
P(A ∩ B) P(A ∩ B) P(B) P(A | B)P(B)
P(B | A) = = = ,
P(A) P(B) P(A) P(A)
and hence (5.4) follows. We next prove a slightly more general form of this rule.
Theorem 5.4. (Bayes’ theorem) Let {Bi : i ∈ I} be a partition of Ω with P(Bi ) > 0
for all i ∈ I and let A ∈ F be such that P(A) > 0. Then
P(A | Bj )P(Bj )
P(Bj | A) = P .
P(A | Bi )P(Bi )
i∈I
11Here is an alternative approach. Notice that people entering the theater have no identity. Hence,
when i-th person enters the hall and finds their place occupied, instead of taking a seat uniformly at
random from the remaining seats, the i-th person forces the person seated at i-th seat out, takes the
i-th seat and then the evicted person takes a seat uniformly at random. Since the identity of the people
entering the hall is irrelevant, this step is equivalent to the process described in the formulation of the
problem. With this modified rule of taking seats, everyone at steps 2, ..., N − 1 will take their seats.
Hence the person mixing their sit will have two choices, either seat on their seat or on the seat of the
last person, both with probability 1/2. This implies that the probability of the last person finding their
seat free equals 1/2.
28 HAYK ALEKSANYAN
85% 15%
G B
80% 20% 20% 80%
WG WB WG WB
Figure 6. It is helpful to describe this problem in schematic way as a tree. The
first level represents the partition of the underlying probability space and the last level
is the witness’ testimony. We then look at the ways of arriving at WB (the Witness
claims the car is Blue) from the root of the tree. The value of each path from the
root to WB becomes the product of the weights of its edges (see the denominator of
(5.5)).
Solution to Example 5.4.1. Let G be the event that the cab involved in the accient,
is Green, B be the event that the cab is Blue. Let also WB be the event that the witness
testimonies that the cab was blue. We are interested in P(B | WB ). Since the events
{G, B} form a partition of the probability space underlying this question, by Theorem
5.4 we get
80 15
P(WB | B)P(B) 100 100
(5.5) P(B | WB ) = = 80 15 20 85 ≈ 0.413.
P(WB | B)P(B) + P(WB | G)P(G) 100 100 + 100 100
The following terminology is widely used in Bayesian theory (cf. Theorem 5.4):
• the events of the partition {Bi : i ∈ I} are called the hypotheses,
• the set A is called data,
• probabilities P(Bk ) are called prior probabilities,
• probabilities P(Bk | A) are called posterior probabilities,
• probabilities P(A | Bk ) are called likelihoods,
• the transition from priors to posteriors via Bayes theorem is called Bayesian
update.
Getting back to the example with cabs, notice that without the witness’s testimony
15
P(B) = 100 = 0.15 (the prior probability). With the data (new information), we can
adjust our estimate to P(B | WB ) ≈ 0.413 (the posterior probability).
Example 5.4.2. We have ten coins in our pocket and 9 of them are fair, and one
of them is biased with P(H) = 0.3 and P(T ) = 0.7 We randomly pick a coin from the
pocket, flip it and it lands H (head). What is the conditional probability that the selected
coin is biased?
PROBABILITY LECTURE NOTES 29
Let B be the event that the coin is biased. The prior probability of B is 0.1 since
out of 10 coins we have exactly one is biased. Applying Bayes’ rule we have
P(H | B)P(B) 0.3 · 0.1 3
P(B | H) = c c
= = .
P(H | B)P(B) + P(H|B )P(B ) 0.3 · 0.1 + 0.5 · 0.9 48
Example 5.4.3. A certain disease occurs in about 0.05% of the population. A diag-
nostic test was developed for it, which correctly detects a person with the disease 99%
of the time. However, in 3% of the cases it will falsely diagnose a healthy patient as
having the disease (false positive cases).
A person is selected at random from the population, and the test indicates that this
person has the disease. What are the conditional probabilities that
(a) the person has the disease?
(b) the person does not have the disease?
Define A = {test is positive} and B = {patient has the disease}. We are given
P(B) = 0.0005, P(A | B) = 0.99 and P(A | B c ) = 0.03.
We need to compute P(B | A) for part (a) and P(B c | A) for part (b). From Bayes formula
we have
P(A | B)P(B) 0.99 · 0.0005
P(B | A) = c c
= ≈ 0.016.
P(A | B)P(B) + P(A | B )P(B ) 0.99 · 0.0005 + 0.03 · 0.9995
For part (b) we get P(B c | A) = 1 − P(B | A) ≈ 0.984.
Observe that while the test has high accuracy (99%), being positive on a randomly
selected individual only provides 0.016 (1.6%) chance that the person has the disease,
which is rather small. This phenomenon is called base rate fallacy. It happens when
• there is a large population,
• only a small fraction of the population has the disease,
• there is a large scale testing of mixed population of healthy and sick individuals.
One should take into account that the characteristics of the sampled population
are a contributing factor, along with the accuracy of the test itself, in determining the
probability of a positive test result. Even for a very accurate test the number of false
positives can potentially be large.
Exercise 5.4.1. In Example 5.4.3 let 0 < p < 1 be the fraction of the population
that have the disease. Assume that the accuracy of the test is a% (i.e. the test gives a
positive result on sick individual), and let b% be the false positive percentage of the test
(i.e. in b% of the cases the test gives positive result on healthy individuals).
A person, randomly selected from the population, gets a positive result on the test.
What is the probability that the person actually has the disease? Play with the parameters
p, a, b to see when the results of the test can be reliable.
5.4.1. Naı̈ve Bayes. We conclude the section with a toy example of using Bayesian
approach for classification problems. For the sake of the exposition in this subsection we
do away the mathematical rigor and concentrate on describing the general approach in a
somewhat hand-waving way. Everything here can be made rigorous and mathematically
sound with a slight effort.
A typical setup in classification problem asks to determine the class label of the
object given a set of features. Here is an example.
30 HAYK ALEKSANYAN
Getting back to (5.6) and using the assumption on conditional independence, we get
n
1 Y
P(Ck | ~x) = P(Ck ) PCk (xi ).
P(~x)
i=1
We now define our classifier: given a data point with features ~x assign it to a class Ck
where 1 ≤ k ≤ K, so that the probability P(Ck | ~x) is maximized. From the above
relation we get13
n
!
Y
arg max P(Ck | ~x) = arg max P(Ck ) PCk (xi ) .
1≤k≤K 1≤k≤K i=1
12This is an extremely crude assumption, which is not true in most of the practical cases, and this is
the reason the classifier is called naı̈ve. Nevertheless, the idea is stil useful and can lead to meaningful
results. Getting back to our example with spam classification, we note that there are more efficient
ways of classification based on Machine Learning principles.
13Given a finite list of numbers A = (a , a , ..., a ) by arg max A we denote the index 1 ≤ k ≤ n such
1 2 n
that max A = ak . If the maximum in A is not unique then arg max returns any of the indicies of existing
maximums.
PROBABILITY LECTURE NOTES 31
The sequence {pk }nk=0 is called binomial distribution and is denoted by B(n, p).
A common example of Bernoulli trials is the series of independent experiments with
a coin toss (see subsection 5.1.1 for the underlying probability space).
Notice that the sum of the sequence pk defined in (6.1) equals 1, since there can be
only 0, 1, ..., n number of successes which provide a partition of the probability space
underlying the trials. Another way to see why the sum equals 1 is through Theorem
2.3 (Binomial theorem).
Example 6.0.1. (Selling extra tickets for a flight) An airline sells 100 tickets for a
certain flight on an airplane that has only 98 seats. On the average, 1% of passengers
do not show up for their flight and assume the passengers choose to show up or not for
the flight independently of each other. What is the probability that everyone who appears
for the departure will have a seat?
not appearing for the flight a success (for the airplane company). Now, the probability
of success is p = 0.01 and hence the number of successes in n = 100 trials follows a
Bernoulli distribution of the form B(100, 0.01) (recall the independence condition in the
problem). Thus everyone will have a seat if and only if there are at least 2 successes in
n = 100 trials. The complement of this event is easier to compute, i.e. the probability
of at most 1 success, which equals
200 200 200
P(X = 0) + P(X = 1) = (1 − 0.01) + 0.01(1 − 0.01)199 ≈ 0.4045,
0 1
hence the probability in question becomes 1 − 0.4045 = 0.5955.
6.1.3. Hypergeometric distribution. Consider an urn with n1 red balls and n2 black
balls. Suppose n balls are drawn without replacement, where n ≤ n1 + n2 . The
probability of drawing exactly k red balls equals
n1
n2
k n−k
pk := n1 +n2
, where max(0, n − n2 ) ≤ k ≤ min(n, n1 ),
n
7.1. Discrete random variable and its probability mass function. Often we
deal with situations involving some function X defined on a sample space Ω of a given
probability space (Ω, F, P). For example, consider the experiment of throwing a fair
dice and assume we gamble on the outcome E of this dice in such a way to win 1 if the
outcome is a prime number, lose 1 in case of 4 or 6 and stay even if 1 shows up. Then
our profit is a function X defined on Ω where
1,
if ω ∈ {2, 3, 5},
X(ω) = −1, if ω ∈ {4, 6},
0, if ω = 1.
Definition 7.2. (Probability mass function) The probability mass function (or pmf )
of a discrete random variable X defined on (Ω, F, P) is the function pX : R → [0, 1]
defined as
pX (x) = P(X = x), x ∈ R.
Here pX (x) is simply the probability that X takes the value x, and is sometimes
referred to as discrete density function too. We have
pX (x) = 0, x ∈
/ ImX,
and
X X
pX (x) = P(X = x) (by σ-additivity of P)
x∈ImX x∈ImX
!
[
=P {ω ∈ Ω : X(ω) = x} = P(Ω) = 1.
x∈ImX
PROBABILITY LECTURE NOTES 35
The power of this theorem is that it allows us to forget about the probability space
in many practical scenarios where we have a discrete probability distribution and need
to consider a random variable that takes certain values following the given distribution.
The theorem ensures that such a random variable exists.
7.2. Functions of random variables. Consider a probability space (Ω, F, P) and a
discrete random variable X : Ω → R, let also g : R → R. It is an easy check following
Definition 7.1 that Y := g(X) defined as Y (ω) = g(X(ω)) with ω ∈ Ω, is also a discrete
random variable. Simple examples are
g(X) = aX + b if g(x) = ax + b,
g(X) = aX 2 if g(x) = ax2 .
The probability mass function of Y = g(X) is obtained from pmf of X as follows
pY (y) = P(g(X) = y) =
X
P(X ∈ g −1 (y)) = P(X = x),
x∈g −1 (y)
since there are at most countably many non-zero contributions to this sum14. For
example, when Y = aX + b and a 6= 0, then
P(Y = y) = P(aX + b = y) = P(X = a−1 (y − b)) for y ∈ R.
If Y = X 2 , then
√ √
P(X = y) + P(X = − y), y > 0,
P(Y = y) = P(X = 0), y = 0,
0, y < 0.
14Here g −1 (y) is the preimage of y under g, i.e. the set {x ∈ R : g(x) = y}.
36 HAYK ALEKSANYAN
7.3. Expectation and variance. Consider a long series of Bernoulli trials with throw-
ing of a fair dice. As each of the outcomes 1, 2, ..., 6 is equally likely to appear, the
average of observed numbers will be approximately
1 1 1 7
1 + 2 + ... + 6 = ,
6 6 6 2
which is the mean value of the observable variable - the value of the dice. This notion
of mean value is at the heart of the next definition.
P
whenever this sum converges absolutely, i.e. |xP(X = x)| < ∞.
x∈ImX
then E(X) is not defined for such X. If one of the sums above converges but another
diverges to infinity, we define E(X) = ∞ with the sign equal to the sign of the diverging
sum. For example, if X ≥ 0 and the sum in (7.1) diverges to +∞, then we define
E(X) = +∞.
Equation (7.1) is often written as
X X
E(X) = xP(X = x) = xpX (x),
x x
and the expectation of X is often called the expected value or mean value of X. Notice
that when the X takes only finitely many values, the requirement of absolute conver-
gence of the series in (7.1) becomes redundant. However, when ImX is infinite (count-
able though), then the absolute convergence of the series guarantees that its sum is the
same for any rearrangement of the series (i.e. irrespective of the summation order).
The physical analogy of expectation is the idea of “centre of gravity”. If masses with
weights π1 , π2 , ... are placed at points x1 , x2 , ... ∈ R, the position of the centre of gravity
is
P
πx πi
Pi i =
X
xi P .
πi πj
i
j
P
Setting pi := πi / j πj gives a probability distribution that amounts for the proportion
of the total weight assigned to point xi .
Then
n n
X n k X n!
E(X) = k p (1 − p)n−k = pk (1 − p)n−k =
k (k − 1)!(n − k)!
k=0 k=1
n
X (n − 1)!
np pk−1 (1 − p)n−1−(k−1) =
(k − 1)! (n − 1 − (k − 1))!
k=1
n−1
X n − 1
np pk (1 − p)n−1−k = np.
k
k=0
Exercise 7.1.1. Assume we toss a fair coin until it comes up Heads. Let N
be the first time in the trials when we see a Head. Compute E(N ).
Note: Notice that N takes values 1, 2, ... and ∞ and observe that N follows the geometric distribution.
Notice that (c) and (d) together imply that E is a linear operator on discrete random
variables that have expectation.
Proof. (a) follows by definition of the expectation as all terms in (7.1) become non-
negative.
(b) assume P(X = 0) < 1. Since
we get that there exists x > 0 such that P(X = x) > 0. Plugging this in the definition
(7.1) we get EX ≥ xP(X = x) > 0 which is a contradiction, hence P(X = 0) = 1.
15Recall from the elementary calculus that for a continuous function f ∈ C[a, b] if R b |f (x)|dx = 0, then
Rb a
f ≡ 0 on [a, b]. When f is Lebesgue integrable on (a, b) and a
|f |dµ = 0 where µ is the Lebesgue
measure, then f = 0 µ-almost everywhere on [a, b].
38 HAYK ALEKSANYAN
where we used the absolute convergence of the series in (7.1) to rearrange the sums
above.
(d) Let us first show that E(X + Y ) exists, for which we need to show that the sum
in (7.1) corresponding to X + Y converges absolutely. We have
X
|z|P(X + Y = z)
z∈Im(X+Y )
X X
= |z| P(X = x, Y = z − x)
z∈Im(X+Y ) x∈ImX
X X
= |z|P(X = x, Y = z − x) (set z − x =: y)
x∈ImX z∈Im(X+Y )
X X
= |x + y|P(X = x, Y = y)
x∈ImX y∈ImY
X X X X
≤ |x| P(X = x, Y = y) + |y| P(X = x, Y = y)
x∈ImX y∈ImY y∈ImY x∈ImX
X X
= |x|P(X = x) + |y|P(Y = y) < ∞,
x∈ImX y∈ImY
where the convergence of the last two sums is due to the existence of E(X) and E(Y ),
and the passage to the second equality follows from the fact that the family of events
{{X = x} : x ∈ ImX} (similarly for Y ) is a partition of the underlying probability
space. We thus get that E(X + Y ) exists and the same argument as above shows that
E(X + Y ) = E(X) + E(Y ).
(e) For any c ∈ R, using the linearity proved in (c) and (d), we get
Exercise 7.2.1. Construct a discrete random variable X, for which E|X| <
+∞, but E(X 2 ) = +∞.
PROBABILITY LECTURE NOTES 39
Note: the example shows that for two discrete random variables X and Y , having the existence of
E(X) and E(Y ) does not imply the existence of E(XY ) in general.
Property 7.4. Let X be a discrete random variable and assume E(X) exists. Then
(a) var(X) = E(X 2 ) − (E(X))2 ,
(b) var(aX) = a2 var(X) for any a ∈ R,
(c) var(X + Y ) = var(X) + var(Y ) + 2 (E(XY ) − E(X)E(Y )), where Y is a discrete
random variable for which E(Y ) exists.
40 HAYK ALEKSANYAN
The waiting time of the student follows a geometric distribution with parameter 1/6
(see subsection 6.1.4). Using it, one may compute the expected waiting time directly
from the definition of expectation. But there is an alternative approach which we
consider below that becomes useful in certain scenarios.
Claim 7.5. Let X be a discrete random variable with pmf (probability mass function)
∞
P
P(X = k) = pk , where k = 0, 1, 2, ... and pk = 1. Then
k=0
X∞
(7.5) E(X) = P(X ≥ k).
k=1
Proof. Notice that P(X = k) = P(X ≥ k) − P(X ≥ k + 1) for all k = 0, 1, 2, ... . Using
this and (7.1) we have
X∞ ∞
X
E(X) = kP(X = k) = k (P(X ≥ k) − P(X ≥ k + 1)) =
k=0 k=0
∞
X ∞
X
P(X ≥ 1) + P(X ≥ k)(k + 1 − k) = P(X ≥ k),
k=2 k=1
16Notice that variance is non-negative by (7.4). This in particular implies that the right-hand side of
(a) is non-negative. Independently of the definition of the variance the non-negativity of the right-hand
side of (a) is a consequence of Jensen’s inequality which we will study later.
PROBABILITY LECTURE NOTES 41
where we used the Abel’s summation rule17 to pass from the first row to the second.
Solution to Example 7.4.1. Define X to be the first day that 1 shows up when
throwing a dice. Clearly P(X = 1) = 61 , P(X = 2) = 65 16 and in general P(X = k) =
5 k−1 1
6 6 for k ≥ 1, thus X follows a geometric distribution with parameter p = 1/6.
From here, for each n ≥ 1 we obtain
∞ k−1 ∞
1 5 n−1 X 5 k
n−1
X 5 1 5
P(X ≥ n) = = =
6 6 6 6 6 6
k=n k=0
Exercise 7.5.1. Confirm the answer to the Example 7.4.1 directly from the
definition of expectation, by showing that
∞
X
kP(X = k) = 6.
k=0
Solution to Example 7.5.1. One strategy toward the example above is to split the
population into groups of k, for some fixed k ≥ 1, then mix the samples from individuals
of the same group and test the mixed sample. If the test on the mix is negative, then
no one in the group has the disease, otherwise we test everyone in the group one by one,
using another k tests. Thus, in the worst case we will need k +1 tests in total for a single
group to know with certainty who has a disease in the group. Let X be the number of
tests required for a single group. Then X takes two values: 1 with probability (1 − p)k
and k + 1 with probability 1 − (1 − p)k . Hence
E(X) = (1 − p)k + (k + 1) 1 − (1 − p)k .
whenever the series converge. This can be considered as the analogue of integration by parts rule for
the integrals and is often referred to as Abel’s summation by parts.
42 HAYK ALEKSANYAN
For simplicity assume that k divides n, hence there will be n/k groups. Consequently,
n/k
X nh i
(7.6) E Xi = (1 − p)k + (k + 1) 1 − (1 − p)k =
k
i=1
1
k
n 1 − (1 − p) + .
k
Recall that k is a parameter of the testing scheme and we now optimize the value of the
expectation in (7.6) with respect to k. To that end, differentiate the right-hand side of
(7.6) with respect to k and consid the value of k that makes the derivative to become
0. We thus get
1
+ (1 − p)k log(1 − p) = 0.
k2
If p is much smaller than k1 , then we can approximate (1 − p)k by 1 and log(1 − p) by
√
−p, arriving at p ≈ k12 . Thus k = √1p and the resulting expectation becomes n p which
is significantly less than n for smaller values of p.
→ Problem 7.5.1. Assume we toss a fair coin until we get two Heads in a row, and
then we stop (i.e. the first time we see the pattern HH, we stop). Let N be the first
time when we see two Heads.
• Compute the expectation of N .
• Do the same computation for the pattern T H (i.e. seeing for the first time a
Tail followed immediately by Heads) and observe that the expecting time for the
two patters HH and T H is different.
• How about getting the pattern HHT for the first time?
→ Problem 7.5.2. A fair 6-sided dice is thrown. We can either take the number
that appears on the dice as our winning (e.g. if its 3 we get 3$ as a winning) and the
game stops, or we can ask to throw it again. We are given n ≥ 1 chances to ask for a
new round (this includes the initial round). If we play it up to the last possible throw,
then we must take whatever appears on the dice on that throw. What is the optimal
strategy of playing this game and what is the expected winning if we play according to
that strategy?
Note: For example, when n = 1, there is only 1 throw, which we must accept, and thus our expected
winning is 61 (1 + 2 + ... + 6) = 3.5. If n = 2 we throw it once and after seeing the result we can decide
if we want to keep it or move to the second round in which case we must take whatever appears there.
Hint: The case of n = 1 is described above. Start with n = 2 and think about what is your expected
winning if you move on the first throw.
→ Problem 7.5.3. Let N > 1 be a fixed integer. Your are offered to play a game
where the referee picks an integer n from 1 to N inclusive and if you guess correctly you
will win n, otherwise 0. How much would you be willing to pay to play this game?
Note: first try to think about the optimal strategy for both the referee and the player. The question
asks about the entry fee that will make the game fair both for the player and the organizer, in a sense
that the expected gain of the player equals expected loss of the organizer.
PROBABILITY LECTURE NOTES 43
7.4. Conditional expectation. Recall from section 5.2 that given a probability space
(Ω, F, P) and an event B ∈ F, knowing that B occurred can affect the probabilities of
other events. To formalize this idea, we introduced the concept of conditional probability.
In the same spirit knowing that some B ∈ F occurred can affect the values of probability
mass function (pmf) of a given random variable X as there we are dealing with events
{X = x} for some x ∈ ImX. More precisely, if B ∈ F and P(B) > 0, then the events
P({X = x}) can be conditioned on B.
Notice that the only difference of (7.7) from (7.1) is that we are taking conditional
probabilities of X taking the value x. The analogue of Theorem 5.3 (law of total
probability) is the following.
Proof. We have
X
E(X | Bi )P(Bi ) (by (7.7))
i
X X
= xP(X = x | B)P(Bi ) (by absolute convergence)
i x∈ImX
X X
= x P(X = x | B)P(Bi ) (by Theorem 5.3)
x∈ImX i
X
xP(X = x) = E(X).
x∈ImX
7.5. Indicator functions. An important class of random variables are the indicator
functions of events. For A ∈ F define
(
1, if ω ∈ A,
(7.8) IA (ω) =
0, otherwise,
and call IA the indicator function of A (sometimes it is called a characteristic function
of A). It follows from the definition that IA is a random variable in a sense of Definition
7.1. Moreover, it enjoys the following nice properties.
Property 7.7. (Basic properties of indicators) Let A ∈ F and IA be its indicator
function. Then
1. E(IA ) = P(A),
2. IA = 1 − IAc ,
3. IA∩B = IA · IB for any B ∈ F,
4. IA∪B = IA + IB − IA∩B for any B ∈ F,
5. E(IA · IB ) = EIA · EIB for any B ∈ F if A and B are independent.
Proof. All claims follow directly from definitions.
Indicator functions become very useful in many counting problems. Below we discuss
a few examples of the so-called method of indicators.
18In (7.9) we have a product of the form (a − b ) · ... · (a − b ). To expand it we take a single factor
1 1 n n
from each bracket ai or −bi for all i = 1, 2, ..., n. Let I ⊂ {1, 2, ..., n} be the set of indicies where we
choose ai and the complement of I will comprise of indicies where −bi is chosen. Thus, we get a sum
of the form X Y Y
ai (−bi ).
I⊂{1,2,...,n} i∈I i∈{1,2,...,n}\I
Q
Applying the expansion above to (7.9) the part with IA becomes IA thanks to IA IA = IA and
i∈I
IA IAi = IAk for all k = 1, 2, ..., n.
PROBABILITY LECTURE NOTES 45
Example 7.7.1. (Kings and queens at a round table) There are n ≥ 2 pairs of
kings/queens seated at a round table where kings take the even numbered seats and queens
have the odd numbered ones, the point being that kings and queens seat interleaving. On
average how many kings will have their queens seated next to them?
Solution to Example 7.7.1. Enumerate the pairs by 1, 2, ..., n and for 1 ≤ i ≤ n let
Ai be the event that the queen of the i-th pair seats next to the king of the i-th pair.
Let also N be the number of the correctly seated pairs. Observe that N is a random
variable taking values from {0, 1, ..., n} and we are asked to compute E(N ). It is clear
that
N = IA1 + ... + IAn ,
and thanks to the linearity of the expectation (Theorem
Pn 7.2 parts (c) and (d)) and
Property 7.7 part 1 of indicators we get E(N ) = i=1 P(Ai ). Notice that for each
1 ≤ i ≤ n we have P(Ai ) = n2 since if the position of the king is fixed, then out of n
available seats the queen can take only 2 seats that are next to their king. We thus get
that E(N ) = n n2 = 2 for all n ≥ 2.
Exercise 7.7.2. A fair coin is tossed n ≥ times. What is the expected number
of Heads? What is the coin turns Heads with probability 0 ≤ p ≤ 1?
Notice that we are considering Bernoulli trials here, see subsection 6.1.2. Compute the expected
number of Heads using two methods: first, directly from the definition of the expectation and Bernoulli
trials, second, representing the number of Heads as a sum of indicators of Heads for each toss and using
the linearity of expectation (cf. Example 7.7.1).
Example 7.7.2. (Parallel series of coin flips) Assume two coins, one with probability
0 ≤ p < 1 of turning up Heads, another with probability q > p are tossed repeatedly.
Show that for any k ∈ N the probability that in a series of n independent tosses the first
coin will produce at least k heads is less than or equal to the probability that the second
one will produce at least k heads.
Solution to Example 7.7.2. The claim is of course intuitive, since the second
coin has a greater chance of landing up Heads. Nevertheless, proving this by counting
arguments is not entirely straightforward. We will instead use a method related to
indicator functions. For 1 ≤ i ≤ n define
(
1, if the i-th toss of the first coin is Head,
Xi =
0, otherwise
which is simply the indicator of the i-th toss of the first coin being Heads. Consider
q−p
an auxiliary series of coin tosses where the probability of landing Heads equals 1−p ,
and let Zi be the indicator of the i-th toss turning up Heads for this new series of
q−p
tosses, with i = 1, 2, ..., n. We thus have that P(Z = 1) = 1−p . Also assume that the
new series of tosses is independent of the tosses of the first coin (we know from section
5.1.1 that such series of independent experiments exist). Now for the second coin define
Yi := Xi + (1 − Xi )Zi , where i = 1, 2, ..., n. Clearly Y takes only two values 0 and 1
and by definition of Zi we have P(Yi = 1) = q and P(Yi = 0) = 1 − q, hence Yi has the
same distribution as the indicator of the Heads in a coin toss of the second coin. By
construction we have Xi ≤ Yi for all i = 1, 2, ..., n and hence
P(X1 + ... + Xn ≥ k) ≤ P(Y1 + ... + Yn ≥ k)
46 HAYK ALEKSANYAN
Then we have two cycles, namely (1 3 4) and (2). It is well-known that any permutation
can be decomposed as a product of disjoint cycles. In this section we will be interested
in the average length of a cycle in a random permutation.
Step 1. The probability that a given element is in a cycle of length m.
Fix an element 1 ≤ i ≤ n and assume it is in a cycle of length m. Then there exist
distinct integers i1 , i2 , ..., im−1 from {1, 2, ..., n} forming the cycle
i 7→ i1 7→ i2 7→ ... 7→ im−1 7→ i.
(n−1)!
Since the sequence i1 , ..., im−1 is ordered there are ((n−1−(m−1))! ways to choose such a
sequence (see subsection 2.3.2). When the cycle is fixed the rest of n − m integers can
(n−1)!
be permuted arbitrarily, thus there are (n−m)! (n − m)! permutations out of n! such that
i is in a cycle of length m. So the probability that i is in a cycle of length m equals n1 .
19The technique used here is an example of a coupling method, which is a powerful technique in
probability theory and has deep applications that are beyond the scope of this notes.
PROBABILITY LECTURE NOTES 47
We conclude that for m > n2 the probability that a random permutation has a cycle of
1
length m equals m . Hence the probability that a randomly chosen permutation has a
cycle of length greater than n/2 is bounded above by
n
X 1 n
≈ ln n − ln = ln 2.
m 2
m: m> n
2
Exercise 7.7.3. What are the chances of survival in the above problem if all
prisoners choose their 50 boxes uniformly at random?
→ Problem 7.7.1. Is there a better strategy to the above problem with names in
boxes? Justify your answer.
and
X X
pX,Y (x, y) = 1.
x∈ImX y∈ImY
Notice that the events {{X = x} : x ∈ ImX} and {{Y = y} : y ∈ ImY } are both
partitions of Ω. Thus
X X
(8.1) pX,Y (x, y) = P(X = x, Y = y) = P(Y = y) = pY (y),
x∈ImX x∈ImX
similarly
X
(8.2) pX,Y (x, y) = pX (x).
y∈ImY
Thus from the joint mass function we can recover the probability mass functions of
individual variables X and Y . The distributions in (8.1) and (8.2) are called marginal
mass functions of Y and X respectively.
y1 y2 ... ym ...
x1 pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) ... pX,Y (x1 , ym ) ...
x2 pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) ... pX,Y (x2 , ym ) ...
..
.
xn pX,Y (xn , ym ) pX,Y (xn , ym ) ... pX,Y (xn , ym ) ...
..
. ... ... ... ... ...
Table 1. One may think about the joint mass function pX,Y as a matrix
where (n, m)-th element is the probability of the event {ω ∈ Ω : X(ω) =
xn and Y (ω) = ym }. Here ImX = {x1 , x2 , ...} and ImY = {y1 , y2 , ...}.
To recover the marginal distributions we sum over the columns of this
matrix for fixed row to get pX and over the rows of the matrix for a fixed
column to recover pY .
8.1. Expectation in the multivariate case. Given two random variables X and Y
on (Ω, F, P) and a function g : R2 → R define Z(ω) = g(X(ω), Y (ω)) where ω ∈ Ω.
Then, one may easily check that conditions of Definition 7.1 are satisfied for Z, i.e. Z
is a discrete random variable. We now formulate and prove the analogue of Theorem
7.3 for Z.
Proof. The proof is similar to the proof of Theorem 7.3. Set Z = g(X, Y ), then
X
EZ = zP(Z = z) =
z∈ImZ
X
zP {ω ∈ Ω : (X(ω), Y (ω)) ∈ g −1 (z)} =
z∈ImZ
X X
z P(X = x, Y = y) =
z∈ImZ x∈ImX
y∈ImY
g(x,y)=z
X X X
zP(X = x, Y = y) =
x∈ImX y∈ImY z∈ImZ
g(x,y)=z
X X
g(x, y)pX,Y (x, y).
x∈ImX y∈ImY
Note, that the result of Theorem 8.1 works similarly for the case of n random vari-
ables. More precisely, consider g : Rn → R and let X1 , X2 , ..., Xn be discrete random
variables defined on Ω. Then Z = g(X1 , ..., Xn ) defines a discrete random variable on
Ω and the analogue of (8.4) is true for joint mass function pX1 ,...,Xn , which is defined in
analogy with the case of two variables as follows:
pX1 ,...,Xn (x1 , ..., xn ) = P (X1 = x1 , ..., Xn = xn ) .
We formulate the result for n variables.
Theorem 8.3. Let X1 , X2 , ..., Xn be random variables defined on (Ω, F, P) and g :
Rn → R be any function. Then g(X1 , X2 , ..., Xn ) is a random variable defined on
(Ω, F, P) and
X X
(8.4) Eg(X1 , ..., Xn ) = ... g(x1 , ..., xn )pX1 ,...,Xn (x1 , ..., xn ),
x1 ∈ImX1 xn ∈ImXn
Exercise 8.3.1. Prove Theorem 8.3 following the proof of Theorem 8.1.
50 HAYK ALEKSANYAN
Observe also that given independence of X and Y , from (9.2) we get that the joint
mass function pX,Y decomposes into a product of two functions depending only on x or
y. The converse of this statement is also true as we see next.
Theorem 9.1. Two discrete random variables X and Y defined on (Ω, F, P) are inde-
pendent if and only if there exist functions f, g : R → R such that
(9.3) pX,Y (x, y) = f (x)g(y), ∀x, y ∈ R.
Proof. The necessity is due to (9.2). We now prove the sufficiency. Assume pX,Y (x, y) =
f (x)g(y) where x, y ∈ R. We need to check that (9.1) holds. We have
X X X X
(9.4) 1 = pX,Y (x, y) = f (x)g(y) =
x∈ImX y∈ImY x∈ImX y∈ImY
!
X X
f (x) g(y) .
x∈ImX y∈ImY
We also have
X X
pX (x) = pX,Y (x, y) = g(y) f (x)
y∈ImY y∈ImY
and !
X X
pY (y) = pX,Y (x, y) = f (x) g(y).
x∈ImX x∈ImX
Multiplying the last two equalities implies
!
X X
pX (x)pY (y) = f (x)g(y) g(y) f (x) (by (9.4))
y∈ImY x∈ImX
Example 9.1.1. Consider random variables X and Y that take values from non-
negative integers and have the joint mass function
1
pX,Y (n, m) = λn µm e−(λ+µ) , n, m = 0, 1, 2, ... .
n! m!
Then, Theorem 9.1 implies that X and Y are independent.
= E(X)E(Y ).
Corollary 9.3. Let X and Y be discrete random variables that are independent. Then
var(X + Y ) = var(X) + var(Y ).
Proof. This follows directly from Property 7.4 (c) and Theorem 9.2.
Example 9.3.1. Let X be a discrete random variable taking values −1, 0, 1 all with
probability 1/3 and let Y = |X|. Then
12
P(X = 0, Y = 1) = 0 and P(X = 0)P(Y = 1) = 6= 0,
33
hence X and Y are not independent. However, E(XY ) = 0 = E(X)E(Y ).
Proof. The necessity follows by the same argument as we used in Theorem 9.2, one
only needs to consider the function (x, y) 7→ f (x)g(y) instead of (x, y) 7→ xy.
We now prove sufficiency. Take any a, b ∈ R and let
( (
1, x = a, 1, y = b,
f (x) = and g(y) = .
0, x 6= a 0, y 6= b
Then E(f (X)g(Y )) = P(X = a, Y = b) and E(f (X)) = P(X = a), E(g(Y )) = P(Y = b).
Combining these with the condition of the theorem implies independence of X and
Y.
Clearly, both sides of (9.5) become zero unless xi ∈ ImXi for all i = 1, 2, ..., n.
We now formulate the analogues of the main results proved for the case of two
discrete random variables for for n variables.
Theorem 9.5. (cf. Theorem 9.1) Discrete random variables X1 , ..., Xn defined on
(Ω, F, P) are independent if and only if there exist functions f1 , ...fn : R → R such that
pX1 ,...Xn (x1 , ..., xn ) = f1 (x1 ) · ... · fn (xn ), ∀x1 , ..., xn ∈ R.
Theorem 9.6. (cf. Theorem 9.2) Let X1 , ..., Xn be discrete random variables that are
jointly independent, and assume E(X1 ), ..., E(Xn ) exist. Then
E(X1 · ... · Xn ) = E(X1 ) · ... · E(Xn ).
Corollary 9.7. (cf. Corollary 9.3) Let X1 , ..., Xn be discrete random variables that are
pairwise independent. Then
var(X1 + ... + Xn ) = var(X1 ) + ... + var(Xn ).
Theorem 9.8. (cf. Theorem 9.4) Discrete random variables X1 , ..., Xn are jointly
independent if and only if for any functions f1 , ..., fn : R → R we have
E(f1 (X1 ) · ... · fn (Xn )) = E(f1 (X1 )) · ... · E(fn (Xn )),
whenever the expectations on the right-hand side exist.
Proofs of the formulated results follows along the same lines as their counterparts
for two random variables.
Exercise 9.8.1. Prove Theorems 9.5, 9.6, 9.8, and Corollary 9.7.
We also include a new result that we will prove for the case of n random variables.
Theorem 9.9. (Preservation of independence under compositions) Let X1 , X2 , ..., Xn be
independent random variables, and let f1 , f2 , ..., fn be any real-valued functions defined
on R. Then f1 (X1 ), f2 (X2 ), ..., fn (Xn ) are independent.
PROBABILITY LECTURE NOTES 53
The morale of Theorem 9.9 is that given a sequence of independent variables any
transformation of a single entry in the sequence will not alter the independence. For
example, if X, Y, Z are independent random variables, then so are
X 2 , eY , (1 + |Z|2021 )−1/2 ,
a fact which might not seem entirely trivial on this particular example.
→ Problem 9.9.1. Let v1 , v2 , ..., vn ∈ Rd be unit vectors, i.e. ||vi || = 1 for all
1P≤ i ≤ n. Prove that there exists a sequence
√ of signs εi ∈ {−1, 1} such that the vector
n
i=1 εi vi has length at most (at least) n.
Hint: P
choose the signs uniformly at random and compute the expectation of the length (Euclidean
norm) of i εi vi .
54 HAYK ALEKSANYAN
10. Inequalities
In this section we state and prove several important inequalities, and discuss the
concept of information entropy.
10.1. Jensen’s inequality. Recall the definition of a convex function. A function
f : Rd → R is called convex if for any x1 , x2 ∈ Rd and 0 ≤ λ ≤ 1 one has
(10.1) f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 ).
A function is called strictly convex if the inequality (10.1) is strict whenever x 6= y
and 0 < λ < 1.
A function is called concave if −f is convex.
Examples of convex functions are f (x) = |x|, f (x) = x2 , f (x) = ex , etc.
x1 (1 − λ)x1 + λx2 x2
Figure 7. An example of a convex function. For any x1 , x2 the graph of f lies below
the chord joining points (x1 , f (x1 )) and (x2 , f (x2 )) on the graph. More precisely, the
value of f at (1 − λ)x1 + λx2 , for any λ ∈ [0, 1], is bounded above by the value of the
chord at the same point.
where the last equality is due to (7.3). This settles the case of random variables with
finitely many values.
To prove the general case assume X has distribution P(X = xi ) = pi with 0 ≤ pi ≤ 1
∞
P
and pi = 1. We will use the fact that convex functions are continuous. In view of
i=1
continuity of f at EX for any ε > 0 there exists N (ε) ∈ N such that for all N ≥ N (ε)
we have
N
!
X
(10.4) f (EX) ≤ ε + f xi p˜i ,
i=1
∞
pi P
where p˜i = N
P
. Indeed, by definition EX = xi pi and the series is absolutely
pi i=1
i=1
N N
P 1 P
convergent. Hence xi p˜i = N
P
xi pi → EX as N → ∞ since {pi } is a probability
i=1 pi i=1
i=1
distribution and sums to 1. Getting back to (10.4) we use the fact that {p˜i }N i=1 is a
finite distribution and apply Jensen’s inequality on the right-hand-side arriving at
N N
X 1 X
f (EX) ≤ ε + f (xi )p˜i = ε + N
f (xi )pi .
P
i=1 pi i=1
i=1
Now both sums in the above inequality are convergent. Passing to limit as N → ∞ and
using Theorem (7.3) we get
f (EX) ≤ ε + Ef (X).
Since ε > 0 is arbitrary the last inequality completes the proof of the general case of
discrete random variable.
Applying the inequality above for x = xi , multiplying both sides by pi and summing
over i = 1, 2, ... we get
∞ ∞
!
X X
f (xi )pi ≥ f (EX) + v · pi xi − EX .
i=1 i=1
The left-hand side of this inequality equals Ef (X) by Theorem 7.3, and the right-hand
side is precisely f (EX) as the sum in the brackets equals EX. The proof of inequality
(10.3) is complete.
For example, applying Jensen’s inequality for specific convex functions f we get
and
(EX)2n ≤ EX 2n , taking f (x) = x2n for some n ∈ N,
whenever the expectations above exist.
Now the above inequality is a direct consequence of the Jensen’s inequality and convexity
of f (x) = − log(x), x > 0. The proof is complete.
20In geometric terms, the elements of the sub-differential are precisely the slopes for which the graph
of f lies above the line through (x0 , f (x0 )) with the given slope. The sub-differential is defined for
functions f : Rd → R as well where v ∈ Rd is a sub-derivative if f (x) ≥ f (x0 ) + hv, x − x0 i for all
x ∈ Rd , where h·, ·i is the standard inner product of Rd . Moreover, a convex function f is differentiable
at x0 if and only if the sub-differential is a single element set; in which case the only sub-derivative is
the gradient of f at x0 .
PROBABILITY LECTURE NOTES 57
Exercise 10.4.1. Prove that (E|XY |)2 ≤ E(X 2 )E(Y 2 ) for two discrete random
variables X and Y .
10.4. Covariance and correlation. In this section we define covariance and correla-
tion as measure of joint variability of two random variables.
Definition 10.1. (Covariance) Given two random variables X and Y the covariance
between them is defined as
(10.7) cov(X, Y ) = E ((X − EX)(Y − EY )) ,
whenever the expectations on the right-hand side exist.
Note that the formula for covariance can be simplified to
cov(X, Y ) = E(XY ) − E(X)E(Y )
in view of the linearity of expectation. Hence the formula for the variance of sums of
random variables proved in Property 7.4 (c), in terms of the covariance, becomes
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y ).
Similarly, for the variance of the sum of n random variables X1 , X2 , ..., Xn we have
X
(10.8) var(X1 + ... + Xn ) = var(X1 ) + ... + var(Xn ) + 2 cov(Xi , Xj ).
1≤i<j≤n
Exercise 10.4.2. Verify (10.8) following the proof of Property 7.4 (c).
We next state some basic properties of covariance, proofs of which follow directly
form definitions.
Property 10.5. (Basic properties of covariance) Let X and Y be discrete random
variables and a ∈ R be a constant. Whenever the covariances defined below exist, we
have
58 HAYK ALEKSANYAN
1. cov(X, a) = 0,
2. cov(X + a, Y ) = cov(X, Y ),
3. cov(aX, Y ) = a cov(X, Y ),
4. cov(X, Y ) = cov(Y, X)
5. cov(X, Y ) = E(XY ) − E(X)E(Y ),
6. cov(X, Y ) = 0 if X and Y are independent,
7. cov(X + Z, Y ) = cov(X, Y ) + cov(Z, Y ),
!
n
P m
P Pn Pm
8. cov ai Xi , bj Yj = ai bj cov(Xi , Yj ) for any discrete random vari-
i=1 j=1 i=1 j=1
ables Xi , Yj , and constants ai , bj ∈ R, whenever the covariances exist21.
Example 10.5.1. (Zero covariance does not imply independence) The aim of this
example is to show that the converse of Property 10.5 (6) is not true in general. To this
end, consider a random variable X with probability mass function defined as
1 1
P(X = 1) = P(X = −1) = and P(X = 0) = .
4 2
Also, define a random variable Y as
(
1, if X = 0,
Y =
0, 6 0.
if X =
Exercise 10.5.4. A fair coin is tossed 3 times. Let X be the number of Heads
in the first two tosses, and let Y be the number of Heads in the last two tosses (note the
overlap). Compute cov(X, Y ).
Definition 10.2. (Correlation) For two random variables X and Y such that var(X),
var(Y ) > 0, the correlation (or correlation coefficient) between them is defined as
cov(X, Y )
(10.9) corr(X, Y ) = p .
var(X)var(Y )
Theorem 10.6. For any two random variables X an Y for which the correlation is
defined, we have
(10.10) |corr(X, Y )| ≤ 1.
Remark 10.7. From Theorem 10.6 we see that the correlation is the “normalized”
version of the covariance taking values in [−1, 1]. The second part of the Theorem hints
toward the direction that covariance (correlation) measures the extent to which variables
X and Y are related linearly. For instance, if corr(X, Y ) = ±1, then there is a linear
dependence between X and Y .
10.5. Information entropy. Can one quantify the level of surprise or information
that a certain event bears? For example, let (Ω, F, P) be a probability space and A ∈ F
be an event with probability 0 ≤ p ≤ 1; how surprising is A? We will measure the
level of the surprise as a function of p, and denote that function by I (here I stands for
“information”). If P(A) = 1, then the event A is certain to happen and hence there is
no surprise of it happening, thus we require I(1) = 0. Also note that the more unlikely
the event is (i.e. if p is small), the more surprising its occurrence is, hence p 7→ I(p)
is required to be decreasing in p. For two independent events A and B the occurrence
of A ∩ B should sum up the surprise (or information content) coming from A and B,
thus we require I(P(A ∩ B)) = I(P(A)) + I(P(B)). It turns out these conditions on I
characterize it.
for any p, q ∈ (0, 1]. If I is continuous then I(p) = c log p for some c ≤ 0.
Proof. Make a change of variables in I by setting p = e−x where p ∈ (0, 1]. Let
f (x) = I(e−x ) for x ∈ [0, ∞). Clearly (10.11) transforms to22
p 1 1
For any p, q ∈ N we have q = , hence applying (10.12) we get f pq = pf 1q .
+ ... +
q q
| {z
}
p times
1 1
Similarly, writing 1 = + ... + from (10.12) we arrive at f (1) = qf 1q , and hence
q q
| {z }
q times
p p
(10.13) f = f (1) for all p, q ∈ N.
q q
Now the continuity of f coupled with (10.13) imply that f (x) = xf (1) for all x ≥ 0.
Getting back to relation between f and I we see that for x ∈ (0, 1] we have
− log x1 1 1
I(e ) = I(x) = f log = log I(e−1 ) = −c log x,
x x
Proof of Theorem 10.8 when I is C 2 . We include another proof that the infor-
mation function is a multiple of logarithm, when I is assumed to be C 2 .
Using the fact that I is twice differentiable we take the derivative in (10.11) with
respect to p arriving at
qI 0 (pq) = I 0 (p).
Now differentiate both sides of the last equality with respect to q. Doing so we get
With a change of variable z := pq, and setting g(z) = I 0 (z), the last equation becomes
g(z)+zg 0 (z) = 0. Thus dg g
dz = − z which is a differential equation with separable variables.
Solving it we obtain g(z) = c z1 for some constant c > 0. Recall that g(z) = I 0 (z) hence
from the formula of g we see that I(z) = c log z + c1 where c and c1 are constants.
To determine the range of these constants recall that I must satisfy (10.11). In that
formula take q = 1 which leads to I(p) = I(p) + I(1) hence I(1) = 0. Hence, setting
z = 1 we get 0 = I(1) = c log 1 + c1 , consequently c1 = 0. These leads to I(z) = c log z
and the proof is complete.
→ Problem 10.8.2. Using the concept of Hamel basis and considering R as a vector
space over rationals Q, show that there exists a non-linear solution to (10.12). By the
previous problem, such solutions cannot be measurable. Thus, the condition of measur-
ability cannot be eliminated in the characterization of the information function I.
We choose the constant c in the definition of the information function to get I(p) =
− log2 p where p ∈ (0, 1]. Now, if X is a random variable with pmf pX given by
pX (xi ) = P(X = xi ) = pi , with i = 1, 2, ..., n, then the average level of information
PROBABILITY LECTURE NOTES 61
→ Problem 10.8.3. There are 12 coins of identical shape and size, except one of them
is fake. It is known that the fake coin has different weight than a real coin, but it is
not known whether the fake is heavier or lighter than a real one. You have a balancing
weight scale, and can weight any two subsets of the 12 coins with it. When placing
two sets of coins on the two sides of the scale (either set of coins can be empty) there
are three possible outcomes: the left and right sides have equal weights, the left side is
heavier, the right side is heavier.
Using the scale, find the fake coin and whether it is lighter or heavier in as few
weightings as possible.
Hint: If we enumerate the coins from 1, 2, ..., 12, then there are 24 scenarios in total. Namely, the
fake coin is coin number i for some 1 ≤ i ≤ 12 (12 possibilities) and it is either either lighter or heavier
than a real one (2 possibilities). Think about a weighting step as a process of gaining information about
these scenarios and when organize a weighting step think about splitting the aforementioned possibilities
as evenly as possible.
Note: If you can do W ≥ 1 weightings, what is the maximal number of coins, of which one is fake,
that can be checked as in the original problem?
Corollary 11.3. Let X and ε > 0 be as in Theorem 11.2 and let µ = EX. Then
1
(11.1) P(|X − µ| ≥ ε) ≤ var(X).
ε2
Proof. The proof follows by applying Chebishev’s inequality to X − µ.
The next exercise is a prelude to the probabilistic proof of the Weierstrass approxi-
mation theorem. It might be beneficial to think about the exercise first before reading
on.
where the first equality is due to Theorem 7.3 and the second one follows by definition
of Bernoulli trials24. For m ∈ N define
Sn (ω) 1
Am,n,x := ω ∈ Ω : f − f (x) ≥ .
n m
Sn
Let also S n := n , then using (11.3) we get
24The intuition behind the argument is the following. By WLLN we know that S /n concentrates
n
around x. Then, the continuity of f implies that f (Sn /n) should concentrate around f (x) (cf. Exercise
11.4.1). But Ef (Sn /n) is a polynomial, and we get a candidate polynomial that can approximate f .
64 HAYK ALEKSANYAN
12.1.1. Counting paths. We can think about the sequence {Sn } as a polygonal line
joining points (n, Sn ) → (n + 1, Sn+1 ) (see Figure 8). Each such path corresponds to
a particular realization of the random walk which can be identified with an outcome of
the joint experiment with X1 , ..., Xn . Thus the number of all paths starting at (0, 0)
and moving for n time-steps equals 2n .
Observe that a walk starting from (0, 0) and moving for n steps cannot terminate
on any integer from [−n, n]. For example, when n = 2, none of the four paths of the
walk ends at 1. In the light of this observation let us compute the number of paths from
(0, 0) to (n, x) for a given point (n, x). Assume the walk does a steps up and b steps
down, i.e. the number of 1 ≤ i ≤ n for which Xi = 1 equals a and for Xi = −1 equals
b. Since we do n steps in total and arrive at x from 0, we must have
a + b = n and a − b = x.
Solving this system we obtain that
n+x n−x
a= and b = .
2 2
Let N(0,0)7→(n,x) be the number of paths from (0, 0) to (n, x). If the context is clear we
will drop the (0, 0) from the subscript and write N(n,x) for the number of paths from
(0, 0) to (n, x). We will think about n as time, and about x as the location of the walk
at time n. It is now clear that if n + x is not an even number, then there are no paths
ending at x in n steps, otherwise the number of such paths equals
(
0, if n + x is odd,
(12.2) N(0,0)7→(n,x) = n
otherwise, for a = n+x
a , 2 .
Let us also note that we can shift the time and location of the starting position of
the path. Namely, for any t ∈ Z and any x0 ∈ Z we have
(12.3) N(0,0)7→(n,x) = N(t,0)7→(n+t,x) = N(t,x0 )7→(n+t,x+x0 ) .
0
By N(0,0)7 →(n,x) denote the number of paths from (0, 0) to (n, x) that become 0 for
some 1 ≤ i ≤ n. The paths that start and end at positive integers but become zero
at some intermediate point are of special interest. The next result links the number of
paths that hit zero at some intermediate point with the number of all paths.
Theorem 12.1. (Reflection principle for random walks) Let x, y ≥ 1 be integers. Then
0
N(0,x)7 →(n,y) = N(0,−x)7→(n,y) ,
66 HAYK ALEKSANYAN
i.e. the number of paths from (0, x) to (n, y) that become 0 at some intermediate point
1 ≤ i ≤ n − 1 equals the number of all paths from (0, −x) to (n, y).
Proof. Let A be the set of all paths from (0, x) to (n, y) that become 0 at some point
1 ≤ i ≤ n − 1, and let B be the set of all paths from (0, −x) to (n, y). We need to
show that |A| = |B|. Let (0, s0 ) → (1, s1 ) → ... → (n, sn ) be any path from A. By
definition we have s0 = x, sn = y and there exists 1 ≤ i ≤ n − 1 such that si = 0.
Define k = min{1 ≤ i ≤ n − 1 : si = 0}, i.e. the first time the path hits 0. Now define
a new path as follows (see Figure 9):
(
−si , for 0 ≤ i ≤ k,
(12.4) s̃i =
si , for k < i ≤ n.
Figure 9. One of the paths from (0, 1) to (5, 2) that hits 0 in an intermediate
point (at step 3 in this example). Part of the path that is reflected with respect to
the x axis is depicted by dashed lines. This reflection generates a path from (0, −1)
to (5, 2).
Clearly we have a path from (0, −x) to (n, y). Such reflection defines a one-to-one
(injective) mapping from A to B implying that |A| ≤ |B|.
To show |A| ≥ |B| take any path from B. Since x, y > 0 the path must intersect 0
before reaching y. Let 1 ≤ i ≤ n − 1 be the first time the path becomes 0. Then the
reflected path given by (12.4) defines a one-to-one correspondence between B and A,
hence |B| ≤ |A| and the proof is complete.
Notice that each realization of a counting process of votes is a path that leads from
(0, 0) to (α + β, α − β). The counting scenarios where candidate A is always ahead of
candidate B correspond to the paths that never cross 0.
Proof. Define S0 = 0 and for n ≥ 1 set Sn = X1 + ... + Xn , where
(
1, if the i-th vote was for A,
Xi =
−1, if the i-th vote was for B.
PROBABILITY LECTURE NOTES 67
12.2. Arcsin law. Recall that we assume the walk is symmetric. In this section we
will study distributions of first and last visits to 0 of a walk starting from the origin.
Lemma 12.3. Let Sn be as in (12.1). Then
P(S1 6= 0, S2 6= 0, ..., S2n 6= 0) = P(S2n = 0).
Proof. We will show that the numbers of paths on both sides of the equality in question
are equal. This will complete the proof in view of the fact that the walk is symmetric
and all paths have the same probability, .
The walk starts from 0 hence in 2n steps it can reach at most 2n, thus
n
X
P(S1 > 0, S2 > 0, ..., S2n > 0) = P(S1 > 0, ..., S2n−1 > 0, S2n = 2k).
k=1
For each 1 ≤ k ≤ n the number of paths with S1 > 0, ..., S2n−1 > 0, S2n = 2k equals
0
N(1,1)7→(2n,2k) − N(1,1)7 →(2n,2k) (shifting time)
0
= N(0,1)7→(2n−1,2k) − N(0,1)7 →(2n−1,2k) (by reflection principle, Theorem 12.2)
= N(0,1)7→(2n−1,2k) − N(0,−1)7→(2n−1,2k) (shifting location)
= N(0,0)7→(2n−1,2k−1) − N(0,0)7→(2n−1,2k+1) .
Thus the number of paths with S1 > 0, S2 > 0, ..., S2n = 2k for some 1 ≤ k ≤ n equals
n
X
N(0,0)7→(2n−1,2k−1) − N(0,0)7→(2n−1,2k+1)
k=1
2n − 1
= N(0,0)7→(2n−1,1) − N(0,0)7→(2n−1,2n+1) = ,
n
where we used (12.2) for the last equality. In view of the symmetry the number of paths
that are always negative equals the number of paths that are always positive. Hence
the number of paths such that S1 6= 0, S2 6= 0, ..., S2n 6= 0 becomes
2n − 1 2n
(12.5) 2 = = N(0,0)→(2n,0) ,
n n
68 HAYK ALEKSANYAN
where the last equality follows by (12.2). We see that the number of paths that never
hit 0 equals the number of paths that end at 0 at time 2n. This settles the lemma as
noted at the beginning of the proof.
We now consider the last visits to zero. Let L2n := sup{m ≤ 2n : Sm = 0}, and
set u2m := P(S2m = 0).
Lemma 12.4. For each 1 ≤ k ≤ n we have P(L2n = 2k) = u2k u2n−2k .
Proof. Notice that after hitting 0 at 2k the walk does not visit 0 until step 2n. Hence
P(L2n = 2k) = P(S2k = 0, S2k+1 6= 0, ..., S2n 6= 0)
= P(S2k = 0)P(S2k+1 6= 0, ..., S2n 6= 0 | S2k = 0)
= u2k P(S1 6= 0, ..., S2n−2k 6= 0) (using Lemma 12.3)
= u2k u2n−2k ,
where the passage from the second to the third row is due to the fact that a walk
conditioned to be 0 at step 2k has the same distribution as the walk starting at 0. The
lemma is proved.
Theorem 12.5. (Arcsin law for the last visit to 0) For any 0 < a < b < 1 we have
Zb
L2n 1 dx
P a< <b → p , as n → ∞.
2n π x(1 − x)
a
√
Remark 12.6. To see why this law is called arcsin, set x = y in the integral, which
leads to √
Zb Zb √
dx dy √
p =2 p = 2(arcsin b − arcsin a).
x(1 − x) 1−y 2
√
a a
Since arcsin √1 = π
we get
2 4
L2n 1 1
(12.7) P < → as n → ∞.
2n 2 2
Here is an interesting interpretation of this law in gambling terms. Assume two players
with initial capital of 0 gamble on fair coin toss. It follows from (12.7) that in a long
series of tosses one of the players will be ahead the entire second half of the series with
probability close to 1/2.
where the last equality is due to (12.8). Let f (x) = √ 1 , where 0 < x < 1. Writing
x(1−x)
the integral sum for f on [a, b] for partition
k
an + : k = 0, 1, ..., (bn − an )n
n
we get
(bn −an )n (bn −an )n
1 X k 1 X 1
f an + = q =
n n n nan +k
1− nan +k
k=0 k=0 n n
nbn Zb
X 1
p → f (x)dx, as n → ∞.
k=nan
k(n − k)
a
Combining this with (12.9) and taking into account that βn,k → 1 the proof is complete.
References
[1] N. Alon, J. Spencer, The probabilistic method, 2016
[2] R. Durret, Probability theory and examples. 4th edition, Cambridge University Press, 2010
[3] M. H. DeGroot, S. J. Mark, Probability and statistics. 4th edition, Pearson Education, 2014
[4] L. C. Evans, An introduction to stochastic differential equations, (vii) 1 - 149, 2013
[5] G. Grimmett and D. Welsh, Probability an Introduction (2nd edition); Oxford University Press,
2014, pp. 1-281; http://www.math.nagoya-u.ac.jp/~richard/teaching/f2017/GW_2014.pdf
[6] D. MacKay, Information theory, inference and learning algorithms. 4th edition, Cambridge Univer-
sity Press, 2014.
[7] S. Shreve, Stochastic Calculus for Finance II. Continuous-Time Models. Springer Finance Textbook,
2004.
[8] R. Weber, Lecture notes, 2014, http://www.statslab.cam.ac.uk/~rrw1/prob/prob-weber.pdf
[9] D. Williams, Probability with martingales, Cambridge Mathematical Textbooks, (xiv)1-251
(2008)
Web: https://www.haykaleksanyan.com
Կոմպլեքս անալիզ
Հարցաշար
-------------------------------------------------------------------------------
Մաթեմատիկա
Մաթեմատիկա