Combinepdf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

PROBABILITY LECTURE NOTES

HAYK ALEKSANYAN

Contents
1. Events and probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Notion of a probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Outcomes and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Combinatorial analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1. Multiplication principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Binomial theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5. Multinomial coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Basic properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1. Continuity of probability measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2. Boole’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3. Graph coloring and Ramsey numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4. Inclusion-exclusion principle and related inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1. Inclusion-exclusion principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2. Bonferroni’s inequalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. Independent events and conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2. Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3. Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4. Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6. Discrete probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1. Important examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2. Bernoulli approximation to Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7. Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1. Discrete random variable and its probability mass function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2. Functions of random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3. Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.4. Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.5. Indicator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8. Multivariate discrete distribution. Expectation in the multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.1. Expectation in the multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9. Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.1. Independence of two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.2. Independence of n random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10. Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.1. Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.2. Arithmetic mean - Geometric mean inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
10.3. Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.4. Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10.5. Information entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11. Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.1. Markov and Chebyshev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.2. WLLN - easy version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.3. Probabilistic proof of the Weierstrass approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12. Simple symmetric random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.1. Random walks on Z and the reflection principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.2. Arcsin law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2 HAYK ALEKSANYAN

1. Events and probabilities


This section introduces the notion of probability and discusses how one may formalize
uncertainty in an experiment. We also formulate basic concepts around probability, such
as sample space, event space and probability space.
1.1. Notion of a probability. The outcomes of many actions cannot be predicted
with certainty in advance, simple examples of which being rolling a dice or tossing a
coin. Probability theory studies such actions and their consequences.
The mathematical theory of probability began in seventeenth-century France when
two renowned French mathematicians B. Pascal and P. Fermat developed an interest in
understanding games involving chance. The game they considered is the following:
Two players A and B play a series of games where the winner of each game
gets a point. The player who reaches 10 points first wins the series. When
A had 8 points and B had 7 they were forced to stop the game. How should
they divide the stake if both A and B are equally skilled?

Today, probability theory is a well-established branch of mathematics that has ap-


plications in other areas of mathematics, and beyond such as machine learning, math-
ematical finance, data science, weather prediction, medical trials, to name a few.
At the heart of the mathematical theory of probability is the idea of an experiment.
It encapsulates the course of action with an uncertain outcome. The mathematical
object modeling such experiment is called a probability space, which in broad terms
comprises three building blocks:
1. the set of all possible outcomes of the experiment,
2. the list of all events which may possibly occur as a consequence of the experi-
ment,
3. an assessment of the likelihoods of these events.
For instance, consider tossing of a (fair) coin as our experiment. Then, following the
points above
1. the set of all possible outcomes is the set {H, T } for Heads and Tails,
2. the possible events are no toss, toss of H, toss of T, or a toss of H or T,
3. both H and T are equally likely to occur.
Later, we will study such examples in greater details and mathematical rigor.

1.2. Outcomes and events. We now formalize the above discussion about experi-
ments and their outcomes. Let E be some experiment whose outcome is not determin-
istic: for the sake of illustration consider rolling a six-sided dice. Then the set of all
possible outcomes of the experiment, which is usually denoted by Ω and referred to as
the sample space is
Ω = {1, 2, 3, 4, 5, 6},
in correspondence with the numbers of the dice. Questions one may ask about experi-
ment E can be as follows: “is the outcome equal to 3 ?”, “is it smaller than 5?”, “is it
a prime number ?”, etc. Such questions can be reformulated in terms of the subsets of
Ω. For example, the last question is equivalent to asking if the outcome of the experi-
ment lies in {2, 3, 5} which is a subset of Ω. Next, we collect all subsets of Ω that are
of interest to us into a list, where each member of such list corresponds to a possible
outcome of the experiment which we will refer to as an event. There is a natural link
between events and subsets of Ω, indeed take two subsets A and B of Ω, then
PROBABILITY LECTURE NOTES 3

• A ∪ B corresponds to the event “either A or B occurred”,


• A ∩ B corresponds to the event “both A and B occurred”,
• Ω \ A corresponds to the event “A did not occur”.
We thus see that set-theoretic concepts applied to subsets of Ω can be put in correspon-
dence with the outcomes of the experiment E.
Given an experiment E, let Ω be the set of all possible outcomes of the experiment
and let F := {Ai : i ∈ I} be the collection of subsets of Ω that are of interest to us (the
events) in relation to E. In the light of the above correspondence between set-theoretic
operations and the outcomes of the experiment, we need to require some consistency
from the set F which we formalize next.
Definition 1.1. (Event space) The collection F of a sample space Ω is called an event
space if
(1.1a) F is not empty,
(1.1b) if A ∈ F then Ω \ A ∈ F,

[
(1.1c) if A1 , A2 , ... ∈ F then Ai ∈ F.
i=1
The axioms above simply state that F is a non-empty set of subsets that is closed
under taking complements and countable unions. In measure-theoretic context, the
event space as defined above is identical to the concept of σ-algebra: event space
is simply a σ-algebra defined on Ω. There are several straightforward but important
implications from Definition 1.1.
Property 1.1. Let Ω be a sample space and F be an event space on Ω. Then
1. F contains ∅ and Ω,
2. F is closed under operations of finite union,
3. F is closed under finite or countable intersections.
Proof. For the first item, observe that by (1.1a) there exists A ⊂ Ω that lies in F.
Then in view of (1.1b) we get Ac = Ω \ A ∈ F. Now set A1 = A, A2 = Ω \ A, and
Ai = A for all i = 3, 4, ... . Thanks to (1.1c) we have

[
Ω = A ∪ (Ω \ A) = Ai ∈ F.
i=1
Since Ω ∈ F we get ∅ = Ω \ Ω ∈ F due to (1.1b), proving item 1.
To prove that an event space is closed under operations of finite union, take any
collection A1 , ..., An ∈ F and define Ai = ∅ for i ≥ n + 1. We showed above that the
empty set lies in F hence (1.1c) concludes the claim on finite unions.
Finally, to see that an event space is closed under finite or countable intersections
we use
 ∞de Morgan’s law which states that for any collection Ai ∈ F, i = 1, 2, ... one has


(Ω \ Ai ). The right-hand side of this equality is in F due to (1.1b)
T S
Ω\ Ai =
i=1 i=1
and (1.1c), therefore the left-hand side is also in F thanks to (1.1b). 

Let us consider a few examples of Ω and an event space F.

Example 1.1.1. Ω is any non-empty set and F = 2Ω - the power set1.


1For any set Ω the notation 2Ω stands for the set of all subsets of Ω, i.e. 2Ω = {A : A ⊂ Ω}. This set is
called the power set of Ω.
4 HAYK ALEKSANYAN

Example 1.1.2. Ω is any non-empty set and F = {∅, Ω}.

The two examples above represent the edge-cases: for the given Ω the first is the
largest event space and the second is the smallest both in terms of inclusions.
Remark 1.2. Informally, we may think about F as the set of all outcomes of the
experiment that we can verify if they have happened or not in a particular realization
of the experiment. In other words F is the set of all questions one may ask about the
outcome of the experiment, that can be addressed.

Example 1.2.1. Let our experiment be rolling a fair 6-sided dice. Assume that the
outcome of the roll is not directly revealed to us, but one can only know if the roll
was even or odd (e.g. we want to build a simplistic random-bit generator). Then Ω =
{1, 2, 3, 4, 5, 6} and
F = {∅, {1, 3, 5}, {2, 4, 6}, {1, 2, 3, 4, 5, 6}}.

Example 1.2.2. Ω is any non-empty set, A ⊂ Ω and F = {∅, Ω, A, Ω \ A}. If A


6 Ω), then F is the smallest event space
is a non-empty proper subset of Ω (i.e. A =
containing A.

In the following exercises Ω is a set and F is an event space as formulated in Definition


1.1.

 Exercise 1.2.1. For any A, B ∈ F show that A ∩ B ∈ Ω and A \ B ∈ Ω.

 Exercise 1.2.2. Given A1 , ..., An ∈ F and let B be the set of all points in Ω
that belong to exactly k of the Ai -s, where 1 ≤ k ≤ n is fixed. Show that B ∈ F.

 Exercise 1.2.3. Let Ω be finite. Show that F has even number of elements.

→ Problem 1.2.1. Consider the set of positive integers N. A subset A ⊂ N is said


to have Cesaro density γ(A) if the limit
|A ∩ {1, 2, ..., n}|
lim
n→∞ n
exists and equals γ(A). Show that the set of subsets of N that have Cesaro density is
not closed under intersection.

1.3. Probabilities. Getting back to the building blocks of a probability space discussed
in subsection 1.1, notice that we have not yet allocated any quantitative measure to an
event. In this section for each event A ∈ F we assign a probability and will write P(A)
for its value. The axiomatic approach that comes next is due to N. Kolmogorov.
Definition 1.2. (Probability measure) A mapping P : F → R is called a probability
measure on (Ω, F) if
1. 0 ≤ P(A) ≤ 1 for any A ∈ F,
2. P(∅) = 0 and P(Ω) = 1,
3. for any collection A1 , A2 , ... ∈ F of disjoint events, i.e. Ai ∩ Aj = ∅ for i 6= j,
we have
∞ ∞
!
[ X
P Ai = P(Ai ).
i=1 i=1
PROBABILITY LECTURE NOTES 5

The last condition is referred to as countable additivity (σ-additivity for short)


of the measure. Notice that the probability measure P is defined only on subsets2 of Ω
that lie in F. Two remarks are in order:

(a) The condition that P(∅) = 0 is redundant and can be inferred from the rest of
the axioms. Indeed, take A1 = Ω and Ai = ∅ for i ≥ 2. In view of condition 3
of Definition 1.2 we have

X
P(Ω) = P(Ω) + P(∅),
i=2

hence P(∅) = 0.
(b) P is finitely additive, i.e. for a collection of disjoint events Ai , i = 1, 2, ..., n we
have
n n
!
[ X
P Ai = P(Ai ).
i=1 i=1
This follows from countable additivity taking Ai = ∅ for all i = n + 1, ... .
(c) The condition that P(A) ≤ 1 for any A ∈ F is also redundant and follows from
the monotonicity of the probability measure. Indeed, take any A, B ∈ F where
A ⊂ B. Then B = A ∪ (B \ A) and since the two sets in the union are disjoint
the finite additivity proved above implies
(1.2) P(B) = P(A) + P(B \ A) ≥ P(A),
where the inequality is due to non-negativity of P. Since for any A ∈ F we have
A ⊂ Ω, using monotonicity of P we get P(A) ≤ P(Ω) = 1.

Example 1.2.3. Let Ω be non-empty set and A ⊂ Ω be a non-empty proper subset.


Then any probability measure P defined on {∅, Ω, A, Ω \ A} has the form
P(∅) = 0, P(Ω) = 1, P(A) = p and P(Ω \ A) = 1 − p,
for some 0 ≤ p ≤ 1.

Example 1.2.4. (Equiprobable sample space) Let Ω = {ω1 , ..., ωn } and F = 2Ω .


For each A let |A| be the cardinality of A (number of elements in A) and define
|A|
P(A) := .
n
Then P is a probability measure on (Ω, F).

 Exercise P1.2.4. Let Ω be at most countable set (finite or countable) and let
pi ≥ 0 be so that i pi = 1. Let F be an event space defined on Ω and for A ∈ F set
X
P(A) := pi .
i: ωi ∈A

Show that P is a probability measure.

2The structural assumption on F is instrumental in defining P. A result due to S. Banach and K.


Kuratowski (1929) states that under continuum hypothesis there is no non-vanishing σ-additive measure
µ defined for all subsets of [0, 1] such that µ({x}) = 0 for all x ∈ [0, 1].
6 HAYK ALEKSANYAN

1.4. Probability space. We combine the definitions of the event space and probability
measure into the following.
Definition 1.3. A probability space is a triplet (Ω, F, P) such that
(a) Ω is a non-empty set,
(b) F is an event space of subsets of Ω,
(c) P is a probability measure on (Ω, F).

A few important properties of a probability space are stated below.

Property 1.3. Let (Ω, F, P) be a probability space. Then


(1) P(A) + P(Ω \ A) = 1 for any A ∈ F,
(2) for any A, B ∈ F with A ⊂ B one has P(A) ≤ P(B),
(3) for any A, B ∈ F we have P(A ∪ B) + P(A ∩ B) = P(A) + P(B).
Proof. For (1) observe that Ω = A ∪ (Ω \ A). The sets in the union are non-intersecting
events, hence, in view of finite additivity of P and item 2 of Definition 1.2 we get (1).
The proof of (2) - the monotonicity of P, is given in (1.2).
For part (3) notice that
A = (A ∩ B) ∪ (A \ B),
B = (A ∩ B) ∪ (B \ A),
and
A ∪ B = (A \ B) ∪ (B \ A) ∪ (A ∩ B).

A∩B

Figure 1. In the union A ∪ B the subset A ∩ B contributes both in P(A) and P(B)
thus we need to subtract P(A ∩ B) to compensate this doubled contribution.

In all three equalities the left-hand side is represented as a union of disjoint events.
Thus, employing the finite additivity of a probability measure we get
P(A) + P(B) = P(A ∩ B) + P(A \ B)+
P(A ∩ B) + P(B \ A) = P(A ∩ B) + P(A ∪ B),
completing the proof. 

Rearranging the equality in (3) of Property 1.3 we obtain


(1.3) P(A ∪ B) = P(A) + P(B) − P(A ∩ B),
which is the inclusion-exclusion principle for two sets. We will return to this equality
later on and establish its generalization for the case of n events.

 Exercise 1.3.1. Use the inclusion-exclusion principle to count the number of


integers from 1 to 2021 that are divisible by 2 or 7.
PROBABILITY LECTURE NOTES 7

 Exercise 1.3.2. Consider coin tossing experiment where we toss a coin 5


times. Describe the probability space for this experiment where
5
1. we are interested in all outcomes of the 5 tosses (22 events),
2. we are only interested in the total number of heads that occur in 5 tosses (25
events).

2. Combinatorial analysis
In this section we discuss various counting principles with an eye on their probabilis-
tic interpretation.

2.1. Multiplication principle. Consider this simple question: in how many different
ways we can form a 3-question exam ticket out of 5 probability, 7 analysis and 4 geometry
questions, if all 3 topics must be presented on the list? The answer is 5 × 7 × 4 in view of
the multiplication principle, also known as the fundamental rule of counting, defined
below.
The rule allows to compute the number of outcomes of a sequence of k experiments
Ei with i = 1, 2, ..., k, when there are m1 possible outcomes for the first experiment, m2
for the second, and so on until mk scenarios for the last one. Then, the outcome of the
joint experiment is an ordered k-tuple and the total number of such tuples equals (see
Figure 2)
m1 · m2 · ... · mk .

root

m1 1 2 m1

m1 · m2
1 2 m2 1 2 m2 1 2 m2

Figure 2. It is convenient to think about such k-tuples as leaves of a tree. Namely,


the root of the tree is the starting position where no experiment has been made yet,
then there are m1 nodes at the first level representing outcomes for the first step, and
then each child on the first level has m2 children for the second choice, and so on
up to the k-th level. The set of all sequences of length k is then the set of all paths
starting from the root of the tree and ending at a leaf. The image depicts such tree
corresponding to two steps, when there are m1 outcomes for the first step and m2
outcomes for the second one. Notice that this tree has m1 · m2 leaves.

Example 2.0.1. Consider sequences of length k made of {H, T } corresponding to k


tosses of a coin. The total number of such sequences is 2k .

Example 2.0.2. The number of words of length k of an alphabet with n letters is nk .


Notice that this is a generalization of the previous examples. Indeed, if we take {H, T }
as our alphabet, i.e. n = 2, and consider words of length k (i.e. sequences made of H
or T of length k) we get 2k such words.
8 HAYK ALEKSANYAN

2.2. Permutations. Another fundamental concept we will encounter in combinatorics


related to probability theory is the permutation.

Definition 2.1. A permutation of a set S is a bijection from S to itself.

When the set S is finite, any permutation of S is simply an ordered arrangement of


its elements. For example, when S = {1, 2, 3} the set of all permutations of S is the
following:
123, 132, 213, 231, 312, 321
In general, we can compute the number of ways integers 1, 2, ..., n can be ordered
from the multiplication principle. Indeed, there are n choices for the first element,
n − 1 for the second, so on until 1 choice for the last one, hence the total number of
permutations becomes
n × (n − 1) × ... × 2 × 1 = n!.

 Exercise 2.0.1. In how many different ways n different hats can be assigned
to n different people so that each person gets exactly one hat?

2.2.1. Stirling’s formula. We will frequently encounter factorials in estimating prob-


abilities. The fundamental tool in dealing with factorials is Stirling’s formula, which we
recall here:
√  n n
(2.1) n! ∼ 2πn , as n → ∞.
e
The notation ∼ means that the ratio of the left and right sides tends to 1 as n → ∞.

We will prove the weak version of Stirling’s approximation which is still useful and
refer to analysis textbooks for the proof of the full version stated above.

Lemma 2.1. (Toy version of Stirling) We have log n! ∼ n log n as n → ∞.


n
P
Proof. Notice that log n! = log k . By monotonicity of x 7→ log x we have
k=1

k+1
Z
log k ≤ log x dx ≤ log(k + 1), with k = 1, 2, ...,
k
Rk R k+1
hence k−1 log x dx ≤ log k ≤ k log x dx, where k = 2, 3, ... . Summing the inequali-
ties over k and using the fact that log 1 = 0 we get
Zn n+1
Z
(2.2) log x dx ≤ log n! ≤ log x dx.
1 1
Rn
Using integration by part we get 1 log x dx = n log n − n, which plugging into (2.2)
completes the proof. 

→ Problem 2.1.1. How many trailing 0-s does n! have? For example, there are none
for 0 ≤ n ≤ 4, there is one for 5 ≤ n ≤ 9, there are two for 10 ≤ n ≤ 14, etc.
PROBABILITY LECTURE NOTES 9

→ Problem 2.1.2. Prove that the fractional parts3 of log n! are dense in [0, 1].
Namely, show that for any interval [a, b] ⊂ [0, 1] there exists n ∈ N such that {log n!} ∈
[a, b].

2.3. Sampling. For some experiments we will be given a set X and will need to take
elements from X following some rule. For instance we have a deck of cards (the set X)
and the experiment consists in choosing three cards from the deck. When choosing a
card we might put it back into a deck so that it can be chosen again, or set aside and
enforce uniqueness of the selected items. In this generality we will refer to such process
as sampling from X and will discuss the following schemes of sampling from a set:
 ordered with replacement,
 ordered without replacement,
 without order without replacement,
 without order with replacement.
Here (without) with replacement refers to the condition of (not) using the same
element more than once, and (without) with order means that the way the elements are
arranged (does not) matters. For example (1, 2, 3) is different from (2, 1, 3) if order is
enforced (i.e. in ordered schemes) otherwise they are the same. The aim of this section
is to introduce counting principles that address the four cases listed above. Table 2.3
summarizes the number of possibilities of sampling k times from n objects with the
schemes mentioned above. We next proceed to the detailed analysis of these schemes.
with replacement without replacement
n!
ordered nk (n−k)!
(k+n−1)! n!
without order k!(n−1)! k!(n−k)!

2.3.1. Ordered with replacement. This is perhaps the easiest case of the four. We
have n objects and need to sample k of them regarding the order and allowing replace-
ment. Then we are looking for tuples of length k where for each position we have n
possible choices. Then, in view of the multiplication principle we have nk total choices
here.

Example 2.1.1. Let us compute the probability of getting at least one 6 in 10 throws
of a fair 6-sided dice. Set X = {1, 2, 3, 4, 5, 6}, then each outcome of the result of
the experiment of throwing the dice 10 times can be represented by a tuple of the form
(a1 , ..., a10 ) where ai ∈ X. Thus the sample space becomes
Ω = {(a1 , a2 , ..., a10 ) : ai ∈ X for i = 1, 2, ..., 10},
with F = 2Ωand P defined as the equiprobable measure on F, i.e.
|E|
P(E) = , for any E ∈ F.
|Ω|
Elements of Ω can equivalently be described as outcomes of ordered sampling with
replacement from X, hence |Ω| = 610 . We will compute the probability of not getting
6 in 10 throws, which in our notation is the event
E = {(a1 , a2 , ..., a10 ) : ai ∈ X \ {6} for i = 1, 2, ..., 10}.
3For x ∈ R let bxc ∈ Z be the largest integer that bxc ≤ x, i.e. bxc ∈ Z is the closest integer to x not
exceeding it. Then this difference x − bxc is called the fractional part of x and is denoted by {x}. By
definition we have {x} ∈ [0, 1) for any x ∈ R. For example {5.6} = 0.6 while {8} = 0.
10 HAYK ALEKSANYAN

Similarly to Ω, the elements of E are the results of sampling with replacement and with
regard to order from the set X \{6} that has 5 elements. Hence |E| = 510 . In particular,
the probability of getting at least one 6 in 10 throws becomes
 10
|E| 5
1 − P(E) = 1 − =1− ≈ 0.838
|Ω| 6
2.3.2. Ordered without replacement. So far when discussing permutations we were
shuffling n objects and taking all n of them. In certain scenarios we might need to
choose only a fixed part of the n elements, say 0 ≤ k ≤ n of them. For example,
consider a code of length 3 that is made of digits 0 to 9. Assume in this code one can
use the same digit only once: what is the total number of such codes?
In this problem we are sampling from the set D := {0, 1, ..., 9} and hence n = 10.
Observe that when k = n, then the set of the codes coincides with the set of permutations
of D. For k < n we can use the multiplication principle from which we have n items for
the first character of the code, n − 1 for the second, so on until n − (k − 1) for the last
one. Hence the total number becomes
n!
n · (n − 1) · ... · (n − (k − 1)) = .
(n − k)!
This is usually denoted by P (n, k) or Akn . The sampling scheme discussed here is also
referred to as permutations of n objects taken k at a time.
2.3.3. Birthday problem. We now discuss a classical problem on counting, often re-
ferred to as a birthday paradox, which can be modeled using the scheme of ordered
sampling without replacement. Here is the statement of the problem.
What is the minimum number of people that should be present in the same
room, so that the probability that at least two of them have their birthdays
on the same day of the year is at least 1/2?
Since we have 365 days in a year (disregarding the leap year’s 1 additional day) it
might be tempting to guess that we need at least half of that, 182, as our answer. The
correct answer, however, is 23, significantly less than that!
To arrive at the answer, assume there are k people in the room and let n = 365 be
the number of days in a year. There are nk arrangements of the birthdays k people
can have (ordered sampling with replacement) and we now compute the number of such
arrangements that result in all people having birthdays on different days. The latter is
clearly a problem about permutations of n taken k at a time as discussed in subsection
2.3.2. Thus, there are Akn of them. Hence the probability of all k having birthdays on
different days is
Akn 1
= k n · (n − 1) · ... · (n − (k − 1)) =
nk n     
1 2 k−1
1− 1− = ... = 1 − =: f (k).
n n n
The probability that there are at least two people having birthdays on the same day is
1−f (k). Plugging n = 365 in the above equality one may estimate that 1−f (22) ≈ 0.47
and 1 − f (23) ≥ 0.5 thus the answer is k = 23.
2.3.4. Without order without replacement. For permutations of n objects taken
0 ≤ k ≤ n at a time, as discussed in the previous section, the order of the elements
matter. For example, (0, 1, 2) and (1, 0, 2) are treated as different elements. There are
cases when we might need to disregard the order. For example, in how many ways 10
PROBABILITY LECTURE NOTES 11

players can form a single group of 4 members? In this case the order of the players
forming a group does not matter. In general, assume we have n objects and need to
n!
choose k of them. Then, in view of the results of Section 2.3.2 we have (n−k)! ways
to do this if take into account the order of the elements. Each choice is formed of an
ordered tuple of length k which itself can be ordered in k! different ways. Thus, if we
n!
disregard the order, then each of the (n−k)! elements will be counted k! times. Hence
dividing this number by k! gives
n!
,
k!(n − k)!
which is the number of ways k elements can be chosen out of n without order and
without replacement. This number is usually denoted by nk , is called a binomial
coefficient and reads “n choose k”. This method of sampling is called combinations of
n objects taken k at a time.

 Exercise 2.1.1. Show that nk = n−k n


 
for any integer 0 ≤ k ≤ n.

 Exercise 2.1.2. Show that for any 1 ≤ k ≤ n − 1 one has


     
n n−1 n−1
= + .
k k−1 k
Here n0 = nn = 1.
 

 Exercise 2.1.3. Show that the number of subsets {1, 2, ..., n} of 0 ≤ k ≤ n


elements equals nk . Use this to prove that
n  
X n
= 2n .
k
k=0
Interpret the result of Exercise 2.1.1 in terms of the subsets of {1, 2, ..., n}.

→ Problem 2.1.3. Show that for any n ≥ 1 one has


n  2  
X n 2n
= .
k n
k=1
Hint: use Exercise 2.1.1 and the first part of Exercise 2.1.3. Compare the two sides of the equation
with the number of ways of selecting n-element subsets from a set of 2n elements.

2.3.5. Without order with replacement. This is arguably the most challenging sce-
nario of counting discussed so far - combinations of k objects from n with replacement
and without order. The setting of this sampling scheme is as follows: given n distinct
objects, in how many ways we can select k ≥ 0 of them if replacement is allowed (i.e. we
can reuse the same element more than once) and order of the elements is disregarded?
Before formulating the main result consider a simple case when n = 3 and k = 2.
Assuming that we are selecting from {1, 2, 3}, the options we have4 are
11, 12, 13, 22, 23, 31.
Observe that each sample in the list is characterized uniquely by the number of 1-s, 2-s
and 3-s it contains (recall that we do not take order of the tuple into account). In the
light of this observation we way rewrite the list above as
(2.3) (2, 0, 0), (1, 1, 0), (1, 0, 1), (0, 2, 0), (0, 1, 1), (1, 0, 1),
4Notice that we consider ordered tuples, then there are 32 options.
12 HAYK ALEKSANYAN

where the i-th index of the 3-tuple indicates the count of i-s used where i = 1, 2, 3.
We see that for a given 3-tuple in (2.3) the sum of its elements is 2, thus what we do,
effectively, is equivalent to representing k as a sum of n non-negative integers. We arrive
at the following problem: find the number of integer solutions to x1 + x2 + x3 = 2 with
0 ≤ xi ≤ k for all i = 1, 2, 3. This observation to settles the general case.

Theorem 2.2. Given n ≥ 1 and k ≥ 0, there are


 
n+k−1
k
ways to select k objects from n with replacement and without order.

Proof. Assume the n objects are enumerated as 1, 2, ..., n. Then any collection of size
k from n with replacement and disregarding the order is uniquely represented by an n-
tuple of integers (x1 , ..., xn ) where xi is the number of times i was used in the collection
of size k. This correspondence implies that the count we are interested in equals the
number of integer solutions to the following equation

(2.4) x1 + ... + xn = k where 0 ≤ xi ≤ k, for all 1 ≤ i ≤ n.

Each equation of this form can be put in one-to-one correspondence with a sequence
of length k + n − 1 composed of | and + symbols, where each xi > 0 is replaced by
xi number of | symbols and if xi = 0 it is replaced by an empty symbol. Thus, for
example the corresponding sequences for (2.3), with the full sequence of reductions we
did, becomes
11 (2, 0, 0) || + +
12 (1, 1, 0) | + |+
13 (1, 0, 1) x1 + x2 + x3 = 2, | + +|
22 (0, 2, 0) xi ∈ {0, 1, 2} +||+
23 (0, 1, 1) for all i = 1, 2, 3. +| + |
33 (0, 0, 2) + + ||
We see that the number of solutions to (2.4) equals the number of sequences of length
k + n − 1 made of | and + symbols where the number of | symbols equals k and the
number of + symbols is n − 1. In view of section 2.3.4 the number of such sequences
equals n+k−1
k which completes the proof. 

 Exercise 2.2.1. Compute the number of injective mappings from {1, 2, 3} to


{a, b, c, d}.

 Exercise 2.2.2. We have n keys at our disposal and one and only one of them
is the right key for the given lock. At each step we randomly pick a key and try to unlock
with it. If we fail, we put the key back and select another one at random. What type of
sampling is this and in how many ways we will unlock the door for the first time at step
k?
This is sampling with replacement. There will be nk total ways to reach step k out of which (n−1)k−1
are successful.
PROBABILITY LECTURE NOTES 13

2.4. Binomial theorem.


Theorem 2.3. For any x, y ∈ R and any integer n ≥ 0 we have
n  
n
X n k n−k
(x + y) = x y .
k
k=0

Proof. There are many proofs of this result, but here we will follow the combinatorial
approach. We will prove by induction on n that
X
(2.5) (x + y)n = a1 a2 · ... · an ,

where the sum is over all ordered n-tuples of the form (a1 , a2 , ..., an ) where each ai is
either x or y, i.e. all strings of length n formed of {x, y}. The case of n = 0, 1 is trivial.
The claim is obviously true for n = 2 since
(x + y)2 = xx + xy + yx + yy.
Assume that (2.5) if true for n − 1. To prove it for n notice that each string of length n
composed of {x, y} is constructed from a string of length n − 1 adding either x or y at
the beginning. Hence writing (x + y)n = x(x + y)n−1 + y(x + y)n−1 and applying the
inductive hypothesis on (x + y)n−1 completes the proof of (2.5). Now each summand
of (2.5) can be rearranged into k n−k for some 0 ≤ k ≤ n. But for given
n
 the form x y
0 ≤ k ≤ n there are exactly k summands of the form xk y n−k since they are formed by
choosing k indicies out of 1, 2, ..., n for which ai = x in (2.5) and the rest of ai -s equals
y. Thus grouping terms of the form xk y n−k in (2.5) of which we have nk , completes
the proof of the theorem. 

2.5. Multinomial coefficient. Assume we are sampling from {1, 2, ..., k} with replace-
ment n times. In how many ways can we do this if number i is sampled ni times where
i = 1, ..., k and n1 + ... + nk = n. To answer this question we will use the multiplication
principle. The result of this experiment is an n-tuple of the form (a1 , ..., an ) where
n

ai ∈ {1, ..., k} and there are n1 ways to fix the places of 1. When all 1-s are placed
there are n−n ways to fix the places of 2, so on until n−n1 −...−n
1
 k−1

n2 nk ways to place k.
Thus, due to the multiplication principle the total number of outcomes of such sampling
equals
    
n n − n1 n − n1 − ... − nk−1
· ... · .
n1 n2 nk
Using the value of binomial coefficient nk = k!(n−k)! n!

the last expression simplifies to
 
n! n
(2.6) =: .
n1 ! n2 ! ... nk ! n1 n2 ... nk
The number in (2.6) is called a multinomial coefficient.

Example 2.3.1. In how many way can we arrange 5 red balls, 3 blue balls and 4 green
balls in a line? For this problem n = 5 + 3 + 4 = 12 and we are asking for
 
12 12!
= = 27720.
534 5! 3! 4!
A similar argument that we used to prove Theorem 2.3 leads to the following.
14 HAYK ALEKSANYAN

Theorem 2.4. For any x1 , x2 , ..., xk ∈ R and any integer n ≥ 0 we have

X n

(x1 + x2 + ... + xk ) = n
xn1 xn2 · ... · xnk k ,
n1 n2 ... nk 1 2

where the sum is over all integers (n1 , ..., nk ) with 0 ≤ ni ≤ n and n1 + ... + nk = n.

 Exercise 2.4.1. How many terms does the sum in Theorem 2.4 have?

 Exercise 2.4.2. Adapt the proof of Theorem 2.3 to prove Theorem 2.4.

3. Basic properties of probability


In this section we establish continuity of probability measure and prove Boole’s
inequality. We then discuss an application of Boole’s inequality in graph theory.

3.1. Continuity of probability measure.

Theorem 3.1. Let (Ω, F, P) be a probability space. Consider a sequence of monotone


events A1 , A2 , ... ∈ F. If the sequence is increasing, i.e. A1 ⊂ A2 ⊂ ..., then


!
[
P An = lim P(An ).
n→∞
n=1

For decreasing sequence A1 ⊃ A2 ⊃ ... we have


!
\
P An = lim P(An ).
n→∞
n=1

Proof. We consider the case of increasing sequence first. Define a new sequence Bn as
follows,

B 1 = A1 ,

n−1
!
[
B n = An \ Ai = An ∩ Acn−1 .
i=1

By definition {Bn } is a sequence of disjoint events and


[ ∞
[ n
[ n
[
Ai = Bi and Ai = Bi for all n ≥ 1.
i=1 i=1 i=1 i=1
PROBABILITY LECTURE NOTES 15

We thus get
∞ ∞
! !
[ [
P Ai =P Bi
i=1 i=1

X
= P(Bi ) (by countable addititivity)
i=1
n
X
= lim P(Bi )
n→∞
i=1
n
!
[
= lim P Bi (by finite addititivity)
n→∞
i=1
n
!
[
= lim P Ai
n→∞
i=1
= lim P(An ) (since Ai is increasing).
n→∞

To prove the theorem for decreasing events {Ai } notice that {Aci } is increasing, hence
from the already proved case we have

!
[
P c
Ai = lim P(Aci ) = 1 − lim P(Ai ).
i=1

For the left-hand side, in view of de Morgan’s law, we have


∞ ∞
!c
[ \
Aci = Ai ,
i=1 i=1

which combined with the previous equality settles the case of decreasing sequences. 

→ Problem 3.1.1. Prove that the axiom 3 of Definition 1.2, namely the countable
additivity of the probability measure, can be replaced by finite additivity and continuity
in a sense of Theorem 3.1. More precisely, prove that axioms 1 and 2 of Definition 1.2
together with the following axiom: for any A1 , A2 , ... ∈ F with A1 ⊂ A2 ⊂ ... one has

!
[
P Ai = lim P(Ai ),
i→∞
i=1

and P is finitely additive, imply countable additivity as stated in Definition 1.2.


Hint: Take a sequence {Ai } of disjoint events and define Bi := ∞
S
n=i An , where i = 1, 2, .... Notice
that {Bi } is decreasing hence the continuity of the measure applies to it. Then use the finite additivity
on the events A1 , A2 , ..., An−1 , Bn to conclude the proof.

3.2. Boole’s inequality.


Theorem 3.2. For any sequence of events A1 , A2 , ... we have
∞ ∞
!
[ X
P An ≤ P(An ).
n=1 n=1

Recall that for disjoint events we have equality in the above inequality thanks to the
σ-additivity of probability measure.
16 HAYK ALEKSANYAN

Proof. Set B1 = A1 and Bi = Ai \(∪i−1 k=1 Ak ). Clearly Bi ⊂ Ai and hence P(Bi ) ≤ P(Ai )
for all i = 1, 2, ... in view of the monotonicity of the probability measure. Notice also
that {Bi } are disjoint, and ∪i Ai = ∪i Bi , hence
∞ ∞ ∞ ∞
! !
[ [ X X
P Ai = P Bi = P(Bi ) ≤ P(Ai ).
i=1 i=1 i=1 i=1


3.3. Graph coloring and Ramsey numbers. In this section we will discuss an ap-
plication of Boole’s inequality in the graph theory. Consider a complete graph 5 on n
vertices. A theorem due to F. Ramsey states that for any positive integers r and s
there is an integer N such that any red-blue coloring of edges6 of a complete graph on
N vertices contains a subgraph on r vertices with all edges being blue, or a subgraph
on s vertices with all edges being red. The smallest such N is denoted by R(r, s) and is
a called a Ramsey number.

5
Figure 3. A particular red-blue coloring (one of the 2(2) = 1024 possible variants)
of a complete graph on 5 verticies. Notice that we are coloring edges of the graph and
not verticies. Observe as well that there is a single monochromatic subgraph with 3
verticies with all edges being red.

The exact value of R(r, s) is not known for general r, s except for some special cases.
A particularly striking example of applications of probability to combinatorics is a lower
bound on R(s, s) due to P. Erdős (1947). The technique that we will use here is called
a probabilistic method and has far reaching applications beyond the graph coloring
problem (see [1] for more).
k
Proposition 3.3. If nk · 21−(2) < 1 then R(k, k) > n. In particular R(k, k) > b2k/2 c

for all k ≥ 3.
Proof. Consider a complete graph G on n vertices. By definition it has n2 = n(n−1)

2 =:
N edges which we will denote by e1 , ..., eN . We now define the experiment of randomly
coloring an edge by blue (+1) or red (−1) and describe the probability space corre-
sponding to this experiment. The sample space of all possible colorings of the graph
becomes
Ω = {~c = (c1 , ..., cN ) : ci ∈ {−1, 1}},
and the event space is F = 2Ω - the power set of Ω. We assume that coloring a single
edge either red or blue is equally likely and for any E ⊂ F define the probability measure
5A graph is called complete if there is an edge between any pair of vertices.
6Each edge of a graph is colored either red or blue without any restrictions.
PROBABILITY LECTURE NOTES 17

as
|E|
(3.1) P(E) :=
,
2N
where |E| is the number of elements in E. As discussed in Example 1.2.4 the triplet
(Ω, F, P) is indeed a probability space.
k
 Take any subgraph on k verticies. Since G is a complete graph, the subgraph has
2 =: K edges. Without loss of generality assume edges of this subgraph are e1 , ..., eK .
Then, the event that this particular subgraph is monochromatic (i.e. all its edges have
the same color) is
E := {( c, ..., c , cK+1 , ..., cN ) : ci ∈ {−1, 1} for K < i ≤ N and c ∈ {−1, 1}}.
| {z }
K times

From here and (3.1), we get


k
P(E) = 2 · 2−K = 21−(2) .
The number of subgraphs on k vertices equals nk := M and let Ai be the probability

that the i-th such subgraph is monochromatic. By Boole’s inequality we have
M M
!  
[ X
1−(k2) n 1−(k)
P Ai ≤ P(Ai ) ≤ M · 2 = 2 2 < 1.
k
i=1 i=1

Thus with positive probability none of the events Ai occurs, hence there is a red-blue
coloring of G without monochromatic subgraph on k vertices. From this, in view of the
definition of the Ramsey number, we get R(k, k) > n.
Finally, setting n = b2k/2 c we obtain
n 1−(k) (n − k + 1) · ... · n 1−(k) nk 1−(k)
 
2 2 = ·2 2 ≤ 2 2
k k! k!
k2 k k
2 2 22 22
<2 k2
=2 < 1 for k ≥ 3.
k! 2 2 k!


k
 Exercise 3.3.1. Show that 2 2 +1 < k! for k ≥ 3.

 Exercise 3.3.2. Prove that among 6 people there are 3 who all know each
other or 3 people that no one knows another7.
Construct an example of a group of 5 people such that there is no subgroup of 3
people that all know each other or a subgroup of 3 people that none knows another.
This, coupled with the first claim shows that R(3, 3) = 6.

→ Problem 3.3.1. Let A1 , ..., An ⊂ A be subsets of a non-empty set A where |Ai | = k


for all 1 ≤ i ≤ n. Assuming n < 2k−1 show that there is a red-blue coloring of A such
that each subset Ai contains elements of both colors.
Hint: Model the probability space corresponding to the 2-coloring and use probabilistic method with
Boole’s inequality.

7In terms of the Ramsey numbers we are trying to show that R(3, 3) ≤ 6. Notice also that Proposition
n
 1−(k)
3.3 cannot be applied here as for n = 6 and k = 3 the condition k
2 2 < 1 is not satisfied.
18 HAYK ALEKSANYAN

→ Problem 3.3.2. Consider a finite number of binary strings (i.e. sequences of


{0, 1}-s) of finite length. For each integer i ≥ 1 let Ni be the number of such strings of
length i. Assume also that no string in the collection is a prefix of another string. Show
that
X Ni
≤ 1.
2i
i

Hint: Notice that 2i is the total number of strings of length i. Then interpret what it means that a
shorter string is not a prefix of a longer string.
Note 1: This is called Kraft’s inequality for prefix-free codes.
Note 2: There is an interesting realization of the the above inequality on prefix codes generated by
binary trees. Consider a binary tree with finite number of nodes. Then prove that
X
2−depth(v) ≤ 1,
v is a leaf

where depth(v) is the length of a path from the root of the tree to the leaf. This claim is related to the
main inequality of this problem as follows: each path from root to a leaf can be identified with a string
of {0, 1}-s (binary string) where we put 0 if we move to the left child of a node and 1 if we go to the
right. Such correspondence between binary strings and the leafs sets the stage for applying the inequality
of the problem. Try to fill in the details.

4. Inclusion-exclusion principle and related inequalities


In this section we study the inclusion-exclusion principle, fundamental tool in
counting problems, for a sequence of n events and discuss some of its applications. We
then look at related inequalities to this principle known as Bonferroni’s inequalities.

4.1. Inclusion-exclusion principle.

Proposition 4.1. (Inclusion-exclusion formula) Let (Ω, F, P) be a probability space and


A1 , ..., An ∈ F. Then

n n
!
[ X X X
(4.1) P Ai = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + ...
i=1 i=1 1≤i<j≤n 1≤i<j<k≤n

+ (−1)n−1 P(A1 ∩ ... ∩ An ).

A few remarks are in order. When Ω is a finite set, F = 2Ω and P is the equiprobable
measure (see Exercise 1.2.4), then the inclusion-exclusion principle becomes a counting
formula that computes the number of elements in the union of Ai -s. Also note that
the terms in the right-hand side of (4.1) are the missing components that turn Boole’s
inequality into an equality.
A more concise way of writing (4.1), and emphasizing the fact that we are summing
over all possible intersections of the events Ai , is the following
 
n
!
[ X \
(4.2) P Ai = (−1)|I|−1 P  Aj  .
i=1 I⊂{1,2,...,n} j∈I
I6=∅
PROBABILITY LECTURE NOTES 19

Proof. The proof is by induction on n. The formula for n = 2 was proved in (1.3).
Using the case of n = 2 for the union up to n − 1 and An we get
n n−1
! !!
[ [
P Ai = P An ∪ Ai =
i=1 i=1
n−1 n−1
! !
[ [
P(An ) + P Ai −P (Ai ∩ An ) (using induction)
i=1 i=1
   

(−1)|Ĩ|−1 P 
X \ X \
= P(An ) + (−1)|I|−1 P  Aj  + Aj  .
I⊂{1,2,...,n−1} j∈I I⊂{1,2,...,n−1} j∈Ĩ
I6=∅ I6=∅
J̃=J∪{n}

The first sum in the above equality is the part of (4.1) without An , and the second one
together with P(An ) covers all terms with An . This concludes the inductive step and
completes the proof. 

4.1.1. Number of derangements. As an application of the inclusion-exclusion prin-


ciple consider the following problem.
Two identical decks of cards are placed side by side and the first deck is
shuffled randomly. One by one we take a card from each of the decks until
the decks become empty. What is the probability that at some step we will
pick the same card?
Let us model the problem mathematically. Denote the number of cards in a deck
by n. Shuffling a deck corresponds to a permutation of {1, 2, ..., n}. With this in mind
define Ω to be the set of all permutations of {1, 2, ..., n}, set F = 2Ω and let P be the
equiprobable measure on Ω. Clearly (Ω, F, P) is a probability space and |Ω| = n!. The
problem asks if we take a random permutation (an element of Ω), what is the probability
that it preserves at least one element at its place8. To compute this probability define
Ai = {σ ∈ Ω : σ(i) = i}, where i = 1, 2, ..., n (i.e. the permutation σ leaves i unaltered).
Then, for any non-empty I ⊂ {1, 2, ..., n} we have
!
\ (n − |I|)!
P Ai = ,
n!
i∈I

since we
n
 fix |I| positions and permute the rest n − |I| arbitrarily. Notice also that there
are |I| ways to choose |I| positions that will be fixed. Hence, by (4.1)
n
!    
[ n (n − 2)! n (n − 3)! 1
(4.3) P Ai =1− + + ... + (−1)n−1 =
2 n! 3 n! n!
i=1
1 1 1
1− + + ... + (−1)n−1 ≈ 1 − e−1 ,
2! 3! n!
8A permutation σ : {1, 2, ..., n} → {1, 2, ..., n} is called a derangement if for all 1 ≤ i ≤ n we
have σ(i) 6= i. Notice that we may assume that the fixed deck of cards corresponds to the identity
permutation. Indeed, in any enumeration of all n! permutations if σ1 is the permutation corresponding
to the fixed deck, the complement of the event we are interested in is the set of all permutations σ such
that σ1 (i) 6= σ(i) for all i = 1, 2, ..., n. Applying the inverse of σ1 on both sides reduces the problem to
the case when the fixed deck corresponds to the identity.
20 HAYK ALEKSANYAN

where the last approximation is due to Taylor’s expansion of e−x . We conclude that the
probability of seeing the same card at some point from the two decks is approximately
1 − e−1 .

 Exercise 4.1.1. Assume each of the two decks is shuffled randomly (instead
of one being fixed) and then one by one we pick a card from each deck until they are
empty. Model the probability space corresponding to this experiment and show that the
probability of getting the same card on each deck at some point is the same as in (4.3).

→ Problem 4.1.1. Let A = {1, 2, ..., k} and B = {1, 2, ..., n} where k, n ∈ N. Use the
inclusion-exclusion principle to compute the number of surjective (onto) functions from
A to B.
Hint: notice that the number of all functions from A to B is nk (sampling with replacement and
regarding the order). It is easier to compute the number of functions that are not surjective. To that
end for each i ∈ B define Ai as the set of all functions f : A → B that does not contain i in their image,
i.e. f (x) 6= i for any x ∈ A. Then use inclusion-exclusion principle with Ai -s.

→ Problem 4.1.2. For n ∈ N let ϕ(n) be the Euler’s totient function, i.e. ϕ(n) is
the number of integers 1 ≤ k ≤ n that are coprime with n. For example, ϕ(4) = 2,
ϕ(5) = 4, ϕ(11) = 10. Using the inclusion-exclusion principle prove that
Y  1

ϕ(n) = n 1− ,
p
p is prime
p|n

i.e. the product is over all primes that divide n.


α
Hint: consider the prime factorization of n, let n = pα 1 α2
1 p2 · ... · pk , where pi is prime and αi ∈ N.
k

Then an integer 1 ≤ m < n is coprime with n if and only if its prime factorization does not contain
any of pi -s. Now define Ai = {1 ≤ m ≤ n : pi | m} where i = 1, 2, ..., k and use the inclusion-exclusion
principle.

4.2. Bonferroni’s inequalities.


Proposition 4.2. (Bonferroni’s inequalities) Let (Ω, F, P) be a probability space and
A1 , A2 , ..., An ∈ F. For each 1 ≤ k ≤ n define
!
X \
Sk = P Ai .
I⊂{1,2,...,n} i∈I
|I|=k

Then for 1 ≤ k ≤ n odd we have


n k
!
[ X
(4.4) P Ai ≤ (−1)i−1 Si
i=1 i=1
and for 1 ≤ k ≤ n even we have
n k
!
[ X
(4.5) P Ai ≥ (−1)i−1 Si .
i=1 i=1

A few clarifying remarks are in order. Notice that Sk is simply the k-th term in the
inclusion-exclusion formula disregarding the sign. In particular, the  summation in Sk
is over all k-element subsets of {1, 2, ..., n} and hence there are nk terms in the sum.
Observe also that the case of k = 1 in (4.4) corresponds to Boole’s inequality for finite
number of terms.
PROBABILITY LECTURE NOTES 21

Proof. The proof is by induction on n. When n = 2, 3 both inequalities in (4.4)


and (4.5) follow from inclusion-exclusion formula and non-negativity of a probability
measure. Assume the inequalities are true for n − 1 and prove the case of n.
Let us first eliminate the edge cases when k = 1 or k = n. The former is the Boole’s
inequality, the latter is inclusion-exclusion formula. Thus we will assume that 1 < k < n
is odd and prove (4.4). Using formula (4.1) for two events, namely An and ∪n−1 i=1 Ai we
get
n n−1 n−1
! ! !
[ [ [
P Ai = P(An ) + P Ai − P (Ai ∩ An ) .
i=1 i=1 i=1
By the inductive hypothesis we can apply (4.4) on the first union in the above equality
and (4.5) with k − 1 on the second one (note that k − 1 ≥ 1 since we assumed k > 1)
arriving at
 
n k
!
[ X X \
P Ai ≤ P(An ) + (−1)i−1 P  Aj  −
i=1 i=1 I⊂{1,2,...,n−1} j∈J
|I|=i
 
k−1
X X \
(−1)i−1 P An ∩ Aj  =
i=1 I⊂{1,2,...,n−1} j∈J
|I|=i
   

(−1)|J̃|−1 P 
X \ X \
P(An ) + (−1)|J|−1 P  Aj  + Aj  .
J⊂{1,2,...,n−1} j∈J J⊂{1,2,...,n−1} j∈J̃
1≤|J|≤k 1≤|J|≤k−1
J̃=J∪{n}

The first sum in the above inequality covers all terms of (4.4) which do not contain An ,
and the second sum coupled with P(An ) cover all terms with An . This completes the
induction step.
The proof of (4.5) is similar and we leave it as an exercise.


5. Independent events and conditional probability


In this section we introduce the notion of independence of events, conditional prob-
ability and two important results concerning it, namely the law of total probability and
Bayes’ theorem.

5.1. Independence.
Definition 5.1. (Independent events) Given a probability space (Ω, F, P) and two
events A, B ∈ F, call A and B independent if P(A ∩ B) = P(A)P(B). Otherwise
A and B are called dependent.
Property 5.1. If A and B are independent, then so are Ac and B, A and B c , and Ac
and B c .
Proof. We start with the first pair, namely Ac and B. Representing B as a union of
two disjoint events as
B = (Ac ∩ B) ∪ (A ∩ B),
22 HAYK ALEKSANYAN

and using the finite additivity of P we get


P(B) = P(Ac ∩ B) + P(A ∩ B) = P(Ac ∩ B) + P(A)P(B),
where the last equality is due to independence of A and B. Rearranging the last equality
leads to
P(Ac ∩ B) = P(B) − P(A)P(B) = P(B)(1 − P(A)) = P(B)P(Ac ),
which is the independence of Ac and B.
Independence of A and B c follows by symmetry (we can change the places of A
and B). The claim for Ac and B c follows from the independence of A and B c and the
already proved property that we can replace A by its complement. 

 Exercise 5.1.1. Prove that if A is independent of itself, then P(A) ∈ {0, 1}.

Example 5.1.1. Two fair dice are thrown. Let A1 be the event that the first one shows
an odd number, and let A2 be the event that the second one shows an even number.
Clearly the sample space of this experiment is the following
Ω = {(i, j) : i, j ∈ {1, 2, 3, 4, 5, 6}}.

Then, F = 2Ω and P((i, j)) = 1/36 for each 1 ≤ i, j ≤ 6. From here we have P(A1 ) =
P(A2 ) = 1/2 and P(A1 ∩ A2 ) = 3·3 1
36 = 4 . It follows that A1 and A2 are independent.
Let A3 be the event that the sum of the first and the second throws is 4. Since the
3 1
only possibilities are {(1, 3), (2, 2), (3, 1)}, we have P(A3 ) = 36 = 12 . We also have
2 1
P(A1 ∩ A3 ) = 36 = 18 6= P(A1 )P(A3 ) hence A1 and A3 are not independent.

Independence is also defined for a family of events.

Definition 5.2. (Independence of a family of events) Let (Ω, F, P) be a probability space.


Then a family of events {Ai : i ∈ I} where Ai ∈ F is called independent (or mutually
independent), if for any finite J ⊂ I we have
 
\ Y
(5.1) P Aj  = P(Aj ).
j∈J j∈J

The family {Ai : i ∈ I} is called pairwise independent if (5.1) holds for any J ⊂ I
with |J| = 2.

Example 5.1.2. (Pairwise independence does not imply mutual independence) It is


important to note that the pairwise independence does not necessarily imply (mutual)
independence. To see this, consider an experiment of rolling three 6-sided dices. For
1 ≤ i 6= j ≤ 3 let Aij be the event that dices i and j show the same number. The
family {A12 , A13 , A23 } is pairwise independent but is not mutually independent. Indeed,
P(Aij ) = 61 for all pairs (i, j), and P(A12 ∩ A13 ) = P(A12 ∩ A23 ) = P(A13 ∩ A23 ) = 36
1
.
It follows that Aij are pairwise independent. However,
1
P(A12 ∩ A13 ∩ A23 ) = 6= P(A12 )P(A13 )P(A23 )
36
hence the family is not mutually independent.
PROBABILITY LECTURE NOTES 23

5.1.1. Independent experiments. Informally, independent events model the idea of


independent experiments. Let us illustrate this on an example of two probability spaces
(Ωi , Fi , Pi ) with i = 1, 2, where both Ω1 and Ω2 are at most countable9, and Fi = 2Ωi .
Set Ω := Ω1 × Ω2 , F := 2Ω and define P(ω1 , ω2 ) := P1 (ω1 )P2 (ω2 ) for (ω1 , ω2 ) ∈ Ω1 × Ω2 .
Then any A ∈ Ω1 can be associated with à := A × Ω2 ⊂ Ω and any B ∈ Ω2 with
B̃ := Ω1 × B ⊂ Ω. We get à ∩ B̃ = A × B and using the definition of P we obtain
X
P(Ã ∩ B̃) = P(A × B) = P(ω1 , ω2 ) =
(ω1 ,ω2 )∈A×B
X X X X
P1 (ω1 )P2 (ω2 ) = P1 (ω1 ) P2 (ω2 ) =
ω1 ∈A ω2 ∈B ω1 ∈A ω2 ∈B

P1 (A)P2 (B) = P(Ã)P(B̃),

hence the events à and B̃ are independent.

Ω2
A × Ω1

A
Ω1
Figure 4. A schematic view of the extension of A ⊂ Ω1 into Ω1 ×Ω2 as a cylindrical
set with base A.

The argument used above can be easily extended to n ≥ 2 experiments. Consider


n ≥ 2 experiments Ei with corresponding probability spaces (Ωi , Fi , Pi ) where i =
1, 2, ..., n. The aim is to build a probability space that will model the joint experiment
(E1 , ..., En ) as a series of independent experiments. Similar to n = 2 define
Ω := Ω1 × ... × Ωn ,
take F := 2Ω and set
P(ω1 , ..., ωn ) := P1 (ω1 ) · ... · Pn (ωn ).
It is easy to see that (Ω, F, P) is a probability space. The events of the individual
experiments Ei can be associated with events of the larger probability space (Ω, F, P)
as follows: any Ai ∈ Fi is associated with10
Ãi := Ω1 × ... × Ωi−1 × Ai × Ωi+1 × ... × Ωn ∈ F,

9The requirement of Ω to be at most countable makes it easier to define a probability measure on their
i
cross product. Later on we will study the concept of product measure in detail that will allow us to
eliminate the restriction on Ωi to be discrete.
10The event A of the single experiment is put in correspondence with the cylindrical set à . Notice
i i
that projection πi : Ω → Ωi defined as πi (ω1 , ..., ωn ) = ωi maps Ãi to Ai .
24 HAYK ALEKSANYAN

where i = 1, 2, ..., n. In view of the definition of P we have P(Ãi ) = Pi (Ai ). The same
argument used for the case n = 2 shows that (cylindrical) events of the form Ã1 , ..., Ãn
are independent in a sense of Definition 5.2.
We conclude that finite number of different experiments can be joined together with
the construction described above into a single probability space such that the joint
experiment is represented as a series of independent experiments.

 Exercise 5.1.2. Show that the cross product F1 × F2 of two event spaces is
not an event space in general.

 Exercise 5.1.3. Fill in the necessary details for the construction for n ≥ 2
experiments.

 Exercise 5.1.4. Get back to Example 5.1.1 and recover the probability space
discussed there using the construction with cross products described above. Individual
experiments will be two (independent) throws of a dice.

 Exercise 5.1.5. Convince yourself that the probability space defined in Proposi-
tion 3.3 can be obtained by cross product as above from independent colorings of the edges
of the graph. Namely, if we enumerate the edges as e1 , ..., eN and for each 1 ≤ i ≤ N let
Ωi = {−1, 1}, Fi = 2Ωi and Pi (−1) = Pi (1) = 1/2 then the cross product of (Ωi , Fi , Pi )
coincides with the probability space described in the proposition.

→ Problem 5.1.1. Consider the matrix multiplication operation over the field Z2 =
{0, 1}. Namely, let A, B, C all be n × n matrices with elements from Z2 and consider
the problem of verifying whether AB = C. Prove that if AB 6= C, then
  1
P ~b ∈ {0, 1}n : (AB)~b = C~b ≤ .
2
Here the equality is understood in Z2 , i.e. modulo 2, and the probability measure on
{0, 1}n is the equiprobable measure. Once you prove the above inequality, investigate how
one can verify with high confidence whether AB = C by doing independent tests by
choosing a vector ~b ∈ {0, 1}n uniformly at random and checking A(B~b) = C~b. Compare
the computational complexity of the series of such tests with direct computation of the
matrix product and element-wise comparison.

→ Problem 5.1.2. For each n ≥ 3 construct a probability space (Ω, F, P) and a


family of events A = {A1 , ..., An } ⊂ F such that A is pairwise independent, but any
2 < k ≤ n events from A are not mutually independent.

5.2. Conditional probability. Consider an experiment E and the associated proba-


bility space (Ω, F, P). Sometimes we might have an incomplete information about the
outcome of E. For example, in a throw of a fair dice assume someone told us that the
outcome is a prime number. This will not determine the actual outcome with certainty,
nevertheless it will narrow the space of possible outcomes of the dice and eventually
will affect our calculations of probabilities. In general, for two events A and B, knowing
that B occurred, the new probability of A might no longer be P(A). A reason for this
is that in this case A occurs if and only if A ∩ B occurs, which suggests that the new
probability of A should be linked with P(A ∩ B). This leads to the following definition.

Definition 5.3. Given a probability space (Ω, F, P) and two events A, B ∈ F with
P(B) > 0. The conditional probability of A given B is denoted by P(A | B) and is
PROBABILITY LECTURE NOTES 25

defined as
P(A ∩ B)
(5.2) P(A | B) = .
P(B)
Notice that the proportionality constant P(B) is chosen so that P(B | B) = 1, i.e.
the probability of B given that B occurred is 1.
Theorem 5.2. (Conditional probability induces a new measure) If B ∈ F and P(B) > 0
then (Ω, F, PB ) is a probability space where PB : F → R is defined as PB (A) = P(A | B).
Proof. We only need to check that PB is a probability measure for which we need to
verify that PB satisfies all requirements of Definition 1.2.
We clearly have PB (A) ≥ 0 for any A ∈ F, and the fact that PB (A) ≤ 1 follows
from the monotonicity of the probability measure (see (1.2)). The conditions PB (∅) = 0
and PB (Ω) = 1 simply follow from (5.2). The only thing that remains to check is the
σ-additivity of PB . To that end fix a sequence of disjoint events A1 , A2 , ... ∈ F. Then
∞ ∞
! !
[ 1 [
PB Ai = P (Ai ∩ B) =
P(B)
i=1 i=1
∞ ∞ ∞
1 X X X
P(Ai ∩ B) = P(Ai | B) = PB (Ai ),
P(B)
i=1 i=1 i=1
where we used the fact that the events {Ai ∩ B}∞
i=1 are disjoint and that P is σ-additive.
Hence PB is a probability measure and the proof is complete. 

5.3. Law of total probability. Given a probability space (Ω, F, P) we say that the
∈ I} ⊂ F forms a partition of Ω if Bi ∩ Bj = ∅ for i 6= j (the events
collection {Bi : i S
are disjoint) and i∈I Bi = Ω.

Theorem 5.3. (Partition Theorem or the Law of Total Probability) If {Bi : i ∈ I} is


a partition of Ω with P(Bi ) > 0 for all i ∈ I then for any event A ∈ F we have
X
P(A) = P(A | Bi )P(Bi ).
i∈I
S
Proof. Using the fact that i Bi
= Ω we get
!!
[
P(A) = P A ∩ Bi
i∈I
!
[
=P (A ∩ Bi ) (by σ-additivity)
i∈I
X
= P(A ∩ Bi ) (by (5.2))
i∈I
X
= P(A | Bi )P(Bi ).
i∈I


Let us see this result in action. Assume a student enters the exam which has a single
question from one of these subjects analysis, probability, geometry. The student knows
1 2 4
2 of the analysis questions, 3 of the probability and 5 of the geometry. If all subjects
are equally probable to be on the question list and once the subject is selected, all its
26 HAYK ALEKSANYAN

B1 B2 Bn

Figure 5. A schematic view of partitioning Ω into n disjoint events B1 , B2 , ..., Bn .


The law of total probability says that the probability of any A ∈ F becomes a weighted
sum over its conditional probabilities on Bi weighted by the probabilities of the par-
tition element.

questions are equally probable, what is the probability that the student will pass the
exam?
Let B1 the list of analysis, B2 - probability and B3 - geometry questions. This
defines a partition of the probability space. Let also A be the event that the student
can pass the exam. Since all subjects have equal probability to be on the exam, we have
P(Bi ) = 31 for i = 1, 2, 3, and from the law of total probability we get
 
1 1 2 4 59
P(A) = P(A | B1 )P(B1 ) + P(A | B2 )P(B2 ) + P(A | B3 )P(B3 ) = + + = .
3 2 3 5 90
 Exercise 5.3.1. What is the probability space in the above example with exam
tickets?

Here is another example of around the partition theorem.


In a theater with N seats, N people bought tickets. Assume that the first
person lost their ticket and chooses a place to seat uniformly at random from
the N seats. The rest of the people then come in one by one, and if their
place is empty, they occupy it; otherwise, they choose a place uniformly
at random from the remaining empty seats and seat there. What is the
probability that the last person to enter the theater will have their seat still
empty?
Let us enumerate the seats by 1, 2, ..., N and let seat i be the seat of person i entering
the theater hall. Let AN be the event that the seat of the last person is empty when
the first N − 1 people become seated. Denote aN := P(AN ) and for each 1 ≤ i ≤ N
define Bi to be the event that the first person takes seat i. In view of the condition of
the problem we have P(Bi ) = N1 and clearly {Bi } defines a partition of the probability
space in question. Applying Theorem 5.3 we get
N N
X 1 X
(5.3) P(AN ) = P(AN | Bi )P(Bi ) = P(AN | Bi ).
N
i=1 i=1

Clearly P(AN | B1 ) = 1 and P(AN | BN ) = 0. Now observe that when the first person
takes the i-th seat, with i > 1, then everyone up to i − 1 will take their own seat, except
possibly for the i-th person, who will choose a seat from the remaining N − i seats
uniformly at random. In the light of this observation, if the event Bi holds, then the
i-th person acts as the first person for the same problem but with N − i seats. Hence,
1
aN = (1 + a2 + ... + aN −1 ) .
N
PROBABILITY LECTURE NOTES 27

Observe that a2 = 12 . Using induction by N it follows from the last equality that aN = 21
for all N = 2, 3, ..., in particular the probability that the last person’s seat will be empty
is 1/2 regardless of the number of seats11 N .

 Exercise 5.3.2. What is the probability space in the above example with theater
seats?

 Exercise 5.3.3. In the problem with theater seats assume everyone up to k-th
person takes their and it is the k-th person who chooses a seat uniformly at random,
where 1 ≤ k < N . What is the probability that the last person will find their seat free?

→ Problem 5.3.1. Assume in the problem with theater seats two people, the first and
the second, mix their seats in the same way as in the original problem. What is the
probability that the last person will find their seat unoccupied? How about the first k
people mixing their seats?

5.4. Bayes rule. The result discussed in this section is of fundamental importance and
has far reaching applications beyond probability theory.
Let (Ω, F, P) be a probability space and let A, B ∈ F satisfy P(A), P(B) > 0. Then,
P(A | B)P(B)
(5.4) P(B | A) =
P(A)
which is called the Bayes rule. To see why (5.4) is true, observe that from the definition
of the conditional probability (5.2) we have
P(A ∩ B) P(A ∩ B) P(B) P(A | B)P(B)
P(B | A) = = = ,
P(A) P(B) P(A) P(A)
and hence (5.4) follows. We next prove a slightly more general form of this rule.
Theorem 5.4. (Bayes’ theorem) Let {Bi : i ∈ I} be a partition of Ω with P(Bi ) > 0
for all i ∈ I and let A ∈ F be such that P(A) > 0. Then
P(A | Bj )P(Bj )
P(Bj | A) = P .
P(A | Bi )P(Bi )
i∈I

Proof. Rearranging the formula of conditional probability (5.2) we get


P(Bj ∩ A) P(Bj ∩ A) P(Bj ) P(A | Bj )P(Bj )
P(Bj | A) = = = .
P(A) P(Bj ) P(A) P(A)
Now the claim follows from the law of total probability (Theorem 5.3) applied to the
denominator of the last equation. 

We next discuss some examples when Bayes theorem can be utilized.

11Here is an alternative approach. Notice that people entering the theater have no identity. Hence,
when i-th person enters the hall and finds their place occupied, instead of taking a seat uniformly at
random from the remaining seats, the i-th person forces the person seated at i-th seat out, takes the
i-th seat and then the evicted person takes a seat uniformly at random. Since the identity of the people
entering the hall is irrelevant, this step is equivalent to the process described in the formulation of the
problem. With this modified rule of taking seats, everyone at steps 2, ..., N − 1 will take their seats.
Hence the person mixing their sit will have two choices, either seat on their seat or on the seat of the
last person, both with probability 1/2. This implies that the probability of the last person finding their
seat free equals 1/2.
28 HAYK ALEKSANYAN

Example 5.4.1. (Hit-and-run accident) A cab was involved in a hit-and-run accident


at night. There are two types of cabs, Green (85% of the total cabs) and Blue (15%
of the total cabs), that operate in the city. A witness identified the cab as Blue. The
court tested the reliability of the witness under the circumstances that existed on the
night of the accident and concluded that the witness correctly identified each one of the
two colors 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was Blue?

85% 15%
G B
80% 20% 20% 80%

WG WB WG WB
Figure 6. It is helpful to describe this problem in schematic way as a tree. The
first level represents the partition of the underlying probability space and the last level
is the witness’ testimony. We then look at the ways of arriving at WB (the Witness
claims the car is Blue) from the root of the tree. The value of each path from the
root to WB becomes the product of the weights of its edges (see the denominator of
(5.5)).

Solution to Example 5.4.1. Let G be the event that the cab involved in the accient,
is Green, B be the event that the cab is Blue. Let also WB be the event that the witness
testimonies that the cab was blue. We are interested in P(B | WB ). Since the events
{G, B} form a partition of the probability space underlying this question, by Theorem
5.4 we get
80 15
P(WB | B)P(B) 100 100
(5.5) P(B | WB ) = = 80 15 20 85 ≈ 0.413.
P(WB | B)P(B) + P(WB | G)P(G) 100 100 + 100 100


The following terminology is widely used in Bayesian theory (cf. Theorem 5.4):
• the events of the partition {Bi : i ∈ I} are called the hypotheses,
• the set A is called data,
• probabilities P(Bk ) are called prior probabilities,
• probabilities P(Bk | A) are called posterior probabilities,
• probabilities P(A | Bk ) are called likelihoods,
• the transition from priors to posteriors via Bayes theorem is called Bayesian
update.
Getting back to the example with cabs, notice that without the witness’s testimony
15
P(B) = 100 = 0.15 (the prior probability). With the data (new information), we can
adjust our estimate to P(B | WB ) ≈ 0.413 (the posterior probability).

Example 5.4.2. We have ten coins in our pocket and 9 of them are fair, and one
of them is biased with P(H) = 0.3 and P(T ) = 0.7 We randomly pick a coin from the
pocket, flip it and it lands H (head). What is the conditional probability that the selected
coin is biased?
PROBABILITY LECTURE NOTES 29

Let B be the event that the coin is biased. The prior probability of B is 0.1 since
out of 10 coins we have exactly one is biased. Applying Bayes’ rule we have
P(H | B)P(B) 0.3 · 0.1 3
P(B | H) = c c
= = .
P(H | B)P(B) + P(H|B )P(B ) 0.3 · 0.1 + 0.5 · 0.9 48

Example 5.4.3. A certain disease occurs in about 0.05% of the population. A diag-
nostic test was developed for it, which correctly detects a person with the disease 99%
of the time. However, in 3% of the cases it will falsely diagnose a healthy patient as
having the disease (false positive cases).
A person is selected at random from the population, and the test indicates that this
person has the disease. What are the conditional probabilities that
(a) the person has the disease?
(b) the person does not have the disease?

Define A = {test is positive} and B = {patient has the disease}. We are given
P(B) = 0.0005, P(A | B) = 0.99 and P(A | B c ) = 0.03.
We need to compute P(B | A) for part (a) and P(B c | A) for part (b). From Bayes formula
we have
P(A | B)P(B) 0.99 · 0.0005
P(B | A) = c c
= ≈ 0.016.
P(A | B)P(B) + P(A | B )P(B ) 0.99 · 0.0005 + 0.03 · 0.9995
For part (b) we get P(B c | A) = 1 − P(B | A) ≈ 0.984.

Observe that while the test has high accuracy (99%), being positive on a randomly
selected individual only provides 0.016 (1.6%) chance that the person has the disease,
which is rather small. This phenomenon is called base rate fallacy. It happens when
• there is a large population,
• only a small fraction of the population has the disease,
• there is a large scale testing of mixed population of healthy and sick individuals.
One should take into account that the characteristics of the sampled population
are a contributing factor, along with the accuracy of the test itself, in determining the
probability of a positive test result. Even for a very accurate test the number of false
positives can potentially be large.

 Exercise 5.4.1. In Example 5.4.3 let 0 < p < 1 be the fraction of the population
that have the disease. Assume that the accuracy of the test is a% (i.e. the test gives a
positive result on sick individual), and let b% be the false positive percentage of the test
(i.e. in b% of the cases the test gives positive result on healthy individuals).
A person, randomly selected from the population, gets a positive result on the test.
What is the probability that the person actually has the disease? Play with the parameters
p, a, b to see when the results of the test can be reliable.

5.4.1. Naı̈ve Bayes. We conclude the section with a toy example of using Bayesian
approach for classification problems. For the sake of the exposition in this subsection we
do away the mathematical rigor and concentrate on describing the general approach in a
somewhat hand-waving way. Everything here can be made rigorous and mathematically
sound with a slight effort.
A typical setup in classification problem asks to determine the class label of the
object given a set of features. Here is an example.
30 HAYK ALEKSANYAN

Consider the problem of classifying an incoming email as spam or non-spam


(sometimes called ham). The data one might use for this problem include
the text of the email, subject line, sender’s address, etc. All these are
features of the email that contribute to it being spam or ham.

In mathematical terms, assume we have K ≥ 2 classes denoted by C1 , C2 , ..., CK


and data is represented by a sequence of features ~x := (x1 , x2 , ..., xn ). We then aim to
construct a probabilistic model which given the vector ~x will compute the conditional
probabilities P(Ck | ~x), i.e. how likely it is that the object is from class Ck if we received
the data ~x. In view of the Bayes rule we have
P(~x | Ck )P(Ck )
(5.6) P(Ck | ~x) = .
P(~x)
Recall that we proved in Theorem 5.2 that conditional probability ~x → P(~x | Ck ) is a
probability measure itself. We thus have that PCk (~x) for fixed Ck defines a probability
measure on the set of features ~x. Now the crucial assumption in constructing the naı̈ve
Bayes classifier is assuming that the features x1 , ..., xn are conditionally independent12
given Ck , i.e.
n
Y
PCk (~x) = PCk (xi ).
i=1

Getting back to (5.6) and using the assumption on conditional independence, we get
n
1 Y
P(Ck | ~x) = P(Ck ) PCk (xi ).
P(~x)
i=1

We now define our classifier: given a data point with features ~x assign it to a class Ck
where 1 ≤ k ≤ K, so that the probability P(Ck | ~x) is maximized. From the above
relation we get13
n
!
Y
arg max P(Ck | ~x) = arg max P(Ck ) PCk (xi ) .
1≤k≤K 1≤k≤K i=1

This decision rule is known as maximum a posteriori or MAP. It picks the


hypothesis that is most probable given the data (see the terminology introduced in
subsection 5.4). For numerical stability one usually applies log on the right-hand side
of the above relation. That will not change the arg max in view of the monotonicity of
the logarithm.

→ Problem 5.4.1. Write a computer program, in a language of your choice, to model


a spam, non-spam classifier for emails using Bayesian approach described above. A data
to build a model can be found at this address.

12This is an extremely crude assumption, which is not true in most of the practical cases, and this is
the reason the classifier is called naı̈ve. Nevertheless, the idea is stil useful and can lead to meaningful
results. Getting back to our example with spam classification, we note that there are more efficient
ways of classification based on Machine Learning principles.
13Given a finite list of numbers A = (a , a , ..., a ) by arg max A we denote the index 1 ≤ k ≤ n such
1 2 n
that max A = ak . If the maximum in A is not unique then arg max returns any of the indicies of existing
maximums.
PROBABILITY LECTURE NOTES 31

6. Discrete probability distributions


In this section we define discrete probability distributions, discuss several important
examples, and study the approximation of Poisson distribution by Bernoulli.
6.1. Important examples.
Definition 6.1. A sequence {pi }i∈I , where the index set I isP
at most countable, is called
a probability distribution if 0 ≤ pi ≤ 1 for all i ∈ I and pi = 1.
i∈I

6.1.1. Bernoulli distribution. Consider a probability space (Ω, F, P) associated with


an experiment of tossing a coin. Here Ω = {H, T } and assume P(H) = p and P(T ) =
1−p where 0 ≤ p ≤ 1. The distribution {p, 1−p} is called Bernoulli distribution and
is denoted by B(1, p). We will sometimes refer to the probability p as the probability
of success of the experiment.
For example, in a clinical trial a treatment given to a particular patient may suc-
ceed or fail. Denoting by 0 ≤ p ≤ 1 the probability of a success, we get a Bernoulli
distribution. Notice that for each p ∈ [0, 1] we get a different distribution, but all from
the same family of Bernoulli distributions.
6.1.2. Bernoulli trials and binomial distribution. Suppose that 5% of items pro-
duced by a certain machine are defective. Assume also that different items are found to
be defective or not independent of each other. We are sampling n items at random and
inspect them. For a single item we may model the experiment of it being defective by
a Bernoulli distribution with probability 5/100 = 0.05 (i.e. probability of success p is
0.95). When sampling n items, we get a sequence of Bernoulli trials, as the experiment
of a sampled item being defective follows Bernoulli distribution and the experiments
are independent in view of our assumption. This leads to the following definition.
A series of n independent experiments all following Bernoulli distribution B(1, p), is
called Bernoulli trials. The number of successes in n experiments is an integer from
{0, 1, ..., n} and
 
n k
(6.1) pk := P(k successes) = p (1 − p)n−k , where 0 ≤ k ≤ n,
k
since there are nk ways to fix k positions in {1, 2, ..., n} where the success will happen.


The sequence {pk }nk=0 is called binomial distribution and is denoted by B(n, p).
A common example of Bernoulli trials is the series of independent experiments with
a coin toss (see subsection 5.1.1 for the underlying probability space).
Notice that the sum of the sequence pk defined in (6.1) equals 1, since there can be
only 0, 1, ..., n number of successes which provide a partition of the probability space
underlying the trials. Another way to see why the sum equals 1 is through Theorem
2.3 (Binomial theorem).

Example 6.0.1. (Selling extra tickets for a flight) An airline sells 100 tickets for a
certain flight on an airplane that has only 98 seats. On the average, 1% of passengers
do not show up for their flight and assume the passengers choose to show up or not for
the flight independently of each other. What is the probability that everyone who appears
for the departure will have a seat?

Solution to Example 6.0.1. We need to understand the distribution of people


that will (or will not) show up for the flight. To that end let X be the number of
purchasers who do not appear for their flight. We will treat the result of a passenger
32 HAYK ALEKSANYAN

not appearing for the flight a success (for the airplane company). Now, the probability
of success is p = 0.01 and hence the number of successes in n = 100 trials follows a
Bernoulli distribution of the form B(100, 0.01) (recall the independence condition in the
problem). Thus everyone will have a seat if and only if there are at least 2 successes in
n = 100 trials. The complement of this event is easier to compute, i.e. the probability
of at most 1 success, which equals
   
200 200 200
P(X = 0) + P(X = 1) = (1 − 0.01) + 0.01(1 − 0.01)199 ≈ 0.4045,
0 1
hence the probability in question becomes 1 − 0.4045 = 0.5955. 

6.1.3. Hypergeometric distribution. Consider an urn with n1 red balls and n2 black
balls. Suppose n balls are drawn without replacement, where n ≤ n1 + n2 . The
probability of drawing exactly k red balls equals
n1
 n2 
k n−k
pk := n1 +n2
 , where max(0, n − n2 ) ≤ k ≤ min(n, n1 ),
n

because there are n1 +n2



n ways to choose n balls out
 of the total n1 + n2 of which there
n1 n2
are k choices with exactly k red balls and n−k choices for exactly n − k black balls.
The multiplication principle then gives the combined choices of red and black balls.
The sequence {pk } defined above is called hypergeometric distribution.
Observe that if we treat drawing of the balls one by one as our experiment, then
it is not a Bernoulli trial, since separate experiments are not independent. Indeed, the
event of drawing a red ball on the second step depends on the result of drawing on the
first step. Hypergeometric distribution is used in scenarios where the total number of
possible successes is predetermined.
Summing the probabilities of the hypergeometric distribution we get
n     
X n1 n2 n1 + n2
= .
k n−k n
k=0

This is known as Vandermonde’s identity in combinatorics (cf. Problem 2.1.3).

6.1.4. Geometric distribution. Consider an infinite sequence of Bernoulli trials with


success probability 0 < p ≤ 1. Then the probability of getting the first success on the
k-th trial equals (1 − p)k−1 p where k = 1, 2, ... . The sequence {p(1 − p)k }∞
k=0 is called
geometric distribution with parameter p.
If we consider the coin toss as the single trial, and treat the probability of getting
Heads as the success, then the sample space can be modeled as
Ω = {H, T H, ..., T k H, ...} ∪ {T ∞ },
where T k H is the sequence of k Tails followed by a Head, and T ∞ is the sequence of
getting no heads.

6.2. Bernoulli approximation to Poisson. In this section we introduce Poisson dis-


tribution and prove that it is a limit of the binomial distribution in a certain sense. We
start with an example.
PROBABILITY LECTURE NOTES 33

In a certain store customers arrive at a constant rate of 4.5 customers per


hour on average. The customers’ arrivals at different time periods are as-
sumed to be independent. Under this assumption, the store owner wants
to find out the number X of actual customers that will arrive during a
particular hour of a day.
One approach to this problem is to build an approximate model for the arrivals,
taking into account the independence condition, as Bernoulli trials. To build the model,
we split the 1 hour period into 3600 seconds and distribute the arrival rate over seconds,
getting on average 4.5/3600 = 0.00125 customers per second. Next, for each second
we assume that the arrival of a customer follows a Bernoulli distribution B(1, p) with
probability of success being p = 0.00125. Thus the number of customers in a given hour
follows a binomial distribution of the form B(3600, 0.00125) in view of the independence
assumption on arrivals of the customers.
While this model provides a meaningful estimate on the number of customers in
a given hour, it imposes a bound of 3600 on the maximum number of customers. To
get rid of that bound we split the hour in an even smaller pieces and apply the same
model with Bernoulli trials. Namely, fix an integer n ≥ 1 and split an hour into n equal
segments. Let also λ = 4.5 and p = λ/n. Then in view of the same analysis we have
that the number of customers in an hour follows binomial distribution with parameters
B(n, p). The next result provides a way to approximate the values of this distribution.
Theorem 6.1. Consider a Bernoulli distribution B(n, pn ) and assume npn → λ as
n → ∞ where λ > 0. Then, for each integer k ≥ 0 we have
λk −λ
 
n k
pn (1 − pn )n−k → e .
k k!
Proof. By the definition of the binomial coefficient we have
 
n k n(n − 1) · ... · (n − k + 1) k
pn (1 − pn )n−k = pn (1 − pn )n−k =
k k!
1 n(n − 1) · ... · (n − k + 1) k
 npn n−k
(npn ) 1 − .
k! nk n
The second term in the product above tends to 1 as n → ∞, since it has a fixed number
of terms and each term tends to 1. The third one tends to λk in view of the condition
npn → λ. For the last one recall that
λ + o(1) n
 
1− → e−λ , as n → ∞,
n
which follows from the Taylor expansion of the logarithm. Putting these together com-
pletes the proof of the theorem. 
k
The sequence { λk! e−λ }∞
k=0 is called the Poisson distribution with parameter λ.
This distribution expresses the probability of a given number of events occurring in a
fixed interval of time or space if these events occur with a known constant mean rate
and independently of the time since the last event. As in the example of arrival of the
customers many experiments consist in observing the occurrence times of the random
arrivals. More examples include arrivals of calls at a switch-board, occurrence of floods
or other natural and man-made disasters, etc. For such cases the family of Poisson
distributions is used to model the number of arrivals occurring in a fixed time period.
The distribution also serves as an approximation to binomial with very small success
probabilities.
34 HAYK ALEKSANYAN

7. Discrete random variable


In this section we introduce the concept of discrete random variable and its probabil-
ity mass function. We then define mathematical expectation of discrete random variables
and discuss several basic properties of the expectation.

7.1. Discrete random variable and its probability mass function. Often we
deal with situations involving some function X defined on a sample space Ω of a given
probability space (Ω, F, P). For example, consider the experiment of throwing a fair
dice and assume we gamble on the outcome E of this dice in such a way to win 1 if the
outcome is a prime number, lose 1 in case of 4 or 6 and stay even if 1 shows up. Then
our profit is a function X defined on Ω where

1,
 if ω ∈ {2, 3, 5},
X(ω) = −1, if ω ∈ {4, 6},

0, if ω = 1.

The mapping X : Ω → R is an example of a discrete random variable.


Definition 7.1. (Discrete random variable) Given a probability space (Ω, F, P), a map-
ping X : Ω → R is called a discrete random variable if
(a) the image of X, i.e. the set
ImX = {x ∈ R : ∃ ω ∈ Ω s.t. X(ω) = x},
is at most a countable subset of R,
(b) {ω ∈ Ω : X(ω) = x} ∈ F for any x ∈ R.
The term discrete here refers to the fact that X takes at most countably many
values. For the second condition, notice that {ω ∈ Ω : X(ω) = x} is the preimage
of {x} under the mapping X, i.e. X −1 (x). We may not be able to predict the actual
values of X with certainty, but we would like to measure the probability of X taking a
certain value x. To be able to do that we require X −1 (x) to be an event.
For a random variable X we will write {X = x} for the set {ω ∈ Ω : X(ω) = x}.

 Exercise 7.0.1. Let X be a discrete random variable defined on (Ω, F, P).


Prove that for any A ⊂ R we have {ω ∈ Ω : X(ω) ∈ A} ∈ F.

Definition 7.2. (Probability mass function) The probability mass function (or pmf )
of a discrete random variable X defined on (Ω, F, P) is the function pX : R → [0, 1]
defined as
pX (x) = P(X = x), x ∈ R.
Here pX (x) is simply the probability that X takes the value x, and is sometimes
referred to as discrete density function too. We have
pX (x) = 0, x ∈
/ ImX,
and
X X
pX (x) = P(X = x) (by σ-additivity of P)
x∈ImX x∈ImX
!
[
=P {ω ∈ Ω : X(ω) = x} = P(Ω) = 1.
x∈ImX
PROBABILITY LECTURE NOTES 35

The last equation is sometimes written as


X
pX (x) = 1,
x∈R
as at most there will be countably many non-zero values in the sum above due to the
assumption that X is discrete. The last equation also characterizes discrete probability
mass functions as stated in the next theorem.
Theorem 7.1. Let S = {si : i ∈ I} be a countable set of distinct real numbers and let
{πi : i ∈ I} be a collection of real numbers satisfying
X
πi ≥ 0 and πi = 1.
i∈I
Then, there exists a probability space (Ω, F, P) and a discrete random variable X defined
on it such that the probability mass function of X is given by
(
πi , if s = si for some i ∈ I,
pX (s) =
0, if s ∈
/ S.
Proof. Take Ω = S, F = 2Ω and
X
P(A) = πi for A ∈ F.
i: si ∈A

Then (Ω, F, P) is a probability space. Now define X : Ω → R by X(ω) = ω for ω ∈ Ω.


Then the probability mass function (pmf) of X is pX . 

The power of this theorem is that it allows us to forget about the probability space
in many practical scenarios where we have a discrete probability distribution and need
to consider a random variable that takes certain values following the given distribution.
The theorem ensures that such a random variable exists.
7.2. Functions of random variables. Consider a probability space (Ω, F, P) and a
discrete random variable X : Ω → R, let also g : R → R. It is an easy check following
Definition 7.1 that Y := g(X) defined as Y (ω) = g(X(ω)) with ω ∈ Ω, is also a discrete
random variable. Simple examples are
g(X) = aX + b if g(x) = ax + b,
g(X) = aX 2 if g(x) = ax2 .
The probability mass function of Y = g(X) is obtained from pmf of X as follows
pY (y) = P(g(X) = y) =
X
P(X ∈ g −1 (y)) = P(X = x),
x∈g −1 (y)

since there are at most countably many non-zero contributions to this sum14. For
example, when Y = aX + b and a 6= 0, then
P(Y = y) = P(aX + b = y) = P(X = a−1 (y − b)) for y ∈ R.
If Y = X 2 , then
 √ √
P(X = y) + P(X = − y), y > 0,

P(Y = y) = P(X = 0), y = 0,

0, y < 0.

14Here g −1 (y) is the preimage of y under g, i.e. the set {x ∈ R : g(x) = y}.
36 HAYK ALEKSANYAN

7.3. Expectation and variance. Consider a long series of Bernoulli trials with throw-
ing of a fair dice. As each of the outcomes 1, 2, ..., 6 is equally likely to appear, the
average of observed numbers will be approximately

1 1 1 7
1 + 2 + ... + 6 = ,
6 6 6 2

which is the mean value of the observable variable - the value of the dice. This notion
of mean value is at the heart of the next definition.

Definition 7.3. (Mathematical expectation) If X is a random variable, the expecta-


tion of X is denoted by E(X) and is defined by
X
(7.1) E(X) = xP(X = x),
x∈ImX

P
whenever this sum converges absolutely, i.e. |xP(X = x)| < ∞.
x∈ImX

If X is a discrete random variable for which


X X
xP(X = x) = −∞ and xP(X = x) = +∞,
x∈ImX x∈ImX
x≤0 x≥0

then E(X) is not defined for such X. If one of the sums above converges but another
diverges to infinity, we define E(X) = ∞ with the sign equal to the sign of the diverging
sum. For example, if X ≥ 0 and the sum in (7.1) diverges to +∞, then we define
E(X) = +∞.
Equation (7.1) is often written as
X X
E(X) = xP(X = x) = xpX (x),
x x

and the expectation of X is often called the expected value or mean value of X. Notice
that when the X takes only finitely many values, the requirement of absolute conver-
gence of the series in (7.1) becomes redundant. However, when ImX is infinite (count-
able though), then the absolute convergence of the series guarantees that its sum is the
same for any rearrangement of the series (i.e. irrespective of the summation order).
The physical analogy of expectation is the idea of “centre of gravity”. If masses with
weights π1 , π2 , ... are placed at points x1 , x2 , ... ∈ R, the position of the centre of gravity
is
P
πx πi
Pi i =
X
xi P .
πi πj
i
j
P
Setting pi := πi / j πj gives a probability distribution that amounts for the proportion
of the total weight assigned to point xi .

Bernoulli distribution. Let X be a random variable with probability mass func-


tion equal to binomial distribution B(n, p) (thanks to Theorem 7.1 such X exists).
PROBABILITY LECTURE NOTES 37

Then
n   n
X n k X n!
E(X) = k p (1 − p)n−k = pk (1 − p)n−k =
k (k − 1)!(n − k)!
k=0 k=1
n
X (n − 1)!
np pk−1 (1 − p)n−1−(k−1) =
(k − 1)! (n − 1 − (k − 1))!
k=1
n−1
X n − 1
np pk (1 − p)n−1−k = np.
k
k=0

Poisson distribution. Let X be a random variable with Poisson distribution with


parameter λ > 0. Then
∞ ∞ ∞
X λk −λ −λ
X λk−1 −λ
X λk
E(X) = k e = λe = λe = λ.
k! (k − 1)! k!
k=0 k=1 k=0

 Exercise 7.1.1. Assume we toss a fair coin until it comes up Heads. Let N
be the first time in the trials when we see a Head. Compute E(N ).
Note: Notice that N takes values 1, 2, ... and ∞ and observe that N follows the geometric distribution.

Theorem 7.2. (Basic properties of expectation) Let X be a discrete random variable


for which E(X) is defined. Then
(a) if X ≥ 0 then E(X) ≥ 0,
(b) if X ≥ 0 and E(X) = 0 then15 P(X = 0) = 1,
(c) if a, b ∈ R then E(aX + b) = aE(X) + b,
(d) for any random variable Y for which the expectation exists, the expectation of
X + Y exists and E(X + Y ) = E(X) + E(Y ),
(e) if E(X 2 ) is defined then E(X) is the minimizer of c 7→ E[(X − c)2 ] where c ∈ R.

Notice that (c) and (d) together imply that E is a linear operator on discrete random
variables that have expectation.
Proof. (a) follows by definition of the expectation as all terms in (7.1) become non-
negative.
(b) assume P(X = 0) < 1. Since

1 = P(X ≥ 0) = P(X = 0) + P(X > 0),

we get P(X > 0) > 0. Since


X
P(X > 0) = P(X = x) > 0
x∈ImX
x>0

we get that there exists x > 0 such that P(X = x) > 0. Plugging this in the definition
(7.1) we get EX ≥ xP(X = x) > 0 which is a contradiction, hence P(X = 0) = 1.

15Recall from the elementary calculus that for a continuous function f ∈ C[a, b] if R b |f (x)|dx = 0, then
Rb a
f ≡ 0 on [a, b]. When f is Lebesgue integrable on (a, b) and a
|f |dµ = 0 where µ is the Lebesgue
measure, then f = 0 µ-almost everywhere on [a, b].
38 HAYK ALEKSANYAN

(c) By (7.1) we have


X X X
E(aX + b) = yP(aX + b = y) = y P(X = x) =
y∈Im(aX+b) y∈Im(aX+b) x∈ImX
ax+b=y
X X
yP(X = x) = (ax + b)P(X = x) = aE(X) + b,
y∈Im(aX+b) x∈ImX
x∈ImX
ax+b=y

where we used the absolute convergence of the series in (7.1) to rearrange the sums
above.
(d) Let us first show that E(X + Y ) exists, for which we need to show that the sum
in (7.1) corresponding to X + Y converges absolutely. We have
X
|z|P(X + Y = z)
z∈Im(X+Y )
X X
= |z| P(X = x, Y = z − x)
z∈Im(X+Y ) x∈ImX
X X
= |z|P(X = x, Y = z − x) (set z − x =: y)
x∈ImX z∈Im(X+Y )
X X
= |x + y|P(X = x, Y = y)
x∈ImX y∈ImY
X X X X
≤ |x| P(X = x, Y = y) + |y| P(X = x, Y = y)
x∈ImX y∈ImY y∈ImY x∈ImX
X X
= |x|P(X = x) + |y|P(Y = y) < ∞,
x∈ImX y∈ImY

where the convergence of the last two sums is due to the existence of E(X) and E(Y ),
and the passage to the second equality follows from the fact that the family of events
{{X = x} : x ∈ ImX} (similarly for Y ) is a partition of the underlying probability
space. We thus get that E(X + Y ) exists and the same argument as above shows that
E(X + Y ) = E(X) + E(Y ).
(e) For any c ∈ R, using the linearity proved in (c) and (d), we get

(7.2) E[(X − c)2 ] = E[(X − E(X) + E(X) − c)2 ] =


E[(X − E(X))2 ] + 2E[(X − E(X))(E(X) − c)] + E[(E(X) − c)2 ] =
E[(X − E(X))2 ] + (E(X) − c)2 ,
since
E[(X − E(X))(E(X) − c)] = E[(X − E(X))](E(X) − c) =
(E(X) − E(X))(E(X) − c) = 0.
Now (e) follows directly from (7.2) as both terms in the last equation are non-negative.


 Exercise 7.2.1. Construct a discrete random variable X, for which E|X| <
+∞, but E(X 2 ) = +∞.
PROBABILITY LECTURE NOTES 39

Note: the example shows that for two discrete random variables X and Y , having the existence of
E(X) and E(Y ) does not imply the existence of E(XY ) in general.

We next establish a formula allowing to compute expectation for functions of a


random variable.
Theorem 7.3. (Law of the subconscious statistician) Let X be a discrete random
variable and g : R → R be any function. Then
X
(7.3) Eg(X) = g(x)P(X = x),
x∈ImX
whenever the series converge absolutely.
Proof. Recall that from subsection 7.2 we know that g(X) is a random variable. By
(7.1) we have
X
Eg(X) = yP(g(X) = y) =
y∈Im(g(X))
X
yP(ω ∈ Ω : X(ω) ∈ g −1 (y)) =
y∈Im(g(X))
X X
y P(X = x) =
y∈Im(g(X)) x∈ImX
g(x)=y
X
g(x)P(X = x),
x∈ImX
where the rearrangements of the series above is due to the absolute convergence of series
in (7.3). 

As an application of Theorem 7.3 consider a random variable X whose probability


k
mass function is a Poisson distribution with parameter λ > 0, i.e. P(X = k) = e−λ λk!
for k = 0, 1, ... . Let also Y = eX . To compute the expectation of Y , we apply Theorem
7.3 with g(x) = ex getting
∞ k ∞
X
k −λ λ −λ
X (eλ)k
EY = e e =e = eλ(e−1) .
k! k!
k=0 k=0
We next define another important concept related to random variables.
Definition 7.4. (Variance of a random variable) A variance of a discrete random
variable X, for which E(X) exists, is denoted by var(X) and is defined as
(7.4) var(X) = E (X − E(X))2 .
The square root of the variance is called the standard deviation.

Since (X − E(X))2 is a non-negative discrete random variable, the expectation in


(7.4) always exists possibly being equal to +∞.

Property 7.4. Let X be a discrete random variable and assume E(X) exists. Then
(a) var(X) = E(X 2 ) − (E(X))2 ,
(b) var(aX) = a2 var(X) for any a ∈ R,
(c) var(X + Y ) = var(X) + var(Y ) + 2 (E(XY ) − E(X)E(Y )), where Y is a discrete
random variable for which E(Y ) exists.
40 HAYK ALEKSANYAN

Proof. (a) Setting µ = E(X) and applying Theorem 7.3 we get


X
var(X) = (x − µ)2 P(X = x) =
x∈ImX
X
(x2 − 2xµ + µ2 )P(X = x) =
x∈ImX
X X X
2
x P(X = x) − 2µ xP(X = x) + µ2 P(X = x) =
x∈ImX x∈ImX x∈ImX
E(X ) − 2µE(X) + 2µ = E(X ) − (E(X))2 ,
2 2 2

completing the proof of (a)16.


Formula (b) is a direct consequence of Property 7.2 (c) and definition of the variance
given by (7.4).
It remains to prove (c). Using (a) we get
var(X + Y ) = E(X + Y )2 − (E(X + Y ))2 (by linearity of E)
= E(X)2 + 2E(XY ) + E(Y )2 − (E(X))2 + 2E(X)E(Y ) + (E(Y ))2


= var(X) + var(Y ) + 2 (E(XY ) − E(X)E(Y )) .


The proof is now complete. 

We next discuss some examples involving expectation.

Example 7.4.1. (Student trying to study probability) A student decided to start


studying probability. Every morning she throws a 6-sided dice and if 1 shows up she
will start her studies the same day, otherwise she will skip that day and repeat her
experiment with a dice the next morning. On average, how many days one should wait
for her to start studying?

The waiting time of the student follows a geometric distribution with parameter 1/6
(see subsection 6.1.4). Using it, one may compute the expected waiting time directly
from the definition of expectation. But there is an alternative approach which we
consider below that becomes useful in certain scenarios.
Claim 7.5. Let X be a discrete random variable with pmf (probability mass function)

P
P(X = k) = pk , where k = 0, 1, 2, ... and pk = 1. Then
k=0
X∞
(7.5) E(X) = P(X ≥ k).
k=1
Proof. Notice that P(X = k) = P(X ≥ k) − P(X ≥ k + 1) for all k = 0, 1, 2, ... . Using
this and (7.1) we have
X∞ ∞
X
E(X) = kP(X = k) = k (P(X ≥ k) − P(X ≥ k + 1)) =
k=0 k=0

X ∞
X
P(X ≥ 1) + P(X ≥ k)(k + 1 − k) = P(X ≥ k),
k=2 k=1

16Notice that variance is non-negative by (7.4). This in particular implies that the right-hand side of
(a) is non-negative. Independently of the definition of the variance the non-negativity of the right-hand
side of (a) is a consequence of Jensen’s inequality which we will study later.
PROBABILITY LECTURE NOTES 41

where we used the Abel’s summation rule17 to pass from the first row to the second. 

Solution to Example 7.4.1. Define X to be the first day that 1 shows up when
throwing a dice. Clearly P(X = 1) = 61 , P(X = 2) = 65 16 and in general P(X = k) =
5 k−1 1

6 6 for k ≥ 1, thus X follows a geometric distribution with parameter p = 1/6.
From here, for each n ≥ 1 we obtain
∞  k−1 ∞  
1 5 n−1 X 5 k
   n−1
X 5 1 5
P(X ≥ n) = = =
6 6 6 6 6 6
k=n k=0

Using this we compute E(X) from (7.5) as follows


∞ ∞  n−1
X X 5
E(X) = P(X ≥ n) = = 6.
6
n=1 n=1

Thus the expected number of days to wait equals 6. 

 Exercise 7.5.1. Confirm the answer to the Example 7.4.1 directly from the
definition of expectation, by showing that

X
kP(X = k) = 6.
k=0

Example 7.5.1. (Group testing of a disease) A certain disease in a population can be


tested and identified by a blood sample. Assume the size of the population is n ≥ 1 and
each individual has probability 0 < p < 1 of having the disease independently of others.
A straightforward approach in identifying all people carrying the disease is to collect
blood samples from each individual and test them. This will require n tests - the size
of the population. Is there a procedure that requires smaller number of tests on average
and still identifies individuals with the disease?

Solution to Example 7.5.1. One strategy toward the example above is to split the
population into groups of k, for some fixed k ≥ 1, then mix the samples from individuals
of the same group and test the mixed sample. If the test on the mix is negative, then
no one in the group has the disease, otherwise we test everyone in the group one by one,
using another k tests. Thus, in the worst case we will need k +1 tests in total for a single
group to know with certainty who has a disease in the group. Let X be the number of
tests required for a single group. Then X takes two values: 1 with probability (1 − p)k
and k + 1 with probability 1 − (1 − p)k . Hence
 
E(X) = (1 − p)k + (k + 1) 1 − (1 − p)k .

17Given a sum of the form P∞ a (b − b


k=1 k k k+1 ) we can rewrite it in the form

a1 (b1 − b2 ) + a2 (b2 − b3 ) + ... = a1 b1 + b2 (a2 − a1 ) + b3 (a3 − a2 ) + ...



X
= a1 b1 + bk (ak+1 − ak ),
k=2

whenever the series converge. This can be considered as the analogue of integration by parts rule for
the integrals and is often referred to as Abel’s summation by parts.
42 HAYK ALEKSANYAN

For simplicity assume that k divides n, hence there will be n/k groups. Consequently,
 
n/k
X nh  i
(7.6) E  Xi  = (1 − p)k + (k + 1) 1 − (1 − p)k =
k
i=1
 
1
k
n 1 − (1 − p) + .
k

Recall that k is a parameter of the testing scheme and we now optimize the value of the
expectation in (7.6) with respect to k. To that end, differentiate the right-hand side of
(7.6) with respect to k and consid the value of k that makes the derivative to become
0. We thus get
1
+ (1 − p)k log(1 − p) = 0.
k2
If p is much smaller than k1 , then we can approximate (1 − p)k by 1 and log(1 − p) by

−p, arriving at p ≈ k12 . Thus k = √1p and the resulting expectation becomes n p which
is significantly less than n for smaller values of p. 

→ Problem 7.5.1. Assume we toss a fair coin until we get two Heads in a row, and
then we stop (i.e. the first time we see the pattern HH, we stop). Let N be the first
time when we see two Heads.
• Compute the expectation of N .
• Do the same computation for the pattern T H (i.e. seeing for the first time a
Tail followed immediately by Heads) and observe that the expecting time for the
two patters HH and T H is different.
• How about getting the pattern HHT for the first time?

→ Problem 7.5.2. A fair 6-sided dice is thrown. We can either take the number
that appears on the dice as our winning (e.g. if its 3 we get 3$ as a winning) and the
game stops, or we can ask to throw it again. We are given n ≥ 1 chances to ask for a
new round (this includes the initial round). If we play it up to the last possible throw,
then we must take whatever appears on the dice on that throw. What is the optimal
strategy of playing this game and what is the expected winning if we play according to
that strategy?
Note: For example, when n = 1, there is only 1 throw, which we must accept, and thus our expected
winning is 61 (1 + 2 + ... + 6) = 3.5. If n = 2 we throw it once and after seeing the result we can decide
if we want to keep it or move to the second round in which case we must take whatever appears there.
Hint: The case of n = 1 is described above. Start with n = 2 and think about what is your expected
winning if you move on the first throw.

→ Problem 7.5.3. Let N > 1 be a fixed integer. Your are offered to play a game
where the referee picks an integer n from 1 to N inclusive and if you guess correctly you
will win n, otherwise 0. How much would you be willing to pay to play this game?
Note: first try to think about the optimal strategy for both the referee and the player. The question
asks about the entry fee that will make the game fair both for the player and the organizer, in a sense
that the expected gain of the player equals expected loss of the organizer.
PROBABILITY LECTURE NOTES 43

7.4. Conditional expectation. Recall from section 5.2 that given a probability space
(Ω, F, P) and an event B ∈ F, knowing that B occurred can affect the probabilities of
other events. To formalize this idea, we introduced the concept of conditional probability.
In the same spirit knowing that some B ∈ F occurred can affect the values of probability
mass function (pmf) of a given random variable X as there we are dealing with events
{X = x} for some x ∈ ImX. More precisely, if B ∈ F and P(B) > 0, then the events
P({X = x}) can be conditioned on B.

Definition 7.5. (Conditional expectation) Let X be a discrete random variable and


B ∈ F with P(B) > 0. The conditional expectation of X is denoted by E(X | B) and is
defined as
X
(7.7) E(X | B) = xP(X = x | B),
x∈ImX

whenever the sum converges absolutely.

Notice that the only difference of (7.7) from (7.1) is that we are taking conditional
probabilities of X taking the value x. The analogue of Theorem 5.3 (law of total
probability) is the following.

Theorem 7.6. (Partition theorem for conditional expectation) Let X be a discrete


random variable defined on (Ω, F, P) and {Bi : i = 1, 2, ...} ⊂ F be a finite or countable
partition of Ω with P(Bi ) > 0 for all i = 1, 2, ... . Then
X
E(X) = E(X | Bi )P(Bi ),
i

whenever the sum converges absolutely.

Proof. We have
X
E(X | Bi )P(Bi ) (by (7.7))
i
X X
= xP(X = x | B)P(Bi ) (by absolute convergence)
i x∈ImX
X X
= x P(X = x | B)P(Bi ) (by Theorem 5.3)
x∈ImX i
X
xP(X = x) = E(X).
x∈ImX

→ Problem 7.6.1. For an integer n ≥ 1 let X(n) be an integer chosen uniformly


at random from {0, 1, ..., n − 1}. If X(n) is 0, we stop, otherwise choose an integer
uniformly at random from {0, ..., X(n) − 1}, in other words we consider X(X(n)). Let
τ be the number of times X was applied before getting 0, i.e. (X ◦| {z... ◦} X)(n) = 0.
τ times
What is the expected value of τ ?
Note: For example, when X(n) = 0, i.e. we choose 0 at the first step, then τ = 1. If n = 10 and we
choose 5 on the first step, then 0 on the second, then τ = 2. Since on each step we decrease the upper
bound n be at least 1, then τ ≤ n, meaning the process will terminate in at most n steps.
44 HAYK ALEKSANYAN

7.5. Indicator functions. An important class of random variables are the indicator
functions of events. For A ∈ F define
(
1, if ω ∈ A,
(7.8) IA (ω) =
0, otherwise,
and call IA the indicator function of A (sometimes it is called a characteristic function
of A). It follows from the definition that IA is a random variable in a sense of Definition
7.1. Moreover, it enjoys the following nice properties.
Property 7.7. (Basic properties of indicators) Let A ∈ F and IA be its indicator
function. Then
1. E(IA ) = P(A),
2. IA = 1 − IAc ,
3. IA∩B = IA · IB for any B ∈ F,
4. IA∪B = IA + IB − IA∩B for any B ∈ F,
5. E(IA · IB ) = EIA · EIB for any B ∈ F if A and B are independent.
Proof. All claims follow directly from definitions. 

 Exercise 7.7.1. Prove Property 7.7.

The rest of the subsection is devoted to various applications of indicator functions


that are of independent interest. We will start with the inclusion-exclusion principle.
Notice that applying Property 7.7 (4) and the linearity of expectation proved in Theorem
7.2 we get the inclusion-exclusion principle for two sets. We now recover the inclusion-
exclusion for n sets using indicator functions.
Inclusion-exclusion principle (Proposition 4.1) via indicator functions. Let
A1 , ..., An ∈ F and set A = ∪ni=1 Ai . Observe that
(7.9) (IA − IA1 ) · ... · (IA − IAn ) = 0 on Ω.
Indeed, if ω ∈ / A then all factors above are vanishing. If ω ∈ A then there exists
1 ≤ i ≤ n such that ω ∈ Ai , hence IA (ω) − IAi (ω) = 0 and we get (7.9). Next, notice
that IA IAi = IAi for all 1 ≤ i ≤ n. Thus, expanding the product in (7.9) we get18
X Y
IA + (−1)|J| IAi = 0 on Ω.
J⊂{1,2,...,n} i∈J
J6=∅
Applying expectation on both sides of the above equation and using items 1 and 3 of
Property 7.7 we get (4.2) and complete the proof. 

Indicator functions become very useful in many counting problems. Below we discuss
a few examples of the so-called method of indicators.
18In (7.9) we have a product of the form (a − b ) · ... · (a − b ). To expand it we take a single factor
1 1 n n
from each bracket ai or −bi for all i = 1, 2, ..., n. Let I ⊂ {1, 2, ..., n} be the set of indicies where we
choose ai and the complement of I will comprise of indicies where −bi is chosen. Thus, we get a sum
of the form X Y Y
ai (−bi ).
I⊂{1,2,...,n} i∈I i∈{1,2,...,n}\I
Q
Applying the expansion above to (7.9) the part with IA becomes IA thanks to IA IA = IA and
i∈I
IA IAi = IAk for all k = 1, 2, ..., n.
PROBABILITY LECTURE NOTES 45

Example 7.7.1. (Kings and queens at a round table) There are n ≥ 2 pairs of
kings/queens seated at a round table where kings take the even numbered seats and queens
have the odd numbered ones, the point being that kings and queens seat interleaving. On
average how many kings will have their queens seated next to them?

Solution to Example 7.7.1. Enumerate the pairs by 1, 2, ..., n and for 1 ≤ i ≤ n let
Ai be the event that the queen of the i-th pair seats next to the king of the i-th pair.
Let also N be the number of the correctly seated pairs. Observe that N is a random
variable taking values from {0, 1, ..., n} and we are asked to compute E(N ). It is clear
that
N = IA1 + ... + IAn ,
and thanks to the linearity of the expectation (Theorem
Pn 7.2 parts (c) and (d)) and
Property 7.7 part 1 of indicators we get E(N ) = i=1 P(Ai ). Notice that for each
1 ≤ i ≤ n we have P(Ai ) = n2 since if the position of the king is fixed, then out of n
available seats the queen can take only 2 seats that are next to their king. We thus get
that E(N ) = n n2 = 2 for all n ≥ 2. 

 Exercise 7.7.2. A fair coin is tossed n ≥ times. What is the expected number
of Heads? What is the coin turns Heads with probability 0 ≤ p ≤ 1?
Notice that we are considering Bernoulli trials here, see subsection 6.1.2. Compute the expected
number of Heads using two methods: first, directly from the definition of the expectation and Bernoulli
trials, second, representing the number of Heads as a sum of indicators of Heads for each toss and using
the linearity of expectation (cf. Example 7.7.1).

Example 7.7.2. (Parallel series of coin flips) Assume two coins, one with probability
0 ≤ p < 1 of turning up Heads, another with probability q > p are tossed repeatedly.
Show that for any k ∈ N the probability that in a series of n independent tosses the first
coin will produce at least k heads is less than or equal to the probability that the second
one will produce at least k heads.

Solution to Example 7.7.2. The claim is of course intuitive, since the second
coin has a greater chance of landing up Heads. Nevertheless, proving this by counting
arguments is not entirely straightforward. We will instead use a method related to
indicator functions. For 1 ≤ i ≤ n define
(
1, if the i-th toss of the first coin is Head,
Xi =
0, otherwise

which is simply the indicator of the i-th toss of the first coin being Heads. Consider
q−p
an auxiliary series of coin tosses where the probability of landing Heads equals 1−p ,
and let Zi be the indicator of the i-th toss turning up Heads for this new series of
q−p
tosses, with i = 1, 2, ..., n. We thus have that P(Z = 1) = 1−p . Also assume that the
new series of tosses is independent of the tosses of the first coin (we know from section
5.1.1 that such series of independent experiments exist). Now for the second coin define
Yi := Xi + (1 − Xi )Zi , where i = 1, 2, ..., n. Clearly Y takes only two values 0 and 1
and by definition of Zi we have P(Yi = 1) = q and P(Yi = 0) = 1 − q, hence Yi has the
same distribution as the indicator of the Heads in a coin toss of the second coin. By
construction we have Xi ≤ Yi for all i = 1, 2, ..., n and hence
P(X1 + ... + Xn ≥ k) ≤ P(Y1 + ... + Yn ≥ k)
46 HAYK ALEKSANYAN

proving the claim19. 

7.5.1. Cycles in random permutation. Let σ be a permutation of {1, 2, ..., n}. A


cycle of length m in σ is a sequence (i1 i2 ... im−1 ) of distinct integers from {1, ..., n}
where σ(ik ) = σ(ik+1 ) for all k = 1, 2, ..., m − 2 and σ(im−1 ) = σ(i1 ). A cycle of length
1 is a fixed point of permutation σ, i.e. an integer i such that σ(i) = i. For example,
let n = 4 and consider a permutation σ defined as
 
1 2 3 4
 .
3 2 4 1

Then we have two cycles, namely (1 3 4) and (2). It is well-known that any permutation
can be decomposed as a product of disjoint cycles. In this section we will be interested
in the average length of a cycle in a random permutation.
Step 1. The probability that a given element is in a cycle of length m.
Fix an element 1 ≤ i ≤ n and assume it is in a cycle of length m. Then there exist
distinct integers i1 , i2 , ..., im−1 from {1, 2, ..., n} forming the cycle
i 7→ i1 7→ i2 7→ ... 7→ im−1 7→ i.
(n−1)!
Since the sequence i1 , ..., im−1 is ordered there are ((n−1−(m−1))! ways to choose such a
sequence (see subsection 2.3.2). When the cycle is fixed the rest of n − m integers can
(n−1)!
be permuted arbitrarily, thus there are (n−m)! (n − m)! permutations out of n! such that
i is in a cycle of length m. So the probability that i is in a cycle of length m equals n1 .

Step 2. The expected number of cycles of length m.


For each 1 ≤ i ≤ n let Ii be the indicator of the event that i lies in a cycle of length
m. Let also N be the total number of cycles of length m. Then
1
N= (I1 + I2 + ... + In ) ,
m
1
where the factor m is due to the fact that each cycle is counted exactly m times in the
sum of indicators. We thus get
n n
1 X 1 X (by Step 1) 1
E(N ) = E(Ii ) = P (i is in a cycle of length m) = .
m m m
i=1 i=1

Step 3. Probability of a long cycle.


Clearly for m > n2 there can be at most one cycle of length m. Let pm be the
probability that a randomly chosen permutation has a cycle of length m where m > n2
1
is fixed. By Step 2 we have E(N ) = m where N is the number of cycles of length m.
But in this case N takes only values 1 and 0 thus
1
= E(N ) = 1 · P(N = 1) = pm .
m

19The technique used here is an example of a coupling method, which is a powerful technique in
probability theory and has deep applications that are beyond the scope of this notes.
PROBABILITY LECTURE NOTES 47

We conclude that for m > n2 the probability that a random permutation has a cycle of
1
length m equals m . Hence the probability that a randomly chosen permutation has a
cycle of length greater than n/2 is bounded above by
n
X 1 n
≈ ln n − ln = ln 2.
m 2
m: m> n
2

Problem about names in boxes. There is a peculiar application of the above


estimates on cycles in permutation.
Assume the names of 100 prisoners are placed randomly in boxes with num-
bers 1, 2, ..., 100 - one name in each box. One by one, prisoners enter the
room where the boxes are placed, with the goal to find their names from the
boxes. Each prisoner is allowed to open at most 50 boxes. If all prisoners
find their names, they are freed. The prisoner leaves the room with boxes
without altering its state. They cannot communicate with each other once
the process starts. What strategy can prisoners choose to have a non-trivial
chance of being freed?
Here is a possible strategy: prisoners start by forming a line which will not be
changed during the process. Starting with the first in the line, the i-th prisoner enters
the room and looks into the box with number i. If it contains his name we are done,
otherwise it contains the name of a prisoner lined i1 and then the current prisoner goes
and checks the box number i1 . If he finds the name of a prisoner lined i2 instead he
opens the box i2 and so on. In this process, the prisoner will eventually find his name,
numbered i, unless i is in a cycle of length larger than 50 (the maximal number of boxes
one can open). As we proved above, the probability that the permutation of names in
boxes has a cycle of length more than 50 is bounded above by ≈ ln 2 ≈ 0.693. Thus the
probability that all will find their names is at least ≈ 1 − 0.693 = 0.307.

 Exercise 7.7.3. What are the chances of survival in the above problem if all
prisoners choose their 50 boxes uniformly at random?

→ Problem 7.7.1. Is there a better strategy to the above problem with names in
boxes? Justify your answer.

8. Multivariate discrete distribution. Expectation in the multivariate


case
Let X and Y be discrete random variables defined on (Ω, F, P). Often it is necessary
to consider X and Y as a pair taking values in R2 rather than individual variables. In
analogy with Definition 7.2, here for the pair of random variables we have the following.
Definition 8.1. (Joint probability mass function) Let X and Y be discrete random
variables on (Ω, F, P). Their joint probability mass function pX,Y : R2 → [0, 1] is
defined by
pX,Y (x, y) = P ({ω ∈ Ω : X(ω) = x and Y (ω) = y}) ,
which is usually abbreviated to pX,Y (x, y) = P(X = x, Y = y).
Clearly we have
pX,Y (x, y) = 0 if x ∈
/ ImX or y ∈
/ ImY,
48 HAYK ALEKSANYAN

and
X X
pX,Y (x, y) = 1.
x∈ImX y∈ImY

Notice that the events {{X = x} : x ∈ ImX} and {{Y = y} : y ∈ ImY } are both
partitions of Ω. Thus
X X
(8.1) pX,Y (x, y) = P(X = x, Y = y) = P(Y = y) = pY (y),
x∈ImX x∈ImX

similarly
X
(8.2) pX,Y (x, y) = pX (x).
y∈ImY

Thus from the joint mass function we can recover the probability mass functions of
individual variables X and Y . The distributions in (8.1) and (8.2) are called marginal
mass functions of Y and X respectively.

y1 y2 ... ym ...
x1 pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) ... pX,Y (x1 , ym ) ...
x2 pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) ... pX,Y (x2 , ym ) ...
..
.
xn pX,Y (xn , ym ) pX,Y (xn , ym ) ... pX,Y (xn , ym ) ...
..
. ... ... ... ... ...
Table 1. One may think about the joint mass function pX,Y as a matrix
where (n, m)-th element is the probability of the event {ω ∈ Ω : X(ω) =
xn and Y (ω) = ym }. Here ImX = {x1 , x2 , ...} and ImY = {y1 , y2 , ...}.
To recover the marginal distributions we sum over the columns of this
matrix for fixed row to get pX and over the rows of the matrix for a fixed
column to recover pY .

8.1. Expectation in the multivariate case. Given two random variables X and Y
on (Ω, F, P) and a function g : R2 → R define Z(ω) = g(X(ω), Y (ω)) where ω ∈ Ω.
Then, one may easily check that conditions of Definition 7.1 are satisfied for Z, i.e. Z
is a discrete random variable. We now formulate and prove the analogue of Theorem
7.3 for Z.

Theorem 8.1. (Law of the subconscious statistician - 2d case) Let X, Y and g be as


above. Then
X X
(8.3) Eg(X, Y ) = g(x, y)pX,Y (x, y),
x∈ImX y∈ImY

whenever the sum converges absolutely.


PROBABILITY LECTURE NOTES 49

Proof. The proof is similar to the proof of Theorem 7.3. Set Z = g(X, Y ), then
X
EZ = zP(Z = z) =
z∈ImZ
X
zP {ω ∈ Ω : (X(ω), Y (ω)) ∈ g −1 (z)} =


z∈ImZ
X X
z P(X = x, Y = y) =
z∈ImZ x∈ImX
y∈ImY
g(x,y)=z
X X X
zP(X = x, Y = y) =
x∈ImX y∈ImY z∈ImZ
g(x,y)=z
X X
g(x, y)pX,Y (x, y).
x∈ImX y∈ImY

We now recover a result proved in Theorem 7.2.


Corollary 8.2. E acts linearly on random variables, namely if X and Y are discrete
random variables on (Ω, F, P) and a, b ∈ R, then
E(aX + bY ) = aE(X) + bE(Y ).
Proof. Take g(x, y) = ax + by. Then by Theorem 8.1 we get
X X
E(aX + bY ) = (ax + by)pX,Y (x, y) =
x∈ImX y∈ImY
X X X X
a x pX,Y (x, y) + b y ypX,Y (x, y) =
x∈ImX y∈ImY y∈ImY x∈ImX
aE(X) + bE(Y ).


Note, that the result of Theorem 8.1 works similarly for the case of n random vari-
ables. More precisely, consider g : Rn → R and let X1 , X2 , ..., Xn be discrete random
variables defined on Ω. Then Z = g(X1 , ..., Xn ) defines a discrete random variable on
Ω and the analogue of (8.4) is true for joint mass function pX1 ,...,Xn , which is defined in
analogy with the case of two variables as follows:
pX1 ,...,Xn (x1 , ..., xn ) = P (X1 = x1 , ..., Xn = xn ) .
We formulate the result for n variables.
Theorem 8.3. Let X1 , X2 , ..., Xn be random variables defined on (Ω, F, P) and g :
Rn → R be any function. Then g(X1 , X2 , ..., Xn ) is a random variable defined on
(Ω, F, P) and
X X
(8.4) Eg(X1 , ..., Xn ) = ... g(x1 , ..., xn )pX1 ,...,Xn (x1 , ..., xn ),
x1 ∈ImX1 xn ∈ImXn

whenever the sum converges absolutely.

 Exercise 8.3.1. Prove Theorem 8.3 following the proof of Theorem 8.1.
50 HAYK ALEKSANYAN

9. Independent random variables


9.1. Independence of two random variables. Given a probability space (Ω, F, P),
recall that two events A, B ∈ F are called independent, if P(A ∩ B) = P(A)P(B). We
now extend the concept of independence from events to random variables.
Definition 9.1. (Independence of discrete random variables) Discrete random variables
X and Y defined on a probability space (Ω, F, P) are called independent, if
(9.1) P(X = x, Y = y) = P(X = x)P(Y = y) for all x, y ∈ R.
Notice that both sides of (9.1) are zero unless x ∈ ImX and y ∈ ImY . Also, the
left-hand side of (9.1) is pX,Y (x, y) - the joint mass function, and for the right-hand side
we can use (8.1) and (8.2) to arrive at
  !
X X
(9.2) pX,Y (x, y) =  pX,Y (x, ỹ) pX,Y (x̃, y) .
ỹ∈ImY x̃∈ImX

Observe also that given independence of X and Y , from (9.2) we get that the joint
mass function pX,Y decomposes into a product of two functions depending only on x or
y. The converse of this statement is also true as we see next.
Theorem 9.1. Two discrete random variables X and Y defined on (Ω, F, P) are inde-
pendent if and only if there exist functions f, g : R → R such that
(9.3) pX,Y (x, y) = f (x)g(y), ∀x, y ∈ R.
Proof. The necessity is due to (9.2). We now prove the sufficiency. Assume pX,Y (x, y) =
f (x)g(y) where x, y ∈ R. We need to check that (9.1) holds. We have
X X X X
(9.4) 1 = pX,Y (x, y) = f (x)g(y) =
x∈ImX y∈ImY x∈ImX y∈ImY
! 
X X
f (x)  g(y) .
x∈ImX y∈ImY

We also have  
X X
pX (x) = pX,Y (x, y) =  g(y) f (x)
y∈ImY y∈ImY

and !
X X
pY (y) = pX,Y (x, y) = f (x) g(y).
x∈ImX x∈ImX
Multiplying the last two equalities implies
  !
X X
pX (x)pY (y) = f (x)g(y)  g(y) f (x) (by (9.4))
y∈ImY x∈ImX

= f (x)g(y) (by (9.3))


= pX,Y (x, y).
The proof is now complete. 
PROBABILITY LECTURE NOTES 51

Example 9.1.1. Consider random variables X and Y that take values from non-
negative integers and have the joint mass function
1
pX,Y (n, m) = λn µm e−(λ+µ) , n, m = 0, 1, 2, ... .
n! m!
Then, Theorem 9.1 implies that X and Y are independent.

An important implication of independence concerns expectation of the product of


random variables.
Theorem 9.2. Let X and Y be discrete random variables that are independent, and
assume E(X) and E(Y ) exist. Then
E(XY ) = E(X)E(Y ).
Proof. Applying Theorem 8.1 with g : R2 → R defined as g(x, y) = xy we get
X X
E(XY ) = xyP(X = x, Y = y) (by independence)
x∈ImX y∈ImY
X X
= xyP(X = x)P(Y = y)
x∈ImX y∈ImY
! 
X X
= xP(X = x)  yP(Y = y)
x∈ImX y∈ImY

= E(X)E(Y ).


Corollary 9.3. Let X and Y be discrete random variables that are independent. Then
var(X + Y ) = var(X) + var(Y ).
Proof. This follows directly from Property 7.4 (c) and Theorem 9.2. 

Notice that we have linearity of expectation regardless of the independence of the


random variables involved. The linearity of the variance, however, in not true in gen-
eral and requires additional conditions. The next example shows that the converse of
Theorem 9.2 is not true.

Example 9.3.1. Let X be a discrete random variable taking values −1, 0, 1 all with
probability 1/3 and let Y = |X|. Then
12
P(X = 0, Y = 1) = 0 and P(X = 0)P(Y = 1) = 6= 0,
33
hence X and Y are not independent. However, E(XY ) = 0 = E(X)E(Y ).

The next theorem provides a condition on expectation that is equivalent to indepen-


dence.
Theorem 9.4. Discrete random variables X and Y are independent if and only if for
any functions f, g : R → R we have
E(f (X)g(Y )) = E(f (X))E(g(Y )),
whenever the expectations on the right-hand side exist.
52 HAYK ALEKSANYAN

Proof. The necessity follows by the same argument as we used in Theorem 9.2, one
only needs to consider the function (x, y) 7→ f (x)g(y) instead of (x, y) 7→ xy.
We now prove sufficiency. Take any a, b ∈ R and let
( (
1, x = a, 1, y = b,
f (x) = and g(y) = .
0, x 6= a 0, y 6= b
Then E(f (X)g(Y )) = P(X = a, Y = b) and E(f (X)) = P(X = a), E(g(Y )) = P(Y = b).
Combining these with the condition of the theorem implies independence of X and
Y. 

9.2. Independence of n random variables. We extend the discussion in the previous


subsection to n random variables.
Definition 9.2. (Independence of n random variables) Discrete random variables X1 ,
X2 , ..., Xn defined on a probability space (Ω, F, P) are called (jointly) independent, if
for any x1 , x2 , ..., xn ∈ R we have
n
Y
(9.5) P (X1 = x1 , X2 = x2 , ..., Xn = xn ) = P(Xi = xi ).
i=1

Clearly, both sides of (9.5) become zero unless xi ∈ ImXi for all i = 1, 2, ..., n.
We now formulate the analogues of the main results proved for the case of two
discrete random variables for for n variables.
Theorem 9.5. (cf. Theorem 9.1) Discrete random variables X1 , ..., Xn defined on
(Ω, F, P) are independent if and only if there exist functions f1 , ...fn : R → R such that
pX1 ,...Xn (x1 , ..., xn ) = f1 (x1 ) · ... · fn (xn ), ∀x1 , ..., xn ∈ R.

Theorem 9.6. (cf. Theorem 9.2) Let X1 , ..., Xn be discrete random variables that are
jointly independent, and assume E(X1 ), ..., E(Xn ) exist. Then
E(X1 · ... · Xn ) = E(X1 ) · ... · E(Xn ).

Corollary 9.7. (cf. Corollary 9.3) Let X1 , ..., Xn be discrete random variables that are
pairwise independent. Then
var(X1 + ... + Xn ) = var(X1 ) + ... + var(Xn ).

Theorem 9.8. (cf. Theorem 9.4) Discrete random variables X1 , ..., Xn are jointly
independent if and only if for any functions f1 , ..., fn : R → R we have
E(f1 (X1 ) · ... · fn (Xn )) = E(f1 (X1 )) · ... · E(fn (Xn )),
whenever the expectations on the right-hand side exist.
Proofs of the formulated results follows along the same lines as their counterparts
for two random variables.

 Exercise 9.8.1. Prove Theorems 9.5, 9.6, 9.8, and Corollary 9.7.

We also include a new result that we will prove for the case of n random variables.
Theorem 9.9. (Preservation of independence under compositions) Let X1 , X2 , ..., Xn be
independent random variables, and let f1 , f2 , ..., fn be any real-valued functions defined
on R. Then f1 (X1 ), f2 (X2 ), ..., fn (Xn ) are independent.
PROBABILITY LECTURE NOTES 53

Proof. Fix any y1 , ..., yn ∈ R, we need to show that


n
Y
P (f1 (X1 ) = y1 , ..., fn (Xn ) = yn ) = P (fi (Xi ) = yi ) .
i=1
To this end define a function g : Rn
→ R as follows
(
1, if fi (xi ) = yi , for all i = 1, 2, ..., n,
g(x1 , ..., xn ) =
0, otherwise,
and for i = 1, 2, ..., n define gi : R → R as
(
1, if fi (x) = yi ,
gi (x) =
0, otherwise.
We have
P (f1 (X1 ) = y1 , ..., fn (Xn ) = yn ) (by Property 7.7 part 1)
= E I{f1 (X1 )=y1 ,...,fn (Xn )=yn } (by definition of g)
= Eg(X1 , ..., Xn ) (by Theorem 8.3)
X X
= ... g(x1 , ..., xn )pX1 ,...,Xn (x1 , ..., xn )
x1 ∈ImX1 xn ∈ImXn
X X
= ... P(X1 = x1 , ..., Xn = xn ) (by independence of Xi )
x1 ∈ImX1 xn ∈ImXn
f (x1 )=y1 f (xn )=yn
n
Y X
= P(Xi = xi ) (by definition of gi )
i=1 xi ∈ImXi
f (xi )=yi
n
Y X
= gi (x)P(Xi = x) (by Theorem 7.3)
i=1 x∈ImXi
n
Y
= E gi (Xi ) (by definition of gi )
i=1
n
Y n
Y
E If (Xi )=yi = P(f (Xi ) = yi ),
i=1 i=1
completing the proof. 

The morale of Theorem 9.9 is that given a sequence of independent variables any
transformation of a single entry in the sequence will not alter the independence. For
example, if X, Y, Z are independent random variables, then so are
X 2 , eY , (1 + |Z|2021 )−1/2 ,
a fact which might not seem entirely trivial on this particular example.

→ Problem 9.9.1. Let v1 , v2 , ..., vn ∈ Rd be unit vectors, i.e. ||vi || = 1 for all
1P≤ i ≤ n. Prove that there exists a sequence
√ of signs εi ∈ {−1, 1} such that the vector
n
i=1 εi vi has length at most (at least) n.
Hint: P
choose the signs uniformly at random and compute the expectation of the length (Euclidean
norm) of i εi vi .
54 HAYK ALEKSANYAN

10. Inequalities
In this section we state and prove several important inequalities, and discuss the
concept of information entropy.
10.1. Jensen’s inequality. Recall the definition of a convex function. A function
f : Rd → R is called convex if for any x1 , x2 ∈ Rd and 0 ≤ λ ≤ 1 one has
(10.1) f ((1 − λ)x1 + λx2 ) ≤ (1 − λ)f (x1 ) + λf (x2 ).
A function is called strictly convex if the inequality (10.1) is strict whenever x 6= y
and 0 < λ < 1.
A function is called concave if −f is convex.
Examples of convex functions are f (x) = |x|, f (x) = x2 , f (x) = ex , etc.

x1 (1 − λ)x1 + λx2 x2

Figure 7. An example of a convex function. For any x1 , x2 the graph of f lies below
the chord joining points (x1 , f (x1 )) and (x2 , f (x2 )) on the graph. More precisely, the
value of f at (1 − λ)x1 + λx2 , for any λ ∈ [0, 1], is bounded above by the value of the
chord at the same point.

Theorem 10.1. (Jensen’s inequality) Let f : Rd → R be a convex function. Then for


any x1 , ..., xn ∈ Rd and any probability distribution p1 , ..., pn i.e. 0 ≤ pi ≤ 1 where
i = 1, 2, ..., n and p1 + ... + pn = 1, one has
(10.2) f (p1 x1 + ... + pn xn ) ≤ p1 f (x1 ) + ... + pn f (xn ).
Proof. The proof is by induction on the number of points. When n = 2 the inequality
(10.2) coincides with the definition of the convex function (10.1). Now assume (10.2)
holds for n − 1 points. To prove the case of n points notice first that the inequality is
trivially true if p1 = 1. Otherwise, if p1 < 1, then q := p2 + ... + pn > 0, and hence
  
p2 pn
f (p1 x1 + ... + pn xn ) = f p1 x1 + q x2 + ... + xn (by (10.1))
q q
 
p2 pn
≤ p1 f (x1 ) + qf x2 + ... + xn (by induction)
q q
 
p2 pn
≤ p1 f (x1 ) + q f (x2 ) + ... + f (xn ) =
q q
≤ p1 f (x1 ) + p2 f (x2 ) + ... + pn f (xn ),
PROBABILITY LECTURE NOTES 55

completing the inductive step and the proof of the inequality. 

An important corollary of the Jensen’s inequality is the following.


Theorem 10.2. (Jensen’s inequality for expectation) Let f : R → R be a convex
function, and let X be a discrete random variable. If EX and Ef (X) both exist, then
(10.3) f (EX) ≤ Ef (X).
Proof. We split the proof in two parts, namely when X takes only finitely many values,
and the general case of discrete random variables.
Assume first that X takes finitely many values with the pmf P(X = xi ) = pi where
0 ≤ pi ≤ 1 for i = 1, 2, ..., n and p1 + ... + pn = 1. Using the definition of expectation
we get
n
!
X
f (EX) = f xi pi (by Jensen’s inequality)
i=1
n
X
≤ pi f (xi ) = Ef (X),
i=1

where the last equality is due to (7.3). This settles the case of random variables with
finitely many values.
To prove the general case assume X has distribution P(X = xi ) = pi with 0 ≤ pi ≤ 1

P
and pi = 1. We will use the fact that convex functions are continuous. In view of
i=1
continuity of f at EX for any ε > 0 there exists N (ε) ∈ N such that for all N ≥ N (ε)
we have
N
!
X
(10.4) f (EX) ≤ ε + f xi p˜i ,
i=1

pi P
where p˜i = N
P
. Indeed, by definition EX = xi pi and the series is absolutely
pi i=1
i=1
N N
P 1 P
convergent. Hence xi p˜i = N
P
xi pi → EX as N → ∞ since {pi } is a probability
i=1 pi i=1
i=1
distribution and sums to 1. Getting back to (10.4) we use the fact that {p˜i }N i=1 is a
finite distribution and apply Jensen’s inequality on the right-hand-side arriving at
N N
X 1 X
f (EX) ≤ ε + f (xi )p˜i = ε + N
f (xi )pi .
P
i=1 pi i=1
i=1
Now both sums in the above inequality are convergent. Passing to limit as N → ∞ and
using Theorem (7.3) we get
f (EX) ≤ ε + Ef (X).
Since ε > 0 is arbitrary the last inequality completes the proof of the general case of
discrete random variable. 

Geometric proof of Jensen’s inequality. We also include a geometric proof


of inequality (10.3). A point v ∈ R is called a sub-derivative of f at x0 ∈ R if
f (x) ≥ f (x0 ) + v(x − x0 ) for all x ∈ R. The set of all sub-derivatives at x0 is denoted
56 HAYK ALEKSANYAN

by ∂f (x0 ) and is called the sub-differential20 of f at x0 . It is a well-known fact that


for convex functions the sub-differential at any point is a non-empty closed convex set.
Let v by any sub-derivative of f at EX. By definition we have

f (x) ≥ f (EX) + v(x − EX) for all x ∈ R.

Applying the inequality above for x = xi , multiplying both sides by pi and summing
over i = 1, 2, ... we get
∞ ∞
!
X X
f (xi )pi ≥ f (EX) + v · pi xi − EX .
i=1 i=1

The left-hand side of this inequality equals Ef (X) by Theorem 7.3, and the right-hand
side is precisely f (EX) as the sum in the brackets equals EX. The proof of inequality
(10.3) is complete. 

For example, applying Jensen’s inequality for specific convex functions f we get

eEX ≤ EeX , taking f (x) = ex

and
(EX)2n ≤ EX 2n , taking f (x) = x2n for some n ∈ N,
whenever the expectations above exist.

10.2. Arithmetic mean - Geometric mean inequality.

Theorem 10.3. (AM-GM inequality) Let x1 , x2 , ..., xn ∈ R be a set of non-negative


numbers. Then
x1 + x2 + ... + xn
(10.5) (x1 x2 · ... · xn )1/n ≤ .
n
Proof. We may assume that xi > 0 for all 1 ≤ i ≤ n as otherwise the inequality is
trivially true. Since x 7→ log x is increasing in x proving (10.5) is equivalent to proving
the inequality with log applied to both sides which reads
n n
!
X 1 X 1
log xi ≤ log xi .
n n
i=1 i=1

Now the above inequality is a direct consequence of the Jensen’s inequality and convexity
of f (x) = − log(x), x > 0. The proof is complete. 

20In geometric terms, the elements of the sub-differential are precisely the slopes for which the graph
of f lies above the line through (x0 , f (x0 )) with the given slope. The sub-differential is defined for
functions f : Rd → R as well where v ∈ Rd is a sub-derivative if f (x) ≥ f (x0 ) + hv, x − x0 i for all
x ∈ Rd , where h·, ·i is the standard inner product of Rd . Moreover, a convex function f is differentiable
at x0 if and only if the sub-differential is a single element set; in which case the only sub-derivative is
the gradient of f at x0 .
PROBABILITY LECTURE NOTES 57

10.3. Cauchy-Schwarz inequality.


Theorem 10.4. (Cauchy-Schwarz inequality) Let X and Y be two discrete random
variables. Then
(10.6) (E(XY ))2 ≤ E(X 2 )E(Y 2 ).
The inequality becomes equality if and only if X and Y are linearly dependent.
Proof. If E(Y 2 ) = 0, by Theorem 7.2 (b) we get that P(Y 2 = 0) = 1 and hence
P(Y = 0) = 1. Consequently both sides of (10.6) become zero and the inequality is
true. Thus, we may assume that E(Y 2 ) > 0. Define Wt := tX + Y , where t ∈ R. Then
0 ≤ E(Wt2 ) = t2 E(X 2 ) + 2t E(XY ) + E(Y 2 ).
Notice that we have a quadratic polynomial in t which is always non-negative. Hence
its discriminant must be non-positive, and we get that
4 (E(XY ))2 − 4E(X 2 )E(Y 2 ) ≤ 0,
which is equivalent to (10.6).
To see when (10.6) becomes an equality, observe that E(Wt2 ) = 0 for some t ∈ R
if and only if the discriminant of the above quadratic polynomial becomes 0 which is
precisely the case of equality in (10.6). But if E(Wt )2 = 0, then P(Wt = 0) = 1 which
is exactly the case of linear dependence between X and Y . The proof is complete. 

 Exercise 10.4.1. Prove that (E|XY |)2 ≤ E(X 2 )E(Y 2 ) for two discrete random
variables X and Y .

10.4. Covariance and correlation. In this section we define covariance and correla-
tion as measure of joint variability of two random variables.
Definition 10.1. (Covariance) Given two random variables X and Y the covariance
between them is defined as
(10.7) cov(X, Y ) = E ((X − EX)(Y − EY )) ,
whenever the expectations on the right-hand side exist.
Note that the formula for covariance can be simplified to
cov(X, Y ) = E(XY ) − E(X)E(Y )
in view of the linearity of expectation. Hence the formula for the variance of sums of
random variables proved in Property 7.4 (c), in terms of the covariance, becomes
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y ).
Similarly, for the variance of the sum of n random variables X1 , X2 , ..., Xn we have
X
(10.8) var(X1 + ... + Xn ) = var(X1 ) + ... + var(Xn ) + 2 cov(Xi , Xj ).
1≤i<j≤n

 Exercise 10.4.2. Verify (10.8) following the proof of Property 7.4 (c).

We next state some basic properties of covariance, proofs of which follow directly
form definitions.
Property 10.5. (Basic properties of covariance) Let X and Y be discrete random
variables and a ∈ R be a constant. Whenever the covariances defined below exist, we
have
58 HAYK ALEKSANYAN

1. cov(X, a) = 0,
2. cov(X + a, Y ) = cov(X, Y ),
3. cov(aX, Y ) = a cov(X, Y ),
4. cov(X, Y ) = cov(Y, X)
5. cov(X, Y ) = E(XY ) − E(X)E(Y ),
6. cov(X, Y ) = 0 if X and Y are independent,
7. cov(X + Z, Y ) = cov(X, Y ) + cov(Z, Y ),
!
n
P m
P Pn Pm
8. cov ai Xi , bj Yj = ai bj cov(Xi , Yj ) for any discrete random vari-
i=1 j=1 i=1 j=1
ables Xi , Yj , and constants ai , bj ∈ R, whenever the covariances exist21.

 Exercise 10.5.1. Prove Property 10.5.

 Exercise 10.5.2. What is cov(X, X) for a given random variable X?

Example 10.5.1. (Zero covariance does not imply independence) The aim of this
example is to show that the converse of Property 10.5 (6) is not true in general. To this
end, consider a random variable X with probability mass function defined as
1 1
P(X = 1) = P(X = −1) = and P(X = 0) = .
4 2
Also, define a random variable Y as
(
1, if X = 0,
Y =
0, 6 0.
if X =

Clearly E(X) = 0 and XY = 0, hence cov(X, Y ) = 0. However,


1
6= P(X = 0)P(Y = 1),
P(X = 0, Y = 1) =
2
implying that X and Y are not independent.

 Exercise 10.5.3. Let X be a discrete random variable such that EX = 0 and


0 < E|X|3 < ∞. Let also Y = X 2 . Show that cov(X, Y ) = 0.
Note: This exercise gives another example of dependent random variables with vanishing covariance.
As an example of such X one may take a random variable whose probability mass function (pmf ) is
symmetric around zero and choose the values of the X and pmf so that to guarantee 0 < E|X|3 < ∞
(by Theorem 7.1 there exists a random variable with given pmf.)

 Exercise 10.5.4. A fair coin is tossed 3 times. Let X be the number of Heads
in the first two tosses, and let Y be the number of Heads in the last two tosses (note the
overlap). Compute cov(X, Y ).

Definition 10.2. (Correlation) For two random variables X and Y such that var(X),
var(Y ) > 0, the correlation (or correlation coefficient) between them is defined as
cov(X, Y )
(10.9) corr(X, Y ) = p .
var(X)var(Y )

21This is the bilinearity of the covariance.


PROBABILITY LECTURE NOTES 59

Theorem 10.6. For any two random variables X an Y for which the correlation is
defined, we have

(10.10) |corr(X, Y )| ≤ 1.

Moreover, corr(X, Y ) =  with  ∈ {−1, 1} if and only if Y = aX + b for some a, b ∈ R


a
where a 6= 0 and |a| = .

Proof. Applying Cauchy-Schwarz inequality (Theorem 10.4) on formula (10.7) of co-


variance we get

|cov(X, Y )|2 ≤ E(X − EX)2 E(Y − EY )2 = var(X)var(Y ).


 

This inequality, combined with (10.9) completes the proof of (10.10).


We now establish the second part of the Theorem. If Y = aX + b, where a 6= 0, then
a
corr(X, Y ) = |a| by a direct computation (or using Property 10.5). To see the other side
of the equivalence, notice that from the proof of (10.10) we have that |corr(X, Y )| = 1
if and only if there is an equality in the Cauchy-Schwarz inequality for X − EX and
Y − EY . Hence, by Theorem 10.4 we obtain that X = aY + b and the sign of a must
coincide with the sign of corr(X, Y ). 

Remark 10.7. From Theorem 10.6 we see that the correlation is the “normalized”
version of the covariance taking values in [−1, 1]. The second part of the Theorem hints
toward the direction that covariance (correlation) measures the extent to which variables
X and Y are related linearly. For instance, if corr(X, Y ) = ±1, then there is a linear
dependence between X and Y .

10.5. Information entropy. Can one quantify the level of surprise or information
that a certain event bears? For example, let (Ω, F, P) be a probability space and A ∈ F
be an event with probability 0 ≤ p ≤ 1; how surprising is A? We will measure the
level of the surprise as a function of p, and denote that function by I (here I stands for
“information”). If P(A) = 1, then the event A is certain to happen and hence there is
no surprise of it happening, thus we require I(1) = 0. Also note that the more unlikely
the event is (i.e. if p is small), the more surprising its occurrence is, hence p 7→ I(p)
is required to be decreasing in p. For two independent events A and B the occurrence
of A ∩ B should sum up the surprise (or information content) coming from A and B,
thus we require I(P(A ∩ B)) = I(P(A)) + I(P(B)). It turns out these conditions on I
characterize it.

Theorem 10.8. (Characterization of information function) Let I : (0, 1] → R+ be a


function satisfying

(10.11) I(pq) = I(p) + I(q),

for any p, q ∈ (0, 1]. If I is continuous then I(p) = c log p for some c ≤ 0.

Proof. Make a change of variables in I by setting p = e−x where p ∈ (0, 1]. Let
f (x) = I(e−x ) for x ∈ [0, ∞). Clearly (10.11) transforms to22

(10.12) f (x + y) = f (x) + f (y) for x, y ∈ R+ .

22Equation 10.12 is called Cauchy’s functional equation.


60 HAYK ALEKSANYAN

p 1 1    
For any p, q ∈ N we have q = , hence applying (10.12) we get f pq = pf 1q .
+ ... +
q q
| {z
}
p times
1 1  
Similarly, writing 1 = + ... + from (10.12) we arrive at f (1) = qf 1q , and hence
q q
| {z }
q times
 
p p
(10.13) f = f (1) for all p, q ∈ N.
q q

Now the continuity of f coupled with (10.13) imply that f (x) = xf (1) for all x ≥ 0.
Getting back to relation between f and I we see that for x ∈ (0, 1] we have
 
− log x1 1 1
I(e ) = I(x) = f log = log I(e−1 ) = −c log x,
x x

where c = I(e−1 ) is a constant. The proof is complete. 

Proof of Theorem 10.8 when I is C 2 . We include another proof that the infor-
mation function is a multiple of logarithm, when I is assumed to be C 2 .
Using the fact that I is twice differentiable we take the derivative in (10.11) with
respect to p arriving at
qI 0 (pq) = I 0 (p).
Now differentiate both sides of the last equality with respect to q. Doing so we get

I 0 (pq) + pqI 00 (pq) = 0.

With a change of variable z := pq, and setting g(z) = I 0 (z), the last equation becomes
g(z)+zg 0 (z) = 0. Thus dg g
dz = − z which is a differential equation with separable variables.
Solving it we obtain g(z) = c z1 for some constant c > 0. Recall that g(z) = I 0 (z) hence
from the formula of g we see that I(z) = c log z + c1 where c and c1 are constants.
To determine the range of these constants recall that I must satisfy (10.11). In that
formula take q = 1 which leads to I(p) = I(p) + I(1) hence I(1) = 0. Hence, setting
z = 1 we get 0 = I(1) = c log 1 + c1 , consequently c1 = 0. These leads to I(z) = c log z
and the proof is complete. 

→ Problem 10.8.1. Let I be as in Theorem 10.8. Prove that the requirement of


continuity of I can be relaxed to measurability of I. More precisely, show that if I is a
Lebesgue measurable function satisfying (10.11), then I(p) = c log p for some c ≤ 0.

→ Problem 10.8.2. Using the concept of Hamel basis and considering R as a vector
space over rationals Q, show that there exists a non-linear solution to (10.12). By the
previous problem, such solutions cannot be measurable. Thus, the condition of measur-
ability cannot be eliminated in the characterization of the information function I.

We choose the constant c in the definition of the information function to get I(p) =
− log2 p where p ∈ (0, 1]. Now, if X is a random variable with pmf pX given by
pX (xi ) = P(X = xi ) = pi , with i = 1, 2, ..., n, then the average level of information
PROBABILITY LECTURE NOTES 61

(surprise) obtained on learning X equals23


n
X
H(X) := EI(pX ) = − pi log2 pi .
i=1
The value H(X) is called information entropy (Shannon entropy) of X, which among
other things measures the level of uncertainty in X. In this light it is plausible to expect
that the highest level of uncertainty is obtained for the uniform distribution. Indeed,
by concavity of x 7→ log x on x > 0, and Jensen’s inequality we have
n n
!
X 1 X 1
H(X) = pi log2 ≤ log2 pi = log2 n.
pi pi
i=1 i=1
But log2 n is the entropy of the uniform distribution on n elements and the Jensen’s
inequality is strict (for strictly convex functions) unless pi = const. Hence the uniform
distribution has the maximal entropy.

→ Problem 10.8.3. There are 12 coins of identical shape and size, except one of them
is fake. It is known that the fake coin has different weight than a real coin, but it is
not known whether the fake is heavier or lighter than a real one. You have a balancing
weight scale, and can weight any two subsets of the 12 coins with it. When placing
two sets of coins on the two sides of the scale (either set of coins can be empty) there
are three possible outcomes: the left and right sides have equal weights, the left side is
heavier, the right side is heavier.
Using the scale, find the fake coin and whether it is lighter or heavier in as few
weightings as possible.
Hint: If we enumerate the coins from 1, 2, ..., 12, then there are 24 scenarios in total. Namely, the
fake coin is coin number i for some 1 ≤ i ≤ 12 (12 possibilities) and it is either either lighter or heavier
than a real one (2 possibilities). Think about a weighting step as a process of gaining information about
these scenarios and when organize a weighting step think about splitting the aforementioned possibilities
as evenly as possible.
Note: If you can do W ≥ 1 weightings, what is the maximal number of coins, of which one is fake,
that can be checked as in the original problem?

11. Weak Law of Large Numbers


In this section we discuss Markov’s and Chebyshev’s inequalities. When then prove
a version of Weak Law of Large Numbers (WLLN) and conclude with probabilistic proof
of Weierstrass approximation theorem.
11.1. Markov and Chebyshev.
Theorem 11.1. (Markov inequality) Let X be a discrete random variable with E|X| <
∞ and let a > 0 be fixed. Then
1
P(|X| ≥ a) ≤ E|X|.
a
Proof. By the linearity of expectation and Property 7.7 (1) of indicators to have
E(|X|) = E(|X|I|X|≥a ) + E(|X|I|X|<a ) ≥
E(|X|I|X|≥a ) ≥ aE(I|X|≥a ) = aP(|X| ≥ a).
Rearranging the last inequality completes the proof. 
23In the formula for H(X) we use the convention that 0 · log 0 = 0 based on the fact that p log p → 0
when p → 0+.
62 HAYK ALEKSANYAN

Theorem 11.2. (Chebyshev inequality) Let X be a discrete random variable with


E|X|2 < ∞ and let ε > 0 be fixed. Then
1
P(|X| ≥ ε) ≤ E|X|2 .
ε2
Proof. In view of the monotonicity of x 7→ x2 we have
1
P(|X| ≥ ε) = P(|X|2 ≥ ε2 ) ≤ E|X|2 ,
ε2
where the last inequality follows by Markov’s inequality. 

Corollary 11.3. Let X and ε > 0 be as in Theorem 11.2 and let µ = EX. Then
1
(11.1) P(|X − µ| ≥ ε) ≤ var(X).
ε2
Proof. The proof follows by applying Chebishev’s inequality to X − µ. 

11.2. WLLN - easy version.


Theorem 11.4. (Weak Law of Large Numbers - easy version) Let X1 , X2 , ..., Xn , ... be
a sequence of independent and identically distributed (i.i.d.) discrete random variables
with mean µ and variance σ 2 . Define
Sn = X1 + ... + Xn , n = 1, 2, ... .
Then, for any ε > 0 we have
 
Sn
P − µ > ε → 0, as n → ∞.

n
Proof. Notice that E(Sn ) = nµ, hence applying inequality (11.1) we get
   
Sn 1 Sn 1 1
P − µ > ε ≤ 2 var = 2 2 var(Sn ) (by independence)
n ε n ε n
1 1 1 1
= 2 2 n σ 2 = 2 σ 2 → 0, n → ∞,
ε n ε n
and the proof is complete. 

The next exercise is a prelude to the probabilistic proof of the Weierstrass approxi-
mation theorem. It might be beneficial to think about the exercise first before reading
on.

 Exercise 11.4.1. Consider a sequence of Bernoulli trials, i.e. a sequence of


random variables X1 , X2 , ..., Xn , ... that are independent and
1
P(Xi = 1) = P(Xi = 0) = , for all i = 1, 2, ... .
2
Prove that for any f ∈ C[0, 1] one has
   
X1 + X2 + ... + Xn 1
Ef →f , n → ∞.
n 2
PROBABILITY LECTURE NOTES 63

11.3. Probabilistic proof of the Weierstrass approximation theorem. We now


put together the ideas developed in this section to prove a classical result due to Weier-
strass on approximation of continuous functions with polynomials.
Theorem 11.5. (Weierstrass approximation theorem) Let f ∈ C[0, 1]. Then, for any
ε > 0 there exists a polynomial p(x) such that |f (x) − p(x)| < ε for all x ∈ [0, 1].
Proof. Continuous function on a closed interval is uniformly continuous, hence for any
m ∈ N there exists δm > 0 such that
1
(11.2) |f (x) − f (y)| < , if x, y ∈ [0, 1] and |x − y| < δm .
m
For n ∈ N and 0 ≤ k ≤ n consider the following polynomial (called Bernstein
polynomial)
 
n k
pn,k (x) = x (1 − x)n−k , x ∈ [0, 1],
k
and define
n  
X k
pn (x) := f pn,k (x).
n
k=0
Clearly pn is a polynomial for any n ∈ N. The aim is to show that the sequence pn
approximates f . Consider a sequence of Bernoulli trials X1 , X2 , ..., Xn , where Xi ∼
B(1, x), i.e. P(Xi = 1) = x and P(Xi = 0) = 1 − x, and the random variables are
independent. Define Sn := X1 + ... + Xn and notice that
  X n  
Sn k
(11.3) Ef = f P(Sn = k) = pn (x),
n n
k=0

where the first equality is due to Theorem 7.3 and the second one follows by definition
of Bernoulli trials24. For m ∈ N define
   
Sn (ω) 1
Am,n,x := ω ∈ Ω : f − f (x) ≥ .
n m
Sn
Let also S n := n , then using (11.3) we get

(11.4) |pn (x) − f (x)| = |Ef (S n ) − f (x)| ≤ E|f (S n ) − f (x)|


  
= E |f (S n ) − f (x)|IAcm,n,x + E |f (S n ) − f (x)|IAm,n,x
 
1
≤ + 2 max |f (x)| P(Am,n,x ).
m x∈[0,1]

To estimate P(Am,n,x ), observe that by (11.2) we have


Am,n,x ⊂ {ω ∈ Ω : |S n (ω) − x| ≥ δm },
hence by monotonicity of probability measure we get
P(Am,n,x ) ≤ P({ω ∈ Ω : |S n (ω) − x| ≥ δm }) (by Corollary (11.3))
1 1 1 1 1
≤ 2 var(S n ) = 2 2 nx(1 − x) ≤ 2 .
δm δm n δm n

24The intuition behind the argument is the following. By WLLN we know that S /n concentrates
n
around x. Then, the continuity of f implies that f (Sn /n) should concentrate around f (x) (cf. Exercise
11.4.1). But Ef (Sn /n) is a polynomial, and we get a candidate polynomial that can approximate f .
64 HAYK ALEKSANYAN

Combining this estimate with (11.4) we obtain25


 
1 1 1
(11.5) |pn (x) − f (x)| ≤ + 2 max |f (x)| 2 .
m x∈[0,1] δm n
1
Now, for a given ε > 0 we fix m ∈ N large enough so that m < 2ε and then take n ∈ N
so that the second summand on the right-hand side of (11.5) is less than 2ε , completing
the proof. 

12. Simple symmetric random walk


In this section we glimpse into the world of random walks. We introduce the simple
one-dimensional random walk on integer lattice Z = {..., −2, −1, 0, 1, 2, ...}, prove the
reflection principle and the arcsin law for it.
12.1. Random walks on Z and the reflection principle. Consider a particle walk-
ing on the integer lattice Z one step at a time either to the left or to the right of its
current position, deciding the direction of the next move by a coin toss experiment. The
next definition formalizes this process.
Definition 12.1. (Simple random walk) Let X1 , X2 , ..., Xn , ... be a sequence of inde-
pendent and identically distributed (i.i.d.) random variables where P(Xi = 1) = p and
P(Xi = −1) = 1 − p, for some 0 ≤ p ≤ 1. Define a new sequence {Sn } where
(12.1) S0 = 0, Sn = X1 + X2 + ... + Xn , for n ≥ 1.
The sequence {Sn } is called a (simple) random walk on Z. When p = 1/2, the random
walk is called symmetric.
Throughout this section we will assume that the walk is symmetric, i.e. p = 1/2
in Definition 12.1.
Random walks serve as a fundamental model for stochastic activities arising in var-
ious scenarios, such as fluctuating stock values, a path of a molecule traveling in a
liquid, etc. Along with a setting of the integer lattice Z described above there are a
number of variations. For example, a random walk can be defined on the d-dimensional
lattice Zd , with d ≥ 2, where at each step the particle walks by selecting one of its 2d
lattice neighbors. There are versions of a walk, where the particle can choose to stay at
its current position with a certain probability (the so-called lazy walks). Other spaces
where a version of the random walk is defined and studied include graphs, Riemannian
manifolds, finitely generated groups, etc. There is a vast theory developed for random
walks in various settings, with deep and beautiful mathematical results. The aim of this
section is to introduce the concept of random walks in the simplest form and prove with
very basic techniques some important properties of simple symmetric random walks.

12.1.1. Counting paths. We can think about the sequence {Sn } as a polygonal line
joining points (n, Sn ) → (n + 1, Sn+1 ) (see Figure 8). Each such path corresponds to
a particular realization of the random walk which can be identified with an outcome of
the joint experiment with X1 , ..., Xn . Thus the number of all paths starting at (0, 0)
and moving for n time-steps equals 2n .

25Notice that by the WLLN we have that P(A


m,n,x ) → 0 as n → ∞ for any fixed m and x. While
we keep m fixed, we need uniformity with respect to x ∈ [0, 1], which is why we used Corollary (11.3)
instead of directly applying Theorem 11.4.
PROBABILITY LECTURE NOTES 65

Figure 8. Each of the four colors corresponds to a particular realization of a


random walk starting from (0, 0) and moving for 5 steps. An outcome of the joint
experiment with (X1 , ..., Xn ), as defined in (12.1), can be put in bijective correspon-
dence with a path similar to those depicted here. In the case of 5 time-steps there are
25 different paths.

Observe that a walk starting from (0, 0) and moving for n steps cannot terminate
on any integer from [−n, n]. For example, when n = 2, none of the four paths of the
walk ends at 1. In the light of this observation let us compute the number of paths from
(0, 0) to (n, x) for a given point (n, x). Assume the walk does a steps up and b steps
down, i.e. the number of 1 ≤ i ≤ n for which Xi = 1 equals a and for Xi = −1 equals
b. Since we do n steps in total and arrive at x from 0, we must have
a + b = n and a − b = x.
Solving this system we obtain that
n+x n−x
a= and b = .
2 2
Let N(0,0)7→(n,x) be the number of paths from (0, 0) to (n, x). If the context is clear we
will drop the (0, 0) from the subscript and write N(n,x) for the number of paths from
(0, 0) to (n, x). We will think about n as time, and about x as the location of the walk
at time n. It is now clear that if n + x is not an even number, then there are no paths
ending at x in n steps, otherwise the number of such paths equals
(
0, if n + x is odd,
(12.2) N(0,0)7→(n,x) = n
otherwise, for a = n+x

a , 2 .
Let us also note that we can shift the time and location of the starting position of
the path. Namely, for any t ∈ Z and any x0 ∈ Z we have
(12.3) N(0,0)7→(n,x) = N(t,0)7→(n+t,x) = N(t,x0 )7→(n+t,x+x0 ) .
0
By N(0,0)7 →(n,x) denote the number of paths from (0, 0) to (n, x) that become 0 for
some 1 ≤ i ≤ n. The paths that start and end at positive integers but become zero
at some intermediate point are of special interest. The next result links the number of
paths that hit zero at some intermediate point with the number of all paths.
Theorem 12.1. (Reflection principle for random walks) Let x, y ≥ 1 be integers. Then
0
N(0,x)7 →(n,y) = N(0,−x)7→(n,y) ,
66 HAYK ALEKSANYAN

i.e. the number of paths from (0, x) to (n, y) that become 0 at some intermediate point
1 ≤ i ≤ n − 1 equals the number of all paths from (0, −x) to (n, y).
Proof. Let A be the set of all paths from (0, x) to (n, y) that become 0 at some point
1 ≤ i ≤ n − 1, and let B be the set of all paths from (0, −x) to (n, y). We need to
show that |A| = |B|. Let (0, s0 ) → (1, s1 ) → ... → (n, sn ) be any path from A. By
definition we have s0 = x, sn = y and there exists 1 ≤ i ≤ n − 1 such that si = 0.
Define k = min{1 ≤ i ≤ n − 1 : si = 0}, i.e. the first time the path hits 0. Now define
a new path as follows (see Figure 9):
(
−si , for 0 ≤ i ≤ k,
(12.4) s̃i =
si , for k < i ≤ n.

Figure 9. One of the paths from (0, 1) to (5, 2) that hits 0 in an intermediate
point (at step 3 in this example). Part of the path that is reflected with respect to
the x axis is depicted by dashed lines. This reflection generates a path from (0, −1)
to (5, 2).

Clearly we have a path from (0, −x) to (n, y). Such reflection defines a one-to-one
(injective) mapping from A to B implying that |A| ≤ |B|.
To show |A| ≥ |B| take any path from B. Since x, y > 0 the path must intersect 0
before reaching y. Let 1 ≤ i ≤ n − 1 be the first time the path becomes 0. Then the
reflected path given by (12.4) defines a one-to-one correspondence between B and A,
hence |B| ≤ |A| and the proof is complete. 

Here is a peculiar application of the reflection principle.


Theorem 12.2. (Ballot theorem) Assume in a certain election candidate A gets α
votes and candidate B gets β votes where α > β > 0. Then, in counting of votes the
probability that A will always be ahead of B equals α−β
α+β .

Notice that each realization of a counting process of votes is a path that leads from
(0, 0) to (α + β, α − β). The counting scenarios where candidate A is always ahead of
candidate B correspond to the paths that never cross 0.
Proof. Define S0 = 0 and for n ≥ 1 set Sn = X1 + ... + Xn , where
(
1, if the i-th vote was for A,
Xi =
−1, if the i-th vote was for B.
PROBABILITY LECTURE NOTES 67

Set T := α + β and D := α − β representing the total number of votes and the


difference of votes for A and B. By definition {Sn } represents the counting process
that starts at 0 and ends at D. We are interested in the number of paths that do not
intersect 0, i.e. when Sn > 0 for all n = 1, 2, ..., T . The latter is equal to the number
of paths from (1, 1) to (T, D) that do not cross 0. Shifting the time axis by 1 we need
to count paths from (0, 1) to (T − 1, D) that stay positive. By Theorem 12.2 (reflection
principle) the number of such paths equals
N(0,1)7→(T −1, D) − N(0,−1)7→(T −1, D)
= N(0,0)7→(T −1, D−1) − N(0,0)7→(T −1, D+1) (by (12.2))
   
α+β−1 α+β−1 (α + β − 1)! (α + β − 1)!
= − = −
α−1 α (α − 1)!β! α!(β − 1)!
 
α−β α+β α−β
= = N .
α+β α α + β (0,0)7→(α+β, α−β)
We proved that the number of paths where A always stays ahead of B equals (α −
β)/(α + β) of the total, hence the proof is complete. 

12.2. Arcsin law. Recall that we assume the walk is symmetric. In this section we
will study distributions of first and last visits to 0 of a walk starting from the origin.
Lemma 12.3. Let Sn be as in (12.1). Then
P(S1 6= 0, S2 6= 0, ..., S2n 6= 0) = P(S2n = 0).
Proof. We will show that the numbers of paths on both sides of the equality in question
are equal. This will complete the proof in view of the fact that the walk is symmetric
and all paths have the same probability, .
The walk starts from 0 hence in 2n steps it can reach at most 2n, thus
n
X
P(S1 > 0, S2 > 0, ..., S2n > 0) = P(S1 > 0, ..., S2n−1 > 0, S2n = 2k).
k=1
For each 1 ≤ k ≤ n the number of paths with S1 > 0, ..., S2n−1 > 0, S2n = 2k equals
0
N(1,1)7→(2n,2k) − N(1,1)7 →(2n,2k) (shifting time)
0
= N(0,1)7→(2n−1,2k) − N(0,1)7 →(2n−1,2k) (by reflection principle, Theorem 12.2)
= N(0,1)7→(2n−1,2k) − N(0,−1)7→(2n−1,2k) (shifting location)
= N(0,0)7→(2n−1,2k−1) − N(0,0)7→(2n−1,2k+1) .
Thus the number of paths with S1 > 0, S2 > 0, ..., S2n = 2k for some 1 ≤ k ≤ n equals
n
X 
N(0,0)7→(2n−1,2k−1) − N(0,0)7→(2n−1,2k+1)
k=1
 
2n − 1
= N(0,0)7→(2n−1,1) − N(0,0)7→(2n−1,2n+1) = ,
n
where we used (12.2) for the last equality. In view of the symmetry the number of paths
that are always negative equals the number of paths that are always positive. Hence
the number of paths such that S1 6= 0, S2 6= 0, ..., S2n 6= 0 becomes
   
2n − 1 2n
(12.5) 2 = = N(0,0)→(2n,0) ,
n n
68 HAYK ALEKSANYAN

where the last equality follows by (12.2). We see that the number of paths that never
hit 0 equals the number of paths that end at 0 at time 2n. This settles the lemma as
noted at the beginning of the proof. 

An important corollary of the last lemma is an estimate on probability of the first


visit to zero. Let T = inf{m ≥ 1 : Sm = 0}, i.e. T is the first time a symmetric
random walk started at 0 will revisit it. Then, by Lemma 12.3 we have
 
1 2n
(12.6) P(T > 2n) = P(S2n = 0) = 2n ∼ π −1/2 n−1/2 , as n → ∞,
2 n
where we used (12.5) and Stirling’s formula (2.1).

We now consider the last visits to zero. Let L2n := sup{m ≤ 2n : Sm = 0}, and
set u2m := P(S2m = 0).
Lemma 12.4. For each 1 ≤ k ≤ n we have P(L2n = 2k) = u2k u2n−2k .
Proof. Notice that after hitting 0 at 2k the walk does not visit 0 until step 2n. Hence
P(L2n = 2k) = P(S2k = 0, S2k+1 6= 0, ..., S2n 6= 0)
= P(S2k = 0)P(S2k+1 6= 0, ..., S2n 6= 0 | S2k = 0)
= u2k P(S1 6= 0, ..., S2n−2k 6= 0) (using Lemma 12.3)
= u2k u2n−2k ,
where the passage from the second to the third row is due to the fact that a walk
conditioned to be 0 at step 2k has the same distribution as the walk starting at 0. The
lemma is proved. 

Theorem 12.5. (Arcsin law for the last visit to 0) For any 0 < a < b < 1 we have
  Zb
L2n 1 dx
P a< <b → p , as n → ∞.
2n π x(1 − x)
a

Remark 12.6. To see why this law is called arcsin, set x = y in the integral, which
leads to √
Zb Zb √
dx dy √
p =2 p = 2(arcsin b − arcsin a).
x(1 − x) 1−y 2

a a

Since arcsin √1 = π
we get
2 4
 
L2n 1 1
(12.7) P < → as n → ∞.
2n 2 2
Here is an interesting interpretation of this law in gambling terms. Assume two players
with initial capital of 0 gamble on fair coin toss. It follows from (12.7) that in a long
series of tosses one of the players will be ahead the entire second half of the series with
probability close to 1/2.

Proof of Theorem 12.5. By Lemma 12.4 and (12.6) we have


1 1
(12.8) P(L2n = 2k) = p βn,k ,
π k(n − k)
PROBABILITY LECTURE NOTES 69

where βn,k → 1 as both k, n → ∞. For each n ∈ N let a ≤ an < bn ≤ b be chosen so


that 2nan is the smallest even integer larger than equal to 2na, and 2nbn is the largest
integer smaller than equal to 2nb. Clearly an → a and bn → b as n → ∞. It follows
that
  nbn nbn
L2n X X 1 1
(12.9) P a< <b = P(L2n = 2k) = p βn,k .
2n π k(n − k)
k=nan k=nan

where the last equality is due to (12.8). Let f (x) = √ 1 , where 0 < x < 1. Writing
x(1−x)
the integral sum for f on [a, b] for partition
 
k
an + : k = 0, 1, ..., (bn − an )n
n
we get
(bn −an )n   (bn −an )n
1 X k 1 X 1
f an + = q =
n n n nan +k
1− nan +k
k=0 k=0 n n

nbn Zb
X 1
p → f (x)dx, as n → ∞.
k=nan
k(n − k)
a
Combining this with (12.9) and taking into account that βn,k → 1 the proof is complete.


References
[1] N. Alon, J. Spencer, The probabilistic method, 2016
[2] R. Durret, Probability theory and examples. 4th edition, Cambridge University Press, 2010
[3] M. H. DeGroot, S. J. Mark, Probability and statistics. 4th edition, Pearson Education, 2014
[4] L. C. Evans, An introduction to stochastic differential equations, (vii) 1 - 149, 2013
[5] G. Grimmett and D. Welsh, Probability an Introduction (2nd edition); Oxford University Press,
2014, pp. 1-281; http://www.math.nagoya-u.ac.jp/~richard/teaching/f2017/GW_2014.pdf
[6] D. MacKay, Information theory, inference and learning algorithms. 4th edition, Cambridge Univer-
sity Press, 2014.
[7] S. Shreve, Stochastic Calculus for Finance II. Continuous-Time Models. Springer Finance Textbook,
2004.
[8] R. Weber, Lecture notes, 2014, http://www.statslab.cam.ac.uk/~rrw1/prob/prob-weber.pdf
[9] D. Williams, Probability with martingales, Cambridge Mathematical Textbooks, (xiv)1-251
(2008)

Email address: [email protected]

Web: https://www.haykaleksanyan.com
Կոմպլեքս անալիզ
Հարցաշար

1. Կոմպլեքս թվեր և գործողություններ նրանց հետ օրինակներ:


2. Կոմպլեքս թվերի երկրաչափական բնութագրումը և եռանկյունաչափական տեսքը,
օրինակներ:
3. Էյլերի բանաձևը և կոմպլեքս թվի ցուցչային տեսքը:
4. Կոմպլեքս թվի մոդուլի և համալուծի հատկությունները:
5. Արմատ կոմպլեքս թվից:
6. Կոմպլեքս թվերի հաջորդականություններ և շարքեր: Ընդհանուր հատկությունները:
7. Անվերջ հեռու կետ և տարածագծային պրոյեկցիա:
8. Կորերը և բազմությունները հարթության վրա:
9. Կոմպլեքս փոփոխականի ֆունկցիա,սահմանը և անընդհատությունը:
10. Ցուցչային, եռանկյունաչափական և հիպերբոլական ֆունկցիաներ:
11. Կոմպլեքս փոփոխականի ֆունկցիայի ածանցյալը: Ածանցյալի պարզագույն
հատկությունները:
12. Դիֆերենցելիության անհրաժեշտ և բավարար պայմանները, Կոշիի-Ռիմանի
պայմանները:
13. Հարմոնիկ և համալուծ հարմոնիկ ֆունկցիաներ: Համալուծ հարմոնիկի գոյության
թեորեմը:
14. Կոմպլեքս փոփոխական ֆունկցիայի ինտեգրալը և պարզագույն հատկությունները,
բերումը որոշյալ ինտեգրալի, օրինակներ:
15. Կոշիի ինտեգրալային թեորեմը (ապացույցը լրացուցիչ պայմանի դեպքում): Հետևանքներ
և օրինակներ:
16. Կոշիի ինտեգրալային թեորեմը բազմակապ տիրույթների համար, հետևանքներ:
17. Նախնական և ինտեգրալ, Նյուտոն-Լաբյնիցի բանաձևը:
18. Կոշիի ինտեգրալային բանաձևը, օրինակներ:
19. Միջին արժեքի թեորեմը հարմոնիկ և անալիտիկ ֆունկցիաների համար:
20. Աստիճանային շարքեր: Աբելի 1-ին թեորեմը, հավասարաչափ և բացարձակ
զուգամիտությունը:
21. Կոշի-Հադամարի թեորեմը, զուգամիտության շառավիղ։

-------------------------------------------------------------------------------

22. Աստիճանային շարքի ինտեգրումը և դիֆերենցումը: Թեյլորի շարք:


23. Անալիտիկ ֆունկցիայի անվերջ դիֆերենցելիությունը: Մորերայի և Վայերշտրասի
թեորեմները:
24. Հիմնական տարրական ֆունկցիաների Թեյլորի շարքերը: Օրինակներ:
25. Անալիտիկ ֆունկցիաների զրոները և միակության թեորեմը:
26. Լորանի շարքերը և զուգամիտության տիրույթներրը:
27. Անալիտիկ ֆունկցիայի վերլուծումը Լորանի շարքի, Լորանի թեորեմը: Օրինակներ:
28. Եզակի կետերի դասակարգումը: Վերացնելի եզակի կետ, բևեռ, էապես եզակի կետ,
անհրաժեշտ և բավարար պայմանները:
29. Սոխոցկի-Վայերշտրասի թեորեմը: Օրինակներ:
30. Լորանի շարքը անվերջ հեռու կետի շրջակայքում և եզակիությունների դասակարգումը
անվերջ հեռու կետում:
31. Մնացքի սահմանումը և պարզագույն հատկությունները: Մնացքի հաշվումը բևեռի
դեպքում: Օրինակներ:
32. Մնացքների տեսության հիմնական թեորեմը և հետևանքները: Օրինակներ:
33. Առաջին և երկրորդ տիպի ինտեգրալների հաշվումը մնացքների օգնությամբ: Օրինակներ:
34. 3-րդ և 4-րդ տիպի ինտեգրալների հաշվումը մնացքների օգնությամբ: Օրինակներ:
Մաթեմատիկական անալիզ 4․2 (2022-2023)

Մաթեմատիկա

1. ՊԿԱԻ-ի ինտեգրումը, Պուասոնի ինտեգրալի հաշվումը։


2. Առաջին սեռի Էյլերյան ինտեգրալ՝ 𝐵𝐵(𝑎𝑎, 𝑏𝑏):
3. Երկրորդ սեռի Էյլերյան ինտեգրալ՝ Γ(𝑎𝑎):
4. Էյլեր-Գաուսի բանաձևը։
5. 𝐵𝐵 և Γ ֆունկցիաների կապը։
6. Γ ֆունկցիայի լրացման բանաձևը։
7. Պարամերտացված կոր, բնական պարամերտացում։
8. Առաջին սեռի կորագիծ ինտեգրալ։
9. Երկրորդ սեռի կորագիծ ինտեգրալ։
10. Գուրսայի լեմման։
11. Փակ կորով տարածված ինտեգրալ։
12. Գրինի բանաձևը։
13. Գրինի բանաձևը բազմակապ տիրույթների համար։
14. Կորագիծ ինտեգրալի անկախությունը ճանապարհից։
15. Փակ դիֆերենցիալ ձև։
16. Հարթ պատկերների արտապատկերումը։
17. Մակերեսի արտահայտումը կորագիծ կոորդինատների միջոցով։
18. Պարամետրացված մակերևույթ։
19. Գաղափար մակերևույթի կողմի (երեսի) մասին։
20. Շվարցի օրինակը։
21. Բացահայտ տեսքով տրված մակերևույթի մակերեսը։
22. Պարամետրացված մակերևույթի մակերեսը։
23. Առաջին տիպի մակերևութային ինտեգրալ։
24. Երկրորդ տիպի մակերևութային ինտեգրալ։
25. Ստոքսի բանաձևը։
26. Գաուս-Օստրոգրադսկիի բանաձևը։
Մաթեմատիկական անալիզ 4․1 (2022-2023)

Մաթեմատիկա

1. Վերջավոր վարիացիայի ֆունկցիաներ, դասեր։ Անվերջ վարիացիայի անընդհատ


ֆունկցիա։
2. Փոփոխական վերին սահմանով ինտեգրալի վարիացիան։
3. 𝐵𝐵𝐵𝐵[𝑎𝑎, 𝑏𝑏] դասի կառուցվածքը։
4. Վերջավոր վարիացիայի ֆունկցիաների հիմնական հատկությունը։
5. Վերջավոր վարիացիայի ֆունկցիաների Ժորդանի վերլուծությունը։
6. Վերջավոր վարիացիայի վեկտորարժեք ֆունկցիաներ, ուղղելի կորեր։
7. Ստիլտեսի ինտեգրալի սահմանումը և պարզագույն հատկությունները։
8. Մասերով ինտեգրում Ստիլտեսի ինտեգրալում։
9. Ստիլտեսի ինտեգրալի գոյության անհրաժեշտ և բավարար պայմանն աճող ֆունկցիայի
դեպքում, Ստիլտեսի ինտեգրալի գոյությունն ընդհանուր դեպքում։
10. Ստիլտեսի ինտեգրալի ներկայացումը Ռիմանի ինտեգրալի միջոցով։
11. Ընդհանուր հավասարաչափ զուգամիտության Կոշիի և Հայնեի սահմանումները, Կոշիի
ընդհանուր հավասարաչափ զուգամիտության սկզբունքը։
12. Ընդհանուր հավասարաչափ զուգամիտությամբ սահմանային ֆունկցիայի
անընդհատությունը և Դինիի թեորեմը։
13. Պարամետրից կախված ինտեգրալներ, սահմանային անցում, պարամետրից կախված
ինտեգրալի անընդհատությունը։
14. Պարամետրից կախված ինտեգրալի ածանցումը։
15. Պարամետրից կախված ինտեգրալի ինտեգրումը։
16. Պարամետրից կախված անիսկական ինտեգրալի (ՊԿԱԻ) հավասարարչափ
զուգամիտությունը, Կոշիի սկզբունքը։
17. ՊԿԱԻ-ի հավասարաչափ զուգամիտության Վայերշտրասի և Դինիի հայտանիշները։
18. ՊԿԱԻ-ի հավասարաչափ զուգամիտության Աբելի և Դիրիխլեի հայտանիշները։
19. ՊԿԱԻ-ի անընդհատությունը, սահմանային անցում ՊԿԱԻ-ի նշանի տակ։
20. ՊԿԱԻ-ի ածանցումը, Դիրիխլեի ինտեգրալի հաշվումը։

You might also like