Probability & Statistics MTH 2401 STV V54
Probability & Statistics MTH 2401 STV V54
Probability & Statistics MTH 2401 STV V54
MTH 2401
Lecture Notes
Instructor: Eugene Dshalalow
Department of Mathematical Sciences
Florida Institute of Technology
Melbourne, FL 32901
[email protected]
CHAPTER I. FOUNDATIONS OF PROBABILISTIC MODELING
CHAPTER I. FOUNDATIONS OF
PROBABILISTIC MODELING
H - Sample Space (a carrier set from which all pertinent events are to be drawn)
A B
A B
A A B B
A
Ac
A B
Set-Theoretical Laws
1) Intersection and Union are commutative and associative:
E ∪ F œ F ∪ Eß E ∩ F œ F ∩ Eß
ÐE ∪ FÑ ∪ G œ E ∪ ÐF ∪ GÑß ÐE ∩ FÑ ∩ G œ E ∩ ÐF ∩ GÑ
2) Distributive Laws
E ∩ ÐF ∪ GÑ œ ÐE ∩ FÑ ∪ ÐE ∩ GÑ
E ∪ ÐF ∩ GÑ œ ÐE ∪ FÑ ∩ ÐE ∪ GÑ
3) De Morgan's Laws
ÐE ∪ FÑ- œ E- ∩ F -
ÐE ∩ FÑ- œ E- ∪ F - .
Probability (Measure) is a function whose domain is Y ÐHÑ, valued in Ò!ß "Ó, and which satisfies
two main axiomsÞ More specifically,
T E3
∞
(Axiom 2')
3œ"
Note that Axiom 2 is a special case of Axiom 2' (if we let E# œ E$ œ á ), and Axiom 2' is the
original second axiom of probability. However, for convenience, its special case, Axiom 2, is
separately formulated. Axiom 2 is called the additivity axiom, whereas Axiom 2' is referred to as
the 5-additivity axiom.
Example 1.1. Suppose we are to roll a die and interested in the event of obtaining an even
number of dots. A reasonable and fairly compact model would be one with
Y ÐHÑ œ ÖHß Øß Eß E- ×, where
With no further information about the die, we assume that it is homogenous with each outcome
to occur equally likely. We then postulate T ÐEÑ œ T ÐE- Ñ œ "# along with T ÐHÑ œ " and
T ÐØÑ œ !. (This model may turn out to be unrealistic if the die is biased. The latter can be
further investigated by conducting statistical experiments.) It is readily seen that the model is
consistent and the probability measure satisfies conditions (or rather “axioms”) (a) and (b).
In Example 1.1, suppose we roll a biased die so that T ÐEÑ œ : and thus T ÐE- Ñ œ " :, for
some : − Ò!ß "Ó. Then, the probability measure on Y ÐHÑ,
is called Bernoulli.
Definition 1.1. A generic probability space ÐHß Y ÐHÑß T Ñ is called a Bernoulli probability space
if Y ÐHÑ œ ÖHß Øß Eß E- × for some nonempty subset E of H and if T is the Bernoulli probability
measure on Y ÐHÑ, i.e. T À Y ÐHÑ Ä Ö"ß !ß :ß " :×.
A few more properties of every probability measure, which can be easily proved:
where F E in general is the rest of F after all points of F belonging to E (if any) are removed.
In particular, it follows that T ÐEÑ Ÿ T ÐFÑ (monotonicity). Taking F œ H, we obtain from (1.1)
that
The latter is due to the partition of E ∪ F as E and ÒF ÐE ∩ FÑÓ and the use of additivity
Axiom 2.
2) For each 3 œ "ß á ß 8, Ö=3 × − Y ÐHÑ. This means that singletons are events, called
elementary events. The latter immediately yields that all subsets of H are events. Indeed, all we
need to do is to include all unions of singletons in Y ÐHÑ and thus run over all subsets of H. (In
set theory, such a collection is referred to as the power set, in notation, cÐHÑ.)
3) Elementary events have identical outcomes. In order for this to happen, we need
T ÐÖ=3 ×Ñ œ 8" ß for each 3 œ "ß á ß 8.
Remark 1.1. In Example 1.1, rolling a homogeneous die can be alternatively modeled by a
Laplacian space, because the outcomes of rolling are equally likely. However, we still need to
formally include singletons Ö.3 ×'s in the 5-algebra Y of the pertinent events. Just having this in
mind is insufficient. Then,
lEl
T ÐEÑ œ lHl , where l † l denotes the number of elements. (1.4)
Indeed, if E œ Ö=" ß á ß =5 × (for example) and H œ Ö=" ß á ß =8 ×, then by the additivity axiom,
Remark 1.2. In Example 1.1, of course, we need not include all singletons in Y ÐHÑ as far as our
original model. We can change the setting as follows. Introduce new sample space
~ ~"ß =
~ # ×ß defining Ö=
~ " × À œ E and Ö=
~ # × À œ E- . Physically speaking, we partition the
H œ Ö=
sample space into events "even number of dots" and "odd number of dots". In this case, the die
turns to a "coin" with two faces.
PROBLEMS
1.1. A pair of fair dice is tossed. What is the probability of the event E µ “getting a total of 7”?
Construct a relevant probability space and then determine T ÐEÑ. Justify your calculation.
straight. Suppose two vehicles are moving through the intersection. 3 Describe the sample
1.2. In an experiment, a busy intersection is observed at which vehicles can turn left, right, or go
space and each of the sample points. 33 If all sample points are equally likely, what is the
probability that at least one car goes straight? 333 Given that all points are equally likely, what
is the probability that at most one car turns?
1.3. Let ÐHß Y ÐHÑß T Ñ be a probability space and E and F two events such that T ÐEÑ œ "# and
T ÐE ∩ FÑ œ "% Þ Find T ÐE ∩ F - ÑÞ Hint: Show that E ∩ F - œ E ÐE ∩ FÑ and then use
Property 1, equation (1.1).
1.4. The probability that an American computer company will outsource its technical support to
China is 0.6; the probability that it will outsource its support to India is 0.5, and the probability
that it will outsource its support to either China or India or both is 0.9. What is the probability
that the company will outsource its technical support (a) to both countries, (b) to neither
country?
1.5. Let Eß Fß and G be three events. Find the expression for the events
1.6. Suppose that E and F are mutually exclusive events for which T ÐEÑ œ !Þ# and
T ÐFÑ œ !Þ%. What is the probability that: (a) either E or F occurs; (b) E occurs, but F does not;
(c) both E and F occur.
1.7. Suppose E and F are two events such that T E œ !Þ* and T F œ !Þ'. Can
T E ∩ F œ !Þ#?
1.8. Under the conditions of Problem 1.7, what is the smallest possible value of T E ∩ F ?
What is the largest possible value of T E ∩ F ?
+Ñ ÐE FÑ- œ E- ∪ F .
,Ñ ÒÐE- ∪ FÑ- ∪ ÐE ∪ F - ÑÓ- œ F E.
-Ñ ÐE ∩ FÑ ∪ ÐE ∩ F - Ñ ∪ ÐE- ∩ FÑ œ E ∪ F .
2. Combinatorial Probability
MODEL 1. (Multiplication). Suppose there are two boxes filled with balls. Box 1 has 7 balls
and Box 2 has 8 balls. In how many different ways two balls, one from Box 1 and another from
Box 2, can be chosen? The answer is simple. The total number of pairs Ð," ß ,# Ñ total 7 † 8ß
which is the content of the Cartesian product 7 ‚ 8 of the total number of different pairs of
balls.
Now, if we have < boxes, filled with 7" ß á ß 7< balls, respectively, then clearly the total
number of different ordered <-tuples of balls chosen from the boxes equals the content of the
Cartesian product 7" ‚ á ‚ 7< and thus is 7" † á † 7<.
Example 2.1. The Florida license test is a multiple choice of 20 questions. Each of the questions
is supplied with four different answers of which only one is correct. Suppose a candidate needs
to answer all questions correctly in order to pass. What is the probability that the candidate will
pass the test?
Solution. In the lack of any other conditions, we assume that the candidate choice of answer will
undergo a random sequence of trials. The sample space H will contain 4#! elementary points,
corresponding to the experiment with 20 boxes, each containing four balls. So, the 3th
elementary event will contain a 20-tuple of chosen responses:
=3 œ Ö<3" ß á ß <3#! ×ß 3 œ "ß á ß % Þ The space is Laplacian and passing the test would mean to
#!
choose just of the %#! ¸ "Þ" ‚ "!"# elementary events. This by (1.4) would yield
"
T ÐÖ=5 ×Ñ œ %#! ¸ !Þ*" ‚ "!"# Þ
Example 2.2. Suppose four students enter an elevator which runs six upper floors of the
Crawford Science Building. What is the probability that all four students will get off the elevator
singly, i.e. not more than one of them will leave the elevator at any floor?
Solution. Firstly we need to identify the combinatorial problem we deal with. Suppose there is
box with 8 numbered empty cells and 5 Ð Ÿ 8Ñ numbered balls. Assume that a cell contains a
maximum of one ball at a time. The experiment consists of randomly distributing the 5 balls
among the cells. The question raises: how many different placements can be rendered?
1
4 2 5 1 3
Figure 2.1
To perform the experiment, we drop off the first ball into the box and see that the first ball has 8
different placements. After the first ball is situated, the second ball has only 8 " different
possibilities. Using the multiplication theorem, we find that the first two balls have 8Ð8 "Ñ
different placements. It is like pairing two balls out of two boxes, with 8 and 8 " balls,
respectively, in light of multiplication theorem.
Continuing the process with the rest of the balls, we obtain that the total number of placements
of 5 numbered balls in 8 numbered cells is
8!
T Ð8ß 5Ñ œ 8Ð8 "ÑâÐ8 5 "Ñ œ Ð85Ñ! . (2.2)
Now, back to the elevator problem. Firstly, we identify the sample space H œ Ö=" ß á ß =R ×,
where R is the total number of all placements of four students among six floors without
restrictions. In this case, each of the four students can be identified as a box and six floors can be
put into each of the students as balls into boxes. So, thus we have four boxes (students), each
filled with six balls (floors), so that if we happen to draw floor number 1 from student 1 and
floor number 1 from student 2 and floor number 3 from student 3 and floor number 3 from
student 4, we have them get off on floors 1 and 3, two on each. By the multiplication theorem,
we thus have
R œ 6%
different placements of the students without restrictions. Now, of the R elementary placements,
we need to pick out those =3 's where students get off singly. So, E œ Ö=3" ß á ß =3: × where : is
the number of permutations from (2.2). Because the space is Laplacian and thus
lEl
T ÐEÑ œ lHl ß
'!
all we need is to determine lEl, which by (2.2), is Ð'%Ñ! . The final answer is
'! &
T ÐEÑ œ Ð'%Ñ!'% œ ") .
MODEL 3. (Combinations). Assume that in the permutation box, the balls are not numbered,
so that the location of balls in the cells only matters but not how the balls permute among
T Ð8ß 5Ñ œ 5 ! 85
85 œ
or
8!
Ð85Ñ!5 ! . (2.3)
Combinations occur more frequently in applications and play a more important role than
permutations. In particular,
is one of the major formulas in mathematics, known as the binomial formula. One important
application of this formula is when + œ , œ "Þ Then we have
8 œ #8 .
8
5 (2.3b)
5œ!
For instance, if E is an 8-elements set, E œ Ö+" ß á ß +8 ×, in how many different ways can we
select one-, two-, etc. elements subsets of E including the empty set? In other words, what is the
cardinal number of the power set cÐEÑ, i.e. the quantity of all subsets of E including the empty
set and itself? Since the order within a set does not matter, a selection of a 5 -elements set will be
correspond to the selection of a subset with elements +# ß +& ß +( . So, we have 85 quantity of 5 -
equivalent to the placement of 5 balls in 8 numbered cells. The occupied cells, i.e. 2,5,7, will
Example 2.3 (Lottery Game). In a typical 6 from 49 lottery, 6 numbers (in the form of balls) are
drawn from 49. If the 6 numbers on a ticket match the numbers drawn, the ticket holder is a
jackpot winner. This is true regardless of the order in which the numbers are drawn. On the day
of drawing, exactly six numbers are randomly generated. Any participant (of 18 years of age or
older) who buys a state-issued ticket fills it out trying to "guess" a forthcoming sequence of six
numbers. As mentioned, the order in which the six numbers appear during the drawing is
irrelevant. Hence we identify the sample space as
H œ Ö=" ß á ß =R ×,
where R œ %*
' œ 13,983,816 and an elementary event is a combination of six numbers in an
increasing order. For instance,
Since all outcomes are regarded to be equally likely, the related probability space ÐHß Y ÐHÑß T Ñ
is Laplacian and thus all singletons are measurable (or elementary events as noted). Therefore,
the probability to be a jackpot winner is T ÐN Ñ œ "ÎR œ "Î"$ß *)$ß )"' and thus it is very small.
On the other hand, there are also smaller awards for guessing just 4 (or 5) out of 6 right numbers,
but clearly much more likely to win. So we wonder what is the probability of the event E µ to
guess exactly 4 out of 6. Since the pertinent probability space is Laplacian,
T ÐEÑ œ lElÎlHl œ lElÎR . To find lEl we note that E contains any combination of 4 out of 6
'% %$
%*
T ÐEÑ œ #
¸ !Þ!!!*')'"*Þ (2.3c)
'
> (choose(6,4)*choose(43,2))/choose(49,6)
[1] 0.0009686197
'& %$
%*
T ÐFÑ œ "
¸ !Þ!!!!")%%*. (2.3d)
'
rectangle !ß < ‚ !ß ?. The particle moves only to the right or upward as shown on the figure
Example 2.4. Consider a “random walk” of a particle along an integer lattice (grid) within a
below.
(r , u )
(0,0)
In how many different ways can the particle move from point !ß ! to point <ß ?? In other
words, how many different paths are there connecting points !ß ! and <ß ??
Solution. Obviously, however the particle moves, it needs to run exactly < right steps and ?
Given a particular path of the particle starting from !ß !, let us put a ball in cell 1 if the particle
upward steps. Now, consider the associate one-box-model of < ? numbered cells with < balls.
moves to the right and leave cell 1 empty if the particle moves upward. Following the path as in
the figure above we then have the associated placement of the balls.
...
!ß ! to <ß Aß so that on its way it passes through the point +ß ,?
Example 2.5. Under the condition of Example 2.4, in how many way can the particle move from
(r , u )
( a, b)
(0,0)
Solution. The number of all paths through +ß , is +, and the number of all paths from +ß ,
to <ß ? is <+?, . The total number of pertinent paths is therefore the product
+
moves from point !ß ! to point <ß ? it passes through the point +ß , .
Example 2.6. Under the condition of Example 2.5, find the probability that when the particle
Solution. Let Hß Y Hß T be the probability space describing the model. Then,
H œ =" ß á ß =R ß where =3 is the 3th path of the particle and R is the total number of different
paths. E.g.,
=3 œ !ß !ß !ß "ß "ß "ß "ß #ß á ß <ß ?
moves from point !ß ! to point <ß ? it passes through the point +ß , equals
because all paths are equally likely. Therefore, the probability of event E that when the particle
T E œ E
H œ +,
+
<+?,
<+
<?
<
.
Solution. First identify the sample space as H œ Ö=" ß á ß = R × whose cardinal lHl equals the
8
number of combinations of 8 (different samples of balls) out of R , first without the care of the
(now elementary events) are equally measurable, i.e. T ÐÖ=4 ×Ñ œ lH" l œ " R8 Þ Thus, we have
colors. Take as usual, Y ÐHÑ œ c ÐHÑ including the singletons Ö=4 × and postulate that all of them
out of < red (whose number is 5< ) and the remaining 8 5 are white (whose number is 85
A
Now, we need to figure out the event E that includes all those elementary events with exactly 5
).
< A < R <
Thus, by the multiplication theorem, lEl œ 5 85 œ 5 85 . Finally,
5< 85
A 5< R85
<
R8 R8
lEl
T ÐEÑ œ lHl œ œ . (‡)
To reduce this model to the lottery game we have R œ %*ß < œ 8 œ 'ß A œ %$. Finally, (‡) is
fully reduced to (2.3) if we take 5 œ %.
To solve this problem, we first rephrase the problem in Model 3 (with 5 balls) by considering 8
balls of which 5 are red and 8 5 are white.
Then, during the process of placements, we identify empty cells in Model 3 with those occupied
by white balls to arrive at the same number
85 œ 8!
Ð85Ñ!5 ! (2.6)
of different placements of red and white balls (which are distinguishable by colors only.
The latter can be interpreted as follows. To have 8x different permutations of 8 balls (regardless
of their colors) we can combine 5x different permutations of 5 red balls with Ð8 5Ñx different
multiplying 5xÐ8 5Ñx by 85 we take into account all different placements of the balls and as it
permutations of 8 5 white balls, provided that they occupied 5 and 8 5 fixed cells. Then,
Now, suppose we have a set of 8 balls of < different colors so that 83 balls are of color -3 and
8" á 8< œ 8.
Emulating the similar experiment with 8 such balls and denoting 8" á
8
8< the number of
different placements of 8 balls distinguished by colors only, we deduce that
8" á
8
8< œ
8x
8" xâ8< x Þ (2.6b)
Ð+" á +< Ñ8
+ , - 8 œ 83œ! 83
4œ! 3 4 834 + , -
8 3 4 834
.
Next,
+ , - . 8
œ 83œ! 83
4œ!
834
5œ! 3 4 5 8345 + , - .
8 3 4 5 8345
.
Ð+" á +< Ñ8
Example 2.7. In how many different ways can 9 security officers patrol three different sites on
We can explain the above result as follows. Suppose we identify 9 officers as S" ß á ß S* prior to
the task. Now, we assume that those officers patrolling Area 1 will wear blue uniforms, those
patrolling Area 2 will wear red uniforms, and those patrolling Area 3 will wear green uniforms.
Consider the following distribution of the officers:
The associated box model will carry * cells filled with * balls of which the first, fifth and eighth
are of red color, fourth and sixth are blue and the rest are green. The total number of different
three are red, and four are green. The latter equals # *$ % as per formula (2.6b).
patrolling arrangements equal the number of placements of nine balls of which two are blue,
PROBLEMS
2.1. How many different seven-digit phone numbers can be generated provided that the first digit
cannot be zero?
8! 8" 8# á Ð "Ñ8 88 œ !.
2.5. Suppose three friends enter together a train at one of the stops. Suppose that the train needs
to make five more stops until its final destination. Considering that we have no prior knowledge
where the three friends are going to get off the train, what is the probability that all three of them
get off at different stops?
2.6. In the above lottery game (Example 2.3), find the probability that a single ticket will match
exactly three out of six winning numbers.
Answer. !Þ!"('&!%. Note that you can use R operators to calculate it:
> (choose(6,3)*choose(43,3))/choose(49,6)
[1] 0.0176504
2.7. If & books are picked at random from a shelf containing 7 novels, 4 books of poems, and 3
dictionaries, what is the probability of event E that 3 novels and 2 books of poems are selected?
Prior to calculation of T ÐEÑ, construct the sample space and explain how the probability of E is
calculated based on Laplacian space argument.
2.8. If 8 students are randomly seated in a row, what is the probability that two of them, say A
and B will seat next to each other?
2.9. In the context of the lottery game (Example 2.3), what is the minimal number of lottery
tickets one must buy in order that at least one of them will be giving at least four right numbers
with probability one, provided that all of them are filled out differently, but with no particular
strategy?
2.10. In the context of the lottery game (Example 2.3), one buys 1000 lottery tickets and fill
them out differently, but with no particular strategy. What is the probability that at least one of
them gives at least four right numbers?
2.11. Suppose that a deck of 30 cards containing 10 red cards, 10 blue cards, and 10 green cards,
is distributed at random among three people so that each gets 10 cards. What is the probability
that each person receives the cards of the same color?
three-dimensional rectangle, starting from point !ß !ß ! and terminating its walk at T ß Uß V .
2.12. Suppose a particle randomly moves in a three-dimensional integer lattice bounded by a
( P, Q , R )
(0,0,0)
How many different paths are there, provided that the particle can move only in positive
direction along any of the three axes? Explain your steps and justify your answer.
Answer. TTUV
U V .
2.13 Calculate the number of different arrangements of all letters in the word
SOCIOLOGICAL
Solution. We use model 5. If necessary, we can associate letters with balls of different colors.
For instance, letter S can be associated with three red balls, etc. Altogether we have three S'sß
two G 's, two M 's, two P's, one Wß one Eß and one K. The answer is
$ "#
# # # " " " .
( P, Q, R )
( p , q, r )
(0,0,0)
Obviously, this probability is -AÎA, where -A is the number of colorblind females. We will
formalize it. If [ and G are the sets of all females and colorblind people, then the above
probability -AÎA can be expressed as lG ∩ [ lÎl[ lß where lEl is the number of elements in a
finite set E. This, as the reader sees it, agrees with the axioms of a Laplacian probability space.
In this case, of course [ represents a sample space and the event G ∩ [ is the trace of G on
[ , thus it is in [ .
So we see here that Ð[ ß Y Ð[ Ñß T[ Ñ is a probability space formed from the original probability
space ÐHß Y ÐHÑß T Ñ, where H represents the population of town T. Also,
T[ ÐGÑ œ lG ∩ [ lÎl[ l good for G and any other measurable subset P of H, with its trace on
[.
We would like to express the measure T[ through the original measure T earlier acting on
Y ÐHÑ. Dividing the numerator and denominator in the last fraction by R we will arrive at the
same result
lG ∩[ lÎR lG ∩[ lÎlHl
-AÎA œ lG ∩ [ lÎl[ l œ l[ lÎR œ l[ lÎlHl (3.1)
with a different interpretation. Obviously, the new numerator of (3.1) represents the probability
that a randomly selected individual is a colorblind female (i.e. simultaneously a female and
colorblind), while denominator is the probability that a randomly chosen person is a female.
Altogether it becomes a ratio of two probabilities, T ÐG ∩ [ Ñ over T Ð[ Ñ thus having
lG ∩[ lÎR T ÐG ∩[ Ñ
-AÎA œ l[ lÎR œ T Ð[ Ñ . (3.2)
Notice that the sets G and [ now turned to be events from Y ÐHÑ. Most importantly, we can say
that the above ratio in (3.2) is the conditional probability of a randomly chosen person to be
colorblind given that this person is a female, in notation T ÐGl[ Ñ. So we have
T ÐG ∩[ Ñ
T[ ÐGÑ œ T ÐGl[ Ñ œ T Ð[ Ñ . (3.3)
Among very important applications of the conditional probability formula, we have the total
probability formula and Bayes formula.
The Total Probability Formula. If the sample space H is partitioned into 8 disjoint subsets
H œ L" ∪ á ∪ L8 referred to as hypotheses and if E © H is another set (all are events), then E
(by the distributive law) will also be partitioned as
E œ H ∩ E œ L" ∪ á ∪ L8 ∩ E œ ÐE ∩ L" Ñ ∪ á ∪ ÐE ∩ L8 Ñ.
In other words, E ∩ L3 is the trace of E on L3 and all these traces are disjoint. Applying
probability measure T to the left and right-hand sides of the last equation and using the
additivity axiom (b), we have
after applying the conditional probability formula to each of the 8 summands. The formula we
arrived at is called the total probability formula and it can also be extended in the same way to
an infinite sum (series).
The Bayes Formula. Now we turn to the celebrated Bayes formula, which is foundational in
probability and statistics, known as Bayesian statistics. [Thomas Bayes, ca. 1702 - April 17,
1761, was a British mathematician who is the author of the formula.]
Suppose the probability of some hypothesis, say L5 , is known and equals T ÐL5 Ñ. It is referred
to as a prior probability of L5 . Suppose an event E related to L5 has occurred and we want to
reevaluate T ÐL5 Ñ after event E occurred. Consequently, an additional information on L5 via E
has entered. More formally, we need to calculate T ÐL5 lEÑ called the posterior probability.
Using the conditional probability formula (3.3) twice and the total probability formula (3.4)
gives
3œ"
Example 3.1. Suppose that based on symptoms a patient has, his doctor is 60% certain that the
patient has a particular disease. If doctor's suspicions would be overwhelming, say at least 85%,
then he would recommend a surgery. Under these circumstances, the doctor opts for quite an
invasive and expensive procedure, which unfortunately is not 100% reliable. In particular, the
test can show positive even if the patient does not have the disease (false positive), because of
his diabetes. Chances of this is 30%. On the other hand, the test can show negative if the patient
does have the disease (false negative) in 10% of all cases. Question is, in the event the test shows
positive, how much higher the prior estimate of 60% should increase to make the test worth
rendering. Can we accurately predict the results of this before running the test in order to see if a
positive test will elevate the prior from 60% to 85% or higher?
Solution. We start with identifying the reference event E and the related hypotheses:
L" µ patient has the disease, with the prior T ÐL" Ñ œ !Þ'
L# µ patient does not have the disease, with the prior T ÐL# Ñ œ !Þ%
The further identification of the related conditional probabilities from (3.5) gives
T ÐElL# Ñ œ !Þ$ß
As we see it from (E3.1), even if the result of the test turns out to be positive, the (posterior)
probability of the patient to have the disease would raise from 0.6 to 0.818, but not high enough
to warrant a surgery. Consequently, the test will not be recommended.
PROBLEMS
3.1. Two fair dice are rolled. What is the conditional probability that at least one lands on 3
given that the dice land on different numbers?
3.2. Let E § F . Using the conditional probability formula express the following probabilities as
simple as possible:
3.3. Peter tries to avoid going to a party which he was invited to. To justify his absence he flips a
coin and if the coin shows heads he goes. Otherwise, he rolls a die to give the party yet another
chance. If the die lands on 6, he goes. Otherwise, he stays home. If Peter ends up being at the
party, what is the probability that the coin he flipped showed Heads?
"
L# µ Coin shows Tailsà T ÐL# Ñ œ # (prior)
T ÐElL" Ñ œ "
"
T ÐElL# Ñ œ '
"† "#
Step 4: Find the posterior probability using Bayes: T ÐL" lEÑ œ (Î"# œ (' Þ
3.4. A gambler has in his pocket a fair coin and a biased coin (with chances for Heads 9 out of 10
times). He selects one of the coins at random and flips it twice. Suppose in both cases the coin
shows Heads. What is the probability that the fair coin was selected?
"
L# µ Biased coin selected T ÐL# Ñ œ # (prior)
3.5. A sport team of 18 archers include five who hit the target with probability !Þ), seven -
with probability !Þ(, four - with probability !Þ', and two - with probability !Þ&. A randomly
selected archer shoots a bow and misses the target. To what group does he most likely belong?
3.6. A student forgot the last digit of a phone number she was about to dial. She decided to dial
the last digit at random. What is the probability that the student needs at most three trials?
Answer: !Þ$.
3.7. Urn A contains 2 white and 8 red balls, whereas urn B contains 7 white and 2 red balls. A
ball is drawn at random from urn A and placed in urn B, after which a ball from urn B was drawn
and it happened to be red. What is the probability that the first ball drawn from urn A was also
red?
4. Independent Events
T ÐGÑT Ð[ Ñ œ T ÐG ∩ [ Ñ. (4.1)
T Ð[ ∩GÑ
T Ð[ Ñ œ T ÐGÑ which is also T Ð[ lGÑÞ (4.2)
Hence, if [ does not affect G , then G does not affect [ either and we see that the two events
[ and G are independent in a mutual way. (4.1) can be used as an equivalent definition of
independence of two events.
A very similar notion of independence can be used for independence of more than two events.
Say, for three events, Eß Fß Gß the independence means (a) pairwise like in (4.1) and (b) all
three at once:
Example 4.1 (Bernstein Tetrahedron). A tetrahedron is a pyramid with four faces each being a
perfect triangle. Suppose we have a homogeneous tetrahedron whose three faces are painted in
Red, Yellow, Green and the fourth face is painted in all three colors. Suppose events Vß ] ß K
mean the appearance of red, yellow and green respectively on a face of the tetrahedron down to
the surface. Are events Vß ] ß K pairwise independent? Are these events independent
(altogether)?
Solution. Check T ÐV] Ñ œ "Î%, because red and yellow appear on only one face that contain all
three colors. On the other hand, T ÐVÑ œ T Ð] Ñ œ #Î% œ "Î# and thus T ÐV] Ñ œ T ÐVÑT Ð] Ñ
meaning that the appearances of colors pairwise are independent. (Other combination of colors
give obviously same results.) However, T ÐV] KÑ œ "Î%, while T ÐVÑT Ð] ÑT ÐKÑ œ "# and
$
Notice that without checking, it would be hard to impossible to figure this out by a mere
intuition.
The independence of any family of events is defined as independence of any finite subfamily and
the latter, in turn, requires independence of any combinations of the involved events.
PROBLEMS
4.1. Suppose E and F are independent events such that T ÐEÑ œ "# and T ÐFÑ œ $% Þ Determine
the probability that none of these events will occur. (Justify your actions.)
Solution. Step 1. We need to show first that if E and F are independent, then so are E- and F - :
E ∩ F - œ E F œ E ÐE ∩ FÑ
Ê T ÐE ∩ F - Ñ œ T ÐEÑ T ÐE ∩ FÑ
4.2. If three balanced dice are rolled, what is the probability that all three numbers will be the
same?
"
T ÐF3 Ñ œ probability that the second die shows 3 dots œ '
"
Similarly T ÐG3 Ñ œ ' probability that the third die shows 3 dots
3œ" 3œ"
""" "
'† ''' œ $' .
4.3. Consider two independent tosses of a fair coin. Let E be the event that the first toss lands
Heads, F - the event that the second coin lands Heads, and G - be the event that both coins land
on the same side. (a) Show that the events E and G are independent. (b) Show that Eß Fß G are
not independent.
4.4. A break in an electric circuit will occur if at least one of the three independent
components connected in series is out of order. Calculate the probability of the event E that the
break will occur if the components fail with respective probabilities !Þ"ß !Þ$ß and !Þ'.
and we are interested in the number of heads in these two trials, the relevant function will be
If the coin is fair, the probability defined on "elementary events" ÖÐL" ß L# Ñ× and alike is
"uniform" with "% for each. Correspondingly, declaring Ö!×ß Ö"×ß Ö#× as elementary events in
Y ÐÖ!ß "ß #×Ñ, we will find that
"
T\ ÐÖ!×Ñ œ T Ö\ œ !× œ % (because ! corresponds to ÐX" ß X# Ñ)
"
T\ ÐÖ"×Ñ œ T Ö\ œ "× œ #
T\ ÐÖ#×Ñ œ T Ö\ œ #× œ "% .
So, the function \ induces a new probability measure T\ on Y ÐÖ!ß "ß #×Ñ called the probability
distribution of \ . The function \ is called a random variable (r.v.).
the sets above must be measurable, i.e. more precisely, we have them belong to Y ÐHÑ. If they do
not, then \ is not a r.v..
\ À H Ä Hw œ ÖB" ß B# ß á ×
(whose range is Hw ). In other words, B3 œ \Ð=Ñ for some = − H. We will see a graphical
illustration of what makes \ a r.v., as oppose to being just a function on H.
Recall that \ " is the inverse of function \ . Furthermore, \ " ÐBÑ need not be a point in H, but
rather a set as in the above example with tossing of a coin.
\ À H Ä Hw œ I œ ÖB" ß B# ß á ×
These subsets of H must be events (i.e. belong to Y H), or else \ is not a r.v.. If they are, then
they are to be measured by T and their measures form the probability distribution of \ .
Figure 5.1
\ À H Ä Hw œ Ö!ß "×.
For example, the r.v. that models tossing a biased coin, with H œ ÖLß X × and T defined on 5-
algebra ÖHß Øß ÖL×ß ÖX ×× as Ö"ß !ß :ß " :×ß respectively. The probability distribution
T\ œ T \ " is defined on
as
Such a r.v. is called Bernoulli. More precisely, any r.v. which lands at Hw œ Ö!ß "× with the
above distribution, regardless of the nature of the underlying model, is Bernoulli.
Most typically, a generic Bernoulli r.v. \ is associated with a single trial which can manifest a
success or failure. The latter in turn is mapped by \ to " or !, respectively.
We can also formalize a Bernoulli r.v. as follows. Given 5 -algebra Y H œ Hß Øß Eß E- we
can define the following function,
\ = œ 1E = œ
"ß =−E
(5.6)
!ß =ÂE
called the indicator function. We recall that the probability space Hß Y Hß T , with
T H œ "ß T Ø œ !ß T E œ :ß and T E- œ " :ß is called a Bernoulli space. Now,
associated with the Bernoulli space is the above function \ œ 1E ß which is a Bernoulli r.v.,
because
\ œ " œ E and \ œ ! œ E-
Binomial R.V. A Bernoulli r.v. is foundational for the formation of two very important classes
of r.v.'s: binomial and geometric. Both r.v.'s appear in a series of independent Bernoulli trials
(like flipping a coin) manifesting a sequence of successes and failures. In other words, a series of
Bernoulli trials is a sequence \" ß \# ß á of independent Bernoulli r.v.'s, each from the
equivalence class ÒÐ"ß :ÑÓ of Bernoulli r.v.'s with same parameter : − Ò!ß "Ó.
In the binomial case, the above sequence is terminated after the 8th term (trial). The r.v. ]
defined as
] œ \ " á \8 ß (5.7)
is called binomial, in notation ] − ÒÐ8ß :ÑÓ. Clearly, the range of ] is Ö!ß á ß 8×.
(where W3 is the 3th success and J4 is the 4th failure) is an elementary event included in
Ö] œ 5×. Any permutation of any two or more successes and failures matters and it gives us yet
another elementary event from Ö] œ 5×, for instance,
To find the quantity of all elements of Ö] œ 5× we notice that each event I3 can be associated
Now, T ÐI" Ñ œ :5 Ð" :Ñ85 if we take into account the independence of the events
W" ß á ß W5 ß J5" ß á ß J8 . Consequently,
The latter is valid for all 5 œ !ß á ß 8Þ To verify that the above probabilities form a probability
distribution, we sum them up:
Example 5.3. In many complex parallel reliability systems and parallel subsystems, an 8-
components-system stays intact (over a time period) if at least 5 out of 8 components do so. For
example, in many power-generating systems with at least two generators, 5 generators are
sufficient to provide power requirements. Also, in a typical wire cable for cranes and bridges, the
cable may contain thousands of wires and only a fraction of them is required to carry the desired
load.
Assuming that all units have identical and independent life distributions and that the probability
that a unit is functioning is :ß the probability that exactly 5 out of a total of 8 units function is
In the context of a 5 -out-of-8 system, at least 5 of them have to function (over a time period) and
such probability is
Here : stands for reliability of one component (in fact each of the 8 components) and V 8 is
the system reliability.
Consider a 5 -out-of-8 reliability system of 8 components each one of which can fail
independently of the others with probability ; œ 0.75. Suppose that the number of components
in this system must be at least 20. What is a minimal number of components R needed so that
the reliability V of this system is at least !Þ*&?
Solution. We will use an R-program to calculate R . Note that the a pertinent command in R to
calculate the sum 54œ! 84 :4 ; 84 of 5 " binomial probabilities of \ À H Ä !ß "ß á ß 8 is
pbinom(k, n, p)
N=19;
R<-0;
while (R<0.95) {
N=N+1
R<-1-pbinom(19,N,p=0.25);
}
print(R);
print(N)
> print(R);
[1] 0.9510116
> print(N)
[1] 107
Here 107 is the minimum number of components for which system reliability V is !Þ*&"!""'.
barplot(dbinom(0:12,12,0.3),names=as.character(0:12),xlab="x",yla
b="f(x)")
0.20
0.15
f(x)
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12
barplot(dbinom(0:12,12,0.5),names=as.character(0:12),xlab="x",yla
b="f(x)",col="lightblue2")
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12
Example 5.5. Suppose it is known that people over the age of 40 can develop a hypertension
(the systolic blood pressure reading of 130 or higher) with probability :. Let ] be a r.v.
recording the number of people in a sample of 8 individuals that have hypertension. Then,
] À H Ä Ö!ß "ß á ß 8× has the binomial probability distribution
Here Ö\3 œ "× corresponds to the event that the 3th individual in the sample has a hypertension.
Geometric R.V. We return to the series of independent Bernoulli trials in Remark 5.1:
\" ß \# ß á on the same probability space ÐHß Y ÐHÑß T Ñ each mapping H œ Öfailureß success×
onto Hw œ Ö!ß "×, with the same Bernoulli probability distribution
as in Remark 5.1. The series of trials now continues until a first success occurs. For instance, if
the first success takes place at the 5 th trial, we have
(As an example, we can flip a coin up until we obtain a Head for the first time.) Then we stop the
series. Let ^ be the r.v. that gives the number of trials needed to attain to the first success. Then,
the event Ö^ œ 5× equals
Ö^ œ 5× œ J" ∩ á ∩ J5" ∩ W5
The r.v. ^ is called geometric of type I or just geometric, because of its association with the
terms of a geometric series. It is easy to show that for all 5 œ "ß #ß á ß the :5 's sum up to 1. We
will say that ^ − Geo:.
The r.v. [ œ ^ " counts the number of failures prior to the first success in a series of
Bernoulli trials. It is readily seen that the distribution of [ is similar to that of ^ and is
T Ö[ œ 5× œ :; 5 ß 5 œ !ß "ß á .
fx<-dgeom(0:20,0.2)
barplot(fx,names=as.character(0:20),xlab="x",ylab="f(x)")
0.20
0.15
0.10
f(x)
0.05
0.00
0 2 4 6 8 10 12 14 16 18 20
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18 20
T Ö[ œ 5 8l[ 8×
œ T ÐÖ[ œ 5 8× ∩ Ö[ 8×ÑÎT Ö[ 8×
œ T Ö[ œ 5 8×Î T Ö[ œ 3×
3 8
8
;
œ :; 58 Î: "; œ :; 58 Î; 8 œ :; 5 .
This means that if a time process like a telephone conversation was observed from some moment
prior to which the conversation lasts 8 or more units of time (seconds), the probability that it will
last 5 units of time longer thereafter is the same as the probability that the conversation lasts 5
seconds but was observed from the very beginning. In this process, we do not count the very last
second to associate it with the second type geometric r.v. The type I geometric r.v. has a similar
property. We call is the memoryless property. It can be proved that if a discrete-valued r.v. has
memoryless property then it is geometric. Thus the geometric r.v. is the only discrete r.v. with
the memoryless property.
Example 5.6. Under the condition of Example 5.4, if the sample size 8 is large (say over 500)
and : is small (we can replace the above age group by 25 years of age and younger individuals),
then (5.1) can be approximated by
5
T Ö\ œ 5× œ /- -5x ß 5 œ !ß "ß á (5.16)
(8: converges to some - ! as 8 Ä ∞ and : Ä !). Indeed, let - œ 8: for a large 8 and small
:. Then, for a fixed 5ß we can represent ,Ð8ß : À 5Ñ as
of which the first factor approaches " as 8 Ä ∞, the fourth factor approaches " as : Ä !. The
third factor converges to /- ß as 8 Ä ∞.
Ideally, 5 runs the countable set of nonnegative integers to qualify for a distribution (since only
the infinite series - equals /- ), but because the convergence is very rapid, the probabilities in
∞ 5
5x
5œ!
(5.1), for 5 larger than say 300, are negligibly small. The distribution defined in (5.16) is called
Poisson with parameter -. Poisson r.v.'s are among most important r.v.'s in biostatistics and
stochastic (i.e. random) processes. A very prominent class of stochastic processes (including
\ − 1 - .
genetics and ecology) is the Poisson process which is related to the Poisson r.v. We will denote
fx<-dpois(0:15,5)
barplot(fx,names=as.character(0:15),col="lightgoldenrod3")
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 11 13 15
Remark 5.1. If ] is a binomial r.v. with parameters Ð8ß :Ñ, then as we know it from (5.2),
] œ \" á \8 ß where \3 's are Bernoulli r.v.'s valued ! or " each and \3 œ " with
probability :. The "average" or "expected value" of \3 , in notation, I\3 , will be : calculated as
I\3 œ : † " Ð" :Ñ † !. For instance, when flipping a fair coin, the average value of one
outcome is "# . Informally, 8: seems to be the average value of the r.v. ] as the sum of the
expected values of 8 Bernoulli's r.v.'s. (More about this in section 6.)
Consequently, in the Poisson case, - has the meaning of the average value of the Poisson r.v. \
as the limiting value of 8: in the binomial case. In most situations we will be concerned with,
the Poisson distribution will serve as an approximation to any binomial distribution qualified for
the approximation as having a large 8 and a small :.
Ð3Ñ It is known that the Norton Antivirus software can identify and eliminate 90% of all current
and past viruses. Suppose a sample of 10 personal computers was selected for testing and they
were all infected with various viruses. What is the probability that exactly 8 of them will be
detected by Norton?
Solution. We model the above situation by a binomial r.v. with parameters 8 œ "! and : œ !Þ*.
The reason for this is that : œ !Þ* comes from some large statistics of unidentified population,
from which a small sample was randomly drawn. Thus, if the r.v. ] describes the number of
detected viruses (i.e. the number of machines the Norton detects), we need to find the probability
that exactly 8 out of 10 PC's will be detected by the Norton:
T Ö] œ )× œ "! ) #
) !Þ* !Þ" œ %& † !Þ%$!%'(#" † "!
#
œ !Þ"*$("!#%%&. (5.17)
(Alternatively, the same answer can be found in the attached Table of Binomial Distributions,
page T-7. However, the above formula needs to be presented.)
Ð33Ñ Suppose a typist makes " "# typos per page on an average when typing a manuscript. What is
the probability that on a given page she will make three or more typos?
Solution. A typical page of manuscript can contain between 500 and 1000 characters, say 750.
Due to Remark 5.2, we interpret the phrase "on an average " "# typos" per page as 8: œ " "# . With
8 œ (&! (large) we then have : œ !Þ!!"$, which is small. Therefore this is a binomial model
ideally, but it allows a good Poisson approximation with - œ " "# . Thus, the corresponding r.v. \
is "approximately" Poisson with - œ " "# and we need to find
T Ö\ $× œ " T Ö\ Ÿ #×
- -#
œ " /- Ð" "x #x Ñ (with - œ " "# ). (5.18)
Ð333Ñ A student is about to submit a draft of her master thesis. It is known there is an 70% chance
that her advisor will accept the draft. If the draft will need a revision, then the new submission
will still have the same probability of being accepted. Assume that submissions will continue
with the same statistics. What is the probability that the student will need less than five
submissions before her thesis will be accepted?
Solution. In this case we observe a series of independent Bernoulli trials \" ß \# ß á , which ends
with the first "success" and with T Ö\3 œ "× œ : œ !Þ(. Therefore, the r.v. \ describing the
process of trials and ending the series is geometric with parameter : œ !Þ(. What we need to find
is
T Ö\ Ÿ %× œ :; 3" œ : ";
% %
%
"; œ " !Þ$ œ " !Þ!!)"
3œ"
œ !Þ**"*. (5.19)
Example 5.8. It is known that of 313.6 million of US population, about 200 million have high-
speed Internet access at home. Thus the population proportion of high-speed Internet users is
0.6378. Suppose 1500 people are randomly selected. What is the probability that at least 1000 of
those responding to the survey have high-speed Internet access?
Solution. We interpret !Þ'$() proportion as : and a sample of 1500 people selected for the
\ œ \" á \"&!! − "&!!ß !Þ'$() gives the number of those in the surveyed sample
survey as a sample of 1500 independent Bernoulli trials, so that
T \ "!!! œ "&!!
5œ"!!!
"&!!
5
!Þ'$()5 !Þ$'##"&!!5 .
Now the above is a computational challenge. The normal approximation, due to the Central
Limit Theorem (to be explored in Chapter V, section 2), would be a remedy. Alternatively, we
can use R language to compute the above probability precisely. The procedure will be as
follows.
implying
> 1‐pbinom(999,1500,.6378)
[1] 0.01042895
which reads
T \ "!!! œ !Þ!"!%#)*&Þ
***
5œ!
"&!! !Þ'$()5 !Þ$'##"&!!5 .
5
> 1‐ppois(999,956.7)
[1] 0.08395288
The result significantly differs from that of direct binomial. The student needs to explain the
discrepancy. (See Problem 5.6.)
Hypergeometric R.V. A r.v. \ À H Ä Ö!ß á ß minÖ<ß 8××, with parameters ÐR ß <ß 8Ñ is called
hypergeometric, if its distribution is
5< R85
<
R8
T Ö\ œ 5× œ ß5 œ maxÖ8 Aß !×ß á ß minÖ<ß 8×.
(5.20)
Example 5.9. (An example from ecology.) Suppose an unknown number R of animals or insects
inhabit a certain region. In order to obtain some information about the population size (and make
a decision on whether or not this population is endangered or overpopulated), ecologists often
perform the following test. They catch a number of animals, say <, and mark them in some
manner; then release them trying to disperse them throughout the habitat. After a while, a new
catch is rendered of size 8. If \ is the r.v. describing the number of marked animals in the catch,
then as we know it from Model 5, \ is a hypergeometric r.v. with parameters ÐR ß <ß 8Ñ.
Supposed \ is observed to be equal B. (We assume that the number of animals between the two
catches did not alter.) From Example 5.5, we recall that
B< R8<B
R8
2ÐR ß <ß 8à BÑ œ T Ö\ œ B× œ . (5.21)
Now, we need to estimate the unknown parameter R using the maximum likelihood principle.
Namely, we assume that the value R we need to find is the one that maximizes distribution
(5.21). In other words, we need to find an R ‡ (m.l.e.) such that
From (5.23) we find that 2ÐR ß <ß 8à BÑ 2ÐR "ß <ß 8à BÑ if and only if the above ratio is
greater than " and thus,
R <8ÎB.
From similar calculations, 2ÐR ß <ß 8à BÑ 2ÐR "ß <ß 8à BÑ if and only if
R " <8ÎB.
Altogether,
<8 <8
B " R‡ B
Therefore, 2ÐR ß <ß 8à BÑ reaches its maximum at the largest integer value of R not exceeding
<8ÎB. Roughly speaking, we have
< B
R ¸ 8
the ratio, under which the likelihood function 2ÐR ß <ß 8à BÑ attains its maximum. For example if
the initial catch consisted of < œ &! animals which were marked and then released, and then the
second catch of 8 œ 100 animals showed B œ & animals marked, then we will estimate that there
are some 1000 animals in the habitat.
PROBLEMS
5.1. The monthly worldwide average number of airplane crashes of commercial airlines is 3.
What is the probability that there will be at least 2 such accidents in the next month. Justify
your choice of a relevant r.v. and its approximation.
5.2. Two chess players of equivalent strengths play against each other. Disregarding draws,
what is more probable to win two games out of four or three games out of six?
5.4. The interval Ò!ß "&Ó is partitioned into two subintervals, E œ Ò!ß "!Ó and F œ Ð"!ß "&Ó.
Suppose four points were randomly selected. Assuming that the probability of a point to land at a
subinterval is proportional to its length, find the probability that exactly two points fall into E
and the other two - into F .
5.5. Let the r.v.'s \" ß ÞÞÞß \8 form 8 independent Bernoulli trials with parameter :. Determine the
conditional probability that \" œ ", given that the sum \3 œ 5 Ð5 œ "ß ÞÞÞß 8ÑÞ
8
3œ"
5.6. Explain in Example 5.8 how was - obtained and what went wrong with the Poisson
approximation.
5.7. Given the plot below of a geometric distribution conclude what type of geometric
distribution (type 1 or type 2) it is and give its parameter:
0.10
0.08
0.06
f(x)
0.04
0.02
0.00
0 3 6 9 13 17 21 25 29 33 37 41 45 49
5.8. In the context of Problem 5.7 write an R-program to plot the above figure.
5.9. Suppose a parallel reliability system consists of a cable of wires for a bridge and that this
cable must have at least 50 wires. If the reliability of a single wire is !Þ#ß how many wires
should the cable have to guarantee the reliability of the system to be at least !Þ*(?
. œ I\ œ B5 :5 (6.1)
5
and call it the expected value of r.v. \ . We also call . the mean or the first moment of r.v. \ . If
\ is Poisson with parameter -, we can easily show that the parameter - is also the mean of \ .
Indeed,
For the binomial and geometric r.v.'s, calculation of their means is somewhat more cumbersome.
We therefore introduce a transform that works for nonnegative integer-valued r.v.'s.
Before that, let 2 be a real-valued function such that the composition 2Ð\Ñ is also a r.v. valued
in Ö2ÐB" Ñß 2ÐB# Ñß á ×. We assume that T Ö2Ð\Ñ œ 2ÐB5 Ñ× œ :5 , i.e. the probability distribution
of the values of 2Ð\Ñ is the same as values of \ . We can regards 2Ð\Ñ as a new r.v. with the
distribution Ö:" ß :# ß á ×. Thus the mean of 2Ð\Ñ is
We can interpret 2Ð\Ñ is the capital gain corresponding to the values of \ . For instance, if \
denotes the contents of some inventory and 2Ð\Ñ is its US dollar value.
Suppose \ À H Ä Ö!ß "ß á × (i.e. B5 œ 5 ) distributed as Ö:! ß :" ß á × and let 2ÐBÑ À œ D B . Then,
the expectation of the function 2 of \ is the power series
I2Ð\Ñ œ 1ÐDÑ À œ ID \ œ D 5 :5
∞
(6.4)
5œ!
called the probability generating function (pgf). The pgf 1ÐDÑ is absolutely convergent on the
I D œ :5 œ ".
∞
\
(6.5)
5œ!
Therefore, 1ÐDÑ is analytic everywhere for lDl " and continuous for lDl œ ". In particular, 1ÐDÑ
can be expanded in Taylor series at zeroß
From the comparison of the series (6.4) and (6.6) and the uniqueness of the power series
representation we conclude that
Remark 6.1. (The existence of expectation). Since in many circumstances, the expectation is a
series, it sounds like the expectation exists if and only if the series I\ converges. This is not
true in spite of the common sense in mathematical analysis. We know that if a series converges
absolutely, it converges in the usual sense. Not so for the expectation that came from an abstract
analysis and integration. We will agree from now on that the expected value . œ I\ exists if
and only if the series Il\l converges. In this case, . is the value of the expectation of \ .
∞
#
1# 8# œ ". (6.8)
8œ"
I\ œ Ð "Ñ8 1## 8
∞
(6.9)
8œ"
which converges as a Leibnitz series (being sign-alternating and with its variant 1## 8 Ä !).
Nevertheless, I\ does not exist, since the series does not obviously converge absolutely as it
forms a divergent harmonic series.
Below are some useful applications of the pgf's. If we formally proceed with differentiating the
series
œ 5D 5" :5
∞
. .
.D 1ÐDÑ œ .D ID
\
(6.10)
5œ"
and assume that the limit at D Ä " exists. (This takes place if and only if the expectation I\
exists.) Hence we have
This is a more convenient form of obtaining the mean in many particular cases.
Then,
5œ!
Example 6.3 (Poisson r.v. revisited). Suppose \ is a binomial r.v. with the pgf
1Ð8ß :à DÑ œ Ð:D ;Ñ8 Þ Assume that 8 Ä ∞ and : Ä ! and
lim lim 8: œ - !.
8Ä∞ :Ä!
Also assume that D Á ". Consider 8 very large and : very small so that 8: ¸ - Ê : ¸ -8 . For
simplicity, we assume that 8: œ - for very large 8 and very small :.
8
This is due to the following. Setting B œ -ÐD"Ñ we observe that B Ä ∞ if and only if 8 Ä ∞
provided that D Á ". Then we easily recognize that
Ð" B" ÑB Ä / , as B Ä ∞ß
a known result from calculus. The rest is due to the continuity of the exponential function. Hence
we proved that
For D œ "ß 1Ð8ß :à "Ñ œ " and this is a trivial result. Now the Taylor series expansion of /-ÐD"Ñ
is
5 ! 5 !
5
concluding that T Ö\ œ 5× œ /- -5x ß 5 œ !ß "ß á
PROBLEMS
6.1. Find the pgf of types I and II geometric r.v.'s of types and then using the procedures like in
Examples 6.1 and 6.2, find the means in both cases.
6.2 Let \ be a Bernoulli type r.v. specified by T Ö\ œ "× œ : and T Ö\ œ "× œ " :. Give
the set of all values of D Á " that solve the equation ID \ œ ".
6.3*. Let \ be a binomial r.v. with parameters Ð8ß :Ñ Ð! : "Ñ. Evaluate I \" "
Þ Hint: You
can use the probability generating function 1D of \ and integrate it from zero to ".
6.4. Let \ À =" ß á ß =8 Ä "ß á ß 8 be a discrete uniform r.v., i.e. such that
T \ œ 5 œ 8" ß for 5 œ "ß á ß 8. Find the pgf, the mean and the variance of \ . Hint: You can
use the formula
Indeed,
œ . # .# ß (7.2)
Observe that Property 2 easily yields Property 1 with 1Ð\Ñ œ +\ and 2Ð\Ñ œ ,.
PROBLEMS
(using Property 2)
œ +# .# #+,. ,# +# .# #+,. ,#
œ + # . # + # .# Þ
7.2. Let \ be a r.v. with mean . and variance 5# . Find the expected value and variance of the
random variable ] œ \. 5 .
If the pgf of a r.v. is known and equals 1ÐDÑ, then it is easily seen,
The term “moment generating function” (mgf) stems from the following expansion:
/)\ œ ) 8\! ,
∞ 8 8
(8.3)
8œ!
which is the Taylor series expansion of the exponential function at zero. Applying formally the
expectation to both sides of the last equation and using the Fubini's Theorem (allowing us to
interchange two series or two integrals or a combination, for a variety of special cases), we have
7Ð)Ñ œ )8! I\ 8 ,
∞ 8
(8.4)
8œ!
where I\ 8 À œ .8 is the 8th moment of r.v. \ (assuming it exists). In the event all moments
.8 of \ exist, then the mgf of \ also exists and according to (8.4), 7Ð)Ñ can be expanded in
Taylor series at zero:
Comparing (8.4) and (8.5) and from the uniqueness of the power series representation we
conclude that
Notice that
Example 8.1. Binomial r.v. Recall that 1ÐDÑ œ Ð:D ;Ñ8 . Thus,
.
. ) 78 Ð) Ñ œ 8:/) 78" Ð)Ñ (8.9)
.#
. ) # 78 Ð) Ñ œ 8:/) 78" Ð)Ñ /) 7w8" Ð)Ñ. (8.10)
Therefore,
and
5# œ 8:; . (8.12)
PROBLEMS
8.1. Using the mgf technique, find the mean and variance of a Poisson r.v. with parameter -.
8.2. Using the mgf technique, find the mean and variance of a type II geometric r.v.
8.3. Using the mgf technique, find the mean and variance of a type I geometric r.v.
8.4. If \ is a binomial r.v. with expected value 6 and variance 2.4, find T Ö\ œ #×Þ
Define J ÐBÑ À œ T Ö\ Ÿ B× and plot this function in order to understand how it works.
Obviously, if B B" ß then J ÐBÑ œ !Þ For B œ B" we have J ÐB" Ñ œ T Ö\ B" × T Ö\ œ B" ×
œ ! :" Þ Hence, J ÐBÑ equals zero for all B B" and at B œ B" it jumps to :" .
œ :" !Þ (1.1)
Therefore, J ÐBÑ, after it increases to :" , continues to equal :" for all B B# . When B œ B# ß
thereby J ÐBÑ jumps to its second level :" :# keeping on its trend as a step function.
Continuing with the same method, we see that between B3 and B3" ß J ÐBÑ is constant, while at
B3 it picks up :3 and adds it to already accumulated sum :" á :3" . Consequently, J is a
piecewise linear function with jumps at B" ß B# ß á ß of respective magnitudes :" ß :# ß á (as in
Figure 1.1).
Figure 1.1
The function J is referred to as the probability distribution function (PDF). Notice that if the
state space I of \ is finite, say lIl œ 8ß then :" á :8 œ ". If I is countable infinite, then
for large Bß J ÐBÑ asymptotically approaches line 1. Formally,
Notice from the definition of J and its plot, J is right-continuous, that is whenever the function
has a discontinuity, at B3 , the right limit at B3 is assigned to the value of the J .
I\ œ B5 :5 , (1.4)
5
if we first replace :5 with ?J ÐB5 Ñ À œ J ÐB5 Ñ J ÐB5 Ñß where J ÐB5 Ñ is the left limit of
J at B5 (see Figure 1.3)Þ Now considering that ?J ÐBÑ œ ! everywhere on ‘ except for B5 's, we
can rewrite (1.4) in the form
or rather
since the formally deals with at most countable many terms. Perhaps even more consistent
would be to replace ?J ÐBÑ with .J ÐBÑ since as the function J is piecewise constant, its
differential is zero everywhere except at points B5 's where .J ÐB5 Ñ does not exist. Without much
formalities, we can let .J ÐB5 Ñ equal ?J ÐB5 Ñ œ :5 thereby arriving at the (Stieltjes) integral
Now, if in the plot of J , the points B" ß B# ß á (at which \ is concentrated) become more and
more dense over ‘, the PDF J becomes strictly monotone increasing (Figure 1.2):
Figure 1.2
Here the function J looks smooth and if so, the differential of J exists (almost) everywhere, and
in this case, .J ÐBÑ œ J w ÐBÑ.B, where J w ÐBÑ is strictly positive and it is denoted by 0 ÐBÑ and
called the pdf (probability density function). Expression (1.7) can be formally rewritten as
I\ œ ∞ B0 ÐBÑ.B.
∞
(1.7)
J ÐBÑ œ :4
4ÀB4 ŸB
Figure 1.3
Solution.
+Ñ T Ö\ − Ð+ß ,Ó× œ T Ö+ \ Ÿ ,×
œ T Ö\ − Ð ∞ß ,Ó Ð ∞ß +Ó× œ T Ö\ Ÿ ,× T Ö\ Ÿ +×
œ J Ð,Ñ J Ð+Ñ.
,Ñ T Ö\ − Ò+ß ,Ñ× œ T Ö+ Ÿ \ ,×
œ T Ö\ − Ð ∞ß ,Ñ Ð ∞ß +Ñ× œ T Ö\ ,× T Ö\ +×
œ J Ð, Ñ J Ð+ Ñ.
.Ñ T Ö\ œ +× œ T Ö\ Ÿ +× T Ö\ +× œ J Ð+Ñ J Ð+ Ñ
Remark 1.1. One can readily embellish the expectation of a function 2 of r.v. \ as
I2Ð\Ñ œ ∞
∞
2ÐBÑ.J ÐBÑ, (1.10)
When J has a density 0 , which takes place when the r.v. \ is valued in a "continuum" set (as
oppose to discrete set), the r.v. \ is called continuous. The continuity of \ has nothing to do
with the conventional notion of a continuous function in analysis.
PROBLEMS
1.1. Let \ œ - almost surelyÞ (i.e. with probability 1, it is a constant and equals - ). Plot the PDF
of \ .
1.2. Let \ be the number of heads in a single toss of a biased coin assuming that it shows Heads
with probability :. Plot the PDF of \ .
1.3. Under the condition of Problem 1.2, let ] give the number of Heads in two tosses of this
coin. Plot the PDF of \ .
1.4. An investment firm offers its customers treasury bills that mature after varying number of
years. Given that the PDF (probability distribution function) of X (the number of years to
maturity for a randomly selected bonds) is
!Þ#ß
!ß >"
J Ð>Ñ œ !Þ&ß
"Ÿ>#
#Ÿ>%
!Þ(ß
"ß
%Ÿ>)
> )ß
find (a) T ÖX œ %×, (b) T ÖX œ &×, (c) T ÖX #×, (d) T Ö" Ÿ X %×. Also sketch J .
becoming a function of variable B. J ÐBÑ is a PDF of r.v. \ introduced in section 1. The PDF of
\ is almost everywhere smooth and thus has a pdf (probability density function) related to J as
.
0 ÐBÑ œ .B J ÐBÑ !, since obviously J ÐBÑ is everywhere monotone non decreasing.
Conversely, (cf. (1.9)),
The Uniform R.V.. A r.v. Y À H Ä Ð!ß "Ñ (range) is called the standard uniform if its pdf is
0 ÐBÑ œ
"ß B − Ð!ß "Ñ
œ À 1Ð!ß"Ñ ÐBÑ (in notation) (2.4)
!ß B − Ð!ß "Ñ-
In general,
1E ÐBÑ œ
"ß B−E
(2.5)
!ß B − E-
called the indicator function of a set E (being a parameter). There is a number of very interesting
properties of indicator functions making them more special than just the use for brevity.
The chief property of a continuous uniform r.v. is that T ÖY − E× is constant wherever set E is
picked out from Ð!ß "Ñ as long as the measure of E, lEl (length if it is an interval) is constant. It
is easy to prove by integration of 0 ÐBÑ over different intervals with identical lengths.
0 ÐBÑ œ ,+
"
ß B − Ð+ß ,Ñ "
œ ,+ 1Ð+ß,Ñ ÐBÑ (2.6)
!ß B − Ð+ß ,Ñ-
Integration of 0 yields the PDF J of such uniform r.v. as depicted in Figure 2.1 below:
Figure 2.1
"ß
,+ ß +B, (2.7)
B ,
Simulation. Let Y be a standard uniform r.v. and let \ be an arbitrarily distributed continuous
r.v. with PDF J\ . We show how to simulate the r.v. \ using the simulation of Y (which is
supposed to be a simple procedure). If we obtain Y by simulation, it takes a value between ! and
". We prove that J\" ÐY Ñ − Ò\Ó. Indeed,
œ T ÖY Ÿ J ÐBÑ× œ J ÐBÑß
since T ÖY Ÿ C× œ C for all C − Ð!ß "Ñ (see Figure 1, for + œ ! and , œ "). This can be utilized
for simulation of a r.v. \ whose inverse can be easily found (cf. exponential r.v. in Example
2.4).
x<-seq(-4,12,0.01)
y<-dexp(x,0.5)
plot(x,y,type="l",col="blue")
0.5
0.4
0.3
y
0.2
0.1
0.0
0 5 10
It is zero on the negative real axis and for positive values of B, 0 is as negative exponential,
monotone decreasing, concave up and asymptotically approaching the horizontal axis. The
integral of 0 gives
J is monotone increasing for nonnegative B, concave up, and asymptotically approaching the
line C œ ". Note that
T Ö\ B× œ /-B ß B !. (2.16)
Using (2.16) we prove one interesting property of exponential r.v.'s. Suppose we began to
observe an exponentially distributed r.v. \ at some point of time when \ has been running, for
instance, a telephone conversation. What is the probability that \ will continue some time
longer, i.e. what is T Ö\ > =l\ =×? We have by the conditional probability formula,
T ÐÖ\>=×∩Ö\=×Ñ
T Ö\ > =l\ =× œ T Ö\=×
T Ö\>=ß\=×
œ T Ö\=× .
Obviously, Ö\ > =× is a subset of Ö\ =× and as such, the intersection of the two gives the
smaller set Ö\ > =× yielding
T Ö\>=×
T Ö\ > =l\ =× œ T Ö\=× (2.17)
by (2.16)
/-Ð>=Ñ
œ / - =
œ /-> œ T Ö\ >×. (2.18)
The latter means that the residual life time (if this is a time process) does not depend on = (i.e.
how long the process has lasted) and, furthermore, it is as if the process was observe from its
beginning. This property is called the memoryless property of the exponential r.v. One can show
that the exponential r.v. is the only representative of the class of continuous r.v.'s with the
memoryless property.
Example 2.1. Insurance companies collect accident records (called history) of drivers. A driver
is considered in a “stable” category, if his/her probability of having an accident remains approxi-
mately a constant independent of time. One interpretation of this feature is when \ is the time
up to the first accident of the driver in question, then
For such a safe driver, the insurance company does not bother estimating the risk of the driver
but it rather sets the amount of premium based on driver's record.
Now, from Problem 2.1, we know that the only continuous r.v. with the above memoryless
property is exponential.
in the case of a discrete r.v., the expectation of \ is defined . œ I\ œ Bœ∞ B0 ÐBÑ.B and it
Arbitrary Continuous Random Variables. Let \ À H Ä ‘ be a continuous r.v. with pdf 0 . As
∞
exists if and only if the integral Il\l œ Bœ∞ lBl0 ÐBÑ.B converges.
∞
Var\ œ .# .# Þ (2.19)
The same properties of the expectation and variance w.r.t. an affine transformation apply to
continuous r.v.'s, namely:
PROBLEMS
2.1. Show that if a continuous r.v. \ has the memoryless property, i.e. that
T Ö\ > =l\ =× œ T Ö\ >×ß for all nonnegative = and >, then \ is exponential. ÒHint:
From (2.17) and (2. 18),
Thus,
2.2. Let \ be a r.v. with the pdf 0 ÐBÑ œ - B 1Ð!ß"Ñ ÐBÑ. Find the constant value of - . Find the
PDF J ÐBÑ and then using J , find the probabilities T Ö " \ Ÿ " "# × and T Ö "# \ Ÿ #×Þ
Figure 2.1
!ß
J ÐBÑ œ B$Î# ß
BŸ!
"ß
! B "Þ (2.22)
B "
Finally, T Ö " \ Ÿ " "# × œ J Ð" "# Ñ J Ð "Ñ œ " ! œ "ß and
T Ö "# \ Ÿ #× œ " "# ¸ " !Þ$& œ !Þ'&Þ
$Î#
2.4 Find the mean and variance of a uniform r.v. on interval +ß ,.
0 ÐBÑ œ ,+
"
ß B − Ð+ß ,Ñ "
œ ,+ 1Ð+ß,Ñ ÐBÑ.
!ß B − Ð+ß ,Ñ-
$ ,+ , +$ œ
" " " ,+,# +,+# +# +,,#
œ $
$ ,+ œ $ .
+# #+,,# +,#
œ "# œ "# .
8!-
7Ð8Ñ Ð)Ñ œ Ð-) Ñ8" ß 8 œ !ß "ß #ß á ß (3.2)
In particular,
" # "
.œ - and .# œ -# Ê Var\ œ 5# œ -# Ê 5 œ .Þ (3.4)
where
referred to as the gamma function. (See Figure 3.1 where >ÐαÑ is depicted for α !Þ)
Figure 3.1
From (3.7), it can be easily shown that >ÐαÑ œ Ðα "Ñ>Ðα "Ñ. If α is a positive integer, then
with >Ð"Ñ œ "ß we get
This is the pdf of the so-called Erlang r.v. Furthermore, if α œ ", (3.9) reduces to the
exponential density.
x<-seq(0.01,4,0.01)
y<-dgamma(x,0.8,0.5)
plot(x,y,type="l",col="blue")
1.2
1.0
0.8
y
0.6
0.4
0.2
0 1 2 3 4
x<-seq(0.01,4,0.01)
y<-dgamma(x,2,2)
plot(x,y,type="l",col="blue")
0.6
0.4
y
0.2
0.0
0 1 2 3 4
It is easily seen that the gamma pdf (3.6) looks almost identical to the integrand of the gamma
function (3.7). Consequently,
Bœ!
∞ " B α"
/ B .B
" ∞ >ÐαÑ
œ " α " Bœ! /
" B
Ð" BÑα" .Ð" BÑ œ "α (3.10)
and thus the integral of 0 equals one, which proves that 0 is indeed a pdf. The MGF of gamma is
" α >ÐαÑ
œ >ÐαÑ Ð" ) )α .
Thus,
"
Denote 7Ð)Ñ œ " ) . Then, Q Ð)Ñ œ 7α Ð)ÑÞ It can be easily shown that
αÐα"Ñ α#
Q ww Ð)Ñ œ "# 7 Ð)ÑÞ (3.13)
In general,
αâÐα5"Ñ α5
Q Ð5Ñ Ð)Ñ œ "5
7 Ð)Ñß (3.14)
αâÐα5"Ñ
.5 œ Q Ð5Ñ Ð!Ñ œ "5
. (3.15)
and therefore,
α
I\ œ " (3.16)
and
α
Var\ œ "# . (3.17)
In particular, for α œ ", we have the numerator of (3.15) reduce to 5x and thus, (3.15) to formula
(3.3) for the 5 th moment of the exponential r.v.
PROBLEMS
3.2. If \ is an exponential r.v. with parameter -, identify the r.v. -\ , where - is a positive real
constant.
3.5. Let \ be a gamma r.v. with parameters α and " . Identify the r.v. -\ß where - is a positive
real constant?
3.6. Using R plot the gamma density function with parameters α œ " œ "!.
ÐB.Ñ#
0 ÐBÑ œ 0 ÐBà .ß 5# Ñ œ "
#15 #
/ #5 # ß B − ‘Þ (4.1)
We will say that \ − ÒR Ð.ß 5# ÑÓ, i.e. \ belongs to the class of normal r.v.'s with parameters .
maximum value at B œ .ß approximately equal to !Þ$** 5" , where 5 œ 5# Þ (See Figure 4.1
Graphically, 0 is the famous bell-shaped curve symmetric about the line B œ . and having its
below.)
Figure 4.1
The integral of 0 belongs to the class of so-called transcendent functions which, except for some
A special case of \ with . œ ! and 5# œ " is referred to as standard normal. Its pdf is denoted
by
#1 /
" B# Î#
:ÐBÑ œ . (4.2)
curve(dnorm(x,2,1.5),from=-4,to=8)
0.25
0.20
dnorm(x, 2, 1.5)
0.15
0.10
0.05
0.00
-4 -2 0 2 4 6 8
The following two plots are of Gaussian r.v.'s with parameters "!ß 5# œ " and "!ß 5# œ %
with the source code
x<-seq(4,16,len=101)
y<-cbind(dnorm(x,10,1),dnorm(x,10,2))
matplot(x,y,type="l",ylab="f(x)")
text(7.5,.3,"X~N(10,1)")
text(14,.05,"X~N(10,4)")
0.4
0.3
X~N(10,1)
0.2
f(x)
0.1
X~N(10,4)
0.0
4 6 8 10 12 14 16
\ œ 5^ ., which 5 !. (4.3)
Then,
5 #1 /
" " Ð 5" ÐB.ÑÑ# Î#
œ œ 0 ÐBà .ß 5# Ñ, (4.5)
i.e. \ − ÒR Ð.ß 5# ÑÓ if \ is the above affine transformation of ^ . This gives rise how to tabulate
Gaussian r.v.'s reducing them to the standard normal.
For instance, suppose we need to find T Ö\ Ÿ B× for some \ − ÒR Ð.ß 5# ÑÓ. We use the inverse
affine transformation of (4.3):
Thus,
Tables of Q can be found in all standard statistical software packages and text books in
probability and statistics, although for the latter they can be quite crude.
The Moment Generating Function of the Gaussian R.V. If \ − ÒR Ð.ß 5# ÑÓ, then we can
represent \ œ 5^ . and thus
using the linearity of expectation and denoting 7Ð)Ñ the MGF of the standard normal. From
(4.8), it is thus sufficient to calculate 7Ð)Ñ, which we will do below.
which is short of a perfect square ÐB# #)B )# Ñ. After a standard maneuvering with (4.10) we
arrive at
7Ð)Ñ œ Bœ∞ / # ) #1 /
∞ " # " ÐB) Ñ# Î#
.B
since the second factor is the integral of a normal density with parameters . œ ) and 5# œ ".
Thus we have
#
7Ð)Ñ œ /) Î#
. (4.12)
The Mean and Variance of a Gaussian R.V. To find the mean and variance the Gaussian r.v.
we will use the MGF techniques. From (4.13),
Ê Q w Ð!Ñ œ .. (4.15)
Furthermore,
Ê .# œ 5 # . † .
It turns out that the parameters . and 5# are in fact the mean and variance of a Gaussian r.v.
The Affine Transformation of a Gaussian R.V. Let \ − ÒR Ð.ß 5# ÑÓ. We are wondering about
the distribution of the r.v.
] œ +\ ,ß
PROBLEMS
4.1. Let ] œ +^ , be an affine transformation of the standard Gaussian r.v., with + Á !. Use a
similar procedure as in (4.4-4.5) to find the pdf of ] .
Let \ − ÒR Ð.ß 5# ÑÓ. The r.v. ] œ /\ is called lognormal with parameters . and 5# .
B #15# /
" " #5 #
4.8. Suppose it is known that a certain bridge can hold a maximum of 100 vehicles at a time.
However, the total weight of all cars passing through the bridge can vary. It is known that the
weight of a vehicle is a Gaussian random variable (r.v.) with mean . œ % and standard deviation
5 œ !Þ% measured in 1000 pound units. It can be shown that 100 vehicles will be a Gaussian r.v.
with mean ."!! œ %!! and standard deviation 5"!! œ %. [The latter can be rigorously proved.]
Civil engineers worry that as many as 100 vehicles can exceed the threshold of 410 units and
thus cause some structural damage to the bridge. What is the probability that this will ever take
place?
4.9. Below are two Gaussian curves along with R source codes. Looking at the curves give the
parameters of respective Gaussian pdf's and interpret the areas below the curves.
x<-seq(-4,8,0.01)
y<-dnorm(x,2,1.5)
plot(x,y,type="l")
polygon(c(x[x>4],4),c(y[x>4],y[x==-
4]),col="honeydew2")
0.25
0.20
0.15
y
0.10
0.05
0.00
-4 -2 0 2 4 6 8
x<-seq(-4,8,0.01)
y<-dnorm(x,2,1.5)
plot(x,y,type="l")
polygon(c(x[x<0],0),c(y[x<0],y[x==-
4]),col="honeydew2")
0.25
0.20
0.15
y
0.10
0.05
0.00
-4 -2 0 2 4 6 8
4.10. Using R plot the Gaussian density function with parameters . œ #Þ& and 5# œ &.
4.11. Under the condition of Problem 4.10 plot the integral area under the left half of the curve.
Ð\ß ] Ñ À H Ä I" ‚ I# .
We assume that both I" and I# are discrete (at most countable) spaces. While in one-
dimensional case we sought T Ö\ œ B× for some B − I" , in two-dimensional case it would be
natural to seek T ÖÐ\ß ] Ñ œ ÐBß CÑ×ß where ÐBß CÑ − I" ‚ I# . Notice that ÐBß CÑ as a pair is
literally the Cartesian product of ÖB× and ÖC× and thus the latter can be rewritten in the form
T ÖÐ\ß ] Ñ − ÖB× ‚ ÖC×× or in the equivalent form
T ÖÖ\ œ B× ∩ Ö] œ C××
(in notation, T Ö\ œ Bß ] œ C×). The latter is less obvious. One can however, easily seen that
the sets ÖÐ\ß ] Ñ œ ÐBß CÑ× and Ö\ œ B× ∩ Ö] œ C× are identical. We can use the pick-a-point
process to show it.
Example 1.1. If Ew and F w are subsets from I" and I# , respectively, we will operate with
E œ Ö\ − Ew × and F œ Ö] − F w × and measure the event E ∩ F , i.e. we will be interested in
T Ö\ − Ew ß ] − F w × œ T ÐE ∩ FÑ, (1.1)
where one comma separating the events abbreviates the intersection and braces. Suppose
F"w ß F#w ß á is a measurable partition of I# . Thus, F" ß F# ß á , being Ö] − F"w ×ß Ö] − F#w ×á ,
form a measurable partition of H, as depicted in Figure 1.1. Therefore,
T Ö\ − Ew × œ T ÐEÑ œ T Ð ∪ ÐE ∩ F5 ÑÑ œ T ÐE ∩ F5 Ñ
5 5
w
œ T Ö\ − E ß ] − F5 ×Þ (1.2)
5
Figure 1.1
and it is called the "marginal" distribution of r.v. \ . More generally, if E"w ß E#w ß á is a
measurable partition of I" , then proceeding with each E3w as with Ew , we obtain the marginal
distribution of r.v. \ À
Example 1.2. Suppose we flip a fair coin twice. Let \ be the number of heads in the second flip
and ] is the total number of tails in two flips. Thus we have
] À H Ä I# œ Ö!ß "ß #×
T Ö\ œ !ß ] œ !× œ !
T Ö\ œ !ß ] œ "×
T Ö\ œ "ß ] œ "×
T Ö\ œ !ß ] œ #×
T Ö\ œ "ß ] œ #×
Table 1.2
T Ö\ œ 3ß ] œ 4× œ T Ö\ œ 3×T Ö] œ 4×
for all 3 and 4, i.e. the above probability of the intersection is the product of marginal
probabilities, consistent with the definition of independence, except that now we are talking
about the probability of pairs of events. For example,
" "
T Ö\ œ "ß ] œ "× œ % œ T Ö\ œ "×T Ö] œ "× œ # † "# Þ
indicating that \ and ] are not independent, since independence for some of the combinations
does not hold.
T Ö\œ3ß] œ4×
T Ö\ œ 3l] œ 4× œ T Ö] œ4× , (1.5)
T ÐE ∩ FÑ œ T ÐEÑT ÐFÑÞ
Analogously,
T Ö\ œ B3 ß ] œ C4 × œ T Ö\ œ B3 ×T Ö] œ C4 ×ß (1.6)
i.e. the joint distribution of \ and ] is the product of their marginal distributions. The
independence is similarly defined for 8 r.v.'s.
Theorem 1.1. Suppose r.v.'s \ and ] are independent and let 1 and 2 are some real-valued
functions. Then 1Ð\Ñ and 2Ð] Ñ are also independent and
Proof. We only prove the second part of the statement, namely, the validity of (1.7)Þ By the
definition of independence (eq. (1.6)),
3œ! 4œ!
Theorem 1.1 gets easily extended for any 8-tuple of r.v.'s and it finds a very important utility
with transforms of the sums of r.v.'s. Let \" ß á ß \8 be iid (independent and identically
distributed) r.v.'s with a common pgf
1ÐDÑ œ ID \" .
Ð\ß ] Ñ À H Ä I" ‚ I# .
In the general case, specifically if both \ and ] are continuous r.v.'s, then we will be most
interested in T ÖÐ\ß ] Ñ − H× in analog to T Ö\ − E× for a single r.v.
also in notation
œ T Ö\ − Eß ] − F×. (2.2)
T Ö\ Ÿ Bß ] Ÿ C×, (2.3)
which we reasonably denote by J ÐBß CÑ and call it the joint PDF of \ and ] .
I" ‚ I# œ ‘ ‚ ‘ œ ‘# .
The joint PDF of \ and ] is this case has a resemblance with the univariate case in the sense
that there is a unique joint pdf 0 ÐBß CÑ ! defined on ‘# , such that
J ÐBß CÑ œ T Ö\ Ÿ Bß ] Ÿ C×
i.e. the PDF of Ð\ß ] Ñ can be expressed as the integral of a unique nonnegative function 0 .
The three-dimensional plot of probability density function 0 Bß C œ " "# B# C#
#1 / is as follows.
0.15
f( x,y)
0.10
0.05
-3 3
-2 2
-1 1
X0 0
Y
1 -1
2 -2
3 -3
f<-function(x,y) {
z<-(1/(2*pi))*exp(-0.5*(x^2+y^2))
}
y<-x<-seq(-3,3,length=50)
z<-outer(x,y,f)
persp(x,y,z)
persp(x,y,z,theta=45,phi=30,expand=0.6,ltheta=120,shade=0.75,t
icktype="detailed",xlab="X",ylab="Y",zlab="f(x,y)",col="honeyde
w2")
In a more general case, a set H, which Ð\ß ] Ñ may belong to, need not be a rectangle. In this
case (2.4) is generalized as follows:
The latter is the volume of the cylinder enclosed between flat set H in the (X,Y)-plane and the
surface 0 and surrounded by the lateral surface generated by a line orthogonal to H and running
across its boundary `H. (See Figure 2.2 below.)
In many cases, the boundary `H can be smoothly parametrized. Suppose that the projection of H
on Y-axis is an interval Ò+ß ,Ó and suppose `H œ G" ∪ G# ß such that both projections of G" and
G# on Y-axis are Ò+ß ,Ó. Suppose G" and G# are parametrized as :" and :# ß respectively. This is
depicted in Figure 2.1 below.
Figure 2.1
procedure. First, for a C − +ß ,ß we draw the C-section of the solid (cylinder enclosed between 0
Then, the integration of 0 over the region H is rendered according to the following informal
This plane crosses the boundary of H at :" C and :# C. See Figure 2.2 below.
and H). This is the intersection of the plane perpendicular to XY-plane and parallel to XZ-plane.
surface
f ( , y)
n
tio ion
ro jec - sect
p y
b
Figure 2.2
Hence the C-section looks like a curved trapezoid enclosed between the curve 0 † ß C and the
segment :" Cß :# C. The integral area of the C-section is obviously Bœ# :" ÐCÑ 0 ÐBß CÑ.B.
: ÐCÑ
Assuming that the C-section does not change from C to C .Cß we will have the volume of the
infinitesimal layer of the solid enclosed between C and C .C as ZC œ Bœ# :" ÐCÑ 0 ÐBß CÑ.B.C.
: ÐCÑ
To find the volume of the whole solid requires C to run from + to , and the summation of all such
ZC 's. Therefore, integral (2.5) equals
Furthermore, we can also encounter cases like T Ö1Ð\ß ] Ñ − V×. It can be shown to equal
f<-function(x,y) {
z<-exp(-(x+y))
}
y<-x<-seq(0,6,length=50)
z<-outer(x,y,f)
persp(x,y,z)
persp(x,y,z,theta=45,phi=30,expand=0.6,ltheta=120,shade=0.75,ti
cktype="detailed",xlab="X",ylab="Y",zlab="f(x,y)")
1.0
0.8
f( x,y)
0.6
0.4
0.2
0 6
1 5
2 4
X3 3
Y
4 2
5 1
6 0
T\
] Ÿ > œ
0 ÐBß CÑ.ÐBß CÑ.
Ö BC Ÿ >×
Y y 1t x
{ y 1t x}
Thus, T \
] Ÿ >
ÖC "> B× ∩ ‘#
Notice that since \ and ] are both positive, their ratio is also positive and thus the above
probability is zero for all negative values of >. In conclusion,
TÖ\
] Ÿ >× œ
>
"> 1‘ Ð>Ñ. (2.9)
#
` `#
Finally, we remark from (2.4) that applying the second order partial derivatives `B`C or `C`B
dependent in which of the two orders we iterate the integration, we extract the joint pdf:
`#
`C`B J ÐBß CÑ œ 0 ÐBß CÑ. (2.10)
PROBLEMS
0 ÐBß CÑ œ #
"
B -Cß ! B "ß # C "!
!ß elsewhere.
œ
/ÐBCÑ ß #
ÐBß CÑ − ‘
!ß otherwise.
{x y t}
" />B
œ
./ /Ð.B/ CÑ ß #
ÐBß CÑ − ‘
!ß otherwise,
3 T \ ] Ÿ >ß > − ‘
{x y t}
" //>B
/ >
œ " /.> .
./ / " /./ >
. .
œ " /.> ./ /
- >
./ /
.>
. /
œ" ./ /
/ >
./ /
.>
ß with > !ß
J^ > œ " .
./ /
/ >
/
./ /
.>
1Ò!ß∞Ñ >.
density function 0 of the vector \ß ] and identify the parameters . and / .
2.4. Using the pattern of Example 2.1 and the R-source code below and problem 2.3 find the
f<-function(x,y) {
z<-0.256*exp(-0.6*(2*x+0.3*y))
}
y<-x<-seq(0,3,length=50)
z<-outer(x,y,f)
persp(x,y,z)
persp(x,y,z,theta=45,phi=30,expand=0.6,ltheta=120,shade=0.75,ticktype
="detailed",xlab="X",ylab="Y",zlab="f(x,y)",col="lightskyblue")
1.0
f( x,y)
0.5
0.0 3.0
0.5 2.5
1.0 2.0
1.5 1.5
X Y
2.0 1.0
2.5 0.5
3.0 0.0
2.6. Suppose a point is located in a unit square and that the location of the point within the unit
square obeys the uniform distribution, i.e., if \ and ] are the coordinates of the particle, then
their joint pdf is
0 ÐBß CÑ œ
"ß ! B "ß ! C "
!ß elsewhere.
J ÐBß CÑ œ T Ö\ Ÿ Bß ] Ÿ C×
it follows that for C Ä ∞, the set Ö] Ÿ C× approaches H. This is literally because ] Ð=Ñ Ÿ ∞
for all =. Then,
Similarly,
The corresponding marginal densities (pdf's) can be obtained by differentiating the marginal
PDF's. So, from (3.2) we have
Example 3.1. Let \ and ] be two r.v.'s with the given joint pdf
0 ÐBß CÑ œ
"& #
% B ß ! Ÿ C Ÿ " B#
(3.6)
!ß elsewhere.
Solution. As we see it from Figure 3.1, 0 ÐBß CÑ ! between the line C" ÐBÑ œ ! and the parabola
C# ÐBÑ œ " B# .
Figure 3.1
"& #
œ % B Ð" B# Ñ1Ò"ß"Ó ÐBÑÞ (3.7)
B Ÿ " C, which means that we will use the positive value of 0 ÐBß CÑ for
Now, to find the marginal pdf 0] we need to integrate 0 w.r.t. B. From C Ÿ " B# we find that
" C Ÿ B Ÿ " CÞ In other words, given C fixed on Ò!ß "Óß the integration of 0 w.r.t. B
will run along the segment of line through C, parallel to X-axis, between B œ " C and
B œ " C. (See Figure 3.2 below.)
Figure 3.2
Consequently, we have
Example 3.2. Let Ð\" ß á ß \8 Ñ be a random sample such that \3 − Ò\Ó and \ is a r.v. with
mean .. Denote
_
\ 8 À œ 8" Ð\" á \8 Ñ (3.10)
and call it a sample mean of population Ò\Ó. From Proposition 3.1, we easily find that
Thus, the mean of the sample mean is . (i.e. the mean of the population) regardless of its size.
PROBLEMS
3.1. Generalize Property 3.1 as follows. Let 1 and 2 be two functions and let \ and ] be r.v.'s.
Show that
0 ÐBß CÑ œ
)BCß !BC"
(3.13)
!ß elsewhere.
Draw the region where 0 ! and find the marginal pdf's of \ and ] .
Y
1
{y x}
X
1
Ð3Ñ I #\
"
$] # %
Ð33Ñ I \]
"
Þ
T Ö\ − Eß ] − F× œ T Ö\ − E×T Ö] − F× (4.1)
holds for all Borel sets E and F . This definition makes sense and it follows from the
independence of two events in basic probability. However, for continuous r.v.'s, condition (4.1)
is difficult to verify for all E and F . If we use sets like Ð ∞ß BÓ for E and Ð ∞ß CÓ for F ,
expression (4.1) will be reduced to an easily verifiable product of the marginal PDF's:
Fortunately, if (4.2) holds, then (4.1) also holds for all Borel subsets of ‘, in particular, all
infinite closed intervals. This is not easy to prove, but it is contained in advanced text books in
probability. From (4.2), we readily obtain that
if and only if \ and ] are independent. From (4.3), it follows that a straightforward
independence test is to calculate the marginal densities and then check if (4.3) holds. However,
there is yet another way to test for independence without the necessity of integrating the
marginal densities.
Property 4.1. Let \ and ] be two continuous r.v.'s with joint pdf 0 ÐBß CÑ. \ and ] are
independent if and only if
Proof. Suppose (4.4) holds true. Integrating 0 ÐBß CÑ in B and then in C gives
0\ ÐBÑ œ ,1ÐBÑß
Therefore, 1 and 2 differ from marginal densities in multiplicative constants + and , such that
+, œ ". Consequently,
which proves that \ and ] are independent. The converse holds obviously true.
Remark 4.1. To completely utilize Property 4.1, we notice that in many applications, a joint pdf
is positive on some proper subset of ‘# Þ It often reads
Now, two things must take place in order for \ and ] to be independent. Firstly, : must be
factorizable. Secondly, H must be a rectangle, i.e. H must be a Cartesian product E ‚ F . The
latter is due to the property of the indicator function:
which is easy to verify by seeing that the left and right-hand sides of (4.6) are simultaneously
equal 1 or 0. This adds to the full factorization of 0 ÐBß CÑ. It is readily seen that only if H is a
rectangle, can 1H be factorized and thus 0 is factorizable.
In a nutshell, the pdf 0 in (4.5) is factorizable (and thus \ and ] are independent) if and only
if : is factorizable and H is a rectangle.
0 ÐBß CÑ œ
"&/$B&C ß #
ÐBß CÑ − ‘
(4.7)
!ß otherwise.
In this case the algebraic part of 0 is clearly factorizable and, in addition, 0 ! on ‘# ß which is
a rectangle Ò!ß ∞Ñ ‚ Ò!ß ∞Ñ. Therefore, \ and ] are independent.
0 ÐBß CÑ œ
-B/B Cß !BC"
!ß otherwise.
In this case, the algebraic part is factorizable, but 0 ! on the upper triangle of the unit square
Ð!ß "Ñ ‚ Ð!ß "Ñ and thus \ and ] are not independent.
0 ÐBß CÑ œ
"& #
% B ß ! Ÿ C Ÿ " B#
!ß elsewhere
we found that
"& #
0\ ÐBÑ œ % B Ð" B# Ñ1Ò"ß"Ó ÐBÑ
and
0] ÐCÑ œ &# Ð" CÑ$Î# 1Ò!ß"Ó ÐCÑ.
From Figure 4.1 below we see that the area where 0 ÐBß CÑ ! and 0\ ÐBÑ0] ÐCÑ ! do not
coincide concluding that \ and ] are not independent. In addition, 0 Á 0\ 0] , which is obvious.
However, the first argument is crucial in seeing that 0 ! not on a rectangle, as we learn it
(indirectly) from Proposition 4.1. This is a vivid illustration to the independence principle in
Remark 4.1.
Figure 4.1
Property 4.2. Let \ and ] be independent r.v.'s and let K and L be two functions. Then,
Remark 4.2. Furthermore, we can show that if \ and ] be independent r.v.'s and K and L be
two functions, then, the r.v.'s KÐ\Ñ and LÐ] Ñ are independent.
PROBLEMS
0 ÐBß CÑ œ
#B/C ß ! Ÿ B Ÿ "ß ! C ∞
!ß elsewhere.
0 ÐBß CÑ œ
# #
- /ÐB C Ñ ß ! Ÿ B Ÿ ∞ß ! C ∞
!ß elsewhereß
where - is a positive constant. Are the events Ö\ "× and Ö "# ] Ÿ #× independent?
factorizable and because 0 ! on a rectangle Ò!ß ∞Ñ ‚ !ß ∞. Secondly, because \ and ] are
Solution. Firstly, we easily conclude that \ and ] are independent. This is because 0 is clearly
independent, they generate independent events. Namely, recall that \ and ] are by the original
definition independent if and only if
T \ − Eß ] − F œ T \ − ET ] − F ß
good for all sets E and F − ‘, in particular, for E œ Ò"ß ∞Ñ (in \ ") and for F œ Ð "# ß #Ó (in
"# ] Ÿ #).
Solution. It can be readily shown that ^ > œ \ > ∩ ] >. Indeed, ^ = > if and
only if \ = > and ] = >, good for all = − H. Then, by the independence of \ and ] ß
Therefore, T ^ Ÿ > œ " /-.> and we conclude that ^ is exponential with parameter
- ..
4.4. Suppose \" ß á ß \5 are independent distributed r.v.'s, each exponentially, with parameters
"" ß á ß "5 , respectively. Show that the r.v. ] À œ minÖ\" ß á ß \5 × is exponentially distributed
with parameter "" á "5 .
{x y t}
or
4.6. Let \ and ] be two independent exponentially distributed r.v.'s with parameters - and .,
respectively. Using the formula of Problem 4.5 find the probability distribution function of
^ œ\].
- >
œ " /.> .
.- / " /.->
. .
œ " /.> .- /
- >
.- /
.>
. -
œ" .- /
- >
.- /
.>
ß with > !ß
J^ > œ " .
.- /
- >
-
.- /
.>
1Ò!ß∞Ñ >.
4.7. Under the condition of Problem 4.6, assuming that - œ .ß find the probability density
function of ^ œ \ ] and identify ^ .
Solution. We have
where / œ .Î-.
/ /.->
Now, . Ä - if and only if / Ä ". We thus apply L'Hôpital's rule to the fraction / " .
.-> -/ ">
Ê lim.Ä- / // " œ lim/ Ä" / // " (differentiate the fraction w.r.t. / )
/ "->
œ lim/ Ä" "->/" œ " ->.
for the gamma density we conclude that 0^ is gamma with parameters α œ # and " œ -.
4.8. Let \ and ] be two independent exponentially distributed r.v.'s with common parameter -.
Find the probability distribution of ^ œ \ ] , the pdf of ^ and identify ^ .
Therefore, \ ] − ÒR Ð. / ß 5# $ # ÑÓ.
Example 5.2. Let \" ß ÞÞÞß \8 be iid exponential r.v.'s, each with parameter -. From (5.2),
which is the MGF of an Erlang r.v. with parameters Ð8ß -Ñ (see (3.9)), which is a special case of
a gamma r.v. with parameters Ðα œ 8ß " œ -Ñ. As an interesting interpretation of (5.4), we see
that an Erlang r.v. contains 8 independent exponential phases and is often used to describe time-
related processes. For example, visiting a supermarket can be statistically adapted to an Erlang
distribution, considering that one spends an exponential time at every department on the list
visiting exactly 8 different departments.
Example 5.3. Let \" ß á ß \8 be iid Bernoulli r.v.'s with parameter :. Recall that \ is a
Bernoulli r.v. with parameter : if \ À H Ä Ö!ß "× such that T Ö\ œ "× œ : and
T Ö\ œ !× œ ; œ " :. The pgf of \ is
ID \ œ D ! ; D: œ :D ; . (5.5)
We once mentioned that the sum of 8 independent Bernoulli r.v.'s is binomial with parameters
Ð8ß :Ñ. Now we see that by independence,
which by (3.15) is the pgf of a binomial r.v. with parameters Ð8ß :Ñ. The corresponding MGF of
the sum will be
PROBLEMS
5.1. Under the condition of Example 5.1, identify of the r.v. +\ ,] , where +ß , Á !Þ
5.2. Let \" ß á ß \8 be iid (independent, identically distributed) r.v.'s, each Gaussian with
parameters . and 5# . Find the distribution of the r.v. \" á \8 Þ
5.3.
_ Under the condition of Problem 5.2, find the distribution of the sample mean
\ 8 œ 8" Ð\" á \8 Ñ (earlier introduced in Example 3.2).
5.4. Let \ and ] be two independent Poisson r.v.'s with parameters -" and -# . Using the MGF
technique find the distribution of the r.v. \ ] Þ
5.5. Let \" ß á ß \8 be independent r.v.'s, each Gaussian with parameters .3 and
53# ß 3 œ "ß á ß 8. Find the distribution of the r.v. +" \" á +8 \8 ß where +3 − ‘ and
+"# á +8# !.
5.6. Let \ and ] be two independent r.v.'s with their respective MGF's
)
7\ Ð)Ñ œ /$Ð/ "Ñ and
7] Ð)Ñ œ "& /) %$ Þ
$
Calculate I \] .
Solution. Firstly, we identify 7\ and 7] as the MGF's of a Poisson and binomial r.v.'s, with
parameters - œ $ and Ð8 œ $ß : œ "& Ñ, respectively. Then the mean .\ of \ is $ and the mean
.] of ] is $ † "& Þ
5.7. Let \ and ] be two independent r.v.'s with their respective MGF's
)
7\ Ð)Ñ œ // " and
7] Ð)Ñ œ #$ /) "$ Þ
&
Calculate
Ð3Ñ T Ö\] œ !×
Ð33Ñ I \]
Ð333Ñ T Ö\ ] œ "×Þ
5.8. Let \" ß á ß \8 be independent r.v.'s, each gamma with parameters α3 and "3 ß 3 œ "ß á ß 8.
Find the distribution of the r.v. +" \" á +8 \8 ß where +3 !ß 3 œ "ß á ß 8.
6. Correlation
If r.v.'s \ and ] are not independent, we are interested how dependent they are. We would like
to introduce in some sense a measure of their dependence referred to as the covariance:
where .\ and .] are the means of \ and ] . The motivation for this measure of dependence is
as follows. If \ and ] are independent, then so are their functions \ .\ and ] .] . In this
case,
Cov\ß ] œ I \ .\ I ] .] œ !. (‡)
The converse is not true, as we will see it, but obviously, for \ and ] independent Cov\ß ]
has its smallest value.
Now, using Proposition 1.1 on linearity of the expectation, after some simple algebra, we arrive
at computationally friendlier formula:
As mentioned, the converse of ‡ does not hold (i.e., if the covariance of \ and ] is zero, \
and ] need not be independent), as we learn it from the example below.
"
Example 6.1. Let \ À H Ä Ö "ß !ß "× and be uniformly distributed with $ for each value. Let
] œ
!ß \Á!
"ß \ œ !.
If CovÐ\ß ] Ñ œ !, the r.v.'s \ and ] are called uncorrelated. Hence, if CovÐ\ß ] Ñ œ !ß \ and
] are not necessarily independent. In any event, they are uncorrelated.
Next, we study the covariance more thoroughly and work on its several important properties.
Properties of the Covariance. Let \ß ] ß \" ß ÞÞÞß \8 ß ]" ß á ß ]7 be r.v.'s. Then the following
hold true:
CovÐ \3 ß ] Ñ œ CovÐ\3 ß ] Ñ
8 8
Ð3@Ñ
3œ" 3œ"
CovÐ \3 ß ]4 Ñ œ CovÐ\3 ß ]4 Ñ.
8 7 8 7
Ð@Ñ
3œ" 4œ" 3œ" 4œ"
Ð@3Ñ If ] œ - is almost surely (a.s.) constant, then CovÐ\ß -Ñ œ !. Thus, any r.v.
is uncorrelated with any constant.
œ CovÐ\3 ß \3 Ñ #CovÐ\3 ß \4 Ñ
8
3œ" 34
œ Var\3 #CovÐ\3 ß \4 Ñ.
8
3œ" 34
Corollary 6.2. Let \" ß ÞÞÞß \8 be pairwise uncorrelated r.v.'s. Then the following Bienamé
equation holds:
Var \3 œ Var\3 .
8 8
(6.4)
3œ" 3œ"
Remark 6.1. Given an 8-tuple of r.v.'s \" ß ÞÞÞß \8 ß with their variances 5"# ß á ß 58# ß denote their
covariance's 534 ß 3ß 4 œ "ß á ß 8. We can place all covariance's in a so-called covariance matrix,
5" 5"8
#
5#" 5#8
5"# 5"$ á
5##
CovX À œ O œ á á .
5#$ á
á á á (6.5)
5 58#
á á á á á
8" 58# 58$ á
Notice that O is a symmetric matrix (i.e. such that O w œ O ) and that the main diagonal consists
of all variances of r.v.'s \" ß ÞÞÞß \8 .
Remark 6.2. A common problem in statistics is the parameter estimation of a r.v. (such as its
mean or variance). In this case, one assumes that some population is represented by an
equivalence class of r.v.'s having the same particular distribution. Such an equivalence class will
be denoted by Ò\Ó, where \ is a generic r.v. being one of them. An experiment consists of
drawing a random sample X8 œ Ð\" ß ÞÞÞß \8 Ñ from the population Ò\Ó, which is an 8-tuple (or
vector) of 8 iid r.v.'s. To estimate an unknown parameter _ one forms a statistic being a Borel
function of the sample. One of them, the sample mean \ 8 , has been introduced in Example 3.2,
except that we did not assume that \" ß á ß \8 were independent. We recall that the sample
mean is
_
\ 8 œ 8" Ð\" á \8 Ñ. (6.6)
_
If the r.v. \ has the mean . and variance 5# , then the mean of \ 8 is
_
I\ 8 œ 8" 8I\" œ .ß (6.7)
which is equal to mean of the population, as _per Example 3.2. (This result did not require
\" ß á ß \8 to be independent.) The variance of \ 8 is by Bienamé equation (6.4),
_
Var\ 8 œ 8"# 85# œ 8" 5# . (6.8)
It shows that the variance of the sample mean is 8 times smaller than that of the population and it
seems _like by increasing the sample size 8 we reduce the variance thereby making the sample
mean \ 8 approach to a constant, apparently . itself. Thus, for large samples, the statistic sample
mean can be a good estimator for the unknown mean of the population.
We only observe that the sample mean converges to . follows from a so-called the Law of Large
Numbers.
PROBLEMS
6.1. Let \ and ] be two independent Gaussian r.v.'s with parameters Ð "ß #Ñ and Ð$ß &Ñ. Find
VarÐ#\ (] &Ñ.
Hint. Step 1. We start with noticing that the constant & in #\ (] & does not have any
impact on the variance and thus as such it can be discarded. Therefore,
Step 2. Since \ and ] are independent, so are #\ and (] as linear functions of \ and ] . In
particular, #\ and (] are uncorrelated. Therefore, we can use the by Bienamé equation (6.4)
that
Next,
6.2. Suppose a fair die is rolled twice and let \ and ] denote the sum and difference between
the first and second outcome, respectively. Find CovÐ\ß ] Ñ.
6.6. Let \ and ] be two identically distributed r.v.'s. Show that the r.v.'s \ ] and \ ] are
uncorrelated.
6.7. Suppose two fair dice are rolled and let \" and \# denote the outcome of the first and
second die, respectively. Prove that the r.v.'s \ œ \" #\# and ] œ #\" \# are
uncorrelated.
"!ß_!Þ$ (i.e., with parameters "!ß !Þ$). Find the mean and variance of the sample
6.8. Let \" ß á ß \"!! be a sample of iid r.v.'s drawn from a binomial population
mean \ "!! .
Solution.
"
œ "!! "!!. œ .ß
where . is the mean of each \3 . Since the expectation of a binomial r.v. from 7ß :
is 7:, we have
. œ "! † !Þ$ œ $
_
and so is I\ "!! .
"
œ "!!# "!! † "!:; œ "!:;Î"!! œ "! † !Þ$ † !Þ(Î"!! œ !Þ!#"Þ
Proposition 7.1. Markov Inequality. Let ] ! a.s. Ði.e. with probability "Ñ and let + !.
Then,
Indeed, if ] + a.s., the left-hand side œ ! a.s. and ] ! a.s. by the assumption. If ] +
a.s., then the left-hand side œ + a.s., in which case the inequality holds true.
Now, taking the expectation on both sides of (7.1) and recollecting that the expectation is a
monotone functional, we have
T Ö] +× Ÿ .+ , where . œ I] Þ (7.2)
(7.2) is known as the Markov inequality. Notice that if . œ ∞ß then (7.2) holds trivially.
Proposition 7.2. Chebyshev's Inequality. Let \ be a r.v.. Denote ] œ Ð\ .Ñ# and let
+ œ &# for some & !. Then, using Markov inequality we have
Var\ 5#
T Öl\ .l &× Ÿ &# œ &# (7.3)
known as the Chebyshev's inequality. One of the noteworthy applications is that if Var\ œ !,
then from (7.3) it follows that \ is almost surely a constant. If 5# œ ∞ß then (7.3) holds
trivially.
of r.v. on a probability space Hß Y ß T . We say that the sequence ^8 converges to r.v. ^
Definition 7.1. Types of Convergence for a sequence of r.v.'s. Let ^ß ^" ß ^# ß á be a sequence
33 in the mean square (or in the square mean) in notation ^8 Ä ^ if
P#
lim8Ä∞ IÒ^8 ^ # Ó œ !ß
Hß Y ß T . Denote by
Definition 7.2. Let \" ß \# ß á be a sequence of independent r.v.'s on probability space
Remark 7.1. Since almost sure convergence of a sequence \8 implies the convergence in
probability (we provide no proof of this assertion), it is obvious that the strong law of large
numbers is stronger than the weak law of large numbers. Note that in the literature, the weak law
of large numbers is often referred to as just the law of large numbers.
addition, we assume that the associated sequence of variances 58# œ Var\8 has the property
that lim8Ä∞ 8"# 83œ" 53# œ !. Then, the sequence \8 satisfies the weak law of large numbers.
T \ 8 .8 & Ÿ
_ _
5"# á58#
Var\ 8
&# œ 8# &# Ä !ß as 8 Ä ∞. (7.4)
Remark 7.2. The condition lim8Ä∞ 8"# 83œ" 53# œ ! can be replaced by a stronger but easier
verifiable condition that 58# Ÿ Q ß 8 œ "ß #ß á ß for some positive real number Q ß i.e. the
sequence 58# is bounded.
Theorem 7.4. Kolmogorov's Strong Law of Large Numbers. Under the assumptions of
numbers.
Corollary 7.5. A Special case of the Weak Law of Large Numbers. Let \" ß \# ß á − \ be
a sequence of independent and identically distributed r.v.'s with a common mean . and variance
5# ∞. Then, from (7.4),
_
T Öl\ 8 .l &× Ä ! as 8 Ä ∞. (7.5)
_
Thus, for large 8, the sample mean \ 8 will become approximately its mean. (More about this
will be in Chapter V.)
Corollary (Þ& on \8 and 5# ß the Kolmogorov's conditions are obviously met. Indeed,
Corollary 7.6. Special case of the Strong Law of Large Numbers. Under the assumptions of
∞ # #
8œ" 58 Î8 œ 5
# ∞ " # 1#
8œ" 8# œ 5 # ∞.
Therefore, such a sequence also obeys the strong law of large numbers.
PROBLEMS
7.1. Let \8 be a sequence of independent r.v.'s such that \8 À H Ä 8ß !ß 8 with
the distribution 8" ß # 8# ß 8" . Does the sequence obey the strong law of large numbers?
Solution. Since I\8 œ !ß Var\8 œ 58# œ I\8# œ " ! " œ #, which implies that
∞ # # 1#
8œ" 58 Î8 œ # # œ 1 ∞.
#
Thus the conditions of Kolmogorov's strong law (Theorem 7.4) of large numbers are met.
7.2. Under the conditions of Problem 7.1, let \8 be pairwise uncorrelated. Does \8 satisfy
the weak law of large numbers?
7.3. Show in Remark 7.2 that if the sequence 58# is bounded, the sequence \8 of r.v.'s
satisfies the weak law of large numbers.
" 8
Solution. If Ö58# œ Var\8 × is bounded, i.e., 58# Ÿ Q ß 8 œ "ß #ß á . then 8#
#
3œ" 53 Ÿ Q 8" and
thus by Chebyshev's inequality (7.3) and Bienamé equation (6.4),
T \ 8 .8 & Ÿ
_ _
5"# á58#
Var\ 8 8Q
&# œ 8# &# Ÿ 8# &# Ä !ß as 8 Ä ∞. (P7.2)
7.4. Suppose a continuous r.v. \ has the mean " and standard deviation 5 œ !Þ#. Using
Chebyshev's inequality estimate the probability of the event !Þ& \ Ÿ "Þ&.
7.5. It is known that the probability that a Gaussian r.v. lies within plus-minus three standard
deviations from its mean is !Þ**(. Estimate the probability of an arbitrary continuous r.v. \ to
lie within plus-minus three standard deviations from its mean.
1. Reliability Measures
Reliability and Hazard Functions. We will focus on various measures representing lifetimes of
r.v. denoting the “time-to-failure” of a component with PDF J > œ T \ Ÿ >, being the
mechanical or biological components or amount of demands or claims. Let \ be a nonnegative
Define
For a > !ß the r.v. \ > is referred to as the residual life time of a component until its failure,
given \ > (i.e. that it was sustained until time >). The conditional distribution function of the
residual life is defined as
that is, given that the component sustained until >ß it is the probability the component fails not
later than in B units of time. Explicitly it is
We denote
2 > œ 0 >
V> . (1.4)
2> can be interpreted as the conditional (instant) failure rate of a component that otherwise has
sustained until time >. In the reliability literature 2 is referred to as the hazard rate function of \
(which agrees with the physical notion of a rate).
Remark 1.1. If 2 > exists, from formula (1.4), 2 > can also be written as
and thus
Mean and Mean Residual Time. Suppose \ ! a.s. We obtain another expression for I \ in
terms of the reliability function:
I \ œ Bœ! B.J B œ Bœ! ?œ! .?.J B œ ?œ! Bœ? .J B.?
∞ ∞ B ∞ ∞
" ∞ B
œ V> Bœ! ?œ! .? 0 > B.B
So we obtained
" ∞
.> œ V> ?œ> V ?.?. (1.11)
We will drop the indicator function for now, because in most cases we deal with positively
defined r.v.'s. From expression (1.8) for the conditional reliability function
V Bl> œ /-ÐB>Ñ
/ - >
œ /-B œ T Ö\ B× œ V B (2.1)
or also
2> œ 0 >
V> œ -/-> Î/-> œ - (2.3)
we see that the exponential r.v. has a constant hazard rate function.
Now, suppose \ is a r.v. with a constant - hazard rate function. From (1.7),
implying that \ is exponential. The latter lets us conclude that the exponential r.v. is the only
r.v. that has a constant hazard rate.
Now let \" ß á ß \8 − Exp- be a sample of independent r.v.'s. We are interested in the
distribution of the minimum value of the sample, min\" ß á ß \8 . The latter is of importance
in calculating the distribution of 8 unreliable machines connected in series (a serial system).
Obviously,
3œ"
3œ"
œ T \3 B œ /-8B ß
8
(2.4)
3œ"
which shows that min\" ß á ß \8 is exponential with parameter -8. [Obviously, the r.v.
8min\" ß á ß \8 is exponential with parameter -.]
Example 2.1. Consider a serial system of 10 identical components working independently. Such
a system fails as soon as one item fails. Thus the lifetime of the system is
\ œ min\" ß á ß \"! ß
where \3 stands for the lifetime of the 3th component. Suppose \3 − Exp!Þ#. Thus, from
(2.4), the reliability function of the system is
Example 2.2. Unlike a serial system, a parallel system works until at least one component does.
If we have a parallel system of 8 components \" ß á ß \8 whose lifetime is independent of each
other, then the lifetime \ of such system can be defined as
\ À œ max\" ß á ß \8
and
\ Ÿ B œ \3 Ÿ B.
8
(2.5)
3œ"
3œ"
implying that
Furthermore,
In particular, for 8 œ #ß
2\ B œ 0\ B
V\ B œ #-" .
- B
œ #- "/
#/-B
"
#/-B
r.v.'s and suppose we need to find the distribution T \ ] Ÿ > of their sum. We will proceed
Remark 2.1. For the applications, suppose \ and ] are two independent positively defined
{x y t}
or
Example 2.3. (Cold-Redundant System.) Consider a system of two units in which the second
the lifetime of the 3th unit and \3 − Exp-3 and assuming that \" and \# are independent,
one is in cold standby and it will be put in working as soon as the first unit breaks down. If \3 is
Solution. Obviously, V > œ " J\" \# >. Then using (2.6)
-# >
œ " " /-# > -#
- # - " / " /-# -" >
-# >
œ / - # > -#
- # - " / " /-# -" > . (2.7)
Example 2.4. In this example we evaluate the reliability of a four-engine jet.
The jet will continue to fly as long as at least one engine on each wing keeps functioning.
Obviously, the jet crashes if either engine fails on any one of the wings.
A C
B D
lifetime of engine 3ß then \3 − Exp-. Since each subsystem (the Left and Right Wing) is
Suppose all four engines work independently with exponential lifetimes so that if \3 is the
Now, the system fails if at least one of the wings fails, which agrees with the way how a serial
system operates. Thus, if [6 and [< denote the lifetimes of either wing, respectively, the system
sustains working (i.e. the jet continues flying) beyond B in accordance with the event
Remark 2.2. In general, the main principle of modeling reliability block diagrams in the cases of
arbitrary distributions of block components reliability works as follows. Suppose we have a
serial system of independent components with life times \" ß á ß \8 . Recall that
\ À œ min\" ß á ß \8 B œ \3 B
8
(2.8)
3œ"
and thus
V\ B œ V\3 B.
8
(2.9)
3œ"
] À œ max\" ß á ß \8 Ÿ B œ Ö\3 Ÿ B×
8
(2.10)
3œ"
implying that
2. Weibull and Rayleigh Distributions. A r.v. \ is said to have a Weibull distribution with
parameters . and α if its PDF is
/.>
α œ α.α >α" 1Ò!ß∞Ñ Ð>Ñ. (2.15)
In the sequel we drop the indicator function. For α œ " the Weibull distribution reduces to the
exponential distribution, while for α œ # the special case is referred to as Rayleigh distribution.
That is
In the reliability literature #.# is commonly replaced with - to have the Rayleigh density and
reliability in the form
under the new meaning of (#.# ) as -. We will say that the underlying Rayleigh r.v.
\ − Ray-.
Therefore, it follows that a probability distribution is Rayleigh if and only if its hazard rate
function is linear.
Remark 2.3. Very often, in reliability, one characterizes a distribution by its hazard rate alone.
We saw that if the hazard rate is constant, the associate density is exponential, and if its hazard
rate is linear, its associated density is Rayleigh.
If a hazard rate is affine, 2 > œ + ,> (with +ß , ! and + , !), then it is easy to see that
"
Var \ œ .# > " α# ># " α" (2.20)
"
I \8 œ .8 > " α8 ß (2.21)
The Weibull distribution was used to described fatigue failure, vacuum tube failure, and ball-
bearing failure. It is the most popular parametric family of failure distributions in reliability of
electronic and mechanical systems.
With the results for Theorem 2.1, we can get similar parameters for the Rayleigh distribution
expressed in the more common notation of (2.16).
I \ œ #1- (2.23)
#8Î#
I \8 œ -8Î#
> " 8#
† 1
5xß 8 œ #5ß 5 œ "ß #ß á
#5"x
#8Î#
œ -8Î#
(2.25)
# † %5 5x ß 8 œ #5 "ß 5 œ !ß "ß á
In particular,
I\ # œ -# ß (2.26)
)
I\ % œ -# (2.27)
%
Var\ # œ -# . (2.28)
Example 2.5. Consider a parallel system of two independent components. What is the
probability that component 1 fails before component 2 if the life time \3 of component 3 is
Rayleigh with parameter -3 .
{y x}
œ -" Bœ!
∞ - " - # "# -" B# "# -# B#
-" -# B/ / .B
-" ∞
-" -# B/ # (-" +-# ÑB .B.
" #
œ -" -# Bœ! (2.31)
Now, since -" -# B/ # (-" +-# ÑB is a pdf of a Rayleigh r.v. with parameter -" -# ß the integral
" #
Bœ!
∞
-" -# B/ # (-" +-# ÑB .B œ ".
" #
T \ # \ " œ -"
- " - # .
PROBLEMS
2.1. For an \ − Exp-ß find the mean residual time from (1.11) and show that it is constant:
Solution.
" ∞
.> œ V> ?œ> V ?.? œ /-> -" /-> œ -" .
2.2. Show that if \ is a r.v. such that its mean residual time is constant, then \ is exponential.
Solution. Let
" ∞
"
- œ .> œ V> ?œ> V ?.?.
V > œ /-> .
with lifetimes \" ß \# ß \$ such that \3 − Exp-3 ß 3 œ "ß #ß $Þ Find the reliability function
2.3. Consider a three-unit serial reliability system consisting of three independent components
V\ B of the systemß where \ À œ min\" ß \# ß \$ . Also give the hazard rate and the mean
residual time of the system.
2.4. Find the mean residual time of a parallel system of two independent and identically
distributed lifetimes of its components, with the common exponential distribution with parameter
-.
2.5. Under the condition of Example 2.3, find V > when -" œ -# œ -.
2.6. In the context of Example 2.4 about the reliability of a jet, find the hazard rate of the system.
2.7. Under the conditions of Example 2.4, suppose that \3 − Exp-3 ß such that 3 œ "ß #ß $ß %
corresponds to engines A,B,C,D. Assuming that -" œ -% and -# œ -$ ß find the reliability
function of the system.
submitting ? œ .Bα
"
œ .8 > " α8 .
2.12. Consider a parallel system of two independent components. What is the probability that
component 1 fails before component 2 if the life time \3 of component 3 is exponential with
parameter -3 .
Assuming that all units have identical and independent life distributions and that the probability
that a unit is functioning is :ß recall that the probability that exactly 5 out of a total of 8 units
function is
In the context of a 5 -out-of-8 system, at least 5 of them have to function and such probability is
In a typical 5 -out-of-8 system with components having equal constant hazard rates -, : stands
for the reliability of one component and it is
With a linearly increasing hazard rate (2 > œ ->)ß the corresponding system reliability is
" #
where : œ / # -> is of Rayleigh reliability function.
As in Chapter I, section 5 (Example 5.3), if we need to compute the reliability function of the
system, we will have to use the R-command
Example 3.1. Consider a parallel 2-out-of-3 system with components that exhibit constant
hazard rates with parameter -. What is the reliability of the system? If - œ $ † "!& failures per
hour, what is the reliability at time > œ "!!! hours?
VW > œ !Þ**(%.
will be reached for some fixed >. Let 5 œ #ß > œ "!! hoursß and - œ !Þ!" failures per hour. (a)
V œ !Þ(& (b) V œ !Þ&.
Let 5 œ #ß > œ "!!ß and - œ !Þ!". Then using (3.6) the above inequality reads
To calculate R we can use an R-program introduced in Chapter I, section 5, Example 5.3. Here
we identify : œ /" . Thus from formula (3.6) we have
p<-exp(-1)
N<-1;
R<-0;
while (R<0.75) {
N=N+1
R<-1-pbinom(1,N,p);
}
print(R);
print(N)
and we get
> print(R);
[1] 0.7953857
> print(N)
[1] 7
or
First off, notice that with B À œ 8 "ß < À œ ln" V ß + œ " /" ß and since ln is a
monotone increasing function, we have
Since + "ß ln+ ! and thus the dominating part of the above inequality is negative implying
that ln18 is monotone decreasing in 8 and so is 18. Consequently, we need to find the first 8
at which 18 Ÿ " V . So we have
8 œ # Ê 1# œ !Þ)'%(
8 œ $ Ê 1$ œ !Þ'*
8 œ % Ê 1% œ !Þ&$"$
8 œ ( Ê 1( œ !Þ#!%'
For V œ !Þ(& Ê " V œ !Þ#& Ê 1' !Þ#& but 1( !Þ#&. Thus
p<-exp(-1)
N<-2;
R<-0;
while (R<0.5) {
N=N+1
R<-1-pbinom(1,N,p);
}
print(R);
print(N)
> print(R);
[1] 0.6053943
> print(N)
[1] 5
a constant hazard rate - for each component. Find a minimum number R 5 of components
Example 3.3. Under the condition of Example 3.2, consider a 5 -out-of-8 parallel system having
so that a minimum reliability V will be reached for some fixed >. Let 5 œ #&ß > œ "!! hoursß
and - œ !Þ!!& failures per hour. V œ !Þ*&.
To calculate R we can use an R-program Example 5.3, Chapter I, section 5. Thus we have
p<-exp(-0.5);
N<-24;
R<-0;
while (R<0.95) {
N=N+1
R<-1-pbinom(24,N,p);
}
print(R);
print(N)
and we get
> print(R);
[1] 0.9528804
> print(N)
[1] 50
PROBLEMS
3.1. Consider a parallel #-out-of-% system with components that exhibit linearly increasing
hazard rates with parameter -. What is the reliability of the system? If - œ 2 † "!3 failures per
hour, what is the reliability at time > œ "!! hours?
3.2. Under the condition of Example 3.1, let 5 œ #ß > œ "!!ß and - œ !Þ!!& failures per hour.
Find a minimum number R # of components so that a minimum reliability V will be
reached with V œ !Þ(& and V œ !Þ&.
or
As in the example, we need to find the first 8 at which 18 Ÿ " V . So we have
8 œ # Ê 1# œ !Þ'$#
8 œ $ Ê 1$ œ !Þ$%#&
8 œ % Ê 1% œ !Þ"(")
For V œ !Þ(& Ê " V œ !Þ#& Ê 1$ !Þ#& but 1% !Þ#&. Thus
% is the minimum number of components needed to sustain the reliability of at least !Þ(& within
100 hours.
With V œ !Þ& Ê " V œ !Þ& Ê 1# !Þ& but 1$ !Þ& thus $ is the minimum number of
components needed to sustain the reliability of the system of at least !Þ& within 100 hours.
hazard rate - for each component. Find a minimum number R %& of components so that
3.3. Under the condition of Example 3.2, consider a 5 -out-of-8 parallel system having a constant
minimum reliability V will be reached for > œ "!! and - œ !Þ!!$ failures per hour and
V œ !Þ)).
reached for some fixed >. Let 5 œ %&ß > œ "!! hoursß and - œ !Þ!!!4 failures per hour.
V œ !Þ*#.
CHAPTER V. ESTIMATION
Let Ò\Ó be some population from which we draw a sample \" ß á ß \8 . (\ is the equivalence
class of all r.v.'s sharing the same distribution with \ .) In most applications, we assume that
\" ß á ß \8 are independent. A function $ of sample X8 œ Ð\" ß ÞÞÞß \8 Ñ, $ ÐX8 Ñ, known as a
statistic, is called an estimator of an unknown parameter ) , also in notation )^8 . Correspondingly,
$Ðx8 Ñ (where x œ B" ß á ß B8 are observed values of the sample) is called an estimate of ) and
denoted by the lower case * ^ .
8
How to choose an estimator or estimate of an unknown parameter )? There are various methods
of choosing $ . For example,
known as a sample mean. This is a common estimator of a parameter ) that is the mean of an
underlying r.v. and such an estimator makes perfect sense. Another example of an estimator is
One can propose a function $ of the sample to serve as an estimator of ), but how reasonable can
it be? It means that how well an estimator estimates ) . There are several most common
(MLE for short). Roughly speaking, we take a joint density function 08 B" ß á ß B8 of an 8-
“credibility criteria” of estimators. One of them is known as a maximum likelihood estimator
sample and find $ B" ß á ß B8 that maximizes 08 . If such a function $ exists, then $ x8 is
referred to as a maximum likelihood estimate (m.l.e.) of the sample and the associated version
)^8 œ $ X8 is then an MLE. So, such an estimator seems to be pretty credible.
There are some additional goodness criteria of an estimator, such as the property that I )^8 œ ).
Such an estimator is called unbiased. Another lucrative (and seemingly most valuable) property
of an estimator is consistency. The latter means that )^8 Ä ) as 8 Ä ∞ in some sense and thus,
)^8 well approximates ), even though we cannot often afford collecting a large sample.
However, we need to differentiate properties of an estimator )^8 from a method of obtaining )^8 .
The method we study in this section is called the maximum likelihood method. This technique is
common for many discrete and continuous r.v.'s likewise and as mentioned above it is based on
maximization of the sample pdf (known as the likelihood function). Another method of obtaining
$ or )^8 is to assume that the parameter ) we are interested in is a r.v. and we pretend to know its
pdf 0 (called prior). The knowledge of such a prior pdf can be arbitrarily crude, but then 0 can be
calibrated and improved after taking values of a sample and somehow obtain a new pdf called
posterior using Bayes principles. The posterior pdf can yield the conditional mean of an
associated r.v. (that owns the posterior pdf) called the Bayes estimator of ). (It can be shown that
every Bayes estimator is unbiased.) Such method is called Bayes analysis which we will study in
forthcoming sections.
Definition 1.1. Let X8 œ Ð\" ß ÞÞÞß \8 Ñ be a random sample from the equivalence class Ò\Ó of
r.v.'s, with the joint density
called the likelihood function of the sample. This function is regarded as a function of ), with
B" ß ÞÞÞß B8 being "fixed values."
In the situations below we will develop very common techniques for obtaining an MLE for )
from different distributions.
Example 1.1. Sampling from a Bernoulli population. Assume that we need to test a proportion
of defective items with no prior data. These problems arise in reliability analysis, quality control,
exit polls, biotechnology, pharmaceutical, to name a few.
Draw a sample X8 œ Ð\" ß á ß \8 Ñ from a Bernoulli population with an unknown ) − Ð!ß "Ñ.
[Note that a Bernoulli r.v. was previously meant to be with parameter : − Ð!ß "Ñ. Now we change
character : to character ).] We can write the density 0 ÐBl)Ñ of each r.v. \5 as
0 ÐBl)Ñ œ
)B Ð" )Ñ"B 1Ð!ß"Ñ Ð)Ñß B œ !ß "
(1.2)
!ß 9>2/<A3=/Þ
where
5 À œ B" á B8 . (1.4)
The value of *^8 that maximizes the likelihood function is the same value that maximizes the log
of 08 Ðx8 l)Ñ since the log function is strictly monotone. [It can be rigorously proved for a
composition 0 1 of any monotone function 0 and any continuous function 1.]
So, let
Then, _ _
Pw Ð)Ñ œ 8ÒB8 )" Ð" B8 Ñ "" ) Ó
_ _ _
œ 8 B8 Ð"))Ð"
Ñ) Ð"B8 Ñ
)Ñ œ 8 )BÐ"
8 )
)Ñ ß (1.8)
_
where B8 is the sample mean of B" ß á ß B8 Þ
_
It is easily seen that Pw Ð)Ñ changes its sign from positive to negative by_ passing through B8
which of course is a value from Ò!ß "Ó. Thus, the empirical sample mean B8 _is the m.l.e. _ of )Þ
Notice that a more elaborate
_ analysis in which we distinguish the cases with B8 œ ! or B8 œ 8
from all other values of B8 from Ð!ß "Ñ yields the same result. Thus,
_
)^8 œ \ 8 . (1.9)
Example 1.2. To estimate the probability ) that a timber joist delivered at a construction site
from a particular source is below specification, an engineer randomly selects 100 joists of timber
and inspects them. It turns out that 5 of them are below specification. Therefore, &Î"!! œ !Þ!& is
the m.l.e. of ).
3 As we know it from (6.11), Chapter II, I\ 8 œ I\" , i.e. the expectation of the sample mean
_
_
case, since 5# œ )Ð" _ )Ñ Ÿ ", the variance of \ 8 becomes smaller and smaller with 8 getting
large. Eventually, Var\ 8 Ä ! as 8 Ä ∞.
Thus,
Var\ 8 œ I \ 8 . œ \ 8 . Ä !
_ _ #
_ #
_
means that \ 8 Ä . in so-called P# -norm. Since I )^8 œ ), i.e. )^8 is unbiased, in our case . œ ).
The latter implies that )^8 Ä ) under the norm. It is also known as the mean square convergence.
of )^8 is called consistency. To tell (1.11) from a stronger form of convergence of )^8 in P# -norm,
we will refer the former as consistency in probability and to the latter as consistency in the
square mean (or mean square consistency).
_
Therefore, the estimator \ 8 is consistent (in probability and in the square mean).
333 In the general case, an estimator )^8 of a parameter ) is called consistent in the square
mean if
)^8 ) Ä ! as 8 Ä ∞. (1.12)
P#
The convergence is in the sense of the P# -norm which is the most common norm in probability
theory. In other words, we say in this case that
Now if )^8 is also an unbiased estimator, then )^8 ) # can be written as )^8 I )^8 #
# #
P P
which is the variance of )^8 . Therefore, if )^8 is an unbiased estimator of )ß it is consistent if and
only if Var )^ Ä ! as 8 Ä ∞. The latter is often easier to verify than other forms of conver-
8
gence.
Thus if )^8 is unbiased and consistent, from Chebyshev's inequality it also follows that )^8
converges to ) in probability, which is yet another form of consistency of )^8 .
Example 1.3. Sampling from a Gaussian population. Here both . and 5# are unknown.
"
exp .Ñ# .
8
"
0 8 Ð xl . ß 5 # Ñ œ Ð#15# Ñ8Î#
#5 # ÐB3 (1.13)
3œ"
To maximize 08 we maximize ln08 Ðin notation PÐ.ß 5# ÑÑ, since the logarithm is a monotone
increasing function. Now,
"
8
PÐ.ß 5# Ñ œ 8# ln#1 8# ln5# #5 # ÐB3 .Ñ# Þ (1.14)
3œ"
We calculate
` `
`. P œ P. œ ` 5# P œ P5# (set equal to) œ !. (1.15)
"
8
P. œ ÐB3 .Ñ œ " .Ñ œ ! Í . œ (1.16)
5 # 5# 8ÐB 8 B8
3œ"
^8 À œ
We denote 7 B 8 . From the second equation of (1.15), taking
B 8 for . we have
"
8
P5# œ 8 "
# 5# #
#Ð5 Ñ# ÐB3
B 8 Ñ# œ !
3œ"
which yields
ÐB3
8
#
=^8 œ "
8 B 8 Ñ# ß (1.17)
3œ"
called the sample variance estimate of 5# . To verify that the critical point
^ œ Ð7^ ß =^# Ñ
* (1.18)
8 8 8
is positive at ) œ *^8 Þ
We get
^Ñ œ
P.. Ð* 8
!ß (1.20)
^
=#
then
^Ñ œ
P5# 5# Ð* 8
! (1.21)
#
Ð=^ 8 Ñ#
and finally,
8 B3
8
8B
P.5# Ð)^Ñ œ #
"
œ !. (1.22)
Ð=^ 8 Ñ#
^ 8 . In addition, I .
^ 8 œ . (the latter is true in general). The estimator
Remark 1.2 As per _ the last situation, the MLE of the unknown mean . of a normal population is
the sample mean \ 8 œ . _
^ 8 of . was called (Remark 1.1) unbiased. Now, .
. ^ 8 œ \ 8 is a function, say $ , of the sample
\" ß ÞÞÞß \8 , i.e.
^ 8 œ $ \" ß ÞÞÞß \8 .
. (1.23)
Suppose \" ß ÞÞÞß \8 is a random sample from a population with unknown variance 5# (not
necessarily Gaussian). Due to Example 1.3, the statistic
5^ 8 œ 8" \3 \ 8
#
8 _ #
(1.25)
3œ"
\ 3 \ 8 # œ \ 3 . . \ 8 #
8 _ 8 _
3œ" 3œ"
3œ"
I 5
^ #8 œ 5# 5# Î8 œ 8
8" 5
#
Á 5# . (1.28)
"
\3 \8
8 _ #
8" ^ #
O8 À œ 8 58 œ 8" (1.29)
3œ"
is unbiased. Note that no assumption about the nature of the populations has been made.
Example 1.4. Sampling from a Poisson Population. We lay out the MLE method by the
following common situation. The number of connections to a wrong phone number is often
modeled by a Poisson distribution. Suppose we need to estimate the parameter - of that
distribution (the mean number of wrong connections) by observing a sample B" ß á ß B8 of wrong
connections on 8 different days. Assuming that 5 œ B" á B8 !, find the m.l.e. and MLE
of -.
B
Solution. If 0 ÐBl-Ñ œ /- -Bx , then the likelihood function is
4œ"
` _
` - PÐ-Ñ œ 8 5Î- œ À ! Í - œ 5 Î8 œ B8 Þ (1.30)
Now, 5Î8 is a critical point of P-. But Pw - œ -"8 ÐB8 -Ñ showing that Pw is positive for
_
_ _ _ _
hence the global maximum point of P and therefore of 08 x8 l-.
6 B8 , equal zero when - œ B8 , and is negative when - B8 . It proves that B8 is a local and
_ _
^ œ \ is the MLE of -.
Thus, ^6 œ B8 is the m.l.e. of - and A
8
Example 1.5. Atmospheric dust particles around construction sites cause a serious
environmental problem. It is assumed that the number of particles in a unit volume is Poisson
whose parameter - needs to be estimated. In order to do it, a small randomly selected portion of
the air was observed by focusing a powerful microscope on the particles and making counts.
Suppose 50 such samples of air were collected and it gave a total of 2500 particles. From (1.30),
we have #&!!Î&! œ &! as the m.l.e. of the unknown -. In conclusion, the number of particles
around the construction site is Poisson with parameter - œ &! per unit volume of the air.
In the context of the above situation, let \" ß ÞÞÞß \8 − Ò\Ó, where \ is an exponential r.v. with
parameter - unknown. Find the m.l.e. and MLE of -. Is this MLE unbiased?
Solution. 0 Bl- œ -/-B ß B !, and 0 Bl- œ ! if B !. (The value of 0 Bl- for B ! will
be ignored for convenience.) Thus the likelihood function of the sample is
P w - œ - B8 -
_
8B8 _"
(1.33)
it follows that
!ß - _"
Pw - is œ !ß
B8
-œ _" (1.34)
!ß
B8
- _"
B8
proving that - œ B_"8 is the global maximum of the likelihood function on the parameter set
@ œ !ß ∞. Hence, B_"8 is the m.l.e. of - for the exponential distribution.
Is _" an
\ 8_
unbiased estimator of -? Formally, it is not, because I \_"8 Á -. However, since
_ _
" " "
- œ \ 8 and - is the mean of
_ the exponential r.v., I\ 8 œ - and thus_ \ 8 is an unbiased
estimator for - . Furthermore, \ 8 is consistent as any sample mean, that is, \ 8 Ä -" as 8 Ä ∞.
"
Example 1.7. The traffic engineering problem revisited. Suppose there are 30 observed intervals
between the passing's of vehicular traffic giving the mean time interval of !Þ&& minute. In other
_ ^ œ "Î!Þ&& œ "Þ)#.
words, B$! œ !Þ&& is an observed sample mean and thus -
Note that the exponential distribution has wide applications in many areas of science and
reliability. The times between earthquakes fits exponential and the distribution of time to failure
of various components fits exponential.
Example 1.8. Sampling from a Rayleigh Population. Suppose we need to estimate the -
parameter in the Rayleigh density given by
0 B œ -B/ # -B .
" #
(1.35)
Its derivative is
P w - œ 8
- "# 83œ" B3# . (1.37)
it follows that
!ß 83œ" B#3
#8
-
Pw - is œ !ß
83œ" B#3
#8
!ß
-œ (1.39)
83œ" B#3
#8
-
83œ" B#3
#8
proving that - œ is the global maximum of the likelihood function on the parameter set
@ œ !ß ∞. Hence, 3œ" B#3
#8
8
^ of - for the Rayleigh distribution.
is the m.l.e. - 8
In summary,
^ œ
83œ" B#3
#8
- 8 (1.40)
^ œ
83œ" \3#
#8
A 8 (1.41)
is the MLE of -. As in the case of the exponential distribution we will work with its reciprocal
^ " œ " 8
A 8 #8
#
3œ" \3 (1.42)
" 8 #
"
^ œ
IA œ -" ß (1.43)
8 #8 3œ" -
Since \"# ß á ß \8# are independent r.v.'s by the Bienamé's equation (6.4), Chapter IV, and then
equation (2.28), Chapter IV,
" 8 " 8
"
^ œ
Var A % " "
3œ" Var \3 Ä !.
#
8 %8# œ %8# 3œ" -# œ -# 8
Even though A ^ " is an unbiased and consistent estimator for " ß we can use A ^ as an MLE
8 - 8
estimator for -, although we cannot claim that it is unbiased and consistent for -. Furthermore,
we can apply an observed value -^ of - for the mean and variance of \ calling them “estimates”
8
of the mean and variance, respectively. Recall that according to formulas (2.23-2.24), Chapter
IV,
. À œ I \ œ #1- (1.44)
^ 8 œ 1^
. (1.46)
#-8
#
^ #8 œ
5 ^ " 1% . (1.47)
- 8
""! "$! "&! "&& "&* "'$ "'' "') "'* "(!
Assuming that the miles to failure follow a Rayleigh distribution determine the value of the
parameter of the distribution and the corresponding estimate of the mean.
Solution. The value of the parameter used is the m.l.e. of the above sample of 10 values of
^ œ 8#8 we have
mileage. Applying formula (1.40) - 8 B# 3œ" 3
^ œ
- #†"!
œ )Þ$""** † "!"" .
"! #%!'"'†"!'
PROBLEMS
1.1. Let \" ß ÞÞÞß \8 − Ò\Ó, where \ r.v. with pdf )B)" 1Ð!ß"Ñ ÐBÑ w.r.t. the unknown parameter
) !. Find the m.l.e. and MLE of ).
08 ÐB" ß á ß B8 l)Ñ œ ) B3
8 ) "
8
1Ð!ß"Ñ8 ÐB" ß á ß B8 ÑÞ
3œ"
3œ"
finding ) œ 8Î lnB3 as a critical point of P. (Notice that this ) is positive.) It is readily seen
8
that 8Î lnB3 is a maximum point of P by checking the signs of P on the left (positive) and
3œ"
8
3œ"
^ œ 8Î
*
8
lnB3
3œ"
and
) œ 8Î ln\3 Þ
8
^
3œ"
1.2. Suppose we need to estimate the time to failure of a water plant that has an exponential
distribution with an unknown parameter -. The past 1! failures of the plant took place in
#ß "!ß "#ß 'ß (ß *ß "%ß )ß $ß % days. Find the maximum likelihood estimate of -.
_
Solution. From the given data of B" ß á ß B"! we find that B"! œ (&Î"! œ (Þ&. From Example
1.6, we have - œ "Î(Þ& œ 0.133Þ
1.3. Suppose purchases of some cell phone brands are made by men and some by women and
their proportions are unknown except that the proportion : of purchases made by males is
"
# Ÿ : Ÿ #$ . In a random sample of 70 phones of a particular brand it was found that 58 were
made by women and 12 - by men. Find the m.l.e. of :.
1.4. Let \" ß á ß \8 − Exp-. Show that \ 8 is a consistent estimator of parameter -" .
_
5 Î 8
\ 8 .
]8 À œ
If \" ß \# ß á is a sequence of iid r.v.'s with common mean . and variance 5# , but not
Theorem 2.1 The Central Limit Theorem (CLT). Let \" ß \# ß á be a sequence of iid r.v.'s,
each with mean . and variance 5# . Then, the standardized r.v.
_
5 Î 8
\ 8 .
]8 œ (2.1)
or
5 8
\" á\8 8.
]8 œ (2.2)
Example 2.1. Let \ be a binomial r.v. with parameters Ð8ß :Ñ. Then, as we recall it,
\ œ \" á \8 is the sum of 8 independent Bernoulli r.v.'s. By the CLT, \ can be
approximated by a Gaussian r.v. if 8 is large enough. ÒRecall that the Poisson approximation to
the binomial required both 8 to be large and : - small. The normal approximation does not limit
:.Ó Using (2.2) with 5# œ :; we have that the r.v.
T
5 8 8:;
\" á\8 8. \8:
œ Ä ^ − ÒR Ð!ß "ÑÓ as 8 Ä ∞Þ (2.3)
Example 2.2. Suppose a fair coin is tossed *!! times. Find the probability of obtaining more
than 495 heads. In this case the r.v. \ giving the number of heads in 900 trials is binomial with
parameters Ð*!!ß "# Ñ. Therefore, 8: œ %&! and 8:; œ ##& and
T Ö\ %*&× œ T \%&!
"& %*&%&!
"&
Remark 2.1. In the general case, if \" ß \# ß á are iid r.v.'s with common parameters .ß 5# ,
then the sum \" á \8 œ \ is a r.v. that is formed of a sum (like binomial r.v. in Example
2.1), then we have
.\ œ I \" á \8 œ 8.
and
When using the normal approximation we assume that 8 is large and replace a r.v. like binomial
or a sum or sample mean with Gaussian after a corresponding standardization. In some other
cases, we even calculate the value of 8 to satisfy the accuracy of estimation of an unknown mean
by the sample mean assuming that the sample mean is already normal. But how accurate is the
normal approximation itself? The following theorem partially addresses this question.
Theorem 2.2 (Berry-Esseen). In the context of the CLT, the following estimate holds:
5 Î 8
8 .
8
ß (2.6)
where
The Berry-Esseen theorem obviously gives the speed of convergence of a sum of r.v.'s to the
Gaussian r.v.
PROBLEMS
2.1. The number of students enrolled in calculus classes at FIT is a Poisson r.v. with parameter
- œ "!!. Use the CLT approximation to find the probability that the new enrollment is going to
be 120 or more students.
T \"!!
"!!
"#!"!!
"! ¸ " QÐ#Ñ œ !Þ!##). (2.7)
Hint: As the 8-Erlang, \ is the sum of 8 independent exponential r.v.'s, each with parameter
- œ ". Now, we need to evaluate
T \" ÞÞÞ\
8
8
-" & α
5# "
α# ß
#
8 &# Q "
2.3. Civil engineers believe that [ , the amount of weight (in units of 1000 pounds) that a certain
span of a bridge can withstand without structural damage resulting, is normally distributed with
mean 400 and standard deviation 40. Suppose that the weight of a car is a r.v. with mean 3 and
standard deviation 0.3. Approximately how many cars would have to be on the bridge span for
the probability of structural damage to exceed 0.1?
withstand. We need to estimate the minimal value of 8 such that T [8 [ !Þ". Since [8
Hint: Let [8 be the total weight of 8 cars and let [ be the total weight the bridge can
T Ö[8 [ !× œ T [8
[ Ð$8%!!Ñ
!Þ*8"'!!
!Þ*8"'!!
$8%!!
œ T ^ !Þ*8"'!!
%!!$8
!Þ"
or
Q !Þ*8"'!!
%!!$8
Ÿ !Þ*ß
which yields
!Þ*8"'!!
%!!$8
Ÿ Q" Ð!Þ*Ñ œ "Þ#).
8 ""(Þ
3. Confidence Intervals
Preliminaries. Suppose ^ − R !ß " and let + ! and α − !ß ". Consider the equation
Here DαÎ# œ Q" " α# is the reference point of the αÎ# tail, i.e., the tail area of the standard
Gaussian PDF from point DαÎ# all the way to the right. See the figure below.
Area
Area 1 2
2
z / 2
It follows from the equation T DαÎ# ^ DαÎ# œ " α that " α is the area enclosed
between the two tail areas, each valued αÎ# located between the two reference points DαÎ#
and DαÎ# :
Area 1
/2 /2
z / 2 0 z / 2
proved to be the MLE of . for the Gaussian case) within a prescribed measure of accuracy &. It
can be formalized as
_
T Öl\ 8 .l &× œ " αß (3.3)
where α is referred to as _the significance level (often assumed to be !Þ!&ß !Þ"!ß or !Þ!#&).
Equation (3.3) tells us that \ 8 deviates from . in less than an & with probability of " α.
If we rewrite (3.3) as
_ _
T Ö\ 8 & . \ 8 &× œ " αß (3.4)
_ _ _
we see that the expression l\ 8 .l & forms the estimator interval Ð\ 8 &ß \ 8 &Ñ of an
unknown mean ..
The _Explicit Form of the Estimator Interval of . with 5 # known. We can also rewrite
T Öl\ 8 .l &× as
5 Î 8
\ 8 .
According to section 5, Chapter III, the statistic of the sample \" ß á ß \8 is Gaussian (in
_
notation, ^ ). Indeed, \ 8 is a linear _combination of \" ß á ß \8 and as such, it is Gaussian with
parameters .ß 5 Î8. Thus the r.v. 5Î8 is the standardized version of \ 8 .
_
# \ 8 .
&8
" α œ T Öl\ 8 .l &× œ T ^ 5
_
œ T DαÎ# ^ DαÎ# .
Here now
&8
DαÎ# œ 5 (3.6)
implying that
8
5
&œ DαÎ# . (3.7)
8 "Þ*'.
5
&œ (3.9)
Then,
_ assuming
_ that 5 is known, the estimator interval of an unknown mean .,
Ð\ 8 &ß \ 8 &Ñ, will have the explicit form
ÐEß FÑ œ \ 8 8 DαÎ#
_ _
8 DαÎ# ß \ 8
5 5
(3.10)
or with α œ !Þ!&,
ÐEß FÑ œ \ 8 8 "Þ*'.
_ _
8 "Þ*'ß \ 8
5 5
(3.11)
According to (3.11), we thus have that the unknown parameter . lies between r.v.'s E and F
with probability " α œ !Þ*&.
The predicament with an empirical substitute is that we can no longer claim that . − Ð+ß ,Ñ with
probability " α œ !Þ*&, since + and , are not r.v.'s, but mere realizations of E and F . In fact,
there is nothing random in this empirical interval to induce an eventß and thereby to warrant the
use of the word probability. Instead we say that . − Ð+ß ,Ñ with confidence " α œ !Þ*&Þ
The interval
Example 3.1. Suppose when a signal having value . is transmitted from node A, the value
received at node B is normally distributed with mean . and variance %. In other words, when the
parameters !ß %. To reduce an error, suppose 9 signals of the same value . are sent. Upon their
signal is sent, then its value received is . [ ß where [ represents a Gaussian noise with
We need to construct a 95% confidence interval for .. Denoting = œ & )Þ& "# "& (
* (Þ& 'Þ& "!Þ& we obtain
_
B8 œ =Î* œ )"Î* œ *.
_
Thus substituting * for B8 ß * for 8, and # for 5, we arrive at
Remark 3.1. If \" ß \# ß á are Gaussian r.v.'s and . is to be estimated, while 5 is unknown, then
5 should be replaced with the unbiased sample standard deviation
"
\ 3 \ 8 .
_ #
5w œ 8"
8
(3.16)
3œ"
Now, the rest of the calculations will be very similar, except that
_
5w Î8
\ 8 .
œ ^ w œ Y8" (3.17)
5w Î8
\ 8 .
(3.18)
As per (3.7),
"
& œ 78" Ð" α# Ñ œ >8"
αÎ# . (3.19)
Thus the confidence interval for the unknown mean . of a Gaussian population, with an
unknown variance 5# ß is
So, the new confidence interval Ð+w ß ,w Ñ is like +ß ,ß with 5 being replaced with 5w and D+Î#
replaced with >8"
αÎ# .
Now, >α8"
Î# can be found from the table of >-distribution like DαÎ# from the Gaussian table. Only
now, >αÎ# depends also upon one more parameter 8. Notice that for 8 being relatively large (like
8"
31 or more), >8"
αÎ# Ä DαÎ# .
Example 3.2. In the context of Example 3.1 assume now that the variance 5# of a transmitted
signal is unknown. Calculation of 5w # gives
Remark 3.2. To ease computations one can use R, the MS Excel, MatLab, or Mathematica to
name a few. For instance, in Mathematica one can use commands like
Sample = {5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5}
Z = Variance[Sample] // N
The users of MS Excel are warned not to use stdev function, as it gives the square root of a
biased sample variance or (population variance). Also Var.P in Excel gives the population
variance. So, either of them needs to be adjusted. (See Problem 3.6.)
sample=c(5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5)
mean(sample)
var(sample)
sd(sample)
to yield
> sample=c(5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5)
> mean(sample)
[1] 9
> var(sample)
[1] 9.5
> sd(sample)
[1] 3.082207
PROBLEMS
3.1 Suppose when a signal having an unknown constant value . is transmitted from node A, the
value received at node B is normally distributed with mean . and variance #. That is, when the
!ß #. To reduce error, "' signals of the same value . are sent. Upon their receipt at node B,
signal is sent, then its value received is . [ ß where [ is a Gaussian noise with parameters
Then,
_
B8 œ =Î"' œ "!!Î"' œ 'Þ#&.
3.2 In the context of Problem 3.1 assume that the variance 5# of a transmitted signal . is
unknown. Construct a 95% confidence interval for ..
" "'
5w œ "& 3œ" B3 B"' œ %Þ'$ .
_ #
3.4 Suppose when a signal having an unknown constant value . is transmitted from node A, the
value received at node B is normally distributed with mean . and an unknown variance. That is,
parameters !ß 5# . To reduce error, "' signals of the same value . are sent. Upon their receipt at
when the signal is sent, then its value received is . [ ß where [ is a Gaussian noise with
3.5 In the context of Problem 3.4, write a program in R to calculate the sample mean, the
unbiased sample variance and the confidence interval.
3.6 How in Remark 3.2 can the population variance and standard deviation of associated
operators in MS Excel be adjusted?
Example 4.1. Suppose we need to estimate the proportion of defective items in a large
population, or we need to estimate the proportion of HIV infected individuals, percentage of
speed limit violations, percentage of smokers, obese people, a TV program viewers, or the
number of corrupt signals. However, unlike parameter estimation in section 1, here we discuss
how to obtain a confidence interval for the unknown parameter.
Suppose, a large sample \" ß á ß \8 is drawn from a Bernoulli population with parameters
. œ : _and 5# œ :Ð" :Ñ. The value of : is unknown and it will be estimated by the sample
mean \ 8 Þ Because the value of 5 is generally unknown, we can replace 5 with max 5 which
will end up giving us a larger interval than it really is.
Since 5# Ð:Ñ œ :Ð" :Ñ, the latter is a second degree polynomial with roots at ! and ". The
function 5# Ð:Ñ is a parabola with vertex at : œ "# and thus 5# is valued "% Þ (See the below figure.)
1
4
2 p(1 p)
0 1
1
2
Now, if we utilize the same idea with a more reasonable 5 for the confidence interval Ð+ß ,Ñ
based on historial data, say !Þ$ for 5 , and then collect empirical data, the size of the confidence
interval will shrink. For example, if we know from the past observations that : cannot exceed
!Þ", then
0.25 2 p(1 p )
0.09
0 0.1 1
Example 4.2. Suppose in a sample of %!! driversß "& drove above the speed limit on a HW I95.
Thus we have B%!! œ "&Î%!! œ !Þ!$(&. Continuing with the rest of +ß , and assuming that the
_
true proportion of drivers going over the speed limit never exceeds !Þ". we get !Þ$ as max5 and
thus
%!! "Þ*'
!Þ$
+ œ !Þ!$(& œ !Þ!$(& !Þ!#*% œ !Þ!!)"
!Þ$
, œ !Þ!$(& #! "Þ*' œ !Þ!$(& !Þ!#*% œ !Þ!''*Þ
Now, notice that 5 is not specified, but is assumed to be known. Then as in (3.1) we have
&8
#Q 5 " "α
or
&8
Q 5 " α# Þ (4.3)
or
5# #
8 &# DαÎ# (4.5)
5#
8 &# $Þ)%. (4.6)
Example 4.4. As in Example 4.1, when 5 is unknown, we can replace 5 with max5 in (4.6)
ensuring that the requested accuracy is met at expense of possibly much larger value of 8:
max5#
8 &# $Þ)%. (4.7)
Suppose we need to estimate the proportion of defective items in a large population, or we need
to estimate the proportion of HIV infected individuals, percentage of speed limit violations,
percentage of smokers, obese people, a TV program viewers, or the number of corrupted signals.
Consequently, a sample \" ß á ß \8 is drawn from a Bernoulli population with parameters . œ _ :
and 5 œ :Ð" :Ñ. The value of : is unknown and it will be estimated by the sample mean \ 8 Þ
#
Since 5# Ð:Ñ œ :Ð" :Ñ, the latter is a second degree polynomial with roots at ! and ". 5# Ð:Ñ is a
parabola with vertex at : œ "# and valued "% Þ Thus, from (2.27),
Now, since \ 8 − R .ß 5# Î8 and ] 7 − R / ß $ # Î7 and they are independent, from section
_ _
\ 8 ] 7 − R . / ß 58 7 Þ
_ _ #
$#
\ 8 ] 7 ./
− R !ß "
_ _
58# $7#
^À œ
./ Ð\ 8 ] 7 Ñ
_ _
œ T DαÎ# DαÎ#
58# $7#
œ T \ 8 ] 7 DαÎ# 58
_ _ #
_ _
$#
7 . / \8 ] 7
DαÎ# 58 .
# $#
7
\ 8 ] 7 DαÎ# 58 ] 7 DαÎ# 58 7
_ _ #
_ _
$# # $#
7 ß \8
+ß , œ
Example 5.1. Two different types of electrical cable insulation have recently been tested to
determine the voltage level at which failures tend to occur. When specimens were subjected to
an increasing voltage stress in a laboratory experiment, failures of two types of cable insulation
occurred at the following values of voltages:
Suppose it is known that the amount of voltage that cables having type A insulation can
withstand is normally distributed with an unknown mean . and known variance 5# œ %!,
whereas the corresponding distribution for type B insulation is normal with unknown means /
and variance $ # œ "!!. We need to find the *&% confidence interval for . / .
The ;# - Distribution. It is a special case of the gamma distribution with parameters α and " . In
the pdf form:
"α
0 ÐBlαß " Ñ œ >ÐαÑ B / 1‘ (B),
α" " B
(1.1)
where >ÐαÑ is the gamma function. Taking α œ 8# ß where 8 is a positive integer, and " œ "
# we
arrive at the special case of gamma pdf,
" 8
0 ÐBà 8Ñ œ #8Î# >Ð 8# Ñ
B # " /BÎ# 1‘ (B) (1.2)
called the pdf of a ;# r.v. with 8 degrees of freedom (in notation ;#8 ).
H! À \ − Ò\ ! Ó (hypothetical),
H" : \ Â Ò\ ! Ó.
We develop a test procedure that will either reject or not reject the null hypothesis.
Then, let
Furthermore, define
R3 :3! # .
5
8
3œ"
Then find
The Interpretation: Given that H! is true, a genuine chi-square r.v. U is unlikely to cross - (of
a chance less than α). The opposite of this would mean that H! is not true.
Practical Application. Define the observed value of U denoted by ; in one of the frequently
used forms:
;À œ œ 8" :!3 8.
83 8:3!
5 # 5 #
8
8:3!
(1.6)
3
3œ" 3œ"
Here ; is an observed value of U. The test will consist of checking on whether or not ; - ,
where - is previously obtained from (1.5). If ; - , then deviations of \ from \ ! are not
negligible and H! must be rejected at α.
Remark 1.1. Given a fixed 5 , the higher is the significance α, the smaller is - (i.e. the larger is
the critical region). Thus, the higher is α, the more likely it is to reject H! , because we go
"stricter" with moderate deviations and less confidence.
Example 1.1. It is conjectured that the number of wrong telephone connections is Poisson with
parameter - œ !Þ&. A total of 120 days of observations produced the following results:
Test the null hypothesis H! that the daily number of wrong telephone connections is indeed
Poisson with - œ !Þ& at significance level α œ !Þ!& against the alternative hypothesis H" that it
is not.
Solution. We start with the above table expanding it and giving it a more formal interpretation:
#
" " 8" œ $( !Þ$!$$
$
# # 8# œ * !Þ!(&)
%
$ $ 8$ œ # !Þ!"#'
&
% % 8% œ " !Þ!!"'
& & 8& œ ! !Þ!!!#
total 8 œ "#! "Þ!
So, we have altogether ' groups 5 œ ' made of the numbers of observed occurrences
8! ß á ß 8& with 8 œ "#! days. To calculate ; we use the formula
with the corresponding 83 's from the table. Now, because 5 œ ', we select
from the table of ;# -distribution with ' " œ & degrees of freedom making the critical region
G œ ""Þ!("ß ∞. Since ;  Gß we do not reject H! that the daily number of wrong telephone
connections is Poisson with parameter - œ !Þ& at the significance level α œ !Þ!&.
Remark 1.2 (The P-Value). The P-value is a more common way in statistics to make an
inference about these hypotheses testing. It is defined as
where ; is the observed value of U in formula (1.6). If the P-value (:) turns out to be very small,
then we say it is highly unlikely that the true ;#5" is larger than ; and thus we reject H! .
However, if : is not small, we say it is quite possible that ;#5" can be that “large” (i.e. as ; or
greater). Therefore, we will not to reject H! .
http://stattrek.com/online-calculator/chi-square.aspx
However, one takes into account that the result provided there is " : and not :.
In Example 1.1, from the same table T Ö;#5" Ÿ $Þ'%× œ !Þ%ß and thus the P-value is
which is very large thereby proving that the Poisson model with - œ !Þ& provides a reasonable
fit for the collected data. In other words, we do not reject H! that the daily number of wrong
telephone connections obeys the Poisson law with parameter - œ !Þ&.
Example 1.2. In our next example we want to investigate if the number of fatalities in
automobile accidents obeys a Poisson distribution. Suppose we have a record of 340 fatal
automobile crashes, observed per each hour during 72 consecutive hours. Suppose that some
hours gave zero, one, two, etc. crashes, which we categorized in eight groups. Group 1 included
only zero or one crashes, group 2 included only 2 crashes, etc. Finally, group 8 included 8 or
more crashes. We placed the data into the following table:
total $%! 8 œ (#
Notice that the number of groups Ð5 œ )Ñ made the number of observed occurrences 8" ß á ß 8)
approximately uniform which suggests how we should group the incidences. Consequently, we
have 72 "observations" (i.e. hours) of an unknown r.v. \ À H Ä ÖC" ß á ß C) ×. For instance, take
the set Ö& crashes× and see that according to our records, 5 crashes occurred in the first hour,
third hour, 23rd hour, etc, altogether making 11 hours of 72 during which exactly 5 crashes took
place. Consequently,
We would like to test that the above data come from the class of Poisson r.v.'s with parameter
- œ &. In the formula
; œ 8" :!3 8
5 #
8
3
3œ"
we will use the hypothetical frequencies which we can take from the Poisson table for - œ &.
According to the procedure specified in (1.4-1.5), given the significance level α œ !Þ!& and with
5 œ ), we find the “critical” value - from the table for the ;#( distribution - œ ;#( ÐÞ*&Ñ
"
; œ $Þ%' - œ "%Þ!(.
Therefore, we do not reject H! that the proposed distribution is indeed Poisson with - œ &.
Example 1.3 (Calculation of the P-Value). Recall that the P-value is defined as
: œ T Ö;#5" œ U ;×
where ; œ $Þ%' in Example 1.2. From the same site we find that T ÖU Ÿ $Þ%'× œ !Þ$( and thus
the P-value is
: œ T ÖU $Þ%'*× œ !Þ'$
which is very large thereby proving that the Poisson model with - œ & provides a very good fit
for the collected data. Thus we do not reject H! that the number of fatalities obeys the Poisson
law with - œ &.
Example 1.4. Suppose there are two measurements of aluminum oxide taken from two different
archeological sites from Roman era potteries. Do these findings come from the same period? The
10 measurements from each site are placed in two tables:
Site 2: 10 10 10 11 11
11 12 13 1$ 1%
We interpret the measurements of site 1 as hypothetical and calculate their 83 's as hypothetical
frequencies. The measurements from site 2 we take as observed occurrences, all placed into five
groups:
Notice that the testing formally works only when none of :3! œ ! which takes place when some
values of site 1 do not occur in site 2. In this case we switch the sites to have the other site
hypothetical and get some 83 œ !. The same adjustment can be made when the numbers of tested
values are different. Ultimately, the Kolmogorov-Smirnov procedure (of section 3) works better
than the ;# -square method anyway, with no need of adjustments.
5œ&
" 8#3
;œ "! :3!
"!
3œ"
The P-value is
Because the P-value is fairly large, we do not reject the null hypothesis that both finding come
from the same period of Roman era.
PROBLEMS
1.1. A coin is tossed until a head occurs and the number \ of tosses is recorded. After repeating
the experiment 256 times the following results were obtained:
^3
B " # $ % & ' ( )
83 "$' '! $% "# * " $ "
where B ^ 3 is the number of tosses needed to obtain the first head and 83 is the number of
experiments in which B ^ 3 occurs. Test the hypothesis at the !Þ!& significance that the observed
distribution of \ is geometric with parameter : œ "# Þ
1.2. According to the Mendelian theory of genetics a certain garden pea plant should produce
either white, or pink, or red flowers, with respective probabilities "% ß "# ß "% Þ To test this theory, a
sample of 564 peas were studied and they produced "%" white, 291 pink, and 132 red flowers.
Test the hypothesis at !Þ!& significance that the observed sample agrees with Mendelian theory.
Also give the P-value. Answer: ; œ !Þ)'"( and P-value œ !Þ'%).
1.3. To claim that a certain die was fair, 1000 rolls of the die were recorded with the following
results
Outcome # Occurrences
" "&)
# "(#
$ "'%
% ")"
& "'!
' "'&
Test the hypothesis that the die is fair at !Þ!& significance. Answer: ; œ #Þ"(*'ß P-value
œ !Þ)#%.
Solution. We expand the above table as follows. Some more optional columns are added for
convenience.
: œ T ÖU& #Þ")× œ !Þ)& very large meaning that the true value of a chi-square r.v. is very
likely to be #Þ") or larger.
1.4. It is conjectured that the daily number of electrical power failures in a certain city obeys the
Poisson law with mean 4.2. A total of 150 days of observations produced the following results:
Test the H! that the number of power outages is indeed Poisson with - œ %Þ# at α œ !Þ!& and
give the P-value. Answer: ; œ "&Þ*&&ß P-value œ !Þ"%$.
1.5. A contractor who purchases a large number of fluorescent light bulbs has been told by the
manufacturer that these bulbs are not of uniform quality but that each bulb produced will,
independently, either be of quality level 1, 2, 3, 4, or 5, with respective probabilities :" œ .15,
:# œ .25, :$ œ .35, :% œ .20, :& œ .05. However, the contractor feels that he is receiving too
many type 5 (the lowest quality) bulbs, and so he decides to challenge the manufacturer's claim
by taking the time and expense to ascertain the quality of 30 such bulbs.
Suppose that he discovers that of the 30 bulbs, 3 are of quality level 1, 6 are of quality level 2, 9
are of quality level 3, 7 are of quality level 4, and 5 are of quality level 5. Do these data, at the
5% level of significance, enable the contractor to reject the manufacturer's claim? Find the P-
value and interpret the result.
Hint: Identify the claimed frequencies :" ß á ß :& as hypothetical, thus having :"! œ .15, :#! œ .25,
:$! œ .35, :%! œ .20, :&! œ .05. Then, take 8" œ $ß 8# œ 'ß 8$ œ *ß 8% œ (ß 8& œ & and substitute
the data in formula (1.6):
5œ&
" 8#3
;œ 30 :3!
$! œ *Þ$%).
3œ"
Thus the hypothesis should not be rejected at the 5 percent level of significance. However, it
would be rejected at all significance levels above !Þ!&$.
Example 2.1. Table 2.1 below is a # ‚ $ contingency table obtained from a report on the
relationship between aspirin use and heart attacks by the Research Study Group at Harvard
Medical School among two groups (""ß !$% and ""ß !$() of participating physicians.
The above data was a 5-year randomized study of whether regular aspirin intake reduces
mortality from cardiovascular disease. The study was “double-blind” - those who in the study
did not know whether they were taking aspirin or placebo.
Define
:34 œ T Ö\ œ 3ß ] œ 4× (2.1a)
or equivalently
In the context of Example 2.1, the veracity of H! would mean that the aspirin has no impact (or
no improvement) on the underlying cardiovascular condition.
The Method. We conduct 8 independent trials \" ß ]" ß á ß Ð\8 ß ]8 Ñ (in the context of
Example 2.1, 8 œ ##ß !(") of the random vector Ð\ß ] Ñ. Then, introduce the r.v.'s
For example, from Table 2.1, 8#" (pre-observed R#" ) œ # physicians of the total 8 œ ##ß !("
who took aspirin and had fatal heart attacks is &.
Since the real values of :34 's in (2.1a) are unknown, we use the following proxies (i.e. estimators)
for :34 À
T^ 34 À œ
R34
8
T^ 3† À œ T^ †4 À œ
R3† R†4
8 and 8 ß
respectively, where
4œ" 3œ"
T^ 34 ¸ T^ 3† T^ †4 Þ (2.4)
Since they are never equal, we need to figure out how far they may deviate to have H! hold in a
reasonable way. For example, we can consider the statistic
T^ 34 T^ 3† T^ †4
V G #
3œ" 4œ"
and see if it is reasonably small. Pearson and Fisher proposed the statistic
3œ" 4œ"
reducible to
œ 8 34 R83† R3††4 †4
R R R
V G " #
(2.5)
3œ" 4œ"
which they claimed was asymptotically chi-square with ÐV "ÑÐG "Ñ degrees of freedom.
Example 2.2. (Example 2.1 revisited.) We expand Table 2.1 in accordance with the above speci-
fications:
3œ" 4œ"
#
") "("# "!)%&#
œ ##!(" ""!$%†#$ ""!$%†#(! ""!$%†#"(()
&#
""!$(†#$ **#
""!$(†#(! "!*$$#
""!$(†#"(() "
"
""!$( Ð"Þ!)'*&'&#" $'Þ$ &%))Þ&)))*"Ñ "
œ #'Þ*!#*$()".
The P-value of a ;## -r.v. (with Ð# "Ñ † Ð$ "Ñ d.f.) taken from the table of tails of chi-square
PDF's is
which is very small. We therefore reject the null hypothesis that taking placebo or aspirin does
not impact a cardiovascular condition.
Example 2.3. Can mobile devices (such as phones and tablets) and radio transmitters interfere
with airplane instruments? In some independent studies they registered a total of 370 incidences
of airplane instruments malfunctioned when personal mobile devices were or were not used.
When they were used by mobile devices, two communication frequencies that the FCC bans are
450 and 800 MHz (since they are believed to interfere with planes communications). The
incidents were categorized in the following table:
Test the hypothesis that the use of mobile phones does not interfere with airplane instruments at
significance α œ !Þ!& and also find the P-value.
Attrib B1 B2 B3
instruments\use of phones 450-MHz 800-MHz No use A-marginals
A1 Galvanometers 8"" œ & 8"# œ $! 8"$ œ "! 8"† œ %&
A2 Navigation systems 8#" œ $! 8## œ %& 8#$ œ "!! 8#† œ "(&
A3 Pilot communication noises 8$" œ #! 8$# œ &! 8$$ œ )! 8$† œ "&!
B-marginals 8†" œ && 8†# œ "#& 8†$ œ "*! total $(!
Furthermore, V œ G œ $ß 8 œ #"!!
œ $(! %&
"
&# Î&& $!# Î"#& "!# Î"*!
á Ó œ #(Þ(.
From the Table for chi-square, Tail;#% !Þ!& œ *Þ%)) Ê Reject H! that attributes A and B
"
are independent. In other words, The use of mobile phones and malfunctions of airplane
instruments are related.
http://stattrek.com/online-calculator/chi-square.aspx
we find that T ;#% Ÿ #(Þ( ¸ " implying T ;#% #(Þ( is very small, so that the true chi-
square r.v. (representing the deviations of joint and product of marginal distributions) being so
large is highly unlikely. Therefore, we reject the null hypothesis.
83† œ 834
G
Category A3 á 834 á
4œ"
ã
8† 4 œ
V
B-marginals total 8
3œ"
Table 2.3
we need to calculate the marginal quantities of outcomes in each category to get the value of ;
subject to the formula
3œ" 4œ"
T Ö;#V"G" -× œ α (2.7)
and find the value - , which is the Ð" αÑ-quantile of ;#ÐV"ÑÐG"Ñ r.v. to obtain the critical region
Ð-ß ∞Ñ. Then, we count the observed occurrences of pairs 3ß 4
directly from the contingency table. Finally, we reject the null hypothesis H! , if ; - .
: œ T Ö;#ÐV"ÑÐG"Ñ ;× (2.10)
PROBLEMS
; œ 8 34 883† 83†4† †4
V8 8 8
G " #
3œ" 4œ"
can be reduced to
3œ" 4œ"
2.2. A random sample of 795 individuals was collected to investigate whether smoking and
drinking alcohol are related. The results were as follows:
Test the hypothesis that drinking alcohol and smoking are independent at α œ !Þ!&.
2.3. Suppose 332 people were selected at random and each person in the sample is classified
according to blood type, Sß Eß Fß EFÞ They were also classified according to other blood types
V2 positive or negative. The observed data are put in the following table:
S E F EF
V2 positive *# )* '' "*
V2 negative "$ $( ( *
Test the hypothesis that the two classifications of blood types are independent at α œ !Þ!&.
Explain your steps, interpret the result, and make a conclusion. Also find the P-value.
2.4. Suppose 300 people were selected at random and each person in the sample is classified
according to blood type, Sß Eß Fß EFÞ They were also classified according to other blood types
V2 positive or negative. The observed data are put in the following table:
S E F EF
V2 positive )# )* &% "*
V2 negative "$ #( ( *
Test the hypothesis that the two classifications of blood types are independent at α œ !Þ!& and
also find the :-value.
2.5. A random sample of 2100 death certificates of adults were examined in a large metropolitan
area and showed the following results:
Attrib B1 B2 B3
death cause\habit Heavy Smoker Moderate Non A-marginals
A1 Respiratory 8"" œ && 8"# œ "#! 8"$ œ "'# 8"† œ $$(
A2 Heart 8#" œ %* 8## œ $)) 8#$ œ $"& 8#† œ (&#
A3 Other 8$" œ '" 8$# œ $!! 8$$ œ '&! 8$† œ "!""
B-marginals 8†" œ "'& 8†# œ )!) 8†$ œ ""#( #"!!
Furthermore, V œ G œ $ß 8 œ #"!!
3œ" 4œ"
œ #"!! $$(
"
&&# Î"'& "#!# Î)!) "'## Î""#(
á
œ "$%Þ"#
From the Table for chi-square, Tail;#% !Þ!& œ *Þ%)) Ê Reject H! that attributes A and B
"
are independent. In other words, smoking and death caused by respiratory, heart, and other
conditions are related.
Now, T ;#% "$%Þ"# is very small, so that the true chi-square r.v. (representing the deviations
of joint and product of marginal distributions) being so large is highly unlikely. Therefore, we
reject the null hypothesis.
2.6 A sample of $!! people was randomly chosen and were individually identified as to their
gender and political affiliation, Democrat, Republican, or Independent. The results were placed
in the following table:
Test the hypothesis that the gender and political affiliation are independent at α œ !Þ!&. Also
find the P-value and interpret the result.
This is a V œ # ‚ G œ $ contingency table. The ; gives the value 'Þ%$$ and the critical value -
of the chi-square r.v. with # d.f. at &% is &Þ**". Since ; - , we reject the null hypothesis at 5%
significance that the gender and political affiliation are independent.
The P-value is calculated as T ;## 'Þ%$$ œ !Þ!%. The latter means that we do not reject the
null hypothesis at significance levels lower that %%.
2.7 A company operates four machines on three separate shifts daily. The following contingency
table presents the data during a 6-month time period, concerning the machine breakdowns that
resulted.
Shift\Machine A B C D
1 "! "# ' (
2 "! #% * "!
3 "$ #! ( "!
Determine if machines' breakdowns are independent of a particular shift the company operates
using the P-value argument.
Answer: ; œ "Þ)"%). P-value is : œ T ;#' "Þ)"%) œ !Þ*$&*. Since the P-value is very large
we do not reject the null hypothesis that machines' breakdowns are independent of the shifts.
Suppose we need to figure out what class of distributions a particular population Ò\Ó (both
continuous or discrete) belongs to. Here we will discuss a hypothesis testing on whether or not
the unknown PDF (probability distribution function) J of \ equals or is close to some hypo-
thetical PDF J ! .
The idea is to collect and order a sample B" á B8 and then form an empirical discrete PDF
J8 with successive increments 8" and compare it with the hypothetical PDF J ! by the formation
of the largest deviation between the two. Andrey Kolmogorov suggested a test statistic which
evaluates the goodness of J ! Þ
Let \" ß á ß \8 − Ò\Ó be a sample of continuous r.v.'s drawn from population Ò\Ó with a
common PDF J . Suppose that after observation, their values are B" ß á ß B8 (which are supposed
to be all different). We can also assume that
B" á B8 (3.1)
or just reorder them. We now construct the associated sample PDF, also referred to as the
empirical distribution function (EDF) relative to ordered sample (3.1):
J8 ÐBÑ œ 8
5
ß B5 Ÿ B B5" ß 5 œ "ß á ß 8ß B8" œ ∞
(3.2)
!ß B B"
]5 ß
8
J^ 8 ÐBß =Ñ À œ "
8 (3.3)
5œ"
where L > is the well-known H-PDF which is tabulated. (See Table 3.2.) Like in other
hypothesis tests, we set
For instance, if the significance level α œ !Þ!&ß from Table 3.2 we will have L " Ð!Þ*&Ñ œ "Þ$&
approximately, since LÐ"Þ$&Ñ œ !Þ*%() is being closest to !Þ*&.
As we have already done it for other instances of hypotheses testing, we form an empirical PDF
J8 (in place of random J^ 8 ) and compare it with the hypothetical J ! by calculating the norm (the
largest distance) of their difference .8 and forming the empirical version
78 œ 8.8 (3.22)
of statistic X8 . The value 78 is then measured up against G and if it falls into G , we reject the
null-hypothesis. [We then say that if in place of 78 there were X8 , it would be unlikely to see X8
greater than - .] Conversely, we do not reject H! is 78 Ÿ - .
We cautiously admit that another candidate J ! may possibly exist that produce an empirical 78
less than - .
The technical part of this can be laid out as the following procedure.
Procedure 1.
Step 2. Construct
?
3 À œ lJ8 ÐB3 Ñ J ÐB3 Ñl and ?3 À œ lJ8 ÐB3 Ñ J ÐB3 Ñl.
! !
(3.23)
Since J ! is monotone increasing and J8 is piecewise linear, suplJ8 ÐBÑ J ! ÐBÑl for
B3 Ÿ B B3" is readily seen to be reached at B3 or B3" with values ?
3 or ?3 , ?3" or ?3" ,
respectively.
Figure 3.1
.8 À œ maxÖ?
3 ß ?3 à 3 œ "ß á ß 8× (3.24)
(empirical value of H8! )
Remark 3.2 (The P-Value). The P-value as we know it from the previous two sections, is
defined as
: œ T ÖX 78 ×, (3.29)
We recall that the P-value is set apart from any significance level α (according to which we
derive the critical region G ). So, after getting 78 we substitute it in (3.29) to obtain :. Let us
suppose that after the collection of empirical data and evaluation of 78 , using the table for L r.v.,
we have some :.
From (3.29), it is obvious that : is the tail-area under the density curve on the right of 7 . Clearly,
the larger is the P-value, the larger will be the area under the curve, namelyß Ð78 ß ∞Ñ. And visa
versa, the smaller : is, the smaller will be Ð78 ß ∞Ñ. If the P-value, : turns out to be very small,
then we say that it is a very low probability that the real X8 can be larger than 78 and thus we
reject H! . However, if : is not small, we say it is quite possible that X can be that “large.”.
Therefore, we will not reject H! .
Example 3.1. The time (in seconds) between successive vehicle arrivals at a certain intersection
was measured for some period of time and yielded the following:
!Þ$ !Þ' "Þ! "Þ" "Þ$ "Þ) "Þ* #Þ" #Þ$ %Þ! &Þ!
Ð3Ñ Test the hypothesis that these data come from an exponential distribution at the significance
level α œ !Þ!&. First find the m.l.e. of the unknown parameter -, and then test the hypothesis.
Ð33Ñ Test the hypothesis that these data come from an exponential distribution with a mean of '
seconds and use α œ !Þ!&.
Solution. Ð3Ñ We have 8 œ "', the m.l.e. - ^ œ "Î%Þ"$"#& (see Problem 1.2, Chapter III), the
Table 3.10), we arrive at ."' œ !Þ"$&%)#"%$. Therefore, 7"' œ "'."' œ !Þ&%"*Þ The latter is
hypothetical PDF J ! ÐBÑ œ " /BÎ%Þ"$"#& . After calculations (see the attached spreadsheet, as
less than "Þ$' and hence we do not reject the null hypothesis that above data come from the
exponential distribution with parameter "Î%Þ"$"#&.
Table 3.3
PROBLEMS
3.1. Test the hypothesis that the below data come from a uniform distribution on interval Ð!ß "Ñ À
Table 3.4
3.2. Test the hypothesis that the below data come from a Gaussian distribution with parameters
Ð. œ #'ß 5# œ %Ñ À
Table 3.5
We notice that when constructing the empirical distribution function (EDF) J8 ß with some B3 's
equal, we proceed as follows. If B3" œ á œ B3= , then we set
J8 ÐB3" Ñ J8 ÐB3" Ñ œ 8= .
X78 œ 78
78
supÖlJ^ 8 B J ! Bl À B − ‘×.
"Î#
(4.1)
Procedure 2
Step 1. Take B" á B8 and C" á C7 ordered and construct the associated empirical
PDF's J8 (for B" á B8 ) and K7 (for C" á C7 ) as in Step 1 of Test Procedure 1.
Step 2. Mix two sets ÖB" ß á ß B8 × and ÖC" ß á ß C7 in one and reorder it denoting it
ÖD" ß á ß D78 ×.
Step 4. Find
Figure 4.1
As per Figure 4.1, maximum .78 between J8 and K7 occurs at one of the points ÖD" ß á ß D78 ×
of Step 2.
Example 4.1. Suppose there are two measurements of aluminum oxide taken from two different
archeological sites from Roman era potteries. Do these findings come from the same period? In
other words, if J8 is the sample PDF of site 1 and K7 is the sample PDF of site 2, do they
belong to the same class of PDF's?
The below Table 4.2 contains the data about the aluminum oxide contents from both sites as B3 's
and C4 's. They are reordered, while placed in different columns for convenience. As reordered
they represent the set ÖD" ß á ß D"& ×.
Table 4.2
The largest difference between J"! and K& is at D"! œ B) œ "$Þ) and it equals .78 œ !Þ%. Now,
let α œ !Þ!&Þ Then, L " Ð!Þ*&Ñ œ "Þ$' (as already mentioned). Therefore the critical region is
G œ Ð"Þ$'ß ∞Ñ.
Furthermore,
78
78
œ &!
"& œ "Þ)#' Ê 778 œ !Þ% † "Þ)#' œ !Þ($.
Since !Þ($ "Þ$', H! is not rejected at α œ !Þ!& that the sites are from the same era.
On the other hand, LÐ!Þ($Ñ œ !Þ$&. Therefore, the P-value is " !Þ$& œ !Þ'& and we will
accept H! at each α !Þ'&.
Remark 4.1. In testing a discrete distribution an observed sample B" ß á ß B8 need not have all
^"ß á ß B
distinct values. In this case we gather them in 5 groups (or rather sets), like B ^ 5 , where
^
C" − B" œ ^
ÖB" ß á ß B8" ×ß with lB" l œ 8" (all B" ß á ß B8" equal), etc. So, lB ^"l
á lB ^ 5 l œ 8" á 85 œ 8 is the total size of the sample now consisting of 5 distinct
groups, each containing identical values. If we construct the corresponding EFD J8 ÐBÑ, then
J8 ÐC" Ñ œ 8" Î8ß J8 ÐC# Ñ œ Ð8" 8# ÑÎ8ß and so on, with now C" á C5 and J8 being a
piecewise linear monotone non decreasing function.
Example 4.2. The data in Table 3.5 below shows the frequencies counts for 8 œ 400
observations on the number of bacterial colonies within the field of a microscope using samples
of milk film.
Table 4.3
So, in the first column we place the number of colonies grouped in each of the 11 rows
beginning with ! and ending with "! being the largest number of 400 observations. The second
column provides us with the numbers of observations corresponding to the named groups. For
^ ! includes a total of &' observations with no colonies. Thus, C! − B
instance, the first group B ^!.
We need to test the hypothesis that the data fit a Poisson distribution with some reasonable
parameter - at α œ !Þ!&.
That means B3 œ #colonies in the 3th observation, 3 œ "ß á ß &' œ 8! (with B" œ á œ B&' œ !)
and so on. Therefore, the sample mean of the number of colonies (per observation) is
_ ^.
B%!! œ *('Î%!! œ #Þ%% œ -
Next, we form the EDF J%!! based on the above frequencies and compare it with the genuine
hypothetical Poisson PDF with - œ #Þ& (for convenience in place of #Þ%%). The following table
shows calculations:
Table 4.4
The supremum norm of the difference between the EDF J%!! and the hypothetical PDF J ! ß
therefore have 7%!! œ %!!Î# † !Þ""#( œ "Þ&* which belongs to the critical region at α œ !Þ!&
||J%!! J ! ||∞ œ !Þ""#( œ .%!! . Using the Smirnov's version (as if we compare two samples) we
and therefore the null-hypothesis that the data fits the Poisson distribution must be rejected.
Table 4.5
we will get
Table 4.6
7%!! œ %!!Î# † !Þ!&(* œ !Þ)# which does not belong to the critical region at α œ !Þ!& and
The supremum norm now decreases to ||J%!! J ! ||∞ œ !Þ!&(* œ .%!! . Therefore,
therefore the null-hypothesis that the data fits the Poisson distribution will not be rejected.
Hence it is very likely that the real X is greater than !Þ)# giving us the reason to believe that the
EDF coming from the corresponding readings fits well the hypothetical Poisson PDF with
parameter #Þ&.
Problem 4.1. Test the hypothesis that 25 observations (Table 4.7) selected at random from a
distribution form which the PDF J is unknown and that 20 observations (Table 4.8) with an
unknown PDF K are such that J and K are identical at α œ !Þ!&:
Table 4.7
Table 4.8
Conditional Distribution and Densities for Two Random Variables. For two discrete r.v.'s \
and ] , using the conditional probability formula
T Ö\œ3ß] œ4×
T Ö\ œ 3l] œ 4× œ T Ö] œ4× , (1.1)
(such that T Ö] œ 4× Á !) we can define the conditional distribution. Notice the distribution in
the denominator is marginal of ] .
Using the same principle we can define conditional probability density function for continuous
r.v.'s \ and ] . Let Ð\ß ] Ñ À H Ä ‘# be a continuous random vector on a probability space
ÐHß Y ÐHÑß T Ñ with its pdf (probability density function) 0 ÐBß CÑ. We call
0 ÐBßCÑ
0 ÐBlCÑ À œ 0] ÐCÑ
the conditional pdf of \ given ] , provided 0] ÐCÑ Á !. Here 0] ÐCÑ is the marginal pdf of r.v. ] .
Analogously,
0 ÐBßCÑ
0 ÐClBÑ À œ 0\ ÐBÑ
Conditional densities are genuine pdf's in the sense that they are nonnegative, the integrals of
conditional pdf's in the first variables are equal to 1 (it can be easily verified) and that if \ and
] are independent, the conditional pdf's are reduced to marginal pdf's. We postpone for a little
the discussion on the areas of the conditional densities where their denominators are zero.
Example 1.1. Suppose a point is randomly selected on interval Ð!ß "Ñ and its value is \ . Then,
another point is randomly selected on interval Ð!ß \Ñ and its value is ] . We need to find the
marginal pdf of r.v. ] .
Solution. The experiment can be seen from the diagram (Figure 1.1) below:
Figure 1.1
Here we observe that a random selection of a point on interval Ð!ß "Ñ means that \ is uniformly
distributed on Ð!ß "Ñ, i.e. \ is standard uniform. Furthermore, we interpret the corresponding pdf
of \ as its marginal. Thus,
Notice that the position of \ depends upon the position of ] , as we know it from the basic
probability that events E and F are either mutually dependent or independent. The position of \
seems to be invariant of ] only because in our interpretation it is given by its marginal pdf.
Furthermore, the random choice of ] means that the conditional pdf of ] given \ is uniform,
that is
Figure 1.2
PROBLEMS
0 ÐBß CÑ œ
"#
& BÐ# B CÑß ! B "ß ! C "
!ß elsewhere.
0 ÐBß CÑ œ
/BÎC /C
C ß ! B ∞ß ! C ∞
!ß elsewhere.
1.3. Show that the integrals of conditional pdf's in the first variables are equal to 1 and that if \
and ] are independent, the conditional pdf's are reduced to marginal pdf's.
1.4. A small reservoir of water supply has a random amount of ] at the beginning of a month
and dispenses a random amount of \ during the month, with measurements in thousand of
gallons. It is not resupplied during the period of a month thus making \ Ÿ ] . Suppose that the
joint density of \ and ] is
0 Bß C œ #
"
ß !BŸCŸ#
!ß elsewhere.
Solution. We use the formula 0 BlC œ 0 Bß CÎ0] C. Thus we need to find the marginal pdf
0] :
0] C œ Bœ∞ 0 ÐBß CÑ.B œ Bœ! "# .B 1Ð!ß#Ó ÐCÑ œ "# C 1Ð!ß#Ó ÐCÑ.
∞ C
Therefore,
2. Bayesian Analysis
The Bayesian Likelihood Function. We continue our discussion of conditional densities, now
in connection with a very important and useful statistical tools of Bayesian Analysis.
In the context of parameter estimation started in section 1, Chapter III, we dealt with the
likelihood function
We will agree about the following notation. A capital letter ) (Greek theta) representing the
unknown parameter will stand for the r.v. The associated parameter variable in the density will
be denoted by a lower script case letter * (also pronounced theta). In light of this, the likelihood
function in (2.1) above will be rewritten as
08 Ðx8 l*Ñ œ 08 ÐB" ß ÞÞÞß B8 l*Ñ œ 0 ÐB" l*Ñ † † † 0 ÐB8 l*Ñ, (2.2)
in which the joint pdf of the sample as well as marginal pdf's of r.v.'s \" ß á ß \8 will be
regarded as conditional pdf's given random parameter ) œ *.
Parametric Conditional Distributions. In the forthcoming situations we most often deal with
discrete distributions that depend on a real-valued random parameter being a continuous r.v. For
tinuous r.v. distributed in interval Ð!ß "ÑÞ Then the distribution ,Ð8ß :à BÑ œ 8B :B Ð" :Ñ8B
example, for a binomial r.v. with parameters Ð8ß N Ñ we can assume that parameter N is a con-
given \" œ B" ß á ß \8 œ B8 . This will calibrate the prior density 0* of random parameter )
Posterior Distribution of Parameter ). We target the conditional distribution of parameter )
after we draw a sample from population \ , and call the obtained density posterior. In other
words, we are interested in 0*lB" ß á ß B8 which we can calculate using the conditional density
formula from section 1:
where 1B" ß á ß B8 stands for the marginal distribution of the sample \" ß á ß \8 and
0 ÐB" ß á ß B8 ß *Ñ is the joint pdf of the sample and parameter ). The joint density 0 ÐB" ß á ß B8 ß *Ñ
can be found by a mere multiplication of the likelihood function 08 of (2.2) and the prior 0, in
accordance with the conditional density formula (of section 1), i.e.,
As to the marginal distribution 1 of the sample, in order to get it we should undergo integration
in * (as in any case whenever the marginal density is required), but this unwanted procedure will
be avoided through a simple recipe that applies to many special cases.
Beta and Gamma Random Variables. A r.v. ) À H Ä Ð!ß "Ñ is said to be beta with parameters
α ! and " ! if its pdf is
>Ðα" Ñ α"
0Ð*lαß " Ñ œ >ÐαÑ>Ð" Ñ * Ð" *Ñ"" 1Ð!ß"Ñ Ð*Ñß (2.5)
where >ÐαÑß α !, is the gamma function that reduces to α "x if α is a positive integer. The
mean of ) is
α
IÒ)Ó œ α " . (2.6)
If α œ " œ ", then the beta r.v. readily reduces to the standard uniform r.v.
In section 3, Chapter II, we introduced the gamma r.v. The student is referred to this section, but
for consistency we revisit it.
I ) œ α
" (2.8)
The Proportion of Defective Items Revisited. We revisit Example 1.1 of section 1, Chapter III,
about the unknown proportion of defective items. Only now the unknown parameter is random.
Example 2.1. Let ) be the proportion of defective items in a large manufactured lot, which is
unknown, but suppose its prior distribution is uniform in Ð!ß "Ñ. (Note that if we do not know a
prior of some proportion, we can always assume that it is standard uniform.) Let a random
sample of 8 items, \" ß ÞÞÞß \8 be drawn. \" ß á ß \8 are independent Bernoulli r.v.'s with
parameter ) unknown. Then their sum D À œ \" ÞÞÞ \8 , as we recall (Example 8.3, Chapter
II), is a binomial r.v. with parameters Ð8ß )Ñ, where ) is random.
Using formula (1.2), section 1, Chapter III, the conditional density 0 ÐBl*Ñ of \5 can be written
as
0 ÐBl*Ñ œ
*B Ð" *Ñ"B 1Ð!ß"Ñ Ð*Ñß B œ !ß "
(2.10)
!ß 9>2/<A3=/
with 5 œ B" ÞÞÞ B8 (the number of defective items in the empirical sample being an integer).
This must belong to the family of beta densities. Here is why. Rewrite (2.14) as
which, except for a missing constant factor 1Ðx"8 Ñ , fits (2.5). We identify parameters α as 5 "
and " as 8 5 ". Hence, the vacant constant (relative to *) 1Ðx"8 Ñ should thus be
Ð8"Ñx
0Ð*lx8 Ñ œ 5xÐ85Ñx *
5""
Ð" *Ñ85"" 1Ð!ß"Ñ Ð*Ñ. (2.17)
This is the technique allowing to get the posterior density without integration of 08 Ðx8 l*Ñ0*.
Notice that the idea of "guessing" a constant factor for a posterior pdf is due to the fact that any
two pdf's in the form 0 Ð*Ñ œ +2Ð*Ñ and 1Ð*Ñ œ ,2Ð*Ñ must have + œ , . This is very easy to
show through integration of both densities.
Example 2.2. Suppose we need to reevaluate the proportion of defective items and that from the
past information we had, 10% of defective items took place with probability 0.7 and 20% of
defective items occurred with probability 0.3. Suppose 8 items were selected and 2 of them
turned out to be defective. We need to find the posterior distribution of defective items based on
this sample.
as the prior distribution of defective items. Now, the defective items \" ß \# ß ÞÞÞ are independent
Bernoulli r.v.'s with the conditional density as in (2.10), while the conditional density of the
sample (i.e., likelihood function) is
where 5 œ # and 8 œ ). [Notice that the dependence of likelihood function in the last expression
is explicit and on the sum 5 œ #.] Thus, in this case, the conditional density of the sample is
and consequently the joint density function of parameter ) and of the sample is
with 0* satisfying (2.18) and * taking on values 0.1 or 0.2. Now we need to find the marginal
density 1x8 using an analog of the total probability formula or the discrete analog of integral of
the joint density:
and
Notice that 0* œ !Þ"lx) 0* œ !Þ#lx) œ ". Thus, after the observations, the correction to
the prior distribution (that was 0.7 and 0.3) is 0.54 and 0.46, respectively. Also, the above was
nothing else but the conventional Bayes formula.
Example 2.3. Under the condition of Example 2.1, assume that the prior of the proportion ) of
the defective items is beta with parameters α and " (instead of standard uniform). Suppose a
sample of 8 items was drawn and that 5 œ B" á B8 were defective. Find the posterior pdf
of ).
0 ÐBl*Ñ œ
*B Ð" *Ñ"B 1Ð!ß"Ñ Ð*Ñß B œ !ß "
(2.21)
!ß 9>2/<A3=/Þ
>Ðα" Ñ α"
0Ð*Ñ œ >ÐαÑ>Ð" Ñ * Ð" *Ñ"" 1Ð!ß"Ñ Ð*Ñ (2.23)
>Ðα" Ñ α5"
0 Ðx8 ß *Ñ œ >ÐαÑ>Ð" Ñ * Ð" *Ñ"85" 1Ð!ß"Ñ Ð*Ñ (2.24)
>Ðα" 8Ñ
>Ðα5Ñ>Ð" 85Ñ
and thus
>Ðα" 8Ñ
0Ð*lx8 Ñ œ >Ðα5Ñ>Ð" 85Ñ *
α5"
Ð" *Ñ"85" 1Ð!ß"Ñ Ð*Ñ (2.26)
Example 2.4 (Example 2.3 revisited). Under the condition of Example 2.3, suppose that the
prior of ) is beta with parameters α! and "! , i.e.
We draw the first sample B" ß ÞÞÞß B8" in which we assume that 5" items were defective. Hence the
likelihood function of the sample should be
08" ÐB8" l*Ñ œ *5" Ð" *Ñ8" 5" 1Ð!ß"Ñ Ð*Ñ (2.28)
and the joint pdf of X8" and ) is the product of (2.27) and (2.28):
œ *5" Ð" *Ñ8" 5" >>ÐÐαα!!Ñ>Ð""!!ÑÑ *α! " Ð" *Ñ"! " 1Ð!ß"Ñ Ð*Ñ
>Ðα! "! Ñ
Dropping the constant >Ðα! Ñ>Ð"! Ñ we arrive at the posterior 0Ð*lx8" Ñ
0Ð*lx8" Ñ µ *α! 5" " Ð" *Ñ8" +"! 5" " 1Ð!ß"Ñ Ð*Ñ, (2.30)
without a multiplicatorß figuring that it is beta with parameters α" œ α! 5" and
"" œ "! 8" 5" and thus concluding that the posterior pdf is
‚ *α! 5" " Ð" *Ñ"! 8" 5" " 1Ð!ß"Ñ Ð*Ñ. (2.31)
Suppose now that we conduct yet another experiment (say 2) by drawing another sample
C" ß ÞÞÞß C8# in which 5# œ C" á C8# items are defective. Using the posterior 0" Ð*lx8" Ñ in
experiment 1 as a prior for experiment 2 to go to the second round, we have the likelihood
function of the sample:
08# Ðy8# l*Ñ œ *5# Ð" *Ñ8# 5# 1Ð!ß"Ñ Ð*Ñ (2.32)
and the joint pdf of Y8# and ) is the product of (2.27) and (2.28):
‚ *α! 5" " Ð" *Ñ"! 8" 5" " 1Ð!ß"Ñ Ð*Ñ
‚ *α! 5" 5# " Ð" *Ñ"! 8" 8# 5" 5# " 1Ð!ß"Ñ Ð*Ñ. (2.33)
0Ð*ly8# Ñ µ *α! 5" 5# " Ð" *Ñ"! 8" 8# 5" 5# " 1Ð!ß"Ñ Ð*Ñ, (2.34)
without a multiplicatorß figuring that the second posterior is also beta, with parameters
α# œ α! 5" 5# and "# œ "! 8" 8# 5" 5# ß thusß
‚ *α! 5" 5# " Ð" *Ñ"! 8" 8# 5" 5# " 1Ð!ß"Ñ Ð*Ñ. (2.35)
Bayes Estimates and Estimators. We will briefly introduce the notion of the Bayes estimator
with a minimum rigor. For further details the student is referred to section 4 and 5.
Given two r.v.'s \ and ) and the conditional pdf 0*lB, we will define the conditional
expectation of r.v. ) given \ œ B as
Example 2.5. To obtain the Bayes estimates in the above examples all we need is to copy the
formula for the associated expectation and adjust its parameter. For instance, using formula
α
IÒ)Ó œ α " for the expectation of a beta r.v. ) , in the context of Example 2.1, in which
α œ 5 " and " œ 8 5 " we get the Bayes estimate of ) as
I )lx8 œ
_
5" B8 8"
8# œ " 8#
, (2.37)
_
after division by 8 which approaches the m.l.e.
_ 8 for large 8. Evidently, the Bayes estimate is
B
more accurate than the corresponding m.l.e. B8 of ) from section 1, Chapter III, provided we
know a prior of ).
_ _
Now, if in formula (2.37) we replace B8 with \ 8 we will have the so-called Bayes estimator of )
which formally is warranted by the replacement of x8 with capital X8 in the conditional
expectation formula. Thus we have
I )lX8 œ
_
\ 8 8"
" 8#
. (2.38)
>Ðα" 8Ñ
0Ð*lx8 Ñ œ >Ðα5Ñ>Ð" 85Ñ *
α5"
Ð" *Ñ"85" 1Ð!ß"Ñ Ð*Ñ
belonging to the beta family of r.v.'s with parameters α 5 and " 8 5ß the Bayes estimate is
I )lx8 œ
_
α 5 B8 α8
α" 8 œ " α " (2.39)
8
I )lX8 œ
_
\ 8 α8
" α8+"
. (2.40)
Example 2.7. We use an R-program to generate the first sample of 100 Bernoulli r.v.'s with
parameter : œ !Þ".
# prior is uniform
# generate a sample of 100 defective items with p=0.1
n1<-100
# generate max likelihood estimate of a sample
sigma0<-rbinom(1,n1,p=0.1)
MaxLik<-sigma0/n1
print(MaxLik);
[1] 0.13 [1] 0.07
# obtain Bayes estimate
alpha1=sigma0+1
beta1=n1-sigma0+1
BayesEst1<-(alpha1)/(n1+2)
print(BayesEst1);
[1] 0.1372549 [1] 0.07843137
# generate a second sample to improve first Bayes estimate
n2<-100
sigma1<-rbinom(1,n2,p=0.1)
alpha2=alpha1+sigma1
BayesEst2<-(alpha2)/(alpha1+beta1+n2);
print(BayesEst2)
[1] 0.1052632 [1] 0.0990099
Because we need only their sum 5, we generate a binomial r.v. from "!!ß !Þ". Then, we
"!!ß !Þ".
the next round also depend on their sum 5 , we generate a binomial r.v. from the class
PROBLEMS
2.1. If X8 is drawn from an exponential population with parameter A ! and the prior of A is
gamma with parameters Ðαß " Ñ, then show that the posterior of A is also gamma with parameters
Ðα 8ß " B3 Ñ.
8
3œ"
2.2. Let ) be the proportion of defective items in a large manufactured lot, which is unknown,
but its prior distribution is supposed to be uniform in Ð!ß "Ñ. Let a random sample of "! items,
X"! œ Ð\" ß á ß \"! Ñ be drawn and suppose just one of them turned out to be defective. Find the
posterior density 0Ð)lx"! Ñ œ 0Ð)lB" ß á ß B"! ÑÞ
2.3. Suppose that as in Problem 2.2, a sample of Ð8" œ Ñ 10 items from Bernoulli population was
drawn and that exactly one item turned out to be defective Ð5" œ "Ñ. In Problem 2.2 we assumed
that the prior pdf was standard uniform (i.e. it is a special case of beta with parameters
α! œ "! œ "). Now, assume that a new sample 8# of "% items was drawn form the same
population and that 5# œ % of them appeared to be defective. Show that the posterior density
after the second draw is beta and finds its parameters.
2.4. Suppose that in the estimation of an unknown parameter ) being the proportion of defective
items, the prior of ) is known to be a beta density with parameters Ðαß " Ñ. Give the Bayes
estimator of ).
2.5. Under the condition of Problem 2.2, with an unknown parameter ) being the proportion of
defective items, the prior density of ) was assumed to be standard uniform. With a sample of 10
items drawn of which one was defective, give the Bayes estimate of ).
2.6. In light of Examples 2.4, 2.5 suppose in the second round of the experiment with the
proportion of defective items, "% items were drawn, of which % were defective. Find the Bayes
estimate of the proportion of defective items after the second experiment.
Example 3.1. Show that the family c of beta pdf's is conjugate for a Bernoulli r.v. with
parameter ) − Ð!ß "Ñ.
Solution. Suppose first that \ is a Bernoulli r.v. with unknown parameter ) − Ð!ß "Ñ (previously
interpreted as a proportion of defective items). We know from Example 1.1 that if ) is uniform
(which is a member of beta family), then the posterior is beta as per (2.17):
Ð8"Ñx
0Ð*lx8 Ñ œ 5xÐ85Ñx *
5""
Ð" *Ñ85"" 1Ð!ß"Ñ Ð*Ñ. (3.2)
However, we need to prove it for a more general case of priors to justify the conjugacy of c . In
other words, we now assume that the prior of ) is not just uniform, but an arbitrary beta with
parameters α and " :
>Ðα" Ñ α"
0Ð*Ñ œ >ÐαÑ>Ð" Ñ * Ð" *Ñ"" 1Ð!ß"Ñ Ð*Ñ. (3.3)
(with the right-hand side of 3.5 being the "principal part" of the density) in which the constant
factor in (3.3), >>ÐÐααÑ>Ð""ÑÑ and 1Ðx"8 Ñ of (3.1) are ignored. We rewrite (3.5) by regrouping the factors
as
thus figuring out that 0Ð*lx8 Ñ is beta with parameters Ðα 5ß " 8 5Ñ. Consequently, we say
that beta family (of priors) is conjugate for Bernoulli family
0 ÐBl*Ñ œ
*B Ð" *Ñ"B 1Ð!ß"Ñ Ð*Ñß B œ !ß "
(3.6)
!ß 9>2/<A3=/Þ
Example 3.2. Show that the gamma family is conjugate for Poisson.
Notice that as in section 1, this conditional density is a mixture of discrete (in B) and continuous
(in *). (3.8) yields the likelihood function
5
*
08 Ðx8 l*Ñ œ /8* B" !áB8!
(B3 integer) (3.9)
where
5 œ B" á B8 . (3.10)
Recalling (3.1), multiplying (3.9) by (3.7), and ignoring constants we have the principal part of
the posterior
/Ð8" Ñ* *α5" ß
which we figure out is gamma with parameters Ðα 5ß " 8Ñ. The rest is obvious.
Example 3.3 (Example 1.4 revisited). Recall that in Example 1.4, the number of connections to
a wrong phone number was modeled by a Poisson distribution. Suppose we need to estimate the
parameter ) (denoted then by -) of that distribution (the mean number of wrong connections) by
observing a sample B" ß á ß B8 of wrong connections on 8 different days. Only now, we assume
that the prior 0Ð*Ñ of ) is gamma distributed with parameters α and " . From Example 3.2, it
follows that the posterior pdf 0Ð*lB" ß á ß B8 Ñ is gamma with parameters Ðα 5ß " 5Ñ, where
5 œ B" á B8 !.
Problem 3.1. The number of connections to a wrong phone number is modeled by a Poisson
distribution. Suppose the prior of its parameter A is known to be gamma with parameters α œ #
and " œ $. As in section 1, Chapter III, we need to estimate A (the mean number of wrong
connections) by observing a sample B" ß á ß B8 of wrong connections on 8 different days.
Applying the results of Examples 3.2 and 3.3 find the Bayes estimate of A using the fact that on
"! different days it was a total of #! wrong connections.
DISCRETE CASE
Definition 4.1. Let \ and ] be two r.v.'s defined on H and valued in a subset Hw of real
numbers. Let 2 À Hw Ä ‘ be a function of r.v. ] . We assume that r.v.'s \ and ] are discrete.
If we formally drop little B having IÒ2Ð] Ñl\Ó, the latter will become a function of r.v. \ß say
1Ð\Ñ.
The conditioning on \ means that in place of one single event Ö\ œ B× in expression (4.1) we
now have a multitude of events generated by r.v. \ .
Example 4.1. Let \" ß \# á be a sequence of iid r.v.'s drawn from Ò\Ó, each with a common
mean .. Suppose we need to find the mean of their sum W8 œ \" á \8 , which is
IW8 œ 8..
Now, if we need to calculate the conditional mean of a random sum of them, say
WR œ \" ÞÞÞ \R given R , then it will be
IÒWR lR Ó œ R . (4.2)
Now, if \" ß \# ß á are geometric with a common parameter : and R is an integer-valued r.v.
independent of \3 's, then, by (4.2),
Example 4.2. Let \" ß \# á be a sequence of iid r.v.'s drawn from Ò\Ó, each with a common pgf
1ÐDÑ œ ID \ . Suppose we need to find the distribution of their sum W8 œ \" á \8
expressed in the pgf form KÐDÑ œ ID W8 Þ We have, by Proposition 4.2, Chapter II,
Now, if we need to calculate the pgf of a random sum of them, say \" ÞÞÞ \R , where \3 's
are geometric with a common parameter : and R is an integer-valued r.v. independent of \3 's,
then, by (4.4),
Property 4.1. (The Law of Double Expectation or Iterated Conditioning.) Let \ and ] be two
discrete-valued r.v.'s and let 2 be a function. Then, it holds that
Example 4.3. Suppose we need to calculate the pgf of a random sum of r.v.'s from Example 4.1,
\" ÞÞÞ \R , where \3 's are independent and geometrically distributed and R is also a
geometrically distributed r.v. independent of \3 's, with parameter 1 !. Then, by equation (4.6)
we have
œ I ";D œ
R
:D 1C :D
"Ð"1ÑC , where C œ ";D .
which indicates that \" ÞÞÞ \R is surprisingly also a geometric r.v. with parameter 1:.
CONTINUOUS CASE
The notion of the conditional expectation introduced for discrete r.v.'s now will be extended for
continuous and arbitrary cases.
Definition 4.2. Let ÐHß Y ÐHÑß T Ñ be a probability spaceß \ and ] be two random variables on
this probability space and 2 be a Borel measurable function. The conditional expectation of a
r.v. 2Ð] Ñ given the event Ö= À \Ð=Ñ œ B× is defined as
Example 4.4. Suppose we need to calculate the mgf of a random sum of r.v.'s, say
\" ÞÞÞ \R , where \3 's are iid exponential and R is an integer-valued r.v. independent of
\3 's, then
I /)\" ÞÞÞ\R lR œ -
- R
) Þ (4.9)
Property 4.2. (The Law of Double Expectation.) Let \ and ] be two continuous r.v.'s and let 2
be a Borel measurable function. Then, it holds
œ C 2ÐCÑB
0 ÐClBÑ0\ ÐBÑ .B.C
0 ÐBß CÑ
Example 4.5. Suppose we need to calculate the mgf of a random sum of r.v.'s, say
\" ÞÞÞ \R , where \3 's are iid exponential and R is a geometrically distributed r.v.
independent of \3 's, with parameter : !. Then, by the law of double expectation we have
œ I -
-
œ
R :D -
) "Ð":ÑD , where D œ -) .
I /)\" ÞÞÞ\R œ -:
-:) , (4.11)
which indicates that \" ÞÞÞ \R is an exponential r.v. with parameter -:.
PROBLEMS
4.2. Under the condition of Problem 1.1, find IÒ\l] ÓÞ Then, using this result, find I\ .
4.3. Under the condition of Problem 1.2, find IÒ\l] ÓÞ Then, using this result, find I\ .
4.4. Suppose someone flips a fair coin R times and we need to calculate the number of Heads in
R trials and this would be binomial. However, R is a geometrically distributed r.v. with
parameter "# . Find the distribution of the number [ of Heads in R trials. Hint: First, find the pgf
of [. Then expand this pgf in Taylor series.
4.5. (Generalization.) Suppose someone flips a biased coin (with probability : of having a Head)
R times and we need to calculate the number of Heads in R trials and this would be binomial.
However, R is a geometrically distributed r.v. with parameter +. Find the distribution of the
number [ of Heads in R trials. Hint: First, find the pgf of [. Then expand this pgf in Taylor
series.
$ ‡ ÐX8 Ñ œ IÒ)lX8 Ó
is the posterior mean of parameter ). This can be easily found from the posterior distributions
or densities or just their parameters as the following examples show it.
Example 5.1. According to Problem 2.1, if X8 is drawn from an exponential population with (an
unknown) parameter A ! and the prior of A is gamma with parameters Ðαß " Ñ, then the
posterior of A is also gamma with parameters Ðα 8ß " B3 Ñ. In particular, if A is 7-Erlang
8
3œ!
with parameter " of each of its 7 individual exponential phases, then the posterior of A is
7 8-Erlang with parameter " B3 of each phase.
8
3œ"
Recall (cf. formula (2.10)) that the mean of a gamma random variable A with parameters Ðαß " Ñ
is IÒAÓ œ α" . Therefore, the Bayes estimator $ ‡ ÐX8 Ñ of A is
α8
IÒAlX8 Ó œ . (5.2)
" \3
8
3œ"
3œ"
Notice that A is the reciprocal of the (conditional) expected value of each member \5 of the
population. In other words, since the estimate $ ‡ Ðx8 Ñ estimates A, Ð$ ‡ Ðx8 ÑÑ" estimates the mean
of the "exponential" population. Now, with the observed sample mean value denoted by
B 8 œ 8" B3 ,
8
(5.4)
3œ"
by dividing the numerator and denominator in (5.3) by 8 and reciprocating it we can express the
reciprocal of $ ‡ Ðx8 Ñ in terms of the observed sample mean as follows:
" Î8
B8
Ð$ ‡ Ðx8 ÑÑ" œ αÎ8" . (5.5)
Hence, for large 8, Ð$ ‡ Ðx8 ÑÑ" becomes approximately the observed sample mean (the m.l.e. of
A), with no regard of the history of A. So, it seems as for a large sample, the sample mean does
its m.l.e. job as usual, but for smaller samples, the Bayes
_ estimator of A is more accurate. Note
that, while Ð$ ÐX8 ÑÑ approaches the sample mean \ 8 , the sample mean, for large 8, by the
‡ "
the reciprocal of I A. Any estimator (like Ð$ ‡ ÐX8 ÑÑ" of ÐI AÑ" ) with such a property is
Law of Large Numbers, in turn, approaches the mean of the population. The latter in our case is
Remark 5.2. In Example 2.3 we arrived at the posterior after the second experiment with the
proportion of defective items as beta with parameters α# œ α! 5" 5# and
"# œ "! 8" 8# 5" 5# . The Bayes estimate then is
α# α! 5" 5#
IÒ)lX8 œ x8# Ó œ α# "# œ α! "! 8" 8# .
It is easy to see that after the 5 th experiment, the Bayes estimate would be
α! 53
5
α5
IÒ)l\8 œ x85 Ó œ œ 3œ"
ß
α! "! 83
α5 "5 5
3œ"
where 83 is the size of the 3th sample and 53 is the number of defective items in the 3th sample.
The same result as we see it could be obtained by combining the 5 samples in one larger of size
8 œ 83 , with the total number of defective items 5 œ 53 in one single experiment. The
5 5
3œ" 3œ"
result formally looks the same. However, conducting these experiments sequentially allows us to
draw smaller, more "affordable" samples and also spread it in a larger time period, which sounds
practically more rational.
PROBLEMS
5.1. Under the condition of Example 2.1, give the Bayes estimator of parameter ) about the
proportion of defective items with the uniform prior.
5.2. Suppose that in the estimation of an unknown parameter ) being the proportion of defective
items, the prior of ) is known to be a beta density with parameters Ðαß " Ñ. Give the Bayes
estimator of ).
5.3. Under the condition of Problem 2.2, with an unknown parameter ) being the proportion of
defective items, the prior density of ) was assumed to be standard uniform. With a sample of 10
items drawn of which one was defective, give the Bayes estimate of ).
5.4. In light of Examples 2.4, 2.5 suppose in the second round of the experiment with the
proportion of defective items, "% items were drawn, of which % were defective. Find the Bayes
estimate of the proportion of defective items after the second experiment.
5.5. The number of connections to a wrong phone number is modeled by a Poisson distribution.
Suppose the prior of its parameter A is known to be gamma with parameters α œ # and " œ $.
As in section 1, we need to estimate A (the mean number of wrong connections) by observing a
sample B" ß á ß B8 of wrong connections on 8 different days. Applying the results of Examples
4.2 and 4.3 find the Bayes estimate of A using the fact that on "! different days it was a total of
#! wrong connections.
Yœ œ "
]" 5" ! ^" .
(1.1)
]# 5# 3 5# " 3 # ^# .#
Y œ E Z .ß (1.2)
where ." ß .# are real constants, 53 !ß 3 œ "ß #, and 3 − Ð "ß "ÑÞ The random vector Y is
called a bivariate normal random vector.
#15" 5# "3#
"
0Y Ðy8 Ñ œ
(1.7)
Thus,
Var]" †Var]#
CovÐ]" ß]# Ñ
3œ (1.11)
is the correlation coefficient. If 3 œ !, CovÐ]" ß ]# Ñ œ ! and thus ]" and ]# are uncorrelated, we
have from (1.8-1.9) that ]" and ]# reduce to affine transformations of ^" and ^# and thus are
independent. Therefore, ]" and ]# are independent if and only if they are uncorrelated or if and
only if correlation coefficient 3 œ !.
Furthermore, (1.8) reads that ]" is an affine transformation of a standard Gaussian r.v. and thus
marginal distribution of ]" is Gaussian with parameters Ð." ß 5"# Ñ. From (1.9) we have that ]# is a
linear combination of two independent standard Gaussian r.v.'s plus a constant, thus it is
Gaussian with mean .# and variance 5## .
Oœ ß
5"# 5" 5# 3
(1.12)
5" 5# 3 5##
with
Furthermore,
" 3
"3#
5"#
5" 5#
"
O "
œ 3 " (1.14)
5" 5# 5##
or in the form
detO Þ
" 5## 5" 5# 3
œ (1.15)
5" 5# 3 5"#
y .w O " y .
Therefore, in light of (1.16), equation (1.7) can be rewritten in terms of the covariance matrix O :
0Y Ðy8 Ñ œ #1detO
"
exp #" y .w O " y .. (1.17)
Problem 1.1. Find the marginal densities of a bivariate normal random vector.
If y8 œ rÐx8 Ñ such that y8 œ Ex8 ß where E is an 8 ‚ 8- nonsingular matrix, then the joint pdf
of Y8 satisfies the following formula:
The results of section 1 can be generalized as follows. Let ^" ß ^# ß á be iid standard Gaussian
r.v.'s and let E be an 8 ‚ 8 nonsingular matrix. Suppose . − ‘8 be a constant vector and
Z À œ Ð^" ß á ß ^8 Ñw . Then, the random vector
Y œ Y8 œ EZ . (2.1)
where
B#" B#8
#1 / á "#1 / #
"
0Z Ðx8 Ñ œ #
œ "
Ð#1Ñ8Î#
exp "# B#" á B#8
œ "
Ð#1Ñ8Î#
exp "# || x8 ||# (2.6)
we arrive at
by Lemma 2.1
œ "
Ð#1Ñ8Î#
exp "# y .w ÐCovYÑ" y .Þ
Hence,
Furthermore,
CovY œ O œ EEw ,
we have
Hence,
œ Ð#1Ñ8Î# detO
" "
exp "# y .w O " y .. (2.9)
CovY œ O œ EEw Þ
I Y œ .. (2.10)
Consequently,
Theorem 2.4. Let Y œ Ð]" ß á ß ]8 Ñw be an 8-variate normal vector. Then, ]" ß á ß ]8 are
independent if and only if they are pairwise uncorrelated.
Theorem 2.6. Suppose Y is a random vector whose components ]" ß á ß ]8 are independent
Gaussian r.v.'s with means ." ß á ß .8 and common variance 5# . Let W œ EY, where E is an
orthogonal matrix. Then the components [" ß á ß [8 of vector W are independent Gaussian
r.v.'s with common variance 5# Þ In particular, if ]" ß á ß ]8 are independent standard Gaussian,
then [" ß á ß [8 are also independent standard Gaussian.
Proof. Denote . œ Ð." ß á ß .8 Ñw and let X œ 5" Y .. Then the elements of X are iid standard
Gaussian r.v.'s. Let
Then,
Since elements of X are iid standard Gaussian, by Theorem 2.5, the vector EX 5" E. has
independent Gaussian, each with variance ". Thus, W will have independent Gaussian
components with common variance 5# and means calculated from E..
If ]" ß á ß ]8 are standard Gaussian, then . is a zero vector, E. œ 0 and 5# œ ", which yields
the second part of the statement.
Theorem 2.7. Let Y − R Ð.ß O Ñ be an 8-variate Gaussian random vector and let F be an
8 ‚ 8 nonsingular matrix and b − ‘8 Þ Then,
V À œ FY b
In particular, if ]" ß á ß ]8 are pairwise uncorrelated (thus independent) and have a common
variance 5# ß then
CovV œ F CovYF w œ 5# FF w
Now, if F is orthogonal, FF w œ M and thus the components of vector V are also independent
with the same common variance 5# .
PROBLEMS
2.3. Derive the mgf for the bivariate normal r.v. given in (1.1).
Solution. Suppose the components of vector Y, ]" ß á ß ]8 , are pairwise uncorrelated. Thus, O is
the diagonal matrix:
5"# !
"
œ á á
! á
"
(2.14)
! "
O á á
! á 5# 8
Furthermore,
0Y Ðy8 Ñ œ exp .3 Ñ#
8
" "
(2.15)
3œ" #153
#
ÐC
#53# 3
2.5. Specify the condition imposed on matrix E in order that ]" ß á ß ]8 be independent. How
does O look in terms of the elements of matrix E in this case?
2.6. Let Y œ Ð]" ß á ß ]8 Ñw − R Ð.ß O Ñ and let α − ‘8 . Show that αw Y − R Ðαw .ß αw OαÑ.
2.7. Prove Theorem 2.5: Let Y œ EZ . be an 8-variate normal random vector generated by an
8 ‚ 8 orthogonal matrix E, random vector Z œ Ð^" ß á ß ^8 Ñw of iid standard Gaussian r.v.'s, and
. − ‘8 Þ Recall (Definition 2.2) that a nonsingular 8 ‚ 8 matrix E is called orthogonal if
Ew œ E" . Prove that the components ]" ß á ß ]8 of Y are independent and that Var]3 œ " for all
3 œ "ß á ß 8Þ Furthermore, ]" ß á ß ]8 are identically distributed if and only if . œ .Ð"ß á ß "Ñw Þ
In particular, if ." œ á œ .8 œ !ß then ]" ß á ß ]8 are independent standard Gaussian r.v.'s.
Solution. Since O œ EEw œ EE" œ Mß the elements of CovY are zero except for the main
diagonal. Hence, ]" ß á ß ]8 are pairwise uncorrelated. By Theorem 2.4, they are also
independent. Moreover, Var]3 œ 53# œ "ß 3 œ "ß á ß 8, and thus ]3 's are identically distributed if
and only if ." œ á œ .8 . The rest is trivial.
Suppose we want to find out how increasing the amount B of a certain chemical in the soil
increases the amount C of that chemical in the plants grown in that soil. For certain chemicals
and plants, the relationship between B and C can be approximated by the linear equation
Now, if we run several experiments with same B using nearly identical soils and plants, we will
find that the values of C are not the same. At the same time, if we use different values B" ß B# ß á
of B and expect to observe CÐB" Ñß CÐB# Ñß á from equation (1.1), we will see that the actual
observed values C" ß C# ß á will differ from CÐB" Ñß CÐB# Ñß á
For this reason, we need to extend the deterministic model driven by equation (1.1) to a more
flexible stochastic model with the random line
whose only random element & is a random error generated by random variations due to
differences between plants, soils, measurements, and other possible factors.
with &3 's being presumably independent and identically distributed. If in addition, &3 's are
Gaussian with zero mean and Var&3 œ 5# ß then the corresponding model is referred to as the
linear regression model.
Notice that the presence of r.v.'s &3 's in (1.3) explains the observed vertical deviations C3 's from
CÐB3 Ñ œ "! "" B3 . Consequently, ] ÐB3 Ñ's of (1.3) must "predict" C3 's. In other words, we can
regard C3 's as observed values of ] ÐB3 Ñ's.
Now, to find the most suitable "! ß "" one uses the "least square estimates" "^ ! ß "^ " . Letting the
error deviations
^ ^
be minimal, we hope to find the two values " ! and " " . Once the parameters "^ ! and "^ " are found
that minimize UÐ"! ß "" Ñ, the line C^ œ "^ ! "^ " B can be drawn so that it will best fit all points
ÐB3 ß C3 Ñß 3 œ "ß á ß 8ß dispersed around this line. In Figure 1.1 below we drew a line with yet
unspecified "! and "" through a group of scattered points. The "best" line C^ œ "^ ! "^ " B can be
chemical put in the soil in the above illustration. Notice that in the objective function U"! ß ""
used to interpolate or extrapolate values of chemicals in the plants for any given value B of the
only the vertical deviations of B3 ß C3 's from the line are taken into account.
Figure 1.1
In (1.4), differentiating U partially in "! and "" and setting the obtained equations to zero, we
can solve them w.r.t. "! and "" denoting their solutions by + œ "^ ! and , œ "^ " :
_ _
+ À œ "^ ! œ C8 "^ " B8 (1.5)
^ ^
Applying the familiar "second partials" test for " ! and " " we find that, in the form of (1.5) and
^ ^
(1.6), they indeed minimize U. However, " ! and " " do only a partial job. We still have 5#
undetermined, which would contribute to a larger picture on the regression line ] .
We are back to the stochastic version of the deterministic line C œ "! "" B as per (1.2):
The error &, as previously, mentioned, can be due to unobservable factors that corrupt
measurements. In our next step, we consider a random sample ]" ß á ß ]8 such that
If we form the likelihood function of the sample ]" ß á ß ]8 ß even though ]3 ;s are not identically
distributed, we can still use a similar method like that of section 1, Chapter III, to find the values
of "! ß "" ß and 5# that maximized the likelihood function of the sample. It turns out that the
m.l.e.'s of "! and "" are precisely the values + œ "^ ! and , œ "^ " of (1.5) and (1.6). The m.l.e. 5
^#
of 5# satisfies the formula
^ # œ " ==/ß
5 (1.9)
8
where
We use hypotheses testing of parameters "! and "" in the general form below by choosing some
fixed constants -! ß -" ß and -‡ :
This amounts to a large variety of special cases for different choices of -! ß -" ß and -‡ .
As in section 1, Chapter III, we replace C3 's in (1.5), (1.6), and (1.9) with their random versions
^ # from their m.l.e.'s to MLE's, in notation Eß Fß and
]3 's of (1.8), thereby changing +ß ,ß and 5
O #.
To test the veracity of the null hypothesis H! (2.1) we define the statistic
X œ -! E -" F (2.3)
and
5X# œ 5# 8! -! B=8B-" ,
_ #
-#
(2.5)
where as we recall,
=#B À œ ÐB3 B8 Ñ# .
8 _
(2.5a)
3œ"
Thus if the null hypothesis (2.1) is true, .X of (2.4) must equal -‡ , and since X − ÒR Ð.X ß 5X# ÑÓ,
the statistic
X -‡
[!" œ 5X − ÒR Ð!ß "ÑÓ (2.6)
is standard Gaussian. Notice that the subscrips !" in [!" mean to represent the respective sub-
scripts in "! and "" rather than dealing with the fact that [!" − R !ß "ß although the latter
holds true.
H! À "" œ ! (2.8)
thus meaning that if H! is true, then in the regression line C œ "! "" B, C does not (linearly)
depend upon B.
We would like to discuss this special case. Under restrictions (2.7) we have X ß .X ß 5X# of (2.3-
2.5) reduce to
X œF (2.10)
and
.X œ "" (2.11)
Thus if the null hypothesis H! of (2.8) is true, the mean of X , .X must be zero and since
X − ÒR Ð.X ß 5X# ÑÓ, the r.v.
X
[" À œ 5X − ÒR Ð!ß "ÑÓ (2.13)
will be a test statistic for parameter "" . Consequently, it stands for reason to test statistic [" for
being standard Gaussian. This approach however has a serious shortcoming, since in 5X# of (2.5)ß
the variance 5# is unknown.
^ # of (1.9) when replacing C3 's with ]3 's and + and , with E and F , we arrive at
Now, if in in 5
This is the MLE of 5# which however is biased. We will use the unbiased estimator
Notice that X is normal regardless of whether or not X œ !, but it fails to be standard normal if
H! is not true. Furthermore, if 5# is replaced with O w # , it is not standard normal, even if the null
hypothesis is true and thus we need to figure out an alternative test for the veracity of H! .
Consider in place of [" , the statistic
œ F O=Bw # œ F O=Bw ß
5X O w #
#
X5
Y" œ (2.16)
in which the unknown 5 got canceled. The r.v. Y" turns out to be a >-r.v. with 8 # degrees of
freedom. Hence, the null hypothesis is true if and only if Y" belongs to the Ò>8# Ó equivalence
class.
Recall that the >-r.v. with 5 degrees of freedom has the pdf looking very much like the pdf of the
standard Gaussian r.v., being a bell-shaped curve, symmetric about C-axis and with mean equal
zero for 5 ". (For 5 œ " it does not exist.)
3. The Procedure
In this section, we will develop a procedure, which will lead to a decision making in the root
problem of accepting or rejecting the null hypothesis. The veracity of H! is checked upon
whether or not Y" belongs to a specified "critical region", say G © ‘. If it does, the null
hypothesis will be rejected. The critical region G will be established based on how the true >8#
r.v. will behave. The criterion of finding G should logically depend on whether or not the tail of
the distribution of >8# , T Öl>8# l -×, will be small for some positive value - . The critical
region will thus be the set
T Öl>8# l -× œ αß (3.2)
where the level of significance α is usually chosen to be equal to 0.05. For instance if 8 œ "#,
we will have >"! (with 10 degrees of freedom) and from the tables for >"! we find that
- œ #Þ$!' (3.3)
Consequently, if test statistic Y" behaves like >8# , then the empirical value ?" (based on
empirical values C3 's of ]3 's) of Y" will lie in the region Ð -ß -Ñ œ Ð #Þ$!'ß #Þ$!'Ñ,
complimentary to G , with "confidence" " α œ !Þ*&. The chief reason why " α is being used
as confidence, rather than probability, is because an empirical value ?" of Y" replaces the r.v. Y" .
(The same reasons as those we used when arguing about confidence intervals in the context of
the Central Limit Theorem.)
The empirical value ?" will be like that of Y" , with ]3 's replaced with their "observed" values
C3 's À
"
8
^ "^ B ÑÓ#
(3.4)
8# ÒC3 Ð" ! " 3
3œ"
where
"
8
5w # œ 8# ÒC3 Ð"^ ! "^ " B3 ÑÓ# (3.5)
3œ"
^ ^
is the empirical unbiased sample variance, and " ! and " " satisfy (1.5-1.6). In a nutshell, ?" is
calculated from (1.5-1.6) and (3.4) and checked whether or not it belongs to the critical region
G . For α œ !Þ!& and 8 œ "# it is equal to G œ Ð ∞ß #Þ$!'Ñ ∪ Ð#Þ$!'ß ∞Ñ. If ?" − G , then
the null hypothesis is rejected. Otherwise, the null hypothesis is either not rejected or will
undergo yet another, more refined testing.
where ?" is the value satisfying equation (3.4). Unlike the decision making first by setting a
significance level and then calculating the associated critical region, the P-value implies a
simpler logic. Namely, if the P-value is small, we conclude that the deviations of the genuine >-
r.v. greater than l?" l are unlikely and thus we will reject the null hypothesis.
Introduce
3œ"
Next,
5w # À œ "
8# sse, (4.9)
which is the square of the denominator of (3.4) (unbiased empirical sample variance). Finally,
formula (3.4) can be rewritten as
,
?" œ denom (4.10)
where
The above formulas can be readily incorporated in spreadsheets in Excel or QuattroPro office
utilities.
Example 4.1. We will test the stock AmeriGas (index GASFX) against the IBM as far as their
"linear" independence being conjectured as a null hypothesis. We take twelve weekly values of
both stocks from February of 2008 through April of 2008.
_ _ _ _
Date GASFX x3 ÐB3 BÑ# IBM C3 ÐC3 CÑ# ÐB3 BÑÐC3 CÑ
" !%Î#" #!Þ#* !Þ$'(!$% "#$Þ'( )&Þ##$'( &Þ&*#)&"$ 5w # œ %#Þ$$
# !%Î"% #!Þ%% !Þ&("#)% "#%Þ%! **Þ#$%)! (Þ&#*$&*(
$ !%Î!( "*Þ'& !Þ!!""'( ""'Þ!! #Þ%$))!# !Þ!&$$&'* denom œ %Þ&&$%&
% !$Î$" "*Þ'( !Þ!!!#!! ""&Þ(' "Þ(%')!# !Þ!")(#$'
& !$Î#% "*Þ"% !Þ#*'""( ""%Þ&( !Þ!"($$' !Þ!(")%)) ?" œ !Þ'!&$
' !$Î"( ")Þ*) 0.495850 "")Þ$$ 15.14506 #Þ(%!$)"*
( !$Î"! "*Þ%% 0.059617 ""&Þ#$ 0.626736 !Þ"*$#*)'
) !$Î!$ "*Þ%$ 0.064600 ""$Þ*% 0.248336 !Þ"#''&*(##
* !#Î#& "*Þ&" 0.030334 ""$Þ)' 0.334469 !Þ"!!(#'$)*
"! !#Î"* "*Þ*( 0.081700 "!)Þ!( %!Þ&&&'' "Þ)#!##)"*
"" !#Î"" "*Þ*& !Þ!(!''( "!'Þ"' 68.53080 #Þ#!!'&'*%
"# !#Î!% "*Þ(% 0.003117 "!$Þ#( 124.7316 Þ!#$&'&#)
_ _
B8 œ "*Þ')%"' C8 œ ""%Þ%$ =CC œ %$)Þ)$%"'''(
^ œ,œ _
=BB #Þ!%"'*" " " #Þ(&'$ ,B8 œ &%Þ#&(
^ œ+œ
" '!Þ")" ,=BC œ "&Þ&"#
!
sse= %#$Þ$#
Figure 4.1
The value ?" œ !Þ'!&$ does not belong to the critical region
G œ Ð ∞ß #Þ$!'Ñ ∪ Ð#Þ$!'ß ∞Ñ and therefore the null hypothesis (that AMERIGAS is
independent of IBM) should not be rejected. It thus stands for reason to assume that these two
companies had a limited (if any) impact on each other.
PROBLEMS
1. The weekly closing prices (see the below table) in US$ from stock exchange for American
Gas GASFX (x-values) and Baxter International Stock (y-values) were recorded from February
4, 2008 to April 21, 2008. Calculate the least square error coefficients + and , for the
deterministic linear regression curve: C œ "! "" B.
2. Under the conditions specified in Problem 1, calculate the sse (square sum error) value
sse œ ÐC3 + ,B3 Ñ# .
8
3œ"
3. Under the conditions specified in Problem 1, test the hypothesis H! À "" œ ! (i.e. that weekly
closings of American Gas are "linearly" independent of Baxter prices) against its alternative
L" À "" Á ! at the confidence level α œ !Þ!&. Among other things, calculate ?" . Also, find the
p-value and make your conclusion about the testing.
2) it additionally depended on an additive random term & − R Ð!ß 5# Ñ, with 5# unknown
5) altogether, the regression line depended on three parameters "! ß "" ß and 5# , for which we
identified unbiased maximum likelihood estimators, F! ß F" ß and O # Þ
7) the empirical values C" ß á ß C8 , along with “observed” control variables B" ß á ß B8 , were
used to find initially the deterministic values "^ ! and "^ " of "! and "" using the least square
method yielding a deterministic regression line C œ "^ ! "^ " B
8) the impact of a presumed additive Gaussian error & brought us to a likelihood function
whose maximum value was reached at "^ ! and "^ " (in agreement with the least squares) and, in
^ # ; all three being the maximum likelihood estimates of "! ß "" ß and 5# , respectively
addition, 5
9) replacing the empirical values C" ß á ß C8 with their “hypothetical random priors”
]" ß á ß ]8 turned the maximum likelihood estimates "^ ! , "^ " , and 5
^ # to maximum likelihood
estimators F! ß F" ß and O #
11) we determined a statistical test procedure for the three parameters "! ß "" ß and 5# by using
test statistic Y!" , as > r.v. with 8 # degrees of freedom
12) the veracity of the null hypothesis (some relationships between the test parameters) had to
be checked upon whether or not the empirical value ?!" of Y!" would fall into the outside of
critical region G which was established based on the behavior of a genuine > r.v. with 8 #
degrees of freedom and a selected confidence level.
In the case of a multiple regression model, a new regression figure now depends on more than
two deterministic parameters compared to the simple regression model.
Example 1.1. Suppose that instead of fitting 8 points by a single straight line we want to employ
a possibly more flexible fitting by a Ð: "Ñst degree polynomial of the form
Notice that the : unknown parameters in (1.1) still appear in a linear form, while the control
variable B forms a polynomial function.
We can use again the method of least squares to find "3 's to minimize the sum U of squares of
the vertical deviations of the points from the polynomial curve:
Calculating the : partial derivatives `UÎ` "3 ß 3 œ !ß "ß á ß : ", and equating each one of them
to zero we obtain the following system of : normal equations of : unknowns "! ß á ß ":" À
We assume that the rank of the associated matrix of these equations is : and thus this system has
a unique solution. It can be shown that the solution of this system indeed minimizes U. Let these
values be "^ ! ß á ß "^ :" . Then, the least square polynomial is
Example 1.2. Suppose a new drug a is being tested on a group of 8 volunteers. In the
framework of the simple regression we would use B as a control (input) variable as some
numeric response of a patient to an old drug b and C as the patient's response to drug a and
represent C as a “linear” function C œ "! "" B w.r.t. parameters "! and "" .
In a more general case, we are interested in their respective responses C" ß á ß C8 to the new drug
compared to their multiple responses to an old drug bß combined with
B3" ß á ß B3:" ß 3 œ "ß á ß 8ß other responses such as blood pressure, interaction with a
supplement, and heart rate. For example, suppose we have : " input variables B" ß á ß B:"
representing : " responses to b, listed above, such as blood pressure, interaction with a
supplement, heart rate, body temperature, interaction with drug a, drug b, and drug c, red cells
count, etc., for each of the 8 patients in the test group. Furthermore, suppose we would like to
represent a generic response C of a patient to drug a as a linear function w.r.t. "3 's of the
patient's response to drug b and other named components in the form
where
"!
":"
x œ Ð"ß B" ß á ß B:" Ñ and " œ
w ã . (1.5a)
Also in this case, we will use the least square function U like (1.2) in slightly different form
The corresponding system of normal equations after the same routine will read
Again after finding the unique solution "^ ! ß á ß "^ :" (assuming this is the case) we have the
least-square hyper-plane
Comparing both examples we conclude that Example 1.1 is a special case of Example 1.2 with
B3 of 1.1 equal to B3 of 1.2. It therefore stands for reason to consider a general linear model
C œ "! "4 B4
:"
(1.9)
4œ"
with
I^ œ I] œ "! "4 B4 .
:"
(1.14)
4œ"
Thus, a single multiple regression plane ] now includes : unknown parameters "! ß á ß ":" ß
the same number of control variables "ß B" ß á ß B:" ß and one random error & − ÒR Ð!ß 5# ÑÓ. With
8 independent observations of ] and the associated control variables B! œ "ß B" ß á ß B:" ß the
complete model in the matrix form becomes
PROBLEMS
1.1. Suppose C is the current value of a house and B" is the square footage of the living area (in
thousands of sq. ft.), B# is the location (numerical value of the indicator zone À from 1 to 10; 10
is the best), B$ is the state mean appraised value for the past three years (in $100,000), B% the
wind resistance of the of the roof (the hurricane code; 1 if it is under the code, 0 if it is not), B&
additional hurricane protection, such as shutters, panels, etc. (ranking from ! to 10), B' flood
zone (from & to &), B( total acreage of the property. (a) Set the corresponding linear
regression model. (b) Modify the model by allowing the third degree polynomial for the first
control variable only and linearly otherwise.
Suppose we have 12 observations on the current value of the house collected from 12 different
homes.
08 Ðyß xw Ñ œ
"
exp "! B3" "" á B3:" ":" Ñ# Þ
8
#15# 8Î#
"
#5 # ÐC3 (2.1)
3œ"
We denote, as previously, by "^ ! ß á ß "^ :" the values of "! ß á ß ":" which maximize the above
likelihood function. For a fixed 5# ß we need to minimize the same objective function U as in
(1.6):
Therefore, the minimum square error values "^ ! ß á ß "^ :" of "! ß á ß ":" will be the same ones
that solve system (1.7). They will also be the m.l.e.'s of "! ß á ß ":" . Now, replacing
^
" ßáß"
! with "^ ß á ß "^
:" in (2.1) we can also find the m.l.e. 5# of 5# as
! :"
^
5# œ 8" sse, (2.3)
where
Then replacing C3 's with ]3 's and then "^ 5 's with F5 's we will have : " MLE's of "! ß á ":" ,
and 5# ß with the MLE of 5# À
where
stands for the associated stochastic sum of square errorsÞ Also with
O w# œ "
8: SSE (2.6)
is the design matrix. We assume that 8 : and that the rank of matrix is :. The other notation
will be as follows:
C"
C8
yœ ã ß (2.8)
]"
]8
Yœ ã ß (2.9)
"!
":"
"œ ã (2.10)
"!
^
^ œ ã ß (2.11)
"^
"
:"
F!
F:"
Bœ ã . (2.12)
For example, in light of notation (2.7-2.12) the objective function U can be represented as
G wG " œ G wy (2.14)
where index 5 is replaced with : ". System (2.14) obviously has unique solution " ^ if and only
if the matrix G w G is nonsingular. This in turn will hold if the number of observations 8 is at least
: and furthermore, there must be at least : linearly independent rows in matrix G . In this case,
we have from (2.14),
^ œ VG w y,
" (2.15)
where
V œ G w G " (2.16)
and similarly,
B œ VG w Y. (2.17)
sse œ y G "
^ w y G "
^ (2.18)
sse œ y G"
^ w y (2.19)
^
5# œ 8" sse, (2.20)
4œ"
where
&"
&8
&œ ã . (2.22)
B œ " VG w &,
or in the form
B œ VG w 5Z " , (2.23)
where " and B are defined in (2.10) and (2.12) and Z − ‘8 is the random vector of independent
standard Gaussian r.v.'s.
Obviously, vector B has each of its components normal, while jointly B does not fall into the
category of multivariate normal, since VG w 5# œ G w G " G w 5, as a : ‚ 8 matrix, is not
necessarily a square matrix. However, if we multiply B by a 8 ‚ : matrix K, we will obviously
have
and since CovZ œ M8 (the 8 ‚ 8 identity matrix) and V w œ V (: ‚ : matrix), we have from the
last equation:
CovB œ 5# VG w G V w œ 5# Vß (2.25)
provided that V " œ G w G is nonsingular, after taking into account the well-known property
E" œ Ew " for a nonsingular square matrix E. Equation (2.25), along with the obvious
w
IB œ " (2.26)
from (2.23) (i.e. B is an unbiased estimator of " ), gives all first and second order characteristics
of random vector B.
Example 2.1. Consider the case of a simple linear regression whose control and response
variables are given in Table 2.1 below:
3 B3 C3
" # !
# " !
$ ! "
% " "
& # $
Table 2.1
The simple linear regression line is C œ "! "" B for which we would like to find "^ ! and "^ " by
using matrix operations. The design matrix is
G wG œ
"! !
"
& ! !
ß V œ ÐG w
GÑ "
œ &
" (2.29)
! "!
G wy œ
&
(2.30)
(
! ( œ !Þ(
"
^ œ VG w y œ & ! & "
" " (2.31)
"!
Example 2.2. Under the condition of Example 2.1, fit the quadratic polynomial curve the data in
Table 2. The quadratic parabola is C œ "! "" B "# B# Þ The design matrix will then be
B B#&" œ %
B%! œ" B%" œ " B#%" œ "
&! œ" B&" œ #
Next,
" # %
" " " " " " "
" ! " # " !
"
%
! " % "
G wG œ # !
" %
" " "
#
& ! "!
"! $%
œ ! "! ! Þ (2.34)
!
"(Î$& ! "Î(
"Î( ! "Î"%
V œ ÐG w GÑ" œ ! !Þ" ! (2.35)
!
" " " " ! &
" ! " # " œ
"
%
Gyœ w
# ( Þ (2.36)
$
" ! " % " "$
Thus,
%Î(
$Î"%
œ !Þ( (2.37)
and
C^ÐBÑ œ %
( !Þ(B $ #
"% B . (2.38)
Example 2.3. Suppose each of 10 patients is treated with a blood pressure medications: drug A
and drug B and that the change of systolic blood pressure is measured in patient 3 by responses
B3 and C3 ß respectively. The data is recorded in the following table:
3 B3 C3
" "Þ* !Þ(
# !Þ) "Þ!
$ "Þ" !Þ#
% !Þ" "Þ#
& !Þ" !Þ"
' %Þ% $Þ%
( %Þ' !Þ!
) "Þ' !Þ)
* &Þ& $Þ(
"! $Þ% #Þ!
Table 2.2
Suppose that for each value B of \ , the regression function is a polynomial of the form
IÒ] l\ œ BÓ œ "! "" B "# B# .
Solving system (2.39) (or using (2.15)) we get the m.l.e.'s of "! ß "" ß "# ß
Now, from Table 2.2 and matrix G , we obtain matrix G w G and matrix V œ ÐG w GÑ" À
*!Þ$( %!"
w
GGœ #$Þ$ *!Þ$( %!" Þ (2.41)
")*#Þ(
Thus,
!Þ%!! !Þ!%'
V œ G G
!Þ$!(
!Þ!%' !Þ!"%
"
w
œ !Þ$!( !Þ%#" !Þ!(% Þ (2.42)
!Þ!(%
!Þ%Þ(!Þ%#" .
!Þ$!(
3!" œ
The m.l.e. of 5# is, by formula (2.3) and using (2.40) (or (2.19) in matrix form),
"
!Þ(%% † " !Þ'"'B3" !Þ!"$B3# Ñ# œ !Þ*$(
"!
^
5# œ "! ÐC3
3œ"
(2.43)
PROBLEMS
2.2. Suppose the observation data for a simple linear regression ] œ "! "" B & is presented
in the following table:
3 B3 C3
" ! "
# " %
$ # $
% $ )
& % *
Table 2.3
Use matrix calculus to find the m.l.e. estimates of "! ß "" , and 5# . Also find CovB.
H! À "4 œ "4‡
(3.1)
H" À "4 Á "4‡
Since F4 is Gaussian with mean "4 and variance 54# œ 5# <44 ß if H! is true,
− R Ð!ß "ÑÞ
F4 "4‡
[4 œ 5# <44 (3.2)
" SSE
Y4 œ [4 Î 8: 5#
"Î#
ß (3.3)
where
− >8: ß if H! is true.
F4 "4‡
8:
Y4 œ "
SSE<44
The estimator
O w# À œ "
8: SSEß (3.5)
can be proved to be an unbiased estimator of 5# . This estimator will replace 5# in (3.2) to get
the test statistic
F4 "4‡
Y4 œ O w <44 (3.6)
^ " ‡
"
5w <44 ß (3.7)
4 4
?4 œ
where
5w # À œ "
8: sseß (3.8)
and
or from (2.19),
sse œ y G"
^ w y (3.9a)
œ " X8:
"
l?4 l X8:
"
l?4 l
œ #" X8:
"
l?4 l œ #>8:
l?4 l ß (3.10)
which we denoted this time by 1 in order not to confuse it with : being the number of model
parameters "! ß á ß ":" .
Example 3.1. Under the condition of Example 2.1, test the hypotheses
H! À "# œ !
H" À "# Á !Þ
Firstly,
5w œ "!$
"Î#
*Þ$(
ß <44 œ !Þ!"%, and "^ # œ !Þ!"$Þ
Hence,
?# œ *Þ$(†(!Þ!"%
"Î#
!Þ!"$ œ !Þ!*&.
Using the table for >( we find the :-value equal #>(!Þ!*& œ !Þ*. This is large and we do not reject
H! that "# œ !.
PROBLEMS
3.1. In Table 3.1 below, the data is collected on 23 days (from February 15, 2011 through March
18, 2011) on closings of AMEX Oil Index, CBOE Gold Index, and Boening Co. AMEX Oil and
Gold Index are assumed to be control variables B" and B# , while Boeing represents the response
C.
Table 3.1
3 Establish the design matrix G and formulate the regression function
C œ IÒ] l\ œ BÓ œ "! "" B" "# B# .
33 Calculate matrix V œ ÐG w GÑ" . Then, using the system of equations "
^ œ VG w y, calculate
^ of the m.l.e.'s of " ß " ß " .
333 Calculate the associated covariance matrix CovB.
the vector " ! " #
H! À "# œ !
H" À "# Á !
H! À "" œ !
H" À "" Á !
In each case find the critical region at the significance level α œ !Þ!& and using the :-value
interpret the results of the testing.
4. Prediction
After finding "^ ! ß á ß "^ :" we can introduce the deterministic “predictor”
of C œ "! "" B" á ":" B:" , which can be used to interpolate a response to any set of
control variables B ß á ß B . Replacing "^ ß á ß "^
" :" ! with their random versions F ß á ß F
:" ! :"
(the MLE's), we introduce the random predictor of the random regression line
as
with
and
Var]^ œ VarÐxw BÑ
can be shown
œ xw ÐCovBÑx œ 5# xw V x. (4.5)
We would like to determine the mean square error (MSE) of ]^ w.r.t. ] œ xw " &. As in
Remark 3.3, ] and ]^ are independent and thus
œ 5# xw V x 5# œ 5# " xw V x (4.6)
Since the MSE I ]^ ] contains 5# in (4.6) which is unknown, as we did it in Remark 3.3,
#
5w # À œ "
8: sse, (4.7)
where
sse œ y G"
^ w y. (4.8)
^ w y" xw V x
8: y
"
œ G" (4.9)
the empirical MSE between ] and ]^ . Formula (4.8) can be used to calculate the predicted
deviation of a response ] to any set of control variables, B" ß á ß B:" Þ
PROBLEMS
4.1. Under the condition of Problem 3.1, find the empirical MSE between ] and ]^ at B" œ "$!!
and B# œ #%!Þ
4.2. Under the condition of Example 2.3 with polynomial regression function, find the empirical
MSE between ] and ]^ at B œ $.
Since both ] and ]^ are Gaussian and they are independent, so is also
Consequently,
^À œ ] ]^
5"xw Vx
− R !ß "Þ
3œ"
[ œ SSE
5# − ;#8: Þ
− >8:
] ]^
5 "xw V x ] ]^
[ ÎÐ8:Ñ O w "xw Vx
^
O w # Î5#
X À œ œ œ
T ÖX l Ÿ -× œ " α
œ T Ö - Ÿ X Ÿ -× œ #T ÖX Ÿ -× "
Ê T ÖX Ÿ -× œ " α
# Ê - œ >" α 8:
8: " # œ >αÎ# Þ
] ]^
− >8:
αÎ# ß >αÎ#
8:
O w # "xw Vx
X œ
] − ]^ >α8:
Î#
O w # " xw V xß ]^ >α8:
Î#
O w # " xw V x Þ
where
^ w y" xw V xÎÐ8 :Ñ
œ y G"
and it is called the "!!Ð" αÑ%-prediction interval for the regression function
C œ "! "3 B3 .
:"
3œ"
PROBLEMS
5.1. Under the condition of Problems 3.1 and 4.1, find the 95% prediction interval for C.
5.2. Under the condition of Example 2.3 and polynomial regression function with B œ $, find the
95% prediction interval for C.
In the regression plane C œ "! "3 B3 stuffed with numerous control variables B" ß á ß B:"
:"
3œ"
there may be some which have no or little impact on the response C, and if this is the case, it
would be a good idea to rid of them and reduce the regression model. Consequently, it makes
sense to test a hypothesis if the associated parameters "3 's equal zero. We had some testing of
single " 's in section 3, but to test more than one " at once requires a different approach.
3œ"
the complete model. Suppose, without loss of generality, we need to test the last : < Ð< :Ñ
parameters as
H! À "< œ á œ ":" œ !
and in order to do it we will try to propose an associated test statistic. We will then introduce the
reduced model
3œ"
The degree of redundancy of the remaining parameters can be evaluated based on the associated
SSE's for both models, for convenience denoted by SSEG and SSEV for the complete and reduced
models respectively. It seems plausible that SSEG must be smaller than SSEV since the presence
of more parameters in the complete model should reduce the least square error function. But how
significant will the reduction be leaves up to a test. The greater the difference SSEV SSEG , the
stronger will be the evidence that H! must be rejected.
and thus,
SSEV
5# − ;#8< Þ
in which the numerator and denominator can be shown to be independent. Furthermore, the r.v.
J is known to have the J -distribution with : < numerator and 8 : denominator degrees of
freedom.
We can also operate with the :-value: : œ T ÖJ 0 × œ " Y Ð0 Ñ to reject or not to reject H! .
Example 6.1. In light of Example 2.2 where we considered a three-variate linear regression
models driven by the table
3 B3 C3
" # !
# " !
$ ! "
% " "
& # $
H! À "" œ "# œ !
Therefore the model in Example 2.2 can be regarded as complete, while the reduced model will
be with C œ "! .
sse œ y G"
^ w y
to compute the sse's for both models. For the reduced model,
B"! !
B#! !
œ"
GR œ B$! ß y œ " Þ
œ"
œ"
B&! $
B%! œ" "
œ"
sseR œ 'Þ
B B#&" œ %
B%! œ" B%" œ " B#%" œ "
&! œ" B&" œ #
^ %Î(
$Î"%
"œ !Þ(
give
sseC œ !Þ%'$.
Ð'!Þ%'$ÑÐ$"Ñ
0œ !Þ%'$Ð&$Ñ œ ""Þ*&*.
The tabulated critical value - œ Y " Ð!Þ*&Ñ œ "*, so 0 does not belong to the critical region and
we do not reject H! that "" œ "# œ ! meaning that the regression line is independent of B in the
parabolic and linear sense. In addition, the :-value is : œ T ÖJ 0 × œ !Þ!(("(, which is not
large, but greater than !Þ!& not to reject H! .
PROBLEMS
6.1. Under the conditions of Problem 3.1, test the hypotheses that
H! À "" œ "# œ !
6.2 Suppose C is the current value of a house and B" is the square footage of the living area (in
thousands of sq. ft.), B# is the location (numerical value of the indicator zone À from 1 to 10; 10
is the best), B$ is the state mean appraised value for the past three years (in $100,000), B% the
wind resistance of the of the roof (the hurricane code; 1 if it is under the code, 0 if it is not), B&
additional hurricane protection, such as shutters, panels, etc. (ranking from ! to 10), B' flood
zone (from & to &), B( total acreage of the property.
Suppose we have 12 observations on the current value of the house collected from 12 different
homes.
+Ñ Find the deterministic prediction plane C^ and calculate the predicted value of a home with the
following control variables B" œ $Þ(ß B# œ &ß B$ œ #Þ(ß B% œ "ß B& œ "ß B' œ "ß B( œ #Þ!Þ
.Ñ Test the hypothesis that "$ ß "% ß and "( are redundant as the entire group, pairwise, and
individually.
6.3 (PROJECT). Select four different financial instruments of your choice (such as stocks,
mutual fund, and indexes) identifying one of them as the response variable and test for
redundancy different combinations of "" ß "# ß "$ . You can restrict the data to 12 trading days.
cw " œ -3 "3 ß -3 − ‘
:"
(7.1)
3œ!
thereby generalizing the results of section 3, in which not a combination, but one particular
regression coefficient "4 was tested. Furthermore, we also add a discussion on confidence
intervals for cw " not previously rendered.
We begin with figuring out the nature of its estimator cw B. The results and discussion will be
very similar to those for the simple linear regression. However, since we did not focus there on
the general case, we rebound in this section. If
cw B œ - 3 F 3 ß
:"
(7.2)
3œ"
we have
cw " œ I cw B (7.3)
a multivariate Gaussian vector, provided that the 8 ‚ 8 matrix KVG w is nonsingular. Moreover
W À œ KB is 8-variate normal. Therefore, the first component of W, say [" is normal.
Obviously, cw B can be regarded as the product of the first row of matrix K and vector B and the
result is a normal r.v. with parameters cw " and 5# cw V c. Consequently, the r.v.
5cw Vc
cw Bcw "
^À œ (7.6)
is standard normal.
^ . Consider
Remark 7.1. Suppose we intend to estimate cw " with cw "
T 5cw Vc # œ " α.
cw Bcw "
(7.7)
To find # , we set the probability in (7.7) equal " αß where α is the usual significance level.
Since the r.v. 5c is standard Gaussian, we can easily find # œ Q" " α# œ À DαÎ# (which
w
Bcw "
cw Vc
stands for the corresponding αÎ# tail of the standard Gaussian PDF). Replacing # with DαÎ# in
(7.9) we conclude that cw " falls into the random interval
with probability " α. The practical significance of this conclusion is minimal, mainly due to
^ , we
the presence of the estimator B of " and thus if we replace cw B with its empirical value cw "
obtain a fully deterministic interval
^ D 5 c w V c ß cw "
Ð+ß ,Ñ œ cw " ^ D 5cw V cß (7.11)
αÎ# αÎ#
for which however, the meaning of probability does not make any sense. In other words, we can
no longer claim that cw " falls into the interval Ð+ß ,Ñ of (7.11) with probability " α, simply
because the latter interval is one of realizations of the original random interval ÐEß FÑÞ
Consequently, statisticians call this empirical interval Ð+ß ,Ñ, the 100Ð" αÑ% confidence
interval for cw ".
Remark 7.2. The above confidence interval for cw " has yet another shortcoming. We notice that
it contains the unknown model parameter 5 , which naturally needs to be replaced with 5w , the
empirical estimate of 5. However, the entire associated argumentation on this needs to be
modified. Namely, we need to start with ^ œ 5c
w
Bcw "
in (7.6) and there replace 5 with O w
being the unbiased estimator of 5. By doing so, the modified r.v.
cw Vc
O w cw Vc
cw Bcw "
X À œ (7.12)
becomes a >-r.v. with 8 : degrees of freedom, as we have seen it numerous times. The main
argument, as in the past cases that B is independent of SSE and thus of O w œ SSEÎÐ8 :Ñ.
Skipping the details of this procedure (as being totally similar to those in Remark 7.1, we obtain
the "!!Ð" αÑ% confidence interval for cw " as
where >8:
αÎ# is the αÎ#-tail of the >8: distribution.
H! À cw " œ -‡ (7.14)
against
The usual way to employ a test statistic would be first to replace cw " in (7.6) with -‡ in which
case we would test whether or not the obtained statistic sill Gaussian remains standard. However,
since 5 is unknown, we also need to replace 5 with O w thereby arriving at a >-r.v. with 8 :
degrees of freedom
O w cw Vc
cw B-‡
Y‡ À œ (7.16)
G œ ∞ß >8: 8:
αÎ# >αÎ# ß ∞ (7.17)
œ " X8:
"
l?‡ l X8:
"
l?‡ l
œ #" X8:
"
l?‡ l œ #>8:
l?‡ l ß (7.19)
which we denoted this time by 1 in order not to confuse it with : being the number of model
parameters "! ß á ß ":" .
a test statistic X that rejects H! at significance level α whenever T ÖX l -×ß where
Remark 7.3. Suppose we test some generic simple hypothesis H! À . œ .! and suppose there is
- œ JX" " α#
and that X is r.v. whose pdf is an even function. Suppose > is an observed value of allegedly X .
We reject H! if > - or > - . If we reject H! , it means that l>l - , which is equivalent to
saying that
: œ T ÖlX l l>l×
for an observed > and call it the :-value. Since > was initially measured against - and - had been
calculated from a given significance α, we see that the smaller : than α is, the more evidence we
have against H! . Therefore, small values of :, which speak against the validity of H! , tell us that
the deviations of the true X from such a > are unlikely. In conclusion, a :-value is a measure of
how much evidence we have against the null hypothesis.
PROBLEMS
7.2. Give the formula for the confidence interval for cw " when testing a simple null hypothesis
about a particular regression coefficient in section 3.
7.3. Under the condition of Problem 1, section 3, test the hypothesis that "" œ "# , first at !Þ!#&
significance value and then by using the :-value.
7.4. Under the condition of Problem 1, section 3, give the *&% confidence interval for the linear
combination "! "$!!"" (!"# . Compare it with the 90% and 99% confidence intervals.