Introduction To Probability and Sampling
Introduction To Probability and Sampling
Introduction To Probability and Sampling
Program:PGPMEx
Course Name: Data Analysis for Decision Making
Unit Name: Introduction to Probability and Sampling_
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Contents
Overview .................................................................................................................................................. 3
Objectives ................................................................................................................................................ 3
Introduction .............................................................................................................................................. 4
The Rules of Probability ............................................................................................................................ 6
Conditional probabilities ........................................................................................................................... 7
Bayes Theorem ......................................................................................................................................... 9
Random Variables ................................................................................................................................... 10
Discrete Probability distributions ........................................................................................................... 11
Definition ............................................................................................................................................ 11
Expected value and variance .............................................................................................................. 12
Binomial distribution............................................................................................................................... 13
Continuous probability distributions ...................................................................................................... 14
Definition ............................................................................................................................................ 14
Expected value and variance .............................................................................................................. 16
Normal distribution ................................................................................................................................. 16
Population and samples .......................................................................................................................... 19
Types of sampling ................................................................................................................................... 20
Sampling distributions ............................................................................................................................ 21
Central Limit Theorem ............................................................................................................................ 26
Glossary.................................................................................................................................................. 26
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Overview
In this unit, we introduce the notion of probability and present several key results that form the
mathematical basis of statistical analysis. We introduce two important probability distributions
– binomial and normal – and highlight the applications of these distributions in the business
context. Finally, we also present an overview of sampling methods, the application of the Central
Limit Theorem and explain how the different kinds of sampling can be conducted.
Objectives
In this Unit you will learn
Learning Outcomes
At the end of this Unit, you would -
Unit Pre-requisites
This unit requires a prior knowledge of Measures of Central Tendency
Before studying this Unit, the student should have completed Measures of Central Tendency &
Measure of variation
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Table of Topics
Introduction
Probability, or chance as it is often referred to in common parlance, occupies a central position
in the mathematical foundations of statistical analysis. The irony in studying chance is that
thinking about the chances of something occurring (e.g., CSK winning another IPL or getting a
job or winning a lottery) is very intuitive. However, development of a rigorous definition for
‘chance’ is an incredibly tough endeavor, one that occupied the greats of mathematics for
several decades.
In this chapter, we will adopt a frequency approach to formalize our study of probability. This
approach is ideal for processes that can independently repeated several times in the same
conditions. As you would realize, our approach of drawing samples from a population and
summarizing the characteristics of the population using the sample also follows a frequency
approach. The mathematics we build in this chapter powers all the statistical guarantees that
our samples represent the population as we will discover by the end of this chapter.
Let us begin with a couple of examples where the frequency approach can be applied. Consider
flipping a coin 100 times. Unless you are a magician trying to pull a trick, each flip of the coin is
an independent event with only two possible outcomes – head or tail. By independent we mean
that each flip of the coin does not depend on what happened before that flip. A complete
listing of all possible outcomes, that is, the set {head, tail} is called the sample space. Each event
will result in an outcome from the sample space. The actual act of flipping the coin and
observing the outcome is called an experiment.
To formalize our discussion from the previous chapter, an experiment is a process that results
in a well-defined result (referred to as the outcome). Of particular interest to us are random
experiments where while we know all possible outcomes that might result in the event, we do
not know the exact outcome and are in no position to influence the outcome (i.e., there is no
bias).
In the context of processes that can occur repeatedly, the zero-probability event is the
realization of an impossible outcome (i.e., something that happens 0 times out of all the trials).
In the same vein, an event with probability 1 is the realization of an outcome that happens
always (i.e., something that happens in all trials). Following this observation, we note that
probabilities can take values only between 0 and 1.
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Another very important property of probability concerns that of events opposite to the event of
our interest. When we flip a coin, if the probability of heads is 0.5, then the probability of the
opposite happening, that is, a tails showing up is 1 – 0.5 = 0.5. To take a cruel example, if your
chance of winning a lottery is 1% (0r 0.01), there is a 99% chance (or 0.99) that you do not win
the lottery.
In sum, according to frequency theory, the probability of an event is the frequency of the
number of times that event occurs after repeated trials of the experiment.
Now that we have a sense of what the numerical value of what probability means, let us now
define a formal method of computing probabilities. In choosing such a method, we want to rely
on the frequency approach. For example, if a bag contains 1 Superman figurines and 2 Batman
figurines, the chance that a Superman figurine shows up when you put your hand in the bag =
1/3. This is because there are 3 figurines in the bag and each of them is equally likely to be
picked. However, there are 2 times more Batman figurines so over repeated draws from the
bag, the Superman figurines will appear only 1/3rd of the time. Note the emphasis on repeated
draws. Formally, we state that the probability that a Superman figurine shows up = 1/3.
Now, suppose the bag had 10 Superman figurines and 20 Batman figures. Should our answer to
the probability of drawing a Superman from the bag change? It should not, since over repeated
draws, Superman still has a 1/3rd chance of showing up, given the relative proportions of the
figurines. From this discussion, we arrive upon the following definition of the probability of an
event:
# 𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑃(𝐸) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
Here, favorable outcomes refer to the outcomes of interest or the event whose probability we
wish to determine. For example, using this definition, the probability of drawing a Superman
1
figurine from a bag containing 1 Superman figurine and 2 Batman figurines 1+2
= 1/3. Similarly,
the probability of drawing a Superman figurine from a bag containing 10 Superman figurines
10 10
and 20 Batman figurines = = = 1/3. Note how our definition of probability yields the
10+20 30
Before we move on, we re-emphasize that our definition of probability relies on the
independence of draws and the repeatability of the experiment. We have no comments on
what will happen in any one draw, but have something to say (i.e., the probability) when many
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
such independent draws are made. In assuming the frequency approach, we are espousing the
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
empirical computation of probability, that is, probability calculated based on historical data.
Probability could also be subjective (i.e., a personal judgement based on the assessment of the
outcomes) or theoretical (i.e., the true probability of the event, that is usually unknown).
Before we lay down the rules, let us get some formalism out of the way (this is going to trigger
some nostalgia of set theory). We define the union of two events A and B as the event that
contains all the outcomes that are either in A or in B or in both A and B. The union of two sets is
denoted as 𝐴 𝖴 𝐵. The intersection of two events A and B (denoted by A ∩ B) is defined as the
events that comprise outcomes that are in both A and B. The complement of the event A
(denoted by 𝐴′) is the event that contains all the outcomes that are not in the event A.
Given two events A and B, they are said to be mutually exclusive, if their intersection is empty.
This means that if the event A occurs, then the event B cannot occur. For example, with a flip of
the coin, there is no event that is both head and tail at the same time. If the union of the two or
more events encompasses the entire sample space, such a collection of events is deemed to be
collectively exhaustive.
Following these definitions, we now define a set of axioms based on which the probability of
any complex composition of events can be computed. By agreeing to these axioms, we enter a
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
gentlemanly mathematical agreement to impose no further assumptions on the results we
derive. Let us look at the axioms now.
The first axiom states that the probability of the entire sample space is 1. That is at least one of
the events that comprise the sample space will occur. The second axiom states that the
probability that no event occurs is 0. The third axiom states that the probability of the
complement of the event A (i.e., 𝐴′) is 1 − 𝑃(𝐴). The fourth axiom states that the probability of
the union of two events A and B, 𝑃(𝐴 𝑈𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵).
It is remarkable that once we shake hands in agreement to these four axioms, every other
result in probability can be derived from these. As an aside you have now joined the league of
extraordinary mathematicians (think Kolmogorov) who made it their life’s work to derive
results on the probabilities of complex events based only on these four axioms.
Let us now look at an example where these axioms help in decision making – managing the
sales funnel. The rules of probability can be used to plan which among the customers of a firm
will convert and hence to appropriately manage resource allocation. For example, consider a
B2B firm who is trying to understand which among its two key customers – A and B – will place
an order this month. From the sales team, the feedback is that 𝑃(𝐴) = 0.4, that is, the
probability that A will place the order is 0.4. Similarly, the team estimates that 𝑃(𝐵) = 0.7 and
the probability that at least one of them places the order 𝑃(𝐴 𝖴 𝐵) = 0.5. If we want to
compute the probability that both of them place orders during the month, that is 𝑃(𝐴 ∩ 𝐵), all
we need to do is to plug in the respective values into axiom 4. Formally, 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) +
𝑃(𝐵) − 𝑃(𝐴 𝖴 𝐵) ⇒ 𝑃(𝐴 ∩ 𝐵) = 0.4 + 0.7 − 0.5 = 0.6. So, the probability that both orders
will land in the month is 0.6, and it is worth pushing the sales team to convert both.
Conditional probabilities
While the rules of probability help us to infer the probabilities of events that are the
composition of simpler events, several managerial decisions are rooted on partial information.
For example, given that the budget was announced this morning, what is the probability that
corporate spending would increase? Or, given that our major customer delayed their payment
last month, what is the probability that this repeats again this month?
Such situations, where we are interested in the probability of an event A, given that another
event B has already occurred fall into the realm of conditional probability. Formally, 𝑃(𝐴|𝐵),
that is, the probability that event A occurs given that the event B has occurred is: 𝑃(𝐵) =
𝑃(𝐴 ∩ 𝐵)/𝑃(𝐵). Let us try to make sense of this expression. First, note how the notion of
conditionality is closely related to independence we discussed at length earlier. If the
probability of A occurring is independent of how B turned out, then 𝑃(𝐵) = 𝑃(𝐴) since A does
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
not depend in any way on B (i.e., A would occur with some probability irrespective of whether B
occurred or did not). In other words, when A and B are independent
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵) = 𝑃(𝐴) ⟹ = 𝑃(𝐴) ⟹ 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵)
𝑃(𝐵)
Second, if the occurrence of B is in any way going to influence the occurrence of A, then the
two are dependent and 𝑃(𝐵) becomes a meaningful entity on its own. For such a set of
dependent events A and B, one way to think of both of them occurring, is to say that B occurred
and then A occurred given that B occurred. This is what the expression 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐵)𝑃(𝐵)
conveys. That is, the probability that both A and B occur, has two components – the probability
that B occurs and then the probability that A occurs, given that B occurred.
This intuitive usage of conditional probabilities is very useful to understand an important law of
probability – the law of total probability. First, let us define a partition of the sample space 𝑆 as
the set of events {𝐴1, 𝐴2, … , 𝐴𝑛} such that each of these events are disjoint, i.e., they cannot
occur together. In effect, we are slicing up the sample space into a series of non-overlapping
sub-events, that together comprise the entire space. Now, any event B that might occur in this
sample space, can be decomposed using conditional probabilities like so:
In this expression, we recognize that event B might occur as a result of the occurrence of one or
more of the sub-events that make up the sample space and carefully account for the individual
probabilities of each of these. The crux here is to understand that we cleverly split the sample
space into non-overlapping events, allowing us to add each of the conditional expressions. The
above equation is commonly referred to as the law of total probability.
The law of total probability is a powerful method that enables us to break a complex problem in
to smaller manageable chunks, derive probabilities of these smaller chunks, and aggregate
these mini probabilities to solve our problem. Let us look at an example. Consider a brand that
wants to place a banner ad on three social media – Facebook, Twitter and Instagram. The click-
through-rate (CTR) observed for ads by the brand on each of these platforms are 0.01, 0.005
and 0.02 for Facebook, Twitter and Instagram respectively. In probabilistic terms, this means,
on Facebook for example, there is a 1% chance or 0.01 probability that a banner ad displayed
on Facebook gets clicked by the viewer. In the language of conditional probability, this data
indicates that the probability that a user clicks an ad on Facebook, given that the user is shown
an ad is 0.01. Assuming that the brand has no specific preference for any of these three
platforms to advertise, what is the overall probability of a click on their ad?
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Let us derive this probability using the law of total probability. First, note that the brand wants
to advertise on one of the three social media. Hence, sample space is composed of three
disjoint events – advertising on Facebook, Twitter or Instagram. This allows us to split the
probability of a click overall (say 𝑃(𝐶)) in terms of what happens when the brand advertises on
each of these. Formally,
What are we expressing through the equation above? The overall click probability 𝑃(𝐶) equals
the sum of the probability of a click on Facebook, Twitter or Instagram. Given that the ad was
placed on Facebook, the probability of a click is 𝑃(𝐶|𝐹). The probability that an ad would be
placed on Facebook is 𝑃(𝐹). The other two terms will be interpreted similarly.
Now what is 𝑃(𝐹)? Since the brand has no preference for any of the social media, it would be
safe to assume that 𝑃(𝐹) = 𝑃(𝑇) = 𝑃(𝐼) and since there is no other option left, 𝑃(𝐹) +
𝑃(𝑇) + 𝑃(𝐼) = 1 (axiom 1). Hence, 𝑃(𝐹) = 𝑃(𝑇) = 𝑃)(𝐼) = 1/3. We are now in a position to
compute the overall probability of a click like so:
1 1 1 0.035
𝑃(𝐶) = 0.01 × + 0.005 × + 0.02 × = = 0.0117
3 3 3 3
Bayes Theorem
Let us look at the banner advertisement example we solved in the previous section more
closely. While it is useful to compute the overall click probability 𝑃(𝐶), probably a more crucial
decision aid would be given the data, what is chance that the brand advertises on Facebook,
that is, 𝑃(𝐹|𝐶)? Or even better, if the firm had to allocate its advertising expenditure across the
three social media, what better a metric than the probabilities 𝑃(𝐹|𝐶), 𝑃(𝑇|𝐶), 𝑃(𝐼|𝐶) to make
this call? But how does one go about computing these? Bayes theorem (formulated by Thomas
Bayes (1701 – 1761), who never published it!) allows us to exactly do this. The Bayes theorem is
in fact central to data-centered thinking and allows us to incorporate evidence that is currently
available to inform decision making by revising our beliefs.
𝑃(𝐵)𝑃(𝐵)
𝑃(𝐴) =
𝑃(𝐴)
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
In this equation, 𝑃(𝐵) is the unconditional probability of event B and 𝑃(𝐴) is the unconditional
probability of event A. 𝑃(𝐵) is the probability of event A given B and 𝑃(𝐴) is the probability of
event B given A. A good way to think of this equation is to interpret 𝑃(𝐵) as the probability of
event B before event A was observed (prior probability) and 𝑃(𝐴) as the probability of event B
once event A is observed (posterior probability). With this interpretation, this equation allows
us to update the prior probability to arrive at the posterior probability of event B. An entire
branch of statistics called Bayesian statistics relies on the seemingly simple, but mighty
inversion equation that is Bayes theorem.
Let us now apply Bayes theorem to compute 𝑃(𝐹|𝐶) from the data presented in the previous
section.
1
𝑃(𝐹)𝑃(𝐹) (0.01) (3)
𝑃(𝐶) = = = 0.285
𝑃(𝐶) 0.0117
Now to the interpretation of the computation. The brand had no particular preference for
Facebook before the click through rate was observed. Once that data is known (i.e., 𝑃(𝐶|𝐹)),
we update the probability that the brand should prefer Facebook 𝑃(𝐶) = 0.285 (from the
previous value of 1/3). Note how beautifully the interpretation flows from the equation and
how this lends itself to data-oriented decision making. As this example illustrates, we start with
a specific idea about the solution (e.g., Facebook is equally likely as the other two social media),
then observe some data and update our original belief (e.g., Facebook is now less preferred
since 𝑃(𝐶) < 1/3). In many ways, Bayesian updation as illustrated in this example, is a natural
extension of the human thought process and hence no surprise that Bayesian statistics is so
popular!
Random Variables
Our study of probability presented thus far in this chapter allows us to now use these concepts
to study processes that are inherently uncertain. Such processes where the exact outcomes
cannot be predicted are called random processes. This does not mean that random processes
cannot be predicted at all, but rather that while the overall set of possible outcomes is known,
which of these outcomes is actually realized is unknown. Note how this definition closely
relates to our definition of the sample space and probability discussed in the previous section.
This similarity is by design. Once we represent the uncertain outcomes of a random process as
appropriately chosen numeric values (called random variables), probabilities could be assigned
to each of these outcomes. For example, a random variable 𝑋 could be used to denote the
outcomes of a flip of a coin (i.e., a random process) like so:
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
𝑋 = {1, 𝑓𝑜𝑟 𝐻𝑒𝑎𝑑𝑠 0, 𝑓𝑜𝑟 𝑇𝑎𝑖𝑙𝑠
The sample space here is {𝐻, 𝑇} an each of these events is mapped to a numeric value (i.e.,
𝐻 → 1, 𝑇 → 0). The probability of observing heads is then 𝑃(𝑋 = 1) and if the coin is fair it
would be reasonable to assume that 𝑃(𝑋 = 1) = 𝑃(𝑋 = 0) = 0.5. To take another example,
one way to translate the click through rate in the digital advertising example we saw in the last
two sections as a random variable would be:
𝐶𝑇𝑅
= {0.01, 𝑖𝑓 𝑡ℎ𝑒 𝑎𝑑 𝑖𝑠 𝑜𝑛 𝐹𝑎𝑐𝑒𝑏𝑜𝑜𝑘 0.005, 𝑖𝑓 𝑡ℎ𝑒 𝑎𝑑 𝑖𝑠 𝑜𝑛 𝑇𝑤𝑖𝑡𝑡𝑒𝑟 0.02, 𝑖𝑓 𝑡ℎ𝑒 𝑎𝑑 𝑖𝑠 𝑜𝑛 𝐼𝑛𝑠𝑡𝑎𝑔𝑟𝑎𝑚
There are two kinds of random variables – discrete and uniform - that get used to represent
uncertainties across different kinds of processes.
A discrete random variable assumes a finite set of values or countably infinite set of values (i.e.,
an infinite sequence of values). Examples include the number of red colored vehicles that cross
a traffic junction every hour, the number of units is a particular SKU get picked up in a day from
a retail shelf, the number of kms a car can travel before requiring an oil change or the number
of EMI payment defaults that occur in a month.
On the other hand, continuous random variables can take on any numerical value in an interval
or collection of intervals. Continuous random variables are usually used to represent outcomes
that are measured on a continuous scale, such as, time, distance or weight. Examples include
closing share prices of stocks and weights of material packed after production.
Once we have defined random variables, the set of values the variable can take is known. The
definition of the random variable is, however, not complete until we assign probabilities to
each of the values that the random variable can take. After all, random variables are different in
precisely this aspect – the outcomes are probabilistic. Such a description of the probability
associated with each outcome of the random variable is called defining a probability
distribution.
Definition
For a discrete random variable, 𝑋, the process of assigning probabilities to the possible values
of the variable results in a probability mass function (PMF), 𝑓𝑋(𝑥). Formally, the PMF is an
enumeration of the values and the associated probabilities, that is, 𝑓𝑋(𝑥) = 𝑃(𝑋 = 𝑥).
Following from axiom 1, the sum of probabilities across possible values of the discrete random
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
variable will sum up to 1, that is, 𝛴𝑓𝑋(𝑥) = 1. Another distribution function that is extremely
useful in statistical analysis is the cumulative distribution function (CDF). To arrive at the CDF
when 𝑋 = 𝑥, we sum up all the probabilities of the values that are less than or equal to 𝑥
(hence the term ‘cumulative’). Formally, this means that the definition of the CDF, 𝐹𝑋(𝑥) =
𝑃(𝑋 ≤ 𝑥). Let us look at an example of the construction of a PMF now.
Continuing with the example of flipping the coin with two possible outcomes – {Heads, Tails}
we defined the random variable 𝑋 as:
For this definition of X, if the coin is unbiased, the PMF could be defined like so
1 1
𝑓𝑋(𝑥) = { , 𝑥 = 0 , 𝑥 = 1 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
2 2
Here, since we have an unbiased coin, it is reasonable to assume that the probability of both
the events is ½. Since there are no other options for the coin beyond heads and tails, the
probability of everything else is 0.
We can compute from the CDF, from the PMF by aggregating the probabilities at each step. For
1
example, 𝐹 (0) = 𝑃(𝑋 ≤ 0) = 𝑃(𝑋 = 0) = (since 𝑋 cannot be less than 0). 𝐹 (1) =
𝑋 2 𝑋
1 1
𝑃(𝑋 ≤ 1) = 𝑃(𝑋 = 0) + 𝑃(𝑋 = 1) = + = 1 (since when 𝑋 ≤ 1, there are only two
2 2
possible values 0 and 1 that the random variable can take). The CDF is then:
1
𝐹𝑋(𝑥) = {0, 𝑥 < 0 , 0 ≤ 𝑥 ≤ 1 1, 𝑥 ≥ 1
2
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Let us now see how the computation works for the PMF’s of the fair die we discussed in the
previous section.
1 1 1
𝐸[𝑋] = 𝛴𝑥𝑓𝑋(𝑥) = ×0+ ×1=
2 2 2
1 2 1 1 2 1 1 1 1 1 2 1
𝜎2 = 𝛴(𝑥 − 𝐸[𝑋])2𝑓𝑋(𝑥) = (0 − ) × + (1 − ) × = × + × = =
2 2 2 2 4 2 4 2 8 4
Binomial distribution
The binomial distribution is closely linked to the world of counting, permutations and
combinations in an intricate way. For example, observe the following problems:
● If a production process is executed 100 times, what is the probability that 20 of these
trials produced defective pieces?
● Out of our 100 customers, what is the probability that 25 of these will move to the
competition in this month?
The solution of these problems involves the usage of binomial coefficients (that you might have
encountered in a study of permutations and combinations) first developed by Pascal and
Newton. The distribution of probabilities of such events where a sequence of trials is
conducted, with each trial having a probability of an event occurring is called a binomial
distribution. It is standard practice to define a ‘success’ in this context as the occurrence of the
event we are interested in (e.g., defect production, customer churn). Formally, a binomial
distribution describes the probability of an experiment comprising 𝑛 trials where the probability
of success in each trial is 𝑝. So, each trial of the experiment has only two outcomes – success
(probability 𝑝) and failure (probability 1 − 𝑝) – and hence is called a binomial distribution.
𝑛
Let us parse this expression. First, note the combinatorial formula ( ) (‘n choose x’) that
𝑥
indicates the number of ways in which 𝑥 successes can occur from 𝑛 independent trials. Each of
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
these success events has a probability 𝑝, while the failure events have a probability (1 − 𝑝).
Hence by the multiplication rule (section 4.3) the probability that we observe 𝑥 successes (and
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
𝑛
hence 𝑛 − 𝑥 failures) is 𝑝𝑥(1 − 𝑝)𝑛−𝑥. There are ( ) ways in which these 𝑥 successes and 𝑛 −
𝑥
Let us now see what the expected value and the variance of the binomial distribution is:
𝑛
𝐸[𝑋] = 𝛴𝑥𝑓 (𝑥) = 𝛴𝑛𝑥 ( ) 𝑝𝑥(1 − 𝑝)𝑛−𝑥 = 𝑛𝑝
𝑋 0
𝑥
𝑛
𝜎2 = 𝛴(𝑥 − 𝐸[𝑋])2𝑓𝑋(𝑥) = 𝛴(𝑥 − 𝑛𝑝)2 ( ) 𝑝 𝑥( 1 − 𝑝)𝑛−𝑥 = 𝑛𝑝(1 − 𝑝)
𝑥
Let us now apply these formulae to the example we saw the beginning of this section: If we
knew that the overall error rate of our production process is 1% and we look a 100 production
runs, what is the probability that there are 20 defective pieces produced? What we need here is
100 100
𝑃(𝑋 = 20) = ( ) 𝑝20(1 − 𝑝)100−20. Since 𝑝 = 0.01, 𝑃(𝑋 = 20) = ( ) 0.0120(1 −
20 20
0.01)80 = 2.4 × 10−20. In other words, this is so rare that it is practically 0.
Definition
The distribution of continuous random variables is a direct extension of the idea of the PMF and
CDF we encountered in the previous section. However, since there are no discrete units where
the ‘mass’ of the probability rests, we have to be prepared for a finite probability to exist at
every possible value that the continuous random variable might take. For example, the number
of hours your customers use social media every day is a continuous random variable. Each
customer has their own preference but it would be very surprising to find anyone who spends
more than, say, 6 hours on social media.
To describe the distribution of continuous random variables, we use the probability distribution
function (PDF) denoted by 𝑓𝑋(𝑥). However, now that there is a finite probability at each
possible value of 𝑥, every 𝑑𝑥 has a meaningful probability. When we add all such delta amounts
of probability across all values the variable can take, we should get 1 (in accordance with axiom
∞
1). Formally, this statement translates to: ∫−∞ 𝑓𝑋(𝑥)𝑑𝑥 = 1. Note how the summation in the
case of the discrete random variable gets replaced by an integral. The interpretation of the PMF
or PDF is the same. They indicate how much probability mass or density exists at each value the
random variable can take.
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Moreover, since the PDF is a density, computing the probability that the random variable can
take a value between, say, 𝑎 and 𝑏, amounts to computing the integral:
𝑏
𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓𝑋(𝑥)𝑑𝑥
To re-iterate, all we are doing here is adding up the tiny bits of probability at each 𝑑𝑥 as defined
by the PDF, i.e., 𝑓𝑋(𝑥) between 𝑎 and 𝑏.
The definition of CDF also extends in a similar way from the discrete case, with the summation
being replaced by an integral. Formally,𝑥 the CDF 𝐹𝑋(𝑥) for a continuous random variable 𝑋 with
a PDF 𝑓 (𝑥) is 𝐹 (𝑥) = 𝑃(𝑋 ≤ 𝑥) = ∫ 𝑓 (𝑡)𝑑𝑡. In the computation of this integral, we are
𝑋 𝑋 −∞ 𝑋
adding up all the tiny bits of probability that are associated with each 𝑑𝑡 around the values 𝑡
that the variable 𝑋 can take. Let us now look at a numerical example to see how this
computation works.
Consider that you are monitoring the number of chat sessions that your customers have with
your service staff each minute. Assume that the PDF of this continuous random variable –
number of chat sessions per minute – denoted by 𝑋 is 𝑓𝑋(𝑥) = 𝑒−𝑥 (𝑥 ≥ 0). The CDF for this
distribution can be computed like so (there is no meaning for negative number of chat sessions
so the value of 𝑥 ≥ 0):
𝑥 𝑥 𝑥
𝐹𝑋(𝑥) = 𝑃(𝑋 ≤ 𝑥) = ∫ 𝑓𝑋(𝑡)𝑑𝑡 = ∫ 𝑒−𝑡𝑑𝑡 =∫ 𝑒−𝑡𝑑𝑡 = −𝑒−𝑡]𝑥0
0 0 0
(In the calculation above, we have used the substitution rule to compute ∫ 𝑒−𝑥𝑑𝑥 =
−𝑒−𝑥 + 𝐶).
Beyond the computation of the CDF, we can also answer several interesting questions armed
with the PDF. For example, what is the chance that more than 3 chat sessions are triggered in a
minute? (Imagine the impact of such events on the efficiency of your service team and
consequently on customer satisfaction). Let us compute this probability now using the formula
listed in the previous paragraph.
∞ ∞
𝑃(𝑋 ≥ 3) = ∫ 𝑓𝑋(𝑥)𝑑𝑥 = ∫ 𝑒−𝑥𝑑𝑥 = −𝑒−𝑥]∞ = −[0 − 𝑒−3] = 𝑒−3 = 0.0497
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
3
3 3
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Rest easy, the probability that more than 3 customers barge into a chat session with your
service staff is only about 0.05.
Let us now look at the most popular continuous distribution – the normal distribution.
Normal distribution
Let us first get the complicated looking functional form of the Normal distribution out of the
way. If 𝑋 is a continuous random variable following a Gaussian or Normal distribution, then its
pdf is:
1
𝑓 (𝑥) = 𝑒(𝑥−𝜇)/2𝜎
2
𝑋
√2𝜋𝜎2
The neatest thing about the normal distribution is that it is symmetric and bell-shaped around
its mean (Figure 1)
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Figure 1. A plot of the PDF of the normal distribution. The red dotted line indicates the mean,
median and the mode.
Let us now see the mean and variance of the Gaussian distribution:
∞ ∞ 1 2
𝐸[𝑋] = ∫ 𝑥𝑓𝑋(𝑥)𝑑𝑥 = ∫ 𝑥𝑑𝑥 = 𝜇
𝑒(𝑥−𝜇)/2𝜎
−∞ −∞ √2𝜋𝜎2
∞ ∞ 1 𝑥−𝜇
As this exercise would have probably indicated, despite the complicated expression for the PDF,
the mean and variance of a Gaussian are simple (don’t worry, in practice, we never deal with
the computation of these integrals by hand). Given the wide prevalence of this distribution,
when the random variable 𝑋 is normally distributed, we usually write 𝑋 ∼ 𝑁(𝜇, 𝜎2). This
expression (read as ‘the random variable 𝑋 is distributed normally with mean 𝜇 and variance
𝜎2’) is tantamount to writing and wrangling the complicated looking PDFs.
A very special case of the normal distribution, called the standard normal distribution occurs
when we set the mean 𝜇 = 0 and the variance 𝜎2 = 1. This variable is usually denoted with
1
𝑍 ∼ 𝑁(0, 1) and its PDF is the less ominous looking: 𝑓 (𝑧) = 2
𝑒−𝑧 /2.
𝑍 √2𝜋
Why is the standard normal so special (apart from having 0 mean and variance 1)? Well, we can
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
transform any normal distribution of mean 𝜇 and variance 𝜎2 to the standard normal
𝑋−𝜇
distribution. For example, if 𝑋 ∼ 𝑁(𝜇, 𝜎2) , then the random variable 𝑍 = follows the
𝜎
standard normal distribution (i..e, 𝑁(0, 1)). This process is called standardization.
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Eons before our world was invaded by computers and Carl Fredrich Gauss was still wondering
about the positions of celestial bodies, having a table of probabilities at various points of the
standard normal distribution was one powerful weapon. Let us look at one example to
elucidate the superpower that is standardization. Say we have a variable 𝑋 ∼ 𝑁(30, 16). What
would be the probability that this variable takes a value between 30 and 32, that is, 𝑃(30 <
𝑋 < 32)? If we did not want to standardize 𝑋, this would require us to compute the integral:
32 1 𝑥−30
∫30 √2𝜋162 𝑒2×162 𝑑𝑥. The trouble is that for the evaluation of any probability a new integral
need to be computed and if you were in the business of deriving the laws of motion (like Gauss
was), this is not what got you the job. Enter standardization.
𝑋−30
Since 𝑋 ∼ 𝑁(30, 16), consider a variable 𝑍 = 16
, which we know will be standard normal
𝑋−30
(𝑁(0, 1)) since we subtracted the mean and divided by the standard deviation. Now, 𝑍 =
16
is the same as 𝑋 = 4𝑍 + 30. So, 𝑃(30 < 𝑋 < 32) = 𝑃(30 < 4𝑍 + 30 < 32) =
2
𝑃 (0 < 𝑍 < ) = 𝑃(0 < 𝑍 < 0.5). Neat!
4
But we have one final trick up our sleeve to compute 𝑃(0 < 𝑍 < 0.5). To begin with notice in
Figure 2, the z-table for a standard normal distribution.
Here each of the values in the table represent the area corresponding to the left of the z value
depicted in column 1 (labelled Z). So, when we look at the entry corresponding to Z = 0.5, that
is, 0.69146, this is the area under the standard normal curve on the left of 0.5, that is,
𝑃(𝑍 < 0.5) = 0.69146. Similarly, 𝑃(𝑍 < 0) = 0.5 (we didn’t need to see the table here since
the standard normal curve is symmetrical around 0). Each further column in this table gives us
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
the probabilities at the second decimal points. For example 𝑃(𝑍 < 0.02) = 0.50798, which is
the entry corresponding to the first row (𝑍 = 0.0) and the column . 02.
So, how do we parse 𝑃(0 < 𝑍 < 0.5)? Look at Figure 3 now. Given that we know 𝑃(𝑍 < 0.5) =
0.69146 & 𝑃(𝑍 < 0 = 0.5), we have the areas under the curve in Figure 3 to the left of these
points. To compute 𝑃(0 < 𝑍 < 0.5), that is, the area under the curve in between 0 and 0.5, all
we have to do is to subtract these values. So, 𝑃(0 < 𝑍 < 0.5) = 𝑃(𝑍 < 0.5) − 𝑃(𝑍 < 0) =
0.69146 − 0.5 = 0.19146.
Figure 3. Standard normal curve annotated at two points 𝑧 = 0 &. 𝑧 = 0.5. each entry in the
table in Figure 2, represents the area under the curve to the left of the points shown in this
figure
0 0.5
Sampling is an important way to understand the features of a population of interest (e.g., all
millennials in India). When carefully executed, a sample can provide a wealth of information
about the population at a fraction of the cost, time and resources. In this context, the class of
observations to which we want our inferences to generalize is called the population. The
specific aspects of this population that are of interest to us (e.g., recall of a recent TV ad,
attitude towards a product or defects produced by a manufacturing process) are called as
parameters. To estimate these parameters, we usually examine a portion of the population,
referred to as a sample. All the statistical machinery we have developed till now and will
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
develop further on is to ensure that our inference about the parameters of the population,
based on the sample are accurate.
Types of sampling
Let us first distinguish between two broad ways in which sampling is conducted – random and
nonrandom. A crucial aspect of random sampling is that every unit in the population has the
same probability of being included in the sample. This ensures that our inferences are
dependent only on the features we observe in the sample and not due to other aspects that
might bias the observations.
A common type of random sampling is simple random sampling where a sample is picked from
the population completely at random. For example, to estimate how many Indians like our app,
we could pick a randomly generated selection of Aadhar numbers (say 100 of them) and call
them up to find out what they think of our app. In this case, we are relying on the fact that
every member of our population has a unique identifier and we are selection a sample out of
this large number completely at random.
Another popular type of random sampling is stratified random sampling where first the
population is divided into non-overlapping strata (e.g., strata based on the state of residence)
and then sampling randomly from each strata. Care is taken so that these strata are all
homogeneous in their characteristics. For example, we could first divide the young population
of India into different age groups and then randomly sample a selection of Aadhar numbers
from the list for each group. In executing this procedure, we are assuming that the
phenomenon we are interested in depends only on the age of the unit being observed.
A third, closely related type of sampling method is cluster sampling. In this method, the
population is divided into non-overlapping strata just like in stratified sampling, but the strata
need not be homogenous. The intent of cluster sampling is to identify mini populations that are
heterogenous and representative of the population. We then proceed to observe either all the
units in the mini population or cluster or a random sample from the cluster depending on
resource availability. For example, dividing India into states is a naturally created cluster
sampling method.
A fourth type of sampling method is systematic sampling. In this method, we start at a random
point and systematically include every 𝑘𝑡ℎ member starting from this point. For example, to
take a systematic random sample of all the advertisements that ran on TV last Tuesday, we
could pick at random one ad that was shown in the first one hour, and then select every 10 th ad
that was shown after the first ad. This method is widely used in advertising research and is very
easy to execute.
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Finally, we note that beyond the random sampling techniques discussed in this section, there
are also non-random sampling techniques, where depending on the convenience or judgement
of the analyst, the inclusion criteria for the sample are defined. It is important to note that
when random sampling is used, the probability of a member of the population to be included in
the sample is known and computable. This is an extremely important foundation on which
several statistical methods rely on. This is because, when the sampling is random, we know that
even when the sample is not truly representative of the population, this is only due to chance
and not because of bias.
Sampling distributions
Once the random sample has been identified and the observations of interest measured, we
can then estimate the parameters of the population using the sample. However, the question
remains – we really took only one sample, how are we sure that the estimates from this sample
can be believed? This is an important question and while we took a single random sample given
the constraints on time, people and cost, we should ideally be repeating this process over and
over again to get a sense of how our estimates vary because of the differences in the samples
we get each time. This is the notion that a sampling distribution formalizes.
Let us illustrate this with a simple example. Consider that a brand made 10 posts on Twitter and
the number of likes received on these posts is as follows (Table 1):
Number of
Tweet id likes
1 616
2 504
3 774
4 27
5 45
6 209
7 902
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
8 529
9 446
10 513
From this population of 10 tweets, let us now take 30 random samples of 2 tweets each and
note down the number of likes received on these tweets. How would we do that? Well, we
generate two random numbers between 1 – 10 and then include the corresponding tweets
from the population into the sample. The 30 random samples are presented in Table 2.
Table 2. 30 random samples of size 2 drawn from the 10 tweets posted by a brand (presented
in Table 1)
Samples
1 2 504 16 7 902
1 7 902 16 3 774
2 4 27 17 2 504
2 6 209 17 5 45
3 8 529 18 2 504
3 9 446 18 8 529
4 4 27 19 1 616
4 4 27 19 8 529
5 6 209 20 5 45
5 4 27 20 6 209
6 2 504 21 8 529
6 9 446 21 1 616
7 2 504 22 8 529
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
7 10 513 22 10 513
8 4 27 23 7 902
8 6 209 23 5 45
9 10 513 24 1 616
9 2 504 24 6 209
10 8 529 25 1 616
10 7 902 25 1 616
11 1 616 26 1 616
11 9 446 26 10 513
12 4 27 27 5 45
12 9 446 27 1 616
13 1 616 28 10 513
13 1 616 28 8 529
14 6 209 29 10 513
14 7 902 29 7 902
15 8 529 30 2 504
15 8 529 30 10 513
As Table 2 indicates, for each sample (1 to 30), we selected two indices at random from 1 – 10
and collected the number of likes from Table 1. Let us now estimate the mean number of likes
received by the brand’s tweets from these thirst samples. First, the true mean or the
population mean = 456.5 (from Table 1). The sample means for the 30 samples are listed in
Table 3 and represent the distribution of the mean number of likes estimated from these 30
samples.
Table 3. Mean number of likes estimated from the 30 simple random samples of size 2 drawn
from the population
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Average of Average of Average of
Sample Number of Sample Number of Sample Number of
likes likes likes
4 27 14 555.5 24 412.5
The data presented in Table 3 is called the sampling distribution of the mean, whose main aim
is to estimate the population mean. Note that the mean of these 30 values is 460.1 and the true
population mean is 456.5. Not bad for a sample of size 2!
Let’s look at this sampling distribution like we did in Unit 1 with a histogram (Figure 4).
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Figure 4. A histogram of the sampling distribution of 30 random samples of size 2 drawn from
Table 1
Figure 4 ties down a lot of the probability and sampling we have discussed in this chapter.
When estimating the true mean number of likes (= 456.5) of a population, we deemed it costly
to estimate this from the population. So, we used a random sample of size 2 to estimate the
mean and generated a sampling distribution by repeating this process 30 times. In the
beginning of this chapter, we defined our measure of probability using a frequency approach.
Now is finally the time to understand why.
The random variable under exploration here is the number of likes received on a tweet. To
estimate the expected value of this variable, we need to repeatedly sample from the
population and estimate the mean from the sample each time. While the individual sample
means might sometimes be very deviant (e.g., sample 5 has mean 118), the average across all
these samples will go close to the population mean. The probabilistic guarantees that underpin
this important finding hinge heavily on the fact that each sample is an independent draw that is
completely unbiased.
One thorn still remains though, should we always be sampling so many times to estimate the
parameter of interest? Won’t this be a costly exercise? The answer to this is no, because the
Central Limit Theorem provides rock solid guarantees on a sample that is appropriately drawn.
Let us look at this beautiful theorem now.
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Central Limit Theorem
The central limit theorem states that when estimating a population mean 𝜇 and a population
standard deviation 𝜎, if we have drawn random samples of size 𝑛, with 𝑛 ≥ 30, the sample
means 𝑥 will be approximately normally distributed. Further, the expected value of the sample
means 𝐸[𝑥] = 𝜇 and the standard deviation of sample means (commonly referred to as the
𝜎
standard error) is 𝜎𝑥 = √𝑛.
This theorem has several important implications. As long as we are interested in the mean
value of a parameter in the population (e.g., mean app rating, mean likes, mean production
defects), all we need to do is to generate a random sample of size 30 or more from the
population. The Central Limit Theorem (fondly referred to as CLT) guarantees that this sample
mean will approximate the population mean. As a bonus, the CLT also says that the standard
error of our estimate, that is, the estimated variability in the sample mean due to the
peculiarities of our sample decays non-linearly with the square root of sample size and is =
𝜎/√𝑛.
As a double bonus, the sampling distribution is normal, and we know that through the magic of
standardization (section 4.9), we can arrive at a probabilistic interpretation of our estimates. It
is no exaggeration to say that the CLT holds a central position in all of statistics and has had a
profound impact on the way scientists have understood the world around us.
Glossary
Probability is a numeric measure associated with the occurrence of an event of interest.
According to frequency theory, it is estimated as the fraction of favorable outcomes to the total
possible outcomes
In a random experiment, while the exact outcome is unknown, it is known to be one of the
outcomes from the sample space
For mutually exclusive events, the occurrence of one event obviates the possibility of
occurrence of another event
Conditional probability refers to the probability of an event occurring given that another event
has already occurred
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
Prior probability (in the context of Bayes Theorem) signifies the probability of an event before
observing another event
Posterior probability (in the context of Bayes Theorem) signifies the conditional probability of
an event given that another event was observed
Random processes are those processes whose outcomes are inherently uncertain
Random variables map the outcomes of a random process to numeric values, with each value
being associated with a probability of occurrence
Discrete random variables are random variables that take on finite set of values or countably
infinite set of values (i.e., an infinite sequence of values)
Continuous random variables are random variables that take on any numerical value in an
interval or collection of intervals
The probability distribution of a random variable is an enumeration of the possible values of the
variable and the probabilities that the variable can take each of these values
The Probability Mass Function (PMF) describes the probabilities associated with each of the
values of the discrete random variable
The Probability Density Function (PDF) is the probability associated with every value in the
range of the continuous random variable
The Cumulative Density Function (CDF) at a certain value of a random variable is an aggregation
of the probabilities associated with all values below this value
The expected value of a random variable is the probability weighted average of all the values
that the random variable can assume
A binomial distribution describes the probability of an experiment comprising 𝑛 trials where the
probability of success in each trial is 𝑝
The population is the set of observations whose characteristics are of interest to the scientist.
These characteristics are called parameters
In random sampling, the selection of units into the sample is derived probabilistically. Random
samples could be simple random samples (each unit is drawn at random), stratified random
samples (the population is divided into homogenous strata and random samples drawn from
each strata), cluster sampling (population is divided into non-homogeneous strata and random
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
samples drawn from each strata) or systematic sampling (from a random start point, every kth
unit is included in the sample)
The sampling distribution of the mean is the distribution of the means of the observations the
random samples drawn from a population
The standard error of the sample mean is the standard deviation of the sample means that
signifies the variance in the estimate of the sample mean due to the sampling process
The Central Limit Theorem states that irrespective of the distribution of a variable in the
population, the sampling distribution of the mean follows a normal distribution with expected
value equal to the population mean
Formulas used
● According to the frequency theory, the probability of an event E is defined as:
# 𝑜𝑓 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑃(𝐸) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
● The conditional probability of event A given that event B has occurred is:
● When the sample space is divided into a set of disjoint events 𝐴1, … , 𝐴𝑛, the law of total
probability states that the probability of an event B is:
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.
● For two events A and B, the Bayes theorem states that:
𝑃(𝐵)𝑃(𝐵)
𝑃(𝐴) =
𝑃(𝐴)
𝜎2 = 𝛴(𝑥 − 𝐸[𝑋])2𝑓𝑋(𝑥)
𝐸[𝑋] = 𝑛𝑝
𝜎2 = 𝑛𝑝(1 − 𝑝)
−∞
∞
𝜎2 =∫ (𝑥 − 𝐸[𝑋])2𝑓𝑋(𝑥)𝑑𝑥
−∞
Powered by Great Learning. Proprietary content. ©Great Learning. All Rights Reserved.
Unauthorized use or distribution prohibited.