Lekcija 5 - Vjerovatnoca

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Statistical analysis 2021/22

Lecture 5

Probability Distributions
Sampling
PROBABILITY DISTRIBUTIONS
Uniform Distribution

• A uniform distribution, also called a rectangular distribution, is a probability


distribution that has constant probability.
• This distribution encompasses the basic assumption we often make that all
outcomes are equally likely
• e.g., it is the assumption you make about a “fair coin”, or a pack of cards
• For a dice, the graph would look like this:

1/6

1 2 3 4 5 6
Binomial Distribution (1)

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd
edition). Dubuque, IA: McGraw-Hill Education, p. 156-157.
Binomial Distribution (2)

• Many situations can be characterized by “either/or” answers


• Either the answer is yes or no
• Either the percentage of fat in a food is over 50%, or it is not
• Either an item is defective, or it is not

• A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE


outcome in an experiment or survey that is repeated multiple times.
• there are 2 outcomes to each trial :outcomes can be classified into 2 groups
• the probability of “success” does not change from trial to trial (or each trial is independent)

• In fact, in most cases, we are not even bothered about which items are defective or have over 50% fat
content.
• We only want to know the number with this characteristic.
Binomial Distribution - A Sample of 2

If we have two trials and a probability of success of 0.2


Then there are 4 outcomes:

2 successes (0.2)(0.2)=0.04 p2 p2 P(2S)

A success followed by a failure (0.2)(0.8)=0.16 pq


2pq P(1S)
A failure followed by a sucess (0.8)(0.2)=0.16 qp

2 failures (0.8)(0.8)=0.64 q2 q2 P(0S)

If we use p for the probability of a success and q for probability of failure

But since the order doesn’t matter, we have


Binomial Distribution- A Sample of 3

Looking at a sample of 3, we get a similar pattern:


(using S for success and F for failure)

S,S,S (0.2)(0.2)(0.2)=0.008 p3 p3 P(3S)

S,S,F (0.2)(0.2)(0.8)=0.032 p2q


(0.2)(0.8)(0.2)=0.032 pqp 3p2q P(2S)
S,F,S
F,S,S (0.8)(0.2)(0.2)=0.032 qp2

F,F,S (0.8)(0.8)(0.2)=0.128 q2p


3pq2 P(1S)
F,S,F (0.8)(0.2)(0.8)=0.128 qpq

S,F,F (0.2)(0.8)(0.8)=0.128 pq2

(0.8)(0.8)(0.8)=0.512 q3 q3 P(0S)
F,F,F

Using letters, we have:

But since the order doesn’t matter, we have


Combinations

You should be seeing a pattern in these results.


The difficulty will be working out how many ways there are
of getting 1, or 2, or 5 successes in a given number of trials

The answer is a formula called COMBINATIONS


n = number of trials
r = number of successes
and number of ways of picking r from n is: ænö n!
çç ÷÷ =
è r ø r!(n - r )!
Where n! is factorial n = n.(n-1).(n-2)……….3.2.1.

For example, the number of ways to get 3 successes in 10 trials is:

æ10 ö 10! 10.9.8.7.6.5.4.3.2.1 10.9.8


çç ÷÷ = = = = 120
è 3 ø 3!(10 - 3)! 3.2.1.7.6.5.4.3.2.1 3.2.1
Binomial – A Formula

A general formula for binomial situations is:

ænö r n-r
P(r ) = çç ÷÷ p (1 - p )
èrø
This gives the probability of r successes in n trials
Provided that the binomial conditions are met
Tables of values are also available
or spreadsheets can be used to do the calculations
Binomial Distribution - Example

• What is the probability that with a fair dice you throw a six
for six times in a row?
• n =6 " n% r n −r
• r=6 P(r) = $ ' p (1 − p)
• p = 1/6 # r&
• q = 1/6 6 6−6
"6%" 1 % " 1 %
$ '$ ' $1 − ' =
#6&# 6 & # 6 &
0.0000216
Discrete Probability Distributions and Function
Names in Excel

• Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 162.

LO 5.3
Basic Concepts Of Normal Distribution

• This is a continuous distribution


• but with adjustments can be used for discrete data

• Normal does not imply a value


judgement
• Distribution does occur very frequently
• In nature, in development.
• In fact, it may occur anywhere where many small factors
affect the outcome.

• It is a symmetrical distribution
Why Is It important?

• occurs in nature
• occurs in production situation
• is the basis of sampling theory
• hence is the basis for the underlying usefulness of
statistics to allow us to draw implications about the
whole population from the results of a sample
Probability Density Function

• To calculate probabilities we define a probability density


function f(x).
• The density function satisfies the following conditions
• f(x) is non-negative,
• The total area under the curve representing f(x) equals 1.
Area = 1
P(x1<=X<=x2)
x1 x2
• The probability that X falls between x1 and x2 is found by
calculating the area under the graph of f(x) between x1 and x2.
Normal Distribution

• A random variable X with mean µ and


variance s2 is normally distributed if its
probability density function is given by

% x −µ (2
1 −(1/ 2)'
& σ )
*
f (x) = e −∞≤ x ≤∞
σ 2π
where π = 3.14159... and e = 2.71828...


The Shape of the Normal Distribution

The normal distribution is bell shaped, and


symmetrical around µ.

90 µ 110
Why symmetrical? Let µ = 100. Suppose x = 110. Now suppose x = 90
2 2
æ 110-100 ö æ 10 ö æ 90-100 ö
2
æ -10 ö
2

1 -(1/ 2)ç ÷ 1 -(1/ 2)ç ÷ -(1/ 2)ç ÷ -(1/ 2)ç ÷


f (110) = e è s ø
= e è sø 1 è s ø 1 è s ø
f (90) = e = e
s 2p s 2p s 2p s 2p
Normal Distributions

• The shape and symmetry remains the same


• BUT, there are an infinite number of normal distributions
• Some seem relatively flat
• Some seem tall and narrow
The Effects of µ & s

How does the standard deviation affect the shape of f(x)?


s= 2
s =3
s =4

How does the expected value affect the location of f(x)?


µ = 10µ = 11
µ = 12

APP
Finding Normal Probabilities

• Two facts help calculate normal probabilities:


• The normal distribution is symmetrical.
• Any normal distribution can be transformed
into a specific normal distribution called
•…

• “STANDARD NORMAL DISTRIBUTION”


Standard Normal Distribution

ANY normal distribution


can be converted into
The Standard Normal
Distribution
Subtract the mean from
every value
Divide by the standard
deviation
Standard Normal Distribution
Z-values

• This might sound complicated as a


description,
• but really is easy
• It gives us z-values
• The number of standard deviations away from
the mean
The formula is:

X -µ
z=
s
Fortunately there are tables of values for areas
under the Standard Normal Distribution
Probabilities

• The total area under the normal distribution is ONE


• so area represents probability
• We can use tables to find areas above particular z-values

This area can be


found in tables and
represents the
probability of
getting a value
BELOW z
Tables
Because the distribution is symmetrical, the tables only need to show values on
one side of the mean

So, the area below z = 1.00 is


0,8413

And the area below 1.64 is


0,9495

This means that the


probability of getting a
value less than 1.64
standard deviation above
the mean is 0.9495
Tables
Because the distribution is symmetrical, the tables only need to show values on
one side of the mean

So, the area below z = -3.39 is


0.00035

And the area below -2.16 is


0.01539

This means that the


probability of getting a
value less than -2.16
standard deviation below
the mean is 0.01539
Symmetry

• Because of the symmetry of the Normal Distribution, the area above a


positive value of z must be the same as the area below the same
negative value.
• If the area below z = -1 is 0,1587
• Then the area above z = +1 is 0,1587
• And since the total area is 1
• The area between Z = -1 and z = +1 is
1 – 0,1587 – 0,1587 = 0,6826
• So the probability of getting a value in this region is also 0.6826
• Or you could say, the percentage of the distribution between Z = -1
and z = +1 is 68,26%
• Similarly, since we know 0,0505 of the distribution is above Z= +1.64
• We can say that only 5,05% of the distribution is above the z value.
An Example
We will look at a business set-up where the owner has determined that a
profit can be made on weekly sales above €.2000.
The bank has asked for a Business Plan showing the amount of time that the
business is likely to be in profit and also how often it might expect to have a
turnover above (a) €2.500 and (b) €3.000.
An independent market analyst has said that, in this market the average
expected weekly sales are €1,900 with a standard deviation of €500 and that
a normal distribution model can be used.

Remember the formula:


Let’s see what we have:
Average Sales (µ) = 1900
X -µ
z=
Standard Deviation (s) = 500
And we need probabilities associated
with three different X values : 2000,
2500, and 3000 s
Finding the z-values

To make a profit we need sales above €2.000


So :
2000 - 1900 100
z= = = 0.2
500 500

We can look at the tables and find the area


above z = 0,2 is 0,4207
So, the business can expect to make a profit
42,07% of the time
(It does look as if it might be successful)
More z-values

2500 - 1900 600


For X = €2.500, we have: z= = = 1.2
500 500

Giving a probability of 0,1151

3000 - 1900 1100


For X = €3.000, we have: z= = = 2.2
500 500

Giving a probability of 0,0139

So we would expect the business to have sales


above €2.500 for 11,51% of the time, and sales above
€3000 for only 1,39% of the time.
SAMPLING
Sampling: A Recap of the Basics

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 220.
Sampling

• Selecting units from population

• Elements of sampling
• population
• sampling framework
• random number generator
• one vs. more samples

• Sampling error and sampling bias

Source: Loane, D. P., Seward, L. E. (2016). Applied Statistics in Business and Economics. New York, NY:
McGraw-Hill Education, p. 293.
Types of Sampling

• Probability Sampling
• This is where every item has a calculable chance of selection

• Non-probability Sampling
• this is where someone has some choice in who or what is selected
• this would mean that some people or organisations had zero chance of selection
Types of Random Sample

• Simple Random Sample


• most basic, like picking numbers from a hat
• SRS without replacement
• Once an individual is sampled, that person is not placed back in the
population for re-sampling.
• SRS with replacement
• Once a person is selected to be in a sample, that person is placed
back in the population to possibly be sampled again.

• Systematic Sample
• Stratified Sampling
• uses information that we already have to try to make sure the sample
reflects the population

• Clusters
• Multi-stage Designs
• only practical method for national surveys
Simple Random Sample (1)

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 222.
Simple Random Sample (2)

• Subjects in the population are sampled by a random process, using either a random
number generator or a random number table, so that each person remaining in the
population has the same probability of being selected for the sample.

Random sample of three units from Table of random numbers.


a population of nine units.
Simple Random Sample: An Example

• A population of nine drug addicts: all nine addicts have


injected heroin into their veins, and have often shared needles
and equipment with colleagues.
• Three of the nine addicts are now infected with HIV (cross-
hatched figures).
• Sampling with replacement: all nine addicts have the same
probability of being selected (i.e., 1 in 9) in all three steps,
since the selected addict is placed back into the population
before each step.
• Sampling without replacement: the selection process changes.
• Step 1: each addict in the population has the same probability of
being selected.
• Step 2: the situation changes. Once the first addict is chosen, he is
not placed back in the population. Thus, the second addict to be
sampled comes from the remaining eight addicts in the population,
all of whom have the same probability of being selected (i.e., 1 in 8).
• Step 3: the selection is derived from a population of seven addicts,
with each addict having a probability of 1 in 7 of being selected.
• Once the steps are completed, the sample contains three different
addicts.
Stratified Random Sampling (1)

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 222.
Stratified Random Sampling (2)

• This sampling procedure separates the population into mutually exclusive sets (strata), and
then draws simple random samples from each stratum.
Stratified Random Sampling (3)

• a stratum is a subset of the population that shares at least one common characteristic (males
and females, or managers and non-managers...)

• the researcher first identifies the relevant stratums and their actual representation in the
population

• random sampling is then used to select a sufficient number of subjects from each stratum

• often used when one or more of the stratums in the population have a low incidence relative to
the other stratums
• stratified sampling can reduce cost per observation and narrow the error bounds (reduces
sampling error)
Stratified Random Sampling (4)

• With this procedure we can acquire information about:

• the whole population

• each stratum
• the relationships among strata.
• Advantages:
• It guarantees that the population subdivisions of interest are represented in the sample.
• The estimates of parameters produced from stratified random sampling have greater precision than
estimates obtained from simple random sampling.
Stratified Random Sample: An Example
Cluster Sampling (1)

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 223.
Cluster Sampling (2)

• applied when natural groupings exist


• difference between sampling units and units of observation
• sampling units are clusters
• units of observation are members of the cluster
• main advantage: (lower) costs and time
• main disadvantage: higher sampling error
• common application: geographical sampling
• heterogeneity of the cluster is an important feature of the ideal cluster design
Cluster Sampling (3)

http://www.youtube.com/watch?v=QOxXy-I6ogs&feature=related
Stratified vs. Cluster Sampling (1)

Source: Jaggia, S., & Hawke, A. K. (2020). Essentials of Business Statistics: Communicating with Numbers (2nd edition).
Dubuque, IA: McGraw-Hill Education, p. 223.
Stratified vs. Cluster Sampling (2)

Stratified Cluster
The population is divided into homogeneous The members of the population are selected
segments, and then the sample is randomly at random, from naturally occurring groups
taken from the segments. called 'cluster'.

Bifurcation is imposed by the researcher Naturally occurring groups


Few groups Many groups
All groups selected Some groups selected
Randomly selected individuals are taken from All the individuals are taken from randomly
all the strata selected clusters
Some units in a group selected Often all members of a group selected
Requires good knowledge of population No knowledge regarding population needed
Reduces sampling error Reduces costs
Homogeneity within subgroups Homogeneity between subgroups
Heterogeneity between groups Heterogeneity within subgroups
The objective is to increase precision and The objective is to reduce cost and improve
representation. efficiency.
Stratified vs. Cluster Sampling (3)

• In stratified sampling, a sample is drawn from each strata.


• In cluster sampling, the sampling unit is the whole cluster;
Instead of sampling individuals from within each group, a
researcher will study whole selected clusters.
Systematic Sampling

• Select every kth item from a list or sequence (e.g., restaurant customers)
• Systematic sampling is quick and convenient when you have a complete list of the
members of your population (for example, members of Congress). However, if there’s
some kind of pattern to the original list, then bias may creep in to your statistics.
• For example, if a list of people is ordered as MFMFMFMF, then choosing every 10th number will
give you a sample consisting entirely of females.

Source: Doane, D. P., Seward, L. E. (2016). Applied Statistics in Business and Economics. New York, NY:
McGraw-Hill Education, p. 37.
Multistage Sampling

•Using a combination of the sampling methods, at


various stages.
•Example:
•Stratify the population by region of the country.
•For each region, stratify by urban, suburban, and rural and
take a random sample of communities within those strata.
•Divide the selected communities into city blocks as clusters,
and sample some blocks.
•Everyone on the block or within the fixed area may then be
sampled.
Types of Non-Random Sample

• Quota Sample
• most frequently used, especially in market research
• again uses information we already have about the population in order for the sample to
reflect this

• Judgmental Sampling
• Snowball Sampling
• Convenience Sampling
Quota Sampling

• nonprobability equivalent of stratified sampling


• researcher first identifies the stratums and their proportions as they are represented in the population
• convenience or judgment sampling is used to select the required number of subjects from each
stratum
• the selection of the sample is made by the researcher (interviewer), who has been given quotas to fill from
specified sub-groups of the population
• example: sample 50 females between the age of 45 to 60
• this differs from stratified sampling, where the stratums are filled by random sampling
Other Non-Random Sampling Methods

• Convenience Sampling
• used in exploratory research
• sample is selected because they are convenient
• first available primary data source will be used without additional requirements (e.g., Facebook polls).
• Judgment (Purposive) Sampling
• common nonprobability method
• researcher selects the sample based on judgment
• most effective in situations where there are only a restricted number of people in a population who own qualities that a
researcher expects from the target population
• extension of convenience sampling
• the researcher must be confident that the chosen sample is truly representative of the entire population
• Snowball Sampling
• used when the desired sample characteristic is rare
• relies on referrals from initial subjects to generate additional subjects
• may introduce bias
Other Non-Random Sampling Methods
Sampling Methods: An Overview
Sources of Error: Sampling Bias vs. Sampling Error

• In sampling, the word bias does not refer to prejudice. Rather, it refers to a systematic
tendency to over- or underestimate a population parameter of interest.
• The word error generally refers to issues / characteristics of sample methodology that lead
to inaccurate estimates of a population parameter.

Source: Doane, D. P., Seward, L. E. (2016). Applied Statistics in Business and Economics. New York, NY:
McGraw-Hill Education, p. 41.
Sampling Error and Survey Bias (1)

• A survey produces a sample statistic, which is used to estimate a population parameter.


• If you repeated a survey many times, using different samples each time, you might get a different
sample statistic with each replication. Each of the different sample statistics would be an estimate
for the same population parameter.
• If the statistic is unbiased, the average of all the statistics from all possible samples will equal the
true population parameter; even though any individual statistic may differ from the population
parameter.
• The variability among statistics from different samples is called sampling error.
• Increasing the sample size tends to reduce the sampling error; it makes the sample statistic less
variable.
• Increasing sample size does not affect survey bias. A large sample size cannot correct for the
methodological problems (undercoverage, nonresponse bias, etc.) that produce survey bias.
Sampling Error and Survey Bias (2)

• A survey produces a sample statistic, which is used to estimate a population parameter.


• If you repeated a survey many times, using different samples each time, you might get a different
sample statistic with each replication. Each of the different sample statistics would be an estimate
for the same population parameter.
• If the statistic is unbiased, the average of all the statistics from all possible samples will equal the
true population parameter; even though any individual statistic may differ from the population
parameter.
Sampling and Non-Sampling Errors

•Two major types of errors can arise when a sampling procedure / data collection is
performed.
• Sampling Error
• Sampling error refers to differences between the sample and the population, because of the specific
observations that happen to be selected.
• Sampling error is expected to occur when making a statement about the population based on the sample taken.
•Non-sampling Error
• Non-sampling error is the error that arises in a data collection process as a result of factors other than taking a
sample.
• Increasing sample size will not reduce this type of errors.
• Non-sampling errors have the potential to cause bias in polls, surveys or samples.
• There are three types of non-sampling errors:
• errors in data acquisition
• non-response errors
• selection bias.

You might also like