Study Guide For Biostats
Study Guide For Biostats
Study Guide For Biostats
1
Code: BPH3612
NQF level: 6
Notional hours: 160
Contact Hours: 4 hours lecture + 2 hours Tutorials per week for 14 weeks
NQF Credits: 16
Pre-requisite/co-requisite: MAT3511 – Basic Mathematics
Compulsory/Electives: Compulsory
Semester offered: 2
Module aims:
This module aims to introduce students to the basic principles and applications of statistics to
public health practice. Intermediate level biostatistics will be taught in statistical methods in
Epidemiology module later in the program. This module is also a foundation for many public health
modules including research modules and research project.
Module Content
The use of statistics in public health – Data types, frequency distribution, Descriptive statistics:
description and presentation of data, random variables and distributions, descriptive statistics,
Inferential statistics: introduction to probability, sample and population, z-scores, sampling error,
sampling, estimation, elements of hypothesis testing, type I and II errors, one- and two-sample
tests, non-parametric tests.
2
Contents
1 Introduction to statistics in public health ................................................................................ 5
1.1 Why statistics?.................................................................................................................. 5
1.2 Descriptive vs inferential statistics ................................................................................... 6
2 Data and variables types in public health ............................................................................... 6
2.1 Quantitative variables ....................................................................................................... 7
2.2 Qualitative variables ......................................................................................................... 8
2.3 Levels of measurement..................................................................................................... 9
3 Data management, analysis and presentation ....................................................................... 11
3.1 Frequency tables ............................................................................................................. 11
3.2 Graphs, and charts for the various variable types .......................................................... 13
4 Measures of central tendency and dispersion (or variability) ............................................... 19
4.1 Central tendency............................................................................................................. 19
4.2 Dispersion and variability .............................................................................................. 25
5 Distribution types and the relationship between probability and the area under the normal
curve .............................................................................................................................................. 29
5.1 Defining Probability ....................................................................................................... 32
5.2 Probability distributions ................................................................................................. 34
6 Types of sampling methods .................................................................................................. 40
6.1 Types of sampling .......................................................................................................... 41
6.1.1 Nonprobability Sampling ........................................................................................ 41
6.1.2 Probability Sampling .............................................................................................. 42
6.1.3 Sampling Frame ...................................................................................................... 42
6.1.4 Simple Random Sampling ...................................................................................... 43
6.1.5 Systematic sampling ............................................................................................... 43
6.1.6 Stratified Sampling ................................................................................................. 44
6.1.7 Cluster Sampling..................................................................................................... 44
7 Statistical tests in comparing population means and proportions and confidence intervals for
population means and proportions ................................................................................................ 45
7.1 Estimation of means ....................................................................................................... 45
7.1.1 Sampling distribution of means .............................................................................. 45
7.1.2 Confidence interval of means ................................................................................. 48
7.2 Estimation of proportions ............................................................................................... 51
7.2.1 Sampling distribution of proportions ...................................................................... 52
7.2.2 Confidence interval of proportions ......................................................................... 52
3
7.3 Test of hypothesis, p-values and associated errors ........................................................ 53
7.3.1 Hypothesis Testing.................................................................................................. 53
7.3.2 One sample tests ..................................................................................................... 56
7.3.3 Two sample tests ..................................................................................................... 57
7.3.4 Type I and Type II errors ........................................................................................ 58
7.4 Non-parametric tests ...................................................................................................... 58
4
1 Introduction to statistics in public health
Population Parameters
Sample Statistics
For example,
The prevalence of certain diseases within some populations
We may be interested in observing certain groups of individuals over a period of
time to determine if they develop some disease outcome and what could have
led to that.
One may also be interested in whether a vaccine or drug is effective and to what
extent the effect is within a population or to determine if that effect is by chance
or real.
Population
This is the entire set of possible individuals, items or objects of interest in a study
Sample
This refers to a subset of the population typically selected for detailed study in order know
more about the population and make informed decisions concerning it.
Sampling unit
Any item/unit under study, if selected from a larger population is called the sampling unit
and observations are made sampling units in order to make
5
1.2 Descriptive vs inferential statistics
Descriptive statistics Deals with using summary measures to describe data. This is typically
done using summary measures showing averages, central locations and dispersions in the
distribution of data, and may be illustrated with graphs and charts.
For example:
To summarize or describe the distribution of a single variable. These statistics are
called univariate (one variable) descriptive statistics.
To describe the relationship between two or more variables. These statistics are
called bivariate (for two variables) or multivariate (more than two variables)
descriptive statistics.
Inferential statistics
Deals with characterizing or making decisions about a population based on information
from a sample drawn from that population. This is used when we wish to generalize our
findings from a sample to a larger population and is mostly based on the probability
distribution (explained later) of the statistic which has been estimated. The probability
distribution of the estimated statistic determines the degree of certainty to which an
inference is made.
There are also, univariate, bivariate and multivariate inferential statistics just as we
have them for descriptive.
Data
When observations are made and recorded, this has to be stored in some form either as a
set of numbers, text, dates, time, etc. The collection of such records is what we term as
data and forms the basis of any statistical analysis. For any statistical analysis to provide
reliable results, this largely depends on the nature and quality of the data collected. The
right type of observations and recordings has to be made to ensure accuracy and reduce
errors which can come from multiple sources. The type of analysis that can be carried out
also depends on the type of data collected which are stored in variables
6
Variables
A variable is an element of a dataset that holds values for observations that vary. When all
values held by the variable are the same then it is said to be a constant.
Types of variables
Variables
Quantitative Qualitative
Continuous variable
These are measured on the continuous scale where there exists an infinite number of sub-
divisions between any two data points and may have an infinite number of decimal places
depending the degree of precision of the instrument used in recording the observation.
E.g. heights and weights of individuals, temperature, distance between points, etc. These
can be analysed using means, medians, standard deviations, inter-quartile ranges, etc.
7
Discrete variables
Discrete variables are also used to represent the number and magnitude of but can only
assume whole numbers and there are no subdivisions between two successive points
although they may be analysed using similar methods. E.g. no. of people in class,
population of a city/town, no. of households, etc. For example, you can’t have one and
a half people in practice, although that is mathematically possible and so rounding up or
down is typically done after analysis. Discrete variables may have some order within them
making them ordinal, although the reverse cannot be said for ordinal variables. These can
also be analysed using means, medians, standard deviations, inter-quartile ranges, etc.
Nominal variables
Shows if an observation belongs to one category or the other and are also referred to as
categorical variables. They may not be compared directly, instead through some other
quantitative measure. Where there are only two categories, it is called a binary variable.
E.g. sex of a person, nationality, type of health facility, etc. They are typically represented
by frequencies and proportions or percentages.
Ordinal variables
These variables come about where there is some order or ranking in a nominal/categorical
variable. The difference between each step in that order may not be the same for the
entire variables. E.g. employment rank, position in a race where the difference in time
between the first and second person may not be the same as the difference between the
second and third person. They may also be analysed using frequencies and proportions or
percentages.
8
Derived variables
These are obtained by converting one of the variable types into another form. E.g. Body
Mass Index (BMI) which is obtained by dividing the weight of a person (in kg) by the square
of the height of the person (in metres). This same variable can be categorized to indicate
if the person is underweight, normal, overweight or obese based on the values obtained.
Derived variables may fall under any of the variable types.
Nominal scale
Variables measured at the nominal level have categories that are not numerical.
Examples of variables at the nominal level include sex, residence, ethnicity, religious
affiliation, etc. This is the lowest level of measurement and the only mathematical
operation possible is comparing the relative sizes of the categories (e.g., more females
than males in this class). The categories or scores of nominal level variables are not ranked
with respect to each other and cannot be added, divided, or otherwise manipulated
mathematically.
Ordinal scale
Data measured on the ordinal scale has some meaningful order, so that higher values may
represent more of some characteristic than lower values and vice-versa depending on
what the values represent. For example, in medical practice burns may be described by
their degree, which describes the amount of tissue damage caused by the burn. These
categories may be ranked in a logical order: first-degree burns are the least serious in terms
of tissue damage, third-degree burns the most serious. However, there is no metric
measurement to quantify how great the distance between categories is, nor is it possible
to determine if the difference between first- and second-degree burns is the same as the
difference between second- and third-degree burns.
9
Interval scale
Interval scale data have a meaningful order and equal intervals between measurements
representing equal changes in the quantity of what is being measured. The most common
example of interval data are the temperature scales (degrees Celsius or Fahrenheit) where
the difference between 10 degrees and 20 degrees (a difference of 10 degrees)
represents the same amount of temperature change as the difference between 30 and
40 degrees. Addition and subtraction are appropriate with interval scales: a difference of
10 degrees represents the same amount over the entire scale of temperature. However,
interval scales have no natural zero point, because 0 on a temperature scale does not
represent an absence of temperature but simply a location relative to other temperatures.
Multiplication and division are not appropriate with interval data since there is no
mathematical sense in saying that 60 degrees is twice as hot as 30 degrees, but rather 30
degrees more. Interval scales are a rare and it’s difficult to think of other common
examples.
Ratio scale
Ratio scale data have all the characteristics of interval data (natural order, equal intervals)
plus a natural zero point. Many physical measurements are ratio data: for instance, height,
weight, and age, etc. Zero means the absence of the quantity, for example 0 income or
0 litres of a fluid. With this scale, it is appropriate to multiply and divide as well as add and
subtract: it makes sense to say that someone weighing 100kg is twice as heavy as someone
whose weight is 50kg, or a person who is 50 years old is 5 times as old as someone who is
10 years old.
It should be noted that some psychological measurements (IQ, aptitude, etc.) are not truly
interval, but rather ordinal (e.g., values from a Likert scale). However, you may still see
interval or ratio techniques applied to such data (for instance, the calculation of means,
which involves division).While incorrect from a statistical point of view, sometimes you have
to go with the conventions of your field, or the variable involved.
10
3 Data management, analysis and presentation
In statistics and scientific research, answers being sought from the data as well as the type
of variable determines the analysis that is carried out and presented. The use of charts or
graphs may be used if the results will look better if illustrated as opposed to displaying it in
a table along with other results where it may get ‘drowned out’. The first question to ask
when considering a graphic method of presentation is whether it is needed at all.
A picture may be worth a thousand words, but sometimes frequency tables do a better
job than graphs at presenting information. This is particularly true when the actual values
of the numbers in different categories, rather than the general pattern among the
categories, are of primary interest. Frequency tables are also an efficient way to present
large quantities of data.
The table below show a frequency distribution of educational level for a sample of 93
people. Cumulative frequency allows us to tell at a glance, for instance, that 74.74% of
these people had up to primary education.
Cumulative
Relative relative
frequency frequency
Educational level Frequency (%) (%)
None 41 43.16 43.16
Primary 30 31.58 74.74
Secondary school 13 13.68 88.42
College 11 11.58 100.00
Total 95 100.00
11
Suppose we are interested in collecting data on the general health of these people for
example. Because obesity is a matter of growing concern, one of the statistics we can
look at is the Body Mass Index (BMI), calculated as weight in kilograms divided by squared
height in meters. This results in data on a continuous scale just like the weights and heights
used. It wouldn’t be useful to display a frequency distribution table listing the individual
frequencies as we would end up with a very long list having low frequencies or even
possibly a frequency of 1 for each value. We therefore group them based on meaningful
categories.
The table below shows the BMI categories, a derived variable from weight and heights
which were then categorised from a continuous variable (ratio scale) to an ordinal
variable (ordinal scale).
Cumulative
Relative relative
frequency frequency
BMI category Frequency (%) (%)
Underweight (<18.5) 2 2.11 2.11
Normal (18.5-24.9) 54 56.84 58.95
Overweight (25.0-29.9) 34 35.79 94.74
Obese (30.0+) 5 5.26 100.00
Total 95 100.00
We can also construct frequency tables to make comparisons between groups, for
instance, the distribution of BMI in males and females. The use of row or column
percentages is very important depending on what we want to interpret.
The table below shows the comparison between males and females for BMI category with
column percentages used (in the brackets) because we want to compare the columns
with regards to BMI.
12
Other tables are also used to display statistics beyond just frequencies depending on what
statistics are being computed or compared.
Bar Charts
The bar chart is particularly appropriate for displaying discrete, ordinal or nominal data
with only a few categories, as in our example of educational level and BMI. The bars in a
bar chart are separated from each other so they do not suggest continuity in the
distribution of the variable, although for BMI categories which are ordinal this is not
absolutely the case. Absolute frequencies are useful when you need to know the number
of people in a particular category and relative frequencies are more useful when you
need to know the relationship of the numbers in each category, particularly when
comparing multiple groups: for example, whether the proportion of obese people is rising
or falling. The concept of relative frequencies becomes even more useful if we compare
the distribution of BMI categories over several groups.
13
60
40
frequency
20
0
Pie chart
Pie charts present data in a manner similar to the stacked bar chart: it shows graphically what
proportion each part occupies of the whole. Pie charts, like stacked bar charts, are most useful
when there are only a few categories of information, and when the differences among those
categories are fairly large. We will present the same BMI information in pie chart form. This is a
single pie chart but other options are available including side-by-side charts to allow comparison
of the proportions of different groups and exploded sections to show a more detailed breakdown
of categories within very small segments.
14
2.1%
5.3%
35.8%
56.8%
Underweight Normal
Overweight Obese
Histogram
The histogram is normally used for displaying continuous data and looks similar to a bar chart, but
generally has many more individual bars, which represent ranges of a continuous variable. To
emphasize the continuous nature of the variable displayed, the bars (also known as “bins,”
because you can think of them as bins into which values from a continuous distribution are sorted)
in a histogram touch each other, unlike the bars in a bar chart. Bins do not have to be the same
width, although frequently they are. The x-axis (vertical axis) represents a scale rather than simply
a series of labels, and the area of each bar represents the percentage of values that are
contained in that range. The histogram below shows the original BMI values on the continuous
scale. Note the shape compared to the bar chart for the categorical BMI.
15
50
40
30
Percent
20
10
0
15 20 25 30 35 40
body mass index (kg/(m. sq)
Boxplot
The boxplot, also known as the “box and whiskers plot,” is a compact way to summarize and
display the distribution of a set of continuous data. Although boxplots can be drawn by hand (as
can many other graphics, including bar charts and histograms), in practice they are nearly always
created using software. Interestingly, the exact methods used to construct a boxplot vary from
one software program to another, but they are always constructed to highlight five important
characteristics of a data set: the median, the first and third quartiles (and hence the interquartile
range (IQR) as well), and the minimum and maximum. The central tendency, range, symmetry,
and presence of outliers in a data set can be seen at a glance in a boxplot, and side-by-side
boxplots make it easy to make comparisons among different distributions of data.
The boxplot below displays the continuous BMI values. The box represents the middle 50% of the
data. The Lower whisker is obtained by subtracting 1.5 x the Interquartile Range from the Lower
Quartile and if there are values in the data still lower than that, the cut-off is set and the data
values beyond indicated as outliers. However, if the lowest value is above this initial 1.5XIQR mark,
the whisker is shortened to this lowest value. The opposite is done for the Upper whisker by adding
1.5XIQR to the Upper Quartile, etc.
16
35 Outlier
Upper whisker
30
Upper quartile
25
Median
Lower quartile
20
Lower whisker
15
Scatter plots
Charts that display information about the relationship between two variables are called bivariate
charts: the most common example is the scatterplot. Scatterplots define each point in a data set
by two values, commonly referred to as x and y, and plot the intersection of each pair of points
on a pair of their respective axes. Conventionally the vertical axis is called the y-axis and represents
the y-value for each point, and the horizontal axis is called the x-axis and represents the x-value.
Scatterplots are a very important tool for examining bivariate relationships among variables.
Consider the following data showing age and BMI (kg/m2), each point on the scatter plot show
the intersection of the pairs of values.
17
Age (years) BMI (kg/m2)
35
60 23.7
body mass index (kg/(m. sq.)
49 23.7
55 27.9
61 25.8
30
52 22.1
50 19.7
52 23.6
55 19.7
42 22.7 25
41 26.8
66 31.6
54 29.5
20
51 28.3
35 25.4 30 40 50 60 70
age (years)
41 23.9
Line graphs
Line graphs are also often used to display the relationship between two variables, often between
time on the x-axis and some other variable on the y-axis. One good thing is that there can be two
y-values or variables for each x-value, so it can be used to compare two variables over time.
Consider the following data showing the prevalence of obesity among some population,
measured annually over a 12-year period. The intersection of the data points are marked and
then joined with the lines in a continuous manner. Multiple lines may be compared across the
same time period and up to two y-axis variables may also be compared if the range of values are
similar.
18
Year Prevalence
24
(%)
2001 14.6
2002 14.6
22
2003 15.7
2004 16.4
Prevalence (%)
20
2005 17.8
2006 18.8
2007 17.6
2008 19.3 18
2009 20.7
16
2010 21.1
2011 22.0
2012 23.1
14
19
The Mean
The arithmetic mean, or simply the mean, is more commonly known as the average of a
set of values. It is appropriate for interval and ratio data, and can also be used for
dichotomous variables that are coded as 0 or 1.For continuous data, for instance measures
of height or scores on an IQ test, the mean is simply calculated by adding up all the values
and dividing by the number of values. The mean of a population is denoted by the Greek
letter mu (σ) while the mean of a sample is typically denoted by a bar over the variable
symbol: for instance, the mean of x would be designated x and pronounced “x-bar.” The
bar notation is sometimes adapted for the names of variables also: for instance, some
authors denote “the mean of the variable age” by age, which would be pronounced
“age-bar”.
For instance, if we have the following values of the variable x:
100, 115, 93, 102, 97
We calculate the mean by adding them up and dividing by 5 (the number of values):
𝑥̅ = (100 + 115 + 93 + 102 + 97)/5 = 507/5 = 101.4
Statisticians often use a convention called summation notation, which defines a statistic
by expressing how it is calculated. The computation of the mean is the same whether the
numbers are considered to represent a population or a sample: the only difference is the
symbol for the mean itself. The mean of a data set, as expressed in summation notation, is:
Where 𝑥̅ is the mean of x, n is the number of cases, and x i is a particular value of x. The
Greek letter sigma (Σ) means summation (adding together), and the figures above and
below the sigma define the range over which the operation should be performed. In this
case the notation says to sum all the values of x from 1 to n. The symbol i designates the
position in the data set, so x1 is the first value in the data set, x2 the second value, and xn
the last value in the data set. The summation symbol means to add together or sum the
values of x from the first (x1) to xn. The mean is therefore calculated by summing all the
data in the data set, then dividing by the number of cases in the data set, which is the
same thing as multiplying by 1/n.
The mean is an intuitively easy measure of central tendency to understand. However the
mean is not an appropriate summary measure for every data set because it is sensitive to
extreme values, also known as outliers (discussed further below), and may also be
20
misleading for skewed (non-symmetrical) data. For instance, if the last value in the data
set were 297 instead of 97, the mean would be:
𝑥̅ = (100 + 115 + 93 + 102 + 297)/5 = 707/5 = 141.4
This is not a typical value for this data: 80% of the data (the first four values) are below the
mean, which is distorted by the presence of one extremely high value. A good practical
example of when the mean is misleading as a measure of central tendency is household
income. A few very rich households make the mean household income a larger value than
is truly representative of the average or typical household.
The mean can also be calculated using data from a frequency table, i.e., a table
displaying data values and how often each occurs. Consider the following example:
Value Frequency
1 7
2 5
3 12
4 2
To find the mean of these numbers, treat the frequency column as a weighting variable,
i.e., multiply each value by its frequency. The mean is then calculated as:
This is the same result you would reach by adding together each individual score the
number of times it occurs (frequency) and dividing by 26.
The mean for grouped data, in which data has been tabulated by range, is calculated in
a similar manner. One additional step is necessary though: the midpoint of each range
must be calculated, and for the purposes of the calculation it is assumed that all data
points in that range have the midpoint as their value. A mean calculated in this way is
called a grouped mean. A grouped mean is not as precise as the mean calculated from
the original data points, but it is often your only option if the original values are not
available. Consider the following grouped data set
21
The mean is calculated by multiplying the midpoint of each interval by its frequency, and
dividing by the total frequency:
One way to lessen the influence of outliers is by calculating a trimmed mean. As the name
implies, a trimmed mean is calculated by trimming or discarding a certain percentage of
the extreme values in a distribution, and calculating the mean of the remaining values. In
the second distribution above, the trimmed mean (defined by discarding the highest and
lowest values after sorting in ascending or descending order) would be:
𝑥̅ = (100 + 115 + 102)/3 = 317/3 = 105.7
This is much closer to the typical values in the distribution than 141.4, the value of the mean
of all the values. In a data set with many values, a percentage such as 10 percent or 20
percent of the highest and lowest values may be eliminated before calculating the
trimmed mean. Trimming may also be done on both sides.
The mean can also be calculated for dichotomous data using 0–1 coding, in which case
the mean is equivalent to the percent of values with the number 1.For instance, if we have
10 subjects, 6 males and 4 females, coded 1 for male and 0 for female, computing the
mean will give us the percentage of males in the population:
𝑥̅ = (1+1+1+1+1+1+0+0+0+0)/10 = 6/10 = 0.6 or 60% males
22
The Median
The median of a data set is the middle value when the values are ranked in ascending or
descending order. If there are n values, the median is formally defined as the (n+1)/2th
value. If n = 7, the middle value is the (7+1)/2th or fourth value. If there is an even number
of values, the median is the average of the two middle values. For example:
Odd number of values: 1, 2, 3, 4, 5, 6, 7 median = 4
Even number of values: 1, 2, 3, 4, 5, 6 median = (3+4)/2 = 3.5
The median is a better measure of central tendency than the mean for data that is
asymmetrical or contains outliers. This is because the median is based on the ranks of data
points rather than their actual values: 50 percent of the data values in a distribution lie
below the median, and 50 percent above the median, without regard to the actual values
in question. Therefore it does not matter if the data set contains some extremely large or
small values, because they will not affect the median more than less extreme values. For
instance, the median of all three distributions below is 4:
Distribution A: 1, 1, 3, 4, 5, 6, 7
Distribution B: 0.01, 3, 3, 4, 5, 5, 5
Distribution C: 1, 1, 2, 4, 5, 100, 2000
The Mode
A third measure of central tendency is the mode, which refers to the most frequently
occurring value. The mode is most useful in describing ordinal or categorical data. For
instance, imagine that the numbers below represent the academic year of college
students, where 1 = first year, 2 = second year, and 3 = third year:
1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3
The modal year is year 3 which has the highest frequency.
The modal BMI group from our earlier example is those with normal BMI who constitute the
highest category.
In a symmetrical distribution (such as the normal distribution, to be discussed later), the
mean, median, and mode are identical. In an asymmetrical or skewed distribution they
differ, and the amount by which they differ is one way to evaluate the skewness of a
distribution
23
Skewness
The general principle to remember is that, relative to the median, the mean is always
pulled in the direction of extreme scores (i.e., scores that are much higher or lower than
other scores); that is, when the data show a skew. The mean and median will have the
same value when and only when a distribution is symmetrical. When a distribution has
some extremely high scores (this is called a positive skew), the mean will always have a
greater numerical value than the median. If the distribution has some very low scores (a
negative skew), the mean will be lower in value than the median. These relationships
between medians and means also have a practical value. For one thing, a quick
comparison of the median and mean will always tell you if a distribution is skewed and, if
so, the direction of the skew. If the mean is less than the median, the distribution has a
negative skew. If the mean is greater than the median, the distribution has a positive skew.
The figures below illustrate this.
24
Use the mean when:
1. The variable is measured at the interval-ratio level (except when the variable is
highly skewed).
2. You want to report the typical score. The mean is the centre that exactly balances
all of the scores.
3. You are going to conduct additional statistical analysis based on the mean.
The Range
The simplest measure of dispersion is the range, which is simply the difference between the
highest and lowest values. Often the minimum (smallest) and maximum (largest) values
are reported as well as the range. For the data set (95, 98, 101, 105), the minimum is 95, the
maximum is 105, and the range is 10 (105 – 95).If there are one or a few outliers in the data
set, the range may not be a useful summary measure: for instance, in the data set (95, 98,
101, 105, 210), the range is 115 but most of the numbers lie within a range of 10 (95 to 105).
Inspection of the range for any variable is a good data screening technique: an unusually
wide range, or extreme minimum or maximum values, warrants further investigation. It may
be due to a data entry error or to inclusion of a case that does not belong to the
population under study.
25
Interquartile Range (IQR)
The interquartile range is an alternative measure of dispersion that is less influenced by
extreme values than the range. The interquartile range is the range of the middle 50% of
the values in a data set, which is calculated as the difference between the 75th and 25th
percentile values which are also the upper/third and lower/first quartiles respectively. The
interquartile range is easily obtained from most statistical computer programs but may also
be calculated by hand using the following formulae for the percentiles:
1 50
𝑀𝑒𝑑𝑖𝑎𝑛 = (𝑛 + 1)𝑡ℎ 𝑜𝑟 (𝑛 + 1)𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
2 100
1 25
𝐿𝑜𝑤𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 = (𝑛 + 1)𝑡ℎ 𝑜𝑟 (𝑛 + 1)𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4 100
3 75
𝑈𝑝𝑝𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 = (𝑛 + 1)𝑡ℎ 𝑜𝑟 (𝑛 + 1)𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑟𝑑𝑒𝑟𝑒𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4 100
26
The Variance and Standard Deviation
The most common measures of dispersion for continuous data are the variance and
standard deviation. Both describe how much the individual values in a data set vary from
the mean or average value. The variance and standard deviation are calculated slightly
differently depending on whether a population or a sample is being studied, but basically
the variance is the average of the squared deviations from the mean, and the standard
deviation is the square root of the variance. The variance of a population is signified by σ2
(pronounced “sigma-squared”: σ is the Greek letter sigma) and the standard deviation as
σ, while the sample variance and standard deviation are signified by s2 and s, respectively.
The deviation from the mean for one value in a data set is calculated as (x i – 𝑥̅ ) where xi is
value i from the data set and x is the mean of the data set. Written in summation notation,
the formula to calculate the sum of all deviations from the mean for a data set with n
observations is:
Unfortunately this quantity is not useful because it will always equal zero. This is not surprising
if you consider that the mean is computed as the average of all the values in the data set.
This may be demonstrated with the tiny data set (1, 2, 3, 4, 5):
So we work with squared deviations (which are always positive) and divide their sum by n,
the number of cases, to get the average deviation or variance for a population:
27
The sample formula for the variance requires dividing by n – 1 rather than n because we
lose one degree of freedom when we calculate the mean. The formula for the variance
of a sample, notated as s2, is therefore:
Continuing with our tiny data set, we can calculate the variance for this population as:
Note that because of the different divisor, the sample formula for the variance will always
return a larger result than the population formula, although if the sample size is close to the
population size, this difference will be slight. The divisor (n – 1) is used so that the sample
variance will be an unbiased estimator of the population variance.
Because squared numbers are always positive (outside the realm of imaginary numbers),
the variance will always be equal to or greater than 0.The variance would be zero only if
all values of a variable were the same (in which case the variable would really be a
constant). However, in calculating the variance, we have changed from our original units
to squared units, which may not be convenient to interpret. For instance, if we were
measuring weight in kilograms, we would probably want measures of central tendency
and dispersion expressed in the same units, rather than having the mean expressed in
kilograms and variance in squared kilograms. To get back to the original units, we take the
square root of the variance: this is called the standard deviation and is signified by σ for a
population and s for a sample.
28
For a population, the formula for the standard deviation is:
5 Distribution types and the relationship between probability and the area under the normal
curve
Some terms to note:
Trials
Probability is concerned with the outcome of trials, which are also called experiments or
observations: the crucial fact, whichever term is used, is that they refer to an event whose
outcome is unknown. If the outcome were known, after all, there would be no need to
consider its probability. A trial can be as simple as flipping a coin or drawing a card from
a deck, or as complex as observing whether a person diagnosed with breast cancer is still
alive five years after diagnosis or the chances a person has a disease when selected at
random from a population.
Sample Space
The sample space, signified by S, is the set of all possible elementary outcomes of a trial. If
the trial is flipping a coin once, then the sample space is S = {heads, tails} (often
abbreviated S = {h, t}), because those two alternatives represent all the possible outcomes
for the experiment. If the experiment is rolling a single die, the sample space is S = {1, 2, 3,
4, 5, 6}, representing the six faces of the die that may turn up in a single roll. These
elementary outcomes are also referred to as sample points. If the experiment consists of
multiple trials, all possible combinations of outcomes of the trials must be specified as part
of the sample space. For instance, if the trial consists of flipping a coin twice, the sample
space is S = {(h, h), (h, t), (t, h), (t, t)}. Example of a sample space is the target population
from which people may be selected for scientific studies. One could end up with any
combination of persons for this purpose.
29
Events
An event, usually signified by a capital letter other than S, is the specification of the
outcome of a trial, and may consist of a single outcome or a set of outcomes. If the
outcome or set of outcomes occurs, we say the outcome has “satisfied the event” or “the
event occurred.” For instance, the event “heads in flipping a coin” could be specified as
E = {heads} while the event “odd number in rolling a die” could be specified as E = {1, 3,
5}.
A simple event is the outcome of a single experiment or observation, such as a single coin
flip. Simple events may be combined into compound events, as in the union and
intersection examples below. Events can be defined by listing the outcomes or by defining
them logically, so that if the trial was rolling two dice, and we were interested in how often
the sum would be less than 6, we could specify this as either E = {2, 3, 4,
5} or E = {sum is less than 6}.
Union
The union of several simple events creates a compound event that occurs if one or more
of the events occur. The union of E and F is written E ∪ F and means “either E or F, or both
E and F.” Note that the union symbol is similar to a capital letter U.
30
Intersection
The intersection of two or more simple events creates a compound event that occurs only
if all the simple events occur. The intersection of E and F is written E ∩ F and means “both E
and F.”
Complement
The complement of an event means everything in the sample space that is not that event.
For example, if R = (probability breast cancer patient survives for at least five years), then
the compliment of R = (probability breast cancer patient does not survive for at least five
years)
Mutual Exclusivity
If events cannot occur together, they are mutually exclusive. Following the same line of
reasoning, if two sets have no events in common, they are mutually exclusive. For instance,
the event A = (BMI is normal) and event B = (BMI is obese) are mutually exclusive, as a
person cannot be both normal and obese at the same time in terms of BMI.
Independence
If two trials are independent, that means that the outcome of one has no relationship to
the outcome of another. To put it another way, knowing the outcome of one event gives
you no information about the outcome of the second event. The classic example of
independence is flipping an ordinary coin: if you flip the coin twice, the outcome of the
first trial has no influence on the outcome of the second trial, and the probability of heads
is the same on every flip.
31
5.1 Defining Probability
There are several technical ways to define probability, but a definition useful for statistics is
that probability tells us how often something is likely to occur when an experiment is
repeated at random. For instance, the probability that a coin will come up heads can be
estimated by executing a number of coin flips and observing how many times it is heads
rather than tails. The most important single fact about probability is this:
The probability of an event is always between 0 and 1.
If the probability of an event is 0, that means there is no chance that it will occur, while if
the probability of an event is 1, that means it is certain to occur. It is conventional in
mathematics to specify probability using decimals, so we say that the probability of an
event is between 0 and 1, but it is equally acceptable (and more common in everyday
speech) to speak in terms of percentages, so it is equally correct to say that the probability
of an event is always between 0% and 100%.To move from decimals to percent, multiply
by 100 (per cent = per 100), so a probability of 0.4 is also a probability of 40% (0.4 x 100 =
40) and a probability of 0.85 may also be stated as 85% probability.
If a person were to be selected at random from this group of 95 persons, the probability
that they would be obese is the number of obese (5) divided by the sample space (95)
which gives us 5.26%. The probability that a person selected is not obese is the complement
of that which is equal to (100 – 5.26) % = 94.74, same as the cumulative relative frequencies
of all the other categories.
32
Conditional Probabilities
Often we want to know the probability of some event, given that another event has
occurred. This is expressed symbolically as P(E|F) and read as “the probability of E given
F.” The second event is known as the “condition” and the process is sometimes referred to
as “conditioning on F.” Conditional probability is an important concept in statistics
because often we are trying to establish that a factor has a relationship with an outcome,
for instance that people who smoke cigarettes are more likely to develop lung cancer.
This would be expressed symbolically as:
P(lung cancer | smoker) > P(lung cancer | non-smoker)
Our BMI table above cannot give us conditional probabilities since there is no condition.
However from the following table, we have conditional probabilities, the condition here
being sex of the person.
33
5.2 Probability distributions
Statistical inference frequently relies on making assumptions about the way data is
distributed, or requires performing some transformation of the data to make it better fit
some known distribution. Therefore we will begin the topic of statistical inference with a
presentation of the concept of a theoretical probability distribution, and a review of two
commonly used distributions.
A theoretical probability distribution is defined by a formula that specifies what values can
be taken by data points within the distribution, and how common each value will be (or,
in the case of continuous distributions, how common a given range of values will
be).Graphic presentations of theoretical probability distributions are often used to present
statistical concepts: the well-known “bell curve” of the normal distribution is a good
example.
Theoretical probability distributions are useful in inferential statistics because their
properties and characteristics are known. If the actual distribution of a given data set is
reasonably close to that of a theoretical probability distribution, many calculations may
be performed on the actual data using assumptions drawn from the theoretical
distribution. In addition, thanks to the Central Limit Theorem, we can assume that the
distribution of means of samples of a sufficient size is normal, even if the population from
which the samples were drawn is not normally distributed.
Probability distributions are commonly classified as continuous, meaning the data can
take on any value within a specified range, or discrete, meaning the data can only take
on certain values. There are a lot of other theoretical distributions that may be used in
computing inferential statistics depending on the characteristics of the data being used
such as the standard normal, student t, Chi-square, etc. We will look at the standard
normal distribution and t-distribution examples of continuous distributions.
34
Standard normal distribution
The normal distribution, a continuous distribution is arguably the most commonly used
distribution in statistics. This is due in part to the fact that the normal distribution is a
reasonable description of how many continuous variables are distributed in reality, from
health and anthropometric variation to intelligence test scores. A second reason for the
widespread use of the normal distribution is that under specified conditions we may
assume that sampling distributions of statistics such as the sample mean are normally
distributed even if the samples are drawn from populations that are not normally
distributed. This is discussed further in the section on Sampling distribution of means later in
this document. The normal distribution is also referred to as the “bell curve” due to its
characteristic shape.
There are an infinite number of normal distributions, all of which have the same basic
shape but differ according to their mean μ (the Greek letter mu) and variance σ (the
Greek letter sigma).Examples of three normal distributions with different means and
standard deviations are displayed in Figure below.
35
The normal distribution with a mean of 0 and standard deviation of 1 is known as the
standard normal distribution or Z distribution. Any normal distribution can be transformed
to the standard normal distribution by converting the original values to standardized scores
(a process discussed further), which facilitates comparison among populations with
different means and variances.
All normal distributions, regardless of the mean and variance, share certain characteristics.
These include:
Symmetry
Unimodality (a single most common value)
A continuous range from –∞ to +∞ (negative infinity to positive infinity)
A total area under the curve of 1
A common value for the mean, median, and mode
As we noted earlier, there are an infinite number of normal distributions, but they all share
certain properties. For convenience, we often describe normal distributions in terms of units
of standard deviation rather than raw numbers, because that allows us to apply the same
description to any normal distribution. Because all normal distributions have the same basic
shape, we can make some assumptions about how data is distributed within any normal
distribution. The empirical rule states that for any normal distribution:
About 68% of the data will fall within one standard deviation of the mean
About 95% of the data will fall within two standard deviations of the mean
Over 99% of the data will fall within three standard deviations of the mean
This is shown in the figure below
36
Knowledge of these properties of the normal distribution gives us a way to judge whether
a particular value is similar or not compared to other values in the population. This
comparison is facilitated by converting raw scores (scores in their natural metric, for
instance weight measured in pounds or kilograms) into Z-scores, which express the value
of the score in terms of units of the standard deviation for their population. Converting all
data points to Z-scores is equivalent to transforming a normally distributed population to
the standard normal distribution. For this reason, Z-scores are sometimes referred to as
normalized scores, the process of computing them as normalizing the scores, and the
standard normal distribution is sometimes called the Z distribution.
The formula to calculate a Z-score for a population with known mean and standard
deviation is:
If the variable x is distributed normally with mean of 100 and standard deviation of 5, i.e.,
x ~ N (100, 5), a value of 105 has a Z-score of 1, because:
Similarly, a value of 10 from this population has a Z-score of 2, and a value of 85 has a Z-
score of –3.Using the empirical rule cited above, we classify the value 105 as above
average but not remarkable among the population (about 15.9% of the population would
be expected to have higher Z-scores).A Z-score of 2 is more unusual (about 2.5% of the
population would be expected to have higher Z-scores) and –3 is quite unusual (less than
half of one percent of the population would be expected to have scores this low or lower).
The advantage of Z-scores is that they facilitate comparison among populations with
different means and standard deviations. For instance, comparing our population
x ~ N(100, 5) with another population y ~ N(50,10), we can’t immediately say whether a
score of 95 among the first population is more or less unusual than a score of 35 among
the second population. However, this comparison is easily made using Z-scores:
95 − 100 35 − 50
𝑍𝑥 = = −1 𝑍𝑦 = = −1.5
5 10
37
Conversion to Z-scores places both populations on the same scale, and we can see that
the second score is more extreme because –1.5 is further from 0, the mean of the standard
normal distribution, than –1.
Specific probabilities can be obtained from the standard normal table for Z-scores
computed. However, most standard normal tables show probabilities between Z = -3.5 and
Z = 3.5. The remaining probabilities not shown are extremely small compared to what is
shown in the tables. The total area under the standard normal curve is 1 or 100%.
t-distribution
In many natural systems, populations are normally distributed, but sometimes they are not,
and thus, the normal distribution cannot be used as a model. However, if you have
gathered enough samples, it may still be possible to use the properties of the normal
distribution, since the sampling distribution of averages is likely to be normal, according to
the Central Limit Theorem, to be discussed later (or at least have some of the key
characteristics of a normal distribution, such as being unimodal and symmetrical). Thus,
irrespective of the underlying population distribution, the normal distribution can be used
to estimate probabilities when samples are sufficiently large where the sample variance
can be used to estimate the population variance, and inferences drawn with the
assistance of the normal distribution
This strategy may not always be appropriate for answering your specific research question,
especially if you can only obtain limited samples because of financial, physical, or time
constraints. When samples are collected from a normal distribution, and if the number of
samples is small, and these are used to estimate the variance, then the formula for
standardizing to the distribution (for the variable x) becomes:
38
This is both flatter, and has more observations appearing in the tails, than a normal
distribution, when sample sizes are less than 30, and where 𝑠/√𝑛 refers to the standard
error (to be explained later). Since both s and x are random variables, this may not be such
a surprise. However, as the number of samples increases, the distribution becomes normal,
given the dependence on n, and the corresponding effect on degrees of freedom, since
df = n – 1.This distribution is known as the t distribution, and approximates a normal
distribution if n (and by implication df) are large (>30 in practical terms).
Books of statistical tables normally provide critical values of t that can be used at different
degrees of freedom to make inferences about the population, with the associated
probabilities for each t-value. The is a complete t-distribution for each degree of freedom
with the total area under each t-distributional curve being 1 or 100% but typically, it is only
the critical t-values for the commonly used probabilities (such as 80%, 90%, 95% and 99%)
that are produced in the t-tables.
At lower degrees of freedom, frequency is low and so the range of values that would give
the total area of 1 or 100% is wider. As the frequency increases, degrees of freedom does
same and the range of t-values that would give the same area of 1 or 100% reduces. We
may consider a rectangular figure as an example of a tree distribution with area (width x
height) of 100cm2, the width being the t-values and the height being the degrees of
freedom (a function of frequency). A low degree of freedom of let’s say 5 results in a wider
range of t-values (20) to give the total area of 5x20=100cm 2. A higher degree of freedom
of let’s say 10 needs a smaller range of 10 to give the same area of 10x10=100cm 2. At each
critical t-value and corresponding degree of freedom, we can tell the percentage or
proportion of the area below, above or between t-values since we know the total to be 1
or 100%. At high degrees of freedom, the t-distribution approximates the standard normal
distribution. Each t-value like standard normal (Z) value have a negative and positive side
with the mean of 0 dividing the distribution in two equal halves. Some statistical tables may
show only one half of the distributions since they are symmetrical. Whatever happens on
the half shown, the opposite happens on the other side. For example, at 200 degrees of
freedom, a t-value of 1.96 with 0.975 area (probability) below it has 0.025 area (probability)
above it, likewise on the opposite side, a t-value of -1.96 has an area (probability) of 0.025
below it and 0.975 area (probability) above it at the same degrees of freedom. Should the
degrees of freedom be much lower, the height of the distribution will be lower further
widening the t-values required to maintain the same area under the curve
39
6 Types of sampling methods
Principles of sampling
The concept of populations and samples is very important to understanding inferential
statistics. The process of defining the population and selecting an appropriate sampling
method can be quite complex and requires a lot of studies.
The population of interest consists of all the people or other units that the researchers would
like to study, if they had infinite resources. To put it another way, the population of interest
is all the units to which the researchers would like to be able to generalize their results.
Defining the population of interest is the first step in drawing a sample: it may be very
broad, such as everyone living in the country, or narrow, such as Namibian men aged 18-
60 years with a diagnosis of certain diseases. Almost all research is based on a study sample
drawn from a population, rather than the population itself, for practical reasons. The rare
exceptions are studies such as those based on the census of an entire population, where
data from every individual is collected.
40
6.1 Types of sampling
Volunteer sampling
This is a commonly used type of nonprobability sample. An example would be if a
researcher advertises in the newspaper for study subjects and accepts those who answer
and volunteer to be part of the study. Unfortunately, people who volunteer for studies
can’t be assumed to be representative of any general population. Volunteer samples are
particularly common in circumstances when it would be difficult to randomly select from
a population, for instance in a study about people who use illegal drugs. Much useful
information can be gained from volunteer samples, particularly in the early stages of a
project: for instance, you might use volunteer subjects to gather information about drug
use within a community, which you could use to construct a questionnaire that would be
administered to a random sample from the community. But results from volunteer samples
have limited usefulness if the goal is to generalize beyond the sample.
Convenience sampling
This type of nonprobability sampling may be used to collect initial information, but like
volunteer sampling have limited usefulness if the goal is to generalize beyond the sample.
For instance, you might collect information about the eating habits of people in an area
by interviewing the first 30 people you saw at a particular restaurant, and use that
information to construct a study that would use a randomized design. But it would not be
valid to conclude, for instance, that because 75% of your convenience sample favoured
eating at one place, that 75% of the people living in the area would do likewise.
41
Quota sampling
With this type of nonprobability sampling, a data collector is only interested in a limited
number of subjects with specific requirements: in the previous example, the data collector
might be instructed to collect data from a sample of 15 men and 15 women. Quota
sampling is a slight improvement in accuracy over convenience sampling because it can
specify the makeup of the sample: without the quota requirements the example sample
might be 25 women and 5 men. But it does not get around the main problem of all
nonprobability sampling, which is that you have no way of knowing if the people sampled
were representative of the population of interest. You may have an even representation
of men and women in a quota sample, for instance, but they may not be representative
of all the men and women in the population they come from?
42
6.1.4 Simple Random Sampling
The most basic type of probability sampling is simple random sampling (SRS).In SRS all
samples of a given size have an equal probability of being selected. Suppose you wanted
to draw a random sample of 50 students attending a particular school. You obtain a list of
the students and select 50 at random from the list, using a random number table or random
number generator. Because the list represents an enumeration of the entire population
and the choice of who to include in the sample is completely random, every student has
an equal probability of being selected for the sample, as does every combination of
students up to the size of the sample. However, SRS may be difficult or expensive to carry
out, so other methods of probability sampling have been developed to deal with situations
where SRS is not possible or practical.
43
6.1.6 Stratified Sampling
In stratified sampling, the population of interest is divided into non-overlapping groups
called strata based on common characteristics. For people, these characteristics might
be gender or age, academic level, community, etc. and for hospitals they might be type
of facility, etc. If comparing different strata is a primary goal of the study, stratified sampling
is a good choice because it can be designed to ensure adequate and proportionate
sampling from each strata of interest. For instance, using SRS might not produce sufficient
elderly people to accurately compare their results with younger people, while a stratified
sample can be designed to oversample the elderly who may be fewer to choose from in
order to ensure sufficient sample size, then correct statistically for the oversampling.
44
7 Statistical tests in comparing population means and proportions and confidence intervals for
population means and proportions
Note: the factorial (!) of any number is the product of the number and all the receding
values of the number, one at a time, e.g. 3! = 3x2x1 and 4! = 4x3x2x1
But if we were to conduct our study, we would only select one of such possible
combinations for our analysis to compute the mean systolic blood pressure, etc. The
collection of these possible means from these possible different combinations make up a
distribution themselves and it is this we refer to as the sampling distribution of the mean,
our statistic in this case. This resulting distribution is normally distributed no matter the nature
of the original values and this is known as the Central Limit Theorem.
Let’s illustrate this further using a smaller list of numbers 1, 2, 3 and 4. Let’s refer to this as our
population. The mean of this set of nos. is 2.5. Assuming we are to sample just 2 values from
this population, the possibilities are shown below.
4! 4𝑥3𝑥2𝑥1
4𝐶2 = = = 6 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 2 𝑓𝑟𝑜𝑚 4
(4 − 2)! 2! (2𝑥1)(2𝑥1)
45
Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6
1 1 1 2 2 3 Overall
2 3 4 3 4 4 mean
Mean 1.5 2.0 2.5 2.5 3.0 3.5 2.5
We see that only the means for (1, 4) and (2, 3) match the actual population mean of 2.5,
also the mean of the sampled means is equal to the population mean of 1, 2, 3, 4. Now let
us look at this graphically.
First we see the original values are simply flat
Now from the figure below showing that for the means, we see a slight peak in the middle
as if it were a somewhat normal distribution. It looks like a flat normal distribution due to the
low frequency. Higher number of combinations and frequencies would yield a smoother
curve and higher peak. We also see the population mean of 2.5 in the middle with the
distribution symmetric about this value.
46
This is just for a small population of 4 numbers. For a much larger population in our earlier
example for the systolic blood pressure like happens in real life, if we were to get the
complete list of possible means from this distribution and compute their mean (i.e. mean
of all the means), this would be equal to the entire population. The resulting distribution
would also be normal and smoother. But because we cannot have access to the entire
population or the resources to do this, we settle for what we obtained from our sample as
an estimate for the population. This single estimate is referred to as a point estimate. Point
estimates like the mean may be an inaccurate way of estimating statistics. For this reason,
statisticians normally put a range around point estimates called a confidence interval
(explained later).
Over a large number of repetitions builds up a distribution for the sample means. We can
work out the mean of these sample means, close to the population mean. We can also
work out the standard deviation of these sample means. The standard deviation of these
sample means is called the standard error. The standard error measures the amount of
variability in the sample mean; it indicates how closely the population mean is likely to be
estimated by the sample mean and is equal to the ratio of standard deviation of the
population to the square root of the sample size (𝑠/√𝑛) The standard error is used to
construct a range of likely values for the (unknown) population mean. Such a range is
called a confidence interval. Note that the standard deviation measures the average
variability in the population and should not be confused with the standard error which
measures average variability in the sampling distribution of the statistic concern (such as
a distribution of means).
47
7.1.2 Confidence interval of means
When we calculate a single statistic, such as the mean, to describe a sample, which is
referred to as calculating a point estimate because the number represents a single point
on the number line. The sample mean is a point estimate, and is a useful statistic as the
best estimate of the population mean. However, we know that the sample mean is only
an estimate and that if we drew a different sample, the mean of the sample would
probably be different. We don’t expect that every possible sample we could draw will
have the same sample mean. It is reasonable to ask how much the point estimate is likely
to vary by chance if we had chosen a different sample, and in many professional fields it
has become common practice to report both point estimates and interval estimates. A
point estimate is a single number, while an interval estimate is a range or interval of
numbers.
Confidence intervals are based on the notion that if a study was repeated an infinite
number of times, each time drawing a different sample from the same population and
constructing a confidence interval based on that sample, x% of the time the confidence
interval would contain the unknown parameter value that the study seeks to estimate. For
instance, if our test statistic is the mean and we are using 95% confidence intervals, over
an infinite number of repetitions of the study, 95% of the time the confidence interval
constructed from the study would contain the mean of the population. For this reason, the
confidence interval is sometimes described as presenting a plausible range of values for
the true mean population being estimated.
The confidence interval conveys important information about the precision of the point
estimate. For instance, suppose we have two samples of students and in both cases the
mean IQ score is 100.In one case, however, the 95% confidence interval is (95,105), while
in the other case the 95% confidence interval is (80,120).Because the former confidence
interval is much narrower than the latter, the estimate of the mean is more precise for the
first sample.
48
Single mean
For a single sample mean the formula is
𝑥̅ ± 𝑡(𝛼,𝑑𝑓) 𝑆𝐸(𝑥̅ ), for small sample sizes or where variance in the population is unknown
2
𝑥̅ and 𝑆𝐸(𝑥̅ ) are the mean and standard error of the mean respectively
𝑡(𝛼,𝑑𝑓) is the reliability coefficients, the critical values from the t and standard normal (Z)
2
Note: Standard normal distributions tables are not structured the same way.
49
Consider the following example of 10 numbers
50
Point estimate ± (Reliability coefficient x Standard error of the point estimate)
The formula for the difference between two means becomes
𝑠2 𝑠2
(𝑥̅1 − 𝑥̅2 ) ± 𝑡(𝛼,𝑑𝑓) 𝑆𝐸(𝑥̅1 − 𝑥̅2 ) = (𝑥̅1 − 𝑥̅ 2 ) ± 𝑡(𝛼,𝑑𝑓) √ 1 + 2
2 2 𝑛1 𝑛2
Where 𝑥̅1 , 𝑥̅2 are the sample means from the two populations; 𝑆𝐸(𝑥̅1 − 𝑥̅ 2 ) is the
2 2
standard error of the difference between the means; 𝑠1 , 𝑠2 are the variances for the two
samples (being used as replacements for the population ones unless they are known) and
𝑛1 , 𝑛2 the sample sizes for the two groups respectively. The degrees of freedom used here
is (n1-1) + (n2-1) resulting in (n1+n2-2) degrees of freedom. Sometimes, the confidence
interval may be used to test if significant differences exist between the two populations. If
the range contains a zero (0), the populations are said to be not significantly different since
the true population difference being estimated could have been zero (0).
51
7.2.1 Sampling distribution of proportions
Proportions, like means and other statistics also have their sampling distributions based on
the same concept of multiple combinations of samples yielding different levels of some
proportion of interest and based on this, we can construct a sampling distribution for
proportions from which we may make estimates based on probabilities, construct
confidence intervals for proportions as well as hypothesis testing. The concept of sampling
distributions for proportions comes from the binomial distribution (not covered here) which
deals with binary outcomes, i.e. either an observation is in one category or not, e.g.
hypertensive or not. The binomial distribution however becomes cumbersome to work with
as sample size increases (beyond 30) and so for large sample sizes beyond this point, the
standard normal distribution is used as an approximation to the binomial distribution. Most
scientific studies deal with samples larger than this and so most statistical software use the
standard normal distribution for analysis involving proportions.
Single proportion
From the general way of calculating confidence intervals,
Point estimate ± (Reliability coefficient x Standard error of the point estimate)
Where 𝑝̂ is typically used as an estimate for the population proportion (𝑝). Note however
that the use of these symbols for proportions may vary from textbook to textbook.
The standard error for a proportion is:
𝑝(1 − 𝑝)
𝑆𝐸(𝑝̂ ) = √
𝑛
Where 𝑛 is the sample size and 𝑝 is the proportion of people in the population with the
outcome of interest, but this mostly not known and so the estimate from the sample (𝑝̂ )
may be used, just as the standard deviation (𝑠) from the sample may be used as an
estimate of the population standard deviation (𝜎) when dealing with means.
52
Difference between two proportions
We can also estimate the confidence interval for the difference between two proportions
just as we did for the means using the same concept. The formula for this becomes:
𝑝1 (1 − 𝑝1 ) 𝑝2 (1 − 𝑝2 )
(𝑝̂1 − 𝑝̂ 2 ) ± 𝑍(𝛼) 𝑆𝐸(𝑝̂1 − 𝑝̂2 ) = (𝑝̂1 − 𝑝̂2 ) ± 𝑍(𝛼) √ +
2 2 𝑛1 𝑛2
Where 𝑛1 , 𝑛2 are the sample sizes from the two populations and 𝑝1 , 𝑝2 are the proportions
of the outcomes of interest in the population respectively, but this mostly not known and
so the estimate from the sample 𝑝̂1 , 𝑝̂ 2 may be used
group treated with drug X, and 𝜇2 as the mean blood pressure in the group receiving
standard treatment, our null and alternative hypotheses could be respectively stated as:
Null hypothesis, 𝐻0 : 𝜇1 ≥ 𝜇2 and Alternate hypothesis, 𝐻𝑎 : 𝜇1 < 𝜇2
53
If the drug really works at reducing the blood pressure, we expect 𝜇1 to be less than 𝜇2 and
Normally these steps are performed before the experiment is designed or the data
collected; the statistic to be used for hypothesis testing is also sometimes specified at this
time, or may be clear from the hypothesis and type of data involved. We then collect the
data and perform the statistical calculations, in this case probably a t-test or ANOVA
because we are dealing with means, and based on our results make one of two decisions:
Reject the null hypothesis and accept the alternative hypothesis, or
Fail to reject the null hypothesis
54
When we state a null hypothesis, we do that at some degree of confidence called the
confidence level which is the probability or confidence we have for it. The complement
of this is called the significance which we attach to the alternative hypothesis. If we set our
p-value at 5% (or 0.05) for example, we are simply saying we have a 95% (0.95) degree of
confidence in our null hypothesis and only a 5% (0.05) chance for the alternative. If after
computing our test statistic, and tracing it on the relevant probability distribution table we
end up with the tail probability being less than our significance level (also called alpha
level), then our test is said to be significant leading to a rejection of the null hypothesis.
For means and proportions, the general way of computing a test statistic is:
𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑑 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 − 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑇𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 =
𝑆𝐸 (𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑑 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐)
SE = Standard Error
And the test statistic may have a standard normal (Z) or t-distribution depending on if it is
a proportion or mean respectively. For large sample sizes and known population variance,
the Z-distribution may be used for means since it becomes approximately equal to the t-
distribution for high degrees of freedom.
55
7.3.2 One sample tests
In one sample tests, we test our computed statistic (mean or proportion) against a
hypothesized value. The formulas become:
being tested, s is the standard deviation of the statistic and n is the sample size.
Where 𝑝̂ is the computed proportion from the sample, 𝑝ℎ is the hypothesized proportion
being tested, p is the proportion of the outcome of interest in the population which may
be estimated using the sample proportion and n is the sample size.
56
7.3.3 Two sample tests
For two sample tests, we test differences in our computed statistic (mean or proportion)
across two groups or populations represented by their respective samples. The formulas In
this case are:
Where 𝑥̅1 , 𝑥̅2 are the computed sample means , (𝜇1 − 𝜇2 ) the hypothesized
population mean difference between the groups, 𝑠12 , 𝑠22 are the variances for the
two samples (being used as replacements for the population ones unless they are known)
If the variances between the two groups are known to be equal, they can be pulled out
and a common variance used. The denominator becomes:
1 1
𝑆𝐸(𝑥̅1 − 𝑥̅2 ) = 𝑠√ +
𝑛1 𝑛2
Where
Where 𝑛1 , 𝑛2 are the sample sizes from the two populations and 𝑝1 , 𝑝2 are the proportions
of the outcomes of interest in the population respectively, but this mostly not known and
so the estimate from the sample 𝑝̂1 , 𝑝̂ 2 may be used
57
7.3.4 Type I and Type II errors
Inferential statistics allows us to make probabilistic statements about the data, but the
possibility of error exists in this process. Statisticians have classified two types of errors when
making decisions in inferential statistics, and set levels for error levels that are commonly
considered acceptable. The two types of error are
The diagonal boxes represent correct decisions: H0 is true and was not rejected in the
study, or H0 is false and was rejected in the study. The other two boxes represent Type I and
Type II errors. A Type I error, also known as alpha or 𝛼, represents the error made when the
null hypothesis is true but is rejected in a study. A Type II error, also called beta or β,
represents the error made when H0 is false but is not rejected in a study.
7.4 Non-parametric tests
The basis of statistics is parameter estimation, i.e., when an attempt is made to estimate
the parameters (mean and standard deviation) of a population from a random sample.
However, most statistical techniques rely on the underlying distribution being of a particular
type, such as the normal distribution, for inferences made from the relevant statistical tests
to be valid. What about scenarios where the underlying data is known to be nonnormal?
In these cases, a different set of statistical techniques, known as nonparametric statistics,
can be fruitfully applied to understand data. These techniques are often known as
distribution-free since they make no assumptions about the underlying distribution of the
data. For most parametric tests such as t-tests and ANOVA, there exists a non-parametric
equivalent.
58
Methods of Facilitation of Learning
The module will be facilitated through the following learning activities: lectures, tutorials based on
the needs of students. Group works will be used to enhance active participation for all students.
The module will be facilitated through the following learning activities: modified lectures, group
discussions, class exercise, individual homework; introductory computer practical sessions.
Assessment Strategies
The continuous assessment (CA): 50 % (minimum of 2 tests and 2 assignments).
Examination: 50 % (1 X 3 hours paper)
Prescribed textbooks:
Wayne W. D. (2012). Biostatistics: A foundation for Analysis in Health Sciences. (10 th Edition).
John Wiley &Sons: New York
Suggested readings:
Kuzma, J. W. & Bohnenblust, S. (2005). Basic Statistics for the Health Sciences. McGraw-Hill
Education: Boston
Ott, R. L. & Longnecker, M.(2010). An Introduction to Statistical Methods and Data Analysis.
(6th Edition). Brooks/Cole, Cengage Learning: Belmont
59