Statistics Chapter
Statistics Chapter
Statistics Chapter
net/publication/344898292
Statistics
CITATIONS READS
0 130
2 authors:
All content following this page was uploaded by Rui Sarmento on 28 April 2021.
Statistics
INTRODUCTION
Statistics is a set of methods used to analyze data. The statistic is present in all areas of science involving
the collection, handling and sorting of data, given the insight of a particular phenomenon and the
possibility that, from that knowledge, inferring possible new results. One of the goals with statistics is to
extract information from data to get a better understanding of the situations they represent. Thus, the
statistics can be thought of as the science of learning from data.
Currently, the high competitiveness in search technologies and markets has caused a constant race for the
information. This is a growing and irreversible trend. Learning from data is one of the most critical
challenges of the information age in which we live. In general, we can say that statistic based on the
theory of probability, provides techniques and methods for data analysis, which help the decision-making
process in various problems where there is uncertainty.
This chapter presents the main concepts used in statistics, and that will contribute to understanding the
analysis presented throughout this book.
Types of variables
Statistical variables can be classified as categorical variables or numerical variables.
Categorical variables have values that describe a “quality” or “characteristic” of a data unit, like “which
type” or “which category”. Categorical variables fall into mutually exclusive (in one category or another)
and exhaustive (include several possible options) categories. Therefore, categorical variables are
qualitative variables and tend to be represented by a non-numeric value. Categorical variables may be
further described as (Marôco, 2011):
• Nominal: the data consist of categories only. The variables are measured in discrete classes, and
it is not possible to establish any qualification or ordering. Standard mathematical operations
(addition, subtraction, multiplication, and division) are not defined when applied to this type of
variable. Gender (male or female) and colors (blue, red or green) are two examples of nominal
variables.
• Ordinal: the data consist of categories that can be arranged in some exact order according to their
relative size or quality, but cannot be quantified. Standard mathematical operations (addition,
subtraction, multiplication, and division) are not defined when applied to this type of variable.
For example, social class (upper, middle and lower) and education (elementary, medium and
high) are two examples of ordinal variables. Likert scales (1-“Strongly Disagree”, 2-“Disagree”,
3-“Undecided”, 4-“Agree”, 5-“Strongly Agree”) are ordinal scales commonly used in social
sciences.
Numerical variables have values that describe a measurable quantity as a number, like “how many” or
“how much”. Therefore, numeric variables are quantitative variables. Numeric variables may be further
described as:
8
• Discrete: the data is numerical. Observations can take a value based on a count of a set of distinct
integer values. A discrete variable cannot take the value of a fraction of one value and the next
closest value. The number of registered cars, the number of business locations, and the number of
children in a family, all of which measured as whole units (i.e. 1, 2, or 3 cars) are some examples
of discrete variables.
• Continuous: the data is numerical. Observations can take any value between a particular set of
real numbers. The value given to one observation for a continuous variable can include values as
precise as possible with the instrument of measurement. Height and time are two examples of
continuous variables.
DESCRIPTIVE STATISTICS
Descriptive statistics are used to describe the essential features of the data in a study. It provides simple
summaries about the sample and the measures. Together with simple graphics analysis, it forms the basis
of virtually every quantitative analysis of data. Descriptive statistics allows presenting quantitative
descriptions in a convenient way. In a research study, it may have lots of measures. Or it may measure a
significant number of people on any measure. Descriptive statistics helps to simplify large amounts of
data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary.
Frequency Distributions
9
Frequency distributions are visual displays that organize and present frequency counts (n) so that the
information can be interpreted more easily. Along with the frequency counts, it may include relative
frequency, cumulative frequency, and cumulative relative frequencies.
• The frequency (n) is the number of times a particular variable assumes that value.
• The cumulative frequency (N) is the number of times a variable takes on a value less than or equal
to this value.
• The relative frequency (f) is the percentage of the frequency.
• The cumulative relative frequency (F) is the percentage of the cumulative frequency.
Depending on the variable (categorical, discrete or continuous), various frequency tables can be created.
Color n N f F
Blue 4 4 0.4 0.4
Red 2 6 0.2 0.6
Frequency Distribution: White 2 8 0.2 0.8
Green 1 9 0.1 0.9
Black 1 10 0.1 1.0
Total 10 1
20 22 21 24 21 20 20 24 22 20
List of responses:
22 24 21 25 20 23 22 23 21 20
Age n N f F
20 6 6 0.3 0.3
21 4 10 0.2 0.5
22 4 14 0.2 0.7
Frequency distribution:
23 2 16 0.1 0.8
24 3 19 0.15 0.95
25 1 20 0.05 1
Total 20 1
1.58 1.56 1.77 1.59 1.63 1.58 1.82 1.69 1.76 1.60
List of responses:
1.73 1.51 1.54 1.61 1.67 1.72 1.75 1.55 1.68 1.65
Interval n N f F
]1.50, 1.55] 3 3 0.15 0.15
Frequency distribution: ]1.55, 1.60] 5 8 0.25 0.4
]1.60, 1.65] 3 11 0.15 0.55
]1.65, 1.70] 3 14 0.15 0.7
10
A measure of variability is a value that describes the spread or dispersion of a data set to its central value
(McCune, 2010). If the values of measures of variability are high, it signifies that scores or values in the
data set are widely spread out and not tightly centered on the mean. There are three common measures of
variability: the range, standard deviation, and variance.
Mean
The mean (or average) is the most popular and well-known measure of central tendency. It can be used
with both discrete and continuous data. An important property of the mean is that it includes every value
in the data set as part of the calculation. The mean is equal to the sum of all the values of the variable
divided by the number of values in the data set. So, if we have n values in a data set and (𝑥! , 𝑥! , … , 𝑥! )
are values of the variable, the sample mean, usually denoted by 𝑥 (denoted by 𝜇, for population mean), is:
!
𝑥! + 𝑥! + ⋯ + 𝑥! !!! 𝑥!
𝑥= =
𝑛 𝑛
20 ∗ 6 + 21 ∗ 4 + 22 ∗ 4 + 23 ∗ 2 + 24 ∗ 3 + 25 ∗ 1 435
𝑥= = = 21.75
20 20
So, the age mean for the 20 individuals is around 22 years (approximately).
Median
The median is the middle value or the arithmetic average of the two middle values of the variable that has
been arranged in order of magnitude. So, 50% of the observations are greater or equal to the median, and
50% are less or equal to the median. It should be used with ordinal data. The median (after ordering all
values) is as follows:
11
𝑥! + 𝑥!!!
! !
, if 𝑛 is even
𝑥 = 2
𝑥!!! , if 𝑛 is odd
!
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 24, 25
!"!!!
As n is even, the median is the average of the middle values. So 𝑥 = = 21.5 is the age median for
!
the sample with 20 individuals.
Mode
The mode is the most common value (or values) of the variable. A variable in which each data value
occurs the same number of times has no mode. If only one value occurs with the greatest frequency, the
variable is unimodal; that is, it has one mode. If exactly two values occur with the same frequency, and
that is higher than the others, the variable is bimodal; that is, it has two modes. If more than two data
values occur with the same frequency, and that is greater than the others, the variable is multimodal; that
is, it has more than two modes (McCune, 2010). The mode should be used only with discrete variables.
In example 2 above, the most frequent value of age variable is “20”. It occurs six times. So, “20” is the
mode of the age variable.
𝑛𝑝
𝑋!"#(!!!) if 𝑖 = is not integer
100
𝑃! =
𝑋! + 𝑋!!! 𝑛𝑝
if 𝑖 = is integer
2 100
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 24, 25
Thus,
!"∗!" !""
• 25th percentile (𝑃!" ) or 1st quartile (𝑄! ): as 𝑖 = = = 5 is integer,
!"" !""
12
𝑋! + 𝑋! 20 + 20
𝑃!" = 𝑄! = = = 20
2 2
!"∗!" !"""
• 50th percentile (𝑃!" ) or median: as 𝑖 = = = 10 is integer,
!"" !""
𝑋!" + 𝑋!! 21 + 22
𝑃!" = 𝑄! = 𝑥 = = = 21.5
2 2
!"∗!" !"#!
• 75th percentile (𝑃!" ) or 3rd quartile (𝑄! ): as 𝑖 = = = 15 is integer,
!"" !""
𝑋!" + 𝑋!" 23 + 23
𝑃!" = 𝑄! = = = 23
2 2
Range
The range for a data set is the difference between the maximum value (greatest value) and the minimum
value (lowest value) in the data set; that is
𝑟𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒
The range should have the same units as those of the data values from which it is computed.
The interquartile range (IQR) is the difference between the first and third quartiles; that is, 𝐼𝑄𝑅 = 𝑄! −
𝑄! (McCune, 2010).
In example 2 above, minimum value=20, maximum value=25. Thus, the range is given by 25-20=5.
(! !!)!
• Population variance: 𝜎 ! = ! , where 𝑥! is the 𝑖 th data value from the population, 𝜇 is mean
!
of the population, and 𝑁 is the size of the population
(! !!)!
• Sample variance: 𝑠 ! = ! , where 𝑥! is the 𝑖 th data value from the sample, 𝑥 is mean of the
!!!
sample and 𝑛 is the size of the sample
(!! !!)!
• Population standard deviation: 𝜎 = 𝜎 ! =
!
(!! !!)!
• Sample standard deviation: 𝑠 = 𝑠 ! =
!!!
Data can be summarized in a visual way using charts and/or graphs. These are displays that are organized
to give a big picture of the data in a flash and to zoom in on a particular result that was found. Depending
on the data type, the graphs include pie charts, bar charts, time charts, histograms or boxplots.
Pie Charts
A pie chart (or a circle chart) is a circular graphic. Each category is represented by a slice of the pie. The
area of the slice is proportional to the percentage of responses in the category. The sum of all slices of the
pie should be 100% or close to it (with a bit of round-off error). The pie chart is used with categorical
variables or discrete numerical variables.
Figure 2 represents the example 1 above.
Black
Blue
Green
Red
White
40%
10%
10%
20%
20%
Bar Charts
A bar chart (or bar graph) is a chart that presents grouped data with rectangular bars with lengths
proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical
bar chart is sometimes called a column bar chart. In general, the x-axis represents categorical variables or
discrete numerical variables.
Figure 3 and Figure 4 represent the example 1 above.
0.4
4
0.3
3
0.2
2
0.1
1
0.0
0
Time Charts
A time chart is a data display whose main point is to examine trends over time. Another name for a time
chart is a line graph. Typically a time chart has some unit of time on the horizontal axis (year, day, month,
and so on) and a measured quantity on the vertical axis (average household income, birth rate, total sales,
or others). At each time’s period, the amount is shown as a dot, and the dots are connected to form the
time chart (Rumsey, 2010).
Figure 5 is an example of a time chart. It represents the number of accidents, for instance, in a small city
along some years.
60
55
50
45
40
Histogram
A histogram is a graphical representation of numerical data distribution. It is an estimate of the
probability distribution of a continuous quantitative variable. Because the data is numerical, the categories
are ordered from smallest to largest (as opposed to categorical data, such as gender, which has no inherent
order to it). To be sure each number falls into exactly one group, the bars on a histogram touch each other
but don’t overlap (Rumsey, 2010). The height of a bar in a histogram may represent either frequency or a
percentage (Peers, 2006).
Figure 6 accounts for the histogram of example 3 above.
5
4
3
2
1
0
Boxplot
A boxplot or box plot is a convenient way of graphically depicting groups of numerical data. It is a one-
dimensional graph of numerical data based on the five-number summary, which includes the minimum
value, the 25th percentile (also known as Q1), the median, the 75th percentile (Q3), and the maximum
value. In essence, these five descriptive statistics divide the data set into four equal parts (Rumsey, 2010).
15
Some statistical software adds asterisk signs (∗) or circle signs (ο) to show numbers in the data set that are
considered to be, respectively, outliers or suspected outliers — numbers determined to be far enough
away from the rest of the data. There are two types of outliers:
• Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first
quartile.
• Suspected outliers are slightly more central versions of outliers: either 1.5×IQR or more above
the third quartile or 1.5×IQR or more below the first quartile.
Outlier:#values#greater#than#3rdQ#+#3#*#IQR#(or#lowest#
40# *# values#than#1stQ#–#3#*#IQR,#if#it#is#a#lowest#outliers)##
o# Suspected#Outlier:#values#greater#than#3rdQ#+#1.5#*#IQR##
o# (or#lowest#values#than#1stQ#–#1.5#*#IQR,#if#it#is#a#lowest#
30# outliers)##
Largest#value#that#is#not#an#outlier#
20# 3rd#quarIle#(75th#percenIle)#
2nd#quarIle#(50th#percenIle#or#median)#
10# 1st#quarIle#(25th#percenIle)#
Minimum#(or#lowest#value#that#is#not#an#outlier,#if#
there#are#lowest#outliers#or#suspected#outliers)#
0#
Figure 7 Boxplot
STATISTICAL INFERENCE
Statistical inference is the process of drawing conclusions about populations or scientific truths from data.
This process is divided into two areas: estimation theory and decision theory. The objective of estimation
theory is to estimate the value of the theoretical population’s parameters by the sample forecasts. The
purpose of the decision theory is to establish decisions with the use of hypothesis tests for the population
parameters, supported by a concrete measure of the degree of certainty/uncertainty regarding the decision
that was taken (Marôco, 2011).
Normal distribution
The normal distribution or Gaussian distribution is the most important probability density function on
statistical inference. The requirement that the sampling distribution is normal is one of the demands of
16
some statistical methodologies with frequent use, called parametric methods (Marôco, 2011). A random
variable 𝑋 with a normal distribution of mean 𝜇 and standard deviation 𝜎 is written as 𝑋 ~ 𝑁 (𝜇, 𝜎). The
probability density function (PDF) of this variable is given by:
1 ! !!! !
!! !
𝑓! 𝑥 = 𝑒 −∞ ≤ 𝒙 ≤ +∞
𝜎 2𝜋
1 !
!!
𝜑 𝑧 = 𝑒 !! −∞ ≤ 𝑧 ≤ +∞
2𝜋
The normal distribution graph has a bell-shaped line (one of the normal distribution names is bell curve)
and is completely determined by the mean and standard deviation of the sample. Figure 8 shows a
distribution 𝑁 (0, 1).
0.4
0.3
0.2
0.1
0.0
−4 −2 0 2 4
Although there are many normal curves, they all share an important property that allows us to treat them
in a uniform fashion. Thus, all normal density curves satisfy the following property, which is often
referred to as the Empirical Rule.
Range Proportion
𝜇 ± 1𝜎 68.3 %
𝜇 ± 2𝜎 95.5 %
𝜇 ± 3𝜎 99.7 %
Thus, for a normal distribution, almost all values lie within three standard deviations of the mean.
Chi-Square distribution
A random variable 𝑋 obtained by the sums of squares of 𝑛 random variables 𝑍! ~ 𝑁 (0, 1) has a chi-
square distribution with 𝑛 degrees of freedom, denoted as 𝑋 ! (𝑛). The probability density function
(PDF) of this variable is given by (Kerns, 2010):
1 ! !
f! x = ! ∙ x !!! ∙ e!!
!! !!!
2! ∙ !
x ! ∙ e!! ∙ 𝑑𝑋
17
0.08
0.06
0.04
0.02
0.00
0 10 20 30 40
Figure 9 Chi-square distribution example
The expected value of 𝑋 is 𝐸 𝑋 = 𝑛 and the variance is 𝑉 𝑋 = 2𝑛. As noted above, the 𝑋 ! distribution
is the sum of squares of 𝑛 variables 𝑁 (0, 1). Thus, the central limit theorem (see section central limit
theorem) also ensures that the 𝑋 ! distribution approaches the normal distribution for high values of 𝑝.
Student's t-distribution
Student’s t-distribution is a probability distribution that is used to estimate population parameters when
the sample size is small and/or when the population variance is unknown.
!
A random variable 𝑋 = has a student’s t-distribution with 𝑛 degrees of freedom, if 𝑍 ~ 𝑁 0, 1 , and
!/!
𝑌 ~ 𝑋 ! (𝑛) are independent variables. The probability density function (PDF) of this variable is given by
(Kerns, 2010):
𝑛+1 !
!
!!!
𝜏 𝑥! !
𝑓! 𝑥 = 2 ∙ 1 + , −∞ < 𝑥 < +∞
𝑛 𝑛
𝑛𝜋 ∙ 𝜏
2
!!
where 𝜏 u = ! x !!! ∙ e!! ∙ 𝑑𝑋 and 𝑛 > 0. When 𝑛 increases, this distribution approximates to the
centered reduced normal distribution (𝑁 (0, 1)). Figure 10 shows an example of a student’s t-distribution:
0.4
0.3
0.2
0.1
0.0
−20 −10 0 10 20
Figure 10 Student’s t-distribution example
18
As the centered reduced normal distribution, the student’s t-distribution has expected value 𝐸 𝑋 = 0 and
!
variance 𝑉 𝑋 = , 𝑛 > 2.
!!!
Snedecor’s F-distribution
Snedecor’s F-distribution is a continuous statistical distribution which arises in the testing of whether two
!!
𝑚+𝑛 ! !!!
𝜏 𝑚 ! 𝑚 !
𝑓! 𝑥 = 𝑚 2 𝑛 ∙
! !
∙ 𝑥 ! !! ∙ 1+ 𝑥 ,𝑥 > 0
𝜏 ∙ 𝜏 𝑛 𝑛
2 2
!! !!!
where 𝜏 u = !
x ∙ e!! ∙ 𝑑𝑋 and 𝑚 > 2 and 𝑛 > 4. Figure 11 shows an example of a Snedecor’s
F-distribution.
0.8
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Figure 11 Snedecor’s F-distribution example
! !!! ∙ (!!!!!)
The expected value of 𝑋 is 𝐸 𝑋 = with 𝑛 > 2 and the variance is 𝑉 𝑋 = .
!!! ! ∙ !!! ! ∙ (!!!)
Binomial distribution
The binomial distribution is the discrete distribution most used in statistical inference to test hypotheses
concerning proportions of dichotomous nominal variables (true vs. false, exist vs. non-exists). This
distribution is obtained with exactly 𝑛 successes out of 𝑁 Bernoulli trials (where the result of each
Bernoulli trial is true with probability 𝑝 and false with probability 𝑞 = 1 − 𝑝). The binomial distribution
for the variable 𝑋 has 𝑛 and 𝑝 parameters and is denoted as 𝑋 ~ 𝐵 (𝑛, 𝑝). The probability mass function
(PMF) of this variable is given by:
𝑛 !
𝑓! 𝑥 = 𝑝 (1 − 𝑝)!!! , 𝑥 = 0, 1, 2, … , 𝑛
𝑥
19
0 2 4 6 8 10
Figure 12 Binomial distribution example
The expected value of variable 𝑋 is 𝐸 𝑋 = 𝑛 ∙ 𝑝, and the variance is 𝑉 𝑋 = 𝑛 ∙ 𝑝 ∙ 𝑞. Such as the chi-
square distribution or student's t-distribution, the central limit theorem ensures that the binomial
distribution is approximated by the normal distribution, when 𝑛 and 𝑝 are sufficiently large (𝑛 > 20 and
𝑛𝑝 > 7; Marôco, 2011).
Sampling distribution
To perform statistical inference - confidence intervals estimation or performing hypothesis testing – it is
necessary to know the distributional properties of the sample, from which it is intended to infer for the
theoretical population (Marôco, 2011). In the examples given so far, a population was specified, and the
sampling distribution of the mean and the range were determined. In practice, the process proceeds the
other way: the sample data is collected, and from these data, the parameters of the sampling distribution
are estimated. The mean of a representative sample provides an estimate of the unknown population
mean, but intuitively we know that if we took multiple samples from the same population, the estimates
would vary from one another. We could, in fact, sample over and over from the same population and
compute a mean for each of the samples. In essence, all these sample means constitute yet another
"population", and we could graphically display the frequency distribution of the sample means. This is
referred to as the sampling distribution of the sample means.
Some of the sampling distributions commonly used in statistical inference process are presented in the
table below (Marôco, 2011).
! !!!
𝑋 𝑋 ~ 𝑁 𝜇, × if the sampling is without replacement or if the population is small
! !!!
!
≤ 0.05.
!
20
!!!
!! ~ 𝑡 𝑛 − 1 if the population standard deviation is unknown.
!
!!! !!!
! ~ 𝑋 ! 𝑛 − 1 if the variable has normal distribution
𝑆′ !!
!!!!
𝑆′!! ~ 𝐹 𝑛! − 1, 𝑛! − 1 if the variances have 𝑋 ! distribution
!!!!
𝑆′!!
𝑃 ~ 𝐵 𝑛, 𝑝 for small samples
!!!
𝑃 ~ 𝑁 0, 1 for large samples (with 𝑛 > 20 e 𝑛𝑝 > 5, where 𝑝 is the population
! !!!
!
proportion)
(Marôco, 2011)
The sample’s mean is one of the most relevant statistics for both the theory of estimation as to the theory
of decision.
Hypothesis tests
A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis tests refer to the formal procedures used by statisticians to accept or reject a statistical
hypothesis.
The best way to determine whether a statistical hypothesis is true would be to examine the entire
population. Since that is often impractical, statistical tests are used to determine whether there is enough
evidence in a sample of data to infer that a particular condition is true for the entire population. If sample
data are not consistent with the statistical hypothesis, the hypothesis is rejected.
Hypothesis tests examine two opposing hypotheses about a population: the null hypothesis and the
alternative hypothesis.
The null hypothesis, denoted by H0, is the statement being tested. Usually, the null hypothesis is a
declaration of the absence of effect or no effect at all and less compromising. The alternative hypothesis,
denoted by H1, is the hypothesis that sample observations are influenced by some non-random cause.
The H0 should only be rejected if there is enough evidence for a given probability of error or a certain
level of confidence, which suggests in fact H0 is not valid.
21
However, a hypothesis test can have one of two outcomes: the reader accepts the null hypothesis, or it
rejects the null hypothesis. Many statisticians stress with the notion of "accepting the null hypothesis".
Instead, they say: you reject the null hypothesis, or you fail to reject the null hypothesis. The distinction
between "acceptance" and "failure to reject" is crucial. Whilst acceptance implies that the null hypothesis
is true, failure to reject means that the data is not sufficiently persuasive to prefer the alternative
hypothesis to the null hypothesis.
When considering whether the null hypothesis is rejected and the alternative hypothesis is accepted, it is
needed to find the direction of the alternative hypothesis statement. This could be a one-tailed test or two-
tailed test.
A one-tailed test is a statistical test in which the critical area of the distribution is one-sided so that it is
either greater than or less than a particular value, but not both. If the sample that is being tested falls into
the one-sided critical area, the alternative hypothesis will be accepted instead of the null hypothesis. The
one-tailed test gets its name from checking the area under one of the tails (sides) of a normal distribution,
although the test can be used in other non-normal distributions as well.
For example, suppose the null hypothesis states that the mean is less than or equal to 10. The alternative
hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of
numbers located on the right side of sampling distribution; that is, a set of numbers greater than 10. This
represents the implementation of a one-tailed test.
A two-tailed test is a statistical test in which the critical area of the distribution is two sided and tests
whether a sample is either greater than or less than a specified range of values. If the sample that is being
tested falls into either of the critical areas, the alternative hypothesis will be accepted instead of the null
hypothesis. The two-tailed test gets its name from checking the area under both of the tails (sides) of a
normal distribution, although the test can be used in other non-normal distributions.
For example, suppose the null hypothesis states that the mean is equal to 10. The alternative hypothesis
would be that the mean is different to 10, i.e., less than 10 or greater than 10. The region of rejection
would consist of a range of numbers located on both sides of sampling distribution; that is, the region of
rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater
than 10.
Decision rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice, statisticians
describe these decision rules in two ways - concerning a p-value or concerning a region of acceptance.
• For a one-tailed test, the p-value is the area to the right (right-tailed test) or left (left-tailed test) of
the test statistic.
• For a two-tailed test, the p-value is two times the area to the right of a positive test statistic or the
left of a negative test statistic.
To make a decision about rejecting or not rejecting H 0, it is necessary to determine the cutoff probability
for the p-value before doing a hypothesis test; this cutoff is called an alpha level (α). Typical values for α
are 0.05 or 0.01.
When p-value (instead of the test statistic) is used in the decision rule, the rule becomes: If the p-value is
less than α (the level of significance), reject H 0 and accept H1. Otherwise, fail to reject H0.
However, incorrect interpretations of p-values are very common. The most common mistake is to
interpret a p-value as the probability of making an error by rejecting a true null hypothesis (called a type I
error).
There are several reasons why p-values can’t be the error rate.
First, p-values are calculated based on the assumptions that the null is true for the population and that the
difference in the sample is caused entirely by random chance. Consequently, p-values can’t tell the
probability that the null hypothesis is true or false because it is 100% true from the perspective of the
calculations.
Second, while a small p-value indicates that the data are unlikely assuming a true null, it can’t evaluate
which of two competing cases is more likely: 1) The null is true, but the sample was unusual or; 2) The
null is false. Determining which case is more likely requires subject area knowledge and replicate studies.
For example, supposing that a vaccine study produced a p-value of 0.04. The correct way to interpret this
value is: assuming that the vaccine had no effect, it would obtain the observed difference or more in 4%
of studies due to random sampling error. An incorrect way to interpret is: if the null hypothesis is rejected
there is a 4% chance that a mistake is being made.
Types of errors
The point of a hypothesis test is to make the correct decision about H0. Unfortunately, hypothesis testing
is not a simple matter of being right or wrong. No hypothesis test is 100% certain because the hypothesis
test is based on probability, so there is always a chance that an error has been made. Two types of errors
are possible: type I and type II. The risks of these two errors are inversely related and determined by the
significance level and the power for the test.
The following table shows the four possible situations:
Decision
Fail to reject Reject
Type I Error -
Correct Decision
True rejecting the null when it is true
(probability = 1 - α)
(probability = α)
Null Hypothesis
Type II Error -
Correct Decision
False fail to reject the null when it is
(probability = 1 - β)
false (probability = β)
Type I error
When the null hypothesis is true, and it is rejected, it has a type I error. The probability of making a type I
error is α, which is the significance level set for the hypothesis test. An α of 0.05 indicates that it is
willing to accept a 5% chance that being wrong when rejecting the null hypothesis. To reduce this risk, a
23
lower value for α should be used. However, using a lower value for alpha, it will be less likely to detect a
true difference if one exists.
Type II error
When the null hypothesis is false, and it is failed to reject it, it has a type II error. The probability of
making a type II error is β, which depends on the power of the test. It is possible to decrease the risk of
committing a type II error by providing that the test has enough power. Ensuring the sample size is large
enough to detect a practical difference when one truly exists can do this.
The probability of rejecting the null hypothesis when it is false is equal to 1–β. This value is the power of
the test.
The following example helps to understand the interrelationship between type I, and type II error, and to
determine which error has more severe consequences for each situation. If there is interest in comparing
the effectiveness of two medications, the null and alternative hypotheses are:
A type I error occurs if the null hypothesis is rejected, i.e., if it is possible to conclude that the two
medications are different when, in fact, they are not. If the medications have the same effectiveness, this
error may not be considered too severe because the patients still benefit from the same level of
effectiveness regardless of which medicine they take.
However, if a type II error occurs, the null hypothesis is not rejected when it should be rejected. That is, it
is possible to conclude that the medications have the same effectiveness when, in fact, they are different.
This error is potentially life-threatening if the less-effective drug is sold to the public instead of the more
effective one.
When the hypothesis tests are conducted, consider the risks of making type I and type II errors. If the
consequences of making one type of error are more severe or costly than making the other type of error,
then choose a level of significance and power for the test that will reflect the relative severity of those
consequences.
Confidence intervals
A confidence interval is an estimated range of a parameter of a population. Instead of estimating the
parameter by a single value, it is given a range of probable estimates.
Confidence intervals are used to indicate the reliability of an estimate. For example, a confidence interval
can be used to describe how the results of a search are trustworthy. If all the estimates are equals, a search
that results in a small confidence interval is more reliable than one that results in a higher confidence
24
interval. These intervals are usually calculated so that this percentage is 95%, but it can produce 90%,
99%, 99.9% (or whatever) confidence intervals for the unknown parameter.
The width of the confidence interval gives some idea of how uncertain the research is about the unknown
parameter. A very wide interval may indicate that more data should be collected before anything very
definite can be said about the parameter.
Confidence intervals are more informative than the simple results of hypothesis tests (where we decide
"reject H0" or "don't reject H0") since they provide a range of plausible values for the unknown parameter.
Confidence limits are the lower and upper boundaries/values of a confidence interval, that is, the values
that define the range of a confidence interval.
The upper and lower bounds of a 95% confidence interval are the 95% confidence limits. These limits
may be taken for other confidence levels, for example, 90%, 99%, and 99.9%.
The confidence level is the probability value 1 − 𝛼 associated with a confidence interval.
It is often expressed as a percentage. For example, say 𝛼 = 0.05 = 5%, then the confidence level is equal
to 1 − 0.05 = 0.95, i.e. a 95% confidence level. For example, suppose an opinion poll predicted that, if
the election were held today, the Conservative party would win 60% of the vote. The pollster might attach
a 95% confidence level to the interval 60% plus or minus 3%. That is, he thinks it very likely that the
Conservative party would get between 57% and 63% of the total vote.
Summarizing:
- A p-value is a probability of obtaining an effect as large as or greater than the observed effect, assuming
null hypothesis is true
• Provides a measure of strength of evidence against the H 0
• Does not provide information on magnitude of the effect
• Affected by sample size and magnitude of effect: interpret with caution!
• Cannot be used in isolation to inform clinical judgment
25
CONCLUSION
This chapter presents the main concepts used in statistical analysis. Without these, it will be difficult for
the reader to understand additional analysis that will be held in the course of this book.
The reader should now be able to recognize the used concepts, their meaning and when they should be
applied.
The theoretical concepts presented in this chapter are:
• Variable, population and sample
• Mean, median, mode, standard deviation, quartile an percentile
• Statistic distributions
o Normal distribution
o Chi-square distribution
o Student’s t-distribution
o Snedecor’s F-distribution
o Binomial distribution
• Central limit theorem
• Decision rules: p-value, error, confidence interval and tests.
REFERENCES
Kerns, G. J. (2010). Introduction to probability and statistics using r. 1st Edition. Lulu. com.
Marôco, J. (2011). Análise Estatística com o SPSS Statistics. 5th Edition. Pero Pinheiro: Report Number,
pp. 7-61.
McCune, S. (2010). Practice Makes Perfect Statistics. 1st Edition. United States: McGraw-Hill.
Peers, I. (2006). Statistical analysis for education and psychology researchers: Tools for researchers in
education and psychology. Routledge.
Rumsey, D. (2010). Statistics Essentials For Dummies. New Jersey, Wiley Publishing, Inc.