Statistics Chapter

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344898292

Statistics

Chapter · January 2017


DOI: 10.4018/978-1-68318-016-6.ch001

CITATIONS READS

0 130

2 authors:

Rui Sarmento Vera Costa

79 PUBLICATIONS 201 CITATIONS


University of Porto
24 PUBLICATIONS 116 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Rui Sarmento on 28 April 2021.

The user has requested enhancement of the downloaded file.


7

Statistics
INTRODUCTION
Statistics is a set of methods used to analyze data. The statistic is present in all areas of science involving
the collection, handling and sorting of data, given the insight of a particular phenomenon and the
possibility that, from that knowledge, inferring possible new results. One of the goals with statistics is to
extract information from data to get a better understanding of the situations they represent. Thus, the
statistics can be thought of as the science of learning from data.
Currently, the high competitiveness in search technologies and markets has caused a constant race for the
information. This is a growing and irreversible trend. Learning from data is one of the most critical
challenges of the information age in which we live. In general, we can say that statistic based on the
theory of probability, provides techniques and methods for data analysis, which help the decision-making
process in various problems where there is uncertainty.

This chapter presents the main concepts used in statistics, and that will contribute to understanding the
analysis presented throughout this book.

VARIABLES, POPULATION, AND SAMPLES


In statistical analysis, “variable” is the common characteristic of all elements of the sample or population
to which is possible to attribute a number or category. The values of the variables vary from element to
element.

Types of variables
Statistical variables can be classified as categorical variables or numerical variables.

Categorical variables have values that describe a “quality” or “characteristic” of a data unit, like “which
type” or “which category”. Categorical variables fall into mutually exclusive (in one category or another)
and exhaustive (include several possible options) categories. Therefore, categorical variables are
qualitative variables and tend to be represented by a non-numeric value. Categorical variables may be
further described as (Marôco, 2011):
• Nominal: the data consist of categories only. The variables are measured in discrete classes, and
it is not possible to establish any qualification or ordering. Standard mathematical operations
(addition, subtraction, multiplication, and division) are not defined when applied to this type of
variable. Gender (male or female) and colors (blue, red or green) are two examples of nominal
variables.
• Ordinal: the data consist of categories that can be arranged in some exact order according to their
relative size or quality, but cannot be quantified. Standard mathematical operations (addition,
subtraction, multiplication, and division) are not defined when applied to this type of variable.
For example, social class (upper, middle and lower) and education (elementary, medium and
high) are two examples of ordinal variables. Likert scales (1-“Strongly Disagree”, 2-“Disagree”,
3-“Undecided”, 4-“Agree”, 5-“Strongly Agree”) are ordinal scales commonly used in social
sciences.

Numerical variables have values that describe a measurable quantity as a number, like “how many” or
“how much”. Therefore, numeric variables are quantitative variables. Numeric variables may be further
described as:
8

• Discrete: the data is numerical. Observations can take a value based on a count of a set of distinct
integer values. A discrete variable cannot take the value of a fraction of one value and the next
closest value. The number of registered cars, the number of business locations, and the number of
children in a family, all of which measured as whole units (i.e. 1, 2, or 3 cars) are some examples
of discrete variables.
• Continuous: the data is numerical. Observations can take any value between a particular set of
real numbers. The value given to one observation for a continuous variable can include values as
precise as possible with the instrument of measurement. Height and time are two examples of
continuous variables.

Population and Samples


The population is the total of all the individuals who have certain characteristics and are of interest to a
researcher. Community college students, racecar drivers, teachers, and college-level athletes can all be
considered populations.
It is not always convenient or possible to examine every member of an entire population. For example, it
is not practical to ask all students which color they like. However, it is possible, to ask the students of
three schools the preferred color. This subset of the population is called a sample.
A sample is a subset of the population. The reason for the sample's importance is because in many models
of scientific research, it is impossible (from both a strategic and a resource perspective) the study of all
members of a population for a research project. It just costs too much and takes too much time. Instead, a
selected few participants (who make up the sample) are chosen to ensure the sample is representative of
the population. And, if this happens, the results from the sample could be inferred to the population,
which is precisely the purpose of inferential statistics - using information on a smaller group of
participants makes possible to understand to all population.
There are many types of samples, including a random sample, a stratified sample, and a convenience
sample, but they all have the goal to accurately obtain a smaller subset of the larger set of total
participants, such that the smaller subset is representative of the larger set.

Independent and Paired Samples


The relationship or absence of the relationship between the elements of one or more samples defines
another factor of classification of the sample, particularly important in statistical inference. If there is no
type of relationship between the elements of the samples, it is called independent samples. Thus, the
theoretical probability of a given subject belonging to more than one sample is null. On the opposite, if
the same subject composes the samples based on some unifying criteria (for example, samples in which
the same variable are measured before and after specific treatment on the same subject), it is called paired
samples. In such samples, the subjects who are purposely tested are related. It can even be the same
subject (e.g., repeated measurements) or subject with paired characteristics (in statistical blocks studies).

DESCRIPTIVE STATISTICS
Descriptive statistics are used to describe the essential features of the data in a study. It provides simple
summaries about the sample and the measures. Together with simple graphics analysis, it forms the basis
of virtually every quantitative analysis of data. Descriptive statistics allows presenting quantitative
descriptions in a convenient way. In a research study, it may have lots of measures. Or it may measure a
significant number of people on any measure. Descriptive statistics helps to simplify large amounts of
data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary.

Frequency Distributions
9

Frequency distributions are visual displays that organize and present frequency counts (n) so that the
information can be interpreted more easily. Along with the frequency counts, it may include relative
frequency, cumulative frequency, and cumulative relative frequencies.
• The frequency (n) is the number of times a particular variable assumes that value.
• The cumulative frequency (N) is the number of times a variable takes on a value less than or equal
to this value.
• The relative frequency (f) is the percentage of the frequency.
• The cumulative relative frequency (F) is the percentage of the cumulative frequency.

Depending on the variable (categorical, discrete or continuous), various frequency tables can be created.

Example 1: favorite color of 10 individuals – categorical variable

Blue Red Blue White Green


List of responses:
White Blue Red Blue Black

Color n N f F
Blue 4 4 0.4 0.4
Red 2 6 0.2 0.6
Frequency Distribution: White 2 8 0.2 0.8
Green 1 9 0.1 0.9
Black 1 10 0.1 1.0
Total 10 1

Example 2: age of 20 individuals – discrete numerical variable

20 22 21 24 21 20 20 24 22 20
List of responses:
22 24 21 25 20 23 22 23 21 20

Age n N f F
20 6 6 0.3 0.3
21 4 10 0.2 0.5
22 4 14 0.2 0.7
Frequency distribution:
23 2 16 0.1 0.8
24 3 19 0.15 0.95
25 1 20 0.05 1
Total 20 1

Example 3: height of 20 individuals – continuous numerical variable

1.58 1.56 1.77 1.59 1.63 1.58 1.82 1.69 1.76 1.60
List of responses:
1.73 1.51 1.54 1.61 1.67 1.72 1.75 1.55 1.68 1.65

Interval n N f F
]1.50, 1.55] 3 3 0.15 0.15
Frequency distribution: ]1.55, 1.60] 5 8 0.25 0.4
]1.60, 1.65] 3 11 0.15 0.55
]1.65, 1.70] 3 14 0.15 0.7
10

]1.70, 1.75] 3 17 0.15 0.85


]1.75, 1.80] 2 19 0.1 0.95
]1.80, 1.85] 1 20 0.05 1
Total 20 1

Measures of Central Tendency and Measures of Variability


A measure of central tendency is a numerical value that describes a data set, by attempting to provide a
“central” or “typical” value of the data (McCune, 2010). As such, measures of central tendency are
sometimes called measures of central location. They are also classed as summary statistics.
Measures of central tendency should have the same units as those of the data values from which they are
determined. If no units are specified for the data values, no units are specified for the measures of central
tendency.
The mean (often called the average) is most likely the measure of central tendency that the reader is most
familiar with, but there are others, such as the median, the mode, percentiles, and quartiles.
The mean, median and mode are all valid measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others.

A measure of variability is a value that describes the spread or dispersion of a data set to its central value
(McCune, 2010). If the values of measures of variability are high, it signifies that scores or values in the
data set are widely spread out and not tightly centered on the mean. There are three common measures of
variability: the range, standard deviation, and variance.

Mean
The mean (or average) is the most popular and well-known measure of central tendency. It can be used
with both discrete and continuous data. An important property of the mean is that it includes every value
in the data set as part of the calculation. The mean is equal to the sum of all the values of the variable
divided by the number of values in the data set. So, if we have n values in a data set and (𝑥! , 𝑥! , … , 𝑥! )
are values of the variable, the sample mean, usually denoted by 𝑥 (denoted by 𝜇, for population mean), is:
!
𝑥! +   𝑥! + ⋯ + 𝑥! !!! 𝑥!
𝑥= =
𝑛 𝑛

Applying this formula to example 2 above, the mean is given by:

20 ∗ 6 + 21 ∗ 4 + 22 ∗ 4 + 23 ∗ 2 + 24 ∗ 3 + 25 ∗ 1 435
𝑥= = = 21.75
20 20

So, the age mean for the 20 individuals is around 22 years (approximately).

Median
The median is the middle value or the arithmetic average of the two middle values of the variable that has
been arranged in order of magnitude. So, 50% of the observations are greater or equal to the median, and
50% are less or equal to the median. It should be used with ordinal data. The median (after ordering all
values) is as follows:
11

𝑥! + 𝑥!!!
! !
, if  𝑛  is  even
𝑥 =   2  
 
𝑥!!! , if  𝑛  is  odd
!

In example 2 above, by ordering the age variable values, we have:

20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 24, 25

!"!!!
As n is even, the median is the average of the middle values. So 𝑥 = = 21.5 is the age median for
!
the sample with 20 individuals.

Mode
The mode is the most common value (or values) of the variable. A variable in which each data value
occurs the same number of times has no mode. If only one value occurs with the greatest frequency, the
variable is unimodal; that is, it has one mode. If exactly two values occur with the same frequency, and
that is higher than the others, the variable is bimodal; that is, it has two modes. If more than two data
values occur with the same frequency, and that is greater than the others, the variable is multimodal; that
is, it has more than two modes (McCune, 2010). The mode should be used only with discrete variables.

In example 2 above, the most frequent value of age variable is “20”. It occurs six times. So, “20” is the
mode of the age variable.

Percentiles and Quartiles


The most common way to report relative standing of a number within a data set is by using percentiles
(Rumsey, 2010). The Pth percentile cuts the data set in two so that approximately P% of the data is below
it and (100−P)% of the data is above it. So, the percentile of order p is calculated by (Marôco, 2011):

𝑛𝑝
𝑋!"#(!!!)                              if  𝑖 =    is  not  integer
100
 
𝑃! =  
𝑋! +   𝑋!!! 𝑛𝑝
                   if  𝑖 =  is  integer
2 100

where 𝑛 is the sample size and 𝑖𝑛𝑡(𝑖 + 1) is the integer part of 𝑖 + 1.


It is usual to calculate the P25 also called first quartile (Q1), P50 as second quartile (Q2) or median and P75
as the third quartile (Q3).

In example 2 above, we have:

20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 24, 25

Thus,
!"∗!" !""
• 25th percentile (𝑃!" ) or 1st quartile (𝑄! ): as 𝑖 = = = 5 is integer,
!"" !""
12

𝑋! +   𝑋! 20 + 20
𝑃!" = 𝑄! =   =   = 20
2 2
!"∗!" !"""
• 50th percentile (𝑃!" ) or median: as 𝑖 = = = 10 is integer,
!"" !""

𝑋!" +   𝑋!! 21 + 22
𝑃!" = 𝑄! = 𝑥 =   =   = 21.5
2 2
!"∗!" !"#!
• 75th percentile (𝑃!" ) or 3rd quartile (𝑄! ): as 𝑖 = = = 15 is integer,
!"" !""

𝑋!" +   𝑋!" 23 + 23
𝑃!" = 𝑄! =   =   = 23
2 2

Range
The range for a data set is the difference between the maximum value (greatest value) and the minimum
value (lowest value) in the data set; that is
𝑟𝑎𝑛𝑔𝑒   =  𝑚𝑎𝑥𝑖𝑚𝑢𝑚  𝑣𝑎𝑙𝑢𝑒   −  𝑚𝑖𝑛𝑖𝑚𝑢𝑚  𝑣𝑎𝑙𝑢𝑒

The range should have the same units as those of the data values from which it is computed.
The interquartile range (IQR) is the difference between the first and third quartiles; that is, 𝐼𝑄𝑅 = 𝑄! −
𝑄! (McCune, 2010).

In example 2 above, minimum value=20, maximum value=25. Thus, the range is given by 25-20=5.

Standard Deviation and Variance


The variance and standard deviation are widely used measures of variability. They provide a measure of
the variability of a variable. It measures the offset from the mean of a variable. If there is no variability in
a variable, each data value equals the mean, so both the variance and standard deviation for the variable
are zero. The greater the distance of the variable’ values from the mean, the greater is its variance and
standard deviation.
The relationship between the variance and standard deviation measures is quite simple. The standard
deviation (denoted by 𝜎 for population standard deviation and 𝑠 for sample standard deviation) is the
square root of the variance (denoted by 𝜎 ! for population variance and 𝑠 ! for sample variance).
The formulas for variance and standard deviation (for population and sample, respectively) are:

(! !!)!
• Population variance: 𝜎 ! =   ! , where 𝑥! is the 𝑖 th data value from the population, 𝜇 is mean
!
of the population, and 𝑁 is the size of the population
(! !!)!
• Sample variance:  𝑠 ! =   ! , where 𝑥! is the 𝑖 th data value from the sample, 𝑥 is mean of the
!!!
sample and 𝑛 is the size of the sample
(!! !!)!
• Population standard deviation: 𝜎 = 𝜎 ! =  
!
(!! !!)!
• Sample standard deviation:  𝑠 = 𝑠 ! =  
!!!

Charts and Graphs


13

Data can be summarized in a visual way using charts and/or graphs. These are displays that are organized
to give a big picture of the data in a flash and to zoom in on a particular result that was found. Depending
on the data type, the graphs include pie charts, bar charts, time charts, histograms or boxplots.

Pie Charts
A pie chart (or a circle chart) is a circular graphic. Each category is represented by a slice of the pie. The
area of the slice is proportional to the percentage of responses in the category. The sum of all slices of the
pie should be 100% or close to it (with a bit of round-off error). The pie chart is used with categorical
variables or discrete numerical variables.
Figure 2 represents the example 1 above.

Black
Blue
Green
Red
White
40%

10%

10%

20%

20%

Figure 2 Pie chart example

Bar Charts
A bar chart (or bar graph) is a chart that presents grouped data with rectangular bars with lengths
proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical
bar chart is sometimes called a column bar chart. In general, the x-axis represents categorical variables or
discrete numerical variables.
Figure 3 and Figure 4 represent the example 1 above.
0.4
4

0.3
3

0.2
2

0.1
1

0.0
0

Black Blue Green Red White


Black Blue Green Red White
Figure 4 Bar graph example (with relative
Figure 3 Bar graph example (with frequencies) frequencies)
14

Time Charts
A time chart is a data display whose main point is to examine trends over time. Another name for a time
chart is a line graph. Typically a time chart has some unit of time on the horizontal axis (year, day, month,
and so on) and a measured quantity on the vertical axis (average household income, birth rate, total sales,
or others). At each time’s period, the amount is shown as a dot, and the dots are connected to form the
time chart (Rumsey, 2010).
Figure 5 is an example of a time chart. It represents the number of accidents, for instance, in a small city
along some years.
60
55
50
45
40

2010 2011 2012 2013 2014 2015


Figure 5 Time Chart example

Histogram
A histogram is a graphical representation of numerical data distribution. It is an estimate of the
probability distribution of a continuous quantitative variable. Because the data is numerical, the categories
are ordered from smallest to largest (as opposed to categorical data, such as gender, which has no inherent
order to it). To be sure each number falls into exactly one group, the bars on a histogram touch each other
but don’t overlap (Rumsey, 2010). The height of a bar in a histogram may represent either frequency or a
percentage (Peers, 2006).
Figure 6 accounts for the histogram of example 3 above.
5
4
3
2
1
0

1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85


Figure 6 Histogram example

Boxplot
A boxplot or box plot is a convenient way of graphically depicting groups of numerical data. It is a one-
dimensional graph of numerical data based on the five-number summary, which includes the minimum
value, the 25th percentile (also known as Q1), the median, the 75th percentile (Q3), and the maximum
value. In essence, these five descriptive statistics divide the data set into four equal parts (Rumsey, 2010).
15

Some statistical software adds asterisk signs (∗) or circle signs (ο) to show numbers in the data set that are
considered to be, respectively, outliers or suspected outliers — numbers determined to be far enough
away from the rest of the data. There are two types of outliers:

• Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first
quartile.
• Suspected outliers are slightly more central versions of outliers: either 1.5×IQR or more above
the third quartile or 1.5×IQR or more below the first quartile.

Figure 7 is a boxplot's representation.

Outlier:#values#greater#than#3rdQ#+#3#*#IQR#(or#lowest#
40# *# values#than#1stQ#–#3#*#IQR,#if#it#is#a#lowest#outliers)##

o# Suspected#Outlier:#values#greater#than#3rdQ#+#1.5#*#IQR##
o# (or#lowest#values#than#1stQ#–#1.5#*#IQR,#if#it#is#a#lowest#
30# outliers)##
Largest#value#that#is#not#an#outlier#

20# 3rd#quarIle#(75th#percenIle)#

2nd#quarIle#(50th#percenIle#or#median)#

10# 1st#quarIle#(25th#percenIle)#

Minimum#(or#lowest#value#that#is#not#an#outlier,#if#
there#are#lowest#outliers#or#suspected#outliers)#
0#
Figure 7 Boxplot

STATISTICAL INFERENCE
Statistical inference is the process of drawing conclusions about populations or scientific truths from data.
This process is divided into two areas: estimation theory and decision theory. The objective of estimation
theory is to estimate the value of the theoretical population’s parameters by the sample forecasts. The
purpose of the decision theory is to establish decisions with the use of hypothesis tests for the population
parameters, supported by a concrete measure of the degree of certainty/uncertainty regarding the decision
that was taken (Marôco, 2011).

Inference Distribution Functions (Most Frequent)


The statistical inference process requires that the probability density function (a function that gives the
probability of each observation in the sample) is known, that is, the sample distribution can be estimated.
Thus, the common procedure in statistical analysis is to test whether the observations of the sample are
properly fitted by a theoretical distribution. Several statistical tests (e.g., the Kolmogorov-Smirnov test or
the Shapiro-Wilk test) can be used to check the sample adjustment distributions for particular theoretical
distribution. The following distributions are some probability density functions commonly used in
statistical analysis.

Normal distribution
The normal distribution or Gaussian distribution is the most important probability density function on
statistical inference. The requirement that the sampling distribution is normal is one of the demands of
16

some statistical methodologies with frequent use, called parametric methods (Marôco, 2011). A random
variable 𝑋 with a normal distribution of mean 𝜇 and standard deviation 𝜎 is written as 𝑋  ~  𝑁  (𝜇, 𝜎). The
probability density function (PDF) of this variable is given by:

1 ! !!! !
!!   !
𝑓! 𝑥 =  𝑒 −∞ ≤ 𝒙   ≤ +∞
𝜎 2𝜋

The expected value of 𝑋 is 𝐸 𝑋 = 𝜇, and the variance is 𝑉 𝑋 = 𝜎 ! . When 𝜇 = 0 and 𝜎 = 1, the


distribution is called standard normal distribution and is typically written as 𝑍  ~  𝑁  (0, 1). The letter phi
(𝜑) is used to denote the standard normal PDF given by:

1 !
!!
𝜑 𝑧 =  𝑒 !! −∞ ≤ 𝑧   ≤ +∞
2𝜋

The normal distribution graph has a bell-shaped line (one of the normal distribution names is bell curve)
and is completely determined by the mean and standard deviation of the sample. Figure 8 shows a
distribution 𝑁  (0, 1).
0.4
0.3
0.2
0.1
0.0

−4 −2 0 2 4

Figure 8 Normal distribution

Although there are many normal curves, they all share an important property that allows us to treat them
in a uniform fashion. Thus, all normal density curves satisfy the following property, which is often
referred to as the Empirical Rule.

Range Proportion
𝜇 ± 1𝜎 68.3 %
𝜇 ± 2𝜎 95.5 %
𝜇 ± 3𝜎 99.7 %

Thus, for a normal distribution, almost all values lie within three standard deviations of the mean.

Chi-Square distribution
A random variable 𝑋 obtained by the sums of squares of 𝑛 random variables 𝑍!  ~  𝑁  (0, 1) has a chi-
square distribution with 𝑛 degrees of freedom, denoted as 𝑋 ! (𝑛). The probability density function
(PDF) of this variable is given by (Kerns, 2010):

1 ! !
 f! x = !   ∙   x !!!   ∙   e!!
!! !!!
2!   ∙ !
x !   ∙   e!! ∙ 𝑑𝑋
17

with n > 0 e x > 0. Figure 9 shows an example of a chi-square distribution.

0.08
0.06
0.04
0.02
0.00

0 10 20 30 40
Figure 9 Chi-square distribution example

The expected value of 𝑋 is 𝐸 𝑋 = 𝑛 and the variance is 𝑉 𝑋 = 2𝑛. As noted above, the 𝑋 ! distribution
is the sum of squares of 𝑛 variables 𝑁  (0, 1). Thus, the central limit theorem (see section central limit
theorem) also ensures that the 𝑋 ! distribution approaches the normal distribution for high values of 𝑝.

Student's t-distribution
Student’s t-distribution is a probability distribution that is used to estimate population parameters when
the sample size is small and/or when the population variance is unknown.
!
A random variable 𝑋 = has a student’s t-distribution with 𝑛 degrees of freedom, if 𝑍  ~  𝑁   0, 1 , and
!/!
𝑌  ~  𝑋 ! (𝑛) are independent variables. The probability density function (PDF) of this variable is given by
(Kerns, 2010):

𝑛+1 !
!
!!!
𝜏 𝑥! !
𝑓! 𝑥 = 2   ∙   1 + , −∞   < 𝑥 < +∞
𝑛 𝑛
𝑛𝜋   ∙  𝜏
2
!!
where 𝜏 u = ! x !!!   ∙   e!! ∙ 𝑑𝑋 and 𝑛 > 0. When 𝑛 increases, this distribution approximates to the
centered reduced normal distribution (𝑁  (0, 1)). Figure 10 shows an example of a student’s t-distribution:
0.4
0.3
0.2
0.1
0.0

−20 −10 0 10 20
Figure 10 Student’s t-distribution example
18

As the centered reduced normal distribution, the student’s t-distribution has expected value 𝐸 𝑋 = 0 and
!
variance 𝑉 𝑋 = , 𝑛 > 2.
!!!

Snedecor’s F-distribution
Snedecor’s F-distribution is a continuous statistical distribution which arises in the testing of whether two
!!

observed samples have the same variance. A random variable 𝑋 = !


!! where 𝑌!  ~  𝑋 ! 𝑚 and 𝑌!  ~  𝑋 ! 𝑛 ,
!
has a Snedecor’s F-distribution with 𝑚 and 𝑛 degrees of freedom, 𝑋  ~  𝐹  (𝑚, 𝑛). The probability density
function (PDF) of this variable is given by (Kerns, 2010):

𝑚+𝑛 ! !!!
𝜏 𝑚 ! 𝑚 !
𝑓! 𝑥 = 𝑚 2 𝑛   ∙  
! !
  ∙ 𝑥 ! !!   ∙   1+ 𝑥 ,𝑥 > 0
𝜏   ∙  𝜏 𝑛 𝑛
2 2
!! !!!
where 𝜏 u = !
x   ∙   e!! ∙ 𝑑𝑋 and 𝑚 > 2 and 𝑛 > 4. Figure 11 shows an example of a Snedecor’s
F-distribution.
0.8
0.6
0.4
0.2
0.0

0 2 4 6 8 10
Figure 11 Snedecor’s F-distribution example

! !!!  ∙  (!!!!!)
The expected value of 𝑋 is 𝐸 𝑋 = with 𝑛 > 2 and the variance is 𝑉 𝑋 = .
!!! !  ∙   !!! !  ∙  (!!!)

Binomial distribution
The binomial distribution is the discrete distribution most used in statistical inference to test hypotheses
concerning proportions of dichotomous nominal variables (true vs. false, exist vs. non-exists). This
distribution is obtained with exactly 𝑛 successes out of 𝑁 Bernoulli trials (where the result of each
Bernoulli trial is true with probability 𝑝 and false with probability 𝑞 = 1 − 𝑝). The binomial distribution
for the variable 𝑋 has 𝑛 and 𝑝 parameters and is denoted as 𝑋  ~  𝐵  (𝑛, 𝑝). The probability mass function
(PMF) of this variable is given by:

𝑛 !
𝑓! 𝑥 = 𝑝 (1 − 𝑝)!!! , 𝑥 = 0, 1, 2, … , 𝑛
𝑥
19

Figure 12 shows an example of a binomial distribution.

0.00 0.05 0.10 0.15 0.20 0.25

0 2 4 6 8 10
Figure 12 Binomial distribution example

The expected value of variable 𝑋 is 𝐸 𝑋 = 𝑛 ∙ 𝑝, and the variance is 𝑉 𝑋 = 𝑛 ∙ 𝑝 ∙ 𝑞. Such as the chi-
square distribution or student's t-distribution, the central limit theorem ensures that the binomial
distribution is approximated by the normal distribution, when 𝑛 and 𝑝 are sufficiently large (𝑛 > 20 and
𝑛𝑝 > 7; Marôco, 2011).

Sampling distribution
To perform statistical inference - confidence intervals estimation or performing hypothesis testing – it is
necessary to know the distributional properties of the sample, from which it is intended to infer for the
theoretical population (Marôco, 2011). In the examples given so far, a population was specified, and the
sampling distribution of the mean and the range were determined. In practice, the process proceeds the
other way: the sample data is collected, and from these data, the parameters of the sampling distribution
are estimated. The mean of a representative sample provides an estimate of the unknown population
mean, but intuitively we know that if we took multiple samples from the same population, the estimates
would vary from one another. We could, in fact, sample over and over from the same population and
compute a mean for each of the samples. In essence, all these sample means constitute yet another
"population", and we could graphically display the frequency distribution of the sample means. This is
referred to as the sampling distribution of the sample means.
Some of the sampling distributions commonly used in statistical inference process are presented in the
table below (Marôco, 2011).

Statistic Sampling distribution


!
𝑋  ~  𝑁 𝜇, if the sampling is with replacement or if the population is too large.
!

! !!!
𝑋 𝑋  ~  𝑁 𝜇, ×   if the sampling is without replacement or if the population is small
! !!!
!
≤ 0.05.
!
20

!!!  
!! ~  𝑡 𝑛 − 1 if the population standard deviation is unknown.
!

!!! !!!
!  ~    𝑋 ! 𝑛 − 1 if the variable has normal distribution
𝑆′ !!

!!!!
𝑆′!! ~  𝐹 𝑛! − 1, 𝑛! − 1 if the variances have 𝑋 ! distribution
!!!!
𝑆′!!
𝑃  ~  𝐵 𝑛, 𝑝 for small samples

!!!
𝑃  ~  𝑁 0, 1 for large samples (with 𝑛 > 20 e 𝑛𝑝 >  5, where 𝑝 is the population
! !!!
!
proportion)
(Marôco, 2011)

The sample’s mean is one of the most relevant statistics for both the theory of estimation as to the theory
of decision.

Central limit theorem


The central limit theorem claims that the distribution of the sample means will be approximately normally
distributed if the population has mean 𝜇 and standard deviation 𝜎, and take sufficiently large random
samples from the population with replacement. This will hold true regardless of whether the source
population is normal or skewed, provided the sample size is sufficiently large (usually 𝑛 > 30). If the
population is normal, then the theorem holds true even for samples smaller than 30. In fact, this also holds
true even if the population is binomial, provided that 𝑚𝑖𝑛 𝑛𝑝, 𝑛   1 − 𝑝 > 5, where 𝑛 is the sample size
and 𝑝 is the probability of success in the population. This means that it is possible to use the normal
probability model to quantify uncertainty when making inferences about a population mean based on the
sample mean.
This theorem is particularly useful to justify the use of parametric methods for high dimension samples.
When it is not possible to assume that the distribution of the sample mean is normal, particularly when the
sample size does not allow the application of the central limit theorem, it is necessary to resort to methods
that do not require, in principle, any assumption about the form of the sampling distribution. These
methods are referred to generically as nonparametric methods.

Hypothesis tests
A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis tests refer to the formal procedures used by statisticians to accept or reject a statistical
hypothesis.
The best way to determine whether a statistical hypothesis is true would be to examine the entire
population. Since that is often impractical, statistical tests are used to determine whether there is enough
evidence in a sample of data to infer that a particular condition is true for the entire population. If sample
data are not consistent with the statistical hypothesis, the hypothesis is rejected.
Hypothesis tests examine two opposing hypotheses about a population: the null hypothesis and the
alternative hypothesis.
The null hypothesis, denoted by H0, is the statement being tested. Usually, the null hypothesis is a
declaration of the absence of effect or no effect at all and less compromising. The alternative hypothesis,
denoted by H1, is the hypothesis that sample observations are influenced by some non-random cause.
The H0 should only be rejected if there is enough evidence for a given probability of error or a certain
level of confidence, which suggests in fact H0 is not valid.
21

However, a hypothesis test can have one of two outcomes: the reader accepts the null hypothesis, or it
rejects the null hypothesis. Many statisticians stress with the notion of "accepting the null hypothesis".
Instead, they say: you reject the null hypothesis, or you fail to reject the null hypothesis. The distinction
between "acceptance" and "failure to reject" is crucial. Whilst acceptance implies that the null hypothesis
is true, failure to reject means that the data is not sufficiently persuasive to prefer the alternative
hypothesis to the null hypothesis.

A hypothesis test is developed in the following steps:


• State the hypotheses. This involves stating the null and alternative hypotheses. The hypotheses
are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be
false.
• Formulate an analysis plan. The analysis plan describes how to use sample data to evaluate the
null hypothesis. The evaluation often focuses around a single test statistic.
• Analyze sample data. Find the value of the test statistic (mean score, proportion, t-score, z-score,
etc.) described in the analysis plan.
• Interpret results. Apply the decision rule described in the analysis plan. If the value of the test
statistic is unlikely, based on the null hypothesis, reject the null hypothesis.

When considering whether the null hypothesis is rejected and the alternative hypothesis is accepted, it is
needed to find the direction of the alternative hypothesis statement. This could be a one-tailed test or two-
tailed test.
A one-tailed test is a statistical test in which the critical area of the distribution is one-sided so that it is
either greater than or less than a particular value, but not both. If the sample that is being tested falls into
the one-sided critical area, the alternative hypothesis will be accepted instead of the null hypothesis. The
one-tailed test gets its name from checking the area under one of the tails (sides) of a normal distribution,
although the test can be used in other non-normal distributions as well.
For example, suppose the null hypothesis states that the mean is less than or equal to 10. The alternative
hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of
numbers located on the right side of sampling distribution; that is, a set of numbers greater than 10. This
represents the implementation of a one-tailed test.
A two-tailed test is a statistical test in which the critical area of the distribution is two sided and tests
whether a sample is either greater than or less than a specified range of values. If the sample that is being
tested falls into either of the critical areas, the alternative hypothesis will be accepted instead of the null
hypothesis. The two-tailed test gets its name from checking the area under both of the tails (sides) of a
normal distribution, although the test can be used in other non-normal distributions.
For example, suppose the null hypothesis states that the mean is equal to 10. The alternative hypothesis
would be that the mean is different to 10, i.e., less than 10 or greater than 10. The region of rejection
would consist of a range of numbers located on both sides of sampling distribution; that is, the region of
rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater
than 10.

Decision rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice, statisticians
describe these decision rules in two ways - concerning a p-value or concerning a region of acceptance.

p-value and statistical errors


The p-value is the probability of observing a value of the test statistic as extreme or more extreme than
the observed test statistic that you computed from the sample. Regarding the distribution associated with
the hypothesis test, the p-value is calculated as follows:
22

• For a one-tailed test, the p-value is the area to the right (right-tailed test) or left (left-tailed test) of
the test statistic.
• For a two-tailed test, the p-value is two times the area to the right of a positive test statistic or the
left of a negative test statistic.

To make a decision about rejecting or not rejecting H 0, it is necessary to determine the cutoff probability
for the p-value before doing a hypothesis test; this cutoff is called an alpha level (α). Typical values for α
are 0.05 or 0.01.
When p-value (instead of the test statistic) is used in the decision rule, the rule becomes: If the p-value is
less than α (the level of significance), reject H 0 and accept H1. Otherwise, fail to reject H0.
However, incorrect interpretations of p-values are very common. The most common mistake is to
interpret a p-value as the probability of making an error by rejecting a true null hypothesis (called a type I
error).
There are several reasons why p-values can’t be the error rate.
First, p-values are calculated based on the assumptions that the null is true for the population and that the
difference in the sample is caused entirely by random chance. Consequently, p-values can’t tell the
probability that the null hypothesis is true or false because it is 100% true from the perspective of the
calculations.
Second, while a small p-value indicates that the data are unlikely assuming a true null, it can’t evaluate
which of two competing cases is more likely: 1) The null is true, but the sample was unusual or; 2) The
null is false. Determining which case is more likely requires subject area knowledge and replicate studies.

For example, supposing that a vaccine study produced a p-value of 0.04. The correct way to interpret this
value is: assuming that the vaccine had no effect, it would obtain the observed difference or more in 4%
of studies due to random sampling error. An incorrect way to interpret is: if the null hypothesis is rejected
there is a 4% chance that a mistake is being made.

Types of errors
The point of a hypothesis test is to make the correct decision about H0. Unfortunately, hypothesis testing
is not a simple matter of being right or wrong. No hypothesis test is 100% certain because the hypothesis
test is based on probability, so there is always a chance that an error has been made. Two types of errors
are possible: type I and type II. The risks of these two errors are inversely related and determined by the
significance level and the power for the test.
The following table shows the four possible situations:

Decision
Fail to reject Reject
Type I Error -
Correct Decision
True rejecting the null when it is true
(probability = 1 - α)
(probability = α)
Null Hypothesis
Type II Error -
Correct Decision
False fail to reject the null when it is
(probability = 1 - β)
false (probability = β)

Type I error
When the null hypothesis is true, and it is rejected, it has a type I error. The probability of making a type I
error is α, which is the significance level set for the hypothesis test. An α of 0.05 indicates that it is
willing to accept a 5% chance that being wrong when rejecting the null hypothesis. To reduce this risk, a
23

lower value for α should be used. However, using a lower value for alpha, it will be less likely to detect a
true difference if one exists.

Type II error
When the null hypothesis is false, and it is failed to reject it, it has a type II error. The probability of
making a type II error is β, which depends on the power of the test. It is possible to decrease the risk of
committing a type II error by providing that the test has enough power. Ensuring the sample size is large
enough to detect a practical difference when one truly exists can do this.
The probability of rejecting the null hypothesis when it is false is equal to 1–β. This value is the power of
the test.

The following example helps to understand the interrelationship between type I, and type II error, and to
determine which error has more severe consequences for each situation. If there is interest in comparing
the effectiveness of two medications, the null and alternative hypotheses are:

Null hypothesis (H0): µ1= µ2


The two medications have equal effectiveness.

Alternative hypothesis (H1): µ1≠ µ2


The two medications do not have equal effectiveness.

A type I error occurs if the null hypothesis is rejected, i.e., if it is possible to conclude that the two
medications are different when, in fact, they are not. If the medications have the same effectiveness, this
error may not be considered too severe because the patients still benefit from the same level of
effectiveness regardless of which medicine they take.
However, if a type II error occurs, the null hypothesis is not rejected when it should be rejected. That is, it
is possible to conclude that the medications have the same effectiveness when, in fact, they are different.
This error is potentially life-threatening if the less-effective drug is sold to the public instead of the more
effective one.
When the hypothesis tests are conducted, consider the risks of making type I and type II errors. If the
consequences of making one type of error are more severe or costly than making the other type of error,
then choose a level of significance and power for the test that will reflect the relative severity of those
consequences.

Acceptance region vs. Rejection region


The acceptance region is a range of values. If the test statistic falls within the region of acceptance, the
null hypothesis is not rejected. The acceptance region is defined so that the chance of making a type I
error is equal to the significance level.
The set of values outside the acceptance region is called the rejection region. If the test statistic falls
within the rejection region, the null hypothesis is rejected. The rejection region is also known as the
critical region. The value(s) that separates the critical region from the acceptance region is called the
critical value(s).
In such cases, we say that the hypothesis has been rejected at the α level of significance.

Confidence intervals
A confidence interval is an estimated range of a parameter of a population. Instead of estimating the
parameter by a single value, it is given a range of probable estimates.
Confidence intervals are used to indicate the reliability of an estimate. For example, a confidence interval
can be used to describe how the results of a search are trustworthy. If all the estimates are equals, a search
that results in a small confidence interval is more reliable than one that results in a higher confidence
24

interval. These intervals are usually calculated so that this percentage is 95%, but it can produce 90%,
99%, 99.9% (or whatever) confidence intervals for the unknown parameter.
The width of the confidence interval gives some idea of how uncertain the research is about the unknown
parameter. A very wide interval may indicate that more data should be collected before anything very
definite can be said about the parameter.

Confidence intervals are more informative than the simple results of hypothesis tests (where we decide
"reject H0" or "don't reject H0") since they provide a range of plausible values for the unknown parameter.
Confidence limits are the lower and upper boundaries/values of a confidence interval, that is, the values
that define the range of a confidence interval.
The upper and lower bounds of a 95% confidence interval are the 95% confidence limits. These limits
may be taken for other confidence levels, for example, 90%, 99%, and 99.9%.
The confidence level is the probability value 1 − 𝛼 associated with a confidence interval.
It is often expressed as a percentage. For example, say 𝛼 = 0.05 = 5%, then the confidence level is equal
to 1 − 0.05 = 0.95, i.e. a 95% confidence level. For example, suppose an opinion poll predicted that, if
the election were held today, the Conservative party would win 60% of the vote. The pollster might attach
a 95% confidence level to the interval 60% plus or minus 3%. That is, he thinks it very likely that the
Conservative party would get between 57% and 63% of the total vote.

Summarizing:
- A p-value is a probability of obtaining an effect as large as or greater than the observed effect, assuming
null hypothesis is true
• Provides a measure of strength of evidence against the H 0
• Does not provide information on magnitude of the effect
• Affected by sample size and magnitude of effect: interpret with caution!
• Cannot be used in isolation to inform clinical judgment

- Confidence interval quantifies


• How confident are we about the true value in the source population
• Better precision with large sample size
• Corresponds to hypothesis testing, but much more informative than p-value

- Keep in mind clinical importance when interpreting statistical significance!

Parametric and non-parametric tests


During the process of statistical inference, there is often the question about the best hypothesis test for
data analysis. In statistics, the test with higher power (1 − 𝛽) is considered the most appropriate and more
robust to violations of assumptions or application conditions.
Hypothesis tests are categorized into two major groups: parametric tests and non-parametric tests.
Parametric tests use more information than non-parametric tests and are, therefore, more powerful.
However, if a parametric test is wrongly used with data that doesn’t satisfy the needed assumptions, it
may determine significant differences when truly there isn’t one.
Alternatively, non-parametric tests use less information and, therefore, are more conservative tests than
their parametric alternatives. This means that if the reader uses a non-parametric test when he/she has data
that satisfies assumptions for a parametric test, the reader can decrease his/her power - i.e. he/she is less
likely to get a significant result when, in reality, one exists (significant relationship, significant difference,
or other).
View publication stats

25

CONCLUSION
This chapter presents the main concepts used in statistical analysis. Without these, it will be difficult for
the reader to understand additional analysis that will be held in the course of this book.
The reader should now be able to recognize the used concepts, their meaning and when they should be
applied.
The theoretical concepts presented in this chapter are:
• Variable, population and sample
• Mean, median, mode, standard deviation, quartile an percentile
• Statistic distributions
o Normal distribution
o Chi-square distribution
o Student’s t-distribution
o Snedecor’s F-distribution
o Binomial distribution
• Central limit theorem
• Decision rules: p-value, error, confidence interval and tests.

REFERENCES

Kerns, G. J. (2010). Introduction to probability and statistics using r. 1st Edition. Lulu. com.

Marôco, J. (2011). Análise Estatística com o SPSS Statistics. 5th Edition. Pero Pinheiro: Report Number,
pp. 7-61.

McCune, S. (2010). Practice Makes Perfect Statistics. 1st Edition. United States: McGraw-Hill.

Peers, I. (2006). Statistical analysis for education and psychology researchers: Tools for researchers in
education and psychology. Routledge.

Rumsey, D. (2010). Statistics Essentials For Dummies. New Jersey, Wiley Publishing, Inc.

You might also like