Descriptive and Inferential Statistics Course Pack

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Contents

Chapter 01 - Basic Statistical Concepts 3


Lesson 1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Lesson 2 - Basic Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter Summative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 02 - Measures of Central Tendency 10


Lesson 1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Lesson 2 - The Measures of Central Tendency for Ungrouped Data. . . . . . . . . . . . . . 11
2.1 - The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 - The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 - The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Lesson 3 - Measures of Central Tendency for Grouped Data . . . . . . . . . . . . . . . . . 20
3.1 - The Grouped Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 - The Grouped Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 - The Grouped Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Check your Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter Summative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Chapter 03 - Measures of Dispersion (Variation) 30


Lesson 1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Lesson 2 - The Measures of Dispersion for Ungrouped Data . . . . . . . . . . . . . . . . . 31
2.1 - The Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 - The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 - The Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 - The Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Check your Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Lesson 3 - The Measures of Dispersion for Grouped Data . . . . . . . . . . . . . . . . . . 38
3.1 - The Grouped Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 - The Grouped Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 - The Grouped Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . 41

References 42
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1
DESCRIPTIVE AND INFERENTIAL
STATISTICS:
(A compilation of Topics)

Frances Jay B. Pacaldo, LPT, MA.Ed- Teaching Mathematics


Cebu Technological University
Descriptive and Inferential Statistics
Chapter 01 - Basic Statistical Concepts
Frances Jay B. Pacaldo, LPT., MA.ED-Teaching Mathematics

Introduction
Statistics is the science of collecting, analyzing, interpreting, and presenting data. Statistical methods are
used in a wide range of fields, including science, business, healthcare, social sciences, and many others. In
this introduction, we will explore some of the basic statistical concepts that are fundamental to understand-
ing and analyzing data.

One of the primary tasks of statistics is to describe and summarize data using descriptive statistics. De-
scriptive statistics include measures of central tendency (such as mean, median, and mode), measures of
variability (such as standard deviation and range), and measures of distribution (such as histograms and
box plots).

Another important aspect of statistics is making inferences about a population based on a sample of data.
This involves using inferential statistics, which includes methods such as hypothesis testing and confidence
intervals. These methods allow us to draw conclusions about a larger population based on a smaller sample.

Sampling is the process of selecting a subset of individuals or units from a larger population to represent
the population as a whole. Statistical methods are used to ensure that the sample is representative of the
population and to estimate population parameters based on the sample data.

Correlation is another important statistical concept that refers to the degree to which two variables are
related to each other. Correlation can be positive, negative, or zero and is used to describe the relationship
between two variables.

In summary, understanding basic statistical concepts is essential for interpreting and analyzing data in a
wide range of fields. Whether you are working in science, business, healthcare, or any other field, knowledge
of statistics can help you make informed decisions based on data.

Basic Statistical Concepts


ˆ A population is any specific collection of objects of interest.
ˆ A sample is any subset or sub collection of the population, including the case that the sample consists
of the whole population, in which case it is termed a census.
ˆ A measurement is a number or attribute computed for each member of a population or of a sample.
ˆ A sample data is the measurements of the sample elements.
ˆ A parameter is a number that summarizes some aspect of the population as a whole.
ˆ Statistics is a collection of methods for collecting, displaying, analyzing, and drawing conclusions
from data.
ˆ Areas of Statistics: Descriptive statistics is the branch of statistics that involves organizing, dis-
playing, and describing data
ˆ Areas of Statistics: Inferential statistics is the branch of statistics that involves drawing conclusions
about a population based on information contained in a sample taken from that population. The

3
measurement made on each element of a sample need not be numerical. In the case of automobiles,
what is noted about each car could be its color, its make, its body type, and so on. Such data are
categorical or qualitative, as opposed to numerical or quantitative data such as value or age. This is
a general distinction of data.
ˆ Qualitative data are measurements for which there is no natural numerical scale, but which consist
of attributes, labels, or other non-numerical characteristics.
ˆ Quantitative data are numerical measurements that arise from a natural numerical scale
ˆ Relation between a Population and a Sample- The relationship between a population of interest
and a sample drawn from that population is perhaps the most important concept in statistics, since
everything else rests on it. This relationship is illustrated graphically in the Figure below. The circles
in the large box represent elements of the population. The solid black circles represent the elements
of the population that are selected at random and that together form the sample. For each element
of the sample there is a measurement of interest, denoted by a lowercase x (which we have indexed as
x1 ,...,xn to tell them apart).

ˆ Probability sampling method is a sampling method that uses randomization to choose survey
participants.

– A Random sample is a sample in which each member of the population has an equal chance
of being included and in which the selection of one member is independent from the selection of
all other members.

[Example: You want to select a simple random sample of 100 employees of Company X. You
assign a number to every employee in the company database from 1 to 1000, and use a random
number generator to select 100 numbers.]

– A Stratified sampling attempts to account for the demographics and traits of the larger pop-
ulation. It attempts to recreate the elements in the sample. For example, if you’re surveying
college history majors, and you already know that 40% of history majors are female and 60%
are male, you might want your sample to have the same proportions.

[Example: The company has 800 female employees and 200 male employees. You want to ensure
that the sample reflects the gender balance of the company, so you sort the population into two
strata based on gender. Then you use random sampling on each group, selecting 80 women and
20 men, which gives you a representative sample of 100 people.]

– A Cluster sampling also involves dividing the population into subgroups, but each subgroup
should have similar characteristics to the whole sample. Instead of sampling individuals from
each subgroup, you randomly select entire subgroups.

4
[Example: The company has offices in 10 cities across the country (all with roughly the same
number of employees in similar roles). You don’t have the capacity to travel to every office to
collect your data, so you use random sampling to select 3 offices – these are your clusters.]

– A Systematic sampling is a method that imitates many of the randomization benefits of sim-
ple random sampling, but is slightly easier to conduct. You can use systematic sampling with a
list of the entire population, like you would in simple random sampling. However, unlike with
simple random sampling, you can also use this method when you’re unable to access a list of
your population in advance.

[Example: You run a department store and are interested in how you can improve the store ex-
perience for your customers. To investigate this question, you ask an employee to stand by the
store entrance and survey every 20th visitor who leaves, every day for a week.]

Although you do not necessarily have a list of all your customers ahead of time, this method
should still provide you with a representative sample of your customers since their order of exit
is essentially random.
ˆ Nonprobability Sampling Methods is a method that do not use any randomization to select survey
participants. Therefore, population members do not have an equal chance of being included.
– Convenience sampling is a nonprobabilistic sampling that includes participants based on their
availability and accessibility. Essentially, it includes people who are easy to reach.

[Example: You are researching opinions about student support services in your university, so
after each of your classes, you ask your fellow students to complete a survey on the topic. This
is a convenient way to gather data, but as you only surveyed students taking the same classes as
you at the same level, the sample is not representative of all the students at your university]

– Voluntary response sampling is a sampling method that is similar to a convenience sample,


a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing
participants and directly contacting them, people volunteer themselves (e.g. by responding to a
public online survey). Voluntary response samples are always at least somewhat biased, as some
people will inherently be more likely to volunteer than others.

[Example: You send out the survey to all students at your university and a lot of students decide
to complete it. This can certainly give you some insight into the topic, but the people who re-
sponded are more likely to be those who have strong opinions about the student support services,
so you can’t be sure that their opinions are representative of all students]

– A Snowball Sampling is a sampling relies on the first survey participants to refer you to the
next ones, and so on. Once you’ve found enough people to meet your required sample size, you
stop the survey.

[Example: You are researching experiences of homelessness in your city. Since there is no list
of all homeless people in the city, probability sampling isn’t possible. You meet one person who
agrees to participate in the research, and she puts you in contact with other homeless people that
she knows in the area.]

– A Purposive sampling is a type of sampling, also known as judgement sampling, involves the
researcher using their expertise to select a sample that is most useful to the purposes of the
research.

[Example: You want to know more about the opinions and experiences of disabled students at
your university, so you purposefully select a number of students with different support needs in
order to gather a varied range of data on their experiences with student services.]

5
– A Quota sampling is similar to a stratified sampling. The difference is that this method doesn’t
randomly select participants. As with stratified sampling, the researchers first define categories
they want to represent in their sample and choose appropriate proportions for each group. These
could be equal quotas, like 100 men and 100 women, or they could seek to replicate a target
population’s demographics. Instead of randomly selected participants, the surveyors will use
some form of convenience sampling. When they’ve hit the right quotas for each category, they
stop the survey

TYPES OF VARIABLE
ˆ A Random variable is a variable that represents value(s) from a random sample. We will use letters
at the end of the alphabet, especially x, y and z, as random variables.
ˆ An Independent random variable is a variable that is chosen, and then measured or manipulated,
by the researcher in order to study some observed behavior.
ˆ A Dependent random variable is a variable whose value depends on the value of one or more
independent variables.
ˆ A Discrete variable is a variable which can take a discrete set of values (e.g. cards in a deck or
scores on an IQ test). Discrete variables can take either a finite or infinite set of values, although for
our purposes we usually consider discrete variables which only take a finite set of values.
ˆ A Continuous variable is a variable that can take all the values in a finite or infinite interval (e.g.
weight or temperature). A continuous variable can take an infinite set of values

TYPES OF DATA MEASUREMENT


ˆ Data Scales: Nominal Data provides a name; if numeric, then no scale is implied.
Example: Gender (Male, Female); Primary Color (Yellow, Red, Blue)
ˆ Data Scales: Ordinal Data provides an ordered scale.
Example: Educational Level (High School, BS, MS, Ph.D.)
ˆ Data Scales: Interval Data can be manipulated mathematically. Scale in equal increments. An
interval scale is one where there is order and the difference between two values is meaningful.
Example: Temperature (Farenheit), Temperature (Celcius), pH, SAT score (200-800)
ˆ Data Scales: Ratio Data are Interval scale with a meaningful zero. A ratio scale is a quantitative
scale where there is a true zero and equal intervals between neighboring points. Unlike on an interval
scale, a zero on a ratio scale means there is a total absence of the variable you are measuring.
Example: Length, area, and population

TYPES OF ERROR
ˆ A type I error (false-positive) occurs if an researcher rejects a null hypothesis that is actually true in
the population.
Example: You throw away a good food you thought was spoiled.
ˆ A type II error (false-negative) occurs if the researcher fails to reject a null hypothesis that is actually
false in the population.
Example: You eat a spoiled food you thought was good.
ˆ Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. The test
provides evidence concerning the plausibility of the hypothesis, given the data. Statistical analysts
test a hypothesis by measuring and examining a random sample of the population being analyzed.
ˆ P-value is a measure of the likelihood of a value that a random variable takes.
ˆ A non-parametric test (sometimes called a distribution free test) does not assume anything about the
underlying distribution (for example, that the data comes from a normal distribution).
ˆ A non-parametric test (sometimes called a distribution free test) does not assume anything about the
underlying distribution (for example, that the data comes from a normal distribution).

6
ˆ A parametric test are those that make assumptions about the parameters of the population distribution
from which the sample is drawn. This is often the assumption that the population data are normally
distributed.
ˆ Normal distribution is a continuous probability distribution wherein values lie in a symmetrical fashion
mostly situated around the mean.
ˆ Measures of Central tendency is a statistic that represents the single value of the entire population or
a dataset.

– The mean (or average) is the most popular and well-known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data. The mean is equal to the sum of all the values in the data set divided by the number of
values in the data set
– The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data.
– The mode is the most frequent score in our data set.

ˆ Measures of Dispersion in statistics is the measures of dispersion help to interpret the variability of
data i.e. to know how much homogenous or heterogeneous the data is.
– A standard deviation is a measure of how dispersed the data is in relation to the mean. Low
standard deviation means data are clustered around the mean, and high standard deviation
indicates data are more spread out.
– The Variance measures how far a data set is spread out. It is mathematically defined as the
average of the squared differences from the mean
ˆ Measures of Relative position are conversions of values, usually standardized test scores, to show where
a given value stands in relation to other values of the same grouping.
– In statistics, a Quartile is a type of quantile which divides the number of data points into four
parts, or quarters, of more-or-less equal size.
– Decile is a method that is used to divide a distribution into ten equal parts. When data is divided
into deciles a decile rank is assigned to each data point in order to sort the data into ascending
or descending order.
– The percentile is a number where a certain percentage of scores fall below the given number
ˆ Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.
ˆ Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distri-
bution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low
kurtosis tend to have light tails, or lack of outliers.
ˆ In descriptive statistics, the interquartile range tells you the spread of the middle half of your distri-
bution.
ˆ The Mean Absolute Deviation of a dataset is the average distance between each data point and the
mean. It gives us an idea about the variability in a dataset.
ˆ A frequency in statistics is the number of times an event or observation happened in an experiment
or study. It can also be defined simply as a count of a certain event.
ˆ A scatter plot is a graph in which the values of two variables are plotted along two axes, the pattern
of the resulting points revealing any correlation present.
ˆ A Graphical representation refers to the use of charts and graphs to visually display, analyze, clarify,
and interpret numerical data, functions, and other qualitative structures.
– Types of Graphs: Line Graphs – Line graph or the linear graph is used to display the continuous
data and it is useful for predicting future events over time.

7
– Types of Graphs: Bar Graphs – Bar Graph is used to display the category of data and it compares
the data using solid bars to represent the quantities.
– Types of Graphs: Histograms – The graph that uses bars to represent the frequency of numerical
data that are organized into intervals. Since all the intervals are equal and continuous, all the
bars have the same width.
– Types of Graphs: Line Plot – It shows the frequency of data on a given number line. ‘x ‘is placed
above a number line each time when that data occurs again.
– Types of Graphs: Circle Graph – Also known as the pie chart that shows the relationships of
the parts of the whole. The circle is considered with 100
– Types of Graphs: Stem and Leaf Plot – In the stem and leaf plot, the data are organized from
least value to the greatest value. The digits of the least place values from the leaves and the
next place value digit forms the stems

– Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box
and whisker show the range (spread) and the middle (median) of the data

ˆ A trimmed mean (similar to an adjusted mean) is a method of averaging that removes a small des-
ignated percentage of the largest and smallest values before calculating the mean. After removing
the specified outlier observations, the trimmed mean is found using a standard arithmetic averaging
formula. The use of a trimmed mean helps eliminate the influence of outliers or data points on the
tails that may unfairly affect the traditional or arithmetic mean.
ˆ The coefficient of variation (CV) is the ratio of the standard deviation to the mean and shows the
extent of variability in relation to the mean of the population. The higher the CV, the greater the
dispersion.

8
Chapter Summative Test
Complete Name:
ID Number:
Course Year and Section:
Score:

9
Descriptive and Inferential Statistics
Chapter 02 - Measures of Central Tendency
Frances Jay B. Pacaldo, LPT., MA.ED-Teaching Mathematics

Measures of Central Tendency


Measures of Central Tendency provides a comprehensive overview of the fundamental statistical tools used
to describe the center of a set of data. This lecture covers the three main measures of central tendency, which
are the mean, median, and mode, and explains how to calculate them and interpret their results. It also
explores the advantages and limitations of each measure and offers guidance on selecting the appropriate
measure for a given dataset.

In addition to covering the basic concepts, “Measures of Central Tendency” delves deeper into more ad-
vanced topics such as weighted means, trimmed means, and measures of central tendency for grouped data.
This lecture also provides practical examples and real-world applications of central tendency measures in
various fields, including finance, economics, psychology, and healthcare.

Whether you are a student, researcher, or practitioner in any field that involves statistical analysis, this
lecture provides a valuable resource for understanding and utilizing measures of central tendency to accu-
rately describe and interpret data. With clear explanations, numerous examples, and helpful illustrations,
“Measures of Central Tendency” is an essential guide for anyone seeking to master this important statistical
concept.

We have learned from the previous lesson the definition of Descriptive Statistics. It is define as a brief
informational coefficients that summarize a given data set, which can be either a representation of the entire
population or a sample of a population. Descriptive statistics are broken down into measures of central ten-
dency and measures of variability (spread), and measures of relative position. Measures of central tendency
include the mean, median, and mode,. Measures of variability include standard deviation, variance, mini-
mum and maximum variables, kurtosis, and skewness, while measures of relative position include quartile,
decile and percentile.

Descriptive statistics, in short, help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data.

People use descriptive statistics to repurpose hard-to-understand quantitative insights across a large data
set into bite-sized descriptions.

A student’s grade point average (GPA), for example, provides a good understanding of descriptive statistics.
The idea of a GPA is that it takes data points from a wide range of exams, classes, and grades, and averages
them together to provide a general understanding of a student’s overall academic performance. A student’s
personal GPA reflects their mean academic performance.

In this lesson, we will discuss the Measures of Central Tendency for ungrouped and grouped data, its im-
portance, how to compute each averages, and when to best use them.

10
MEASURES OF CENTRAL TENDENCY:

Def.

Measures of central tendency focus on the average or middle values of data sets. It uses
graphs, tables and general discussions to help people understand the meaning of the analyzed data.
Measures of central tendency describe the center position of a distribution for a data set. A person
analyzes the frequency of each data point in the distribution and describes it using the mean,
median, or mode, which measures the most common patterns of the analyzed data set.

A measure of central tendency is a single value that attempts to describe a set of data by identifying the
central position within that set of data. As such, measures of central tendency are sometimes called measures
of central location. They are also classed as summary statistics.

There are three main measures of central tendency: the mean, median and mode. Each of these measures
describes a different indication of the typical or central value in the distribution.

In the following sections, we will look at the mean, mode and median, and learn how to calculate them and
under what conditions they are most appropriate to be used:

Measures of Central Tendency for Ungrouped Data


Ungrouped data refers to individual data points or raw data that haven’t been organized into groups, classes,
or intervals. These data are presented in their raw form without any summarization, categorization, or ag-
gregation.

When you’re dealing with ungrouped data, you’re looking at a set of individual observations.
In the context of measures of central tendency, ungrouped data allows for direct calculation of key metrics
like the mean, median, and mode.

ˆ Mean: The sum of all data points divided by the total number of points.
ˆ Median: The middle value when data is sorted in ascending order.
ˆ Mode: The value(s) that occur most frequently in the dataset.
Each of these measures helps to find the “center” or “average” value in ungrouped data. Since this data
type is not categorized or summarized, it provides a granular view of the dataset, allowing for more detailed
analysis of individual values.

The Mean, x̄
The mean (or average) is the most popular and well-known measure of central tendency. It can be used with
both discrete and continuous data, although its use is most often with continuous data. The mean is equal
to the sum of all the values in the data set divided by the number of values in the data set.
.
So, if we have n values in a data set and they have values, x1 , x2 , x3 , .., xn , the sample mean, usually denoted
by x̄ (pronounced as ”x bar”), is:

x1 + x2 + x3 + ... + xn
x̄ = (1.1)
n
P
This formula is usually written in a slightly different manner using the Greek Capital Letter, , pronounced
as ”sigma”, which means, ”the sum of...”:

11
P
xi
x̄ = (1.2)
n

You may have noticed that the above formula refers to the sample mean. This is because in statistics,
samples and populations have very different meanings and these differences are very important, even if, in
the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the
population mean and not the sample mean, we use the Greek lower-case letter mu, denoted as µ:

P
xi
µ= (1.3)
N

Example 1.
On a day in May 2024, the temperatures for the 7 places around Cebu are as follows:

Lapu-Lapu, 40◦ C Cordova, 42◦ C Mandaue, 45◦ C Cebu City, 47◦ C


Minglanilla, 46◦ C Talisay, 41◦ C Consolacion, 40◦ C

Find the average temperature of the 7 places around Cebu.

Using the formula of the mean (in eq. 1.2), we have:

P
xi
x̄ =
n

Substitute all the values of x and n, in our case, we have:


40 + 42 + 45 + 47 + 46 + 41 + 40
x̄ =
7
301
x̄ =
7
x̄ = 43◦ C

The average temperature for the 7 places around Cebu is 43 ◦ C.

Lets try another Example:

Example 2.
Consider the wages of the 10 employees of TUNGAB refreshment below in thousand (k). Solve for the Mean.

Staff 1 2 3 4 5 6 7 8 9 10
Salary (in thousand) 15 18 16 14 15 15 12 17 90 95

Using the formula of the mean (in eq. 1.2), we have:

P
xi
x̄ =
n

12
Substitute all the values of x and n, in our case, we have:
15 + 18 + 16 + 14 + 15 + 15 + 12 + 17 + 90 + 95
x̄ =
10
307
x̄ =
10
x̄ = 30.7
x̄ = 30, 700.00

The average salary for the 10 employees at TUNGAB Refreshment is approximately 31,000.00

Obviously, the process of getting the mean is correct. However if we look at the computed value, it seems
that the 30,700.00 might not be the best way to accurately reflect the typical salary of a worker, as most
workers have salaries ranges from 12,000.00 to 18,000.00. The mean is being skewed by the two large salaries.
Therefore, in this situation, we would like to have a better measure of central tendency.

Remember:
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are
values that are unusual compared to the rest of the data set by being especially small or large in numerical
value. When there are significant outliers in your data set, the mean loses its ability to provide the best
central location for the data because the skewed data is dragging it away from the typical value.

13
Exercise 1
1. In a statistics exam, the following are the scores of the 9 students:
25 27 21 25 23
21 27 21 18
Find the average score of the 9 students.
2. In the zoo, a group of Kangaroo have a competition to see how far they can jump. Their
results are as follows:
Name of Kangaroo Jump Height (m)
Kang 2.3m
Gar 3.5m
Roo 1.7m
Orr 4.3m
Rag 2.1m
Ngak 1.7m
Find the mean of their jumps.
3. Eleven runners are raising money for charity by running round a track. the following are
numbers of laps they manage to run:
Runners No. of Laps
Najnaj 15
Juls 12
Frans 8
Trevor 26
Grace 14
Row 11
Wil 8
Ann 15
Ed 9
Bel 10
Shirl 15
Find the mean of laps they manage to run.

14
The Median, x̃
The median is the middle score for a set of data that has been arranged in order of magnitude. The median
is less affected by outliers and skewed data.

Formula for the median:

x̃ = |Middle Score| (1.4)

Steps in solving the median, x̃:


ˆ Step 1. Arrange the data in order of magnitude (smallest first to greatest)
ˆ Step 2. After arranging the data, locate the middle score to solve for the median.
(Remember: if the number of data is even, locate the two middle score and get their average)

Using the same data in Example 2, above:

Consider the wages of the 10 employees of TUNGAB refreshment below in thousand (k). Solve for the
Median.
Staff 1 2 3 4 5 6 7 8 9 10
Salary (in thousand) 15 18 16 14 15 15 12 17 90 95

Step 1. We first need to rearrange that data into order of magnitude (smallest first). Then we have:

Salary (in thousand) 12 14 15 15 15 16 17 18 90 95

Step 2. After arranging the data, locate the middle score to solve for the median. (Remember: if the number
of data is even, locate the two middle score and get their average)

Since n = 10 (even number), then we locate the two middle score in our data set, we have:

Salary (in thousand) 12 14 15 15 15 16 17 18 90 95

15 + 16
x̃ =
2
x̃ = 15.50
x̃ = 15, 500.00

So, the average salary for the 10 employees is 15,500.00

Comparing the two averages, mean = 30,100.00 and median=15,500.00. The computed average using the
median can accurately describe the typical salary of a worker, since most workers have salaries ranges from
12,000.00 to 18,000.00. Remember that in the case where outliers are significant in the data set, the median
may be the best measure of central tendency.

Fun Fact: The median (or mode) may or may not be affected by the outliers in the data set.

15
Exercise 2
1. Suppose we have 10 dogs whose weights, in pounds, are shown in the table.
Dogs 1 2 3 4 5 6 7 8 9 10
Weights(lbs) 20 25 32 40 55 50 56 58 55 24
Find the median weight of the 10 dogs.
2. Suppose we have 9 cats who weights, in pounds are shown in the table below:
Cats 1 2 3 4 5 6 7 8 9
Weights(lbs) 24 20 30 40 50 45 52 51 52
Find the median weight of the 9 cats.

16
The Mode, x̂
The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar
chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.

x̂ = |Most Frequent Score| (1.5)

An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common category,
as illustrated below:

Based on the histogram, students most favored color, in this particular data set, is blue.

17
However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we
have two or more values that share the highest frequency, such as below:

In the provided histogram, the data reveals that blue and pink are the most preferred colors among the
students. This poses a challenge because Measures of Central Tendency are defined as single values that
summarize the entire dataset. Having two modes, as evident in this case, complicates the determination of
a single representative measure of central tendency.

FACT:
Remember that the mode (or median) may or may not be affected by the outliers. However, the
mode may not be the best measures of central tendency since there are cases that there may be two
or more modes present in the data set or there is be no mode at all.

Example 1.
Consider the wages of the 10 employees of TUNGAB Refreshment below in thousand (k). Solve for the
Mode.
Staff 1 2 3 4 5 6 7 8 9 10
Salary (in thousand) 15 18 16 14 10 11 12 17 90 95

Looking at the data above, there is no frequent score that appears in the data set. Therefore, there is no
mode at all.

(Note: you can’t put a ”0” to represent ”no mode” since ”0” means something)

Example 2.
Identify the mode for the following data set:
21, 19, 62, 21, 66, 28, 66, 48, 79, 59, 28, 62, 63, 63, 48, 66, 59, 66, 94, 79, 19, 94.

18
Step 1. To make it easier to identify the mode. It is better that we arrange the data according to its
magnitude (highest to lowest or vice versa). So we have:
19, 19, 21, 21, 28, 28, 48, 48, 59, 59, 62, 62, 63, 63, 66, 66, 66, 66, 79, 79, 94, 94.

Step 2. After arranging the data, we then look for the most frequent scores that appears in the data set. One
might think that most of the data can be considered as mode since majority of them appears not just once
but twice or more than. However, based on the definition of the mode, it should be the most frequent score.
So in our case, 66 is the mode in the given dataset since it appears four times compared to others. Therefore,

x̂ = 66

Exercise 3
1. The Temperature in ◦ F on 20 days during the month of May was as follows:

70◦ F, 76◦ F, 76◦ F, 74◦ F, 70◦ F, 70◦ F, 72◦ F, 74◦ F, 78◦ F, 80◦ F,
74◦ F, 74◦ F, 78◦ F, 76◦ F, 78◦ F, 76◦ F, 74◦ F, 78◦ F, 80◦ F, 76◦ F.

What is the mode of the temperatures in ◦ F for the month of May?


2. Ritchelle, a BEED student recorded her scores on weekly Stat Quizzes that were marked out
of a possible 10 points. Her Scores were as follows:

8, 5, 8, 5, 7, 6, 7, 7, 5, 7, 5, 5, 6, 6, 9, 8, 9, 7, 9, 9, 6, 8, 6, 6, 7

Find her most frequent score.


3. What are the modes of the following sets of numbers?

a. 3, 16, 6, 8, 10, 5, 6
b. 12, 0, 15, 15, 13, 19, 16, 13, 16, 16

19
Measures of Central Tendency for Grouped Data
Grouped data refers to data that has been organized into categories, classes, or intervals. Instead of dealing
with individual data points, grouped data represents summarized information about a dataset, typically us-
ing frequency counts or other descriptive statistics to indicate the distribution of data within defined ranges.

Grouping data is a common method when you have large datasets or when you want to analyze the distribu-
tion of data within specific ranges. This approach can simplify analysis and provide an overview of patterns
and trends in the dataset.

Calculating measures of central tendency with grouped data typically involves additional assumptions and
calculations, as you’re working with intervals instead of individual data points. Common methods include:
ˆ Grouped Mean: Using the midpoint of each group to estimate the mean.
ˆ Grouped Median: Identifying the median interval and interpolating within it.
ˆ Grouped Mode: Finding the mode based on the group with the highest frequency.
Grouping data is valuable when analyzing larger datasets or when precision in individual data points isn’t as
critical. It provides a broader view of the data’s distribution and can be used to identify trends or patterns
across different groups or intervals.

Grouped Mean, x̄
To solve for the Mean involving grouped data, we use:

P
fm
x̄ = (1.6)
n

Where:
f - frequency
m - midpoint
n - Total frequency

Example.
The following table gives the frequency distribution of the number of orders received each day during the
50 days at the office a mail-order company. Calculate the Mean.

Number of orders Frequency (f )


10-12 4
13-15 12
16-18 20
19-21 14

Steps in Solving the mean involving grouped data:

a]. Locate the midpoint m, in the given data. To find the midpoint, add the class limits and divide it by 2.

10+12
For example: 2
= 11

Number of orders Frequency (f ) Midpoint (m)


10+12
10-12 4 2
= 11
13+15
13-15 12 2
= 14
16+18
16-18 20 2
= 17
19+21
19-21 14 2
= 20

20
b]. Once you have located the midpoint for each group, multiply it with its corresponding frequencies to
solve for fm.
For example: 11 ∗ 4 = 44

Number of orders Frequency (f ) Midpoint (m) fm


10-12 4 11 4 ∗ 11 = 44
13-15 12 14 12 ∗ 14 = 168
16-18 20 17 20 ∗ 17 = 340
19-21 14 20 14 ∗ 20 = 280

c]. Find the summation of all the values under the column frequency (f) and fm (frequency*midpoint)

Number of orders Frequency (f ) Midpoint (m) fm


10-12 4 11 44
13-15 12 14 168
16-18 20 17 340
19-21 14 20 280
P
Total n = 50 f m = 832

d]. Solve for the mean by substituting all the values generated in c.
so we have:
P
fm
x̄ =
n
832
x̄ =
50
x̄ = 16.64 ≈ 17
x̄ = 17

So, the average orders received each day is approximately 17.

Exercise 4
1. During 3 hours at Mactan Cebu International Airport (MCIA), 55 aircrafts arrived late. The
number of minutes they were late is shown in the grouped frequency table below:

Number of orders Frequency (f ) Midpoint (m) fm


1 - 10 27
11 - 20 10
21 - 30 7
31 - 40 5
41 - 50 4
51 - 60 2

Find the Mean of the number of minutes the aircraft were late.

21
Grouped Median, x̃
To solve for the Median involving grouped data, we use:

n
2
− < C.F.
x̃ = Lm + i( ) (1.7)
fm

Where:
Lm = Lower Limit of the Median Class
n
2
= Median Class
< CF = Cummulative frequency before the median class
fm = frequency of the median class
i = class interval

Example.
The following table gives the frequency distribution of the number of orders received each day during the
50 days at the office a mail-order company. Calculate the Median.

Number of orders Frequency (f )


10-12 4
13-15 12
16-18 20
19-21 14

Steps in calculating for the median involving grouped data:

a]. Create a cumulative frequency in your table. The cumulative frequency is calculated by adding each
frequency from a frequency distribution table to the sum of its predecessors. The last value will always be
equal to the total for all observations, since all frequencies will already have been added to the previous
total.
Number of orders Frequency (f ) CF
10-12 4 4
13-15 12 4 + 12 = 16
16-18 20 16 + 20 = 36
19-21 14 36 + 14 = 50
n
b]. Locate the median class. In locating the median class, calculate median class = 2

n
Median Class =
2
50
Median Class =
2
Median Class = 25

c]. After solving for the median class, locate its value in the cumulative frequency column, in our case,
n
2
= 50
2
= 25, the median class (25) is located in the third row under the cumulative frequency.

n
Note: To find the median class, we have to find the cumulative frequencies of all the classes and 2
. After
that, locate the class whose cumulative frequency is greater than (nearest to) n2 = 25.

Number of orders Frequency (f ) CF


10-12 4 4
13-15 12 16
16-18 20 36 (Median, x̃ class)
19-21 14 50

22
d]. After locating the median class, create another column in your table for the lower limit, Lm

Note: the lower limit is the smallest value of the class interval and the actual lower limit is obtained
by subtracting 0.5 to the smallest number.

Number of orders Frequency (f ) CF Lower Limit (Lm )


10-12 4 4 10 − 0.5 = 9.5
13-15 12 16 13 − 0.5 = 12.5
16-18 20 36 16 − 0.5 = 15.5
19-21 14 50 19 − 0.5 = 18.5

e]. Once the table is complete, locate the values of the following: (Refer to your median class in locating
the following values)
ˆ Median Class, n
2
= 25
ˆ < CF = 16 (The < CF is the cumulative frequency before the median class, in our case, the median
class’ cumulative frequency is 36, and 16 is the number before it. So, the < CF = 16.)
ˆ Lower limit of the median class (Lm ) is 15.5
ˆ Frequency (f ) of the median class is 20, and the
ˆ Class Interval (i) is 3. To solve the class interval (i), Find the highest value (21) and the lowest
value (10) in the given class interval and divide it with the number of groups (4). In our case,
i = HV −LV
g
= (21−10)
4
= 2.75 ≈ 3.
Once you already have the following values, substitute it to the formula and do the operation.
n
2
−< CF
x̃ = Lm + i( )
fm
50
− 16
x̃ = 15.5 + 3( 2 )
20
25 − 16
x̃ = 15.5 + 3( )
20
9
x̃ = 15.5 + 3( )
20
x̃ = 15.5 + 3(0.45)
x̃ = 15.5 + 1.35
x̃ = 16.85 ≈ 17
x̃ = 17

So, the median of the given data set is 17.

Exercise 5
1. During 3 hours at Mactan Cebu International Airport (MCIA), 55 aircrafts arrived late. The
number of minutes they were late is shown in the grouped frequency table below:

Number of orders Frequency (f ) C.F Lower Limit (Lm )


1 - 10 27
11 - 20 10
21 - 30 7
31 - 40 5
41 - 50 4
51 - 60 2

Find the Median of the number of minutes the aircraft were late.

23
Grouped Mode, x̂
To solve for the Mode involving grouped data, we use:

d1
x̂ = Lmod + i( ) (1.8)
d1 + d2

Where:
Lmod = Lower Limit of the Modal Class
d1 - Frequency of the modal class - frequency of the class before the modal class
d2 = Frequency of the modal class - frequency of the class after the modal class
i = class interval

Example.
The following table gives the frequency distribution of the number of orders received each day during the
50 days at the office a mail-order company. Calculate the Mode.

Number of orders Frequency (f )


10-12 4
13-15 12
16-18 20
19-21 14

Steps in calculating for the mode involving grouped data:

a]. To solve the Mode for Grouped data, one must locate the group with the highest frequency since
mode is define as the most frequent score in the data set. In our example, Number of orders, 16 -18 has the
highest frequency (f ) with 20. Therefore, the modal class is in the third row under the frequency.

So we have,

Number of orders Frequency (f )


10-12 4
13-15 12
16-18 20 (modal class, x̂)
19-21 14

b]. After locating the modal class, create another column for the lower limit (Lmod ). So we have:

Number of orders Frequency (f ) Lower Limit (Lmod )


10-12 4 9.5
13-15 12 12.5
16-18 20 15.5
19-21 14 18.5

c]. Once the you have located the modal class, identify the values of the following:(Refer to your modal class
in finding the following values).
ˆ Lmod - Lower limit of modal class: 15.5
ˆ d1 - Frequency of the modal class - frequency of the class before the modal class: 20 − 12 = 8
ˆ d2 - Frequency of the modal class - frequency of the class after the modal class: 20 − 14 = 6
ˆ i - Class Interval: 3. To solve for the class interval (i), Find the highest value (21) and the lowest
value (10) in the given class interval and divide it with the number of groups (4). In our case,
i = HV −LV
g
= (21−10)
4
= 2.75 ≈ 3.

24
Once you already have the following values, substitute it to the formula and do the operation.
d1
x̂ = Lmod + i( )
d1 + d2
8
x̂ = 15.5 + 3( )
8+6
8
x̂ = 15.5 + 3( )
14
x̂ = 15.5 + 3(0.514)
x̂ = 15.5 + 1.714
x̂ = 17.21 ≈ 17
x̂ = 17

So, the mode for the given data set is approximately equal to 17.

25
Exercise 6
1. During 3 hours at Mactan Cebu International Airport (MCIA), 55 aircrafts arrived late. The
number of minutes they were late is shown in the grouped frequency table below:

Number of orders Frequency (f ) Midpoint (m) fm


1 - 10 27
11 - 20 10
21 - 30 7
31 - 40 5
41 - 50 4
51 - 60 2

Find the Mode of the number of minutes the aircraft were late.

Check your Progress!


1. Given the following data below:
Statistics Exam Score Number of Students (f )
11-20 7
21-30 10
31-40 10
41-50 21
51-60 20
61-70 15
71-80 8
Calculate the following:
a. mean
b. median
c. mode

2. If the mean of the given frequency distribution is 35, then find the missing frequency y. Also,
calculate the following:
a. the median
b. the mode

Statistics Exam Score Number of Students (f )


11-20 2
21-30 4
31-40 7
41-50 y
51-60 1

26
Chapter Summative Test
Complete Name:
ID Number:
Course, Year & Section:
Score:
Examination Date:

Instruction: Read each item very carefully and choose the letter of the correct answer.

1. Which measures of Central Tendency is generally used in determining the size of the most saleable shirt
in the department store

a. Mean b. Median c. Mode d. None of the above

2. Which measures of Central Tendency has greatest stability?

a. Mean b. Median c. Mode d. None of the above

3. It is the most often repeated value or the value with the highest frequency in the data set?

a. Mean b. Median c. Mode d. None of the above

4. Which measures of the central tendency is greatly affected by extreme scores?

a. Mean b. Median c. Mode d. None of the above

5. What is the mode of 3, 3, 4, 5, 6, 6?

a. 3 b. 6 c. 3 and 6 d. 4 and 5

6. Which statement is true for the set of data consisting of 80, 80, 81, 82, 82.

a. x̄ = x̂ b. x̃ = x̂ c. x̄ = x̃ d. x̄ < x̃

7. What is the mean of 5, 6, 7, 8, 9?

a. 5 b. 6 c. 7 d. 8

8. What is the median age of a group of students whose ages are: 16, 18, 10, 15 and 8 years old?

a. 18 b. 10 c. 15 d. 16

9. if the heights in cm of a group of students are 150, 154, 158, 160, 163, what is the mean height of these
students?

a. 155 b. 156 c. 157 d. 158

10. Ritchelle has grades of 86, 85, and 89 in her first three grading periods in Statistics. What grade must
she obtain in the fourth grading to get a final rating of 85?

27
a. 80 b. 82 c. 83 d. 85

11. Rachelle is hosting a birthday party in her house. Six kids aged 12, five babies aged 2, attended the party.
Which measures of Central Tendency is appropriate to use to find the average age?

a. Mean b. Median c. Mode d. None of the above

12. Seven Students got the following scores in Math: 24, 15, 18, 10, 25, 30, and 9. What is the median score?

a. 10 b. 15 c. 18 d. 24

For items 13-18, refer to the data below.


Class Frequency (f)
1−5 2
6 − 10 4
11 − 15 6
16 − 20 5
21 − 25 9
26 − 30 6
31 − 35 10
36 − 40 3
41 − 45 2
46 − 50 1
13. What is the class size (i) of the data above?

a. 3 b. 4 c. 5 d. 6

P
14. What is f m?

a. 1158 b. 1159 c. 1160 d. 1161

P
15. What is f?

a. 48 b. 50 c. 255 d. 1159

16. What is Mean (x̄)?

a. 27.45 b. 26.01 c. 24.15 d. 23.78

17. What is Median (x̃)?

a. 24.39 b. 25.40 c. 22.17 d. 23.78

18. What is Mode (x̂)?

a. 32.32 b. 31.01 c. 33.15 d. 35.78

19. Find the mean of the number of hamburgers sold in 7 days: 25, 28, 23, 28, 25, 27, 24.

28
a. 27.45 b. 26.01 c. 24.15 d. 23.78

20. Find the mode of the number of hamburgers sold in 7 days: 25, 28, 23, 28, 26, 27, 24.

a. 27 b. 26 c. 25 d. 28

Test II.
A periodical in Math contained 40 questions. The distribution below summarizes the results of the test.

Scores f m fm CF Lower limit


1−5 2
6 − 10 3
11 − 15 10
16 − 20 15
21 − 25 9
26 − 30 1
P P
Total f= fm =

Complete the table above and find the following:


a. Mean (x̄)
b. Median (x̃)
c. Mode (x̂)

29
Descriptive and Inferential Statistics
Chapter 03 - Measures of Dispersion (Variation)
Frances Jay B. Pacaldo, LPT., MA.ED-Teaching Mathematics

Measures of Dispersion (Variation)


The measures of central tendency are not adequate to describe data. Two data sets can have the same mean
but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This
is given by the measures of dispersion

Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to which
numerical data is likely to vary about an average value. In other words, dispersion helps to understand the
distribution of the data.

The most important use of measures of dispersion is that they help to get an understanding of the distribu-
tion of data. As the data becomes more diverse, the value of the measure of dispersion increases.

Range, Interquartile Range, Standard Deviation, and Variance are the commonly used measures of
dispersion.
ˆ The range is the difference between the largest (LO) and the smallest observation (SO) in the data.
The prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it
has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a data
set. It is more informative to provide the minimum and the maximum values rather than providing
the range.

Range = LO − SO (1.9)

ˆ Interquartile range is defined as the difference between the 25th and 75th percentile (also called the
first and third quartile). Hence the interquartile range describes the middle 50% of observations. If
the interquartile range is large it means that the middle 50% of observations are spaced wide apart.
The important advantage of interquartile range is that it can be used as a measure of variability if the
extreme values are not being recorded exactly (as in case of open-ended class intervals in the frequency
distribution).[2] Other advantageous feature is that it is not affected by extreme values. The main
disadvantage in using interquartile range as a measure of dispersion is that it is not amenable to
mathematical manipulation.

IQR = Q3 − Q1 (1.10)

ˆ Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of
spread of data about the mean. SD is the square root of sum of squared deviation from the mean
divided by the number of observations. The reason why SD is a very useful measure of dispersion is
that, if the observations are from a normal distribution, then 68% of observations lie between mean
± 1 SD 95% of observations lie between mean ± 2 SD and 99.7% of observations lie between mean

30
± 3 SD. The other advantage of SD is that along with mean it can be used to detect skewness. The
disadvantage of SD is that it is an inappropriate measure of dispersion for skewed data.

rP
(xi − x̄)2
σ/s = (1.11)
n−1

ˆ Variance is average squared deviation from the mean of the given data set. This measure of dispersion
checks the spread of the data about the mean. Variance is a measure of the dispersion or spread of
a set of data points in relation to their mean. It quantifies how much the values in a dataset deviate
from the mean, providing a sense of the data’s variability. Variance helps us understand how much
the data points differ from each other and how much they differ from the average value. If there’s a lot
of spread, it could indicate that there’s more unpredictability or variety. If there’s not much spread,
it suggests more consistency or similarity among the data points.

(xi − x̄)2
P
σ 2 /s2 = (1.12)
n−1

The Measures of Dispersion for Ungrouped Data


Measures of dispersion help to quantify the variability or spread in a dataset. For ungrouped data—data
that hasn’t been organized into frequency distributions or grouped by intervals—these measures can provide
insights into how much individual data points differ from one another and from the central tendency.

Example:
The data below show the score of 40 students in the 2012 Division Achievement Test (DAT).

35 16 28 43 21 17 15 20
16 20 18 25 22 33 18 22
32 38 23 32 18 25 35 23
18 20 22 36 22 20 14 21
39 22 38 25 32 33 17 35

Discuss how spread the 2012 DAT scores.

The Range
The range is one of the simplest measures of dispersion in statistics. It describes the spread or extent of
variation in a dataset by calculating the difference between the maximum and minimum values.

The range is the difference between the largest and smallest values in a dataset. Mathematically, it’s
expressed as:

Range = LO − SO

31
Consider the dataset of the score of 40 students in the 2012 Division Achievement Test (DAT). Based on
the data provided, the Largest Observed value is 43 while the Smallest observed value is 14. Then we have,

Range = LO − SO
Range = 43 − 14
Range = 29

The range of 29 indicates that the data values span a total distance of 29 units, from the lowest to the
highest. It provides a sense of the spread or dispersion in the data. There is a 29-unit gap between the
smallest and largest numbers in the dataset.

Interquartile Range
The interquartile range (IQR) is a measure of dispersion that quantifies the spread of the middle 50% of
a dataset. It’s particularly useful for understanding the variability of data while minimizing the impact of
outliers and extreme values.

The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) in a dataset. Quartiles
divide a dataset into four equal parts. Thus, the IQR represents the range of the middle two quartiles, or
the central 50% of the data. Mathematically, the IQR is given by:

IQR = Q3 − Q1

Where:
Q3 = X 3(n+1) th
4
Q1 = X (n+1) th
4

In the case, when the index for a quartile is not a whole number, we will be using interpolation in quartiles.
Thus,

Interpolation = Qlow + Decimal Point ∗ (Qhigh − Qlow )

Consider the dataset of the score of 40 students in the 2012 Division Achievement Test (DAT).

35 16 28 43 21 17 15 20
16 20 18 25 22 33 18 22
32 38 23 32 18 25 35 23
18 20 22 36 22 20 14 21
39 22 38 25 32 33 17 35

ˆ First is we arranged the data according to its magnitude (Smallest to Largest) and assign them with
X1 , X2 , X3 , ..., Xn

32
ˆ Second is we solve for the location of both First Quartile and Third Quartile.

For First Quartile (Q1): Since n = 40, thus,


Q1 = X (n+1) th
4

Q1 = X (40+1) th
4

Q1 = X (41) th
4

Q1 = X10.25th

Therefore, Q1 is located between X10 and X11 .

Since the index for Q1 is not a whole number (10.25), we can’t decide what really is the value of
X10.25th . Since it is between X10th = 18 and X11th = 20. So, to solve for the actual value of X10.25th .
We use the interpolation formula for Quartiles. Thus,
Interpolation = Qlow + Decimal Point ∗ (Qhigh − Qlow )

We assign X10th = 18 as our Qlow and X11th = 20 as Qhigh . And for the decimal point, we have 0.25.
Completing these values, we can now substitute it to the formula for interpolation. Thus,
Q1 = Qlow + Decimal Point ∗ (Qhigh − Qlow )
Q1 = 18 + 0.25 ∗ (20 − 18)
Q1 = 18 + 0.25 ∗ (2)
Q1 = 18 + 0.50
Q1 = 18.50
So, the First Quartile (Q1) in the given dataset is 18.50.

For the Third Quartile (Q3): Since n = 40, thus


Q3 = X 3(n+1) th
4

Q3 = X 3(40+1) th
4

Q3 = X 3(41) th
4

Q3 = X 123 th
4

Q3 = X30.75th

33
Therefore, Q13 is located between X30 and X31 .

Since the index for Q3 is not a whole number (30.75), we can’t decide what really is the value of
X30.75th . Since it is between X30th = 32 and X31th = 33. So, to solve for the actual value of X30.75th .
We use the interpolation formula for Quartiles. Thus,

Interpolation = Qlow + Decimal Point ∗ (Qhigh − Qlow )

We assign X30th = 32 as our Qlow and X31th = 33 as Qhigh . And for the decimal point, we have 0.75.
Completing these values, we can now substitute it to the formula for interpolation. Thus,

Q1 = Qlow + Decimal Point ∗ (Qhigh − Qlow )


Q1 = 32 + 0.75 ∗ (33 − 32)
Q1 = 32 + 0.75 ∗ (1)
Q1 = 32 + 0.75
Q1 = 32.75

So, the Third Quartile (Q3) in the given dataset is 32.75.

ˆ Third, solve for the Interquartile Range, IQR = Q3 − Q1. Thus,

IQR = Q3 − Q1
IQR = 32.75 − 18.50
IQR = 14.25

The IQR provides a robust and reliable measure of dispersion, offering insights into the variability of
the central portion of a dataset and helping to detect outliers. If the interquartile range (IQR) is small,
this indicates that the values within the middle 50% of the data are close together or concentrated.
This can mean that there’s relatively little variability among the central data points, suggesting con-
sistency or homogeneity within this central portion.

The calculated IQR of 14.25 suggests moderate dispersion in the middle 50% of the data.

Fact:
The IQR can be used to detect potential outliers. A common rule of thumb is to define outliers as
data points that are more than 1.5 times the IQR below Q1 or above Q3:

Lower Bound = Q1 − 1.5 ∗ (IQR)

Upper Bound = Q3 + 1.5 ∗ (IQR)

Standard Deviation
Standard deviation is a measure of dispersion that quantifies the amount of variation or spread in a set of
data. It tells us how much individual data points deviate from the mean (average) of a dataset.

A high standard deviation indicates a greater spread of values, suggesting more variability or inconsistency
within the dataset. A low standard deviation indicates that the data points are closer to the mean, suggest-
ing less variability and more consistency.

34
A higher standard deviation can suggest the presence of outliers or extreme values in the dataset.

Standard deviation is a powerful tool for understanding data variability and dispersion. It’s used in various
fields to measure risk, consistency, and overall spread, but it has limitations when dealing with datasets with
outliers or non-symmetrical distributions.

Standard deviation is the square root of the variance. The variance measures the average squared deviation
from the mean, and standard deviation takes the square root to bring the measure back to the original units
of the data.

Mathematically, the standard deviation is given by:

rP
(xi − x̄)2
σ/s =
n−1

Consider the dataset of the score of 40 students in the 2012 Division Achievement Test (DAT).
35 16 28 43 21 17 15 20
16 20 18 25 22 33 18 22
32 38 23 32 18 25 35 23
18 20 22 36 22 20 14 21
39 22 38 25 32 33 17 35
ˆ First, we solve for the Mean (x̄). Since n = 40, we have:
P
xi x1 + x2 + x3 + ... + xn
x̄ = =
n n
35 + 16 + 9 + ... + 23 + 21 + 35
x̄ =
40
1009
x̄ =
40
x̄ = 25.23
ˆ Second, we get the squared difference between our xi and the mean x̄. Then we have:
X
(xi − x̄)2 = (x1 − x̄)2 + (x2 − x̄)2 + ... + (xn − x̄)2
X
(xi − x̄)2 = (35 − 25.23)2 + (36 − 25.23)2 + ... + (21 − 25.23)2 + (35 − 25.23)2
X
(xi − x̄)2 = 95.4529 + 115.9929 + ... + 17.8929 + 95.4529
X
(xi − x̄)2 = 2, 452.98

ˆ Third, we subtitute the values and do the operation to solve for the standard deviation, σ. So we
have,
rP
(xi − x̄)2
σ=
n−1
r
2, 452.98
σ=
40 − 1
r
2, 452.98
σ=
39

σ = 62.90
σ = 7.93
With a standard deviation of 7.93, the data points are generally spread out, indicating a moderate to
high level of variability.

35
Variance
Variance is a measure of statistical dispersion that quantifies the spread or variability within a set of data.
It represents the average of the squared differences from the mean, providing a sense of how much individual
data points differ from the mean of a dataset. Variance provides a measure of how much the data points in a
dataset are spread out. A higher variance indicates greater variability or dispersion among data points, while
a lower variance suggests more consistency and less spread. Variance is expressed in squared units (since
it involves squaring differences), making it more challenging to understand compared to other statistical
measures like standard deviation

Variance is computed as the average of the squared differences between each data point and the mean.
Mathematically, the variance is given by:

(xi − x̄)2
P
σ2 =
n−1

Consider the dataset of the score of 40 students in the 2012 Division Achievement Test (DAT).

35 16 28 43 21 17 15 20
16 20 18 25 22 33 18 22
32 38 23 32 18 25 35 23
18 20 22 36 22 20 14 21
39 22 38 25 32 33 17 35

ˆ First, we solve for the Mean (x̄). Since n = 40, we have:


P
xi x1 + x2 + x3 + ... + xn
x̄ = =
n n
35 + 16 + 9 + ... + 23 + 21 + 35
x̄ =
40
1009
x̄ =
40
x̄ = 25.23

ˆ Second, we get the squared difference between our xi and the mean x̄. Then we have:
X
(xi − x̄)2 = (x1 − x̄)2 + (x2 − x̄)2 + ... + (xn − x̄)2
X
(xi − x̄)2 = (35 − 25.23)2 + (36 − 25.23)2 + ... + (21 − 25.23)2 + (35 − 25.23)2
X
(xi − x̄)2 = 95.4529 + 115.9929 + ... + 17.8929 + 95.4529
X
(xi − x̄)2 = 2, 452.98

ˆ Third, we subtitute the values and do the operation to solve for the variance, σ 3 . So we have,

(xi − x̄)2
P
σ2 =
n−1
2 2, 452.98
σ =
40 − 1
2 2, 452.98
σ =
39
σ 2 = 62.90

A variance of 60.23 suggests that the data points in a dataset are generally spread out, indicating a
relatively high level of variability or dispersion. This spread can imply inconsistency and the presence
of outliers.

36
Exercise
Instruction. Read each question carefully before selecting your answer. Choose the best answer from
the given choices.
1. The range of a dataset is given by:
a. The difference between the maximum and minimum values
b. The difference between the third quartile (Q3) and the first quartile (Q1)
c. The square root of the variance
d. The average of the squared differences from the mean
2. The interquartile range (IQR) of a dataset is calculated as
a. The difference between the maximum and minimum values
b. The difference between the third quartile (Q3) and the first quartile (Q1)
c. The square root of the variance
d. The average of the squared differences from the mean
3. Standard deviation can be defined as
a. The square root of the variance
b. The difference between the maximum and minimum values
c. The difference between the third quartile and the first quartile
d. The average absolute deviation from the mean
4. Given the following dataset: [4, 8, 15, 16, 23, 42], what is the range?

a. 15 b. 20 c. 38 d. 30

5. Given the following dataset: [12, 15, 22, 25, 33, 36, 40, 45, 50, 52], what is the interquartile range
(IQR)?

a. 20 b. 18 c. 25 d. 17

Check your Progress!


To check if you really understand what is measures of dispersion and on how to calculate them.
Consider the following dataset below.

58 67 45 72 86 93 55 68 70 80
62 71 49 77 82 59 90 51 65 78
60 69 83 88 74 64 53 56 73 85
61 66 79 57 92 50 54 63 76 75

Calculate the following Measures of Dispersion and interpret the result.


a. Range
b. Interquartile Range (IQR)
c. Standard Deviation (σ)
d. Variance (σ 2 )

37
The Measures of Dispersion for Grouped Data
Measures of dispersion for grouped data help quantify the spread or variability in a dataset that’s organized
into groups or classes. These groups are often used when dealing with large datasets, frequency distribu-
tions, or data organized into intervals. The main measures of dispersion for grouped data include range,
interquartile range (IQR), variance, and standard deviation.

Example A.
The following table gives the frequency distribution of the number of orders received each day during the
50 days at the office of a mail-order company. Discuss how spread the data distribution of the company.

Number of orders Frequency (f )


10-12 4
13-15 12
16-18 20
19-21 14

The Grouped Range


To solve for the Range involving grouped data, we use:

Range = Upper-class boundary of the highest interval–Lower class boundary of the lowest interval

Using the data in Example A. the upper class boundary of the highest interval is 21 while the lower class
boundary of the lowest interval is 10. Therefore,

Range = 21 − 10 = 11

The Grouped Interquartile Range, IQR


To solve for the Interquartile Range, IQR involving grouped data, we use

IQR = Q3 − Q1

To solve for:
3n −<CF
Q3 = Lqc + i( 4 fqc )
Where:
Lqc - Lower limit of the quartile class
i - class interval
3n
4
= The 3rd Quartile Class
< CF = The cumulative frequency before the quartile class
fqc = The frequency of the quartile class

To solve for: 3n
−<CF
Q1 = Lqc + i( 4 fqc )
Where:
Lqc - Lower limit of the quartile class
i - class interval
n
4
= The 1st Quartile Class
< CF = The cumulative frequency before the quartile class

38
fqc = The frequency of the quartile class

Using the data in Example A. Solve for the Interquartile Range.


a. Complete the table by adding another column in the table for the cumulative frequency and lower limit

Number of orders Frequency (f ) C.F Lower Limit


10-12 4 4 9.5
13-15 12 16 12.5
16-18 20 36 15.5
19-21 14 50 18.5

b. Locate the Quartile Class (i.e., 1st, 3rd)

n
b.1. For the 1st Quartile, we use 4
. Since n = 50, then we have:
n 50
=
4 4
n
= 12.50
4
n
In our table (in the CF column), 4
= 12.50 is near to 16. Therefore, this is our 1st quartile class.

Number of orders Frequency (f ) C.F Lower Limit


10-12 4 4 9.5
13-15 12 16 (Q1 class) 12.5
16-18 20 36 15.5
19-21 14 50 18.5

3n
b.2. For the 3rd Quartile, we use 4
. Since n = 50, then we have:
n 3 ∗ 50
=
4 4
n 150
=
4 4
n
= 37.50
4

3n
In our table (in the CF column), 4
= 37.50 is near to 50. Therefore, this is our 3rd quartile class.

Number of orders Frequency (f ) C.F Lower Limit


10-12 4 4 9.5
13-15 12 16 (Q1 class) 12.5
16-18 20 36 15.5
19-21 14 50 (Q3 class) 18.5

c. Since we have located our 1st and 3rd Quartile class, we can now find the values needed to solve for the
3rd Quartile and 1st Quartile.

c.1. For 3rd Quartile, Q3:


Lqc - Lower limit of the quartile class = 18.5
i - class interval = 3
3n
4
= The 3rd Quartile Class = 37.50
< CF = The cumulative frequency before the quartile class = 36
fqc = The frequency of the quartile class = 14

39
c.2. For 1st Quartile, Q1:
Lqc - Lower limit of the quartile class = 12.5
i - class interval = 3
3n
4
= The 3rd Quartile Class = 12.50
< CF = The cumulative frequency before the quartile class = 4
fqc = The frequency of the quartile class = 12

d. Solve for the 3rd Quartile and 1st Quartile using the values we obtained in c.

d.1. Using the obtained values in c. Solve for 3rd Quartile, thus:
3n
4

< CF
Q3 = Lqc + i( )
fqc
37.50 − 36
Q3 = 18.5 + 3( )
14
1.5
Q3 = 18.5 + 3( )
14
Q3 = 18.5 + 3(0.107142857)
Q3 = 18.5 + 0.321428571
Q3 = 18.5 + 0.321428571
Q3 = 18.82

Therefore the 3rd Quartile is 18.82.

d.2. Using the obtained values in c. Solve for 1st Quartile, thus:
3n
4

< CF
Q1 = Lqc + i( )
fqc
12.50 − 4
Q1 = 12.50 + 3( )
12
8.50
Q1 = 12.50 + 3( )
12
Q1 = 12.50 + 2.125
Q1 = 14.63

Therefore the 1st Quartile is 14.63

e. Since we have already computed Q1 = 14.63 and Q3 = 18.82, we can now solve for the Interquartile
Range (IQR). Thus,

IQR = Q3 − Q1
IQR = 18.82 − 14.63
IQR = 4.19

Therefore, the IQR in the given set is 4.19

40
The Grouped Standard Deviation, σ
To solve for the Standard Deviation, σ involving grouped data, we use:

rP
2 (xi − x̄)2
σ =
n−1

Using the data in Example A. Solve for the Grouped Standard Deviation:

a. Add another column for the midpoint. To solve for the midpoint, add the upper and lower boundaries in
each rows and divide it by two. For example, in Row 1, m = 10+12
2
= 11.00

Number of orders Frequency (f ) Midpoint (m)


10-12 4 m = 10+12
2
= 11
13-15 12 14
16-18 20 17
19-21 14 20

b. Once you are done locating/solving the midpoint. Solve for the Grouped Mean. Remember the formula
for the grouped mean:
X fm
x̄ =
n

Multiply the midpoint by its corresponding frequency and get the sum.

41
REFERENCES
1 Meyers, L. S., Gamst, G., & Guarino, A. J. (2016). Applied multivariate research: Design and
interpretation. Sage Publications.
2 Rosner, B. (2015). Fundamentals of biostatistics. Cengage Learning.
3 Rumsey, D. J. (2017). Statistics essentials for dummies. John Wiley & Sons.
4 Agresti, A., & Finlay, B. (2018). Statistical methods for the social sciences. Pearson.
5 Bluman, A. G. (2017). Elementary statistics: A step by step approach. McGraw-Hill Education.
6 Freund, J. E., & Simon, G. A. (2018). Modern elementary statistics. Pearson.
7 Hogg, R. V., McKean, J. W.,& Craig, A. T. (2018). Introduction to mathematical statistics. Pearson.
8 McClave, J. T., Benson, P. G., & Sincich, T. (2016). Statistics for business and economics. Pearson.
9. https://www.cuemath.com/data/measures-of-dispersion/
10. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198538/

42

You might also like