Lectures_1_2_notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

What is the Mean and How to Find It: Definition & Formula

What is the Mean?

The mean in math and statistics summarizes an entire dataset with a single number
representing the data’s center point or typical value. It is also known as the
arithmetic mean, and it is the most common measure of central tendency. It is
frequently called the “average.”

Learn how to find the mean and know when it is and is not a good statistic to use!

How to Find the Mean

Finding the mean is very simple. Just add all the values and divide by the number of
observations. The mean formula is below:

For example, if the heights of five people are 48, 51, 52, 54, and 56 inches. Here’s
how to find the mean:

48 + 51 + 52 + 54 + 56 / 5 = 52.2

Their average height is 52.2 inches.

Mean Formula

There are two versions of the mean formula in math—


the sample and population formulas. In each case, the process for how to find the
mean mathematically does not change. Add the values and divide by the number of
values. However, the formula notation differs between the two types.

Sample Mean Formula

The sample mean formula is the following:

Module-1 Lectures:1 &2 Bhaskar


Where:

o x̄ is the sample average of variable x.


o ∑xn= sum of n values.
o n = number of values in the sample.

Typically, the sample formula notation uses lowercase letters.

Population Mean Formula

The population mean formula is the following:

Where:

o µ is the population average.


o ∑XN= sum of N values.
o N = number of values in the population.

Typically, the population mean formula notation uses Greek and uppercase
letters.

When Do You Use the Average?

Ideally, the mean in math (or the average) indicates the region where most values in
a distribution fall. Statisticians refer to it as the central location of a distribution. You
can think of it as the tendency of data to cluster around a middle value.
The histogram below illustrates the average accurately finding the center of the
data’s distribution.

Module-1 Lectures:1 &2 Bhaskar


However, the average does not always find the center of the data. It is sensitive to
skewed data and extreme values. For example, when the data are skewed, it can miss
the mark. In the histogram below, the average is outside the area with the most
common values.

Module-1 Lectures:1 &2 Bhaskar


This problem occurs because outliers have a substantial impact on the mean.
Extreme values in an extended tail pull it away from the center. As the distribution
becomes more skewed, the average is drawn further away from the center.

In these cases, the average can be misleading because it might not be near the most
common values. Consequently, it’s best to use the average to measure the central
tendency when you have a symmetric distribution.

For skewed distributions, it’s often better to use the median or trimmed mean,
which use different methods to find the central location. Note that the average
provides no information about the variability present in a distribution. To evaluate
that characteristic, assess the standard deviation.

Standard Deviation: Interpretations and Calculations


The standard deviation (SD) is a single number that summarizes the variability in a
dataset. It represents the typical distance between each data point and the mean.
Smaller values indicate that the data points cluster closer to the mean—the values in
the dataset are relatively consistent. Conversely, higher values signify that the values
spread out further from the mean. Data values become more dissimilar, and extreme
values become more likely.

The standard deviation uses the original data units, simplifying the interpretation.
For this reason, it is the most widely used measure of variability. Suppose a pizza
restaurant measures its delivery time in minutes and has an SD of 5. In that case, the
interpretation is that the typical delivery occurs 5 minutes before or after the mean
time. Statisticians often report the standard deviation with the mean: 20 minutes
(StDev 5). If another pizza restaurant has a standard deviation of 10 minutes, we

Module-1 Lectures:1 &2 Bhaskar


know that its delivery service is more inconsistent. We’ll assess this example more
closely later on in lecture 3.

In this note/article learn why the standard deviation is essential, work through an
interpretation example, and learn how to calculate it by hand.

Why is the Standard Deviation Important?

Understanding the standard deviation is crucial. While the mean identifies a central
value in the distribution, it does not indicate how far the data points fall from the
center. Higher SD values signify that more data points are further away from the

Objective: To learn why the standard deviation is essential, work through an


interpretation example, and learn how to calculate it by hand.

Variability is everywhere. When you order a favourite meal at a restaurant, it isn’t


exactly the same each time. Your drive time to work varies every day. Parts from an
assembly line might seem identical, but they have subtly different lengths and widths.

When variability is high, you can expect to experience extreme values more
frequently, which can cause problems! If the restaurant meal differs noticeably from
the usual, you might not like it at all. When your morning commute takes much
longer than the average travel time, you will be late. And, manufactured parts that
are too far out then system won’t perform correctly.

Frequently, we feel distressed at the extremes more than the mean. Standard
deviations help you understand the variability and provides vital information about
the consistency of outcomes or lack thereof.

Example of Using the Standard Deviation

Suppose two pizza restaurants advertise a 20-minute average delivery time. We’re
starving and both look equally good! However, we know the mean does not tell the
entire story!

Let’s assess their standard deviations to choose the restaurant. Imagine we obtain
their delivery time data. One restaurant has a SD of 10 minutes while the other has
a value of 5. How does this affect deliveries?

Module-1 Lectures:1 &2 Bhaskar


The graphs below incorporate the SDs to answer this question. The restaurant with
the larger standard deviation (10 minutes) has more variable delivery times and a
broader distribution curve.

NOTE: Area under the curve is equal to one always, irrespective of the shape –
normal or square

In these charts, we’ll consider a 30-minute wait or longer to be unacceptable—we’re


hungry! The shaded areas represent the percentage of delivery times exceeding 30
minutes. Almost 16% of deliveries for the high variability pizza joint exceed 30
minutes compared to only 2% for the low variability restaurant. They both have a
mean delivery time of 20 minutes, but I know where I’d place my order when I’m
hungry!

Module-1 Lectures:1 &2 Bhaskar


After calculating the standard deviation, you can use various methods to evaluate it.
The graphs above incorporate the SD into the normal probability distribution.
Alternatively, you can use the Empirical Rule or Chebyshev’s Theorem to assess
how the standard deviation relates to the distribution of values1. Alternatively,
you can calculate the coefficient of variation (CV), which uses both the SD and
the mean.

Standard Deviation Formula

The formula for the standard deviation is below.

o s = the sample StDev


o N = number of observations
o Xi = value of each observation
o x̄ = the sample mean

Statisticians refer to the numerator portion of the standard deviation formula as


the sum of squares. (Remember why we did it in class? +1 and -1 simple
addition is zero i.e. no deviation in our steps at all !).

Technically, this formula is for the sample standard deviation.


The population version uses N in the denominator. Learn about the differences
between the population and sample varieties.? i.e why (N-1) in case of samples.

Step-by-Step Example of Calculating the Standard Deviation

Calculating the standard deviation involves the following steps. The numbers
correspond to the column numbers.

The calculations take each observation (1), subtract the sample mean (2) to calculate
the difference (3), and square that difference (4).

1
Note: not important to remember the name for us , Just for info.

Module-1 Lectures:1 &2 Bhaskar


Then, at the bottom, sum the column of squared differences and divide it by 16 (17
– 1 = 16), which equals 201. Statisticians call this value the variance.

Calculate the square root of the variance to derive the SD. (Question: why not just
leave at Variance? Why Square root?)

Learn how you can use the range of a dataset to estimate the standard deviation
using the range rule of thumb.

The standard deviation is similar to the mean absolute deviation. Both statistics use
the original data units and they compare the data points to the mean to assess
variability. However, there are differences. To learn more, read about the mean
absolute deviation (MAD).

Module-1 Lectures:1 &2 Bhaskar


Mean, Median, and Mode: Measures of Central Tendency

What is Central Tendency?

Measures of central tendency are summary statistics that represent the center point
or typical value of a dataset. Examples of these measures include the mean, median,
and mode. These statistics indicate where most values in a distribution fall and are
also referred to as the central location of a distribution. You can think of central
tendency as the propensity for data points to cluster around a middle value.

In statistics, the mean, median, and mode are the three most common measures of
central tendency. Each one calculates the central point using a different method.
Choosing the best measure of central tendency depends on the type of data you
have. In this post, I explore the mean, median, and mode as measures of central
tendency, show you how to calculate them, and how to determine which one is best
for your data.

Locating the Measures of Central Tendency

Most articles about the mean, median, and mode focus on how you calculate these
measures of central tendency., I’m going to start by illustrating the central point of
several datasets graphically—so you understand the goal. Then, we’ll move on to
choosing the best measure of central tendency for your data and the calculations.

The three distributions below represent different data conditions. In each


distribution, look for the region where the most common values fall. Even though
the shapes and type of data are different, you can find that central tendency. That’s
the area in the distribution where the most common values are located. These
examples cover the mean, median, and mode or 3Ms .

Module-1 Lectures:1 &2 Bhaskar


Module-1 Lectures:1 &2 Bhaskar
As the graphs highlight, you can see where most values tend to occur. That’s the
concept. Measures of central tendency represent this idea with a value. Coming up,
you’ll learn that as the distribution and kind of data changes, so does the best
measure of central tendency. Consequently, you need to know the type of data you
have, and graph it, before choosing between the mean, median, and mode!2

Median

The median is the middle value. It is the value that splits the dataset in half, making
it a natural measure of central tendency.

To find the median, order your data from smallest to largest, and then find the data
point that has an equal number of values above it and below it. The method for
locating the median varies slightly depending on whether your dataset has an even
or odd number of values. I’ll show you how to find the median for both cases. In
the examples below, I use whole numbers for simplicity, but you can have decimal
places.

In the dataset with the odd number of observations, notice how the number 12 has
six values above it and six below it. Therefore, 12 is the median of this dataset.

Module-1 Lectures:1 &2 Bhaskar


When there is an even number of values, you count in to the two innermost values
and then take the average. The average of 27 and 29 is 28. Consequently, 28 is the
median of this dataset.

Outliers and skewed data have a smaller effect on the median than the mean as a
measures of central tendency. To understand why, imagine we have the Median
dataset below and find that the median is 46. However, we discover data entry errors
and need to change four values, which are shaded in the Median Fixed dataset. We’ll
make them all significantly higher so that we now have a skewed distribution with
large outliers.

Module-1 Lectures:1 &2 Bhaskar


As you can see, the median doesn’t change at all. It is still 46. When comparing the
mean vs median, the mean depends on all values in the dataset while the median
does not. Consequently, when some of the values are more extreme, the effect on
the median is smaller. Of course, with other types of changes, the median can
change. When you have a skewed distribution, the median is a better measure of
central tendency than the mean.

Mean vs Median as Measures of Central Tendency

Now, let’s compare the mean vs median as measures of central tendency on


symmetrical and skewed distributions to see how they perform. The histograms
below allow us to compare these two statistics directly.

Module-1 Lectures:1 &2 Bhaskar


In a symmetric distribution, the mean and median both find the center accurately.
They are approximately equal, and both are valid measures of central tendency.

Module-1 Lectures:1 &2 Bhaskar


In a skewed distribution, the outliers in the tail pull the mean away from the center
towards the longer tail. For this example, the mean vs median differs by over 9000.
The median better represents the central tendency for the skewed distribution.

These data are based on the U.S. household income for 2006. Income is the classic
example of when to use the median instead of the mean because its distribution
tends to be skewed. The median indicates that half of all incomes fall below 27581,
and half are above it. For these data, the mean overestimates where most household
incomes fall.

NOTE : the median is a robust statistic while the mean is sensitive to outliers and
skewed distributions.

Module-1 Lectures:1 &2 Bhaskar

You might also like