Descriptive Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5
At a glance
Powered by AI
The key takeaways are that descriptive statistics are used to describe basic features of data through simple summaries and graphics. There are three major characteristics (distribution, central tendency, and dispersion) of a single variable examined in univariate analysis. The three major types of estimates for central tendency are the mean, median, and mode.

The three major characteristics of a single variable examined in univariate analysis are the distribution, central tendency, and dispersion.

The three major types of estimates of central tendency are the mean, median, and mode.

INFORMATION SHEET 2.

Descriptive Statistics

Learning Objectives

After reading this Information Sheet, you must be able to:

1. Define Descriptive Statistics


2. Understand Univariate analysis and its 3 major characteristics
3. Use different statistical tools.

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They
provide simple summaries about the sample and the measures. Together with simple graphics
analysis, they form the basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from  inferential statistics . With


descriptive statistics you are simply describing what is or what the data shows. With
inferential statistics, you are trying to reach conclusions that extend beyond the immediate
data alone.

Descriptive Statistics are used to present quantitative descriptions in a manageable


form. In a research study we may have lots of measures. Or we may measure a large number of
people on any measure. Descriptive statistics help us to simplify large amounts of data in a
sensible way. Each descriptive statistic reduces lots of data into a simpler summary.

Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are
three major characteristics of a single variable that we tend to look at:

 the distribution
 the central tendency
 the dispersion

In most situations, we would describe all three of these characteristics for each of the
variables in our study.

The Distribution

The distribution is a summary of the frequency of individual values or ranges of values for a
variable. The simplest distribution would list every value of a variable and the number of persons who
had each value. For instance, a typical way to describe the distribution of college students is by year
in college, listing the number or percent of students at each of the four years. Or, we describe gender
by listing the number or percent of males and females. In these cases, the variable has few enough
values that we can list each one and summarize how many sample cases had the value. But what do
we do for a variable like income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case, we group the raw scores into
categories according to ranges of values. For instance, we might look at GPA according to the letter
grade ranges. Or, we might group income into four or five ranges of income values.
Category Percent

Under 35 years old 9%

36–45 21%

46–55 45%

56–65 19%

66+ 6% Page | 1
One of the most common ways to describe a single variable is with a frequency distribution.
Depending on the particular variable, all of the data values may be represented, or you may group the
values into categories first (e.g., with age, price, or temperature variables, it would usually not be
sensible to determine the frequencies for each value. Rather, the value is grouped into ranges and the
frequencies determined.). Frequency distributions can be depicted in two ways, as a table or as a
graph. The table above shows an age frequency distribution with five categories of age ranges
defined. The same frequency distribution can be depicted in a graph as shown in Figure 1. This type
of graph is often referred to as a histogram or bar chart.

Figure 1. Frequency distribution bar chart.

Distributions may also be displayed using percentages. For example, you could use percentages
to describe the:

 percentage of people in different income levels


 percentage of people in different age ranges
 percentage of people in different ranges of standardized test scores

Central Tendency

The central tendency of a distribution is an estimate of the “center” of a distribution of values.


There are three major types of estimates of central tendency:

 Mean
 Median
 Mode

The Mean or average is probably the most commonly used method of describing central
tendency. To compute the mean all you do is add up all the values and divide by the number of
values. For example, the mean or average quiz score is determined by summing all the scores
and dividing by the number of students taking the exam. For example, consider the test score
values:

15, 20, 21, 20, 36, 15, 25, 15

Page | 2
The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

The Median is the score found at the exact middle of the set of values. One way to
compute the median is to list all scores in numerical order, and then locate the score in the
center of the sample. For example, if there are 500 scores in the list, score #250 would be the
median. If we order the 8 scores shown above, we would get:

15, 15, 15, 20, 20, 21, 25, 36

There are 8 scores and score #4 and #5 represent the halfway point. Since both of these
scores are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.

The Mode is the most frequently occurring value in the set of scores. To determine the
mode, you might again order the scores as shown above, and then count each one. The most
frequently occurring value is the mode. In our example, the value  15 occurs three times and is
the model. In some distributions there is more than one modal value. For instance, in a bimodal
distribution there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values (20.875, 20,
and 15) for the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-
shaped), the mean, median and mode are all equal to each other.

Dispersion

Dispersion refers to the spread of the values around the central tendency. There are two
common measures of dispersion, the range and the standard deviation. The  range is simply the
highest value minus the lowest value. In our example distribution, the high value is  36 and the
low is 15, so the range is 36 - 15 = 21.

The Standard Deviation is a more accurate and detailed estimate of dispersion because


an outlier can greatly exaggerate the range (as was true in this example where the single outlier
value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample. Again lets take the set of scores:

15, 20, 21, 20, 36, 15, 25, 15

to compute the standard deviation, we first find the distance between each value and the mean.
We know from above that the mean is 20.875. So, the differences from the mean are:

15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875

Notice that values that are below the mean have negative discrepancies and values
above it have positive ones. Next, we square each discrepancy:

-5.875 * -5.875 = 34.515625


-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625

Page | 3
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625

Now, we take these “squares” and sum them to get the Sum of Squares (SS) value.
Here, the sum is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the
result is 350.875 / 7 = 50.125. This value is known as the variance. To get the standard
deviation, we take the square root of the variance (remember that we squared the deviations
earlier). This would be SQRT(50.125) = 7.079901129253.

Although this computation may seem convoluted, it’s actually quite simple. To see this,
consider the formula for the standard deviation:

where:

 X is each score,


 X̄  is the mean (or average),
 n is the number of values,
 Σ means we sum across the values.

In the top part of the ratio, the numerator, we see that each score has the the mean
subtracted from it, the difference is squared, and the squares are summed. In the bottom part,
we take the number of scores minus 1. The ratio is the variance and the square root is the
standard deviation. In English, we can describe the standard deviation as:

the square root of the sum of the squared deviations from the mean divided by the number of
scores minus one.

Although we can calculate these univariate statistics by hand, it gets quite tedious when you
have more than a few values and variables. Every statistics program is capable of calculating
them easily for you. For instance, I put the eight scores into SPSS and got the following table as
a result:

Metric Value

N 8

Mean 20.8750

Median 20.0000

Mode 15.00

Standard Deviation 7.0799

Variance 50.1250

Range 21.00

Page | 4
which confirms the calculations I did by hand above.

The standard deviation allows us to reach some conclusions about specific scores in our
distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the
following conclusions can be reached:

 approximately 68% of the scores in the sample fall within one standard deviation of the
mean
 approximately 95% of the scores in the sample fall within two standard deviations of the
mean
 approximately 99% of the scores in the sample fall within three standard deviations of
the mean

For instance, since the mean in our example is 20.875 and the standard deviation
is 7.0799, we can from the above statement estimate that approximately 95% of the scores will
fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348.
This kind of information is a critical stepping stone to enabling us to compare the performance of
an individual on one variable with their performance on another, even when the variables are
measured on entirely different scales.

Reference:
WMK Trochim (10 March 2020), Research Methods Knowledge Base; Conjoint.ly ;
ABN 56 616 169 021, https://conjointly.com/kb/descriptive-statistics/

Page | 5

You might also like