Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Weighted Average
Calculating aggregate marks
If Statement
Pass - fail verdicts
Central Tendency
Average & median
Dispersion
Variance, Standard deviation
Quartiles & Interquartile range
Frequency Distributions
Histogram
Descriptive Statistics
Summary Statistics
Ranking Data
Handling data
Central tendency
Dispersion
Frequency distributions
Ranking data
The data below are the percentage marks obtained in each of two tests
by a group of fifty students.
Test 1 is weighted at 40% and test 2 is weighted at 60%.
These weights have been entered to the A12 and B12 cells and then the cells named
as W_1 and W_2 respectively.
The task is to calculate the aggregate mark for each student in column C.
To see how to do it look at the comments in A12, B12 and C13.
NOTES 1
EET 4 (Dispersion)
CONTENTS Dispersion
Having obtained measures of central tendency for the data we should now
compute a few measures of Dispersion (how the data are spread around their
central value). The commonest measure is the Variance and its square root - the Standard Devia
Once again the appropriate Excel functions and formulae are explained in the comments.
NOTES 2 VARIANCE TEST 1
VARIANCE TEST 2
VARIANCE AGGREGATE
35 35 40 ST DEVIATION TEST 1
ST DEVIATION TEST 2
40% 60% ST DEVIATION AGGREGATE
TEST 1 TEST 2 AGGREGATE VERDICT
57 38 45.6 PASS When the median has been used as the measure of central
50 34 40.4 FAIL the appropriate measure of dispersion is given by the
36 69 55.8 PASS We define:
47 83 68.6 PASS Q1 - the lower quartile is the mark below which 25% of the o
97 66 78.4 PASS Q3 - the upper quartile is the mark below which 75% of the
92 88 89.6 PASS Q2 is the median - 50% lie above and 50% lie below.
97 90 92.8 PASS Consequently the interquartile range is defined as Q3 minus
4 46 29.2 FAIL the range of marks obtained by the middle 50% of students
6 52 33.6 FAIL
81 70 74.4 PASS FOR TEST 1
24 90 63.6 FAIL QUARTILE 1
37 6 18.4 FAIL QUARTILE 2
96 88 91.2 PASS QUARTILE 3
70 44 54.4 PASS INTERQUARTILE RANGE
52 11 27.4 FAIL FOR TEST 2
46 2 19.6 FAIL QUARTILE 1
93 63 75 PASS QUARTILE 2
7 83 52.6 FAIL QUARTILE 3
66 46 54 PASS INTERQUARTILE RANGE
94 11 44.2 FAIL FOR AGGREGATE
71 50 58.4 PASS QUARTILE 1
85 91 88.6 PASS QUARTILE 2
95 91 92.6 PASS QUARTILE 3
0 56 33.6 FAIL INTERQUARTILE RANGE
82 57 67 PASS
12 95 61.8 FAIL
97 55 71.8 PASS
11 16 14 FAIL
82 53 64.6 PASS
29 19 23 FAIL
3 98 60 FAIL
91 90 90.4 PASS
34 86 65.2 FAIL
15 38 28.8 FAIL
54 34 42 FAIL
39 56 49.2 PASS
96 11 45 FAIL
42 51 47.4 PASS
51 27 36.6 FAIL
79 76 77.2 PASS
65 91 80.6 PASS
75 80 78 PASS
88 22 48.4 FAIL
98 98 98 PASS
99 48 68.4 PASS
75 8 34.8 FAIL
23 0 9.2 FAIL
26 58 45.2 FAIL
42 11 23.4 FAIL
81 71 75 PASS
e root - the Standard Deviation.
d in the comments.
In this sheet we are going to use the Descriptive Statistics option from the data analysis routine.
Assuming it has been added in, we proceed as follows.
Select Tools then Data Analysis and then Descriptive Statistics.
At the dialogue box that appears, define the data range as A14:C64. Then check the "labels in
Next check the Summary Statistics box and then define the output range as E14.
Click OK and the results shown in the following link will be reproduced.
A few terms may not be familiar. They are explained in the links below
the output table.
35 35 40
0.4 0.6
TEST 1 TEST 2 AGGREGATE
57 38 45.6
50 34 40.4
36 69 55.8
47 83 68.6
97 66 78.4
92 88 89.6
97 90 92.8
4 46 29.2
6 52 33.6
81 70 74.4
24 90 63.6
37 6 18.4
96 88 91.2
70 44 54.4
52 11 27.4
46 2 19.6
93 63 75 STANDARD_ERROR
7 83 52.6
66 46 54 SAMPLE_VARIANCE
94 11 44.2
71 50 58.4 SKEWNESS
85 91 88.6
95 91 92.6 KURTOSIS
0 56 33.6
82 57 67
12 95 61.8
97 55 71.8
11 16 14
82 53 64.6
29 19 23
3 98 60
91 90 90.4
34 86 65.2
15 38 28.8
54 34 42
39 56 49.2
96 11 45
42 51 47.4
51 27 36.6
79 76 77.2
65 91 80.6
75 80 78
88 22 48.4
98 98 98
99 48 68.4
75 8 34.8
23 0 9.2
26 58 45.2
42 11 23.4
81 71 75
the data analysis routine.
Descriptive
xplained in the links below
Here we have the aggregate marks data for the student group.
It will often be useful to obtain each mark's rank in the group.
Sorting the data from high to low would clearly do this but in itself would not
supply useable ranks.
This is where the Rank function is useful.
First we have named the range containing the 50 aggregate marks as: AGG.
Next we note that the general syntax of the rank function is:
=RANK(Value to be ranked,Range of values from which to rank,1 for 1 to be the rank of lowest mark,0 for 1 to be the r
With the first mark located in A15 then the following will provide ascending ranks (1 = highest)
=RANK(A15,AGG,0) When copied down column B.
Enter this now to the B15 cell and copy down to obtain each mark's rank.
AGGREGATE RANK (1 = HIGHEST)
45.6
40.4
55.8
68.6 PROCEED TO SHEET 8 (Spearman’s rank corr
78.4
89.6
92.8
29.2
33.6
74.4
63.6
18.4
91.2
54.4
27.4
19.6
75
52.6
54
44.2
58.4
88.6
92.6
33.6
67
61.8
71.8
14
64.6
23
60
90.4
65.2
28.8
42
49.2
45
47.4
36.6
77.2
80.6
78
48.4
98
68.4
34.8
9.2
45.2
23.4
75
ut in itself would not
TEST 1 TEST 2 RANK 1 RANK 2 D D2 test after naming A7:A56 and B7:B56 as Test1 and Test2 r
57 38 26 35 the rankings in each of the tests agree or disagree with one
50 34 30 37 The method of doing this is called Rank Correlation
36 69 37 19 known as Spearman's Rank Correlation Coefficient
47 83 31 13 minus one and plus one with the former indicating perfect d
97 66 3 20 latter indicating perfect agreement.
92 88 11 10 The first step in calculating the statistic is to use column E
97 90 3 7 difference in the ranks from: =C7-D7 copied down to E56.
4 46 48 32 As you can see, some of these differences are positive and
6 52 47 28 There will therefore be a tendency for them to cancel each
81 70 17 18 We deal with this by squaring the differences in column F f
24 90 41 7 Do this now and then copy down to F56
37 6 36 48 The next step is to sum the squared differences, so do this
96 88 6 10 cell below (I21) from =SUM(F7:F56).
70 44 23 34 Next we count the number of differences in I22 from:=COU
52 11 28 43 SUM D2
46 2 32 49 COUNT (N)
93 63 10 21
7 83 46 13 Now we define Spearman's Rank Correlation Coefficient
66 46 24 32 1 - 6*SUM D2/[N*(N2 -1)]
94 11 9 43 So use I27 below to calculate 6*SUM D2 from
71 50 22 30 6*SUM D2
85 91 14 4 and then use I29 below to calculate N*(N2-1)
95 91 8 4 N*(N2-1)
0 56 50 24 rank correlation coefficient in I31 below from
82 57 15 23 SPEARMAN
12 95 44 3
97 55 3 26
11 16 45 42
82 53 15 27 PROCEED TO SHEET 9 (
29 19 39 41
3 98 49 1
91 90 12 7
34 86 38 12
15 38 43 35
54 34 27 37
39 56 35 24
96 11 6 43
42 51 33 29
51 27 29 39
79 76 19 16
65 91 25 4
75 80 20 15
88 22 13 40
98 98 2 1
99 48 1 31
75 8 20 47
23 0 42 50
26 58 40 22
42 11 33 43
81 71 17 17
used the Rank function to rank the scores in each
#DIV/0!
CONTENTS Pearson’s correlation coefficient
Spearman's rank correlation coefficient uses ranked ordinal data as the basis of its test
of association between the rankings.
However for numerical data there is another correlation coefficient known as
Coefficient that computes the degree of association between two numerical data sets.
Like Spearman's, it has a minimum value of minus one and a maximum value of plus one. Th
indicates a perfect negative or indirect association, while the latter indicates a perfect
positive or direct linear association.
Excel has a dedicated function to calculate Pearson's coefficient with the following syntax
To see it in operation we have reproduced the marks data below.
TEST 1 TEST 2 AGGREGATE To correlate the test 1 marks with the test 2 marks we should write:
57 38 45.6 =CORREL(A14:A63,B14:B63) Try it in the blue cell below.
50 34 40.4 Test 1/Test 2
36 69 55.8 Next we could correlate the test 1 marks with the aggregate from:
47 83 68.6 =CORREL(A14:A63,C14:C63) Try it in the blue cell below.
97 66 78.4 Test 1/Aggregate
92 88 89.6 Finally we can correlate the test 2 marks with the aggregate from:
97 90 92.8 =CORREL(B14:B63,C14:C63) Try it in the blue cell below.
4 46 29.2 test 2/aggregate
6 52 33.6 From this we conclude that the association between test 1 and test 2 m
81 70 74.4 not very high - performing well in one does not suggest that a student
24 90 63.6 well in the other.
37 6 18.4 However, the high positive correlation between the test 2 marks and th
96 88 91.2 suggests that performance in test 2 is highly associated with overall pe
70 44 54.4
52 11 27.4 Finally, associated with the correlation coefficient is a statistic known a
46 2 19.6 coefficient of determination. This is simply the square of the correlation
93 63 75 and is referred to in Excel as R squared or R2
7 83 52.6 Being the square of a number that is always less than or equal to one
66 46 54 always be less than or equal to the correlation coefficient.
94 11 44.2 Also, being a square, it will always be positive and so gives no indicati
71 50 58.4 direction of the association (positive/direct or negative/indirect).
85 91 88.6 The Excel function to calculate R squared is:
95 91 92.6 =RSQ(First data range,Second Dat
0 56 33.6 It is left as an exercise to calculate the R squared values for each of th
82 57 67 combinations.
12 95 61.8 TEST 1/TEST 2
97 55 71.8 TEST 1/AGGREGATE
11 16 14 TEST 2/AGGREGATE
82 53 64.6
29 19 23
3 98 60
91 90 90.4
34 86 65.2
15 38 28.8
54 34 42
39 56 49.2
96 11 45
42 51 47.4
51 27 36.6
79 76 77.2
65 91 80.6
75 80 78
88 22 48.4
98 98 98
99 48 68.4
75 8 34.8
23 0 9.2
26 58 45.2
42 11 23.4
81 71 75
nal data as the basis of its test
efficient with the following syntax =CORREL(First data range,Second Data range)
0.65352
0.84388
0.021137
0.427088
0.712134
CONTENTS Excel has added the category "more" to its bin range but its frequency is zero.
This is because there are no observations greater than 100.
The frequency column gives the absolute number of marks in the intervals:
0 - 10 (5), 11 - 20 (3), 21 - 30 (4) and so on.
These are shown as the columns of the histogram.
The cumulative % column gives the percentage of observations that are less than
or equal to the interval mark. Thus 5 out of 50 (10%) marks are less than or equal to 10,
while 5 + 3 + 4 = 12 out of 50 (24%) are less than or equal to 30.
This is shown as the line graph in the histogram.
35 35 40
40% 60%
TEST 1 TEST 2 AGGREGATE INTERVALS INTERVALS Frequency Cumulative %
57 38 45.6 10 10 5 10.00%
50 34 40.4 20 20 3 16.00%
36 69 55.8 30 30 4 24.00%
47 83 68.6 40 40 4 32.00%
97 66 78.4 50 50 5 42.00%
92 88 89.6 60 60 4 50.00%
97 90 92.8 70 70 3 56.00%
4 46 29.2 80 80 4 64.00%
6 52 33.6 90 90 6 76.00%
81 70 74.4 100 100 12 100.00%
24 90 63.6 More 0 100.00%
37 6 18.4
96 88 91.2
70 44 54.4
52 11 27.4 Frequency Distributions
46 2 19.6
93 63 75
7 83 52.6
66 46 54
94 11 44.2
71 50 58.4
85 91 88.6
95 91 92.6
0 56 33.6
82 57 67
12 95 61.8
97 55 71.8
11 16 14
82 53 64.6
29 19 23
3 98 60
91 90 90.4
34 86 65.2
15 38 28.8
54 34 42
39 56 49.2
96 11 45
42 51 47.4
51 27 36.6
79 76 77.2
65 91 80.6
75 80 78
88 22 48.4
98 98 98
99 48 68.4
75 8 34.8
23 0 9.2
26 58 45.2
42 11 23.4
81 71 75
are less than
s than or equal to 10,
Histogram
15 120.00%
100.00%
Frequency
10 80.00% Frequency
60.00%
Cumulative %
5 40.00%
20.00%
0 0.00%
10 30 50 70 90 ore
M
INTERVALS
cy Distributions
CONTENTS
35 35 40
40% 60%
TEST 1 TEST 2 AGGREGATE TEST 1
57 38 45.6
50 34 40.4 Mean 57.84
36 69 55.8 Standard Error 4.516849182
47 83 68.6 Median 61
97 66 78.4 Mode 97
92 88 89.6 Standard Deviation 31.93894686
97 90 92.8 Sample Variance 1020.096327
4 46 29.2 Kurtosis -1.23718592
6 52 33.6 Skewness -0.31490202
81 70 74.4 Range 99
24 90 63.6 Minimum 0
37 6 18.4 Maximum 99
96 88 91.2 Sum 2892
70 44 54.4 Count 50
52 11 27.4
46 2 19.6
93 63 75 STANDARD_ERROR
7 83 52.6
66 46 54 SAMPLE_VARIANCE
94 11 44.2
71 50 58.4 SKEWNESS
85 91 88.6
95 91 92.6 KURTOSIS
0 56 33.6
82 57 67
12 95 61.8
97 55 71.8
11 16 14
82 53 64.6
29 19 23
3 98 60
91 90 90.4
34 86 65.2
15 38 28.8
54 34 42
39 56 49.2
96 11 45
42 51 47.4
51 27 36.6
79 76 77.2
65 91 80.6
75 80 78
88 22 48.4
98 98 98
99 48 68.4
75 8 34.8
23 0 9.2
26 58 45.2
42 11 23.4
81 71 75
TEST 2 AGGREGATE
Data analysis is an Add In and may not automatically be installed on your version of Excel.
To install it select Tools and then Add Ins and then check the first two boxes viz:
Analysis Toolpak and Analysis Toolpak VBA.
The standard error is a sampling concept and is defined as the sample standard deviation
divided by the square root of the sample size. Thus for test 1 we have = F19/SQRT(F27) which gives:
Excel has calculated the sample as opposed to the population variance and so produces a slightly different
answer from our earlier calculations.
However, the answer returned by the descriptive statistics routine could be reproduced exactly if we use
VAR rather than VARP. For example: 1020.096 QED The same is true for the standard deviati
that is, use STDEV rather than STDEVP 31.93895 QED
Skewness is the extent to which the histogram of the data is "off centre" (asymmetrical).
Skewness can be either negative, zero, or positive.
With negative skewness the histogram has significantly more observations to the right than the left (tail to the
left, peak to the right).
With zero skewness the histogram is symmetrical (peak in the middle, equivalent tails on either side).
With positive skewness the histogram has significantly more observations to the left than the right (tail to the
right, peak to the left).
The negative value calculated by Excel for the test 1 data (-0.315) confirms what the histogram shows -
namely that the marks are peaked to the right (a relatively large number of students with high marks).
Return to Dispersion
ghtly different