Statistical Analysis For Environmental Systems and Societies

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Statistical analysis for Environmental Systems and Societies

Statistical analysis
1 State that error bars are a graphical representation of the variability of data.
2 Calculate the mean and standard deviation (SD) of a set of values.
3
State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the
values fall within one standard deviation of the mean.
4
Explain how the standard deviation is useful for comparing the means and the spread of data between two or more
samples.
5
Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate
tables.
6 Explain that the existence of a correlation does not establish that there is a causal relationship between two variables.



Statistical analysis Keywords

Arithmetic mean Uncertainty Correlation Error bars
Relationship Significance Spread Standard deviation
t-test Value Variability Variable



























Describing variation mathematically:

Living things can vary so that even two peas in a pod show a variety of sizes and shapes.
This raises a number of questions. How can we describe the range of variation? Which
pea size is the most common? Can we sort the peas into groups to decide if they came
from the same or different pods? Biologists ask these types of questions not only about
living organisms but also about sets of data from experiments.

The Arithmetic Mean:

A group of ten students were tested for shoe size. The results are listed here:

Group A 5 6 8 7 8 6 7 7 9 7

The arithmetic mean is the total divided by the number of results, so:

Total = 70
Number of results = 10

7
10
70
Mean

Group A 5 6 8 7 8 6 7 7 9 9
Group B 7 7 7 7 7 7 7 7 7 7
Group C 5 6 6 6 7 7 7 7 8 9
Group D 5 5 5 5 8 8 8 8 9 9

Most people if asked to summarise each set of data above would probably come up with the idea of using
the mean. If asked what other information would be useful, can you think of anything?



You may have suggested that a measure of the spread of data would be useful. A very simple way to do
this is to simply record the range. Can you complete the information below to describe each set of data:

Group Mean Spread of Data

A


B


C


D

7

7

7

7


7


Data ranges from 5 to 9


Can you see any problems in only using this to describe a group of data?


If you were a shoe manufacturer you
would find this useful as you would know
that it would be a good idea to make
plenty of shoes at size 7. However, they
do not know how wide the variation is
around the mean. All of the distributions
on the next page have means of 7, but
they clearly need very different outputs
from the shoe factory.




Standard Deviation

Continuous data shows a smooth transition of values across a spectrum. So, weight, height and numbers
of plants in a particular area are all good examples. To describe the spread of results in continuous data,
biologists use a statistic called the standard deviation.


The standard deviation of a set of data is calculated by calculating the deviation of each measurement from
the mean.

Which group is very tightly packed around 7 a shoe manufacturers dream?_____________
This group should have the smallest standard deviation.

If data is clustered around the mean you would expect to have lots of small deviations away from the
mean. If the data is more spread out you would expect the deviations to be bigger.



Calculating Standard Deviation (What are the steps involved and what does it mean?)

The heights in the groups of students checked for shoe size were recorded. The data shows a typical
distribution and the mean can easily be calculated.

Heights 157 160 161 164 171 172 175 176 177 182

Work out the total and the mean for this set of data:




The heights of individual members of the group are different from the mean.
These differences can be calculated.


Height



157 160 161 164 171 172 175 176 177 182
Difference
from
the mean


Since some of the individuals fall below the average some of the differences will be negative. To convert
all these values into positive numbers they are squared.



Height



157 160 161 164 171 172 175 176 177 182

Squared
differences



The figures above give a measure of the deviation of the individuals from the mean. The standard
deviation is the mean deviation. So you will need to find the mean of these values:

Total = 642.5 (explain what numbers were used to calculate this)



Mean = 64.25(explain how this number was obtained)


Since this is the mean of the squares of the original deviation we use the square root of the mean and call it
standard deviation.


Standard deviation = 64.25 = 8.02

The standard deviation is a useful way to describe the variability in a set of continuous data. The larger
the standard deviation the larger the spread of data is around the mean.


Questions

The table below shows the heights of two groups of IB Biology students.

Group A heights / cm

180 176 160 169 172 178 182 177 175

Group B heights / cm

180 177 163 166 175 177 180 179 173 169



a. Calculate the mean for each set of students.


b. Calculate the standard deviation for each set of students.


Heights
Group / A
180 176 160 169 172 178 182 177 175
Difference
from mean

Square of
differences


Heights
Group / B
180 177 163 166 175 177 180 179 173 169
Difference
from mean

Square of
differences



Total of squared differences for group A:

Total of squared differences for group B:


Standard deviation for group A =
Standard deviation for group B = =
Continuous data shows a smooth transition of values across a spectrum. So, weight, height and numbers
of plants in a particular area are all good examples. To describe the spread of results in continuous data,
biologists use a statistic called the standard deviation.

Calculating Standard Deviation (What are the steps involved and what does it mean?)

The heights in the groups of students checked for shoe size were recorded. The data shows a typical
distribution and the mean can easily be calculated.

Heights 157 160 161 164 171 172 175 176 177 182

Work out the total and the mean for this set of data:


The individual members of the group are different from the mean.
These differences can be calculated.


Height



157 160 161 164 171 172 175 176 177 182
Difference
from
the mean


Since some of the individuals fall below the average some of the differences will be negative. To convert
all these values into positive numbers they are squared.



Height



157 160 161 164 171 172 175 176 177 182

Squared
differences



The figures above give a measure of the deviation of the individuals from the mean. The standard
deviation is the mean deviation. So you will need to find the mean of these values:

Total = 642.5 (explain what numbers were used to calculate this)



Mean = 64.25(explain how this number was obtained)





Using Excel to Calculate Average and Standard Deviation

You can use Excel to calculate the mean and standard deviation of a set of data for you. This is especially
helpful when you quickly want to produce useful data to put onto a graph.

The command to calculate average (on excel software using English) is:

=AVERAGE (*)

Where * represents the dataset of interest


The command to calculate standard deviation (on excel software using English) is:

=STDEVP(*)

Where * represents the dataset of interest



One way to take into account the variability in the results and hence their level of accuracy is to draw error
bars.

A simple way to construct an error bar is to use the maximum deviation of a single data point away from
the mean.

When drawing a graph an error bar is drawn above and below the mean that shows the maximum
deviation away from the mean.


Error bars can be constructed for each mean value:

If the error bars overlap then it cannot be concluded that the values are truly different. We state that the
values are not significantly different.
Mean A
Mean B
If the error bars do not overlap then a conclusion that they are significantly different is justified.

Standard deviation error bars are more sophisticated indicator of the precision of a set of measurements.
Standard deviation error bars are usually drawn for 1 standard deviation above and below the mean. Excel
can do this for you.

If standard deviation is calculated for a set of data you will need a minimum of five repeats (and
preferably seven or more).

The student t test is a statistical test. One of the most common applications of statistics is to compare two
sets of data, for example the heights of males and females in a class. These heights can be represented as a
frequency histogram using the same x axis for both sets of data.

If almost all the male students were taller than the female students then the two histograms would show
very little overlap, as shown below in graph (a). From looking at this graph we would be confident in
saying that the male students are taller than the female students.


Fig 1: Comparing two sets of data. The triangle indicates the mean value for each set of data.


As the overlap increases it becomes less certain that there is a difference. If the data looked like that
shown in graph (b) above where there is almost complete overlap, then we would be confident in saying
that there is no difference in the height of male and female students.

It may appear from the graphs above that the difference between the mean values should be a sufficient
measure of overlap, i.e. as the means become closer the overlap increases. However, the overlap between
the two sets of data also depends on how closely the data are clustered around the two means.

Look at the two graphs below:



















You should notice that the difference between the means is the same.
However, the data used to plot graph (b) is more variable there is more overlap, and less certainty that
there is a difference between the data.

The T test is a technique which will take into account the means as well as the amount of overlap between
two sets of data and say how certain we are that there is a significant difference.

The t-Test





















What does the t-Test tell us?

It provides a way of measuring the overlap between two sets of data.
Notation
__
X
1
is the mean value for data set 1
Vertical lines indicate that the positive difference between the means should be taken,
irrespective of which is bigger
S is the symbol for standard deviation
n is the number of measurements collected

** Note you will not be expected to remember this formula

If two sets of data have widely separated means and small variances (the data is clustered around the
mean) they will have little overlap and a big value of t, they can be shown to be significantly different.

On the other hand if two sets of data have means that are close together and large variances (the data is
spread from the mean) they will have a large overlap and a small value of t, they cannot be shown to be
significantly different.



































A large value of t indicates little overlap and a significant difference.
A small value of t indicates a lot of overlap and no significant difference.


To judge whether the value of t is big or small you have to consult a table known as A Table of Critical
Values. The value that should be looked at in the table depends on something known as The Degrees of
Freedom. An example of a part of a Table of Critical Values is shown below:



Degrees of Freedom Significance levels
p = 0.05 p = 0.01
15 2.13 2.94
16 2.12 2.92
17 2.11 2.90
18 2.10 2.88
19 2.09 2.86
20 2.09 2.85
21 2.08 2.83
22 2.07 2.82
23 2.07 2.81
24 2.06 2.80
25 2.06 2.80
30 2.04 2.75
40 2.00 2.70
60 2.00 2.66

To work out the degrees of freedom:

Degrees of freedom = number of classes 1

So if there were 21 individuals in each sample then the degrees of freedom would equal:

Degrees of freedom = (21-1) + (21-1)
Degrees of freedom = 40

Imagine carrying out a t test to compare two sets of data with 21 samples in each set and a value of t was
calculated and t = 3.42.

Looking at the table, the critical value at for t at the 0.05 level (Scientists usually always look at this level)
and with 40 degrees of freedom is 2.00. This means the probability of getting a value of t at least as large
or larger than 2.00 by chance is less than 0.05 (5%). So it is extremely unlikely that the difference in the
two sets of data could have arisen by chance. Therefore the two sets of data are significantly different. In
fact 3.42 is also bigger than the value at 0.01 (1%) which means that the probability of getting a value of t
at least as large or larger than 2.70 by chance is less than 0.01 (1%).

In investigations that will be analysed using statistical tests scientists usually make a null hypothesis. The
null hypothesis usually states that there is no significant difference between two samples.

If a value of t is greater than or equal to the critical value then the null hypothesis can be rejected and it
can be stated that there is a significant difference.

You might also like