NYA Lab 2B Scientific Method Statistics W2022
NYA Lab 2B Scientific Method Statistics W2022
NYA Lab 2B Scientific Method Statistics W2022
AIM
When you have completed this lab, you will be expected to:
understand the difference between descriptive and inferential statistics
apply inferential statistics (focus on t-tests) appropriately to data.
1. INTRODUCTION
Researchers gather data to describe and learn about large populations. Unfortunately, most populations are too
big for us to measure any one variable for every member of the population. For example, it would be practically
impossible to measure the resting heart rate of all teenagers in Canada. Instead, we collect information from a
sample of the population. Sampling means we take a relatively small number of measurements that (hopefully)
represents the entire population.
2. EXPERIMENTAL DESIGN
2.1 The importance of sample size
Let’s look at what can happen if your sample size is small and doesn’t represent the population. If you were to test
the effect of exercise on heart rate by measuring the heart rate of 10 people at rest, then again after 1 minute of
skipping rope, you would find that the heart rate obviously increased, but what you’re interested in is the degree to
which the heart rate increased. I’ve made up 2 possible data sets to compare below.
General Biology I, S. Kunicki, Dawson College Page 1 of 13
Table 1. Increase in heart rate (bpm) of 10 subjects, 5 males, 5 females.
Group 1 Group 2
5 35
3 39
6 28
4 33
7 36
7 37
6 32
5 28
8 31
4 35
Average 6 33
If, just by chance, the 10 people you chose were in really good shape (group 1), you would conclude that heart rate
doesn’t increase very much in response to exercise. If, on the other hand, just by chance, you chose 10 people
who were couch-potatoes (group 2), you would conclude that heart rate increases greatly. Neither of these groups
would be representative of the population at large.
In the experiments you will perform this semester, you will be using sample sizes that are scientifically way too
small (example 4 samples). This is done for practical reasons (cost and space), but you will see below that no one
would publish results from a study that has a sample size of only 4.
In order to have confidence that your survey results are representative, it is critically important that you have a
large number of randomly selected participants in each group you survey. So, what exactly is ''a large number?"
For a 95% confidence level (which means that there is only a 5% chance of your sample results differing from the
1
true population average) a good estimate of the margin of error (or confidence interval) is given by where n
√n
is the number of participants or sample size (Niles, 2006).
The following table shows this estimate of the margin of error for sample sizes ranging from 10 to 10,000. For more
advanced students with an interest in statistics, the Creative Research Systems website (Creative Research
Systems, 2003) has a more exact formula, along with a sample size calculator that you can use. For most
1
purposes, though, the approach is sufficient. Source of table:
√n
http://www.sciencebuddies.org/science-fair-projects/project_ideas/Soc_participants.shtml
You can quickly see from the table that results obtained from a survey with only 10 random participants are not
reliable. The margin of error in this case is roughly 32%. This means that if you found for example, that 6 out of
your 10 participants (60%) had a fear of heights: then the actual proportion of the population with a fear of heights
QUESTION:If you were to test 100 students (before and after exercise) on 3 separate days, the 100 students are
______________________ replicates, and the repeated measurements of the heart rates on days 1,2, and 3 are
_______________________replicates. With the technical replicates, you could determine if the heartrate
increased to the same for degree for each individual.
QUESTIONS:
1. Consider the allelopathy lab (Lab 2A); how many replicates are there in the control group?
2. Were these biological or technical replicates?
3. What information does a biological replicate give you?
4. What information does a technical replicate give you?
2. Inferential statistics is used to make judgments of the probability that an observed difference between groups
is due to the independent variable, or if it is one that might have happened by chance in this study. A conclusion
using inferential statistics might be “Students who studied Biology at Dawson did significantly better on
standardized Biology tests than those who studied at CEGEP Mediocre.”
Inferential statistics are used to determine how confident you are that the results (in this example grades on the
standardized test) were caused by the independent variable (the CEGEP you went to), and not by chance (ex
what if by chance, the students who went to Dawson had higher IQs than those who went to the other CEGEP and
so scored higher because of their IQ, not because of the classes they took at Dawson?). You will see below
(section 3.2) how this is determined.
Descriptive stats are used to summarize data so we can get an overview of the results. We can present the data
in tables and graphs and report the mean values as well as the variability observed.
Below is shown a small sample size (n) of 18 students (n=18), 9 males and 9 females. By graphing the raw data,
you can see a trend; for both females and males, the heart rate increased after skipping rope for 5 minutes.
However, just by looking at the graph, you can't tell what the mean heart rate before and after exercise was.
To be able to say how much the heart rate increased after exercise, you will need to use descriptive
statistics.
To determine if the increase was significant, you would need to do inferential statistics.
By the way, a control variable is missing here. It is well known that the heart of people who exercise regularly
pumps blood more effectively and so the heart rate of such people doesn’t increase as much when exercising, as
compared to that of people who never exercise. If you wanted to compare, for example, the increase in heart rate
in males as compared to females, you would need to know beforehand how often each person exercised. Imagine
that you hypothesized that the increase in heart rate would be less for males than for females and, just by chance,
the females you chose were all athletes and the males were all couch-potatoes (see also Table 1, page 2). Your
results would probably suggest that the heart rate is less affected by exercise in females than in males.
It is very important to try to identify as many control variables as you can when designing your experiments.
Note that the control variable is not the same thing as a control group. For example, if you wanted to know the
effect of exercise on weight loss, you would have a control group that didn’t exercise, and an experimental group
that did. An example of a control variable would be diet (you wouldn’t want subjects in one group eating 4,000
Calories/day while those in the other group eating only 1,000 Calories/day).
General Biology I, S. Kunicki, Dawson College Page 4 of 13
3.1.1 Measures of Central Tendency
The most likely “pattern” revealed by examining a data set is the central tendency of the values. The most common
measures of central tendency are the mean (average), median, and mode. The mean ( ) is the arithmetic
average of a group of measurements. It is the sum of all the values divided by the number of values (n). In the
case of exercise above, the mean beats per minute (bpm) at rest was 1290/18 = 71.7 = 72 bpm and was 1491/18 =
82.8 = 83 bpm after exercise.
The mean is the most common measure of central tendency, but the median and mode are sometimes useful
because they are less sensitive to extreme values. [FYI: The median is the middle value of a group of
measurements that have been ranked from highest to lowest or lowest to highest. The mode is the value that
appears most often in the data set.] Note that for this course, we will not be using either the median or mode.
The range is the difference between the smallest and the largest values of the data set—the wider the range the
greater the variation. The range of the fish length from lake 1 is 35-25 = 10 mm; the range of fish length from lake 2
is 67-10= 57mm. The mean length of the fish is the same for both sites, but the ranges indicate much more
variation in the fish from lake 2.
Variance measures how data values vary about the mean. Variance is much more informative than the range.
In order to determine the variance (see Table 2), you first calculate the mean. Second calculate the deviation of
each sample from the mean (see column 3 in Table 2) Third, square each deviation (column 4). Fourth; sum the
squared deviations. Lastly, divide the sum of squared deviations by the number of data points minus one to
calculate the variance (S2). S2 = ∑ (xi – mean)2 / n-1. (see step 5 below).
Notice in the equation that the larger the sample size, n, the smaller the variance. This is because a larger sample
size is likely to be more representative of the population (see Figure 2, page 1).
While variance is a good measure of the dispersion of values about the mean, a second and more commonly cited
measure of variation is the standard deviation. The standard deviation (S) equals the square root of the variance:
s= √ s2
Table 2: Calculation for the variance of the fish length in mm data from Lake 1:
Length of fish from Lake 1 Mean ( ) Deviation from the mean Deviation2
(xi) (xi - ) (xi - )2
25 30 -5 25
28 30 -2 4
30 30 0 0
32 30 2 4
35 30 5 25
Step 4: Sum of the squared deviations = 58
Step 5: Variance = 58/(5-1) = 14.5 mm
General Biology I, S. Kunicki, Dawson College Page 5 of 13
Step 6: Standard deviation = √14.5= 3.807 = 4 mm
Table 3. Calculation for the variance of the fish length in mm data from Lake 2:
Length of fish from Lake 2 Mean ( ) Deviation from the Deviation2
(xi) mean (xi - ) (xi - )2
10 30 -20 400
18 30 -12 144
20 30 -10 100
35 30 5 25
67 30 37 1369
Sum of the squared deviations = 2038
Variance = 2038/(5-1) = 509.5 mm; Standard deviation = √509.5= 22.57= 23 mm
The length of the fish can be reported as the mean ± standard deviation; 30 ± 4 mm for lake 1 and 30 ± 23 mm for
lake 2.
The mean and measures of variability (variance and standard deviation) are examples of descriptive statistics.
If we think of a graphical example, the variability in a sample might look something like this:
Going back to our fish example, we would see that the standard deviation (represented by the error bar in the
graph) is much greater for lake 2.
Average fish size
60
50 Lake 1 Figure 6. Comparison of mean fish lengths (mm) in 2
(mm)
Lake 2 different lakes. For both lakes, the sample size, n, was five.
40 The error bars represent standard deviation.
30
20
10
0
Lakes
1
QUESTION: Why is it necessary, but not sufficient, to present the average value obtained? Consider the fictional
person who drowned crossing a river that had a mean depth of 0.5 meters; what did this person fail to consider?
A) Biological replicates
B) Technical replicates
The resting heart rate can be reported as the mean ± standard deviation: ______ ± ______
Variance =
Standard deviation =
Standard deviation =
Write out your hypotheses (null and alternate), collect your data, analyze your data using the appropriate
statistical test, and draw your conclusion. For this course, we will only be looking at the t-test.
For example, you may be interested in determining whether the ability to recall words is reduced when studying
with music on, as compared to when studying in silence, or whether physical exercise is associated with reduced
amounts of stress hormones. In both cases you would probably find that the sample means that you are comparing
are different. The question is are they significantly different?
The first hypothesis (1) is called the null hypothesis because you are predicting that there will be no difference;
This is the hypothesis that researchers want to nullify (reject)… so it’s called the null hypothesis. Researchers
themselves do not usually predict that there will be no difference. Rather, they make one of the alternative
hypotheses (see numbers 2,3,4 above) and collect data in an attempt to REJECT the null hypothesis. If you are
successful in rejecting the null hypothesis, you can conclude that the data supports the alternative hypothesis.
There are also two types of alternative hypotheses in the above examples. Hypotheses (3) and (4) are
directional hypotheses. In both cases, you are specifying which group (young or old) will have the larger sample
mean. Hypothesis (2) is a non-directional hypothesis. You are only saying that there will be a difference; you are
not specifying whether the difference is positive or negative. For example, you are not specifying whether the
blood pressure of young people will be higher or lower. A non-directional hypothesis is said to be two-tailed; a
directional hypothesis is said to be one-tailed.
You can only test one alternative hypothesis with a data set. Therefore, you must state only one alternative
hypothesis.
State whether the following hypotheses are 1-tailed (directional) or 2-tailed (non-directional):
Mice fed vitamin C will grow larger than mice fed a placebo. __________
Is your alternative hypothesis regarding the onion extract one or two-tailed? _____________________
Traditionally, scientists have chosen a significance level of 0.05 (p<0.05) which means that there is less than a 5%
probability (p<0.05) that the difference observed between the experimental and control groups was due simply to
chance and not due to the independent variable (sample not representative of the population, unidentified
confounding variables, experimental errors…). When the consequence of the mistaken conclusion is great,
scientists may choose a lower significance level, for example p<0.01 or even p< 0.001. On the other hand, if a
drug company is testing a drug that may be used for terminally ill people, for whom there is no other drug option, a
significance level of 0.1 or higher may be considered acceptable. The smaller the calculated p-value, the higher
the significance.
The lower the p-value, the less likely that you would obtain this difference by chance and so you don’t believe
the H0 (you reject it) and you believe the HA.
For this course, you will use a significance level of 0.05 (p<0.05) ie there is less than a 5% chance (1 out of 20
times) of getting these results when the H0 is true, that is, the H0 is too unlikely to be true.
It's possible that you perform a study (ex. test whether 50 people given drug X loose more weight than 50 people
given a placebo) where you find a significant difference, but then you see that several other researchers repeated
the study and failed to find a significant difference. There are a number of possible explanations for such results.
One possible explanation is that your sample size was too small and so was not representative of the
population. In the case of the weight loss study, it’s possible that the test group (got the drug) of patients you
randomly selected just happened to respond to the drug better than, say, 95% of most people (again, the smaller
the sample size, the more likely this can occur). It’s entirely possible that the subjects (people) in the other
researcher were simply less sensitive to the drug than were your subjects. Or maybe the subjects in your test
group were eating better or doing exercise; these are variables that you didn’t control for in your experimental
design. While you do try to control for all the variables that you can think of, it can happen that there are control
variables that you were unaware of, and so you didn’t control for them.
Moral of the story: no matter how well an experiment is designed, it should never be considered as definitive: you
can draw conclusions based on your data, but you can’t say you have proven something. When you’re writing a
lab report, please make sure you never write “we proved…”; it’s very bad for my blood pressure, and for your
grade.
Based on the above, for your experiment on allelopathy (Lab 2A), you will want a p value that is _____________
in order to conclude that the onion extract significantly inhibited the germination of the radish seeds.
This leads us to a very important conclusion: when we are looking at the differences between scores for two
groups, we have to look not only at the difference in the means, but also in the variability within each group.
A t-test considers 2 factors to judge whether a difference between 2 groups is significant or not:
1. The size of the difference in the measured variable between treatment groups (difference in their means).
2. The amount of variation in the measured variable WITHIN each treatment groups.
The LARGER the difference and the SMALLER the variability of the data within each group, the more likely that
the difference is statistically significant.
To the right is one of the equations used in t-tests. Don’t memorize this. I include it so
you can see that t-tests evaluate not only the difference in the means (X1 vs X2), but
also the variability (S2 is variance) in your control and tests groups.
The paired t-test is for comparing and testing 2 sets of data that are somehow related, based on 2 samples.
Usually this happens when measurements are taken “Before and After” a certain treatment or event; essentially,
you test the same samples under two different conditions. So, the samples are not ‘independent’ samples.
The paired t-test reduces inter-subject variability (because it makes comparisons between the same subject), and
thus is theoretically more powerful than the unpaired t-test.
QUESTION 2. Describe a protocol to test the following hypothesis: Hypothesis: the ability to recall words is
reduced when studying with music on, as compared to when studying in silence. A) First describe a protocol that
would be unpaired and then B) re-write the protocol for a paired t-test.
A) Unpaired t-test:
B) Paired t-test:
If the p-value is ≥ 0.05, then you fail to reject your null hypothesis (the H0): The difference IS NOT
statistically significant.
If the p-value is < 0.05, then you accept your alternative hypothesis (the HA) ie you reject the H0 and
conclude that the difference IS statistically significant.
For instance:
1. A difference between treatment groups might be significant but small and scientifically boring.
2. A statistical test does not look at how you carried out the experiment and cannot identify a false positive result
because of an unwanted variable. You may get a p value of 0.01 with a sample size of 5, which would be
meaningless.
3. A statistical test can’t look at other limitations of the experiment in testing a hypothesis.
A) Samples of male Leopard frogs (Rana pipens) were selected from 2 different ponds, and their body lengths
were measured to the nearest mm. Data was used to determine whether there is a statistically significant
difference in body length in the male frogs from the 2 populations. The p value was found to be 0.04
Null hypothesis:
Alternative hypothesis:
Conclusion:
B) There is some support for the idea that eating anchovies before a race will make a person run faster. To test this
hypothesis, Dr. B. Joggerson gathered 3 professional sprinters. He had each sprinter run the 100m dash and
recorded their times. After an hour break, Dr. Johnson then had each sprinter eat 5 extra-salty anchovies and
asked them to run the dash again; and he recorded their times. The following is the data that was collected, and
the p value was found to be 0.05.
Sprinter Sprint time without anchovies (sec) Sprint time with anchovies (sec)
1 13.4 12.2
2 12.2 12.0
3 12.5 12.1
Null hypothesis:
Alternative hypothesis:
Conclusion:
You can also use Excel to make sure that you calculated
your mean and standard deviation correctly.
Step 3. A) Choose the place that you want your p value to appear (in the example above, square C10 was chosen.
B) Click on Insert Function and choose the following
function: T.TEST
C) Click OK
You are now ready to analyze your Allelopathy results (Lab 2A, which you haven’t done yet😊).