University of Bristol Statistics Content
University of Bristol Statistics Content
University of Bristol Statistics Content
This document explains hypothesis testing in three essential tools for a research project: 1) the
method to design a controlled experiment to gather data; 2) the method to use statistical tests
to analyse the results and prove something or not; 3) R, a programming language to run those
tests. This document is not doing into any mathematical details of the statistical tests themselves.
1
Table of content
2
Chapter 1. All you need to know about hypothesis testing
Hypothesis testing is a fundamental concept in statistics and is widely used in scientific research
to draw conclusions about populations based on sample data. This is not the only method, e.g.
Bayesian Inference or Exploratory Data Analysis. Hypothesis testing relies on two primary
knowledge: you need to know how to design a controlled experiment (this is called experimental
design) and you need to know how to analyse the data via a statistic analysis.
Note that there first chapter contains most of the terms you need, so it might be overwhelming.
This is why the other chapters have lot of examples where we investigate all these terms in details.
Hopefully by the end of you reading this you will only need to refer to this chapter as reminder!
3
1.2. Experimental design
The first step in hypothesis testing is to design your controlled experiment. Experimental design
refers to the process of planning and conducting a scientific experiment to ensure valid results. It
involves making decisions about how to manipulate independent variables and measure
dependent variables.
Independent variable: it is the factor that the researcher manipulates or changes. It is the variable
believed to have a causal effect on the dependent variable. We manipulate the independent
variable to test if it influences the dependant variable.
Example 1: is my mindfulness meditation app led to a significant reduction in stress levels?
Your independent variable is the usage (or not) of the meditation app.
Example 2: which of three ketchup brand lead to better customer satisfaction? Your
independent variable is type of ketchup, and you have 3 brands (or factors) to compare.
Example 3: is there a difference in test scores between students who receive traditional
classroom instruction and those who use an online learning platform? Your independent
variable is the type of teaching (with two factors: in class or online).
Dependent variable: it is the outcome that is measured. It is the variable that the researcher
expects to be influenced by changes in the independent variable.
Example 1: is my mindfulness meditation app led to a significant reduction in stress levels?
Your dependent variable is the stress level (difference from before to after using the app),
which may be measured through biological sensors or questionnaires.
Example 2: which of three ketchup brand lead to better customer satisfaction? Your
dependent variable is the satisfaction of the participants which may be measured through
questionnaires, or the amount of product sold.
Example 3: is there a difference in test scores between students who receive traditional
classroom instructions and those who use an online learning platform? Your dependent
variable is the grade that students get at their exam.
All other variables in your experiment should be controlled variables. They should not be
confounding variables. Controlled variables are factors that are kept constant or held unchanged
throughout the experiment. These are variables that could potentially affect the dependent
variable but are not the focus of the study. Controlling these variables helps isolate the impact of
the independent variable on the dependent variable by minimizing other potential influences.
Confounding variables are factors that are NOT kept constant and may screw your experiment.
It is an extraneous variable that is not the focus of a research study but can affect the outcome of
the study by influencing the relationship between the independent variable and the dependent
variable. You need to avoid them at all costs.
4
Even if two variables are correlated, it does not mean that changes in one variable cause changes
in the other. Correlation simply indicates that there is a relationship. There may be a correlation
between ice cream sales and the number of murders in a community. Both tend to increase during
the summer, but buying ice cream does not cause murder (see figure 1). Causation implies a
cause-and-effect relationship between two variables. If changes in one variable led to changes in
another, there is a causal relationship. Establishing causation is more complex than
demonstrating correlation. Which is why we need a controlled experiment with no confounding
variables to demonstrate causation.
Figure 1. Correlation between ice cream sales and the number of murders in a
community. Both tend to increase during the summer, but buying ice cream does not
cause murder, this is not causation. (From book spurious correlation).
I will let you know in the later chapter how to design your experiment with plenty of examples.
But this is very important to design it well. In a well-designed study you should also know in
advance what statistical tests you will do. You cannot decide the tests first and then figure out
how to run the experiment. Countless researchers have failed to do so and must redo their entire
experiment from scratch (I certainly did that myself and it is a pain!).
1
Note that a well-formulated hypothesis is a crucial part of the scientific method. It should be clear and specific,
leaving no room for ambiguity. It should be testable and measurable through empirical observation and
experimentation. It should be falsifiable, meaning there should be a way to prove it wrong. If a hypothesis cannot be
proven wrong, it is not scientifically useful. It should be formulated in a way that predicts a relationship or an outcome.
It should be concise and to the point.
5
Example 2: which of three ketchup brand lead to better customer satisfaction? Your
hypothesis may be “customer satisfaction will be higher with ketchup a.
Example 3: is there a difference in test scores between students who receive traditional
classroom instruction and those who use an online learning platform? Your hypothesis may
be “grade will be higher for students in class compared to online”.
Oddly enough when we run a statistical test, we don’t ask if our hypothesis is true, i.e. that the
data are significantly different to grant us the permission to say one is better. We ask it if the data
are similar. We call this the Null Hypothesis (H₀). This is a statement of no effect or no difference.
In our ketchup example H0 is “there is no different in ketchup satisfaction”.
When we run a statistical test, it gives us some important data, the most important is the p-value.
The p-value is a measure that helps you decide if your data provides enough evidence to reject
the null hypothesis. It tells you the probability of seeing the observed results (or more extreme
results) if the null hypothesis is true. For example:
• A low P-value (e.g., less than 0.05): It suggests that the observed results are unlikely
to happen just by chance, so you might reject the null hypothesis and thus accept your
hypothesis (H1). And as a researcher you are happy.
• A high P-value (e.g., greater than 0.05): It suggests that the observed results could
plausibly occur by chance, so you might fail to reject the null hypothesis. In that case
we cannot say anything about our initial hypothesis (H1). And this is important, we will
come back to this. But in this case, you CANNOT conclude anything from your data. As
a researcher at this stage, you are generally not happy.
Why using 0.05 to test against? (this is called significant level)
We picked a threshold of 0.05 to test our p-value. 0.05 is what we call our significant level (α).
Commonly used significance levels are 0.05, 0.01, or 0.10. It may depend on the discipline of
science you are in. A significance level of 0.05 means that there is a 5% chance of rejecting the
null hypothesis when it is true. 0.05 is very common in computer science. 0.01 may be used in
medical science (e.g. when we test a vaccine) or in physics. You are very much likely to use 0.05
as significance level, this is a convention.
Is the p-value the only thing I need to conclude? (no, we also need the effect size)
While a P value can inform whether an effect exists, the P value will not reveal the size of the
effect. The lower the p-value, the stronger the evidence against the null hypothesis. But it doesn't
tell you the size or importance of the effect; it only indicates whether the observed effect is likely
real or just a random occurrence. The p-value doesn't quantify the size or practical significance of
the effect. A small p-value doesn't necessarily mean that the observed effect is large or practically
meaningful; it only means that the effect is statistically significant.
The size of the effect tells you how big or important the change or difference you observed in
your study is. The size of the effect is like a measure that says, "Yes, there's a change, and it's this
much." It gives you a sense of the practical importance of what you found, beyond just saying
whether the change is statistically significant (unlikely to be due to chance).
Why no conclusion if high p-value? (because we cannot test equality)
In hypothesis testing, failing to reject the null hypothesis does not provide direct evidence in
support of the hypothesis itself. It means that the data you have collected did not provide enough
evidence to reject the null hypothesis at the chosen significance level. The absence of evidence
against the null hypothesis does not imply evidence in favour of it.
6
To explain this, I will take a detour example. Imagine you have a pair of shoes you absolutely
adore. You know it makes you run faster and you have collected data of how fast you run with it.
One day you feel that someone took your beloved shoes and replaced them with identical ones.
You doubt these are your shoes. You are collecting more data, you run with those dubious shoes.
You now have a new set of running data with the dubious shoes that you can compare with the
old log you had with your beloved shoes. You can observe that the dubious shoes make you run
slightly slower, but you are not sure if this is statistically significant. You run a statistical test. Now
imagine these two scenarios:
1. In one universe imagine the p-value is low (below 0.05). You can reject the null hypothesis
that the data are similar. Thus, the data are different. Thus, someone has replaced your
beloved shoes!! You are not 100% sure because science is never 100%, but you have so
much data (your sample size is large) that you can be arbitrary sure these are not your
shoes because the difference in data did not occur by chance.
2. In another universe, imagine the p-value is high (above 0.05). You CANNOT reject the null
hypothesis that the data are similar. Thus, you have rejected nothing, and you are back to
square one. In this case there are still two possibilities: 1) these are replaced shoes; they
make you run in a very similar way than with your beloved one; 1) these are your beloved
shoes and the slight difference in speed is due to randomness. And thus, you still cannot
conclude anything. The two possibilities co-exist.
Then, how do you test that things are equal? In hypothesis testing, you typically don't prove that
two things are equal. And this is super important! In a paper or a report, if you fail to reject your
null hypothesis you cannot claim anything, you can just say “we fail to reject the null hypothesis
… thus … not thus … nothing J.
7
• Interval Data has ordered categories with uniform intervals between them, but it lacks a
true zero point. Examples include temperature measured in Celsius or Fahrenheit, where
the difference between 20°C and 30°C is the same as the difference between 30°C and
40°C, but zero does not mean the absence of temperature.
• Ratio Data has ordered categories with uniform intervals between them and a true zero
point, indicating the absence of the attribute being measured. Examples include height,
weight, age, speed during a particular task, income, or mark on an exam.
Figure 2. This figure shows the main statistical tests used depending on the type of data.
Independent vs. dependant variable: At the top of the chart, you also see within the first task
there is a distinction between data that is the independent or the dependant variable. We have
seen the different between those already.
Normal or skewed distribution: In the middle of the chart, you will see that we must choose
whether or data follows a normal distribution or is skewed2. This is important because statistical
tests can be split into two main categories:
• Parametric tests are statistical method used when your independent variable is a
category, and your dependant variable is a scale. This is what we have seen so far, for
example type of keyboard (a or b) or memorization group (baseline or reward) and the
dependant variable were ratio scale. But they only work if the data you have collected
follows a normal distribution (or Gaussian, or bell curve).
• Non-parametric tests are statistical methods that are used when you cannot use the
parametric tests above, i.e. when your data do not meet the assumptions of parametric
tests. Parametric tests typically assume that the data come from a specific parametric
2
How do you know it is following a normal distribution? Do not worry there are tests to check this.
8
distribution (often the normal distribution) and make inferences about population
parameters such as means and variances. Non-parametric tests, on the other hand, do not
make strong assumptions about the shape of the underlying distribution and are more
robust to deviations from normality.
Paired / non paired: Toward the bottom of the chart, you may have to decide if your data is paired
or not. Within-subjects (paired) or Between-subjects (non-paired) are terms used to describe
different experimental designs. These designs determine how participants are allocated to
different conditions (or factors) and how the study is structured.
• Between-Subjects Design (or independent measures or independent groups design): each
group of participants is exposed to only one level/factor of the independent variable, and
they are independent of each other. It helps control for order effects and minimizes
potential carryover effects.
• Within-Subjects Design (or repeated measures design): teach participant experiences all
levels of the independent variable, and the order of exposure is typically counterbalanced
to control for order effects (we will see what counterbalancing means in the next chapter).
• Mixed Design: combines elements of both between-subjects and within-subjects designs.
It involves having at least one independent variable manipulated as a within-subjects
factor and another as a between-subjects factor. Participants experience all levels of one
independent variable (within-subjects factor) and only one level of another independent
variable (between-subjects factor).
Number of things you are comparing (group): At the bottom of the chart, you also have different
choices depending on how many things/conditions you want to compare (or how many
levels/factors your independent variable has). For example, you may want to compare the
satisfaction of three different product. In that case you have three factors.
9
1.8 First step in R and CSV files
There are many ways to do statistical tests, here we will use R. R is a command line interface. But
there is a graphical user interface version of it (I have never tested it myself). SPSS is also another
software with a graphical user interface that is commonly used to run stats. You can download R
here: https://www.r-project.org/. Find the link associated with your operating system and install
R. I am on a mac, hopefully there is not too much different with other OS. Once installed you will
have an application icon called R appearing on your computer. When you open it (double click on
it) you will see a command-line interface (figure 3). Once you have the R command line opened
try to type what is below. Note I will use this format in the rest of the document with on the left
of the table what you write, and on the right the results from R.
What you type in R What you get under your command
print("hello world") [1] "hello world"
Figure 4. The R command line interface you get when you double click on the logo of the app.
To use R, you will need to make sure that your data is formatted in the right way. The most
common way of doing is as a table. You can do it in excel and then save as a CSV (comma-
separated value). Typically, you would have something looking like on figure 4 (left). ID is the
identified of the participant. the column called ketchup is the type of ketchup product tested (a
or b or c); tastiness is the measured data from participant (how much they liked it on a scale from
1 to 5 5 being the tastiest). From excel, you save this as “.csv” (or in your code). The csv will look
like figure 4 (right) if you open it with a text editor.
10
ID ketchup tastiness
1 a 1
2 a 2
3 a 1
4 a 2
5 b 2
6 b 3
7 b 1
8 b 2
9 c 5
10 c 4
11 c 4
12 c 5
Figure 4. Example of CSV formating, on the left open as excel file, on the right as textfile.
Now let’s import if our “ketchup.csv3” file where all the imaginary data from a ketchup
experiment is and try the following (see below). While trying this it is possible you are having error
of file not existing. It is because the path of your file is not found. I would suggest you go to the
file and find its path (right click info on it). Then try the command getwd() in R to know where R
is running from and adapt the path in consequence.
dat = read.csv("Desktop/ketchup.csv", ID ketchup tastiness
header = TRUE) 1 1 a 1
print(dat) 2 2 a 2
3 3 a 1
4 4 a 2
5 5 b 2
6 6 b 3
7 7 b 1
8 8 b 2
9 9 c 5
10 10 c 4
11 11 c 4
Etc.
1.9 Conclusion
Hypothesis testing is a fundamental method in statistics used to make informed decisions about
population parameters based on sample data (sample size). It involves formulating two
hypotheses: the null hypothesis (H₀), which represents a default assumption of no effect or no
difference, and the alternative hypothesis (or hypothesis) (H₁), which suggests a specific effect or
difference. Researchers collect and analyse data, often using a significance level (α) of 0.05, to
determine the probability of obtaining observed results if the null hypothesis is true. The p-value,
calculated from statistical tests, is compared to the significance level. A small p-value (< 0.05)
leads to the rejection of the null hypothesis, indicating sufficient evidence to support the
alternative hypothesis. Conversely, a larger p-value results in a failure to reject the null hypothesis
and thus lead to no conclusion about neither of the hypothesis. While statistical significance is
crucial, researchers must also consider effect size, which quantifies the practical importance of
results.
3
All the files related to this document are also provided.
11
In the rest of this document, I will use imaginary studies to exemplify hypothesis testing:
Keyboard performance study: this could be a study done by computer scientists who designed a
new type of keyboard and want to test if it has better performance than traditional keyboards.
The study asks participants to type segment of texts and count the word typed in one minute. We
look at our first statistical test: the student t-test.
Memorisation study: this is a study that could have been done by psychologist to determinate if
human have better memorization ability when they are presented a chocolate reward or
punishment. The study asks participants to memorize sequence of numbers with different length.
We also go through an imaginary scenario where one group get chocolate, one nothing, and
another receive a slap … which is obviously non ethical! We look at student t-test in more detail,
using Bonferroni correction. We also look at Anova.
Snack studies: this could be done by someone in a marketing company. We do a series of small
studies to investigate MnM’s: how their color affect taste; then how to test if packets have the
right distribution of colors; finally, we look a best size for MnM’s, so they don’t melt in the hand.
We look at non-parametric version of t-test and Anova. Then we look at Chi-square tests and
finally linear regression.
12
Chapter 2. The keyboard layout study
For this first study I want you to step in the shoes of an engineer, August Dvorak. In 1936, he
created a new computer keyboard layout and went on a mission to show it is a faster and more
ergonomic alternative to the QWERTY layout (figure 3). Dvorak proponents claim that it requires
less finger motion and as a result reduces errors, increases typing speed, reduces repetitive strain
injuries, or is simply more comfortable than QWERTY. Let’s focus on the typing speed only here,
as it is a measure we can easily imagine collecting from a bunch of participants.
13
2.1 Experimental design
Step 1. Research questions and hypotheses.
Our first step if to define our research question and hypothesis. We want to know if the Dvorak
layout results in a different typing speed compared to the conventional Qwerty layout. Our
hypothesis and null hypothesis are:
H1: The Dvorak Layout have a significantly higher typing speed compared to the Qwerty one.
H0: There is no significant difference in typing speed between the Dvorak and QWERTY layouts.
We should also explain why this is our prediction. In our case we can take the rationale from
August Dvorak himself: The Dvorak keyboard layout emphasizes placing the most common letters
and letter combinations on the home row to minimize finger movement and increase typing
speed. The layout prioritizes reduced finger travel, encourages finger alternation for smoother
typing, and minimizes instances of same finger typing. The Dvorak design is tailored to the English
language, considering letter frequencies and linguistic characteristics. This we should expect
(predict) that it will increase typing speed compared to the traditional layout.
It seems clear we are going to measure some typing speed. This is our dependant variable. Now
we want to compare this dependant variable evolving under two conditions, one with the Dvorak
layout, the other with the Qwerty layout. The type of keyboard is thus our independent variable.
It has two levels (or two factors): Qwerty or Dvorak.
We might start wondering a bit more what we are measuring exactly. A common way to measure
any speed is to decide an arbitrary scale, e.g. kilometres per hours. We might not want to have
our participants write for an hour long, may be using a minute would be fine and enough to collect
enough data. We can count the number of words typed within a minute. But what if participants
just hammered the keyboard? We probably need to count the number of words typed properly
within a minute and exclude the mistakes. We might instruct them to “type as fast as possible
without doing mistakes within 1 minute”.
But what are they typing? If we make them type a text they already know, they might be biased.
We need to come up with a text that is relatively random. We can generate a text and give it to
participants; count how many good words they manage to type in one minute. Our sample text
is a controlled variable. It is the same for all participants so there is no variation and thus no
possibility it will create confounding results.
Also note that Dvorak is designed for the English language, so we need English text … which also
means we probably need participant whose English is their first language. Since we want to
compare relatively similar expertise we might also want to make sure to recruit people who are
in the same age range, or similar demographics. This avoid introducing confounding variables.
Our study starts to take shape but now we need to figure out if we pair our data or not. If we pair
it (within-subject design), it means we make all our participants do both keyboards. If we do not
pair it (between-subject design), it means that one group of participants will do Dvorak, and the
14
other half will do Qwerty. Now choosing a between-subject design may feel easier because we
only have one generated sample text to type. This makes our life easier because each participant
will have to type only once, and only with that sample text.
However, if you can (there are studies you cannot and we will see later), it is often better to go
for a within-subject experiment because the data are paired and it helps increase the statistical
power of the experiment (see chapter 5), or in other word your likelihood to find the signal in the
noise is increased. It also helps having less participants. In a within-subject study like this one you
could probably be ok with 12 participants4. In between-subject you would need twice more.
Even if within-subject experiments have advantages, they bring some more confounding variables
to control for: learning effect and fatigue effect. In our experiment, we can have participants
doing both keyboard one after the other, but this creates some troubles:
- We cannot use the same sample text. If they type something with one keyboard and then
the same text with other keyboard, they will be advantaged because they have already
typed the sample text before. Thus, we need to generate another sample text. We need
two in total. But we need to make sure those texts are different enough to avoid
memorization but also similar enough to be comparable. They need to have the same
difficulty somehow. We could say we generate the second sample of the same length,
containing words of the same complexity etc.
- Secondly we cannot make participants start with Dvorak and then Qwerty (or the
opposite). If all participant systematically started with Dvorak and then Qwerty, it is
possible that either 1) keyboard Dvorak has better performance because participants are
quite energetic at the beginning and fatigued at the end when they reach the Qwerty. We
call this fatigue effect. 2) or that the Qwerty has better performance because participants
are well rounded with the task because they spend the first half of the study on keyboard
Dvorak. We call this learning effect.
Both learning and fatigue effects are confounding variables. We must control them. To do this we
can use counterbalancing. For our keyboard experiment this could be simple, we make half of
the participants start with keyboard Dvorak and then do the Qwerty. The other half of the
participants start with keyboard Qwerty and then keyboard Dvorak. Counterbalancing is
employed to minimize the impact of these effects and enhance the internal validity of the study.
At this stage we can start identifying the statistical tests we may use. We use the figure 2 flow
charts. Our independent variable (type of keyboard) is a category. Our dependant variable is a
scale (number of words typed in one minute). Now we don’t know yet if the data is normally
distributed or not.
- If our data is normally distributed. We are comparing [2 groups] and they are paired
because we do a within-subject study. The flow chart says we need a T-TEST.
- If our data is skewed. We are comparing [2 group] and they are paired. The flow chart says
we need a Friedman test.
4
I will also talk about number of participants and why 12 in chapter 5.
15
You can now summarise your current experimental design in a report or a paper as such:
We are running a within-subject study with 12 participants. Our independent variable is
the type of keyboard (Dvorak or Qwerty), and the dependant variable is the typing speed
(number of correct words typed within a minute). The order of the keyboard is
counterbalanced. Two sampled tests of similar difficulties are used as controlled variables
to avoid learning effect. We also counterbalanced the sample texts. In summary our design
is 12 (participants) x 2 (types of keyboard) = 24 data sample. Our experience last 3-4
minutes per participants. If our data is normally distributed we will use a t-test, if not a
Friedman test to compare keyboards performance.
I do really encourage you to run a pilot experiment. A pilot experiment is a small-scale version of
the entire study. Try a few participants first. It helps you to check if the experiment is not too
long, if you have forgot to control any variables, if your data collected are not bugged etc. If your
instructions are correct. It really saves time later to do a few participants first and refine the study
to make sure you have something solid. You also generally throw away any data collected in this
study unless your experimental do not change at all and is perfect straight from the design phase.
Let’s imagine we did run such a pilot. A few things we may have noticed: Some participants are
lost at the first trial because they never seen this task before, they often screw it; Some
participants have been distracted during one of the trials and screw it; Some participants were
tall in size and the table on which the keyboard was, was too low; Looking at my data we realise
that our code does not count the words that are misspelled, we totally forgot this.
From those observations we will of course fix our code J but we can also iterate on our design.
One thing we should do is to make participant practise to avoid them screwing their first trial. We
can generate a bunch of other sample texts (of course not the one we will use for the actual
experiment!). And make them practise, say 4 times on each keyboard.
People being distracted is often inherent. Maybe we could run the experiment in an enclose room
with no noise and no passer-by or other distractions to make sure this does not happen. We can
also make sure we adapt the setup to people height. Basically, we need to make sure that all of
them do the task in the same condition so that the only thing that change is the type of keyboard.
Another thing we could consider is to increase the number of trials they do (repeating the typing
task) so we have multiple data for each keyboard and if they are one of this data that is screwed
this won’t matter too much. This is also good because we can increase the sample size. Let’s say
each participant do each keyboard with 4 differently generated sample texts each time.
After the pilot study it is also a good moment to check the timing of your study: if you have room
for more, of if inversely you must reduce something because the experiment is too long. I would
not advice the keyboard experiment to last longer than 30 minutes because this is quite repetitive
and boring. Any longer than 30min you will start to have biased data possibly. The length of your
experiment may vary but it depends on the task the participants do. You will have a feel about
how long you can go for with the pilot. Generally, it is good to stop at 1h.
16
In our case let’s fix it under 30 minutes. If we have participants doing the two keyboards 4 times,
so 8 trials in total of roughly 1 minutes = 8 minutes. But we also said we need them to practise, 4
times on each keyboard = 8 minutes. We are at 16 minutes. Count some break time and the initial
welcome and explanation, we are probably around 20-25 minutes. This sounds good.
We summarise our iterated experimental design as such:
Experimental design: We are running a within-subject study with 12 participants. Our
independent variable is the type of keyboard (Dvorak or Qwerty), and the dependant
variable is the typing speed (number of correct words typed within a minute). The order of
the keyboard is counterbalanced. The participant repeats the task 4 times for each
keyboard. We generated 8 sample texts of similar difficulties as controlled variable to
avoid learning effect. The 8 sample texts are randomised so that each sample is used once.
Before each swap of keyboards, the participants must practise on 4 randomly generated
texts (different from the one explained before). In summary our design is 12 (participants)
x 2 (types of keyboards) x 4 (repetitions) = 96 data sample. Our experience last 20-25
minutes per participants. If our data is normally distributed we will use a t-test, if not a
Friedman test to compare keyboards performance.
At this stage you could also write about your procedure, apparatus, and participants:
Apparatus: The study was set up in a room in which only the participants and the
experimenter was. A computer screen was on a desk with the keyboard below the screen.
The experimenter made the swap of keyboard when necessary. The participant could
adjust the size of their chair to be ergonomically sat at the table.
Procedure: Participants were first explained the task and signed consent form. They were
asked to sit comfortably at the table. They were instructed to type as fast as possible while
avoiding mistakes. A sample text in English was displayed on the screen and a countdown
was shown before each trial to allow the participant to get ready. The experimenter could
stop the experiment at any time between trials when participants needed a break. The
word count performance was not shown to the participant. (note you can also show a
screenshot of the interface and provide the sample texts).
Participants: 12 participants (complete later about their ages, genders) participate in this
study. They were recruited within our institution and are using a keyboard regularly as part
of their work. They received no intensive for their participation. All participants spoke
English as their first language.
Ok this is the easy part. Hopefully it is because you pilot it before. And you have a beautiful CSV
file ready for analysis.
17
2.2. Statistical analysis
Step 8 Using R to do some pre-analysis
If you run the previous command successfully, the variable “dat” has your table. We can start by
computing the means for each keyboard (see code below). We use a function called mean(). The
argument for this function is a table. In R you don’t need to create a new table for each separate
keyboards data, you can just use conditional formatting. By doing dat[dat$keyboard == 'dvorak',
'count']), this creates a sub table with only the lines where keyboard = Dvorak and with the count
(words per minute). This is quite practical and will often be used in R. By doing this we have the
mean for Dvorak at roughly 114 and Qwerty at 106 words per minute.
mean(dat[dat$keyboard == 'dvorak', 'count']) [1] 114.4375
mean(dat[dat$keyboard == 'qwerty', 'count']) [1] 106.0625
Now let’s see if we can visualize our raw data (code below). We use the function barplot() which
needs a table as argument. We specify that we want a plot of the word per minute and thus use
the formatting dat$count (count is the name of the column we used in the csv file). By doing this
you can see all the results from our experiment, on the left side the Dvorak data and on the right
the Qwerty one. If we look carefully we can see there is a strange entry on the right. Some
participant did a trial that is very odd with the rest of the data. This is clearly an outlier.
18
barplot(dat$count)
Removing outliers
It is important to check for outliers because they are often caused by something we are not trying
to model in our data. Here it may be that there was a disruption during the trial. We do not want
them. They are not characteristic from our data. A good practise is to set a threshold, often any
data that is above 3 standard deviations from the mean can be considered as an outlier. But we
want to avoid removing a large percentage of the data because otherwise we bias our
experiment. It is a good to define before the experiment an acceptable outlier policy. e.g. 1%.
For our data we can try the following. Note # means this are commentaries. We use the function
mean() that we have already seen and sd() which give the standard deviation. We store both
mean and std in temporary variables meandat and stddat. Then we create the upper and lower
bound using 3 std above (Tmin) and below (Tmax). Then we find outliers. To do this we do a
search in our data table where the words per minute are above or below our Tmin and Tmax. This
line will output the results 2 displayed on the right. We know we have 1 outlier. We now then to
remove them from our data table called dat. To do this I first create a duplicate of the table in a
new variable called dat2. And I then use the function subset() to remove the lines that are not
matching a certain conditional formatting: ere the line where count (words per minute) is above or
below Tmin and Tmax. We will now use dat2 as our data for the rest of the tests.
# get mean and Standard deviation [1] 1
meandat = mean(dat$count)
stddat = sd(dat$count)
# find outlier
outliers = dat$count[which(dat$count < Tmin | dat$count > Tmax)]
print(length(outliers))
# remove outlier
dat2 = dat
dat2 = subset(dat2,dat2$count > Tmin & dat2$count < Tmax )
19
Step 9 Checking tests requirements (here for normality)
Finally, we can have a look at the distribution of our data, first visually. This is done like below.
From this graphic we can already see this is quite following the bell curve, but we can check with
a test called Shapiro-Wilk if the data really is normally distributed.
# the function hist makes histogram
# specify the number of bins you want
# here we used 10
hist(dat2$count,breaks = 10)
We have done our first statistical test (the Shapiro Wilk test!). But it is not one what will help us
comparing our two keyboards. It is one to test if our data follows a normal distribution. Remember
we still don’t know what tests we need to perform, either a t-test or a Friedman test. We ran a
Shapiro-Wilk test to check if our data follow a normal distribution. The Shapiro-wilk test gives us
a p-value of 0.2 which is thus a high p-value (above 0.05). Thus, we are fine with running
parametric tests, aka the t-test. Note that there are other normality tests such as the Kolmogorov-
Smirnov Test (the command would be ks.test(dat2$count)). This test is good when you have lot of
participants (above 50).
If the p-value of the Shapiro-wilk or Kolmogorov-Smirnov test is greater than α = .05, then the
data is assumed to be normally distributed. For those who have followed till this point this might
seem extremely strange to you … why? Did I not say earlier that then the p-value is high we cannot
reject the null hypothesis … thus we cannot conclude anything. So why can we conclude that our
data is normally distributed? Why can we conclude that certain things seem equivalent, I thought
it was forbidden? Those normality tests do indeed compare two things. They compare your data
with the normal distribution. The H1 would be “there is a significant difference between those
two data sets” and h0 would be “there is no different between those two data sets”. Since we
found that p-value is high. We cannot reject the null hypothesis. Thus, we should not be able to
conclude anything right? And if fact this is what we do. We call this “assumption of normality”
rather than proof of normality. We do not know for sure. But assuming normality is acceptable
under certain conditions. Parametric tests can perform well if the underlying distribution is
approximately normal, or the sample size is sufficiently large.
In our case we are fine to assume our data is following a normal distribution (I made the numbers
up to I know they are!). And we can proceed using a t-test rather than a Friedman test.
20
Step 10 Run your test (t-test)
The paired Student's t-test, or simply the paired t-test, is a statistical method used to compare
the means of two related groups where each observation in one group is paired or matched with
an observation in the other group. This is exactly what we do with our keyboard experiment. We
have a data point for keyboard a and one for keyboard b for each. Each participant coming to do
the experiment must have done a typing task with each keyboard. The data we have for keyboard
a are paired with the data we have for keyboard b because each time we run the study they come
from the same person (within participants design). Each statistical test has a series of requirement
that need to be checked before running it. For paired t-test these are:
- The paired t-test needs paired data. This is what we explained before. In our case each
participants provided two data points, one for each keyboard so we are good.
- The differences between the paired observations should be independent. This means that
the value of one observation difference should not be influenced by the value of another.
This means we need to think hard about what the participants are doing in their typing
task so that the answer from one participant does not influence the answer for another
one. Typically, this could be avoided by making sure the study is individual, that no
participants can observe the others nor their results to avoid any biases. The data between
each participant is thus completely independent.
- The data should be continuous, meaning it takes any numerical value within a range. The
paired t-test is not appropriate for categorical data. The variables should be measured on
an interval or ratio scale. As said before our keyboard example is a ratio-data type and
continuous: When measuring words per minute, we count the number of words a person
type within a minute, and this measurement take on a wide range of real-number values.
There is a continuous spectrum of possible values for words per minute, allowing for
fractions or decimals, and there are no separate categories as seen in discrete data.
- It should have a sufficient sample size. This one is more complex to answer now. Let’s say
for now we will use 12 participants. Chapter 5 will dive more on sample size.
- The differences between the paired observations should be approximately normally
distributed. We have seen we are good with this.
Next let’s run our paired t-test (code below). The function we use is t.test(). The first argument is
the first factor of the independent variable (dat2$count[dat2$keyboard == "dvorak"] create a table
of all the measure words per minute for only dvorak). The second argument is the second factor
of the independent variable (dat2$count[dat2$keyboard =="qwerty"]). The 3rd argument is
paire=true to specify to the test our data are paired. The final argument is alternative =
"two.sided" which is another important thing to specify, and we will see later what this means.
After running this, you can see there is obviously a problem here in our data. The error says that
we don’t have the same length of data. Which is very important when we run paired statistical
tests. The problem we have is that we have removed an outlier earlier. And now we have
repetitions of participants missing.
t.test(dat2$count[dat2$keyboard == "dvorak"], Error in complete.cases(x,
dat2$count[dat2$keyboard =="qwerty"], paired = TRUE, y) : not all arguments have
alternative = "two.sided") the same length
21
But no worry. We did the repetition exactly to avoid this case, to be able to have more reliable
data. We first need to aggregate our data. The way to aggregate the data is simple. We basically
want to remove the column repetition and compute the mean. For each participant (ID), we need
the mean for keyboard dvorak and the mean for keyboard qwerty. We do this as follow (See
code): we create a new table called dat3, in which we use the function aggregate(). We specify
we want to aggregate the variable count (words per minute) for each participant (ID) and
keyboard type. We then specify which table we should do that on (data=dat2). And we specify
the function we want to use for the aggregation (FUN=mean).
dat3 =aggregate(count ~ ID + keyboard, data = dat2, ID keyboard count
FUN = mean) 1 1 dvorak 115.5000
print(dat3) 2 2 dvorak 115.2500
3 3 dvorak 115.2500
4 4 dvorak 114.7500
5 5 dvorak 114.5000
6 6 dvorak 114.5000
7 7 dvorak 114.2500
….
Now we can run our test on the new table dat3 which has not outliers and the aggregated results
(see code). There is a lot of information displayed as a result. In a way we don’t really care about
all these data here because we first look at the p-value which is high (below 0.05 our significance
level). We can thus reject the null hypothesis and accept our hypothesis H1.
t.test(dat3$count[dat3$keyboard Paired t-test
== "dvorak"], data: dat3$count[dat3$keyboard == "dvorak"] and
dat3$count[dat3$keyboard dat3$count[dat3$keyboard == "qwerty"]
=="qwerty"], paired = TRUE, t = 53.889, df = 11, p-value = 1.107e-14
alternative = "two.sided") alternative hypothesis: true mean difference is not
equal to 0
95 percent confidence interval:
5.888160 6.389618
sample estimates:
mean difference
6.138889
Finally, as we have seen in chapter 1, it is good to have the significant different (p-value), but we
also need to compute the effect size to report properly the results. The effect size for a paired-
samples t-test can be calculated by dividing the mean difference by the standard deviation of the
difference. We can do this using the function cohen.d(). Doing this output a d-value which ranges
from 0 to infinity, with 0 indicating no effect where the means are equal.
install.packages("effsize") (you only need to install the library once)
library(effsize)
cohen.d(dat3$count[dat3$keyboard Cohen's d
== "dvorak"],
dat3$count[dat3$keyboard d estimate: 9.41925 (large)
=="qwerty"], paired = TRUE) 95 percent confidence interval:
lower upper
6.977857 11.860643
22
In our case the Cohen’s d is 9.42: the average word per minutes with the Dvorak is 9.42 standard
deviations greater than the average words per minute with the Qwerty. We often use the
following rule of thumb when interpreting Cohen’s d: a value between 0.2 and 0.5 represents a
small effect size; a value between 0.5 and 0.8 represents a medium effect size; a value above 0.8
represent a large effect size. We would this interpret our results as a large effect size. Note we do
compute the effect size in the hope to have an effect (small, medium, or large). But if the d-value
is less then 0.2, the difference is negligible, even if it is statistically significant.
Step 11 Conclude
23
Counterbalancing more than two factors
Now this is simple with two levels. What do we do if we have more things we want to compare?
A common method is to use a Latin square. Let's consider a study with four experimental
conditions (A, B, C, D). We make a Latin square (see below). Participants are assigned to different
sequences based on the rows or columns of the Latin square. For example, the participants
number 1 5 and 9 (etc) would start with condition A then B than C then D.
Participant 1, 5, 9, etc A B C D
Participant 2, 6, 10, etc B C D A
Participant 3, 7, 11, etc C D A B
Participant 4, 8, 12, etc D A B C
At this point then you may understand that choosing the number of participants is important. In
the example above with a 4 condition within-study using a Latin square, we will need to have the
same number of entries for each of these possible orders. Thus, we need a factor of 4 in
participants numbers so that you have a fair number of entries for each possible order to spread
the possible noise.
24
2.4 Summary (what you should write in a paper)
We designed a keyboard layout we called Dvorak. Its Participants: 12 participants (<add their ages,
layout emphasizes placing the most common letters genders>) were recruited within our institution and
and letter combinations on the home row to are using a keyboard regularly as part of their work.
minimize finger movement and increase typing They received no intensive for their participation. All
speed. The layout prioritizes reduced finger travel, participants spoke English as their first language as
encourages finger alternation for smoother typing, Dvorak was designed for English language.
and minimizes instances of same finger typing. It is Experimental design: We are running a within-
tailored to the English language, considering letter subject study with 12 participants. Our independent
frequencies and linguistic characteristics. We want variable is the type of keyboard (Dvorak or Qwerty),
to know if the Dvorak layout results in a different and the dependant variable is the typing speed
typing speed compared to the conventional Qwerty (number of correct words typed within a minute).
layout. Our hypotheses are: The order of the keyboard is counterbalanced. The
H1: The Dvorak Layout have a significantly higher participant repeats the task 4 times for each
typing speed compared to the Qwerty one. keyboard. We generated 8 sample texts of similar
H0: There is no significant difference in typing speed difficulties as controlled variable to avoid learning
between the Dvorak and QWERTY layouts. effect. The 8 sample texts are randomised so that
Variables: Our independent variable is the type of each sample is used once. Before each swap of
keyboard used. Our dependant variable is the typing keyboards, participants must practise on 4
speed. To measure the typing speed, we use the randomly generated texts (different from the trials).
number of correct words written within 1 minute. In summary our design is 12 (participants) x 2 (types
We did because to capture quality as much as of keyboards) x 4 (repetitions) = 96. Our experience
quantity of typing. We generated several sample last 20-25 minutes per participants. If our data is
texts that the participant copied. To avoid learning normally distributed we will use a t-test, if not a
effect, we controlled this variable and generated 16 Friedman test to compare keyboards performance.
samples of text of same difficulties <say how>. Results: We removed one outlier (3std above the
Apparatus: The study was set up in a room in which mean). We run a Shapiro-Wilk test (W = 0.98269, p
only the participants and experimenter were. A >0.05). We thus assumed normality and run a paired
computer screen was on a desk with the keyboards student t-test (t = 53.889, df = 11, p<0.05) showing
below. The experimenter swapped the keyboards. that there is a significative difference between the
The participant could adjust the size of their chair to performance of the two keyboards. The effect size,
be ergonomically sat at the table. as measured by Cohen’s d, was d = 9.42, indicating
a large effect. Dvorak average at 114.4 words per
Procedure: Participants were first explained the minute while Qwerty averaged at 106 words per
task and signed consent form. They were asked to minute (figure 1). We can conclude that the Dvorak
sit comfortably at the table. They were instructed to layout enhance typing speed.
type as fast as possible while avoiding mistakes. A
sample text in English was displayed on the screen
and a countdown was shown before each trial to
allow the participant to get ready. The experimenter
could stop the experiment at any time between
trials when participants needed a break. The word
count performance was not shown to the
participant. (note you can also show a screenshot of
the interface and provide the sample texts).
Figure 1. <caption>
25
Chapter 3. The memorization study
Did you know that places where chocolate consumption is highest have the most Nobel Prize
recipients? It's true, at least according to a 2012 published in the New England Journal of
Medicine. Of course, that could be a coincidence. But is it possible that intelligence or other
measures of high brain function are improved by the consumption of chocolate? Let try to see if
we can figure some of it out. I want you to step in the shoes of an experimental psychologist. The
topic of our study will the effect of reward and punishment on memorization.
Warning!! For learning purpose, I am going to use bad example of experimental designs in this
chapter. I hope these examples will help you to understand the intricacy of designing
experiments, what to do and what not. And we will use this to go in more complex parametric
tests (t-test with Bonferroni correction and ANOVA).
26
We then divide the room in three equal and explain that each split is a different group.
- The left side is the no reward group. We ask them to do their best at the game.
- The middle side is the reward group. We put a huge box containing chocolates on a table
in front of them and explain that if they beat everybody they will get the chocolate.
- The right group is the punishment group. We tell them if they arrive last, they will receive
a slap (note again this is not a real experiment!).
We proceed with running experiment. We manage to get up to a list of 10 numbers until all the
students fail to remember correctly. We ask all the student to fill in an online questionnaire to
individually report their memorization score. We end up with a list of memorization score, a
bunch of entries for each group in the room. We put all this in a csv file called “chocolate.csv”.
Now I know you already probably seen some huge issues with the design of this experiment but
before analysing what is wrong I want to dive into what we would do with the statistical analysis.
And even if the study is completely flawed we will still do the stats well.
27
Identify variables
In our design, what are the independent and dependant variables?
We have one independent variable which is the group in which the students are split into.
We will call this independent variable “group”. Group can have three values, either
“chocolate” “punishment” or “no reward”. This is the variable we are manipulating. We
have one dependant variable, memorization score. It is a continuous ration-scale variable.
We are trying to see if memorization score changes depending on the group.
All the other variables in an experiment should be controlled. it Is crucial to avoid confounding
variables. And when it comes to design an experiment it is probably the hardest part. In our design
we have a lot of confounding variables this is not good. Can you think about some? More precisely
take a minute to think about what other factors in this experiment could screw the results and
that we have not checked? Could you think about a way to transform them into controlled
variables?
o We have not checked if participants like chocolate to start with. Their results will
corrupt our data. Even worst if we in fact ended up with the reward group with only
people disliking chocolate. We must control this. The way to do it is to ask participants
if they like chocolate and if not they cannot participate in the study.
o At what time of the day are we doing this study, is it straight after lunch? If so it could
corrupt our data as people may not feel particularly hungry for chocolate. We must
control this. We should ask the participants to come in the experiment let’s say 2h
after a meal and if not they cannot participate.
o Are we sure the participants are not cheating with their score or just making mistakes
? It is possible. We must control this. A way to do this is to remove the piece of paper
results and ask them to enter their guess in a computer that checks automatically.
o On top of this we are doing the experiment with all the students present together.
They may influence each other (see point before) but more importantly the simple
fact of seeing the other group might be itself an intensive (a form of competitive
reward) to win. We must control this. We should do the experiment individually,
asking each participant to do the task one by one, without seeing the others.
Independence of data is also a requirement for many of the statistical tests.
o What about the list of number you ask them to memorize? Imagine some of those are
“45 46 47 48 49 50”, would this be easier than remembering “45 98 4 72 18 37”?
Probably yes. Thus, the list of numbers must not be purely random, it must be
handcrafted to avoid any possible lucky draw.
Identify type of design
This point related to whether we want paired data or unpaired data. Between-subjects, within-
subjects, and mixed design are terms used to describe different experimental designs. What have
we done for our experiment?
In our experiment, we follow a between-subject design because each group of students is
exposed to only one level (reward, baseline, punishment) of the independent variable.
This is a pretty good decision given that doing it within-subject may led to participant
potentially cheating in the baseline to make sure they have the chocolate.
Now I said before that it is better to try to do it as within-subject even if there is more learning
and fatigue effect to managed. However, it is not always possible like in our case. For example,
consider another scenario where researchers are investigating the effectiveness of two new drugs
28
in treating a specific medical condition. In this case, it would be impossible to have participant
use both drugs without them having an interference between each other so we would have to do
a between-subject experiment.
29
dat = read.csv("Desktop/chocolate.csv", [1] 6.4 (chocolate)
header = TRUE) [1] 6.3 (no chocolate)
mean(dat[dat$group == 'chocolate', 'score'])
mean(dat[dat$group == 'nochocolate', 'score'])
shapiro.test(dat$score) Shapiro-Wilk normality test
data: dat$score
W = 0.96331, p-value = 0.3752
t.test(dat$score[dat$group == "chocolate"], Welch Two Sample t-test
dat$score[dat$group =="nochocolate"],
paired = FALSE, alternative = "two.sided") data: dat$score[dat$group ==
"chocolate"] and dat$score[dat$group ==
"nochocolate"]
t = 0.10991, df = 16.086, p-value =
0.9138
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
-1.827903 2.027903
sample estimates:
mean of x mean of y
6.4 6.3
Conclude: We run a Shapiro-Wilk test (W = 0.98269, p >0.05). We can thus assume normality and
run a parametric test. We used an unpaired t-test We run a student paired t-test (t = 0.10991, df
= 16.086, p-value>0.05). There is no significant difference we cannot conclude anything.
Note: the group with reward scored better than the control group BUT the results are non-
significant, so we do conclude anything from those data. They are neither different, not equal.
30
Figure 5. Type I and II errors
If we run multiple t-tests, we must account for that possibility to make Type I and II errors. The
solution in our case here is to use what we call a Bonferroni correction. The Bonferroni correction
is employed to mitigate the inflated risk of making Type I errors. It adjusts the significance level
of each test by dividing the chosen alpha level by the number of comparisons.
In practise it is simple. We count how many tests we need to run, which correspond to how many
combinations of factors we have. In our case we have 3, so we need 3 comparisons (chocolate-no
reward, chocolate-punishment, punishment-no reward). Thus, our significance level is not 0.05
anymore but 0.05/3 so 0.016. A low p-value would have to be below this.
dat = read.csv("Desktop/chocolate.csv", [1] 6.4 (chocolate)
header = TRUE) [1] 6.3 (no chocolate)
mean(dat[dat$group == 'chocolate', 'score']) [1] 1.5 (slap)
mean(dat[dat$group == 'nochocolate', 'score'])
mean(dat[dat$group == slap, 'score'])
t.test(dat$score[dat$group == "chocolate"], Welch Two Sample t-test
dat$score[dat$group =="nochocolate"], data: dat$score[dat$group ==
paired = FALSE, alternative = "two.sided") "chocolate"] and dat$score[dat$group ==
"nochocolate"]
(we already had that) t = 0.10991, df = 16.086, p-value =
0.9138
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
-1.827903 2.027903
sample estimates:
mean of x mean of y
6.4 6.3
t.test(dat$score[dat$group == "chocolate"], Welch Two Sample t-test
dat$score[dat$group =="slap"], paired = data: dat$score[dat$group ==
FALSE, alternative = "two.sided") "chocolate"] and dat$score[dat$group ==
"slap"]
t = 7.4532, df = 16.905, p-value =
9.772e-07
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
3.512337 6.287663
sample estimates:
mean of x mean of y
6.4 1.5
31
t.test(dat$score[dat$group == "slap"], Welch Two Sample t-test
dat$score[dat$group =="nochocolate"], data: dat$score[dat$group == "slap"] and
paired = FALSE, alternative = "two.sided") dat$score[dat$group == "nochocolate"]
t = -5.6656, df = 13.807, p-value =
6.145e-05
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
-6.619488 -2.980512
sample estimates:
mean of x mean of y
1.5 6.3
As you see we ran two more t-tests. This time their p-value is low. But is it low enough? Remember
we compare with 0.016. Turns out this is still below this level for our two last comparisons, so we
are happy. We can conclude. (note that I have not done the Cohen’s d test for effect size here, and
it will be required for each comparison you have done).
Conclude: We run a Shapiro-Wilk test (W = 0.98269, p >0.05). We can thus assume normality and
run a parametric test. We chose to run 3 unpaired T-test. We thus had to accommodate the
significance level from 0.05 to 0.05/3 (0.016) using a Bonferroni correction. We found no
significant different between chocolate and no reward (t = 0.10991, df = 16.086, p-value>0.05).
However, we found a significant different between chocolate and slap (t = 7.4532, df = 16.905, p
< 0.016) as well between no reward and slap (t = -5.6656, df = 13.807, p <0.016). Our results show
that punishment significantly affect the memorization score (mean 1.5 for slap compared to 6.3
and 6.4 for no reward and chocolate respectively).
Note I get this mistake often with Bonferroni correction. If you had an independent variable with
4 factors/groups, you would have 6 comparisons to make. So, your significance level drops to
0.008. if you had an independent variable with 5 groups, you would have 10 comparisons to make.
So, your significance level drops to 0.005 etc. So, the more comparisons you make, the harder it
will be to get a low p-value. Which is why people tend to use the alternative of ANOVA.
32
• Homogeneity of Variances (Homoscedasticity): The variances of the groups being
compared should be approximately equal. This assumption is critical for the reliability
of ANOVA results. You can check for homogeneity of variances using statistical tests
or graphical methods. And that is a new thing. We will see how we deal with it.
An ANOVA run in three steps.
4 You check for homogeneity of variance.
5 You run the Anova
6 You run post-hoc tests.
Now the code we will be using do step 1 and 2 together which makes our life easier. We do this as
follow. Note that sometimes we must use some function (here ddply) which comes from some
specific libraries. This is why we first imported the library plyr with the first line library(plyr). In
most case the library you need will be specified in this document but if you use other statistical
tests, you will find information in the R online documentation. Now not all the libraries are
automatically embedded in R and sometimes you may also have to use the command
install.packages() but you only need to do this once.
At the top if our ANOVA results. At the bottom our test for homogeneity. We look at this last one
first. If the p-value for the Levene test is greater than .05, then the variances are not significantly
different from each other (i.e., the homogeneity assumption of the variance is met). It is the same
story we had with normality remember. It means we can “assume” the homogeneity of variance is
met. We are not sure, but the Anova should be robust to this.
We then look at the top result. We search for the p-value, it is a low p-value. It means that there is
something significant in our data. But at this stage we don’t know what. To figure out what it is we
need to do post-hoc test. This is the last step. So directly after you run your Anova you input this:
pairwise.t.test(dat$score,dat$group, Pairwise comparisons using t tests with pooled SD
paired=FALSE,
p.adjust.method="bonferroni") data: dat$score and dat$group
chocolate nochocolate
nochocolate 1 -
slap 5.8e-06 8.0e-06
33
As you can we see we end up with a matrix of p-value. Basically, this gives us all comparison
possible. Like with the t-test. But this time we don’t need a Bonferroni correction as this is already
included in the test we just performed (in a different way than we did before).
When using effect size with ANOVA, we use η² (Eta squared), rather than Cohen’s d with a t-test.
The good news for you is that the function ezAnova() we run earlier automatically compute that
for you. It is the “ges” column just after the p-value. In this case we have η² = 0.63. The value for
Eta squared ranges from 0 to 1. The following rules of thumb are used to interpret values for Eta
squared: from 0.01 to 0.06 is a small effect size, from 0.06 to 0.14 it is a medium effect size and
higher than 0.14 is a large effect size. We thus have a large effect size.
Conclude: We run a Shapiro-Wilk test (W = 0.98269, p >0.05). We can thus assume normality and
run a parametric test. We chose to run a one-way ANOVA. We can also assume homogeneity of
variance following a Levene test variance (DFn=1 DFd=2 SSn=27 SSd=3.8 F=34.4 p>.05). “A
one-way ANOVA showed a significant effect on memorization score for our independent variable
Group (F2,27=23.79, p < 0.05). Post-hoc comparison t-tests (with Bonferonni adjustment method)
showed significant difference between the group chocolate and slap (p<0.05) and between group
no reward and slap (p<0.05). The effect size, as measured by Eta Squared, was η² = 0.63, indicating
a large effect. Our results show that punishment significantly affect the memorization score
(mean 1.5 for slap compared to 6.3 and 6.4 for no reward and chocolate respectively).
34
3.7 Summary (what you should write in a paper)
35
studies suggest that human can memorize, and We can also assume homogeneity of variance
average of 7 +- 2 chuck of information [X].Each of following a Levene test variance (DFn=1 DFd=2
these 10 sequences were not generated using an SSn=27 SSd=3.8 F=34.4 p>.05). “A one-way
automatic random function to avoid case in which ANOVA showed a significant effect on memorization
easy number appears (e.g. 1 2 3 4). We also score for our independent variable Group
generated 3 different version of each of these (F2,27=23.79, p < 0.05). Post-hoc comparison t-tests
sequences to enable more trials per participants. (with Bonferroni adjustment method) showed
Those versions are all distinctive to avoid any significant difference between the group chocolate
potential learning biases. and slap (p<0.05) and between group no reward and
slap (p<0.05). The effect size, as measured by Eta
Hypothesis: The memorization score between the Squared, was η² = 0.63, indicating a large effect. Our
group will be different, the reward group will score results show that punishment significantly affect the
higher than the baseline group (H1), and the memorization score thus accepting H2 (mean 1.5 for
punishment group will score lower than the other slap compared to 6.3 and 6.4 for no reward and
group (H2). chocolate respectively).
Experimental design: We use a between-subject
design. Our independent variable is group (baseline
or reward). Our dependant variable is the
memorization score which is a continuous ratio-
scale. We controlled the variable “length of
numbered sequence” to have 10 different lengths.
We have 12 participants in each group. Each
different length of list of number is tested 3 times (4
trials). As a result, we have 12 x 2 (group) x 10
<Add a figure to show results visually>
(sequence length) x 3 (trials) = 720 entries. The
experiment lasted 30 minutes. We plan to perform
the following statistical tests: we will first test
whether our data follows a normal distribution. If it
does we will run an Anova. If no we will run the non-
parametric test.
36
Chapter 4. Snack study (realm of non-parametric)
In this chapter I want you to imagine you are working for the M&M's company and want to
improve your product. The first M&M's came in six colours: brown, yellow, orange, red, green,
and violet. Then, in 1949, violet was switched to tan. Later, the company asked customers to vote
for which colour they wanted in the pack and the winner was blue. Nowadays, you'll find brown,
yellow, orange, red, green, and blue. Despite common belief, each coloured M&M’s does not
have a different flavour, and all possess the same chocolate taste. Despite this, you got some
infuriated twitter post about certain colour no tasting good, and you want to investigate this
further. You want to understand if there is some kind of colour biases effect at play and if there
is, you want to suggest a new repartition of colours in packet to satisfy your customers.
In this chapter we are entering the realm of non-parametric tests, which have different tests from
what we have seen so far. Those are very robust to data that does not follow a normal distribution
which is often the case for the type of data we are going to gather here. We also finish with
another useful test, parametric this time: linear regression.
37
4.2 Gathering subjective metric via questionnaires
At this stage the experiment is rather well designed, but I have omitted a detail. How are we going
to gather the tastiness metric? Contrary to the other example studies we have done so far this is
not something we can easily measure. This is something we need to ask the participants. It is a
subjective measure. Now there is a very simple way to gather this information. We use Likert
Scale Questionnaire.
Likert scale questionnaires are a popular method of collecting and analysing opinions, attitudes,
and perceptions of respondents towards a particular subject. The Likert scale itself is a
psychometric scale commonly used in survey research. A typical Likert scale consists of a series
of statements or items related to a specific topic. Respondents are asked to indicate their level of
agreement or disagreement with each statement by selecting a position on a scale. Participants
choose the response that best represents their opinion or attitude.
• The scale is usually five or seven points, with response options such as: Strongly Agree;
Agree; Neutral; Disagree; Strongly Disagree
• For seven-point Likert scales, additional options might include: Strongly Agree; Agree;
Somewhat Agree; Neutral; Somewhat Disagree; Disagree; Strongly Disagree
Likert Scale are great in statistic because the scores assigned to each response can then be
analysed statistically to understand the distribution of opinions within a group. It is important
that you use them systematically to gather subjective metric because of one important aspect:
these are categorical data BUT in statistic they can be treated as continuous variable. And this
is a massive game changer for doing statistics, it simplifies for like a lot.
One common alternative that people often think about is to use rating scale. Rating scale are to
avoid. Because ratings will stay a categorical data whatever you do. For example, imagine you
must rank your preference between orange jus, apple jus and broccolis jus. I would personally go
apple just is best, then orange just and then broccolis jus. But I think apple and orange are very
equal to my taste bud … and broccolis just is not going near me J. If I had a 7 points Likert scale,
I would give a 1 to broccolis and a 7 to both orange and apple just.
Likert scales must be used. It allows respondents to express their opinions on a continuum. This
granularity allows for a more nuanced understanding of individual attitudes. It can be treated as
internal data, allowing for statistical analysis such as mean, median, and standard deviation. This
enables researchers to quantify and compare the strength of opinions. Likert scales are generally
easier for respondents to understand and complete. Participants are familiar with expressing
their agreement or disagreement on a scale, making Likert items more user-friendly. Likert scales
offer flexibility in the number of response options, allowing researchers to customize the scale
based on the complexity of the topic.
Now in term of the question you may ask this is often a variation of this:
On a scale from 1 to 5, how would you rate the taste of [specific item]?
1. Not Tasty at All 2. Slightly Tasty 3.Moderately Tasty 4.Very Tasty 5. Extremely Tasty
Now one thing to be careful with Likert Scale questionnaire is to avoid biasing users in the
question itself. We tend to make the question as neutral as possible. For example, we would avoid
“on a scale of 1 to 5” how delicious is [specific items]? Because it implies that it must be delicious.
If you have several Likert scale questions also make sure to be coherent with the scale you use
(what is 1, what is 5) not to confuse participants.
38
4.3 Analysing the data with a non-parametric test
We should have the answer from our study in a file called “snack.csv”. Let’s summarize and use
the chart in figure 2 to figure out what tests to use. Our independent variable is a category with 4
factors (the 4 colours). Our dependant variable is a Likert scale. Because it is considered as a ratio-
interval we can continue in the same branch that we did for our keyboard and memorization study.
The next step is to try the usual: assumption of normality. Now of course this type it is not normally
distributed to explore the non-parametric versions of t-test and ANOVA. We are testing more than
2 groups (we have 4). And our data is paired because each participant tried the 4 colours (within-
subject). So, we must use a Friedman test. If our data was not paired we would have used a
Kruskall-wallis test.
The Friedman test is a non-parametric statistical test used to detect differences in treatments across
multiple test attempts or conditions. It is a bit of a non-parametric ANOVA somehow. It is an
extension of the Wilcoxon signed-rank test for more than two related groups. The Wilcoxon signed-
rank can be seen as a non-parametric paired t-test. The process to do that in R is not that different
from what we did before. I put below the different commands we need to lunch.
dat = read.csv("Desktop/snack.csv", header = TRUE) [1] 4.25 (yellow)
mean(dat[dat$color == 'yellow', 'taste']) [1] 3.4 (green)
mean(dat[dat$color == 'green', 'taste']) [1] 1.95 (blue)
mean(dat[dat$color == 'blue', 'taste']) [1] 1.9 (Red)
mean(dat[dat$color == 'red', 'taste'])
shapiro.test(dat$taste) Shapiro-Wilk normality test
data: dat$taste
W = 0.89426, p-value = 7.029e-06
friedman.test(dat$taste, Friedman rank sum test
dat$color, dat$ID)
data: dat$taste, dat$color and dat$ID
Friedman chi-squared = 34.721, df = 3, p-value =
1.395e-07
pairwise.wilcox.test(dat$taste, Pairwise comparisons using Wilcoxon rank sum test
dat$color, p.adj = "bonf") with continuity correction
All our p-value are low (<0.05). For our normality test this confirm that the data is not normally
distributed, we have rejected the null hypothesis, we know this for sure. The Friedman test, like
ANOVA, give us an indication that something is the data is statistically different, but we don't
know what. We run a post-hoc test (here a Wilcoxon rank sum test) with a bonferroni (“bonf”)
correction (already included so we don’t have to adjust our significance level). And we get a table
of p-value comparing each colour. Now which one are we allowed to compare? All those who are
below 0.05 which is a lot them apart from the comparison red vs. blue which is not significantly
different. To get a bit more of what is going on we can do a boxplot. Red and Blue appeared the
less tasty, but the results are not significantly different from one another. But we know that Yellow
is the tastiest, followed by green and then blue and red altogether.
39
Now of course p-value on its own is not sufficient, for non-parametric tests we also need to
measure effect size. The Kendall’s W can be used as the measure of the Friedman test effect size.
It assumes the value from 0 (indicating no relationship) to 1 (indicating a perfect relationship).
Kendall’s W uses the following interpretation guidelines of 0.1 - < 0.3 (small effect), 0.3 - < 0.5
(moderate effect) and >= 0.5 (large effect). We can see our effect is large (0.579).
install.packages("rstatix") # A tibble: 1 × 5
library(rstatix) .y. n effsize method magnitude
dat %>% friedman_effsize(taste ~ color * <chr> <int> <dbl> <chr> <ord>
| ID) 1 taste 20 0.579 Kendall W large
boxplot(dat[dat$color == 'yellow', 'taste'],
dat[dat$color == 'green', 'taste'], dat[dat$color ==
'blue', 'taste'], dat[dat$color == 'red', 'taste'],
names=c("yellow","green","blue","red"),ylim=c(0,5))
Conclude: We did not find outliers (3 std from mean). We run a Shapiro-Wilk test (W = 0.89426, p
<0.05). Our data do not follow a normal distribution. A non-parametric Friedman test (chi-squared
= 34.721, df = 3, p<0.05) showed a significant effect on tastiness score for our independent variable
color. Post-hoc comparison t-tests (using Wilcoxon rank sum test with Bonferonni adjustment
method) showed significant difference between all the color except for the comparison of blue and
red. The effect size, measured by Kendall’s W, was W = 0.579, indicating a large effect. Our results
show that Yellow (mean 4.25) is rated significantly tastier than any other color. Green (mean 3.4)
is rated as less tasty than Yellow but more than red (mean 1.9) and blue (mean 1.95) altogether.
Red and Blue are the least tasty, but we did not find any significant difference between those two.
• If your study was unpaired rather than paired, we would do the same process than
with Friedman, but with a Kruskall wallis test as below. Note that to compute the
effect size we would use the function kruskal_effsize().
kruskal.test(taste ~ color, data = dat)
pairwise.wilcox.test(dat$taste, dat$color, p.adjust.method = "bonferroni")
• If your study was paired but you had only two colors to compare, we would use a
Wilcoxon signed rank test as below (in a similar fashion than a paired t-test). Note
that to compute the effect size we would use the function wilcox_effsize().
wilcox.test(dat$taste[dat$color == "yellow"], dat$taste[dat$color
=="green"],paired=TRUE)
• If your study was unpaired and you had only two colors to compare, we would use a
Mann-Whitney Utest as below (in a similar fashion than an unpaired t-test). Note that
to compute the effect size we would use the function wilcox_effsize().
wilcox.test(dat$taste[dat$color == "yellow"], dat$taste[dat$color
=="green"],paired=FALSE)
40
4.5 Using Chi-square test
As there are other test in the figure 2 chart we have not seen I want to push the M&M’s study
example further. Imagine following the first study the marketing people decide to change their
production line to ensure each packet of M&M’s have 40% of yellow, 30% of green and then 15%
of blue and red respectively. They want to incentivise their customers to buy more and made and
advert for this. As a customer who cherish M&M’s you want to make sure this is not a fake news.
You want to verify that the distribution is has said. We might postulate the following hypotheses:
H1: The color distribution for M&Ms is different from the distribution stated in the null hypothesis.
H0: The color distribution for M&Ms is 40% yellow, 30% green, 15% red, 15% blue.
We select a random sample of 150 plain M&M’s from a bunch of packets to test these hypotheses.
Of course, we will eat them all after J. We get 55 yellow, 43 green, 25 blue and 27 red.
To test this, you can use a CHI-square test. The chi-square test is a statistical test that is used to
determine whether there is a significant association between two categorical variables. It is a non-
parametric test, meaning it makes no assumptions about the distribution of the data. The chi-square
test comes in different forms, each tailored for specific types of data and research questions. Here
we need to use the Chi-Square Goodness-of-Fit Test. It is used when you have one categorical
variable, and you want to compare the distribution of observed frequencies to an expected
distribution. We do this as follow.
table = c(55,43,25,27) Chi-squared test for given probabilities.
chisq.test(table, p =
c(0.4,0.3,0.15,0.15))
data: table
X-squared = 1.6833, df = 3, p-value = 0.6406
Our p-value is high (>0.05) so we cannot reject the null hypothesis. We do not have evidence that
the manufacturer has fooled us. Now imagine we got a different result such as this:
table = c(53,35, 30,32) Chi-squared test for given probabilities.
chisq.test(table, p =
c(0.4,0.3,0.15,0.15))
data: table
X-squared = 9.55, df = 3, p-value = 0.02281
In that case the p-value is low so we can reject the null hypothesis and accept H1: The color
distribution for M&Ms is different from the distribution stated in the null hypothesis. So, the
manufacturer has fooled us!
Now it is as usual important to report effect size. With Chi-square this is given by Phi φ = square
root (x-squared/sample size). In our case this is as below. A value between 0.1 and 0.3 is considered
a small effect, from 0.3 to 0.5 a medium effect, and above 0.5 a large effect. We thus have a small
effect.
sqrt(9.55 / 150) [1] 0.2523225
Conclude: We ran a CHI-square test (Χ2=9.55, df=3, p<0.05). The effect size, as measured by Phi
φ, was φ = 0.25, indicating a small effect. We reject the null hypothesis, the color distribution for
M&Ms is not 40% yellow, 30% green, 15% red, 15% blue.
41
4.6 Using regression for continuous independent and dependant variable
We use the M&M’s example for a final test, parametric this time: linear regression. Imagine the
marketing people decide to tailor the recipe for M&M’s so that they do not melt easily in the hands
of their customers. They want to investigate whether the size of M&M's (independent variable) can
predict the time it takes for them to melt (dependent variable). Their hypotheses are:
To figure this out we can use a linear regression to model the relationship between M&M's size
and melting time. Regressions are statistical method used when your dependant and independent
variable are scales. Imagine they run some experiment with different size of M&M’s and how long
they take to melt at 33 degrees Celsius (temperature of the human hands). The file containing the
data is melting.csv.
dat = read.csv("Desktop/melting.csv",
header = TRUE)
scatter.smooth(x=dat$size, y=dat$speed,
main="speed ~ size")
linearMod <- lm(speed ~ size, data=dat)
summary(linearMod) Call:
lm(formula = speed ~ size, data = dat)
Residuals:
Min 1Q Median 3Q
Max
-0.076497 -0.037079 0.005408 0.039839
0.081218
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3713051 0.0282071 154.97
<2e-16 ***
size 0.0102072 0.0001684 60.61
<2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
0.05 ‘.’ 0.1 ‘ ’ 1
42
What this means is that we have a linear equation: Speed of melting =4.371 (intercept) + 0.01 *
size. The p-value <0.05 and the R-squared is 0.9946 It tells us what percentage of the variation
within our dependent variable that the independent variable is explaining. In other words, it’s
another method to determine how well our model is fitting the data. In the example above, points
explain ~99% of the variation within our dependent variable.
Conclude: A simple linear regression was used to test if size of MnM’s significantly predicted
melting time. The fitted regression model was “Speed of melting =4.371 + 0.01 * size”. The
regression was statistically significant (R2 = 0.9946, F(1,20) = 3673, p <0.05).
43
Chapter 5. Example, sample size and p-hacking
5.1 Examples and solutions
Here are a few more examples of studies. I put the solutions after so you can try to think for
yourself. For each try to think about
1) What is(are) the independent variable(s)?
2) What is the dependant variable?
3) What are the type of data for both?
4) What statistical test you should use?
A 20 participants were asked to write text using two different keyboard layouts (A and B).
Half of the participants started the task on the A layout and then the B and the other half
of the participants started the task on the B layout and then the A. The number of words
typed per minute was collected for each participant and layout. Choose the most
appropriate procedure to decide which layout allow participants to type the fastest.
Assumption normality and homogeneity are verified.
B 40 participants were randomized to two groups. One group received a drug to decrease
hair loss and the other group received a placebo (a pill of sugar). At the end of the program,
the percentage hair loss for each patient was recorded. Choose the most appropriate
procedure to decide if there is a relationship between the use of the drug and the
percentage of hair loss. Assumption normality and homogeneity are verified.
C A study attempted to find out if the age of an animal had any relationship to their athletic
ability. The researchers took the data of 104 cheetahs, calculating their age and running a
test to measure their speed. Choose the most appropriate procedure to decide if the age
has any relationship with the run speed.
D 20 participants were asked to type of their phone touchscreen in four different postures
(sitting, lying down, standing, and running). The number of words typed per minute was
collected for each participant and postures. Choose the most appropriate procedure to
decide which posture allow participants to type the fastest. Assumption normality and
homogeneity are verified.
E 20 participants were asked to type of their phone touchscreen in four different postures
(sitting, lying down, standing, and running). They were asked to rate their comfort for each
posture using a Likert Scale questionnaire. Choose the most appropriate procedure to
decide which posture was most comfortable. Data does not follow the normal distribution.
F A study has gathered 10000 observations of computer performances (speed) in three
different room of varying temperature (15, 25 and 35 degrees Celsius). Choose the most
appropriate procedure to decide if the data follows a normal distribution.
44
Solutions:
A Answer: IV = type of keyboard (categories - nominal); DV = Words per minute (scale -
continuous); Test = Paired-T-test because participants did both condition
B Answer: IV = drug or placebo (categorie-nominal); DV = percentage hair loss (scale-
continuous); Test = Unpaired-T-test because participants did only one condition
C Answer: IV = animal age (scale-continuous); DV = athletic ability (scale-continuous); Test
= Regression
D Answer: IV = posture (categories-nominal); DV = word types per minute (scale-
continuous); Test = Repeated Anova (Within)
E Answer: IV = postures (categories-nominal); DV = comfort (scale-continuous) Likert can
be considered as continuous although often not following a normal distribution); Test =
Friedman
F Answer: IV = room temperature (scale-discrete); DV = speed (scale-continuous); Test =
Komogorov-smirnov because more than 50 observations
45
Now why I am saying this. Well, we have seen how to compute p-value and effect size, but we
have not touched how to find the right number of participants for you study and this link a lot to
the statistical power. There are two ways to determine the right sample size.
1) Determining sample size using pilot studies and statistical power. The conventional way
to find the correct sample size is to run a pilot study and do a power analysis. Doing this
allows you to gather the p-value, the effect size observed. From that you can determine
the statistical power (see code below). Researchers should strive for adequate power
(typically 0.80 or higher) to minimize the risk of Type II errors (false negatives) and increase
confidence in the study's findings. Insufficient power can lead to failure to detect true
effects, undermining the validity and generalizability of the study results. By doing a pilot
study and gathering the effect size you are likely to see in the real study, you can test what
should be the sample size of your study to ensure you have a statistical power above 0.8.
library(pwr) [1] 0.9404272
# Define parameters
effect_size = 0.5 # Cohen's d (standardized effect size)
sample_size = 100 # Number of observations
alpha = 0.05 # Significance level
# Compute power
power <- pwr.t.test(d = effect_size, n = sample_size, sig.level =
alpha)$power
print(power)
2) Determining sample size using established sample size metric. Not every researcher use
the first method. The most common sample size in HCI is 12 participants for a 2 factors
independent variable done in a within-subject experiment done with 2-3 repetitions of
the task. If you do your experiment as a between-subject you will have to double this
number (24 participants). And the more you add factors within your independent variable,
or add more independent variables, the most participants you will need. This is also why
studies in CS tend to be simple, only looking at a few factors at the time to avoid lowering
the statistical power. Try to avoid comparing too much thing in one go, rather split your
experiment in several tasks. For more information you can read this article5.
5.3 P-hacking
P-hacking, short for significance hacking or fishing for significance, refers to the unethical
practice of manipulating or analysing data in ways that increase the likelihood of obtaining
statistically significant results. It involves various tactics aimed at achieving a p-value that is less
than the conventional threshold for statistical significance to claim a significant finding. P-hacking
undermines the integrity of scientific research by producing false-positive results and distorting
5
Kelly Caine. 2016. Local Standards for Sample Size at CHI. In Proceedings of the 2016 CHI Conference on Human
Factors in Computing Systems (CHI '16). Association for Computing Machinery, New York, NY, USA, 981–992.
https://doi.org/10.1145/2858036.2858498
46
the evidence base. It contributes to the replication crisis, where subsequent studies struggle to
reproduce initially reported findings. To combat p-hacking, researchers are encouraged to pre-
register their studies, clearly articulate their hypotheses, use appropriate statistical methods, and
report all results transparently, including those that are not statistically significant. Journals and
the scientific community play a role in promoting transparency and replicability by valuing
rigorous methodology and complete reporting. Some of these below are considered p-hacking:
- Multiple Testing: Conducting many statistical tests on the same dataset increases the
probability of finding at least one result that appears statistically significant purely by
chance. P-hackers may exploit this by testing multiple hypotheses until one yields a
significant p-value.
- Data Mining/fishing: Repeatedly exploring and analysing data without a clear hypothesis
in mind until a significant result is found. This involves trying different combinations of
variables or subgroups until a desirable outcome is achieved.
- Selective Reporting: Choosing to report only the results that support the desired
conclusion while omitting or downplaying results that do not show statistical significance.
This can create a biased and misleading representation of the study findings.
- Hypothesis Switching: Changing the research question or hypothesis after observing the
data to align with the observed results. This can lead to "post hoc" rationalization of
findings.
- Exclusion of Outliers: Removing false outliers or specific data points selectively to obtain
more favourable results. This can lead to an overestimation of the observed effect.
- Data Transformations: Applying various data transformations or statistical techniques
until a desired result is achieved. This can involve manipulating the data to make it more
amenable to finding statistical significance.
Here are several strategies to mitigate p-hacking:
- Pre-Registration: Encourage researchers to pre-register their studies by specifying the
research questions, hypotheses, and analysis plans before data collection begins. This
helps reduce the flexibility in analysis choices that can contribute to p-hacking.
- Transparent Reporting: Emphasize complete and transparent reporting of all results,
including those that are not statistically significant. Journals can play a role in promoting
such reporting standards.
- Publication of Negative Results: Promote the publication of studies with null or non-
significant findings. This helps counteract publication bias and provides a more balanced
view of the evidence.
- Replication Studies: Encourage replication studies to validate and confirm findings.
Replication helps ensure the robustness of results and identifies potential issues related
to p-hacking.
- Adjusting Significance Thresholds: Consider adjusting the conventional significance
threshold (e.g., from 0.05 to 0.005) to reduce the likelihood of false positives. However,
this approach has its own set of challenges and may not be universally adopted.
- Promoting Effect Sizes: Emphasize the importance of effect sizes and confidence intervals
alongside p-values. Understanding the magnitude and precision of an effect can provide
a more comprehensive view of the results.
47
- Open Data and Code Sharing: Encourage researchers to share their raw data and analysis
code. This allows others to scrutinize the methods and results, promoting transparency
and reproducibility.
- Education and Training: Provide training and education on good research practices,
statistical methods, and the potential pitfalls of p-hacking. And this is hopefully what we
have done here!
48
- Beneficence: Researchers should strive to maximize the benefits of the research while
minimizing potential harm. The research should contribute to knowledge and
understanding without undue risk to participants.
- Respect for Autonomy: Participants' autonomy should be respected, and they should be
treated with dignity. Researchers should consider individual differences, cultural
sensitivities, and diverse perspectives.
- Ongoing Monitoring: Researchers should monitor the experiment continuously to identify
any unexpected adverse effects on participants. If risks emerge, appropriate action should
be taken.`
49
(Note this is a working document, if you find any mistakes or unclarity please feel free to contact
me at [email protected] so that I can improve it)
50