Lecture 6 - NHST and Assumptions Testing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

PSY107 Lecture 6 – NHST &

Assumptions Testing

Slides prepared by: Timothy Liew


What’s on the menu
today?

▪ NHST
▫ P-values
▪ Introduction to
assumptions
▫ Normality
NHST and p-values

▪ Inference
▪ Hypothesis Testing - NHST
▫ P-values
▫ Type I and II Error
▫ Caveats and considerations

3
What are Inferential Statistics?

❑ Infer = deduce or conclude from evidence and reasoning rather than


from explicit statements
▪ Inferential statistics: Used to make propositions about a population, using
data from samples that have been drawn from the population
❑ Inferential statistics is preceded by descriptive statistics (= describing the
data without making inferences)
▪ Predict (hypothesis) -> sample -> study -> consolidate data -> statistical
inference -> conclusion regarding hypothesis
o Extra info: we’re adhering to the ‘classical inference’ school of thought
❑ Confidence intervals are sometimes also considered as inferential
statistics.
❑ Why Inferential statistics?
▪ Consider the following scenario

Women have a mean (average) IQ of 140 and men have a mean IQ of 129.
Therefore women are more intelligent than men.

❑ What do you think might be the problem with the above conclusion?

5
Why Inferential Statistics?

❑ Helps determine the likelihood such an outcome occurred even when it is


not accurate/ incompatible
❑ Makes making claims a more stringent process by applying the principle
of falsifiability
Logic of Hypothesis Testing

❑ Confirmation bias
▪ We pay more attention to information that confirms our pre-existing beliefs
❑ Principle of falsifiability
We must be able to test for and find results that DO NOT support our claim

▪ Must be able to specify what information will falsify our claim
❑ Example:
𝑯𝟏
▪ Your claim: There is a gender difference in intelligence
▪ Counter-claim: There is no gender difference in intelligence

To abide by the principle of falsifiability & reduce confirmation bias, in 𝑯𝟎


inferential statistics, the standard is to test the null hypothesis
Null Hypothesis Significance Testing

❑ A framework that seeks to help answer questions like “Is the difference in
height between men and women genuine, or is it simply due to chance?”
▪ Follows the logic of reducio ad absurdum, or “reducing to an absurdity”
▪ You might have seen this logic in the form of proof by contradiction in your
high school math (*shiver*)
❑ Let’s take a look at an example on the next slide
NHST

❑ Let’s say you hypothesize that drinking milk will make you
taller (𝐻1 )
▪ The counter claim to this would be that drinking milk
would not make you taller (𝐻0 )
▪ We momentarily assume that 𝑯𝟎 is the “true state of
reality”
▪ We embark on a quest to refute/reject this counter
claim
❑ We collect our data, and find evidence that 96% people who
drink milk are very much taller than the national average
▪ Remember, we had assumed that 𝐻0 was true
▪ So our results are considered as very “surprising” if 𝐻0
were true
▪ Therefore, we conclude that 𝐻 can be rejected.
9
▪ In other words, in order to support our alternative hypothesis,
we actually collect data to see whether we can reject/refute the
null hypothesis; if the data is considered “surprising” under the
null hypothesis, we can then proceed to reject the null.
Okay cool, but how “surprising” does
my result have to be to reject the null?
p-value

❑ A statistic indicating the statistical significance of evidence


▪ We look at p-values to determine whether we should reject or fail to reject the null
hypothesis
❑ p-value = Probability of observing our sample results given that the null
hypothesis is true
▪ The probability that your result is simply because of chance/incompatibility, and not a
‘correct’ result
❑ We use a common threshold (alpha) to decide under what circumstances that
rejecting the null hypothesis is ‘safe’ (= most likely not an error). This
conventional level is usually p < 0.05 (= 5% probability)
▪ If the p-value is lower than 0.05, the results are “statistically significant” and it is ‘safe’ for us
to reject the null hypothesis
p-value (Example)

❑ On a stats knowledge test, girls scored 98 and boys scored 88 on average. (Difference
in average: 10)
▪ Inferential stats test will calculate the probability that this would occur
given that the null hypothesis is true. (Null hypothesis, H₀: There is no
difference in statistical knowledge between girls and boys.)
❑ If the inferential stats test result of this difference of 10 is p = .031:
▪ There is a 3.1% chance that you have gotten this result when the H₀ is actually
true (actually no difference, but you found a difference due to
chance/alternative explanations).
▪ Compared to conventional threshold of 5%, you can have some confidence that
your observed difference is an actual difference and did not occur by chance
❑ What you would conclude: The difference is statistically significant. Girls have
significantly better statistical knowledge than boys (p = .031).
13
Wow! P-values seem all powerful

There’s always a catch


Errors!

❑ P-values are based on probabilities.


▪ Because of the nature of probabilities, there is always a chance of us
committing some kind of mistake/error.
❑ Type I Error
▪ Rejecting the null hypothesis when the null is actually true
o Incorrectly rejecting the null hypothesis
▪ False Positive/ False Alarm
❑ Type II Error
▪ Failing to reject the null when the null is actually NOT true
o Incorrectly failing to reject the null hypothesis
▪ False Negative
Null = You’re not
pregnant
Examples

❑ Weather forecast is that it won’t rain, but it rained.


▪ What is the null hypothesis?
▪ What is the alternative hypothesis?
❑ What error have the weather forecasters made?
Interim Summary

▪ When working with probabilities, there’s always a


possibility of committing Type I or II Error.
▪ This does not change the meaning of p-values, but we
should be mindful about the error rates.
▪ Let’s take a closer look at these errors in the context of
NHST
Alpha value (α)

❑ Probability of making a Type I error


▪ Probability of incorrectly rejecting the H₀ when the H₀ is actually true
❑ The alpha value (α) is the threshold that we compare p-value against, to determine if
our results are statistically significant
❑ The α that we set is called our “significance level”, what we consider to be an
‘acceptable’ error. This must have been decided before the study.
▪ α - how bad the predictions of H0 have to be before you’ll decide to reject it?
o Need to choose α at the beginning of the research –not when you’re doing
analysis!
❑ If our p-value falls below the set α, we say that our results are “significant”
❑ Conventional critical value of α is 0.05
▪ At α = 0.05, there is a 5% chance of making a Type I error
22
NHST: How does it work?

❑ Select a null hypothesis, H0


▪ The skeptical hypothesis, that there is no effect
▪ For example:
o Does brain training improve IQ? A good H0 would be µ = 100, the typical
average in the population.
o Can thinking about heads make a coin land heads? A good H0 would be ph =
0.50, the proportion of heads expected by chance alone.
❑ Calculate a p-value or significance value, p:
▪ p is the probability your data would occur IF H0 were true
NHST: How does it work?

❑ Compare p to a decision cut-off, α


▪ α determines where your critical region is
▪ Critical regions are ranges of the distributions where the values represent
statistically significant results
o If p < α : reject the null; if the null had been true, the data collected would have
been quite rare, occurring less than α times.
o If p ≥ α; fail to reject the null; this type of data is more than α, common when the
null is true.

24
NHST: How does it work?

❑ One-tailed vs. two-tailed tests*


▪ In theory, a directional hypothesis should be tested with a one-tailed test
because you are hypothesizing that the effect/difference/relationship only
happens on one side of the curve
o The critical region is only on one side of the curve.
NHST: How does it work?

❑ One-tailed vs. two-tailed tests


▪ HOWEVER, there is criticism regarding one-tailed tests:
o Easier to find an effect/difference/relationship
o Completely disregards the opposite effect/alternative explanations (no power to
detect an effect in the opposite direction)
▪ Two-tailed tests are more commonly used, even with directional hypotheses, to
reflect the tentativeness of the research process – convention.
▪ For this class, we will only learn and use two-tailed tests.

26
NHST: How does it work?

▪ If the p-value is more than the α (p ≥ .05), then we fail to


reject the null hypothesis.
▪ If the p-value is less than the α (p < .05), then we reject
the null hypothesis.
Limitations of NHST

❑ Significance testing can be misleading


▪ As sample size increases, the ability to detect small differences increases,
therefore it is easier to get significant results
o Increase in the “power” of the study also increases probability of Type I error
▪ α is not always a reliable indicator
o Researchers have a say over how stringent they want it to be
▪ p-values don’t tell you anything about the size or strength of the
relationship/effect you have found (=“effect size”)
o Merely the probability of finding the relationship/effect if the H₀ is true
Beta value (β)

❑ = Probability of making a Type II error


▪ Probability of incorrectly failing to reject the H₀
▪ That is, the probability that you would not reject the H₀ (=fail to reject H₀),
when you actually should, because H₀ is in fact false
Power (1 – β)

❑ = Probability of correctly rejecting the null


▪ probability of rejecting the H₀ when you should
▪ Power refers to the ability to detect an effect – you would want high power!
❑ Conventional level for high power: ≥ 0.8
▪ In other words, the acceptable level for β (Type II error) is 20% max
❑ Higher power -> higher chances of avoiding Type II error
❑ Increasing sample sizes often increases power
▪ Small effects can be significant in big samples; big effects may not be
significant in small samples
▪ So if you want more power, you need more participants
NHST: A quick overview

▪ Define the null hypothesis, H0


▪ Based on the data, calculate a p value: the probability you would get such
data if the null were true
o Think of p as representing the quality of H0
▪ Compare p to α
o If p < α, reject the null
o If p ≥ α, fail to reject the null at this point
▪ The language:
o If null is reject (p < α) the results are described as “statistically significant”
o If null is not rejected (p ≥ α), results are described as “not statistically
significant”
NHST: Red Flags!

❑ Beware dichotomous thinking!


❑ Statistically significant ≠ large, meaningful, or worthy of attention
▪ It just means that the H0 is unlikely.
❑ Beware of accepting the H0
▪ a non-significant result does not mean that the H0 is true!
❑ Beware of interpreting the p value:
▪ p value is not the odds the H0 is true
▪ It is the odds of obtaining your results IF the null is true.
Assumption

▪ Parametric vs Non-parametric Tests


▪ Assumption of Normality
▫ How to test for normality

34
What is an assumption?

❑ Stacy: Oh. Mai. Gawwwwwd did you see that huge sombrero that Shia was
wearing?? She must work at Taco Bell or something
❑ Antuan: …She’s a singer-song writer.
❑ Stacy: Uhhh, same thing.
What is an assumption

❑ Sometimes assumptions are still necessary.


▪ All parametric statistical tests need to make assumptions about the data.
▪ For example, a t-test assumes that the data is from a normal distribution.
❑ Think of (parametric) statistical tests as professional chefs
▪ We assume that the ingredients are fresh and organic.
▪ If that assumption is met, our chef can produce glorious dishes.
▪ If that assumption is not met, our chef can still cook something, but it won’t
be as good.
❑ If statistical assumptions are not met, we still get results, but they are not
as trustworthy (depends on how severe the assumptions)
Parametric vs Non-Parametric Tests

Statistical Tests

Parametric Tests Non-Parametric Tests


There are assumptions about the These tests don’t require the data to
parameters that data must meet in meet any particular parameters
order for the test to be powerful Less powerful than parametric tests,
e.g. Pearson’s r, simple linear regression, t-tests, unless their assumptions are not met
ANOVAs, multiple regression, etc. e.g. Spearman’s rho, Kendall’s tau, Wilcoxon rank
sum test, Mann-Whitney U, Kruskal-Wallis,
Friedman, etc.

37
Assumption of Normality

❑ Normality is one of the major assumptions for parametric tests, which


have been designed with a normal distribution (= bell curve) as their
basis
▪ We assume that any continuous data we obtain will be normally distributed
❑ Before running inferential tests such as Pearson’s r, we need to ensure that
the assumption of normality is met for our data
❑ To do that, we run normality tests such as Kolmogorov-Smirnov and
Shapiro-Wilk
Normal Distribution

39
Assumption of Normality

❑ Kolmogorov-Smirnov or Shapiro-Wilk tests the assumption that our data’s


distribution is drawn from a normally distributed population
❑ Null hypothesis of these tests: There is no significant difference between the
data’s distribution and the normal distribution
❑ We don’t want to reject this null hypothesis because we want our data’s
distribution to be NOT SIGNIFICANTLY DIFFERENT from the normal
distribution
▪ When these tests return p < .05: there IS significant difference between our data’s
distribution and the normal distribution
▪ When these tests return p > .05: there is NO significant difference between our data’s
distribution and the normal distribution
Assumption of Normality

Population Distribution Sample Distribution


Assumption of Normality

❑ So which one do we want? When we run normality tests, do we want p <


.05 or p > .05?
❑ If our data’s distribution is not significantly different from the population
distribution, we can say that the assumption of normality for our data is
met
▪ If p < .05, the assumption of normality is violated (= not met)
❑ We use Shapiro-Wilk if n < 2,000 (n = sample size)
How to run K-S & Shapiro-Wilk in SPSS

❑ Analyse → Descriptive Statistics → Explore


❑ Enter variables of interest into Dependent List
❑ Go to Plots – check “Normality Plots with test”
Reporting Results: Assumption of
Normality

❑ Values to look out for:


▪ Shapiro-Wilk statistic – report to 2 decimal places
▪ df
▪ p-value (= Sig.) – report exact value to 3 decimal places
o Except when the p-value is shown as .000, then report as p < .001
❑ Format: Shapiro-Wilk (df) = S-W statistic, p = p-value
❑ Examples:
▪ Shapiro-Wilk (55) = .87, p = .067
▪ Shapiro-Wilk (25) = .38, p < .001
❑ If assumption of normality is met, proceed to use a parametric test
❑ If assumption of normality has been violated, you need you to use a non-parametric
test (or explain the robustness of the test)
44
Test of Normality (Examples)

1. Shapiro-Wilk (55) = .87, p = .067


▪ p is >.05, that means there is NO SIGNIFICANT DIFFERENCE between
sample distribution and the normal population distribution
▪ This indicates that our sample distribution is normal
▪ The assumption of normality is met / assumed
2. Shapiro-Wilk (25) = .38, p < .001
▪ p is <.05, that means there is A SIGNIFICANT DIFFERENCE between sample
distribution and the normal population distribution
▪ This indicates that our sample distribution is NOT normal
▪ The assumption of normality is NOT met / violated
Recap

Recap
Normality Test

❑ What I’m asking: Is my sample distribution significantly different from a


normal population distribution?
❑ My H₀: There is no significant difference between my sample distribution and
a normal distribution.
▪ What if I get p < .05? What can I conclude?
▪ What if I get p > .05? What can I conclude?
Concept Test

❑ For my study on exam scores and exam anxiety, I ran a Shapiro-Wilk test
and got these results for exam anxiety:
▪ Shapiro-Wilk (45) = .78, p = .075
❑ What is the df?
A. 45
B. 47
C. 43
D. .78
Concept Test

❑ For my study on exam scores and exam anxiety, I ran a Shapiro-Wilk test
and got these results for exam anxiety:
▪ Shapiro-Wilk (45) = .78, p = .075
❑ Is the assumption of normality met?
A. We can’t tell from these results alone
B. It depends on the inferential test later
C. Yes
D. No
Concept Test

❑ For my study on exam scores and exam anxiety, I ran a Pearson’s r test and
got these results:
▪ r(18) = .42, p = .034
❑ Is normality assumed?
A. We can’t tell from these results
B. It depends on the inferential test later
C. Yes
D. No
Normality Testing Tutorial

❑ Interpret this table and report on the assumption of normality in full:

❑ The assumption of normality was MET for the intelligence scores of men, Shapiro-
Wilk(10)=.93, p=.412, and women, Shapiro-Wilk(10)=.92, p=.345. Therefore, the
overall assumption of normality has been met.

52
Normality Testing Tutorial

❑ Interpret this table and report on the assumption of normality in full:

❑ The assumption of normality was not met for the collective self-efficacy scores of
participants who were asked to recycle, Shapiro-Wilk(30)=.92, p=.021. However,
normality was assumed for participants who were asked to drive an electric vehicle,
Shapiro-Wilk(20)=.97, p=.646. Thus, the overall assumption of normality has been
53
violated.
One final note

▪ Always graph your data


▪ Use histograms to visually gauge how normal (or non-
normal) is your distribution
What have we learned today?

▪ NHST
▪ Types of Errors
▪ Assumptions
▫ Normality
▫ Shapiro-Wilks

55

You might also like