AP SHAH ADS Notes Mod 2 p2
AP SHAH ADS Notes Mod 2 p2
AP SHAH ADS Notes Mod 2 p2
Hypothesis testing
Hypothesis testing helps in data analysis by providing a way to make inferences about a
population based on a sample of data. It allows analysts to make decisions about whether
to accept or reject a given assumption or hypothesis about the population based on the
evidence provided by the sample data. For example, hypothesis testing can be used to
determine whether a sample mean is significantly different from a hypothesized population
mean or whether a sample proportion is significantly different from a hypothesized
population proportion. This information can be used to make decisions about whether to
accept or reject a given assumption or hypothesis about the population.
The hypothesis is a statement, assumption or claim about the value of the parameter
(mean, variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.
Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an
assumption that we are making based on the average wins and losses team had under his
captaincy. We can test this statement based on all the match data.
The null hypothesis is the hypothesis to be tested for possible rejection under the
assumption that it is true. The concept of the null is similar to innocent until proven guilty
We assume innocence until we have enough evidence to prove that a suspect is guilty.
It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible
values of the population parameter.
It is denoted by H1.
Let’s understand this with an example:
A soap company claims that its product kills on an average of 99% of the germs.
Suppose lifebuoy claims that, it kills 99.9% of germs. So how can they say so? There has
to be a testing technique to prove this claim right?? So hypothesis testing uses to prove a
claim or any assumptions.
To test the claim of this company we will formulate the null and alternate hypothesis.
Note: When we test a hypothesis, we assume the null hypothesis to be true until there is
sufficient evidence in the sample to prove it false. In that case, we reject the
null hypothesis and support the alternate hypothesis. If the sample fails to provide
sufficient evidence for us to reject the null hypothesis, we cannot say that the null
hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.
When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and
if it specifies a range of values then it is called a composite hypothesis.
e.g. Motor cycle company claiming that a certain model gives an average mileage of
100Km per liter, this is a case of simple hypothesis.
The average age of students in a class is greater than 20. This statement is a composite
hypothesis.
If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.
If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.
here according to H1, mean can be greater than or less than 100. This is an example of a
Two-tailed test
Critical Region
The critical region is that region in the sample space in which if the calculated value lies
then we reject the null hypothesis.
Suppose you are looking to rent an apartment. You listed out all the available apartments
from different real state websites. You have a budget of Rs. 15000/ month. You cannot
spend more than that. The list of apartments you have made has a price ranging from
7000/month to 30,000/month.
You select a random apartment from the list and assume below hypothesis:
Now, since your budget is 15000, you have to reject all the apartments above that price.
Here all the Prices greater than 15000 become your critical region. If the random
apartment’s price lies in this region, you have to reject your null hypothesis and if the
random apartment’s price doesn’t lie in this region, you do not reject your null hypothesis.
The critical region lies in one tail or two tails on the probability distribution curve according
to the alternative hypothesis. The critical region is a pre-defined area corresponding to a
cut off value in the probability distribution curve. It is denoted by α.
Critical values are values separating the values that support or reject the null hypothesis
and are calculated on the basis of alpha.
We will see more examples later on and it will be clear how do we choose α.
So Type I and type II error is one of the most important topics of hypothesis testing. Let’s
simplify it by breaking down this topic into a smaller portion.
A false positive (type I error) — when you reject a true null hypothesis.
A false negative (type II error) — when you accept a false null hypothesis.
The probability of committing Type I error (False positive) is equal to the significance level
or size of critical region α.
The probability of committing Type II error (False negative) is equal to the beta β. It is
called the ‘power of the test’.
A person is on trial for a criminal offense, and the judge needs to provide a verdict on his
case. Now, there are four possible combinations in such a case:
• First Case: The person is innocent, and the judge identifies the person as innocent
• Second Case: The person is innocent, and the judge identifies the person as guilty
• Third Case: The person is guilty, and the judge identifies the person as innocent
• Fourth Case: The person is guilty, and the judge identifies the person as guilty
Here
As you can clearly see, there can be two types of error in the judgment –
Type I error will be if the Jury convicts the person [rejects H0] although the person was
innocent [H0 is true].
Type II error will be the case when Jury released the person [Do not reject H0] although
the person is guilty [H1 is true].
According to the Presumption of Innocence, the person is considered innocent until proven
guilty. We consider the Null Hypothesis to be true until we find strong evidence against
it. Then we accept the Alternate Hypothesis. That means the judge must find the
evidence which convinces him “beyond a reasonable doubt.” This phenomenon
of “Beyond a reasonable doubt” can be understood as Significance Level (⍺) ie.
(Judge Decided Guilty | Person is Innocent) should be small. Thus, if ⍺ is smaller, it
will require more evidence to reject the Null Hypothesis.
The basic concepts of Hypothesis Testing are actually quite analogous to this situation.
It must be noted that z-Test & t-Tests are Parametric Tests, which means that the Null
Hypothesis is about a population parameter, which is less than, greater than, or equal to
some value. Steps 1 to 3 are quite self-explanatory but on what basis can we make a
decision in step 4? What does this p-value indicate?
We can understand this p-value as the measurement of the Defense Attorney’s argument.
If the p-value is less than ⍺ , we reject the Null Hypothesis, and if the p-value is greater
than ⍺, we fail to reject the Null Hypothesis.
Level of significance(α)
The significance level, in the simplest of terms, is the threshold probability of incorrectly
rejecting the null hypothesis when it is in fact true. This is also known as the type I error
rate.
It is the probability of a type 1 error. It is also the size of the critical region.
Generally, strong control of α is desired and in tests, it is prefixed at very low levels like
0.05(5%) or 01(1%).
If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis
is true with 95% assurance.
The p-value is the smallest level of significance at which a null hypothesis can be
rejected.
p-value
To understand this question, we will pick up the normal distribution:
p-value is the cumulative probability (area under the curve) of the values to the right of the red
point in the figure above.
Or,
p-value corresponding to the red point tells us about the ‘total probability’ of getting any value
to the right hand side of the red point, when the values are picked randomly from the population
distribution.
A large p-value implies that sample scores are more aligned or similar to the population score.
Alpha value is nothing but a threshold p-value, which the group conducting the
test/experiment decides upon before conducting a test of similarity or significance ( Z-test or a
T-test).
Consider the above normal distribution again. The red point in this distribution represents the
alpha value or the threshold p-value. Now, let’s say that the green and orange points represent
different sample results obtained after an experiment.
We can see in the plot that the leftmost green point has a p-value greater than the alpha. As a
result, these values can be obtained with fairly high probability and the sample results are
regarded as lucky.
The point on the rightmost side (orange) has a p-value less than the alpha value (red). As a
result, the sample results are a rare outcome and very unlikely to be lucky. Therefore, they
are significantly different from the population.
The alpha value is decided depending on the test being performed. An alpha value of 0.05 is
considered a good convention if we are not sure of what value to consider.
Let’s look at the relationship between the alpha value and the p-value closely.
Here, the red point represents the alpha value. This is basically the threshold p-value. We can
clearly see that the area under the curve to the right of the threshold is very low.
The orange point represents the p-value using the sample population. In this case, we can
clearly see that the p-value is less than the alpha value (the area to the right of the red point is
larger than the area to the right of the orange point). This can be interpreted as:
The results obtained from the sample is an extremity of the population distribution (an
extremely rare event), and hence there is a good chance it may belong to some other
distribution (as shown below).
Considering our definitions of alpha and the p-value, we consider the sample results obtained
as significantly different. We can clearly see that the p-value is far less than the alpha value.
Again, consider the same population distribution curve with the red point as alpha and the
orange point as the calculated p-value from the sample:
So, p-value > alpha (considering the area under the curve to the right-hand side of the red and
the orange points) can be interpreted as follows:
The sample results are just a low probable event of the population distribution and are very
likely to be obtained by luck.
We can clearly see that the area under the population curve to the right of the orange point is
much larger than the alpha value. This means that the obtained results are more likely to be
part of the same population distribution than being a part of some other distribution.
In the National Academy of Archery, the head coach intends to improve the performance of
the archers ahead of an upcoming competition. What do you think is a good way to improve
the performance of the archers?
He proposed and implemented the idea that breathing exercises and meditation before the
competition could help. The statistics before and after experiments are below:
Interesting. The results favor the assumption that the overall score of the archers improved. But
the coach wants to make sure that these results are because of the improved ability of the
archers and not by luck or chance. So what do you think we should do?
This is a classic example of a similarity test (Z-test in this case) where we want to check
whether the sample is similar to the population or not. In order to solve this, we will follow a
step-by-step approach:
1. Understand the information given and form the alternate and null hypothesis
2. Calculate the Z-score and find the area under the curve
3. Calculate the corresponding p-value
4. Compare the p-value and the alpha value
5. Interpret the final results
• Population Mean = 74
• Population Standard Deviation = 8
• Sample Mean = 78
• Sample Size = 60
We have the population mean and standard deviation with us and the sample size is over
30, which means we will be using the Z-test.
1. The after-experiment results are a matter of luck, i.e. mean before and after experiment
are similar. This will be our “Null Hypothesis”
2. The after-experiment results are indeed very different from the pre-experiment ones.
This will be our “Alternate Hypothesis”
The probability that we obtained is to the left of the Z-score (Red Point) which we calculated.
The value 0.999 represents the “total probability” of getting a result “less than the sample
score 78”, with respect to the population.
Here, the red point signifies where the sample mean lies with respect to the population
distribution. But we have studied earlier that p value is to the right-hand side of the red point,
so what do we do?
For this, we will use the fact that the total area under the normal Z distribution is
1. Therefore the area to the right of Z-score (or p-value represented by the unshaded region)
can be calculated as:
0.001 (p-value) is the unshaded area to the right of the red point. The value 0.001 represents
the “total probability” of getting a result “greater than the sample score 78”, with respect to
the population.
We were not given any value for alpha, therefore we can consider alpha = 0.05. According to
our understanding, if the likeliness of obtaining the sample (p-value) result is less than the alpha
value, we consider the sample results obtained as significantly different.
We can clearly see that the p-value is far less than the alpha value:
This says that the likeliness of obtaining the mean as 78 is a rare event with respect to the
population distribution. Therefore, it is convenient to say that the increase in the
performance of the archers in the sample population is not the result of luck. The sample
population belongs to some other (better in this case) distribution of itself.
Box plot
A box and whisker plot—also called a box plot—displays the five-number summary of
a set of data. The five-number summary is the minimum, first quartile, median, third
quartile, and maximum.
In a box plot, we draw a box from the first quartile to the third quartile. A vertical line
goes through the box at the median. The whiskers go from each quartile to the
minimum or maximum.
Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Applications
It is used to know:
Example:
3, 7, 8, 5, 12, 14, 21, 13, 18, 50
Step1: Sort the values
3, 5, 7, 8, 12, 14, 15, 18, 21, 50
Step 2: Find the median.
Q2=13
Step 3: Find the quartiles.
First quartile, Q1 = data value at position (N + 2)/4=12/4=3rd position
Third quartile, Q3 = data value at position (3N + 2)/4=8th position
Q1=7
Q3=18
Step 4: Complete the five-number summary by finding the min and the max.
Here IQR=Q3-Q1=18-7=12
Any point beyond Q3+ 1.5 IQR (18+1.5*12=18+18=36) and Q1-1.5 IQR (7-1.5*12) is
considered an outlier.
Scatter plot
Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter
graphs or scatter diagrams.
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The
scatter diagram graphs numerical data pairs, with one variable on each axis, show
their relationship.
Scatter plots are used in either of the following situations.
The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.
Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
1. Positive Correlation
2. Negative Correlation
3. No Correlation
Positive Correlation
A scatter plot with increasing values of both variables can be said to have a positive
correlation. Now positive correlation can further be classified into three categories:
Negative Correlation
A scatter plot with an increasing value of one variable and a decreasing value for
another variable can be said to have a negative correlation.These are also of three
types:
No Correlation
A scatter plot with no clear increasing or decreasing trend in the values of the variables
is said to have no correlation
No. of games 3 5 2 6 7 1 2 7 1 7
Scores 80 90 75 80 90 50 65 85 40 100
Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:
Note: We can also combine scatter plots in multiple plots per sheet to read and
understand the higher-level formation in data sets containing multivariable, notably
more than two variables.
Z-Test
If we have a sample size of less than 30 and do not know the population variance, we must use a t-test. This is
how we judge when to use the z-test vs the t-test. Further, it is assumed that the z-statistic follows a standard
normal distribution. In contrast, the t-statistics follows the t-distribution with a degree of freedom equal to n-1,
where n is the sample size.
It must be noted that the samples used for z-test or t-test must be independent sample, and also must have a
distribution identical to the population distribution. This makes sure that the sample is not “biased” to/against
the Null Hypothesis which we want to validate/invalidate.
One-Sample Z-Test
We perform the One-Sample z-Test when we want to compare a sample mean with the population mean.
In this example:
• Mean Score for Girls is 641
• The number of data points in the sample is 20
• The population mean is 600
• Standard Deviation for Population is 100
Since the P-value is less than 0.05, we can reject the null hypothesis and conclude based on our result that
Girls on average scored higher than 600.
Two-Sample Z-Test
We perform a Two Sample z-test when we want to compare the mean of two samples.
In this example:
• Mean Score for Girls (Sample Mean) is 641
• Mean Score for Boys (Sample Mean) is 613.3
• Standard Deviation for the Population of Girls’ is 100
• Standard deviation for the Population of Boys’ is 90
• Sample Size is 20 for both Girls and Boys
• Difference between Mean of Population is 10
Thus, we can conclude based on the p-value that we fail to reject the Null Hypothesis. We don’t have enough
evidence to conclude that girls on average score of 10 marks more than the boys. Pretty simple, right?
T-Test
T test is a type of inferential statistic used to study if there is a statistical difference between two groups.
Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀:
µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different.
The statistical test can be one-tailed or two-tailed. The one-tailed test is appropriate when there is a difference
between groups in a specific direction. It is less common than the two-tailed test. When choosing a t test, you
will need to consider two things: whether the groups being compared come from a single population or two
different populations, and whether you want to test the difference in a specific direction.
• One Sample t-test : Compares mean of a single group against a known/hypothesized/ population mean.
• Two Sample: Paired Sample T Test: Compares means from the same group at different times.
• Two Sample: Independent Sample T Test: Compares means for two different groups.
̅−µ
𝒙
𝒕=𝒔
⁄ 𝒏
√
𝑥̅ Sample mean
µ Population mean
𝑠 Sample standard deviation
𝑛 Sample size
Standard deviation can be calculated as:
̅
𝒅
𝒕=𝒔
⁄ 𝒏
√
∑𝑑 2 −∑𝑛(𝑑̅ )2
s= √ 𝑛−1
Degree of freedom is n1 + n2 – 2.
• If the calculated t value is greater than critical t value (obtained from a critical value table called the T-
distribution table) then reject the null hypothesis.
• P-value <significance level (𝜶) => Reject your null hypothesis in favor of your alternative
hypothesis. Your result is statistically significant.
• P-value >= significance level (𝜶) => Fail to reject your null hypothesis. Your result is not statistically
significant.