STATFINALedit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

CHAPTER 8: NORMAL DISTRIBUTION The standard deviation is a measure of variability.

It To find the probability of observations in a distribution


defines the width of the normal distribution. The standard falling above or below a given value.
A normal distribution, also known as a bell curve or a deviation determines how far away from the mean the
Gaussian distribution, named from the German values tend to fall. It represents the typical distance To find the probability that a sample mean significantly
mathematician Carl Friedrich Gauss (1777-1855), is a between the observations and the average. The standard differs from a known population mean.
probability distribution that is symmetric about the mean, deviation stretches or squeezes the curve. A small To compare scores on different distributions with
showing that data near the mean are frequent in standard deviation results in a narrow curve, while a large different means and standard deviations.
occurrence than data from the mean. standard deviation leads to a wide curve
EMPIRICAL RULE For Normal Distribution
The normal distribution is one type of symmetrical OTHER Properties of A Normal Distribution
distribution. Symmetrical distributions occur when a > Also known as the three sigma rule or the 68 - 95 -
dividing line produces two mirror images. 3. it has a kurtosis of 0.263
99.7 rule.
Kurtosis measures the thickness of the tail ends of a
Not all symmetrical distributions are normal, since some distribution in relation to the tails of a distribution. The > It is a statistical rule which states that for a normal
data could appear as two humps of a series of hills in normal distribution has a kurtosis equal to 0.263 distribution, almost all observed data will fall within 3
addition to the bell curve that indicates a normal standard deviations from the mean.
distribution. A normal distribution is a continuous, Other Properties of A Normal Distribution ➢ 68% of the data lie within 1 standard deviation (μ ± σ)
symmetric, bell-shaped distribution of a variable. - The curve is continuous, that is, there are no gaps or of the mean.
holes. For each value of X, there is a corresponding value
Applications: ➢ 95% of the data lie within 2 standard deviations (μ ±
of Y.
The normal distributions are closely associated with σ) of the mean.
- The curve never touches the x axis. Theoretically, no
many things such as: ➢ 99.7% of the data lie within 3 standard deviations (μ ±
matter how far in either direction the curve extends, it
• Marks scored on the test σ) of the mean.
never meets the x axis – but it gets increasingly closer.
• Heights of different persons - The total area under a normal distribution curve is equal
• Size of objects produced by the machine to 1.00 or 100%. This fact may seem unusual, since the
• Blood pressure and so on. curve never touches the x-axis, but one can prove it
mathematically by using calculus. - Normal
Properties of A Normal Distribution distributions are symmetrical, but not all symmetrical
1. it has zero skew distributions are normal.
Skewness measures the degree of symmetry of a
distribution. The mean, median, and mode are all equal. A Standard Normal Distribution
normal distribution curve is unimodal. Half of the
population is less than the mean and half is greater than The standard normal distribution, also called the z-
the mean distribution, is a special normal distribution where the
2. it has two parameters The mean is 0 and the standard deviation is 1.
parameters of the normal distribution define its shape and
While individual observations from normal distributions
probabilities entirely. 2 parameters of normal
are referred to as x, they are referred to as z in the z-
distribution: ~ mean ~ standard deviation Example:
distribution. Every normal distribution can be converted
to the standard normal distribution by turning the The height of BSED-MATH 2A students are normally
The mean is the central tendency of the normal
individual values into z-scores. distributed with a mean of 160 cm and a standard
distribution. It defines the location of the peak for the bell
deviation of 3.5 cm.
curve. Most values cluster around the mean. The mean
Difference Between Normal Distribution From Standard
determines where the peak of the curve is centered.
Normal Distribution
Increasing the mean moves the curve right, while
decreasing it moves the curve left. We convert normal distributions into the standard normal
distribution for several reasons:
Given : μ = 4, σ = 2, x = 3
➢ Use the empirical rule to estimate the probability of a
student whose height is taller than 163.5 cm. TAKE NOTE!
We add 0.5000 if we are looking for the probability that
is less than or equal to that number. We subtract 0.5000
if we are looking for the probability that is greater than or
equal to that number.

Application Of The Normal Distribution

Normal Curve Therefore, the probability density function for the normal
➢ The normal curve, also known as the Gaussian distribution is 0.17603
distribution or bell curve, is a probability distribution that
describes a wide range of phenomena in nature and Standard Normal Table / Z – Table
human behavior. ➢ Used to determine the area/ percentage under
consideration in a standard normal distribution.
➢ It is a symmetrical, bell-shaped curve that is defined The standard normal distribution curve can be used to
by two parameters: its mean (µ) and standard deviation ➢ This z-table gives the area between the mean and the solve a wide variety of practical problems. The only
(σ). value from 0 to 3. 99. requirement is that the variable be normally or
➢ The normal curve is also important in statistical ➢ This particular z-table provides the area between the approximately normally distributed.
inference, as it allows researchers to make probabilistic mean in the z-value.
statements about the population based on sample data. To solve problems by using the standard normal
Steps In Finding Area Under Normal Curve distribution, use the formula:
⦿ SKETCH the normal curve, locate the z- value and
label it.
⦿ ANALYZE and shade the area you are looking for
depending on the given condition.
⦿ LOCATE the given z value on the z table. Find the first
two digits (whole number & tenths) on the left side of the Example: 1. The mean number of hours an American
ztable. Look for the remaining number (hundredths) on worker spends on the computer is 3.1 hours per workday.
the top of the z-table. The intersection of this number (I & Assume the standard deviation is 0.5 hour. Find the
II) will be the area from the mean to the z-value. percentage of workers who spend less than 3.5 hours on
⦿ LABEL the shaded area. the computer. Assume the variable is normally
⦿ State your ANSWER. distributed.
Solution:
A Normal Distribution Curve As A Probability Step 1 – Draw the figure and represent the area.
Example Step 2 – Find the z value corresponding to X.
Distribution Curve
Find the probability density function for the normal
distribution where mean = 4 and standard deviation = 2 A normal distribution curve can be used as a probability
distribution curve for normally distributed variables.
and x = 3.
For probabilities, a special notation is used. For example:
Find the probability of any z value between 0 and 2.32,
this probability is written as P (0 < z < 2.32).
Step 3 – Find the area. A. PROBABILITY SAMPLING Probability sampling
means that every member of the population has a chance If you use this technique, it is important to make sure that
of being selected. It is mainly used in quantitative there is no hidden pattern in the list that might skew the
research. If you want to produce results that are sample. For example, if the HR database groups
representative of the whole population, probability employees by team, and team members are listed in order
sampling techniques are the most valid choice. of seniority.
There are four main types of probability samples.
Example: All employees of the company are listed in
alphabetical order. From the first 10 numbers, you
randomly select a starting point: number 6. From number
6 onwards, every 10th person on the list is selected (6, 16,
26, 36, and so on), and you end up with a sample of 100
people. there is a risk that your interval might skip over
people in junior roles, resulting in a sample that is skewed
towards senior employees.

3. Stratified sampling
Stratified sampling involves dividing the population into
subpopulations that may differ in important ways. It
allows you to draw more precise conclusions by ensuring
that every subgroup is properly represented in the sample.

1. Simple random sampling To use this sampling method, you divide the population
In a simple random sample, every member of the into subgroups (called strata) based on the relevant
population has an equal chance of being selected. Your characteristic (e.g., gender identity, age range, income
sampling frame should include the whole population. bracket, job role).
Random sampling can be done for relatively small
populations by drawing lots or by using a table of random Example: The company has 800 female employees and
numbers. 200 male employees. You want to ensure that the sample
To conduct this type of sampling, you can use tools like reflects the gender balance of the company, so you sort
random number generators or other techniques that are the population into two strata based on gender. Then you
based entirely on chance. use random sampling on each group, selecting 80 women
and 20 men, which gives you a representative sample of
Example: You want to select a simple random sample of 100 people.
1000 employees of a social media marketing company.
You assign a number to every employee in the company 4. Cluster sampling
CHAPTER 9: SAMPLING DISTRIBUTION database from 1 to 1000, and use a random number Cluster sampling also involves dividing the population
SAMPLING DISTRIBUTION A sampling distribution is generator to select 100 numbers. into subgroups, but each subgroup should have similar
a probability distribution of a statistic obtained from a 2. Systematic sampling characteristics to the whole sample. Instead of sampling
larger number of samples drawn from a specific Systematic sampling is similar to simple random individuals from each subgroup, you randomly select
population. It is also known as the subset of a population. sampling, but it is usually slightly easier to conduct. entire subgroups.
Any quantity obtained from a sample for the purpose of Every member of the population is listed with a number,
estimating a population parameter is called a sample but instead of randomly generating numbers, individuals A cluster sample is a simple random sample of clusters
statistic or statistic. are chosen at regular intervals. from the available clusters in the population. Clusters are
a group of elements in the population of interest.
Voluntary response samples are always at least somewhat
If it is practically possible, you might include every biased, as some people will inherently be more likely to
individual from each sampled cluster. If the clusters volunteer than others, leading to self-selection bias.
themselves are large, you can also sample individuals
from within each cluster using one of the techniques Example: You send out the survey to all students at your
above. This is called multistage sampling. university and a lot of students decide to complete it. This
can certainly give you some insight into the topic, but the
This method is good for dealing with large and dispersed people who responded are more likely to be those who
populations, but there is more risk of error in the sample, have strong opinions about the student support services,
as there could be substantial differences between clusters. so you can’t be sure that their opinions are representative
It’s difficult to guarantee that the sampled clusters are of all students.
really representative of the whole population.
3. Purposive sampling
Example: The company has offices in 10 cities across the 1. Convenience sampling This type of sampling, also known as judgement
country (all with roughly the same number of employees A convenience sample simply includes the individuals sampling, involves the researcher using their expertise to
in similar roles). You don’t have the capacity to travel to who happen to be most accessible to the researcher. This select a sample that is most useful to the purposes of the
every office to collect your data, so you use random is an easy and inexpensive way to gather initial data, but research.
sampling to select 3 offices – these are your clusters. there is no way to tell if the sample is representative of
the population, so it can’t produce generalizable results. It is often used in qualitative research, where the
Convenience samples are at risk for both sampling bias researcher wants to gain detailed knowledge about a
B. NON-PROBABILITY SAMPLING METHOD In a and selection bias. specific phenomenon rather than make statistical
non-probability sample, individuals are selected based on inferences, or where the population is very small and
non-random criteria, and not every individual has a Advantages: specific. An effective purposive sample must have clear
chance of being included. This type of sample is easier 1. Collect data quickly criteria and rationale for inclusion. Always make sure to
and cheaper to access, but it has a higher risk of sampling 2. Inexpensive methodology describe your inclusion and exclusion criteria and beware
bias. That means the inferences you can make about the 3. Easy to do research of observer bias affecting your arguments.
population are weaker than with probability samples, and 4. Low cost
your conclusions may be more limited. If you use a non- 5. Readily available sample Example: You want to know more about the opinions and
probability sample, you should still aim to make it as 6. Fewer rules to follow experiences of disabled students at your university, so you
representative of the population as possible. purposefully select a number of students with different
Example: You are researching opinions about student support needs in order to gather a varied range of data on
Non-probability sampling techniques are often used in support services in your university, so after each of your their experiences with student services.
exploratory and qualitative research. In these types of classes, you ask your fellow students to complete a survey
research, the aim is not to test a hypothesis about a broad on the topic. This is a convenient way to gather data, but 4. Snowball sampling
population, but to develop an initial understanding of a as you only surveyed students taking the same classes as If the population is hard to access, snowball sampling can
small or under-researched population. you at the same level, the sample is not representative of be used to recruit participants via other participants. The
all the students at your university. number of people you have access to “snowballs” as you
These are the four main types of non-probability sample. get in contact with more people. The downside here is also
2. Voluntary response sampling representativeness, as you have no way of knowing how
Similar to a convenience sample, a voluntary response representative your sample is due to the reliance on
sample is mainly based on ease of access. Instead of the participants recruiting others. This can lead to sampling
researcher choosing participants and directly contacting bias.
them, people volunteer themselves (e.g. by responding to
a public online survey). Example: You are researching experiences of
homelessness in your city. Since there is no list of all
homeless people in the city, probability sampling isn’t To use the formula, first figure out what you want your In this case, we can say that the sample statistics is
possible. You meet one person who agrees to participate error of tolerance to be. unbiased estimate.
in the research, and she puts you in contact with other For example, you may be happy with a confidence level
homeless people that she knows in the area. of 95 percent (giving a margin error of 0.05), or you may TWO TYPES OF ESTIMATORS
require a tighter accuracy of a 98 percent confidence level
5. Quota sampling (a margin of error of 0.02)
Quota sampling relies on the non-random selection of a
predetermined number or proportion of units. This is
called a quota. You first divide the population into CHAPTER 10: ESTIMATION
mutually exclusive subgroups (called strata) and then
recruit sample units until you reach your quota. These STATISTICAL INFERENCE → is the process by which 1. POINT ESTIMATOR ¬ A point estimator draws
units share specific characteristics, determined by you we infer population properties from sample properties. inferences about a population by estimating the value of
prior to forming your strata. The aim of quota sampling is an unknown parameter using a single value or points.
to control what or who makes up your sample. Example: Suppose a college president wishes to estimate

Quota sampling is used in both qualitative and


quantitative research designs in order to gain insight
about a characteristic of a particular subgroup or
investigate relationships between different subgroups.

Example: You want to gauge consumer interest in a new


produce delivery service in Boston, focused on dietary
preferences. You divide the population into meat eaters,
vegetarians, and vegans, drawing a sample of 600 people.
Since the company wants to cater to all consumers, you set
the average age of students attending classes this
a quota of 200 people for each dietary group. In this way, semester. The president could select a random sample of
all dietary preferences are equally represented in your Two Types Of Statistical Inference
100 students and find the average age of these students,
research, and you can easily compare these groups. You 1. ESTIMATION
say, 22.3 years. From the sample mean the president
continue recruiting until you reach the quota of 200 2. HYPOTHESIS TESTING
could infer that the average age of all the students is 22.3
participants for each subgroup.
years. This type of estimate is called a point estimate.
ESTIMATION → The objective of estimation is to
In finding the sample size in a given population, we use approximate the value of a population parameter on the
the formula: basis of a sample statistic. For example, the sample mean
X¯ is used to estimate the population mean µ.
SLOVIN’S FORMULA PLEASE TAKE NOTE! Estimates provide limited
information. It does not tell how much about the possible
size of the error. In order for us to be confident with our
estimates as approximation of a true parameter values, we
take as many samples as possible from the population.
Compute the sample statistics and carefully compare the
n= sample size results before we formulate the conclusions. A good
N= population size method of estimating a parameter is described from many
e= margin of error samples and are equal to the true population parameter.
1. LARGE SAMPLES ( n ≥ 30 ) If the statistics is the to be P 19.00 and the standard deviation to be 6.8. Find
2. INTERVAL ESTIMATOR ¬ An interval estimator
sample mean x̄, then the confidence interval is the best point estimate of the population mean and the
draws inferences about a population by estimating the
95% confidence interval of the population mean.
value of an unknown parameter using an interval. Here,
we try to construct an interval that “covers” the true
Solution: The best point estimate of the mean is P 19.00.
population parameter with a specified probability.
For the 95% confidence interval use z = 1.

where:
• x̄is the sample mean
• z α∕2 is the z value providing an area of a/2 in the upper
tail of the standard normal probability distribution
• n is the sample size
• σ is the population standard deviation
Hence, one can say with 95% confidence the true mean
example: of the population is between 17.10 and 20.90 based on a
Find a 95 % confidence interval for a population mean μ sample of 50 people who play the lottery.
In an interval estimate, the parameter is specified as being for n= 36, x̄= 15.2 ,σ = 1.6
between two values. For example, an interval estimate for Since the sample mean size of n=36, the distribution of 2. SMALL SAMPLES (n < 30) In this case we use the t
the average of all students might be 26.9 < μ < 27.7. the sample mean x̄is approximately normally distributed distribution to obtain the confidence level.
with mean μ and standard error (σ/√n). The approximate
The confidence interval is a specific interval estimate of 95% confidence interval is
a parameter determined by using data obtained from a STEP 1: Determine the confidence coefficients or the
sample and by using the specific confidence level of the critical values (za/2)
estimate. Three common confidence intervals are used: *look at z-table • x̄is the sample mean
the 90, the 95, and the 99% confidence intervals. STEP #2: Find the lower and upper confidence limit • t α∕2 are values found in the t-table that are proportions
to the areas in the two tails of the curve, called the critical
values
• Additional information: Degrees of freedom refer to the
maximum number of logically independent values, which
may vary in a data sample.
• n is the sample size
• σ is the population standard deviation

STEP #3: Interpret the results Hence, one can say with
Steps In Finding The Confidence Interval 95% confidence the true mean of the population is
STEP #1: Determine the confidence coefficients or the between 14.68 and 15.72.
critical values (za/2)
STEP #2: Find the lower and upper confidence limits EXAMPLE:
STEP #3: Interpret the results. A researcher wishes to estimate the average amount of Steps In Finding The Confidence Interval
money a person spends on lottery tickets each month. A STEP #1: Determine the confidence coefficients or the
Confidence Intervals For The Mean sample of 50 people who play the lottery found the mean critical values (ta/2)
STEP #2: Find the lower and upper confidence limits STEP #2: Find the lower and upper confidence limits expect to capture the population parameter with repeated
STEP #3: Interpret the results sampling.

EXAMPLE: The Statistician of BISCAST wants to know The confidence interval for p is completed using this
the mean age of entering mathematics majors. He formula:
computed the mean age of 18 years and standard
deviation of 1.4 years on a random sample of 25 entering STEP #3: Interpret the results
Mathematics majors coming from a normally distributed Thus, we can say that 95% confidence that the interval
population. With 99% confidence, find the interval between 172.97 and 187.02 obtain the true mean weight
estimate of the population mean. of dark chocolate bars based on 20 samples of dark Now,
chocolate. let’s estimate the population proportion with 85%
STEP #1: Determine the confidence coefficients or the confidence with the given n=500 and p^=0.84
critical values (ta/2) Confidence Intervals for Proportions
n = 25 The confidence limits for the population proportion are SOLUTION: Step 1: Calculate the Standard Error
df = n - 1 = 25 – 1 = 24 given by
The coefficient for this values is 2.797.
(base from t table)

STEP #2: Find the lower and upper confidence limits

Sample Proportion

Step 2: Find the significance level represented by

where p̂ is the point estimate or single value estimate


STEP #3: Interpret the results which is defined by p̂ = x/n where x is the number of
Thus, we can say that 99% confidence that the interval successes in a sample of size n. The remaining n-x are
between 17.22 and 18.78 contain the true mean age of the considered failures. The sample proportion of failures, 1
population of entering math majors based on 25 samples. - p̂is often denoted as q^ = 1 – p^

EXAMPLE #1: Suppose a random sample of 500 students at BISCAST Step 3: Find the tail area and find it to the less than z-table
The average weight of 20 dark chocolate bars selected agree or disagree to the statement given by the researcher.
from a normally distributed population is 180g with a This 0.84 is a point estimate for the population
standard deviation of 15g. Find the interval estimate proportion. And the sample proportion for failures is 1-
using the 95% confiedence interval. 0.84 which is 0.16.
STEP #1: Determine the confidence coefficients or the
critical values (ta/2) It gives us an idea of what the population might be.
n = 20 However, it does not tell us that 86% of the population
df = n - 1 = 20 – 1 = 19 will agree to the statement.
The coefficient for this values is 2.093.
(base from t table) To infer or generalize about the population we construct
interval that is, a range of values that we expect to capture
the population parameter, p at some confidence level. The
confidence level represents the proportion of times we
Step 3: Now, substitute all the given values in the Step 2: Now, using these values find the P1 and P2
Step 4: Compute for the confidence interval formulation and compute for the two different population
mean

Step 3: Find the cumulative z value (using less than z-


table)
1-0.99=0.01
0.01/2=0.005
Step 5: Find the percentage and conclusion We are 85%
confident that the population proportion lies between ;find the nearest z value of 0.005 at less than z-table and
81.6% and 86.4% you will get 2.58
Step 4: Write your final answer in conclusion Therefore,
the 95% confidence limits for the difference of the mean
Confidence Intervals for the difference between two Step 3: Now, substitue all the given values in the
lifetimes of the populations of brand A and brand B are
Population Means formulation and compute for the two different population
272.79 and 327.21
The confidence limits for the difference of two population proportion
means, where populations are infinite, are given by
Confidence Intervals for the difference between two
Population Proportion
The confidence limits for the difference of two population
proportion, where population are infinite, are given by
Where; Where;
X̄ = mean
σ = standard deviation
n = sample size Step 4: Write your final answer in conclusion Therefore,
Z c = cumulative z value (from any z table) P = sample proportion the 99% confidence limits for the difference in proportion
x=sample mean of all female and male college students who frequent
EXAMPLE: n = sample size internet cafes are 17.3% and 32.7%
A sample of 100 brand A cellphone battery showed a Z c = cumulative z value (from any z table)
mean life of 1500 hours and a standard deviationof 115 CHAPTER 12: ANALYSIS OF VARIANCE
hours. A sample of 200 brand B cellphone battery showed EXAMPLE: In a random sample of 400 females and 600 (ANOVA)
a mean life of 1200 hours and a standard deviation of 110 male college students who frequent internet cafes, 100
hours. Find a 95% confidence limits for the difference of female and 300 male college students indicated they like Analysis of Variance (ANOVA) When an F test is used
the mean lifetimes of tbe populations of brands A and B. playing internet games. Construct a 99% confidence to test a hypothesis concerning the means of three or more
limits for the difference in proportions of all female and populations, the technique is called analysis of variance,
SOLUTION: male college students wo frequent internet cafes. commonly abbreviated as ANOVA.
Step 1: Write all the given values.
Population 1 SOLUTION: Analysis of variance is a collection of statistical models
X̄ 1=1500 σ1=115 n1=100 Step 1: Write all the given values. used to analyze the differences among group means and
Population 2 X̄ 2=1200 σ2=200 n2=200 Population 1 their associated procedures such as "variation" among
Step 2: Find the cumulative z value (using less than z- x1=100 n1=400 and between groups. It checks the impact of one or more
table) 1-0.95=0.05 0.05/2=0.025; find the nearest z value Population 2 factors by comparing the means of different samples. In
of 0.025 at less than z-table and you will get 1.96. x2=300 n2=600 its simplest form, ANOVA provides a statistical test of
whether or not the means of several groups are equal and
therefore generalizes the t-test to more than two groups.
→ The shape of a chi-square distribution is determined When k greatly increases
Hypothesis Test In Analysis Of Variance by the parameter k. It is dependent, or changes as, on the → the distribution looks more and more similar to a
degree of freedom (k-1) increases. normal distribution. In fact, when k is 90 or greater, a
normal distribution is a good approximation of the chi-
square distribution.

CHAPTER 13: CHI-SQUARE DISTRIBUTION

Chi Square Distribution: CONCEPT GRASPING → Chi-


square distribution is very similar to the standard normal
distribution.

Imagine taking a sample of a standard normal distribution


(z). If we square all the values in the sample, we have a When k is one or two
chi-square distribution wherein k=1. → The chi-square distribution is a curve shaped like a Chi Square Distribution: PROPERTIES
backwards “J.” The curve starts out high and then drops 1. Chi-square distributions start at zero and continue to
off, meaning that there is a high probability that Χ² is infinity, continuous distribution.
close to zero. 2. The degree of freedom (df) is k - 1 (k is the sample
Now let us take samples from two standard normal
size).
distributions (𝑧𝑧1 and 𝑧𝑧2).
3. The variance is 2k.
4. The mode is k - 2 (when k > 2)
5. The standard deviation is 2𝑘𝑘
6. The distribution is asymmetrical (right-skewed), but
More generally, sampling a k independent standard
becomes increasingly symmetrical as k increases.
normal distribution and then square and add the values,
we produce a chi-square distribution with k degrees of
Types of Chi-Square Tests
freedom.
● Chi-square goodness of fit test
When k is greater than two ● Chi-square test of independence
→ the chi-square distribution is hump-shaped. The curve ● Chi-square test for homogeneity
starts out low, increases, and then decreases again. There
Chi Square Distribution: DEFINITION is low probability that Χ² is very close to or very far from
→ widely used to test hypothesis or specific prediction TEST OF GOODNESS OF FIT
zero. The most probable value of Χ² is Χ² − 2. When k is Definition
→ indicates the likeliness of a variable is fit to another only a bit greater than two, the distribution is much longer
one or if the variable is independent This is used when a person is testing to see whether a
on the right side of its peak than its left (i.e., it is strongly frequency distribution fits a specific pattern. In short, this
→ Unlike other distributions, such as the normal right-skewed) test tells if the observed distribution differs from the
distribution or Poisson distributions wherein it describes
expectations. The chi-square goodness of fit test tells you
the useful things or real-world distribution, Chi-Square is
how well a statistical model fits a set of observations. It’s
used in hypothesis testing
often used to analyze genetic crosses.
→ In layman’s terms, Chi-Square distribution tests if a
frequency distribution fits a specific pattern or if the
Chi-square goodness of fit test hypotheses
elements have common characteristics for each
Like all hypothesis tests, a chi-square goodness of fit test
population.
evaluates two hypotheses: the null and alternative
hypotheses. They’re two competing answers to the
Chi Square Distribution: SHAPE
question “Was the sample drawn from a population that ● States a specific difference between a parameter and a The larger the difference between the observations and
follows the specified distribution?” specific value or states that there is a difference between the expectations (O − E in the equation), the bigger the
● Null hypothesis (H0): The population follows the two parameters. chi-square will be.
specified distribution. Step 2: Find the critical value
● Alternative hypothesis (Ha): The population does not Find the critical chi-square value in a chi-square critical To use the formula, follow these five steps:
follow the specified distribution. value table or using statistical software. The critical value Step 1: Create a table
is calculated from a chi-square distribution. To find the Step 2: Calculate O − E Step
These are general hypotheses that apply to all chi-square critical chi-square value, you’ll need to know two things: 3: Calculate (O − E) 2
goodness of fit tests. You should make your hypotheses ● The degrees of freedom (df): For chi-square goodness Step 4: Calculate (O − E) 2 / E
more specific by describing the “specified distribution.” of fit tests, the df is the number of groups minus one. Step 5: Calculate Χ2
You can name the probability distribution (e.g., Poisson ● Significance level (α): By convention, the significance
distribution) or give the expected proportions of each level is usually .05. Example: Chi-square goodness of fit test
group. Step 3: Compute the critical chi-square value Calculate You’re hired by a dog food company to help them test
the chi-square value from your observed and expected three new dog food flavors. You recruit a random sample
When to use the chi-square goodness of fit test? frequencies using the chi-square formula. of 75 dogs and offer each dog a choice between the three
The following conditions are necessary if you want to flavors by placing bowls in front of them. You expect that
perform a chi-square goodness of fit test: the flavors will be equally popular among the dogs, with
1. You want to test a hypothesis about the distribution of about 25 dogs choosing each flavor.
one categorical variable. If your variable is continuous, Step 4: Compare the chi-square value to the critical
you can convert it to a categorical variable by separating value. Make a decision. Once you have your experimental results, you plan to use
the observations into intervals. This process is known as Compare the chi-square value to the critical value to a chi-square goodness of fit test to figure out whether the
data binning. determine which is larger. distribution of the dogs’ flavor choices is significantly
2. The sample was randomly selected from the ● If the Χ 2 value is greater than the critical value, then different from your expectations. Observed and expected
population. the difference between the observed and expected frequencies.
3. There are a minimum of five observations expected in distributions is statistically significant (p < α).
each group. o The data allows you to reject the null hypothesis After weeks of hard work, your dog food experiment is
and provides support for the alternative hypothesis. complete and you compile your data in a table:
How to perform the chi-square goodness of fit test? ● If the Χ 2 value is less than the critical value, then the
The chi-square statistic is a measure of goodness of fit, difference between the observed and expected Observed and expected frequencies of dogs’ flavor
but on its own it doesn’t tell you much. For example, is Χ distributions is not statistically significant (p > α). choices
2 = 1.52 a low or high goodness of fit? o The data doesn’t allow you to reject the null
To interpret the chi-square goodness of fit, you need to hypothesis and doesn’t provide support for the alternative
compare it to something. That’s what a chi-square test is: hypothesis.
comparing the chi-square value to the appropriate chi- Step 5: Summarize the results.
square distribution to decide whether to reject the null
hypothesis. How to calculate the test statistic (chi-square value) The
test statistic for the chi-square (Χ2) goodness of fit test is
To perform a chi-square goodness of fit test, follow these Pearson’s chi-square:
five steps
Step 1: State the hypothesis. To help visualize the differences between your observed
● States that there is no difference between a parameter and expected frequencies, you also create a bar graph:
and specific value or that there is no difference between
two parameters.
CONTINGENCY TABLE To answer this question, a researcher selects a sample of
Definition nurses and doctors and tabulates the data in table form, as
When data can be tabulated in table form in terms of shown.
frequencies, several types of hypotheses can be tested by
using the chi-square test.
Two such tests are the independence of variables test and
Step 1: State your hypothesis the homogeneity of proportions test. The test of As the survey indicates, 100 nurses prefer the new
● Null hypothesis (H0): The dog population chooses the independence of variables is used to determine whether procedure, 80 prefer the old procedure, and 20 have no
three flavors in equal proportions (p1 = p2 = p3). two variables are independent of or related to each other preference; 50 doctors prefer the new procedure, 120 like
● Alternative hypothesis (Ha): The dog population does when a single sample is selected. The test of homogeneity the old procedure, and 30 have no preference. The main
not choose the three flavors in equal proportions. of proportions is used to determine whether the question is whether there is a difference in opinion.
Step 2: Find the critical value proportions for a variable are equal when several samples
Finding the critical chi-square value are selected from different populations. Both tests use the Step 1: State the hypothesis
Since there are three groups (Garlic Blast, Blueberry chi-square distribution and a contingency table, and the Ho: The opinion about the procedure is independent of
Delight, and Minty Munch), there are two degrees of test value is found in the same way. The independence test the profession.
freedom. will be explained first. H1: The opinion about the procedure is dependent on the
For a test of significance at α = .05 and df = 2, the Χ2 profession.
critical value is 5.99 TEST FOR INDEPENDENCE Step 2: Determine the tabular value
Step 3: Compute the critical chi-square value Definition The degrees of freedom are
Add up the values of the previous column. This is the chi- The chi-square independence test can be used to test the (R - 1)(C - 1) = (2 - 1)(3 - 1) = 2. If a 0.05, the critical
square test statistic (Χ2) independence of two variables. value from Table G is 5.991
Step 3: Compute the test value. Solve first for the expected
Steps to Follow value.
1. State the hypothesis, null hypothesis and alternative To test the null hypothesis by using the chi-square
hypothesis. independence test, you must compute the expected
2. Determine the tabular value (critical value) using the frequencies, assuming that the null hypothesis is true.
chi-square table. These frequencies are computed by using the observed
3. Compute the test value by using chi-square test. Solve frequencies given in the table.
for the expected values using the formula. When data are arranged in table form for the chi-square
4. Make a decision. independence test, the table is called a contingency table.
Χ2 = 0.36 + 1 + 0.16 = 1.52 5. Summarize the result. The table is made up of R rows and C columns. The table
Step 4: Comparing the chi-square value to the critical here has two rows and three columns. Note that row and
value. Make a decision. **Always remember: column headings do not count in determining the number
Χ2 = 1.52 Critical value = 5.99 If test value < tabular value, we accept H0. of rows and columns.
Step 5: Summarize the result If test value > tabular value, we reject H0 A contingency table is designated as an R x C (rows by
There is enough evidence to accept the null hypothesis. columns) table. In this case, R = 2 and C = 3; hence, this
Therefore, the dog population chooses the three flavors in Example: table is a 2x3 contingency table. Each block in the table
equal proportions. Suppose a new postoperative procedure is administered is called a cell and is designated by its row and column
This suggests that the dog food flavors are equally to a number of patients in a large hospital. The researcher position. For example, the cell with a frequency of 80 is
popular in the dog population. You report your findings can ask the question, Do the doctors feel differently about designated as C1,2, or row 1, column 2. The cells are
back to the dog food company president. He decides not this procedure from the nurses, or do they feel basically shown below.
to eliminate the Garlic Blast and Minty Munch flavors the same way? Note that the question is not whether they
based on your findings. The many dogs who love these prefer the procedure but whether there is a difference of
flavors are very grateful! opinion between the two groups.
Using the previous table, you can compute the expected
frequencies for each block (or cell), as shown next.

a. Find the sum of each row and each column, and find
the grand total, as shown.
Step 1: State the hypothesis
H0: There is no significant difference between the
opinions between male and female toward the candidate.
H1: There is a significant difference between the opinions
Step 4: Make a decision between male and female toward the candidate.
The decision is to reject the null hypothesis since 26.67 > Step 2: Determine the critical value
b. For each cell, multiply the corresponding row sum by 5.991. degree of significance = (R - 1)(C - 1)
the column sum and divide by the grand total, to get the = (2 - 1)(2 - 1)
expected value: = (1)(1)
degree of significance = 1 level of significance = 1% or
0.01
c. For example, for C 1,2, the expected value, denoted by Tabular value is 6.64
E1,2, is (refer to the previous tables) Step 3: Compute the test value using the chi-square
Step 5: Summarize the result test for homogeneity.
The conclusion is that there is enough evidence to support
Computing for x2 we have:
the claim that opinion is related to (dependent on)
profession—that is, that the doctors and nurses differ in
their opinions about the procedure.

TEST FOR HOMOGENEITY


d. The expected values can now be placed in the Definition Step 4: Make a decision
corresponding cells along with the observed values, as The Chi-Square is more convenient to use in testing Given that 4.92 < 6.62, we accept the null hypothesis.
shown. significance of difference between two proportions as Step 5: Summarize the result
compared to the z - test. The test is use to test the There is no significant difference in the opinions between
significant difference between two proportions. male and female toward the candidate and there are more
voters who dislike the candidate than those who like him.
Formula

The rationale for the computation of the expected


frequencies for a contingency table uses proportions. For
C1,1 a total of 150 out of 400 people prefer the new
procedure. And since there are 200 nurses, you would Where N = total number of cases
expect, if the null hypothesis were true, (150/400) (200), A,B,C, and D are the observed frequencies
or 75, of the nurses to be in favor of the new procedure.
Example A sample survey of presidential candidates in
e. The formula for the test value for the independence test the Philippines shows that 120 of 200 male voters dislike
is the same as the one used for the goodness-of-fit test. It the candidate x 175 of 250 female candidates dislike the
is same candidate. Determine whether the difference
between the two sample proportions 120/200 and 175/250
is significant or not at 1% level of significance.
2nd Semester, Academic Year 2022 – 2023
MATH111A: ELEMENTARY STATISTICS AND PROBABILITY

CHAPTER 11: HYPOTHESIS TESTING

I. OBJECTIVES

At the end of the lesson, students should be able to:

● Acquire deep understanding about the basic concepts in hypothesis testing.


● Identify the difference between t-test independent and dependent.
● Use a step-by-step process to solve t-test independent and dependent.

II. CONTENT

HYPOTHESIS
● A premise or claim that we want to test.
● A statistical hypothesis is a conjecture about a population parameter. This
conjecture may or may not be true.
● A claim or statement about a population parameter.

HYPOTHESIS TESTING
● Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results. You’re basically testing whether
your results are valid by figuring out the odds that your results have happened by
chance.
● It is a decision – making process for evaluating claims about a population.

Two Types of Statistical Hypotheses

Null Hypothesis
● States that there is no difference between a parameter and specific value or that
there is no difference between two parameters.
● It shows no significant difference, no changes, nothing happened, no relationship
between two parameters.
● Currently accepted value for a parameter.
● It is the initial claim and represented by H0.
Alternative Hypothesis
● States a specific difference between a parameter and a specific value or states
that there is a difference between two parameters.
● It shows that there is significant difference, an effect, change, relationship
between a parameter and specific value.
● It involves the claim to be tested and it is also called a research hypothesis.
● It is contrary to the null hypothesis and represented by Ha or H1.

Types of Statistical Symbols Words or Clue in Identifying the types of


Hypotheses Used Statistical Hypotheses
= Equal to, the same as, not changed from, and is.
Null Is greater than or equal to, is at least, is not less
Hypothesis ≥ than, and no less than.
(H0) Is less than or equal to, is at most, does not
≤ exceed, is not greater than, no greater than, is
not more than, and no more than.
Not equal, different from, changed from, and not
≠ the same as.
Alternative Greater than, above, higher than, longer than,
>
Hypothesis bigger than, increased, at least.
(Ha or H1) Less than, below, lower than, smaller than,
< shorter than, decreased or reduced from, at
most.

Example:
State the null and the alternative hypotheses for each conjecture.

1. It is believed that a candy machine makes chocolate bars that are on average 5
grams. A worker claims that the machine after maintenance no longer makes 5
grams bars.
Solution:
H0 : μ = 5 grams
Ha : μ ≠ 5 grams
2. Doctors believe that the average teen sleeps on average no longer than 10 hours
per day. A researcher believes that teens on average sleep longer.
Solution:
H0 : μ ≤ 10 hours
Ha : μ > 10 hours
3. A researcher feels that advertising on television will change the buying preferences
of young adults for a certain product. The researcher is not sure whether the sales
will increase or decrease. In the past, the mean sales was P500, 000.
Solution:
H0 : μ = P500, 000
Ha : μ ≠ P500, 000

STATISTICAL TEST
● Uses the data obtained from a sample to make a decision whether or not the null
hypothesis should be rejected. The numerical value obtained from a statistical test
is called the test value.
● A statistical test provides a mechanism for making quantitative decisions about a
process or processes. The intent is to determine whether there is enough evidence
to "reject" a conjecture or hypothesis about the process. The conjecture is called
the null hypothesis.

Types of Errors
In the hypothesis testing situation, there are four possible outcomes. The four
possible outcomes are shown below:

Null Hypothesis (H0 )

True False

Reject Null
Type I Error Correct Decision
Hypothesis (H0 )

Fail to reject Null


Correct Decision Type II Error
Hypothesis (H0 )

Remember:
● A type I error occurs if one rejects the null hypothesis when it is true.
● A type II error occurs if one does not reject the null hypothesis when it is false.
Examples:
1. Suppose the null hypothesis is: Ben’s used car is safe to drive. Which statement
represents a type I error and a type II error?
a. Ben thinks that his car may be safe when, in fact, it is not safe.
b. Ben thinks that his car may be safe when, in fact, it is safe.
c. Ben thinks that his car may not be safe when, in fact, it is not safe.
d. Ben thinks that his car may not be safe when, in fact, it is safe.
Answer:
➔ Letter d is the statement that represents type I error because it rejects the
null hypothesis even though it is true.
➔ Letter a is the statement that represents type II error because it does not
reject the null hypothesis when it is false.

2. In a criminal court case, the null hypothesis is that the defendant is presumed
innocent. Which statement represents a type I error and a type II error?
a. The jury believes that the defendant is guilty when, in fact, he is innocent.
b. The jury believes that the defendant is guilty when, in fact, he is not
innocent.
c. The jury believes that the defendant is not guilty when, in fact, he is not
innocent.
d. The jury believes that the defendant is not guilty when, in fact, he is
innocent.
Answer:
➔ Letter a is the statement that represents type I error because it rejects the
null hypothesis when it is true.
➔ Letter c is the statement that represents type II error because it does not
reject the null hypothesis when it is false.

Level of Confidence
● It is the percentage of times you expect to get close to the same estimate if you
run your experiment again or resample the population in the same way.
● The most common confidence levels are 90%, 95% and 99%.
Level of Significance
The level of significance is the maximum probability of committing a type I error.
This probability is symbolized by “α” (alpha). That is P (type I error) = α. Researchers
generally agree on using three arbitrary significance: 0.10, 0.05 and 0.01 level. When α
= 0.10, there is 10% chance of rejecting a true null hypothesis; when α = 0.05, there is
5% chance of rejecting a true null hypothesis and when α = 0.01, there is 1% chance of
rejecting a true null hypothesis.
● It can also represent α (alpha) = 1 – C (level of confidence).

For example:
The level of confidence is 95%
C = 0.95
α = 1 – 0.95
α = 0.05

Remember:
● The level of confidence and level of significance are related to each other since
C and α specify the same thing level. They are both telling how sure you are
and not making the right decision. There are some problems that will specify
the level of confidence and some will specify the level of significance.

CRITICAL VALUES
● The critical value (s) separates the critical region from the non-critical region.
● The critical or rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis should
be rejected.
● The non-critical or non-rejection region is the range of values of the test value
that indicates that the null hypothesis should not be rejected.
● The critical value can be on the right side of the mean or on the left side of the
mean for a one-tailed test. Its location depends on the inequality sign of the
alternative hypothesis.

● A one-tailed test indicates that the null hypothesis should be rejected when the
test value is in the critical region on one side of the mean. A one-tailed test is
either right-tailed or left-tailed, depending on the direction of the inequality of the
alternative hypothesis.
● A two-tailed test, in statistics, is a method in which the critical area of a distribution
is two-sided and tests whether a sample is greater than or less than a certain range
of values. It is used in null-hypothesis testing and testing for statistical significance.

● Right tailed test is also called the upper tail test. A hypothesis test is performed if
the population parameter is suspected to be greater than the assumed parameter
of the null hypothesis.

● A left-tailed test is used when the alternative hypothesis states that the true value
of the parameter specified in the null hypothesis is less than the null hypothesis
claims.

Decision Criteria:
● Reject the null hypothesis if test statistic > t critical value (right-tailed hypothesis
test).
● Reject the null hypothesis if test statistic < t critical value (left-tailed hypothesis
test).
● Reject the null hypothesis if the test statistic does not lie in the acceptance region
(two-tailed hypothesis test).

Example:
Find the critical value of the following.

1. The confidence level is 5%. Find the critical value of “Z” for Right-Tailed Test
Solution:
Step 1: C = 5% = 0.05

Step 2: Subtract what you get from 1.


α=1–C
α = 1 – 0.05 = 0.95

Step 3: Find the 0.95 in the z-table and add the value you get.
1.6+ 0.05 = 1.65

Therefore, the critical value of Z for Right – Tailed test is 1.65.


2. Find the critical value of “Z” for two-tailed test for alpha of 15%.
Solution:
Step 1: C = 15% = 0.15

Step 2: Divide it with 2 because it is a two-tailed


0.15/2 = 0.075
Step 3: Subtract what you got from 1
α = 1 - 0.075 = 0.925

Step 4: Find the 0.925 in the z-table and add the value you get.
1.4 + 0.04 = 1.44

Therefore, the critical value of “Z” for two-tailed test is ± 1.44.


t-t est DEPENDENT
● The dependent t-test (also called the paired t-test or paired-samples t-test)
compares the means of two related groups to determine whether there is a
statistically significant difference between these means.
● You need one dependent variable that is measured on an interval or ratio scale.
● A dependent t-test is an example of a "within-subjects" or "repeated-measures"
statistical test. This indicates that the same participants are tested more than once.
Thus, in the dependent t-test, "related groups" indicate that the same participants
are present in both groups.

Example:
1. Measure participants' weight before and after the diet counseling course.
2. Measured the performance of 10 participants in a spelling test before and after
they underwent a new form of computerized teaching method to improve spelling.
3. Compute the difference of their scores from the pre-test and post-test.
Assumptions:
1. Your dependent variable should be measured on a continuous scale.
2. Related samples/group. This means that the subjects in the first group are also in
the second group.
3. No significant outliers in the two groups.
4. Approximately normally distributed.

Formulas:

To find t-value:
𝐷𝐷 − µ
𝐷𝐷
𝑡𝑡 = 𝑠𝑠
𝐷𝐷

𝑛𝑛
where:
𝐷𝐷 = average of difference (D)
µ = difference between the means of the two variables = assume it is zero
𝐷𝐷
𝑠𝑠 = standard deviation of the difference
𝐷𝐷
n = sample size

Requirements: D, 𝐷𝐷 and SD

To find D:
𝐷𝐷 = 𝑥𝑥 − 𝑥𝑥
1 2
where:

𝐷𝐷 = difference of sample means


𝑥𝑥 = first mean
1
𝑥𝑥 = second mean
2

To find 𝐷𝐷:
Σ𝐷𝐷
𝑛𝑛
where:
Σ𝐷𝐷 = summation of difference
𝑛𝑛 = sample size
To find SD:
2 2
𝑛𝑛Σ𝐷𝐷 −(Σ𝐷𝐷)
𝑛𝑛(𝑛𝑛−1)
where:
𝑛𝑛 = sample size
2
Σ𝐷𝐷 = summation of squared difference
Σ𝐷𝐷 = summation of difference

If we substitute the formulas:


Σ𝐷𝐷
𝑛𝑛
− µ𝐷𝐷
𝑡𝑡 = 2
𝑛𝑛Σ𝐷𝐷 −(Σ )𝐷𝐷 2
𝑛𝑛(𝑛𝑛−1)

𝑛𝑛

Steps in computing t-test dependent:


Step 1: State the hypotheses (Null and Alternative Hypothesis).
Step 2: Find the type of test, degrees of freedom, and critical value.
Step 3: Find the t-value.
Step 4: Make the decision.
Step 5: Make a summary.

Example 1:

A math teacher wishes to see whether a new program will reduce the number of
errors that the students make when solving worded problems. The data are shown here.
At α = 0.05, can it be concluded that the number of errors has been reduced?

Student 1 2 3 4 5 6

Errors Before 12 9 0 5 4 3

Errors After 9 6 1 3 2 3
Step 1: State the hypothesis.
H0: There is no significant difference between the average number or errors before and
after the new program.
H1: There is a significant difference between the average number or errors before and
after the new program.

Step 2: Find the type of test, degree of freedom, and critical value.
Type of Test: One-tailed Test (Right-tailed Test)

Degree of freedom: d.f.= n-1 = 6-1 = 5

Critical Value: α = 0.05, d.f. = 5, look in the t-table = 2.015

Step 3: Find the t-value.

Student 1 2 3 4 5 6

Errors Before 12 9 0 5 4 3

Errors After 9 6 1 3 2 3

D 3 3 -1 2 2 0

D2 9 9 1 4 4 0

Given: n = 6

Required: Σ𝐷𝐷, Σ𝐷𝐷2 , 𝐷𝐷, 𝑠𝑠


𝐷𝐷

Find Σ𝐷𝐷: 2
Find Σ𝐷𝐷 :
Σ𝐷𝐷 = 3 + 3 + (− 1) + 2 + 2 + 0 2

Σ𝐷𝐷 = 9 + 9 + 1 + 4 + 4 + 0
Σ𝐷𝐷 = 9 2
Σ𝐷𝐷 = 27

Find 𝐷𝐷:
𝐷𝐷 = Σ𝐷𝐷 = 9 = 3
𝑛𝑛 6 2

𝐷𝐷 = 1. 5
Find 𝑠𝑠 :
𝐷𝐷
2 2
𝑛𝑛Σ 𝐷𝐷−(Σ𝐷𝐷 )
𝑠𝑠 =
𝐷𝐷 𝑛𝑛(𝑛𝑛−1)
2
(6)(27)−(9)
= (6)(6−1)
162−81
= (6)(5)
81
= 30

= 2. 7
𝑠𝑠 = 1. 64
𝐷𝐷

Find t-value:
𝐷𝐷 − µ
𝑡𝑡 = 𝑠𝑠𝐷𝐷
𝐷𝐷

𝑛𝑛
1.5 − 0
𝑡𝑡 = 1.64
6
1.5
𝑡𝑡 = 1.64
2.45
1.5
𝑡𝑡 = 0.67
𝑡𝑡 = 2. 24

Step 4: Make the decision.


Reject the H0 , since 2.24 > 2.015

Step 5: Make a summary.


Therefore, there is enough evidence to support the claim that the errors have
been reduced before and after the new program.
Example 2:

A teacher, after seeing the poor mathematics scores of the class, decides to
conduct special tutoring for the subject. The test was out of 10. She then compares the
before and after scores of the students. The alpha is to be assumed as 0.05. The aim is
to find whether the special tutoring was effective or not and the results come out as
follows:

Before 7 6 5 4 4 6 7 5 5 7

After 9 10 7 5 7 5 9 6 8 7

Step 1: State the hypothesis.


H0: There is no significant difference because the mean before and mean after the
special tutoring is equal. Meaning, the course did not make any difference.
H1: There is a significant difference because the mean before and after the course is
different. Meaning, the course did make some difference.

Step 2: Find the type of test, degree of freedom, and critical value.
Type of Test: Two-tailed Test

Degree of freedom: d.f.= n-1 = 10-1 = 9

Critical Value: α = 0.05, d.f. = 9, look in the t-table = ±2.2622

Step 3: Find the t-value.

Before 7 6 5 4 4 6 7 5 5 7

After 9 10 7 5 7 5 9 6 8 7

D -2 -4 -2 -1 -3 1 -2 -1 -3 0

D2 4 16 4 1 9 1 4 1 9 0
Given: n=10

Required: Σ𝐷𝐷, Σ𝐷𝐷2 , 𝐷𝐷, 𝑠𝑠𝐷𝐷


Find Σ𝐷𝐷:
Σ𝐷𝐷 = (− 2) + (− 4) + (− 2) + (− 1) +(− 3) + 1 + (− 2) + (− 1) + (− 3) + 0
Σ𝐷𝐷 =− 17

2
Find Σ𝐷𝐷 :
2
Σ𝐷𝐷 = 4 + 16 + 4 + 1 + 9 + 1 + 4 + 1 + 9 + 0
2
Σ𝐷𝐷 = 49

Find 𝐷𝐷:
𝐷𝐷 = Σ𝐷𝐷𝑛𝑛= −17 10
𝐷𝐷 =− 1. 7

Find 𝑠𝑠 :
𝐷𝐷
2 2
𝑛𝑛Σ 𝐷𝐷−(Σ𝐷𝐷 )
𝑠𝑠 =
𝐷𝐷 𝑛𝑛(𝑛𝑛−1)
2
(10)(49)−(−17)
= (10)(10−1)
490−289
= (10)(9)
201
= 90

= 2. 23
𝑠𝑠 = 1. 49
𝐷𝐷

Find t-value:
𝐷𝐷 − µ
𝑡𝑡 = 𝑠𝑠𝐷𝐷
𝐷𝐷

𝑛𝑛
−1.7 − 0
𝑡𝑡 = 1.49
10
−1.7
𝑡𝑡 = 1.49
3.16
−1.7
𝑡𝑡 = 0.47
𝑡𝑡 =− 3. 62
Step 4: Make the decision.
Reject the H0 , since -3.62 < -2.2622

Step 5: Make a summary.


We can state that the special tutoring helped the students score more than
before the tutoring.

t-t est INDEPENDENT


The independent-samples t- test ( t- test independent ) compares the means
between two unrelated groups on the same continuous, dependent variable.

Example:
● The effectiveness of two different diets on two different groups of individuals.
● Comparing the height of students in two different schools.

Caution!! The t-test can be used when the population standard deviations are not known
and the sample size is smaller (less than 30).

Assumptions:
1. Independence of the observations. Each subject should belong to only one
group. There is no relationship between the observations in each group.
2. No significant outliers in the two groups.
3. Normality. The data for each group should be approximately normally distributed.
4. Homogeneity of variances. The variance of the outcome variable should be
equal in each group.

Formula for the t-test – For Testing the differences Between Two Means- Independent
Samples.
Variances are assumed to be unequal.

Variances are assumed to be equal.

Example:
According to Nielsen Media Research, children ( ages 2-11 ) spend an average of
21 hours and 30 minutes watching television per week while teens (ages 12-17) spend
an average of 20 hours and 40 minutes . Based on the sample statistics obtained below,
is there sufficient evidence to conclude a difference in average television watching times
between the two groups? Use α= 0.05.

Children Teens
Sample mean 22. 45 18.50
Sample variance 16. 4 18. 2
Sample size 15 15

Step 1: State the hypothesis


H0: There is no significant difference between the average tv watching times of the two
groups.
H1: There is a significant difference between the average tv watching times of the two
groups.
Step 2. Identify the statistical test
(two- tailed test)

Step 3: Find the critical values


α = 0.05

d.f. = n1-1 or n2 -1 (whichever is smaller) = 15 -1 =14

critical value= ±2.145

Step 4: Find the test value

Step 5: Make a decision


Reject the null hypothesis, since 2.60 > 2.145

Step 6: Make a summary


There is sufficient evidence to conclude a difference in viewing times of the two
groups.
III. LEARNING ACTIVITIES & ASSESSMENT TASKS
Instruction: Answer the following in a separate sheet of paper. Copy the heading but no
need to copy the questions.

OUTPUT 11: HYPOTHESIS TESTING


Elementary Statistics and Probability
Name: Course, Year & Section:
Instructor: Monette C. Valencia Date Submitted:

A. Compare and Contrast

Create a venn diagram null and alternative hypothesis. (3 pts.)

B. State the hypotheses

State only the null and the alternative hypotheses for each conjecture. (1 point
each)

a. An instructor feels that using a module will enhance the performance of his
students in clinical psychology. In the past, the average grade of the students was
75.

b. The school board claims that at least 60% of students bring a phone to school. A
teacher believes this number is too high and randomly samples 25 students to test
at a level of significance of 0.02.

c. A company has stated that their straw machine makes straw that are 4 mm
diameter. A worker believes the machine no longer makes straw of this size and
samples 100 straws to perform a hypothesis test with 99% confidence.

C. Identification

Analyze the possibilities of Amari’s conclusion. Identify if it is a Type I Error,


Type II Error, or a Correct Decision.

If Amari finds out that her null hypothesis is …

1. True and she fails to reject it, then she commits a .


2. True and she rejects it, then she commits a .
3. False and she fails to reject it, then she commits a .
4. False and she rejects it, then she commits a .
D. Problem Solving

1. Critical Value

a. Solve and illustrate the critical value of Z. Find the critical value of “Z” for
two-tailed test for alpha of 39%. (5 points)

2. t-test Dependent

Salary Wizard is an online tool that allows you to look up incomes for
specific jobs for cities in the United State. We looked up the 25th percentile for
income for six jobs in two cities: Boise, Idaho, and Los Angeles, California. The
data are below:

1 2 3 4 5 6

Boise, Idaho 53047 49958 41974 44366 40470 36963

Los Angeles, California 62490 58850 49445 52263 47674 43542

a. Use the steps in solving t-test dependent. (15 points)

3. t-test Independent

IV. A statistics teacher wants to compare his two classes to see if they perform any differently
on the tests he gave that semester. Class A had 25 students with an average score of 70,
standard deviation 15. Class B had 20 students with an average score of 74 , standard
deviation 25. Using alpha O.O5 , did these two classes perform differently on the tests?
(15 points)
CHAPTER 12: ANALYSIS OF VARIANCE (ANOVA)

I. OBJECTIVES

At the end of this chapter, students should be able to:

● identify the number of groups or classification to be tested


● determine if there is a significant difference among three or more means
using the ANOVA technique
● perform ANOVA technique using MS Excel

II. CONTENT

ANALYSIS OF VARIANCE (ANOVA)

When an F test is used to test a hypothesis concerning the means of three


or more populations, the technique is called analysis of variance, commonly
abbreviated as ANOVA.

Analysis of variance is a collection of statistical models used to analyze the


differences among group means and their associated procedures such as
"variation" among and between groups. It checks the impact of one or more factors
by comparing the means of different samples. In its simplest form, ANOVA
provides a statistical test of whether or not the means of several groups are equal
and therefore generalizes the t-test to more than two groups.

HYPOTHESIS TEST IN ANALYSIS OF VARIANCE

𝐻𝐻0: µ1 = µ2 =···= µ𝑘𝑘


𝐻𝐻 : 𝐴𝐴𝐴𝐴 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜𝑜𝑜 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖𝑖𝑖 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑡𝑡ℎ𝑒𝑒 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒.
1

ANOVA Notation

= number of populations or treatments being compared

Population or treatment 1 2 ··· 𝑘𝑘


Population or treatment mean µ1 µ2 ··· µ
𝑘𝑘
2 2 2
Population or treatment variance σ σ ··· σ
1 2 𝑘𝑘
Sample size 𝑛𝑛 𝑛𝑛 ··· 𝑛𝑛𝑘𝑘
Sample mean 𝑥𝑥 𝑥𝑥 ··· 𝑥𝑥𝑘𝑘
Sample variance 𝑠𝑠 𝑠𝑠 ··· 𝑠𝑠𝑘𝑘

𝑁𝑁 = 𝑛𝑛 + 𝑛𝑛 +2 ··· + 𝑛𝑛 (sum of the sample sizes of the groups)


1 𝑘𝑘
𝑇𝑇 = grand total = sum of all 𝑁𝑁 observations = 𝑛𝑛 𝑥𝑥 + 𝑛𝑛 𝑥𝑥 + ··· + 𝑛𝑛 𝑘𝑘𝑥𝑥𝑘𝑘
1 1 22
𝑋𝑋 = grand mean
𝐺𝐺𝐺𝐺

Degrees of freedom are:


𝑑𝑑. 𝑓𝑓. 𝑁𝑁 = 𝑘𝑘 − 1 𝑑𝑑. 𝑓𝑓. 𝐷𝐷 = 𝑁𝑁 − 𝑘𝑘

BETWEEN-GROUP VARIATION

It refers to variations between the distributions of individual groups (or


levels) as the values within each group differ. Each sample is examined, and the
difference between its mean and grand mean is calculated to calculate the
variability. If the distributions overlap or are close, the grand mean will be similar
to the individual means, whereas if the distributions are far apart, the difference
between means and grand mean would be large.

The formula for between-group variation is:


2
2𝑠𝑠
𝐵𝐵
= (
Σ𝑛𝑛 𝑥𝑥 −𝑥𝑥
𝑖𝑖 𝑖𝑖 𝐺𝐺𝐺𝐺 )
𝑘𝑘−1

The numerator of the fraction obtained in the computational procedure is


called the sum of squares between groups, denoted by 𝑆𝑆𝑆𝑆 . Then 𝑆𝑆𝑆𝑆 is
𝐵𝐵 𝐵𝐵

divided by 𝑑𝑑. 𝑓𝑓. 𝑁𝑁 to obtain the between-group variance. This variance is

sometimes called a mean square, denoted by 𝑀𝑀𝑀𝑀 .


𝐵𝐵

WITHIN-GROUP VARIATION

It refers to variations caused by differences within individual groups (or


levels), as not all the values within each group are the same. Each sample is
looked at on its own, and variability between the individual points in the sample is
calculated. In other words, no interactions between samples are considered.

The formula for between-group variation is:

2 ( 𝑖𝑖 ) 𝑖𝑖
Σ 𝑛𝑛 −1 𝑠𝑠2
𝑠𝑠 =
𝑊𝑊 Σ(𝑛𝑛𝑖𝑖−1)

The numerator of the fraction obtained in the computational procedure is


called the sum of squares within groups, denoted by 𝑆𝑆𝑆𝑆𝑊𝑊. This statistic is also
called the sum of squares for the error. Then 𝑆𝑆𝑆𝑆𝑊𝑊 is divided by 𝑑𝑑. 𝑓𝑓. 𝐷𝐷 to obtain the
within-group variance. This variance is sometimes called a mean square, denoted
by 𝑀𝑀𝑀𝑀𝑊𝑊.

These terms are used to summarize the analysis of variance and are placed
in a summary table, as shown in Table 1.
Table 1. Analysis of Variance Summary Table

Source Sum of 𝑑𝑑. 𝑓𝑓. Mean square 𝐹𝐹


squares

Between 𝑆𝑆𝑆𝑆 𝑘𝑘 − 1 𝑀𝑀𝑀𝑀


𝐵𝐵 𝐵𝐵
Within (error) 𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀
𝑊𝑊 𝑁𝑁 − 𝑘𝑘 𝑊𝑊

Total

𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀
𝑆𝑆𝑆𝑆 = 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑀𝑀𝑀𝑀 = 𝐵𝐵
𝐹𝐹 = 𝑀𝑀𝑀𝑀
𝐵𝐵
𝐵𝐵 𝐵𝐵 𝑘𝑘−1
𝑊𝑊
𝑆𝑆𝑆𝑆
=
𝑆𝑆𝑆𝑆 = 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑖𝑖𝑖𝑖 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑀𝑀𝑀𝑀 𝑊𝑊

𝑊𝑊 𝑊𝑊 𝑁𝑁−𝑘𝑘

TWO MAIN TYPES OF ANALYSIS OF VARIANCE

● One-Way ANOVA

It has one independent variable with two levels and is used when you
want to test two groups to see if there’s a difference between them.

● Two-Way ANOVA

It has two independent variables which can have multiple levels and
is used when you want to know how two independent variables, in
combination, affect a dependent variable.

PERFORMING ANALYSIS OF VARIANCE (A Step-By-Step Procedure)

Step 1. State the hypothesis and identify the claim.

Step 2. Find the critical value.

Step 3. Compute the test value.

a. Find the mean and variance of each sample.


(𝑥𝑥 , 𝑠𝑠 ), (𝑥𝑥 , 𝑠𝑠 ), ···, (𝑥𝑥 , 𝑠𝑠 )
1
2
1 2
2
2 𝑘𝑘
2
𝑘𝑘

b. Find the grand mean.


𝑇𝑇
𝑥𝑥 =
𝐺𝐺𝐺𝐺 𝑁𝑁

c. Find the between-group variance.


𝑆𝑆𝑆𝑆
𝑀𝑀𝑀𝑀 = 𝐵𝐵
𝐵𝐵 𝑘𝑘−1

( )
2
Σ𝑛𝑛𝑖𝑖 𝑥𝑥 𝑖𝑖−𝑥𝑥 𝐺𝐺𝐺𝐺
𝑀𝑀𝑀𝑀 =
𝐵𝐵
𝑘𝑘−1

d. Find the within-group variance.


𝑆𝑆𝑆𝑆
𝑀𝑀𝑀𝑀 = 𝑊𝑊
𝑊𝑊 𝑁𝑁−𝑘𝑘

( 𝑖𝑖 ) 𝑖𝑖
Σ 𝑛𝑛 −1 𝑠𝑠2
𝑀𝑀𝑀𝑀 =
𝑊𝑊 Σ(𝑛𝑛𝑖𝑖 −1)

e. Find the F test value.


𝑀𝑀𝑀𝑀
𝐵𝐵
𝐹𝐹 = 𝑀𝑀𝑀𝑀
𝑊𝑊

Step 4. Make the decision.

Step 5. Summarize the results.

POST HOC TEST

“Post hoc” (Latin, meaning “after this”) means to analyze the results of
your experimental data. They are often based on a familywise error rate; the
probability of at least one Type I error in a set (family) of comparisons.

Post hoc test will only be conducted when the p-value for the ANOVA is
statistically significant. If the p-value is not statistically significant, meaning, the
means for all of the groups are not different from each other, thus there is no
need to conduct a post hoc test to find out which groups are different from each
other.

One of the most common post hoc tests is Bonferroni Procedure.

BONFERRONI PROCEDURE

The Bonferroni procedure is a statistical comparison test that involves


checking multiple tests limiting the chance of failure. It is otherwise known as the
Bonferroni correction or Bonferroni adjustment.

It is a test where the mean of a numerical outcome can be compared


between two or more independent groups. This test can be helpful when there are
more than two groups. On sample data, to carry out a Bonferroni correction,
significance should be set the cut-off at α . For example, if you are running 20
𝑛𝑛
simultaneous tests at α = 0.05, the correction would be 0.0025.

This test is used to limit the possibility of getting a statistically significant


result when testing multiple hypotheses. It’s needed because the more tests you
run, the more likely you are to get a significant result. The correction lowers the
area where you can reject the null hypothesis. In other words, it makes your p-
value smaller.

The specifics of the study will determine whether or not to apply the
Bonferroni correction. We can use the method in certain circumstances, such as
if:

a. A test of the “universal null hypothesis” (denoted as 𝐻𝐻 ) showing


0
that all tests examined are not significant is necessary,

b. Avoiding a “type I” error is critical, and

c. There are no predetermined hypotheses while performing most tests.

In conclusion, testing several hypotheses on a single data set increases the


chance of drawing false-positive findings that aren’t accurate. The Bonferroni
correction is a straightforward statistical technique for reducing this risk, and
when done properly, it can guarantee the objectivity of research that employs several significance tests.
ONE-WAY ANALYSIS OF VARIANCE

The one-way analysis of variance (ANOVA) is used to determine whether


there are any statistically significant differences between the means of three or
more independent groups

Formula:

Where:

sB2 = The mean squares of between the groups.

sW2 = The mean squares of within the group.

Assumptions for the F Test for Comparing Three or More Means


Populations from which Samples must be Variances of the
the samples were independent of one populations must be
obtained must be another. equal.
normally or
approximately normally
distributed.

Even though you are comparing three or more means in this use of the F
test, variances are used in the test instead of means.

With the F test, two different estimates of the population variance are
made, namely, between-group variance, and within-group variance.

Between-Group Variance

➢ It is the first estimate that involves finding the variance of the means.

Within-Group Variance
➢ It is the second estimate made by computing the variance using all the
data and is not affected by differences in the means.

Example 1: A researcher wishes to know if music in a class enhances


concentration and helps students absorb more information. The researcher took
three groups of 10 randomly selected students (all of the same age) from three
classrooms. Each classroom has a different environment for students to study.
Classroom A had constant music playing in the background, Classroom B had
variable music, and Classroom C was a regular class with no music playing. After
a month, the researcher conducted a test for all three groups and collected their
test scores. At 0.05 alpha level, test the claim that there is a significant difference
among the means.

Solution:

Step 1: State the hypothesis and identify the claim.

H0 = μ1 = μ2 = μ3 (claim)

H1 = At least one mean is different from the others.

Step 2: Find the critical value. Since k= 3 and N= 30.

d.f.N. = k – 1 = 3 – 1 = 2

d.f.D. = N – k = 30 – 3 = 27
The critical value is 3.35 with α = 0.05

Step 3: Compute the test value.

a.

Class A Class B Class C

7 4 6

9 3 1

5 6 3

8 2 5

6 7 3

8 5 4

6 5 6

10 4 5
7 1 7

4 3 3

x̄ 1 =7 x̄ 2 =4 x̄ 3 = 4.3
2 2 2
s1 = 3.33 s2 = 3.33 s3 = 3.34

b. Find the grand mean


𝑇𝑇
𝑥𝑥 =
𝐺𝐺𝐺𝐺 𝑁𝑁

𝑥𝑥 =
70 + 40 + 43
𝐺𝐺𝐺𝐺 30

𝑥𝑥 =
153
𝐺𝐺𝐺𝐺 30

𝑥𝑥 = 5. 1
𝐺𝐺𝐺𝐺

c. Find the between-group variance.


𝑆𝑆𝑆𝑆
𝑀𝑀𝑀𝑀 = 𝐵𝐵
𝐵𝐵 𝑘𝑘−1
2

𝑀𝑀𝑀𝑀 =
𝐵𝐵
(
Σ𝑛𝑛𝑖𝑖 𝑥𝑥 𝑖𝑖−𝑥𝑥 𝐺𝐺𝐺𝐺 )
𝑘𝑘−1
2 2 2

𝑀𝑀𝑀𝑀 = 10 (7 − 5.1) + 10 (4 − 5.1) + 10 (4.3− 5.1)


𝐵𝐵 3−1

2 2 2

𝑀𝑀𝑀𝑀 = 10 (1.9) + 10 (−1.1) + 10 (−0.8)


𝐵𝐵 2

10 (3.61) + 10 (1.21) + 10 (0.64)


𝑀𝑀𝑀𝑀 =
𝐵𝐵 2

36.1 + 12.1 + 6.4


𝑀𝑀𝑀𝑀 =
𝐵𝐵 2

54.6
𝑀𝑀𝑀𝑀 =
𝐵𝐵 2

𝑀𝑀𝑀𝑀𝐵𝐵 = 27. 3
d. Find the within-group variance.
𝑆𝑆𝑆𝑆
𝑀𝑀𝑀𝑀 = 𝑊𝑊
𝑊𝑊 𝑁𝑁−𝑘𝑘

( 𝑖𝑖 ) 𝑖𝑖
Σ 𝑛𝑛 −1 𝑠𝑠2
𝑀𝑀𝑀𝑀 =
𝑊𝑊 Σ(𝑛𝑛𝑖𝑖 −1)

(10 − 1)(3.33) + (10 − 1)(3.33) + (10 − 1)(3.34)


𝑀𝑀𝑀𝑀 =
𝑊𝑊 (10 − 1) + (10 − 1) + (10 − 1)

29.97 + 29.97 + 30.06


𝑀𝑀𝑀𝑀 =
𝑊𝑊 9 + 9 +9

90
𝑀𝑀𝑀𝑀 =
𝑊𝑊 27

𝑀𝑀𝑀𝑀 = 3. 33
𝑊𝑊

e. Find the F test value.


2
𝑠𝑠
𝐹𝐹 = 𝐵𝐵

𝑠𝑠2
𝑊𝑊

𝑀𝑀𝑀𝑀
𝐹𝐹 = 𝑀𝑀𝑀𝑀
𝐵𝐵

𝑊𝑊

27.3
𝐹𝐹 = 3.33

𝐹𝐹 = 8. 198

Step 4. Make the decision.

Since the F-test value is greater than the critical value, 8.198 > 3.35,
therefore, reject the null hypothesis.

Step 5. Summarize the results.

There is enough evidence to reject the null hypothesis and conclude that at
least one of the three samples has significantly different means and thus
belongs to an entirely different population.
Limitations of One-Way ANOVA

A one-way ANOVA tells us that at least two groups are different from each
other. But it won’t tell us which groups are different. If our test returns a significant
f-statistic, we may need to run a post-hoc test to tell us exactly which groups differ
in means.

Step-by-Step to Perform One-Way ANOVA With Post-hoc Test in Excel 2013

Step 1. Input your data into columns or rows in Excel. For example, if three groups
of students for music treatment are being tested, spread the data into three
columns.

Step 2. Click the “Data” tab and then click “Data Analysis.” If you don’t see Data
Analysis, load the ‘Data Analysis Toolpak’ add-in.

Step 3. Click “ANOVA Single Factor” and then click “OK.”

Step 4. Type an input range into the Input Range box. For example, if the data is
in cells A1 to C10, type “A1:C10” into the box. Check the “Labels in the first row” if
we have column headers, and select the Rows radio button if the data is in rows.

Step 5. Select an output range. For example, click the “New Worksheet” radio
button.

Step 6. Choose an alpha level. For most hypothesis tests, 0.05 is standard.

Step 7. Click “OK.” The results from ANOVA will appear in the worksheet.

The results for our example look like this:


Here, we can see that the F-value is greater than the F-critical value for the
alpha level selected (0.05). Therefore, we have evidence to reject the null
hypothesis and say that at least one of the three samples has significantly different
means and thus belongs to an entirely different population.

Another measure for ANOVA is the p-value. If the p-value is less than
the alpha level selected (which it is, in our case), we reject the Null Hypothesis.

Now to check which samples had different means, we will take the
Bonferroni approach and perform the post hoc test in Excel through the following
steps:

Step 8. Again, click on “Data Analysis” in the “Data” tab and select “t-Test: Two-
Sample Assuming Equal Variances,” and click “OK.”

Step 9. Input the range of the Class A column in the Variable 1 Range box and the
range of the Class B column in the Variable 2 Range box. Check the “Labels” if
you have column headers in the first row.

Step 10. Select an output range. For example, click the “New Worksheet” radio
button.

Step 11. Perform the same steps (step 8 to step 10) for Columns of Class B –
Class C and Class A – Class C.

The results will look like this:


Here, we can see that the p-value of (A vs B) and (A vs C) is less than the
alpha level selected (alpha = 0.05). This means that groups A and B & groups A
and C have less than a 5% chance of belonging to the same population. Whereas
for (B vs C), it is much greater than the significance level. This means that B and
C belong to the same population. So, it is clear that A (constant music group)
belongs to an entirely different population. Or we can say that the constant music
had a significant effect on students’ performance.

TWO-WAY ANALYSIS OF VARIANCE

ANOVA stands for analysis of variance and tests for differences in the
effects of independent variables on a dependent variable. A two-way ANOVA test
is a statistical test used to determine the effect of two nominal predictor variables
on a continuous outcome variable.

A two-way ANOVA is an extension of the one-way ANOVA (analysis of


variances) that reveals the results of two independent variables on a dependent
variable.

ONE-WAY OR TWO-WAY ANALYSIS OF VARIANCE

A one-way ANOVA evaluates the impact of a sole factor on a sole response


variable. It determines whether the observed differences between the means of
independent (unrelated) groups are explainable by chance alone, or whether there
are any statistically significant differences between groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way,
you have one independent variable affecting a dependent variable. With a two-way
ANOVA, there are two independents. For example, a two-way ANOVA allows a
company to compare worker productivity based on two independent variables,
such as department and gender. It is utilized to observe the interaction between
the two factors. It tests the effect of two factors at the same time.

The two-way ANOVA summary table is set up as shown in the table below:
Correction term (𝐶𝐶 )
𝑥𝑥

2
(Σ𝑥𝑥)
(𝐶𝐶 ) = 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝑥𝑥

Sum of Squares (SS)

For Factor A
2
(Σ𝑋𝑋 )
𝑆𝑆𝑆𝑆 = 𝐴𝐴
− (𝐶𝐶 )
𝐴𝐴 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑥𝑥

For Factor B
2
(Σ𝑋𝑋 )
𝑆𝑆𝑆𝑆 = 𝐵𝐵
− (𝐶𝐶 )
𝐵𝐵 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑥𝑥

For Within (Error)


2
𝑆𝑆𝑆𝑆 = Σ (𝑥𝑥 − 𝑥𝑥)
𝑊𝑊

For 𝑆𝑆𝑆𝑆
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇

2
𝑆𝑆𝑆𝑆 = Σ(𝑋𝑋) − (𝐶𝐶 )
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑋𝑋

For Interaction
𝑆𝑆𝑆𝑆 = 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆
𝐴𝐴𝐴𝐴𝐴𝐴 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝐴𝐴 𝐵𝐵 𝑊𝑊

Assumption of Two-Way ANOVA

The assumptions for the two-way analysis of variance are basically the
same as those for the one-way ANOVA, except for sample size.
Assumption of Two-Way ANOVA

1. The populations from which the samples were obtained must be


normally or approximately normally distributed.
2. The samples must be independent.
3. The variance of the populations from which the samples were selected
must be equal.
4. The groups must be equal in sample size.

EXAMPLE:

A researcher wishes to see whether the type of gasoline used and the type of
automobile driven have any effect on gasoline consumption. Two types of gasoline,
regular and high-octane, will be used, and two types of automobiles, two-wheel-
and four-wheel-drive, will be used in each group. There will be two automobiles in
each group, for a total of eight automobiles used. Using a two-way analysis of
variance, the researcher will perform the following steps.

Solution:

Step 1: State the hypotheses. The hypotheses for the interaction are these:
H𝑜𝑜: There is no interaction effect between type of gasoline used and type
of automobile a person drives on gasoline consumption.
H1 : There is an interaction effect between type of gasoline used and type
of automobile a person drives on gasoline consumption.

The hypotheses for the gasoline types are:


H𝑜𝑜: There is no difference between the means of gasoline consumption
for two types of gasoline.
H1: There is a difference between the means of gasoline consumption for
two types of gasoline.

The hypotheses for the types of automobiles driven are:


H𝑜𝑜: There is no difference between the means of gasoline consumption
for two-wheel-drive and four-wheel-drive automobiles.
H1: There is a difference between the means of gasoline consumption for
two-wheel-drive and four-wheel-drive automobiles.

Step 2: Find the critical values for each F test.


In this case, each independent variable, or factor, has two levels. Hence, a
2×2 ANOVA table is used. Factor A is designated as the gasoline type. It has two
levels, regular and high-octane; therefore, a=2. Factor B is designated as the
automobile type. It also has two levels; therefore, b=2. The degrees of freedom for
each factor are as follows:

Step 3: Complete the ANOVA summary table to get the test values.

The mean square are computed first.


Correction term (𝐶𝐶 )
𝑥𝑥

2
(Σ𝑥𝑥)
(𝐶𝐶 ) = 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝑥𝑥

= (26.7+25.2+28.6+29.3+32.3+32.8+26.1+24.2
8
= 6339.38
SUM OF SQUARES (SS)

For Factor A
2
(Σ𝑋𝑋 )
𝑆𝑆𝑆𝑆 = 𝐴𝐴
− (𝐶𝐶 )
𝐴𝐴 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑥𝑥

2 2
(109.8 + 115.4 )
= 4
− 6339. 38 = 6343. 3 − 6339. 39 = 3.92

For Factor B

2
(Σ𝑋𝑋 )
𝑆𝑆𝑆𝑆 = 𝐵𝐵
− (𝐶𝐶 )
𝐵𝐵 𝑁𝑁𝑁𝑁. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑥𝑥

2 2
(117 + 108.2 )
= 4
− 6339. 39 = 6349. 06 − 6339. 39 = 9.68

For Within (Error)

2
𝑆𝑆𝑆𝑆 = Σ (𝑥𝑥 − 𝑥𝑥)
𝑊𝑊

= (1. 125 + 0. 245 + 0. 125 + 1. 805) = 3.300

For 𝑆𝑆𝑆𝑆
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇

2
𝑆𝑆𝑆𝑆 = Σ(𝑋𝑋) − (𝐶𝐶 )
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑋𝑋

2
= (26. 7 + 25. 2 + 28. 6 + 29. 3 + 32. 3 + 32. 8 + 26. 1 + 24. 2)

= 6410.36 - 6339.38 = 70.98

For Interaction

𝑆𝑆𝑆𝑆 = 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆


𝐴𝐴𝐴𝐴𝐴𝐴 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝐴𝐴 𝐵𝐵 𝑊𝑊

= 70.98 - 3.92 - 9.68 - 3.300 = 54.080

MEAN OF SQUARE (MS)


𝐴𝐴 𝑆𝑆𝑆𝑆
3.920
𝑀𝑀𝑀𝑀 = = 𝐴𝐴
𝑎𝑎−1 2−1
𝑆𝑆𝑆𝑆
9.680
= 3. 920 𝑀𝑀𝑀𝑀 = 𝐵𝐵
= = 9. 680
𝐵𝐵 𝑏𝑏−1 2−1

𝑆𝑆𝑆𝑆
54.080
𝑀𝑀𝑀𝑀𝐴𝐴𝐴𝐴𝐴𝐴 = 𝐴𝐴𝐴𝐴𝐴𝐴
(𝑎𝑎−1)(𝑏𝑏−1)
= (2−1)(2−1)
= 54. 080

𝑆𝑆𝑆𝑆
3.300
𝑀𝑀𝑀𝑀 = 𝑊𝑊
= = 0. 825
𝑊𝑊 𝑎𝑎𝑎𝑎(𝑛𝑛−1) 4

F VALUE
𝑀𝑀𝑀𝑀
3.920
𝐹𝐹𝐴𝐴 = 𝑀𝑀𝑀𝑀
𝐴𝐴
= 0.825
= 4. 752
𝑊𝑊

𝑀𝑀𝑀𝑀
9.680
𝐹𝐹 = 𝑀𝑀𝑀𝑀
𝐵𝐵
= = 11. 733
𝐵𝐵 0.825
𝑊𝑊

𝑀𝑀𝑀𝑀
54.080
𝐹𝐹 = 𝑀𝑀𝑀𝑀
𝐴𝐴𝐴𝐴𝐴𝐴
= = 65. 552
𝐴𝐴𝐴𝐴𝐴𝐴 0.825
𝑊𝑊

Step 4: Make the decision.


Since F𝐵𝐵 11.733 and F𝐴𝐴𝐴𝐴𝐴𝐴 65.552 are greater than the critical value

7.71 , the null hypotheses concerning the type of automobile driven and the
interaction effect should be rejected. Since the interaction effect is statistically
significant no decision should be made about the automobile type without further
investigation.

Step 5: Summarize the results.


Since the null hypothesis for the interaction effect was rejected, it can be
concluded that the combination of the type of gasoline and type of automobile
does affect gasoline consumption.
STEPS TO PERFORM TWO-WAY ANOVA IN EXCEL 2013

Step 1: Click the “Data” tab and then click “Data Analysis.” If you don’t see the
Data analysis option, install the Data Analysis Toolpak.

Step 2: Click “ANOVA two factor with replication” and then click “OK.” The two-
way ANOVA window will open.

Step 3: Type an Input Range into the Input Range box. For example, if your data
is in cells A1 to A25, type “A1:A25” into the Input Range box. Ensure you include
all of your data, including headers and group names.

Step 4: Type a number in the “Rows per sample” box. Rows per sample is actually
a bit misleading. What this is asking you is how many individuals are in each group.
For example, if you have 5 individuals in each age group, you would type “5” into
the Rows per Sample box.

Step 5: Select an Output Range. For example, click the “new worksheet” radio
button to display the data in a new worksheet.

Step 6: Select an alpha level. In most cases, an alpha level of 0.05 (5 percent)
works for most tests.

Step 7: Click “OK” to run the two-way ANOVA. The data will be returned in your
specified output range.

Step 8: Read the results. To figure out if you are going to reject the null hypothesis
or not, you’ll basically be looking at two factors:

● If the F-value (F)is larger than the f critical value (F crit)


● If the p-value is smaller than your chosen alpha level.

Note: We don’t only have to have two variables to run a two-way ANOVA in Excel
2013. We can also use the same function for three, four, five, or more variables.

❖ When there are two independent variables, the analysis of variance is called
a two-way ANOVA.
❖ The two-way ANOVA enables the researcher to test the effects of two
independent variables and a possible interaction effect on one dependent
variable.
The results for the two-way ANOVA test on our example look like this:

As you can see in the highlighted cells in the image above, the F-value for
sample and column, i.e., factor 1 (music) and factor 2 (age), respectively, are
higher than their F-critical values. This means that the factors significantly affect
the students’ results, and thus we can reject the null hypothesis for the factors.

Also, the F-value for interaction effect is quite less than its F-critical value,
so we can conclude that music and age did not have any combined effect on the
population.

SUMMARY

❖ ANOVA is a statistical formula used to compare variances across the


means (or average) of different groups.
❖ There are two types of commonly used ANOVA; one-way ANOVA and two-
way ANOVA.
❖ sample means after the ANOVA technique has been done.
❖ The ANOVA technique uses two estimates of the population variance.
❖ When there is one independent variable, the analysis of variance is called
a one-way ANOVA.
III. LEARNING ACTIVITIES AND ASSESSMENT TASKS

OUTPUT 12: ANOVA

Name: Course, Year & Section: BSED-Math 2A

Instructor: Monette C. Valencia Score:

Copy and answer the following. (short-size bond paper)

A. IDENTIFICATION

1. A collection of statistical models used to analyze the differences


among group means and their associated procedures such as "variation" among
and between groups.

2. To analyze the results of your experimental data.

3. Used to determine whether there are any statistically significant


differences between the means of three or more independent groups

4. A statistical test is used to determine the effect of two nominal


predictor variables on a continuous outcome variable.

5. A statistical comparison test that involves checking multiple tests


limits the chance of failure.

6. What is the F-test formula for comparing three or more means?

7. It refers to variations between the distributions of individual groups


(or levels) as the values within each group differ.

8. It is the first estimate that involves finding the variance of the means.

9. It refers to variations caused by differences within individual groups


(or levels), as not all the values within each group are the same.

10. It is the second estimate made by computing the variance using


all the data and is not affected by differences in the means.

B. SOLVE THE FOLLOWING AND PERFORM THE FOLLOWING STEPS.


a. State the hypotheses and identify the claim.
b. Find the critical value
c. Compute the test value.
d. Compute the ANOVA summary table to get the test values. (For two-way
ANOVA)
e. Make the decision.
f. Summarize the results.

1. A researcher wishes to try three different techniques to lower the blood


pressure of individuals diagnosed with high blood pressure. The subjects are
randomly assigned to three groups; the first group takes medication, the second
group exercises, and the third group follows a special diet. After four weeks, the
reduction in each person’s blood pressure is recorded. At α = 0.05, test the claim
that three is no difference among the means. The data are shown below.

Medication Exercise Diet

10 6 5

12 8 9

9 3 12

15 0 8

13 2 4

2. A reputed marketing agency in India has three different training programs for its
salesmen. The three programs are Method – A, B, and C. To assess the success
of the programs, 4 salesmen from each of the programs were sent to the field.
Their performances in terms of sales are given in the following table. Test whether
there is a significant difference among methods and among salesmen.

METHODS
SALESMAN
A B C

1 4 6 2

2 6 10 6

3 5 7 4

4 7 5 4
BICOL STATE COLLEGE OF APPLIED SCIENCES AND TECHNOLOGY
Penafrancia Ave., Penafrancia, Naga City
Academic School Year 2022 - 2023

III. LEARNING ACTIVITIES & ASSESSMENT TASKS


Instruction: Answer the following in a separate sheet of paper.

OUTPUT 13: CHI-SQUARE DISTRIBUTION


Elementary Statistics and Probability

Name: _______________________________ Course, Year & Section: _________________


Instructor: Ms. Monette C. Valencia Date Submitted: _______________________

A. True or False (1 point each)


Identify whether the following situation is true or false.
1. Chi-square is used to describe useful things or real world distribution.
2. Chi-square compares the observed frequency and expected frequency.
3. Contingency tables are always 2 x 2.
4. The Chi-Square Test for Independence tests the significant difference of two proportions.
5. The Chi-Square Test for Independence uses a two-way classification contingency table.

B. Problem Solving (10 points each)

1. In a recent survey conducted by Company A to determine the effectiveness of its hair


shampoo products, 5 groups of female respondents were given questionnaires and their
answers are as follows:

Group Strongly Approve Disapprove Strongly Total


Approve Disapprove

1 12 5 12 11 40

2 5 14 14 11 44

3 5 6 12 11 34

4 10 8 13 13 44

5 15 20 5 5 45

Total 47 53 56 51 207

Test the significance of the difference between the observed frequencies and the expected
frequencies at 1% level of significance.
BICOL STATE COLLEGE OF APPLIED SCIENCES AND TECHNOLOGY
Penafrancia Ave., Penafrancia, Naga City
Academic School Year 2022 - 2023

2. As a researcher, you want to examine if the proportion of students who drive their own
cars are the same with those students who drive their parents’ cars at the two schools:
BISCAST and CBSUA. Use 0.05 as the level of significance.
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner
CamScanner

You might also like