Basic Statistical Analysis
Basic Statistical Analysis
Basic Statistical Analysis
ANANALYSIS
ARJ AY T. ALTOVAR
CAS-MNSD INSTRUCTOR
Basic Statistical Analysis
While descriptive statistics describes what is going on in our sample data set, inferential statistics
allows us to predict trends about a larger population based on a study of the samples taken from
it. There are also inferences where we examine the relationships among variables within a sample
and then make generalization or prediction about how those variables will relate to the larger
population.
The prerequisite in following the discussion on test of hypothesis is a clear grasp on the basic
concepts of inferential statistics. Recall the following concepts used in hypothesis testing. Discuss
them with your classmates.
Hypothesis Test
As previously mentioned, this is a statistical activity that uses data to decide between two or more
different possibilities to resolve an issue. Formally, this is a systematic procedure for testing a
claim about a property of a population of interest.
This topic presents individual components of a hypothesis test. We should know and understand
the following:
✓ How to identify the null hypothesis and alternative hypothesis from a given claim, and
how to express both in symbolic form.
✓ How to calculate the value of the test statistic, given a claim and sample data.
✓ How to identify the critical value(s), given a significance level.
✓ How to identify the P-value, given a value of the test statistic.
✓ How to state the conclusion about a claim in simple and nontechnical terms.
Example:
With individual lines at its various windows, a bank finds that the standard deviation for a
normally distributed waiting times on Friday afternoons is 6.2 mins. The bank experiments with
a main single waiting line and finds that for a random sample of 25 customers, the waiting times
have a standard deviation of 3.8 mins. This is to test the banks’ claim that a single line causes
lower variation among the waiting times for customers.
✓ Null Hypothesis, H0: A single line does not cause lower variation among the waiting times
for customers.
In symbols: 𝜎 = 6.2
✓ Alternative Hypothesis, Ha: A single line causes lower variation among the waiting times
for customers.
In symbols: 𝜎 < 6.2
Once the hypotheses are properly stated, sample data are obtained, then appropriate test statistic
are to be used.
Test Statistic is a value obtained from different statistical test procedures of which the decision
to reject the null hypothesis depends on its value.
To determine the appropriate test statistic, that is whether it is one-tailed or two-tailed test,
depends on the hypotheses stated.
❖ One-tailed test – if the claim or the alternative hypothesis indicates a direction (greater
than, > or less than, <). From the previous example, the alternative is given by: 𝜎 < 6.2
which indicates a single line causing lower variation among waiting times, hence one-
tailed test is the appropriate test for hypothesis testing.
❖ Two-tailed test - if the claim or the alternative hypothesis does not specify a direction (not
equal to, ≠).
The type of test is needed to determine the correct decision as well as using appropriate critical
region and significance level. Critical Region (or rejection region) is the set of all values of the
test statistic that cause us to reject the null hypothesis. The critical value for each test statistic can
be determined manually using statistical tables or via Excel function. Significance level (denoted
by 𝛼) is the probability that the test statistic will fall in the critical region when the null hypothesis
is true. This is usually set by the researcher where common choices for 𝛼 are 0.05, 0.01, and 0.10.
We always test the null hypothesis. The initial conclusion will always be one of the following:
✓ Reject the null hypothesis.
✓ Fail to reject the null hypothesis.
To decide whether to reject or fail to reject the null hypothesis, we state our decision rule:
❖ Using Critical Region – check if the test statistic falls in the region of rejection, if yes, reject
the null.
❖ Using p-value/significance level – determine the corresponding p-value of the test
statistic, then reject the null hypothesis if the p-value is less than or equal to the
significance level, 𝛼.
It is important to note to never conclude a hypothesis test with a statement of “reject the null
hypothesis” or “fail to reject the null hypothesis.” Always make sense of the conclusion with a
statement that uses simple nontechnical wording that addresses the original claim.
There are several types of research problems which utilizes test statistic depending on the
number of groups we want to compare. But in this module, we will only focus on comparing or
testing significance of two population means based on samples.
The test procedures to be used in comparing means are z-test, and t-test. However, z-tests
require you to have a known population variance (parameter) which most of the time is unknown
and is tedious to determine. Hence, we will only focus on the application of t-tests in comparing
means of two types of samples.
Below are the test procedures for Student’s and Welch’s t-test:
Test Test Procedure
(𝑥̅ −𝑦̅)−𝜇𝐷 (𝑚−1)𝑠12 +(𝑛−1)𝑠12
Student’s t-test 𝑡𝑐 = , Where: 𝑆𝑝2 = 𝑚+𝑛−2
1 1
𝑆𝑝 √ +
𝑚 𝑛
(𝑥̅ − 𝑦̅) − 𝜇𝐷
𝑡𝑐 =
Welch’s t-test 2 2
√𝑠1 + 𝑠2
𝑚 𝑛
Where: 𝑥̅ – sample mean of the first group;
𝑦̅ – sample mean of the second group;
𝑠12 – sample variance of the first group;
𝑠22 – sample variance of the second group;
m – number of observations in the first group;
n – number of observations in the second group.
The decision rule using critical region for t-tests are as follows:
One-tailed:
Two-tailed:
Alternative Hypothesis 𝜇1 − 𝜇2 < 𝜇𝐷 or
𝜇1 − 𝜇2 ≠ 𝜇𝐷
𝜇1 − 𝜇2 > 𝜇𝐷
Now that we know all the components for hypothesis test on comparing means given
independent samples, let us consider these examples:
❖ An instructor wanted to test the difference of the academic performance based on grades
of BSBA students in SLSU. Group A with 36 students have a mean grade of 85 and standard
deviation of 2.5 while Group B with 41 students have mean 83 and standard deviation of
1.8. If population variances are assumed to be equal, test the claim that Group A have better
performance than Group B at 0.05 level of significance. Perform the 5-step hypothesis test
to test the instructor’s claim.
❖ The average size of a coconut farm in Lucban, Quezon is 77 hectares. The average size of
a coconut farm in Luisiana, Laguna is 80.5 hectares. Given that from the samples, standard
deviations and sample sizes were 15.3 and 5, and 8 and 10, respectively. Can it be
concluded at 5% level of significance that the average size of the farms in the two
municipalities is different assuming unequal variances?
Go to Data Tab then “Data Analysis” in Microsoft Excel and perform t-test: Two sample assuming
equal variance.
5-step hypothesis test Solution
𝐻0 : 𝜇𝑀 − 𝜇𝐹 = 0; There is no significant difference on the GEC05
Grades of male and female students.
1. Formulate hypotheses.
𝐻1 : 𝜇𝑀 − 𝜇𝐹 ≠ 0; There is significant difference on the GEC05
Grades of male and female students.
Level of Significance: 𝛼 = 0.05
2. Set level of significance Test Procedure: Student’s two-tailed t-test since variances are
and determine the p-value. assumed to be equal
p-value: 0.067013954 ≅ 0.067
3. Formulate Decision rule. Reject 𝐻0 𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼, otherwise fail to reject 𝐻0
4. Give the decision Since 0.067 ≰ 0.05, fail to reject 𝐻0
At 5% level of significance, evidence is enough to say that there
5. Conclusion, is no significant difference on the GEC05 Grades of male and
female students.
One-tailed:
Two-tailed:
Alternative Hypothesis 𝜇1 − 𝜇2 < 𝜇𝐷 or
𝜇1 − 𝜇2 ≠ 𝜇𝐷
𝜇1 − 𝜇2 > 𝜇𝐷
The decision rule using critical region for paired t-test are as follows:
Before 14 13 9 9 10 10 12 7
After 11 15 10 14 11 13 11 12
1. Open a new MS Excel file, encode the Before and After data set in cells A1 to A8 and B1 to
B8 respectively
2. Go to Data tab then “Data Analysis” at the right of the toolbar. A dialogue box will appear.
3. From the Data Analysis dialogue box, choose “t-test: paired two sample for means” then
click OK.
4. Type the locations for your variable 1 data (before) into the input range of the first text
box, type: “A1:A18”. On the other hand, for variable 2 data (after) into the second text box
which is in cells B1 to B8, type: “B1:B8” into the box.
5. Type “0” into the hypothesized mean difference box and set the alpha level. 6. Choose an
output area example in cell D1 then click OK.
Correlation is a statistical method that determines the degree of relationship between two
different variables. The relationship between any two variables can vary from strong, weak, to
none. When a relationship is strong, this means that knowing an object’s score on one variable
helps to predict their score on the second variable. Correlation coefficient ranges from -1 to 1.
If the correlation or relationship between variables A and B is a weak one, then knowing an
object’s score on variable A does not help to predict their score on variable B.
Example:
A tobacco company statistician wishes to know whether heavy smoking is related to longevity.
From a sample of recently deceased smokers, the number of cigarettes (estimated on a per day
for their last five years after visits with their surviving relatives) is paired with the number of
years they lived.
Cigarette (X) 25 35 10 40 85 75 60 45 50
Year Lived (Y) 63 68 72 62 65 46 51 50 55
∑ 𝑥 = 25 + 35 + ⋯ + 50 = 425; 2 2 2 2
∑ 𝑥 = 25 + 35 + ⋯ + 50 = 24525
∑ 𝑦 = 63 + 68 + ⋯ + 55 = 542; ∑ 𝑦 2 = 632 + 682 + ⋯ + 552 = 33118
2 2
(∑ 𝑥) = 425 = 180625; (∑ 𝑦)2 = 5422 = 293764
∑ 𝑥𝑦 = 25(63) + 35(68) + ⋯ + 50(55) = 24640
Thus, 𝑟 = −0.61 means that there is a strong negative correlation between smoking and longevity.
This indicates that the higher the number of cigarettes smoked in the past five years, the lower
the number of years lived.
From the previous example, 𝑟 = −0.61 was obtained which means that there is a strong negative
relationship between smoking and longevity, but we need to check if this relationship is
significant, thus using the formula above, we have:
9−2
𝑡𝑐 = −0.61√ = −2.042
1 − (−0.61)2
For a two-tailed test of significance at 𝛼 = 0.05 with 𝑑𝑓 = 𝑛 − 2 = 7, the critical value of t is 𝑡 =
2.365. Note that the decision rule for a two-tailed t-test is:
Reject 𝐻0 if |𝑡𝑐 | ≤ −𝑡𝛼 , otherwise fail to reject 𝐻0 .
Since, |−2.042| = 2.042 < 2.365 we reject 𝐻0 . Thus, 𝑟 = −0.61 indicates a significant
relationship at 5% level of significance.
EXERCISE
1. Since the pandemic, the price of gasoline varies constantly from the last two months.
However, you observed that local gas stations charge higher price. You want to test your
hypothesis that local gas stations are charging much more than the national average price
for gasoline. What is your claim about this problem? What is the null hypothesis of this
study?
2. You carry out a test of the hypothesis described in #1. If the results show that you cannot
reject the null hypothesis, what conclusion can you generate based on your claim?
4. What type of correlation would you expect between wages and the unemployment rate?
EXERCISE
Test of Hypothesis.
1. The following data are the grades in GEC 05 and GPA (grade point average) of 60 students
in the College of Arts and Sciences.
2. A dietitian wishes to see if a person’s cholesterol level will change if the diet is
supplemented by a certain mineral. Six respondents were pretested and then took the
mineral supplement for a six-week period. Can it be concluded that the cholesterol level
has been changed at 𝛼 = 0.10. Assume that the data is approximately normally
distributed. Cholesterol level is measured in milligrams per deciliter.
Subject 1 2 3 4 5 6
Before 210 235 208 190 172 244
After 190 170 210 188 173 228
Perform the 5-step hypothesis testing.
10 | S o u t h e r n L u z o n S t a t e U n i v e r s i t y
Basic Statistical Analysis
Private Government-owned
x = P26,800 x = P25,400
s=P600 s=P450
n = 10 n= 8
11 | S o u t h e r n L u z o n S t a t e U n i v e r s i t y