Basic Statistical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BASIC STATISTICAL

ANANALYSIS
ARJ AY T. ALTOVAR
CAS-MNSD INSTRUCTOR
Basic Statistical Analysis

I. Managing and Understanding Data using Inferential Statistics.

While descriptive statistics describes what is going on in our sample data set, inferential statistics
allows us to predict trends about a larger population based on a study of the samples taken from
it. There are also inferences where we examine the relationships among variables within a sample
and then make generalization or prediction about how those variables will relate to the larger
population.

The prerequisite in following the discussion on test of hypothesis is a clear grasp on the basic
concepts of inferential statistics. Recall the following concepts used in hypothesis testing. Discuss
them with your classmates.

1. Why do we need to test the hypothesis?


2. Two kinds of statistical hypothesis
3. Level of Significance
4. One-Tailed or Two-Tailed Test, when do we use it?
5. Type I and Type II Errors
6. p-value
7. Which hypothesis do we reject/ fail to reject?
8. When do we reject H0?

Hypothesis Test
As previously mentioned, this is a statistical activity that uses data to decide between two or more
different possibilities to resolve an issue. Formally, this is a systematic procedure for testing a
claim about a property of a population of interest.

This topic presents individual components of a hypothesis test. We should know and understand
the following:
✓ How to identify the null hypothesis and alternative hypothesis from a given claim, and
how to express both in symbolic form.
✓ How to calculate the value of the test statistic, given a claim and sample data.
✓ How to identify the critical value(s), given a significance level.
✓ How to identify the P-value, given a value of the test statistic.
✓ How to state the conclusion about a claim in simple and nontechnical terms.

Statistical Hypothesis is an assertion or conjecture concerning the characteristics of one or more


populations. There are two types of hypotheses, that is, the null hypothesis and alternative
hypothesis.
❖ The null hypothesis (denoted by H0) is a statement that the value of a population
parameter (such as proportion, mean, or standard deviation) is equal to some claimed
value. This is the one which the researcher always hopes to reject. In writing, it uses the
(=) symbol.
❖ The alternative hypothesis (denoted by H1 or Ha) is the statement that the parameter has
a value that somehow differs from the null hypothesis, this is often the claim of the
researcher. The symbolic form of the alternative hypothesis must use one of these
symbols: ≠, >, <.

1|Southern Luzon State University


Basic Statistical Analysis

Example:
With individual lines at its various windows, a bank finds that the standard deviation for a
normally distributed waiting times on Friday afternoons is 6.2 mins. The bank experiments with
a main single waiting line and finds that for a random sample of 25 customers, the waiting times
have a standard deviation of 3.8 mins. This is to test the banks’ claim that a single line causes
lower variation among the waiting times for customers.
✓ Null Hypothesis, H0: A single line does not cause lower variation among the waiting times
for customers.
In symbols: 𝜎 = 6.2
✓ Alternative Hypothesis, Ha: A single line causes lower variation among the waiting times
for customers.
In symbols: 𝜎 < 6.2
Once the hypotheses are properly stated, sample data are obtained, then appropriate test statistic
are to be used.

Test Statistic is a value obtained from different statistical test procedures of which the decision
to reject the null hypothesis depends on its value.
To determine the appropriate test statistic, that is whether it is one-tailed or two-tailed test,
depends on the hypotheses stated.
❖ One-tailed test – if the claim or the alternative hypothesis indicates a direction (greater
than, > or less than, <). From the previous example, the alternative is given by: 𝜎 < 6.2
which indicates a single line causing lower variation among waiting times, hence one-
tailed test is the appropriate test for hypothesis testing.
❖ Two-tailed test - if the claim or the alternative hypothesis does not specify a direction (not
equal to, ≠).

The type of test is needed to determine the correct decision as well as using appropriate critical
region and significance level. Critical Region (or rejection region) is the set of all values of the
test statistic that cause us to reject the null hypothesis. The critical value for each test statistic can
be determined manually using statistical tables or via Excel function. Significance level (denoted
by 𝛼) is the probability that the test statistic will fall in the critical region when the null hypothesis
is true. This is usually set by the researcher where common choices for 𝛼 are 0.05, 0.01, and 0.10.

We always test the null hypothesis. The initial conclusion will always be one of the following:
✓ Reject the null hypothesis.
✓ Fail to reject the null hypothesis.

To decide whether to reject or fail to reject the null hypothesis, we state our decision rule:
❖ Using Critical Region – check if the test statistic falls in the region of rejection, if yes, reject
the null.
❖ Using p-value/significance level – determine the corresponding p-value of the test
statistic, then reject the null hypothesis if the p-value is less than or equal to the
significance level, 𝛼.

It is important to note to never conclude a hypothesis test with a statement of “reject the null
hypothesis” or “fail to reject the null hypothesis.” Always make sense of the conclusion with a
statement that uses simple nontechnical wording that addresses the original claim.

2|Southern Luzon State University


Basic Statistical Analysis

Steps in Hypothesis Testing


1. Formulating the null and alternative hypothesis.
2. Set the level of significance; determine the test statistic or p-value.
3. Formulate the decision rule (Using critical region or p-value)
4. Give the decision (Reject or Fail to Reject the null)
5. Draw conclusions.

There are several types of research problems which utilizes test statistic depending on the
number of groups we want to compare. But in this module, we will only focus on comparing or
testing significance of two population means based on samples.

Testing the Significance of Means


In comparing two population means derived from samples, we have two types of samples:
1. Independent Samples – Sample selected from one of the populations has no effect on
how the other sample was selected from the other population. Number of samples from
independent samples maybe different. Example: Comparing a group of Filipinos and non-
Filipinos about whether or not they will talk to a stranger in a public transport.
2. Paired/Dependent Samples – Observations in sample 1 is matched or paired with
observations in sample 2. Example: Testing the effectiveness of a certain food supplement
on reducing weight of randomly sampled individuals before and after supplementation.

The test procedures to be used in comparing means are z-test, and t-test. However, z-tests
require you to have a known population variance (parameter) which most of the time is unknown
and is tedious to determine. Hence, we will only focus on the application of t-tests in comparing
means of two types of samples.

Comparing Means of Independent Samples


For independent samples, there are two types of t-tests to be used:
✓ Student’s t-test – used when population variances are unknown but are equal 𝜎12 = 𝜎22
✓ Welch’s t-test – used when population variances are unknown but are unequal 𝜎12 ≠ 𝜎22
How can we determine if the population variances are equal or not?
In practice, we test the equality of population variance using statistical tests. In this case we
conduct another hypothesis test with the claim that the variances are not equal. However, in this
module, we will not require you to test the equality of variance, instead we will only assume equal
or unequal variances for depending on the problem given.

Below are the test procedures for Student’s and Welch’s t-test:
Test Test Procedure
(𝑥̅ −𝑦̅)−𝜇𝐷 (𝑚−1)𝑠12 +(𝑛−1)𝑠12
Student’s t-test 𝑡𝑐 = , Where: 𝑆𝑝2 = 𝑚+𝑛−2
1 1
𝑆𝑝 √ +
𝑚 𝑛

(𝑥̅ − 𝑦̅) − 𝜇𝐷
𝑡𝑐 =
Welch’s t-test 2 2
√𝑠1 + 𝑠2
𝑚 𝑛
Where: 𝑥̅ – sample mean of the first group;
𝑦̅ – sample mean of the second group;
𝑠12 – sample variance of the first group;
𝑠22 – sample variance of the second group;
m – number of observations in the first group;
n – number of observations in the second group.

The decision rule using critical region for t-tests are as follows:

3|Southern Luzon State University


Basic Statistical Analysis

One-tailed:
Two-tailed:
Alternative Hypothesis 𝜇1 − 𝜇2 < 𝜇𝐷 or
𝜇1 − 𝜇2 ≠ 𝜇𝐷
𝜇1 − 𝜇2 > 𝜇𝐷

𝑡𝑐 ≤ −𝑡(𝑚+𝑛−2)𝛼 or |𝑡𝑐 | ≥ 𝑡(𝑚+𝑛−2)𝛼


(Student’s) Reject 𝐻0 at level 𝛼 if: 𝑡𝑐 ≥ 𝑡(𝑚+𝑛−2)𝛼 2

𝑡𝑐 ≤ −𝑡(𝑛−1)𝛼 or |𝑡𝑐 | ≥ 𝑡(𝑛−1)𝛼


(Welch’s) Reject 𝐻0 at level 𝛼 if: 𝑡𝑐 ≥ 𝑡(𝑛−1)𝛼 2

Now that we know all the components for hypothesis test on comparing means given
independent samples, let us consider these examples:

❖ An instructor wanted to test the difference of the academic performance based on grades
of BSBA students in SLSU. Group A with 36 students have a mean grade of 85 and standard
deviation of 2.5 while Group B with 41 students have mean 83 and standard deviation of
1.8. If population variances are assumed to be equal, test the claim that Group A have better
performance than Group B at 0.05 level of significance. Perform the 5-step hypothesis test
to test the instructor’s claim.

5-step hypothesis test Solution


𝐻0 : 𝜇1 − 𝜇2 = 0; There is no significant difference in the
academic performance between Group A and Group B based on
1. Formulate hypotheses. their mean grades.
𝐻1 : 𝜇1 − 𝜇2 > 0; Group A significantly have better performance
than Group B based on their mean grades.
Level of Significance: 𝛼 = 0.05
Test Procedure: Student’s one-tailed t-test since variances are
assumed to be equal.
2. Set level of significance Test Statistic: 𝑥̅ = 85, 𝑦̅ = 83, 𝑠1 = 2.5, 𝑠2 = 1.8
and determine Test (𝑥̅ − 𝑦̅) − 𝜇𝐷 (85 − 83)
statistic. 𝑡𝑐 = = = 𝟏. 𝟖𝟖𝟓𝟐
√ 1 1 √ 1 1
𝑆𝑝 𝑚 + 𝑛 4.6447 36 + 41
𝑡(𝑚+𝑛−2)𝛼 = 𝟏. 𝟔𝟔𝟓
3. Formulate Decision rule. Reject 𝐻0 if 𝑡𝑐 ≥ 𝑡(𝑚+𝑛−2)𝛼 , otherwise fail to reject 𝐻0
4. Give the decision Since 𝑡𝑐 = 1.8852 > 1.665, reject 𝐻0
At 5% level of significance, evidence is enough to say that Group
5. Conclusion, A significantly have better performance than Group B based on
their mean grades.

❖ The average size of a coconut farm in Lucban, Quezon is 77 hectares. The average size of
a coconut farm in Luisiana, Laguna is 80.5 hectares. Given that from the samples, standard
deviations and sample sizes were 15.3 and 5, and 8 and 10, respectively. Can it be
concluded at 5% level of significance that the average size of the farms in the two
municipalities is different assuming unequal variances?

5-step hypothesis test Solution


𝐻0 : 𝜇1 − 𝜇2 = 0; There is no significant difference in the average
size of the farms in the two municipalities.
1. Formulate hypotheses.
𝐻1 : 𝜇1 − 𝜇2 ≠ 0; There is significant difference in the average
size of the farms in the two municipalities.

4|Southern Luzon State University


Basic Statistical Analysis

Level of Significance: 𝛼 = 0.05


Test Procedure: Welch’s two-tailed t-test since variances are
assumed to be unequal
2. Set level of significance Test Statistic: 𝑥̅ = 77, 𝑦̅ = 80.5, 𝑠1 = 15.3, 𝑠2 = 5
and determine Test (𝑥̅ − 𝑦̅) − 𝜇𝐷 (77 − 80.5) − 0
statistic. 𝑡𝑐 = = = −𝟎. 𝟔𝟐𝟏𝟎
2 2 2 2
𝑠
√ 1+ 2 𝑠 √ 15.3 5
𝑚 𝑛 8 + 10
𝑡(𝑛−1)𝛼 = 𝟐. 𝟑𝟔𝟓
2
3. Formulate Decision rule. Reject 𝐻0 |𝑡𝑐 | ≥ 𝑡(𝑛−1)𝛼 , otherwise fail to reject 𝐻0
2
4. Give the decision Since |𝑡𝑐 | = 0.6210 ≱ 2.365, fail to reject 𝐻0
At 5% level of significance, evidence is enough to say that there
5. Conclusion, is no significant difference in the average size of the farms in the
two municipalities.

Using Excel, consider this example:


Use data in Exercise 5.1 on 60 students of CAS given their GEC05 Grades and GPA for testing
hypothesis using t-test for two independent samples assuming equal variances. Problem: Is there
a significant difference on the GEC grades of male and female students?

Go to Data Tab then “Data Analysis” in Microsoft Excel and perform t-test: Two sample assuming
equal variance.
5-step hypothesis test Solution
𝐻0 : 𝜇𝑀 − 𝜇𝐹 = 0; There is no significant difference on the GEC05
Grades of male and female students.
1. Formulate hypotheses.
𝐻1 : 𝜇𝑀 − 𝜇𝐹 ≠ 0; There is significant difference on the GEC05
Grades of male and female students.
Level of Significance: 𝛼 = 0.05
2. Set level of significance Test Procedure: Student’s two-tailed t-test since variances are
and determine the p-value. assumed to be equal
p-value: 0.067013954 ≅ 0.067
3. Formulate Decision rule. Reject 𝐻0 𝑝𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼, otherwise fail to reject 𝐻0
4. Give the decision Since 0.067 ≰ 0.05, fail to reject 𝐻0
At 5% level of significance, evidence is enough to say that there
5. Conclusion, is no significant difference on the GEC05 Grades of male and
female students.

Comparing Means of Paired/Dependent Samples


For comparing means given paired samples, we will only be focusing on paired t-test. Below is
the test procedure for paired t-test:
Test Test Procedure
𝑑̅ − 𝜇𝐷
Paired t-test 𝑡𝑐 = 𝑠
𝑑

√𝑛

One-tailed:
Two-tailed:
Alternative Hypothesis 𝜇1 − 𝜇2 < 𝜇𝐷 or
𝜇1 − 𝜇2 ≠ 𝜇𝐷
𝜇1 − 𝜇2 > 𝜇𝐷

𝑡𝑐 ≤ −𝑡(𝑛−1)𝛼 or |𝑡𝑐 | ≥ 𝑡(𝑛−1)𝛼


Reject 𝐻0 at level 𝛼 if: 𝑡𝑐 ≥ 𝑡(𝑛−1)𝛼 2

The decision rule using critical region for paired t-test are as follows:

5|Southern Luzon State University


Basic Statistical Analysis

Consider this example for paired t-test:


❖ A sample of nine local banks show their deposits (in billion pesos) 5 years ago and their
deposits (in billion pesos) at present. At 0.05 level of significance, can it be concluded that
the average deposits for the bank is greater today than it was 5 years ago? Given that the
means and standard deviation of the differences is 𝑑̅ = −1.081, 𝑠𝑑 = 1.937. Perform the
5-step hypothesis test to test the claim.

5-step hypothesis test Solution


𝐻0 : 𝜇𝐷 = 0; There is no significant difference in the bank’s
average deposit 5 years ago and today.
1. Formulate hypotheses.
𝐻1 : 𝜇𝐷 < 0; The bank’s average deposit today significantly
increased than bank’s average deposit 5 years ago.
Level of Significance: 𝛼 = 0.05
Test Procedure: t-test for dependent samples
2. Set level of significance Test Statistic: 𝑑̅ = −1.081, 𝑠𝑑 = 1.937
and determine Test 𝑑̅ − 𝜇𝐷 −1.081 − 0
𝑡𝑐 = 𝑠 = = −𝟏. 𝟔𝟕𝟒𝟐
statistic. 𝑑
⁄ 1.937
√𝑛 √9
𝑡(𝑛−1)𝛼 = −𝟏. 𝟖𝟔𝟎
3. Formulate Decision rule. Reject 𝐻0 if 𝑡𝑐 ≤ −𝑡(𝑛−1)𝛼 , otherwise fail to reject 𝐻0 .
4. Give the decision Since 𝑡𝑐 = −1.6742 ≰ −1.860, fail to reject 𝐻0
At 5% level of significance, evidence is not enough to say that
5. Conclusion, there is no significant difference in the banks average deposit 5
years ago and today.

Using Excel, try this example:


❖ As an aid for improving employee’s working habits, eight employees were randomly
selected to attend a seminar workshop on the importance of work. The table shows the
number of workloads done per week before and after attending the seminar workshop.
At 5% level of significance, did attending the seminar-workshop increase the
performance level of employees?

Before 14 13 9 9 10 10 12 7
After 11 15 10 14 11 13 11 12

1. Open a new MS Excel file, encode the Before and After data set in cells A1 to A8 and B1 to
B8 respectively
2. Go to Data tab then “Data Analysis” at the right of the toolbar. A dialogue box will appear.
3. From the Data Analysis dialogue box, choose “t-test: paired two sample for means” then
click OK.
4. Type the locations for your variable 1 data (before) into the input range of the first text
box, type: “A1:A18”. On the other hand, for variable 2 data (after) into the second text box
which is in cells B1 to B8, type: “B1:B8” into the box.
5. Type “0” into the hypothesized mean difference box and set the alpha level. 6. Choose an
output area example in cell D1 then click OK.

Perform the 5-step on hypothesis testing.

6|Southern Luzon State University


Basic Statistical Analysis

Determining Relationship of two Variables

Correlation is a statistical method that determines the degree of relationship between two
different variables. The relationship between any two variables can vary from strong, weak, to
none. When a relationship is strong, this means that knowing an object’s score on one variable
helps to predict their score on the second variable. Correlation coefficient ranges from -1 to 1.

If the correlation or relationship between variables A and B is a weak one, then knowing an
object’s score on variable A does not help to predict their score on variable B.

✓ Positive Correlation: The correlation is said to be positive correlation if the values of


two variables changing with same direction.
✓ Negative Correlation: The correlation is said to be negative correlation when the
values of variables change with opposite direction. One problem in correlation is that,
just because two variables are correlated, it does not mean that one variable caused the
other.

Pearson Correlation Coefficient


✓ The single most common type of correlation.
✓ a measure of the strength of a relationship between two continuous variables
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
n – is the number of paired values.
Values of 𝝆 or r Interpretation
0 No linear association
0 < 𝑟 < 0.2 Very weak linear association
0.2 ≤ 𝑟 < 0.4 Weak linear association
0.4 ≤ 𝑟 < 0.6 Moderate linear association
0.6 ≤ 𝑟 < 0.8 Strong linear association
0.8 ≤ 𝑟 < 1 Very strong linear association
1 Perfect linear association

Example:
A tobacco company statistician wishes to know whether heavy smoking is related to longevity.
From a sample of recently deceased smokers, the number of cigarettes (estimated on a per day
for their last five years after visits with their surviving relatives) is paired with the number of
years they lived.
Cigarette (X) 25 35 10 40 85 75 60 45 50
Year Lived (Y) 63 68 72 62 65 46 51 50 55
∑ 𝑥 = 25 + 35 + ⋯ + 50 = 425; 2 2 2 2
∑ 𝑥 = 25 + 35 + ⋯ + 50 = 24525
∑ 𝑦 = 63 + 68 + ⋯ + 55 = 542; ∑ 𝑦 2 = 632 + 682 + ⋯ + 552 = 33118
2 2
(∑ 𝑥) = 425 = 180625; (∑ 𝑦)2 = 5422 = 293764
∑ 𝑥𝑦 = 25(63) + 35(68) + ⋯ + 50(55) = 24640

Note that there are 9 pairs of scores, hence n=9


𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2

7|Southern Luzon State University


Basic Statistical Analysis

9(24 640) − (425)(542)


𝑟= = −𝟎. 𝟔𝟏
√[9(24525) − 180625][9(33118) − 293764)

Thus, 𝑟 = −0.61 means that there is a strong negative correlation between smoking and longevity.
This indicates that the higher the number of cigarettes smoked in the past five years, the lower
the number of years lived.

Testing of Significance of the Correlation coefficient


A correlation coefficient may be tested to determine whether the coefficient significantly differs
from zero.
𝐻0 : 𝜌 = 0
𝐻1 : 𝜌 ≠ 0
The test procedure for testing the significance of the correlation coefficient is t-test. Below is the
formula for t-test for significance:
𝑛−2
𝑡𝑐 = 𝑟√
1 − 𝑟2

From the previous example, 𝑟 = −0.61 was obtained which means that there is a strong negative
relationship between smoking and longevity, but we need to check if this relationship is
significant, thus using the formula above, we have:
9−2
𝑡𝑐 = −0.61√ = −2.042
1 − (−0.61)2
For a two-tailed test of significance at 𝛼 = 0.05 with 𝑑𝑓 = 𝑛 − 2 = 7, the critical value of t is 𝑡 =
2.365. Note that the decision rule for a two-tailed t-test is:
Reject 𝐻0 if |𝑡𝑐 | ≤ −𝑡𝛼 , otherwise fail to reject 𝐻0 .

Since, |−2.042| = 2.042 < 2.365 we reject 𝐻0 . Thus, 𝑟 = −0.61 indicates a significant
relationship at 5% level of significance.

8|Southern Luzon State University


Basic Statistical Analysis

EXERCISE

Explain your reasoning.

1. Since the pandemic, the price of gasoline varies constantly from the last two months.
However, you observed that local gas stations charge higher price. You want to test your
hypothesis that local gas stations are charging much more than the national average price
for gasoline. What is your claim about this problem? What is the null hypothesis of this
study?

2. You carry out a test of the hypothesis described in #1. If the results show that you cannot
reject the null hypothesis, what conclusion can you generate based on your claim?

3. Explain why accepting the null hypothesis is not a possible outcome.

4. What type of correlation would you expect between wages and the unemployment rate?

9|Southern Luzon State University


Basic Statistical Analysis

EXERCISE

Test of Hypothesis.
1. The following data are the grades in GEC 05 and GPA (grade point average) of 60 students
in the College of Arts and Sciences.

Male GEC05 Grade GPA Female GEC 05 Grade GPA


1 99 95 1 96 99
2 98 91 2 91 98
3 98 90 3 91 96
4 98 89 4 90 89
5 97 89 5 90 89
6 96 88 6 89 89
7 95 87 7 89 89
8 93 87 8 88 88
9 91 86 9 88 88
10 90 85 10 88 88
11 89 85 11 86 88
12 89 84 12 85 87
13 89 84 13 85 87
14 89 82 14 84 86
15 89 81 15 84 86
16 89 81 16 84 86
17 88 80 17 84 86
18 88 80 18 81 85
19 87 78 19 80 85
20 84 77 20 80 84
21 83 77 21 80 84
22 80 77 22 80 84
23 80 76 23 80 83
24 79 76 24 80 84
25 77 76 25 79 82
26 76 75 26 78 80
27 76 74 27 76 80
28 76 72 28 76 79
29 75 72 29 73 78
30 70 70 30 70 75
Perform the 5 steps in hypothesis testing. Is there a significant correlation between the
GEC 05 grades and the GPA of male students?

2. A dietitian wishes to see if a person’s cholesterol level will change if the diet is
supplemented by a certain mineral. Six respondents were pretested and then took the
mineral supplement for a six-week period. Can it be concluded that the cholesterol level
has been changed at 𝛼 = 0.10. Assume that the data is approximately normally
distributed. Cholesterol level is measured in milligrams per deciliter.

Subject 1 2 3 4 5 6
Before 210 235 208 190 172 244
After 190 170 210 188 173 228
Perform the 5-step hypothesis testing.

10 | S o u t h e r n L u z o n S t a t e U n i v e r s i t y
Basic Statistical Analysis

3. A researcher wishes to determine whether the salaries of professional nurses as front


liners employed by private hospitals are higher than those of nurses employed by
government-owned hospitals. She selects a sample of nurses from each type of hospital
and calculate the mean and standard deviations of their salaries. At  = 0.01 Can you
conclude that private hospitals pay more than the government hospitals?

Private Government-owned
x = P26,800 x = P25,400
s=P600 s=P450
n = 10 n= 8

Perform the 5-step hypothesis testing.

11 | S o u t h e r n L u z o n S t a t e U n i v e r s i t y

You might also like