Chi-Square Test A Nonparametric Hypothesis Test
Chi-Square Test A Nonparametric Hypothesis Test
Chi-Square Test A Nonparametric Hypothesis Test
or
test
2
A
nonparametric
hypothesis test
1
Parametric vs. Nonparametric
Tests
• Parametric hypothesis test
– about population parameter ( or p)
– z, t, tests
– interval/ratio data
• Nonparametric tests
– do not test a specific parameter
– nominal & ordinal data
– frequency data ~
2
Chi-square test
df=3
df=5
df=10
4
2 distribution
2 Oi Ei 2
Ei
where
Oi is the observed frequency
Ei is the expected frequency
7
2 Goodness of fit test
Uses univariate data
Want to see how well the
observed counts “fit” what we
expect the counts to be
Based on df –
df = number of categories - 1
8
Hypotheses – written in words
H0: the observed counts equal the
expected counts, i.e., there
is no significant difference between the
observed and the expected counts
H1: the observed counts are not equal to
the expected counts
Ei 10
Steps for Computation of 2 (con’t)
5 Under the null hypothesis test the theory fits
well, the above statistic follows 2 distribution
with v=n-1 d.f.
6 Look up the tabulated value of 2 for (n-1) d.f at
given level of significance.
7 If the calculated value of 2 is less than the
corresponding tabulated value obtained in step 6,
then it is said to be non-significant at the
required level of significance.
8 If the calculated value of 2 is greater than the
corresponding tabulated value obtained in step 6,
then it is said to be significant at the required
11
level of significance.
Let’s test our dice!
12
CASELETS
1. A dice is rolled 100 times with the following
distribution:
Number : 1 2 3 4 5 6
Observed frequency : 17 14 20 17 17 15
At the 0.01 level of significance, determine whether
dice is true (unbiased).
Solution. We are given:
Number of categories = 6
N = total frequency = 17+14+20+17+17+15 =100
2 Oi Ei 2
1.2796
Ei
15
CASELETS
2. Offspring of certain fruit flies may have
yellow or ebony bodies and normal wings or
short wings. Genetic theory Since there are
predicts that4
categories,
these traits willcounts:
Expected appear in the ratio 9:3:3:1
(yellow Y
&&normal, yellow & short,
N = 56.25 df = 4ebony
– 1 =&3
normal, Yebony
& S =& 18.75
short) A researcher checks
100 suchE flies
& N =and finds the distribution of
18.75
traits toEbe
& S = 20,
59, 6.2511, and
We 10, respectively.
expect 9/16 of the
What are the expected100 counts?
flies to df?
have yellow
and normal wings. (Y & N)
Are the results consistent with the
theoretical distribution predicted by the
genetic model? (5% level of significance) 16
CASELETS (con’t)
Assumption:
All expected counts are greater than 5.
Expected counts:
Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25
H0: The distribution of fruit flies is the same as the theoretical
model.
Ha: The distribution of fruit flies is not the same as the
theoretical model.
Number Obser. Expec (O-E) (O-E)2 (O-E)2/E
Freq. (O) Freq. (E)
1 59 56.25 2.75 7.5625 0.135
2 20 18.75 1.25 1.5625 0.083
3 11 18.75 -7.75 60.062 3.203
5
17
4 10 6.25 3.75 14.062 2.25
CASELETS (con’t)
2 Oi Ei 2
5.671
Ei
2
23 21.3
2
20 21.3
2
...
29 21.3
2
5.094
21.3 21.3 21.3
20
CASELETS (con’t)
2 Oi Ei 2
5.094
Ei
21
CASELETS
4.Records taken of the number of male and female
births in 800 families having four children are given as
follows:
No. of births Frequency
Male Female
0 4 32
1 3 178
2 2 290
3 1 236
4 0 64
Test whether the data are consistent with the
hypothesis that the binomial law holds and the chance
of a male birth is equal to that of female birth.
22
CASELETS (con’t)
Let us set up the null hypothesis that the data are consistent
with the binomial law of equal probability for male and female
births No. of Expected
We are given n = 4, N = 800 male frequency
births F(r)
According to binomial
probability law, the frequency 0 50 * 4C0 = 50
of r male births is given by: 50 * 4C1 = 200
1
F(r) = N*p(r) = N* nCr * pr *qn-r 50 * 4C2 = 300
2
= 800* 4Cr * (0.5)r *(0.5)4-r 3 50 * 4C3 = 200
= 50 * 4Cr; (r = 0,1,2,3,4) 4 50 * 4C4 = 50
Total 800
23
CASELETS (con’t)
No. of Obser. Expec (O-E) (O-E)2 (O-E)2/E
male Freq. (O) Freq. (E)
birth
0 32 50 -18 324 6.48
1 178 200 -22 484 2.42
2 290 300 -10 100 0.33
3 236 200 36 1296 6.48
4 64 50 14 196 3.92
Total 2=19.63
2
Oi Ei 2 19.63
Ei
Since the calculated value of 2 = 19.63 is greater than the
tabulated value of 2, i.e., 9.488 (for 4 d.f at 5% level of
significance) therefore the null hypothesis is rejected and we
conclude that hypothesis of equal male and female births 24 is
wrong
CASELETS
5. A company says its premium mixture of nuts contains
10% Brazil nuts, 20% cashews, 20% almonds, 10%
hazelnuts and 40% peanuts. You buy a large can and
separate the nuts. Upon weighing them, you find there
are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds,
71 g or hazelnuts, and 446 g of peanuts. You wonder
whether you mix is significantly different from what the
company advertises?
Because we do NOT
have counts
Why is the chi-square goodness-of-fit of the
test NOT
appropriate here? type of nuts.
We could count the number
What might you do instead of of weighing the nuts
each type in and
of nut
order to use chi-square? then perform a 2 test.
25
Practice CASELETS
1. The following figures show the distribution of digits in
numbers chosen at random from a telephone directory:
Digit :0 1 2 3 4 5 6 7 8 9 10
Frequency:1026 1107 997 966 1075 933 1107 972 964 853
Test whether the digits may be taken to occur equally
frequently in the directory.(tabular value for 9 d.f at 5%
level of significance is 16.92)
2. The number of scooter accidents per month in a certain
town were as follows:
12, 8, 20, 2, 14, 10, 15, 6, 9, 4
Are these frequencies in agreement with the belief that
accidents conditions were the same during this 10 month
period? (tabular value for 9 d.f at 5% level is 16.92)
26
2 test for independence
28
Hypotheses – written in words
H0: two variables are
independent
H1: two variables are dependent
29
CASELETS
1. A beef distributor wishes to determine
whether there is a relationship between
geographic region and cut of meat preferred.
If there is no relationship, we will say that
beef preference is independent of geographic
region. Suppose that, in a random sample of
500 customers, 300 are from the North and
200 from the South. Also, 150 prefer cut A,
275 prefer cut B, and 75 prefer cut C.
Also suppose that in the actual sample of 500
consumers the observed numbers were as
follows:
30
CASELETS (con’t)
North South Total
Cut C 50 25 75
Assuming H0 is true,
Degrees of freedom
df (r 1)(c 1)
Or cover up one row & one
column & count the number of
cells remaining!
33
CASELETS (con’t)
If beef preference is independent of
geographic region, how would we expect this
table to be filled in?
North South Total
Cut A 90 60 150
Cut C 45 30 75
36
CASELETS
2. In a certain sample of 2000 families 1400 families
are consumer of tea. Out of 1800 Hindu families,
1236 families consume tea. Use Chi-Square test and
state whether there is any significant difference
between consumption of tea among Hindu and non-
Hindu families. (5% level of significance)
40
CASELETS
3. A sample of 400 students of under-graduate and 400
students of post-graduate classes was taken to know
their opinion about autonomous colleges. 290 of the
under-graduate and 310 of the post-graduate students
favored the autonomous status. Present these facts in
the form of a table and test at 55 level, that the opinion
regarding autonomous status of college is independent of
the level of classes of students.
Observed Frequencies
Solution Class Number of Students Total
Favoring Opposin
g
110
Under Graduate 290 400
90
Post Graduate 310 400
600 200 800 41
Total
CASELETS (con’t)
Expected Counts
• Assuming H0 is true,
row total column total
expected counts
table total
Class Number of Students Total
Favoring Opposin
g
300 100
Under Graduate 400
300 100
Post Graduate 400 42
44
CASELETS
4. Suppose that, in a public opinion survey answers to
the questions-
(a) Do you drink
(b) Are you in favor of local option on sale of liquor?
Were as given in the table
Questio Question (a) Total
n (b) Yes No
Yes 56 31 87
No 18 6 24
Total 74 37 111
Can you infer that opinion on local option is dependent
45
Yes 58 29 87
No 16 8 24
46
Total 74 37 111
Assumptions: CASELETS (con’t)
All expected counts are greater than 5.
48
Yates correction for
Continuity
If any cell frequency in 2X2 table is less than 5,
then for the application of Chi-Square test it has
to be pooled with the preceding or succeeding
frequency so that total is greater than 5. This
results in the loss of 1 d.f. Since for 2X2 table,
d.f. = (2-1)X(2-1) = 1; the d.f. left after
adjusting for pooling are v = 1-1 = 0, which is
absurd. In such situation we apply Yates correction
for ‘continuity’. In this method we add 0.5 to the
cell frequency which is less than 5 and adjusting
the remaining frequencies accordingly, since row
and column totals are fixed and then applying Chi-
Square. 49
2 test for homogeneity
50
Assumptions & formula remain
the same!
Expected counts & df are found
the same way as test for
independence.
52