Chi Squared
Chi Squared
Chi Squared
function. By no means, the coverage is exhaustive. The aim is to make the reader
appreciate the conceptual framework of Chi-Square analysis through problem
illustrations in marketing. The ideas presented in this article certainly can be extended
to many decision situations in marketing that can fruitfully employ chi-square tests.
Contents:
1. Chi-Square Analysis-Introduction
2. Chi-Square Test-Goodness of Fit
3. Chi-Square Test of Independence
Get the most brand exposure for your products through Internet Marketing!
Next-Chi Square Analysis-Introduction
Next
previous
Packag
e Color
preferenc
e by
Consumer
s
Red
70
Blue
106
Green
80
Pink
70
Orange
74
Total
400
Do the consumer preferences for package colors show any significant difference?
Next-Solution
previous
Solution: If you look at the data, you may be tempted to infer that Blue is the most
preferred color. Statistically, you have to find out whether this preference could have
arisen due to chance. The appropriate test statistic is the test of goodness of fit.
Null Hypothesis: All colors are equally preferred.
alternative Hypothesis: They are not equally preferred
Observed
Package
Color
Red
Expected
Frequencies Frequencies
(O)
(E)
70
80
100
1.250
Blue
106
80
676
8.450
Green
80
80
0.000
Pink
70
80
100
1.250
Orange
74
80
36
0.450
400
400
Total
11.400
Please note that under the null hypothesis of equal preference for all colors being true,
the expected frequencies for all the colors will be equal to 80. Applying the formula
,
we get the computed value of chi-square ( ) = 11.400
The critical value of 2 at 5% level of significance for 4 degrees of freedom is 9.488.
So, the null hypothesis is rejected. The inference is that all colors are not equally
preferred by the consumers. In particular, Blue is the most preferred one. The
marketing manager can introduce blue color package in the market.
Next-Chi Square Test of Independence
previous
Cross Tabulation of Income versus Brand chosen (Figures in the cells represent
number of consumers)
Brands
Brand1
Brand2
Brand3
Brand4
Total
Income
Lower
25
15
55
65
160
Middle
30
25
35
30
120
Upper Middle
50
55
20
22
147
Upper
60
80
15
18
173
Total
165
175
125
135
600
Analyze the cross-tabulation data above using chi-square test of independence and
draw your conclusions.
Next-Solution
previous
Solution:
Null Hypothesis: There is no association between the brand preference and income
level (These two attributes are independent).
alternative Hypothesis: There is association between brand preference and income
level (These two attributes are dependent).
Let us take a level of significance of 5%.
In order to calculate the value, you need to work out the expected frequency in
each cell in the contingency table. In our example, there are 4 rows and 4 columns
amounting to 16 elements. There will be 16 expected frequencies. For calculating
expected frequencies, please go through hyperstat. Relevant data tables are given
below:
Observed Frequencies (These are actual frequencies observed in the survey)
Brands
Brand1 Brand2 Brand3 Brand4 Total
Income
Lower
25
15
55
65
160
Middle
30
25
35
30
120
Upper Middle
50
55
20
22
147
Upper
60
80
15
18
173
Total
165
175
125
135
600
Expected Frequencies (These are calculated on the assumption of the null hypothesis being
true: That is, income level and brand preference are independent)
Brands
Brand1 Brand2 Brand3 Brand4 Total
Income
Lower
Middle
Upper Middle
27.000 120.000
Upper
47.575
Total
Note: The fractional expected frequencies are retained for the purpose of accuracy. Do
not round them.
Next-Calculation of Chi-Square
previous
Calculation:
Compute
.
There are 16 observed frequencies (O) and 16 expected frequencies (E). As in the case
of the goodness of fit, calculate this value. In our case, the computed =131.76 as
shown below: Each cell in the table below shows (O-E)/(E)
Brand1 Brand2 Brand3 Brand4
Income
Lower 8.20
21.49
14.08
23.36
Middle 0.27
2.86
4.00
0.33
Upper
2.27
Middle
3.43
3.69
3.71
Upper 3.24
17.30
12.28
11.25
and there are 16 such cells. Adding all these 16 values, we get =131.76
The critical value of depends on the degrees of freedom. The degrees of freedom =
(the number of rows-1) multiplied by (the number of colums-1) in any contingency
table. In our case, there are 4 rows and 4 columns. So the degrees of freedom =(4-1).
(4-1) =9. At 5% level of significance, critical for 9 d.f = 16.92. Therefore reject the
null hypothesis and accept the alternative hypothesis.
The inference is that brand preference is highly associated with income level. Thus,
the choice of the brand depends on the income strata. Consumers in different income
strata prefer different brands. Specifically, consumers in upper middle and upper
income group prefer premium brands while consumers in lower income and middleincome category prefer economy brands. The company should develop suitable
strategies to position its detergent products. In the marketplace, it should position
economy brands to lower and middle-income category and premium brands to upper
middle and upper income category
Link at:
http://davidmlane.com/hyperstat/viswanathan/chi_square_marketing.html
Here you could compare sales revenues of each product type before
and after the change in product mix. Thus the categories in variable
X would include all the product types and the categories in variable
Y would include period 1 and period 2.
3. A final, somewhat classic application of the chi-square test of
independence is to verify the influence of gender on purchase
decisions. Are men the primary decision makers when it comes to
purchasing a big ticket items? Is gender a factor in color preference
of a car? Here variable X would be gender and variable Y would be
color.
No matter the business analytics problem, the chi-square test will
find uses when you are trying to establish or invalidate that
a relationship exists between two given business
parameters that are categorical (or nominal) data types.
Chi-squared test of independence is a very useful tool for any
predictive analytics professional. What other type of business
problems are best solved by using these tools?Link at:
http://www.simafore.com/blog/bid/54594/How-to-use-Chi-Square-test-for-3common-business-analytics-problems
No
Total
Female
45
50
Male
15
35
50
Total
60
40
100
This article starts with the theory, and then has guidelines for using the statistic:
UNDERSTANDING
THE CALCULATIONS
When we eyeball our table above, it looks like women are much more likely to answer Yes, but is
it random variation or something we can count on? What Chi Square does is compare the actual
or Observed data we have from respondents with an Expected value. In our two questions, the
total answers are:
Female
50
Male
50
Yes
60
No
40
If there were no relationship between the questions, then you would Expect a table that allocates
those totals to look like this:
Yes
No
Total
Female
30
20
50
Male
30
20
50
Total
60
40
100
No
Total
Female
E: 30
O: 45
D: 7.50
E: 20
O: 5
D: 11.25
E&O: 50
Male
E: 30
O: 15
D: 7.50
E: 20
O: 35
D: 11.25
E&O: 50
Total
E&O: 60
E&O: 40
E&O: 100
D: 37.5
Adding all the differences, we get a Total Chi Square of 37.5which is yet another interim value
in this calculation. So on to the next stage.
Many statistics rely on a concept called Degrees of freedom. The details vary stat to stat, but it's
based on the number of variables involved in a calculation. For Chi Square, the degrees of
freedom are:
df = (# rows - 1) * (# columns - 1)
= ( 2 - 1) * ( 2 - 1) = 1
In our cast we now have:
Degrees of freedom = 1
We have two more players, and those are the Probability and Critical Value.
Any time you have a statistic designed to "predict" for a larger population or tell you a value's
validity or reliability, part of the calculation is a level of confidence. Sometimes you'll see this
indicated as the level of risk such as 5%, and at other times it will be noted as the level of
certainty, 95%. For Chi Square, the tables are based on the level of risk, with common thresholds
of 10%, 5%, 2.5%, 1% and 0.1%. Each one of those risk levels has a Critical Value associated with it:
Probability
Critical Value
when df = 1
10.0%
2.71
5.0%
3.84
2.5%
5.02
1.0%
6.64
0.1%
10.83
Our final step to calculate Chi Square is to compare our Total to the Critical Values. In our case,
37.5 > 10.83 which means it's even more than 99.9% significant. If instead we only came up with a
Total of 4.5, that's > 3.84 so we'd say it was 95% significant.
IN REAL LIFE
If you're lucky, you have a survey software or statistics program which will take your Observed
values and crunch everything for yousome won't even make you specify a probability first.
If you don't have an application which makes this easy, try the on-line calculator Kristopher J.
Preacher has posted on his site.
While Microsoft Excel has a CHITEST function, it takes a bit of hand work. You have to manually
generate all the Expected values, and all it does is give you the Total Chi Square (our 37.5). To get
the probability, you have to pair it with the CHIDIST function, manually giving it the degrees of
freedom.
TO SURVEYS
Question types:
Chi square can be used with any pair of single answer discrete questions. This includes:
Demographics
Likert scales
The answers do not need to be ordered, equal or symmetricaljust discrete. This is part of what
makes Chi Square a handy statistic for surveys.
"Mark all that apply" questions cannot be used as an individual respondent cannot exist in more
than one cell of our table. For example, a woman answering the survey can't appear in both the
Yes and No columns.
Good
Fair
Poor
Total
On-line 325
597
216
52
1,190
In Store 1,527
1,712
304
96
3,639
2,309
520
148
4,829
Total
1,852
Good
Fair
Poor
Total
On-line 27.3%
50.2%
18.2%
4.4%
100.0%
1,190
In Store 42.0%
47.0%
8.4%
2.6%
100.0%
3,639
47.8%
2,309
10.8%
520
3.1%
148
100.0%
4,829
Total
38.4%
1,852
You still want to keep the count totals in the report so that readers know the relative sizes of the
groups.
Cross-tabs can also be well suited to graphical views, including stacked bar charts, bar graphs and
line/profile graphs.
Combining categories:
This is used to increase the counts of cells when you have too many with infrequent responses, or
simply to clarify the relationships for your analysis.
With an ordered scale such as a 5 level Likert, this could take the form of combining the upper
and lower categories into a 3 level "Agree/Neither/Disagree" breakdown.
With unordered data such as product names, you might combine into categories. With city names
you might group the information into geographic regions or urban/rural classifications.
The main issue is to make sure the categories are sufficiently related that you're not masking a
relationship. When in doubt, first run the cross-tabulation and Chi Square on an expanded table,
then start combining.
Link at:
http://www.practicalsurveys.com/reporting/chisquare.php
If sample data are displayed in a contingency table, the expected frequency count for
each cell of the table is at least 5.
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results.
Significance level. Often, researchers choose significance levels equal to 0.01, 0.05,
or 0.10; but any value between 0 and 1 can be used.
Test method. Use the chi-square test for independence to determine whether there is
a significant relationship between two categorical variables.
Test statistic. The test statistic is a chi-square random variable (2) defined by the
following equation.
2 = [ (Or,c - Er,c)2 / Er,c ]
where Or,c is the observed frequency count at level r of Variable A and level c of
Variable B, and Er,c is the expected frequency count at level r of Variable A and
level c of Variable B.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null
hypothesis. Typically, this involves comparing the P-value to the significance level, and
rejecting the null hypothesis when the P-value is less than the significance level.
Democrat
Independent
Male
200
150
50
400
Female
250
300
50
600
Column total
450
450
100
1000
Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results. We work through those
steps below:
State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
H0: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample
data, we compute the degrees of freedom, the expected frequency counts, and the
chi-square test statistic. Based on the chi-square statistic and the degrees of
freedom, we determine the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
Interpret results. Since the P-value (0.0003) is less than the significance level
(0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a
relationship between gender and voting preference.
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the sampling
method was simple random sampling, the variables under study were categorical, and the
expected frequency count was at least 5 in each cell of the contingency table.
Link at:
http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP