Chi Squared

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

In this article, an attempt is made to bring into sharp focus the use of in marketing

function. By no means, the coverage is exhaustive. The aim is to make the reader
appreciate the conceptual framework of Chi-Square analysis through problem
illustrations in marketing. The ideas presented in this article certainly can be extended
to many decision situations in marketing that can fruitfully employ chi-square tests.
Contents:
1. Chi-Square Analysis-Introduction
2. Chi-Square Test-Goodness of Fit
3. Chi-Square Test of Independence

Get the most brand exposure for your products through Internet Marketing!
Next-Chi Square Analysis-Introduction

1. Chi-Square () Analysis- Introduction


Consider the following decision situations:
1) Are all package designs equally preferred? 2) Are all brands equally preferred? 3)
Is their any association between income level and brand preference? 4) Is their
any association between family size and size of washing machine bought? 5)
Are the attributes educational background and type of job chosen
independent? The answer to these questions require the help of Chi-Square
() analysis. The first two questions can be unfolded using Chi-Square test of
goodness of fit for a single variable while solution to questions 3, 4, and 5
need the help of Chi-Square test of independence in a contingency table.
Please note that the variables involved in Chi-Square analysis are nominally
scaled. Nominal data are also known by two names-categorical data and
attribute data.
The symbol used here is to denote the chi-square distribution whose value
depends upon the number of degrees of freedom (d.f.). As we know, chisquare distribution is a skewed distribution particularly with smaller d.f. As
the sample size and therefore the d.f. increases and becomes large, the
distribution approaches normality.
tests are nonparametric or distribution-free in nature. This means that no
assumption needs to be made about the form of the original population
distribution from which the samples are drawn. Please note that all parametric
tests make the assumption that the samples are drawn from a specified or
assumed distribution such as the normal distribution.
For a meaningful appreciation of the conditions/assumptions involved in using chisquare analysis, please go through the contents of hyperstat on chi-square
testmeticulously.

Next

previous

2. Chi-Square Test-Goodness of Fit


A number of marketing problems involve decision situations in which it is important
for a marketing manager to know whether the pattern of frequencies that are observed
fit well with the expected ones. The appropriate test is the test of goodness of fit.
The illustration given below will clarify the role of in which only one categorical
variable is involved.
Problem: In consumer marketing, a common problem that any marketing manager
faces is the selection of appropriate colors for package design. Assume that a
marketing manager wishes to compare five different colors of package design. He is
interested in knowing which of the five is the most preferred one so that it can be
introduced in the market. A random sample of 400 consumers reveals the following:

Packag
e Color

preferenc
e by
Consumer
s

Red

70

Blue

106

Green

80

Pink

70

Orange

74

Total

400

Do the consumer preferences for package colors show any significant difference?
Next-Solution

previous

Solution: If you look at the data, you may be tempted to infer that Blue is the most
preferred color. Statistically, you have to find out whether this preference could have
arisen due to chance. The appropriate test statistic is the test of goodness of fit.
Null Hypothesis: All colors are equally preferred.
alternative Hypothesis: They are not equally preferred
Observed
Package
Color

Red

Expected

Frequencies Frequencies
(O)
(E)
70

80

100

1.250

Blue

106

80

676

8.450

Green

80

80

0.000

Pink

70

80

100

1.250

Orange

74

80

36

0.450

400

400

Total

11.400

Please note that under the null hypothesis of equal preference for all colors being true,
the expected frequencies for all the colors will be equal to 80. Applying the formula

,
we get the computed value of chi-square ( ) = 11.400
The critical value of 2 at 5% level of significance for 4 degrees of freedom is 9.488.
So, the null hypothesis is rejected. The inference is that all colors are not equally
preferred by the consumers. In particular, Blue is the most preferred one. The
marketing manager can introduce blue color package in the market.
Next-Chi Square Test of Independence

previous

3. Chi-Square Test of Independence


The goodness-of-fit test discussed above is appropriate for situations that involve one
categorical variable. If there are two categorical variables, and our interest is to
examine whether these two variables are associated with each other, the chisquare( ) test of independence is the correct tool to use. This test is very popular in
analyzing cross-tabulations in which an investigator is keen to find out whether the
two attributes of interest have any relationship with each other.
The cross-tabulation is popularly called by the term contingency table. It contains
frequency data that correspond to the categorical variables in the row and column. The
marginal totals of the rows and columns are used to calculate the expected frequencies
that will be part of the computation of the statistic. For calculations on expected
frequencies, refer hyperstat on test.
Problem: A marketing firm producing detergents is interested in studying the
consumer behavior in the context of purchase decision of detergents in a specific
market. This company is a major player in the detergent market that is characterized
by intense competition. It would like to know in particular whether the income level
of the consumers influence their choice of the brand. Currently there are four brands
in the market. Brand 1 and Brand 2 are the premium brands while Brand 3 and Brand
4 are the economy brands.
A representative stratified random sampling procedure was adopted covering the
entire market using income as the basis of selection. The categories that were used in
classifying income level are: Lower, Middle, Upper Middle and High. A sample of
600 consumers participated in this study. The following data emerged from the study.

Cross Tabulation of Income versus Brand chosen (Figures in the cells represent
number of consumers)
Brands
Brand1

Brand2

Brand3

Brand4

Total

Income
Lower

25

15

55

65

160

Middle

30

25

35

30

120

Upper Middle

50

55

20

22

147

Upper

60

80

15

18

173

Total

165

175

125

135

600

Analyze the cross-tabulation data above using chi-square test of independence and
draw your conclusions.
Next-Solution

previous

Solution:
Null Hypothesis: There is no association between the brand preference and income
level (These two attributes are independent).
alternative Hypothesis: There is association between brand preference and income
level (These two attributes are dependent).
Let us take a level of significance of 5%.
In order to calculate the value, you need to work out the expected frequency in
each cell in the contingency table. In our example, there are 4 rows and 4 columns
amounting to 16 elements. There will be 16 expected frequencies. For calculating
expected frequencies, please go through hyperstat. Relevant data tables are given
below:
Observed Frequencies (These are actual frequencies observed in the survey)

Brands
Brand1 Brand2 Brand3 Brand4 Total
Income
Lower

25

15

55

65

160

Middle

30

25

35

30

120

Upper Middle

50

55

20

22

147

Upper

60

80

15

18

173

Total

165

175

125

135

600

Expected Frequencies (These are calculated on the assumption of the null hypothesis being
true: That is, income level and brand preference are independent)

Brands
Brand1 Brand2 Brand3 Brand4 Total
Income
Lower

44.000 46.667 33.333 36.000 160.000

Middle

33.000 35.000 25.000

Upper Middle

40.425 42.875 30.625 33.075 147.000

27.000 120.000

Upper

47.575

50.458 36.042 38.925 173.000

Total

165.000 175.000 125.000 135.000 600.000

Note: The fractional expected frequencies are retained for the purpose of accuracy. Do
not round them.
Next-Calculation of Chi-Square

previous

Calculation:
Compute

.
There are 16 observed frequencies (O) and 16 expected frequencies (E). As in the case
of the goodness of fit, calculate this value. In our case, the computed =131.76 as
shown below: Each cell in the table below shows (O-E)/(E)
Brand1 Brand2 Brand3 Brand4
Income
Lower 8.20

21.49

14.08

23.36

Middle 0.27

2.86

4.00

0.33

Upper
2.27
Middle

3.43

3.69

3.71

Upper 3.24

17.30

12.28

11.25

and there are 16 such cells. Adding all these 16 values, we get =131.76
The critical value of depends on the degrees of freedom. The degrees of freedom =
(the number of rows-1) multiplied by (the number of colums-1) in any contingency
table. In our case, there are 4 rows and 4 columns. So the degrees of freedom =(4-1).
(4-1) =9. At 5% level of significance, critical for 9 d.f = 16.92. Therefore reject the
null hypothesis and accept the alternative hypothesis.

The inference is that brand preference is highly associated with income level. Thus,
the choice of the brand depends on the income strata. Consumers in different income
strata prefer different brands. Specifically, consumers in upper middle and upper
income group prefer premium brands while consumers in lower income and middleincome category prefer economy brands. The company should develop suitable
strategies to position its detergent products. In the marketplace, it should position
economy brands to lower and middle-income category and premium brands to upper
middle and upper income category

Link at:
http://davidmlane.com/hyperstat/viswanathan/chi_square_marketing.html

How to use Chi-Square test for


3 common business analytics
problems
Background: The chi-square test of independence is a very useful
statistical tool that helps in identifying if two variables are
related to each other. In a functional sense it is very similar to a
correlation co-efficient of determination R^2, however the key
difference is that chi-square test was developed to work with
nominal or categorical data, where as standard R^2 works only with
numerical data.
When would you use the chi-square test of independence:
Any business situation where you are essentially checking if one
variable, X is related to, or independent of, another variable, Y. The
use of chi-square test is indicated in any of the following business
scenarios.
1. Suppose you want to determine if certain types of products sell
better in certain geographic locations than others. A trivial example:
the type of shoes sold in winter depends strongly on whether a retail
outlet is located in the upper mid-west versus in the south. A slightly
more complicated example would be to check if the type of gasoline
sold in a neighborhood is indicative of the median income in the
region. So variable X would be the type of gasoline and variable Y
would be income ranges (e.g. <0k, 41k-50k, etc).
2. Suppose you want to test if altering your product mix (% of
upscale, mid-range and volume items, say) has impacted profits.

Here you could compare sales revenues of each product type before
and after the change in product mix. Thus the categories in variable
X would include all the product types and the categories in variable
Y would include period 1 and period 2.
3. A final, somewhat classic application of the chi-square test of
independence is to verify the influence of gender on purchase
decisions. Are men the primary decision makers when it comes to
purchasing a big ticket items? Is gender a factor in color preference
of a car? Here variable X would be gender and variable Y would be
color.
No matter the business analytics problem, the chi-square test will
find uses when you are trying to establish or invalidate that
a relationship exists between two given business
parameters that are categorical (or nominal) data types.
Chi-squared test of independence is a very useful tool for any
predictive analytics professional. What other type of business
problems are best solved by using these tools?Link at:
http://www.simafore.com/blog/bid/54594/How-to-use-Chi-Square-test-for-3common-business-analytics-problems

UNDERSTANDING CHI SQUARE


Chi Square lets you know whether two groups have significantly different opinions, which makes it
a very useful statistic for survey research. It's applied to cross-tabulations (AKA pivot tables) which
are simply breakdowns like this:
Yes

No

Total

Female

45

50

Male

15

35

50

Total

60

40

100

This article starts with the theory, and then has guidelines for using the statistic:

Understanding the calculations


Calculating Chi Square in real life
Applying Chi Square to surveys

UNDERSTANDING

THE CALCULATIONS

When we eyeball our table above, it looks like women are much more likely to answer Yes, but is
it random variation or something we can count on? What Chi Square does is compare the actual
or Observed data we have from respondents with an Expected value. In our two questions, the
total answers are:

Female

50

Male

50

Yes

60

No

40

If there were no relationship between the questions, then you would Expect a table that allocates
those totals to look like this:
Yes

No

Total

Female

30

20

50

Male

30

20

50

Total

60

40

100

The formula for the upper-left cell is:


(TotalYes * TotalFemale) / TotalTable
( 60 * 50 ) / 100
In less tidy examples, the Expected values often have a decimal or two. Once we have all the
Expected values, we need to find the difference squared (so they're all positive) between the
individual cells' Expected and Observed values:
D = ((O - E)2 / E)
Yes

No

Total

Female

E: 30
O: 45
D: 7.50

E: 20
O: 5
D: 11.25

E&O: 50

Male

E: 30
O: 15
D: 7.50

E: 20
O: 35
D: 11.25

E&O: 50

Total

E&O: 60

E&O: 40

E&O: 100
D: 37.5

Adding all the differences, we get a Total Chi Square of 37.5which is yet another interim value
in this calculation. So on to the next stage.
Many statistics rely on a concept called Degrees of freedom. The details vary stat to stat, but it's
based on the number of variables involved in a calculation. For Chi Square, the degrees of
freedom are:

df = (# rows - 1) * (# columns - 1)
= ( 2 - 1) * ( 2 - 1) = 1
In our cast we now have:

Assorted Observed and Expected values

Total Chi Square = 37.5

Degrees of freedom = 1

We have two more players, and those are the Probability and Critical Value.
Any time you have a statistic designed to "predict" for a larger population or tell you a value's
validity or reliability, part of the calculation is a level of confidence. Sometimes you'll see this
indicated as the level of risk such as 5%, and at other times it will be noted as the level of
certainty, 95%. For Chi Square, the tables are based on the level of risk, with common thresholds
of 10%, 5%, 2.5%, 1% and 0.1%. Each one of those risk levels has a Critical Value associated with it:
Probability

Critical Value
when df = 1

10.0%

2.71

5.0%

3.84

2.5%

5.02

1.0%

6.64

0.1%

10.83

(More valuessee the "Upper" table)

Our final step to calculate Chi Square is to compare our Total to the Critical Values. In our case,
37.5 > 10.83 which means it's even more than 99.9% significant. If instead we only came up with a
Total of 4.5, that's > 3.84 so we'd say it was 95% significant.

CALCULATING CHI SQUARE

IN REAL LIFE

If you're lucky, you have a survey software or statistics program which will take your Observed
values and crunch everything for yousome won't even make you specify a probability first.
If you don't have an application which makes this easy, try the on-line calculator Kristopher J.
Preacher has posted on his site.
While Microsoft Excel has a CHITEST function, it takes a bit of hand work. You have to manually
generate all the Expected values, and all it does is give you the Total Chi Square (our 37.5). To get
the probability, you have to pair it with the CHIDIST function, manually giving it the degrees of
freedom.

APPLYING CHI SQUARE

TO SURVEYS

Question types:
Chi square can be used with any pair of single answer discrete questions. This includes:

Demographics

Likert scales

Cities, product names, instructor names, etc.

Dates once they've been grouped into periods

Numbers once they've been grouped into ranges

The answers do not need to be ordered, equal or symmetricaljust discrete. This is part of what
makes Chi Square a handy statistic for surveys.
"Mark all that apply" questions cannot be used as an individual respondent cannot exist in more
than one cell of our table. For example, a woman answering the survey can't appear in both the
Yes and No columns.

Presenting the information:


While the statistic has to be calculated on the counts, that's not necessarily the best approach for
our brains to spot patterns. For example in this table we have over 3 times the number of In Store
respondents as On-line:
Excellent

Good

Fair

Poor

Total

On-line 325

597

216

52

1,190

In Store 1,527

1,712

304

96

3,639

2,309

520

148

4,829

Total

1,852

In a report, it's easier for our brains to compare percentages:


Excellent

Good

Fair

Poor

Total

On-line 27.3%

50.2%

18.2%

4.4%

100.0%
1,190

In Store 42.0%

47.0%

8.4%

2.6%

100.0%
3,639

47.8%
2,309

10.8%
520

3.1%
148

100.0%
4,829

Total

38.4%
1,852

You still want to keep the count totals in the report so that readers know the relative sizes of the
groups.
Cross-tabs can also be well suited to graphical views, including stacked bar charts, bar graphs and
line/profile graphs.

Low count cells:


The guidelines on this vary, but if you have more than one cell with 5 or fewer respondents, the
final calculation may overstate your level of probability. If you do have this situation, either wait
on this statistic until you have more data, or combine categories.

Dropping answer options:


In our original example, our column scale might have been "Yes/Uncertain/No." If the Uncertain
column totaled 0, we would have to drop it as the Expected values for it would have all been 0.
This means the difference calculation would be attempting to divide by 0, which is challenging.
Completely empty rows or columns are the only answers you should ever drop. Even if there was
just 1 response in the Uncertain column, you need to include that individual in the table for the
statistic to be reliable. We can, however, combine Uncertain with Yes or No if needed.

Combining categories:
This is used to increase the counts of cells when you have too many with infrequent responses, or
simply to clarify the relationships for your analysis.
With an ordered scale such as a 5 level Likert, this could take the form of combining the upper
and lower categories into a 3 level "Agree/Neither/Disagree" breakdown.
With unordered data such as product names, you might combine into categories. With city names
you might group the information into geographic regions or urban/rural classifications.
The main issue is to make sure the categories are sufficiently related that you're not masking a
relationship. When in doubt, first run the cross-tabulation and Chi Square on an expanded table,
then start combining.

Questions left blank:


In surveys respondents will often skip one or both of the questions in your comparison. If this
represents more than a couple people, you may want to add a "No Answer" or "Empty" row and
column. Just as with non-response sampling errors, sometimes there's a relationship in the people
who don't give an answer.
And that's Chi Square in a nutshell! (Or as close to nutshells as inferential statistics get.)

Link at:
http://www.practicalsurveys.com/reporting/chisquare.php

hi-Square Test for Independence


This lesson explains how to conduct a chi-square test for independence. The test is
applied when you have two categorical variables from a single population. It is used to
determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female)
and voting preference (Democrat, Republican, or Independent). We could use a chi-square
test for independence to determine whether gender is related to voting preference.
The sample problem at the end of the lesson considers this example.

When to Use Chi-Square Test for Independence


The test procedure described in this lesson is appropriate when the following conditions are
met:

The sampling method is simple random sampling.

The variables under study are each categorical.

If sample data are displayed in a contingency table, the expected frequency count for
each cell of the table is at least 5.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses


Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states
that knowing the level of Variable A does not help you predict the level of Variable B. That is,
the variables are independent.
H0: Variable A and Variable B are independent.
Ha: Variable A and Variable B are not independent.
The alternative hypothesis is that knowing the level of Variable A can help you predict the
level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are related; but the
relationship is not necessarily causal, in the sense that one variable "causes" the other.

Formulate an Analysis Plan


The analysis plan describes how to use sample data to accept or reject the null hypothesis.
The plan should specify the following elements.

Significance level. Often, researchers choose significance levels equal to 0.01, 0.05,
or 0.10; but any value between 0 and 1 can be used.

Test method. Use the chi-square test for independence to determine whether there is
a significant relationship between two categorical variables.

Analyze Sample Data


Using sample data, find the degrees of freedom, expected frequencies, test statistic, and the
P-value associated with the test statistic. The approach described in this section is illustrated
in the sample problem at the end of this lesson.

Degrees of freedom. The degrees of freedom (DF) is equal to:


DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the number of
levels for the other categorical variable.

Expected frequencies. The expected frequency counts are computed separately


for each level of one categorical variable at each level of the other categorical
variable. Compute r * c expected frequencies, according to the following formula.
Er,c = (nr * nc) / n
where Er,c is the expected frequency count for level r of Variable A and level c of
Variable B, nris the total number of sample observations at level r of Variable A, nc is
the total number of sample observations at level c of Variable B, and n is the total
sample size.

Test statistic. The test statistic is a chi-square random variable (2) defined by the
following equation.
2 = [ (Or,c - Er,c)2 / Er,c ]
where Or,c is the observed frequency count at level r of Variable A and level c of
Variable B, and Er,c is the expected frequency count at level r of Variable A and
level c of Variable B.

P-value. The P-value is the probability of observing a sample statistic as extreme as


the test statistic. Since the test statistic is a chi-square, use the Chi-Square
Distribution Calculator to assess the probability associated with the test statistic. Use
the degrees of freedom computed above.

Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null
hypothesis. Typically, this involves comparing the P-value to the significance level, and
rejecting the null hypothesis when the P-value is less than the significance level.

Test Your Understanding


Problem
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were
classified by gender (male or female) and by voting preference (Republican, Democrat, or
Independent). Results are shown in the contingency table below.
Voting Preferences
Row total
Republican

Democrat

Independent

Male

200

150

50

400

Female

250

300

50

600

Column total

450

450

100

1000

Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results. We work through those
steps below:

State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
H0: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent.

Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.

Analyze sample data. Applying the chi-square test for independence to sample
data, we compute the degrees of freedom, the expected frequency counts, and the
chi-square test statistic. Based on the chi-square statistic and the degrees of
freedom, we determine the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180

E1,3 = (400 * 100) / 1000 = 40000/1000 = 40


E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60

2 = [ (Or,c - Er,c)2 / Er,c ]


2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40
+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/60
2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
where DF is the degrees of freedom, r is the number of levels of gender, c is the
number of levels of the voting preference, nr is the number of observations from
level r of gender, nc is the number of observations from level c of voting preference, n
is the number of observations in the sample, Er,c is the expected frequency count
when gender is level r and voting preference is level c, and Or,c is the observed
frequency count when gender is level r voting preference is level c.
The P-value is the probability that a chi-square statistic having 2 degrees of freedom
is more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P(2 > 16.2) = 0.0003.

Interpret results. Since the P-value (0.0003) is less than the significance level
(0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a
relationship between gender and voting preference.

Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the sampling
method was simple random sampling, the variables under study were categorical, and the
expected frequency count was at least 5 in each cell of the contingency table.

Link at:
http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP

You might also like