AMA Assignment
AMA Assignment
AMA Assignment
Assignment
Utsav Vadgama
2014IPM106 Section B
1. Use regression data. This dataset talks about the sales of different cereals and also explains the
amount of calories, protein, fats, etc. in each cereal. Further, it provides the insights on where
these cereals are located (variable-shelf) and the advertising amount spent on them. We also
know the weight and the cups available. (Total points=40)
a. Estimate and interpret a regression model with sales as DV and shelf, calories, protein, fat,
sodium, fiber, carbo, sugars, potass, vitamins, weight, cups, and adv as IVs (consider 0.05
significance level). Report the significance and the performance of the model.
After checking the normality of dependent variable sales, the following results were obtained:
Shapiro-Wilk normality test
data: reg_data$sales
W = 0.94507, p-value = 0.002304
Thus, normality assumption stays due to significance of P-value and high S-W stat value.
The regression model for the above specification was done and the results obtained are as follows:
Call:
lm(formula = sales ~ shelf + calories + protein + fat + sodium
+
fiber + carbo + sugars + potass + vitamins + weight + cups +
adv, data = reg_data)
Residuals:
Min 1Q Median 3Q Max
-443561 -165896 -10392 116442 409273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 182906.0 269565.0 0.679 0.50009
shelfMiddle 117127.0 84512.0 1.386 0.17099
shelfTop 71963.7 81271.4 0.885 0.37950
calories -12187.0 5798.8 -2.102 0.03986 *
protein 101834.2 41679.7 2.443 0.01757 *
fat 149428.6 60355.6 2.476 0.01618 *
sodium 575.0 383.2 1.501 0.13876
fiber 63904.1 35734.5 1.788 0.07886 .
carbo 84887.3 27295.3 3.110 0.00288 **
sugars 84335.0 25929.1 3.253 0.00189 **
potass -1305.8 1285.2 -1.016 0.31378
vitamins -1059.6 1483.5 -0.714 0.47786
weight -804621.2 420162.9 -1.915 0.06034 .
cups -152483.3 149148.5 -1.022 0.31078
adv 805.4 701.5 1.148 0.25561
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The model exhibits very low R^2 of 0.2409 and even lower adjusted R^2 of 0.06082 at 21.39% of
significance with F stat of 1.388 which is very low. Thus, the model doesn’t establish the relationship
among sales and other independent variables.
b. How is “fat” related to sales? (consider 0.1 significance level).
Fat is related to sales in a positive manner such that if model would have been significant then 1 unit
of increase in fat would affect sales by 149428.6 units.
There is no difference in the level if a product is kept in different shelves. This is evident by the
results obtained from regression as there is no significant relationship among sales and different
levels of shelf.
Further, this can be seen by Anova on the mentioned variables. Following results were obtained
after Anova:
Df Sum Sq Mean Sq F value Pr(>F)
shelf 2 1.149e+11 5.746e+10 1.037 0.359
Residuals 74 4.099e+12 5.539e+10
reg_data<-read.csv(d)
str(reg_data) #To check if all the variables are specified correctly or not
summary(reg_mod)
library(ggpubr)
res.aov <- aov(sales ~ shelf, data = reg_data) #Anova for sales and shelf level
a. Estimate and interpret a logit model (Model a) where dv=coke_selection and rest are IVs.
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9664 -0.4746 -0.3500 0.6083 2.4225
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7125 0.6021 1.183 0.236625
gender1 2.9010 0.2204 13.160 < 2e-16 ***
occupation1 0.8143 0.2203 3.696 0.000219 ***
country_of.origin1 -2.6236 0.3501 -7.493 6.71e-14 ***
price -0.1598 0.3846 -0.416 0.677771
distribution 0.4770 0.3714 1.284 0.199036
adv_ratio -0.3347 0.3670 -0.912 0.361731
satisfaction_avg -0.6086 0.3752 -1.622 0.104775
competition 0.2463 0.3724 0.661 0.508386
storevisit_perweek -0.4684 0.3707 -1.264 0.206378
health.conciousness -0.3605 0.3782 -0.953 0.340464
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the model, gender, occupation and country of origin are significant independent variables. The
model has value of AIC: 592.67. Further, the confusion matrix exhibited mentioned prediction
accuracy.
> table(log_mod ,log_data$coke.selection)
log_mod 0 1
No 352 46
Yes 74 278
> (352+278)/750
[1] 0.84
Model B:
Call:
glm(formula = coke.selection ~ gender + occupation + country_of.origin +
price + distribution + adv_ratio, family = binomial(link = "logit"),
data = log_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8476 -0.4558 -0.3696 0.6734 2.4219
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1272 0.4658 0.273 0.784810
gender1 2.8555 0.2169 13.165 < 2e-16 ***
occupation1 0.8065 0.2163 3.728 0.000193 ***
country_of.origin1 -2.6284 0.3490 -7.531 5.05e-14 ***
price -0.1793 0.3803 -0.472 0.637211
distribution 0.5174 0.3707 1.396 0.162750
adv_ratio -0.2997 0.3634 -0.825 0.409519
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the model, gender, occupation and country of origin are significant independent variables. The
model has value of AIC: 590.32. Further, the confusion matrix exhibited mentioned prediction
accuracy.
> table(log_mod2 ,log_data$coke.selection)
log_mod2 0 1
No 348 40
Yes 78 284
> (348+284)/2
[1] 316
> (348+284)/750
[1] 0.8426667
The models are slightly different in terms of AIC and model 2 seems to be better in terms of AIC
because it’s AIC is slightly lower than Model A’s AIC. Also, accuracy of Model B is slightly higher than
the accuracy of Model A. We can also perform anova to compare two models.
p <- file.choose()
log_data<-read.csv(p, header = T)
log_mod<-rep("No",750)
log_mod[mod.probs >.5]="Yes" ## Assuming more than 50% probability means the person will buy
coke
table(log_mod ,log_data$coke.selection) ## To generate confusion matrix
log_mod2<-rep("No",750)
log_mod2[mod.probs2 >.5]="Yes" ## Assuming more than 50% probability means the person will
buy coke