Project 2 Factor Hair Revised Case Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Project 2 – Factor Hair Revised Case Study

By:
Shreya Garg
Table of Contents

1 Project Objective ..............................................................................................................


2 Exploratory Data Analysis – Step by step approach ...........................................................
2.1 Environment Set up and Data Import .........................................................................
2.1.1 Install necessary Packages and Invoke Libraries ................................................
2.1.2 Set up working Directory.....................................................................................
2.1.3 Import and Read the Dataset ..............................................................................
2.2 Variable Identification ................................................................................................
2.3 Bi-Variate Analysis ....................................................................................................
2.4 Missing Value Identification .......................................................................................
3 Multicollinearity Evidences ...............................................................................................
4 Simple Linear Regression ................................................................................................
5 PCA/Factor Analysis ........................................................................................................
6 Multiple Linear Regression ...............................................................................................
7 Model Validity ..................................................................................................................
1 Project Objective

The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an


optimum regression model to predict satisfaction.
This report will consist of the following:

• Importing the dataset in R


• Understanding the structure of dataset
• Graphical exploration
• Descriptive statistics
• Insights from the dataset
• Performing PCA/FA analysis
• Building an optimum regression model to predict satisfaction.

2 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Variable Identification
3. Bi-Variate Analysis
4. Missing Value Treatment

2.1 Environment Set up and Data Import

2.1.1 Install necessary Packages and Invoke Libraries


The following packages and libraries were installed for the analysis:
1. Readr - readr provide a fast and friendly way to read rectangular data (like 'csv',
'tsv', and 'fwf') and we have our dataset available in .csv format which needs readr
to be installed.
2. Ggplot2 - ggplot2 is a system for declaratively creating graphics, based on The
Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the details.

3. Psych- include functions most useful for personality and psychological research.
Some of the functions
(e.g., read.file, read.clipboard, describe,pairs.panels, error.bars and error.dots) are
useful for basic data entry and descriptive analyses.

4. nFactors- To determine the number of factors for factor analysis.

5. caTools- Contains several basic utility functions including: moving (rolling,


running) window statistic functions, read/write for GIF and ENVI binary files, fast
calculation of AUC, LogitBoost classifier, base64 encoder/decoder, round-off error
free sum and cumsum, etc.

6. corrplot- The corrplot package is a graphical display of a correlation matrix,


confidence interval or general matrix.
3|P age
2.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting
data files and code files easier. Basically, working directory is the location/ folder on
the PC where you have the data, codes etc. related to the project.

4|P age
2.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing
the file.

Please refer Appendix A for Source Code.

2.2 Variable Identification

A total of 12 variables were used for market segmentation in the context of product
service management. From this data we are asked to first check for multicollinearity
within the variables. Then factor analysis is being performed on the variables to
reduce them to 4 factors. Then linear regression analysis is being performed with
customer satisfaction as the dependent variable and the four factors as the
independent variable.

5|P age
2.3 Bi-Variate Analysis

The below scatter plots of all the 12 variables, shows the distribution and
variance of each of these variables. All of the variables follow a normal
distribution bell curve.

2.4 Missing Value Identification

The given dataset does not contain any missing values. This was checked
in R studio by running command - any(is.na.data.frame(“dataset”)) which
returned a False value.

3 Multicollinearity Evidences

Multicollinearity is a state of very high intercorrelations or inter-


associations among the independent variables.
The given dataset has high degrees of multicollinearity in between
independent variables that can be proven from the following tests and
analysis:

5|P age
1. Using the corrplot command for displaying the pictorial representation
of correlation matrix. It indicates the presence of correlations between
variables. The circles with darker shades of blue shows the correlation
between variables.

6|P age
The same can also be seen from a numerical point of view to get a
clearer picture:

2. After we study the above graphs, we can perform the following tests to
find out the multicollinearity in the dataset.

a) Bartlett’s Test of Sphericity – It is an inferential statistic used to


assess the equality of variance in different samples.
The null hypothesis states that there is Homogeneity in the
variances of all variables, whereas the alternative hypothesis states
that the variances are not equal for atleast one pair of variables.

After performing the Bartlett Test in R, we get the following output:

print(cortest.bartlett(newdata_cor,nrow(newdata)))
$chisq
[1] 619.2726
7|P age
$p.value
[1] 1.79337e-96

$df
[1] 55
From the above output, we can clearly infer that the p-value is much
less than the significance value of 0.05, hence we can easily reject the
null hypothesis.

Therefore, we accept the alternative hypothesis that the variances are


not equal for atleast one pair and hence proving multicollinearity. This
means, it is possible to reduce the data dimensions in the data set.

b) Kaiser-Meyer-Olkin (KMO) Test – It is a measure of how suited


your data is for Factor Analysis. The test measures sampling
adequacy for each variable in the model and for the complete
model. The statistic is a measure of the proportion of variance
among variables that might be common variance.

Kaiser-Meyer-Olkin factor adequacy


Call: KMO(r = newdata_cor)
Overall MSA = 0.65
MSA for each item =
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage
0.51 0.63 0.52 0.79 0.78 0.62 0.62
ComPricing WartyClaim OrdBilling DelSpeed
0.75 0.51 0.76 0.67

From the above test output, we can infer that the data is suitable for
factor analysis as the overall MSA > 0.5

A KMO value over 0.5 and a significance level for the Bartlett’s test below
0.05 suggest there is substantial correlation in the data.
Variable collinearity indicates how strongly a single variable is correlated
with other variables.
KMO measures are also calculated for each variable. Values above 0.5
are acceptable.

Alternative way: We can use the VIF Function to check the


evidence of multicollinearity in the Factor-Hair.CSV data.
We need to check whether any high or moderate correlation exist
between independent variables. For this evidence, we need to
test the data using some functions in R code.

8|P age
Code in R (Using VIF function)

vif(lm(Satisfaction~ProdQual+Ecom+TechSup+CompRes+Advertising+Pr
odLine+SalesFImage+ComPricing+WartyClaim+OrdBilling+DelSpeed,
data= newdata))
ProdQual Ecom TechSup CompRes
1.635797 2.756694 2.976796 4.730448
Advertising ProdLine SalesFImage ComPricing
1.508933 3.488185 3.439420 1.635000
WartyClaim OrdBilling DelSpeed
3.198337 2.902999 6.516014

VIF Status of predictors


VIF=1 Not correlated
1<VIF<5 Moderately correlated
VIF > 5 to
10 Highly correlated

As, we can see, the values lies between 1<VIF<10, the dataset is correlated.

4 Simple Linear Regression

1. Satisfaction~ProdQual

Call:
lm(formula = Satisfaction ~ ProdQual)

Residuals:
Min 1Q Median 3Q Max
-1.88746 -0.72711 -0.01577 0.85641 2.25220

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
ProdQual 0.41512 0.07534 5.510 2.90e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.047 on 98 degrees of freedom


Multiple R-squared: 0.2365, Adjusted R-squared: 0.2287
F-statistic: 30.36 on 1 and 98 DF, p-value: 2.901e-07

9|P age
2. Satisfaction~ECom

Call:
lm(formula = Satisfaction ~ Ecom)

Residuals:
Min 1Q Median 3Q Max
-2.37200 -0.78971 0.04959 0.68085 2.34580

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1516 0.6161 8.361 4.28e-13 ***
Ecom 0.4811 0.1649 2.918 0.00437 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.149 on 98 degrees of freedom


Multiple R-squared: 0.07994, Adjusted R-squared: 0.07056
F-statistic: 8.515 on 1 and 98 DF, p-value: 0.004368

3. Satisfaction~TechSup

Call:
lm(formula = Satisfaction ~ TechSup)

Residuals:
Min 1Q Median 3Q Max
-2.26136 -0.93297 0.04302 0.82501 2.85617

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.44757 0.43592 14.791 <2e-16 ***
TechSup 0.08768 0.07817 1.122 0.265
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.19 on 98 degrees of freedom


Multiple R-squared: 0.01268, Adjusted R-squared: 0.002603
F-statistic: 1.258 on 1 and 98 DF, p-value: 0.2647

4. Satisfaction~CompRes

Call:
lm(formula = Satisfaction ~ CompRes)

10 | P a g
e
Residuals:
Min 1Q Median 3Q Max
-2.40450 -0.66164 0.04499 0.63037 2.70949

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.68005 0.44285 8.310 5.51e-13 ***
CompRes 0.59499 0.07946 7.488 3.09e-11 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9554 on 98 degrees of freedom


Multiple R-squared: 0.3639, Adjusted R-squared: 0.3574
F-statistic: 56.07 on 1 and 98 DF, p-value: 3.085e-11

5. Satisfaction~Advertising

Call:
lm(formula = Satisfaction ~ Advertising)

Residuals:
Min 1Q Median 3Q Max
-2.34033 -0.92755 0.05577 0.79773 2.53412

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6259 0.4237 13.279 < 2e-16 ***
Advertising 0.3222 0.1018 3.167 0.00206 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.141 on 98 degrees of freedom


Multiple R-squared: 0.09282, Adjusted R-squared: 0.08357
F-statistic: 10.03 on 1 and 98 DF, p-value: 0.002056

6. Satisfaction~ProdLine

Call:
lm(formula = Satisfaction ~ ProdLine)

Residuals:
Min 1Q Median 3Q Max
-2.3634 -0.7795 0.1097 0.7604 1.7373

Coefficients:
11 | P a g
e
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.02203 0.45471 8.845 3.87e-14 ***
ProdLine 0.49887 0.07641 6.529 2.95e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 98 degrees of freedom


Multiple R-squared: 0.3031, Adjusted R-squared: 0.296
F-statistic: 42.62 on 1 and 98 DF, p-value: 2.953e-09

7. Satisfaction~SalesFImage

Call:
lm(formula = Satisfaction ~ SalesFImage)

Residuals:
Min 1Q Median 3Q Max
-2.2164 -0.5884 0.1838 0.6922 2.0728

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.06983 0.50874 8.000 2.54e-12 ***
SalesFImage 0.55596 0.09722 5.719 1.16e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.037 on 98 degrees of freedom


Multiple R-squared: 0.2502, Adjusted R-squared: 0.2426
F-statistic: 32.7 on 1 and 98 DF, p-value: 1.164e-07

8. Satisfaction~ComPricing

Call:
lm(formula = Satisfaction ~ ComPricing)

Residuals:
Min 1Q Median 3Q Max
-1.9728 -0.9915 -0.1156 0.9111 2.5845

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.03856 0.54427 14.769 <2e-16 ***
ComPricing -0.16068 0.07621 -2.108 0.0376 *
---
Signif. codes:
12 | P a g
e
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.172 on 98 degrees of freedom


Multiple R-squared: 0.04339, Adjusted R-squared: 0.03363
F-statistic: 4.445 on 1 and 98 DF, p-value: 0.03756

9. Satisfaction~WartyClaim

Call:
lm(formula = Satisfaction ~ WartyClaim)

Residuals:
Min 1Q Median 3Q Max
-2.36504 -0.90202 0.03019 0.90763 2.88985

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3581 0.8813 6.079 2.32e-08 ***
WartyClaim 0.2581 0.1445 1.786 0.0772 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.179 on 98 degrees of freedom


Multiple R-squared: 0.03152, Adjusted R-squared: 0.02164
F-statistic: 3.19 on 1 and 98 DF, p-value: 0.0772

10. Satisfaction~OrdBilling

Call:
lm(formula = Satisfaction ~ OrdBilling)

Residuals:
Min 1Q Median 3Q Max
-2.4005 -0.7071 -0.0344 0.7340 2.9673

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0541 0.4840 8.377 3.96e-13 ***
OrdBilling 0.6695 0.1106 6.054 2.60e-08 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.022 on 98 degrees of freedom


Multiple R-squared: 0.2722, Adjusted R-squared: 0.2648
F-statistic: 36.65 on 1 and 98 DF, p-value: 2.602e-08
13 | P a g
e
11. Satisfaction~DelSpeed

Call:
lm(formula = Satisfaction ~ ProdQual)

Residuals:
Min 1Q Median 3Q Max
-1.88746 -0.72711 -0.01577 0.85641 2.25220

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
ProdQual 0.41512 0.07534 5.510 2.90e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.047 on 98 degrees of freedom


Multiple R-squared: 0.2365, Adjusted R-squared: 0.2287
F-statistic: 30.36 on 1 and 98 DF, p-value: 2.901e-07

5 PCA/Factor Analysis

To avoid multicollinearity, we must perform factor analysis.

R code:

ev <- eigen(newdata_cor)
eigenValues <- ev$values
eigenVector <- ev$vectors

Factor = c(2,3,4,5,6,7,8,9,10,11,12)
Scree =data.frame(Factor, eigenValues)
plot(Scree, main="Scree Plot", col="Blue")
lines(Scree, col="Red")

14 | P a g
e
Based on above Scree plot and by applying Kaiser rule (elbow
rule), we can take 4 factors which are greater than 1 and leave
remaining factors that are less than 1.

PCA (Unrotated factor loadings):

> Unrotate<-principal(newdata,nfactors = 4 ,rotate="none")


> print(Unrotate, digits=3)
Principal Components Analysis
Call: principal(r = newdata, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix

15 | P a g
e
Mean item complexity = 1.9
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is 0.06


with the empirical chi square 39.023 with prob < 0.00177

Fit based upon off diagonal values = 0.968

16 | P a g
e
Rotated Factor Loadings:

Rotate = principal(newdata, nfactors = 4, rotate = "varimax")


> print(Rotate, digits=3)
Principal Components Analysis
Call: principal(r = newdata, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix

17 | P a g
e
The root mean square of the residuals (RMSR) is 0.06
with the empirical chi square 39.023 with prob < 0.00177

Fit based upon off diagonal values = 0.968

RotateProfile=plot(Rotate,row.names(Rotate$loadings),cex=1.0)

18 | P a g
e
Naming of Factors:

RC1 Factor Name: Customer Interface – DelSpeed, OrdBilling,


CompRes

RC2 Factor Name: Market Presence – SalesFImage, Ecom,


Advertising

RC3 Factor Name: Value for Money – ProdQual, ProdLine,


ComPricing

RC4 Factor Name: Service Quality – TechSup, WartyClaim

19 | P a g
e
FACTOR ANALYSIS :-

20 | P a g
e
6 Multiple Linear Regression

RC<-Rotate$scores
ModelRC = lm(Satisfaction~Rotate$scores)
summary(ModelRC)

Call:
lm(formula = Satisfaction ~ Rotate$scores)

Residuals:
Min 1Q Median 3Q Max
-1.6308 -0.4996 0.1372 0.4623 1.5228

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.07089 97.589 < 2e-16 ***
21 | P a g
e
Rotate$scoresRC1 0.61805 0.07125 8.675 1.12e-13 ***
Rotate$scoresRC2 0.50973 0.07125 7.155 1.74e-10 ***
Rotate$scoresRC3 0.06714 0.07125 0.942 0.348
Rotate$scoresRC4 0.54032 0.07125 7.584 2.24e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7089 on 95 degrees of freedom


Multiple R-squared: 0.6605, Adjusted R-squared: 0.6462
F-statistic: 46.21 on 4 and 95 DF, p-value: < 2.2e-16

1. Confidence Interval for each of the RC components:

> confint(ModelRC,"Rotate$scoresRC1")
2.5 % 97.5 %
Rotate$scoresRC1 0.4766059 0.7594879
> confint(ModelRC,"Rotate$scoresRC2")
2.5 % 97.5 %
Rotate$scoresRC2 0.3682931 0.6511751
> confint(ModelRC,"Rotate$scoresRC3")
2.5 % 97.5 %
Rotate$scoresRC3 -0.07430515 0.2085769
> confint(ModelRC,"Rotate$scoresRC4")
2.5 % 97.5 %
Rotate$scoresRC4 0.398878 0.6817601

2. Predicted Model Vs Actual Model

newdata1=data.frame(RC1=0.71,RC2=0.66,RC3=0.13,RC4=0.76)
prediction=predict(ModelRC,newdata1)

prediction
prediction=predict(ModelRC,newdata1,interval="confidence")
prediction
Predicted=predict(ModelRC)
Actual=mydata$Satisfaction
Backtrack=data.frame(Actual,Predicted)
Backtrack
plot(Actual,col="Red")
lines(Actual,col="Red")
plot(Predicted,col="Blue")
lines(Predicted,col="Blue")
lines(Actual,col="Red")

22 | P a g
e
The Actual Model is represented by red lines and Predicted Model is
represented by blue lines which shows that the Predicted Model is fairly
close to the Actual Model. Hence, the regression model developed is good
enough to capture the essence of the data.

7 Model Validity

Since, probability values of Multiple Regression is less then Alpha


(0.05), we can reject null hypothesis of all Betas are zero. It can be
concluded that at least one Beta is non-zero and hence accept the
alternative hypothesis. Overall there is overwhelming evidence that
Regression model exists in the population meaning the linear model of
Customer Satisfaction depending on RC1, RC2, RC3 and RC4 are
robust and statistically valid and can be used for Predictive Analytics.

23 | P a g
e
24 | P a g
e

You might also like