Project 2 Factor Hair Revised Case Study

Project 2 – Factor Hair Revised Case Study
By:
Shreya Garg
Table of Contents
1 Project Objective ..............................................................................................................

2 Exploratory Data Analysis – Step by step approach ...........................................................
2.1 Environment Set up and Data Import .........................................................................
2.1.1 Install necessary Packages and Invoke Libraries ................................................
2.1.2 Set up working Directory.....................................................................................
2.1.3 Import and Read the Dataset ..............................................................................
2.2 Variable Identification ................................................................................................
2.3 Bi-Variate Analysis ....................................................................................................
2.4 Missing Value Identification .......................................................................................
3 Multicollinearity Evidences ...............................................................................................
4 Simple Linear Regression ................................................................................................
5 PCA/Factor Analysis ........................................................................................................
6 Multiple Linear Regression ...............................................................................................
7 Model Validity ..................................................................................................................
1 Project Objective
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an

optimum regression model to predict satisfaction.
This report will consist of the following:
• Importing the dataset in R

• Understanding the structure of dataset
• Graphical exploration
• Descriptive statistics
• Insights from the dataset
• Performing PCA/FA analysis
• Building an optimum regression model to predict satisfaction.
2 Exploratory Data Analysis – Step by step approach

A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import

2. Variable Identification
3. Bi-Variate Analysis
4. Missing Value Treatment
2.1 Environment Set up and Data Import
2.1.1 Install necessary Packages and Invoke Libraries

The following packages and libraries were installed for the analysis:
1. Readr - readr provide a fast and friendly way to read rectangular data (like 'csv',
'tsv', and 'fwf') and we have our dataset available in .csv format which needs readr
to be installed.
2. Ggplot2 - ggplot2 is a system for declaratively creating graphics, based on The
Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the details.
3. Psych- include functions most useful for personality and psychological research.
Some of the functions
(e.g., read.file, read.clipboard, describe,pairs.panels, error.bars and error.dots) are
useful for basic data entry and descriptive analyses.
4. nFactors- To determine the number of factors for factor analysis.
5. caTools- Contains several basic utility functions including: moving (rolling,

running) window statistic functions, read/write for GIF and ENVI binary files, fast
calculation of AUC, LogitBoost classifier, base64 encoder/decoder, round-off error
free sum and cumsum, etc.
6. corrplot- The corrplot package is a graphical display of a correlation matrix,

confidence interval or general matrix.
3|P age
2.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting
data files and code files easier. Basically, working directory is the location/ folder on
the PC where you have the data, codes etc. related to the project.
4|P age
2.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing
the file.
Please refer Appendix A for Source Code.
2.2 Variable Identification
A total of 12 variables were used for market segmentation in the context of product
service management. From this data we are asked to first check for multicollinearity
within the variables. Then factor analysis is being performed on the variables to
reduce them to 4 factors. Then linear regression analysis is being performed with
customer satisfaction as the dependent variable and the four factors as the
independent variable.
5|P age
2.3 Bi-Variate Analysis
The below scatter plots of all the 12 variables, shows the distribution and
variance of each of these variables. All of the variables follow a normal
distribution bell curve.
2.4 Missing Value Identification
The given dataset does not contain any missing values. This was checked
in R studio by running command - any(is.na.data.frame(“dataset”)) which
returned a False value.
3 Multicollinearity Evidences
Multicollinearity is a state of very high intercorrelations or inter-

associations among the independent variables.
The given dataset has high degrees of multicollinearity in between
independent variables that can be proven from the following tests and
analysis:
5|P age
1. Using the corrplot command for displaying the pictorial representation
of correlation matrix. It indicates the presence of correlations between
variables. The circles with darker shades of blue shows the correlation
between variables.
6|P age
The same can also be seen from a numerical point of view to get a
clearer picture:
2. After we study the above graphs, we can perform the following tests to
find out the multicollinearity in the dataset.
a) Bartlett’s Test of Sphericity – It is an inferential statistic used to

assess the equality of variance in different samples.
The null hypothesis states that there is Homogeneity in the
variances of all variables, whereas the alternative hypothesis states
that the variances are not equal for atleast one pair of variables.
After performing the Bartlett Test in R, we get the following output:
print(cortest.bartlett(newdata_cor,nrow(newdata)))
$chisq
[1] 619.2726
7|P age
$p.value
[1] 1.79337e-96
$df
[1] 55
From the above output, we can clearly infer that the p-value is much
less than the significance value of 0.05, hence we can easily reject the
null hypothesis.
Therefore, we accept the alternative hypothesis that the variances are

not equal for atleast one pair and hence proving multicollinearity. This
means, it is possible to reduce the data dimensions in the data set.
b) Kaiser-Meyer-Olkin (KMO) Test – It is a measure of how suited

your data is for Factor Analysis. The test measures sampling
adequacy for each variable in the model and for the complete
model. The statistic is a measure of the proportion of variance
among variables that might be common variance.
Kaiser-Meyer-Olkin factor adequacy

Call: KMO(r = newdata_cor)
Overall MSA = 0.65
MSA for each item =
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage
0.51 0.63 0.52 0.79 0.78 0.62 0.62
ComPricing WartyClaim OrdBilling DelSpeed
0.75 0.51 0.76 0.67
From the above test output, we can infer that the data is suitable for
factor analysis as the overall MSA > 0.5
A KMO value over 0.5 and a significance level for the Bartlett’s test below
0.05 suggest there is substantial correlation in the data.
Variable collinearity indicates how strongly a single variable is correlated
with other variables.
KMO measures are also calculated for each variable. Values above 0.5
are acceptable.
Alternative way: We can use the VIF Function to check the

evidence of multicollinearity in the Factor-Hair.CSV data.
We need to check whether any high or moderate correlation exist
between independent variables. For this evidence, we need to
test the data using some functions in R code.
8|P age
Code in R (Using VIF function)
vif(lm(Satisfaction~ProdQual+Ecom+TechSup+CompRes+Advertising+Pr
odLine+SalesFImage+ComPricing+WartyClaim+OrdBilling+DelSpeed,
data= newdata))
ProdQual Ecom TechSup CompRes
1.635797 2.756694 2.976796 4.730448
Advertising ProdLine SalesFImage ComPricing
1.508933 3.488185 3.439420 1.635000
WartyClaim OrdBilling DelSpeed
3.198337 2.902999 6.516014
VIF Status of predictors

VIF=1 Not correlated
1<VIF<5 Moderately correlated
VIF > 5 to
10 Highly correlated
As, we can see, the values lies between 1<VIF<10, the dataset is correlated.
4 Simple Linear Regression
1. Satisfaction~ProdQual
Call:
lm(formula = Satisfaction ~ ProdQual)
Residuals:
Min 1Q Median 3Q Max
-1.88746 -0.72711 -0.01577 0.85641 2.25220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
ProdQual 0.41512 0.07534 5.510 2.90e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.047 on 98 degrees of freedom

Multiple R-squared: 0.2365, Adjusted R-squared: 0.2287
F-statistic: 30.36 on 1 and 98 DF, p-value: 2.901e-07
9|P age
2. Satisfaction~ECom
Call:
lm(formula = Satisfaction ~ Ecom)
Residuals:
-2.37200 -0.78971 0.04959 0.68085 2.34580
Coefficients:
(Intercept) 5.1516 0.6161 8.361 4.28e-13 ***
Ecom 0.4811 0.1649 2.918 0.00437 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 8.515 on 1 and 98 DF, p-value: 0.004368
3. Satisfaction~TechSup
Call:
lm(formula = Satisfaction ~ TechSup)
Residuals:
-2.26136 -0.93297 0.04302 0.82501 2.85617
Coefficients:
(Intercept) 6.44757 0.43592 14.791 <2e-16 ***
TechSup 0.08768 0.07817 1.122 0.265
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

4. Satisfaction~CompRes
Call:
lm(formula = Satisfaction ~ CompRes)
10 | P a g
e
Residuals:
-2.40450 -0.66164 0.04499 0.63037 2.70949
Coefficients:
(Intercept) 3.68005 0.44285 8.310 5.51e-13 ***
CompRes 0.59499 0.07946 7.488 3.09e-11 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5. Satisfaction~Advertising
Call:
lm(formula = Satisfaction ~ Advertising)
Residuals:
-2.34033 -0.92755 0.05577 0.79773 2.53412
Coefficients:
(Intercept) 5.6259 0.4237 13.279 < 2e-16 ***
Advertising 0.3222 0.1018 3.167 0.00206 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

6. Satisfaction~ProdLine
Call:
lm(formula = Satisfaction ~ ProdLine)
Residuals:
-2.3634 -0.7795 0.1097 0.7604 1.7373
Coefficients:
11 | P a g
e
(Intercept) 4.02203 0.45471 8.845 3.87e-14 ***
ProdLine 0.49887 0.07641 6.529 2.95e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1 on 98 degrees of freedom

7. Satisfaction~SalesFImage
Call:
lm(formula = Satisfaction ~ SalesFImage)
Residuals:
-2.2164 -0.5884 0.1838 0.6922 2.0728
Coefficients:
(Intercept) 4.06983 0.50874 8.000 2.54e-12 ***
SalesFImage 0.55596 0.09722 5.719 1.16e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

8. Satisfaction~ComPricing
Call:
lm(formula = Satisfaction ~ ComPricing)
Residuals:
-1.9728 -0.9915 -0.1156 0.9111 2.5845
Coefficients:
(Intercept) 8.03856 0.54427 14.769 <2e-16 ***
ComPricing -0.16068 0.07621 -2.108 0.0376 *
---
Signif. codes:
12 | P a g
e
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

9. Satisfaction~WartyClaim
Call:
lm(formula = Satisfaction ~ WartyClaim)
Residuals:
-2.36504 -0.90202 0.03019 0.90763 2.88985
Coefficients:
(Intercept) 5.3581 0.8813 6.079 2.32e-08 ***
WartyClaim 0.2581 0.1445 1.786 0.0772 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

10. Satisfaction~OrdBilling
Call:
lm(formula = Satisfaction ~ OrdBilling)
Residuals:
-2.4005 -0.7071 -0.0344 0.7340 2.9673
Coefficients:
(Intercept) 4.0541 0.4840 8.377 3.96e-13 ***
OrdBilling 0.6695 0.1106 6.054 2.60e-08 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

13 | P a g
e
11. Satisfaction~DelSpeed
Call:
lm(formula = Satisfaction ~ ProdQual)
Residuals:
-1.88746 -0.72711 -0.01577 0.85641 2.25220
Coefficients:
(Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
ProdQual 0.41512 0.07534 5.510 2.90e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5 PCA/Factor Analysis
To avoid multicollinearity, we must perform factor analysis.
R code:
ev <- eigen(newdata_cor)
eigenValues <- ev$values
eigenVector <- ev$vectors
Factor = c(2,3,4,5,6,7,8,9,10,11,12)
Scree =data.frame(Factor, eigenValues)
plot(Scree, main="Scree Plot", col="Blue")
lines(Scree, col="Red")
14 | P a g
e
Based on above Scree plot and by applying Kaiser rule (elbow
rule), we can take 4 factors which are greater than 1 and leave
remaining factors that are less than 1.
PCA (Unrotated factor loadings):
> Unrotate<-principal(newdata,nfactors = 4 ,rotate="none")

> print(Unrotate, digits=3)
Principal Components Analysis
Call: principal(r = newdata, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
15 | P a g
e
Mean item complexity = 1.9
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06

with the empirical chi square 39.023 with prob < 0.00177
Fit based upon off diagonal values = 0.968
16 | P a g
e
Rotated Factor Loadings:
Rotate = principal(newdata, nfactors = 4, rotate = "varimax")

> print(Rotate, digits=3)
Principal Components Analysis
Call: principal(r = newdata, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
17 | P a g
e
The root mean square of the residuals (RMSR) is 0.06
with the empirical chi square 39.023 with prob < 0.00177
Fit based upon off diagonal values = 0.968
RotateProfile=plot(Rotate,row.names(Rotate$loadings),cex=1.0)
18 | P a g
e
Naming of Factors:
RC1 Factor Name: Customer Interface – DelSpeed, OrdBilling,

CompRes
RC2 Factor Name: Market Presence – SalesFImage, Ecom,

Advertising
RC3 Factor Name: Value for Money – ProdQual, ProdLine,

ComPricing
RC4 Factor Name: Service Quality – TechSup, WartyClaim
19 | P a g
e
FACTOR ANALYSIS :-
20 | P a g
e
6 Multiple Linear Regression
RC<-Rotate$scores
ModelRC = lm(Satisfaction~Rotate$scores)
summary(ModelRC)
Call:
lm(formula = Satisfaction ~ Rotate$scores)
Residuals:
-1.6308 -0.4996 0.1372 0.4623 1.5228
Coefficients:
(Intercept) 6.91800 0.07089 97.589 < 2e-16 ***
21 | P a g
e
Rotate$scoresRC1 0.61805 0.07125 8.675 1.12e-13 ***
Rotate$scoresRC3 0.06714 0.07125 0.942 0.348
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 46.21 on 4 and 95 DF, p-value: < 2.2e-16
1. Confidence Interval for each of the RC components:
> confint(ModelRC,"Rotate$scoresRC1")
2.5 % 97.5 %
Rotate$scoresRC1 0.4766059 0.7594879
2.5 % 97.5 %
Rotate$scoresRC2 0.3682931 0.6511751
2.5 % 97.5 %
Rotate$scoresRC3 -0.07430515 0.2085769
2.5 % 97.5 %
Rotate$scoresRC4 0.398878 0.6817601
2. Predicted Model Vs Actual Model
newdata1=data.frame(RC1=0.71,RC2=0.66,RC3=0.13,RC4=0.76)
prediction=predict(ModelRC,newdata1)
prediction
prediction=predict(ModelRC,newdata1,interval="confidence")
prediction
Predicted=predict(ModelRC)
Actual=mydata$Satisfaction
Backtrack=data.frame(Actual,Predicted)
Backtrack
plot(Actual,col="Red")
lines(Actual,col="Red")
plot(Predicted,col="Blue")
lines(Predicted,col="Blue")
lines(Actual,col="Red")
22 | P a g
e
The Actual Model is represented by red lines and Predicted Model is
represented by blue lines which shows that the Predicted Model is fairly
close to the Actual Model. Hence, the regression model developed is good
enough to capture the essence of the data.
7 Model Validity
Since, probability values of Multiple Regression is less then Alpha

(0.05), we can reject null hypothesis of all Betas are zero. It can be
concluded that at least one Beta is non-zero and hence accept the
alternative hypothesis. Overall there is overwhelming evidence that
Regression model exists in the population meaning the linear model of
Customer Satisfaction depending on RC1, RC2, RC3 and RC4 are
robust and statistically valid and can be used for Predictive Analytics.
23 | P a g
e
24 | P a g
e

Project 2 Factor Hair Revised Case Study

Uploaded by

Copyright:

Available Formats

Project 2 Factor Hair Revised Case Study

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 2 Factor Hair Revised Case Study

Uploaded by

Copyright:

Available Formats

Project 2 – Factor Hair Revised Case Study

1 Project Objective ..............................................................................................................

The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an

• Importing the dataset in R

2 Exploratory Data Analysis – Step by step approach

1. Environment Set up and Data Import

2.1 Environment Set up and Data Import

2.1.1 Install necessary Packages and Invoke Libraries

4. nFactors- To determine the number of factors for factor analysis.

5. caTools- Contains several basic utility functions including: moving (rolling,

6. corrplot- The corrplot package is a graphical display of a correlation matrix,

Please refer Appendix A for Source Code.

2.2 Variable Identification

2.4 Missing Value Identification

Multicollinearity is a state of very high intercorrelations or inter-

a) Bartlett’s Test of Sphericity – It is an inferential statistic used to

After performing the Bartlett Test in R, we get the following output:

Therefore, we accept the alternative hypothesis that the variances are

b) Kaiser-Meyer-Olkin (KMO) Test – It is a measure of how suited

Kaiser-Meyer-Olkin factor adequacy

Alternative way: We can use the VIF Function to check the

VIF Status of predictors

4 Simple Linear Regression

Residual standard error: 1.047 on 98 degrees of freedom

Residual standard error: 1.149 on 98 degrees of freedom

Residual standard error: 1.19 on 98 degrees of freedom

Residual standard error: 0.9554 on 98 degrees of freedom

Residual standard error: 1.141 on 98 degrees of freedom

Residual standard error: 1 on 98 degrees of freedom

Residual standard error: 1.037 on 98 degrees of freedom

Residual standard error: 1.172 on 98 degrees of freedom

Residual standard error: 1.179 on 98 degrees of freedom

Residual standard error: 1.022 on 98 degrees of freedom

Residual standard error: 1.047 on 98 degrees of freedom

To avoid multicollinearity, we must perform factor analysis.

PCA (Unrotated factor loadings):

> Unrotate<-principal(newdata,nfactors = 4 ,rotate="none")

The root mean square of the residuals (RMSR) is 0.06

Fit based upon off diagonal values = 0.968

Rotate = principal(newdata, nfactors = 4, rotate = "varimax")

Fit based upon off diagonal values = 0.968

RC1 Factor Name: Customer Interface – DelSpeed, OrdBilling,

RC2 Factor Name: Market Presence – SalesFImage, Ecom,

RC3 Factor Name: Value for Money – ProdQual, ProdLine,

RC4 Factor Name: Service Quality – TechSup, WartyClaim

Residual standard error: 0.7089 on 95 degrees of freedom

1. Confidence Interval for each of the RC components:

2. Predicted Model Vs Actual Model

Since, probability values of Multiple Regression is less then Alpha

You might also like