ML Assignemnt PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21
At a glance
Powered by AI
The document discusses preparing employee transportation data and building machine learning models to predict car usage based on various factors.

Age, work experience, distance from work and salary were identified as the most significant factors for predicting car usage based on random forest variable importance.

The data was split into a training and test set. SMOTE sampling was used on the training data to address class imbalance since car usage was a minority class.

Machine Learning Group Assignment -

Group 3
 Amritkant Debasis,
Debasis, Bijay Gupta
Gupta & Rajagopalan
Rajagopalan Krishnan

Problem Statement
To understand the factors that influence the use of cars as a mode of transport and to that best
explains the employee’s decision to use cars as the main means of transport

Preparation of the data


Read the data and understand the structure

library(caret)

## Loading required package: lattice


## Loading required package: ggplot2
library(car)

## Loading required package: carData


library(DMwR)

## Loading required package: grid


carsbasedata<-read.csv( "C:\\Users\\acer\\Documents\\PGPBABI\\Machine Learning
carsbasedata<-read.csv("C:\\Users\\acer\\Documents\\PGPBABI\\Machine
\\GA\\cars.csv",
\\GA\\cars.csv" , header = TRUE
TRUE))
str(carsbasedata)
## 'data.frame': 444 obs. of 9 variables:
## $ Age : int 28 23 29 28 27 26 28 26 22 27 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
## $ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
## $ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
## $ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
## $ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
## $ license : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3
...

## The columns Engineer,MBA and license need to be converted into factors


carsbasedata$Engineer<-as.factor(carsbasedata$Engineer)
carsbasedata$MBA<-as.factor(carsbasedata$MBA)
carsbasedata$license<-as.factor(carsbasedata$license)
carsbasedata<-knnImputation(carsbasedata)

Our primary interest as per problem statement is to understand the factors influencing car usage.
Hence we will create a nwe column for Car usage. It will take value 0 for Public Transport & 2
Wheeler and 1 for car usage ## Understand the proprotion of cars in Transport Mode

carsbasedata$CarUsage<-ifelse(carsbasedata$Transport =='Car'
=='Car',,1,0)
table(carsbasedata$CarUsage)
##
## 0 1
## 383 61
sum(carsbasedata$CarUsage == 1)/nrow(carsbasedata)
## [1] 0.1373874
carsbasedata$CarUsage<-as.factor(carsbasedata$CarUsage)

Considerations for the model building and data


split
The number of records for people travelling by car is in minority. Hence we need to use an
appropriate sampling method on the train data. We will explore using SMOTE We will use logistic
regression, decision trees to see the best fit model and also explore a couple of blackbox models for
prediction later on ## Balancing the data

##Split the data into test and train

set.seed(400
set.seed( 400)
)
carindex<-createDataPartition(carsbasedata$CarUsage, p=0.7
p=0.7,list
,list = FALSE
FALSE,times
,times
= 1)
carsdatatrain<-carsbasedata[carindex,]
carsdatatest<-carsbasedata[-carindex,]
prop.table(table(carsdatatrain$CarUsage))
##
## 0 1
## 0.8621795 0.1378205
prop.table(table(carsdatatest$CarUsage))

##
## 0 1
## 0.8636364 0.1363636
carsdatatrain<-carsdatatrain[,c(1
carsdatatrain<-carsdatatrain[,c( 1:8,10
10)]
)]
carsdatatest<-carsdatatest[,c(1
carsdatatest<-carsdatatest[,c( 1:8,10
10)]
)]

## The train and test data have almost same percentage of cars usage as the b
ase data

## Apply SMOTE on Training data set

library(DMwR)

attach(carsdatatrain)

carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250


250,perc.under
,perc.under =
150)
150 )
prop.table(table(carsdataSMOTE$CarUsage))
##
## 0 1
## 0.5 0.5

We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only
onl y limited variable,
we will use them all in model building

##Create control parameter for GLM 

outcomevar<-'CarUsage'
outcomevar<- 'CarUsage'
regressors<-c( "Age",
regressors<-c("Age" ,"Work.Exp"
"Work.Exp",
,"Salary"
"Salary",,"Distance"
"Distance",,"license"
"license",,"Engineer"
"Engineer",,"MBA"
,"Gender"
"Gender"))
trainctrl<-trainControl(method = 'repeatedcv'
'repeatedcv',number
,number = 10
10,repeats
,repeats = 3)
carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method =
"glm",
"glm" , family = "binomial"
"binomial",trControl
,trControl = trainctrl)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


summary(carsglm$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.98326 -0.00024 0.00000 0.00000 1.11801
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -118.62275 51.84138 -2.288 0.0221 *
## Age 3.69904 1.69960 2.176 0.0295 *
## Work.Exp -1.20774 0.89958 -1.343 0.1794
## Salary 0.96031 0.72224 1.330 0.1836
## Distance -0.04112 0.32873 -0.125 0.9005
## license1 4.44597 2.89203 1.537 0.1242
## Engineer1 -0.72971 2.98679 -0.244 0.8070
## MBA1 -1.05817 1.70158 -0.622 0.5340
## GenderMale -3.03060 2.10593 -1.439 0.1501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 357.664 on 257 degrees of freedom
## Residual deviance: 17.959 on 249 degrees of freedom
## AIC: 35.959
##
## Number of Fisher Scoring iterations: 12
carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)
## glm variable importance
##
## Overall
## Age 100.000
## license1 68.845
## GenderMale 64.056
## Work.Exp 59.351
## Salary 58.720
## MBA1 24.219
## Engineer1 5.813
## Distance 0.000

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regress


ion")
 ## Model Interpretation
From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability
that the employee will use a car. As expected , if the employee has a license, then it implies a 99%
probability that he/she will use a car. One lkah increase in salary increases the probability of car
usage by 72% The null deviance of this model is 357.664 and the residual deviance is 17.959. This
yields a McFadden R Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and
Kappa values are high We shall do the prediction based on this model

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],
type = "raw")
confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 108 1

## 1 6 17
##
## Accuracy : 0.947
## 95% CI : (0.8938, 0.9784)
## No Information Rate : 0.8636
## P-Value [Acc > NIR] : 0.001692
##
## Kappa : 0.7984
## Mcnemar's Test P-Value : 0.130570
##
## Sensitivity : 0.9444
## Specificity : 0.9474
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.9908
## Prevalence : 0.1364
## Detection Rate : 0.1288
## Detection Prevalence : 0.1742
## Balanced Accuracy : 0.9459
##
## 'Positive' Class : 1
##
carusagepreddata<-carsdatatest

carusagepreddata$predictusage<-carusageprediction

Interpretation of Prediction
We see that the accuracy of prediction is 95% with almost all non users gettng predicted accurately.
We have a 94% accuracy in predicting the car users.
Let us perform the prediction for the two given cases

carunknown<-read.csv("cars2.csv", header = TRUE)


carunknown$license<-as.factor(carunknown$license)
carunknown$Engineer<-as.factor(carunknown$Engineer)
carunknown$MBA<-as.factor(carunknown$MBA)

carunknown$predictcaruse<-predict.train(object = carsglm,carunknown[,regresso
rs],type = "raw")

print(carunknown)
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.286067
## iter 20 value 163.721684
## iter 30 value 150.708078
## iter 40 value 150.092463
## iter 50 value 150.037849
## iter 60 value 150.024405
## iter 60 value 150.024405
## iter 60 value 150.024405
## final value 150.024405
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.657045
## iter 20 value 184.308026
## iter 30 value 183.659705
## final value 183.659690
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.286444
## iter 20 value 163.844875
## iter 30 value 152.090152
## iter 40 value 151.493757
## iter 50 value 151.490307
## final value 151.489799
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 214.482753
## iter 20 value 160.220934
## iter 30 value 151.603339
## iter 40 value 150.395167
## iter 50 value 150.358448
## iter 60 value 150.353563
## final value 150.352786
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 215.189918
## iter 20 value 169.779188
## iter 30 value 168.743649
## final value 168.743620
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 214.483479
## iter 20 value 160.258776
## iter 30 value 151.992973
## iter 40 value 150.915310
## iter 50 value 150.889919
## iter 60 value 150.886709
## final value 150.884703
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 228.856660
## iter 20 value 168.001438
## iter 30 value 154.236725
## iter 40 value 153.867964
## iter 50 value 153.845862
## iter 60 value 153.842756
## final value 153.839375
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 229.444001
## iter 20 value 184.308154
## iter 30 value 183.868037
## iter 30 value 183.868036
## iter 30 value 183.868035
## final value 183.868035
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 228.857257
## iter 20 value 168.065554
## iter 30 value 154.855369
## iter 40 value 154.542843
## iter 50 value 154.522111
## iter 60 value 154.519223
## final value 154.516941
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 234.254206
## iter 20 value 168.284999
## iter 30 value 155.742483
## iter 40 value 155.014042
## iter 50 value 154.947951
## iter 60 value 154.939998
## final value 154.914585
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 235.388715
## iter 20 value 184.201160
## iter 30 value 182.909514
## final value 182.909450
## converged
## iter 60 value 153.138985
## final value 153.138800
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.507213
## iter 20 value 185.620246
## iter 30 value 171.900994
## iter 40 value 171.676761
## iter 50 value 171.675108
## iter 60 value 171.674814
## final value 171.674579
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.940535
## iter 20 value 199.377311
## iter 30 value 198.536398
## final value 198.536393
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.507655
## iter 20 value 185.669986
## iter 30 value 172.353149
## iter 40 value 172.153199
## iter 50 value 172.152002
## iter 60 value 172.151876
## final value 172.151773
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.104304
## iter 20 value 169.319193
## iter 30 value 163.306010
## final value 163.300003
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.509222
## iter 20 value 191.712747
## final value 191.537661
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.104715
## iter 20 value 169.495059
## iter 30 value 163.760883
## final value 163.755633
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.594142
## iter 20 value 151.856399
## iter 30 value 133.452187
## iter 40 value 127.258073
## iter 50 value 127.152551
## iter 60 value 127.146114
## iter 70 value 127.139747
## final value 127.139210
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.969405
## iter 20 value 169.051070
## iter 30 value 168.584656
## final value 168.584649
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.594521
## iter 20 value 151.955300
## iter 30 value 135.924719
## iter 40 value 134.835537
## iter 50 value 134.523859
## iter 60 value 134.444202
## iter 70 value 134.425891
## final value 134.425421
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.519208
## iter 20 value 166.315086
## iter 30 value 152.931391
## iter 40 value 152.098932
## iter 50 value 152.035764
## iter 60 value 152.024331
## iter 70 value 152.015512
## final value 152.015438
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.779646
## iter 20 value 191.975327
## final value 191.787626
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.519471
## iter 20 value 166.465638
## iter 30 value 154.046241
## iter 40 value 153.618817
## iter 10 value 245.622893
## iter 20 value 192.271097
## iter 30 value 178.089553
## iter 40 value 177.355048
## iter 50 value 177.332968
## final value 177.328965
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 218.032720
## iter 20 value 173.350700
## iter 30 value 163.770891
## iter 40 value 163.750951
## iter 40 value 163.750950
## iter 40 value 163.750950
## final value 163.750950
## converged
carsmlr$finalModel
## Call:
## nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
##
## Coefficients:
## (Intercept) Age GenderMale Engineer1 MBA1
## Public Transport -2.924602 0.1867473 0.9073215 -0.1392119 0.1814413
## Car -73.157600 2.4072670 -0.8481166 1.4958898 -0.7794798
## Work.Exp Salary Distance license1
## Public Transport 0.1027101 -0.07091949 -0.1481504 -1.154896
## Car -1.0684135 0.17301043 0.3334671 1.378113
##
## Residual Deviance: 327.5019
## AIC: 363.5019
carmlrcoeff<-exp(coef(carsmlr$finalModel))
write.csv(carmlrcoeff,file = "Coeffsmlr.csv")
plot(varImp(object=carsmlr), main = "Variable Importance for Multinomial Logi
t")

The model has a residual deviance 296.416. The model implies that an increase in Age by 1 year
increases the odds of taking Public Transport as compared to 2 Wheeler by 1.3 (57%), whereas it
increses the odds of choosing car by 12 (92%) Age and license are the two main important factors in
deciding mode of transport
Let us try and predict using the test data

predictions_mlr<-predict(carsmlr,carstestlda)
confusionMatrix(predictions_mlr,carstestlda$Transport)
## Warning in confusionMatrix.default(predictions_mlr, carstestlda$Transport)
:
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
##

## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 16 0 8
## Car 1 18 2
## Public Transport 7 0 80
##
## Overall Statistics
##
## Accuracy : 0.8636
## 95% CI : (0.7931, 0.9171)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 1.242e-06
##
## Kappa : 0.725
## Mcnemar's Test P-Value : 0.3815
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6667 1.0000 0.8889
## Specificity 0.9259 0.9737 0.8333
## Pos Pred Value 0.6667 0.8571 0.9195
## Neg Pred Value 0.9259 1.0000 0.7778
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1212 0.1364 0.6061
## Detection Prevalence 0.1818 0.1591 0.6591
## Balanced Accuracy 0.7963 0.9868 0.8611

The overall accuracy is at 70% with accuracy of car usage predicted at 94% Let us try and predict
for the two unknown cases

predictTransportuk4<-predict.train(object = carsmlr,newdata = carunknown)


carunknown$predictTransportmlr<-predictTransportuk4
print(carunknown)
## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart predictTransportxgb predictTransportmlr
## 1 Public Transport 2Wheeler Public Transport
## 2 Public Transport 2Wheeler Public Transport

The model has predicted both cases to be using public transport

Prediction using Random Forest


rftrcontrol<-control <- trainControl(method="repeatedcv", number=10, repeats=
3)
mtry<-sqrt(ncol(carsdatatrainldasm))
tunegridrf <- expand.grid(.mtry=mtry)
carsrf<-train(Transport ~.,carsdatatrainldasm,method = "rf", trControl=rftrco
ntrol, tuneGrid = tunegridrf)
carsrf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.77%
## Confusion matrix:
## 2Wheeler Public Transport Car class.error
## 2Wheeler 85 32 1 0.27966102
## Public Transport 14 100 4 0.15254237
## Car 2 1 83 0.03488372

plot(varImp(object=carsrf), main = "Variable Importance for Random Forest")


 The out of bag error estimate rate is 16.7% in the training dataset. Age, Work Experience, Distance
and Salary are the most significant variables Let us try and predict for test data

predictions_rf<-predict(carsrf,carstestlda)
confusionMatrix(predictions_rf,carstestlda$Transport)
## Warning in confusionMatrix.default(predictions_rf, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 18 0 29
## Car 1 17 1
## Public Transport 5 1 60
##
## Overall Statistics
##
## Accuracy : 0.7197
## 95% CI : (0.6349, 0.7943)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.2011606
##
## Kappa : 0.5123
## Mcnemar's Test P-Value : 0.0004523
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.7500 0.9444 0.6667
## Specificity 0.7315 0.9825 0.8571
## Pos Pred Value 0.3830 0.8947 0.9091
## Neg Pred Value 0.9294 0.9912 0.5455
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1364 0.1288 0.4545
## Detection Prevalence 0.3561 0.1439 0.5000

## Balanced Accuracy 0.7407 0.9635 0.7619

We have an overall accuracy of 72% and 94% accuracy for prediction of car usage Let us now
predict the choice of transport for 2 unknown cases

predictTransportuk5<-predict.train(object = carsrf,newdata = carunknown)


carunknown$predictTransportrf<-predictTransportuk5
print(carunknown)
## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart predictTransportxgb predictTransportmlr
## 1 Public Transport 2Wheeler Public Transport
## 2 Public Transport 2Wheeler Public Transport
## predictTransportrf
## 1 Public Transport
## 2 2Wheeler

We have one record (female,engineer) predicted to choose 2 Wheeler and the other record to have
chosen Public Transport

You might also like