XGBoost R Tutorial
XGBoost R Tutorial
XGBoost R Tutorial
6documentation
XGBoostRTutorial
Introduction
XgboostisshortforeXtremeGradientBoostingpackage.
ThepurposeofthisVignetteistoshowyouhowtouseXgboosttobuildamodelandmake
predictions.
Itisanefficientandscalableimplementationofgradientboostingframeworkby
@[email protected]:
linearmodel
treelearningalgorithm.
Itsupportsvariousobjectivefunctions,includingregression,classificationandranking.The
packageismadetobeextendible,sothatusersarealsoallowedtodefinetheirownobjective
functionseasily.
IthasbeenusedtowinseveralKagglecompetitions.
Ithasseveralfeatures:
Speed:itcanautomaticallydoparallelcomputationonWindowsandLinux,withOpenMP.Itis
generallyover10timesfasterthantheclassical gbm .
InputType:ittakesseveraltypesofinputdata:
DenseMatrix:Rsdensematrix,i.e. matrix
SparseMatrix:Rssparsematrix,i.e. Matrix::dgCMatrix
DataFile:localdatafiles
xgb.DMatrix :itsownclass(recommended).
Sparsity:itacceptssparseinputforbothtreeboosterandlinearbooster,andisoptimized
forsparseinput
Customization:itsupportscustomizedobjectivefunctionsandevaluationfunctions.
Installation
Githubversion
Forweeklyupdatedversion(highlyrecommended),installfromGithub:
install.packages("drat",repos="https://cran.rstudio.com")
drat:::addRepo("dmlc")
install.packages("xgboost",repos="http://dmlc.ml/drat/",type="source")
WindowsuserwillneedtoinstallRtoolsfirst.
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 1/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
CRANversion
Theversion0.42isonCRAN,andyoucaninstallitby:
install.packages("xgboost")
FormerlyavailableversionscanbeobtainedfromtheCRANarchive
Learning
ForthepurposeofthistutorialwewillloadXGBoostpackage.
require(xgboost)
Datasetpresentation
Inthisexample,weareaimingtopredictwhetheramushroomcanbeeatenornot(likeinmany
tutorials,exampledataarethethesameasyouwilluseoninyoureverydaylife:).
MushroomdataiscitedfromUCIMachineLearningRepository.@Bache+Lichman:2013.
Datasetloading
Wewillloadthe agaricus datasetsembeddedwiththepackageandwilllinkthemtovariables.
Thedatasetsarealreadysplitin:
train :willbeusedtobuildthemodel
test :willbeusedtoassessthequalityofourmodel.
Whysplitthedatasetintwoparts?
Inthefirstpartwewillbuildourmodel.Inthesecondpartwewillwanttotestitandassessits
quality.Withoutdividingthedatasetwewouldtestthemodelonthedatawhichthealgorithmhave
alreadyseen.
data(agaricus.train,package='xgboost')
data(agaricus.test,package='xgboost')
train<agaricus.train
test<agaricus.test
Intherealworld,itwouldbeuptoyoutomakethisdivision
between train and test data.Thewaytodoitisoutofthepurposeofthisarticle,
however caret packagemayhelp.
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 2/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
str(train)
##Listof2
##$data:Formalclass'dgCMatrix'[package"Matrix"]with6slots
##....@i:int[1:143286]26811182021242832...
##....@p:int[1:127]036937233065845648965138380838410991...
##....@Dim:int[1:2]6513126
##....@Dimnames:Listof2
##......$:NULL
##......$:chr[1:126]"capshape=bell""capshape=conical""capshape=convex""capshape=
##....@x:num[1:143286]1111111111...
##....@factors:list()
##$label:num[1:6513]1001000100...
label istheoutcomeofourdatasetmeaningitisthebinaryclassificationwewilltrytopredict.
Letsdiscoverthedimensionalityofourdatasets.
dim(train$data)
##[1]6513126
dim(test$data)
##[1]1611126
ThisdatasetisverysmalltonotmaketheRpackagetooheavy,howeverXGBoostisbuiltto
managehugedatasetveryefficiently.
class(train$data)[1]
##[1]"dgCMatrix"
class(train$label)
##[1]"numeric"
BasicTrainingusingXGBoost
Thisstepisthemostcriticalpartoftheprocessforthequalityofourmodel.
Basictraining
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 3/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
Inasparsematrix,cellscontaining 0 arenotstoredinmemory.Therefore,inadatasetmainly
madeof 0 ,memorysizeisreduced.Itisveryusualtohavesuchdataset.
Wewilltraindecisiontreemodelusingthefollowingparameters:
objective="binary:logistic" :wewilltrainabinaryclassificationmodel
max.deph=2 :thetreeswontbedeep,becauseourcaseisverysimple
nthread=2 :thenumberofcputhreadswearegoingtouse
nround=2 :therewillbetwopassesonthedata,thesecondonewillenhancethemodelby
furtherreducingthedifferencebetweengroundtruthandprediction.
bstSparse<xgboost(data=train$data,label=train$label,max.depth=2,eta=1,nthread
##[0]trainerror:0.046522
##[1]trainerror:0.022263
Parametervariations
Densematrix
Alternatively,youcanputyourdatasetinadensematrix,i.e.abasicRmatrix.
bstDense<xgboost(data=as.matrix(train$data),label=train$label,max.depth=2,eta=
##[0]trainerror:0.046522
##[1]trainerror:0.022263
xgb.DMatrix
dtrain<xgb.DMatrix(data=train$data,label=train$label)
bstDMatrix<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective
##[0]trainerror:0.046522
##[1]trainerror:0.022263
Verboseoption
XGBoosthasseveralfeaturestohelpyoutoviewhowthelearningprogressinternally.The
purposeistohelpyoutosetthebestparameters,whichisthekeyofyourmodelquality.
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 4/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
#verbose=0,nomessage
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=
#verbose=1,printevaluationmetric
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=
##[0]trainerror:0.046522
##[1]trainerror:0.022263
#verbose=2,alsoprintinformationabouttree
bst<xgboost(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,objective=
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,6extran
##[0]trainerror:0.046522
##[11:41:01]amalgamation/../src/tree/updater_prune.cc:74:treepruningend,1roots,4extran
##[1]trainerror:0.022263
BasicpredictionusingXGBoost
Performtheprediction
Thepurposeofthemodelwehavebuiltistoclassifynewdata.Asexplainedbefore,wewilluse
the test datasetforthisstep.
pred<predict(bst,test$data)
#sizeofthepredictionvector
print(length(pred))
##[1]1611
#limitdisplayofpredictionstothefirst10
print(head(pred))
##[1]0.285830170.923923910.285830170.285830170.051698730.92392391
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 5/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
Transformtheregressioninabinaryclassification
TheonlythingthatXGBoostdoesisaregression.XGBoostisusing label vectortobuild
itsregressionmodel.
Howcanweusearegressionmodeltoperformabinaryclassification?
Ifwethinkaboutthemeaningofaregressionappliedtoourdata,thenumberswegetare
probabilitiesthatadatumwillbeclassifiedas 1 .Therefore,wewillsettherulethatifthis
probabilityforaspecificdatumis >0.5 thentheobservationisclassifiedas 1 (or 0 otherwise).
prediction<as.numeric(pred>0.5)
print(head(prediction))
##[1]010001
Measuringmodelperformance
Tomeasurethemodelperformance,wewillcomputeasimplemetric,theaverageerror.
err<mean(as.numeric(pred>0.5)!=test$label)
print(paste("testerror=",err))
##[1]"testerror=0.0217256362507759"
Stepsexplanation:
1. as.numeric(pred>0.5) appliesourrulethatwhentheprobability(<=>regression<=>
prediction)is >0.5 theobservationisclassifiedas 1 and 0 otherwise
2. probabilityVectorPreviouslyComputed!=test$label computesthevectoroferror
betweentruedataandcomputedprobabilities
3. mean(vectorOfErrors) computestheaverageerroritself.
Themostimportantthingtorememberisthattodoaclassification,youjustdoaregressionto
the label andthenapplyathreshold.
Multiclassclassificationworksinasimilarway.
Thismetricis0.02andisprettylow:ouryummlymushroommodelworkswell!
Advancedfeatures
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 6/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
Mostofthefeaturesbelowhavebeenimplementedtohelpyoutoimproveyourmodelbyofferinga
betterunderstandingofitscontent.
Datasetpreparation
Forthefollowingadvancedfeatures,weneedtoputdatain xgb.DMatrix asexplainedabove.
dtrain<xgb.DMatrix(data=train$data,label=train$label)
dtest<xgb.DMatrix(data=test$data,label=test$label)
Measurelearningprogresswithxgb.train
Both xgboost (simple)and xgb.train (advanced)functionstrainmodels.
insomewayitissimilartowhatwehavedoneabovewiththeaverageerror.The
maindifferenceisthatbelowitwasafterbuildingthemodel,andnowitisduringthe
constructionthatwemeasureerrors.
watchlist<list(train=dtrain,test=dtest)
bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726
XGBoosthascomputedateachroundthesameaverageerrormetricthanseenabove(we
set nround to2,thatiswhywehavetwolines).Obviously,the trainerror numberisrelatedto
thetrainingdataset(theonethealgorithmlearnsfrom)andthe testerror numbertothetest
dataset.
Bothtrainingandtesterrorrelatedmetricsareverysimilar,andinsomeway,itmakessense:what
wehavelearnedfromthetrainingdatasetmatchestheobservationsfromthetestdataset.
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 7/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
Ifwithyourowndatasetyouhavenotsuchresults,youshouldthinkabouthowyoudividedyour
datasetintrainingandtest.Maybethereissomethingtofix.Again, caret packagemayhelp.
Forabetterunderstandingofthelearningprogression,youmaywanttohavesomespecificmetric
orevenusemultipleevaluationmetrics.
bst<xgb.train(data=dtrain,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
##[0]trainerror:0.046522trainlogloss:0.233376testerror:0.042831testlogloss:0.22668
##[1]trainerror:0.022263trainlogloss:0.136658testerror:0.021726testlogloss:0.13787
eval.metric allowsustomonitortwonewmetricsforeach
round, logloss and error .
Linearboosting
Untilnow,allthelearningswehaveperformedwerebasedonboosting
trees.XGBoostimplementsasecondalgorithm,basedonlinearboosting.Theonlydifferencewith
previouscommandis booster="gblinear" parameter(andremoving eta parameter).
bst<xgb.train(data=dtrain,booster="gblinear",max.depth=2,nthread=2,nround=2,watchlis
##[0]trainerror:0.024720trainlogloss:0.184616testerror:0.022967testlogloss:0.18423
##[1]trainerror:0.004146trainlogloss:0.069885testerror:0.003724testlogloss:0.06808
Inthisspecificcase,linearboostinggetssligtlybetterperformancemetricsthandecisiontrees
basedalgorithm.
Insimplecases,itwillhappenbecausethereisnothingbetterthanalinearalgorithmtocatcha
linearlink.However,decisiontreesaremuchbettertocatchanonlinearlinkbetweenpredictors
andoutcome.Becausethereisnosilverbullet,weadviseyoutocheckbothalgorithmswithyour
owndatasetstohaveanideaofwhattouse.
Manipulatingxgb.DMatrix
Save/Load
xgb.DMatrix.save(dtrain,"dtrain.buffer")
##[1]TRUE
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 8/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
#toloaditin,simplycallxgb.DMatrix
dtrain2<xgb.DMatrix("dtrain.buffer")
##[11:41:01]6513x126matrixwith143286entriesloadedfromdtrain.buffer
bst<xgb.train(data=dtrain2,max.depth=2,eta=1,nthread=2,nround=2,watchlist=watchlist
##[0]trainerror:0.046522testerror:0.042831
##[1]trainerror:0.022263testerror:0.021726
Informationextraction
label=getinfo(dtest,"label")
pred<predict(bst,dtest)
err<as.numeric(sum(as.integer(pred>0.5)!=label))/length(label)
print(paste("testerror=",err))
##[1]"testerror=0.0217256362507759"
Viewfeatureimportance/influencefromthelearntmodel
FeatureimportanceissimilartoRgbmpackagesrelativeinfluence(rel.inf).
importance_matrix<xgb.importance(model=bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix=importance_matrix)
Viewthetreesfromamodel
xgb.dump(bst,with.stats=T)
##[1]"booster[0]"
##[2]"0:[f28<1.00136e05]yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
##[3]"1:[f55<1.00136e05]yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
##[4]"3:leaf=1.71218,cover=812"
##[5]"4:leaf=1.70044,cover=112.5"
##[6]"2:[f108<1.00136e05]yes=5,no=6,missing=5,gain=198.174,cover=703.75"
##[7]"5:leaf=1.94071,cover=690.5"
##[8]"6:leaf=1.85965,cover=13.25"
##[9]"booster[1]"
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 9/10
25/08/2016 XGBoostRTutorialxgboost0.6documentation
##[10]"0:[f59<1.00136e05]yes=1,no=2,missing=1,gain=832.545,cover=788.852"
##[11]"1:[f28<1.00136e05]yes=3,no=4,missing=3,gain=569.725,cover=768.39"
##[12]"3:leaf=0.784718,cover=458.937"
##[13]"4:leaf=0.96853,cover=309.453"
##[14]"2:leaf=6.23624,cover=20.4624"
Youcanplotthetreesfromyourmodelusing```xgb.plot.tree``
xgb.plot.tree(model=bst)
Saveandloadmodels
Maybeyourdatasetisbig,andittakestimetotrainamodelonit?Maybeyouarenotabigfanof
losingtimeinredoingthesametaskagainandagain?Intheseveryrarecases,youwillwantto
saveyourmodelandloaditwhenrequired.
Hopefullyforyou,XGBoostimplementssuchfunctions.
#savemodeltobinarylocalfile
xgb.save(bst,"xgboost.model")
##[1]TRUE
xgb.save functionshouldreturnTRUEifeverythinggoeswellandcrashes
otherwise.
Aninterestingtesttoseehowidenticaloursavedmodelistotheoriginalonewouldbetocompare
thetwopredictions.
#loadbinarymodeltoR
bst2<xgb.load("xgboost.model")
pred2<predict(bst2,test$data)
#Andnowthetest
print(paste("sum(abs(pred2pred))=",sum(abs(pred2pred))))
##[1]"sum(abs(pred2pred))=0"
http://xgboost.readthedocs.io/en/latest/Rpackage/xgboostPresentation.html#viewfeatureimportanceinfluencefromthelearntmodel 10/10