The Honor Code Office of Student Integrity (OSI) NLP Course Page

CSE6242/CX4242:DataandVisualAnalytics|GeorgiaTech|Spring2016
Due:Friday,April22,2016,11:55PMEST
PreparedbyGopiKrishnanNambiar,NilakshDas,PradeepVairamani,AjiteshJain,
VishakhaSingh,PoloChau
SubmissionInstructions:
It is important that you read the following instructionscarefullyandalsothose aboutthedeliverablesat
theendofeachquestionor
youmaylosepoints
.
Submitasinglezippedfile,calledHW4{YOUR_LAST_NAME}{YOUR_FIRST_NAME}.zip,
containingallthedeliverablesincludingsourcecode/scripts,datafiles,andreadme.Example:
HW4DoeJohn.zipifyournameisJohnDoe.Only.zipisallowed(no.rar,etc.)
You may collaborate with otherstudentsonthisassignment,butyoumustwriteyourowncodeand
give the explanations in your ownwords
, andalsomentionthe
collaborators names
onTSquares
submission page. All GT students must observe
the honor code
.
Suspected plagiarism and
academic misconduct will be reported and directly handled by the
Office of Student Integrity
(OSI)
. Here are some examples similar to Prof. Jacob Eisensteins
NLP course page (grading
policy):
OK: discuss concepts (e.g., how crossvalidation works) and strategies (e.g., use hashmap
insteadofarray)
Not OK: several students work on one master copy together (e.g., by dividing it up), sharing
solutions,orusingsolutionfrompreviousyearsorfromtheweb.
If you use any
slip days
, you must write down the number of days used in the Tsquare
submission page. For example, Slip days used: 1. Each slip day equals 24 hours. E.g., if a
submissionislatefor30hours,thatcountsas2slipdays.
At the end ofthisassignment,wehavespecifiedafolderstructureofhowtoorganizeyourfilesina
singlezippedfile.
5pointswillbedeductedfornotfollowingthisstrictly.
Wherever you are asked to write down an explanation for the task you perform,
stay within the
wordlimit
oryoumaylosepoints.
In your final zip file,

do not include any intermediate files you may have generated to work on the
task, unless your script is absolutely dependent on it to get the final result (which it ideally should
notbe).
After all slip days are used up,
5% deduction for every 24 hours of delay
. (e.g., 5 points for a
100pointhomework)
We
will not consider late submission of any missing parts of an homework assignment or project
deliverable. To make sure you have submitted everything,downloadyoursubmittedfilestodouble
check.
Task0:DownloadHWskeleton
DownloadtheHWskeletonfrom
thislink
thatcontainsstartercodeforTask1.
Task1:RandomForest(70points)
In this task, you will implement a random forest classifier in Python. The performance of the classifier
will be evaluatedusing10foldcrossvalidationonaprovideddataset.Fordetailsonrandomforest,see
Chapter 15 in the
ElementsofStatisticalLearning
bookand
lectureslides
.You
mustnotuseexisting
machine learning or random forest libraries. We have also set up an
inclass Kaggle competition in
whichyourclassifierwillcompetewiththerestoftheclass!
To train your classifier, you will be working on the wine dataset, which is often used for evaluating
classification algorithms. The classification task is to determine whether the quality ofaparticularwine
is over 7. We have mapped the wine quality scores for you to binary classes of 0 and 1. Wine scores
from 0 to 6 (inclusive) are mapped to 0, wine scores of 7 and above are mapped to 1. You will be
performing binary classification on the dataset. The dataset is extracted from
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
. The data is stored in a comma separated file
(csv). Each line describes awineusing12columns:thefirst11describethewinescharacteristics(see
details on Data page for the competition on Kaggle), and the last columnisaground
truthlabelforthe
quality of the wine (0/1). You
must not use the last column as an input feature when you classify the
data.
A.ImplementingRandomForest(40pt)
Themainparametersinarandomforestare:
1. Whichattributesofthewholesetofattributesdoyouselecttofindasplit?
2. Whendoyoustopsplittingleafnodes?Youmayevenchoosetousedecisionstumpswhichare
just1nodedeep.
3. Howmanytreesshouldbeintheforest?
In your implementation, you may apply any variations that you like (e.g., using entropy, Gini index, or
other measures binary split or multiway split). However, you must
explainyourapproachesandtheir
effectsontheclassificationperformanceinatextfile
description.txt.
We have prepared starter code written in Python (

RandomForest.py)
which you will be using. This
would help you setup the environment (loading the data and evaluatingyourmodel).Forstudentsnew
to Python, we have also created
reference documentation that might help you get started with the
language.
B.EvaluationusingCrossValidation(30pt)
You will evaluate your random forest (model) using
10fold cross validation (also,
lecture slide 1315
).
In the nutshell, you will first divide the provided data into 10 parts. Then hold out 1 partasthe testset
and use the remaining 9 parts for training. Train your model using thetrainingsetandusethetrained
model to classify entries in the test set. Repeat this process for all 10 parts, so that each entry will be
used for testing exactlyonce. Tocomputeyourmodelsfinalaccuracy,takethe average ofthe10folds
accuracies.
With correct implementation of both parts (random forest and cross validation), your
2
classification accuracy should beabove0.786.Solutionsachievingaccuracylessthanthismayreceive

somepenalty.
Note: In random forests, evaluation using cross validation is not necessary (see slide9ofclassslides
on Ensemble Methods
). However, we would like you to learn how to implement and apply cross
validationinpractice,asitisafundamentalmodelevaluationapproach.
C.Kagglecompetition(15ptbonus)
Kaggle is a platformforpredictivemodellingandanalyticscompetitions.Youcansubmitthepredictions
of your random forest on a test dataset at the
competition page hosted on this platform.
The top
submissions will receive significant amount of bonus points
. The starter code isaccompaniedby
a tester file (
CSE6242HW4Tester.pyc
) that automatically evaluates yourrandomforestimplementation
and creates a Kaggle submission file every time you run your code. The generated submission file is
named
cse6242_hw4_submission_LastNameFirstName.csv.
You MUST NOT change this file in any
waysubmitthisfiletoKaggle.
Sign up on Kaggle with your @gatech.edu email address and use your Georgia Tech account name
(e.g. gpburdell3)
as both your username and display name
. If you have already registered with
Kaggle using yourGaTechemailaddress, pleasemakeaprivatePiazzaposttoinstructorswiththetitle
HW4 Kaggle Registration Details and mention your existing Kaggle
username
,
display name and
your
GeorgiaTechaccountname
.
Note that Kaggle has 2 leaderboards, a public one that is visibletoallstudents,andaprivateone only
visible to the instructors. Each leaderboard use a differentdatasettoranksubmissions.
Wewillaward
bonus points based on your models performance ontheprivatedataset/leaderboard.
Usingtwo
similar, independent datasets/leaderboards is a common practice, which prevents competitors from
overfittingtheirmodelstothepublicdataset.Youcanreadmoreaboutthis
here
.
Bonuspointswillbeawardedasfollows:
Rank1to5:15pt
Rank6to25:10pt
Rank26to75:6pt
Rank76andabove:3pt
You may use advanced heuristics toimprovetheperformanceofyourrandomforest,whichyoushould

mention in
description.txt
. It is OK to build on top of your initial implementation and only submit the
best (final) version of your random forest. Also mention the final test accuracy of your implementation
(e.g.,aftertuningparametersofyourrandomforestusingcrossvalidation).
Deliverables
1. RandomForest.py
:Thesourcefileofyourprogram.Thisshouldalsocontainbriefcomments
mentioningwhateachsectionofthecodedoes,e.g.,calculatinginformationgain.
2. cse6242_hw4_submission_LastNameFirstName.csv
:Thesubmissionfilegeneratedbythe
testerfile.
3. description.txt
:Thisfileshouldinclude:
a. Howyouimplementedtheinitialrandomforest(SectionA)andwhyyouchoseyour
specificapproach(<75words)
b. Accuracyofyourimplementation(withcrossvalidation)
c. Explainimprovementsthatyoumade(SectionC)andwhyyouthinkitworksbetter(or
worse)(<50words)
Task2:UsingWeka(30points)
YouwilluseWekatotrainclassifiersforthesamedatasetusedinTask1,andcomparethe
performanceofyourimplementationwithWekas.
Downloadandinstall
Weka
.NotethatWekarequiresJavaRuntimeEnvironment(JRE)torun.We
suggestyouinstallthe
latestJRE
,toavoidJavaorruntimerelatedissues.
HowtouseWeka:
Loaddatainto
WekaExplorer:
Wekasupportsfileformatssuchasarff,csv,xls.
Preprocessing:youcanviewyourdata,selectattributes,andapplyfilters.
Classify:under
Classifier
youcanselectthedifferentclassifiersthatWekaoffers.Youcan
adjusttheinputparameterstomanyofthemodelsbyclickingonthetexttotherightofthe
Choose
buttonintheClassifiersection.
ThesearejustsomefundamentalsofWeka.Thereareplentyofonlinetutorials.
A.Experiment(15pt)
Runthefollowingexperiments.Aftereachexperiment,reportyour
parameters,runningtime,
confusionmatrix,andpredictionaccuracy
.Anexampleisprovidedbelow,undertheDeliverables
section.FortheTestoptions,choose
10foldcrossvalidation
1. RandomForest
.Under
classifiers>trees,
selectRandomForest.Youmighthaveto
preprocessthedatabeforeusingthisclassifier.(5pt)
2. SupportVectorMachines
.Underthe
classifiers>functions
select
SMO
.(5pt)
3. YourchoicechooseanyclassifieryoulikefromthenumerousclassifiersWekaprovides.You
canusepackagemanagertoinstalltheonesyouneed.(5pt)
B.Discussion(15pt)
1. ComparetheRandomForestresultfromA1toyourimplementationinTask1anddiscuss
possiblereasonsforthedifferenceinperformance.(<50words,5pt)
2. DescribetheclassifieryouchooseinA3:whatitishowitworksitsstrengths&weaknesses.(<
50words,5pt)
3. CompareandexplainthethreeapproachesclassificationresultsinSectionA,specificallytheir
runningtimes,accuracies,andconfusionmatrices.Ifyouhavechanged/tunedanyofthe
parameters,brieflyexplainwhatyouhavedoneandwhytheyimprovethepredictionaccuracy.
(<100words,5pt)
Deliverables
report.txt
atextfilecontainingtheWekaresultandyourdiscussionforallquestionsabove.For
example:
SectionA
1.
J48C0.25M2
Timetakentobuildmodel:3.73seconds
Overallaccuracy:86.0675%
ConfusionMatrix:
ab<classifiedas
332732079|a=no
44016757|b=yes
2.
SectionB
1.TheresultofWekais86.1%comparedtomyresult<accuracy>because
2.Ichoose<classifier>whichis<algorithm>
...
SubmissionGuidelines
Submitthedeliverablesasasingle
zip
filenamed
hw4
LastNameFirstName.zip
(shouldstartwith
lowercasehw4).Writedownthename(s)ofanystudentsyouhavecollaboratedwithonthis
assignment,usingthetextboxontheTSquaresubmissionpage.
Thezipfilesdirectorystructuremustexactlybe(whenunzipped):
hw4LastNameFirstName/
Task1/
cse6242_hw4_submission_LastNameFirstName.csv
description.txt
hw4data.csv
RandomForest.py
CSE6242HW4Tester.pyc
Task2/
report.txt
Youmustfollowthenamingconventionspecifiedabove.

The Honor Code Office of Student Integrity (OSI) NLP Course Page

Uploaded by

Copyright:

Available Formats

The Honor Code Office of Student Integrity (OSI) NLP Course Page

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Honor Code Office of Student Integrity (OSI) NLP Course Page

Uploaded by

Copyright:

Available Formats

CSE6242/CX4242:DataandVisualAnalytics|GeorgiaTech|Spring2016

In your final zip file,

We have prepared starter code written in Python (

classification accuracy should beabove0.786.Solutionsachievingaccuracylessthanthismayreceive

You may use advanced heuristics toimprovetheperformanceofyourrandomforest,whichyoushould

You might also like