Milestone
Milestone
Milestone
-------------------------------------------------------------------
2. See my sample report for assignment 1 to be used as a rubric for this assignment. As I
3. Go over the sections: Data, Kernels, Discussion, Activity to understand the data analytic
4. Download one of the following data analytic tools (or else use Python or Java
b. RapidMiner: https://rapidminer.com/get-started
c. Weka : https://www.cs.waikato.ac.nz/ml/weka
d. Those who will be using programming languages/codes such as Python, Java, and R
will be given a grading advantage (as it requires more effort to program than to use
ready to use tools). However, you are not required to do so.
------------------------------------------
• The team will be assessed based on the overall progress up to the milestone. There are
minimum criteria (related to data science projects’ tasks and also template components). In
addition, top 10-20 % will be assessed based on relative contribution/achievement among all
teams.
• The final research report should follow research paper templates (You can see examples in
research-paper/, https://style.mla.org/formatting-papers/,
http://www.aresearchguide.com/4format.html, https://explorable.com/research-paper-format,
https://owl.english.purdue.edu/owl/resource/560/18/,
https://www.ece.ucsb.edu/~parhami/rsrch_paper_gdlns.htm.
• I will also upload samples of “anonymous good previous submissions” for your preference
• You will submit in Milestone1: (1) Abstract, (2) Research Questions, (3) Introduction + ((4)
analysis progress: e.g. data collection, and preprocessing activities), with size of no less than
• Be prepared to present both your code progress as well as your report (Presentation is part of
for the presentation part unless you present your work orally.
• You have to evaluate all other students presentations according to the assessment form
template. Submit your assessment before you leave live session, Although your assessment to
others will not be impacting your grade, yet lacking to submit them will cause your own
------------------
-----------------------------------------------------------------------------
In deliverables 2 and final, you add the new sections in addition to any
modifications in earlier sections from previous deliverables.
Make sure you submit report in addition to any supporting code
------------------------- Deliverable 2 components -------------------------------------
As most of the selected projects use public datasets, no doubt there are different
attempts/projects to analyze those datasets. 30 % of this deliverable is in your overall assessment
of previous data analysis efforts. This effort should include:
Evaluating existing source codes that they have (e.g. in Kernels and discussion sections)
or any other refence. Make sure you try those codes and show their results
In addition to the code, summarize most relevant literature or efforts to analyze the same
dataset you have picked.
If you have a new dataset with no or limited Kernel, survey literature not necessary
on any work on this dataset in particular, but in the domain of the dataset (as you may
have many other similar or relevant datasets)
https://www.kaggle.com/WinningModelDocumentationGuidelines
We talked about feature selection methods and I uploaded several samples about that. I
expect you for the least to reuse such code towards your dataset
You can use the following questions to guide you in what to include in this section (if you can
answer all those questions with evidences/screenshots from your work, that is great)
What were the most important features?
We suggest you provide:
• a variable importance plot (an example here about halfway down the page), showing the 10-20
most important features and
partial plots for the 3-5 most important features
If this is not possible, you should provide a list of the most important features.
How did you select features?
Did you make any important feature transformations?
Did you find any interesting interactions between features?
Did you use external data? (if permitted)
Many customers are happy to trade off model performance for simplicity. With this in mind:
Is there a subset of features that would get 90-95% of your final performance? Which features?
*
What model that was most important? *
What would the simplified model score?
* Try and restrict your simple model to fewer than 10 features and one training
method
• Use the questions below to guide your effort in this section. (if you can answer all those
questions with evidences/screenshots from your work, that is great)
Ensemble and Deep learning methods (Partial, part of the final grading)
We will cover ensemble, DL methods in the last two weeks. Details, quality and thoroughness
of evaluated Ensemble and DL methods are important factors in grading of the final
deliverable
Citations to references, websites, blog posts, and external sources of information where
appropriate.
Summary
Summarize the most important aspects of your model and analysis, such as:
-------------------------------------