Assignment - Machine Learning
Assignment - Machine Learning
Assignment - Machine Learning
Select a real-world dataset and perform machine learning analysis. Dimensional reduction,
classification, regression, and clustering are all important tools to learn to become a successful data
scientist. This process will allow you to generate questions and investigate them with visualizations.
The use of machine learning allows an analyst to draw conclusions from data to drive business
impact. Often you can make interesting discoveries that were not initial considerations.
The machine learning project should include the following:
You may use either Python or R for this project. In Python, the scikit-learn library provides many machine
learning tools for classification, regression, clustering, dimensional reduction, model selection,
preprocessing. Similar machine learning packages are available in R as well.
Please show your steps and outputs as you go along. Your workflow should have a logical, well thought
out and organized presentation.
Format for the presentation: You should present your work using R markdown or Python Jupyter
notebooks, and/or PowerPoint. You want to avoid taking screenshots of your inputs/outputs from an IDE.
Grading Rubric:
• Describe 3 questions/goals you are trying to solve or achieve (or prove wrong). The questions
should involve
o 1) dimensional reduction.
o 2) classification/regression.
o 3) clustering.
• Describe the kind of data you have. State the data types (numerical, categorical, etc). If there are
missing data and/or outliers, you should manage them appropriately. You may also need to convert
data types.
• Feature selection and engineering (hints: see helpful links section below):
o Explain the value and benefit of feature selection and feature engineering towards your stated
question. It is important to consider feature selection as a part of selecting an appropriate
machine learning model to avoid overfitting of the training data. You may also need to
perform feature engineering (creating a new feature out of an existing feature).
o Perform the appropriate steps for feature engineering and/or feature selection. In some cases,
feature selection may be combined with dimensional reduction, supervised learning, and
unsupervised learning methods. If you performed any parameter and hyperparameter tuning,
you should describe that as well. Evaluate the feature selection you performed as appropriate.
1
o Perform appropriate evaluation of performance of the model(s), using appropriate evaluation
metrics (AUC, adjusted R2, confusion matrix, etc)
o Describe strengths and weaknesses, and suitable alternate approaches
• Supervised learning (Classification/Regression) question (hints: see helpful links section below):
o Explain the value and benefit of supervised learning towards your stated question.
o Describe the steps/algorithm of how you have performed regression/classification. You may
use any method(s) (e.g. linear regression, logistic regression, K-nearest neighbors, decision
trees, ensemble methods (random forest, gradient boosted trees), SVM, Naïve Bayes, etc).
To perform the analysis, you may need to split the data into training, test, and validation data
sets as appropriate. If you performed any parameter and hyperparameter tuning, you should
describe that as well.
o Please describe the reason why your chose a method(s).
o Perform appropriate evaluation of performance of the model(s), using appropriate evaluation
metrics (AUC, adjusted R2, confusion matrix, etc).
o Describe strengths and weaknesses, and suitable alternate approaches.
• Unsupervised learning (Clustering) question (hints: see helpful links section below).
o Explain the value and benefit of unsupervised learning towards your stated question.
o Describe the steps/algorithm of how you have performed clustering. You may use any
method(s) (e.g. K-means, hierarchical clustering, Gaussian mixture models, DBSCAN, etc).
To perform the analysis, you may need to split the data into training, test, and validation data
sets as appropriate. If you performed any parameter and hyperparameter tuning, you should
describe that as well.
o Please describe the reason why your chose a method.
o Perform appropriate evaluation of performance of the model(s), using appropriate evaluation
metrics (AUC, adjusted R2, confusion matrix, etc).
o Describe strengths and weaknesses, and suitable alternate approaches.
• Summary and Conclusions.
o Summarize your findings.
o Provide conclusions.
o Describe what you have learned.
o What might you have done differently for solving any of the three questions?
o Please describe any future work extensions.
2
Useful links (you may use other references as well)
Machine Learning libraries in R
https://cran.r-project.org/web/views/MachineLearning.html
https://medium.com/activewizards-machine-learning-company/top-20-r-libraries-for-data-science-in-
2018-infographic-956f8419f883
https://medium.com/activewizards-machine-learning-company/top-20-python-libraries-for-data-
science-in-2018-2ae7d1db8049
Feature selection:
https://machinelearningmastery.com/an-introduction-to-feature-selection/
https://www.machinelearningplus.com/machine-learning/feature-selection/
http://www.feat.engineering/goals-of-feature-selection.html
https://towardsdatascience.com/feature-selection-and-dimensionality-reduction-f488d1a035de
https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-
python/
Dimensional reduction:
https://en.wikipedia.org/wiki/Dimensionality_reduction
https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/
https://elitedatascience.com/dimensionality-reduction-algorithms
https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/
Classification:
https://en.wikipedia.org/wiki/Statistical_classification
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
Cluster analysis:
https://en.wikipedia.org/wiki/Cluster_analysis
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-
a36d136ef68
https://towardsdatascience.com/unsupervised-machine-learning-clustering-analysis-d40f2b34ae7e