Project Data Scientist Program Group Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Data Scientist Program

DATA SCIENTIST PROGRAM


GROUP PROJECT

GENERAL INSTRUCTION
You are required to perform classification or regression tasks using supervised learning techniques.
Choose at least THREE suitable algorithms and write a code to implement machine learning (ML)
algorithms (k-NN, decision trees, naïve Bayes, etc.) by using Python and the Scikit-Learn library.
Your work must include the following:

1. Dataset
Select any dataset for the classification task from the UCI database or any resource. Use the
same data set to perform the ML algorithms. A good dataset for better points will have these
criteria:
a) Latest dataset (2020-2024)
b) Sufficient number of data points in the dataset to create a good ML model.
c) Complex dataset where some pre-processing operations are required.

2. Report

2-1. Dataset Description and Exploration


● A satisfactory description of the dataset you used (task, classes, attributes, year)
● Find the central point for each attribute
● Understand the spread of each attribute
● Visualize the distribution of each attribute
● Pivot the data
● Watch out for outliers
● Understand the relationship between attributes
● Visualize the relationship between attributes

2-2. Design Framework


● Give a brief introduction regarding the objective of the machine learning task you will
perform. Explain the number of samples and features of the dataset.
● Describe the process of designing both of your ML models. The description may
include pre-processing method (if needed), feature extraction method (if related),
classification/regression algorithm (minimum three algorithms) or any other related
process that will be implemented. Also, describe how the dataset is partitioned into
training and testing sets. The description can be in text or graphical representation
such as a flowchart.
● Describe how the evaluation is conducted and what performance metrics you use for
evaluation (e.g: precision, recall, accuracy)

1
Data Scientist Program

2-3. Implementation (Coding)


● Briefly describe the packages that you import and explain why you need them.
● Add comments to your codes to explain the steps in 2-2

2-4. Result
● Figures of the output result. Choose suitable performance metrics to evaluate the
classification/regression task.
● Based on the model validation and performance evaluation in your work, compare
both of your models in terms of model performance (based on performance metrics
used).

SUBMISSION INSTRUCTIONS
Due Date:

(1) This assignment must be submitted in group (maximum of 3 persons)

(2) Submission must be through Google Classroom

(3) Deliverables:
● Dataset used for the assignment
● Notebook (.ipynb file)
● Full Report

LATE POLICY

Submission after the due time without having been granted an extension by your lecturer, will mean
that your work is ‘late.' Late work will have a penalty of 10% of the total possible marks deducted
from the mark that your work is worth, per day (including weekends).

You might also like