Machine Learning Team Coursework

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

1.

Team case study on machine learning


Dataset choice
Choose a suitable public dataset for the classification task. The target in the input data can be
categorical (or a number interpreted as a category), but you will need to transform this into a category
later. You can also choose a dataset with a continuous target if you will the target to a binary attribute
as part of preprocessing. Recommended sources for datasets:
- https://www.openml.org/
- https://kaggle.com/
Software choice
It is recommended to use Python (scikit-learn).
Task customization
Please select:
1. Target attribute
2. One instance of interest (you can do it randomly, but do not have to), which will be analysed
later. The instance can be identified, for example, by an id.
3. Choose an attribute of interest (ideally one that can be influenced for the given instance, for
example, ticket class, not age)
4. Cost matrix consisting of a “cost of a false positive, false negative, true positive and true
negative”. Alternatively, you can also define the cost matrix in terms of benefits.
The report will have the following structure:
Introduction
1. Data exploration
2. Data preprocessing
3. Modelling
4. Model evaluation
5. Explanation
Conclusion
Requirements and content
Introduction
1. Describe the business value of addressing this problem with machine learning.
2. Provide the link to the source of the data.
3. Describe chosen customization: target attribute, an instance of interest, attribute of interest, cost
matrix.

4iz210, version 19/02/2024


Example: if you would use a dataset about air tickets, the target attribute may be the Airport travelled
to. You will also explain that the Airport will be binarized into EU Airport (positive) or Non-EU
Airport (negative). An instance of interest could be a particular passenger with id=10. The attribute of
interest could be the passenger class. The cost matrix could be given by providing the cost of a true
positive (1), true negative (1), false positive (2) and false negative (3). This will express that the highest
cost will be incurred when the classifier predicts a non-EU Airport but the actual airport will be an EU
Airport.
Data Exploration
1. Show a histogram (or a table with value frequencies) for the target variable and for selected
other variables
2. Show a scatterplot (correlation plot) showing the relation between selected predictors and the
target variable
3. Interpret the results
Data preprocessing
Carefully describe the preprocessing in the report. You need to define the following subsections:
1. Preprocessing for supervised machine learning
i. Derive binary target attribute: The team needs to convert the target attribute to a binary
transformed target attribute by choosing which value(s) will be positive and which value(s)
will be negative value.
o For example, the target attribute can be „Airport“. The team decided to create a new
attribute “EUAirport”, which will be set to “0” if “Airport is set to BRS (Bristol), LHR
(London Heathrow) or LCY (London City) and to “1” otherwise. Remove the original
target attribute. In our example, this will be “Airport”.
ii. Split the data into a training set and a test set. It is up to you what percentage will be used
for training but the training set needs to be larger than the test set.
iii. Do at least one additional data preprocessing, such as
- Remove missing values
- If the classes are imbalanced, you may upsample (or downsample) the training dataset.
- Normalize values or use a standard scale
- Remove rows based on subsetting
- Derive new columns
- Perform feature selection (remove some attributes)
- Important: Make sure that your preprocessing operation does not use information from the
test set. It is therefore recommended to “fit” preprocessing on the training set and then
apply it on the test set.

4iz210, version 19/02/2024


(tip: see Preprocessing Data With SCIKIT-LEARN (Python tutorial) - JC Chouinard)
2. Preprocessing for unsupervised models (clustering)
i. For clustering, it is recommended to perform feature rescaling and or normalization.
ii. If you need to remove some attributes you can select the most important variables from the
classification part.
iii. Optionally, you can remove unnecessary rows by detecting outliers.
iv. Perform the final pre-processing step: for the following phases you need two datasets, 1)
without target variable for making clusters, 2) with the target variable for the evaluation
part.
Modeling
Carefully describe all parameters used in the report:
1. Supervised model that predicts the target attribute.
a. Train the model on training data, and evaluate the model on test data.
b. Try at least two machine learning algorithms. It is recommended to try Decision Trees
and Forests.
c. Try various combinations of metaparameters (such as tree depth for decision tree or
number of trees in a forest) and record the impact on predictive performance
d. Try a baseline model (e.g., random classifier)
2. Clustering model.
a. Make two clustering models with k-means
i. k = 2 (hereafter k-2 model)
ii. best k (hereafter k-best model) recommended based on WCSS or Silhouette score
(use meta-parameter tunning)
b. Make hierarchical clustering and visualize it.
c. Experiment with different clustering setups.
Evaluation
1. Supervised model
a. Which metric is most suitable for use for the current problem (accuracy, F-measure)?
b. Compare the performance metrics for all types of models (e.g,. decision tree and forest).
Which model is the best one?
c. Combine (multiply) the predefined costs matrix with the values in the confusion matrix
for each model. Which model is the best one?
d. If you would change the probability (score) threshold for classification, would you
obtain better results in terms of total costs? For which threshold? (optional)
e. Quantify the effect of individual preprocessing steps (such as rescaling). How would the
performance change if you have not performed this step (optional).

4iz210, version 19/02/2024


2. Unsupervised model (clustering)
a. Justify the chosen k using the elbow curve graph (WCSS value comparison) or/and
Silhouette score (only for the k-means algorithm).
b. For the k-2 model with two clusters evaluate the quality of clustering by the rand index
depending on the target variable.
Explanation
1. Supervised models
a. Identify the most important variables in the model.
i. Explain how the decision tree model reached its conclusion (which branches of
the tree/decision nodes were activated).
b. Use both models to classify the chosen instance
i. Do both models assign the same class?
ii. What is the confidence (probability) of the prediction?
c. If you would change the value of the attribute of interest, how would the classification
of the instance change? In scikit-learn, you can use the ICE plot (optional).
d. Apply LIME, SHAPLEY or Anchors to explain the classification of the instance.
(optional, only Python)
2. Unsupervised model (clustering)
a. Visualize all cluster models in 2D or 3D space.
i. Dendrogram for hierarchical clustering.
ii. Select two most important features (as X and Y axis) and plot datapoints for
selected attributes in different colors to represent the cluster. Alternatively, you
can use the advanced PCA method to reduce the dimensions and render the
entire model in 2D or 3D space.
b. Compare found clusters (for k-2 model and k-best model) with the target variable. What
is the proportion of target variable values in each cluster?
c. Try to describe all clusters (in k-2 model and k-best model) returned by the k-means
algorithm based on their centroid and size. What is the main characteristic of each
cluster?
d. Use the k-best model to classify the chosen instance into a cluster
i. Inspect the assigned cluster.
ii. Does the value of the target class in the data match with the mode (average) of
the assigned cluster?

4iz210, version 19/02/2024


Conclusion
Summarize the results, answering questions such as:
1. Which machine learning result has the highest value and is most interesting?
2. What setting provided the best result?
3. Which attributes are the most important?

Submission:
- Submit a ipynb and a html (or pdf) file with the report. You can also split the submission
into several files with clearly chosen names indicating what part of the report the file
relates to.
- Submit the input datasets.

Final checklist
• Are all preprocessing steps justified?
• Did you try different metaparameter values where appropriate?
• Are the results replicable? If you have the same data, does the report describe all steps in
sufficient detail to obtain the same results as reported by the authors?
• Were proper evaluation metrics selected? Are the results correctly interpreted?
• Are all important steps explained and justified?
• What is the quality of writing? Is the language clear and concise?

2. Review assignment
Once all teamwork reports have been submitted, they will be uploaded to OneDrive, and you will find
out which team you will be writing a review for. You will have approx 7 days to submit the review.
The review document will be submitted to InSIS in the .pdf format.

Assessment outline:

1. Summary of the report


a. This description should demonstrate that you have read the study carefully.
b. A brief recap of the business problem, objectives, and results
2. Follow the report structure and check all required steps
a. Customization
b. Structure
c. Introduction
d. Data exploration
e. Data preprocessing (classification and clustering)
f. Modeling (classification and clustering)
g. Evaluation (classification and clustering)

4iz210, version 19/02/2024


h. Explanation (classification and clustering)

3. Review summary
a. Positives and negatives, formal aspects, improvement proposals
4. Final evaluation, recommendations
a. Accept without any changes (18-20 points)
b. Accept with minor revisions (13-17 points)
c. Accept after major revisions (8-12 points)
d. Reject (0-7 points)

3. Presentation
After submitting the review, you will have approx 7 days to prepare the presentation. All presentations
take place at the seminar.

Presentation essentials:

• Prepare slides summarizing the most important parts - they are preferred over the presentation
directly from jupyter notebooks.
• Each team has about 15 minutes to present the coursework.
• Coursework can be presented by anyone on the team or even by all of your team members.
• The presentation of the team is followed by the opponents' statements (max. 5 minutes).
• Select the most important: model description, problems, “lesson learnt“.

4. Statement of work
After the presentation, you will be required to submit a statement about your work and the work of the
other students in your team. The statement of work will be submitted via nbgrader. If the student did
not work sufficiently in the team, he or she may be penalized based on average of all statements.

The final number of student points results from the following formula:

Total points = b * min{1, 2*p*n}

where

• b = total point of your team (report + review + presentation, max 30 points)


• p = your average proportion of your work per coursework
• n = number of members in your team

Shortly, in order to not be penalized, a student must work at least 50% of their assigned work.

4iz210, version 19/02/2024


For example, team A has 4 members and obtained 30 points. Proportionally, each student should
report optimally 25% of the work. But the real proportion is as follows:

• Member 1: 25% - the member did exactly what he/she was supposed to do. Points = 30 * min{1,
2*0.25*4} = 30
• Member 2: 12,5% - the member completed 50% of his/her assigned work. However, it is enough
to get the full number of points. Points = 30 * min{1, 2*0.125*4} = 30
• Member 3: 52,5% - the student has worked more than he/she should have, so he/she naturally
has full points. Points = 30 * min{1, 2*0.525*4} = 30
• Member 4: 10% - This student worked less than required, so he/she will receive less points than
other team members. Points = 30 * min{1, 2*0.1*4} = 24

4iz210, version 19/02/2024

You might also like