Num Py

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

10/06/2022, 20:40 Assignment_1_NumPy

Assignment 1 - NumPy
Name, Surname, Student Number

You need to implement k-nearest neighbour (knn) classification and regression (k-mean and
linear regression) algorithms using numpy library only. You should should use Euclidean distance
to measure the distance between the data points. For a top mark, your knn algorithm you
should include a weighting proportional to the inverse distance. More details about these
algorithms will be explained in lectures. An example of a knn algorithm for the travelling
salesman problem is given in Notebook 1.8 of the lecture material.

To learn more about the knn and linear regression algorithms see Wikipedia articles on knn and
Linear Regression, Sections 3.1, 3.2, 3.5 and 4.7.6 in An Introduction to Statistical Learning, and
Sections 2.3.2, 3.2 and 13 in The Elements of Statistical Learning.

Your code should be written in such a way that given a correctly formated data and running the
whole notebook would give the wanted answers. The data is given in CSV files where the first
column is the class or response y→, and the remaining columns are predictors x
→ ,x
→ , ...
1 2

The data file names should be global variables, class_file_name and regr_file_name .

You do not need to produce any graphical visualisations. This will be a part of your Assignment
3.

You can use this notebook as a template, only you should delete all the instructions before
submitting.

This assignment contributes 25% of your final grade. Parts 1 and 2 are worth 10% each. Top
marks will be awarded for an effective usage of numpy features such as broadcasting,
universality, sorting, masking and fancy indexing. The remaining 5% are awarded for the clarity
of your code and comments, and overall presentation of your work.

In [1]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

In [2]:
# available sampe data files

# classification: class1.csv, class2.csv

# regression: regr1.csv, regr2.csv

class_file_name = "class2.csv"

regr_file_name = "regr2.csv"

The sample data is structured in the following structure:


y →
x1 →
x2 ⋯

y1 x11 x12 ⋯

y2 x21 x22 ⋯

y3 x31 x32 ⋯

⋮ ⋮ ⋮

localhost:8888/nbconvert/html/Data Handling and Visualisation/Assignment_1_NumPy.ipynb?download=false 1/3


10/06/2022, 20:40 Assignment_1_NumPy

where y→ is a column-vector of responses and x


→ ,x
→ , … are column-vectors of predictors. For
1 2

the classification problem y 's take integer values: 1, 2, 3, etc. For the regression problem y 's are
i i

real numbers.

Important! Your code must work well for any data file having this structure.

Part 1 - KNN classification


1.1 KNN classification algorithm
In this section you should write a function knn_classify(test, train, k) that takes train
and test data as numpy ndarrays, and a k-value as an integer, and returns the class-values of the
test data.

In [3]:
# Write your code with comments here

1.2 Data Analysis


In this section you should read the data. Then split it randomly into train (60%), validation (20%),
and test (20%) data. Use the train and validation data to find k-value giving the best
classification result. Then use this k-value to classify the test data and report your findings: the k-
value and the percentage of correct predictions.

In [4]:
# write your code here

Part 2 - KNN and linear regression


2.1 KNN regression algorithm
In this section you should write a function knn_regression(train, test, k) that takes
train and test data, and a k-value, and returns the regression (fitted) values of the responses of
the test data.

In [5]:
# write your code here

data = pd.read_csv(class_file_name).to_numpy()

2.2 Linear regression algorithm


In this section you should write a function linear_regression(train, test) that takes
train and test data, and returns linear regression (fitted) values of the responses of the test data.
The column-vector of regression values y
→ should be computed using this formula:
^

^

y = X
(test)
^
β

where

X
(test)
is the test design matrix obtained by stacking together a column of 1's with columns
of predictors variables from the test data:

localhost:8888/nbconvert/html/Data Handling and Visualisation/Assignment_1_NumPy.ipynb?download=false 2/3


10/06/2022, 20:40 Assignment_1_NumPy
(test) (test)
⎡1 x x ⋯⎤
11 12


(test) (test) ⎥


1 x x ⋯ ⎥


21 22 ⎥

→ →
(test) (test) (test)
X = [1 x x ⋯ ] = ⎢

1 2 ⎢


⎢ ⋮ ⋮ ⋮ ⎥
(test) (test)
⎣ ⎦
1 x x ⋯
m1 m2

^
β
→ is a column vector of least-squares estimates of the regression coefficients:


^
β = ((X
(train) T
) X
(train)
)
−1
(X
(train) T
) y
(train)

X
(train)
is the design matrix for the train data:

(train) (train)
⎡1 x x ⋯⎤
11 12


(train) (train) ⎥


1 x x ⋯ ⎥


21 22 ⎥

→ →
(train) (train) (train)
X = [1 x x ⋯ ] = ⎢

1 2 ⎢


⎢ ⋮ ⋮ ⋮ ⎥
(train) (train)
⎣ ⎦
1 x x ⋯
n1 n2

m is the number of rows of the test data

n is the number of rows of the train data

→ is a column-vector of response values of the train data


(train)
y

In [6]:
# write your code here

2.3 Data Analysis


In this section you should read the data. Then split it randomly into train (60%), validation (20%),
and test (20%) data. Use the train and validation data to find k-value giving the best knn
regression result. Then use this k-value to conduct knn regression on the test data and report
your findings: the k-value and the residual sum of squares: RSS where y^
m
2
= ∑ (y
^ − yi )
i=1 i i

are predicted values, y are observed values, and m is the number of observations in your test
i

data. Then repeat the last step using the linear regression approach. Finally, compare the the
two RSS values your have obtained. Which algorithm, knn or linear regression, gives a better
result?

In [7]:
# write your code here

data = pd.read_csv(regr_file_name).to_numpy()

localhost:8888/nbconvert/html/Data Handling and Visualisation/Assignment_1_NumPy.ipynb?download=false 3/3

You might also like