Num Py
Num Py
Num Py
Assignment 1 - NumPy
Name, Surname, Student Number
You need to implement k-nearest neighbour (knn) classification and regression (k-mean and
linear regression) algorithms using numpy library only. You should should use Euclidean distance
to measure the distance between the data points. For a top mark, your knn algorithm you
should include a weighting proportional to the inverse distance. More details about these
algorithms will be explained in lectures. An example of a knn algorithm for the travelling
salesman problem is given in Notebook 1.8 of the lecture material.
To learn more about the knn and linear regression algorithms see Wikipedia articles on knn and
Linear Regression, Sections 3.1, 3.2, 3.5 and 4.7.6 in An Introduction to Statistical Learning, and
Sections 2.3.2, 3.2 and 13 in The Elements of Statistical Learning.
Your code should be written in such a way that given a correctly formated data and running the
whole notebook would give the wanted answers. The data is given in CSV files where the first
column is the class or response y→, and the remaining columns are predictors x
→ ,x
→ , ...
1 2
The data file names should be global variables, class_file_name and regr_file_name .
You do not need to produce any graphical visualisations. This will be a part of your Assignment
3.
You can use this notebook as a template, only you should delete all the instructions before
submitting.
This assignment contributes 25% of your final grade. Parts 1 and 2 are worth 10% each. Top
marks will be awarded for an effective usage of numpy features such as broadcasting,
universality, sorting, masking and fancy indexing. The remaining 5% are awarded for the clarity
of your code and comments, and overall presentation of your work.
In [1]:
import numpy as np
import pandas as pd
In [2]:
# available sampe data files
class_file_name = "class2.csv"
regr_file_name = "regr2.csv"
→
y →
x1 →
x2 ⋯
y1 x11 x12 ⋯
y2 x21 x22 ⋯
y3 x31 x32 ⋯
⋮ ⋮ ⋮
the classification problem y 's take integer values: 1, 2, 3, etc. For the regression problem y 's are
i i
real numbers.
Important! Your code must work well for any data file having this structure.
In [3]:
# Write your code with comments here
In [4]:
# write your code here
In [5]:
# write your code here
data = pd.read_csv(class_file_name).to_numpy()
^
→
y = X
(test)
^
β
→
where
X
(test)
is the test design matrix obtained by stacking together a column of 1's with columns
of predictors variables from the test data:
⎢
(test) (test) ⎥
⎢
1 x x ⋯ ⎥
⎢
21 22 ⎥
→ →
(test) (test) (test)
X = [1 x x ⋯ ] = ⎢
⎥
1 2 ⎢
⎥
⎢
⎥
⎢ ⋮ ⋮ ⋮ ⎥
(test) (test)
⎣ ⎦
1 x x ⋯
m1 m2
^
β
→ is a column vector of least-squares estimates of the regression coefficients:
→
^
β = ((X
(train) T
) X
(train)
)
−1
(X
(train) T
) y
(train)
→
X
(train)
is the design matrix for the train data:
(train) (train)
⎡1 x x ⋯⎤
11 12
⎢
(train) (train) ⎥
⎢
1 x x ⋯ ⎥
⎢
21 22 ⎥
→ →
(train) (train) (train)
X = [1 x x ⋯ ] = ⎢
⎥
1 2 ⎢
⎥
⎢
⎥
⎢ ⋮ ⋮ ⋮ ⎥
(train) (train)
⎣ ⎦
1 x x ⋯
n1 n2
In [6]:
# write your code here
are predicted values, y are observed values, and m is the number of observations in your test
i
data. Then repeat the last step using the linear regression approach. Finally, compare the the
two RSS values your have obtained. Which algorithm, knn or linear regression, gives a better
result?
In [7]:
# write your code here
data = pd.read_csv(regr_file_name).to_numpy()