Hmls
Hmls
Hmls
Venelin Valkov
This book is for sale at http://leanpub.com/hmls
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
Contents
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Build a taxi driving agent in a post-apocalyptic world using Reinforcement Learning . . 106
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Driving in a post-apocalyptic world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Lamer is someone who thinks they are smart, yet are really a loser. E.g. that hacker
wannabe is really just a lamer.
You might’ve noticed that reading through Deep Learning and TensorFlow/PyTorch tutorials might
give you an idea of how to do a specific task, but fall short if you want to apply it to your own
problems. Importing a library and calling 4-5 methods might get the job done but leave you clueless
about why it works. Can you solve this problem?
Different people learn in different styles, but all hackers learn the same way - we build stuff!
Supervised Learning
In Supervised Learning setting, you have a dataset, which is a collection of N labeled examples. Each
example has a vector of features xi and a label yi . The label yi can belong to a finite set of classes
{1, . . . , C}, real number or something more complex.
The goal of supervised learning algorithms is to build a model that receives a feature vector x as
input and infer the correct label for it.
Unsupervised Learning
In Unsupervised Learning setting, you have a dataset of N unlabeled examples. Again, you have
a feature vector x, and the goal is to build a model that takes it and transforms it into another.
Some practical examples include clustering, reducing the number of dimensions and anomaly
detection.
Reinforcement Learning
Reinforcement Learning is concerned with building agents that interact with an environment by
getting its state and executing an action. Actions provide rewards and change the state of the
environment. The goal is to learn a set of actions that maximize the total reward.
These are the main components of all the algorithms we’re going to implement. While there are
many optimization routines, Stochastic Gradient Descent¹ is the most used in practice. It is used to
find optimal parameters for logistic regression, neural networks, and many other models.
Making predictions
How can you guarantee that your model will make correct predictions when deployed in production?
Well, only suckers think that this is possible.
¹https://en.wikipedia.org/wiki/Stochastic_gradient_descent
“All models are wrong, but some are useful.” - George Box
That said, there are ways to increase the prediction accuracy of your models. If the data used for
training were selected randomly, independently of one another and following the same procedure
for generating it, then, it is more likely your model to learn better. Still, for situations that are less
likely to happen, your model will probably make errors.
Generally, the larger the data set, the better the predictions you can expect.
NumPy
NumPy² is the fundamental package for scientific computing with Python. It contains among other
things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.
Here’s a super simple walkthrough:
1 import numpy as np
2
3 a = np.array([1, 2, 3]) # 1D array
4 type(a)
1 numpy.ndarray
²http://www.numpy.org/
1 a.shape
1 (3,)
We’ve created a 1-dimensional array with 3 elements. You can get the first element of the array:
1 a[0]
1 1
1 a[0] = 5
2
3 a
1 array([5, 2, 3])
1 (3, 3)
1 b[0:2]
1 array([[1, 2, 3],
2 [4, 5, 6]])
Pandas
pandas³ is an open source, BSD-licensed library providing high-performance, easy-to-use data
structures and data analysis tools for the Python programming language.
We’re going to use pandas as a holder for our datasets. The primary entity we’ll be working with is
the DataFrame. We’ll also do some transformations/preprocessing and visualizations.
Let’s start by creating a DataFrame:
³https://pandas.pydata.org/
1 import pandas as pd
2
3 df = pd.DataFrame(dict(
4 customers=["Jill", "Jane", "Hanna"],
5 payments=[120, 180, 90]
6 ))
7
8 df
customers payments
Jill 120
Jane 180
Hanna 90
1 df.shape
1 (3, 2)
We have 3 rows and 2 columns. You can use head()⁴ to render a preview of the first five rows:
1 df.head()
customers payments
Jill 120
Jane 180
Hanna 90
1 df.isnull()
customers payments
False False
False False
False False
⁴https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html
⁵https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
1 df.payments.sum()
1 390
1 df.plot(
2 kind='bar',
3 x='customers',
4 y='payments',
5 title='Payments done by customer'
6 );
Matplotlib
Matplotlib⁷ is a Python 2D plotting library which produces publication quality figures in
a variety of hardcopy formats and interactive environments across platforms.
Ready to start?
Welcome to the amazing world of Machine Learning. Let’s get this party started!
The Data
You collected some data from your database(s), analytics packages, etc. Here’s what you might’ve
come up with:
⁸https://colab.research.google.com/drive/1kmtjoULbyRtAtDPKYlhWSwATLpF7PQd8
1 data = OrderedDict(
2 amount_spent = [50, 10, 20, 5, 95, 70, 100, 200, 0],
3 send_discount = [0, 1, 1, 1, 0, 0, 0, 0, 1]
4 )
1 df = pd.DataFrame.from_dict(data)
amount_spent send_discount
0 50 0
1 10 1
2 20 1
3 5 1
4 95 0
5 70 0
6 100 0
7 200 0
8 0 1
Note — the presented data is a simplification of a real dataset you might have. If your
data is really simple, you might try simpler methods.
y ∈ {0, 1}
and set 0: negative class (e.g. email is not spam) or 1: positive class (e.g. email is spam).
⁹https://www.kaggle.com/surveys/2017
hw (x) = w1 x1 + w0
where the coefficients wi are parameters of the model. Let the coefficient vector W be:
( )
w1
W =
w0
hw (x) = wT x
¹⁰https://machinelearningplus.com/
0 ≤ hw (x) ≤ 1
For Logistic Regression we want to modify this and introduce another function g:
hw (x) = g(wT x)
1
g(z) =
1 + e−z
where
z∈R
g is also known as the sigmoid function or the logistic function. After substitution, we end up with:
1
hw (x) =
1 + e−(wT x)
1
g(z) =
1 + e−z
1 def sigmoid(z):
2 return 1 / (1 + np.exp(-z))
Loss function
We have a model that we can use to make decisions, but we still have to find the parameters W.
To do that, we need an objective measurement of how good a given set of parameters are. For that
purpose, we will use a loss (cost) function:
1 ∑
m
J(W ) = Cost(hw (x(i) ), y (i) )
m i=1
−log(h (x)) ify = 1
w
Cost(hw (x), y) =
−log(1 − hw (x)) ify = 0
1
J(W ) = (−y log (hw ) − (1 − y) log (1 − hw ))
m
where
hw (x) = g(wT x)
¹²https://ml-cheatsheet.readthedocs.io
1 X = df['amount_spent'].astype('float').values
2 y = df['send_discount'].astype('float').values
3
4 def predict(x, w):
5 return sigmoid(x * w)
6
7 def print_result(y_hat, y):
8 print(f'loss: {np.round(loss(y_hat, y), 5)} predicted: {y_hat} actual: {y}')
9
10 y_hat = predict(x=X[0], w=.5)
11 print_result(y_hat, y[0])
Unfortunately, I am pretty lazy, and this approach seems like a bit too much work for me. Let’s go
to the next one:
1 0.0
2 0.0
3 0.0
4 6.661338147750941e-16
5 9.359180097590508e-14
6 1.3887890837434982e-11
7 2.0611535832696244e-09
8 3.059022736706331e-07
9 4.539889921682063e-05
10 0.006715348489118056
11 0.6931471805599397
12 5.006715348489103
13 10.000045398900186
14 15.000000305680194
15 19.999999966169824
16 24.99999582410784
17 30.001020555434774
18 34.945041100449046
19 inf
20 inf
Amazing, the first parameter value we tried got us a loss of 0. Is it your lucky day or this will always
be the case, though? The answer is left as an exercise for the reader :)
Ok, but how can we find that gradient thing? We have to find the derivate of our cost function since
our example is rather simple.
1
J(W ) = (−y log (hw ) − (1 − y) log (1 − hw ))
m
Given
∂J(W ) 1 1
= (y(1 − hw ) − (1 − y)hw )x = (y − hw )x
∂W m m
1
W := W − α( (y − hw )x)
m
The parameter a is known as learning rate. High learning rate can converge quickly, but risks
overshooting the lowest point. Low learning rate allows for confident moves in the direction of
the negative gradient. However, it time-consuming so it will take us a lot of time to get to converge.
¹⁴https://math.stackexchange.com/a/1225116/499458
12 W -= lr * gradient
13 return W
About that until convergence part. You might notice that we kinda brute-force our way around it.
That is, we will run the algorithm for a preset amount of iterations. Another interesting point is the
initialization of our weights W — initially set at zero.
Let’s put our implementation to the test, literally. But first, we need a function that helps us predict
y given some data X (predict whether or not we should send a discount to a customer based on its
spending):
1 class TestGradientDescent(unittest.TestCase):
2
3 def test_correct_prediction(self):
4 global X
5 global y
6 if len(X.shape) != 2:
7 X = X.reshape(X.shape[0], 1)
8 w = fit(X, y)
9 y_hat = predict(X, w).round()
10 self.assertTrue((y_hat == y).all())
Note that we use reshape to add a dummy dimension to X. Further, after our call to predict, we round
the results. Recall that the sigmoid function spits out (kinda like a dragon with an upset stomach)
numbers in the [0; 1] range. We’re just going to round the result in order to obtain our 0 or 1 (yes or
no) answers.
run_tests()
Here is the result of running our test case:
F
Well, that’s not good, after all that hustling we’re nowhere near achieving our goal of finding good
parameters for our model. But, what went wrong?
Welcome to your first model debugging session! Let’s start by finding whether our algorithm
improves over time. We can use our loss metric for that:
run_tests()
We pretty much copy & pasted our training code except that we’re printing the training loss every
10,000 iterations. Let’s have a look:
1 loss: 0.6931471805599453
2 loss: 0.41899283818630056
3 loss: 0.41899283818630056
4 loss: 0.41899283818630056
5 loss: 0.41899283818630056
6 loss: 0.41899283818630056
7 loss: 0.41899283818630056
8 loss: 0.41899283818630056
9 loss: 0.41899283818630056
10 loss: 0.41899283818630056
F……..
Suspiciously enough, we found a possible cause for our problem on the first try! Our loss doesn’t
get low enough, in other words, our algorithm gets stuck at some point that is not a good enough
minimum for us. How can we fix this? Perhaps, try out different learning rate or initializing our
parameter with a different value?
First, a smaller learning rate a :
1 run_tests()
1 loss: 0.42351356323845546
2 loss: 0.41899283818630056
3 loss: 0.41899283818630056
4 loss: 0.41899283818630056
5 loss: 0.41899283818630056
6 loss: 0.41899283818630056
7 loss: 0.41899283818630056
8 loss: 0.41899283818630056
9 loss: 0.41899283818630056
10 loss: 0.41899283818630056
11
12 F.......
Not so successful, are we? How about adding one more parameter for our model to find/learn?
1 def add_intercept(X):
2 intercept = np.ones((X.shape[0], 1))
3 return np.concatenate((intercept, X), axis=1)
4
5 def predict(X, W):
6 X = add_intercept(X)
7 return sigmoid(np.dot(X, W))
8
9 def fit(X, y, n_iter=100000, lr=0.01):
10
11 X = add_intercept(X)
12 W = np.zeros(X.shape[1])
13
14 for i in range(n_iter):
15 z = np.dot(X, W)
16 h = sigmoid(z)
17 gradient = np.dot(X.T, (h - y)) / y.size
18 W -= lr * gradient
19 return W
1 run_tests()
1 ........
2 ---------------------------------------------------------
3 Ran 8 tests in 0.686s
4
5 OK
What we did here? We added a new element to our parameter vector W and set it’s initial value to
1. Seems like this turn things into our favor!
1 class TestLogisticRegressor(unittest.TestCase):
2
3 def test_correct_prediction(self):
4 global X
5 global y
6 X = X.reshape(X.shape[0], 1)
7 clf = LogisticRegressor()
8 y_hat = clf.fit(X, y).predict(X)
9 self.assertTrue((y_hat == y).all())
1 class LogisticRegressor:
2
3 def _add_intercept(self, X):
4 intercept = np.ones((X.shape[0], 1))
5 return np.concatenate((intercept, X), axis=1)
6
7 def predict_probs(self, X):
8 X = self._add_intercept(X)
9 return sigmoid(np.dot(X, self.W))
10
11 def predict(self, X):
12 return self.predict_probs(X).round()
13
14 def fit(self, X, y, n_iter=100000, lr=0.01):
15
16 X = self._add_intercept(X)
17 self.W = np.zeros(X.shape[1])
18
19 for i in range(n_iter):
20 z = np.dot(X, self.W)
21 h = sigmoid(z)
22 gradient = np.dot(X.T, (h - y)) / y.size
23 self.W -= lr * gradient
24 return self
1 run_tests()
We just packed all previously written functions into a tiny class. One huge advantage of this
approach is the fact that we hide the complexity of the Gradient descent algorithm and the use
of the parameters W.
Now let’s try our model on data obtained from 2 new customers:
1 Customer 1 - $10
2 Customer 2 - $250
1 y_test
1 array([1., 0.])
Conclusion
Well done! You have a complete (albeit simple) LogisticRegressor implementation that you can play
with. Go on, have some fun with it!
Complete source code in Google Colaboratory Notebook¹⁶
Coming up next, you will implement a Linear regression model from scratch :)
¹⁶https://colab.research.google.com/drive/1kmtjoULbyRtAtDPKYlhWSwATLpF7PQd8
I know that you’ve always dreamed of dominating the housing market. Until now, that was
impossible. But with this limited offer you can… got a bit sidetracked there.
Let’s start building our model with Python, but this time we will use it on a more realistic dataset.
Complete source code notebook¹⁷
The Data
Our data comes from a Kaggle competition named “House Prices: Advanced Regression Tech-
niques¹⁸”. It contains 1460 training data points and 80 features that might help us predict the selling
price of a house.
1 df_train = pd.read_csv('house_prices_train.csv')
¹⁷https://colab.research.google.com/drive/1DXkpo9PmH9_HiCSz9NQlZ9vGQtMIYqmF
¹⁸https://www.kaggle.com/c/house-prices-advanced-regression-techniques
1 count 1460.000000
2 mean 180921.195890
3 std 79442.502883
4 min 34900.000000
5 25% 129975.000000
6 50% 163000.000000
7 75% 214000.000000
8 max 755000.000000
9 Name: SalePrice, dtype: float64
Most of the density lies between 100k and 250k, but there appears to be a lot of outliers on the pricier
side.
Next, let’s have a look at the greater living area (square feet) against the sale price:
You might’ve expected that larger living area should mean a higher price. This chart shows you’re
generally correct. But what are those 2–3 “cheap” houses offering huge living area?
One column you might not think about exploring is the “TotalBsmtSF” — Total square feet of the
basement area, but let’s do it anyway:
Intriguing, isn’t it? The basement area seems like it might have a lot of predictive power for our
model.
Ok, last one. Let’s look at “OverallQual” — overall material and finish quality. Of course, this one
seems like a much more subjective feature, so it might provide a bit different perspective on the sale
price.
Everything seems fine for this one, except that when you look to the right things start getting much
more nuanced. Will that “confuse” our model?
Let’s have a more general view on the top 8 correlated features with the sale price:
Surprised? All the features we discussed so far appear to be present. Its almost like we knew them
from the start…
Y = bX + a
Multivariable Regression
A more complex, multi-variable linear equation might look like this, where w represents the
coefficients or weights, our model will try to learn.
Y (x1 , x2 , x3 ) = w1 x1 + w2 x2 + w3 x3 + w0
The variables x1 , x2 , x3 represent the attributes or distinct pieces of information, we have about each
observation.
Loss function
Given our Simple Linear Regression equation:
Y = bX + a
We can use the following cost function to find the coefficients/parameters for our model:
1 ∑ (i)
m
M SE = J(W ) = (y − hw (x(i) ))2
m i=1
where
hw (x) = g(wT x)
The MSE measures how much the average model predictions vary from the correct values. The
number is higher when the model is performing “bad” on our training data.
The first derivative of MSE is given by:
2 ∑
m
M SE ′ = J ′ (W ) = (hw (x(i) ) − y (i) )
m i=1
1 ∑ (i)
m
OHM SE = J(W ) = (y − hw (x(i) ))2
2m i=1
1 ∑
m
OHM SE ′ = J ′ (W ) = (hw (x(i) ) − y (i) )
m i=1
1 class TestLoss(unittest.TestCase):
2
3 def test_zero_h_zero_y(self):
4 self.assertAlmostEqual(loss(h=np.array([0]), y=np.array([0])), 0)
5
6 def test_one_h_zero_y(self):
7 self.assertAlmostEqual(loss(h=np.array([1]), y=np.array([0])), 0.5)
8
9 def test_two_h_zero_y(self):
10 self.assertAlmostEqual(loss(h=np.array([2]), y=np.array([0])), 2)
11
12 def test_zero_h_one_y(self):
13 self.assertAlmostEqual(loss(h=np.array([0]), y=np.array([1])), 0.5)
14
15 def test_zero_h_two_y(self):
16 self.assertAlmostEqual(loss(h=np.array([0]), y=np.array([2])), 2)
Now that we have the tests ready, we can implement the loss function:
run_tests()
1 .....
2 --------------------------------------------------------------
3 Ran 5 tests in 0.007s
4
5 OK
Data preprocessing
We will do a little preprocessing to our data using the following formula (standardization):
x−µ
x′ =
σ
1 x = df_train['GrLivArea']
2 y = df_train['SalePrice']
3
4 x = (x - x.mean()) / x.std()
5 x = np.c_[np.ones(x.shape[0]), x]
We will only use the greater living area feature for our first model.
1 class TestLinearRegression(unittest.TestCase):
2
3 def test_find_coefficients(self):
4 clf = LinearRegression()
5 clf.fit(x, y, n_iter=2000, lr=0.01)
6 np.testing.assert_array_almost_equal(
7 clf._W,
8 np.array([180921.19555322, 56294.90199925])
9 )
1 class LinearRegression:
2
3 def predict(self, X):
4 return np.dot(X, self._W)
5
6 def _gradient_descent_step(self, X, targets, lr):
7
8 predictions = self.predict(X)
9
10 error = predictions - targets
11 gradient = np.dot(X.T, error) / len(X)
12
13 self._W -= lr * gradient
14
15 def fit(self, X, y, n_iter=100000, lr=0.01):
16
17 self._W = np.zeros(X.shape[1])
18
19 for i in range(n_iter):
20 self._gradient_descent_step(x, y, lr)
21
22 return self
You might find our Linear Regression implementation simpler compared to the one presented for the
Logistic Regression. Note that the use of the Gradient Descent algorithm is pretty much the same.
Interesting, isn’t it? A single algorithm can build two different types of models. Would we be able
to use it for more?
run_tests()
1 ......
2 -----------------------------------------------------------------
3 Ran 6 tests in 1.094s
4
5 OK
1 clf = LinearRegression()
2 clf.fit(x, y, n_iter=2000, lr=0.01)
1 clf = LinearRegression()
2 clf.fit(x, y, n_iter=2000, lr=0.01)
822817042.8437098
The loss at the last iteration is nearly 2 times smaller. Does this mean that our model is better now?
You can find complete source code and run the code in your browser²⁰
¹⁹https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html
²⁰https://colab.research.google.com/drive/1DXkpo9PmH9_HiCSz9NQlZ9vGQtMIYqmF
Conclusion
Nice! You just implemented a Linear Regression model and not the simple/crappy kind.
One thing you might want to try is to predict house sale prices on the testing dataset from Kaggle.
Is this model any good on it?
In the next part, you’re going to implement a Decision Tree model from scratch!
I am sorry, you might be losing sleep. Deep down you know your Linear Regression model ain’t
gonna cut it. That housing market domination is still further down the road.
Can we improve that, can we have a model that makes better predictions?
Complete source code notebook²¹
The Data
Once again, we’re going to use the Kaggle data: “House Prices: Advanced Regression Techniques²²”.
It contains 1460 training data points and 80 features that might help us predict the selling price of a
house.
Decision Trees
Decision tree models build structures like this:
²¹https://colab.research.google.com/drive/11T18ZELpOWuP7UtM7EKOut5juhJa9cHb
²²https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Data Preprocessing
We’re going to use the same data we used with the Linear Regression model. However, we’re not
going to do any scaling, just because we’re lazy (or it is not needed):
Cost function
We’re going to use a new cost function — Root Mean Square Error (RMSE). It is the standard
deviation of how far from the regression line data points are. In other words, it tells you how
concentrated the data is around the line of best fit.
RMSE is given by the formula:
²³https://www.xoriant.com/
v
u
u1 ∑ m
RM SE = t (y (i) − h(x(i) ))2
m i=1
We’ve already implemented MSE in previous parts, so we’re going to import an implementation
here, in the name of readability (or holy laziness?):
We are using RandomForestRegressor with 1 estimator, which basically means we’re using a
Decision Tree model. Here is the tree structure of our model:
²⁴https://scikit-learn.org/stable/
You should receive the exact same model (if you’re running the code) since we are setting the random
state. How many features did the model use?
Now that this model is ready to be used let’s evaluate its R2 score:
1 preds = reg.predict(X)
2 metrics.r2_score(y, preds)
0.6336246655552089
The R2 statistic gives us information about the goodness of fit of the model. A score of 1 indicates
perfect fit. Let’s have a look at the RMSE:
48069.23940764968
1 class DecisionTreeRegressor:
2
3 def fit(self, X, y, min_leaf = 5):
4 self.dtree = Node(X, y, np.array(np.arange(len(y))), min_leaf)
5 return self
6
7 def predict(self, X):
8 return self.dtree.predict(X.values)
1 class Node:
2
3 def __init__(self, x, y, idxs, min_leaf=5):
4 self.x = x
5 self.y = y
6 self.idxs = idxs
7 self.min_leaf = min_leaf
8 self.row_count = len(idxs)
9 self.col_count = x.shape[1]
10 self.val = np.mean(y[idxs])
11 self.score = float('inf')
12 self.find_varsplit()
Trees are recursive data structures and we’re going to take full advantage of that. Our Node class
represents one decision point in our model. Each division within the model has 2 possible outcomes
for finding a solution — go to the left or go to the right. That decision point also divides our data
into two sets.
The property idxs stores indexes of the subset of the data that this Node is working with.
The decision (prediction) is based on the value that Node holds. To make that prediction we’re just
going to take the average of the data of the dependent variable for this Node.
The method find_varsplit finds where should we split the data. Let’s have a look at it:
1 def find_varsplit(self):
2 for c in range(self.col_count): self.find_better_split(c)
3 if self.is_leaf: return
4 x = self.split_col
5 lhs = np.nonzero(x <= self.split)[0]
6 rhs = np.nonzero(x > self.split)[0]
7 self.lhs = Node(self.x, self.y, self.idxs[lhs], self.min_leaf)
8 self.rhs = Node(self.x, self.y, self.idxs[rhs], self.min_leaf)
First, we try to find a better feature to split on. If no such feature is found (we’re at a leaf node) we
do nothing. Then we use the split value found by find_better_split, create the data for the left and
right nodes and create each one using a subset of the data.
Here are the split_col and is_leaf implementations:
1 @property
2 def split_col(self): return self.x.values[self.idxs,self.var_idx]
3
4 @property
5 def is_leaf(self): return self.score == float('inf')
We are trying to split on each data point and let the best split wins.
We’re going to create our split such that it has as low standard deviation as possible. We find the split
that minimizes the weighted averages of the standard deviations which is equivalent to minimizing
RMSE.
If we find a better split we store the following information: index of the variable, split score and
value of the split.
The score is a metric that tells us how effective the split was (note that leaf nodes do not have scores,
so it will be infinity). The method find_score calculates a weighted average of the data. If the score
is lower than the previous we have a better split. Note that the score is initially set to infinity -> only
leaf nodes and really shallow trees (and Thanos) have a score of infinity.
Finally, let’s look at how we use all this to make predictions:
Once again, we’re exploiting the recursive nature of life. Starting at the tree root, predict_row checks
if we need to go left or right node based on the split value we found. The recursion ends once we
hit a leaf node. At that point, the answer/prediction is stored in the val property.
Here is the complete source of our Node class:
1 class Node:
2
3 def __init__(self, x, y, idxs, min_leaf=5):
4 self.x = x
5 self.y = y
6 self.idxs = idxs
7 self.min_leaf = min_leaf
8 self.row_count = len(idxs)
9 self.col_count = x.shape[1]
10 self.val = np.mean(y[idxs])
11 self.score = float('inf')
12 self.find_varsplit()
13
14 def find_varsplit(self):
15 for c in range(self.col_count): self.find_better_split(c)
16 if self.is_leaf: return
17 x = self.split_col
18 lhs = np.nonzero(x <= self.split)[0]
19 rhs = np.nonzero(x > self.split)[0]
20 self.lhs = Node(self.x, self.y, self.idxs[lhs], self.min_leaf)
21 self.rhs = Node(self.x, self.y, self.idxs[rhs], self.min_leaf)
22
23 def find_better_split(self, var_idx):
24
25 x = self.x.values[self.idxs, var_idx]
26
27 for r in range(self.row_count):
28 lhs = x <= x[r]
29 rhs = x > x[r]
30 if rhs.sum() < self.min_leaf or lhs.sum() < self.min_leaf: continue
31
32 curr_score = self.find_score(lhs, rhs)
33 if curr_score < self.score:
34 self.var_idx = var_idx
35 self.score = curr_score
36 self.split = x[r]
37
38 def find_score(self, lhs, rhs):
39 y = self.y[self.idxs]
40 lhs_std = y[lhs].std()
41 rhs_std = y[rhs].std()
42 return lhs_std * lhs.sum() + rhs_std * rhs.sum()
43
44 @property
45 def split_col(self): return self.x.values[self.idxs,self.var_idx]
46
47 @property
48 def is_leaf(self): return self.score == float('inf')
49
50 def predict(self, x):
51 return np.array([self.predict_row(xi) for xi in x])
52
53 def predict_row(self, xi):
54 if self.is_leaf: return self.val
55 node = self.lhs if xi[self.var_idx] <= self.split else self.rhs
56 return node.predict_row(xi)
Evaluation
Let’s check how your Decision Tree regressor does on the training data:
1 regressor = DecisionTreeRegressor().fit(X, y)
2 preds = regressor.predict(X)
Feel free to submit your csv file to Kaggle. Also, how can you improve?
Conclusion
Give yourself a treat, you just implemented a Decision Tree regressor!
Would it be possible to implement Random Forest regressor on top of your model? How that affects
your Kaggle scores?
In the next part, you’re gonna do some unsupervised learning with k means!
²⁵https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation
Acknowledgments
The source code presented in this part is heavily inspired by the excellent course of fast.ai²⁶ —
Introduction to Machine Learning for Coders²⁷
²⁶https://www.fast.ai/
²⁷http://course18.fast.ai/ml
Unsupervised Learning
Up until now we only looked at models that require training data in the form of features and labels.
In other words, for each example, we need to have the correct answer, too.
Usually, such training data is hard to obtain and requires many hours of manual work done by
humans (yes, we’re already serving “The Terminators”). Can we skip all that?
Yes, at least for some problems we can use example data without knowing the correct answer for it.
One such problem is clustering.
What is clustering?
Given some vector of data points X , clustering methods allow you to put each point in a group. In
other words, you can categorize a set of entities, based on their properties, automatically.
Yes, that is very useful in practice. Usually, you run the algorithm on a bunch of data points and
specify how much groups you want. For example, your inbox contains two main groups of e-mail:
spam and not-spam (were you waiting for something else?). You can let the clustering algorithm
create two groups from your emails and use your beautiful brain to classify which is which.
More applications of clustering algorithms:
²⁸https://dribbble.com/
²⁹https://www.uplabs.com/
³⁰https://www.behance.net/
³¹https://colab.research.google.com/drive/1_p1nptDfvJhcSvprmi1WtoIugsCI40Xa
• Customer segmentation - find groups of users that spend/behave the same way
• Fraudulent transactions - find bank transactions that belong to different clusters and identify
them as fraudulent
• Document analysis - group similar documents
The Data
source:
various authors on https://www.uplabs.com/
This time, our data doesn’t come from some predefined or well-known dataset. Since Unsupervised
learning does not require labeled data, the Internet can be your oyster.
Here, we’ll use 3 mobile UI designs from various authors. Our model will run on each shot and try
to extract the color palette for each.
Wikipedia also tells us that solving K-Means clustering is difficult (in fact, NP-hard³³) but we can
find local optimum solutions using some heuristics.
But how do K-Means Clustering works?
Let’s say you have some vector X that contains n data points. Running our algorithm consists of the
following steps:
Let’s see how can we use it to extract color palettes from mobile UI screenshots.
Data Preprocessing
Given our data is stored in raw pixels (called images), we need a way to convert it to points that our
clustering algorithm can use.
Let’s first define two classes that represent a point and cluster:
1 class Point:
2
3 def __init__(self, coordinates):
4 self.coordinates = coordinates
Our Point is just a holder to the coordinates for each dimension in our space.
1 class Cluster:
2
3 def __init__(self, center, points):
4 self.center = center
5 self.points = points
The Cluster is defined by its center and all other points it contains.
Given a path to image file we can create the points as follows:
³³https://en.wikipedia.org/wiki/NP-hardness
1 def get_points(image_path):
2 img = Image.open(image_path)
3 img.thumbnail((200, 400))
4 img = img.convert("RGB")
5 w, h = img.size
6
7 points = []
8 for count, color in img.getcolors(w * h):
9 for _ in range(count):
10 points.append(Point(color))
11
12 return points
Note that we’re creating a Point for each pixel in our image.
Alright! You can now extract points from an image. But how can we calculate the distance between
points in our clusters?
Distance function
Similar to the cost function in our supervised algorithm examples, we need a function that tells us
how well we’re doing. The objective of our algorithm is to minimize the distance between the points
in each centroid.
One of the simplest distance functions we can use is the Euclidean distance, defined by:
v
u n
u∑
d(p, q) = t (qi − pi )2
i=1
To find the center of a set of points, we add the values for each dimension and divide by the number
of points.
Now for finding the actual clusters:
The implementation follows the description of the algorithm given above. Note that we exit the
training loop when the color difference is lower than the one set by us.
Evaluation
Now that you have an implementation of K-Means clustering you can use it on the UI screenshots.
We need a little more glue code to make it easier to extract color palettes:
1 def rgb_to_hex(rgb):
2 return '#%s' % ''.join(('%02x' % p for p in rgb))
3
4 def get_colors(filename, n_colors=3):
5 points = get_points(filename)
6 clusters = KMeans(n_clusters=n_colors).fit(points)
7 clusters.sort(key=lambda c: len(c.points), reverse = True)
8 rgbs = [map(int, c.center.coordinates) for c in clusters]
9 return list(map(rgb_to_hex, rgbs))
The get_colors function takes a path to an image file and the number of colors you want to extract
from the image. We sort the clusters obtained from our algorithm based on the points in each (in
descending order). Finally, we convert the RGB colors to hexadecimal values.
Let’s extract the palette for the first UI screenshot from our data:
Running the clustering on the next two images, we obtain the following:
Conclusion
Congratulations, you just implemented your first unsupervised algorithm from scratch! It also
appears that it obtains good results when trying to extract a color palette from images.
Next up, we’re going to do sentiment analysis on short phrases and learn a bit more how we can
handle text data.
Textual data dominates our world from the tweets you read to the timeless writings of Seneca. And
while we’re consuming images (looking at you Instagram) and videos at increasing rates, you still
read Google search results multiple times per daily.
One frequently recurring problem with text data is Sentiment analysis³⁶ (classification). Imagine
you’re a big sugar + water beverage company. You want to know what people think of your products.
You’ll search for texts with some tags, logos or names. You can then use Sentiment analysis to figure
out if the opinions are positive or negative. Of course, you’ll send the negative ones to your highly
underpaid support center in India to sort things out.
Here, we’ll build a generic text classifier that puts movie review texts into one of two categories -
negative or positive sentiment. We’re going to have a brief look at the Bayes theorem and relax its
requirements using the Naive assumption.
Complete source code in Google Colaboratory Notebook³⁷
The Data
Our data comes from a Kaggle challenge - “Bag of Words Meets Bags of Popcorn”⁴⁰.
We have 25,000 movie reviews from IMDB labeled as positive or negative. You might know that
IMDB ratings are in the 0-10 range. An additional preprocessing step, done by the dataset authors,
converts the rating to binary sentiment (<5 - negative ). Of course, a single movie can have multiple
reviews, but no more than 30.
³⁶https://en.wikipedia.org/wiki/Sentiment_analysis
³⁷https://colab.research.google.com/drive/1OPQDDJTKy0b40pziZWSsoBCmQV6HyXsm
³⁸https://en.wikipedia.org/wiki/Natural_language_processing
³⁹https://en.wikipedia.org/wiki/Bag-of-words_model
⁴⁰https://www.kaggle.com/c/word2vec-nlp-tutorial
Exploration
Let’s get a feel for our data. Here are the first 5 rows of the training data:
id sentiment review
5814_8 1 With all this stuff going down at the moment w…
2381_9 1 The Classic War of the Worlds” by Timothy Hi…
7759_3 0 The film starts with a manager (Nicholas Bell)…
3630_4 0 It must be assumed that those who praised this…
9495_8 1 Superbly trashy and wondrously unpretentious 8…
We’re going to focus on the sentiment and review columns. The id column is combining the movie
id with a unique number of a review. This might be a piece of important information in real-world
scenarios, but we’re going to keep it simple.
Both positive and negative sentiments have an equal presence. No need for additional gimmicks to
fix that!
Here are the most common words in the training dataset reviews:
Cleaning
Real-world text data is really messy. It can contain excessive punctuation, HTML tags (including
that br), multiple spaces, etc. We’ll try to remove/normalize most of it.
Most of the cleaning we’ll do using Regular Expressions⁴¹, but we’ll use two libraries to handle
HTML tags (surprisingly hard to remove) and removing common (stop) words:
First, we use BeautifulSoup⁴² to remove HTML tags from our text. Second, we remove anything
that is not a letter or space (note the ignoring of uppercase characters). Finally, we replace excessive
spacing with a single one.
⁴¹https://en.wikipedia.org/wiki/Regular_expression
⁴²https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Tokenization
Now that our reviews are “clean”, we can further prepare them for our Bag of Words model.
Let’s them to lowercase letters and split them into individual words. This process is known as
tokenization⁴³:
The last step of our pre-processing is to remove stop words⁴⁴ using those defined in the NLTK⁴⁵
library. Stop words are usually frequently occurring words that might not significantly affect the
meaning of the text. Some examples in English are: “is”, “the”, “and”.
An additional benefit of removing stop words is speeding up our models since we’re removing
the amount of train/test data. However, in real-world scenarios, you should think about whether
removing stop words can be justified.
We’ll place our clean and tokenize function in a class called Tokenizer.
Naive Bayes
Naive Bayes models are probabilistic classifiers that use the Bayes theorem⁴⁶ and make a strong
assumption that the features of the data are independent. For our case, this means that each word is
independent of others.
Intuitively, this might sound like a dumb idea. You know that (even from reading) the prev word(s)
influence the current and next ones. However, the assumption simplifies the math and works really
well in practice⁴⁷!
The Bayes theorem is defined as:
P (B|A)P (A)
P (A|B) =
P (B)
multiply that by the probability of A (known as Prior) happening. All of this is divided by the
probability of B happening on its own.
The naive assumption allows us to reformulate the Bayes theorem for our example as:
∏
n
P (Sentiment) P (wi |Sentiment)
i=1
P (Sentiment|w1 , . . . , wn ) =
P (w1 , . . . , wn )
We don’t really care about probabilities. We only want to know whether a given text has a positive
or negative sentiment. We can skip the denominator entirely since it just scales the numerator:
∏
n
P (Sentiment|w1 , . . . , wn ) ∝ P (Sentiment) P (wi |Sentiment)
i=1
To choose the sentiment, we’ll compare the scores for each sentiment and pick the one with a higher
score.
∏
n
log P (Sentiment|w1 , . . . , wn ) = log P (Sentiment) + log P (wi |Sentiment)
i=1
Note that we can still use the highest score of our model to predict the sentiment since log is
monotonic⁴⁹.
⁴⁸https://en.wikipedia.org/wiki/Multinomial_distribution
⁴⁹https://en.wikipedia.org/wiki/Monotonic_function
We’re going to implement a generic text classifier that doesn’t assume the number of classes. You
can use it for news category prediction, sentiment analysis, email spam detection, etc.
For each class, we’ll find the number of examples in it and the log probability (prior). We’ll also
record the number of occurrences of each word and create a vocabulary - set of all words we’ve
seen in the training data:
Note that we use our Tokenizer and the built-in class Counter⁵⁰ to convert a review to a bag of
words.
In case you’re interested, here’s how group_by_class works:
⁵⁰https://docs.python.org/3/library/collections.html#collections.Counter
Making predictions
In order to predict sentiment from text data, we’ll use our class priors and vocabulary:
For each review, we use the log priors for positive and negative sentiment and tokenize the text.
For each word, we check if it is in the vocabulary and skip it if it is not. Then we calculate the log
probability of this word for each class. We add the class scores and pick the class with a max score
as our prediction.
Laplace smoothing
Note that log(0) is undefined (and no, we’re not using JavaScript here). It is entirely possible for a
word in our vocabulary to be present in one class but not another - the probability of finding this
word in that class will be 0! We can use Laplace smoothing⁵¹ to fix this problem. We’ll simply add 1
to the numerator but also add the size of our vocabulary to the denominator:
⁵¹https://en.wikipedia.org/wiki/Additive_smoothing
Predicting sentiment
Now that your model can be trained and make predictions, let’s use it to predict sentiment from
movie reviews.
Data preparation
As discussed previously, we’ll use only the review and sentiment columns from the training data.
Let split it for training and testing:
1 X = train['review'].values
2 y = train['sentiment'].values
3
4 X_train, X_test, y_train, y_test = train_test_split(
5 X, y,
6 test_size=0.2,
7 random_state=RANDOM_SEED
8 )
Evaluation
We’ll pack our fit and predict functions into a class called MultinomialNaiveBayes. Let’s use it:
1 MNB = MultinomialNaiveBayes(
2 classes=np.unique(y),
3 tokenizer=Tokenizer()
4 ).fit(X_train, y_train)
Our classifier takes a list of possible classes and a Tokenizer as parameters. Also, the API is quite
nice (thanks scikit-learn!)
1 y_hat = MNB.predict(X_test)
2 accuracy_score(y_test, y_hat)
1 0.8556
Overall, our classifier does pretty well for himself. Submit the predictions to Kaggle and find out
what place you’ll get on the leaderboard.
Conclusion
Well done! You just built a Multinomial Naive Bayes classifier that does pretty well on sentiment
prediction. You also learned about Bayes theorem, text processing, and Laplace smoothing! Will
another flavor of the Naive Bayes classifier perform better?
Complete source code in Google Colaboratory Notebook⁵²
Next up, you’ll build a recommender system from scratch!
⁵²https://colab.research.google.com/drive/1OPQDDJTKy0b40pziZWSsoBCmQV6HyXsm
Recommender Systems are becoming more and more relevant as the amount of information on “The
Internets” is exponentially increasing:
Finding what you might enjoy from books, movies, games to who to follow on Instagram becomes
increasingly difficult. Moreover, we (the users) require faster interactions with the products we use
on a daily basis, so we don’t feel like we waste time (even though we do it more than ever before in
human history). As the amounts of data increase, your computational resources might have a hard
time to produce fast enough results.
Here, we’ll have a look at a succinct implementation of a Recommender System that is both the
basis of many real-world implementations and is easy to understand. I can highly recommend you
to read this through :)
Complete source code in Google Colaboratory Notebook⁵³
User Ratings
Traditionally, recommender systems are built around user ratings given for a set of items, e.g. movie
ratings on IMDB. Here, we’ll have a look at using another metric for making recommendations.
Our data comes from last.fm⁵⁴, hosted by GroupLens (download from here⁵⁵). It contains the
following:
We’ll focus on user_artists.dat and artists.dat as they contain all data required to make recom-
mendations for new music artists to a user. Instead of ratings, we’re going to use the play count by
a user for each artist.
Data wrangling
You need to do a bit of wrangling before working with the data:
⁵³https://colab.research.google.com/drive/1_WxDPLGkJY3qJ-PK0J1YjATaZz35efmk
⁵⁴https://www.last.fm/
⁵⁵https://grouplens.org/datasets/hetrec-2011/
1 ap = pd.merge(
2 artists, plays,
3 how="inner",
4 left_on="id",
5 right_on="artistID"
6 )
7 ap = ap.rename(columns={"weight": "playCount"})
We merge the artists and user plays and rename the weight column to playCount. Let’s rank the
artists based on how much they were played by the users:
1 artist_rank = ap.groupby(['name']) \
2 .agg({'userID' : 'count', 'playCount' : 'sum'}) \
3 .rename(columns={"userID" : 'totalUniqueUsers', "playCount" : "totalArtistPlays"})\
4 \
5 .sort_values(['totalArtistPlays'], ascending=False)
6
7 artist_rank['avgUserPlays'] = artist_rank['totalArtistPlays'] / artist_rank['totalUn\
8 iqueUsers']
Exploration
Let’s look at how much each artist is played by users:
No surprises here, popular artists are taking the majority of the plays. Good thing that “The Beatles”
still hold strong, though :)
Recommender Systems
Recommender systems (RS) are used pretty much everywhere where you have an array of items to
choose from. They work by making suggestions that might help you make better/faster choices.
You might think that you’re a special snowflake, but RS exploit behavior patterns that are shared
between users (you and other people). For example, you and your friends might have similar tastes
for some stuff. RS try to find “friends” within other users and recommend things you haven’t tried
that.
The two most used types of RS are Content-Based and Collaborative Filtering (CF). Collaborative
filtering produces recommendations based on the knowledge of users’ preference to items, that is
it uses the “wisdom of the crowd” to recommend items. In contrast, content-based recommender
systems focus on the properties of the items and give you recommendations based on the similarity
between them.
Memory-Based CF methods can be divided into two sections: user-item filtering and item-item
filtering. Here is the difference:
• Item-Item Collaborative Filtering: “Users who liked this item also liked …”
• User-Item Collaborative Filtering: “Users who are similar to you (kinda like the twin you never
knew you had) also liked …”
Both methods require user-item matrix that contain the ratings for user u for item i. From that, you
can calculate the similarity matrix.
The similarity values in Item-Item Collaborative Filtering are calculated by taking into account all
users who have rated a pair of items.
For User-Item Collaborative Filtering, the similarity values are calculated by observing all items that
are rated by a pair of users.
Model-based CF methods are based on matrix factorization (MF)⁵⁶. MF methods are used as an
unsupervised learning method for latent variable decomposition and dimensionality reduction. They
can handle scalability and sparsity problems better than Memory-based CF.
The goal of MF is to learn latent user preferences and item attributes from known ratings. Then use
those variable to predict unknown ratings through the dot product of the latent features of users
and items.
Matrix factorization restructures the user-item matrix into a low-rank matrix. You can represent it
by the multiplication of two low-rank matrices, where the rows contain a vector of latent variables.
You want this matrix to approximate the original matrix, as closely as possible, by multiplying the
low-rank matrices together. That way, you predict the missing entries in the original matrix.
X = U SV T
Given m × n matrix X :
• U is (m × r) orthogonal matrix
• S is (r × r) diagonal matrix with non-negative real numbers on the diagonal
• V T is (r × n) orthogonal matrix
where U represents feature vectors of the users, V represents feature vectors of the items and the
elements on the diagonal of S are known as singular values.
You can make a prediction by taking the dot product of U , S and V T . Here is a quick example of how
you can implement SVD in Python:
⁵⁶https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)
⁵⁷https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429
1 import scipy.sparse as sp
2 from scipy.sparse.linalg import svds
3
4 U, S, VT = svds(user_item_ratings, k = 20)
5 S_diagonal = np.diag(S)
6 Y_hat = np.dot(np.dot(U, S_diagonal), VT)
∑
min (rui − PuT Qi )2 + λ(||Qi ||2 + ||Pu ||2 )
Q∗,P ∗
(u,i)∈K
Where K is a set of (u, i) pairs, r(u, i) is the rating for item i by user u and λ is a regularization term
(used to avoid overfitting). The training of our model consists of minimizing the regularized squared
error. After an estimate of P and Q is obtained, you can predict unknown ratings by taking the dot
product of the latent features for users and items.
We can apply Stochastic Gradient Descent (SGD)⁵⁹ or Alternating Least Squares (ALS)⁶⁰ in order to
minimize the loss function. Both methods can be used to incrementally update our model (online
learning), as new rating comes in.
Here, we’ll implement SGD as it seems to be generally faster and more accurate⁶¹ than ALS (except
in situations of highly sparse and implicit data). As an added bonus, SGD is widely used for training
Deep Neural Networks (which we’ll discuss later). Thus, many high-quality implementations exist
for this algorithm.
⁵⁸https://en.wikipedia.org/wiki/Cold_start_(computing)
⁵⁹https://en.wikipedia.org/wiki/Stochastic_gradient_descent
⁶⁰https://www.quora.com/What-is-the-Alternating-Least-Squares-method-in-recommendation-systems-And-why-does-this-
algorithm-work-intuition-behind-this
⁶¹http://cs229.stanford.edu/proj2014/Christopher%20Aberger,%20Recommender.pdf
Preprocessing
In order to apply the CF algorithm, we need to transform our dataset into a user-artist play count
matrix. Let’s do that after doing data scaling first:
1 pc = ap.playCount
2 play_count_scaled = (pc - pc.min()) / (pc.max() - pc.min())
3
4 ap = ap.assign(playCountScaled=play_count_scaled)
This squishes the play counts in the [0 - 1] range and adds a new column to our data frame. Let’s
build our “ratings” data frame:
1 ratings_df = ap.pivot(
2 index='userID',
3 columns='artistID',
4 values='playCountScaled'
5 )
6
7 ratings = ratings_df.fillna(0).values
We use Pandas pivot⁶² method to create index/column data frame and fill the missing play counts
with 0.
Let’s have a look at how sparse our data frame is:
1 sparsity = float(len(ratings.nonzero()[0]))
2 sparsity /= (ratings.shape[0] * ratings.shape[1])
3 sparsity *= 100
4 print('{:.2f}%'.format(sparsity))
1 0.28%
Our dataset seems really sparse. Next, let’s split our data into training and validation data sets:
⁶²https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html
1 MIN_USER_RATINGS = 35
2 DELETE_RATING_COUNT = 15
3
4 def train_test_split(ratings):
5
6 validation = np.zeros(ratings.shape)
7 train = ratings.copy()
8
9 for user in np.arange(ratings.shape[0]):
10 if len(ratings[user,:].nonzero()[0]) >= MIN_USER_RATINGS:
11 val_ratings = np.random.choice(
12 ratings[user, :].nonzero()[0],
13 size=DELETE_RATING_COUNT,
14 replace=False
15 )
16 train[user, val_ratings] = 0
17 validation[user, val_ratings] = ratings[user, val_ratings]
18 return train, validation
Okay, that is a lot different than you might’ve expected. We remove some existing play counts from
users by replacing them with zeros.
where yi is the real value for item i, yˆi is the predicted one and $N$ is the size of the training set.
A discussion on more evaluation metrics can be found here⁶³
Here is the RMSE in Python:
Training
Let’s use SGD to train our recommender:
⁶³https://www.microsoft.com/en-us/research/publication/evaluating-recommender-systems/
We start by creating 2 matrices for the latent features of the users and ratings. For each user, item
pair, we calculate the error (note we use a simple difference of the existing and predicted ratings).
We then update P and Q using Gradient Descent.
After each training epoch, we calculate the training and validation errors and store their values to
analyze later. Here is the predictions method implementation:
As we discussed earlier, we obtain predictions by taking the dot product of the transposed P matrix
with Q.
Evaluation
Let’s have a quick look at how the training process went by looking at the training and validation
RMSE:
Our model seems to train gradually, and both errors decrease as the number of epochs increase. Note
that you might want to spend even more time to train your model, as it might get even better.
Making recommendations
Finally, we are ready to make some recommendations for a user. Here is the implementation of the
predict method:
First, we obtain the matrix with all predicted play counts. Second, we take the indices of all unknown
play counts and return only these as predictions. Let’s make some recommendations for a specific
user:
1 user_id = 1236
2 user_index = ratings_df.index.get_loc(user_id)
3 predictions_index = np.where(train[user_index, :] == 0)[0]
4
5 rating_predictions = recommender.predict(train, user_index)
Before looking at the recommendation list, let’s have a look at what this user preferences currently
are:
name rating
Marilyn Manson 0.196486
3 Doors Down 0.043204
Pearl Jam 0.042016
Children of Bodom 0.025657
Disturbed 0.021690
Rammstein 0.021562
A Perfect Circle 0.020879
Gojira 0.017051
Rob Zombie 0.016280
D12 0.010990
1 create_artist_ratings(
2 artists,
3 predictions_index,
4 rating_predictions
5 )
name rating
Sander van Doorn 0.559202
Jacob Miller 0.552893
Celtas Cortos 0.552142
Camisa de Vênus 0.546361
Ella Fitzgerald & Louis Armstrong 0.541477
The Answer 0.537367
Anjelika Akbar 0.536756
Medications 0.536486
Lützenkirchen 0.535515
Archie Star 0.535147
Now might be a good time to go on YouTube or Spotify and try out some of those artists!
Conclusion
Congratulations on building a highly performant Recommender System from scratch! You’ve
learned how to:
We live in the age of Instagram, YouTube, and Twitter. Images and video (a sequence of images)
dominate the way millennials and other weirdos consume information.
Having models that understand what images show can be crucial for understanding your emotional
state (yes, you might get a personalized Coke ad right after you post your breakup selfie on
Instagram), location, interests and social group.
Predominantly, models that understand image data used in practice are (Deep) Neural Networks.
Here, we’ll implement a Neural Network image classifier from scratch in Python.
Complete source code in Google Colaboratory Notebook⁶⁵
Image Data
Hopefully, it’s not a complete surprise to you that computers can’t actually see images as we do. Each
image on your device is represented/stored as a matrix, where each pixel is one or more numbers.
You might be familiar with the original handwritten digits MNIST⁶⁷ dataset and wondering why
we’re not using it? Well, it might be too easy to make predictions on⁶⁸. And of course, fashion is
cooler, right?
Exploration
The product images are grayscale, 28x28 pixels and look something like this:
⁶⁷http://yann.lecun.com/exdb/mnist/
⁶⁸http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/
Here are the first 3 rows from the pixel matrix of the image:
Let’s have a look at a lower dimensional representation of some of the products using t-SNE⁶⁹. We’ll
transform the data into 2-dimensional using the implementation from scikit-learn⁷⁰:
You can observe a clear separation between some classes and significant overlap between others.
Let’s build a Neural Network that can try to separate between different fashion products!
Neural Networks
Neural Networks (NNs), Deep Neural Networks in particular, are all the rage in the last couple of
years in the Machine Learning realm. That’s hardly a surprise since most state-of-the-art results
(SOTA)⁷¹ on various Machine Learning problems are obtained via Neural Nets.
where f is an activation function that controls how strong the output signal of the neuron is.
• Input layer - 3 neurons that should match the size of your input data
• Hidden layer - 4 neurons with weights W that your model should learn during training
• Output layer - 2 neurons that provide the predictions of your model
Want to build a Deep Neural Network? Just add at least one more hidden layer:
⁷⁵https://cs231n.github.io/
Sigmoid
The sigmoid function⁷⁷ is quite commonly used activation function, at least it was until recently. It
has distinct S shape, it is a differentiable real function for any real input value and output values
between 0 and 1. Additionally, it has a positive derivative at each point. More importantly, we will
use it as an activation function for the hidden layer of our model.
⁷⁶https://cs231n.github.io/
⁷⁷https://en.wikipedia.org/wiki/Sigmoid_function
1
σ(x) =
1 + e−x
1 def sigmoid(z):
2 return 1.0 / (1.0 + np.exp(-z))
It’s first derivative (which we will use during the backpropagation step of our training algorithm)
has the following formula:
dσ(x)
= σ(x) · (1 − σ(x))
d(x)
1 def sigmoid_prime(z):
2 sg = sigmoid(z)
3 return sg * (1 - sg)
Softmax
The softmax function can be easily differentiated, it is pure (output depends only on input) and the
elements of the resulting vector sum to 1. Here it is:
ezj
σ(z)j = ∑k=1 for j = 1, ..., k
K ezk
1 def softmax(z):
2 return (np.exp(z.T) / np.sum(np.exp(z), axis=1)).T
In probability theory, the output of the softmax function is sometimes used as a representation of a
categorical distribution. Let’s see an example result:
1 softmax(np.array([[2, 4, 6, 8]]))
The output has most of its weight corresponding to the input 8. The softmax function highlights the
largest value(s) and suppresses the smaller ones.
Backpropagation
Backpropagation is the backbone of almost anything we do when using Neural Networks. The
algorithm consists of 3 subtasks:
In the first step, backprop uses the data and the weights of the network to compute a prediction. Next,
the error is computed based on the prediction and the provided labels. The final step propagates the
error through the network, starting from the final layer. Thus, the weights get updated based on the
error, little by little.
Let’s build more intuition about what the algorithm is actually doing:
Solving XOR
We will try to create a Neural Network that can properly predict values from the XOR function.
Here is its truth table:
1 epochs = 50000
2 input_size, hidden_size, output_size = 2, 3, 1
3 LR = .1 # learning rate
The epochs parameter controls how many times our algorithm will “see” the data during training.
Then we set the number of neurons in the input, hidden and output layers - we have 2 numbers as
input and 1 number as output size. The learning rate parameter controls how quickly our Neural
Network will learn from new data and forget what already knows.
Our training data (from the table) looks like this:
The W vectors in our NN need to have some initial values. We’ll sample a uniform distribution,
initialized with proper size:
That error seems to be decreasing! YaY! And the implementation is not that scary, isn’t it?
During the forward step, we take the dot product of the data X and Whidden and apply our activation
function to obtain the output of our hidden layer. We obtain the predictions by taking the dot product
of the hidden layer output and Woutput .
To obtain the error, we calculate the difference between the true values and the predicted ones.
Note that this is a very crude metric, but it works fine for our example.
Finally, we use the calculated error to adjust the weights. Note that we need the results from the
forward pass act_hidden to calculate Woutput and calculate the first derivative using sigmoid_prime
to update Whidden .
In order to make an inference (predictions) we’ll do just the forward step (since we won’t adjust W
based on the result):
1 array([ 1.])
Classifying Images
We’ll sample a uniform distribution with values between -1 and 1 for our initial weights. Here is the
implementation:
1 def _init_weights(self):
2 w1 = np.random.uniform(-1.0, 1.0,
3 size=(self.n_hidden_units, self.n_features))
4 w2 = np.random.uniform(-1.0, 1.0,
5 size=(self.n_classes, self.n_hidden_units))
6 return w1, w2
Training
For each epoch, we apply the backprop algorithm, evaluate the error and the gradient with respect
to the weights. We then use the learning rate and gradients to update the weights.
Doing a backprop step is a bit more complicated than our XOR example. We do an additional step
before returning the gradients - apply L1 and L2 Regularization⁷⁸. Regularization is used to guide
our training towards simpler methods by penalizing large values for our parameters W .
Our forward and backward steps are very similar to the one in our previous example, how about the
error?
We’re going to use Cross-Entropy loss⁷⁹ (known as log loss) function to evaluate the error. This
function measures the performance of a classification model whose output is a probability. It
penalizes (harshly) predictions that are wrong and confident. Here is the definition:
⁷⁸http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
⁷⁹https://en.wikipedia.org/wiki/Cross_entropy
∑
C
Cross-Entropy = − yo,c log(po,c )
c=1
where C is the number of classes, y is a binary indicator if class label is the correct classification for
the observation and p is the predicted probability that o is of class c.
The implementation in Python looks like this:
Now that we have our loss function, we can finally define the error for our model:
After computing the Cross-Entropy loss, we add the regularization terms and calculate the mean
error. Here is the implementation for L1 and L2 regularizations:
Making predictions
Now that our model can learn from data, it is time to make predictions on data it hasn’t seen before.
We’re going to implement two methods for prediction - predict and predict_proba:
Recall that predictions in NN (generally) includes applying a forward step on the data. But the result
of it is a vector of values representing how strong the belief for each class is for the data. We’ll use
Maximum likelihood estimation (MLE)⁸⁰ to obtain our final predictions:
⁸⁰https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
MLE works by picking the highest value and return it as a predicted class for the input.
The method predict_proba returns a probability distribution over all classes, representing how
likely each class is to be correct. Note that we obtain it by applying the softmax function to the
result of the forward step.
Evaluation
Time to put our NN model to the test. Here’s how we can train it:
The training might take some time, so please be patient. Let’s get the predictions:
1 y_hat = nn.predict_proba(X_test)
Something looks fishy here, seems like our model can’t continue to reduce the error 150 epochs or
so. Let’s have a look at a single prediction:
Not too good. How about the training & testing accuracy:
Well, those don’t look that good. While a random classifier will return ∼10% accuracy, ∼50% accuracy
on the test dataset will not make a practical classifier either.
The error seems a lot more stable and settles at lower point - ∼200 vs ∼400. Let’s have a look at some
predictions:
∼87% (vs ∼50%) on the training set is a vast improvement over the unscaled method. Finally, your
hard work paid off!
Conclusion
What a ride! I hope you got a blast working on your first Neural Network from scratch, too!
You learned how to process image data, transform it, and use it to train your Neural Network. We
used some handy tricks (scaling) to vastly improve the performance of the classifier.
Complete source code in Google Colaboratory Notebook⁸³
⁸³https://drive.google.com/file/d/1S59KWV8KmZTI-A6OpXge3yG87HXyIcUz/view?usp=sharing
You wake up. It is a sunny day. Your partner is still asleep next to you. You take a minute to admire
the beauty and even crack a smile.
Your stomach is rumbling, so you jump on your feet and look around. Jackpot! You’re looking at
the leftovers of the anniversary dinner you two had last night. You take a minute to scratch some
private spot of yours. Yep, what an amazing morning! You have 1/3 of a bean can, expired only 6
months ago. It is delicious!
Although you feel bad about not leaving any food for the one sleeping in your bed, you dress up
and prepare for work. “What are you going to do? Not eat and work!?” - those thoughts don’t seem
to help anymore. Time to hop in the car!
You try to reach Johny via the radio but no luck. Oh well, Jimmy is missing since last week, his twin
brother might be too. Strange, you can’t remember the last time your eyes were feeling wet.
A lot has changed since the event. The streets are getting more dangerous every week, but people
still need transportation. You receive a pickup request and have a look at the anomaly map. You
accept hesitantly. Your map was updated 2 months ago.
You got the job done and got credit for a half can. That couple was really generous! That leaves time
for tinkering with the idea of making self-driving taxis. You even got a name - SafeCab. You read a
lot of about this crazy guy called Elon Nusk that was trying to build those fully autonomous vehicles
right before the event occured.
You start scratching your head. Is this possible or just a fantasy? After all, if you get this done, you
might afford to eat every other day. Enter Reinforcement Learning.
Complete source code in Google Colaboratory Notebook⁸⁴
⁸⁴https://colab.research.google.com/drive/1FMo6Lpf1UtO1blfMyA4yznzDN7tlgIWm
Reinforcement Learning
Reinforcement Learning (RL) is concerned with producing algorithms (agents) that try to achieve
some predefined goal. The achievement of this objective is dependent on choosing a set of actions -
receiving rewards for the good ones and punishment for the bad ones - the reinforcement bit comes
from this. The agent act, in environments that have a state, gives rewards, and a set of actions.
The discount factor γ allows us to inject the heuristic that 100 bucks now are more valuable than 100
bucks in 30 days. A discount factor of 1 would make future rewards worth just as much as immediate
rewards.
Another reason to discount future rewards:
Learning
Here is how learning happens in RL context:
for time step t = 0 until done:
What is the objective of all this? Find a function π∗ , known as optimal policy, that maximizes the
cumulative discounted reward:
∑
γ t rt
t≥0
where rt is the reward received at step t and γ t is the discount factor at step t.
A policy π is a function that maps state s to action a, that our agent believes is the best given that
state.
Value function
The value function gives you the maximum expected future reward the agent will get, starting from
some state s and following some policy π.
∞
∑
Vπ (s) = E[ γ k Rt+k+1 |St = s]
π
k=0
There exists an optimal value function that has the highest value for all states. Defined as:
Q-function
Similarly, let’s define another function known as Q-function (state-action value function) that gives
the expected return starting from state s, taking action a, and following policy π.
∞
∑
Qπ (s, a) = E[ γ k Rt+k+1 |St = s, At = a]
π
k=0
You can think of the Q-function as the “quality” of a certain action given a state. We can define the
optimal Q-function as:
There is a relationship between the two optimal functions V ∗ and Q∗ . It is given by:
That is, the maximum expected total reward when starting at s is the maximum of Q∗ (s, a) over all
possible actions.
We can find the optimal policy π∗ by choosing the action a that gives maximum reward Q∗ (s, a) for
state s:
There seems to be a synergy between all functions we defined so far. More importantly, we can now
build an optimal agent for a given environment. Can we do it in practice?
Q-learning
Q-Learning is an off-policy algorithm for Temporal Difference (TD) learning⁸⁹. It is proven that
with enough training, it converges with probability 1 to a close approximation of the action-value
function for an arbitrary target policy. It learns the optimal policy even when actions are selected
using exploratory (some randomness) policy (off-policy).
Given a state s and action a, we can express Q(s, a) in terms of itself (recursively):
⁸⁹https://en.wikipedia.org/wiki/Temporal_difference_learning
Q(s, a) = r + γ max
′
Q(s′ , a′ )
a
This is the Bellman equation. It defines the maximum future reward as the reward r the agent
received, for being at state s, plus the maximum future reward for state s′ for every possible action.
We can define Q-learning as an iterative approximation of Q∗ using the Bellman equation:
where α is the learning rate that controls how much the difference between previous and new Q
value is considered.
Here’s the general algorithm we’re going to implement:
Note that Q-learning is just one possible algorithm to solve the RL problem. Here is a comparison
between some more of them⁹⁰.
Exploration vs exploitation
You might’ve noticed that we glanced over the strategy on choosing an action. Any strategy you
implement will have to choose how often to try something new or use something it already knows.
This is known as the exploration/exploitation tradeoff.
Remember, the goal of our RL agent is to maximize the expected cumulative reward. What does this
mean for our self-driving taxi?
Initially, our driving agent knows pretty much nothing about the best set of driving directions for
picking up and dropping off passengers. It should learn to avoid anomalies too since they are bad
for business (passengers tend to disappear or worse in those)! During this time, we expect to make
a lot of exploration.
After obtaining some experience, the agent can use it to choose an action more and more often.
Eventually, all choices will be based on what is learned.
⁹⁰https://en.wikipedia.org/wiki/Reinforcement_learning#Comparison_of_reinforcement_learning_algorithms
Your partner sketched this map. Each block represents a small region of the city. Anomalies are
marked in bright circles. The four letters are “safe zones”, where pickups and drop-offs happen.
Let’s assume your car is the only vehicle in the city (not much of a stretch). We can break it up into
a 5x5 grid, which gives us 25 possible taxi locations.
The current taxi location is (3, 1) in (row, col) coordinates. The pickup and drop off locations are:
[(0,0), (0,4), (4,0), (4,3)]. Anomalies are at [(0, 2), (2, 1), (2, 2), (4, 2)].
Environment
We’re going to encode our city map into an environment for the self-driving agent using OpenAI’s
Gym⁹¹. What is this Gym thing anyways?
The most important entity in Gym is the Environment. It provides a unified and easy to use interface.
Here are the most important methods:
The complete source code of the environment is in the notebook. Here, we’ll take a look at the
registration to Gym:
1 register(
2 id='SafeCab-v0',
3 entry_point=f"{__name__}:SafeCab",
4 timestep_limit=100,
5 )
Note that we set a timestep_limit which limits the number of steps in an episode.
Action Space
Our agent encounters one of the 500 states (5 rows x 5 columns x 5 passenger locations x 4
destinations) and it chooses an action. Here are the possible actions:
• south
⁹¹https://gym.openai.com/
⁹²https://gym.openai.com/envs/Humanoid-v2/
⁹³https://gym.openai.com/envs/Pong-ram-v0/
⁹⁴https://gym.openai.com/envs/VideoPinball-ram-v0/
• north
• east
• west
• pickup
• dropoff
You might notice that in the illustration above, the taxi cannot perform certain actions when near
the boundaries of the city. We will penalize this with -1 and won’t move the taxi if such situation
occurs.
1 env.reset()
2 env.render()
3
4 print("Action Space {}".format(env.action_space))
5 print("State Space {}".format(env.observation_space))
• 0 = south
• 1 = north
• 2 = east
• 3 = west
• 4 = pickup
• 5 = dropoff
Building an agent
Our agent has a simple interface for interacting with the environment. Here it is:
1 class Agent:
2
3 def __init__(
4 self,
5 n_states,
6 n_actions,
7 decay_rate=0.0001,
8 learning_rate=0.7,
9 gamma=0.618
10 ):
11 pass
12
13 def choose_action(self, explore=True):
14 pass
15
16 def learn(
17 self,
18 state,
19 action,
20 reward,
21 next_state,
22 done,
23 episode
24 ):
25 pass
We’re going to use the choose_action method when we want our agent to make a decision and act.
Then, after the reward and new state from the environment are observed, our agent will learn from
its actions using the learn method.
Let’s take a look at the implementations:
1 def __init__(
2 self,
3 n_states,
4 n_actions,
5 decay_rate=0.0001,
6 learning_rate=0.7,
7 gamma=0.618
8 ):
9 self.n_actions = n_actions
10 self.q_table = np.zeros((n_states, n_actions))
11 self.max_epsilon = 1.0
12 self.min_epsilon = 0.01
13 self.epsilon = self.max_epsilon
14 self.decay_rate = decay_rate
15 self.learning_rate = learning_rate
16 self.gamma = gamma # discount rate
17 self.epsilons_ = []
The interesting part of the __init__ is the initialization of our Q-table. Initially, it is all zeros. Can
we initialize it in some other way?
Our strategy is rather simple. We draw a random number from a uniform distribution between 0
and 1. If this number is smaller than epsilon and we want to explore, we take a random action.
Otherwise, we take the best action based on our current knowledge.
1 def learn(
2 self,
3 state,
4 action,
5 reward,
6 next_state,
7 done,
8 episode
9 ):
10 # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
11 self.q_table[state, action] = self.q_table[state, action] + \
12 self.learning_rate * (reward + self.gamma * \
13 np.max(self.q_table[new_state, :]) - self.q_table[state, action])
14
15 if done:
16 # Reduce epsilon to decrease the exploration over time
17 self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * \
18 np.exp(-self.decay_rate * episode)
19 self.epsilons_.append(self.epsilon)
Learning involves the update of the Q-table using the Q-learning equation and reducing the
exploration rate ϵ if the episode is complete.
Training
Now that our agent is ready for action, we can train it on the environment we’ve created. Let’s train
our agent for 50k episodes and record episode rewards over time:
1 total_episodes = 60000
2 total_test_episodes = 10
3
4 agent = Agent(env.observation_space.n, env.action_space.n)
Let’s have a look at our agent driving before learning anything about the environment:
See YouTube video⁹⁵
1 rewards = []
2
3 for episode in range(total_episodes):
4 state = env.reset()
5 episode_rewards = []
6
7 while True:
8
9 action = agent.choose_action()
10
11 # Take the action (a) and observe the outcome state(s') and reward (r)
12 new_state, reward, done, info = env.step(action)
13
14 agent.learn(
15 state,
16 action,
17 reward,
18 new_state,
19 done,
20 episode
21 )
22
23 state = new_state
24
⁹⁵https://www.youtube.com/watch?v=CHayUf_IoUc
25 episode_rewards.append(reward)
26
27 if done == True:
28 break
29
30 rewards.append(np.mean(episode_rewards))
Recall that we set timestep_limit when we registered the environment so our agent won’t stay in
an episode infinitely.
Evaluation
Let’s have a look at reward changes as the training progresses:
Note that the learning curve is smoothened using savgol_filter⁹⁶ savgol_filter(rewards, window_-
length=1001, polyorder=2)
Recall that our exploration rate should decrease as our agent is learning. Have a look:
⁹⁶https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html
Here’s how we’re going to test our agent and record the progress:
1 frames = []
2
3 rewards = []
4
5 for episode in range(total_test_episodes):
6 state = env.reset()
7 episode_rewards = []
8
9 step = 1
10
11 while True:
12 action = agent.choose_action(explore=False)
13
14 new_state, reward, done, info = env.step(action)
15
16 frames.append({
17 'frame': env.render(mode='ansi'),
18 'state': state,
19 'episode': episode + 1,
20 'step': step,
21 'reward': reward
22 })
23
24 episode_rewards.append(reward)
25
26 if done:
27 step = 0
28 break
29 state = new_state
30 step += 1
31
32 rewards.append(np.mean(episode_rewards))
33
34 env.close()
Note that we want our agent to use only the experience it has so we set explore=False. Here’s what
the total reward for each episode looks like:
I know that this chart might not give you a good idea of what the agent is capable of. Here is a video
of it driving in the city:
See YouTube video⁹⁷
Pretty good, right? It looks like that it learned to avoid the anomalies, pick up and drop off passengers.
⁹⁷https://www.youtube.com/watch?v=HHIuVNZNMNg
Conclusion
Congratulations on building a self-driving taxi agent. You’ve learned how to
⁹⁸https://colab.research.google.com/drive/1FMo6Lpf1UtO1blfMyA4yznzDN7tlgIWm