Python Machine Learning Tutorial With Scikit-Learn
Python Machine Learning Tutorial With Scikit-Learn
Python Machine Learning Tutorial With Scikit-Learn
Snob Edition
elitedatascience.com/python-machine-learning-tutorial-scikit-learn
In this end-to-end Python machine learning tutorial, you’ll learn how to use Scikit-Learn to
build and tune a supervised learning model!
We’ll be training and tuning a random forest for wine quality (as judged by wine snobs
experts) based on traits like acidity, residual sugar, and alcohol concentration.
Before we start, we should state that this guide is meant for beginners who are interested
in applied machine learning.
Our goal is introduce you to one of the most flexible and useful libraries for machine
learning in Python. We’ll skip the theory and math in this tutorial, but we’ll still
recommend great resources for learning those.
Before we start…
Recommended Prerequisites
The recommended prerequisites for this guide are at least basic Python programming
skills. To move quickly, we’ll assume you have this background.
1/16
It’s also a fantastic library for beginners because it offers a high-level interface for many
tasks (e.g. preprocessing data, cross-validation, etc.). This allows you to better practice
the entire machine learning workflow and understand the big picture.
This is not a complete course on machine learning. Machine learning requires the
practitioner to make dozens of decisions throughout the entire modeling process, and we
won’t cover all of those nuances.
Instead, this is a tutorial that will take you from zero to your first Python machine learning
model with as little headache as possible!
If you’re interested in mastering the theory behind machine learning, then we recommend
our free guide:
In addition, we also won’t be covering exploratory data analysis in much detail, which is a
vital part of real-world machine learning. We’ll leave that for a separate guide.
This tutorial is designed to be streamlined, and it won’t cover any one topic in too
much detail. It may be helpful to have the Scikit-Learn documentation open beside you as
a supplemental reference.
2/16
Drinking wine makes predicting wine easier (probably).
Next, make sure the following are installed on your computer:
If you need to update any of the packages, it's as easy as typing $ conda update
<package> from your command line program (Terminal in Mac).
Shell
NumPy
Python
1 import numpy as np
Next, we'll import Pandas, a convenient library that supports dataframes . Pandas is
technically optional because Scikit-Learn can handle numerical matrices directly, but it'll
make our lives easier:
3/16
Pandas
Python
1 import pandas as pd
Now it's time to start importing functions for machine learning. The first one will be the
train_test_split() function from the model_selection module. As its name implies, this
module contains many utilities that will help us choose between models.
Next, we'll import the entire preprocessing module. This contains utilities for scaling,
transforming, and wrangling data.
Next, let's import the families of models we'll need... wait, did you just say "families?"
A "family" of models are broad types of models, such as random forests, SVM's, linear
regression models, etc. Within each family of models, you'll get an actual model after you
fit and tune its parameters to the data.
*Tip: Don't worry too much about this for now... It will make more sense once we get to
Step 7.
For the scope of this tutorial, we'll only focus on training a random forest and tuning its
parameters. We'll have another detailed tutorial for how to choose between model
families.
For now, let's move on to importing the tools to help us perform cross-validation.
4/16
1 from sklearn.pipeline import make_pipeline
2 from sklearn.model_selection import GridSearchCV
Next, let's import some metrics we can use to evaluate our model performance later.
And finally, we'll import a way to persist our model for future use.
Joblib is an alternative to Python's pickle package, and we'll use it because it's more
efficient for storing large numpy arrays.
Phew! That was a lot. Don't worry, we'll cover each function in detail once we get to it.
Let's first take a quick sip of wine and toast to our progress... cheers!
You can read data from CSV, Excel, SQL, SAS, and many other data formats. Here's a
list of all the Pandas IO tools.
The convenient tool we'll use today is the read_csv() function. Using this function, we can
load any CSV file, even from a remote URL!
1 dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-
2 quality/winequality-red.csv'
data = pd.read_csv(dataset_url)
5/16
1 print data.head()
2 # fixed acidity;"volatile acidity";"citric acid"...
3 # 0 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56...
4 # 1 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68...
5 # 2 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0...
6 # 3 11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;...
7 # 4 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56...
Crap... that looks really messy. Upon further inspection, it looks like the CSV file is
actually using semicolons to separate the data. That's annoying, but easy to fix:
Great, that's much nicer. Now, let's take a look at the data.
Python
1 print data.shape
2 # (1599, 12)
We have 1,599 samples and 12 features, including our target feature. We can easily print
some summary statistics.
Summary statistics
Python
1 print data.describe()
2 # fixed acidity volatile acidity citric acid...
3 # count 1599.000000 1599.000000 1599.000000...
4 # mean 8.319637 0.527821 0.270976...
5 # std 1.741096 0.179060 0.194801...
6 # min 4.600000 0.120000 0.000000...
7 # 25% 7.100000 0.390000 0.090000...
8 # 50% 7.900000 0.520000 0.260000...
9 # 75% 9.200000 0.640000 0.420000...
10 # max 15.900000 1.580000 1.000000...
quality (target)
6/16
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
All of the features are numeric, which is convenient. However, they have some very
different scales, so let's make a mental note to standardize the data later.
As a reminder, for this tutorial, we're cutting out a lot of exploratory data analysis we'd
typically recommend.
First, let's separate our target (y) features from our input (X) features:
1 y = data.quality
2 X = data.drop('quality', axis=1)
As you can see, we'll set aside 20% of the data as a test set for evaluating our model. We
also set an arbitrary "random state" (a.k.a. seed) so that we can reproduce our results.
7/16
Finally, it's good practice to stratify your sample by the target variable. This will
ensure your training set looks similar to your test set, making your evaluation metrics
more reliable.
WTF is standardization?
Standardization is the process of subtracting the means from each feature and then
dividing by the feature standard deviations.
Scikit-Learn makes data preprocessing a breeze. For example, it's pretty easy to simply
scale a dataset:
1 X_train_scaled = preprocessing.scale(X_train)
2 print X_trained_scaled
3 # array([[ 0.51358886, 2.19680282, -0.164433 , ..., 1.08415147,
4 # -0.69866131, -0.58608178],
5 # [-1.73698885, -0.31792985, -0.82867679, ..., 1.46964764,
6 # 1.2491516 , 2.97009781],
7 # [-0.35201795, 0.46443143, -0.47100705, ..., -0.13658641,
8 # ...
You can confirm that the scaled dataset is indeed centered at zero, with unit variance:
Python
1 print X_train_scaled.mean(axis=0)
2 # [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
3
4 print X_train_scaled.std(axis=0)
5 # [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Great, but why did we say that we won't use this code?
The reason is that we won't be able to perform the exact same transformation on the test
set.
8/16
Sure, we can still scale the test set separately, but we won't be using the same means
and standard deviations as we used to transform the training set.
In other words, that means it wouldn't be a fair representation of how the model pipeline,
include the preprocessing steps, would perform on brand new data.
So instead of directly invoking the scale function, we'll be using a feature in Scikit-Learn
called the Transformer API. The Transformer API allows you to "fit" a preprocessing step
using the training data the same way you'd fit a model...
1. Fit the transformer on the training set (saving the means and standard deviations)
2. Apply the transformer to the training set (scaling the training data)
3. Apply the transformer to the test set (using the same means and standard
deviations)
This makes your final estimate of model performance more realistic, and it allows to insert
your preprocessing steps into a cross-validation pipeline (more on this in Step 7).
1 scaler = preprocessing.StandardScaler().fit(X_train)
Now, the scaler object has the saved means and standard deviations for each feature in
the training set.
1 X_train_scaled = scaler.transform(X_train)
2
3 print X_train_scaled.mean(axis=0)
4 # [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
5
6 print X_train_scaled.std(axis=0)
7 # [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Note how we're taking the scaler object and using it to transform the training set. Later,
we can transform the test set using the exact same means and standard deviations used
to transform the training set:
9/16
Applying transformer to test data
Python
1 X_test_scaled = scaler.transform(X_test)
2
3 print X_test_scaled.mean(axis=0)
4 # [ 0.02776704 0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
5 # -0.02414174 -0.00293273 -0.00467444 -0.10894663 0.01043391]
6
7 print X_test_scaled.std(axis=0)
8 # [ 1.02160495 1.00135689 0.97456598 0.91099054 0.86716698 0.94193125
9 # 1.03673213 1.03145119 0.95734849 0.83829505 1.0286218 ]
Notice how the scaled features in the test set are not perfectly centered at zero with unit
variance! This is exactly what we'd expect, as we're transforming the test set using the
means from the training set, not from the test set itself.
In practice, when we set up the cross-validation pipeline, we won't even need to manually
fit the Transformer API. Instead, we'll simply declare the class object, like so:
1 pipeline = make_pipeline(preprocessing.StandardScaler(),
2 RandomForestRegressor(n_estimators=100))
This is exactly what it looks like: a modeling pipeline that first transforms the data using
StandardScaler() and then fits a model using a random forest regressor.
There are two types of parameters we need to worry about: model parameters and
hyperparameters. Models parameters can be learned directly from the data (i.e.
regression coefficients), while hyperparameters cannot.
Hyperparameters express "higher-level" structural information about the model, and they
are typically set before training the model.
Within each decision tree, the computer can empirically decide where to create branches
based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the
actual branch locations are model parameters.
10/16
However, the algorithm does not know which of the two criteria, MSE or MAE, that it
should use. The algorithm also cannot decide how many trees to include in the forest.
These are examples of hyperparameters that the user must set.
1 print pipeline.get_params()
2 # ...
3 # 'randomforestregressor__criterion': 'mse',
4 # 'randomforestregressor__max_depth': None,
5 # 'randomforestregressor__max_features': 'auto',
6 # 'randomforestregressor__max_leaf_nodes': None,
7 # ...
You can also find a list of all the parameters on the RandomForestRegressor
documentation page. Just note that when it's tuned through a pipeline, you'll need to
prepend randomforestregressor__ before the parameter name, like in the code above.
As you can see, the format should be a Python dictionary (data structure for key-value
pairs) where keys are the hyperparameter names and values are lists of settings to try.
The options for parameter values can be found on the documentation page.
This is one of the most important skills in all of machine learning because it helps you
maximize model performance while reducing the chance of overfitting.
11/16
These are the steps for CV:
Let's say you want to train a random forest regressor. One of the hyperparameters you
must tune is the maximum depth allowed for each decision tree in your forest.
That's where cross-validation comes in. Using only your training set, you can use CV to
evaluate different hyperparameters and estimate their effectiveness.
This allows you to keep your test set "untainted" and save it for a true hold-out evaluation
when you're finally ready to select a model.
For example, you can use CV to tune a random forest model, a linear regression model,
and a k-nearest neighbors model, using only the training set. Then, you still have the
untainted test set to make your final selection between the model families!
The best practice when performing CV is to include your data preprocessing steps inside
the cross-validation loop. This prevents accidentally tainting your training folds with
influential data from your test fold.
12/16
4. Preprocess the hold-out fold using the same transformations from step (2).
5. Evaluate your model on the same hold-out fold.
6. Perform steps (2) - (5) k times, each time holding out a different fold.
7. Aggregate the performance across all k folds. This is your performance metric.
Yes, it's really that easy. GridSearchCV essentially performs cross-validation across the
entire "grid" (all possible permutations) of hyperparameters.
It takes in your model (in this case, we're using a model pipeline), the hyperparameters
you want to tune, and the number of folds to create.
Obviously, there's a lot going on under the hood. We've included the pseudo-code above,
and we'll cover writing cross-validation from scratch in a separate guide.
Now, you can see the best set of parameters found using CV:
Python
1 print clf.best_params_
2 # {'randomforestregressor__max_depth': None,
'randomforestregressor__max_features': 'auto'}
Interestingly, it looks like the default parameters win out for this data set.
*Tip: It turns out that in practice, random forests don't actually require a lot of tuning. They
tend to work pretty well out-of-the-box with a reasonable number of trees. Even so,
these same steps can be used when building any type of supervised learning model.
Conveniently, GridSearchCV from sklearn will automatically refit the model with the best
set of hyperparameters using the entire training set.
13/16
Confirm model will be retrained
Python
1 print clf.refit
2 # True
Now, you can simply use the clf object as your model when applying it to other sets
of data. That's what we'll be doing in the next step.
This step is really straightforward once you understand that the clf object you used to
tune the hyperparameters can also be used directly like a model object.
1 y_pred = clf.predict(X_test)
Now we can use the metrics we imported earlier to evaluate our model performance.
Python
Well, the rule of thumb is that your very first model probably won't be the best possible
model. However, we recommend a combination of three strategies to decide if you're
satisfied with your model performance.
1. Start with the goal of the model. If the model is tied to a business problem, have you
successfully solved the problem?
2. Look in academic literature to get a sense of the current performance benchmarks
for specific types of data.
3. Try to find low-hanging fruit in terms of ways to improve your model.
There are various ways to improve a model. We'll have more guides that go into detail
about how to improve model performance, but here are a few quick things to try:
14/16
1. Try other regression model families (e.g. regularized regression, boosted trees,
etc.).
2. Collect more data if it's cheap to do so.
3. Engineer smarter features after spending more time on exploratory analysis.
4. Speak to a domain expert to get more context (...this is a good excuse to go wine
tasting!).
As a final note, when you try other families of models, we recommend using the same
training and test set as you used to fit the random forest model. That's the best way to get
a true apples-to-apples comparison between your models.
You've done the hard part, and deserve another glass of wine. Maybe this time you can
use your shiny new predictive model to select the bottle.
But before you go, let's save your hard work so you can use the model in the future. It's
really easy to do so:
1 joblib.dump(clf, 'rf_regressor.pkl')
And that's it. When you want to load the model again, simply use this function:
1 clf2 = joblib.load('rf_regressor.pkl')
2
3 # Predict data set using loaded model
4 clf2.predict(X_test)
We've just completed a whirlwind tour of Scikit-Learn's core functionality, but we've only
really scratched the surface. Hopefully you've gained some guideposts to further explore
all that sklearn has to offer.
Python
15/16
1 # 2. Import libraries and modules
2 import numpy as np
3 import pandas as pd
4
5 from sklearn.model_selection import train_test_split
6 from sklearn import preprocessing
7 from sklearn.ensemble import RandomForestRegressor
8 from sklearn.pipeline import make_pipeline
9 from sklearn.model_selection import GridSearchCV
10 from sklearn.metrics import mean_squared_error, r2_score
11 from sklearn.externals import joblib
12
13 # 3. Load red wine data.
14 dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-
15 quality/winequality-red.csv'
16 data = pd.read_csv(dataset_url, sep=';')
17
18 # 4. Split data into training and test sets
19 y = data.quality
20 X = data.drop('quality', axis=1)
21 X_train, X_test, y_train, y_test = train_test_split(X, y,
22 test_size=0.2,
23 random_state=123,
24 stratify=y)
25
26 # 5. Declare data preprocessing steps
27 pipeline = make_pipeline(preprocessing.StandardScaler(),
28 RandomForestRegressor(n_estimators=100))
29
30 # 6. Declare hyperparameters to tune
31 hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
32 'randomforestregressor__max_depth': [None, 5, 3, 1]}
33
34 # 7. Tune model using cross-validation pipeline
35 clf = GridSearchCV(pipeline, hyperparameters, cv=10)
36
37 clf.fit(X_train, y_train)
38
39 # 8. Refit on the entire training set
40 # No additional code needed if clf.refit == True (default is True)
41
42 # 9. Evaluate model pipeline on test data
43 pred = clf.predict(X_test)
44 print r2_score(y_test, pred)
45 print mean_squared_error(y_test, pred)
46
47 # 10. Save model for future use
48 joblib.dump(clf, 'rf_regressor.pkl')
# To load: clf2 = joblib.load('rf_regressor.pkl')
16/16