TPOT Automated Machine Learning in Python: 607K Followers Editors' Picks Features Deep Dives Grow Contribute About

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

Sign in

Get started

Follow

607K Followers
·
Editors' PicksFeaturesDeep DivesGrowContribute
About

TPOT Automated Machine


Learning in Python

Jeff Hale

Aug 22, 2018·19 min read


TPOT graphic from the docs
In this post I’m sharing some of my explorations with TPOT, an
automated machine learning (autoML) tool in Python. The goal
is to see what TPOT can do and if it merits becoming part of
your machine learning workflow.

Automated machine learning doesn’t replace the data scientist,


(at least not yet) but it might be able to help you find good
models faster. TPOT bills itself as your Data Science Assistant.

TPOT is meant to be an assistant that gives you ideas on how


to solve a particular machine learning problem by exploring
pipeline configurations that you might have never considered,
then leaves the fine-tuning to more constrained parameter
tuning techniques such as grid search.

So TPOT helps you find good algorithms. Note that it isn’t


designed for automating deep learning — something like
AutoKeras might be helpful there.

An example machine learning pipeline (source: TPOT docs)


TPOT is built on the scikit learn library and follows the scikit
learn API closely. It can be used for regression and classification
tasks and has special implementations for medical research.
TPOT is open source, well documented, and under active
development. It’s development was spearheaded by researchers
at the University of Pennsylvania. TPOT appears to be one of the
most popular autoML libraries, with nearly 4,500 GitHub
stars as of August 2018.

How does TPOT work?


An example TPOT Pipeline (source: TPOT docs)

TPOT has what its developers call a genetic search algorithm to


find the best parameters and model ensembles. It could also be
thought of as a natural selection or evolutionary algorithm.
TPOT tries a pipeline, evaluates its performance, and randomly
changes parts of the pipeline in search of better performing
algorithms.
AutoML algorithms aren’t as simple as fitting one model on the
dataset; they are considering multiple machine learning
algorithms (random forests, linear models, SVMs, etc.) in a
pipeline with multiple preprocessing steps (missing value
imputation, scaling, PCA, feature selection, etc.), the
hyperparameters for all of the models and preprocessing steps,
as well as multiple ways to ensemble or stack the algorithms
within the pipeline. (source: TPOT docs)
This power of TPOT comes from evaluating all kinds of possible
pipelines automatically and efficiently. Doing this manually is
cumbersome and slower.

Running TPOT
Instantiating, fitting, and scoring the TPOT classifier is similar
to any other sklearn classifier. Here’s the format:
tpot = TPOTClassifier()
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)

TPOT comes with its own variation of one-hot encoding. Note


that it could add it to a pipeline automatically because it treats
features with fewer than 10 unique values as categorical. If you
want to use your own encoding strategy you can encode your
data and then feed it into TPOT.

You can choose the scoring criterion for tpot.score (although a


bug with Jupyter and multiple processor cores prevents you
from having a custom scoring criterion with multiple processor
cores in a Jupyter notebook).

It appears that you can’t alter the scoring criteria TPOT uses
internally as it searches for the best pipeline, just the scoring
criteria for use on the test set after TPOT has chosen the best
algorithms. This is an area where some users might want more
control. Perhaps this option will be added in a future version.

TPOT writes information about the best performing algorithm


and it’s accuracy score to a file with tpot.export(). You can
choose the level of verboseness you would like to see as TPOT
runs and have it write pipelines to an output file as it runs in
case it terminates early for some reason (e.g. your Kaggle Kernel
crashes).

How long does TPOT take to run?


The short answer is that it depends.

TPOT was designed to run for a while — hours or even a day.


Although less complex problems with smaller datasets can see
great results in minutes. You can adjust several parameters for
TPOT to finish its searches faster, but at the expense of a less
thorough search for an optimal pipeline. It was not designed to
be a comprehensive search of preprocessing steps, feature
selection, algorithms, and parameters, but it can come close if
you set its parameters to be more exhaustive.

As the docs explain:


…TPOT will take a while to run on larger datasets, but it’s
important to realize why. With the default TPOT settings (100
generations with 100 population size), TPOT will evaluate
10,000 pipeline configurations before finishing. To put this
number into context, think about a grid search of 10,000
hyperparameter combinations for a machine learning
algorithm and how long that grid search will take. That is
10,000 model configurations to evaluate with 10-fold cross-
validation, which means that roughly 100,000 models are fit
and evaluated on the training data in one grid search.

Some of the data sets we’ll see below only need a few minutes to
find algorithms that score well; others might need days.

Here are the default TPOTClassifier parameters:


generations=100,
population_size=100,
offspring_size=None # Jeff notes this gets set to
population_size
mutation_rate=0.9,
crossover_rate=0.1,
scoring="Accuracy", # for Classification
cv=5,
subsample=1.0,
n_jobs=1,
max_time_mins=None,
max_eval_time_mins=5,
random_state=None,
config_dict=None,
warm_start=False,
memory=None,
periodic_checkpoint_folder=None,
early_stop=None
verbosity=0
disable_update_check=False
A description of each parameter can be found the docs. Here are
a few key ones that determine the number of pipelines TPOT
will search through:
generations: int, optional (default: 100)
Number of iterations to the run pipeline optimization process.
Generally, TPOT will work better when you give it more
generations(and therefore time) to optimize the pipeline. TPOT
will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE
pipelines in total (emphasis mine).population_size: int,
optional (default: 100)
Number of individuals to retain in the GP population every
generation.
Generally, TPOT will work better when you give it more
individuals (and therefore time) to optimize the pipeline.
offspring_size: int, optional (default: None)
Number of offspring to produce in each GP generation.

By default, offspring_size = population_size.

When starting out with TPOT it’s worth


setting verbosity=3 and periodic_checkpoint_folder=“any_stri
ng_you_like” so that you can watch the models evolve and
training scores improve. You’ll see some errors as some
combinations of pipeline elements are incompatible, but don’t
sweat that.

If you’re running on multiple cores and not using a custom


scoring function, set n_jobs=-1 to use all available cores and
speed up TPOT.

Search Space

Here are the classification algorithms and parameters TPOT


chooses from as of version 0.9:
‘sklearn.naive_bayes.BernoulliNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1,
1., 10., 100.], ‘fit_prior’: [True, False] },
‘sklearn.naive_bayes.MultinomialNB’: { ‘alpha’: [1e-3, 1e-2, 1e-
1, 1., 10., 100.], ‘fit_prior’: [True, False] },
‘sklearn.tree.DecisionTreeClassifier’: { ‘criterion’: [“gini”,
“entropy”], ‘max_depth’: range(1, 11), ‘min_samples_split’:
range(2, 21), ‘min_samples_leaf’: range(1, 21) },
‘sklearn.ensemble.ExtraTreesClassifier’: { ‘n_estimators’:
[100], ‘criterion’: [“gini”, “entropy”], ‘max_features’:
np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21),
‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True,
False] },‘sklearn.ensemble.RandomForestClassifier’:
{ ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”],
‘max_features’: np.arange(0.05, 1.01, 0.05),
‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1,
21), ‘bootstrap’: [True, False] },
‘sklearn.ensemble.GradientBoostingClassifier’: { ‘n_estimators’:
[100], ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.],
‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21),
‘min_samples_leaf’: range(1, 21), ‘subsample’: np.arange(0.05,
1.01, 0.05), ‘max_features’: np.arange(0.05, 1.01,
0.05) },‘sklearn.neighbors.KNeighborsClassifier’:
{ ‘n_neighbors’: range(1, 101), ‘weights’: [“uniform”,
“distance”], ‘p’: [1, 2] }, ‘sklearn.svm.LinearSVC’:
{ ‘penalty’: [“l1”, “l2”], ‘loss’: [“hinge”, “squared_hinge”],
‘dual’: [True, False], ‘tol’: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20.,
25.] }, ‘sklearn.linear_model.LogisticRegression’: { ‘penalty’:
[“l1”, “l2”], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10.,
15., 20., 25.], ‘dual’: [True, False] },
‘xgboost.XGBClassifier’: { ‘n_estimators’: [100], ‘max_depth’:
range(1, 11), ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.],
‘subsample’: np.arange(0.05, 1.01, 0.05), ‘min_child_weight’:
range(1, 21), ‘nthread’: [1] }

And TPOT can stack classifiers, including the same classifier


multiple times. One of the core developers of TPOT explains
how it works in this issue:

The
pipeline ExtraTreesClassifier(ExtraTreesClassifier(input_matrix
, True, 'entropy', 0.10000000000000001, 13, 6), True, 'gini',

0.75, 17, 4) does the following:

Fit all of the original features using an ExtraTreesClassifier


Take the predictions from that ExtraTreesClassifier and create
a new feature using those predictions

Pass the original features plus the new “predicted feature” to


the 2nd ExtraTreesClassifier and use its predictions as the final
predictions of the pipeline

This process is called stacking classifiers, which is a fairly


common tactic in machine learning.

And here are the 11 preprocessors that could be applied by


TPOT as of version 0.9.
‘sklearn.preprocessing.Binarizer’: { ‘threshold’: np.arange(0.0,
1.01, 0.05) }, ‘sklearn.decomposition.FastICA’: { ‘tol’:
np.arange(0.0, 1.01, 0.05) },
‘sklearn.cluster.FeatureAgglomeration’: { ‘linkage’: [‘ward’,
‘complete’, ‘average’], ‘affinity’: [‘euclidean’, ‘l1’, ‘l2’,
‘manhattan’, ‘cosine’] }, ‘sklearn.preprocessing.MaxAbsScaler’:
{ }, ‘sklearn.preprocessing.MinMaxScaler’: { },
‘sklearn.preprocessing.Normalizer’: { ‘norm’: [‘l1’, ‘l2’,
‘max’] }, ‘sklearn.kernel_approximation.Nystroem’: { ‘kernel’:
[‘rbf’, ‘cosine’, ‘chi2’, ‘laplacian’, ‘polynomial’, ‘poly’,
‘linear’, ‘additive_chi2’, ‘sigmoid’], ‘gamma’: np.arange(0.0,
1.01, 0.05), ‘n_components’: range(1, 11) },
‘sklearn.decomposition.PCA’: { ‘svd_solver’: [‘randomized’],
‘iterated_power’: range(1, 11) },
‘sklearn.preprocessing.PolynomialFeatures’: { ‘degree’: [2],
‘include_bias’: [False], ‘interaction_only’: [False] },
‘sklearn.kernel_approximation.RBFSampler’: { ‘gamma’:
np.arange(0.0, 1.01, 0.05) },
‘sklearn.preprocessing.RobustScaler’: { },
‘sklearn.preprocessing.StandardScaler’: { },
‘tpot.builtins.ZeroCount’: { }, ‘tpot.builtins.OneHotEncoder’: {
‘minimum_fraction’: [0.05, 0.1, 0.15, 0.2, 0.25], ‘sparse’:
[False] } (emphasis mine)

That’s a pretty comprehensive list of sklearn ml algorithms and


even a few you might not have used for preprocessing, including
Nystroem and RBFSampler. The final preprocessing algorithm
listed is the custom OneHotEncoder mentioned before. Note
that the list contains no neural network algorithms.

The number of combinations appears to be nearly infinite — you


can stack algorithms, including instances of the same algorithm.
There may be an internal cap on the number of steps in the
pipeline, but suffice to say there are a plethora of possible
pipelines.
TPOT will likely not result in the same algorithm selection if you
run it twice (maybe not even if random_state is set, I found, as
discussed below). As the docs explain:

If you’re working with a reasonably complex dataset or run


TPOT for a short amount of time, different TPOT runs may
result in different pipeline recommendations. TPOT’s
optimization algorithm is stochastic in nature, which means
that it uses randomness (in part) to search the possible pipeline
space. When two TPOT runs recommend different pipelines,
this means that the TPOT runs didn’t converge due to lack of
time or that multiple pipelines perform more-or-less the same
on your dataset.

Less talk — more action. Let’s try out TPOT on some data!

Dataset 1: MNIST Digit Classification


First we’ll look at a classification task — the popular
handwriting digit classification task from MNIST included in
sklearn’s datasets. The MNIST database contains 70,000
images of handwritten Arabic digits in 28x28 pixels, labeled
from 0 to 9.

TPOT comes standard on the Kaggle Docker image, so you only


need to import it if you’re using Kaggle — you don’t need to
install it.
Here’s my code — available on this Kaggle Kernel, in a slightly
different form and possibly with a few modifications.
# import the usual stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os# import TPOT and sklearn stuff
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import sklearn.metrics# create train and test sets
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data,
digits.target, train_size=0.75, test_size=0.25,
random_state=34)tpot = TPOTClassifier(verbosity=3,
scoring="balanced_accuracy",
random_state=23,

periodic_checkpoint_folder="tpot_mnst1.txt",
n_jobs=-1,
generations=10,
population_size=100)# run three iterations
and time themfor x in range(3):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')times = [time/60 for
time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)

As mentioned above, the total number of pipelines is equal to


POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE.

For example, if you set population_size=20 and generations=5,


then offspring_size=20 (because offspring_size equals
population_size by default. And you’ll have a total of 120
pipelines because 20 + (5 * 20 ) = 120.

You can see it doesn’t take much code at all to run this data set
— and that includes a loop to time and test it repeatedly.

With 10 possible classes and no reason to prefer one outcome to


another, accuracy — the TPOT classification default — is a fine
metric for this task.

Here’s the relevant code section.


digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data,
digits.target, train_size=0.75, test_size=0.25,
random_state=34)tpot = TPOTClassifier(verbosity=3,
scoring=”accuracy”,
random_state=32,
periodic_checkpoint_folder=”tpot_results.txt”,
n_jobs=-1,
generations=5,
population_size=10,
early_stop=5)
And here are the results:
Times: [4.740584810283326, 3.497970838083226,
3.4362493358499098]
Scores: [0.9733333333333334, 0.9644444444444444,
0.9666666666666667]Winning pipelines: [Pipeline(memory=None,
steps=[('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=7,
max_features=0.15000000000000002,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
...auto', random_state=None,
subsample=0.9500000000000001, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True,
with_mean=True, with_std=True)), ('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002,
max_leaf_...auto', random_state=None,
subsample=0.9500000000000001, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True,
with_mean=True, with_std=True)), ('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002,
max_leaf_...auto', random_state=None,
subsample=0.9500000000000001, verbose=0,
warm_start=False))])]

Note that with only 60 pipelines — far less than what TPOT
suggests — we were able to see pretty good scores — over 97%
accuracy on the test set in one case.

Reproducibility

Does TPOT find the same winning pipeline every time with the
same random_state set? Not necessarily. Individually
algorithms such as RandomForrestClassifier() have their own
random_state parameters that don’t get set.
TPOT doesn’t always find the same result if you instantiate one
classifier and then fit it repeatedly like we do in the for loop in
the code above. I ran three very small sets of 60 pipelines with
random_state set and Kaggle’s GPU setting on. Note that we get
slightly different pipelines and thus slightly different test set
scores on the three test sets.

Here’s another example of a small number of pipelines with


random state set and using Kaggle’s CPU setting.
Times: [2.8874817832668973, 0.043678393283335025,
0.04388708711679404]
Scores: [0.9622222222222222, 0.9622222222222222,
0.9622222222222222]
Winning pipelines: [Pipeline(memory=None,
steps=[('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('gradientboostingclassifier',
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))])]
The same pipeline was found each of the three times.

Note that the run time is much faster after the first iteration.
TPOT does seem to remember when it has seen an algorithm
and doesn’t rerun it, even if it’s a second fit and you’ve set
memory=False. Here’s what you’ll see if you set the verbosity=3
when it finds such a previously evaluated pipeline:
Pipeline encountered that has previously been evaluated during
the optimization process. Using the score from the previous
evaluation.

Longer runs for higher accuracy

How does TPOT do if you make a large number of pipelines? To


really see the power of TPOT for the MNIST digits task, you
need over 500 total pipelines to run. This will take at least an
hour if you’re running it on Kaggle. Then you will see higher
accuracy scores and might see more complex models.

Chained, or stacked, ensembles where the outputs of one


machine learning algorithm feed into another are what you’ll
likely see if you have a larger number of pipelines and a non-
trivial task.
0.9950861171999883knn = KNeighborsClassifier(
DecisionTreeClassifier(
OneHotEncoder(input_matrix,
OneHotEncoder__minimum_fraction=0.15,
OneHotEncoder__sparse=False),
DecisionTreeClassifier__criterion=gini,
DecisionTreeClassifier__max_depth=5,
DecisionTreeClassifier__min_samples_leaf=20,
DecisionTreeClassifier__min_samples_split=17),
KNeighborsClassifier__n_neighbors=1,
KNeighborsClassifier__p=2,
KNeighborsClassifier__weights=distance)
This is .995 average internal CV accuracy score after running for
over an hour and generating over 600 pipelines. The kernel
crashed before completion, so I didn’t get to see a test set score
and couldn’t get an outputted model, but this looks quite
promising for TPOT.

The algorithm uses a DecisionTreeClassifier with TPOT’s


custom OneHotEncoder categorical encodings feeding into
KNeighborsClassifier.

Here’s a similar internal score with a different pipeline resulting


from a different random_state after nearly 800 pipelines.
0.9903723557310828
KNeighborsClassifier(Normalizer(OneHotEncoder(RandomForestClassi
fier(MinMaxScaler(input_matrix),
RandomForestClassifier__bootstrap=True,
RandomForestClassifier__criterion=entropy,
RandomForestClassifier__max_features=0.55,
RandomForestClassifier__min_samples_leaf=6,
RandomForestClassifier__min_samples_split=15,
RandomForestClassifier__n_estimators=100),
OneHotEncoder__minimum_fraction=0.2,
OneHotEncoder__sparse=False), Normalizer__norm=max),
KNeighborsClassifier__n_neighbors=4, KNeighborsClassifier__p=2,
KNeighborsClassifier__weights=distance)

TPOT found a pipeline with KNN, One Hot encoding,


normalization, and random forest. It took two and a half hours.
Previous one was faster and scored better, but sometimes that’s
what happens with the stochastic nature of TPOT’s genetic
search algorithm. 😉

Takeaways from MNIST digit classification task


1. TPOT can perform really well on this image
recognition task if you give it enough time.

2. TPOT works better with more pipelines.

3. If you need reproducibility for a task, TPOT isn’t the


tool you want.

Dataset 2: Mushroom Classification


For a second dataset I chose the popular mushroom
classification task. The goal is to determine correctly whether a
mushroom is poisonous based on its labels. This is not an image
classification task. It’s set up as a binary task so that all
potentially dangerous mushrooms are grouped as one category
and safe to eat mushrooms as another category.
Yummy or deadly? Ask TPOT.
My code is available on this Kaggle Kernel.

TPOT can routinely fit a perfect model quickly on this data set.
It did so in under two minutes. This is much better performance
and speed than when I tested this dataset without TPOT with
many scikit-learn classification algorithms, a wide range of
nominal data encodings, and no parameter tuning.

On three runs with the same TPOTClassifier instance and the


same random state set here’s what TPOT found:
Times: [1.854785452616731, 1.5694829618000463,
1.3383520993001488]
Scores: [1.0, 1.0, 1.0]
Interestingly, it found a different best algorithm each time. It
found a DecisionTreeClassifier, then a KNeighorsClassifier, and
then a Stacked RandomForestClassifier with BernoulliNB.

Let’s dig into reproducibility a bit more. Let’s run it again with
everything exactly the same settings.
Times: [1.8664863013502326, 1.5520636909670429,
1.3386059726501116]
Scores: [1.0, 1.0, 1.0]

We see the same set of three pipelines, very similar times, and
the same scores on the test set.

Now let’s try splitting the cell into multiple different cells and
instantiating a TPOT instance in each one. Here’s the code:
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75, test_size=0.25, random_state=25)tpot =
TPOTClassifier(
verbosity=3,
scoring=”accuracy”,
random_state=25,
periodic_checkpoint_folder=”tpot_mushroom_results.txt”,
n_jobs=-1,
generations=5,
population_size=10,
early_stop = 5
)

The result of the second run now matches the result of the first
one and took almost the same time (Score = 1.0, Time = 1.9
minutes, pipeline = Decision Tree Classifier). The key for higher
reproducibility is that we are instantiating a new instance of the
TPOT classifier in each cell.
Time results from 10 sets of 30 pipelines with random_state on
train_test_split and TPOT set to 10 are below. All pipelines
correctly classified all mushrooms on the test set. TPOT was
quite fast on this fairly easy-to-learn task.

Takeaways from Mushroom Task


TPOT performs well and quickly for this basic classification
task.
As a comparison, this Kaggle kernel on the mushroom set in R is
very nice and explores a variety of algorithms and gets very
close to perfect accuracy. But it doesn’t quite reach 100% and it
certainly took quite a bit more time to prepare and train than
our implementation of TPOT.

I would strongly consider TPOT as a time saver for a task like


this in the future, at least as a first step.

Dataset 3: Ames Housing Prediction


Next we turn to a regression task to see how TPOT performs.
We’ll predict housing property sale values with the
popular Ames, Iowa Housing Price Prediction dataset. My code
is available on this Kaggle Kernel.
This house could be in Ames, Iowa, right?

For this task, I did some basic imputation of missing values


first. I filled missing numeric column values with the most
frequent value for the column, because some of those columns
contain ordinal data. With more time I’d categorize the columns
and use different imputation strategies depending on interval,
ordinal, or nominal data types.

String column missing values were filled with a “missing” label


prior to ordinal encoding because not all columns had a most
frequent value. TPOT’s one hot encoding algorithm would then
make one more dimension per feature that would indicate that
the data had a missing value for that feature.
TPOTRegressor uses mean squared error scoring by default.

Here’s a run with only 60 pipelines.


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = .25, random_state = 33)# instantiate tpot
tpot = TPOTRegressor(verbosity=3,
random_state=25,
n_jobs=-1,
generations=5,
population_size=10,
early_stop = 5,
memory = None)
times = []
scores = []
winning_pipes = []# run 3 iterations
for x in range(3):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_ames.py')# output results
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)

The results of these three little runs.


Times: [3.8920086714831994, 1.4063017464330188,
1.2469199204002508]
Scores: [-905092886.3009057, -922269561.2683483, -
949881926.6436856]
Winning pipelines: [Pipeline(memory=None,
steps=[('zerocount', ZeroCount()), ('xgbregressor',
XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0,
max_depth=9, min_child_weight=18, missing=None,
n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear',
random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))]), Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5,
booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0,
max_depth=9, min_child_weight=11, missing=None,
n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear',
random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))]), Pipeline(memory=None,
steps=[('stackingestimator',
StackingEstimator(estimator=RidgeCV(alphas=array([ 0.1, 1. ,
10. ]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None,
store_cv_values=False))), ('maxabsscaler-1',
MaxAbsScaler(copy=True)), ('maxabsscaler-2',
MaxAbsScaler(copy=True)), ('xgbr... reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))])]

The runs finished pretty quickly and found different winning


pipelines each time. Taking the square root of the scores gives
us the Root Mean Squared Error (RMSE). The RMSE was
around $30,000 on average.

Trying with 60 pipelines and a random_state = 20 for


train_test_split and TPOTRegressor.
Times: [9.691357856966594, 1.8972856383004304,
2.5272325469001466]
Scores: [-1061075530.3715296, -695536167.1288683, -
783733389.9523941]Winning pipelines: [Pipeline(memory=None,
steps=[('stackingestimator-1',
StackingEstimator(estimator=RandomForestRegressor(bootstrap=True
, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_sample...0.6000000000000001,
tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5,
booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0,
max_depth=7, min_child_weight=3, missing=None,
n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear',
random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1.0))]), Pipeline(memory=None,
steps=[('stackingestimator',
StackingEstimator(estimator=RandomForestRegressor(bootstrap=True
, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_...ators=100,
n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False))])]

Led to very different pipelines and scores.

Let’s try one longer run with 720 pipelines


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = .25, random_state = 20)tpot =
TPOTRegressor(verbosity=3,
random_state=10,
#scoring=rmsle,
periodic_checkpoint_folder=”any_string”,
n_jobs=-1,
generations=8,
population_size=80,
early_stop=5)

Results:
Times: [43.206709423016584]
Scores: [-644910660.5815958]
Winning pipelines: [Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5,
booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0,
max_depth=8, min_child_weight=3, missing=None,
n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear',
random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.8500000000000001))])]
RMSE is the best yet. It took the better part of an hour to
converge, and we’re still running far smaller pipelines than
recommended. 🤔

Next let’s try using Root Mean Squared Logarithmic Error, a


custom scoring parameter Kaggle uses for this competition. This
was run in another very small iteration with 30 pipelines in
three runs with random_state=20. We couldn’t use more than
one CPU core because of a bug with custom scoring parameters
in Jupyter in some algorithms included in TPOT.
Times: [1.6125734224997965, 1.2910610851162345,
0.9708147236000514]
Scores: [-0.15007242511943228, -0.14164770517342357, -
0.15506057088945932]
Winning pipelines: [Pipeline(memory=None,
steps=[('maxabsscaler', MaxAbsScaler(copy=True)),
('stackingestimator',
StackingEstimator(estimator=RandomForestRegressor(bootstrap=True
, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
...0.6000000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('extratreesregressor',
ExtraTreesRegressor(bootstrap=False, criterion='mse',
max_depth=None,
max_features=0.6500000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=7, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False))]), Pipeline(memory=None,
steps=[('ridgecv', RidgeCV(alphas=array([ 0.1, 1. ,
10. ]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None,
store_cv_values=False))])]

Those scores aren’t terrible. The output file from tpot.export


from this small run is below.
import numpy as np
import pandas as pd
from sklearn.linear_model import ElasticNetCV, LassoLarsCV
from sklearn.model_selection import train_test_split from
sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator # NOTE: Make sure
that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE',
sep='COLUMN_SEPARATOR', dtype=np.float64) features =
tpot_data.drop('target', axis=1).values training_features,
testing_features, training_target, testing_target =
train_test_split(features, tpot_data['target'].values,
random_state=42) # Score on the training set was:-
0.169929041242275 exported_pipeline =
make_pipeline( StackingEstimator(estimator=LassoLarsCV(norma
lize=False)), ElasticNetCV(l1_ratio=0.75, tol=0.01)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In the future I’d like to do some longer runs with TPOT on this
data set to see how it performs. I’d also like to see how some
manual feature engineering and various encoding strategies can
improve our model performance.

Gotchas with TPOT and Kaggle


I love Kaggle’s kernels, but if you want to run an algorithm such
as TPOT for a few hours, it can be super frustrating. Kernels
frequently crash when running, you sometimes can’t tell if your
attempted commit is hanging, and you can’t control your
environment as much as you might like.
Wall for banging head

There’s nothing like getting to 700 out of 720 pipeline iterations


and having Kaggle disconnect. My Kaggle CPU utilization rate
was often showed 400%+ and there were many restarts
required during this exercise.

A few other things to be aware of:


 I found I needed to convert my Pandas DataFrame to

a Numpy Array to avoid an XGBoost issue on the


regression task. This is a known issue with Pandas
and XGBoost.
 A Kaggle kernel is running a Jupyter notebook under

the hood. Custom scoring classifiers in TPOT don’t


work when n_jobs is > 1 in a Jupyter notebook. This is
a known issue.

 Kaggle will only let your kernel code write to an


output file when you commit your code. And you can’t
see TPOT’s temporary output when committing. Make
sure you just have the file name in quotes — no
slashes. The file will show up on the Output tab.

 Turning on the GPU setting on Kaggle didn’t speed


things up for most of these analyses, but likely would
for deep learning.

 Kaggle’s 6 hours of possible run time and GPU setting


make it possible to experiment with TPOT for free
with no configuration on non-huge data sets. It’s hard
to pass up free.
For more time and speed you can use something
like Paperspace. I set TPOT up on Paperspace and it was pretty
pain-free, although not money-free. If you need a cloud solution
to run TPOT, I suggest playing around with it on Kaggle first
and then moving off Kaggle if you need more than a few hours
of running time or more power.

Future Directions
There are so many interesting directions to explore with TPOT
and autoML. I’d like to compare TPOT with autoSKlearn,
MLBox, Auto-Keras, and others. I’d also like to see how it
performs with a greater variety of data, other imputation
strategies, and other encoding strategies. A comparison with
LightGBM, CatBoost , and deep learning algorithms would also
be interesting. The exciting thing about this moment in machine
learning is that there are so many areas to explore. Follow me to
make sure you don’t miss future analysis.

For most data sets there’s still a lot of data cleaning, feature
engineering, and final model selection to do — not to mention
the most important step of asking the right questions up front.
Then you might need to productionize your model. And TPOT
isn’t doing exhaustive searches yet. So TPOT isn’t going to
replace the data scientist role — but this tool might make your
final machine learning algorithms better faster.

If you’ve used TPOT or other autoML tools please share your


experience in the comments.

I hope you found this introduction to TPOT to be helpful. If you


did, please share it on your favorite social media so other folks
can find it, too. 😀
I write about Python, SQL, and other tech topics. If any of that’s
of interest to you, sign up for my mailing list of awesome data
science resources and read more to help you grow your
skills here. 👍
Happy TPOTing! 🚀
Jeff Hale

I write about data science. Join my Data Awesome mailing list to stay on top of the latest data
tools and tips: https://dataawesome.com
Follow

JEFF HALE FOLLOWS


TDS Editors

SeattleDataGuy

Barack Obama

Nick Becker

Moez Ali
See all (287)

1.2K
2
Related

Khmer Natural Language Processing in Python


Utilizing khmer-nltk, an open-source NLP toolkit for Khmer

Exploring Audio Datasets with Python


Create a simple GUI to browse large datasets

Natural Language Processing: From one-hot vectors to billion parameter models

Building a Sentiment Classifier using spaCy 3.0 Transformers

Sign up for The Variable


By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.

Get this newsletter

 Machine Learning
 Data Science
 AI
1.2K
2
More from Towards Data Science

Follow

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Enoch Kan

·Aug 22, 2018

Data Science 101: Is Python better


than R?

For decades, researchers and developers have been debating


whether Python or R is a better language for data science and
analytics. Data science has rapidly grown across a variety of
industries including biotech, finance and social media. Its
importance is being recognized not only by the people working
in the…
Read more · 11 min read
203

Share your ideas with millions of readers.


Write on Medium

Markus Schmitt

·Aug 22, 2018

The Machine Learning Workflow


What’s different about machine learning projects? How do
you reduce risks and build a working solution in the shortest
time?
In normal software development, you simply answer the
question:

What do you want to implement?

And then you, well, implement.

But in machine learning projects, you first need


to explore what’s possible –with the data you have. The first
question therefore is:

What can you implement?
Read more · 5 min read

515

Kristen Kehrer

·Aug 22, 2018


Trying to Change Careers or Get
Your Start in Data Science?
If you’re someone who is looking to make a move to data
science, there are some ways that you can polish your approach
to get noticed during your job search.

Assuming that you’ve built up the skills required for the job see
if you’re able to leverage some of these…
Read more · 6 min read

320
Rohit Sharma

·Aug 22, 2018

7 points to ponder, before you use


GPUs to speed up Deep Learning
apps

Deep Learning and GPUs

Introduction
It’s tempting to use GPUs for deep learning training and
inference, given the hype around their speedup. However, it is
important to gain deeper understanding of source of speedup to
use these resources effectively. In this article, we’ll examine the
performance dependency of a typical DL app. …
Read more · 4 min read

Keith McNulty

·Aug 22, 2018

A lay-person’s guide to the


algorithm jungle
This is the third in a series of articles intended to make
Machine Learning more approachable to those without
technical training. Prior articles introduce the concept of
Machine Learning and discuss how the process of learning
works in general. You can start the series here.

This installment describes a number…


Read more · 8 min read

72

Read more from Towards Data Science

About
Write
Help
Legal

You might also like