TPOT Automated Machine Learning in Python: 607K Followers Editors' Picks Features Deep Dives Grow Contribute About
TPOT Automated Machine Learning in Python: 607K Followers Editors' Picks Features Deep Dives Grow Contribute About
TPOT Automated Machine Learning in Python: 607K Followers Editors' Picks Features Deep Dives Grow Contribute About
Get started
Follow
607K Followers
·
Editors' PicksFeaturesDeep DivesGrowContribute
About
Jeff Hale
Running TPOT
Instantiating, fitting, and scoring the TPOT classifier is similar
to any other sklearn classifier. Here’s the format:
tpot = TPOTClassifier()
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)
It appears that you can’t alter the scoring criteria TPOT uses
internally as it searches for the best pipeline, just the scoring
criteria for use on the test set after TPOT has chosen the best
algorithms. This is an area where some users might want more
control. Perhaps this option will be added in a future version.
Some of the data sets we’ll see below only need a few minutes to
find algorithms that score well; others might need days.
Search Space
The
pipeline ExtraTreesClassifier(ExtraTreesClassifier(input_matrix
, True, 'entropy', 0.10000000000000001, 13, 6), True, 'gini',
Less talk — more action. Let’s try out TPOT on some data!
periodic_checkpoint_folder="tpot_mnst1.txt",
n_jobs=-1,
generations=10,
population_size=100)# run three iterations
and time themfor x in range(3):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')times = [time/60 for
time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)
You can see it doesn’t take much code at all to run this data set
— and that includes a loop to time and test it repeatedly.
Note that with only 60 pipelines — far less than what TPOT
suggests — we were able to see pretty good scores — over 97%
accuracy on the test set in one case.
Reproducibility
Does TPOT find the same winning pipeline every time with the
same random_state set? Not necessarily. Individually
algorithms such as RandomForrestClassifier() have their own
random_state parameters that don’t get set.
TPOT doesn’t always find the same result if you instantiate one
classifier and then fit it repeatedly like we do in the for loop in
the code above. I ran three very small sets of 60 pipelines with
random_state set and Kaggle’s GPU setting on. Note that we get
slightly different pipelines and thus slightly different test set
scores on the three test sets.
Note that the run time is much faster after the first iteration.
TPOT does seem to remember when it has seen an algorithm
and doesn’t rerun it, even if it’s a second fit and you’ve set
memory=False. Here’s what you’ll see if you set the verbosity=3
when it finds such a previously evaluated pipeline:
Pipeline encountered that has previously been evaluated during
the optimization process. Using the score from the previous
evaluation.
TPOT can routinely fit a perfect model quickly on this data set.
It did so in under two minutes. This is much better performance
and speed than when I tested this dataset without TPOT with
many scikit-learn classification algorithms, a wide range of
nominal data encodings, and no parameter tuning.
Let’s dig into reproducibility a bit more. Let’s run it again with
everything exactly the same settings.
Times: [1.8664863013502326, 1.5520636909670429,
1.3386059726501116]
Scores: [1.0, 1.0, 1.0]
We see the same set of three pipelines, very similar times, and
the same scores on the test set.
Now let’s try splitting the cell into multiple different cells and
instantiating a TPOT instance in each one. Here’s the code:
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75, test_size=0.25, random_state=25)tpot =
TPOTClassifier(
verbosity=3,
scoring=”accuracy”,
random_state=25,
periodic_checkpoint_folder=”tpot_mushroom_results.txt”,
n_jobs=-1,
generations=5,
population_size=10,
early_stop = 5
)
The result of the second run now matches the result of the first
one and took almost the same time (Score = 1.0, Time = 1.9
minutes, pipeline = Decision Tree Classifier). The key for higher
reproducibility is that we are instantiating a new instance of the
TPOT classifier in each cell.
Time results from 10 sets of 30 pipelines with random_state on
train_test_split and TPOT set to 10 are below. All pipelines
correctly classified all mushrooms on the test set. TPOT was
quite fast on this fairly easy-to-learn task.
Results:
Times: [43.206709423016584]
Scores: [-644910660.5815958]
Winning pipelines: [Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5,
booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0,
max_depth=8, min_child_weight=3, missing=None,
n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear',
random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.8500000000000001))])]
RMSE is the best yet. It took the better part of an hour to
converge, and we’re still running far smaller pipelines than
recommended. 🤔
In the future I’d like to do some longer runs with TPOT on this
data set to see how it performs. I’d also like to see how some
manual feature engineering and various encoding strategies can
improve our model performance.
Future Directions
There are so many interesting directions to explore with TPOT
and autoML. I’d like to compare TPOT with autoSKlearn,
MLBox, Auto-Keras, and others. I’d also like to see how it
performs with a greater variety of data, other imputation
strategies, and other encoding strategies. A comparison with
LightGBM, CatBoost , and deep learning algorithms would also
be interesting. The exciting thing about this moment in machine
learning is that there are so many areas to explore. Follow me to
make sure you don’t miss future analysis.
For most data sets there’s still a lot of data cleaning, feature
engineering, and final model selection to do — not to mention
the most important step of asking the right questions up front.
Then you might need to productionize your model. And TPOT
isn’t doing exhaustive searches yet. So TPOT isn’t going to
replace the data scientist role — but this tool might make your
final machine learning algorithms better faster.
I write about data science. Join my Data Awesome mailing list to stay on top of the latest data
tools and tips: https://dataawesome.com
Follow
TDS Editors
SeattleDataGuy
Barack Obama
Nick Becker
Moez Ali
See all (287)
1.2K
2
Related
Machine Learning
Data Science
AI
1.2K
2
More from Towards Data Science
Follow
Your home for data science. A Medium publication sharing concepts, ideas and codes.
Enoch Kan
Markus Schmitt
What can you implement?
Read more · 5 min read
515
Kristen Kehrer
Assuming that you’ve built up the skills required for the job see
if you’re able to leverage some of these…
Read more · 6 min read
320
Rohit Sharma
Introduction
It’s tempting to use GPUs for deep learning training and
inference, given the hype around their speedup. However, it is
important to gain deeper understanding of source of speedup to
use these resources effectively. In this article, we’ll examine the
performance dependency of a typical DL app. …
Read more · 4 min read
Keith McNulty
72
About
Write
Help
Legal