Narrowing The Search: Which Hyperparameters Really Matter?

Narrowing the Search: Which Hyperparameters Really Matter?
14/03/22, 1:13 AM
Narrowing the Search: Which

Hyperparameters Really Matter?
June 17, 2020
Tech Blog Aimee Coelho
When performing hyperparameter optimization, we are really searching for the best model we can find within our time
constraints. If we searched long enough, with random search we would get arbitrarily close to the optimum model. However,
the more hyperparameters we have to tune, the larger our search space and the more infeasible this becomes, especially
when we are working with larger real-world datasets and each model we test can take hours to train.
This is why in AutoML the best search strategies are those that can quickly hone in on the most promising regions of search
space to explore, and therefore find better models in the limited amount of time available to train and test models.
https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter Page 1 of 9
Narrowing the Search: Which Hyperparameters Really Matter? 14/03/22, 1:13 AM
Hyperparameter optimization feels like figuring out a map on a budget (Credit: Pexels (https://www.pexels.com/fr-fr/photo/a-l-interieur-
argent-bague-boire-2678301/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels))
One way to refine the search space is to study which hyperparameters are most ‘important’ and focus on them. For a given
machine learning task it is likely that changing the values of some hyperparameters will make a much larger di!erence to the
performance than others. Tuning the value of these hyperparameters can therefore bring the greatest benefits.
By studying the importance of the di!erent hyperparameters over many datasets, we can try to meta-learn good values for
the hyperparameters. For the most important hyperparameters we hope to reduce the search range and focus on the best
values, while for the least important ones, we could fix them to a single value and thus remove entire dimensions from a grid
search.
This is exactly what Jan N. van Rijn and Frank Hutter propose in their paper “Hyperparameter Importance Across Datasets”
(https://arxiv.org/abs/1710.04725).
We think this is a really interesting approach to cra"ing a more targeted hyperparameter search when you have limited
resources, and it has the potential to o!er valuable insights to data scientists as well, so we decided to test it for ourselves on
XGBoost to see what insights we could glean.
How Can You Identify Which Hyperparameters

Are Important?
To determine which hyperparameters have the largest impact on the model performance, first you need to collect an
unbiased dataset of performance evaluations with di!erent combinations of hyperparameters by doing a random search on
many di!erent datasets.
Then, for each dataset you can use functional ANOVA (https://www.automl.org/legacy/papers/ICML14-
HyperparameterAssessment.pdf) to decompose the variance in the performance metric and attribute it to each of the
hyperparameters and their interactions. This is a quantitative measure of the importance of each hyperparameter.
Figure 1: From the original paper (https://arxiv.org/abs/1710.04725), the marginal contributions over all datasets for SVM (RBF Kernel)
(le!) showing that hyperparameter gamma was the most important and AdaBoost (right) showing that max depth was the most
important hyperparameter.
When you aggregate these variance contributions or importances over all the datasets, you can identify which
hyperparameters are generally more important. To illustrate this point, in figure 1 the violin plots show the distribution of
importances for each hyperparameter, ordered by the median. On the le" are the hyperparameters for SVM (https://scikit-
learn.org/stable/modules/svm.html) with the RBF kernel where gamma is shown to generally be the most important
hyperparameter, and on the right for the AdaBoost (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) algorithm, max depth was found to be
generally the most important hyperparameter.
Hyperparameter Importance Density

Distributions
To study whether or not there are clear best values for each of these hyperparameters across all the datasets, the ten best
models found during the random search are selected and for each hyperparameter independently, a 1-dimensional Kernel
Density Estimator is fitted to these best evaluations to create density distributions. Examples of these density distributions
from the algorithms studied in the paper are shown in figure 2. Since these distributions were used as priors to sample from
for hyperband (https://medium.com/data-from-the-trenches/a-slightly-better-budget-allocation-for-hyperband-
bbd45af14481), they were referred to as priors.
Figure 2: Examples of density distributions of the top ten evaluations from all datasets for some of the hyperparameters of the algorithms
studied in the paper, Random Forest, Adaboost, SVM (RBF kernel) and SVM (sigmoid).
More Random Forest Hyperparameters

Here at Dataiku we build the Dataiku Data Science Studio (Dataiku DSS), a collaborative data science platform that allows
coders to collaborate with non-coder profiles to explore, prototype, build, and deliver their own data and AI products
together. Dataiku DSS o!ers the possibility to train machine learning algorithms through a visual interface.
We were particularly interested in looking at the hyperparameters that we expose in the visual machine learning interface of
Dataiku DSS, so we ran the experiments again while searching over di!erent hyperparameters. We also extended the range for
min samples leaf, as we expected that higher values would perform better for larger datasets. We evaluated 225 models for
each dataset. For each combination of hyperparameters the model was evaluated using 3-fold cross validation for the metric
AUC.
The hyperparameters and their values we searched over were:
Min Samples Leaf: 1–60

Max Features: 0.1–1.0
Max Depth: 6–20
Number of Estimators: 10–200
We again found the most important hyperparameter to be min samples leaf, followed by max features and number of
estimators, the new violin plot can be seen in figure 3.
Figure 3: The violin plot of importances for this set of hyperparameters for random forest, showing min samples leaf to be by far the most
important hyperparameter generally.
We found that min samples leaf had a higher importance than the original experiments because the range was larger and very
high values in some cases could seriously degrade the performance. However lower values were not always better, as can be
seen by comparing the two datasets in figure 4.
Figure 4: The marginal performance for min samples leaf for random forest on the mice protein dataset
(https://www.openml.org/d/40966) and on the analcatdata DMFT dataset (https://www.openml.org/d/469) where higher values of min
samples leaf are better.
We created the density distributions to explore the best values of this hyperparameter over the full set of datasets and
recovered a similar shape to that in the original experiments. This is shown in figure 5.
Figure 5: One-dimensional density plots of the 10 best values of min samples leaf across all datasets, from the original paper (le!) and for
our new experiments on OpenML-CC18 data with di"erent hyperparameters (right).
XGBoost Experiments
XGBoost (https://xgboost.readthedocs.io/en/latest/parameter.html) is an algorithm with a very large number of parameters.
We are using the implementation with the scikit-learn API, which reduces the number of parameters you can change, and we
decided to restrict our study to those available to tune in Dataiku DSS. The hyperparameters and their ranges that we chose to
search over are:
Booster: Gbtree, dart

Cpu tree method: Exact, hist, approx
Max Depth: 3–20
Learning Rate: 0.01–0.5
Gamma: 0.0–1.0
Alpha: 0.0–1.0
Lambda: 0.0–1.0
Min child weight: 0.0–5.0
Subsample: 0.5–1.0
Colsample by tree: 0.5–1.0
Colsample by level: 0.5–1.0
The technique for fanova relies on training a model to predict performance, as they detail in the paper [1], this requires a
minimum number of evaluations, which increases with the number of dimensions in the search. Since we were searching over
11 hyperparameters, we increased the number of evaluations per dataset from 225 to 1500 to improve the quality of the
underlying model. All models were evaluated on AUC averaged over 3 fold cross validation, and training was stopped a"er 4
early stopping rounds.
In figure 6 is shown the violin plot of the importances over all datasets, learning rate was found to be generally the most
important, followed by subsample and min child weight.
Figure 6: A violin plot showing the most important hyperparameters and interactions for XGBoost over all datasets with learning rate
generally the most important.
Important Hyperparameters for XGBoost
From our experiments we found that:
For learning rate, higher values of 0.2 and above tended to perform better.
For subsample, the density was increased above 0.8.
For min child weight, there is increased density at lower values. A setting of 0 means no minimum weight is required on
the sum of the instance weight in a child, therefore no regularization. This means the algorithm can be much slower, and
thus setting this parameter to 1 (which is the default value) represents a good trade o! between performance and training
time.
Figure 7: The density distributions of the ten best values over all datasets for the hyperparameters learning rate, subsample and min child
weight for XGBoost.
What Are Our Takeaways From This

Experiment?
This is a very interesting technique that can help data scientists build intuition about the e!ect of changing the
hyperparameters for an algorithm, not only generally, such as in studies on benchmark sets of datasets, but in particular for a
given task, as the e!ect is data-dependent.
It can help you to tune your algorithms faster by focusing on the hyperparameters that bring the largest performance
improvement. Generally, that is min samples leaf for random forest and learning rate, subsample and min child weight for
XGBoost.
Why Is Min Samples Leaf for Random Forest Almost Always Better Set So Low?
While doing this study, we were surprised to find that higher values of min samples leaf did not bring higher performance as
o"en as we expected. This subject has already (https://github.com/scikit-learn/scikit-learn/issues/10773) been discussed
(https://github.com/scikit-learn/scikit-learn/issues/11976) in the scikit-learn community and it may be due to the way this
parameter has been implemented in scikit-learn. While the expectation may be that the way this parameter would behave
would limit the depth of the trees, in reality it simply does not consider splits that would leave fewer than the minimum
number of samples in a leaf, and so the algorithm could choose a worse split instead of halting the splitting.
How to Set the Number of Estimators?
We found the number of estimators for random forest to be only the third most important hyperparameter. We were
interested to see what would be a reasonable number of estimators that would be likely to give good performance.
By looking at the marginal performance for each dataset as well as the distribution of best values across all datasets, we see
that while 10 estimators frequently led to lower performance, once the number of estimators reached 100, the performance
was relatively stable. In figure 8 is an example of the characteristic shape of the marginal performance on the le" and the
density distribution of best values over all datasets on the right.
Given that the training time increases with the number of estimators, a sensible default value for this hyperparameter would
appear to be 100. This provides empirical evidence for the change of the default (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from 10 to 100 estimators in version
0.22 of scikit-learn.
Figure 8: An example marginal for the number of estimators parameter for random forest on the madelon dataset
(https://www.openml.org/d/1485), and the density distribution of the best values of n estimators over all datasets.
The Importance of a Good Benchmarking Set
Finally, we found that even though the OpenML-CC18 benchmark set (https://www.openml.org/s/99) is conceived to be more
challenging, some datasets were still too easy. Therefore, tuning the algorithm had little e!ect on these tasks, so they were
removed from the results. We are working on further curating our benchmark set of datasets to be more challenging.
Read More Technical Content from Dataiku's

Experts
Did you enjoy this article? Follow our data scientists and engineers' tech blog Data from the Trenches on Medium:
June 17, 2020 Tech Blog Aimee Coelho
Subscribe to the Your Email* SUBSCRIBE
Dataiku Blog
You May Also Like
(https://blog.dataiku.com/projecting-
(https://blog.dataiku.com/where- (https://blog.dataiku.com/markdown-
(https://blog.dataiku.com/building-
and-protecting-data-in-a-mesh) ml-research-meets-data-science- optimization-upli"-modeling- the-geospatial-join-recipe-in-
practice-causal-inference) trends) dataiku)
March 10, 2022 March 9, 2022 March 2, 2022 February 25, 2022
Dataiku Product, Featured, Featured, Tech Blog Catie Featured, Tech Blog Catie Use Cases & Projects,
Tech Blog Adam Grasso Grasso Dataiku Product, Tech Blog
Gugliciello Andrey Avtomonov, Thibault
Where ML Research Where ML Research Desfontaines
Projecting and Meets Data Science Meets Data Science
Protecting Data in a Practice: Causal Practice: Markdown Building the
Mesh Inference Optimization & Upli! Geospatial Join
(https://blog.dataiku.com/projecting-
(https://blog.dataiku.com/where-
Modeling Recipe in Dataiku
and-protecting-data- ml-research-meets- (https://blog.dataiku.com/buildin
(https://blog.dataiku.com/markdown-
in-a-mesh) data-science-practice- optimization-upli!- the-geospatial-join-
causal-inference) modeling-trends) recipe-in-dataiku)
READ MORE READ MORE READ MORE READ MORE

(HTTPS://BLOG.DATAIKU.COM/PROJECTING-AND-PROTECTING-DATA-IN-A-MESH)
(HTTPS://BLOG.DATAIKU.COM/WHERE-ML-RESEARCH-MEETS-DATA-SCIENCE-PRACTICE-CAUSAL-INFERENCE)
(HTTPS://BLOG.DATAIKU.COM/MARKDOWN-OPTIMIZATION-UPLIFT-MODELING-TR
(HTTPS://BLOG.DATAIKU.COM/BUILDING-

Narrowing The Search: Which Hyperparameters Really Matter?

Uploaded by

Copyright:

Available Formats

Narrowing The Search: Which Hyperparameters Really Matter?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Narrowing The Search: Which Hyperparameters Really Matter?

Uploaded by

Copyright:

Available Formats

Narrowing the Search: Which Hyperparameters Really Matter?

Narrowing the Search: Which

Tech Blog Aimee Coelho

How Can You Identify Which Hyperparameters

Hyperparameter Importance Density

More Random Forest Hyperparameters

The hyperparameters and their values we searched over were:

Min Samples Leaf: 1–60

Number of Estimators: 10–200

Booster: Gbtree, dart

Important Hyperparameters for XGBoost

From our experiments we found that:

What Are Our Takeaways From This

How to Set the Number of Estimators?

The Importance of a Good Benchmarking Set

Read More Technical Content from Dataiku's

June 17, 2020 Tech Blog Aimee Coelho

Subscribe to the Your Email* SUBSCRIBE

You May Also Like

READ MORE READ MORE READ MORE READ MORE

You might also like