Chopt: A H O F C - B M L P: Utomated Yperparameter Ptimization Ramework FOR Loud Ased Achine Earning Latforms
Chopt: A H O F C - B M L P: Utomated Yperparameter Ptimization Ramework FOR Loud Ased Achine Earning Latforms
Chopt: A H O F C - B M L P: Utomated Yperparameter Ptimization Ramework FOR Loud Ased Achine Earning Latforms
Jinwoong Kim * 1 Minkyu Kim * 1 Heungseok Park 1 Ernar Kusdavletov 2 Dongjun Lee 1 Adrian Kim 1
Ji-Hoon Kim 1 Jung-Woo Ha 1 Nako Sung 1
A BSTRACT
arXiv:1810.03527v2 [cs.LG] 16 Oct 2018
Many hyperparameter optimization (HyperOpt) methods assume restricted computing resources and mainly focus
on enhancing performance. Here we propose a novel cloud-based HyperOpt (CHOPT) framework which can
efficiently utilize shared computing resources while supporting various HyperOpt algorithms. We incorporate
convenient web-based user interfaces, visualization, and analysis tools, enabling users to easily control optimization
procedures and build up valuable insights with an iterative analysis procedure. Furthermore, our framework can
be incorporated with any cloud platform, thus complementarily increasing the efficiency of conventional deep
learning frameworks. We demonstrate applications of CHOPT with tasks such as image recognition and question-
answering, showing that our framework can find hyperparameter configurations competitive with previous work.
We also show CHOPT is capable of providing interesting observations through its analysing tools.
• We introduce an implemented analytic visual tool and hyperparameter optimization methods. However, Tune re-
show a step by step use case of hyperparameter fine tuning quires user code modification to access intermediate training
with a real world example. results while our framework does not require any user code
modification. Also, Tune provides visual tool via web inter-
face, but user cannot interfact with it for fine tune, analyzing,
2 RELATED WORK
or changing hyperparameter sets. Orı́on (Tsirigotis et al.,
2018) is another hyperparameter optimization tool which
mainly focuses on version control for experiments. It is de-
2.1 Hyperparameter Optimization signed to work with any language or framework since it has
a very simple configuration interface, however, it does not re-
AutoML is a general concept which covers diverse tech- ally manage computing resources such as GPUs to improve
niques for automated model learning including automatic performance at optimization time and resource efficiency.
data preprocessing, architecture search, and model selection. Google Vizier (Golovin et al., 2017) is a Google-interval
HyperOpt is one of core components of AutoML because service. Although it provides many hyperparameter opti-
hyperparameter tuning has large influence on performance mization, parallel execution, early stopping and many other
for a fixed model structure even, in particular in neural features, it is closed-source project so it is hard to apply this
network-based models. We mainly focuses on automated system to user’s code or system.
HyperOpt throughout this paper.
To this end, there have been proposed several hyperparame- 2.3 NSML
ter optimization algorithms (Jaderberg et al., 2017; Falkner
NSML (Sung et al., 2017) is a platform designed for help-
et al., 2018; Li et al., 2017). Population-Based Training
ing machine learning developers and researchers. Although
(PBT) (Jaderberg et al., 2017) is an asynchronous optimiza-
many machine learning libraries such as TensorFlow (Abadi
tion algorithm which effectively utilizes a fixed computa-
et al., 2016) and PyTorch (Paszke et al., 2017) make many
tional budget to jointly optimize a population of models
researcher’s life easy by simplifying model implementa-
and their hyperparameters to maximize performance. PBT
tion, they still suffer from a non-trivial amount of manual
discovers a schedule of hyperparameter settings rather than
tasks such as computing resource allocation, training model
following the generally sub-optimal strategy of trying to
progress monitoring, and comparison of models with differ-
find a single fixed set to use for the whole course of train-
ent hyperparameter settings.
ing. In practice, Hyperband (Li et al., 2017) works very
well and typically outperforms random search and Bayesian In order to handles all these tasks so that the researchers can
optimization methods operating on the full function evalua- focus on models, NSML includes automatic GPU allocation,
tion budget quite easily for small to medium total budgets. learning status visualization, and handling model parameter
However, its convergence to the global optimum is limited snapshots as well as comparison of performance metrics
by its reliance on randomly-drawn configurations, and with between models via a leaderboard.
large budgets its advantage over random search typically
We chose NSML as a cloud machine learning platform for
diminishes. BOHB (Falkner et al., 2018) is a combination
implementing and evaluating the proposed CHOPT due to
of Hyperband and Bayesian Optimization. Hyperband al-
the following properties: 1) since NSML has well-designed
ready satisfies most of the desiderata (in particular, strong
interfaces including Command Line Interface (CLI) and
anytime performance, scalability, robustness and flexibility),
web, so that it is easy to assign the hyperparameters to model
and with BOHB it also satisfies the desideratum of strong
and get the result from the model. 2) NSML enables us to
final performance.
easily manage cluster computing resources and effectively
Although these hyperparameter optimization methods out- monitor the progress of model training.
perform traditional methods in terms of resource utilization
and optimization time, it requires high cost to apply on 3 CHOPT: C LOUD - BASED
a new environment or code. Moreover, it is hard to know
which solution performs better than others before applying
H YPERPARAMETER OPT IMIZAION
all methods to the model. For this reason, there have been
some efforts to solve these problems through systematic
improvements. 3.1 Design goals and Requirements
2.2 Hyperparameter Optimization Framework The main goals of designing our framework are as follows.
Tune (Liaw et al., 2018) is a scalable hyperparameter opti- Efficient resource management. We need our framework
mization framework for deep learning that provides many to manage shared resources efficiently so that it balances
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms
throughout CHOPT users and non-CHOPT users while max- sions have three session pools, live pool, stop pool, and
imizing resource utilization. dead pool. All running NSML sessions are in the live pool,
which have resource limits set by the configuration file. With
Minimum configuration and code modification. By host-
NSML session in live pool, agent periodically compares the
ing hyperparameter optimization methods in our system,
performance of NSML sessions and tuning NSML sessions
users should be able to easily tune their models without im-
according to the configuration file. If NSML session is ex-
plementation overheads. In addition, configurations should
ited, it will be moved to stop pool or dead pool. NSML
be simple yet flexible.
session in stop pool can be restarted while NSML session
Web-based analytic visual tool. As hyperparameter opti- in dead pool is completely removed from our system since
mization requires many training sessions, it is essential to automl systems commonly creates models a lot and it often
have powerful visualization tools to analyze from hundreds takes up too much system storage space. Thus, user can
of training results. specify the stop ratio, how much of exited NSML session
goes to stop pool and then NSML session in stop pool can be
3.2 System Architecture resumed when system becomes under-utilized or recovered
by user.
Agents AutoML sessions
sessions
CHOPT
NSML sessions pool Live
Queue
Stop
Dead
3.2.2 Master Agent
NSML
Master agent assigns a session to one of agents according to
Interface
the configuration file. Master agent is elected from one of
CLI NSML
agents like zookeeper’s leader election (Hunt et al., 2010).
If master agent falls, any agent can be the next master agent.
Web
User Whenever user submits CHOPT session into the queue, One
NSML
of the most important roles of the master agent is shift-
ing available computing resources such as GPUs between
CHOPT users and Non-CHOPT users. Otherwise, users will
Figure 1. System Architecture of CHOPT face serious computing resource imbalances. We describe
the details of this in the next Section 3.3.
In this section, we would like to introduce our hyperparam-
eter optimization system step-by-step, where Figure 1 is the 3.3 Stop-and-Go
architecture of the proposed CHOPT framework. Automated optimization methods and frameworks often
Users can run CHOPT sessions through both a command take a lot of computing resources because many models
line interface (CLI) or Web interface. CLI is light-weight, are trained in parallel. If not carefully managed, this leads
therefore suitable for simple tasks such as running and stop- to crucial load imbalance across users in shared cluster
ping sessions. The web interface is relatively heavy but environments. To avoid this problem, our framework con-
provides many features, making it more suitable for ana- trols available computing resources such as GPUs among
lyzing and monitoring training results. To run a CHOPT CHOPT users and non-CHOPT users with fairness while
session, the user needs to submit codes compatible with maximizing the utilization of computing resources. Here,
NSML and a configuration file containing details on the tun- we introduce a new feature called Stop-and-Go. The key
ing method, hyperparameter sets, and more. While holding idea of this feature is that the master agent controls available
the submitted codes, a CHOPT session is initilized with the resources among NSML and CHOPT sessions according to
configuration file and gets inserted in a Queue, which stores the cluster’s condition. By doing so, we can maximize the
CHOPT sessions before running. When a CHOPT session utilization of shared cluster environment as well as mitigate
manager (Agent) is available, a CHOPT session obtained the problem of early stopping in CHOPT environments.
from the Queue is ran by the Agent.
3.3.1 Efficient Resource Management
3.2.1 Agent
Whenever a resource cluster is under-utilized, the master
An Agent is a module responsible for running CHOPT ses- agent assigns more resources (GPUs) to CHOPT sessions
sions as well as monitoring the progress of an automl session so that they can quickly finish hyperparameter optimization.
so that users can check and analyze corresponding NSML On the other hand, if cluster is over-utilized, the master
training sessions with given interfaces. Note that NSML agent takes GPUs from CHOPT sessions so that other non-
session is a single training model. CHOPT users can train their models on the shared-cluster
environment. This feature enables to maximimum utilization
To manage and maximize computing resource CHOPT ses-
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms
early stopping. Setting the checking interval for stopping • Compatibility environment: integrate multiple results al-
is important because it controls the performance of early though they have different configurations such as the
stopping. Here, users can configure this interval by setting number of hyperparameter sets, tuning algorithm, and
step. Note that if step is -1, CHOPT session disables early others.
stopping. • Fine-tuning Method: provide interface that simply re-
configurate the hyperparameter spaces. for searching un-
Every iterative optimization process has a termination con-
explored or shrinking ranges.
dition to avoid from running infinitely. We support three
• Expand Hyperparameter Dimension: allow user to add a
options for termination: time, max session number, and
new hyperparameter to be tuned.
performance threshold. When multiple conditions are used,
• Model Analyze: serve many features for model analysis.
our framework stops as soon as it reaches the first condition.
To choose a CHOPT algorithm we can configure it by
tune. In this paper, we support three algorithms: random 3.5.2 Scalable Hyperparameter Configuration
search with or without early stopping, population based Visualization
training (Jaderberg et al., 2017), and hyperband (Li et al., For scalable representation, we use parallel coordinates
2017). Each algorithm requires different set of parameters. visualization (Inselberg & Dimsdale, 1987; Heinrich &
For instance, population based training requires two func- Weiskopf, 2013). Parallel coordinates visualization has an
tions to exploit to compare models and explore new search advantage in representing high-dimensional or multivariate
space while hyperband requires resource limit. data in a 2D plane by arranging each dimension in parallel.
Additionally, the visualization can provide overall distribu-
3.5 Web-Based Analytic Environment tions of datasets for each dimension. Other CHOPT visual-
ization systems also have used parallel coordinates visual-
Although hyperparameter optimization framework enables
ization to represent hyperparameter tuning results (Golovin
users to train hundreds, thousands of models in parallel,
et al., 2017; Liaw et al., 2018; Tsirigotis et al., 2018).
if is difficult to analyze, results can be useless. CHOPT
provides powerful visualization tools that allows users to In the visualization as in Figure 3, each line represents a
monitor, analyze the results to get constructive insights. machine learning model, NSML session, created by CHOPT
In this section, we describe our visual tool’s design goals, session, where each axis represents a hyperparameter such
implementation and features. as learning rate, activation, dropout rate, and more. By ar-
ranging each hyperparameter in parallel, the visualization
3.5.1 Design Goals can increase the number of hyperparameters and present
a number of hyperparameter configurations in a scalable
• Scalable representation: illustrate tuning results in a scal-
way. We expect this visualization to help users analyze the
able way without loss of information even user adds hy-
relationships between hyperparameters as well as analyzing
perparameter more and more.
results.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms
Run AutoML with Analyze hyperparmeter tuning results Get the best hyperparamter sets Get desired models with
initial hyperparameter sets fine-tuned hyperparameter sets
vious results. In case of unexplored ranges, users may formed nicely and run with additional hyperparameters, as
want to see the effect of other ranges to the performance same with the earlier steps. With this context, we gradu-
of their models. In case of narrowed ranges, they may ally added other hyperparameters one by one and fine-tuned
want to find more precise values affecting the perfor- each optimized range.
mance of the models. Also, users can change the other
However, in the fifth step (yellow), we added ‘depth’ hyper-
configurations for the next CHOPT session such as tun-
parameter and obtained slightly improved performance with
ing algorithm, step size for early stopping, population
70.54, and we found that the experiment was quite biased by
size, and so on. Additionally, we expect that the users can
the early stopping because most models with depths greater
combine tuning configuration based on what they aim
than twenty dropped very quickly. Therefore, we rerun the
at in this Step such as tuning algorithm, early stopping
CHOPT sessions (blue color) without the early stopping
option, and etc.
method and obtained the better results and the highest per-
4. (Optional) Append new hyperparamter to be tuned: If
formed model with 79.37 percent accuracy, which shown
users get an optimal combination of a given hyper-
at the top-right side of the parallel coordinates in the Fig-
paramter sets through a single CHOPT session or mul-
ure 7. As we mentioned in Section, early stopping does not
tiple sessions, they can append a new hyperparameter
guarantee better results than non early stopping results.
to be tuned which was a constant value in the previous
sessions. Table 1 shows the changes of the configuration (hyperpa-
5. Get desired models with fine-tuned hyperparamter sets rameter spaces) according to the progress of the fine tuning.
Each constant value means there was no exploration by
the CHOPT. Even if each CHOPT sessions has different
4 P RACTICAL USE CASES configuration of hyperparamter spaces to be tuned, our visu-
alization allows users to integrate the sessions by setting the
In this section, we describe how CHOPT visualization works constant value. For example, the ‘depth’ hyperparamter was
with a real-world example. Figure 7 shows an overview a constant value of 20 in the all sessions (1-4th sessions in
of CHOPT visualization with six CHOPT sessions with table) except for the yellow and blue colored session (5-6th
CIFAR100 dataset. session in table) as shown in the Figure 7.
As described at the usage flow in Section 3.5.4, we fine-
tuned six hyperparameters step-by-step. First, we run an Table 1. Fine tuning results and configurations for each session in
initial CHOPT session (purple color on the Figure 7) to tune Figure 7
‘learning rate’ with other hyperparameters which not to be no. Top Acc. early stopped learning rate momentum prob sh depth
1st 69.62 True 0.001 - 0.2 0.9 0.0 0.4 20
tuned. 2nd 69.78 True 0.0334 - 0.0868 0.1 - 0.999 0.0 0.4 20
3rd 70.4 True 0.0334 - 0.0868 0.9 - 0.9381 0.0 - 0.9 0.4 20
For the next step, we selected the top ten models with our 4th 70.36 True 0.0334 - 0.0868 0.9 - 0.9381 0.2378 - 0.579 0.2 - 0.9 20
5th 70.54 True 0.0266 - 0.0399 0.9 - 0.9381 0.2111 - 0.380 0.2237 - 0.3496 [20, 92, 110, 122, 134, 140]
masking method (figure 4) and obtained the optimal ranges 6th 79.37 False 0.0266 - 0.0399 0.9 - 0.9381 0.2111 - 0.380 0.2237 - 0.3496 [20, 92, 110, 122, 134, 140]
Figure 7. An overview of CHOPT visualization system: the six different sessions with different hyperparameter configurations are
represented on parallel coordinates, and the three top performed models are shown in details. The tooltips on parallel coordinates are
activated by mouse hovered on the ‘train/loss’ line plot on the middle of the figure.
The horizontal bar plot (right middle) shows the learning 5.1 Image Classification and Question-Answering
duration of each model. The x-axis represents the last learn-
Image classification and Question-Answering are well-
ing step for each model. In practice, this plot can help users
known as the most challenging, and most popular tasks
to find biased experiments because we found that the fifth
in deep learning community (He et al., 2016; Seo et al.,
experiment had never explored at the depth with 140 by
2016). There were huge breakthrough on convolutional neu-
early stopping. The figure 2 shows the difference between
ral networks (CNNs) for image classification after residual
fifth and sixth sessions which have same hyperparamter
network (He et al., 2016) and attention architectures for
spaces but different early stopping use. We found that the
Question-Answering using SQuAD dataset, respectively.
fifth session was biased to depth with 20, and the depth 140
We test our CHOPT framework on various CNNs structures
was not explored. On the other hand, the sixth session was
on the image classification task with CIFAR-100. We choose
not biased.
CIFAR-100 data because it is large enough to compare
The selected model summary table (the right-bottom side) deeper model’s performance comparing with mnist (LeCun,
is shown to provide users to recognize the precise values 1998) or CIFAR-10, and simultaneously, it is small enough
of selected models at once such as performance and hyper- to train many models in a short time comparing with Ima-
paramter configuration. genet (Deng et al., 2009). We examine our framework with
residual network (ResNet), wide residual network (WRN)
5 E VALUATION (Zagoruyko & Komodakis, 2016). In addition, we test on
regularization method as well, specifically on data augmen-
In order to evaluate the performance of our system, we tation by Random Erasing (RE) (Zhong et al., 2017) with
conduct experiments on two different tasks; Image classifi- ResNet and WRN to prove that our framework works on
cation and Question-Answering. For each experiment, we search space with high dimension. CHOPT is also evaluated
used CIFAR-100 dataset (Krizhevsky & Hinton, 2009) and on SQuAD 1.1 for Question-Answering. For experiments,
SQuAD 1.1 (Rajpurkar et al., 2016) dataset, respectively. we use BiDAF (Seo et al., 2016). In this experiment, we use
Both evaluations show that our proposed framework finds random search with early stopping, population based train-
better performance than performance reported in the papers. ing (PBT) and Hyperband while reporting the best result
among these methods.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms
0.631
90 0.421
0.344
1.49e-3
80 2.05
Normalized of number of GPUs (%)
Test/accuracy Test/loss
100 5.5
70
80 Early
A B C D E
4.5
76.61 stopping
60 3.5
60 point
40 2.5
20 2.05 1.5 1.02
50 0
0.5
0 40k 80k 120k 160k 0 40k 80k 120k 160k
step step
40
30
Figure 9. Fully trained result of a revived early stopped model. Top:
Hyperparameter configuration of early stopped model. Bottom:
Fully trained result of early stopped model
20
Total GPU
10
MAX GPU for AutoML
CHOPT
GPU In Use For future work, we would like to find more effective poli-
AutoML
CHOPTGPU In Use cies other than using random selection methods. In addition,
0
12:00:00 12:10:00 12:20:00 12:30:00 12:40:00 12:50:00 13:00:00 we would like to extend our AutoML method by adding
Time more criteria for our framework. Currently CHOPT only
uses for selecting the best model, but we would also like to
Figure 8. Adaptive available GPUs control between NSML and find models which also have less number of parameters, are
CHOPT sessions resource efficient, and train fast.
Han, D., Kim, J., and Kim, J. Deep pyramidal residual Sung, N., Kim, M., Jo, H., Yang, Y., Kim, J., Lausen, L.,
networks. In Computer Vision and Pattern Recogni- Kim, Y., Lee, G., Kwak, D., Ha, J.-W., et al. Nsml: A
tion (CVPR), 2017 IEEE Conference on, pp. 6307–6315. machine learning platform that enables you to focus on
IEEE, 2017. your models. arXiv preprint arXiv:1712.05902, 2017.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Tsirigotis, C., Bouthillier, X., Corneau-Tremblay, F., Hen-
ing for image recognition. In Proceedings of the IEEE derson, P., Askari, R., Lavoie-Marchildon, S., Deleu, T.,
conference on computer vision and pattern recognition, Suhubdy, D., Noukhovitch, M., Bastien, F., et al. Orı́on:
pp. 770–778, 2016. ‘. 2018.
Heinrich, J. and Weiskopf, D. State of the art of parallel Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
coordinates. In Eurographics (STARs), pp. 95–116, 2013. Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
and Kavukcuoglu, K. Wavenet: A generative model for
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. raw audio. In SSW, pp. 125, 2016.
Zookeeper: Wait-free coordination for internet-scale sys-
tems. In USENIX annual technical conference, volume 8, Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
pp. 9. Boston, MA, USA, 2010. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in Neural Information
Inselberg, A. and Dimsdale, B. Parallel coordinates for Processing Systems, pp. 5998–6008, 2017.
visualizing multi-dimensional geometry. In Computer
Graphics 1987, pp. 25–44. Springer, 1987. Zagoruyko, S. and Komodakis, N. Wide residual networks.
arXiv preprint arXiv:1605.07146, 2016.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y.
I., Simonyan, K., et al. Population based training of neural Random erasing data augmentation. arXiv preprint
networks. arXiv preprint arXiv:1711.09846, 2017. arXiv:1708.04896, 2017.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.