Chopt: A H O F C - B M L P: Utomated Yperparameter Ptimization Ramework FOR Loud Ased Achine Earning Latforms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

CHOPT : AUTOMATED H YPERPARAMETER O PTIMIZATION F RAMEWORK

FOR C LOUD -BASED M ACHINE L EARNING P LATFORMS

Jinwoong Kim * 1 Minkyu Kim * 1 Heungseok Park 1 Ernar Kusdavletov 2 Dongjun Lee 1 Adrian Kim 1
Ji-Hoon Kim 1 Jung-Woo Ha 1 Nako Sung 1

A BSTRACT
arXiv:1810.03527v2 [cs.LG] 16 Oct 2018

Many hyperparameter optimization (HyperOpt) methods assume restricted computing resources and mainly focus
on enhancing performance. Here we propose a novel cloud-based HyperOpt (CHOPT) framework which can
efficiently utilize shared computing resources while supporting various HyperOpt algorithms. We incorporate
convenient web-based user interfaces, visualization, and analysis tools, enabling users to easily control optimization
procedures and build up valuable insights with an iterative analysis procedure. Furthermore, our framework can
be incorporated with any cloud platform, thus complementarily increasing the efficiency of conventional deep
learning frameworks. We demonstrate applications of CHOPT with tasks such as image recognition and question-
answering, showing that our framework can find hyperparameter configurations competitive with previous work.
We also show CHOPT is capable of providing interesting observations through its analysing tools.

1 INTRODUCTION efforts are required for implementation.


Deep neural networks (DNNs) have become an essential Recently, there have been proposed HyperOpt frameworks
method for solving difficult tasks in computer vision, sig- to eliminate this integration cost (Liaw et al., 2018; Tsirig-
nal processing, and natural language processing (He et al., otis et al., 2018; Golovin et al., 2017). These frameworks
2016; Choi et al., 2018; Han et al., 2017; Van Den Oord host methods of optimizing hyperparameters and provides
et al., 2016; Seo et al., 2016; Vaswani et al., 2017). As the a visualization tool for analyzing the results of a trained
capabilities of deep learning have expanded with more mod- model. However, the proposed frameworks do not consider
ular architectures and advanced optimization methods, the computing resource management. Visualization tool plots
number of hyperparameters has increased in general. This only one type for whole parameter information but not sup-
increase of hyperparameter sizes makes it more difficult for ports interactive function to analyze properties and trends
a researcher to optimize a model, wasting a lot of human between user-specified hyperparameters.
resources and potentially leading unfair comparisons. This In this work, we propose an efficient HyperOpt framework
reinforces the importance of efficient automated hyperpa- called Cloud-based Hyperparamter OPTimziation (CHOPT).
rameter tuning methods and interfaces. CHOPT efficiently improves the resource usages in flexible
To address this problem, several hyperparameter optimiza- computing environments based on Stop-and-Go approach,
tion (HyperOpt) methods have been proposed (Jaderberg with supporting state-of-the art hyperparameter optimiza-
et al., 2017; Falkner et al., 2018; Li et al., 2017). These meth- tion algorithms. Our CHOPT framework also provides web-
ods have many advantages such as strong final performance, based analytic visual tool which users can monitor the cur-
parallelism, early stopping which significantly improve per- rent tuning progress of models or compare conducted results
formance in terms of computing resource efficiency and with other ones. Moreover, user can fine tune hyperparame-
optimization time. However, for using the previous methods, ters through visualization tool.
users must implement it in their own system or code, and The main contributions of this work are as follows
even if the source code is provided, considerable time and
*
Equal contribution 1 Clova AI Research, NAVER Corp., Seong- • We develop an automated HyperOpt framework that hosts
nam, Korea 2 Department of Computer Science, Ulsan National hyperparamter optimization methods. We have shown
Institute of Science and Technology, Ulsan, Korea. Correspon- that our framework can find competitive hyperparameters
dence to: Jung-Woo Ha <jungwoo.ha@navercorp.com>. compared to results reported in previous papers.
• We propose Stop-and-Go to address the early stopping
problem and maximize utilization of shared-cluster re-
sources, which we demonstrate in our evaluation.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

• We introduce an implemented analytic visual tool and hyperparameter optimization methods. However, Tune re-
show a step by step use case of hyperparameter fine tuning quires user code modification to access intermediate training
with a real world example. results while our framework does not require any user code
modification. Also, Tune provides visual tool via web inter-
face, but user cannot interfact with it for fine tune, analyzing,
2 RELATED WORK
or changing hyperparameter sets. Orı́on (Tsirigotis et al.,
2018) is another hyperparameter optimization tool which
mainly focuses on version control for experiments. It is de-
2.1 Hyperparameter Optimization signed to work with any language or framework since it has
a very simple configuration interface, however, it does not re-
AutoML is a general concept which covers diverse tech- ally manage computing resources such as GPUs to improve
niques for automated model learning including automatic performance at optimization time and resource efficiency.
data preprocessing, architecture search, and model selection. Google Vizier (Golovin et al., 2017) is a Google-interval
HyperOpt is one of core components of AutoML because service. Although it provides many hyperparameter opti-
hyperparameter tuning has large influence on performance mization, parallel execution, early stopping and many other
for a fixed model structure even, in particular in neural features, it is closed-source project so it is hard to apply this
network-based models. We mainly focuses on automated system to user’s code or system.
HyperOpt throughout this paper.
To this end, there have been proposed several hyperparame- 2.3 NSML
ter optimization algorithms (Jaderberg et al., 2017; Falkner
NSML (Sung et al., 2017) is a platform designed for help-
et al., 2018; Li et al., 2017). Population-Based Training
ing machine learning developers and researchers. Although
(PBT) (Jaderberg et al., 2017) is an asynchronous optimiza-
many machine learning libraries such as TensorFlow (Abadi
tion algorithm which effectively utilizes a fixed computa-
et al., 2016) and PyTorch (Paszke et al., 2017) make many
tional budget to jointly optimize a population of models
researcher’s life easy by simplifying model implementa-
and their hyperparameters to maximize performance. PBT
tion, they still suffer from a non-trivial amount of manual
discovers a schedule of hyperparameter settings rather than
tasks such as computing resource allocation, training model
following the generally sub-optimal strategy of trying to
progress monitoring, and comparison of models with differ-
find a single fixed set to use for the whole course of train-
ent hyperparameter settings.
ing. In practice, Hyperband (Li et al., 2017) works very
well and typically outperforms random search and Bayesian In order to handles all these tasks so that the researchers can
optimization methods operating on the full function evalua- focus on models, NSML includes automatic GPU allocation,
tion budget quite easily for small to medium total budgets. learning status visualization, and handling model parameter
However, its convergence to the global optimum is limited snapshots as well as comparison of performance metrics
by its reliance on randomly-drawn configurations, and with between models via a leaderboard.
large budgets its advantage over random search typically
We chose NSML as a cloud machine learning platform for
diminishes. BOHB (Falkner et al., 2018) is a combination
implementing and evaluating the proposed CHOPT due to
of Hyperband and Bayesian Optimization. Hyperband al-
the following properties: 1) since NSML has well-designed
ready satisfies most of the desiderata (in particular, strong
interfaces including Command Line Interface (CLI) and
anytime performance, scalability, robustness and flexibility),
web, so that it is easy to assign the hyperparameters to model
and with BOHB it also satisfies the desideratum of strong
and get the result from the model. 2) NSML enables us to
final performance.
easily manage cluster computing resources and effectively
Although these hyperparameter optimization methods out- monitor the progress of model training.
perform traditional methods in terms of resource utilization
and optimization time, it requires high cost to apply on 3 CHOPT: C LOUD - BASED
a new environment or code. Moreover, it is hard to know
which solution performs better than others before applying
H YPERPARAMETER OPT IMIZAION
all methods to the model. For this reason, there have been
some efforts to solve these problems through systematic
improvements. 3.1 Design goals and Requirements

2.2 Hyperparameter Optimization Framework The main goals of designing our framework are as follows.

Tune (Liaw et al., 2018) is a scalable hyperparameter opti- Efficient resource management. We need our framework
mization framework for deep learning that provides many to manage shared resources efficiently so that it balances
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

throughout CHOPT users and non-CHOPT users while max- sions have three session pools, live pool, stop pool, and
imizing resource utilization. dead pool. All running NSML sessions are in the live pool,
which have resource limits set by the configuration file. With
Minimum configuration and code modification. By host-
NSML session in live pool, agent periodically compares the
ing hyperparameter optimization methods in our system,
performance of NSML sessions and tuning NSML sessions
users should be able to easily tune their models without im-
according to the configuration file. If NSML session is ex-
plementation overheads. In addition, configurations should
ited, it will be moved to stop pool or dead pool. NSML
be simple yet flexible.
session in stop pool can be restarted while NSML session
Web-based analytic visual tool. As hyperparameter opti- in dead pool is completely removed from our system since
mization requires many training sessions, it is essential to automl systems commonly creates models a lot and it often
have powerful visualization tools to analyze from hundreds takes up too much system storage space. Thus, user can
of training results. specify the stop ratio, how much of exited NSML session
goes to stop pool and then NSML session in stop pool can be
3.2 System Architecture resumed when system becomes under-utilized or recovered
by user.
Agents AutoML sessions
sessions
CHOPT
NSML sessions pool Live
Queue
Stop
Dead
3.2.2 Master Agent
NSML
Master agent assigns a session to one of agents according to
Interface
the configuration file. Master agent is elected from one of
CLI NSML
agents like zookeeper’s leader election (Hunt et al., 2010).
If master agent falls, any agent can be the next master agent.
Web
User Whenever user submits CHOPT session into the queue, One
NSML
of the most important roles of the master agent is shift-
ing available computing resources such as GPUs between
CHOPT users and Non-CHOPT users. Otherwise, users will
Figure 1. System Architecture of CHOPT face serious computing resource imbalances. We describe
the details of this in the next Section 3.3.
In this section, we would like to introduce our hyperparam-
eter optimization system step-by-step, where Figure 1 is the 3.3 Stop-and-Go
architecture of the proposed CHOPT framework. Automated optimization methods and frameworks often
Users can run CHOPT sessions through both a command take a lot of computing resources because many models
line interface (CLI) or Web interface. CLI is light-weight, are trained in parallel. If not carefully managed, this leads
therefore suitable for simple tasks such as running and stop- to crucial load imbalance across users in shared cluster
ping sessions. The web interface is relatively heavy but environments. To avoid this problem, our framework con-
provides many features, making it more suitable for ana- trols available computing resources such as GPUs among
lyzing and monitoring training results. To run a CHOPT CHOPT users and non-CHOPT users with fairness while
session, the user needs to submit codes compatible with maximizing the utilization of computing resources. Here,
NSML and a configuration file containing details on the tun- we introduce a new feature called Stop-and-Go. The key
ing method, hyperparameter sets, and more. While holding idea of this feature is that the master agent controls available
the submitted codes, a CHOPT session is initilized with the resources among NSML and CHOPT sessions according to
configuration file and gets inserted in a Queue, which stores the cluster’s condition. By doing so, we can maximize the
CHOPT sessions before running. When a CHOPT session utilization of shared cluster environment as well as mitigate
manager (Agent) is available, a CHOPT session obtained the problem of early stopping in CHOPT environments.
from the Queue is ran by the Agent.
3.3.1 Efficient Resource Management
3.2.1 Agent
Whenever a resource cluster is under-utilized, the master
An Agent is a module responsible for running CHOPT ses- agent assigns more resources (GPUs) to CHOPT sessions
sions as well as monitoring the progress of an automl session so that they can quickly finish hyperparameter optimization.
so that users can check and analyze corresponding NSML On the other hand, if cluster is over-utilized, the master
training sessions with given interfaces. Note that NSML agent takes GPUs from CHOPT sessions so that other non-
session is a single training model. CHOPT users can train their models on the shared-cluster
environment. This feature enables to maximimum utilization
To manage and maximize computing resource CHOPT ses-
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

of computing resources and efficiently train models.

3.3.2 Resolving Early Stopping Problem


Many hyperparameter optimization methods and frame-
works have a feature called, Early Stopping (Liaw et al.,
2018; Golovin et al., 2017; Jaderberg et al., 2017; Li et al.,
2017) which terminates unpromising model training ear-
lier rather than keeps them training. This feature can sig-
nificantly save computing resources and also enables re-
searchers to train more models during the same time.
However, early stopping can be harmful since mod-
els with high performance are not guaranteed to train
quickly.Figure 2 presents a history of model search by our
framework with step size of 7 epochs. Here, we refer step
size as the checking interval for early stopping. After a
few steps, CHOPT with early stopping only gets to search
a space with shallow depth because models with shallow
depth perform better in the early stage of training than oth-
Step
ers. This may cause a premature convergence of CHOPT,
ending up with poor models. More importantly, it is difficult Figure 2. Hyperparameter Optimization with Early Stopping. Col-
to avoid because setting a small number of epochs as a step ors indicate the depth of the model as a hyperparameter to show-
size would be the first option to choose as a beginner of case the negative effects of early stopping.
CHOPT.
The proposed Stop-and-Go method can help users avoid the
pitfall. In our framework, when the Master agent takes GPUs We chose to use a dictionary format from python for hyper-
from a CHOPT session, the session randomly splits running parameters because python is one of the most popular pro-
NSML sessions into the stop pool and dead pool. If the gramming language among ML researchers. Code in Listing
CHOPT session gets allocated with more GPUs, it attempts 1 shows an example of an interface for hyperparameter space
to resume NSML sessions from the stop pool instead of which has four components; parameters, distribution, type,
creating new sessions, which is only done when the stop and range. The framework supports various distributions
pool is empty. By reviving early stopped sessions which other than uniform such as Gaussian distributions. With
have potential, we can avoid the problem of early stopping. the simple python-dictionary-style configuration, users can
easily add or remove any hyperparameters, and also com-
3.4 Configuration pare two CHOPT sessions with different hyperparameter
configurations via web interface. As deep learning models
get more complex to get better performance, in many cases
Many hyperparameter optimization frameworks require they require hierarchical hyperparameters. For instance, if
users to modify their code. However, it takes time to under- a model uses stochastic gradient descent (Ruder, 2016) as
stand and perform implementations for these modifications. an optimizer, training would require momentum and learn-
In contrast with other frameworks, CHOPT does not require ing rate as hyperparameters.To fulfill these requirements,
any code modification while require minimum configuration CHOPT also supports hierarchical hyperparameter space.
to tune their models on our framework. In this section, we
introduce the typical form of our framework’s configuration 3.4.2 CHOPT Session Configuration
file, which uses a dictionary based structure. We provide As shown in Code Listing 1, there are a few configurations
many examples of configuration files in our system for user for CHOPT other than defining hyperparameter space. It
convenience. is necessary to define the goal of CHOPT, so that it can
compare many models and find the best performing one.
3.4.1 Hyperparmeter Space Definition Users can define the goal of a task by defining measure and
Defining hyperparameter space is one of the most important order. For example, we can select either top-1 accuracy or
steps in automated machine learning. Hyperparameter space loss on the validation set as the measure, and set descending
should be large enough to cover many points, and it should or ascending as the order respectively by their focus.
be small enough to search points efficiently at the same time. As we mentioned in section 3.3, our framework supports
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Listing 1. Example of configuration


config = {
’h_params’: {
’lr’: {’parameters’: [0.01, 0.09], ’distribution’: ’log\_uniform’,
’type’: ’float’, ’p_range’: [0.001, 0.1]},
’depth’: {’parameters’: [5, 10], ’distribution’: ’uniform’, ’type’: ’int’,
’p_range’: [5, 10]},
’activation’: {’parameters’: [’relu’, ’sigmoid’], ’distribution’: ’categorical’,
’type’: ’str’, ’p_range’: []}
},
’h_params_conditions’: [],
’h_params_conjunctions’: [],
’measure’: ’test/accuracy’,
’order’: ’descending’,
’step’: 5,
’population’: 5,
’tune’: {’pbt’: {’exploit’: ’truncation’, ’explore’: ’perturb’}},
’termination’: {’max_session_number’: 50}
}

early stopping. Setting the checking interval for stopping • Compatibility environment: integrate multiple results al-
is important because it controls the performance of early though they have different configurations such as the
stopping. Here, users can configure this interval by setting number of hyperparameter sets, tuning algorithm, and
step. Note that if step is -1, CHOPT session disables early others.
stopping. • Fine-tuning Method: provide interface that simply re-
configurate the hyperparameter spaces. for searching un-
Every iterative optimization process has a termination con-
explored or shrinking ranges.
dition to avoid from running infinitely. We support three
• Expand Hyperparameter Dimension: allow user to add a
options for termination: time, max session number, and
new hyperparameter to be tuned.
performance threshold. When multiple conditions are used,
• Model Analyze: serve many features for model analysis.
our framework stops as soon as it reaches the first condition.
To choose a CHOPT algorithm we can configure it by
tune. In this paper, we support three algorithms: random 3.5.2 Scalable Hyperparameter Configuration
search with or without early stopping, population based Visualization
training (Jaderberg et al., 2017), and hyperband (Li et al., For scalable representation, we use parallel coordinates
2017). Each algorithm requires different set of parameters. visualization (Inselberg & Dimsdale, 1987; Heinrich &
For instance, population based training requires two func- Weiskopf, 2013). Parallel coordinates visualization has an
tions to exploit to compare models and explore new search advantage in representing high-dimensional or multivariate
space while hyperband requires resource limit. data in a 2D plane by arranging each dimension in parallel.
Additionally, the visualization can provide overall distribu-
3.5 Web-Based Analytic Environment tions of datasets for each dimension. Other CHOPT visual-
ization systems also have used parallel coordinates visual-
Although hyperparameter optimization framework enables
ization to represent hyperparameter tuning results (Golovin
users to train hundreds, thousands of models in parallel,
et al., 2017; Liaw et al., 2018; Tsirigotis et al., 2018).
if is difficult to analyze, results can be useless. CHOPT
provides powerful visualization tools that allows users to In the visualization as in Figure 3, each line represents a
monitor, analyze the results to get constructive insights. machine learning model, NSML session, created by CHOPT
In this section, we describe our visual tool’s design goals, session, where each axis represents a hyperparameter such
implementation and features. as learning rate, activation, dropout rate, and more. By ar-
ranging each hyperparameter in parallel, the visualization
3.5.1 Design Goals can increase the number of hyperparameters and present
a number of hyperparameter configurations in a scalable
• Scalable representation: illustrate tuning results in a scal-
way. We expect this visualization to help users analyze the
able way without loss of information even user adds hy-
relationships between hyperparameters as well as analyzing
perparameter more and more.
results.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

learning_rate momentum probability test/accuracy


– Scalar plot view: users can analyze the details of se-
lected models such as loss values on each step or wall-
time.
– Model summary view: users can easily see the precise
values of the hyperparamters and the performance of
selected models.
Figure 3. An example of representation of hyperparameter tuning
result on parallel coordinates visualization with on CIFAR100
random erasing model. Each axis represents each hyperparameter
(except for last right axis for evaluation metric), and each line
represents the trained model’s hyperparameter configuration.

Learning duration view Hyperparameter set clustered view hierarchical view


For the rest of design goals, we implement several features
and they are described in the following section. Figure 5. An example of further analytic tools. Learning duration
view is a bar plot for the models’ learning duration; hyperparamter
3.5.3 Visualization Feature List set clustered view is a t-SNE plot for the structural overview of
created models; hierarchical view is a node-link diagram for the
CHOPT visual tool serves many useful features to the user. parent-child relation overview.
Users can monitor and analyze their models by interacting
with visual tools. We list up our selected features as follows.
• Further Analytic Tools
• Model Selection and View – Rerun Interface: users can change the configuration of
– Masking Top K sessions: users can set thresholds and a CHOPT session and submit a new one by selecting
select models with the highest performances at once different hyperparameter ranges or adding hyperparam-
(Top at Figure 4). eters on the parallel coordinates visualization.
– Multiple range selection: users can select models with – Parameter analytic view: users can select visualization
specific hyperparameter ranges (Bottom at Figure 4). type and parameters to see the distributions of the in-
– Highlighting axis: users can highlight lines with spe- terested parameter or relation between two parameters
cific conditions to find relations between hyperparame- via histogram or scatter plot.
ters and model performance. – Session overview: users can see the a summarized
overview on hyperparameter configurations by select-
learning_rate momentum probability test/accuracy
ing from learning duration, hierarchical, and clustered
learning_rate momentum probability test/accuracy
views.
Selected by masking Top10 performed models

learning_rate momentum probability test/accuracy


3.5.4 Fine Tuning with Analytic Visual Tool
Origin result

When optimizing hyperparameters, the typical user would


Selected by multuple dragging interactions on axes
start with a small set of hyperparameters and add would
incrementally expand the number of hyperparameters while
Figure 4. An example of how select interested models on the par- optimizing. This fine tuning process can be done with the
allel coordinates. Users can select the top performed K models at
following steps, which is also illustrated in Figure 6.
once by one click and select the models within certain ranges by
dragging the ranges on each axis. 1. Run CHOPT session with initial hyperparameter sets:
users run the CHOPT session with an initial configu-
• Axis Control ration which contains hyperparameter sets to be tuned,
– Filtering axis: if there are too many hyperparame- ranges to be explored, and tuning method, and so on.
ters to analyze, users can specify the interested hy- 2. Analyze hyperparameter tuning results: users analyze the
perparamters by setting the visibility of each axis. CHOPT results with a single session or multiple sessions
– Controlling range scale: if there are too many lines to according to whether the session is the initial one or not.
see the details on the parallel coordinates, users can We expect users will obtain insight into which values (for
easily adjust the scale of each axis by selecting the numerical) or types (for categorical) of hyperparameters
interested ranges on the density plot. affect the performance of the model at this Step.
• Exhaustive Analysis 3. (Optional) Rerun with unexplored/narrowed ranges of
– Merging or switching interesting sessions: users can hyperparamter sets: users can select unexplored or nar-
load and see the multiple autoML sessions, as well as rowed ranges of the hyperparamter sets to make the
merge and see the results simultaneously CHOPT session explored the ranges based on the pre-
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Append new hyperparameter to be tuned

Rerun with unexplored/narrowed


ranges of hyperparameter sets

Run AutoML with Analyze hyperparmeter tuning results Get the best hyperparamter sets Get desired models with
initial hyperparameter sets fine-tuned hyperparameter sets

Figure 6. Basic usage flow of hyperparameter tuning with CHOPT visualization

vious results. In case of unexplored ranges, users may formed nicely and run with additional hyperparameters, as
want to see the effect of other ranges to the performance same with the earlier steps. With this context, we gradu-
of their models. In case of narrowed ranges, they may ally added other hyperparameters one by one and fine-tuned
want to find more precise values affecting the perfor- each optimized range.
mance of the models. Also, users can change the other
However, in the fifth step (yellow), we added ‘depth’ hyper-
configurations for the next CHOPT session such as tun-
parameter and obtained slightly improved performance with
ing algorithm, step size for early stopping, population
70.54, and we found that the experiment was quite biased by
size, and so on. Additionally, we expect that the users can
the early stopping because most models with depths greater
combine tuning configuration based on what they aim
than twenty dropped very quickly. Therefore, we rerun the
at in this Step such as tuning algorithm, early stopping
CHOPT sessions (blue color) without the early stopping
option, and etc.
method and obtained the better results and the highest per-
4. (Optional) Append new hyperparamter to be tuned: If
formed model with 79.37 percent accuracy, which shown
users get an optimal combination of a given hyper-
at the top-right side of the parallel coordinates in the Fig-
paramter sets through a single CHOPT session or mul-
ure 7. As we mentioned in Section, early stopping does not
tiple sessions, they can append a new hyperparameter
guarantee better results than non early stopping results.
to be tuned which was a constant value in the previous
sessions. Table 1 shows the changes of the configuration (hyperpa-
5. Get desired models with fine-tuned hyperparamter sets rameter spaces) according to the progress of the fine tuning.
Each constant value means there was no exploration by
the CHOPT. Even if each CHOPT sessions has different
4 P RACTICAL USE CASES configuration of hyperparamter spaces to be tuned, our visu-
alization allows users to integrate the sessions by setting the
In this section, we describe how CHOPT visualization works constant value. For example, the ‘depth’ hyperparamter was
with a real-world example. Figure 7 shows an overview a constant value of 20 in the all sessions (1-4th sessions in
of CHOPT visualization with six CHOPT sessions with table) except for the yellow and blue colored session (5-6th
CIFAR100 dataset. session in table) as shown in the Figure 7.
As described at the usage flow in Section 3.5.4, we fine-
tuned six hyperparameters step-by-step. First, we run an Table 1. Fine tuning results and configurations for each session in
initial CHOPT session (purple color on the Figure 7) to tune Figure 7
‘learning rate’ with other hyperparameters which not to be no. Top Acc. early stopped learning rate momentum prob sh depth
1st 69.62 True 0.001 - 0.2 0.9 0.0 0.4 20
tuned. 2nd 69.78 True 0.0334 - 0.0868 0.1 - 0.999 0.0 0.4 20
3rd 70.4 True 0.0334 - 0.0868 0.9 - 0.9381 0.0 - 0.9 0.4 20
For the next step, we selected the top ten models with our 4th 70.36 True 0.0334 - 0.0868 0.9 - 0.9381 0.2378 - 0.579 0.2 - 0.9 20
5th 70.54 True 0.0266 - 0.0399 0.9 - 0.9381 0.2111 - 0.380 0.2237 - 0.3496 [20, 92, 110, 122, 134, 140]
masking method (figure 4) and obtained the optimal ranges 6th 79.37 False 0.0266 - 0.0399 0.9 - 0.9381 0.2111 - 0.380 0.2237 - 0.3496 [20, 92, 110, 122, 134, 140]

related to the selected models. With the optimized ranges


of ‘learning rate’, we run a second CHOPT session with
Meanwhile, the right side of the figure shows the further
additional hyperparameter ‘momentum’ to be tuned (red
analysis of Automl results and model summary table of
color) so that CHOPT session explores the spaces with the
the selected models. The scatter plot between ‘prob’ hyper-
optimal range of ‘learning rate‘ and the initial range of
paramter and ‘test/accuracy’ metric is visualized (right top
‘momentum’.
side of the figure), and it can be seen that the ‘sh’ affects to
Subsequently, we obtained the optimized ranges that per- high performance values in the range of 0.22 to 0.35.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Figure 7. An overview of CHOPT visualization system: the six different sessions with different hyperparameter configurations are
represented on parallel coordinates, and the three top performed models are shown in details. The tooltips on parallel coordinates are
activated by mouse hovered on the ‘train/loss’ line plot on the middle of the figure.

The horizontal bar plot (right middle) shows the learning 5.1 Image Classification and Question-Answering
duration of each model. The x-axis represents the last learn-
Image classification and Question-Answering are well-
ing step for each model. In practice, this plot can help users
known as the most challenging, and most popular tasks
to find biased experiments because we found that the fifth
in deep learning community (He et al., 2016; Seo et al.,
experiment had never explored at the depth with 140 by
2016). There were huge breakthrough on convolutional neu-
early stopping. The figure 2 shows the difference between
ral networks (CNNs) for image classification after residual
fifth and sixth sessions which have same hyperparamter
network (He et al., 2016) and attention architectures for
spaces but different early stopping use. We found that the
Question-Answering using SQuAD dataset, respectively.
fifth session was biased to depth with 20, and the depth 140
We test our CHOPT framework on various CNNs structures
was not explored. On the other hand, the sixth session was
on the image classification task with CIFAR-100. We choose
not biased.
CIFAR-100 data because it is large enough to compare
The selected model summary table (the right-bottom side) deeper model’s performance comparing with mnist (LeCun,
is shown to provide users to recognize the precise values 1998) or CIFAR-10, and simultaneously, it is small enough
of selected models at once such as performance and hyper- to train many models in a short time comparing with Ima-
paramter configuration. genet (Deng et al., 2009). We examine our framework with
residual network (ResNet), wide residual network (WRN)
5 E VALUATION (Zagoruyko & Komodakis, 2016). In addition, we test on
regularization method as well, specifically on data augmen-
In order to evaluate the performance of our system, we tation by Random Erasing (RE) (Zhong et al., 2017) with
conduct experiments on two different tasks; Image classifi- ResNet and WRN to prove that our framework works on
cation and Question-Answering. For each experiment, we search space with high dimension. CHOPT is also evaluated
used CIFAR-100 dataset (Krizhevsky & Hinton, 2009) and on SQuAD 1.1 for Question-Answering. For experiments,
SQuAD 1.1 (Rajpurkar et al., 2016) dataset, respectively. we use BiDAF (Seo et al., 2016). In this experiment, we use
Both evaluations show that our proposed framework finds random search with early stopping, population based train-
better performance than performance reported in the papers. ing (PBT) and Hyperband while reporting the best result
among these methods.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

ResNet with RE as the target architecture for image classifi-


Table 2. Best top 1 accuracy (%) reports obtained with CHOPT.
cation on CIFAR-100 with the termination condition as 200
Image classification (IC) and question-answering (QA) tasks were
performed with various models. Bold indicates higher performance. models for the experiment. Each model is configured to run
ResNet (He et al., 2016), WRN (Zagoruyko & Komodakis, 2016), up to 300 epochs. We used popular-based training (PBT)
RE (Zhong et al., 2017) for all experiments with early stopping, and used random
search without early stopping.
TASKS M ODELS R EFERENCES CHOPT
IC R ES N ET 76.27 77.75
Table 4. GPU time and performance by step size
IC WRN 81.51 81.66
IC R ES N ET WITH RE 77.9 79.45 GPU TIME T OP -1
IC WRN WITH RE 82.27 83.1 W ITHOUT EARLY STOPPING 60+ DAYS 79.75%
QA B I DAF 77.3 77.93 L ARGE STEP SIZE (25 EPOCHS ) 22 DAYS 79.45%
S MALL STEP SIZE (3 EPOCHS ) 2 DAYS 77.42%
Table 2 summarizes the comparing results of the models
reported by references and the best models searched by Both resource efficiency and performance by step size are
our CHOPT framework. We use top-1 accuracy for mea- presented in Table 4. We denote GPU time as the time taken
sure, and in most cases, CHOPT succeeds in finding better- if only one GPU is used for the entire CHOPT pipeline.
performing models than the references. Without early stopping, CHOPT can generate the best model
among all algorithms including the human-tuned model
However, it may seem unfair to claim that CHOPT is better (77.9%), while taking more than 2 months of GPU time.
than the human-tuned model from the reference because When we enable early stopping with small step size, it only
of different model parameter sizes. For instance, CHOPT’s takes 2 days of GPU time, but it cannot find a good enough
best model of WRN with RE in Table 2 has 172.07M pa- model. However, if we put the right step size like the second
rameters while the best result from the reference contains row in the table, we can increase our GPU efficiency by
only 36.54M. We tried one more experiment on WRN with more than a double, while obtaining similar performance
RE while limiting the maximum number of parameters. without early stopping.
As another advantage, the stop-and-go feature allows us to
Table 3. Best model with parameter limit get the same performance results in shorter time by assign-
ing idle GPUs to CHOPT sessions. Figure 8 shows how
proposed Stop-and-Go method improves resource utiliza-
T OP -1 # OF PARAMETERS
tion. We divide time period into multiple zones, A,B,C,D,
BASELINE 82.27% 36.54M and E for better explanation. In the time zone A, no CHOPT
CHOPT W / CONSTRAINT 82.41% 36.54M session is running. So, green and yellow lines overlap. Then,
several CHOPT sessions start hyperparameter optimization
CHOPT W / O CONSTRAINT 83.1% 172.07M
in the time zone B. As a result, green line goes up little bit.
During the time zone C, master agent assigns idle GPUs
Table 3 shows the result of this experiment. The best model to CHOPT sessions to maximize the utilization of cluster
by CHOPT is slightly better or at least same performance and make tune much faster since cluster is under-utilized.
as human-tuned model even it is limited by the model size. However, other users suddenly try to use GPUs for their
Specifically, both models shared the same architecture of 28 models in time zone D (Yellow line goes up). It turns out,
depth and 10 widen factor, but they had different hyperpa- master agent takes GPUs from CHOPT session. Although it
rameters for regularization and optimization. exceeds maximum number of GPU for CHOPT but not that
much. Last part of time zone D and time zone E, CHOPT
5.2 Stop-and-Go sessions almost finish optimizing hyperparameters so green
line goes down and finally overlap with yellow line.
Our early stopping method Stop-and-Go is one of the key
features of our framework. In the best case scenario, early By reviving stopped sessions from early stopping, the
stopping can help users save both resources and time, while Stop-and-Go method sometimes improves performance of
still finding the best model. However, naive early stopping CHOPT. Figure 9 presents one example of successful Stop-
might fail and drop potentially good training sessions. In and-Go case. As shown in the top side of Figure 9, the
this section, we will show how the early stopping affects model was stopped by CHOPT at the early stage because
performance and resource efficiency on CHOPT. We chose its poor performance. However, after it got revived by Stop-
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

lr depth momentum prob sh Test/accuracy


100 170

0.631
90 0.421
0.344
1.49e-3
80 2.05
Normalized of number of GPUs (%)

Test/accuracy Test/loss
100 5.5
70
80 Early
A B C D E
4.5
76.61 stopping
60 3.5
60 point
40 2.5
20 2.05 1.5 1.02
50 0
0.5
0 40k 80k 120k 160k 0 40k 80k 120k 160k
step step
40

30
Figure 9. Fully trained result of a revived early stopped model. Top:
Hyperparameter configuration of early stopped model. Bottom:
Fully trained result of early stopped model
20
Total GPU
10
MAX GPU for AutoML
CHOPT
GPU In Use For future work, we would like to find more effective poli-
AutoML
CHOPTGPU In Use cies other than using random selection methods. In addition,
0
12:00:00 12:10:00 12:20:00 12:30:00 12:40:00 12:50:00 13:00:00 we would like to extend our AutoML method by adding
Time more criteria for our framework. Currently CHOPT only
uses for selecting the best model, but we would also like to
Figure 8. Adaptive available GPUs control between NSML and find models which also have less number of parameters, are
CHOPT sessions resource efficient, and train fast.

and-Go, the model ended up with 76.61% of top-1 accu-


R EFERENCES
racy. Although it was not the best model CHOPT has found Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
(77.42%), considering that the results are competitive, we Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
can say that Stop-and-Go can potentially save valuable hy- M., et al. Tensorflow: Large-scale machine learning
perparameter configurations. on heterogeneous distributed systems. arXiv preprint
arXiv:1603.04467, 2016.
6 CONCLUSION
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo,
J. Stargan: Unified generative adversarial networks for
multi-domain image-to-image translation. In The IEEE
Although several HyperOpt methods have been proposed,
Conference on Computer Vision and Pattern Recognition
many researchers still suffer from optimizing hyperparame-
(CVPR), June 2018.
ters due to the difficulty in usage and inconvenient interfaces.
In this paper, we propose a novel cloud-based HyperOpt
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
(CHOPT) framework that efficiently tackles flexible comput-
L. Imagenet: A large-scale hierarchical image database.
ing resources with supporting various HyperOpt algorithms.
In Computer Vision and Pattern Recognition, 2009. CVPR
In addition, we propose Stop-and-Go to address the early
2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
stopping problem while maximizing utilization of shared-
cluster resources. We presented our analytic visual tool and
Falkner, S., Klein, A., and Hutter, F. Bohb: Robust and
showed our system’s capabilites with a practical use case of
efficient hyperparameter optimization at scale. arXiv
hyperparameter fine tuning. Finally, we demonstrate the per-
preprint arXiv:1807.01774, 2018.
formance of CHOPT with two real world tasks, image clas-
sification and question-answering. The results have shown
Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro,
that our framework can find competitive hyperparameter
J., and Sculley, D. Google vizier: A service for black-box
configurations compared to previously reported results.
optimization. In Proceedings of the 23rd ACM SIGKDD
Although Stop-and-Go is an effective method, it is still hard International Conference on Knowledge Discovery and
to pick a session to revive among all early-stopped models. Data Mining, pp. 1487–1495. ACM, 2017.
CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Han, D., Kim, J., and Kim, J. Deep pyramidal residual Sung, N., Kim, M., Jo, H., Yang, Y., Kim, J., Lausen, L.,
networks. In Computer Vision and Pattern Recogni- Kim, Y., Lee, G., Kwak, D., Ha, J.-W., et al. Nsml: A
tion (CVPR), 2017 IEEE Conference on, pp. 6307–6315. machine learning platform that enables you to focus on
IEEE, 2017. your models. arXiv preprint arXiv:1712.05902, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Tsirigotis, C., Bouthillier, X., Corneau-Tremblay, F., Hen-
ing for image recognition. In Proceedings of the IEEE derson, P., Askari, R., Lavoie-Marchildon, S., Deleu, T.,
conference on computer vision and pattern recognition, Suhubdy, D., Noukhovitch, M., Bastien, F., et al. Orı́on:
pp. 770–778, 2016. ‘. 2018.

Heinrich, J. and Weiskopf, D. State of the art of parallel Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
coordinates. In Eurographics (STARs), pp. 95–116, 2013. Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W.,
and Kavukcuoglu, K. Wavenet: A generative model for
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. raw audio. In SSW, pp. 125, 2016.
Zookeeper: Wait-free coordination for internet-scale sys-
tems. In USENIX annual technical conference, volume 8, Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
pp. 9. Boston, MA, USA, 2010. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in Neural Information
Inselberg, A. and Dimsdale, B. Parallel coordinates for Processing Systems, pp. 5998–6008, 2017.
visualizing multi-dimensional geometry. In Computer
Graphics 1987, pp. 25–44. Springer, 1987. Zagoruyko, S. and Komodakis, N. Wide residual networks.
arXiv preprint arXiv:1605.07146, 2016.
Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y.
I., Simonyan, K., et al. Population based training of neural Random erasing data augmentation. arXiv preprint
networks. arXiv preprint arXiv:1711.09846, 2017. arXiv:1708.04896, 2017.

Krizhevsky, A. and Hinton, G. Learning multiple layers


of features from tiny images. Technical report, Citeseer,
2009.

LeCun, Y. The mnist database of handwritten digits.


http://yann. lecun. com/exdb/mnist/, 1998.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and


Talwalkar, A. Hyperband: A novel bandit-based approach
to hyperparameter optimization. The Journal of Machine
Learning Research, 18(1):6765–6816, 2017.

Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez,


J. E., and Stoica, I. Tune: A research platform for dis-
tributed model selection and training. arXiv preprint
arXiv:1807.05118, 2018.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:


100,000+ questions for machine comprehension of text.
arXiv preprint arXiv:1606.05250, 2016.

Ruder, S. An overview of gradient descent optimization


algorithms. arXiv preprint arXiv:1609.04747, 2016.

Seo, M., Kembhavi, A., Farhadi, A., and Hajishirzi, H.


Bidirectional attention flow for machine comprehension.
arXiv preprint arXiv:1611.01603, 2016.

You might also like