Keywords

1 Introduction

The performance of machine learning models highly depends on the choice of the hyperparameters. For many years, grid search was the standard approach for tuning the underlying models. However, with the emergence of more sophisticated models such as in deep learning, grid search is no longer practical due to the large hyperparameter space, and thus simpler approaches such as random search became more desirable and showed to be more effective [2].

Over the last few years, the problem of hyperparameter optimization has been successfully presented as metalearning using Bayesian optimization methods [3, 5, 10]. Nevertheless, bandit-based approaches exhibit superb performance in many scenarios [8, 9]. Li et al. [8] propose a method, called Hyperband (HB), for hyperparameter selection which, in their settings, outperforms Bayesian methods while providing a significant speed-up compared to those competitors. Hyperband is based on the successive halving approach [11] for improving random search by an adaptive allocation of available resources to different configurations.

However, Hyperband is an amended version of random search in which there is no learning to guide the search. In addition, despite the fact that Hyperband is highly efficient for finding a good configuration, it does not find an optimum fast enough. Hence, modeling the hyperparameter optimization as a learning problem is more reliable than a search algorithm. Therefore, instead of only sampling iid configurations of hyperparameters as Hyperband does, we propose to leverage the information of previous batches in order to pre-evaluate sampled configurations and to discard unpromising ones. This is done by a UCB bandit strategy in a contextual setting.

In this paper, we introduce HyperUCB, a model-based bandit framework, to accommodate exploitation into the purely exploratory algorithm of Hyperband. In HyperUCB, the arm selection is carried out by incorporating an Upper Confidence Bound (UCB) strategy [1] to guide the search within the iterations in order to balance exploration vs. exploitation. We further model the arms in a contextual setting which generalizes the model for unseen arms (i.e., configurations). Therefore, we employ a modified version of LinUCB [7] in our approach to achieve a model-based Hyperband for the task of hyperparameter optimization. Empirically, we show that our proposed approach either outperforms Hyperband or performs on par on optimizing the hyperparameters of a deep learning model.

figure a

2 Background

2.1 Problem Setting

Let \(\mathcal {D} = (\mathcal {X},\mathcal {Y})\) be a data set and M be a learning algorithm. The data is usually split into a training set for optimizing the parameters of the model, a validation set for optimizing the hyperparameters and a test set for evaluating the overall performance of the model. Assume that \(\mathcal {H}\) is the set of all possible hyperparameter configurations, we denote by \(\mathcal {L}(\lambda )\) the loss of M using \(\lambda \in \mathcal {H}\) on the validation set. The goal is to find the best hyperparameter configuration \(\lambda ^\star = {{\text {arg}\,\text {min}}_{\lambda }} \mathcal {L}(\lambda )\), which minimizes the validation loss for a given budget.

2.2 Hyperband

Hyperband (HB) is an anytime search algorithm based on multi-armed bandits to find the best configuration for a machine learning approach given limited resources. The method performs several iterations based on the available resources, and in each iteration repeatedly calls the SuccessiveHalving method [6] for choosing the best ones. Let R be the maximum budget available for training various instances of a model, then Hyperband conducts \(s_{max} = \lfloor {\log _\eta R}\rfloor \) iterations for exploration, where \(\eta \) is the ratio of sampling the best arms.

Hyperband is outlined in Algorithm 1. Note that the evaluation of the hyperparameters \(\lambda \in \varLambda _s\) in line 7 can be done in parallel. Within the algorithm, three methods are used. The method returns a set \(\varLambda _s\) of \(n \in \mathbb {N}\) hyperparameter configurations \(\{\lambda _1,\ldots ,\lambda _n\}\) sampled iid from a given hyperparameter space \(\mathcal {H}\) of feasible configurations. Furthermore, by calling , we obtain the validation loss \(\mathcal {L}(\lambda )\) of configuration \(\lambda \) and resource allocation r. Finally, returns a subset of \(\varLambda _s\) of size k with the k lowest validation losses given in \(\mathcal {L}(\varLambda _s)\).

2.3 Contextual Bandits

The multi-armed bandits in contextual settings benefit from the available information (context) to make a better decision at the time of action (arm) selection. That means, before making a decision, some context is shown to the bandits, and depending on the situation the decision might be different. The context could include the information about the current state, the attributes of the arms, or any other available data. A contextual bandit aims at finding a mapping between the contexts and their corresponding outcomes in order to minimize the total regret. Li et al. [7] propose LinUCB in which the outcome of every arm is modeled as a linear function of the context. In the next section, we present a modified form of LinUCB to design contextual Hyperband.

figure e

3 Contextual HyperUCB

In this section, we present our approach to upgrade Hyperband to a contextual bandit method using a UCB strategy. Let \(\mathcal {H}\) be the space of all possible hyperparameter configurations for a machine learning approach. We are interested in finding \(\lambda ^\star \in \mathcal {H}\) that gives the best performance \(y^\star \) in terms of the validation loss \(\mathcal {L}\) of the model

$$\begin{aligned} \lambda ^\star = \mathop {\arg \min }\limits _{\lambda } \mathcal {L}(\lambda ). \end{aligned}$$
(1)

We assume that a hyperparameter configuration can be represented by a d-dimensional vector \(\lambda \) and model the contextual bandit as a linear function of the configurations. After learning the parameters \(\varvec{\theta }\) of the linear model, a new configuration \(\lambda \) can be evaluated as \(\hat{y} = \varvec{\theta }^\top \lambda \). The optimization problem in Eq. (1) suggests a lower confidence bound strategy since we aim to minimize \(\mathcal {L}\). However, by considering negative loss values \(-y\), we can retain the usual upper confidence bound (UCB) strategy since maximizing the negative validation loss \(-\mathcal {L}\) is equivalent to minimizing \(\mathcal {L}\). The UCB approach trades off exploration and exploitation as it also considers the uncertainty for a specific hyperparameter configuration. The score \(p_\lambda \) is thus obtained from \(\varvec{\theta }^\top \lambda + \alpha \sqrt{\lambda ^\top A^{-1} \lambda }\), where \(A = \varvec{X}^\top \varvec{X} + \gamma \mathbf {I}\) is the regularized design matrix of the configurations with \(\gamma \ge 0\) which have been evaluated so far and \(\alpha >0\) is a trade-off parameter.

Algorithm 2 summarizes our approach for HyperUCB. In this algorithm, the bandit model is learned in line 13, and together with the covariance matrix it computes the upper confidence values in two sampling steps. At every iteration, a number of \(n_0\) configurations are randomly sampled as in HB, and from those, the bandit model selects the n most promising ones. The next sampling step is at line 14, where top_ucb is performed on the values of \(p_{\lambda }\) rather than \(y_\lambda \). Note that the matrix A is updated every time a configuration is chosen, even within an iteration, which leads to a tighter confidence interval for those configurations.

Table 1. Hyperparameters of the multi-layer perceptron.

4 Empirical Study

In this section, we evaluate the performance of the HyperUCB strategy compared to Hyperband [8]. The experiments are conducted on the MNIST data which consists of 60,000 training and 10,000 test instances. As a model, we use a simple multi-layer perceptron (MLP) which learns to classify images of handwritten digits. We use the categorical cross entropy as a loss function and the RMSprop optimizer. The validation loss is computed on the hold-out data. Within the MLP we use four hyperparameters which are outlined in Table 1. We determine a minimum budget of one unit of resource which corresponds to 100 mini-batches of size 100. The maximum budget consists of R units of resources, hence 100R mini-batches. We use the default value of \(\eta =3\) as specified in Hyperband. Our approach contains two additional parameters: the exploration-exploitation trade-off \(\alpha \) and a regularization-weight \(\gamma \) in ridge regression. We select the values of \(\alpha =0.4\) as it gives best performance in [7] and the regularization is set to \(\gamma =0.1\).

Fig. 1.
figure 1

Performance w.r.t. the budget.

Fig. 2.
figure 2

Performance w.r.t. the time.

Figure 1 shows the validation loss averaged over five independent runs for various maximum budgets including standard errors. With a max. budget higher than 19, HyperUCB outperforms Hyperband as it consistently yields lower validation errors. We credit this finding to the fact that using a higher budget, more rounds are conducted on which the bandit model can learn to discriminate promising from unpromising hyperparameter configurations. This can be hardly done with lower max. budgets due to the lack of training data.

Figure 2 depicts the average validation loss in dependence of computational time, measured in seconds, for a budget of 45. It can be seen that HyperUCB performs on par with Hyperband, meaning it is as fast or faster than Hyperband.

5 Conclusion and Future Work

In this paper, we presented HyperUCB, a contextual extension using a UCB strategy for Hyperband, which is a bandit-based method for hyperparameter optimization. The idea was as follows: Instead of sampling n iid hyperparameter configurations in each round for evaluation, we sampled more configurations, assessed them using a multi-armed bandit with a UCB strategy and only evaluated the n best configurations. This way, we guided the sampling procedure towards more promising configurations and avoided evaluating hyperparameters which are already assumed to yield a high validation error. An experiment on the MNIST data showed that it outperforms the Hyperband baseline for moderate budgets at optimizing several hyperparameters of a multi-layer perceptron.

Further work will utilize the ideas from Tavakol and Brefeld [12], in which the parameters of the bandit model can be learned using kernel methods in the dual space to capture non-linearity. We also plan on extending the experimental setup by adding more baselines, e.g., BO-HB [4] as well as considering multiple hyperparameter optimization scenarios on various data sets and models.