Toward Data Science or my previous post on the subject, fine-tuning your model

As you probably know if you are familiar with Data Science, Machine Learning, Toward Data Science or my previous post on the subject, fine-tuning your model is crucial for getting the best performances. You simply cannot rely on default values.
Satyam Kumar
states in his last article, several methods exist to perform this optimization. They range from manual tuning, random search, brute force to bayesian search. Each of these methods has its advantages and drawbacks.
This article will focus on a pretty recent way to perform HP optimization: Model-Based HP Tuning.
This method is quite interesting, as it uses ML methods to tune ML models. We can reuse the tools we are familiar with to optimize the tools we are familiar with 🙂 dizzying, no?
Hyper Parameters can be regarded as structured, tabular data, and what is the most versatile ML algorithm that we know for structured, tabular data? The uncontested winner of Kaggle competitions? XGBoost!
Let’s see what we can do with it, and try to use it to tune itself.
Model-based HP Tuning
The idea behind model-based tuning is pretty simple: to speed up convergence towards the best parameters for a given use case, we need a way to guide the Hyper Parameters Optimization towards the best solution. Indeed, training a model can be time-consuming, depending on the size of the training set, and the combinatorics of the configuration space that is usually very large.
This means that we need a way to identify the most promising configuration to evaluate. Why not simply use a model to learn an estimator of the score for a given configuration? Each training will then be used to refine the underlying model, and give us some insight into the direction to explore. This is the leading idea of model-based Hyper Parameter Optimization.
If you are interested in the theory behind this idea, I strongly advise reading this academic paper. You could also be curious to look at the implementation of SMAC that uses this method.
Overall method
The overall algorithm for model-based optimization of Hyper Parameters is pretty straightforward:
select n configurations randomly
Evaluate these configurations using the internal estimator that scores a configuration
Keep the configuration giving the best score estimation, discard all the others
Train the model with this configuration
Store the current configuration and score to the training set for the internal estimator
Retrain the internal estimator
Go back to step 1. if the maximal number of iterations or minimal score is not yet reached.
Let’s see how we can implement this.
Sampling configuration space–166184382/–166184382/–166184817/–166184815/–166184817/–166184815/–166185370/–166185373/–166185374/–166185380/–166185381/–166185536/–166185538/–166185529/–166185525/–166185524/–166185521/–166185520/–166185370/–166185373/–166185374/–166185380/–166185381/–166185536/–166185538/–166185529/–166185525/–166185524/–166185521/–166185520/–166187184/–166187184/

The first question we have to answer is: how do we sample the configuration space? I.e: how do we pick randomly an eligible configuration in the configuration space?
This is not a very difficult task, and we could write some code ourselves to solve the problem, but fortunately, there exists a library that can handle all the burden for us: ConfigurationSpace. More specifically, ConfigurationSpace can handle conditional configurations. We won’t use this feature here, but it’s very helpful in many situations.
Below is an example showing how to use ConfigurationSpace to randomly generate a RandomForest configuration:

In a few lines of code, we can easily generate random configuration. Code from the author.
Choosing the right model
As always in Data Science, the next question now is: what model should we use to build a trustable estimator?
Initially, model-based optimization used Gaussian processes to estimate configuration score, but recent papers show that tree-based models are a good option. The mean reason for dropping Gaussian processes is that they don’t support categorical features.
As stated a few line above, when speaking of tree-based models, the immediate (but probably not always the best) answer is XGBoost!
Why using XGBoost? Not only because XGBoost and gradient boosting methods are very efficient and amongst the most frequent winners of Kaggle contests, but also because they are very versatile and do not need much preprocessing: features normalization is not required, missing values can be handled automatically, …
To be honest, it’s also funny to use XGBoost for optimising XGBoost. However, we will also consider another option, that is very similar to XGBoost, but has the noticeable advantage of supporting categories natively: CatBoost.
Handling categorical features is a very handy feature, as many model parameters are categorical. Thinks of XGBoost objective, booster or tree_method parameters for instance.
LightGBM would also be a perfect fit.
Tuning model Hyper Parameters
We are now in possession of all the elements required to create our own Hyper Parameters Optimization engine. To do so, we create an Optimizer class that is configured by four parameters:
algo_score: a method used to score a model or an algorithm for a given configuration.
max_iter: the maximal number of training to perform
max_intensification: the maximal number of candidates configuration to sample randomly
model: the class of the internal model used as score estimator
cs: the configuration space to explore
As you can see below, this does not require too many lines:

This class implement HP Tuning using a model. Code from the author
The critical part of the code above lies in the optimize function. This function does three things:
It stores explored configurations in the list cfgs.
It stores selected configurations in the list trajectory
It selects candidate configuration for exploration using the score estimation provided by the internal model.
It trains the estimator using training past scores
Performances analysis
To evaluate the efficiency of our model-based Hyper Parameters engine, we are going to use the Boston dataset. As you probably already know, this dataset contains information regarding house price in Boston. The goal of our model is to estimate house price given features. To begin, we are going to use RandomForest as our base model. We are going to evaluate our model-based method on this model.
First, let’s ensure that our engine really helps to converge more quickly to a better configuration. To do so, we compare learning progression when using a RandomSearch with the one of our engine. In the code below, we use both scikit learn RandomizedSearchCV and our Optimizer to explore randomly the configuration space:

Comparing RandomizedSearchCV with our engine. Code by the author.
Looking at the figure below, there is no doubt that our engine is much more effective than a random search. More precisely, it clearly appears that our engine learns and improves with iterations:

RandomizedSearchCV vs our Engine. Plot by the author.
Random search is as expected clearly erratic.
As your code is (relatively) independent of the model used as a score estimator, we can also compare the speed of convergence of XGBoost vs Catboost. Code follows:

Comparing convergence when using CatBoost with categories, XGBoost and CatBoost without categories. Code by the author.
The resulting plots are displayed in the figure below:

CatBost (with and without categorical features) vs XGBoost. Plot by the author.
Performances look quite similar in both cases. Keep in mind that neither XGBoost nor CatBoost Hyper Parameters for the score estimator have been optimized: both use default config. We could try to tune the scoring model using another scoring model, but this article will start looking as complex as the Inception movie 😉
Another aspect that worth analysis is the impact of the intensification steps duration. We can increase the number of candidates when picking random candidates in the exploration space. In the following plot, we have been training our model with respectively 25, 250 and 2500 intensification candidates. The code is similar to the previous one. We just configure Optimizer differently:

Comparing various levels of intensification. Code by the author
The resulting plots follow:

The number of intensification steps does not seem to impact the convergence rate. Plot by the author.
Intensification does not seem to have some impact in this case.
Finally, as promised, we are going to use XGBoost to tune XGBoost. The code is exactly the same as the one for RandomForest, except that we use XGBoost as the main model. See the code below:

Tuning XGBoost using XGBoost. Code by the author.
Note that we also use CatBoost as an internal scoring estimator, for comparison purpose. Looking at the plot below, it seems that in this case, XGBoost is slightly better than CatBoost:

Tuning XGBoost using XGBoost. Image by the author.
Going further
We have shown in this post that building a decent Hyper Parameters Optimisation engine is not that complex. With a few lines of code, it’s possible to greatly speed up model training.
What is funny is that you don’t need to use external libraries. Reusing models already at hand works. XGBoost can be used to tune XGBoost, CatBoost can be used to tune CatBoost, and RandonForest can tune RandomForest. You can also mix them.
Although our model works pretty well, an improvement that would be very interesting to investigate is updating the random sample to use bayesian strategies to generate candidates using learned distribution instead of random sampling.