5 Hyperparameter Optimization Methods Every Data Scientist Should Use
Grid Search, Successive Halving & Bayesian Grid Search …
Getting Started
In every Data Science project, it is possible and recommended to search the hyperparameter space to get the best performance metric. Finding the best hyperparameter combination is a step you wouldn’t want to miss as it might give your well-conceived model the final boost it needs.
Many of us, default to using the well established GridSearchCV implemented in Scikit-learn. However, the truth is that alternative optimization methods might be more suitable depending on the situation. In this article, we go through five options with in-depth explanations for each and a guide for how to use them in practice.
Table Of Contents
· How to find my models' hyperparameters
· Grid Search
· Successive Halving
· Bayesian Grid Search
· Visualizing hyperparameter optimization results
Warming Up
Before starting our quest for our best model, we want to find a dataset and a model first.
For the dataset, we will use a package called datasets that allows us to easily download more than 500 datasets:
We chose to use Amazon Us Reviews. The goal is to predict its target feature (the number of stars attributed) using customer reviews.
Below, we’re defining the model whose hyperparameters we will try to optimize:
If you’re not familiar with pipelines, don’t hesitate to check out our previous article!
How to find my models’ hyperparameters
Before we get to the optimization part, we first need to know what are our model’s hyperparameters, right? To do so, there are two simple ways:
- Scikit-learn documentation for your specific model You can simply look for your Scikit-learn model’s documentation. You’ll get to see a full list of hyperparameters with their names and possible values.
- Using one line of code You can also use the
get_paramsmethod to find the names and current values for all the parameters of a given estimator:
model.get_params()
Now that we know how to find our hyperparameters, we can move on to our different optimization options 😉
Grid Search
What is it?
Grid search is an optimization method based on trying out every possible combination of a finite number of hyperparameter values. In other words, in order to decide which combination of values gives the optimal results, we go through all the possibilities and measure the performance for each resulting model using a certain performance metric. In practice, grid search is usually combined with cross-validation on the training set.
When it comes to grid search, Scikit-learn gives us two options to choose from:
Exhaustive Grid Search ( GridSearchCV )
This first version is the classic one that goes through all the possible combinations of hyperparameter values exhaustively. The resulting models are evaluated one by one and the best performing combination gets picked.
To visualize the results of your grid search and to get the best hyperparameters, refer to the paragraph at the end of the article.
Randomized Grid Search ( RandomizedSearchCV )
The second variant of grid search is a more selective one. Instead of going through every possible combination of hyperparameters, a choice is made over the possible estimators. In fact, not all parameter values are tried out. A chosen fixed number of parameter combinations is sampled from a certain statistical distribution given as an argument.
This method offers the flexibility of choosing the computational cost we can afford. This is done by fixing the number of sampled candidates or sampling iterations through the argument n_iter .
There are certain points to mention here:
- Lists are sampled uniformly
- As mentioned in the documentation:
If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used.
- It is recommended to use continuous distributions for continuous parameters to take full advantage of the randomization. A good example of this is the uniform distribution used above to sample the maximum document frequency chosen for the TF-IDF vectorizer.
- Use n_iter to control the trade-off between results and computational efficiency. Increasing n_iter will always lead to better results especially if there are continuous distributions involved.
- Any function can be passed as a distribution as long as it implements an
rvsmethod (random variate sample) for value sampling.
Successive Halving
What is it?
The second method of hyperparameter tuning offered by scikit-learn is successive halving. This method consists of iteratively choosing the best performing candidates on increasingly larger amounts of resources.
In fact, in the first iteration, the largest number of parameter combinations is tested over a small number of resources. As the number of iterations increases, only the best performing candidates are kept. They are compared based on their performance over bigger amounts of resources.
💡 In practice what can resources be? Most of the time, resources are the number of samples in the training set. It is however possible to choose another custom numeric parameter like the number of trees in the random forest algorithm by passing it as an argument.
Similar to the first type of grid search, there are two variants: HalvingGridSearchCV and HalvingRandomSearchCV.
Successive Halving Grid Search ( HalvingGridSearchCV )
Successive Halving estimators are still in the experimental phase on scikit-learn. Therefore, in order to use them, you need to have the latest version of scikit-learn ‘0.24.0’ and to require the experimental feature:
from sklearn.experimental import enable_halving_search_cv
Once this is done, the code is exactly similar to GridSearchCV’s:
To exploit successive halving to the fullest and customize the computational cost to your needs, there is a number of relevant arguments to play with:
-
resource: You can play on this argument to customize the type of resources to increase with each iteration. For example, in our code above, we can define it to be the number of trees in the random forest:
Or even the number of features in the TFIDF vectorization:
Make sure however to remove the type of resource from the param_grid dictionary.
- factor: This parameter is the halving parameter. By choosing a certain value for this argument, we get to choose the proportion of candidates that are selected and the number of resources being used for each iteration:
nresources{i+1} = n_resources_i * factor
ncandidates{i+1} = n_candidates_i / factor
- aggressive_elimination: Since the amount of resources used is multiplied by factor at each iteration, there can be at most i_max iterations so that
(nresources{i_max} = nresources{0} * factor^{i_max} ) ≤ max_n_resources
If the amount of resources isn’t high enough, the remaining number of candidates at the last iteration isn’t small enough. It’s in this case that the aggressive_elimination argument makes sense. In fact, if it is set to True, the first iteration is performed multiple times until the number of candidates is small enough.
Randomized Successive Halving ( HalvingRandomSearchCV )
Just like the randomized grid search, randomized successive halving is similar to regular successive halving with one exception. In this variant, a fixed chosen number of candidates is sampled at random from the parameter space. This number is given as an argument named n_candidates. Let’s go back to our code. If we wish to apply randomized successive halving, the corresponding code would be:
Bayesian Grid Search
The third and final method we’re going to talk about in this article is bayesian optimization over hyperparameters. To use it with Python, we are using a library called scikit-optimize. The method is called BayesSearchCV and as mentioned in the documentation, it "utilizes Bayesian Optimization where a predictive model referred to as "surrogate" is used to model the search space and utilized to arrive at good parameter values combination as soon as possible".
What is the difference between randomized grid search and Bayesian grid search?
Compared to the randomized grid search, this method offers the advantage of taking into consideration the structure of search space to optimize the search time. This is done by keeping in memory past evaluations and using that knowledge to sample new candidates that are most likely to give better results.
Now that we have a clear overall idea about this method, let’s move on to the concrete part, the coding part. We are going to use a library called scikit-optimize.
I should mention though:
1- You might have to downgrade your scikit-learn version to ‘0.23.2’ if you’re using the latest version for scikit-optimize to work properly (I would recommend you do that in a new environment):
pip install scikit-learn==0.23.2
2- Also, to avoid any further errors, make sure to install the newest development version via this command:
pip install git+https://github.com/scikit-optimize/scikit-optimize.git
Now the actual code for our model would be:
Visualizing hyperparameter optimization results
To get the full report of all candidates’ performance, we just need to use the attribute cvresults for all the methods listed above. The resulting dictionary can be converted to a data frame for more readability:
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
To get other resulting items, you just need these lines of code:
The winning candidate:
best_model = grid_search.best_estimator_
The best combination of hyperparameters:
params = grid_search.best_params_
The best score after trying out the best candidate on the testing set:
score = grid_search.best_score
Grid Search report
If you wish to have a real-time report while the search is still on, scikit-learn developers were kind enough to post a ready-to-use piece of code that does exactly that. The result looks like this:
Final Thoughts
When it comes to hyperparameter optimization, you have a wide choice of ready-to-use tools with Python. You can choose what works for you and experiment with them according to your needs. The trade-off between the best model performance and the most optimized search time is usually the factor that most influences the choice. In any case, it is important not to forget this step to give your model its best chance to perform well.
You can find all the Python scripts gathered in one place in this Github repository. If you have questions, please don’t hesitate to leave them in the responses section and we’ll be more than happy to answer.
Thank you for sticking around this far, stay safe and we will see you in our next article! 😊
More articles to read
References
- scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
- scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- scikit-learn.org/stable/modules/grid_search.html
- scikit-learn.org/stable/auto_examples/model_selection/plot_successive_halving_heatmap.html#sphx-glr-auto-examples-model-selection-plot-successive-halving-heatmap-py
- scikit-learn.org/stable/modules/grid_search.html#aggressive-elimination
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS