Tuning the hyperparameters of a neural network doesn't have to be black magic! Leveraging the power of TensorBoard and Clusterone, here’s a simple and hands-on introduction to hyperparameter tuning.
You’ve worked hard to implement your neural network and after hours of work it’s finally training nicely. Awesome!
But now comes the next step. The question is: Could it be better? Probably! But how? And what does better really mean?
Hyperparameters are everything that can influence how a neural network performs beyond the basic choice of architecture. This can be fundamental stuff like the number of layers in the network and the number of neurons in each layer, but also more abstract parameters like the learning rate, the batch size, and so forth.
Why hyperparameter-tuning is useful
The goal of hyperparameter tuning is to find better models with more optimal hyperparameters. But what are better models? What’s optimal in this context?
Most of the time, “better” comes down to a lower loss and higher test accuracy and this is what we’ll focus on in this article. But just for the record, “better” can also mean faster convergence, lower memory consumption, or lower compute requirements. The definition of optimality is usually set by the end user’s specific requirements.
Hyperparameter tuning tries to find a combination of the variables in a neural network, so that the network learns more efficiently and achieves a higher test accuracy. Or, in other words, we’re trying to minimize the error produced over the test set.
One common way to approach hyperparameter tuning is to find the right model complexity that balances bias and variance. Low-complexity models have a low variance, but a high bias. They don’t train very efficiently and don’t learn much.
As the complexity increases, the bias decreases. The model trains better and also achieves better results in test accuracy. But with increasing complexity, the variance is growing, too. While the training accuracy is getting better and better, the test accuracy actually gets worse.
This process is called overfitting. We have create such an intricate model that it describes the training dataset very well, but it cannot generalize anymore and thus fails when having to deal with the new test dataset.
Think of the example of trying to teach your network to recognize dogs. It ends up perfectly recognizing the specific dogs from your training set, but not dogs in general. It can identify Snoopy in a pool of a million images, but wouldn’t be able to say that Lassie is a dog, too.
So how do we find the perfect balance between variance and bias? Well, it’s a process of trial and error, there’s no single equation we can just solve to get to the right solution.
But there are tools and methods that can help us find the balance we are looking for. One of these methods is the Coordinate Descent Approach.
Coordinate Descent Approach
There’s two main areas where tuning can happen: in the realm of model complexity and in task complexity.
Model complexity is controlled by factors like the number of layers in the model and the number of nodes (“neurons”) in each layer.
While model complexity focuses on the representation of the network we’re training to solve a problem, task complexity describes the problem itself. It consists of factors like the number of data points (or “samples”) we have in the dataset, how many labels we have to differentiate between (if we’re talking about a classification problem), and the size of each data point.
ImageNet is a great example for a very high task complexity. The dataset is huge, there are millions of data points, each of which is a large file - an image. Likewise, the images show hundreds of different objects that a neural networks needs to learn to differentiate between.
MNIST, on the other hand, has a relatively low task complexity. It “only” has around 60,000 data points, the images are small and grey-scale. Most importantly, there are only 10 classes.
So, how does Coordinate Descent for hyperparameter tuning work? It’s simple: start small, then gradually go bigger.
Step 0: Start with a simple task
Even when training on a hugely complex problem like ImageNet, start with only 2 classes and maybe 100 images. For example, pick “cats” and “dogs” as labels. Then, choose 100 images of cats and dogs from ImageNet and train your model on it.
Step 1: Increase your model complexity parameters a little to solve the current task
Now work on tuning the model parameters to solve the reduced task. Usually increasing the number of layers or nodes will lead to better results. Make sure to increase the complexity only slightly, so you don’t overfit.
Step 2: Tune the optimizer
Now it’s time to tune the remaining hyperparameters. Play around with different optimizer algorithms and adjust the batch size and the learning rate.
Step 3: Increase the task complexity
When the model can successfully identify all cats and dogs in your reduced data set, extend it! Add another label class and extend the image set to maybe 300.
Step 4: Go back to step 1 and repeat
Now it’s time to repeat the cycle. Increase the model complexity slightly, tune other parameters like the learning rate, or maybe add a new layer. Then increase the task complexity again and start over, until eventually, your model is learning on the entire training dataset.
This approach won’t guarantee amazing results, but it provides a scaffolding to help you deal with a large number of unknown hyperparameters.
A hands-on example
Let’s apply what we’ve learned to an actual code example. We’ll use MNIST, sort of the “hello world” of machine learning, as a baseline.
Since there are only 10 different labels in MNIST, we won’t start with a reduced task complexity. Instead, we’ll tackle the complete MNIST dataset head-on.
Note: The demo was updated since the original post and the webinar.
tune.sh that was used in the webinar has been replaced with similar
The MNIST demo has 4 hyperparameters: learning rate, batch size, number of nodes in hidden layer 1, number of nodes in hidden layer 2.
Following the Coordinate Descent Approach from above, we’ll start with a relatively simple model and work our way forward from there. But we’ll add one more trick to make our life easier: we’re using Clusterone’s ability to run multiple experiments concurrently, thus speeding up the tuning process.
To get an idea what order of magnitude our hyperparameters should be in, we start 6 experiments, varying the learning rate and the number of nodes in the hidden layers. The batch size is kept constant:
- Learning rate = 1, nodes in each hidden layer: 8
- Learning rate = 0.1, nodes in each hidden layer: 8
- Learning rate = 0.01, nodes in each hidden layer: 8
- Learning rate = 1, nodes in each hidden layer: 64
- Learning rate = 0.1, nodes in each hidden layer: 64
- Learning rate = 0.01, nodes in each hidden layer: 64
Then we use Clusterone’s built-in TensorBoard integration to compare the results. If you run the jobs locally, make sure each job writes its results into its own subfolder of your TensorBoard
The Loss graphs should look something like this:
You can see immediately that there’s a huge difference in performance. The yellow and violet line at the top are the two jobs with a learning rate of 1. They have huge loss values and don’t improve.
The two unsteady curves below belong to the jobs with learning rate 0.01, while the two lines all the way at the bottom use a learning rate of 0.1. Likewise you can see that one job converged particularly quickly (the blue line). This is the 64 neurons per layer job with a learning rate of 0.1. We’ll be using this job as a starting point to see if we can improve further.
From here, we slowly increase the model complexity and add a few nodes. I created three new jobs and plotted the loss values in TensorBoard (see graph below). Here are the parameters I used:
- Green: Learning rate = 0.1, nodes in each hidden layer: 128
- Yellow: Learning rate = 0.1, nodes in each hidden layer: 86
- Blue: Learning rate = 0.2, nodes in each hidden layer: 64
The purple line in the graph is the learning rate is our best old job for comparison, with a learning rate of 0.1 and 64 hidden nodes.
As you can see, all three new jobs converge faster than the “old” purple job. But you can also see that it doesn’t seem to matter much if there are 86 or 128 hidden nodes, since both the green and the yellow job behave almost the same.
At the same time, a slightly increased learning rate of 0.2 - the blue graph - seems to work just as good or even better than increasing the number of nodes.
From here, you can either decide to be satisfied with the results or iterate further. At some point you will notice that your changes only produce tiny improvements. For MNIST, this is almost already the case in the graph above.
If you keep going and what you tweak next and how much is up to you. Only make sure not to change any parameter too drastically and to compare test accuracy via TensorBoard as often as feasible to see if you’re going in the right direction.
That’s it, happy tuning!
Hopefully, this post has helped you getting a better idea of the mysterious dark art of hyperparameter tuning. This post is based on a session of our Coffee with Clusterone webinar. You can find the video of the session on Youtube.
If you’re interested in joining one of our upcoming webinars, check out what’s coming and sign up, we’d love to see you around!