Paper recap: Cyclical Learning Rates for Training Neural Networks

Cyclical Learning Rates for Training Neural Networks

Instead of monotonically decreasing the learning rate as in the traditional way, the author introduced a new method (called Cyclical Learning Rates) to control it, allowing the hyper-parameter to rise and fall systematically during training.
CLR improves classification accuracy without tuning and does not require any computations.
The author also showed a way to estimate the bounds for the learning rate to vary while training.

Imagine you try to draw a picture, there are multiple ways to tackle this:

You jump straight-in, drawing up to ~80-90% of the picture. However, to refine it, you have to slow down to finish the touches, which very tiring, so you might have to slow down much more until you finish it or abandon it midway. (~ Learning rate schedule)
You start by drawing the layout, and then you calculate how much effort you should spend on each section. After you finish one, you reevaluate again and continue to draw until the picture is done. (~Adaptive learning rate)
You start simple, draw a tree today, draw some more trees tomorrow, and draw a sun and a mountain the following day. The gist is that you start slow and progress over time until you reach a particular intensity. Then you slow down again to avoid burnout until you’re at the initial state (finish a cycle), you will restart the cycle again until the picture is done. (~Cyclical Learning Rates)

To train a deep neural network to convergence requires one to experiment with a variety of LR.
There are already multiple solutions to this. These include learning rate scheduling (e.g., time-based decay, step decay, exponential decay), or adaptive learning rate (e.g., RMSProp, AdaGrad, AdaDelta, Adam), but there are still some drawbacks such as:
- Learning rate scheduling: Monotonically decreased learning rate, which later proved that it would not help the model escape from the saddle point.
- Adaptive learning rate: requires high computational cost.

Observation: increasing the learning rate might have a short term negative effect and yet achieve a longer-term beneficial effect. (since increasing the learning rate allows more rapid traversal of saddle point plateaus)
Pick minimum (base_lr) and maximum (max_lr) boundaries, and the learning rate will cyclically vary between these bounds.

Window type

For the cyclical function, triangular(Bartlett) window, parabolic(Welch) window, and sinusoidal(Hanning) window produced equivalent results, which led to adopting a triangular window thanks to its simplicity.
Other variations:
- triangular: using triangular window
- triangular_2: same as triangular but after every cycle, max_lr is halved
- exp_range: each boundary value declines by exponential factor of current iterations.

Triangular window

LR_finder

Set stepsize to 2-10 times of #iterations/epoch.
Best to stop training at the end of the cycle (LR at base_lr and the accuracy peaks)
- -> Early stopping might not be good for CLR.
Optimum learning rate is usually within a factor of two of the largest one that converges, and set base_lr = \(\frac{1}{3}\) or \(\frac{1}{4}\) of max_lr.

Result

CLR helps to model to converge much faster.
Decay (monotonically decreasing LR)’s result provides evidence that both increasing and decreasing LR are essential.

Result

When using with adaptive learning rate methods, the benefits from CLR are reduced.

Triangular code

All experiments show similar or better accuracy performance when using CLR versus using a fixed learning rate, even though the performance drops at some of the learning rate values within this range.

Noice

This paper is straightforward, well-explained, so I highly recommend new DL-practitioners to try reading it.
CLR is an impressive technique to control the learning rate. We should try this method when we started a new model or a new dataset, giving a lovely baseline for further optimization.
CLR is also widely used by kagglers.

triangular implementation:
triangular_2 implementation:
exp_range implementation:
Plot:
Explaination for how x is calculated:
- \(raw\_x = \frac{current\_iteration}{step\_size}\): current cycle in term of half-cycles (floatting point)
- \(x’ = raw\_x - 2*cycle\): how many half-cycles left to complete this cycle
- \(x’ = x’ + 1\): Shift so that the function 0-centered on the y-axis, so that when we take absolute, we can achieve cycles.