As you might know, many schedulers decrease the learning rate in a relatively monotonous manner. While this might be efficient in some cases, such methods have some drawbacks as well:
The model might get stuck in the local minima or a saddle point with a constant decrease in the learning rate. Since the learning rate values are decreasing only, it is hard for the model to break out from this “trap.”
The model’s success depends significantly on the initial choice of the learning rate. If it is set poorly, the model will likely get stuck soon, keeping the loss function high.
Cyclic Learning Rate is a scheduling technique that varies the learning rate between the minimal and maximal thresholds. The learning rate values change in a cycle from more minor to higher and vice versa. This method helps the model get out of the local minimum or a saddle point while not skipping the global minimum.
The general algorithm for CyclicLR is the following:
Set the minimum learning rate;
Set the maximum learning rate;
Let the learning rate fluctuate between the two thresholds in cycles.
Base LR - the initial learning rate, which is the lower boundary of the cycle.
Max LR - the maximum learning rate, which is the higher boundary of the cycle.
The cycle amplitude is defined as (max_lr - base_lr). The learning rate at any cycle is the sum of base_lr and some amplitude scaling. Therefore, max_lr may not even be reached in some cases, depending on the scaling function.
The step size reflects in how many epochs the learning rate will reach from one bound to the other.
Step size up - the number of training iterations passed when increasing the learning rate from Base LR to Max LR.
Step size down - the number of training iterations passed when decreasing the learning rate from Max LR to Base LR.
If Step Size Down is set to null, then its value is set to that of Step Size Up.
Mode - there are different techniques in which the learning rate can vary between the two boundaries:
Triangular - in this method, we start training at the base learning rate and then increase it until the maximum learning rate is reached. After that, we decrease the learning rate back to the base value. Increasing and decreasing the learning rate from min to max and back take half a cycle each.
Triangular2 - in this method, the maximal learning rate threshold is cut in half every cycle. Thus, you can avoid getting stuck in the local minima/saddle points while decreasing the learning rate.
Exp_range - as well as the Triangular2, this method allows you to decrease the learning rate, but more gradually, aiming at exponential decay.
Gamma - the constant variable in the ‘exp_range’ scaling function - a multiplicative factor by which the learning rate is decayed. For instance, if the learning rate is 1000 and gamma is 0.5, the new learning rate will be 1000 x 0.5 = 500.
The gamma value should be less than 1 to reduce the learning rate.
Scale mode - defines whether the scaling function is evaluated on cycle number or cycle iterations (training iterations since the start of the cycle):
Base momentum - lower momentum boundaries in the cycle for each parameter group.
Note that momentum is cycled inversely to the learning rate. At the cycle’s peak, momentum is ‘base_momentum,’ and the learning rate is ‘max_lr.’
Max momentum - upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference between max_momentum and some scaling of the amplitude; therefore, base_momentum may not actually be reached depending on the scaling function.
Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’, and learning rate is ‘base_lr.’
Hello, thank you for using the code provided by Hasty. Please note that some code blocks might not be 100% complete and ready to be run as is. This is done intentionally as we focus on implementing only the most challenging parts that might be tough to pick up from scratch. View our code block as a LEGO block - you can’t use it as a standalone solution, but you can take it and add to your system to complement it. If you have questions about using the tool, please get in touch with us to get direct help from the Hasty team.