Learning rate is the most important hyperparameter in training deep neural networks. CyclicLR eliminates the need to tune the learning rate. The learning rate cycles between a set boundaries with a certain frequency.
It is the initial learning rate - the lower end of the boundary. The learning will not be lower than the Base LR.
As the name suggests, it is the higher end of the boundary of the Cyclic LR. Hence, the learning rate can't be higher than MAX LR.
Step size up is the number of iterations passed when increasing the learning rate from Base LR to MAX LR.
Step size down is the number of iterations passed when decreasing the learning rate from MAX LR to Base LR.
If Step Size Down is set to null, then it's value is set to that of Step Size Up.
There are different techniques in which the learning rate can be varied between the two boundaries. These techniques are defined by the mode. The three modes available are:
Let us explain the Triangular mode with a figure:
We can see that the maximum and minimum bound of the learning rates and the step size in which the learning rate reaches from Base LR to Max LR (here, step up size = step down size). A triangular wave function has been used to cycle the learning rate which is the Triangular Mode.
Triangular2 is another basic triangular cycle, similar to Triangular, but it scales the initial amplitude by half in each cycle. (Cycle is the number of iterations in which the initial learning rate is reached).
Exp Range is another type of cycle that scales the initial amplitude according to the set gamma and the number of cycles. The initial amplitude is scaled by $$gamma^{cycles}$$
It is the constant used in the exp_range to scale the amplitude.
Setting gamma greater than 1 can potentially cause the learning rate to explode to a very high value. Setting it to 1 will make it behave as the Triangular Mode. Therefore setting it to value lower than 1, but near to it (e.g. 0.99994) would be more effective.
The scale mode defines whether the scaling of the amplitudes of the learning rate due to used mode happens every iteration or cycle.
It is the lower momentum boundary when using the cyclic momentum. The momentum value is inversely proportional to the learning rate. Hence a MAX LR would have a base momentum.
The maximum momentum that is used in the training process. Since it varies inversely with learning rate, the max momentum is applied in case of Base LR.
The default value of Base Momentum and Max Momentum are 0.8 and 0.9 respectively.
import torch
model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01, amsgrad=False)
scheduler =torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr,
step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0,cycle_momentum=false)
for epoch in range(20):
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
scheduler.step()
Hasty is a unified agile ML platform for your entire Vision AI pipeline — with minimal integration effort for you.