Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.
The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,
We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,
Again, here is the base learning rate and is the momentum similar to as discussed in Adam.
The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,
Here and are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.
Hello, thank you for using the code provided by Hasty. Please note that some code blocks might not be 100% complete and ready to be run as is. This is done intentionally as we focus on implementing only the most challenging parts that might be tough to pick up from scratch. View our code block as a LEGO block - you can’t use it as a standalone solution, but you can take it and add to your system to complement it. If you have questions about using the tool, please get in touch with us to get direct help from the Hasty team.
# importing the library
import torch
import torch.nn as nn
x = torch.randn(10, 3)
y = torch.randn(10, 2)
# Build a fully connected layer.
linear = nn.Linear(3, 2)
# Build MSE loss function and optimizer.
criterion = nn.MSELoss()
# Optimization method using Adamax
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002,
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
# Forward pass.
pred = linear(x)
# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())
optimizer.step()
Automate 90% of the work, reduce your time to deployment by 40%, and replace your whole ML software stack with our platform.
Start for free Request a demo