AdaMax

Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,

$$$u_t=\beta2^\infty v{t-1} + (1-\beta2^\infty v{t-1})|g_t|^\infty=max(\beta2 \cdot v{t-1},|g_t|)$$$

We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,

$$$\theta_{t+1}=\theta_t- \eta \cdot \frac{m_t}{u_t}$$$

Again, here $$\eta$$ is the base learning rate and $$m_t$$ is the momentum similar to as discussed in Adam.

Major Parameters

Betas

The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,

$$$V_{dw}=\beta1 \cdot V{dw}+(1-\beta_1)\cdot \partial w\ u_t=\beta2^\infty v{t-1} + (1-\beta2^\infty v{t-1})|g_t|^\infty$$$

Here $$\beta_1$$ and $$\beta_2$$ are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

Code Implementation

  
Hello, thank you for using the code provided by CloudFactory. Please note that some code blocks might not be 100% complete and ready to be run as is. This is done intentionally as we focus on implementing only the most challenging parts that might be tough to pick up from scratch. View our code block as a LEGO block - you can’t use it as a standalone solution, but you can take it and add it to your system to complement it.

      python
      
    
      # importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

# Optimization method using Adamax
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002, 
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

optimizer.step()
    

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more

Last modified 9d ago

Previous - Solver / Optimizer

Adagrad

Next - Solver / Optimizer

Adamw