All annotation is now free in Hasty.


Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,

$$ u_t=\beta2^\infty v{t-1} + (1-\beta2^\infty v{t-1})|g_t|^\infty=max(\beta2 \cdot v{t-1},|g_t|) $$

We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,

\theta_{t+1}=\theta_t- \eta \cdot \frac{m_t}{u_t}

Again, here €€\eta€€ is the [base learning rate]((/content-hub/mp-wiki/solvers-optimizers/base-learning-rate) and €€m_t€€ is the momentum similar to as discussed in Adam.

Major Parameters


The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,

$$ V_{dw}=\beta1 \cdot V{dw}+(1-\beta_1)\cdot \partial w\u_t=\beta2^\infty v{t-1} + (1-\beta2^\infty v{t-1})|g_t|^\infty $$

Here €€\beta_1€€ and €€\beta_2€€ are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

Code Implementation

# importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

# Optimization method using Adamax
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002, 
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

Last updated on Jun 01, 2022

Removing the risk from vision AI.

Only 13% of vision AI projects make it to production, with Hasty we boost that number to 100%.