Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,

$$ u_t=\beta*2^\infty v*{t-1} + (1-\beta*2^\infty v*{t-1})|g_t|^\infty=max(\beta*2 \cdot v*{t-1},|g_t|) $$

We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,

$$

\theta_{t+1}=\theta_t- \eta \cdot \frac{m_t}{u_t}

$$

Again, here €€\eta€€ is the [base learning rate]((/content-hub/mp-wiki/solvers-optimizers/base-learning-rate) and €€m_t€€ is the momentum similar to as discussed in Adam.

The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,

$$ V_{dw}=\beta*1 \cdot V*{dw}+(1-\beta_1)\cdot \partial w\u_t=\beta*2^\infty v*{t-1} + (1-\beta*2^\infty v*{t-1})|g_t|^\infty $$

Here €€\beta_1€€ and €€\beta_2€€ are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

```
\# importing the library
import torch
import torch.nn as nn
x = torch.randn(10, 3)
y = torch.randn(10, 2)
\# Build a fully connected layer.
linear = nn.Linear(3, 2)
\# Build MSE loss function and optimizer.
criterion = nn.MSELoss()
\# Optimization method using Adamax
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002,
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
\# Forward pass.
pred = linear(x)
\# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())
optimizer.step()
```

Our platform is completely free to try. Sign up today to start your two-month trial.

On the 9th of February, we are hosting a ML-IRL event with speakers from Bayer, Intel, and Infineon.