Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.
The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,
We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,
Again, here is the base learning rate and is the momentum similar to as discussed in Adam.
The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,
Here and are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.