AdamW is very similar to Adam. It only differs in the way how the weight decay is implemented. The way how it's implemented in Adam came from the good old vanilla SGD optimizers which isn't mathematically correct. AdamW fixes this implementation mistake.
The authors of the original AdamW paper
claimed that they were able to solve the generalization issues of the Adam solver with their modification. Empirically speaking
, however, it seems that the right hyperparameter settings have a bigger impact than the choice between Adam
and AdamW, but AdamW generalizes a bit better.