We shall make use of Adam optimization to briefly explain the epsilon coefficient. For the Adam optimizer, we know that the first and second moments are calculated via;
is the derivative of the loss function with respect to a parameter.
is the running average of the decaying gradients(momentum term) and is the decaying average of the gradients.
And the parameter updates are done as follows;
The epsilon in the aforementioned update is the epsilon coefficient.
Note that when the bias-corrected gets close to zero, the denominator is undefined. Hence, the update is arbitrary. To rectify this, we use a small epsilon such that it stabilizes this numeric.
The standard value of the epsilon is 1e-08.
Epsilon in code example
Further Resources