We shall make use of Adam optimization to briefly explain the epsilon coefficient. For the Adam optimizer, we know that the first and second moments are calculated via;
is the derivative of the loss function with respect to a parameter.
is the running average of the decaying gradients(momentum term) and is the decaying average of the gradients.
And the parameter updates are done as follows;
The epsilon in the aforementioned update is the epsilon coefficient.
Note that when the bias-corrected gets close to zero, the denominator is undefined. Hence, the update is arbitrary. To rectify this, we use a small epsilon such that it stabilizes this numeric.
The standard value of the epsilon is 1e-08.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,amsgrad=true,eps=1e-08)
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
Accelerated Annotation.
Maximize model performance quickly with AI-powered labeling and 100% QA.
Learn more