# ASGD

Average Stochastic Gradient Descent, abbreviated as ASGD, averages the weights that are calculated in every iteration.

$$w_{t+1}=w_t-\eta \nabla Q(w_t)$$

where €€w_t€€ being the weight tensor , €€\eta€€ being the base learning rate and €€\nabla Q(w_t)€€ being the gradient of the objective function evaluated at €€w_t€€.

With the given update rule SGD assigns calculated weight to the model. But with ASGD assigns the following averaged weight €€\overline{w}€€,

$$\overline{w}=\frac{1}{N} \sum_{t=1}^Nw_t$$

where €€w_t€€ is the weight tensor calculated in iteration 't'.

Such averaging is used when the data is noisy.

### Lambda

It is the decay term for the past weights used in the average.

### Alpha

It is the power value that is used to update the learning rate.

### TO

It is the optimization step at which the averaging is started. If the required number of iteration is lower than the TO value, then the averaging will not happen.

### Code Implementation


# importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

# Optimization method using ASGD
optimizer = torch.optim.ASGD(linear.parameters(), lr=0.01, lambd=0.0001,
alpha=0.75, t0=1000000.0, weight_decay=0)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

optimizer.step()

Last updated on Jun 01, 2022

## Removing the risk from vision AI.

Only 13% of vision AI projects make it to production, with Hasty we boost that number to 100%.