# ASGD

Average Stochastic Gradient Descent, abbreviated as ASGD, averages the weights that are calculated in every iteration.

$w_\left\{t+1\right\}=w_t-\eta \nabla Q\left(w_t\right)$

where $w_t$ being the weight tensor , $\eta$ being the base learning rate and $\nabla Q\left(w_t\right)$ being the gradient of the objective function evaluated at $w_t$.

With the given update rule SGD assigns calculated weight to the model. But with ASGD assigns the following averaged weight $\overline\left\{w\right\}$,

$\overline\left\{w\right\}=\frac\left\{1\right\}\left\{N\right\} \sum_\left\{t=1\right\}^Nw_t$

where $w_t$ is the weight tensor calculated in iteration 't'.

Such averaging is used when the data is noisy.

### Lambda

It is the decay term for the past weights used in the average.

### Alpha

It is the power value that is used to update the learning rate.

### TO

It is the optimization step at which the averaging is started. If the required number of iteration is lower than the TO value, then the averaging will not happen.

### Code Implementation

    # importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

# Optimization method using ASGD
optimizer = torch.optim.ASGD(linear.parameters(), lr=0.01, lambd=0.0001,
alpha=0.75, t0=1000000.0, weight_decay=0)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

optimizer.step()


Last updated on Dec 21, 2022

## Removing the risk from vision AI.

Only 13% of vision AI projects make it to production, with Hasty we boost that number to 100%.