为什么在训练期间需要调用zero_grad() ?

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.

正因为如此,当你开始训练循环时,理想情况下你应该将梯度归零,以便正确地进行参数更新。否则,梯度将是旧梯度(您已经用于更新模型参数)和新计算的梯度的组合。因此,它将指向一些其他方向,而不是指向最小值(或最大值,在最大化目标的情况下)的预期方向。

这里有一个简单的例子:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者,如果你在做香草梯度下降,那么:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意:

当对损失张量调用.backward()时,梯度的积累(即总和)发生。 从v1.7.0开始,Pytorch提供了将梯度重置为None optimizer.zero_grad(set_to_none=True)的选项,而不是用一个0的张量填充它们。文档声称,这种设置减少了内存需求,并略微提高了性能,但如果不小心处理,可能容易出错。

如果使用梯度方法来减少错误(或损失),Zero_grad()将重新启动循环,而不损失上一步。

如果不使用zero_grad(),损失将会增加,而不是按要求减少。

例如:

如果你使用zero_grad(),你会得到以下输出:

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

如果你不使用zero_grad(),你会得到以下输出:

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

虽然这个想法可以从选定的答案中推导出来,但我觉得我想把它写清楚。

能够决定何时调用optimizer.zero_grad()和optimizer.step()为优化器在训练循环中如何积累和应用梯度提供了更多的自由。当模型或输入数据很大,并且一个实际的训练批次不适合gpu卡时,这是至关重要的。

在google-research的这个例子中,有两个参数,分别名为train_batch_size和gradient_accumulation_steps。

Train_batch_size是继loss.backward()之后向前传递的批处理大小。这受到gpu内存的限制。 Gradient_accumulation_steps是实际的训练批大小,其中累积了多次向前传递的损失。这并不受gpu内存的限制。

从这个例子中,你可以看到optimizer.zero_grad()后面可能跟着optimizer.step()而不是loss.backward()。Loss.backward()在每次迭代中都会被调用(第216行),但是optimizer.zero_grad()和optimizer.step()只在累积的火车批处理的数量等于gradient_accumulation_steps时才会被调用(第219行if块中的第227行)

https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py

还有人问TensorFlow中的等效方法。我猜是tf。GradientTape也有同样的作用。

(我对AI库还是个新手,如果我说错了请指正)

你不需要调用grad_zero()或者你可以衰减渐变,例如:

optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
    for p in group['params']:
        if p.grad is not None:
            ''' original code from git:
            if set_to_none:
                p.grad = None
            else:
                if p.grad.grad_fn is not None:
                    p.grad.detach_()
                else:
                    p.grad.requires_grad_(False)
                p.grad.zero_()
                
            '''
            p.grad = p.grad / 2

这样学习就更容易继续

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

简单来说,我们需要ZERO_GRAD

因为当我们开始一个训练循环时,我们不希望过去的梯度或过去的结果干扰我们当前的结果,因为PyTorch在反向传播中收集/累积梯度时的工作方式,以及如果过去的结果可能会混淆并给出错误的结果,所以我们每次经过循环时都将梯度设置为零。 这里有一个例子: `

# let us write a training loop
torch.manual_seed(42)

epochs = 200
for epoch in range(epochs):
  model_1.train()

  y_pred = model_1(X_train)

  loss = loss_fn(y_pred,y_train)

  optimizer.zero_grad()

  loss.backward()

optimizer.step () ` 在这个for循环中,如果我们不将优化器设置为零,那么每次过去的值都会被加起来并改变结果。 所以我们使用zero_grad来避免错误的累积结果。