PyTorch | Gradients
Table of Contents:
- Gradients, Jacobians and Hessians
- Gradients act as accumulators
- Gradient with respect to (wrt.) the inputs and gradients wrt. the leaves
- Formal definition of Gradients, Jacobians and Hessians
- Getting the gradients
Gradients, Jacobians and Hessians
We mentioned how AD works to compute the gradients. Now, it’s time for a basic experiment with PyTorch AD. In this section we explain the specifics of gradients in PyTorch and also Jacobian and Hessian matrices as these are important.
Gradients act as accumulators
One of the very first experiments in PyTorch is to create a tensor that requires gradients. It can be created from a single line.
import torch
w = torch.randn(5, requires_grad = True)
print(w)
Out:
tensor([-0.0340, 1.1180, 1.1411, 3.1435, 2.1553], requires_grad=True)
Else we may use reqires_grad_()
function on a tensor for the same.
import torch
w = torch.rand(5)
print(w)
w.requires_grad_()
print(w)
Out:
tensor([0.5695, 0.4895, 0.9994, 0.9366, 0.2727])
tensor([0.5695, 0.4895, 0.9994, 0.9366, 0.2727], requires_grad=True)
Next, let’s create a scalar function sum()
to calculate the scalar value s
. A scalar function is a function that returns a single value.
Next, let’s call s.backward()
. If we print the gradients for w
we get a tensor of all one.
s = w.sum()
print(s)
s.backward()
print(w.grad)
Out:
tensor(2.2267, grad_fn=<SumBackward0>)
tensor([1., 1., 1., 1., 1.])
But, if we repeat this process, our gradients w.grad
or w.grad.data
which are the same will increase by one.
s.backward()
print(w.grad)
s.backward()
print(w.grad)
s.backward()
print(w.grad.data)
Out:
tensor([2., 2., 2., 2., 2.])
tensor([3., 3., 3., 3., 3.])
tensor([4., 4., 4., 4., 4.])
To zero the gradients we will call w.grad.zero_()
or w.grad.data.zero_()
.
w.grad.zero_()
print(w.grad)
Out:
tensor([0., 0., 0., 0., 0.])
The previous example shows one important feature: how PyTorch handles gradients. They are like accumulators. We first create a tensor w
with requires_grad = False
. Then we activate the gradients with w.requires_grad_()
.
After that we create the computational graph with the w.sum()
. The root of the computational graph will be s
. The leaves of the computational graph will be the tensor elements.
Even though the sum will never change, the gradients of the tensor w
from w.grad
will increase for 1.0
each time we call s.backward()
. This is because each element of the tensor w
contributes to the sum.
Lastly when we zero the gradients, we may note all the gradients are set to 0.
print(w.grad)
Out:
tensor([0., 0., 0., 0., 0.])
Gradient with respect to (wrt.) the inputs and gradients wrt. the leaves
From the last example we may request gradients for s
w.r.t. w
. To do that we call torch.autograd.grad()
function.
torch.autograd.backward()
is a special case of torch.autograd.grad
:
backward()
: Computes and returns the sum of gradients of outputs w.r.t. the leaves.grad()
: Computes and returns the sum of gradients of outputs w.r.t. the inputs.
import torch
w = torch.rand(5, requires_grad=True)
s=sum(w)
torch.autograd.grad(s,w)
Out:
(tensor([1., 1., 1., 1., 1.]),)
Formal definition of Gradients, Jacobians and Hessians
Gradient formula and Hessian formula are defined for $f: R^n \to R $ or in other words for scalar functions.
Gradient:
\[\nabla f=\left[\begin{array}{c}\frac{\partial f}{\partial x_{1}} \\ \frac{\partial f}{\partial x_{2}} \\ \vdots \\ \frac{\partial f}{\partial x_{n}}\end{array}\right]\]Hessian:
\[\nabla^{2} f=\left[\begin{array}{cccc}\frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ \vdots & & \ddots & \vdots \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^{2} f}{\partial x_{n}^{2}}\end{array}\right]\]Jacobian formula is defined also for non-scalar functions $f: R^n \to R^m$ :
\[J f=\left[\begin{array}{cccc}\frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \cdots & \frac{\partial f_{1}}{\partial x_{n}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & & \frac{\partial f_{2}}{\partial x_{n}} \\ \vdots & & \ddots & \vdots \\ \frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \cdots & \frac{\partial f_{m}}{\partial x_{n}}\end{array}\right]\]In case $m=1$ Jacobian is gradient.
# jacobian example
def flog(x):
return x.log()
inputs = torch.rand(2, 2)
torch.autograd.functional.jacobian(flog, inputs)
Out:
tensor([[[[2.4621, 0.0000],
[0.0000, 0.0000]],
[[0.0000, 7.4280],
[0.0000, 0.0000]]],
[[[0.0000, 0.0000],
[1.7573, 0.0000]],
[[0.0000, 0.0000],
[0.0000, 2.0443]]]])
# hessian example
def rootsum(x):
return torch.sqrt(x).sum()
inputs = torch.rand(2, 2)
torch.autograd.functional.hessian(rootsum, inputs)
Out:
tensor([[[[-7.4587, -0.0000],
[-0.0000, -0.0000]],
[[-0.0000, -8.0190],
[-0.0000, -0.0000]]],
[[[-0.0000, -0.0000],
[-0.2779, -0.0000]],
[[-0.0000, -0.0000],
[-0.0000, -1.0183]]]])
Getting the gradients
Example: Getting Jacobians using torch.autograd.grad
In the next example torch.autograd.grad
computes the product of the jacobian with the vector given in grad_outputs
. In grad_outputs
you give the vector [1., 1.]
with this line:
grad_outputs=y.data.new(y.shape).fill_(1)
To compute the jacobian, you have to multiply with [1., 0]
to extract the first column, then with [0., 1.]
to extract the second column.
Here is the complete code:
x = np.arange(1,3,1)
x = torch.from_numpy(x).reshape(len(x),1)
x = x.float())
x.requires_grad = True
w1 = torch.randn((2,2), requires_grad = False)
y=w1@x
print(w1)
jacT = torch.zeros(2,2)
for i in range(2):
output = torch.zeros(2,1)
output[i] = 1.
jacT[:,i:i+1] = torch.autograd.grad(y, x, grad_outputs=output, retain_graph=True)[0]
print(jacT)
Out:
tensor([[-0.4164, 0.1159],
[-0.4436, 1.8093]])
tensor([[-0.4164, -0.4436],
[ 0.1159, 1.8093]])
Similar:
x = np.arange(1,3,1)
print(x)
x = torch.from_numpy(x).reshape(len(x),1)
print(x)
x = x.float()
x.requires_grad = True
w1 = torch.randn((2,2), requires_grad = True)
y = w1@x
jac = torch.autograd.grad(y, x, grad_outputs=y.data.new(y.shape).fill_(1), create_graph=True)
jac
Out:
[1 2]
tensor([[1],
[2]], dtype=torch.int32)
(tensor([[-0.8852],
[ 0.7835]], grad_fn=<MmBackward>),)
…
tags: gradients - jacobians - hessians - pytorch requires_grad - pytorch gradients - gradient accumulation - gradient accumulation example - gradients wrt. input - gradients wrt. leaves & category: pytorch