# PyTorch | Automatic Differentiation

*Table of Contents:*

- What is PyTorch AD ?
- Create backward computational graph using torchviz
- Detach from AD
- Bonus define deep learning

## What is PyTorch AD ?

Automatic Differentiation (AD) is a technique to calculate the derivative of function $f(x_1, \cdots, x_n)$ at some point.

### What PyTorch AD is not ?

AD is not symbolic math approach to calculate the derivate. Symbolic math approach would be to use derivation rules. For example, if you have $f(x) = {1\over x}$ then $f’(x) = -{1 \over x^2}$.

AD is also not numeric procedure to calculate the derivate. The numerical procedure to calculate the derivative of `tanh`

function at point $x=1$ would be:

```
def tanh(x):
y=np.exp(-x)
return (1.0-y)/(1.0+y)
s=0.00001 # some small number
x=1.0
d=(tanh(x+s)-tanh(x))/s
print(d)
```

Output:

```
0.39322295790622513
```

### How reverse mode AD works?

PyTorch uses **reverse mode** AD. AD *forward mode* exists, but it is computationally more expensive.

Reverse mode AD works the following way.

First the forward pass is being executed. In the forward pass PyTorch creates the computational graph dynamically and calculates the intermediate variables based on inputs.

The computational graph being calculated is like a tree. Inputs are *tree leaves* and each node in the graph corresponds to some operation (such as `+`

), or to some function (such as `sin`

).

The output function is *the root* of the tree. Once we create the computational tree in the forward pass, together with the intermediate gradients, we then may get the gradients from the root to any of the leaves.

We say: we compute the gradients of a function with respect to the input variable $x$.

This pass when we compute the gradients is known as the backward pass and corresponds to PyTorch `grad`

funciton.

PyTorch `grad`

function is very cheap. It just traverses the computational graph and creates the sum of the intermediate gradient products to calculate the final gradient. The math behind calculating gradient is called the **chain rule**.

Note: There is one tool similar to Pytorch called Chainer just because of this chain rule principle.

Example:

To get a clue how PyTorch AD works the next code example we will create the computational graph for the function:

\[f(x_1, x_2) = \frac{ 1+sin(x_2)}{x_2+e^{x_1}} + x_1x_2\]We will calculate the gradient of a function $f(x_1, x_2)$ with respect to $x_2$.

```
import math
class ADNumber:
def __init__(self,val, name=""):
self.name=name
self._val=val
self._children=[]
def __truediv__(self,other):
new = ADNumber(self._val / other._val, name=f"{self.name}/{other.name}")
self._children.append((1.0/other._val,new))
other._children.append((-self._val/other._val**2,new)) # first derivation of 1/x is -1/x^2
return new
def __mul__(self,other):
new = ADNumber(self._val*other._val, name=f"{self.name}*{other.name}")
self._children.append((other._val,new))
other._children.append((self._val,new))
return new
def __add__(self,other):
if isinstance(other, (int, float)):
other = ADNumber(other, str(other))
new = ADNumber(self._val+other._val, name=f"{self.name}+{other.name}")
self._children.append((1.0,new))
other._children.append((1.0,new))
return new
def __sub__(self,other):
new = ADNumber(self._val-other._val, name=f"{self.name}-{other.name}")
self._children.append((1.0,new))
other._children.append((-1.0,new))
return new
@staticmethod
def exp(self):
new = ADNumber(math.exp(self._val), name=f"exp({self.name})")
self._children.append((self._val,new))
return new
@staticmethod
def sin(self):
new = ADNumber(math.sin(self._val), name=f"sin({self.name})")
self._children.append((math.cos(self._val),new)) # first derivation is cos
return new
def grad(self,other):
if self==other:
return 1.0
else:
result=0.0
for child in other._children:
result+=child[0]*self.grad(child[1])
return result
A = ADNumber # shortcuts
sin = A.sin
exp = A.exp
def print_child(f, wrt): # with respect to
for e in f._children:
print("child:", wrt, "->" , e[1].name, "grad: ", e[0])
print_child(e[1], e[1].name)
x1 = A(1.5, name="x1")
x2 = A(0.5, name="x2")
f=(sin(x2)+1)/(x2+exp(x1))+x1*x2
print_childs(x2,"x2")
print("\ncalculated gradient for the function f with respect to x2:", f.grad(x2))
```

Out:

```
child: x2 -> sin(x2) grad: 0.8775825618903728
child: sin(x2) -> sin(x2)+1 grad: 1.0
child: sin(x2)+1 -> sin(x2)+1/x2+exp(x1) grad: 0.20073512936690338
child: sin(x2)+1/x2+exp(x1) -> sin(x2)+1/x2+exp(x1)+x1*x2 grad: 1.0
child: x2 -> x2+exp(x1) grad: 1.0
child: x2+exp(x1) -> sin(x2)+1/x2+exp(x1) grad: -0.05961284871202578
child: sin(x2)+1/x2+exp(x1) -> sin(x2)+1/x2+exp(x1)+x1*x2 grad: 1.0
child: x2 -> x1*x2 grad: 1.5
child: x1*x2 -> sin(x2)+1/x2+exp(x1)+x1*x2 grad: 1.0
calculated gradient for the function f with respect to x2: 1.6165488003791766
```

Check:

```
1.5 + (0.8775825618903728 * 1.0 * 0.20073512936690338) + (-0.05961284871202578 *1.0)
```

Out:

```
1.6165488003791768
```

The next image shows the computational graph for the example function:

\[f(x_1, x_2) = \frac{ 1+sin(x_2)}{x_2+e^{x_1}} + x_1x_2\]where

$x_1=1.5, x_2=0.5$

Each node in the tree graph is either a leaf node (green) or the root node (brown) or something in between.

From the input `x2`

in forward pass we identify three paths leading to the root. The arrows in dark red, red and orange denote these paths. We can ignore black arrows for now.

To compute the final gradient for our function `f`

with respect to the `x2`

we need to multiply the gradient values along the paths and finally to sum them up.

The calculus is as follows:

```
1.5 + (0.8775825618903728 * 1.0 * 0.20073512936690338) + (-0.05961284871202578 *1.0)
# 1.6165488003791768
```

This is exactly what our function `grad`

will do if we print `f.grad(x2)`

the result will be 1.6165488003791766.

Let’s show the numerical procedure will provide the same result.

```
import math
def f(x1, x2):
return (math.sin(x2)+1)/(x2+math.exp(x1))+x1*x2
e=0.0001 # some small e
x1 = 1.5
x2 = 0.5
grad = (f(x1, x2+e)-f(x1, x2))/e
print(grad) # 1.6165416488078677
```

## Create backward computational graph using torchviz

```
# !pip install torchviz
from torchviz import make_dot
# Create tensors
x1 = torch.tensor(1.5, requires_grad=True)
x2 = torch.tensor(0.5, requires_grad=True)
c = torch.tensor(1., requires_grad=True)
# Build a computational graph
y=(torch.sin(x2)+c)/(x2+torch.exp(x1))+x1*x2
y.backward() # compute gradients
print(x1.grad)
print(x2.grad)
print(c.grad)
params = {'x1': x1, 'x2':x2, 'c': c}
param_map = {id(v): k for k, v in params.items()}
param_map
make_dot(y, {'x1': x1, 'x2':x2, 'c': c})
```

Out:

```
tensor(0.2328)
tensor(1.6165)
tensor(0.2007)
```

**Example**: Create resnet18 computational graph

```
import torch
import torchvision.models as models
resnet18 = models.resnet18()
x = torch.zeros(1, 3, 224, 224, dtype=torch.float, requires_grad=False)
out = resnet18(x)
make_dot(out)
```

**Example**: Using hiddenlayer

```
import torch
import hiddenlayer as hl
import torchvision.models as models
resnet18 = models.resnet18()
x = torch.zeros(1, 3, 224, 224, dtype=torch.float, requires_grad=False)
transforms = [ hl.transforms.Prune('Constant') ] # removes Constant nodes from graph.
# resnet18 from torchvision and and x is the input 4D tensor
graph = hl.build_graph(resnet18, x, transforms=transforms)
graph.theme = hl.graph.THEMES['blue'].copy()
# graph.save('rnn_hiddenlayer', format='png')
graph
```

_{Save and zoom to check the details}

## Detach from AD

Here is one computational graph.

```
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x
r=(y+z).sum()
make_dot(r)
```

It is possible to `detach()`

the tensor from the AD computational graph.

```
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.detach()
r=(y+z).sum()
make_dot(r)
```

`x.detach()`

is the same as`x.data`

.

```
from torchviz import make_dot
x=torch.ones(2, requires_grad=True)
y=2*x
z=3+x.data
r=(y+z).sum()
make_dot(r)
```

You can use the `with torch.no_grad()`

class (context manager). Whatever is created inside that block, will end as `requires_grad=False`

. The next example will show just that. Tensor `x`

that `requires_grad=True`

will create tensor `y`

, but that tensor will have `requires_grad=False`

.

```
x=torch.tensor(2., requires_grad=True)
print(x)
with torch.no_grad():
y = x * 2
print(y, y.requires_grad)
```

Out:

```
tensor(2., requires_grad=True)
tensor(4.) False
```

For more details refer to `help(torch.no_grad)`

.

## Bonus define deep learning

In essence, for the deep learning you need to have deep models. By definition, shallow models have just one hidden layer:

```
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.Linear(H, D_out),
)
```

Deep models have 2 or more hidden layers.

```
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H1),
torch.nn.Linear(H1, H2),
torch.nn.Linear(H2, D_out),
)
```

In other words, to do some deep learning you need to have at least three linear layers. The dimension `H`

is called the hidden dimension. Instead of `nn.Linear`

layers you may use convolution layers.

…

**tags:**

*ad*-

*pytorch automatic differentiation*-

*pytorch ad*-

*automatic differentiation*-

*computational graph*-

*backward computational graph*-

*reverse mode ad*-

*derivation rule*&

**category:**

*pytorch*