PyTorch | Optimizer
Table of Contents:
About PyTorch Optimizers
torch.optim
is a package implementing various optimization algorithms in PyTorch.
If you use PyTorch you can create your own optimizers in Python.
PyTorch has default optimizers. Most famous is torch.optim.SGD
, followed by torch.optim.Adam
or
torch.optim.AdamW
.
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.
Recently very popular is also torch.optim.LBFGS
inspired by a matlab function minFunc.
It is really easy to create custom PyTorch optimizer. This is just a Python class.
It need to have a constructor __init__
, it need to have a state dict (__state_dict__
) also called the state.
Instantiate optimizers
PyTorch has a well-debugged optimizers you can consider.
optimizer = optim.Adam(net.parameters(), lr=0.001)
optimizer = optim.AdamW(net.parameters(), lr=0.001)
optimizer = optim.SGD(net.parameters(), lr=0.001)
Creating a custom optimizer
Here is an example of an optimizer called Adaam I created some time ago.
Usually, you start from a template class and set the Optimizer
as the base class:
class Adaam(Optimizer):
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-3):
defaults = dict(lr=lr, betas=betas, eps=eps)
super(Adaam, self).__init__(params, defaults)
def __setstate__(self, state):
super(Adaam, self).__setstate__(state)
def step(self, closure=None):
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
if len(state) == 0:
state['step'] = 0
state['agrad'] = torch.zeros_like(p.data) # grad average
state['agrad2'] = torch.zeros_like(p.data) # Hadamar grad average
state['step'] += 1
agrad, agrad2 = state['agrad'], state['agrad2']
beta1, beta2 = group['betas']
agrad.mul_(beta1).add_(1 - beta1, grad)
agrad2.mul_(beta2).addcmul_(1 - beta2, grad, grad)
bias_1 = 1 - beta1 ** state['step']
bias_2 = 1 - beta2 ** state['step']
agrad = agrad.div(bias_1)
agrad2 = agrad2.div(bias_2)
denom = agrad2.sqrt().add_(group['eps'])
p.data.addcdiv_(-group['lr'], agrad, denom)
return loss
This optimizer has:
- input parameters (lr, betas, eps)
- the state
- possible closure to replace the step function
- params
- self.param_groups
The state holds the step number, gradient average and Hadamar grad average in this case.
You probable won’t write your optimizes, but you will use the existing ones.
…
tags: optimizer & category: pytorch