PyTorch | Optimizer

About PyTorch Optimizers

torch.optim is a package implementing various optimization algorithms in PyTorch.

If you use PyTorch you can create your own optimizers in Python.

PyTorch has default optimizers. Most famous is torch.optim.SGD, followed by torch.optim.Adam or torch.optim.AdamW.

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

Recently very popular is also torch.optim.LBFGS inspired by a matlab function minFunc.

It is really easy to create custom PyTorch optimizer. This is just a Python class. It need to have a constructor __init__, it need to have a state dict (__state_dict__) also called the state.

Instantiate optimizers

PyTorch has a well-debugged optimizers you can consider.

optimizer = optim.Adam(net.parameters(), lr=0.001)
optimizer = optim.AdamW(net.parameters(), lr=0.001)
optimizer = optim.SGD(net.parameters(), lr=0.001)

Creating a custom optimizer

Here is an example of an optimizer called Adaam I created some time ago.

Usually, you start from a template class and set the Optimizer as the base class:

class Adaam(Optimizer):

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-3):
        
        defaults = dict(lr=lr, betas=betas, eps=eps)
        super(Adaam, self).__init__(params, defaults)

    def __setstate__(self, state):
        super(Adaam, self).__setstate__(state)

    def step(self, closure=None):
        
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p] 
                
                if len(state) == 0:
                    state['step'] = 0
                    state['agrad'] = torch.zeros_like(p.data) # grad average
                    state['agrad2'] = torch.zeros_like(p.data) # Hadamar grad average
                    
                state['step'] += 1
                
                agrad, agrad2 = state['agrad'], state['agrad2'] 
                beta1, beta2 = group['betas']
                
                agrad.mul_(beta1).add_(1 - beta1, grad)
                agrad2.mul_(beta2).addcmul_(1 - beta2, grad, grad) 

                bias_1 = 1 - beta1 ** state['step']
                bias_2 = 1 - beta2 ** state['step'] 
                
                agrad = agrad.div(bias_1)
                agrad2 = agrad2.div(bias_2)
                
                denom = agrad2.sqrt().add_(group['eps'])
            
                p.data.addcdiv_(-group['lr'], agrad, denom)

        return loss

This optimizer has:

  • input parameters (lr, betas, eps)
  • the state
  • possible closure to replace the step function
  • params
  • self.param_groups

The state holds the step number, gradient average and Hadamar grad average in this case.

You probable won’t write your optimizes, but you will use the existing ones.

tags: optimizer & category: pytorch