Optimizers supported by the PyTorch Framework

PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. PyTorch has many optimizer classes such as AdaDelta, Adam, and

SGD to name a few.

The optimizer takes the parameters we want to update, the learning rate we want to use and optimizers update weights through its step() method.

In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.

Table of contents

TORCH.OPTIM

torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.

Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

optimizer = optim.Adam([var1, var2], lr=0.0001)

AdaDelta Class

It implements the Adadelta algorithm and the algorithms were proposed in ADADELTA: An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this:

def Adadelta(weights, sqrs, deltas, rho, batch_size):

eps_stable = 1e-5

for weight, sqr, delta in zip(weights, sqrs, deltas):

g = weight.grad / batch_size

sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight in place.

weight[:] -= cur_delta

With help of PyTorch you can do same with just a single line of code as shown below:

torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)

AdaGrad Class

Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

def Adagrad(data):

gradient_sums = np.zeros(theta.shape[0])

for t in range(num_iterations):

gradients = compute_gradients(data, weights)

gradient_sums += gradients ** 2

gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))

weights = weights - lr * gradient_update

return weights

In several problems many times the most critical information is present in the data that is not as frequent. So, if the use-case you are working on is related to sparse data, Adagrad can be useful. You can call the optimizer algorithm by using the below command with the help torch:

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

But there is some drawback too like it is computationally expensive and the learning rate is also decreasing which make it slow in training.

Adam Class

Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

AdamW Class

This time the authors suggested an improved version of Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size. The weight decay or regularization term does not end up in moving averages and in outputs it only proportional to the weight itself. The authors show practically that AdamW yields better training loss, that means the models generalize much better than models trained with Adam allowing the remake to compete with SGD with momentum.

In PyTorch, you can simply call this algorithm using below command:

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)

SparseAdam Class

SparseAdam Implements a lazy version of Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

In PyTorch, you can simply call this algorithm using below command:

torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)

Adamax class

It Implements the Adamax algorithm (a variant of Adam supported infinity norm).

In PyTorch, you can simply call this algorithm using below command:

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

LBFGS Class

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate optimization in Matlab)

In PyTorch, you can simply call this with the help of the torch method:

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)

RMSprop Class

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton.

The centered version first appears in Generating Sequences with Recurrent Neural Networks. The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus $\gamma/(\sqrt{v} + \epsilon)$ γ/(v+ϵ) where $\gamma$ γ is the scheduled learning rate and v is the weighted moving average of the squared gradient.

To implement this optimizer in PyTorch, you can use below mentioned command:

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

Rprop Class

This class Implements the resilient backpropagation algorithm.

In PyTorch, you can use below mentioned command to call the optimizer:

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

SGD Class

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is predicted on the formula from On the importance of initialization and momentum in deep learning.

Implamentation of algorithm:

def SGD(data, batch_size, lr):

N = len(data)

np.random.shuffle(data)

mini_batches = np.array([data[i:i+batch_size]

for i in range(0, N, batch_size)])

for X,y in mini_batches:

backprop(X, y, lr)

In PyTorch, you can implement algorithms by calling below command:

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

To use algorithms in building deep neural network model, you can use as per below example:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

optimizer.zero_grad()

loss_fn(model(input), target).backward()

optimizer.step()

ASGD Class

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

It has been proposed in Acceleration of stochastic approximation by averaging paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

NAdam Class

This class implements NAdam optimization algorithm.

For further details regarding the algorithm you can refer to Incorporating Nesterov Momentum into Adam paper.

In PyTorch, you can implement algorithms by calling below command:

torch.optim.NAdam(params, lr=0.02, betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004, foreach=None)

RAdam Class

It implements RAdam optimization algorithm.

For more details regarding the algoeithms, you can refer to On the variance of the adaptive learning rate and beyond.

In PyTorch, you can implement algorithms by calling below command

torch.optim.NAdam(params, lr=0.001, betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)

Conclusion

In this blog, we saw 13 optimizers supported by PyTorch framework and their implementations. PyTorch framework is powerful tool when it comes to deep learning models whether it is a research problem or business problem. For more details and in-depth details on parameters you can refer to their official documentation here

Data Science Lessons

Search This Blog

Optimizers supported by the PyTorch Framework

Comments

Post a Comment

Popular posts from this blog

Introduction to Mathematics and Statistics for Data Science

Introduction to conditional GANs