Skip to main content

Optimizers supported by the PyTorch Framework


PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. PyTorch has many optimizer classes such as AdaDelta, Adam, and

SGD to name a few.

The optimizer takes the parameters we want to update, the learning rate we want to use and optimizers update weights through its step() method.

In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.


Table of contents

 

 

 

 

TORCH.OPTIM

torch.optim is a PyTorch package containing various optimization algorithms. Most commonly used methods for optimizers are already supported, and the interface is pretty simple enough so that more complex ones can be also easily integrated in the future.

Now to use torch.optim you have to construct an optimizer object that can hold the current state and also update the parameter based on gradients.

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

 

optimizer = optim.Adam([var1, var2], lr=0.0001)

 

 

AdaDelta Class

 

It implements the Adadelta algorithm and the algorithms were proposed in ADADELTA: An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate constant to start with, You can use it without any torch method by defining function like this:

def Adadelta(weights, sqrs, deltas, rho, batch_size):

     eps_stable = 1e-5

     for weight, sqr, delta in zip(weights, sqrs, deltas):

         g = weight.grad / batch_size

         sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

         cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

         delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

         # update weight in place.

         weight[:] -= cur_delta

 

With help of PyTorch you can do same with just a single line of code as shown below:

torch.optim.Adadelta(paramslr=1.0rho=0.9eps=1e-06weight_decay=0)

 

 

 

AdaGrad Class

Adagrad (short for adaptive gradient) penalizes the learning rate for parameters that are frequently updated, instead, it gives more learning rate to sparse parameters, parameters that are not updated as frequently. You can implement it without any class like this:

 

def Adagrad(data):

    gradient_sums = np.zeros(theta.shape[0])

     for t in range(num_iterations):

         gradients = compute_gradients(data, weights)

         gradient_sums += gradients ** 2

         gradient_update = gradients / (np.sqrt(gradient_sums + epsilon))

         weights = weights - lr * gradient_update

     return weights

 

In several problems many times the most critical information is present in the data that is not as frequent. So, if the use-case you are working on is related to sparse data, Adagrad can be useful. You can call the optimizer algorithm by using the below command with the help torch:

 

 

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

 

But there is some drawback too like it is computationally expensive and the learning rate is also decreasing which make it slow in training.

 

 

 

Adam Class

 

 

Adam is One of the most popular optimizers also known as adaptive Moment Estimation, it combines the good properties of Adadelta and RMSprop optimizer into one and hence tends to do better for most of the problems. You can simply call this class using the below command:

 

 

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

 

 

AdamW Class

 

This time the authors suggested an improved version of Adam class called AdamW in which weight decay is performed only after controlling the parameter-wise step size. The weight decay or regularization term does not end up in moving averages and in outputs it only proportional to the weight itself. The authors show practically that AdamW yields better training loss, that means the models generalize much better than models trained with Adam allowing the remake to compete with SGD with momentum.

 

In PyTorch, you can simply call this algorithm using below command:

 

 

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)

 

 

 

SparseAdam Class

 

SparseAdam Implements a lazy version of Adam algorithm which is suitable for sparse tensors.

In this variant of adam optimizer, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.

 

In PyTorch, you can simply call this algorithm using below command:

 

torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)

 

 

 

Adamax class

 

It Implements the Adamax algorithm (a variant of Adam supported infinity norm).

 

In PyTorch, you can simply call this algorithm using below command:

 

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

 

 

LBFGS Class

 

This class Implements the L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate optimization in Matlab)

 

In PyTorch, you can simply call this with the help of the torch method:

 

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)

 

 

 

RMSprop Class

 

This class Implements the RMSprop algorithm, which was Proposed by G. Hinton.

 

The centered version first appears in Generating Sequences with Recurrent Neural Networks. The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus \gamma/(\sqrt{v} + \epsilon)γ/(v+ϵ) where \gammaγ is the scheduled learning rate and v is the weighted moving average of the squared gradient.

 

To implement this optimizer in PyTorch, you can use below mentioned command:

 

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

 

 

 

 

Rprop Class

 

This class Implements the resilient backpropagation algorithm.

 

 

In PyTorch, you can use below mentioned command to call the optimizer:

 

 

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

 

 

SGD Class

 

Implements stochastic gradient descent (optionally with momentum).

 

 

Nesterov momentum is predicted on the formula from On the importance of initialization and momentum in deep learning.

 

Implamentation of algorithm:

 

def SGD(data, batch_size, lr):

     N = len(data)

     np.random.shuffle(data)

     mini_batches = np.array([data[i:i+batch_size]

      for i in range(0, N, batch_size)])

     for X,y in mini_batches:

         backprop(X, y, lr)

 

 

 

 

In PyTorch, you can implement algorithms by calling below command:

 

 

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

 

 

 

To use algorithms in building deep neural network model, you can use as per below example:

 

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

optimizer.zero_grad()

loss_fn(model(input), target).backward()

optimizer.step()

 

 

 

 

 

ASGD Class

 

It Implements Averaged Stochastic Gradient Descent(ASGD) algorithm.

 

It has been proposed in Acceleration of stochastic approximation by averaging paper.

 

In PyTorch, you can implement algorithms by calling below command:

 

 

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)

 

 

 

NAdam Class

 

 

This class implements NAdam optimization algorithm.

 

For further details regarding the algorithm you can refer to Incorporating Nesterov Momentum into Adam paper.

 

 

In PyTorch, you can implement algorithms by calling below command:

 

 

torch.optim.NAdam(params, lr=0.02, betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004, foreach=None)

 

 

RAdam Class

 

It implements RAdam optimization algorithm.

 

For more details regarding the algoeithms, you can refer to On the variance of the adaptive learning rate and beyond.

 

In PyTorch, you can implement algorithms by calling below command

torch.optim.NAdam(params, lr=0.001, betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)

 

 

 

Conclusion

 

 

In this blog, we saw 13 optimizers supported by PyTorch framework and their implementations. PyTorch framework is powerful tool when it comes to deep learning models whether it is a research problem or business problem. For more details and in-depth details on parameters you can refer to their official documentation here

Comments

Popular posts from this blog

Introduction to Mathematics and Statistics for Data Science

  Hello and welcome to the Data science lessons blog. to perform any data science task mathematics knowledge and its application will be really important. in fact, it's inevitable in the data science field. Mathematics can be divided into four parts for the Data Science field: 1) Statistics (Descriptive and Inferential): 2) Linear Algebra 3) Probability 4) Optimization  1) Statistics: I cannot imagine data science without this evergreen field of Statistics and its applications across the industries and research fields. basically, statistical methods help us to summerise quantitative data and to get insights out of it. it is not easy to gain any insights by just seeing raw numerical data in any way, until and unless you are a math genius! Topics about Descriptive Statistics: 1) Mean, Median, Mode 2) IQR, percentiles 3) Std deviation and Variance 4) Normal Distribution 5) Z-statistics and T-statistics 6) correlation and linear regression Topics about Inferential Statistics: 1) S...

The Ultimate Data Visualization Guide For Beginners

  Hello and welcome to the data science blog site. today I am going to talk about some other sides of the data science field which you might have been aware of or not, that is nothing but 'the arts'. yeah! you heard it right artistic skills are really important to present your data science solutions that you have figured out from data modeling or crunching your data from various sources etc. if you can't present and tell your story to your audience then your solution has no meaning at all. you have to sell your story effectively and visually compellingly way, and that's where data visualization comes into the picture. Here's what we are going to cover: 1) Ideas on visualizations 2) Storytelling 3) Visual display of data 1) Ideas on visualizations: What is data visualizations:   data visualization is nothing but visualizing structured, raw, and numerical data in various forms of charts and graphs to let your audience understand data. that's no big deal, a simple ...

Introduction to conditional GANs

In this blog, we are going to see Generative adversarial networks (GAN). A generative adversarial network is a class of machine learning frameworks used for training generative models. Generative models create new data instances that resemble the training data. Given a training set, a GAN learns to generate new data with the same statistics as the training set. GANs much depend on the training loss of the model, the model tries to minimize loss to generate as real images as possible. Table of content 1)     What is GAN and How it works? 2)     What is Conditional GAN? 3)     Advantages of cGAN 4)     Pictorial explanation 5)     Use-cases   1)   What is GAN and How it works? GAN is a  generative model which achieves a high level of realism by pairing a generator with a discriminator. The generator learns to produce the ...