PyTorch is the fastest growing deep learning framework and it is also used by many top fortune companies like Tesla, Apple, Qualcomm, Facebook, and many more. It integrates many algorithms, methods, and classes into a single line of code to ease your day. PyTorch has many optimizer classes such as AdaDelta, Adam, and
SGD to name a few.
The
optimizer takes the parameters we want to update, the learning rate we want to
use and optimizers update weights through its step() method.
In this blog, we are going to see 13 such optimizers which are supported by the PyTorch framework.
Table of contents
- TORCH.OPTIM
- AdaDelta Class
- AdaGrad Class
- Adam Class
- AdamW Class
- SparseAdam Class
- Adamax Class
- LBFGS Class
- RMSprop Class
- Rprop Class
- SGD Class
- ASGD Class
- NAadam Class
- RAdam Class
- Conclusion
TORCH.OPTIM
torch.optim is a PyTorch package containing various optimization algorithms.
Most commonly used methods for optimizers are already supported, and the
interface is pretty simple enough so that more complex ones can be also easily
integrated in the future.
Now
to use torch.optim you have to construct an optimizer object
that can hold the current state and also update the parameter based on
gradients.
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01,
momentum=0.9)
optimizer = optim.Adam([var1, var2], lr=0.0001)
AdaDelta Class
It implements the Adadelta algorithm and
the algorithms were proposed in ADADELTA:
An Adaptive Learning Rate Method paper. In Adadelta you don’t require an initial learning rate
constant to start with, You can use it without any torch method by defining
function like this:
def Adadelta(weights, sqrs,
deltas, rho, batch_size):
eps_stable = 1e-5
for weight,
sqr, delta in zip(weights, sqrs, deltas):
g
= weight.grad / batch_size
sqr[:]
= rho * sqr + (1. - rho) * nd.square(g)
cur_delta
= nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g
delta[:]
= rho * delta + (1. - rho) * cur_delta * cur_delta
#
update weight in place.
weight[:]
-= cur_delta
With help of PyTorch you can do same
with just a single line of code as shown below:
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
AdaGrad
Class
Adagrad (short for adaptive gradient) penalizes the learning rate for
parameters that are frequently updated, instead, it gives more learning rate to
sparse parameters, parameters that are not updated as frequently. You can
implement it without any class like this:
def Adagrad(data):
gradient_sums =
np.zeros(theta.shape[0])
for t in
range(num_iterations):
gradients
= compute_gradients(data, weights)
gradient_sums
+= gradients ** 2
gradient_update
= gradients / (np.sqrt(gradient_sums + epsilon))
weights
= weights - lr * gradient_update
return weights
In several problems many
times the most critical information is present in the data that is not as
frequent. So, if the use-case you are working on is related to sparse data,
Adagrad can be useful. You can call the optimizer algorithm by using the below
command with the help torch:
torch.optim.Adagrad(params, lr=0.01, lr_decay=0,
weight_decay=0, initial_accumulator_value=0, eps=1e-10)
But there is some drawback
too like it is computationally expensive and the learning rate is also
decreasing which make it slow in training.
Adam Class
Adam is One of the most popular optimizers also known as
adaptive Moment Estimation, it combines the good properties of Adadelta
and RMSprop optimizer into one and hence tends to do better for most of
the problems. You can simply call this class using the below command:
torch.optim.Adam(params, lr=0.001, betas=(0.9,
0.999), eps=1e-08, weight_decay=0, amsgrad=False)
AdamW Class
This time the authors
suggested an improved version of Adam class called AdamW in which weight decay
is performed only after controlling the parameter-wise step size. The weight
decay or regularization term does not end up in moving averages and in outputs
it only proportional to the weight itself. The authors show practically that
AdamW yields better training loss, that means the models generalize much better
than models trained with Adam allowing the remake to compete with SGD with
momentum.
In PyTorch, you can simply call this algorithm
using below command:
torch.optim.AdamW(params, lr=0.001, betas=(0.9,
0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
SparseAdam Class
SparseAdam
Implements a lazy version of Adam algorithm which is suitable for sparse
tensors.
In
this variant of adam optimizer, only moments that show up in the gradient get
updated, and only those portions of the gradient get applied to the parameters.
In PyTorch, you can simply call this algorithm
using below command:
torch.optim.SparseAdam(params, lr=0.001,
betas=(0.9, 0.999), eps=1e-08)
Adamax class
It Implements the Adamax
algorithm (a variant of Adam supported infinity norm).
In PyTorch, you can simply call this algorithm
using below command:
torch.optim.Adamax(params, lr=0.002, betas=(0.9,
0.999), eps=1e-08, weight_decay=0)
LBFGS Class
This class Implements the
L-BFGS algorithm, which is heavily inspired by minFunc (minFunc – unconstrained differentiable multivariate
optimization in Matlab)
In PyTorch, you can simply
call this with the help of the torch method:
torch.optim.LBFGS(params, lr=1, max_iter=20,
max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09,
history_size=100, line_search_fn=None)
RMSprop Class
This class Implements the
RMSprop algorithm, which was Proposed by G. Hinton.
The centered version first appears in Generating
Sequences with Recurrent Neural Networks. The implementation here takes the square root of the gradient
average before adding epsilon (note that TensorFlow interchanges these two
operations). The effective learning rate is thus \gamma/(\sqrt{v} + \epsilon)γ/(v+ϵ) where \gammaγ is
the scheduled learning rate and v is
the weighted moving average of the squared gradient.
To implement this optimizer in PyTorch, you can use
below mentioned command:
torch.optim.RMSprop(params, lr=0.01, alpha=0.99,
eps=1e-08, weight_decay=0, momentum=0, centered=False)
Rprop Class
This class Implements the
resilient backpropagation algorithm.
In PyTorch, you can use
below mentioned command to call the optimizer:
torch.optim.Rprop(params, lr=0.01, etas=(0.5,
1.2), step_sizes=(1e-06, 50))
SGD Class
Implements stochastic
gradient descent (optionally with momentum).
Nesterov momentum is
predicted on the formula from On the importance of initialization and momentum in
deep learning.
Implamentation
of algorithm:
def SGD(data, batch_size, lr):
N = len(data)
np.random.shuffle(data)
mini_batches =
np.array([data[i:i+batch_size]
for i in
range(0, N, batch_size)])
for X,y in
mini_batches:
backprop(X,
y, lr)
In PyTorch, you can implement algorithms by calling
below command:
torch.optim.SGD(params, lr=<required
parameter>, momentum=0, dampening=0, weight_decay=0,
nesterov=False)
To use algorithms in building deep neural network
model, you can use as per below example:
optimizer = torch.optim.SGD(model.parameters(),
lr=0.1, momentum=0.9)
optimizer.zero_grad()
loss_fn(model(input), target).backward()
optimizer.step()
ASGD Class
It Implements Averaged Stochastic Gradient
Descent(ASGD) algorithm.
It has been proposed
in Acceleration of stochastic
approximation by averaging paper.
In PyTorch, you can implement algorithms by calling
below command:
torch.optim.ASGD(params, lr=0.01, lambd=0.0001,
alpha=0.75, t0=1000000.0, weight_decay=0)
NAdam Class
This class implements NAdam optimization algorithm.
For further details
regarding the algorithm you can refer to Incorporating Nesterov Momentum into Adam paper.
In PyTorch, you can implement algorithms by calling
below command:
torch.optim.NAdam(params, lr=0.02,
betas=(0.9,0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004,
foreach=None)
RAdam Class
It implements RAdam optimization algorithm.
For more details regarding the algoeithms, you can
refer to On the variance of the adaptive learning
rate and beyond.
In PyTorch, you can implement algorithms by calling
below command
torch.optim.NAdam(params, lr=0.001,
betas=(0.9,0.999), eps=1e-08, weight_decay=0, foreach=None)
Conclusion
In this blog, we saw 13 optimizers supported by PyTorch framework and their implementations. PyTorch framework is powerful tool when it comes to deep learning models whether it is a research problem or business problem. For more details and in-depth details on parameters you can refer to their official documentation here
Comments
Post a Comment