Popular Optimization Algorithms In Deep Learning

November 2, 2020 Venkatesh Nagilla

Popular Optimization Algorithms In Deep Learning

Building a well optimized, deep learning model is always a dream. To build such models, we need to study about various optimization algorithms in deep learning.

The optimization algorithm plays a key in achieving the desired performance for the models.

If using the best optimization algorithm helps in achieving the desired performance. Why can’t we go ahead and use the best one?

Unfortunately, we can’t comment saying, the best optimization algorithms work well with all the deep learning algorithms.

In some cases, it would be better to use the standard optimization algorithms to get the desired results.

So how do we select the best optimization algorithms for the model we are building?

Learn about the popular optimisation algorithms used in deep learning #optimization #deeplearning #SGD #Adam

Click to Tweet

So how do we select the best optimization algorithms for the model we are building?

The answer is quite simple.

Learn all the popular optimization algorithms out there and pick the one which best suits the deep learning model you are building.

This is similar to the approach of learning various and using the most popular activation functions to create the best deep learning and neural networks modelling architecture.

If you’re new to deep learning or neural networks, we recommend you to please study the introduction to neural network basics before going forward to read this article.

To give you a high overview. In this article, we will learn various most popular optimization algorithms in deep learning. What are the advantages and drawbacks of each of these algorithms in detail.

In particular, we spend more time understanding the stochastic gradient descent (SGD) optimization.

Before we dive further, let’s look at what you are going to learn in this article.

Before we learn about various optimization algorithms.

First, let’s discuss why we need a better optimization algorithm as the performance of machine learning models or the deep learning models depends on the data we feed.

Why we need optimization algorithms

The essential step in building a deep learning model is solving the underlying complexity of the problem without overfitting it. Feeding more data will improve the model performance, but it's not the case all the time.

Even thought we have data augmentation methods to increase the data. Still learning the patterns from the data is a challenging problem to solve.

If we are trying to optimize the problem for the given train dataset. Chances it will fail in production or for the data it has never seen is very high.

So we need to be very conscious about how we use the optimization functions and the loss functions.

We are going to cover the optimization function in a while. If you want to know about the loss function, below is a quick explanation for your reference.

What is the loss function?

Now let’s learn what the loss function is?

The best way to represent the relationship of the predicted values and actual values is to showcase how close two series (actual values and predicted values).

How about having a value which signifies the closeness of the actual values and predicted values?

One way of showing the closeness is the loss function. If the values output from the loss function is minimum. Then both the actual and the predicted values are having minimal differences.

It means the algorithm is working correctly, or in other words, we can say it’s working close to the real values.

This is the reason why we need a function to determine the closeness of the actual and the target values, and the minimum the loss function value determines the performance of the model.

We know the minimum loss function value the closer the actual and the predicted values, then how do we optimize the loss function to get a low loss value.

What is optimization?

We learned loss function; how can we optimize (get less loss function value) this for the given data?

In Deep Learning, these loss functions are very complex, which means high dimensional functions usually have hundreds of parameters.

How do we optimize all these parameters to get the least loss value or the value returned by the loss function?

Where comes the optimizations and the optimization algorithms.

If you are familiar with building supervised machine learning models like regression algorithms. Then you are already aware of the gradient descent algorithm, which is one kind of optimization algorithm.

All the optimization algorithms we are discussing in this article is on top of the gradient descent algorithm.

So let’s understand gradient descent before we are moving to stochastic gradient descent.

Gradient Descent Algorithm

Gradient descent is a mathematical method of determining a minimum of a variable function. In deciding this minimum of a variable function via gradient descent, and then we move proportionally to the negative of the gradient of the function at the current point.

This means if we move proportionally towards the positive of the gradient, then we come to the maximum of the function. Popularly known as the gradient descent.

For explaining the gradient descent algorithm, we are going to publish an article very soon. Until then, the above explanation is enough for going through this article.

What is Stochastic Gradient Descent?

Stochastic gradient descent

Stochastic gradient descent allows you to calculate the gradient of a function and follow the negative gradient of that function.

Now the gradient means a bunch of derivatives. It Tells us basically how the function is changing in different directions.

Suppose You are moving towards the valley of the complex surface, i.e., in a negative direction. You stop at one point and again re-calculate the gradient and keep doing it until the function is not improving.

To better understand Stochastic gradient descent, let’s take a random search.

Why?

The idea of random is the current approximation of the solution is randomly perturbed. If the configuration is better than the current approximation, then the perturbation is accepted.

Many of us didn’t realize that a stochastic algorithm is nothing else than a random search.

Unlike random in the stochastic gradient descent, every time meaningfully hints. We update the weights to get better loss with hints by a chosen heuristics to guide the next potential solution to evaluate.

Computing the gradient can be very time consuming. However, often it is possible to find a “cheap” approximation of the gradient.

Approximating the gradient is still useful as long as its points are roughly in the same direction as the true gradient.

Stochastic gradient descent is a random approximation of the gradient descent method for minimizing the function which is written as a sum of differentiable functions.

The word stochastic here refers to the fact that we acknowledge that we do not know the gradient precisely but instead only know a noisy approximation to it.

By constraining the probability distribution of the approximate gradients, we can still theoretically guarantee that Stochastic gradient descent will converge.

Understanding Stochastic gradient descent with an example

To understand the stochastic gradient, let’s take an example.

Let’s take this very simple function:

The derivative of the function is 2x.

If x = 1, then the derivative is equal to 2.

This means the slope is positive, and we have a decrease in x to get closer to the minimum. If the slope was negative, we could increase the x value to get closer to the minimum.

Remember, Stochastic gradient descent updates parameters in the reverse direction of the slope.

Stochastic gradient descent — many parameters

In deep neural networks, every weight is a parameter. As deep learning models are higher dimensional, there could be millions or even more parameters.

Still, Stochastic gradient descent works the same just you need to compute the partial derivatives of the given function.

You think of Stochastic gradient descent as a Travelling Salesman Problem (TSP), local search, or hill climbing.

An example of walking down the mountain step by step to reach a low point in the valley.

Stochastic gradient descent example

By now, you understood Stochastic gradient descent. It’s not that complicated.

The number of dimensions doesn’t matter; we have a simple iterative algorithm that lets us find the smallest value possible for our function.

Does it really work every single time?

No. We end up with many challenges in optimization.

We have solutions to solve these challenges too. Before we address how to solve these challenges, first, let’s understand these challenges.

Challenges in Optimization Algorithms

Challenges in optimization algorithms

We end up various challenges while handling the optimisations, Below are the listed challenges.

Local Minima
Slow convergence
Different slopes
Gradient size & distributed training
Saddle points

Let's understand about these challenge in much better way.

Challenge 1: Local minima

A function may exhibit many local minima. Look at the weird beast below.

Local minima example

There are lots of shallow local minima. Some are deeper ones and a global one.

As we iterate on Stochastic gradient descent, we want to avoid getting stuck in any of the shallow ones — this happens if we use a small learning rate.

Ideally, we end up that global minimum, but we could end up falling into a deep one.

Challenge 2: Slow convergence

Imagine the parameter space where the slope would be close to zero in every dimension.

There all the components of the gradient would be close to zero, right?

Hardly any slope.

The consequence would be near-zero updates of all the weights, which would mean that they hardly move towards the minimum.

We’d stuck at one point for a longer time, and training would be extremely slow, no matter how much hardware we use—definitely an undesirable solution (unless we’d reach an excellent minimum).

Challenge 3: Different slopes

We cannot accept all the dimensions that could have the same slope. There could be steep dimensions to make right quick moves or flatter dimensions where we can stick or move much slower.

As we know, Stochastic gradient descent uses the same learning rate for all parameters. There could be uneven progress, which would eventually cause slowing down the training process.

Challenge 4: Gradient size & distributed training

Imagine that we’re working with an extensive data set: distributed training — splitting computation across multiple instances would surely deliver a nice speedup.

We can store our sample on the hard drive, read one sample at a time, and do an update to all the remaining samples.

Challenge 5: Saddle points

Now, imagine we’d reach a specific point in the parameter space where all components of the gradient are actually equal to zero.

What would happen then?

No more weights updates. We’d be stuck there for infinitely.

Defeating the gradient

Let’s look at the example:

This does look like a horse saddle.

Saddle points

Now, let’s compute the partial derivatives:

Hence, the gradient of this function is, at the point (0,0), both components of the gradient are equal to zero.

Look at the above graph. We can say that this point is a minimum along the x-axis and a maximum along the y-axis. Such points are called saddle points: a minimum or a maximum for every dimension.

In higher dimensions, saddle points are more common. At saddle points, Hessian has both positive and negative values.

What is Hessian?

Now we can see our problem statement in a single direction. Instead, we look at curvature around the saddle point, then there could be a way down along the y-axis.

Then we need to use second-order partial derivatives for computation. The output stored in a square matrix called the Hessian.

Let’s do this for our previous example:

Thus, the Hessian for our functions is:

By multiplying this matrix with unit vectors along the x and y axes, we’re going to find out what the curvature looks like:

H * [0, 1] = [0 -2] = -2*[0, 1]

H * [0, -1] = [0 2] = -2*[0, -1]

H * [1, 0] = [2 0] = 2*[1, 0]

H * [ -1, 0] = [-2 0] = 2*[-1, 0]

If you observe the above multiplication of H by a unit vector along the x-axis gives a positive multiple of the vector, meaning it only goes up. Indeed, (0,0) is a minimum along the x-axis.

In the other direction. Multiplying H by a unit vector along the y-axis gives a negative multiple of the vector. This indicates a negative curvature, which means that there is a way down.

But the problem here is computing the Hessian is quite expensive.

Defeating the Hessian

Here’s an example called the monkey saddle:

Hessian monkey saddle

Let’s compute the gradient and the Hessian again.

The gradient is [3x^2- 3y^2, -6xy], which is equal to [0, 0] at point (0,0). This is a saddle point again.

Accordingly, the Hessian is:

If we multiply the matrix by a unit vector, it will result in a zero vector. So we can’t find curvature. In this case, the gradient nor the Hessian provide any information on which way is down.

Now we will discuss solutions to these problems and how to apply them.

Solutions for the challenges

Solution 1: Local Minima

In deep neural networks, the error surfaces are guaranteed to have a large and, in some cases, an infinite. So it is not a big issue here.

If there is a local minimum then the slope in all directions is zero. Which is very unlikely and which means if you have 100-dimensional vectors, then in all the 100 directions of the slope will be zero.

Solution 2 & 3: Slow convergence and different slopes

Let’s assume we have a small learning rate for a high dimensional deep neural network, then it takes a longer time to reach a minimum. Here these two problems are related.

To overcome these problems, we use techniques like Momentum and Nesterov Momentum that we're used to progress in the direction of the steepest distance.

Momentum

The momentum is a technique for accelerating gradient descent that accumulates a velocity vector to move in the same direction as previously across iterations.

Given an objective function to be minimized, momentum is given by:

To understand momentum better, let's take a real life example of riding a bike in the mountain range.

Momentum reference example

As the figure shows, there are cyclists who came to climb mountain ranges. Since it is a mountain range, we naturally have vertical and horizontal land. Up and down, up and down.

But we want to take cyclists deep to the bottom of the mountains. So, we want to stop at some part of the road that has the lowest elevation.

The only problem we can see here is we can’t move faster since we have lots of ups and downs. But eventually, we should start at one point, so we began to.

As our “cycle” moves downwards, it’s gaining more and more speed. So we are just moving towards the downhill.

Will this hill stop us?

No, because we are gaining a lot of momentum! So we pass the small hill and another small hill and another and another.

Finally, after doing like this forever, we find ourselves facing a very tall hill. Maybe it’s the bottom of the mountain range. The “cycle” stopped. We could see the deepest valley of the mountain!

That’s exactly what momentum does in SGD. It just uses the law of motion principle to passing through the local minima, in our case, a small hill.

Adding momentum will also make the convergence faster, as we’re getting more speed, so our gradient descent step could be more extensive when compared with the SGD’s constant step.

Now the code!

Nesterov Momentum

Nesterov momentum is very similar to the momentum method as discussed above but adds one little different bit to the momentum calculation. What it does is instead of calculating the current position gradient, it calculates the gradient at the approximated new position.

Let’s take the sample example of the cycle. We have some momentum applied to our “cycle” at the current position. Then we can have some intuition where our “cycle” ends up one more minute from the present time.

So, Nesterov momentum makes use of that knowledge. Instead of using the current position gradient, it uses the approximated position gradient, which will give better information for taking the next step.

Now the code!

Hold on, let's forget about our “cycle” example. Now we will see gradient descent from a different perspective where we can ignore the learning rate.

We are entering into the family of adaptive algorithms.

Adagrad

If we observe the gradient descent, we have a problem with the learning rate,, affecting all our parameters.

What happens when we slow down or speed up our “cycle”? If we accelerate our speed in one direction instead of another.

What will happen? It’s our hard-luck using SGD.

To solve these kinds of problems we have Adagrad.

Now the code!

It has different learning rates for different parameters or per-parameters. To normalize our learning rate (), we need to get the sum of squares of all our parameter’s gradients.

As a result, when the gradient is very large, the learning rate () is reduced and vice-versa.

RMSProp

We have one problem with Adagrad, where the learning rate will be decreasing to the point where the learning stops altogether, where we have a minimal learning rate.

To overcome that problem, RMSProp goes off the past accumulated gradient, so only a small portion of gradients are considered.

Now, instead of taking all the past gradients RMSProp takes the moving average.

Now the code!

The only gradient difference between Adagrad and RMSProp is how we calculate the cache.

Adam

It’s just a modification of RMSProp which adds momentum to it. So, what Adam does is it combines momentum and learning rate.

Now the code!

In short:

Adam = momentum + normalise the learning rate + moving average of squared gradient.

The visualisation is shown below.

Optimizations comparision

Solution 4: Gradient Size & distributed training

Instead of sending all the gradients at a time, send the gradients only when the updates reach the certain threshold value.

Solution 5: Saddle points

By now we are clearly understood that gradient descent will never move from a stationary point if started there so it is necessary to modify gradient descent slightly to get some degrees of randomness. There are two methods for this:

Intermittent Perturbations
Random Initialisation.

Conclusion

We learnt the importance of optimization algorithms in reducing the loss in building various deep learning models.

Also discussed various challenges in optimising and learnt how we can overcome these challenges. Feel free to fork the complete code used in this article in our GitHub repo.