Popular Activation Functions In Neural Networks

Activation functions in neural networks

Popular Activation Functions In Neural Networks

In the neural network introduction article, we have discussed the basics of neural networks. This article focus is on different types of activation functions using in building neural networks. 

In the deep learning literate or in neural network online courses, these activation functions are popularly called transfer functions.

The main focus of this article is to give you a complete overview of various activation functions and their properties. We’ll also see how to implement them in python.

Popular activation functions in neural networks.

Click to Tweet

Before we drive further, Let’s see the topic you are going to learn in this article.

So let’s begin with understanding what is the activation function. If you remember decision tree, at each node the decision tree algorithm needs to take decision to split the further data, we can related this to understand about activation functions.


What is Activation Function?

The name activation is self explainable. As the name suggests, the activation function is to alert or fire the neurons/node in neural networks.

If we treat these functions as a black box, like we treat many classifiction algorihtms , these functions will take input and return output, which helps the neural network to pass the value to the next nodes of the network.

Activation functions are vital components in the neural networks, which helps the network to learn the intricate patterns in train data, which helps in predicting the future.

In mathematical words, activation functions are used in neural networks to compute the weighted sum of input and biases, which is used to decide if a neuron can be fired or not. 

We can related the computing the weighted sum with the linear regression concept.

It manipulates the presented data through some gradient processing, usually gradient descent, and afterward produces an output for the neural network that contains the parameters in the data. 

Activation functions are often referred to as a transfer function in deep learning research papers literature. These activation functions are having a set of properties to follow. 

Let’s discuss this.

Properties of activation functions

Properties of activation functions

Properties of activation functions

These functions having the set of properties to follow,

  • Computational Inexpensive
  • Differentiable
  • Zero Centered

Computational Inexpensive

The activation function computation has to be very minimal, as this impacts the neural network training period/time. 

For the complicated neural network architectures such as the Convolutional neural network (CNN), Recurrent Neural Network (RNN) needs many parameters to optimize. 

This optimization needs to compute the activation functions at each layer. If the activation functions are computational high, it will take a hell lot of time for getting the optimized weights at each layer in the network. 

So the key properties activation function should follow computational inexpensiveness.

Differentiable

The second fundamental property is differentiable.

Activation functions have to be differentiable, even though we are having linear functions which are non-differentiable, to learn the complex patterns in the training data, the activation functions need to be differentiable. 

Now raised the other questions 

why the activation functions need to differentiable?

If you remember in the neural networks introduction article we explained the concept call backpropagation. Using the backpropagation the networks calculate the errors it’ made previously and using this information, it updates the weights accordingly to reduce the overall network error. 

To perform this the network uses the gradient descent approach which needs the differential of the activation functions.

Zero Centered

The output of the activation functions needs to be zero centered, So this will help in the calculated gradients to be in the same direction and shifting across. 

We discussed the key properties of the activation functions, now let’s discuss various categories of these functions.

Activation Function Categories

Activation Function Categories

Activation Function Categories

At a high level, the activation function categorized into 3 types.

  • Binary step functions
  • Linear activation functions
  • Nonlinear activation functions

Binary step functions

The simpler activation function is a step function. The output value depends on the threshold value we are considering. If the input value is greater than the threshold value the output will be 1, else the output will be 0.

This means if the input value is more than the threshold value which means the node has to fire, else no.

This is similar to the way how logistic regression predicts the binary target class.

In the above graph, we are considering the threshold value as zero. If the graph is not visible, scroll the code, you will find the graph. 

This activation function can be used in binary classifications as the name says, however, it can not be used in a situation where you have multiple classes to deal with.

Why is it used?

Some cases call for a function which applies a hard threshold: either the output is precisely a single value, or not. 

The other functions we have looked at have an intrinsic probabilistic output to them i.e. a higher output in decimal format implying a greater probability of being 1 (or high output).

The step function does away with this opting for a definite high or low output depending on some threshold on the input  T. 

However, the step-function is discontinuous and therefore non-differentiable. Therefore the use of this function in practice is not done with back-propagation.

Linear activation functions

The linear activation function is the simplest form of activation. If you use a linear activation function the wrong way, your whole neural network ends up being a regression.

Not sure about that,

Just think

How the network will be if we simple use the linear activation functions?

In the end, we need to add all the nodes activation functions, if we are using the linear activation function we will be adding all the linear functions. So the sum of all the linear functions is a linear function. 

This makes the network a regression equation.

Linear activations are only needed when you’re considering a regression problem, as the last layer.

Why is it used?

If there’s a situation where we want a node to give its output without applying any thresholds, then the identity or linear function is the way to go. 

The linear function is not used in the hidden layers. We must use non-linear transfer functions in the hidden layer nodes or else output will only ever end up being a linearly separable function.

Pros

  • The output value is not binary.
  • Can connect multiple neurons together, if any one fires, we can take the maximum on to take the decision.

Cons

  • Derivate is constant, which means no use with the gradient descent.
  • Changes in the backpropagation will depend on the constant derivate but not on the actual variable.

Both the binary step function and the linear activation functions are not so famous in terms of deep learning complex and modern architectures. The nonlinear activation functions are mostly used. 

So let’s discuss various nonlinear activation functions.

Nonlinear activation functions

We are having numerous non-linear activation functions, in this article we are mainly focussing on the below functions.

  • Sigmoid Function
  • Tanh Function
  • Gaussian
  • Relu
  • Leaky Relu

Let’s start with the sigmoid function.

Sigmoid function

The sigmoid activation function  is sometimes referred to as the logistic function or squashing function in some literature. 

Why it is used?

This function maps the input to a value between 0 and 1 (but not equal to 0 or 1). This means the output from the node will be a high signal (if the input is positive) or a low one (if the input is negative). 

The simplicity of its derivative allows us to efficiently perform backpropagation without using any fancy packages or approximations. The fact that this function is smooth, continuous, monotonic, and bounded means that backpropagation will work well. 

The sigmoid’s natural threshold is 0.5, meaning that any input that maps to a value above 0.5 will be considered high (or 1) in binary terms.

Similary to this we have the softwax function which can used for multi classification problems. You can have a look at the key difference by reading the softmax Vs sigmod article.

Pros

  • Interpretability of the output mapped between 0 and 1.
  • Compute gradient quickly.
  • It’s has a smooth gradient.

Cons

  • At the end of the sigmoid function, the Y values tend to respond very less to changes in X, this is known as the Vanishing gradient problem.
  • Sigmoids saturate and kill gradients.
  • The optimization becomes hard when the output is not zero centered.

Hyperbolic Tangent function

The hyperbolic tangent function known as the tanh function is a smoother zero-entered function whose range lies between -1 to 1.

Why is it used?

This is a very similar function to the previous sigmoid function and has much of the same properties, even its derivative is straight forward to compute. However, this function allows us to map the input to any value between -1 and 1 (but not inclusive of those). 

In effect, this allows us to apply a penalty to the node (negative) rather than just have the node not fire at all. 

This function has a natural threshold of 0, meaning that any input value greater than 0 is considered high (or 1) in binary terms. 

Again, the fact that this function is smooth, continuous, monotonic, and bounded means that backpropagation will work well. 

The subsequent functions don’t have all these properties which makes them more difficult to use in backpropagation.

Pros

  • Efficient since it has mean 0 in the middle layers between -1 and 1.

Cons

  • Vanishing gradient too.

Now the question is

what is the difference between sigmoid and hyperbolic tangent?

They both achieve a similar mapping, both are continuous, smooth, monotonic, and differentiable, but give out different values.

For a sigmoid function, a larger negative input generates an almost zero output. This lack of output will affect all subsequent weights in the network which may not be desirable - effectively stopping the next nodes from learning.

In contrast, the tanh function supplies -1 for negative values, maintaining the output of the node, and allowing subsequent nodes to learn from it. 

Gaussian Function

Why is it used?

The Gaussian function is an even function, thus it gives the same output for equally positive and negative values of input. It gives its maximal output when there is no input and has decreasing output with increasing distance from zero. 

We can perhaps imagine this function is used in a node where the input feature is less likely to contribute to the final result. 

Rectified Linear Unit (ReLU)

This Relu is widely used in Convolutional Neural networks. As the function is just the max of input and zero, It so is easy to compute and does not saturate and does not cause the Vanishing Gradient Problem.

You may came accross these activation function in extracting the text from the hand written images.

Why is it used?

The ReLU represents a nearly linear function and therefore preserves the properties of linear models that made them easy to optimize, with gradient-descent methods.

This function rectifies the values of the inputs less than zero thereby forcing them to zero and eliminating the vanishing gradient problem observed in the earlier types of the activation function.

Pros

  • Easy to implement and quick to compute.
  • It avoids and rectifies the vanishing gradient problem.

Cons

  • Problematic when we have lots of negative values since the outcome is always 0 and leads to the death of the neuron.
  • It has just one issue of not being zero centered. It suffers from the “dying ReLU” problem

LeakyReLU

The LeakyRelu is a variant of ReLU. Instead of being 0 when 𝑧<0 z<0, a leaky ReLU allows a small, non-zero, constant gradient.

Why is it used?

As said before leakyRelu is a variant of Relu. Here alpha is a hyperparameter generally set to 0.01. Leaky ReLU solves the “dying ReLU” problem to some extent. 

If you observe if we set α as 1 then Leaky ReLU will become a linear function f(x) = x and will be of no use. Hence, the value of alpha is never set close to 1. If we set alpha as a hyperparameter for each neuron separately, we get parametric ReLU or PReLU.

The activations functions are not limited to these but we have discussed widely used activation functions in the industry.

The below figure shows the different types of activation functions.

Different activation functions

Different activation functions (Source: wikipedia)

Conclusion

To conclude, we provided a comprehensive summary of the activation functions used in deep learning.

The activation functions have the capability to improve the learning of the patterns in data there by automating the process of feature detection and justifying their use in the hidden layers of the neural networks.

Recommended Deep Learning Courses

Recommended
Deep Learning python

Deep Learning A to Z Course

Rating: 4.5/5

Tensorflow Course

Learn Deep Learning With Tensorflow

Rating: 4/5

Python Deep Learning Specialization

Rating: 4.5/5

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *

>