Introduction to Neural Network Basics

September 17, 2020 Venkatesh Nagilla

This is the first part of a series of blog posts on simple Neural Networks. The basics of neural networks can be found all over the internet. Many of them are the same, each article is written slightly differently.

But here we tried a different approach to get a deep understanding of the neural networks by explaining each building block concept to build the neural network.

Literally, we will narrow down to the very basic concepts you should need to build the neural networks. The knowledge you gained in this article will help you understand the various deep learning models architecture in the long run.

Explaining each building block concepts for Neural networks.

Click to Tweet

Before we drive together, below is the list of topics you will learn in this article.

Table of Contents

What is Learning

Before diving deeper into the individual blocks of Neural Networks. Lets first discuss what is learning?

What is Learning

Many of us have seen the pocket calculator in an arithmetic contest. It will never improve its speed or accuracy, no matter how much it practices.

In short:

It doesn't learn.

For example, every time I press its square-root button, it computes exactly the same function in exactly the same way. Here the pocket calculator is not learning.

But how can it learn?

By computing a function. Our brains can also learn much more efficiently based on the same idea. Before delving deeper into how such networks can learn, let's first understand how they can compute.

This computing function is called neural networks models in deep learning, in machine learning literature it’s called a machine learning model.

Unlike various machine learning models such as logistic regression, decision trees, randomforest the deep learning models are complete different in the way they learn from data.

Now let’s learn how the neural networks learn from the data we are feeding.

Introduction to Neural networks

A neural network is simply a group of interconnected neurons that are able to influence each other’s behavior.

Your brain contains about as many neurons as there are stars in our galaxy. On average, each of these neurons is connected to a thousand other neurons via junctions called synapses.

We can schematically draw a neural network as a collection of dots representing neurons connected by lines representing synapses as shown in the below figure.

Neural Network Architecture

Real-world neurons are very complicated. However, AI researchers have shown that neural networks can still attain human-level performance on many remarkably complex tasks.

Such as hand written text recognition, identifiying cancer tumers ..etc

Even if one ignores all these complexities and replaces real biological neurons with extremely simple simulated ones that are all identical and obey very simple rules.

Currently the most popular model for such an artificial neural network represents the state of each neuron by a single number and the strength of each synapse by a single number.

In this model, each neuron updates its state at regular time steps by simply averaging together the inputs from all connected neurons.

Weighting them by the synaptic strengths, optionally adding a constant, and then applying what’s called an activation function to the result to compute its next state.

Activation Functions

The easiest way to use a neural network as a function is to make it feedforward, with information flowing only in one direction.

In case you like math, two popular choices of these activation functions.

Sigmoid Function
Ramp Function

Sigmoid function and the ramp function ƒ(x) = max{0, x}, although it’s been proven that almost any function will suffice as long as it’s not linear (a straight line).

Famous model uses

ƒ(x) = -1 if x < 0 and ƒ(x)= 1 if >= 0.

If the neuron states are stored in a vector. Then the network is updated by simply multiplying that vector by a matrix storing the synaptic couplings and then applying the function ƒ to all elements.

Simple neural networks are universal in the sense that they can compute any function arbitrarily accurately by simply adjusting those synapse strength numbers accordingly.

When I first learned about neural networks, I was mystified by how something so simple could compute something arbitrarily complicated.

For example, how can you compute even something as simple as multiplication, when all you’re allowed to do is compute weighted sums and apply a single fixed function?

How this works is shown in the below figure.

Which shows how a mere four neurons can multiply two arbitrary numbers together, and how a single neuron can multiply three bits together.

Continuous multiplication gate

Now let’s see a hello world example of neural networks.

Suppose that we wish to classify megapixel grayscale images into two categories, say cats and dogs. If each of the million pixels can take one of say 256 values then there are possible images for each one.

We wish to compute the probability that it depicts a cat. This means that an arbitrary function that inputs a picture and outputs a probability is defined by a list of probabilities i.e., way more numbers than there are atoms in our universe (about ).

Now we have the idea of how neural networks work. To frame it simple.

“Fire together, wire together”

Let’s see the math behind the neural networks.

The math behind the neural networks

At each node in the hidden and output layers of the neural networks (NN) an activation function is executed.

The activation function can also be called a transfer function. This function takes in the output of the previous node, and multiples it by some weights. The weights that come out of one node can all be different, that is they will activate different neurons.

There can be many forms of the transfer function, we will first look at the sigmoid transfer function as it seems traditional.

Sigmoid Function

Sigmoid Function

As you can see from the figure, the sigmoid function takes any real-valued input and maps it to a real number in the range (0, 1).

We can think of this almost like saying

“if the value we map to output near 1, this node fires if it maps to output near 0, the node does not fire”.

The equation of the sigmoid function is:

Sigmoid Function Derivation

We need to have the derivative of this transfer function so that we can perform backpropagation later on. This is the process where the connections in the network are updated to tune the performance of the neural network.

We’ll talk about this in more detail later, but let’s find the derivative now.

Therefore, we can write the derivative of the sigmoid function as:

What does it look like?

Sigmoid Function Derivation

Now that we have our activation or transfer function selected, what do we do with it?

We will use this in feed forward flow.

How Feed Forward Works

During the feed-forward pass, the network takes in the input values and gives us some output values. To see how this is done, let’s first consider a two-layer neural network like

Feed forward neural network

Here we are going to refer below index’s:

i - the node of the input layer I.

j - the node of the hidden layer J.

k - the node of the output layer K.

The activation function at a node j in the hidden layer takes the value:

Where is the value of the input node and is the weight of the connection between the input node and the hidden node.

In short:

At each hidden layer node, multiply each input value by the connection received by the node and add them together.

We apply the activation function on at the hidden node and get:

is the output of the hidden node. This is calculated for each of the j nodes in the hidden layer. The resulting outputs now become the input for the next layer in the network.

In our case, this is the final output layer. So for each of the k nodes in K:

This is the end of the feed-forward pass. So how well did our network do at getting the correct result

As this is the training phase of our network, the true results will be known when we calculate the error.

Whas is Error

We measure error at the end of each forward pass. This allows us to quantify how well our network has performed in getting the correct output. Once the neural networks build completed. We can use the various evaluation metrics to measure the performance of the model.

Let’s define as the expected or target value of the node of the output layer K then the error E on the entire output is:

Good! Now how does this help us?

Our aim here is to find a way to tune our network such that when we do a forward pass of the input data, the output is exactly what we know it should be. But we can’t change the input data, so there are only two things we can change:

The weights going into the activation function.
The activation function itself.

The second case will be considered as a separate blog post since there are a lot of activation functions, but the magic of neural networks is all about the weights.

Getting each weight i.e. each connection between nodes, to be just the perfect value, is what backpropagation is all about. We’ll look at the backpropagation algorithm in the next section.

But let’s go ahead and set it up by considering the following:

How much of this error E has come from each of the weights in the network?

What is the proportion of the error coming from each of the connections between the nodes in the layer J and the output layer K. In mathematical terms:

The derivative of the error function w.r.t weights is then:

We group the terms involving k and define:

And therefore:

So we have an expression for the amount of error, called ‘delta’ (). But how does this help us to improve our network? We need to back propagate the error.

When calculating the errors, special care needs to be taken in the form of the loss function. As the neural networks will tend to overfit the data if the data we provided is not diversified enough.

Even though we have various ways to create more diversified data with the available data, it's still worth keeping this in mind.

How Back Propagation Works

Backpropagation takes the error function and uses it to calculate the error on the current layer and updates the weights to that layer by some amount.

So far we’ve looked at the error on the output layer, what about the hidden layer?

This also has an error, but the error here depends on the output layer’s error too because this is where the difference between the target and output can be calculated.

Let’s have a look at the error on the weights of the hidden layer :

Now, unlike before, we cannot just drop the summation as the derivative is not directly acting on a subscript k in the summation. We should be careful to note that the output from every node in J is actually connected to each of the nodes in K so the summation should stay.

But we can still use the same tricks as before: let’s use the power rule again and move the derivative inside (because the summation is finite):

Again, we substitute and its derivative and revert back to our output notation:

This still looks familiar from the output layer derivative, but now we’re struggling with the derivative of the input to k i.e. w.r.t the weights from I to J.

Let’s use the chain rule to break apart this derivative in terms of the output from J: 

The change of the input to the node with respect to the output from the node is down to a product with the weights .

Therefore this derivative just becomes the weights . The final derivative has nothing to do with the subscript k anymore, so we’re free to move this around — lets put it at the beginning:

Let’s finish the derivatives, remembering that the output of the node j is just and we know the derivative of this function too:

The final derivative is straight forward too, the derivative of the input to j w.r.t the weights is just the previous input, which in our case is ,.

Almost there! Recall that we defined earlier, lets substitute that in:

To clean this up, we now define the ‘delta’ for our hidden layer:

That’s the amount of error on each of the weights going into our hidden layer:

What is Bias

Let's remind ourselves what happened inside our hidden layer nodes:

Each feature from the input layer I is multiplied by some weight .
These are added together to get the total, weighted input from the nodes in I.
is passed through the activation or transfer function, .
This gives the output for each of the j nodes in hidden layer J.
from each of the J nodes becomes for the next layer.

When we talk about the bias term in neural networks, we are actually talking about an additional parameter that is included in the summation of step 2 above.

The bias term is usually denoted with the symbol θ (theta). Its function is to act as a threshold for the activation (transfer) function.

Given the value of 1 and is not connected to anything else. As such, this means that any derivative of the node’s output with respect to the bias term would just give a constant, 1.

This allows us to just think of the bias term as an output from the node with the value of 1. This will be updated later during back propagation to change the threshold at which the node fires.

Lets update the equation of :

Now we have all the pieces to understand the neural networks. The bias we are talking here is completly different from the bias-variance tradeoff in machine learning.

Conclusion

We’ve got the initial outputs after our feed-forward, we have the equations for the delta terms (the amount by which the error is based on the different weights) and we know we need to update our bias term too.

So what does it look like:

1. Input the data into the network and feed-forward.

2. For each of the output nodes calculate:

3. For each of the hidden layer nodes calculate:

4. Calculate the changes that need to be made to the weights and bias terms:

5. Update the weights and biases across the network:

This algorithm is looped over and over until the error between the output and the target values is below some set threshold. Depending on the size of the network i.e. the number of layers and number of nodes per layer, it can take a long time to complete one ‘epoch’ or run through of this algorithm.

In the next article, we’ll discuss different types of activation functions. If you have FOMO “fear of missing out” please follow us.

If you like the article share it, if not tell us. Be like a neural network, learn from mistakes.