machine-learning Neural Networks Backpropagation - The Heart of Neural Networks


The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.

Each layer has its own set of weights, and these weights must be tuned to be able to accurately predict the right output given input.

A high level overview of back propagation is as follows:

  1. Forward pass - the input is transformed into some output. At each layer, the activation is computed with a dot product between the input and the weights, followed by summing the resultant with the bias. Finally, this value is passed through an activation function, to get the activation of that layer which will become the input to the next layer.
  2. In the last layer, the output is compared to the actual label corresponding to that input, and the error is computed. Usually, it is the mean squared error.
  3. Backward pass - the error computed in step 2 is propagated back to the inner layers, and the weights of all layers are adjusted to account for this error.

1. Weights Initialisation

A simplified example of weights initialisation is shown below:

layers = [784, 64, 10] 
weights = np.array([(np.random.randn(y, x) * np.sqrt(2.0 / (x + y))) for x, y in zip(layers[:-1], layers[1:])])
biases = np.array([np.zeros((y, 1)) for y in layers[1:]])
  • Hidden layer 1 has weight of dimension [64, 784] and bias of dimension 64.

  • Output layer has weight of dimension [10, 64] and bias of dimension

You may be wondering what is going on when initialising weights in the code above. This is called Xavier initialisation, and it is a step better than randomly initialising your weight matrices. Yes, initialisation does matter. Based on your initialisation, you might be able to find a better local minima during gradient descent (back propagation is a glorified version of gradient descent).

2. Forward Pass

activation = x
hidden_activations = [np.reshape(x, (len(x), 1))]
z_list = []

for w, b in zip(self.weights, self.biases):    
    z =, np.reshape(activation, (len(activation), 1))) + b
    activation = relu(z)

t = hidden_activations[-1] 
hidden_activations[-1] = np.exp(t) / np.sum(np.exp(t))

This code carries out the transformation described above. hidden_activations[-1] contains softmax probabilities - predictions of all classes, the sum of which is 1. If we are predicting digits, then output will be a vector of probabilities of dimension 10, the sum of which is 1.

3. Backward Pass

weight_gradients = [np.zeros(w.shape) for w in self.weights]
bias_gradients = [np.zeros(b.shape) for b in self.biases]

delta = (hidden_activations[-1] - y) * (z_list[-1] > 0) # relu derivative
weight_gradients[-1] =, hidden_activations[-2].T)
bias_gradients[-1] = delta

for l in range(2, self.num_layers):
    z = z_list[-l]
    delta =[-l + 1].T, delta) * (z > 0) # relu derivative
    weight_gradients[-l] =, hidden_activations[-l - 1].T)
    bias_gradients[-l] = delta

The first 2 lines initialise the gradients. These gradients are computed and will be used to update the weights and biases later.

The next 3 lines compute the error by subtracting the prediction from the target. The error is then back propagated to the inner layers.

Now, carefully trace the working of the loop. Lines 2 and 3 transform the error from layer[i] to layer[i - 1]. Trace the shapes of the matrices being multiplied to understand.

4. Weights/Parameter Update

for i in xrange(len(self.weights)):
    self.weights[i] += -self.learning_rate * weight_gradients[i]
    self.biases[i] += -self.learning_rate * bias_gradients[i] 

self.learning_rate specifies the rate at which the network learns. You don't want it to learn too fast, because it may not converge. A smooth descent is favoured for finding a good minima. Usually, rates between 0.01 and 0.1 are considered good.