The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs.
Each layer has its own set of weights, and these weights must be tuned to be able to accurately predict the right output given input.
A high level overview of back propagation is as follows:
A simplified example of weights initialisation is shown below:
layers = [784, 64, 10]
weights = np.array([(np.random.randn(y, x) * np.sqrt(2.0 / (x + y))) for x, y in zip(layers[:-1], layers[1:])])
biases = np.array([np.zeros((y, 1)) for y in layers[1:]])
Hidden layer 1 has weight of dimension [64, 784] and bias of dimension 64.
Output layer has weight of dimension [10, 64] and bias of dimension
You may be wondering what is going on when initialising weights in the code above. This is called Xavier initialisation, and it is a step better than randomly initialising your weight matrices. Yes, initialisation does matter. Based on your initialisation, you might be able to find a better local minima during gradient descent (back propagation is a glorified version of gradient descent).
activation = x
hidden_activations = [np.reshape(x, (len(x), 1))]
z_list = []
for w, b in zip(self.weights, self.biases):
z =, np.reshape(activation, (len(activation), 1))) + b
activation = relu(z)
t = hidden_activations[-1]
hidden_activations[-1] = np.exp(t) / np.sum(np.exp(t))
This code carries out the transformation described above. hidden_activations[-1]
contains softmax probabilities - predictions of all classes, the sum of which is 1. If we are predicting digits, then output will be a vector of probabilities of dimension 10, the sum of which is 1.
weight_gradients = [np.zeros(w.shape) for w in self.weights]
bias_gradients = [np.zeros(b.shape) for b in self.biases]
delta = (hidden_activations[-1] - y) * (z_list[-1] > 0) # relu derivative
weight_gradients[-1] =, hidden_activations[-2].T)
bias_gradients[-1] = delta
for l in range(2, self.num_layers):
z = z_list[-l]
delta =[-l + 1].T, delta) * (z > 0) # relu derivative
weight_gradients[-l] =, hidden_activations[-l - 1].T)
bias_gradients[-l] = delta
The first 2 lines initialise the gradients. These gradients are computed and will be used to update the weights and biases later.
The next 3 lines compute the error by subtracting the prediction from the target. The error is then back propagated to the inner layers.
Now, carefully trace the working of the loop. Lines 2 and 3 transform the error from layer[i]
to layer[i - 1]
. Trace the shapes of the matrices being multiplied to understand.
for i in xrange(len(self.weights)):
self.weights[i] += -self.learning_rate * weight_gradients[i]
self.biases[i] += -self.learning_rate * bias_gradients[i]
specifies the rate at which the network learns. You don't want it to learn too fast, because it may not converge. A smooth descent is favoured for finding a good minima. Usually, rates between 0.01
and 0.1
are considered good.