The backpropagation method of training neural networks
In the backpropagation method, neural networks are trained through the gradient descent technique, where the combined weights vector, W, is updated iteratively, as follows:
Here, η is the learning rate, W(t+1) and W(t) are the weight vectors at iterations (t+1) and (t), respectively, and ∇C(W(t)) is the gradient of the cost function or the error function, with respect to the weight vector, W, at iteration (t). The previous algorithm for an individual weight or bias generalized by w ∈ W can be represented as follows:
As you can gather from the previous expressions, the heart of the gradient descent method of learning relies on computing the gradient of the cost function or the error function, with respect to each weight.
From the chain rule of differentiation, we know that if we have y = f(x), z = f(y), then the following is true:
This expression can be generalized to any number of variables. Now, let's take a look at a very simple neural network, as illustrated in the following diagram, in order to understand the backpropagation algorithm:
Let the input to the network be a two-dimensional vector, x = [x1 x2]T, and the corresponding output label and prediction be and , respectively. Also, let's assume that all of the activation units in the neural network are sigmoids. Let the generalized weight connecting any unit i in layer (l-1) to unit j in layer l be denoted by , while the bias in any unit i in layer l should be denoted by . Let's derive the gradient for one data point; the total gradient can be computed as the sum of all of the data points used in training (or in a mini-batch). If the output is continuous, then the loss function, C, can be chosen as the square of the error in prediction:
The weights and biases of the network, cumulatively represented by the set W, can be determined by minimizing the cost function with respect to the W vector, which is as follows:
To perform the minimization of the cost function iteratively through gradient descent, we need to compute the gradient of the cost function with respect to each weight, w ∈ W, as follows:
Now that we have everything that we need, let's compute the gradient of the cost function, C, with respect to the weight, . Using the chain rule of differentiation, we get the following:
Now let's look at the following formula:
As you can see in the previous expression, the derivative is nothing but the error in prediction. Generally, the output unit activation function is linear in the case of regression problems, and hence the following expression applies:
So, if we were to compute the gradient of the cost function with respect to the total input at the output unit, it would be . This is still equal to the error in prediction of the output.
The total input at the output unit, as a function of the incoming weights and activations, can be expressed as follows:
This means that, and the derivative of the cost function with respect to the weight, , contributing to the input of the output layer is given via the following:
As you can see, the error is backpropagated in computing the gradient of the cost function, with respect to the weights in the layers preceding the final output layer. This becomes more obvious when we compute the gradient of the cost function with respect to the generalized weight, . Let's take the weight corresponding to j=1 and k=2; that is, . The gradient of the cost function, C, with respect to this weight can be expressed as follows:
Now, , which means that, .
So, once we have figured out the gradient of the cost function with respect to the total input to a neuron as , the gradient of any weight, w, contributing to the total input, s, can be obtained by simply multiplying the activation, z, associated with the weight.
Now, the gradient of the cost function with respect to the total input, , can be derived by chain rule again, as follows:
Since all of the units of the neural network (except for the output unit) are sigmoid activation functions, the following is the case:
Combining (1), (2), and (3), we get the following:
In the preceding derived gradient expressions, you can see that the error in prediction, , is backpropagated by combining it with the relevant activations and weights (as per the chain rule of differentiation) for computing the gradients of the weights at each layer, hence, the name backpropagation in AI nomenclature.