Gradient Computation:
The cost function for a Neural Network: We need to optimize the values of Θ.
J(Θ)=−1m[∑mi=1∑Kk=1y(i)klog(hΘ(x)(i))k + (1−y(i)k)log(1−(hΘ(x)(i))k)] + λ2m∑L−1l=1∑sli=1∑sl+1j=1(Θ(l)ji)2
Goal: minΘ J(Θ)
We need to write the code to compute:
- J(Θ)
- ∂∂Θ(l)ijJ(Θ)
Θlij∈ℜ
How to compute the Partial Derivative Terms?
Example : Given One Training Example: Forward Propagation
Given one training example (x,y):
Forward Propagation: Vectorized implementation to calculate the activation values for all layers
a(1)=x
z(2)=Θ(1)a(1)
a(2)=g(z(2)) (add a(2)0)
z(3)=Θ(2)a(2)
a(3)=g(z(3)) (add a(3)0)
z(4)=Θ(3)a(3)
a(4)=hΘ(x)=g(z(4))
Gradient Computation: Backpropagation Algorithm
Intuition : We need to compute δ(l)j which is the "error" of node j in layer l, or error in the activation values
For each output unit (Layer L=4)
δ(4)j=a(4)j−yj [a(4)j=(hΘ(x))j]
This δ(4)j is essentially the difference between what out hypothesis outputs, and the actual values of y in our training set.
Writing in vectorized format:
δ(4)=a(4)−y
Next step is to compute the values of δ for other layers.
δ(3)=(Θ(3))Tδ(4) .∗g′(z(3))
δ(2)=(Θ(2))Tδ(3) .∗g′(z(2))
There is no δ(1) because the first layer correspond to the input units which asre used as such, and there is no error terms associated with it.
g′() is the derivative of the sigmoid activation function.
It can be proved that g′(z(3))=a(3).∗(1−a(3)). Similarily, g′(z(2))=a(2).∗(1−a(2))
Finally, ∂∂Θ(l)ijJ(Θ)=a(l)jδ(l+1)i (Ignoring λ; or λ=0)
Backpropagation Algorithm:
The name Backpropagation comes from the fact that we calculate the δ terms for the output layer first and then use it to calculate the values of other δ terms going backward.
Suppose we have a training Set: {(x(1),y(1)),(x(2),y(2)),.,.,.,.,(x(m),y(m))}
First, set Δ(l)ij=0 for all values of l,i,j - This is used to compute ∂∂Θ(l)ijJ(Θ). Eventually, this Δ(l)ij will be used to calculate the partial derivative ∂∂Θ(l)ijJ(Θ). Δ(l)ij are used as accumulators.
For i=1 to m
Set a(1)=x(i) --> Activation for input layer
Perform Forward Propagation to compute a(l) for l=2,3,...L
Using y(i), compute the error term for the output layer δ(L)=a(L)−y(i)
Use backpropagation algorithm for computing δ(L−1),δ(L−2),δ(L−3),....,δ(2), there is no δ(1)
Finally, Δ terms are used to accumulate these partial derivatives terms Δ(l)ij:=Δ(l)ij+a(l)jδ(l+1)i. In vectorized format it can be written as (assuming Δ, a and δ are all matrices : Δ(l):=Δ(l)+δ(l+1)(a(l))T
end;
D(l)ij:=1mΔ(l)ij+λΘ(l)ij if j≠0
D(l)ij:=1mΔ(l)ij if j=0
Finally, we can calculate the partial derivatives as : ∂∂Θ(l)ijJ(Θ)=D(l)ij
---------------------------------------------------------------------------------
Backpropagation Intuition with Example:
Mechanical steps of Backpropagation Agorithm
Forward Propagation: Consider a simple NN with 2 input units (not counting the bias unit), 2 activation unit in two layers (not counting bias unit in each layer) and one output unit in layer 4.
The input (x(i),y(i) as represented as below. We first use Forward propagation to compute the values of z(2)1, and apply sigmoid activation function on z(2)1 to get the values of a(2)1. Similarly, the values of z(2)2 and a(2)2 are computed to complete the values of layer 2. The bias unit a(2)0=1 is added to the layer 2.
Once the values of layer 2 are calculated, we apply the same methodology to compute the values of layer 3 (a & z)
For the values of Layer 3, we have values of Θ which is use to compute the values of z(3)i.
z(3)1=Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2
What is Backpropagation doing? Backpropagation is almost doing the same thing as forward propagation in the opposite direction (right to left, from output to input)
The cost function again:
J(Θ)=−1m[∑mi=1∑Kk=1y(i)klog(hΘ(x)(i))k + (1−y(i)k)log(1−(hΘ(x)(i))k)] + λ2m∑L−1l=1∑sli=1∑sl+1j=1(Θ(l)ji)2
Assume
λ = 0 (remove the regularization term, ignoring for now).
Focusing on single example
x(i),y(i), the case of 1 output unit
cost(
i) =
y(i)log hΘ(x(i))+(1−y(i))log hΘ(x(i)) Cost associated with
xi,yi training example
Think of
cost(i)≈(hΘ(x(i))−y(i))2 i.e. how well is the network doing on example (i) just for the purpose of intuition,
how close is the output compared to actual observed values y
The δ terms are actually the partial derivatives of the cost associated with the example (i). They are a measure of how much the weights Θ needs to be changed to bring the hypothesis output hΘ(x) closer to the actual observed value of y.
For the output layer, we first set the value of δ(4)1=y(i)−a(4)1.
Next, we calculate the values of δ(3) in the layer 3 using δ(4), and the values of δ(2) in the layer 2 using δ(3). Please note that there is no δ(1).
Focusing on calculating the value of δ(2)2, this is actually the weighted sum (weights being the parameters Θ and the values of δ(3)1 and δ(3)2
δ(2)2=Θ(2)12δ(3)1+Θ(2)22δ(3)2
This corresponds to the magenta values + red values in the above diagram
Similarly, to calculate
δ(3)2,
δ(3)2=Θ(3)12δ(4)1
Once we calculate the values of
δ, we can use the optimizer function to calculate the optimized vaues of
Θ