Friday, May 15, 2015

Neural Networks: Unrolling Parameters, Gradient Checking & Random Initialization

Neural Networks: Unrolling Parameters

We need to unroll parameters from matrices to vectors to use in our optimization function like fminunc in Matlab/Octave

Advanced Optimization Functions:

function[jVal, gradient] = costFunction(theta);
...
optTheta = fminunc(@costFunction, initialTheta, options);

The above function assumes that the parameters initialTheta are vectors, not matrices. In Logistic regression, these parameters are vectors. However, in a Neural Network, these are matrices

For a Neural Network with 4 Layers (L=4)
$\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$ - matrices (Theta1, Theta2, Theta3)
$D^{(1)}, D^{(2)}, D^{(3)}$ = matrices (D1, D2, D3)

"Unroll into Vectors"

Example: lets say we have a 3 layer NN with the following details:
$s_1 = 10, s_2 = 10, s_3 = 1$

The dimension of matrices $\Theta$ and $D$ are given by:

$\Theta^{(1)}\in\Re^{10\times11}$, $\Theta^{(2)}\in\Re^{10\times11}$, $\Theta^{(3)}\in\Re^{1\times11}$

$D^{(1)} \in \Re^{10\times11}$, $D^{(2)} \in \Re^{10\times11}$, $D^{(3)} \in \Re^{1\times11}$

The command below will convert the matrices into vectors after combining them:
thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
DVec = [D1(:);D2(:);D3(:)]

To recombine the vectors in the matrices in the initial format,
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);

Learning Algorithm:

Here is how we use the unrolling algorithm:

Have initial parameters $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$
Unroll to get initialTheta  to pass to fminunc(@costFunction, initialTheta,options);

function [jVal,gradientvec] = costFunction(thetaVec);

  • From thetaVec, get  $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$ (reshape to get the original matrices back)
  • Use forward propagation/back propagation to compute $D^{(1)}, D^{(2)}, D^{(3)}$ and $J(\Theta)$
  • Unroll $D^{(1)}, D^{(2)}, D^{(3)}$ to get gradientVec




Numerical Gradient Checking:

The backpropagation algorithm is a bit complex, and even though the cost function J might seem to be decreasing, there could be a bug in the algorithm which could give erroneous results. The NN might end up with a high value of J.

How to check the gradient: Numerical Estimation of gradients (Gradient Checking)

We approximate the partial derivative of the function $J(\Theta)$, which is defined as the slope of the curve at that value of $\Theta$ by calculating the slope in a more geometrical way.


We select a point $\Theta+\epsilon$ just ahead of $\Theta$, and a point $\Theta-\epsilon$ just less than $\Theta$. The slope of the line connecting the values of $J(\Theta)$ at these points is given by the equation:

slope at $J(\Theta) = \frac {J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}$

The value of $\epsilon \ approx 10^{-4}$


Implementation in Matlab or Octave:
gradApprox = (J(theta+EPSILON) - J(theta - EPSILON))/(2*EPSILON)

This will give a numerical estimate of slope at this point.

General case : Parameter vector $\theta$
$\theta \in \Re^n$ E.g. $\theta$ is 'unrolled' version of $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}...$
$\theta = \theta_1, \theta_2, \theta_3...\theta_n$



The partial derivatives of each of $\theta_1, \theta_2...\theta_n$ are calculated separately.

In Matlab/Octave, the following equation is implemented:
for i = 1 to n;
     thetaPlus = theta;
     thetaPlus(i) = thetaPlus(i) + EPSILON;
     thetaMinus = theta;
     thetaMinus(i) = thetaMinus(i) - EPSILON;
     gradApprox(i)  = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON)
end;

Check that gradApprox $\approx$ DVec. If the values are very close, then we will be more confident that the algorithm is calculating the cost function J correctly and we will correctly optimize $\Theta$.

Implementation Note:

  • Implement Backprop to compute DVec (unrolled $D^{(1)}, D^{(2)}, D^{(3)}$)
  • Implement Numerical Gradient Checking to calculate gradApprox
  • Make sure they give similar values
  • Turn off gradient checking. using backprop code for learning
Important:
  • Be sure to disable your code before training your classifier. if you run numerical gradient computation on every iteration of gradient descent ( or in the inner loop of the costFunction, your code will be very slow).

Random Initialization of $\Theta$:

Initial value of $\Theta$

What if we set all values of initialTheta = zeros(n,1)? The values of $\Theta$ determine the values in the activation units in each layer. In case the values are same, the cost function will not decrease as the partial derivaties of the cost function will be the same, so will the values of $\delta$ and $a_1,a_2...a_n$

After each update, the parameters corresponding to inputs going into each of the two hidden units are identical.


Random Initialization : Symmetry breaking

Initialize each $\Theta_{ij}^{(l)}$ to a random variable in $[-\epsilon,\epsilon]$, (i.e. $-\epsilon\le\Theta_{ij}^{(l)}\le\epsilon$)

E.g.:

Theta1 = rand(10,11) *(2*INIT_EPSILON)-INIT_EPSILON;

Theta2 = rand(1,11) *(2*INIT_EPSILON)-INIT_EPSILON;

rand(10,11) will give a random 10x11 matrix between 0 and 1;




No comments:

Post a Comment