Machine Learning, Vizualization & Analytics

Friday, May 15, 2015

Implementation Summary : Neural Networks

Training a Neural network:

Pick a network architecture (connectivity pattern between neurons)

No. of input units: Dimension of features $x^{(i)}$
No. of output units: Number of classes

Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better)

Training:

Randomly initialize weights
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
Implement code to calculate the cost function $J(\Theta)$
Implement Backpropagation to compute partial derivatives $\frac \partial {\partial\Theta_{jk}^{(l)}}$

for i = 1 to m

Perform forward propagation and backpropagation using examples $(x^{(i)},y^{(i)})$

(Get activations $a^{(l)}$ and delta terms $\delta^{(l)}$ for $l=2,3,4....,L$

Calculate $\Delta^{(l)}:=\Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T

...

end;

Compute $\frac \partial {\partial\Theta_{jk}^{(l)}}J(\Theta)$

5. Use Gradient Checking to compare $\frac \partial {\partial\Theta_{jk}${(l)}}J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$

6. Use Gradient Descent or advanced optimization method with backpropagation to try to minimize $J(\Theta)$ as a function of parameters $\Theta$

Neural Networks: Unrolling Parameters, Gradient Checking & Random Initialization

Neural Networks: Unrolling Parameters

We need to unroll parameters from matrices to vectors to use in our optimization function like fminunc in Matlab/Octave

Advanced Optimization Functions:

function[jVal, gradient] = costFunction(theta);
...
optTheta = fminunc(@costFunction, initialTheta, options);

The above function assumes that the parameters initialTheta are vectors, not matrices. In Logistic regression, these parameters are vectors. However, in a Neural Network, these are matrices

For a Neural Network with 4 Layers (L=4)
$\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$ - matrices (Theta1, Theta2, Theta3)
$D^{(1)}, D^{(2)}, D^{(3)}$ = matrices (D1, D2, D3)

"Unroll into Vectors"

Example: lets say we have a 3 layer NN with the following details:
$s_1 = 10, s_2 = 10, s_3 = 1$

The dimension of matrices $\Theta$ and $D$ are given by:

$\Theta^{(1)}\in\Re^{10\times11}$, $\Theta^{(2)}\in\Re^{10\times11}$, $\Theta^{(3)}\in\Re^{1\times11}$

$D^{(1)} \in \Re^{10\times11}$, $D^{(2)} \in \Re^{10\times11}$, $D^{(3)} \in \Re^{1\times11}$

The command below will convert the matrices into vectors after combining them:
thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
DVec = [D1(:);D2(:);D3(:)]

To recombine the vectors in the matrices in the initial format,
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);

Learning Algorithm:

Here is how we use the unrolling algorithm:

Have initial parameters $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$
Unroll to get initialTheta to pass to fminunc(@costFunction, initialTheta,options);

function [jVal,gradientvec] = costFunction(thetaVec);

From thetaVec, get $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$ (reshape to get the original matrices back)
Use forward propagation/back propagation to compute $D^{(1)}, D^{(2)}, D^{(3)}$ and $J(\Theta)$
Unroll $D^{(1)}, D^{(2)}, D^{(3)}$ to get gradientVec

Numerical Gradient Checking:

The backpropagation algorithm is a bit complex, and even though the cost function J might seem to be decreasing, there could be a bug in the algorithm which could give erroneous results. The NN might end up with a high value of J.

How to check the gradient: Numerical Estimation of gradients (Gradient Checking)

We approximate the partial derivative of the function $J(\Theta)$, which is defined as the slope of the curve at that value of $\Theta$ by calculating the slope in a more geometrical way.

We select a point $\Theta+\epsilon$ just ahead of $\Theta$, and a point $\Theta-\epsilon$ just less than $\Theta$. The slope of the line connecting the values of $J(\Theta)$ at these points is given by the equation:

slope at $J(\Theta) = \frac {J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}$

The value of $\epsilon \ approx 10^{-4}$

Implementation in Matlab or Octave:
gradApprox = (J(theta+EPSILON) - J(theta - EPSILON))/(2*EPSILON)

This will give a numerical estimate of slope at this point.

General case : Parameter vector $\theta$
$\theta \in \Re^n$ E.g. $\theta$ is 'unrolled' version of $\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}...$
$\theta = \theta_1, \theta_2, \theta_3...\theta_n$

The partial derivatives of each of $\theta_1, \theta_2...\theta_n$ are calculated separately.

In Matlab/Octave, the following equation is implemented:
for i = 1 to n;
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON)
end;

Check that gradApprox $\approx$ DVec. If the values are very close, then we will be more confident that the algorithm is calculating the cost function J correctly and we will correctly optimize $\Theta$.

Implementation Note:

Implement Backprop to compute DVec (unrolled $D^{(1)}, D^{(2)}, D^{(3)}$)
Implement Numerical Gradient Checking to calculate gradApprox
Make sure they give similar values
Turn off gradient checking. using backprop code for learning

Important:

Be sure to disable your code before training your classifier. if you run numerical gradient computation on every iteration of gradient descent ( or in the inner loop of the costFunction, your code will be very slow).

Random Initialization of $\Theta$:

Initial value of $\Theta$

What if we set all values of initialTheta = zeros(n,1)? The values of $\Theta$ determine the values in the activation units in each layer. In case the values are same, the cost function will not decrease as the partial derivaties of the cost function will be the same, so will the values of $\delta$ and $a_1,a_2...a_n$

After each update, the parameters corresponding to inputs going into each of the two hidden units are identical.

Random Initialization : Symmetry breaking

Initialize each $\Theta_{ij}^{(l)}$ to a random variable in $[-\epsilon,\epsilon]$, (i.e. $-\epsilon\le\Theta_{ij}^{(l)}\le\epsilon$)

E.g.:

Theta1 = rand(10,11) *(2*INIT_EPSILON)-INIT_EPSILON;

Theta2 = rand(1,11) *(2*INIT_EPSILON)-INIT_EPSILON;

rand(10,11) will give a random 10x11 matrix between 0 and 1;

Neural Networks: BackPropagation Algorithm

Gradient Computation:

The cost function for a Neural Network: We need to optimize the values of $\Theta$.

Goal: $min_\Theta \ J(\Theta)$

We need to write the code to compute:

- $J(\Theta)$

- $\frac \partial {\partial\Theta_{ij}^{(l)}} J(\Theta)$

$\Theta_{ij}^l \in \Re$

How to compute the Partial Derivative Terms?

Example : Given One Training Example: Forward Propagation

Given one training example (x,y):

Forward Propagation: Vectorized implementation to calculate the activation values for all layers

$a^{(1)} = x$

$z^{(2)} = \Theta^{(1)}a^{(1)}$

$a^{(2)} = g(z^{(2)}) \ \ (add \ a_0^{(2)})$

$z^{(3)} = \Theta^{(2)}a^{(2)}$

$a^{(3)} = g(z^{(3)})\ \ (add\ a_0^{(3)})$

$z^{(4)} = \Theta^{(3)}a^{(3)}$

$a^{(4)} = h_\Theta(x) = g(z^{(4)})$

Gradient Computation: Backpropagation Algorithm

Intuition : We need to compute $\delta_j^{(l)}$ which is the "error" of node $j$ in layer $l$, or error in the activation values

For each output unit (Layer L=4)

$\delta_j^{(4)} = a_j^{(4)} - y_j$ [$a_j^{(4)} = (h_\Theta(x))_j]$

This $\delta_j^{(4)}$ is essentially the difference between what out hypothesis outputs, and the actual values of y in our training set.

Writing in vectorized format:

$\delta^{(4)} = a^{(4)}-y$

Next step is to compute the values of $\delta$ for other layers.

$\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}\ .* g'(z^{(3)})$

$\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}\ .* g'(z^{(2)})$

There is no $\delta^{(1)}$ because the first layer correspond to the input units which asre used as such, and there is no error terms associated with it.

$g'()$ is the derivative of the sigmoid activation function.

It can be proved that $g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})$. Similarily, $g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})$

Finally, $\frac \partial {\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}$ (Ignoring $\lambda$; or $\lambda$=0)

Backpropagation Algorithm:

The name Backpropagation comes from the fact that we calculate the $\delta$ terms for the output layer first and then use it to calculate the values of other $\delta$ terms going backward.

Suppose we have a training Set: $\{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),.,.,.,.,(x^{(m)},y^{(m)})\}$

First, set $\Delta_{ij}^{(l)} = 0 $ for all values of $l,i,j$ - This is used to compute $\frac \partial {\partial\Theta_{ij}^{(l)}} J(\Theta)$. Eventually, this $\Delta_{ij}^{(l)}$ will be used to calculate the partial derivative $\frac \partial {\partial\Theta_{ij}^{(l)}} J(\Theta)$. $\Delta_{ij}^{(l)}$ are used as accumulators.

For $i = 1\ to\ m$

Set $a^{(1)} = x^{(i)}$ --> Activation for input layer

Perform Forward Propagation to compute $a^{(l)}$ for $l = 2,3,...L$

Using $y^{(i)}$, compute the error term for the output layer $\delta^{(L)} = a^{(L)}-y^{(i)}$

Use backpropagation algorithm for computing $\delta^{(L-1)}, \delta^{(L-2)}, \delta^{(L-3)},...., \delta^{(2)}$, there is no $\delta^{(1)}$

Finally, $\Delta$ terms are used to accumulate these partial derivatives terms $\Delta^{(l)}_{ij}:= \Delta^{(l)}_{ij} + a_j^{(l)}\delta_i^{(l+1)}$. In vectorized format it can be written as (assuming $\Delta$, $a$ and $\delta$ are all matrices : $\Delta^{(l)}:=\Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$

end;

$D_{ij}^{(l)} := {\frac 1 m}\Delta_{ij}^{(l)} + \lambda\Theta_{ij}^{(l)}\ if\ j\ne0$

$D_{ij}^{(l)} := {\frac 1 m}\Delta_{ij}^{(l)} \ \ \ \ \ \ \ \ \ \ \ \ if\ j=0$

Finally, we can calculate the partial derivatives as : $\frac \partial {\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$

---------------------------------------------------------------------------------

Backpropagation Intuition with Example:

Mechanical steps of Backpropagation Agorithm

Forward Propagation: Consider a simple NN with 2 input units (not counting the bias unit), 2 activation unit in two layers (not counting bias unit in each layer) and one output unit in layer 4.

The input $(x^{(i)},y^{(i)}$ as represented as below. We first use Forward propagation to compute the values of $z_1^{(2)}$, and apply sigmoid activation function on $z_1^{(2)}$ to get the values of $a_1^{(2)}$. Similarly, the values of $z_2^{(2)}$ and $a_2^{(2)}$ are computed to complete the values of layer 2. The bias unit $a_0^{(2)} = 1$ is added to the layer 2.

Once the values of layer 2 are calculated, we apply the same methodology to compute the values of layer 3 (a & z)

For the values of Layer 3, we have values of $\Theta$ which is use to compute the values of $z_i^{(3)}$.

$z_1^{(3)} = \Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)}$

What is Backpropagation doing? Backpropagation is almost doing the same thing as forward propagation in the opposite direction (right to left, from output to input)

The cost function again:

$J(\Theta) = -{\frac 1 m} \left[ {\sum_{i=1}^m}{\sum_{k=1}^K}{y_k^{(i)}log(h_\Theta(x)^{(i)})_k}\ +\
{(1-y_k^{(i)})log(1-(h_\Theta(x)^{(i)})_k)}
\right]\ + \
{\frac \lambda {2m}}{\sum_{l=1}^{L-1}}{\sum_{i=1}^{s_l}}{\sum_{j=1}^{s_{l+1}}}
(\Theta_{ji}^{(l)})^2
$

Assume $\lambda$ = 0 (remove the regularization term, ignoring for now).
Focusing on single example $x^{(i)},y^{(i)}$, the case of 1 output unit

cost($i$) = $y^{(i)}log\ h_\Theta(x^{(i)}) + (1-y^{(i)})log\ h_\Theta(x^{(i)})$ Cost associated with $x^i,y^i$ training example

Think of $cost (i) \approx (h_\Theta(x^{(i)}) - y^{(i)})^2$ i.e. how well is the network doing on example (i) just for the purpose of intuition, how close is the output compared to actual observed values y

The $\delta$ terms are actually the partial derivatives of the cost associated with the example (i). They are a measure of how much the weights $\Theta$ needs to be changed to bring the hypothesis output $h_\Theta(x)$ closer to the actual observed value of y.

For the output layer, we first set the value of $\delta_1^{(4)} = y^{(i)}-a_1^{(4)}$.

Next, we calculate the values of $\delta^{(3)}$ in the layer 3 using $\delta^{(4)}$, and the values of $\delta^{(2)}$ in the layer 2 using $\delta^{(3)}$. Please note that there is no $\delta^{(1)}$.

Focusing on calculating the value of $\delta_2^{(2)}$, this is actually the weighted sum (weights being the parameters $\Theta$ and the values of $\delta_1^{(3)}$ and $\delta_2^{(3)}$

$\delta_2^{(2)} = \Theta_{12}^{(2)}\delta_1^{(3)}+ \Theta_{22}^{(2)}\delta_2^{(3)}$
This corresponds to the magenta values + red values in the above diagram

Similarly, to calculate $\delta_2^{(3)}$,

$\delta_2^{(3)} = \Theta_{12}^{(3)}\delta_1^{(4)}$

Once we calculate the values of $\delta$, we can use the optimizer function to calculate the optimized vaues of $\Theta$

Neural networks : Cost Function

Neural Network Classification:

Above: A Neural Network with 4 layers

Input Unit: $\{ (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),.,.,.,.,(x^{(m)},y^{(m)})\}$
L = Total number of layers in the network (L=4 in the above case)
$s_l$ = number of units (not counting bias units) in the layer $l$

There are two types of Neural Network outcomes:

Binary Classification:
$y$ = 0 or 1 ; 1 Output Unit
$s_L=1$, K=1

Multiclass Classification:

Number of output units: K

$y\in\Re^{(k)}$

E.g. if K=4

Output will be K vectors: $\begin{bmatrix}1 \cr 0\cr0\cr0\cr\end{bmatrix}$, $\begin{bmatrix}0 \cr 1\cr0\cr0\cr\end{bmatrix}$,$\begin{bmatrix}0\cr 0\cr1\cr0\cr\end{bmatrix}$,$\begin{bmatrix}0\cr 0\cr0\cr1\cr\end{bmatrix}$

Cost Function:

Cost function for a Neural Network is a generalization of the Cost Function of a Logistic Regression.

Logistic Regression Cost Function:

$J(\theta) = -\frac 1 m \left[\sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

Neural Network Cost Function:

Neural Network outputs vectors in $\Re^K$

$h_\Theta(x)\in \Re^K$;
$(h_\Theta(x))_i = i^{th}\ output$

Cost Function:
$J(\Theta) = -{\frac 1 m} \left[ {\sum_{i=1}^m}{\sum_{k=1}^K}{y_k^{(i)}log(h_\Theta(x)^{(i)})_k}\ +\
{(1-y_k^{(i)})log(1-(h_\Theta(x)^{(i)})_k)}
\right]\ + \
{\frac \lambda {2m}}{\sum_{l=1}^{L-1}}{\sum_{i=1}^{s_l}}{\sum_{j=1}^{s_{l+1}}}
(\Theta_{ji}^{(l)})^2
$

The Summation ${\sum_{k=1}^K}$ is over the 'K' output units i.e. summing the cost function for each of the output units K.

Regularization Term : We don't sum over the terms corresponding to the bias units $a_0$, corresponding to $\Theta_{i0}x_0$. Even if we include the bias terms, it will output similar result.

Neural Networks: MultiClass Classification

Tuesday, May 12, 2015

Neural Network: Model Representation

Anatomy of a Neuron:

A simgle neuron can be explained by the following system: Dendrites to collect information and transmit information, a cell body as a node/processing unit and Axon as an output wire

We can mimic a simple Logistic Model as a neuron by the following diagram:

The Output function $h_\theta(x)$ is defined by the sigmoid (logistic) activation function $\frac 1 {1+e^{-\theta^Tx}}$

The matrix of weights / parameters is defined by $\theta$ as $\begin{bmatrix}\theta_0 \cr \theta_1 \cr \theta_2 \cr \theta_3 \cr \end{bmatrix}$, the input vector $x$ is defined by $\begin{bmatrix}x_0 \cr x_1 \cr x_2 \cr x_3 \cr \end{bmatrix}$

Sigmoid (logistic) activation function:

$g(z) = \frac 1 {1+e^{-z}}$

The abovementioned representation is for a very simple and basic network with only one hidden layer having only one unit. Typically, a Neural Network has multiple input layer units, multiple hidden layers with multiple units, and multiple output units for multi-class classification.

The input layer has an additiona unit called 'The Bias Unit'($x_0$) which is equal to 1. The activation layers will also have an additional Biad Units equal to one.

Neural Network:

Details:

The Neural Network above consists of three layers : an Input Layer (Layer 1), a Hidden Layer (Layer 2), and an Output Layer (Layer 3). Both the Input and Hidden Layers (Layers 1 and 2) contain a bias unit $x_0$ and $a_0^{(2)}$ respectively.

The Hidden Layer: The Layer 2 or the 'Activation Layer' consists of xxactivation units $a_i^j$ which are defined by weight from the Input Layers. Each Input unit feeds to each Activation unit, and the interaction is characterized by the weight parameters $\theta$.

Number of Units: 4 (including a bias input unit) in Layer 1, 4 (including a bias activation unit) in Layer 2, 1 in Output Layer (Layer 3).

Definitions:

$a_i^{(j)}$: Activation of unit $i$ in layer $(j)$;

$\Theta^{(j)}$: matrix of weights controlling the function mapping from layer $(j)$ to layer $(j+1)$.

$\Theta^{(1)}$ $\in$ $\Re^{3x4}$ : 3 rows for each activation unit $a_1^{(2}), a_2^{(2)}$ and $a_3^{(2)}$. The fourth unit in activation layer $a_0^{(2)}$ is the bias unit equal to 1. The rows 1 to 4 are for the input parameters (including the bias input unit) $x_0, x_1, x_2$ and $x_3$.

The subscript denotes the units (0,1,2, and 3), and the superscript denotes the layer number (1,2 or 3).

$a_1^{(2)}$ denotes the first unit in the second layer.

The sigmoid function $g(z)$ is defined by $g(z) = \frac 1 {1+e^{-z}}$

z is defined as follows:

Layer 2: The activation layer

$a_1^{(2)}=g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3)$

$a_2^{(2)}=g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3)$

$a_3^{(2)}=g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3)$

Layer 3: The output layer

$h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})$

Dimensions of $\Theta^{(j)}$: If the network has $s_j$ units in layer $j$, $s_{j+1}$ units in layer (j+1), then $\Theta^{(j)}$ will be of the dimension $s_{j+1}\times(s_j+1)$

The value of z:
$a_1^{(2)} = g(z_1^{(2)})$
$a_2^{(2)} = g(z_2^{(2)})$
$a_3^{(2)} = g(z_3^{(2)})$

$a_1^{(3)} = g(z^{(3)})$

Forward Propogation Model: Vectorized Implementation:

Define X, $\Theta$, $z_i^j$, $a_i^j$ in a vector notation

$x = \begin{bmatrix}x_0 \cr x_1 \cr x_2 \cr x_3 \cr \end{bmatrix} = \begin{bmatrix}a_0^{(1)} \cr a_1^{(1)} \cr a_2^{(1)} \cr a_3^{(1)}\cr \end{bmatrix}$

$a^{(1)} = x = $$\begin{bmatrix}x_0 \cr x_1 \cr x_2 \cr x_3 \cr \end{bmatrix} = \begin{bmatrix}a_0^{(1)} \cr a_1^{(1)} \cr a_2^{(1)} \cr a_3^{(1)}\cr \end{bmatrix}$

$\Theta^{(1)} = \begin{bmatrix}
\Theta_{10}^{(1)} & \Theta_{11}^{(1)} & \Theta_{12}^{(1)} & \Theta_{13}^{(1)} \cr
\Theta_{20}^{(1)} & \Theta_{21}^{(1)} & \Theta_{22}^{(1)} & \Theta_{23}^{(1)} \cr
\Theta_{30}^{(1)} & \Theta_{31}^{(1)} & \Theta_{32}^{(1)} & \Theta_{33}^{(1)} \cr
\end{bmatrix} \in \Re^{3\times4}$

$z^{(2)} = \Theta^{(1)}a^{(1)}$

$\Theta^{(1)} : 3\times 4$, $a^{(1)} : 4 \times 1$; $z^{(2)} : 3 \times 1$

$a^{(2)} = g(z^{(2)})$, $\in \Re^3$

Add $a_0^{(2)} = 0$ as a bias unit in activation layer (layer 2)

$\Theta^{(2)} = \begin{bmatrix}
\Theta_{10}^{(2)} & \Theta_{11}^{(2)} & \Theta_{12}^{(2)} & \Theta_{13}^{(2)}
\end{bmatrix} \in \Re^{1\times4}$

$z^{(3)} = \Theta^{(2)}a^{(2)}$

$\Theta^{(2)} : 1\times 4$, $a^{(2)} : 4 \times 1$; $z^{(3)} : 1 \times 1$

$h_\Theta(x) = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})$

$h_\Theta(x) = a^{(3)} = g(z^{(3)})$

Neural Network learning its own features: Forward Propagation

This is called a forward propagation because we map the function from layer 1 to layer 2, establish the weight parameters, and then map the function from layer 2 to layer 3. Each layer and parameters $\Theta$ works as an input for the next layer, till it reaches the output layer.

Network Architecture:

A Neural Network can be more complex than the one shown above. A Neural Network can have multiple activation layers $a^{(j)}$, and also multiple output layers.

Neural Networks: Introduction

Neural Networks: Introduction - How is it different from Logistic Regression

Neural Networks are a class of algorithms that is used widely for many purposes. It mimics the functioning of the brain (Neurons, hence the name) and try to simulate the network in the human brain to teach / train a computer.

Why not logistic regression: Logistic regression is also a class of algorithm pretty much used for solving similar set of problems as is the case with neural networks, but there are variour constraints with Logistic Regression. Logistic Regression is typically for a small set of features, where all the polynomial terms can be included in the model ($x_1^2x_2, x_1x_2^2$, etc). The problem occurs when we have too many features : what if we have 100 variables, and wee need every combination of all the variables which will include terms like $x_{23}^3x_{74}^7x_{12}^{33}$ and so on. It becomes extremely hard to create and analyze these features. Also, this will lead to the problem of overfitting, and these computations will be extremely computationally expensive. To solve this, if we reduce the number of features, we will lose information about the data.

Neural Networks aim to solve this issue. It is built to optimize for evaluating a huge number of features which is a typical case in any of the image recognition or handwriting recognition problem. It can be used for simple classification, multiclass classification or prediction models.