Machine Learning, Vizualization & Analytics: logistic regression

Showing posts with label logistic regression. Show all posts

Tuesday, May 5, 2015

Regularization : Logistic Regression

Regularization: Logistic Regression

The problem of overfitting can occur in a Logistic regression model in case the model includes high order polynomial terms, like the following quation

$h_\theta(x) = g(\theta_0 +\theta_1x_1 + \theta_2x_1^2 + \theta_3x_1^2x_2 + \theta_4x_1^2x_2^2 + \theta_5x_1^2x_2^3... )$

The cost function of a Logistic Regression model is given by:

$J(\theta) = -\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}$

In a similar way as regularization using Linear Regression, we add a regularization term to the cost function which is defined as $\frac \lambda {2m} \sum_{j=1}^n\theta_j^2$

We do not add $\theta_0$ in the regularization term, and the regularization parameter is defined for $\theta_1, \theta_2, \theta_3.....\theta_n$

The Cost Function for Logistic regression becomes:

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

Gradient Descent with Regularization:
$\theta_0 := \theta_0-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j}\right]$

$\theta_j := \theta_j-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j} - \frac \lambda m \theta_j \right]$ Simultaneously update for all $\theta_j$

The value of $\theta_0$ is calculated separately without adding the regularization term. The value of j ranges from 1...n in the regularization term.

Regularization with Advanced Optimization:

Estimating $\theta$ using advanced optimization

Code:

function [jVal, gradient] = costFunction(theta)

jVal = [code to compute $J(\theta)$]

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

gradient(1) = [code to compute $\frac \partial {\partial\theta_0} J(\theta)$]

No regularization term for $\theta_0$

gradient(2) = [code to compute $\frac \partial {\partial\theta_1} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_1^{(i)} - \frac \lambda m \theta_1$

gradient(3) = [code to compute $\frac \partial {\partial\theta_2} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_2^{(i)} - \frac \lambda m \theta_2$

gradient(n+1) = [code to compute $\frac \partial {\partial\theta_n} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_n^{(i)} - \frac \lambda m \theta_1$

Monday, May 4, 2015

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

The Problem of Over fitting:

The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit

Overfitting:

High Variance scenario, the model fits too well with the training data
Usually occurs due to the presence of large number of higher order parameters
Fits the training examples quite well, but fails to generalize to new examples

Underfitting:

High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
Doesn't fit the training data well

Addressing Overfitting:

Reduce the number of features

Manually Select which features to keep
Model selection algorithm

Regularization

Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
Works well when we have a lot of features, each one of which contributes a bit in predicting $y$

Regularization: Cost Function

Intuition:

$h_\theta(x)$ for an overfitted function :

$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$

If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:

Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$

Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both $\theta_3$ and $\theta_4$ and make it very close to 0.

The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.

Implementaion:

The Regularization Parameter $\lambda$:

The cost function can be rewritten as :

Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right] $

Points to note:

An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge

Logistic Regression : One Vs All Classification

MultiClass Classification : Logistic Regression

Examples:

Email Folder tagging: Work (y=1), Friends (y=2), Family (y=3), Travel (y=4)
Weather : Sunny (y=1), Cloudy (y=2), Rain (y=3)

The outcome variable y is not restricted to only two outcomes $y=\pmatrix{0 \cr 1}$ but is defined by $y=\pmatrix{1 \cr 2 \cr 3 \cr 4\cr}$ depending on the number of classifications/ groups we need to make.

In the multiclass classification, we train the model separately for y=1, y=2, y=3 and so on, and for each outcome, we select the class that maximizes $h_\theta^{(i)}(x)$

$h_\theta^{(i)}(x) = P(y=i | x;\theta)$

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$.

On a new input $x$, to make a prediction, pick the class $i$ that maximizes $h_\theta^{(i)}(x)$

Logistic Regression : Advanced Optimization

Advanced Optimization for Logistic Regression : Finding the values of $\theta$

Gradient Descent Algorithm is one way to calculate the value of parameters $\theta$. However, it involves selecting a appropriate value of $\alpha$ which might cause multiple iterations.

There are various other optimization algorithms available for minimizing a cost function $J(\theta)$ which are a bit more complex, but there is no need to manually pick a value of $\alpha$ and are much faster than Gradient Descent. The other algorithms are Conjugate Gradient, BFGS and L-BFGS.

Coding the Advanced Optimization Algorithms in MATLAB/Octave:

Example:

say we have to optimize $\theta_1 and \theta_2$, and $J(\theta)$ is given by
$J(\theta) = {(\theta_1-5)}^2+{(\theta_2-5)}^2$
$\frac \partial {\partial\theta_1} J(\theta) = 2(\theta_1-5)$
$\frac \partial {\partial\theta_2} J(\theta) = 2(\theta_2-5)$

We write the function in Matlab/Octave which calculates the value of the cost function $J(\theta)$ and the partial derivatives (gradient 1 for $\theta_1$ and gradient 2 for $\theta_2$)

Note: The index in Octave/Matlab starts from 1; so $\theta_0, \theta_1.....\theta_n$ in the equation is equal to $\theta_1, \theta_2,....\theta_{n+1}$

Code:
function[jVal, gradient] = costFunction(theta)

jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
gradient = zeros(2,1)
gradient(1) = 2*(theta(1)-5)
gradient(2) = 2*(theta(2)-5)
....

Once the code for calculating jVal (the cost function) and gradient is written, the values of $\theta$ are optimized by the following code:

options = optimset('GradObj','on','MaxIter','100');
initialTheta = zeros(2,1)
[optTheta, functionVal, exitFlag]...
= fminunc(@costFunction, intitalTheta,options);

Recap:

theta = $\pmatrix{\theta_0 \cr \theta_1 \cr .\cr .\cr \theta_{n}}$
function[jVal, gradient] = costFunction(theta)

jVal = [code to compute $J(\theta)$]

gradient(1) = [code to compute $\frac \partial {\partial\theta_0} J(\theta)$]
gradient(2) = [code to compute $\frac \partial {\partial\theta_1} J(\theta)$]

gradient(n+1) = [code to compute $\frac \partial {\partial\theta_n} J(\theta)$]

Logistic Regression :Cost Function & Gradient Descent

Logistic Regression: Cost Function

The Logistic Regression hypotheses function is given by
$h_\theta(x) = \frac 1 {1+e^{-\theta^Tx}}$

The question is : how do we choose the parameters $\theta$?

Recap: Linear Regression Cost Function

$J(\theta) = \frac 1 m \sum^m_{i=1} \frac 1 2 (h_\theta(x^{(i)}) - y^{(i)})^2$

Logistic Regression Cost Function:

In Logistic Regression, $h_\theta(x) = \frac 1 {1+e^{-\theta^Tx}}$ as opposed to linear regression where $h_\theta(x) = \theta_0 + \theta_1x_1...$.

The problem with representing the cost function of Logistic regression as $\frac 1 2 (h_\theta(x^{(i)}) - y^{(i)})^2$ is that the curve of $J(\theta)$ is a non convex one, i.e. it has multiple local minima which cannot be optimized by the Gradient Descent function. For Gradient Descent to converge, the cost function has to be a convex function as is the case with Linear Regression.

Cost Function :

$Cost (h_\theta(x),y) = -log(h_\theta(x))$ if y = 1

$Cost (h_\theta(x),y) = -log(1- h_\theta(x))$ if y = 0

$Cost (h_\theta(x),y) = -ylog(h_\theta(x)) -(1-y)log(1- h_\theta(x))$

If y=1:

Cost = 0 if y=1 and $h_\theta(x)$ = 1 (i.e. if the actual value of y = 1 and the predicted value of y is also 1)

But as $h_\theta(x) \rightarrow 0$, $Cost \rightarrow \infty$

If y=0:

Cost = 0 if y=0 and $h_\theta(x)$ = 0 (i.e. if the actual value of y = 0 and the predicted value of y is also 0)

But as $h_\theta(x) \rightarrow 1$, $Cost \rightarrow \infty$

Simplified Cost Function:

$J(\theta) = \frac 1 m \sum^m_{i=1}Cost(h_\theta(x^{(i)}) - y^{(i)})^2$

$J(\theta) = -\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}$

Now, to fit parameters $\theta$, we need to minimize $J(\theta)$

To make a prediction given a new $x$:

Output $h_\theta(x) = \frac 1 {1+e^{-\theta^Tx}}$

Gradient Descent Function for Logistic Regression:

Pseudocode : repeat until convergence $\{$

$\theta_j := \theta_j - {\alpha}{\frac {\partial }{ \partial {\theta_j}}}{J(\theta)}$

$\}$

Putting in the value of the Cost Function:

$\theta_j : \theta_j-\alpha {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j}$ Simultaneously update for all $\theta_j$

The algorithm of Gradient Descent for Logistic Regression is same as that for linear regression, the only difference is the value of $h_\theta(x)$ which in this case is $h_\theta(x) = \frac 1 {1+e^{-\theta^Tx}}$ instead of $h_\theta(x) = \theta^Tx$ for Linear Regression.

Logistic Regression: Decision Boundary

Logistic Regression : Decision Boundary

$h_\theta(x) = g(\theta^Tx)$

$g(z) = \frac 1 {1+e^{-z}}$

Threshold:

Predict y=1 if $h_\theta(x)\ge0.5$ or $\theta^Tx\ge0$
g(z) $\ge$ 0.5 when z $\ge$ 0
$h_\theta(x) = g(\theta^Tx)$

Predict y=0 if $h_\theta(x)\lt0.5$ or $\theta^Tx\lt0$
g(z) $\lt$ 0.5 when z $\lt$ 0
$h_\theta(x) = g(\theta^Tx)$

Decision Boundary:

Decision Boundary is the property of the hypothesis and the parameters, and not the property of the dataset

In the above example, the two datasets (red cross and blue circles) can be separated by a decision boundary whose equation is given by:

$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$

Suppose the parameters $\theta$ is defined by the vector

$\theta = \pmatrix {-3 \cr 1 \cr 1 \cr}$

The model becomes:

Predict "y=1" if $-3 + x_1 + x_2 \ge 0$
or $x_1 + x_2 \ge 3$

Similarily, Predict "y=0" if $x_1 + x_2 \lt 3$

At the Decision Boundary, when $x_1 + x_2 =3$, $h_\theta(x) = 0.5$

A Decision boundary can be nonlinear, depending on the parameters. For example, a logistic regression optimized by the following hypothesis will result in a circular Decision Boundary

$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2)$
where the vector $\theta$ is given by $\theta = \pmatrix{-1 \cr 0\cr0\cr1\cr1\cr}$

In this case, predict "y=1" when $z\ge0$, where $z=-1 +x_1^2+x_2^2$

The Cost Function and the Gradient Descent Algorithm works much in the similar manner as the Linear Regressionm, with the exception of the function $h_\theta(x)$ which is different for Linear and Logistic Regression.

Classification using Logistic Regression

Classification using Logistic Regression:

Classification is different from linear regression as it classifies the data into two or more categories. It is still called regression as it takes the input from the training data and creates a model much like linear regression, but since it uses the Logit function to classify the data points, it is named Logistic Regression.

Why can't we use Linear Regression for Classification : Linear regression can be used to separate two sets of data points using a higher order polynomial (in case of non-linear decision boundary) but presence of any outlier seriously affects the classification when we use Linear Regression. This problem can be easily solved using Logistic Regression.

Logistic regression examples:

Emails: Spam/Not Spam
Online Transactions: Fraud/Not Fraud
Tumor: Malignant/Benign

Typically the target variable (outcome) is classified as

$y \epsilon \{0,1\}$

0: Negative Class (e.g. benign Tumor)

1: Positive Class (e.g. malignant tumor)

Differences with Linear Regression:

Logistic: y = 0 or 1

Logistic:$0\le h_\theta(x) \le 1$

Linear: $h_\theta(x)$ can be >1 or <0

Logistic Regression Model:

Want: $0\le h_\theta(x) \le 1$

$h_\theta(x) = g(\theta^Tx)$

$g(z) = \frac 1 {1+e^{-z}}$

where $z= \theta^Tx$

The function g(z) is also called the Logit function or the Sigmoid Function

Interpretation of Hypothesized Output:

$h_\theta(x)$ = estimated probability that y=1 on input x

Example: If $h_\theta(x)$ = 0.7, there is a 70% chance that y = 1

$h_\theta(x) = P(y=1 | x ; \theta)$