Machine Learning, Vizualization & Analytics: regularization

Showing posts with label regularization. Show all posts

Tuesday, May 5, 2015

Regularization : Logistic Regression

Regularization: Logistic Regression

The problem of overfitting can occur in a Logistic regression model in case the model includes high order polynomial terms, like the following quation

$h_\theta(x) = g(\theta_0 +\theta_1x_1 + \theta_2x_1^2 + \theta_3x_1^2x_2 + \theta_4x_1^2x_2^2 + \theta_5x_1^2x_2^3... )$

The cost function of a Logistic Regression model is given by:

$J(\theta) = -\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}$

In a similar way as regularization using Linear Regression, we add a regularization term to the cost function which is defined as $\frac \lambda {2m} \sum_{j=1}^n\theta_j^2$

We do not add $\theta_0$ in the regularization term, and the regularization parameter is defined for $\theta_1, \theta_2, \theta_3.....\theta_n$

The Cost Function for Logistic regression becomes:

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

Gradient Descent with Regularization:
$\theta_0 := \theta_0-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j}\right]$

$\theta_j := \theta_j-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j} - \frac \lambda m \theta_j \right]$ Simultaneously update for all $\theta_j$

The value of $\theta_0$ is calculated separately without adding the regularization term. The value of j ranges from 1...n in the regularization term.

Regularization with Advanced Optimization:

Estimating $\theta$ using advanced optimization

Code:

function [jVal, gradient] = costFunction(theta)

jVal = [code to compute $J(\theta)$]

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

gradient(1) = [code to compute $\frac \partial {\partial\theta_0} J(\theta)$]

No regularization term for $\theta_0$

gradient(2) = [code to compute $\frac \partial {\partial\theta_1} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_1^{(i)} - \frac \lambda m \theta_1$

gradient(3) = [code to compute $\frac \partial {\partial\theta_2} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_2^{(i)} - \frac \lambda m \theta_2$

gradient(n+1) = [code to compute $\frac \partial {\partial\theta_n} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_n^{(i)} - \frac \lambda m \theta_1$

Monday, May 4, 2015

Regularization : Linear Regression

Regularization : Linear Regression

$J(\theta)$ = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right] $

Goal: minimize $J(\theta)$

Gradient Descent:

GD Algorithm without the regularization term:

repeat until convergence $\{$

$\theta_0 := \theta_0 - \alpha{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.x_0^{(i)}$

$\theta_j := \theta_j-\alpha {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}}$

$\}$

Gradient Descent with Regularization term $\lambda$ added:

Since we are not regularizing $\theta_0$, it is updated separately and the regularization term is not included in the calculation of $\theta_0$

repeat until convergence $\{$

$\theta_0 := \theta_0 - \alpha{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.x_0^{(i)}$

$\theta_j := \theta_j-\alpha \left[{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}} - \frac \lambda m \theta_j \right]$
(j= 1,2,3,....n)
$\}$

Simplifying:

$\theta_j := \theta_j(1-\alpha\frac \lambda m) - \alpha\frac 1 m \sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}}$

The term $(1-\alpha\frac \lambda m)$ is less than 1 (0.9 or 0.95) and this reduces the value of $\theta_j$ in every step by multiplying it repeatedly with a number less than 1, thus achieveing the goal of regularization.

Normal Equation Method:

Matrices:

X : m x (n+1) matrix
y: m- dimensional vector

Without Regularization:

$\theta = (X^TX)^{-1}X^Ty$

With Regularization

$\theta = (X^TX + \lambda\begin{bmatrix}0& 0& 0& 0& 0& 0 \cr 0& 1& 0& 0& 0& 0 \cr 0& 0& 1& 0& 0& 0 \cr 0& 0& 0& .& .& . \cr 0& 0& 0& .& .& .\cr 0& 0& 0& .& .& 1\cr\end{bmatrix})^{-1}X^Ty$

The Matrix multiplied with $\lambda$ is a $(n+1)$ x$ (n+1)$ matrix, where the first row, first columns and all other non diagonal elements are 0. Only the elements on the main diagonal (except the first one) are equal to 1

The Normal equation method with regularization takes care of the non-invertibility of the matrix $X^TX$

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

The Problem of Over fitting:

The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit

Overfitting:

High Variance scenario, the model fits too well with the training data
Usually occurs due to the presence of large number of higher order parameters
Fits the training examples quite well, but fails to generalize to new examples

Underfitting:

High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
Doesn't fit the training data well

Addressing Overfitting:

Reduce the number of features

Manually Select which features to keep
Model selection algorithm

Regularization

Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
Works well when we have a lot of features, each one of which contributes a bit in predicting $y$

Regularization: Cost Function

Intuition:

$h_\theta(x)$ for an overfitted function :

$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$

If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:

Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$

Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both $\theta_3$ and $\theta_4$ and make it very close to 0.

The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.

Implementaion:

The Regularization Parameter $\lambda$:

The cost function can be rewritten as :

Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right] $

Points to note:

An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge