Showing posts with label regularization. Show all posts
Showing posts with label regularization. Show all posts

Tuesday, May 5, 2015

Regularization : Logistic Regression

Regularization: Logistic Regression

The problem of overfitting can occur in a Logistic regression model in case the model includes high order polynomial terms, like the following quation

$h_\theta(x) = g(\theta_0 +\theta_1x_1 + \theta_2x_1^2 + \theta_3x_1^2x_2 + \theta_4x_1^2x_2^2 + \theta_5x_1^2x_2^3... )$

The cost function of a Logistic Regression model is given by:

$J(\theta) = -\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}$

In a similar way as regularization using Linear Regression, we add a regularization term to the cost function which is defined as $\frac \lambda {2m} \sum_{j=1}^n\theta_j^2$

We do not add $\theta_0$ in the regularization term, and the regularization parameter is defined for $\theta_1, \theta_2, \theta_3.....\theta_n$

The Cost Function for Logistic regression becomes:

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

Gradient Descent with Regularization:
$\theta_0 := \theta_0-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j}\right]$

$\theta_j := \theta_j-\alpha\left[ {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x^{(i)}_j} - \frac \lambda m \theta_j \right]$ Simultaneously update for all $\theta_j$

The value of $\theta_0$ is calculated separately without adding the regularization term. The value of j ranges from 1...n in the regularization term.

Regularization with Advanced Optimization:

Estimating $\theta$ using advanced optimization


Code:

function [jVal, gradient] = costFunction(theta)

jVal = [code to compute $J(\theta)$]

$J(\theta) = -\left[\frac 1 m \sum^m_{i=1}{y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))}\right] + \frac \lambda {2m} \sum_{j=1}^n\theta_j^2 $

gradient(1) = [code to compute $\frac \partial {\partial\theta_0} J(\theta)$]
No regularization term for $\theta_0$

gradient(2) = [code to compute $\frac \partial {\partial\theta_1} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_1^{(i)} - \frac \lambda m \theta_1$

gradient(3) = [code to compute $\frac \partial {\partial\theta_2} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_2^{(i)} - \frac \lambda m \theta_2$
.
.
.

gradient(n+1) = [code to compute $\frac \partial {\partial\theta_n} J(\theta)$]

$\frac 1 m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}).x_n^{(i)} - \frac \lambda m \theta_1$






Monday, May 4, 2015

Regularization : Linear Regression

Regularization : Linear Regression

$J(\theta)$ = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right]  $

Goal: minimize $J(\theta)$

Gradient Descent:

GD Algorithm without the regularization term:

repeat until convergence $\{$

$\theta_0 := \theta_0 - \alpha{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.x_0^{(i)}$

$\theta_j := \theta_j-\alpha {\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}}$

$\}$

Gradient Descent with Regularization term $\lambda$ added:

Since we are not regularizing $\theta_0$, it is updated separately and the regularization term is not included in the calculation of $\theta_0$

repeat until convergence $\{$

$\theta_0 := \theta_0 - \alpha{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.x_0^{(i)}$

$\theta_j := \theta_j-\alpha \left[{\frac{1}{m}}\sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}} - \frac \lambda m \theta_j \right]$
(j= 1,2,3,....n)
$\}$

Simplifying:

$\theta_j := \theta_j(1-\alpha\frac \lambda m) - \alpha\frac 1 m \sum_{i=1}^m{({h_\theta}(x^{(i)})-y^{(i)})}.{x_j^{(i)}}$


The term $(1-\alpha\frac \lambda m)$ is less than 1 (0.9 or 0.95) and this reduces the value of $\theta_j$ in every step by multiplying it repeatedly with a number less than 1, thus achieveing the goal of regularization.


Normal Equation Method:

Matrices:

X : m x (n+1) matrix
y:  m- dimensional vector

Without Regularization:

$\theta = (X^TX)^{-1}X^Ty$

With Regularization

$\theta = (X^TX + \lambda\begin{bmatrix}0& 0& 0& 0& 0& 0 \cr 0& 1& 0& 0& 0& 0 \cr 0& 0& 1& 0& 0& 0 \cr 0& 0& 0& .& .& . \cr 0& 0& 0& .& .& .\cr 0& 0& 0& .& .& 1\cr\end{bmatrix})^{-1}X^Ty$

The Matrix multiplied with $\lambda$ is a $(n+1)$ x$ (n+1)$ matrix, where the first row, first columns and all other non diagonal elements are 0. Only the elements on the main diagonal (except the first one) are equal to 1

The Normal equation method with regularization takes care of the non-invertibility of the matrix $X^TX$

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

The Problem of Over fitting:

The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit


Overfitting:
  • High Variance scenario, the model fits too well with the training data
  • Usually occurs due to the presence of large number of higher order parameters
  • Fits the training examples quite well, but fails to generalize to new examples
Underfitting:
  • High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
  • Doesn't fit the training data well

Addressing Overfitting:

  1. Reduce the number of features
    1. Manually Select which features to keep
    2. Model selection algorithm
  2. Regularization
    1. Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
    2. Works well when we have a lot of features, each one of which contributes a bit in predicting $y$

 Regularization: Cost Function

Intuition:

$h_\theta(x)$ for an overfitted function :

$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$

If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:

Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$

Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both  $\theta_3$ and $\theta_4$ and make it very close to 0.

The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.

Implementaion:

The Regularization Parameter $\lambda$:
 
The cost function can be rewritten as :

Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right]  $

Points to note:
  • An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
  • The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
  • Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
  • If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge