Monday, May 4, 2015

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

The Problem of Over fitting:

The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit


Overfitting:
  • High Variance scenario, the model fits too well with the training data
  • Usually occurs due to the presence of large number of higher order parameters
  • Fits the training examples quite well, but fails to generalize to new examples
Underfitting:
  • High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
  • Doesn't fit the training data well

Addressing Overfitting:

  1. Reduce the number of features
    1. Manually Select which features to keep
    2. Model selection algorithm
  2. Regularization
    1. Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
    2. Works well when we have a lot of features, each one of which contributes a bit in predicting $y$

 Regularization: Cost Function

Intuition:

$h_\theta(x)$ for an overfitted function :

$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$

If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:

Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$

Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both  $\theta_3$ and $\theta_4$ and make it very close to 0.

The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.

Implementaion:

The Regularization Parameter $\lambda$:
 
The cost function can be rewritten as :

Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right]  $

Points to note:
  • An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
  • The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
  • Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
  • If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge

No comments:

Post a Comment