Machine Learning, Vizualization & Analytics: Regularization for Linear & Logistic Regression : Overfitting & Cost Function

Monday, May 4, 2015

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

The Problem of Over fitting:

The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit

Overfitting:

High Variance scenario, the model fits too well with the training data
Usually occurs due to the presence of large number of higher order parameters
Fits the training examples quite well, but fails to generalize to new examples

Underfitting:

High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
Doesn't fit the training data well

Addressing Overfitting:

Reduce the number of features

Manually Select which features to keep
Model selection algorithm

Regularization

Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
Works well when we have a lot of features, each one of which contributes a bit in predicting $y$

Regularization: Cost Function

Intuition:

$h_\theta(x)$ for an overfitted function :

$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$

If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:

Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$

Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both $\theta_3$ and $\theta_4$ and make it very close to 0.

The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.

Implementaion:

The Regularization Parameter $\lambda$:

The cost function can be rewritten as :

Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right] $

Points to note:

An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge

Machine Learning, Vizualization & Analytics

Monday, May 4, 2015

Regularization for Linear & Logistic Regression : Overfitting & Cost Function

No comments:

Post a Comment

Blog Archive