The Problem of Over fitting:
The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit
Overfitting:
Regularization: Cost Function
The parameters that generate a model might throw up three type of results: Overfit, Right Fit and Underfit
Overfitting:
- High Variance scenario, the model fits too well with the training data
- Usually occurs due to the presence of large number of higher order parameters
- Fits the training examples quite well, but fails to generalize to new examples
Underfitting:
- High Bias scenario, the model fits poorly with the data and usually biased (generalized one result for all data)
- Doesn't fit the training data well
Addressing Overfitting:
- Reduce the number of features
- Manually Select which features to keep
- Model selection algorithm
- Regularization
- Keep all the features, but reduce the magnitude/values of parameters $\theta_j$
- Works well when we have a lot of features, each one of which contributes a bit in predicting $y$
Regularization: Cost Function
Intuition:
$h_\theta(x)$ for an overfitted function :
$\theta_0 + \theta_1x + \theta_2x^2+\theta_3x^3+\theta_4x^4$
If we penalize and make $\theta_3$ and $\theta_4$ very small, by changing the cost function to:
Cost ($J(\theta)$) = $min \frac 1 {2m} \sum_{i=1}^m {(h_\theta(x)-y)^2} + 1000\theta_3^2 + 1000\theta_4^2$
Multiplying 1000 to $\theta_3$ and $\theta_4$ in the cost function forces the algorithm to reduce the value of both $\theta_3$ and $\theta_4$ and make it very close to 0.
The smaller values of all $\theta$'s will generate a simpler hypothesis which will be less prone to overfitting.
Implementaion:
The Regularization Parameter $\lambda$:
The cost function can be rewritten as :
Cost ($J(\theta)$) = $min \frac 1 {2m} \left[\sum_{i=1}^m {(h_\theta(x^{(i)})-y^{(i)})^2} + \lambda\sum_{j=1}^n\theta_j^2 \right] $
Points to note:
- An extra term is added to regularize all the $\theta$ parameters, $\theta_0$ is not penalized (the summation term starts at j=1 and not at j=0, leaving $\theta_0$)
- The parameter $\lambda$ controls a tradeoff between two goals of the equation : Fit the data well (first part), and keep the coefficients small (second part)
- Selecting the value of $\lambda$: If $\lambda$ is too large, the parameters $\theta$ will be penalized very heavily and will become very close to 0. The hypothesis will become $h_\theta(x) = \theta_0$ which is a highly biased hypothesis
- If $\lambda$ is too large, the Gradient Descent Algorithm will fail to converge
No comments:
Post a Comment