Understanding the Gradient Descent Algorithm for minimizing the cost function of a Linear Regression Model
Gradient Descent is a generic algorithm which is used in many scenarios, apart from finding parameters to minimize a cost function in linear regression. Its a robust algorithm, though a little tricky to implement (as compared to other methods like a Normal Equation method). The generalization of the algorithm applies to univariate and multivariate models to find parameters
θ0,θ1,θ2,..... corresponding to the variables
x0,x1,x2...etc to minimize the
Cost Function Jθ0,θ1,...
More about the Cost Function :
Understanding Linear Regression
Gradient Descent Algorithm :
Case: GD algorithm applied for minimizing
θ0andθ1. The same case can be generalized over multiple values of
θ
The Gradient Descent (GD) Algorithm works in the following way:
- Have some function J(θ0,θ1)
- Want to minimize Jθ0,θ1
- Outline:
- Start with some value of θ0,θ1
- Keep changing θ0,θ1 to reduce J(θ0,θ1) until we hopefully end at a minimum
The Gradient Descent starts on a point on the curve generated by plotting J(θ0,θ1), θ0 and θ1. It then takes a small step downwards to find a point where the value of J is lower than the original, and resets θ0 and θ1 to the new values. It again takes the step in the downward direction to find a lower cost function J and repeats the steps until we reach the local minima of the curve. The values of θ0 and θ1 can be found at this minimum value of J.
Key pointers:
- How do we define how big a step we need to take in the downward direction? This is defined by the parameter α which is called the Learning Rate of the GD algorithm
- The above curve shows there are two local minima. What if the GD algorithm reaches a local minima but the global minima is something else? The reason why GD algorithm works is because the curve in a Linear Regression model is a convex function with only one Global Minima, so irrespective of where you start, you will reach the Global Minima of the function J(θ0,θ1)
Convex function:
Gradient Descent Algorithm: Pseudocode
repeat until convergence
{
θj:=θj−α∂∂θjJ(θ0,θ1) for j=0 and j=1
}
Simultaneously update both
θ0 and
θ1 i.e. don't plug the value of
θ0 calculated in one step into the function J to calculate the value of
θ1
The Learning Rate α: This defines how fast an algorithm converges, i.e. are we taking a small step or a big step in moving to the next minimum value of J. If the value of
α is too small, the steps will be small and the algorithm will be slow. If the value of
α is large, the steps will be big and we might overshoot the minimum value of J.
Also, as we approach to a global minimum, the GD algorithm will automatically take smaller steps. Thus there is no need to readjust or decrease
α over time.
The partial derivative function: This defines the slope of the curve at any given point (
θ0,θ1). This could either be negative or positive, and will pull the values of
θ0andθ1 to the minima of the curve.
Application : Gradient Descent Algorithm for Linear Regression
Linear Regression Model:
hθ(x)=θ0+θ1(x)
Cost Function:
J(θ0,θ1)=12m∑mi=1(hθ(x(i))−y(i))2
Step 1: Calculate the partial derivatives of J w.r.t.
θ0andθ1
j=0 :
∂∂θ0Jθ0,θ1 =
1m∑mi=1(hθ(x(i))−y(i))
j=1 :
∂∂θ1Jθ0,θ1 =
1m∑mi=1(hθ(x(i))−y(i)).x(i)
This is simple partial differentiation, once with
θ0 and again with
θ1, keeping the other parameter constant. If you get it, good, and if you don't get it, it doesn't matter.
Step 2: The Gradient Descent Algorithm now becomes:
repeat until convergence
{
θ0:θ0−α1m∑mi=1(hθ(x(i))−y(i))
θ1:θ1−α1m∑mi=1(hθ(x(i))−y(i)).x(i)
} Update θ0 and θ1 simultaneously
Batch Gradient Descent: Each step of gradient Descent uses all the training examples
The Normal Equation Method: The exists a method in Linear Algebra where the concept of Metrices and Vectors are used to calculate the parameters
θ0 and
θ1 with iteration. This method is an efficient one, except in the case of very large datasets where it becomes computationally expensive due to the calculation of inverse of very large metrices.
To use the Normal Equation method of solving for
θ0 and
θ1 , the dataset has to be represented in a Matrix format.