Gradient Descent Demystified

Gradient Descent
Gradient descent is an optimization algorithm that helps minimize the cost by finding the optimal parameters for given set of inputs. 
Parameters are coefficients (b,m) in linear regression or weights (w) in neural networks.
The goal is to minimize the cost when training. 

Error

The error can be defined as the following:


where


Cost

The cost function tells us how well our model is performing. In other words, it tells us how good or bad the model is for a given set of parameters (m and b values for linear regression, w for neural networks).
The cost function can be defined as follows:


In linear regression, a prediction is made using the line equation:


Plugging in the line equation to cost function, we get:


Our goal is to minimize the cost. Cost is same as loss. 
On graphing the cost function, we can identify the minima. But when graphing models with too many dimensions, it is not possible to mathematically compute the minima. 
To find the minima of the loss function; we take partial derivatives for b and m
Gradient Descent is used to find the minima of the cost/loss function
Screen-Shot-2018-11-23-at-9-28-29-AM.png
Source: IMG Source
Now as the name states, we calculate the gradient and iteratively descent down the gradient to reach the minima of the function.
To calculate the gradient, we take partial derivatives with respect to b and m values. 
The derivative gives us the slope of the line drawn tangent at the point of b and m values that we just gave. The derivative is the tiny nudge or change that causes a change with respect to another value. In this case, when we're calculating the partial derivative, the other values stay constant.
The partial derivative tells us which towards which direction is the slope higher. So we need to substract that from the given b or m at that state to reach a more optimal solution. 
Remember,
Derivative tells us how steep the slope is. It is nothing but the rise over run.
We calculate the partial derivative for the cost function(C) on  m by the following formula :


Plugging in the cost function:
we get, 


the derivative applies on tiny changes on m as m approaches 0
Hence:


Explanations
The derivative of constant y and m will be zero since it's a constant and it has no change.

Similarly, we calculate the partial derivative for the cost function(C) on b as:


Gradient for b


Now that we have the derivatives, we know which direction towards the steepest point at that specific point. 
The b and m are substracted by the derivatives ( substracted since we want to traverse to the oppsite direction of the slope, i.e. travel towards the minima). Here we introduce a new factor, the learning rate
The learning rate defines how much of a step to take towards the minima. Higher the learning rate, bigger the jump towards the minima. But if the learning rate is too high, we risk the overshooting the minima. Hence the learning rate is set to a tiny value, for very tiny steps. This does mean running through more samples of data as compared to a higher learning rate, but the risk of overshooting the minima is minimal or absent when we have a low learning rate. 
Learning rate is multiplied with the calculated b and m derivatives and then substracted from the existing b and m values and are replaced by the new values.
The formula for this is mentioned below:


The process of iterating and updating  b and m is done through the whole data. 
And that is how Gradient Descent optimizes the parameters for a given function.

Comments

Popular Posts