Ok, the title is a bit of an exaggeration but hopefully it caught your attention :)
When looking into deep learning models, very frequently you are going to encounter the term stochastic gradient descent(SGD) as an optimization mechanism. SGD is, by far the most common algorithms in deep learning models and is not a stretch to say that nearly all deep learning is, to some extent, powered by SGD., What is the mystical algorithm and why is so relevant to deep learning?
In its most basic form, SGD is an optimization algorithm. Like many other optimization techniques, SGD focuses on minimizing the cost function of a specific model without drastically impacting the rest of the model. Optimization algorithms are nothing new in machine learning and has been a core part of it since its inception. However, most of the traditional optimization techniques in machine learning needed to be adapted to perform with the large datasets common in deep learning problems. Specifically, SGD is the deep learning adaptation of a common family of optimization methods known as Gradient Descent.
Gradient Descent Optimization
The general goal of optimization algorithms is to minimize the cost function(also known as error function) of a learning model. If we represent the cost function as c= f(x), the objective of an optimization algorithms would be to minimize c by altering x.
Gradient Descent Optimization(GDO) relies on derivatives to minimize the cost function. Derivatives is one of the pillars of calculus and has many applications on different areas of deep learning. GDP relies on a very particular property of derivatives that allows to obtain small changes in the output of a function by scaling its input. Using some mathematical nomenclature, if f’(x) is the first-order derivative of our cost function f(x) and d is a very small number, then we can assert that:
f(x + d)==> f(x) + d*f’(x)
All those mathematical expressions, simple tell us that we can make small changes in f(x) by modifying x. GDO uses that technique to find different optimization points for a cost function f(x). Even if you are not familiar with GDO, I am sure you have heard of some of its terms such as local minimum or maximum or global minimum or maximum as they are often used indiscriminately in mainstream technical articles.
Local minimum/maximum refers to a point on which a function f(x) is lower/higher that all its neighboring points. Similarly, global minimum/maximum refers to the absolutely lowest or highest point of the function. In the const of GDO, the goal of the algorithms is to find local minimums that don’t contradict the global minimum.
Stochastic Gradient Descent
traditional GDO techniques often result impractical and prohibitory expensive when dealing with large datasets. Imagine calculating derivatives across billions of data points in a training dataset. SGD improves on classic GDO techniques by uniformly drawing small sets of samples (ranging fro a few dozens to a few hundreds)) from the training datasets and evaluating different optimization functions.
Without getting into the algorithmic details behind SGD, it its important to highlight that it excels at funding very low values for the cost function very quickly. More importantly, SGD does so while keeping steady computational costs. SGD provides no guarantees that will ever arrives at a local minimum but the tradeoff in terms of speed and resources makes a more viable option than typical GDO algorithms.
Optimization is a very active area of research in the deep learning space. Constantly, new algorithms are being actively tested by researchers and many of them are improvements on SGD. For now, SGD has become a favorite of the deep learning community and is important to understand some of its concepts when applied in deep learning solutions .