Divide and Conquer: Supervised Pretraining in Deep Learning Models

Most people divide the world of machine learning( and consequently deep learning) into two main types of algorithms: supervised and unsupervised. While that categorization is technically correct, its far from being complete. As deep learning practitioners quickly come to fund out, there are dozens of subcategories of supervised and unsupervised learning techniques with dozens of algorithms each. One of those subcategories that has become extremently popular with the emergence of deep learning is known as supervised pretraining.

Deep learning models are typically represented by a neural net structure with several hidden layers or sub-networks. The essence of supervised pretraining is to break down tasks into simpler tasks that can be trained independently before confronting the original task.

To illustrate the ideas behind supervised pretraining, let’s use a hypothetical deep learning model that is learning to master chess. Strategically, chess games are divided in three main stages: opening, middle-game and end-game. While the opening and end-game stages require very strong theoretical knowledge, the middle-game is where most deep strategies and tactic manipulations take place. Training a single deep learning model to learn chess can be really unpractical( not the best technical example but stay with me for a minute :) ) from the computational standpoint. Instead, we can use supervised pre-training to train individual sub-networks in opening, middle-game and end-game strategies and then combine them into a complete chess master deep learning model.

Let’s use another example, closer to the deep learning world, of a natural language understanding(NLU) agent that is learning o have conversations with humans on specific subjects. A human conversation goes beyond determining the subject and intent of a dialog and it includes many aspects such as empathy, tone, voice speed, clarifications and dozens of others. Supervised pretraining can be used to learn about these individual aspects and of a conversation and combine them into robust NLU models.

Why Do We Need Supervised Pretraining?

The simplest answer boils down to computational cost. Training large deep learning networks can be incredibly expensive and, many times, the information to train the model its simply not available. Supervised pretraining addresses that challenge with two simple goals:

a) Modify deep learning models into simpler versions that are easier to train.

b) Evolve the trained simpler models into more complex models that solve the original task.

Getting Greedy

Greedy supervised pretraining is one of the most popular types of supervised pretraining techniques and one that has been widely adopted in deep learning algorithms. Technically, greedy supervised pretraining breaks down a network into many different components and tries to solve the optimal version of each component. After that, greedy algorithms combines the optimized versions of the sub networks into a new deep learning model that solves the original problem and then proceeds to optimize that model.

Greedy supervised pretraining should be seen as a tradeoff between knowledge and computational resources. Obviously, combining the most optimal version of sub-sub networks doesn’t always produce an optimal deep learning model. Just because you are super empathetic it doesn’t mean that you can have a conversation about politics or art. Nonetheless, greedy supervised pretraining are often computational cheaper compared to other training approaches. From that perspective, data scientists should balance the right level between knowledge and resource consumption in order to build an optimal process to train their models.

CEO of IntoTheBlock, Chief Scientist at Invector Labs, I write The Sequence Newsletter, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.