A few months ago, I wrote about ensemble learning as one of the most practical techniques used in real world deep learning applications. Conceptually, ensemble learning combines multiple models in a single meta-model in order to minimize the generalization error. As you might imagine, there are many ways to efficiently construct ensemble models. Among those, Bagging and Dropout have become increasingly popular in the latest generation of deep learning technologies.
Ensemble Learning as a Regularization Method
The foundations behind ensemble learning goes back to the core of how humans make decisions. Research in decision theory shows that large and heterogeneous groups of people can arrive to better decisions that top experts on a specific subject. In machine learning theory, it is a well-known fact that most models underperform when exposed to datasets that deviate from the training data. This is commonly known as the no free lunch theorem. From that perspective, combining different models on an ensemble is an efficient way to expand the capacity of a model and reduce its generalization error. That’s the textbook definition of a regularization method.
Most everyone agrees on the value of ensemble learning. In recent years, the emergence of deep learning has produced a large number of techniques for constructing ensemble learning models and is becoming increasingly difficult to differentiate between them. Bagging and Dropout are two ensemble methods that have been widely implemented in deep learning frameworks and have been regularly used in modern applications.
One of the drawbacks of ensemble learning is that different models in the ensemble can have different training algorithms and objective functions which can make it computationally expensive to execute at scale. Bagging is an ensemble learning technique that enables the reusability of models, training algorithms and objective functions in a meta-model.
The essence of Bagging is to train a deep learning model using variations of the training dataset. Each variation is built by sampling subsets of the original training dataset and replacing some of the missing entries with duplicates from the original selection. That translates into a collection of datasets that are based on entries from the training data replicated multiple times. Let’s use an example and assume that our original dataset is the vector [1,2,3,4,5]. A Bagging technique will produce datasets such as [1,1,2,3,5], [1,2,2,3,4], [1,2,3,5,5]… and many others. Bagging will use the generated datasets to train the original model and combine the results. Different trainings should specialize the model on different areas of the target data.
If Bagging trains the model on variations of the input dataset, Dropout works by generating subnetworks of the original deep neural network. Essentially, Dropout creates subnetworks by removing some of the non-output units and then train those subnetworks using the original dataset. The nature of Dropout enables to create an ensemble capabilities of the original model which can be incredibly efficient