Practical Deep Learning: Selecting the Right Model and Gathering Training Data Part II

This is the second part of an essay that explains some of the practical tips in deep learning applications. Specifically, we are focusing on the selection of learning models and the correct structuring of training datasets. In the first part, we explored some of the basic methodology fro selecting baseline models in deep learning scenarios. Today, we are going to provide some guidance in regards to one of the most difficult challenges facing deep learning practitioners: how to determine if the right size of the training dataset?

Structuring a proper training dataset is an essential aspect of effective deep learning models but one that is particularly hard to solve. Part of the challenge comes from the intrinsic relationship between a model and the corresponding training dataset. If the performance of a model is below expectations, it is often hard to determine whether the causes are related to the model itself or to the composition of the training dataset. While there is no magic formula for creating the perfect training dataset, there are some patterns that can help.

When confronted with a deep learning model with poor performance, data scientists should determine if the optimization efforts should focus on the model itself or on the training data. In most real world scenarios, optimizing a model is exponentially cheaper than gathering additional clean data and retraining the algorithms. From that perspective, data scientists should make sure that the model has been properly optimized and regularized before considering collecting additional data.

Typically, the first rule to consider when a deep learning algorithm is underperforming is to evaluate whether its using the entire training dataset. Very often data scientists will be shocked to find out that models that are not working correctly are only using a fraction of the training data. At that point, a logical thing to consider is to increase the capacity of the model( the number of potential hypothesis it can formulate) by adding extra layers and additional hidden units per layer. Another ideas to explore in that scenario is to optimize the model’s hyperparameters( read my article about hyperparamenters). If none of those ideas work, then it might be time to consider gathering more training data.

The process of enriching a training dataset can be cost prohibited in many scenarios. To mitigate that, data scientists should implement a data wrangling pipeline that is constantly labeling new records. semi-supervised learning strategies might also help to incorporate unlabeled records as part of the training dataset.

The imperative question in scenarios that require extra training data always is: how much data? Assuming that the composition of the training dataset doesn’t drastically vary with new records, we can estimate the appropriate size of the new training dataset by monitoring its correlation with the generalization error. A basic principle to follow in that situation is to increase the training dataset at a logarithmic scale by, for example, doubling the number of instances each time. In some cases, we can improve the training dataset by simply creating variations using noise generation models or regularization techniques such as Bagging(read my recent article about Bagging).

CEO of IntoTheBlock, Chief Scientist at Invector Labs, I write The Sequence Newsletter, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store