OpenAI Helps Us Understand How Deep Learning Training Scales
Understanding the optimimal size of a training set remains one of the most interesting challenges of supervised learning models.
--
I recently started an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
In the last few yew years, there have been an increasing interest in training parallelization methods that can be applicable to large deep learning models. Those training parallelism efforts have focused on both model-based and data-based approaches with the latter being more popular given their simplicity. Conceptually, data parallelism involves splitting a training dataset into batches of data, distributing those across multiple computing devices and aggregating the resulting gradients. One of the common challenges in data parallelism for deep learning training is to determine the correct size of the training batches in data parallelism approaches. In An Empirical Model of Large-Batch Training, researchers from OpenAI propose a quantitative approach to understand the scalability of the training process in deep neural networks.
Given a specific model and dataset, there is very empirical indication about the appropriate size of batch in data parallelism approaches and the differences with other models. For instance many reinforcement learning models can use data parallelisms methods with batch sizes of millions of records but image classification models are constrained to a few thousands. If we use a few records with a reinforcement learning model, it is unlikely that it will achieve any generalization and, similarly, using a million record batch with an image classification model is likely to produce diminishing returns. How to determine the right training…