The Sequence Scope: Distributed ML Training is Going to be Everyone’s Problem
Weekly newsletter with over 100,000 subscribers that discusses impactful ML research papers, cool tech releases, the money in AI, and real-life implementations.
--
📝 Editorial: Distributed ML Training is Going to be Everyone’s Problem
Large-scale, distributed training is one of those machine learning (ML) problems that is easy to ignore. After all, only large AI labs like Google, Facebook, and Microsoft work with these massively large models that require many GPUs to be trained. I definitely thought that way until the transformers came into the picture. If there is one takeaway from the emergence of transformer models, it is that bigger models are better, at least for the time being. Training a basic BERT-based transformer model requires quite a bit of infrastructure and distributed processes. As a result, distributed training is slowly becoming a mainstream problem for the entire AI community.
As someone who didn’t care much about distributed ML training, I followed the research peripherally without getting into the details. This changed in the last couple of years when I started playing with larger and larger models. The level of research and engineering built-in distributed ML training frameworks is mind-blowing. Frameworks like Horovod and Ray are certainly better known, but the innovation doesn’t stop there. Just this week, Microsoft open-sourced some new additions to its DeepSpeed distributed training library. At the same time, Facebook and Tencent published very advanced research to scale the distributed training of transformer models. Innovation in this space will certainly continue in the next few years, and, at this point, distributed training should be considered a key building block of any modern ML pipeline.