One Model to Learn Them All: Inside Google’s MultiModel Algorithm
This is the third( and final I promise ;) ) part of an essay related to Google’s recent and famous research paper about a new multi-domain learning algorithm. The first part of the series explained some of theory and vision behind multi-model machine learning algorithms. On the second part, we explored the master algorithm theory of a universal learner popularized ( if that word can be applied to machine learning) by University of Washington’s researcher Pedro Domingos. Today, I would like to discuss some of the main ideas behind Google’s MultiModel algorithm without going too crazy into the implementation details.
The objective behind the Google MultiModel algorithm was to create a single deep learning model that can learn tasks from multiple domains. Specifically, MultiModel focuses on deep learning areas such as machine translation, image classification, speech recognition and language parsing.
One of the first challenges that Google researchers faces when conceiving MultiModel was the diversity of data inputs types such as images, audio and text files as well as their different sizes and dimensions that needed to be processed. In order to address that challenges, MultiModel creates individual “sub-networks” to process specific inputs and transform it into a uniform representation. MultiModel refers to those “sub-networks” as modality nets and they specialize on processing data from a specific modality such as text, images or audio files. One important characteristic of Google’s MultiModel algorithm is that it maintains a single modality net for each category of task instead of having individual modality nets for each task. In that sense, all translation tasks will share the same modality net instead of having specific modality nets for each language. This important design decision facilitates generation while preventing the number of sub-networks to get out of control. Specifically, MultiModel uses fout modality nets for language, image, audio and categorical data respectively.
In terms of the overall architecture, Google MultiModel combines its modality nets into an structure that includes an encoder that processes the inputs, a mixer that combines the encoded inputs with previous outputs and an autoregressive decoder that processes the outputs of the mixer and generates new outputs.
MultiModel proposes an architecture for the encoder and decoder that combines three fundamental building blocks to operate efficiently across different domains. The first component are called Convolutional Blocks that focus on detecting patterns and generalizations across domains. Technically, Convolutional Blocks receives an input tensor and returns an output tensor of the same shape. The second component of called Attention Blocks which improve the performance of the model by focusing on specific elements. Finally, the third component of the architecture are Mixture-Of-Experts Blocks that consist on a number of neural networks and a trainable gating that select the appropriate expert networks to process a specific input.
Confused? Is not that bad ;) Think about MultiModel as a combination of encoders, mixers and decoders and each one of those blocks architected using a combination of convolutional, attention and mixture-of-experts blocks. Google has released an implementation of MultiModel based on TensorFlow which makes it relatively easy to follow.
Based on the initial evaluations, MultiModel didn’t show any particular improvements over individual models but it highlighted some areas on which learning processes can be drastically improved by sharing knowledge from different domains. From this initial experience, Google believes that the key to designing successful MultiModel algorithms is to leverage an architecture in which parameters and computational blocks are shared across different domains.