Member-only story
Inside Imagen. Google’s Impressive Text-to-Image Alternative to OpenAI’s DALLE-2.
Imagen provides a simpler architecture able to generate photorealistic images from language inputs.
I recently started an AI-focused educational newsletter, that already has over 125,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Text-to-image(TTI) is one of the most innovative areas in multi-modal learning these days. The influence that transformer architectures have played in natural language understanding(NLU) and computer vision, have catalyzed the research in the TTI space. In the last few months, OpenAI have made the headlines by publishing two papers of their DALL-E model which can generate photorealistic, artistic images based on language. Recently, Google Brain published the research related to Imagen, a simpler alternative to DALL-E2 that generates mindboggling images based on textual inputs.
Imagen’s does not rely solely on transformer models to deliver its TTI capabilities. The Imagen architecture combines transformers with high-fidelity diffusion models to deliver a very simple structure for TTI synthesis. More specifically, Imagen’s architecture is based on the following components:
· T5-XXL Encoder: This component maps a text input to a sequence of embeddings. OpenAI’s CLIP has become one of the favorite options for encoding in these architectures. How4ever, Imagen uses T 5-XXL given that it provides a similar performance to CLIP and seems to be preferred by human evaluators.
· Conditional Diffusion Models: Conceptually, diffusion models are methods that convert Gaussian noise into samples learned from a data distribution. Imagen uses a conditional diffusion model to map the text-embeddings produced by the encoder to a 64x64 image.