Inside the AI Powering Stable Diffusion, The New Hot Text-To-Image Synthesis Model

Latent Diffusion has the ability to power a new wave of text-to-image generation models.

Jesus Rodriguez
3 min readAug 29, 2022

--

Image Credit: Stability AI

I recently started an AI-focused educational newsletter, that already has over 125,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

A few days ago, AI startup Stability AI unveiled the first version of its Stable Diffusion text-to-image synthesis model. If you haven’t been living under a rock for the last year, you probably know that the text-to-image generation space is going through a massive revolution. Models like OpenAI’s GLIDE and DALL-E 2, MidJourney of Google’s Party or Imagen have made significant progress advancing different text-to-image techniques. Stable Diffusion matches the quality of those models using a hyper efficient and architecture and, best of all, is open source. Even AI legend and former Tesla AI Chief Andrej Karpathy agrees:

One of the main advantages of Stable Diffusion is its relative lightweight architecture. The current version is able to run 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds. However, the model has been trained on 4,000 A100 Ezra-1 AI ultracluster over the a month and has over 1000 beta testers creating something like 1.7 million images per day. The artistic quality of the images generated by Stable Diffusion is astonishing.

--

--

Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, I write The Sequence Newsletter, Guest lecturer at Columbia University and Wharton, Angel Investor, Author, Speaker.