Member-only story
OpenAI’s New Foundation Model: Point-E is Able to Generate 3D Representations from Language
The new model combines GLIDE with image-to-3D generation models is a very clever and efficient architecture.

I recently started an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
Generative AI and foundation models are dominating the headlines in the deep learning space. Text-to-Image models such as DALL-E, Stable Diffusion or Midjourney have captured a tremendous momentum in terms of adoption. 3D and video seem to be the next frontier for multimodal generative models. OpenAI have been actively working in the space and quietly unveiled Point-E, a new text-to-3D model that is able to generate 3D point clouds from natural language inputs.
3D is a particularly challenging domain for generative AI models. Compared to image or even video, 3D datasets are seldomly available. Additionally, 3D generation is more than shape and includes other aspects such as texture or orientation which are hard to capture in text representation. As a result, traditional supervised methods based on text-3D pairs face incredible limitations in terms of scalability. Pretrained models have been somewhat successful overcoming some of the limitations of supervised models and is precisesly the path followed by OpenAI.
Point-E
Instead of focusing on complete 3D objects, Point-E generates 3D meshes point representations based on an input prompts. These synthetic…