Edge 258: Inside OpenAI's Point-E: The New Foundation Model Able to Generate 3D Representations from Language
The new model combines GLIDE with image-to-3D generation models is a very clever and efficient architecture.
Generative AI and foundation models are dominating the headlines in the deep learning space. Text-to-Image models such as DALL-E, Stable Diffusion or Midjourney have captured a tremendous momentum in terms of adoption. 3D and video seem to be the next frontier for multimodal generative models. OpenAI have been actively working in the space and quietly unveiled Point-E, a new text-to-3D model that is able to generate 3D point clouds from natural language inputs.
3D is a particularly challenging domain for generative AI models. Compared to image or even video, 3D datasets are seldomly available. Additionally, 3D generation is more than shape and includes other aspects such as texture or orientation which are hard to capture in text representation. As a result, traditional supervised methods based on text-3D pairs face incredible limitations in terms of scalability. Pretrained models have been somewhat successful overcoming some of the limitations of supervised models and is precisesly the path followed by OpenAI.