Last week we finished our series about a new generation of text-image models and their underlying techniques. Here is a full recap for you to catch up on the topics we covered. As the proverb (and many ML people) says: Repetition is the mother of learning ;)
Multidomain learning is one of the crown jewels of deep learning. Today, most neural networks remain highly specialized in a single domain such as language, speech, or computer vision. Recently, we have seen a generation of successful models that can operate with datasets from different domains. Among those, text-image models have proven to be particularly successful in combining recent breakthroughs in both language and computer vision. Â Â
The key to text-image models is the ability to detect the relationships between images and the text that describes them. In this super popular series, we cover methods such as image diffusion that have made major inroads in this area; and the methods, such as VQGAN, CLIP, DALL-E2, and Imagen, that achieve remarkable performance in text-to-image generation.  Â
Forward this email to those who might benefit from reading it or give a gift subscription.
→ In Edge#219 (read it without a subscription): we start the new series about text-to-image models; discuss CLIP, a neural network that can learn image representations while being trained using natural language datasets; and explore Hugging Face’s CLIP implementation.Â
→ In Edge#221: we explain what Diffusion Models are; discuss Imagen, Google’s massive diffusion model for photorealistic text-to-image generation; explore MindEye which allows you to run multiple generative art models in a single interface.
→ In Edge#223: we discuss different types of diffusion; explain OpenAI’s GLIDE, a guided diffusion method for photorealistic image generation; explore Hugging Face text-to-image catalog.Â
→ In Edge#225: we explain latent diffusion models; discuss the original latent diffusion paper; explore Hugging Face Diffusers, a library for state-of-the-art diffusion models.
→ In Edge#227: we explain autoregressive text-to-image models; discuss Google’s Parti, an impressive autoregressive text-to-image model; explore MS COCO, one of the most common datasets in text-to-image models.
→ In Edge#229: we introduce VQGAN + CLIP architecture; discuss the original VQGAN+CLIP paper; explore the VQGAN+CLIP implementations. Â
→ In Edge#231: we explore Text-to-image synthesis with GANs; discuss Google’s XMC-GAN, a modern approach to text-to-image synthesis; explore NVIDIA GauGAN2 Demo.Â
→ In Edge#233: we explain DALL-E 2; discuss the DALL-E 2 paper; explore DALL-E Mini (Now Craiyon), the most popular DALL-E implementation in the market.Â
→ In Edge#235: we explain Meta AI’s Make-A-Scene; discuss Meta’s Make-A-Scene Paper; explore LAION, one of the most complete training datasets for text-to-image synthesis models.Â
→ In Edge#237: we discuss Midjourney, one of the most enigmatic models in the space; explore Microsoft’s LAFITE that can train text-to-image synthesis models without any text data; explain Disco Diffusion, an important open source implementation of diffusion models. Â
→ In Edge#239: we dive deeper into Stable Diffusion; discuss retrieval augmented diffusion models that bring memory to text-to-image synthesis; explore Stable Diffusion interfaces.
→ In Edge#241: we conclude our text-to-image series discussing the emerging capabilities of text-to-image synthesis models; explain NVIDIA’s textual inversion approach to improving text-to-image synthesis; explore DALL-E and Stable Diffusion Outpainting Interfaces. Â
Next week we start the new series and will deep dive into the foundations of ML interpretability methods as well as the top frameworks and platforms in the space. Fascinating!