📄➡️🖼 Edge#219: A New Series About Text-to-Image Models

Aug 23, 2022

In this issue:

we start the new series about text-to-image models;
we discuss CLIP, a neural network that can learn image representations while being trained using natural language datasets;
we explore Hugging Face’s CLIP implementation.

Enjoy the learning!

💡 ML Concept of the Day: A New Series About Text-to-Image Models

Multidomain learning is one of the aspirational crown jewels of deep learning. Today, most neural networks remain highly specialized in a single domain such as language, speech, or computer vision. In recent years, we have seen a generation of successful models that can operate with datasets from different domains. Among those, text-image models have proven to be particularly successful in combining recent breakthroughs in both language and computer vision. Today, we are starting a brief series about a new generation of text-image models and their underlying techniques.

Combining language and images seems like a natural evolution for deep learning methods. After all, the association between images and text is natural to human cognition. Recent advancements in transformer architectures, few-shot learning and language pretrained models have pushed the boundaries of natural language understanding (NLU) and are making inroads in computer vision. Additionally, methods such as generative adversarial networks (GANs) have achieved major milestones in photorealistic image generation. Combining the two, we have a new wave of models that are able to produce high-fidelity images from textual inputs.

The key to text-image models is the ability to detect the relationships between images and the text that describes them. In this series, we will cover methods such as image diffusion that have made major inroads in this area. Similarly, we will cover methods such as VQGAN, CLIP, DALL-E2, and Imagen that are achieving remarkable performance in text-to-image generation.

🔎 ML Research You Should Know: Connecting Text to Images with OpenAI CLIP

In the paper Learning Transferable Visual Models From Natural Language Supervision, researchers from OpenAI introduced CLIP (Contrastive Language–Image Pre-training), a neural network that can learn image representations while being trained using natural language datasets.

The objective: Master different computer vision tasks without requiring expensive labeled image datasets.

Why is it so important: CLIP has been the foundation of models like DALL-E 2 that have revolutionized the text-to-image space.

Diving deeper: Text-to-image models are typically vulnerable to two main sets of challenges. Cost is likely the main roadblock for implementing text-to-image models in the real world as labeled datasets for computer vision tasks are very expensive. Additionally, text-to-image models require mastering different computer vision tasks at once which contrasts with the highly specialized nature of computer vision models. With CLIP, OpenAI tried to address both challenges.

The idea behind CLIP is to train a model using an image dataset combined with widely available natural language supervision. The end model can be instructed to perform different image classification tasks in natural language without optimizing for a specific task. This method is based on “zero-shot” learning techniques and is a key contribution to the CLIP model. By not optimizing for any specific task, CLIP becomes fairly robust across many of them.

CLIP design is based on three powerful concepts:

Multimodal learning to combine language and computer vision in a single model.
Zero-shot transfer learning to reuse knowledge across different tasks.
Language supervision to train computer vision models using large varieties of text.

CLIP combines these ideas in a powerful transformer architecture, which is  trained on an image dataset using language supervision. CLIP uses an image encoder and a text decoder to predict which images are paired with a given text in a dataset. That behavior is then used to train a zero-shot classifier that can be adapted to several image classification tasks. During training, CLIP uses a proxy model that, given an input image, predicts a number of text snippets that can be associated with it. This process allows CLIP to learn a wide variety of visual concepts and associated texts across different classification tasks.

Graphical user interface, diagram

Description automatically generated — Image credit: OpenAI

CLIP became an important building block of several text-to-image models such as OpenAI’s own DALL-E 2 and has been widely used in general adversarial network (GANs) architectures.

🤖 ML Technology to Follow: Using OpenAI CLIP with Hugging Face’s Transformers Library

Why should I know about this: Hugging Face’s transformer library includes one of the most complete implementations of the OpenAI CLIP model.

What is it: Hugging Face transformers have become the home of some of the most sophisticated language and computer vision models ever created. Shortly after the publication of the CLIP paper, Hugging Face added an implementation to its transformers library in order to improve its text-to-image capabilities.

As explained in the previous section, CLIP can be used to predict the most relevant text snippet that matches a given image without directly optimizing for the task. CLIP can be used for tasks such as text-to-image similarity and zero-shot classification. The Hugging Face implementation uses a series of abstractions that match the original model architecture proposed by OpenAI:

CLIPFeatureExtractor: to rescale and normalize images for the model.
CLIPTokenizer: to encode an input text.

CLIPProcessor: to encapsulate CLIPFeatureExtractor and CLIPTokenizer into a single class that can handle both the text encoding and the preprocessing of images.
CLIPModel: to provide the main interface to interact with the CLIP implementation.

The Hugging Face transformer libraries combine the aforementioned components into a single programming model that enables the use of CLIP in just a few lines of code.

While there are other implementations of CLIP available, Hugging Face’s transformers library offers notable benefits in terms of maintenance, consistency of the programming model as well as seamless interoperability with machine learning frameworks and tools.

How can I use it: Hugging Face’s CLIP implementation is open source and available https://github.com/openai/CLIP

TheSequence

Discussion about this post