The Sequence Chat: Salesforce Research's Junnan Li on Multimodal Generative AI
One of the creators of the famous BLIP-2 model shares his insights about the current state of multimodal generative AI.
👤 Quick bio
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning (ML)?
I ‘m a research scientist at Salesforce Research focusing on multimodal AI research. I did my PhD at National University of Singapore in Computer Vision. I got started in computer vision and machine learning in my undergrad FYP project Â
🛠ML Work Â
Recently, you have been working on BLIP-2, which can be considered one of the first open-source multimodal conversational agents ever released. Could you elaborate on the vision and history of the project?
BLIP-2 is a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images. It unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype. Checkout this blog post for more details: https://blog.salesforceairesearch.com/blip-2/
Before BLIP-2, we have published BLIP, one of the most popular vision-and–language models and the #18 high-cited AI papers in 2022. BLIP-2 achieves significant enhancement over BLIP by effectively leveraging frozen pre-trained image encoders and LLMs.
One of the biggest contributions of BLIP-2 is the idea of zero-shot image-to-text generation. Could you explain this concept and how your team was able to implement it?
BLIP-2 achieves zero-shot image-to-text generation by enabling LLMs to understand images, thus harvesting the zero-shot text generation capability from LLMs. It is challenging for LLMs to understand images, due to the domain gap between images and texts. We propose a novel two-stage pre-training strategy to bridge this gap.
The release of GPT-4 highlighted the importance of multimodality as one of the key elements of the new wave of generative AI models. How would you compare the strengths and weaknesses of GPT-4 and BLIP-2?
GPT-4 is amazing and demonstrates strong image-to-text generation capabilities. There are two key differences between BLIP-2 and GPT-4
BLIP-2 is a generic multimodal pre-training method that can enable any LLMs to understand images. GPT-4 refers to a particular model/family of models.
Compared to GPT-4, BLIP-2 is much more efficient both during pre-training and inference. In fact, BLIP-2 is one of the most computation-friendly modern multimodal pre-training methods.
What are some of the next milestones and the biggest research breakthroughs needed to push multimodal generative AI to the next level?
The world is multimodal by nature, thus an AI agent that can understand and simulate the world need to be multimodal. In my opinion, multimodal generative AI will drive the next wave of AI breakthroughs. There are so many exciting areas, such as video generation, embodied multimodal AI, etc.
💥 Miscellaneous – a set of rapid-fire questionsÂ
Favorite area of AI research outside generative AI?
Self-supervised/unsupervised learning
How do you see the balance and risks between the open source and closed source approach to foundation models?
I believe that open-source is the preferable approach to drive safer and responsible AI research that can benefit a larger community. However, it requires careful planning before open-sourcing a model to mitigate its potential risks.
The next domain for multimodal AI would be language and video, audio, 3D, all of them?
Yes!
Is the Turing Test still relevant? Any clever alternatives?Â
This question is out of my scope and cannot answer :).