The Sequence Chat: Deyao Zhu and Ju Chen on MiniGPT-4

The researchers behind the open source GPT-4 alternative share their insights about the state of multimodal AI agents.

May 10, 2023

"Visually informed language models open up a new route to generalized visual intelligence. By providing these models with visual input, they are already capable of solving tasks such as recognizing cats or dogs and answering questions about a car brand. Moreover, they can be further developed to solve more tasks on-demand, such as interactively generating a painting or designing a new chair to be placed in your living room."

Mohamed Elhoseiny, Assistant Professor of Computer Science at KAUST.

👤 Quick bio

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning and data labeling?  

I am Deyao Zhu from Quanzhou, China, and I am a fourth-year PhD student at King Abdullah University of Science and Technology (KAUST), under the supervision of Prof. Mohamed Elhoseiny. My recent research focuses on AI in decision making and vision-language understanding. Prior to joining KAUST, I obtained my Bachelor’s degree in Mechatronics from Tongji University in Shanghai and my Master’s degree in Electrical Engineering from Leibniz Universitaet Hannover in Germany.

My interest in artificial neural networks began in high school when I read an article about the Human Brain Project in Scientific American. I was attracted by the idea of simulating brains via computers. A few years later, the success of AlphaGo made me believe in the future of deep learning, and I found a research assistant position in human motion prediction at my Master's university.

I am Jun Chen, originally from Qinhuangdao, China. Currently, I am a fourth-year PhD student at King Abdullah University of Science and Technology (KAUST), under the guidance of Prof. Mohamed Elhoseiny. My research primarily revolves around multi-modal learning, with a particular emphasis on vision and language learning. Prior to joining KAUST, I earned my Bachelor's degree from Xi'an Jiaotong-Liverpool University, situated in the beautiful city of Suzhou.

My fascination with machine learning began during my undergraduate studies when the groundbreaking release of AlphaGo deeply impressed me with its remarkable success. This served as a driving force, motivating me to concentrate on the study of machine learning.

🛠 ML Work

MiniGPT-4 has made headlines in the past few weeks due to its impressive multimodal capabilities. Could you provide more information about the vision and inspiration behind the project?

When OpenAI released GPT-4, we were shocked by its unbelievable vision-language abilities, such as coding a website according to a photo of a website draft and describing a given image with rich details. These abilities had never been demonstrated in any previous vision-language models. However, OpenAI did not release any technical details about GPT-4 or its model. We were curious about how they created it and how we could reproduce their website coding demo.

At the same time, we were also inspired by our previous project, ChatCaptioner. In ChatCaptioner, we found that one of the best open-source vision-language models, BLIP-2, could see many image details and answer questions related to those details. However, compared to GPT-4, BLIP-2 lacks many advanced abilities, such as generating code based on handwritten text and describing images in great detail. We believe that the absence of these abilities is due to the lack of an advanced language model, such as ChatGPT. Therefore, we decided to work on aligning our vision model with an advanced language model, such as Vicuna.

MiniGPT-4 builds on Vicuna, an LLaMA-based model, by integrating computer vision capabilities. What motivated the selection of Vicuna over other models, and how would you describe the significance of LLaMA in advancing open-source LLMs?

At the beginning of this project, Vicuna had not yet been released, and we used other open-source models. We tried different models but did not find any that could perform similarly to ChatGPT. Three days after starting the project, Vicuna was released, and after playing with their demo, we found that Vicuna behaved similarly to ChatGPT and was the strongest model among all the available open-source language models. Thus, we began using Vicuna.

We believe that LLaMA is still the strongest open-source base model and that it significantly boosts the performance of other open-source LLMs, such as Alpaca and Vicuna. Without LLaMA, models as strong as Vicuna could not have been developed as quickly as they were.

Building MiniGPT-4 involved a unique process, starting with an existing frozen LLM. Could you explain the pretraining and fine-tuning procedures used to incorporate computer vision capabilities into MiniGPT-4?

The first pre-training stage is a standard procedure used to align the vision and language modules. In this stage, the VLM takes the image as input and is trained to generate ground-truth image captions in the output distribution. As the model is trained to predict the caption accurately, it learns to understand the visual content in the image.

However, our goal is not just to train a model that can comprehend the image contents but also to develop a chatting bot that can articulate about the image fluently. We noticed that after the first fine-tuning stage, the powerful Vicuna's speaking ability is affected by the visual inputs. For instance, it began to generate incomplete sentences or sentences with significant repetitions, and it only functioned with carefully designed prompts.

Therefore, we propose the second fine-tuning stage to restore Vicuna's speaking ability. Traditional image-caption datasets, such as those used in stage one, typically contain brief image captions. As humans usually prefer a chatbot that provides rich information, we created a small image-text-pair dataset with long image descriptions generated by the model itself after stage one training, using carefully designed prompts and a post-processing mechanism based on ChatGPT and hard-coded rules. Additionally, the model inputs in stage two take images wrapped in a conversation template, rather than images alone.

After fine-tuning our model on this small dataset, which only takes 7 minutes, Vicuna successfully recovers its speaking ability, and our final MiniGPT-4 system is complete.

MiniGPT-4 has exhibited remarkable abilities, including generating narratives and identifying key facts in images. Which of these capabilities surprised you, and which ones emerged organically versus being engineered through the fine-tuning process?

I was surprised by the first demo we made, which is about writing a poem from a given photo. Honestly, we expected that this could work, as Vicuna is able to write a poem, and if it can see the image, it should be able to write a poem about it, But still, when we found it worked so well, we were still surprised.

Most of the capacities came from aligning the vision part with Vicuna. However, Vicuna loses the ability to speak smoothly when we added the vision feature to it after the first pretraining stage. Therefore, we propose the second finetune stage to make Vicuna recover its speaking ability.

The majority of current multimodal foundation models appear to be focused on combining language and computer vision. What is the potential of multimodal AI, and what are the practical limitations? Will models with capabilities across various domains be developed in the future?

As vision and language serve as crucial inputs to the human brain, with language also acting as a primary tool for thinking, the potential of a multimodal AI system to comprehend a wide range of tasks, environments, and situations becomes evident. Consequently, such a system holds the capacity to automate numerous human jobs closely tied to these abilities.

However, it is important to acknowledge that current multimodal foundation models possess several limitations. For instance, they exhibit a significant issue with hallucinations, struggle with object counting, and have difficulty understanding spatial information.

Nevertheless, we maintain the belief that future developments will lead to the creation of a single, all-encompassing AI model capable of understanding a vast array of modalities and domains.

💥 Miscellaneous – a set of rapid-fire questions

What is your favorite area of AI research outside of generative AI?

Deyao: For me, it is decision making. I still think that decision making is the most important next step of AI research. Now we have many AI systems that can understand language and vision well. It is time to use them to do some real jobs for humans. And this is related to decision making. AutoGPT as an example is a good first step.

Jun: Ensuring safe and robust data alignment is of utmost importance to me. While generative AI has shown remarkable success, achieving a truly safe AI necessitates training on high-quality aligned data. By leveraging safe aligned data, we can establish a more secure AI environment that promotes safety at its core.

Compare and contrast MiniGPT-4 with GPT-4.

MiniGPT-4 is an attempt to unveil the secret of GPT-4’s vision-language ability. Although we are able to reproduce many GPT-4’s vision-language demos, MiniGPT-4’s abilities are still weaker than GPT-4. For example, GPT-4 is able to read small and long texts in the image. In contrast, MiniGPT-4 can only read short and big texts.

Besides, the pure language ability of GPT-4 is much stronger than both ChatGPT and Vicuna. MiniGPT-4 doesn’t focus on improving language ability and directly uses a frozen Vicuna as the language part. Thus, MiniGPT-4’s language ability is also weaker than GPT-4.

Lastly, although we reproduce GPT-4’s vision-language demo, we are based on open-sourced models and we still don’t know how GPT-4 is implemented. So, I expect that there should be a difference between the method they used to train GPT-4 and our method.

Have we hit a scale limit with LLMs? What are the new techniques needed to enhance LLM knowledge without depending on size?

From what we learned from OpenAI’s talks, I think we are close to consuming all the internet data. I think we will hit a scaling limit when we cannot build a larger dataset. However, as humans learn from interacting with the environment instead of a given dataset, I think the next promising step to further finetuning LLM can be developing an algorithm to let LLM collect the data in the environment by itself and learn from it.

What is the next frontier for multimodal foundation models?

We think the next step for multimodal foundation models is video understanding. We now have good models to understand images. However, it is still unclear how to build LLMs that can understand videos well. Solving this problem is important and can enable many new applications.

TheSequence

Discussion about this post