Inside LlaVA: The Very Popular Open Source Alternative to GPT-4V
The model outperforms GPT-4 in several visual instruction tasks.
Today, we celebrate the Thanksgiving holiday in the United States when it is customary to give thanks for our blessings during the last year. I wanted to take a moment to express my gratitude for your support of this newsletter. Writing this amount of deeply technical content on a weekly basis is not easy, particularly considering I have operational responsibilities in three other companies. I do it because I believe it is a small contribution to raising awareness about new AI research and technology, but also because I am fortunate to have a very engaging, intellectually curious, and technically rigorous audience that ensures we maintain a high standard for this newsletter and makes it really enjoyable.
For that, thank you.
Now onto today’s edition:
A few weeks ago, OpenAI unveiled new image and audio processing capabilities in GPT-4. Fundamentally, the AI lab announced a new model known as GPT-4 Vision(GPT-4V), which allows users to instruct GPT-4 on image and audio inputs. GPT-4V is an interesting development in the multimodal foundation model space. A few days after the GPT-4V announcements, we already had the first open-source alternative. Researchers from the University of Wisconsin-Madison and Microsoft Research introduced Large Language and Vision Assistant (LLaVA), a LLaMA-based multimodal LLM that can process image and audio data as input.