The Sequence Chat: Lianmin Zheng, UC Berkeley About Vicuna, Chatbot Arena and the Open Source LLM Revolution

The co-creator of one of the most important open source LLMs shares his insights about research and development in foundation models.

Sep 13, 2023

Lianmin Zheng is a Ph.D. student in the EECS department at UC Berkeley, advised by Ion Stoica and Joseph E. Gonzalez. His research interests include LLMs, compilers, and distributed systems. He was awarded the Meta PhD Fellowship. Currently, he is leading the LMSYS efforts and open-source projects including Vicuna and Chatbot Arena.

Quick bio

Please tell us a bit about yourself. Your background, current role and how did you get started in AI?

I am a Ph.D. student working on the intersection of AI and systems. I am committed to open-source AI research by developing better models (e.g., Vicuna), evaluations (e.g., Chatbot Arena and MT-bench), and systems (e.g., FastChat, Alpa). I get started in AI from my undergrad research projects.

🛠 AI Work

You are one of the researchers behind Vicuna, one of the most popular open-source LLMs released to date. Could you please tell us about the vision and inspiration behind the project?

The vision of Vicuna project is to build powerful models similar to OpenAI’s ChatGPT but with an open recipe. The rapid advancement of large language models (LLMs) has revolutionized AI systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT. However, despite its impressive performance, the training and architecture details of ChatGPT remain unclear, hindering research and open-source innovation in this field. So, we started Vicuna project to replicate ChatGPT-like capability with open recipe.

This project is inspired by Llama and Alpaca. We emphasize the importance of data quality, so we find the best data source – user shared conversations on ShareGPT.

Vicuna is based on a fine-tuned version of Meta AI’s LLaMA using ChatGPT conversations. Can you provide more details about the supervised fine-tuning process and the techniques that you used while building Vicuna?

We used standard instruction fine-tuning and additionally handles multi-turn conversations.

We carefully cleaned the collected conversations and only compute loss on the assistant outputs. This makes Vicuna better at multi-turn conversations. In the latest versions of Vicuna, we also extend the context length to 16k with RoPE interpolation. All our code and hyperparameters are available at https://github.com/lm-sys/FastChat.

The latest version of Vicuna scales to 33 billion parameters. Could you explain the process for scaling the Vicuna architecture, the challenges you encountered, and the best practices you discovered?

To scale the training to larger models, you need more GPUs and better parallelism strategies. Finetuning a 33B is actually not that challenging for latest GPUs like H100 (80 GB) or A100 (80 GB), so we just use our existing code in FastChat, which utilizes Pytorch FSDP for parallelism.

If you want to efficiently scale to a larger scale with more advanced parallelism strategies, you can check out Megatron-LM, DeepSpeed or our research project Alpa.

In relation to the previous question, you have been significantly involved in research aimed at expanding the context length of LLMs. Can you discuss the emerging research in this area and outline some of the practical limitations of using large context windows in LLMs?

Possible topics:

How to re-design the attention mechanism for better model accuracy or better inference/training efficiency.
How to construct the training data to let the mode learn long-context dependency better.

Limitations:

The current open LLMs are not good at using the information in the context even if we provide them a large amount of context.
The inference is slow with long context.

You are also actively engaged in evaluation projects such as Chatbot Arena, MT-Bench, and LongEval. Would you be able to describe those efforts and offer insights into what constitutes robust evaluation benchmarks for LLMs?

We think we should evaluate LLMs on more open-ended and fresh questions, instead of multi-choice questions like MMLU, so we started Chatbot Arena and MT-bench.

Chatbot arena is a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. So far, we have collected around 70K votes and used these votes to compute Elo ratings of models. You can check out the latest leaderboard. It is based on human preferences.

MT-bench is a small set of challenging multi-turn questions where you can use them in a more controlled and automated manner. The details can be found in our paper Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. It is based on GPT-4 grading.

I think it is very challenging to build a robust evaluation. Here are some suggestions:

Always evaluate the model for your specific use cases.
Build comprehensive evaluations of LLMs by synthesizing a lot of benchmark results.

Areas such as reasoning, knowledge augmentation, tool usage, and memory have been highly active segments of research in the LLM space. Are these elements part of the Vicuna roadmap? Additionally, could you elaborate on the new capabilities that you are currently exploring for the upcoming versions of Vicuna?

Yes. We are working on enhancing the reasoning and coding ability of Vicuna. Stay tuned!

💥 Miscellaneous – a set of rapid-fire questions

What is your favorite area of research outside of generative AI?

Compiler. Besides generative AI, I worked on several compiler projects such as Alpa (based on Jax/XLA) and Ansor (based on TVM).

What are the next milestones or potential research breakthroughs for the next generation of foundation models?

Get competitive performance in coding/algorithm competitions such as International Olympiad in Informatics (IOI) and The International Collegiate Programming Contest (ICPC), without seeing the problems in their training data.

How would you describe the strengths and weaknesses of Vicuna in relation to other open-source LLMs such as LLaMA 2, Falcon, Platypus, Alpaca, and others?

The latest vicuna is finetuned from Llama-2. It focuses on chat ability and helpfulness. Compared to base models (e.g., Llama 2, Falcon), it has the instruction-following ability. Compared to other finetunes, the training data (ShareGPT) of Vicuna makes it able to handle chat on a diverse range of topics.

How do you perceive the balance between open-source and closed-source/API-based distribution for foundation models? Who do you think will emerge as the victor in the end?

TheSequence

Discussion about this post