Dr. Joseph Gonzalez, UC Berkeley: Creating Gorilla and Language Models that Can Call APIs
The team behind Gorilla discusses the process of creating the model and the state of API-augmented language models.
This interview includes answers from Dr. Gonzalez’s students Shishir G. Patil*, and Tianjun Zhang who participated in the creation of the Gorilla model.
Quick bio
Tell us a bit about yourself: your background, current role, and how you got started in machine learning.
I am a Professor in the EECS department at UC Berkeley, a co-director and founding member of the UC Berkeley Sky Computing Lab, and RISE Lab and a member of the Berkeley AI Research (BAIR Group). I work on research spanning machine learning and systems. I am co-founder at Aqueduct and prior to joining Berkeley, I co-founded Turi Inc. (acquired by Apple), which was based on my thesis work.
🛠 ML Work
You recently unveiled your work on Gorilla, which is one of the most interesting papers in the tool-augmented LLM space that I have read. Could you tell us about the vision and inspiration behind the project?
We started the Gorilla project with the goal of radically extending the tool capabilities of LLMs. However, as the project evolved, we realized we were working on something even bigger. The vision of Gorilla is to provide an LLM interface to the world. Today, we rely on a web browser to discover and use services. Tomorrow, AI technology will extend and maybe even replace the browser as our interface to the world. Through conversations with persistent context, LLMs will discover the right services and take the correct actions to help us both complete tasks and even understand the scope of what we can accomplish. For example, next week I am traveling to give a talk and Gorilla could examine my schedule, remind me this week, and notice that I still haven’t booked a rental car. It could find a discount EV rental service based on my preferences and perhaps even plan a roadtrip over the weekend.
What was the process and architecture used to fine-tune Gorilla? What are some of the unique characteristics and fundamental challenges of fine-tuning LLMs for API calls?
Shishir: Gorilla is an LLM that is trained to be able to write API calls accurately. We are able to do this due to an innovative training recipe that we call RAT - Retriever Aware Training. In RAT, we train the LLM to be aware that the prompt is in-fact augmented by a retriever, which allows us to treat the retrieved data differently from just the user prompt. What makes API calls unique is that APIs are extremely brittle, even a single spelling error can lead to an error. Hence, it is a significantly more challenging task than just text/code/image generation.
Gorilla is based on LLaMA-7B and appears to outperform substantially larger models such as GPT-4 and Claude. What factors do you attribute to the achieved performance of this relatively small model?
Shishir: Well, we don’t know for sure, given the other models are closed-source :) But our best guess is that a few factors helped. First, like I mentioned, I think our RAT (retriever aware training) really shines when it comes to writing APIs. Second, introducing the ability to measure hallucination, something we can do with APIs, gave us a standing to actually compare and refine techniques. Third, the ability to make the LLM focus on writing just API calls definitely helped.
One aspect that I find fascinating about Gorilla is the use of GPT-4 to generate synthetic instructions. Could you elaborate on the process and techniques employed to preserve the quality of these instructions?
Tianjun: We used a technique called Self-Instruct, a simple idea where you can use the LLM itself to generate questions and their corresponding answers. We know LLMs today are really good at coming up with answers/solutions, and from the paper, it also seemed to be good at generating questions. It turns out by showing a few examples of question-answer pairs, the model is already good at coming up with instructions. For better quality, we also manually edit the questions and answers to make them more robust.
What are the main differences between Gorilla and Meta AI's Toolformer?
Tianjun: The ToolFormer is a great demonstration of tool use in LLM, but it only demonstrates the capability in a very narrow domain: using tools like calculator, wikipedia, etc., to answer a specific question. It also focuses on ~20 API calls, which is far less than Gorilla (dealing with 3000+ API calls). We also build an extensive evaluation benchmark on these calls, rather than focusing on question answering. From our point of view, the solutions proposed by Gorilla and Toolformer are different mainly because they look at the problem from a different scale and perspective.
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of AI research aside from generative AI?
I am still very excited about the work being done in computer vision and multi-modal models as well as a lot of the more basic work connecting machine learning to various data sources (e.g., Feature Stores and Vector Stores). We are also looking at how to better serve models by playing with different tradeoffs in latency and throughput.
Your team open sourced the original version of Gorilla. How do you see the balance between open source and API-based distribution of LLMs? Which approach ultimately prevails?
This is an important question and I don’t yet know where things will head. When some of the first set of open source LLMs for chatting came about, a lot of people started to think that the open-source LLMs would dominate the big commercial LLM providers. This hasn’t happened. For general reasoning, it is challenging to beat state-of-the-art commercial offerings. I think in the future we will see lots of open-source specialized models that perform certain tasks well (or at least good enough). Yet, just like web search, I still imagine there will be a few hosted LLMs that people will use everyday. This is because building, maintaining, and delivering LLM technology requires significant capital investments in people, data, and technology.
There are important areas, such as reasoning or knowledge augmentation, that are receiving a lot of research focus. Are there any specific research milestones that you believe will be relevant in the next generation of LLMs?
There is a major open question about how we balance retrieval augmented generation with fine-tuning to incorporate domain knowledge. There are strengths and weaknesses to both approaches. Retrieval is limited by the quality of retrieved results as well as our LLMs ability to deal with distracting content. Fine-tuning has the challenge of potentially requiring many models and it is not yet clear how much to fine-tune or the consequences of fine-tuning on the underlying model abilities.
How do you envision tool/knowledge-augmented LLMs evolving in the next two-to-five years?
I suspect this will quickly become a dominant focus of LLM research and commercial applications of LLMs. Being able to use tools and web services will make LLM technology significantly more powerful and useful.