The Sequence Chat: Raza Habib, Humanloop on Building LLM-Driven Applications
Humanloop is one of the emerging platforms that allow developers to build large scale applications on top of LLMs.
👤 Quick bio
Tell us a bit about yourself: your background, current role, and how you got started in machine learning.
I’m the co-founder and CEO of Humanloop. We help developers build reliable applications on top of LLMs like GPT-4.
I first got interested in machine learning when I was a physics undergrad at Cambridge and saw Professor Sir David Mackay's lectures on information theory and learning algorithms. The idea of building intelligent learning systems fascinated me and I was immediately hooked. I was excited both by the potential applications of the technology and also by the dream that we might understand how brains work.
Later, during my PhD, the rate of progress in AI and NLP totally staggered me. Things that I didn't expect to happen for decades kept happening every year and it feels like it's only been accelerating since then! I initially studied physics and at the start of the 20th century, all the smartest people were drawn to the problems of quantum mechanics. Today, it seems to me, that the most exciting and challenging problems are in AI.
🛠 ML Work
Humanloop is one of the emerging platforms in the LLM application development space. Can you tell us about the vision and inspiration for the project?
I’ve believed for a long time now that foundational AI models, like GPT-3/4, are the start of the next big computing platform. Developers building on top of these models will be able to build a new generation of applications that until recently would have felt like science fiction. We’ve already seen examples of these in the form of intelligent assistants like chatGPT or Github copilot for software but these are just the beginning.
We've worked closely with some of the earliest adopters of GPT-3 to understand the challenges faced when working with this powerful new technology. Repeatedly we heard that prototyping was easy but getting to production was hard. Evaluation is subjective and difficult. Prompt engineering was more art than science. Models hallucinate and are hard to customise. To unlock the potential of LLM applications we need a new set of tools built from first principles.
At Humanloop, we’ve been building the tools needed to take the raw intelligence of a Large Language Model and wrangle that into a differentiated and reliable product. Our vision is to empower millions of developers to build novel and useful apps and products with LLMs.
Humanloop recently partnered with Stability AI/Carper AI to build a new generation of instruction-following LLMs. What is the workflow and what techniques are used to add instruction-following capabilities to LLMs?
OpenAI pioneered the techniques needed to train instruction following models and the main steps and workflow are largely unchanged. There are three steps:
Train a self-supervised base model — in this step, you train a language model using a next-word prediction task on a large corpus of data from the public internet (and elsewhere). This model learns a lot about language and the world but won’t follow instructions at all. It just tries to predict the next word.
Supervised fine-tuning — next you gather a large (thousands) dataset of human-generated examples. Each example is a pair of an instruction and an appropriate completion. You use this data to do a small amount of extra training that teaches the model to follow instructions.
Reinforcement learning from Human Feedback (RLHF) — in the final step you gather a different type of feedback data, which is preferences. You show people two examples of generations for the same instruction and ask them which one they prefer. You use those preference data to train a second model that attempts to predict the human preferences. This new model is called the “reward model” and is finally used as a training signal to further improve the model from step 2 with reinforcement learning.
After supervised finetuning (step 2) the models are quite good at instruction following but RLHF provides much more feedback data and allows the models to learn more abstract human preferences, like a preference for honest answers.
Testing and experimentation are some of the hallmarks of the Humanloop platform. What are some of the best practices for prompt A/B testing with LLMs?
One of the hardest parts of building with LLMs is that evaluation is much more subjective than in traditional software or machine learning. When you’re building a coding assistant, sales coach or personal tutor, it’s not straightforward to say what the “correct” answer actually is.
You can get moderately far using traditional machine learning metrics like ROUGE but we’ve found that by far the best signal of performance is human feedback. This feedback can be generated during development from an internal team but it’s particularly important to capture feedback data in production based on how users actually respond to the model’s behavior.
We’ve seen three types of feedback be particularly useful:
votes — thumbs up/thumbs down
actions — does the user accept a suggestion? do they regenerate? do they complete the flow? These are all excellent signals of performance.
corrections — if you generate answers users can interact with and they edit those answers (e.g email drafting) then the corrections are a very useful form of feedback.
The feedback data you collect in production allows you to monitor performance and also to take actions to improve models over time (e.g through finetuning)
Another common best practice for evaluation and monitoring models is to use a second LLM to score the generations from your application. In practice evaluation is a much easier task than generation and LLMs provide surprisingly accurate scoring information.
Just like RLHF laid the foundation for ChatGPT, techniques such as chain of thought prompting or information augmentation are enabling exciting capabilities for LLMs. What are some of the new methods that you foresee having an impact in the next generation of LLMs?
The trends that excite me most are parameter-efficient finetuning, larger context windows and multi-modality.
The context window is the amount of “tokens” (similar to words) a model can “read” before generating a response. Today’s models can’t learn new things after training and so any new information needs to be included in the context window. Many applications today are limited by the size of this context window but I think we can reasonably expect much longer contexts in the future.
Parameter-efficient finetuning methods like LoRa make it cost-effective to finetune LLMs yourself. This will enable a lot of developers to train private models and enable products that are privacy-sensitive or need a lot of personalization
Language models do surprisingly well in questions that require world knowledge despite only having seen text but this is a severe limitation in actual understanding. Model’s trained on images, text, audio, video etc are a natural next step and will allow a much richer understanding of the world.
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of AI research apart from generative AI?
I find this question hard to answer because I think ultimately most of AI is actually generative AI. Taken in its broadest sense, generative AI is trying to learn the full probability distribution of a dataset from unlabelled data. Once this distribution is learned it can be used for discriminative tasks like classification, for sampling (generation) and even for reasoning and compression. So I actually think generative AI is not really distinct from AI writ large.
Present the arguments in favor of and against open-source foundation models compared to closed-source API distribution models. Which approach ultimately prevails in the long run?
I think both strands are important and both will win in different ways.
Open source enables permissionless innovation and will drive a lot of creativity. For many use cases, existing models are smart enough and the real challenges are product challenges or privacy, latency and cost. Open-source models will help a lot here. This may even be the majority of use-cases by number.
However, there are valuable use cases that are well beyond the capabilities of existing models e.g. scientific research. To get to these capabilities we’ll have to build much more powerful models that will require investment beyond what OSS can support. The model capabilities also become increasingly dangerous in the hands of bad actors and will likely not be safe to Open source.
Does LLM application development require a new stack, or would it be part of standard app lifecycle management tools?
I think it almost certainly requires a new stack. It’s fundamentally a new paradigm of software and is just getting going!
What are some of the major milestones and roadblocks for LLM app development in the next 3 to 5 years?
Multimodality, larger context lengths, better reasoning are big milestones. GPU compute and talent are the main bottlenecks. On a 5-year time horizon I think its conceivable to see capabilities quite close to AGI.