Luca Beurer-Kellner: ETH Zürich, Creator, Language Model Query Language,
LMQL, language model programming and the future of LLMs.
👤 Quick bio
Tell us a bit about yourself: your background, current role, and how you got started in machine learning.
I am currently doing my PhD (3rd year) in Computer Science at ETH Zürich. In my PhD studies I focus on the intersection of machine learning (ML) and programming language (PL) research. Before that, I did my BSc in CS at HU Berlin and later on my MSc at ETH as well. In between studies, I worked at a few smaller companies doing software and compiler engineering.
All throughout my academic and programmer life, I have always been very fascinated with the design and implementation of programming languages. Only later on, during my master’s studies, I got more exposed to the machine learning world and quickly found interest in combining both in my research and other projects. In this context, I have worked on both PL–informed machine learning and machine learning focused PL projects, such as differentiable programming and language model programming more recently.
🛠 ML Work
You recently worked on LMQL, a query language for LLMs. Could you please tell us more about the vision and inspiration behind the project?
The origin story of LMQL is always a fun one to tell, when we get around to it. We started working on the project in July 2022, months before ChatGPT came out. Still, already at that time, we observed with great interest, how more recent LLMs had become more and more powerful, and started to exhibit a level of programmability previously not known from other models. In contrast to traditional single-task ML models, LLMs can be prompted to perform all sorts of tasks. More recent developments indicate that they may even have the potential to become general-purpose reasoning engines. As PL researchers this is super exciting to see, as it allows for fundamentally new forms of programming, where code is still needed, but tightly interwoven with an LLM, which acts as a form of text computer that can do all sorts of computations that previously were very hard to do.
Based on this perspective, we started to explore how LLMs could be used as primitive building blocks, to construct and program a novel form of systems. During the summer of 2022, we built the first version of LMQL, which happened to be completed right before NeurIPS, around the same time ChatGPT was announced. Unfortunately, we could not release LMQL back then, because of anonymized peer reviews that had to pass first. Still, we felt empowered and validated by that generation of (RLHF) models, and continued to build LMQL out into the open source project we lead today. Our core vision is to further explore and facilitate LLM programming, and to provide good infrastructure for this evolving space, focusing on language abstractions, interface robustness, types and vendor compatibility.
LMQL inherits aspects of Python but also uses SQL-inspired elements. What are the advantages and limitations of this approach in combining scripts and prompts, compared to other forms of language model programming (LMP)?
Fundamentally, LMQL separates LLM programming into three orthogonal dimensions: (1) How text is generated in terms of the decoding algorithm you use (e.g. argmax, sampling or e.g. beam search). (2) What kind of (multi-part) prompt you use to call the model, and (3) what kind of constraints and formatting requirements you have on the model’s response. Decoding algorithms (1) and constraints (3) are relatively declarative aspects about this process, however, prompting itself is more of an imperative, programmatic concept, i.e. you imperatively provide the model with instructions and examples on how to respond.
Based on this understanding, LMQL adopts declarative elements to enable the specification of decoder and constraints, and allows imperative Python code for the actual prompting logic of your program. If your focus is on prompting alone, we also provide a reduced syntax mode that feels very much like standard Python to do just the prompting. Overall, I think this separation of declarative vs. imperative makes a lot of sense, and maps well to the programming models we observe in LLM configuration vs LLM prompting itself.
In our initial paper on LMQL we defined this exact form of programming as language model programming (LMP). However, since our work was initially published, we have seen a plethora of different approaches emerge. More specifically, we observe compositional frameworks that focus on retrieval and chaining, and more template-based frameworks that focus mostly on output formatting. LMQL sits somewhat outside of that spectrum, as it also emphasises constrained templates, but puts algorithmic LLM use at the centre, i.e. the top-level statements in LMQL are code not prompt. This allows LMQL to tightly integrate and optimize inline LLM calls in your code, while providing the same outside interface as a standard Python function. Overall, this enables the use of an LMQL program as a functional component in your existing compositional frameworks, while also benefiting from the concise syntax and runtime optimizations.
The LMQL query syntax includes unique elements such as decoders or constraints. Could you please elaborate on the components of an LMQL query and their specific relevance?
While I think the fundamental reasoning capabilities of LLMs lie in the model weights, I also think constraining and decoding are important aspects from a programming perspective.
For us, constraining mostly serves the purpose of establishing interface robustness, i.e. you want your LLM to provide you an answer to a specific query. However, to reliably process this answer in an automated system, you need it to be in a very specific, parsable format every time. LMQL constraints afford you this by making sure the model’s vocabulary is limited in a way that only allows it to produce (at least syntactically) correct outputs. If you instead rely on prompting alone, you will end up with a small error rate per LLM call, just because of unexpected output formatting on the model side. In production, when you issue many consecutive LLM calls in a single process, this will exponentiate into unacceptable error rates for your overall system. In that sense, I think constraining is a cornerstone of robust LLM use that enables a reliable form of programmability, otherwise not possible.
Decoding is generally known to improve overall output quality for the price of more model calls. However, in the presence of constraints and multi-part LLM use and reasoning, I think it will play an increasingly important role when it comes to backtracking forms of reasoning, like recently shown with tree-of-though. Especially when you externally enforce constraints on LLM output, this form of backtracking can become crucial, as constraints can turn out to be unsatisfiable quite late during generation, which can then only be solved by stepping back out of the current trajectory of reasoning.
What is the extensibility model in LMQL? Does it support custom functions or different LLM plugins?
At all levels, LMQL is extensible and allows for custom user functions and functionality. First, our constraint decoding engine provides custom evaluation semantics for an expressive constraint language that can be extended by custom constraints, as long as they satisfy the internally-used interface. This system even comes with proven guarantees on the soundness of the resulting validation behaviour.
With respect to the programs users write, LMQL is fully interoperable with Python. This means it seamlessly embeds in your existing Python program, you can call any of your existing Python functions and also call LMQL programs opaquely, just like standard Python functions.
With respect to model backends, we currently support OpenAI, HuggingFace Transformers and llama.cpp. We also started to generalize our backend infrastructure, with the aim of standardizing the LLM backend interface, beyond just mocking the OpenAI API. We call this the Language Model Transport Protocol, and hope it can benefit the broader community, unlock more backends for LMQL, but also transfer some of the optimizations we enable internally, for other projects.
How do you see LMQL in relation to other LMP frameworks, such as LangChain, LLaMA Index, or Semantic Kernel?
Most existing frameworks (with very few notable exceptions) consider LLMs to be magic boxes with a purely text-based interface. This means, they pass in some prompt and get out some textual response. If you look more closely, however, there is more about the internals of an LLM that you can leverage. This is what enables constrained decoding, advanced caching or distributions in LMQL. However, since LMQL mostly operates inside this “magic box”, it is fully compatible with existing frameworks like LangChain, and can be used in conjunction with such compositional layers.
In the long run, we intend to extend LMQL beyond this scope of operating on the level of what a query program or prompt template may be considered today. However, at the time of writing, I generally advise people to embrace both, LMQL in the loop, and other frameworks outside of the loop or for retrieval.
LMQL on its own also provides a very simple text-based interface to your calling code, which is a very simple and powerful model to work with yourself. What LMQL contributes here are optimizations, model backends, decoders, scripting, constraining, error handling and lower-level convenience functionality. So I definitely also encourage everyone to write their own custom code to chain calls, which has been shown to often be much simpler than existing do-it-all-style frameworks. On this level, many abstractions still seem very early and trying many different variants for yourself is likely the fastest way to identify the best solution.
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of AI research aside from generative AI?
Having done some work there, I generally find differentiable programming and algorithmically-guided neural networks a very interesting research discipline.
How do you see the balance between open source and API-based distribution of LLMs? Which approach ultimately prevails?
I think currently it is very clear that OpenAI has the best models and the highest use. However, open source models are catching up and I am very optimistic about their future. LMQL is vendor neutral, so you can use them all, although, I have to say that I prefer open source models, as they allow us full access, and we do not have to work around very restricted, proprietary APIs.
There are important areas, such as reasoning or knowledge augmentation, that are receiving a lot of research focus. Are there any specific research milestones that you believe will be relevant in the next generation of foundation models?
I think hallucinations are *the* biggest issue here. Hopefully retrieval and augmentation will help eventually, but ultimately I think it will require a very big and fundamentally different modelling decision on the level of model training and architecture.
How do you envision LMP evolving in the next five years?
I think conversational models like ChatGPT and GPT-4 offer a very interesting future for programming in general. I think we will have a lot of neuro-symbolic systems that heavily rely on models as primitive blocks. This is an exciting prospect for programming language development and something we definitely want to contribute to, with LMQL and all the features and updates we have planned so far. If models continue the trend of getting more and more capable, the resulting programmability will be very fruitful to build upon.