The Sequence Chat: Doug Burger- Technical Fellow, Microsoft Research About Building Autonomous Agents, AutoGen and the Future of Generative AI
One of the collaborators in the AutoGen project, shares insights about its vision, architecture and the future of autonomous agents.
Please tell us a bit about yourself. Your background, current role and how did you get started in AI?
I started out as a researcher in computer architecture, focusing on CPU and memory systems architecture. I spent a decade as an academic (Computer Sciences professor at the University of Texas at Austin) co-leading a DARPA-funded research program there. In 2008 I joined Microsoft and spent a decade working in Microsoft Research across several areas, including computer architecture, AI, computing systems, and reconfigurable computing. After that I shifted to Azure with the founding of the Azure Hardware Systems group, and served as a product executive building AI supercomputers. A bit under a year ago I moved back to Microsoft Research to take on a leadership role, and help evolve the organization to meet the challenge of the new AI era.
I’ve been interested in AI for a long time, but didn’t work in it as a core area until about 2015. I co-authored my first AI-related paper in 2000 (using neural networks to manage on-CPU hardware resources). My colleagues and I did other explorations on the boundaries of AI and computer architecture over the next 15 years, including analog neural networks for low-power branch prediction (2008), using neural networks to replace conventional code running on CPUs (2012), and building analog neural network accelerators into CPUs to offload that code (2014). In 2013, my team started exploring building neural network accelerators on FPGAs, and in 2015 we ramped that effort to take it to production as Project Brainwave, which shipped into production at large scale in 2017, accelerating neural network inference for Bing and Office. In 2018, when I moved into Azure, my team started building custom AI supercomputers at scale. Some of that work involved deep algorithmic work, which culminated in this year’s announcement of the MX consortium, which is standardized 4, 6, and 8-bit datatypes for ultra-efficient AI computation.
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
🛠 AI Work
You are one of the co-creators of AutoGen for autonomous AI agents. Can you tell us about the vision and inspiration for the project?
I’m not actually one of the co-creators. I’ve been working closely with Chi Wang, Gagan Bansal, and Ahmed Adawallah on the AutoGen team. When I met them last summer, I realized the importance of this area and what they were building, and started meeting with the team weekly to see how I could help. The potential of this area (and project) is to uplevel the capabilities of AI working with humans, allowing everyone to achieve more, which is in line with Microsoft’s mission as a company.
The team built a beautiful library that made multi-agent orchestration, and using human feedback in the loop, simple and powerful. That's one reason for its great success and growing popularity on GitHub. The thing that I wanted to see was a scientific study of why different patterns worked well or poorly. What combination of agent capabilities would be most effective at solving tasks. How would you know that a particular subtask was complete and correct, without a human monitoring each agent interaction?
In the open-source community, there are huge numbers of people leveraging AutoGen in creative ways, and solving surprising problems. One pattern that we see as fundamental is the "generator+critic" pattern, where one agent generates content (writing, code, etc.) and another agent critiques it (finds bugs, etc.) They can iterate until the solution is correct. Or find problems in the environment and automatically install the needed packages for the code it's generated to run.
Right now, the AutoGen user has the option to be in the loop for any interaction, which is super valuable as we are figuring out how to apply this technology to successively larger problems. Ultimately, we will want more of an agent communication graph to be "closed loop", meaning that humans don't need to review the generated information, and can interact in an open loop fashion at a higher level (for example, refining specifications where they are imprecise, or giving feedback about how the agents are interacting if it can't make forward progress). Ultimately, graphs of human collaborations and graphs of agent collaborations will be interleaved, and I think we'll find that the ideal interleavings will be quite surprising.
The two most important problems (at least how I am thinking about them currently), are:
Finding a rigorous scientific framework for how different agent skills, personalities, and instructions combine to be most capable for different problems (think of this as social management science for AI agents) .
Figuring out how you formally validate and verify that subtasks of a larger task are correct and complete. Progress on those two fronts will unlock large capabilities as we grow the sizes of inter-agent collaborations.
What are the core components of the AutoGen architecture and their key capabilities?
AutoGen has three core components that contributed to its success early on. First, it is incredibly simple and lightweight; setting up multiple agents interacting through "conversational programming" is straightforward. That ease of use allows people to get up and running quickly. The abstractions AutoGen provides are fully general, but they can be realized in useful specific ways: the notion of a "User Proxy Agent" allows people to choose to be an agent in the graph, intercept messages, provide their own, or allow the agents to run. That capability greatly simplifies the ability to keep things on track, especially in the early days when we don't know how to ground the conversations and inter-agent collaborations fully. Third, AutoGen’s flexible topologies allow for arbitrary and creative organizations of agent graphs. An example is AutoGen’s Group Chat Manager, which allows an arbitrary set of agents to participate in a chat. The Group Chat Manager selects the agent that seems like the best choice to respond at each iteration of the conversation. This dynamism allows many types of agents to work together with low friction from the user or programmer. Put together, these three components allow people new to the platform to build sophisticated groups of agents without having to do a lot of debugging or experimentation. That low friction is central to AutoGen's success and momentum.
AutoGen seems to place special emphasis on multi-agent communication patterns. Could you expand on some of the main patterns included in the current version of the framework?
The most popular innovation in AutoGen’s multi-agent coordination is the Group Chat Manager. It uses the LLM capabilities themselves to guide arbitrary collections of agents working together, as opposed to exposing that complexity to the user in a space that is new and not well understood. Beyond Group Chat Manager, AutoGen also supports many popular conversational patterns, such as one-to-one, hierarchical, and nested chats. Over time, as we understand the patterns that work well for different types and compositions of groups of agents, specific point functionality like the Group Chat Manager may diminish in importance, but it's been incredibly helpful for getting high value, low-friction experiences off the ground for users quickly. AutoGen also supports some interesting features, such as dynamic agents, which can—on the fly—decide to initiate and consult new agents. One of the exciting aspects of the exploding popularity of this tool is seeing the surprising and creative ways that users are leveraging these more advanced features.
AutoGen’s agents offer capabilities such as tool integration, memory, and human interaction. What techniques were used to implement those features?
AutoGen's value (today) is in low-friction assemblage of agents, tools, and human feedback, which results in this “conversation-centric computing” paradigm. The core technique is simple: basic message passing among agents, humans, and tools (code). There is really nothing special in the messaging, the popularity of AutoGen really resides in how the low friction needed to get interesting combinations of agents up and running quickly. Many of the capabilities in OpenAI's platform enhance AutoGen's capabilities as well, as AutoGen sits above the level of large language models.
What are the enhanced inference capabilities of AutoGen, and why are they needed?
Because AutoGen sees the entire agent graph, it can make optimizations in the back end. Some of these optimizations AutoGen supports include performance tuning, transparent error handling, and caching. but rather calls LLMs through pre-existing APIs that leverage inference optimizations. Additionally, since AutoGen is model independent, over time it can support a fleet of optimized per-topic "expert agents" that are called where appropriate, rather than calling expensive foundation models for every type of agent. Specifically, we are exploring advanced AI model techniques such as those contained in MSR's Orca and phi models.
The autonomous AI agents space is evolving at a very rapid pace. Can you share some of the new areas of focus for AutoGen?
We have two parallel tracks that we are pursuing. The first is to move the platform forward with requests from the open-source community and integration with new capabilities like OpenAI's Assistants/custom GPTs APIs. The second is to advance the science of automated problem solving. One direction is to leverage learning loops to identify which combination of agents best solve problems. Another direction is to advance the understanding of how to partition tasks automatically into solvable sub-tasks with AutoGen.
How do you compare AutoGen with specialized agent frameworks such as AutoGPT or BabyAGI, or with language model programming stacks such as LlamaIndex, in contrast to alternatives such as LangChain or Microsoft’s own Semantic Kernel?
All of these frameworks are useful and provide a different (but often overlapping) set of capabilities. All of them are essentially running experiments to see which features will be most useful, which abstractions people most like, and which classes of problems each set of capabilities can address. And given that they are all either open source or open access, they will compose in interesting ways. Semantic Kernel supports both AutoGen and its native multi-agent approach. AutoGen just integrated with OpenAI’s Assistant/Custom GPT developer interfaces. I think having an ebb and flow, and varying levels of integration between these frameworks, is allowing the community to experiment rapidly and have all of us advance the utility of these frameworks more quickly than if they were rigid verticals with no cross pollination.
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of research outside of generative AI?
I don't have a favorite area, there are a ton of research problems I'm interested in and I've historically worked across many areas. Understanding biological neural networks is one current focus. In the past, I’ve spent a lot of time working on more efficient silicon architectures, particularly dataflow architectures, and advanced numerical quantization approaches for deep learning. Another area I'm excited about is programming languages for hardware synthesis, both ASICs and reconfigurable computing.
What are some major research breakthroughs that you believe can unlock the true value of AI agent solutions?
There are several important research areas and problems where breakthroughs would take multi-agent systems to the next level. One is a more formal view of explainability ... what semantic features in LLMs do different prompts invoke, and which semantic hierarchies invoked across multiple agents are most effective at solving problems, working together, being creative, etc.? These models contain so much information, but we don't have good structures for reasoning about how even one invocation works, let alone how to think about multiple invocations collaborating. I expect this area to advance empirically for now, with learning loops improving the empirical results, but it would be wonderful to have some deeper theory to understand why different combinations of agents work well or poorly. Second, having formal abstractions to reason about correctness of a wide range of problems would be good. Things like code are testable, because they (ideally) have precise specifications. But applying a notion of "correct" or "good enough" to a wide range of problems will allow multi-agent systems to be much more effective. Finally, we need formal structures to support de-composition and re-composition of tasks into subtasks and back into tasks. Currently our approaches are ad hoc and having formal structures to solve problems hierarchically (for general problems) will be essential. These structures may also change how we architect solutions; sort of an AI version of Conway's Law. Another Conway’s Law-related observation is that the capabilities of the models will also change the topology of the ideal multi-agent solutions, an observation we refer to as Gabuchi’s Law.
Give us a vision of autonomous AI agents in three to five years?
It's a really great (and tough) question. I recently learned from Eric Horvitz, Microsoft’s Chief Scientific Officer, that the originator of the "technological singularity" concept was actually John von Neumann, but he used it in a different fashion than how Ray Kurtzweil and others use it today. He meant it as the point where technology was advancing so rapidly that it was not possible to extrapolate and make predictions about even the near-term future. I feel like we are at that point; I built a multi-agent application this week that would not have been possible just two weeks ago. But researchers should predict, so I'll make a prediction that will likely be wrong: In five years we will have a much deeper understanding of how human collaborative graphs and AI collaborative graphs work. We'll be able to mix and match them to design systems that can give us much better outcomes on hard problems. My personal dream is that we can use these capabilities to solve problems, like building a more fair or more sustainable society, that are beyond our reach today because they interact with large-scale human and sociotechnical systems. It's also possible that these technologies will lead us to bigger problems. Just like evolution is unpredictable, it's unclear what the driving forces that will guide how these technologies shape society will be (what is the equivalent of natural selection?) In part that is up to us, to the extent that we can understand and guide how this technology affects society. We might need sophisticated multi-agent systems to understand how to steer sophisticated multi-agent systems for responsible use. All of us have a collective responsibility to steer these powerful technologies in directions that do more good than harm.
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.