The Sequence Chat: Zhou Wangchunshu - CTO, AIWaves, on Generative AI Autonomous Agents
To co-authors of the Agents paper/framework shares ideas about the path towards autonomous agents.
Quick bio
Please tell us a bit about yourself. Your background, current role and how did you get started in AI?
I majored in mathematics and physics during my undergraduate study and started doing research in the field of AI during my masters. I interned at Microsoft Research, Bytedance AI Lab, and Allen Institute for AI during my master’s. I worked on neural text generation and language modeling during that time and then started my PhD study at ETH Zurich. I continued working on long, creative, and controllable text generation. Late 2022, inspired by the huge success of large language models such as ChatGPT, I realized that this is the right timing to make real-world impact from my research, and decided to launch a startup named AIWaves with my PhD colleague, Eleanor. I’m currently the CTO of the startup, working mainly on the development of AI algorithms while also spending time working on products and software development.
I started in AI because I was a huge fan of SCI-FI since I was 5 or 6 years old. I was always fascinated by the idea of AI in stories such as Hyperion and Neuromancer. I watched the Go game of AlphaGo and realized that the stories may become true in the near future and decided to be part of this great transition. I chose to work on NLP and LLMs because I think language is the core of human intelligence.
🛠 AI Work
You are the creator of AIWaves. Can you tell us about the vision and inspiration for the project?
The goal of AIWaves is to use AI technology (specifically, large language models) to improve the productivity of humans in both content creation and daily work. Our vision is that building controllable AI agents is the key to freeing human labor and making us human beings have more time doing whatever we want.
Recently, you published some of your ideas about autonomous AI agents in a fascinating paper with different collaborators. Can you elaborate on the core components of the architecture outlined in the paper?
The key idea is that instead of using a single task description and letting LLM agents to plan and act completely on their own, we provide symbolic control over LLM agents using an SOP, which is a graph, or automata, of different states (e.g., different sub-tasks) that LLM agents may encounter during completing the task. This idea enables us to provide finer-grained control of the LLM agents system. The SOP system not only makes LLM agents more controllable but also makes it easier for developers and users to tune the LLM agents (in most of the previous frameworks, one can only try different initial task descriptions and hope they are lucky enough).
Another interesting innovation in the Agents framework is a new method to dispatch multiple LLM agents in a multi-agent system. Previous multi-agent frameworks generally use hard-coded rules (fixed or random) to decide the order of agents to act, while we use a “controller agent” to dynamically decide which agent should act based on the environment, dialogue history, and each agents’ goal.
What's the role of reasoning and methods like chain-of-thought (CoT) in enabling autonomous agents? Do we need any major research breakthroughs in this area?
As mentioned in the previous answer, I think CoT on its own is not able to build truly usable autonomous agents because it is not controllable and also hard to debug/refine if one of the intermediate reasoning step wents wrong. I believe the integration of symbolism methods (e.g., SOP system in our framework) would be helpful toward AGI agents.
Tool learning is another key capability included in the Agents architecture. What techniques have you seen in this area, and how do you see the balance between retrieval-augmented generation and tool augmentation?
Yeah I agree that tool usage is a very important feature. Recent work such as Toolformer and Gorilla are all very interesting directions in tool learning. Personally I still believe crowdsourcing high-quality data that refects how and when real human uses different kind of tools is the key towards building LLM agents that use tools really cleverly.
For RAG and tool augmentation, I think RAG can be regarded as a subset of tool augmentation because retrieving from KBs can be seen as a kind of tool / API, similar to search engines. That’s why we integrate RAG from an external knowledge base as a sub-class of ToolComponent in our framework.
The Agents architecture introduces the unique notion of Standard Operation Process or SOP. Why is this required, and can you illustrate its role with a few scenarios?
As mentioned above, SOP is required to make multi-agent systems, or LLM agents, more controllable and easier to tune for developers. For example, in the case of building a software company with LLM agents, when using existing frameworks such as ChatDev, we sometimes find that one software engineer often performs poorly when doing code review for others. In this case, it’s very hard for developers to solve this problem because adding too detailed instructions on doing code review in the software engineer agent’s system prompt will make its general performance degrade a lot, and if the instruction is not detailed enough, it will often be overlooked. However, with the SOP system, we can define code reviewing as a separate state and provide very detailed instructions / demonstrations on code reviewing, and these instructions will only be used in the code review state and not in other states.
One of the areas that I find fascinating about the paper is the emphasis on multi-agent communication. How does an agent decide it needs to collaborate with other agents? Which messaging protocols have been effective in this area?
In our framework, the communication between different agents is controlled using a carefully designed “controller agent”. Unlike previous multi-agent frameworks that generally use hard-coded rules (fixed or random) to decide the order of agents to act, we use a “controller agent” to dynamically decide which agent should act based on the environment, dialogue history, and each agents’ goal. This allows LLM agents to “cleverly” decide when to act and interact with other LLM agents.
🛠 AI Work
What is your favorite area of research outside of generative AI?
AI For Science. I believe the power of AI can help advance other disciplines.
What are some major research breakthroughs that you think can unlock the true value of AI agent solutions?
I think prompt optimization methods (Large Language Models as Optimizers) are very promising because it allows agents to evolve without having to be re-trained.
What is the biggest challenge that can cause agent models to fail as an automation paradigm?
I think controllability and consistency issues can be the biggest challenge, that’s why we design the “symbolic control with SOP” paradigm.
Your work seems to strongly support open-source distribution models. How do you perceive the balance between open source and closed foundation models? Who ultimately prevails in the end?
Currently AIWaves Agents supports open-source models. However, the cost of deploying open-source models is too high for most developers and users, which makes calling OpenAI or Google’s APIs become the de facto choice. I think this situation is not solely related to whether the model is open-sourced or not, but more relevant to if some company serves the model in a centralized way so users can call their APIs easily. Companies can also serve open-sourced models and make profit. I therefore believe we should have open-source models served in a centralized way by tech companies. Normal users can directly call the APIs and advanced users can deploy the same model on their own computational resources.