The Sequence Chat: Yohei Nakajima on Creating BabyAGI, Autonomous Agents and Investing in Generative AI

The creator of one of the most popular open source generative AI projects shares his views about AI tech, investing and the future.

Mar 06, 2024

Quick bio

Please tell us a bit about yourself. Your background, current role and how did you get started in AI?

My name is Yohei, I was born in Japan, raised in Seattle, and went to college in California. I’ve been working with startups my whole career, initially on the community side (in LA), and on the investing side for over a decade, started with helping launch the Disney Accelerator. I became the first Director of Pipeline for Techstars before joining Scrum Ventures. This led me to starting my own VC firm Untapped Capital. Specific to AI, I’d played with a few APIs back when I was at Techstars, but this recent deep dive started in August of ‘22 a few months before ChatGPT. I’ve always been a build-to-learn kind of guy, and have done this across no code, web3, and more. Doing this publicly (build-in-public) is how I accelerate my learning, while also connecting and collaborating with founders.

🛠 AI Work

You are widely recognized in the generative AI community as the creator of BabyAGI. Could you share the vision and inspiration behind the project?

BabyAGI was project number 70 or so in a series of experiments and prototypes I’ve built with AI. The inspiration for this project was HustleGPT where people were using ChatGPT as a cofounder and doing whatever it told them to do. I wanted to experiment with taking the human element out of this and embarked on a weekend challenge to prototype and autonomous startup founder. When I shared a demo video online, people were quick to identify that this framework could be used for more - where it got the nickname BabyAGI (from my friend Jenny).

You can subscribe to The Sequence for more exclusive AI research and tech content:

Planning and reasoning are among the most fascinating capabilities of BabyAGI. Similar to how reinforcement learning with human feedback (RLHF) laid the groundwork for GPT-3.5/ChatGPT, planning and reasoning have the potential to advance the next generation of Large Language Models (LLMs). Could you share some of the best practices you've adopted to enhance reasoning within LLMs, as well as any new research in this area that you find particularly compelling?

I’ve tried a couple of things, but what’s stood out to me in my experiments is the ability to learn over time. In the most recent modification of BabyAGI, every tasklist is analyzed alongside the output of the task list to generate a “reflection” of sorts that we store alongside the objective and tasklist. Anytime we run a new objective, we do a vector search to find the most similar past objectives, pull in the reflection notes, and write a pre-reflection note based on this that gets fed into the task list generator. On a small scale, this has worked in giving BabyAGI the ability to create better tasklist over time, even with the same objective. What I like is that this mimics our ability to improve through repetition, and the same approach could be utilized to generating code, which is more on the execution side.

Autonomous agents are seen as the cornerstone of automation in the generative AI era. Despite significant advancements, they remain a challenging problem. What do you consider the biggest challenges and obstacles to achieving widespread adoption of autonomous agents?

Autonomous agents, especially general ones, are best suited for edge cases. For organizations, the most valuable workflows to automate are workflows that happen repeatedly, meaning there is no need for an agent to generate the task list. The reality today is that there is a ton of value organizations can gain from automation tools like Zapier, even without the use of AI. Reflecting on what I’ve seen here, I suspect the biggest obstacle to achieving widespread adoption is change in human behavior, which compounds in a complex organization with multiple stakeholders with varying incentives.

Action execution stands out as a critical capability of BabyAGI and autonomous agents at large, made particularly complex by the stochastic nature of LLMs. How do you view the development of trends such as Retrieval-Augmented Generation (RAG) or large action models in relation to task execution in autonomous agents?

RAG is a great way to get context, but simply looking at documents is just scratching the surface. Ultimately, we’ll want our agents to be able to RAG against all human knowledge, against its previous runs, its own code, etc. More challenging today, is giving AI access to the tools it needs to execute tasks (calendar, message, etc), as it requires managing and storing authentication methods from the user, understanding how the tool is used, how the AI can use the tool (API or browsers), and in some cases adapting RAG techniques to match the data structure. One approach is building these integrations one at a time, which is more stable which means can get to market quicker - but the goal I believe should be building a system that can teach itself how to use new tools.

A somewhat controversial question: How do you view the boundaries between the internal and augmented capabilities of LLMs? In other words, should capabilities such as Retrieval-Augmented Generation (RAG) or tool integration remain external, or become integrated into LLMs themselves?

Candidly, this is far outside my area of expertise, but based on my observations, it does seems like the rapid experimentation on the orchestration side is slowly being embedded into the models themselves. You can almost imagine an MOE approach of three experts in a loop like BabyAGI. That being said, unsure if things like RAG or tool usage (engaging with things outside the model) can be done from the model natively… unless the model has a code sandbox within it…? Unsure. Regardless, it does feel like the effort in building better orchestration will help models improve, so I think it’s not wasted effort to experiment and explore newer and better orchestration methods.

Besides your contributions to BabyAGI, you also work as a venture capitalist. Apart from well-known areas like LLMs or GPU providers, what generative AI trends do you believe will capture significant value in the coming years?

(1) AI everywhere - we’ll see bits and pieces of AI across all apps and businesses, regardless of whether they are an AI company - similar to how most companies store data in the cloud without being a “cloud company”, (2) Passive AI - with cost going down, we’ll see increasing amount of AI just running in the background, structuring and summarizing data, generating insights, etc, and (3) AI workers - so many people around the world are spending a lot of time doing tasks that don’t require people. We’ll see lots of workflows being automated over the next decade. (4) Smaller/local/fine-tuned models - it’s still early days, but much like we went from general to personalized ads on the web, I suspect we’ll slowly start engaging with various models that are fine tuned for us specifically, and running on our phones, etc.

You've recently been involved in another open-source project, Instagraph, which focuses on implementing knowledge graphs. What defines a high-quality knowledge graph, and why is it crucial for applications involving autonomous agents?

Candidly, I’m new to knowledge graphs so can’t speak to “high quality” knowledge graphs. I’ve had plenty of feedback that deduping is hard (it is), but in early RAG experiments on knowledge graphs, I’ve found that it can still work with non-perfect deduping. I’m curious about this approach because the data structure feels closer to how our brain is wired, so it intuitively feels like the right way to do RAG. As we (humans) experience life, we’re constantly processing and restructuring information in our minds, for more efficient recalls and storage, so it seems to make sense to me that AI would benefit from the same type of activity. I think RAG techniques against knowledge graphs, while there are some early examples, is still underexplored.

💥 Miscellaneous – a set of rapid-fire questions

Looking three to five years into the future, how do you envision the role of autonomous agents in enhancing the productivity of knowledge workers?

I suspect we’ll see lots of workflows that have been replaced with AI, new problems that arise from this, and new roles to solve these problems. That being said, roll out won’t be immediate, as we’ll see early adopters implement the this, run into challenges, and solve them before late adopters start experimenting. This happens in stages with experiments/replacements starting small and getting larger, and then in varying speeds across different industries. In 3-5 years, I’d suspect a good number of forward thinking organizations to have a handful of AI workers who are capable of handling some tasks/workflows being done by humans today.

Do you think transformers combined with massive computing power are sufficient to pave the way to Artificial General Intelligence (AGI), or do you believe new architectures are necessary?

This one is also outside of my area of expertise but for true AGI (depending on your definition here), it seems like we’d want a model that can process multiple inputs in different modalities in parallel (audio, visual, etc) and also stream parallel outputs across modalities (audio, text, movement) at the same time. My guess is this requires some new architecture beyond what we have today.

How do you view the balance between open-source and proprietary foundational models?

Human beings intrinsically have both self-serving and altruistic motivations, both from an evolutionary history of survival that includes wars and tribes. In my view, the balance of open source and proprietary models reflects this duality within us, and we’ll continue to see this balance ebb and flow based on a multitude of factors from culture to economic results.

Who is your favorite mathematician or computer scientist, and why?

Davinci, cuz he understood the benefit of exploring the same idea through various modalities (image, text, math, etc)

TheSequence

Discussion about this post

Ready for more?