The Sequence Chat #814: Z.ai's Zixuan Li Talks About GLM

Feb 26, 2026

We are back with our interview series and a very special guest today! Zixuan Li has been at the center of the China AI open source revolution. We discuss the GLM models, Chinese open source and more.

Background

1. Can you introduce yourself and tell us about your journey from academy to Zhipu AI?

I’m Zixuan Li, Head of Z.ai Global Ecosystem, responsible for Z.ai Chat, Z.ai API services, global partnerships, and global branding.

It was an easy choice for me. I’m not a pure academic person. I’ve done entrepreneurship before, worked at large tech firms, and hold CFA and PMP certifications. I love challenges. When I joined Z.ai, I saw fascinating challenges that I had a chance to conquer: product development, partnership building, commercialization, and establishing a global brand. The opportunity to build something from the ground up on a global scale was too compelling to pass up.

The Main Sequence

Let’s start with the genesis of the General Language Model (GLM). Unlike many projects that simply forked existing architectures, GLM emerged with a distinct technical philosophy regarding pre-training objectives. Can you walk us through the original hypothesis behind GLM? What was the specific gap you saw in the landscape back then, and how has that initial DNA influenced the models you are shipping today?

The original hypothesis behind GLM was that the dichotomy between autoencoding (BERT-style) and autoregressive (GPT-style) models was a false choice. We believed a unified framework could capture the best of both worlds: strong bidirectional understanding and powerful generation capabilities.

Back then, the landscape was fragmented. You had to choose between models good at understanding versus models good at generation. GLM’s autoregressive blank infilling objective was designed to bridge that gap. This DNA still influences our models today. We continue to prioritize versatility and multi-task capability rather than optimizing for a single benchmark or use case.

The original GLM-130B was distinct for its ‘autoregressive blank infilling’ objective, differing from the standard causal decoder-only architecture of GPT-3. With GLM-4 and now GLM-4.7, have you maintained that unique architectural lineage, or has the industry converged on a single ‘best practice’ architecture? What specific architectural bet is Z.ai still making that differs from the West?

The GLM model architecture today has evolved significantly from its previous versions. We cannot comment extensively on Western models since most are closed-source. We simply cannot see their architectures.

Throughout our development process, we continuously absorb industry best practices while making our own innovations. The key point is to keep identifying new critical problems and solving them. For us, there is no unified “best practice.” The field is moving too fast, and what works best depends heavily on the specific challenges you’re trying to solve.

With GLM-4.5, you moved heavily into Mixture-of-Experts (MoE) architectures, activating only ~32B parameters out of hundreds of billions per token. Given the compute constraints often discussed in the Chinese ecosystem, is MoE purely an efficiency play for you, or do you see it as a superior path to reasoning compared to dense models?

We see MoE as a superior path to reasoning, not merely an efficiency optimization. The sparse activation pattern in MoE architectures allows different experts to specialize in different types of knowledge and reasoning patterns. This specialization leads to more nuanced and accurate responses across diverse domains.

Zhipu has been a prolific contributor to open source (e.g., GLM-4-9B), often called the ‘Llama of China.’ In a market where competitors are increasingly closing their frontiers to protect IP, why does Z.ai continue to release weights? Is this a strategy to build a developer moat that US sanctions or competitors can’t breach?

We want to expand the cake before taking a bite of it. With open source, we aim to achieve three things:

Improve accessibility: You can download the model yourself or use it from various inference providers. This lowers barriers for developers and researchers worldwide.

Enable ecosystem innovation: You can build your own models on top of the GLM series. Quantize, finetune, extend. We’ve seen remarkable work extending our original capabilities, like Intellect-3. The community often takes our models in directions we hadn’t anticipated.

Shape standards: If we’re fortunate enough, we might help set some norms and standards for open models. That’s something we simply cannot achieve with closed-source models alone.

Critically, open-sourcing does not cannibalize our business. Demand for GLM has exceeded supply. The whole world now lacks enough compute to run all the GLM deployments people want. Open source builds trust, expands the ecosystem, and ultimately drives more enterprise customers to our managed services.

We are seeing a shift from ‘Chat’ to ‘Action’ with models like AutoGLM that can navigate phone UIs. What are the unique technical bottlenecks in building agents that control local devices versus cloud agents? Is the latency of current LLMs the primary blocker for a truly fluid ‘Her’ OS experience, or is it an issue of error-correction?

To assess the product-market fit of device agents, the critical point is not to compare local versus cloud solutions. It’s to compare the agent’s operation with human operation. Currently, we see three main obstacles:

Speed of operations: Agents must match or exceed the speed at which humans navigate interfaces. A 2-second delay per action compounds quickly into an unusable experience.

Error recovery and robustness: Humans are incredibly good at recovering from small mistakes. We barely notice when we misclick and correct. Agents need this same resilience. A single error that derails an entire workflow breaks user trust immediately.

Context persistence across sessions: Humans remember what we were doing yesterday. We pick up tasks seamlessly. Agents need similar long-term context awareness to feel truly integrated into daily life.

Latency is important, but I’d argue error-correction and graceful degradation are the bigger blockers today. Users can tolerate a slightly slower agent that reliably completes tasks. They cannot tolerate a fast agent that fails unpredictably.

With the GLM-4V series, you are pushing hard into visual understanding. Do you subscribe to the view that AGI cannot be achieved through text alone? Specifically, is visual data necessary to teach a model ‘physics’ and cause-and-effect, or can that be learned through text reasoning?

Vision capabilities are still essential in many scenarios. For example:

Physical world understanding: Text can describe that “water flows downhill,” but video data showing fluid dynamics teaches intuitive physics in ways text descriptions cannot fully capture.

Real-world grounding: Many real-world applications, such as medical imaging, autonomous driving, and industrial inspection, simply cannot function without visual input.

That said, I don’t think the question is binary. Text contains vast amounts of implicit physical knowledge encoded in how humans describe the world. The most capable systems will likely combine both, using text to provide abstract reasoning frameworks and visual data to ground those frameworks in physical reality. The question isn’t “text or vision” but “how do we best integrate multiple modalities for richer understanding?”

The Chinese market is fiercely competitive with the ‘Four Tigers’ (Zhipu, Moonshot, MiniMax, DeepSeek). If you had to describe Z.ai’s ‘soul’ compared to the others—is your differentiation in the model quality, the full-stack ecosystem (model + cloud + hardware), or your B2B enterprise entrenchment?

Z.ai is relatively more mature in “service” and more committed to the model-as-a-service (MaaS) philosophy. We don’t just provide models. We’ve developed a mature enterprise and individual service system.

This means we not only serve customers on our own API platform but also assist other providers in deploying GLM models correctly and efficiently. We help with optimization, integration, and ongoing support. That comprehensive service approach has driven our commercial success across both B2C and B2B segments.

Our soul, if I had to distill it: we’re builders who serve builders. We care deeply about developer experience, enterprise reliability, and the entire journey from model to production application.

Tell us about your latest release. What makes GLM-5 unique?

GLM-5 scales from 355B parameters in GLM-4.5 to 744B total parameters, with 40B active per inference through a Mixture of Experts architecture. To keep deployment practical at this size, it integrates DeepSeek Sparse Attention for the first time, maintaining long-context capacity while significantly reducing memory and compute costs. The result is a model designed around intelligence efficiency, not raw parameter count.

Beyond scaling, GLM-5 uses "slime," a custom asynchronous reinforcement learning infrastructure built to overcome throughput bottlenecks that make RL at frontier scale difficult. By decoupling data generation from gradient updates, slime enables faster, more iterative post-training cycles. This is a meaningful technical contribution that gives the GLM team durable infrastructure for continued improvement in future model generations.

GLM-5 is purpose-built for long-horizon task execution, not just code generation. On Vending Bench 2, which simulates running a business autonomously over a full year, it ranked first among open-source models and close to Claude Opus 4.5, demonstrating sustained goal alignment and resource planning. For product builders, this means GLM-5 can handle entire engineering pipelines rather than isolated subtasks.

Miscellaneous

There is a popular meme that the AI race is ‘Chinese engineers in China vs. Chinese engineers in the US.’ As someone deep in the ecosystem, are you seeing a ‘reverse brain drain’ where talent is returning to Beijing to work on GLM, or is the flow still predominantly outwards? What attracts a researcher to Z.ai over a Silicon Valley lab today?

I won’t comment on other labs in China. But for Z.ai, we don’t really face this “China vs. Silicon Valley” framing. Most of our researchers are Z.ai-originated. They started their research journey here, growing with the company from early days.

What attracts talent to us? I’d say it’s the combination of cutting-edge research opportunity, the pace of iteration, and the chance to see your work deployed at massive scale very quickly. At Z.ai, the distance from research idea to production deployment is remarkably short. For researchers who want to see their work matter in the real world, that’s compelling.

There is a critical debate on whether the Transformer architecture is the final vehicle that takes us to AGI. Do you believe that scaling the current Transformer paradigm—optimizing data, compute, and MoE structures—is sufficient to reach General Intelligence, or will we hit a fundamental ceiling that requires a completely new architectural breakthrough to bridge the final mile?

It’s hard to give a precise definition of AGI. The goalposts keep moving as capabilities advance. But I believe the current Transformer architecture has a very high ceiling, higher than most people expect.

Here’s a perspective that’s often overlooked: most of the data patterns we’ll see in 2027 and 2028 haven’t even been created yet. Human knowledge, content, and interaction patterns are continuously expanding. The models of tomorrow will be trained on data that doesn’t exist today. That’s a powerful tailwind for continued scaling.

Will we eventually need architectural breakthroughs? Probably. But I suspect we’re nowhere near the ceiling of what’s possible with Transformers. We’re still in the early innings of understanding how to fully leverage this architecture.

Give us one prediction for the AI landscape in 2026 that 90% of our readers would likely disagree with.

Pricing by tokens might no longer be the mainstream business model. LLMs will be charged by the value they create.

Think about it: we don’t pay for electricity based on how many electrons flow through our devices. We pay for what those electrons enable us to do. Similarly, as AI becomes more agentic and outcome-oriented, pricing will shift from input metrics (tokens) to output metrics (tasks completed, value generated, problems solved).

This is already beginning with subscription models and outcome-based enterprise contracts.

TheSequence

Discussion about this post

Ready for more?