Edge 410: Learn About Virtual Token Counter: A Novel Method that Address One of the Major Challenges LLM Serving

Created by UC Berkeley and Stanford University, VTC introduced a fairness in LLM serving scheduling

Jul 04, 2024

∙ Paid

Imagine the following scenarios in an LLM application:

Client A sends requests averaging 4k tokens each.
Client B sends requests averaging 200 tokens each.

Should the requests from both clients follow be served by the same LLM resources. The answer seems obviously no as the second client requires much less resources than the first client. However, today’s LLM infrastructures do not differentiate between the two types of requests. This is come to be known as fair serving as is the subject of a fascinating paper by a list of rock star researchers that includes UC Berkeley’s Joseph Gonzalez and Ion Stoica as well as researchers from Stanford University and Duke University.

Current LLM serving systems rely on a commonly used method for handling incoming requests based on the First-Come-First-Serve (FCFS) approach. However, this method is not without its problems.

TheSequence

Edge 410: Learn About Virtual Token Counter: A Novel Method that Address One of the Major Challenges LLM Serving

Created by UC Berkeley and Stanford University, VTC introduced a fairness in LLM serving scheduling

This post is for paid subscribers