Edge 376: The Creators of Vicuna and Chatbot Arena Built SGLang for Super Fast LLM Inference
Created by LMSys, the framework provides a tremendous optimizations to improve the inference times in LLMs by 5x.
Chat remains the main interaction pattern to interact with LLMs. While chatting provides an interactive way to invoke LLMs real applications require much complex workflows. To cater to this need, several programming systems have been developed. These systems range from high-level libraries with ready-to-use modules to more adaptable pipeline programming frameworks. Additionally, there are languages focused on managing a single prompt, enhancing control over the LLM’s output. However, more integrated approaches that operated at lower levels of the LLM stack might provide a different optimization vector. This is the core thesis behind a new open source project from Berkeley University called SGLang.
SGLang stands for Structured Generation Language for LLMs. SGLang is designed to streamline interactions with LLMs, making them quicker and more manageable. It integrates the backend runtime system with frontend languages for better control. SGLang is based on two fundamental components: