Inside SGLang: LMSys New Framework for Super Fast LLM Inference

The framework enables core optimizations to build highly sophisticated LLM applications.

Jesus Rodriguez
5 min readJan 29, 2024
Created Using DALL-E

I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

Chat remains the main interaction pattern to interact with LLMs. While chatting provides an interactive way to invoke LLMs real applications require much complex workflows. To cater to this need, several programming systems have been developed. These systems range from high-level libraries with ready-to-use modules to more adaptable pipeline programming frameworks. Additionally, there are languages focused on managing a single prompt, enhancing control over the LLM’s output. However, more integrated approaches that operated at lower levels of the LLM stack might provide a different optimization vector. This is the core thesis behind a new open source project from Berkeley University called SGLang.

SGLang stands for Structured Generation Language for LLMs. SGLang is designed to streamline interactions with LLMs, making them quicker and more manageable. It integrates the backend runtime system with frontend languages for better control. SGLang is based on two fundamental components:

I. On the backend, LMSys uses a technique called RadixAttention to efficiently reuse KV cache across multiple LLM generation calls.

II. The frontend features a domain-specific language embedded in Python, which can be operated in either interpreter or compiler mode. These elements together aim to improve the execution and programming efficiency of complex LLM programs.

The Challenges

SGLangs tackles a set of known challenges in LLM applications with a fresh approach. When we think about the elements There are several areas where LLM programming can be improved:

· Caching: In LLM programs, caching the computed KV cache from previous tokens can minimize repeated calculations when multiple text segments and generation calls are involved.

· Batching: Since LLMs are primarily memory-bound, increasing batch sizes can significantly boost throughput. Employing contiguous batching techniques is also beneficial.

· Sharing: LLM programs often need to generate multiple outputs from a single prompt or branch out to a new prompt. Developing more sophisticated sharing patterns can enhance efficiency.

· Parallelism: By creating a dependency graph for generation calls within an LLM program, independent calls can be executed simultaneously, enhancing parallelism within the program.

· Compilation: Full programs can be compiled into an optimized intermediate representation for more efficient execution. Aggressive optimizations, such as adjusting prompts based on test cases, can further enhance performance.

These improvements aim to extend the capabilities of LLM applications, enabling them to handle more complex tasks and interactions.

Enter SGLang

LMSys introduces SGLang, a domain-specific language integrated within Python. This language is designed to enhance the way users interact with and control LLM programs. An example SGLang program might assess the quality of an essay from various angles. It uses a state variable to manage and modify the prompt, incorporating the essay, then creating parallel forks for generating different dimensional judgments. These are eventually merged to form a summary and assign a grade to the essay.

An SGLang program allows developers to combine fundamental primitives in LLM applications as illustrated in the following code:

Image Credit: UC Berkeley

SGLang comprises several essential elements:

Image Credit: UC Berkeley

Breaking down the components of the architecture:

1) Interpreter

The basic execution of a SGLang program occurs through an interpreter. Here, prompts are treated as asynchronous streams. When primitives like extend, gen, and select are used on a prompt, they are submitted to the stream for asynchronous execution. This method allows the Python code to run concurrently without waiting for the LLM generation to complete, similar to launching asynchronous CUDA kernels.

2) Compiler

SGLang programs can also be compiled into computational graphs, enabling more sophisticated optimization opportunities. This process involves tracing SGLang structures and operations as a graph, with nodes representing primitive operators and edges for dependencies. Each function call or fork in a program creates a new prompt state or stream.

3) Radix Attention

A crucial optimization in SGLang involves KV cache reuse. This approach allows prompts with identical prefixes to share intermediate KV cache, reducing redundant memory and computation. In complex LLM programs involving multiple calls, various KV cache reuse patterns are observed. SGLang addresses this with a radix tree, an efficient data structure managing the mapping between token sequences and their corresponding KV cache tensors. These tensors are stored in a non-contiguous, paged layout optimized for space efficiency.

Given the limitations of GPU memory, an eviction policy is necessary. SGLang uses an LRU (Least Recently Used) eviction policy that targets leaf nodes for eviction, with a reference counter to track usage. The frontend of the system sends complete prompts to the backend, which then performs automatic prefix matching, reuse, and caching. The radix tree’s structure is maintained on the CPU, ensuring minimal overhead. This process is key to managing the flow of multiple requests and efficiently utilizing system resources.

Image Credit: UC Berkeley

The Results

LMSys evaluated SGLang across several benchmarks and the results are NOTHING BUT REMARKABLE.

Take a look at the following benchmark in few-shot in-context learning tasks.

Image Credit: UC Berkeley

Also this one about reasoning.

Image Credit: UC Berkeley

One of the most impressive capabilities of SGlang is that its really FAST. This is clearly illustrated in the following benchmark for latency in agent tasks.

Image Credit: UC Berkeley

SGLang represents a fresh take on building the foundational capabilities of LLM applications. Rather than just focusing on usability like higher level frameworks, SGLang is focused on performance and optimization. An amazing project to track.



Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...