Inside VTC: A Super Innovative Method for Fair LLM Serving
The collaboration between UC Berkeley, Stanford University and Duke University optimizes the cost of LLM inference.
I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
Imagine the following scenarios in an LLM application:
I. Client A sends requests averaging 4k tokens each.
II. Client B sends requests averaging 200 tokens each.
Should the requests from both clients follow be served by the same LLM resources. The answer seems obviously no as the second client requires much less resources than the first client. However, today’s LLM infrastructures do not differentiate between the two types of requests. This is come to be known as fair serving as is the subject of a fascinating paper by a list of rock star researchers that includes UC Berkeley’s Joseph Gonzalez and Ion Stoica as well as researchers from Stanford University and Duke University.
Current LLM serving systems rely on a commonly used method for handling incoming requests based on the First-Come-First-Serve (FCFS) approach. However, this method is not without its problems. For instance, it does not offer a way to prevent one client from overwhelming the system with excessive requests, which can slow down or even disrupt the service for other clients. To counteract this, many LLM services have implemented a restriction on the number of requests a client can make per minute. While this does help to manage the load, it can also lead to inefficient use of resources. For example, a client may be limited in their request rate even when the system has plenty of unused capacity, leading to a waste of valuable resources like GPUs.
To address these challenges, the researchers proposed a method known as Virtual Token Counter (VTC). VTC aims to offer a more balanced and efficient way of managing LLM serving. It addresses three main challenges that are inherent in the current LLM serving systems.
I. The cost of processing a request can vary, depending on the number of input and output tokens involved.
II. The capacity of the server to process tokens can fluctuate.
III. The length of the output for each request is unpredictable.
The VTC method works by measuring the service each client receives using a weighted number of tokens. This allows for a fairer distribution of resources and a more efficient utilization of the system’s capacity. The details of this method and its broader implications are discussed in the publication, offering a promising solution for the challenges faced in LLM serving. The technique operates by managing a queue of client requests while keeping a record of the tokens served to each client. During the operation of the LLM execution engine, tokens are generated for various clients. The VTC system updates each client’s counter based on these tokens. When the system is ready to handle new requests, such as when memory becomes available, VTC is responsible for selecting which requests to process next. It achieves fairness by giving priority to the clients with the lowest token count.
The core principle of VTC is to track the service each client has received and prioritize those who have received the least. Each time a client joins the queue, their counter is adjusted to ensure fairness, especially after a period of low activity. This adjustment, known as a counter lift, prevents any client from receiving more service than others in the future. Credits in this system are used as they are accrued and cannot be saved for later use. The virtual counters are updated with each new token, allowing for immediate reflection of the service received.
VTC’s design allows it to be seamlessly integrated into a continuous batching mechanism, independent of the server’s capacity. This resolves the issue of fluctuating token-rate capacity. The system uses two parallel streams for its operation. The first stream monitors incoming requests, adding new requests to the queue and adjusting counters as necessary to maintain fairness, especially in cases where a client was previously underloaded. The second stream is part of the execution engine and controls the addition of new requests into the ongoing batch. This is done by selecting requests from the client with the smallest counter, ensuring a fair distribution of service. The counters are updated in response to the service provided, ensuring an equitable distribution of resources among all clients.
The research paper includes a large number of super interesting scenarios that evaluate the effectiveness of VTC.
In one scenario, two clients, both overloaded, were sending requests at different rates. Client 1 sent 90 requests per minute, while Client 2 doubled that rate with 180 requests per minute. Each client sent their requests at consistent intervals, making for an evenly spaced request pattern. Every request from both clients had an input and output length of 256. Due to their high request rates, both clients exceeded the server’s capacity and were backlogged.
Another scenario involved three clients with varied request rates. Client 3 was overloaded and consumed more than its fair share, while Clients 1 and 2 sent requests at lower rates than their allocated share. Specifically, Clients 1, 2, and 3 sent 15, 30, and 90 requests per minute respectively. All requests, as in the previous scenario, had input and output lengths of 256. In this case, Client 3 faced a backlog, whereas Clients 1 and 2 did not.
A particularly interesting scenario featured two clients with distinct request patterns. Client 1 sent requests at a rate of 30 per minute, which was under half of the server’s capacity. In contrast, Client 2 started at a slower rate but gradually increased its requests to 120 per minute, eventually exceeding half of the server’s capacity. Both clients sent requests uniformly, with each request having input and output lengths of 256.
These scenarios were designed to test the ability of VTC to manage varying loads and request patterns, ensuring fair resource allocation and efficient handling of server capacity.