Radix Attention (SGLang)
Radix Attention is the core innovation of SGLang. It optimizes the Prefill Phase by reusing computation from previous requests.
The Intuition
If two users ask:
- “Write a Python script to scrape a website.”
- “Write a Python script to sort a list.”
They share the prefix “Write a Python script to “. In a standard engine, we would compute the KV cache for this prefix twice.
graph TD
classDef shared fill:#aaffaa,stroke:#333,stroke-width:2px;
classDef unique fill:#ffaaaa,stroke:#333,stroke-width:2px;
Root((Root)) --> Node1["Write a Python script to"]:::shared
Node1 --> Node2["scrape a website"]:::unique
Node1 --> Node3["sort a list"]:::unique
style Node1 fill:#aaffaa
The Radix Tree
SGLang maintains a Radix Tree (Trie) of all token sequences currently in the KV cache.
- Nodes: Sequences of tokens.
- Edges: Transitions to new tokens.
When a new request arrives, we map its prompt to the longest matching path in the Radix Tree.
- Hit: We reuse the KV Cache for the matched part. The prefill only needs to compute the new suffix.
- Miss: We compute from scratch.
Benefits
- Reduced Latency: “Time To First Token” (TTFT) is nearly zero for cached prefixes.
- Higher Throughput: Less computation required per request.
- Complex Workflows: Enables efficient multi-turn chat, few-shot learning, and tree-of-thought prompting.