Memory Management: Paged KV-Cache System

Overview

The memory management system in Mini-YAIE implements a paged KV-cache mechanism inspired by systems like vLLM and SGLang. This approach addresses the memory fragmentation challenges in LLM inference by using fixed-size memory blocks (pages) that can be allocated and deallocated independently for each request.

Core Concepts

Paged Memory Architecture

Traditional KV-cache management allocates contiguous memory blocks for each request, leading to fragmentation when requests have varying lengths. Paged KV-cache solves this by:

Dividing the KV-cache into fixed-size blocks (pages)
Allowing requests to use non-contiguous memory blocks
Enabling efficient memory reuse and sharing

Key Benefits

Reduced Fragmentation: Fixed-size blocks prevent memory fragmentation
Efficient Memory Utilization: Unused blocks can be allocated to other requests
Scalability: Supports variable-length requests without memory waste
Computation Sharing: Enables shared prefixes to use the same memory blocks

Architecture

KVCacheBlock Class

Each KVCacheBlock represents a fixed-size memory block:

class KVCacheBlock:
    def __init__(self, block_id: int, size: int, num_heads: int, head_dim: int, ...):
        self.block_id = block_id  # Unique identifier for the block
        self.size = size          # Number of tokens this block can hold
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.keys = None          # [size, num_heads, head_dim] tensor
        self.values = None        # [size, num_heads, head_dim] tensor

KVCacheManager Class

The main manager orchestrates all memory operations:

class KVCacheManager:
    def __init__(self, num_blocks: int, block_size: int, num_heads: int, head_dim: int, ...):
        self.num_blocks = num_blocks      # Total number of blocks in the pool
        self.block_size = block_size      # Size of each block in tokens
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.blocks: List[KVCacheBlock] = []  # Pool of all blocks
        self.free_block_list: List[int] = []  # Available blocks for allocation
        self.block_tables: dict = {}      # Maps request_id to list of block_ids

Memory Management Operations

1. Block Allocation

When a request needs KV-cache memory, the manager allocates the required number of blocks:

def allocate_blocks(self, request_id: str, num_tokens: int) -> List[int]:
    # Calculate required blocks
    num_blocks_needed = (num_tokens + self.block_size - 1) // self.block_size
    
    # Check availability
    if len(self.free_block_list) < num_blocks_needed:
        raise RuntimeError("Not enough free blocks")
    
    # Allocate from free list
    allocated_block_ids = []
    for _ in range(num_blocks_needed):
        block_id = self.free_block_list.pop(0)  # Remove from free list
        allocated_block_ids.append(block_id)
        self.blocks[block_id].allocate()  # Allocate GPU memory
    
    # Track allocation
    self.block_tables[request_id] = allocated_block_ids
    return allocated_block_ids

Key Aspects:

Calculates minimum blocks needed based on token count and block size
Ensures sufficient free blocks before allocation
Updates free block list and request tracking
Actually allocates GPU memory for the blocks

2. Block Deallocation

When requests complete, their blocks are returned to the free pool:

def free_blocks(self, request_id: str):
    if request_id in self.block_tables:
        block_ids = self.block_tables[request_id]
        self.free_block_list.extend(block_ids)  # Return to free pool
        self.free_block_list.sort()  # Maintain sorted order
        del self.block_tables[request_id]  # Remove tracking entry
        
        # Optionally clear tensors to free GPU memory
        for block_id in block_ids:
            self.blocks[block_id].keys = None
            self.blocks[block_id].values = None

Key Aspects:

Returns blocks to the free list for reuse
Maintains sorted order for allocation efficiency
Removes request tracking information
Optionally clears GPU tensors to free memory

3. Block Copying

For advanced operations like request preemption or memory defragmentation:

def copy_blocks(self, src_block_ids: List[int], dst_block_ids: List[int]):
    if len(src_block_ids) != len(dst_block_ids):
        raise ValueError("Source and destination lists must have same length")
    
    for src_id, dst_id in zip(src_block_ids, dst_block_ids):
        src_block = self.blocks[src_id]
        dst_block = self.blocks[dst_id]
        
        # Allocate destination if needed
        if dst_block.keys is None or dst_block.values is None:
            dst_block.allocate()
        
        # Copy data
        with torch.no_grad():
            dst_block.keys.copy_(src_block.keys)
            dst_block.values.copy_(src_block.values)

Memory Layout and Access

Block Organization

The memory is organized as a collection of fixed-size blocks:

Global Memory Pool:
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Block 0 │ Block 1 │ Block 2 │ Block 3 │ Block 4 │ Block 5 │
│ [16xHxD]│ [16xHxD]│ [16xHxD]│ [16xHxD]│ [16xHxD]│ [16xHxD]│
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Request A: Uses [Block 0, Block 2] for non-contiguous sequence storage
Request B: Uses [Block 1, Block 4, Block 5] for its sequence

Where H = num_heads and D = head_dim

Block Tables

Each request has an associated block table that maps logical token positions to physical blocks:

Request A Block Table:
Logical: [0-15][16-31][32-47][48-63][64-79]
Physical: Block0 Block2  -     -     Block4

Integration with SGLang Features

The paged system enables computation sharing by allowing requests with shared prefixes to reference the same memory blocks:

Requests with common prefixes can share the same KV-cache blocks
Multiple requests can reference the same physical memory location
Reduces redundant computation and memory usage

Memory Efficiency

By using fixed-size blocks:

Memory fragmentation is eliminated
Block reuse is maximized
Memory utilization approaches optimal levels

Performance Considerations

Block Size Selection

The block size parameter is critical for performance:

Smaller blocks: Less internal fragmentation, more overhead for block management
Larger blocks: More internal fragmentation, less management overhead
Typical values: 8-32 tokens per block work well in practice

Memory Allocation Strategy

The system uses a simple first-fit strategy:

# Allocate from beginning of free list
block_id = self.free_block_list.pop(0)

For production systems, more sophisticated strategies might include:

Best-fit to minimize fragmentation
Coalescing strategies to combine blocks
Preallocation to reduce allocation overhead

Memory Safety and Management

GPU Memory Management

The system ensures proper GPU memory allocation:

def allocate(self):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    self.keys = torch.zeros(
        self.size, self.num_heads, self.head_dim, 
        dtype=self.dtype, device=device
    )
    self.values = torch.zeros(
        self.size, self.num_heads, self.head_dim, 
        dtype=self.dtype, device=device
    )

Memory Cleanup

Proper cleanup prevents memory leaks:

Free blocks when requests complete
Clear GPU tensors to release memory
Maintain consistent state in block tables

Advanced Features

Dynamic Block Resizing

For requests that need to extend beyond their initial allocation:

Allocate additional blocks as needed
Maintain logical sequence continuity
Update block tables accordingly

Memory Pool Management

Advanced implementations might include:

Block migration to reduce fragmentation
Eviction policies for memory-constrained scenarios
Prefetching strategies for better performance

Error Handling

Out of Memory Conditions

The system handles memory exhaustion gracefully:

if len(self.free_block_list) < num_blocks_needed:
    raise RuntimeError(f"Not enough free blocks. Need {num_blocks_needed}, have {len(self.free_block_list)}")

Block Validation

Before operations, the system validates block states:

Verify blocks are allocated before accessing
Check for proper tensor dimensions
Validate request associations

Future Enhancements

Memory Optimization

Potential improvements include:

Compressed KV-cache storage
Offloading to CPU memory when possible
Cache eviction policies for long-running requests

Performance Optimization

Advanced techniques might include:

Block prefetching for better cache performance
Heterogeneous memory management (different memory types)
Asynchronous memory operations

Implementation Variations

SGLang-Style Memory Management

For SGLang-specific optimizations:

Prefix sharing memory management
Radix tree integration for shared computation
Advanced scheduling based on memory access patterns

Integration Points

The memory manager connects with other components:

Scheduler: Provides memory availability information
Attention modules: Access KV-cache through block tables
Model execution: Uses paged cache for efficient attention computation

Keyboard shortcuts

Mini-YAIE: Educational LLM Inference Engine