KV Cache Manager (`kernels/kv_cache.py`)

1. Concept: Paged Attention

In a standard implementation, KV Cache is a huge contiguous tensor [MAX_SEQ_LEN, HEADS, DIM]. This wastes memory because most prompts are shorter than MAX_SEQ_LEN.

Paged Attention breaks this tensor into small fixed-size blocks (e.g., size 16).

Physical Memory: A big pool of blocks [NUM_BLOCKS, 16, HEADS, DIM].
Logical Memory: For each request, we just keep a list of block indices [0, 5, 12].

Your job is to write the Allocator (like malloc in C).

2. Implementation Guide

Open src/kernels/kv_cache.py.

Step 1: Initialization

We need to track which blocks are free and which are used.

Task: In __init__:

Create a list self.free_blocks. Initially, it should contain all integers from 0 to num_blocks - 1.
Create a dictionary self.block_tables. This will map request_id -> List[int] (the list of blocks owned by that request).

# Hint
self.free_blocks = list(range(num_blocks))

Step 2: The `allocate_blocks` Method

When a request comes in (or generates new tokens), it needs memory.

Signature:

def allocate_blocks(self, request_id: str, num_tokens: int) -> List[int]:

Algorithm:

Calculate how many blocks are needed.
- $N_{blocks} = \lceil num_tokens / block_size \rceil$
Check if we have enough free_blocks.
- If len(free_blocks) < needed, raise an Error (or handle OOM).
Pop the blocks from free_blocks.
Assign them to self.block_tables[request_id].
Return the list of allocated block indices.

Your Turn: Implement this logic. Watch out for integer division!

Step 3: The `free_blocks` Method

When a request finishes, we must reclaim memory.

Algorithm:

Look up the blocks for request_id.
Append them back to self.free_blocks.
Delete the entry from self.block_tables.

Critical: Do not double-free! (Though Python sets make this easier, a list is faster for standard stacks).

Step 4: Connecting to the Engine

The get_kv_tensors method is checking if you can translate the “Logical” view to the “Physical” view.

Task: Implement get_kv_tensors.

It should presumably return the specific GPU tensors for the blocks.
Note: In this Python simulation, just returning the indices is often enough for the Scheduler to know mapping. The actual Tensor access happens in the CUDA kernel.

Step 5: Verify

Create tests/test_kv_manual.py:

manager = KVCacheManager(num_blocks=10, block_size=16, ...)
# Alloc 20 tokens -> needs 2 blocks (indices 0, 1)
blocks = manager.allocate_blocks("req1", 20)
print(blocks)
# Free
manager.free_blocks("req1")
print(len(manager.free_blocks)) # Should be 10 again

Keyboard shortcuts

Mini-YAIE: Educational LLM Inference Engine