Memory Operations (kernels/cuda/memory_ops.cu)
Concept
Moving data between different GPU memory locations is a frequent operation in Paged Attention.
Implementation Goal
Implement copy_blocks_kernel:
Signature
void copy_blocks_kernel(
torch::Tensor key_cache, // [num_blocks, block_size, head_dim]
torch::Tensor value_cache, // [num_blocks, block_size, head_dim]
torch::Tensor block_mapping, // [num_mappings, 2] (src, dst)
int num_mappings
);
Logic
- Parallelism: Launch one thread per token to copy.
- Indexing:
mapping_idx = blockIdx.xsrc_block = block_mapping[mapping_idx][0]dst_block = block_mapping[mapping_idx][1]
- Copy:
- Read
key/valuefromsrc_blockatthreadIdxoffset. - Write to
dst_block.
- Read