Welcome to Mini-YAIE

Mini-YAIE (Yet Another Inference Engine) is an educational project designed to demystify modern Large Language Model (LLM) inference engines.

Driven by the need for efficiency, modern engines like SGLang, vLLM, and TensorRT-LLM use sophisticated techniques to maximize GPU throughput and minimize latency. Mini-YAIE provides a simplified, clean implementation of these concepts, focusing on:

Continuous Batching
Paged KV Caching
Radix Attention (Prefix Sharing)

How to use this guide

This documentation is structured to take you from high-level concepts to low-level implementation.

Core Concepts: Start here to understand the why and what of inference optimization.
Architecture: Understand how the system components fit together.
Implementation Guides: Step-by-step guides to implementing the missing “kernels” in Python and CUDA.

Your Mission

The codebase contains placeholders (NotImplementedError) for critical components. Your goal is to implement these components following this guide, turning Mini-YAIE from a skeleton into a fully functional inference engine.

Keyboard shortcuts

Mini-YAIE: Educational LLM Inference Engine

Welcome to Mini-YAIE

How to use this guide

Your Mission