System Architecture Overview
Introduction
Mini-YAIE (Yet Another Inference Engine) is an educational implementation of modern LLM inference techniques, specifically designed to demonstrate concepts from state-of-the-art systems like SGLang, vLLM, and TensorRT-LLM. The architecture focuses on three core optimizations:
- Continuous Batching: Dynamically batching incoming requests to maximize GPU utilization
- Radix Attention: Efficient attention mechanism with prefix sharing for similar requests
- Paged KV-Cache: Memory-efficient key-value cache management
High-Level Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Layer │ │ Engine Core │ │ Model/Kernels │
│ (FastAPI) │◄──►│ (Scheduler, │◄──►│ (PyTorch/ │
│ │ │ Attention) │ │ CUDA) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CLI Layer │ │ Model Loading │ │ Memory Mgmt │
│ (yaie serve/ │ │ (HuggingFace │ │ (Paged Cache) │
│ yaie chat) │ │ Integration) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Core Components
1. Main Inference Engine (engine.py)
The main inference engine orchestrates all components and provides the high-level API for inference. It implements SGLang-style continuous batching with radix attention and prefix sharing.
Key Responsibilities:
- Request orchestration and management
- Integration between scheduler, attention mechanisms, and memory management
- API layer communication
- Model loading and tokenizer management
2. SGLang Scheduler (core/sglang_scheduler.py)
The SGLang-style scheduler implements advanced request scheduling with:
- Prefix-based request grouping: Groups requests with common prefixes for computation sharing
- Separate prefill and decode scheduling: Optimizes for the different computational patterns
- Memory-aware batch sizing: Considers available KV-cache memory when scheduling
- Continuous batching optimization: Maintains high GPU utilization
3. Radix Attention System (kernels/radix_attention.py)
Implements the radial attention mechanism with:
- Prefix sharing: Reduces redundant computation for requests with common prefixes
- Paged KV-cache integration: Efficient memory management for variable-length requests
- RoPE (Rotary Position Embeddings): Supports position-aware attention
4. Paged KV-Cache Management (kernels/kv_cache.py)
Efficient memory management using page-based allocation:
- Fixed-size blocks: Reduces memory fragmentation
- Request-to-block mapping: Tracks which blocks belong to which requests
- Dynamic allocation/deallocation: Manages memory based on request lifecycle
5. Radix Tree System (kernels/radix_tree.py)
Enables efficient prefix matching and computation sharing:
- Trie-based structure: Organizes token sequences hierarchically
- Request grouping: Identifies requests with shared prefixes
- Computation optimization: Provides information for scheduler optimization
6. Sampling Kernel (kernels/sampling.py)
Implements core sampling algorithms:
- Temperature scaling: Controls randomness in generation
- Top-K sampling: Limits selection to top K most probable tokens
- Top-P (Nucleus) sampling: Limits selection to tokens that sum to probability P
7. API Server (server/api.py)
Provides OpenAI-compatible API endpoints:
- RESTful design: Follows OpenAI’s API specification
- Streaming support: Real-time token streaming
- Health monitoring: Server status endpoints
Data Flow
The system processes requests in the following sequence:
- Request Arrival: Client sends a request through the API layer
- Request Scheduling: SGLang scheduler groups requests with common prefixes
- Prefill Phase: Process full prompt sequences using radial attention
- Decode Phase: Generate tokens one-by-one with shared computation
- KV-Cache Management: Efficient memory allocation and sharing
- Response Generation: Return results via API layer
Key Design Principles
Modularity
Each component is designed to be independent, allowing for focused learning and experimentation.
Educational Focus
Clean, well-documented code with comprehensive explanations of key concepts.
SGLang-Style Optimization
Focus on prefix sharing and radix trees to maximize computational efficiency.
Memory Efficiency
Paged cache management to reduce memory fragmentation and maximize utilization.
Architecture Benefits
- High Throughput: Continuous batching and prefix sharing maximize GPU utilization
- Memory Efficiency: Paged KV-cache reduces fragmentation and enables larger batch sizes
- Scalability: Modular design allows for optimization of individual components
- Educational Value: Clean implementation of state-of-the-art techniques
Integration Points
The system integrates components through well-defined interfaces:
- Engine connects to scheduler for request management
- Scheduler connects to memory manager for KV-cache coordination
- Attention mechanisms access KV-cache through the memory manager
- Sampler provides token selection for generation
- API layer communicates with the engine for request processing