Configuration Management
Overview
Mini-YAIE uses a flexible configuration system that allows users to customize various aspects of the inference engine without modifying the code. The configuration system provides settings for memory management, scheduling, model loading, and performance optimization.
Configuration Structure
The configuration system is built around the SGLangConfig dataclass in src/config.py. The system supports:
- Dataclass-based configuration with type hints
- Default values for all parameters
- Dictionary-based overrides
- Component-specific configuration sections
Key Configuration Parameters
Scheduler Configuration
# Maximum batch size for processing requests
max_batch_size: int = 8
# Maximum batch size for prefill operations
max_prefill_batch_size: int = 16
# Maximum batch size for decode operations
max_decode_batch_size: int = 256
# Maximum sequence length allowed
max_seq_len: int = 2048
KV Cache Configuration
# Number of GPU memory blocks for KV-cache
num_gpu_blocks: int = 2000
# Number of CPU memory blocks for swapping
num_cpu_blocks: int = 1000
# Size of each memory block (in tokens)
block_size: int = 16
Model Configuration
# Data type for model weights and KV-cache
dtype: str = "float16" # Options: "float16", "float32", "bfloat16"
# Tensor parallelism size
tensor_parallel_size: int = 1
# GPU memory utilization fraction
gpu_memory_utilization: float = 0.9
# CPU swap space in GB
swap_space: int = 4
Generation Configuration
# Default maximum tokens to generate per request
default_max_tokens: int = 1024
# Default sampling temperature
default_temperature: float = 1.0
# Default top-p value
default_top_p: float = 1.0
SGLang-Specific Features
# Enable radix attention cache for prefix sharing
enable_radix_cache: bool = True
# Enable chunked prefill for long prompts
enable_chunked_prefill: bool = True
# Scheduling policy: "fcfs" (first-come-first-served)
schedule_policy: str = "fcfs"
# Enable prefix caching
enable_prefix_caching: bool = True
# Maximum scheduling steps before preemption
max_num_schedule_steps: int = 1000
Configuration Loading
Default Configuration
When no explicit configuration is provided, the system uses sensible defaults that work well for most educational purposes:
- Conservative memory usage to work on most GPUs
- Balanced performance settings
- Safe batch sizes that avoid out-of-memory errors
Custom Configuration
Users can customize configurations by:
- Direct parameter passing to constructors
- Environment variables for deployment scenarios
- Configuration files (when implemented)
Configuration Best Practices
Performance Tuning
For production use, consider these configuration adjustments:
- Increase batch sizes based on available GPU memory
- Adjust block size for optimal cache utilization
- Tune memory pool size based on request patterns
Memory Management
Configure memory settings based on your hardware:
# For high-end GPUs (24GB+ VRAM)
num_blocks = 4000
max_batch_size = 32
# For mid-range GPUs (8-16GB VRAM)
num_blocks = 1000
max_batch_size = 8
# For entry-level GPUs (4-8GB VRAM)
num_blocks = 500
max_batch_size = 4
Integration with Components
Engine Integration
The main engine uses the SGLangConfig for initialization:
from src.config import SGLangConfig, get_sglang_config
# Use default config
config = get_sglang_config()
# Or override specific parameters
config = get_sglang_config(
max_batch_size=16,
num_gpu_blocks=4000
)
# Initialize components with config values
scheduler = SGLangScheduler(
max_batch_size=config.max_batch_size,
max_prefill_batch_size=config.max_prefill_batch_size,
max_decode_batch_size=config.max_decode_batch_size
)
Scheduler Configuration
The SGLang scheduler uses configuration for scheduling policies:
- Batch size limits
- Prefill/decode phase sizing
- Memory-aware scheduling decisions
Memory Manager Configuration
The KV-cache manager uses configuration for:
- Total memory pool size
- Block allocation strategies
- Memory optimization policies
Environment-Specific Configuration
Development Configuration
For development and learning:
- Conservative memory limits
- Detailed logging
- Debug information enabled
Production Configuration
For production deployment:
- Optimized batch sizes
- Performance-focused settings
- Minimal logging overhead
Configuration Validation
The system validates configuration parameters to prevent:
- Memory allocation failures
- Invalid parameter combinations
- Performance-degrading settings
Future Extensions
The configuration system is designed to accommodate:
- Model-specific optimizations
- Hardware-aware tuning
- Runtime configuration updates
- Performance auto-tuning
Configuration Examples
Basic Configuration
# Minimal configuration for learning
config = {
"max_batch_size": 4,
"num_blocks": 1000,
"block_size": 16
}
Performance Configuration
# Optimized for throughput
config = {
"max_batch_size": 32,
"max_decode_batch_size": 512,
"num_blocks": 4000,
"block_size": 32
}
Troubleshooting Configuration Issues
Memory Issues
If experiencing out-of-memory errors:
- Reduce
num_blocksin KV-cache - Lower batch sizes
- Check available GPU memory
Performance Issues
If experiencing low throughput:
- Increase batch sizes
- Optimize block size for your model
- Verify CUDA availability and compatibility