CLI Usage: Interactive and Server Modes
Overview
Mini-YAIE provides a comprehensive command-line interface (CLI) that serves as the primary entry point for users. The CLI supports both interactive chat mode and server mode, making it suitable for both direct interaction and production deployment scenarios.
CLI Architecture
Entry Point Structure
The CLI is organized around different command verbs:
yaie <command> [options] [arguments]
Commands:
- serve: Start an OpenAI-compatible API server
- chat: Start an interactive chat session
Core Components
- Argument Parsing: Uses argparse for command-line option handling
- Model Integration: Connects CLI commands to the inference engine
- Interactive Interface: Provides user-friendly chat experience
- Server Integration: Launches API server with proper configuration
Server Mode
Basic Server Usage
Start the API server with a specific model:
yaie serve microsoft/DialoGPT-medium --host localhost --port 8000
Server Options
Model Selection
--model MODEL_NAME # Specify the model to use (required)
Network Configuration
--host HOST # Server host (default: localhost)
--port PORT # Server port (default: 8000)
--workers WORKERS # Number of server workers
Performance Options
--max-batch-size N # Maximum batch size
--max-prefill-batch-size N # Maximum prefill batch size
--max-decode-batch-size N # Maximum decode batch size
--num-blocks N # Number of KV-cache blocks
--block-size N # Size of each cache block
Server Startup Process
- Model Loading: Download and load model from HuggingFace if not cached
- Engine Initialization: Create inference engine with specified parameters
- API Server Creation: Initialize FastAPI application with engine
- Server Launch: Start the web server on specified host/port
Example Server Commands
Basic Server
yaie serve microsoft/DialoGPT-medium
Production Server
yaie serve microsoft/DialoGPT-medium --host 0.0.0.0 --port 8000 --max-batch-size 16
Resource-Constrained Server
yaie serve microsoft/DialoGPT-medium --num-blocks 1000 --max-batch-size 4
Chat Mode
Basic Chat Usage
Start an interactive chat session:
yaie chat microsoft/DialoGPT-medium
Chat Options
Generation Parameters
--temperature TEMP # Sampling temperature (default: 1.0)
--top-p TOP_P # Nucleus sampling threshold (default: 1.0)
--max-tokens N # Maximum tokens to generate (default: 512)
--stream # Stream responses in real-time (default: true)
Model Configuration
--model MODEL_NAME # Specify the model to use (required)
Interactive Chat Experience
Session Flow
- Model Loading: Model is loaded if not cached
- Chat Initialization: Engine and tokenizer are set up
- Conversation Loop: User inputs are processed and responses generated
- Session Termination: Exit with Ctrl+C or quit command
User Interaction
The chat interface provides a conversational experience:
$ yaie chat microsoft/DialoGPT-medium
Model loaded successfully!
Starting chat session (press Ctrl+C to exit)...
User: Hello, how are you?
AI: I'm doing well, thank you for asking!
User: What can you help me with?
AI: I can have conversations, answer questions, and assist with various tasks.
Example Chat Commands
Basic Chat
yaie chat microsoft/DialoGPT-medium
Creative Chat
yaie chat microsoft/DialoGPT-medium --temperature 1.2 --top-p 0.9
Focused Chat
yaie chat microsoft/DialoGPT-medium --temperature 0.7 --max-tokens 128
Model Selection
Supported Model Formats
The CLI supports any HuggingFace-compatible model:
Pre-trained Models
yaie serve microsoft/DialoGPT-medium
yaie serve gpt2
yaie serve facebook/opt-1.3b
Local Models
yaie serve /path/to/local/model
yaie serve ./models/my-custom-model
Model Caching
Models are automatically downloaded and cached:
- First run: Download from HuggingFace Hub
- Subsequent runs: Use cached version
- Cache location: Standard HuggingFace cache directory
Performance Tuning
Memory Configuration
Adjust memory settings based on available GPU memory:
# For 24GB+ GPU
yaie serve model --num-blocks 4000 --max-batch-size 32
# For 8-16GB GPU
yaie serve model --num-blocks 1500 --max-batch-size 8
# For 4-8GB GPU
yaie serve model --num-blocks 800 --max-batch-size 4
Batch Size Optimization
Tune batch sizes for optimal throughput:
# High throughput (more memory)
yaie serve model --max-batch-size 32 --max-prefill-batch-size 64
# Memory efficient (lower batch sizes)
yaie serve model --max-batch-size 4 --max-prefill-batch-size 8
Error Handling and Troubleshooting
Common Errors
Model Loading Errors
# If model name is invalid
Error: Model not found on HuggingFace Hub
# If network is unavailable during first load
Error: Failed to download model
Memory Errors
# If not enough GPU memory
CUDA out of memory error
# If KV-cache is too large
Memory allocation failed
Debugging Options
Verbose Output
yaie serve model --verbose # Show detailed startup information
Configuration Validation
yaie serve model --debug # Enable debugging features
Advanced CLI Features
Configuration Files
The CLI supports configuration files for complex setups:
yaie serve --config config.yaml model
Environment Variables
Several environment variables can customize behavior:
# Set default host
export YAIE_HOST=0.0.0.0
# Set default port
export YAIE_PORT=9000
# Set memory limits
export YAIE_MAX_BLOCKS=2000
Logging Configuration
Control logging verbosity and output:
# Enable detailed logging
yaie serve model --log-level DEBUG
# Log to file
yaie serve model --log-file server.log
Integration with SGLang Features
Batching Optimization
The CLI exposes SGLang batching parameters:
yaie serve model \
--max-prefill-batch-size 16 \
--max-decode-batch-size 256
Prefix Sharing Control
Parameters that affect prefix sharing efficiency:
yaie serve model \
--max-seq-len 2048 \
--block-size 16
Production Deployment
Server Management
Process Control
# Start server in background
nohup yaie serve model > server.log 2>&1 &
# Kill server process
pkill -f "yaie serve"
Process Monitoring
# Monitor server with systemd
systemctl start yaie-server
# Monitor with supervisor
supervisorctl start yaie-server
Health Checks
The server provides health status:
# Check server status
curl http://localhost:8000/health
# Integrate with monitoring tools
# Health check interval and thresholds
Examples and Use Cases
Development Usage
# Quick test with small model
yaie chat gpt2
# Interactive development with verbose output
yaie serve gpt2 --port 8000 --verbose
Production Usage
# High-performance server for production
yaie serve microsoft/DialoGPT-medium \
--host 0.0.0.0 \
--port 8000 \
--max-batch-size 16 \
--num-blocks 2000
# Low-resource server for edge deployment
yaie serve gpt2 \
--max-batch-size 2 \
--num-blocks 500
Testing and Evaluation
# Test with various parameters
yaie chat model --temperature 0.5 --top-p 0.9
# Evaluate different models
yaie serve model1 --port 8001 &
yaie serve model2 --port 8002 &
The CLI provides a comprehensive interface to access all of Mini-YAIE’s features, from simple interactive chat to high-performance API serving with SGLang-style optimizations.