API & Serving: OpenAI-Compatible Server

Overview

The API server in Mini-YAIE implements an OpenAI-compatible interface, allowing the engine to be used with existing applications and tools designed for OpenAI’s API. The server uses FastAPI to provide RESTful endpoints with proper request/response handling, streaming support, and health monitoring.

API Design Philosophy

The server follows OpenAI’s API specification to ensure compatibility with existing tools and applications while leveraging the advanced features of the SGLang-style inference engine. This approach allows users to:

Use existing OpenAI clients without modification
Take advantage of Mini-YAIE’s performance optimizations
Integrate with tools built for OpenAI’s API format
Maintain familiar request/response patterns

Core Architecture

FastAPI Application

The server is built using FastAPI for high-performance web serving:

def create_app(model_name: str) -> FastAPI:
    app = FastAPI(title="YAIE API", version="0.1.0")
    engine = InferenceEngine(model_name)
    return app

Main Components

Inference Engine Integration: Connects API endpoints to the core inference engine
Request Validation: Pydantic models for request/response validation
Streaming Support: Server-sent events for real-time token streaming
Error Handling: Proper HTTP error codes and message formatting
Health Monitoring: Endpoints for system status and availability

API Endpoints

1. Chat Completions Endpoint

The main endpoint follows OpenAI’s /v1/chat/completions specification:

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Implementation handles both streaming and non-streaming responses

Request Schema

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = 1.0
    top_p: Optional[float] = 1.0
    max_tokens: Optional[int] = None
    stream: Optional[bool] = False

Response Schema

class ChatCompletionResponse(BaseModel):
    id: str
    object: str = "chat.completion"
    created: int
    model: str
    choices: List[Choice]
    usage: Dict[str, int]

Supported Parameters

model: Model identifier (passed during server startup)
messages: List of message objects with role and content
temperature: Sampling temperature (0.0-2.0 recommended)
top_p: Nucleus sampling threshold (0.0-1.0)
max_tokens: Maximum tokens to generate
stream: Whether to stream responses (true/false)

2. Model Listing Endpoint

Lists the available model:

@app.get("/v1/models")
async def list_models():
    return {
        "object": "list",
        "data": [
            {
                "id": model_name,
                "object": "model",
                "owned_by": "user",
                "created": int(time.time()),
            }
        ],
    }

3. Health Check Endpoint

Simple health monitoring:

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": model_name}

Streaming Implementation

Streaming vs Non-Streaming

The server supports both response formats using the same endpoint:

if request.stream:
    # Return streaming response
    return StreamingResponse(generate_stream(), media_type="text/event-stream")
else:
    # Return non-streaming response
    response = engine.chat_completion(messages_dicts, **kwargs)
    return response

Streaming Response Format

The streaming implementation generates Server-Sent Events (SSE):

def generate_stream():
    for chunk in engine.chat_completion_stream(messages_dicts, **kwargs):
        yield f"data: {json.dumps(chunk)}\n\n"
    yield "data: [DONE]\n\n"

Each chunk follows OpenAI’s streaming format:

{
  "id": "chatcmpl-...",
  "object": "chat.completion.chunk",
  "created": 1234567890,
  "model": "model-name",
  "choices": [{
    "index": 0,
    "delta": {"content": "token"},
    "finish_reason": null
  }]
}

Integration with Inference Engine

Request Processing Flow

API Request: Received through FastAPI endpoints
Validation: Pydantic models validate request format
Parameter Extraction: Convert API parameters to engine format
Engine Processing: Call appropriate engine methods
Response Formatting: Convert engine output to API format
API Response: Return properly formatted responses

Parameter Mapping

API parameters are mapped to engine capabilities:

kwargs = {}
if request.max_tokens is not None:
    kwargs["max_tokens"] = request.max_tokens
if request.temperature is not None:
    kwargs["temperature"] = request.temperature
if request.top_p is not None:
    kwargs["top_p"] = request.top_p

Message Formatting

The server handles OpenAI-style messages by converting them to a format the engine understands:

messages_dicts = [
    {"role": msg.role, "content": msg.content}
    for msg in request.messages
]

# Apply chat template if available
if hasattr(self.tokenizer, 'apply_chat_template'):
    formatted_prompt = self.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
else:
    # Fallback formatting
    formatted_prompt = ""
    for message in messages:
        formatted_prompt += f"{message['role'].capitalize()}: {message['content']}\n"
    formatted_prompt += "\nAssistant:"

Error Handling

HTTP Error Codes

The server properly handles various error conditions:

400 Bad Request: Invalid request parameters
429 Too Many Requests: Rate limiting (not implemented in basic version)
500 Internal Server Error: Server-side errors during processing

Error Response Format

Standard OpenAI-compatible error format:

{
  "error": {
    "message": "Error description",
    "type": "server_error",
    "param": null,
    "code": null
  }
}

Exception Handling

The server wraps all processing in try-catch blocks:

try:
    # Process request
    response = engine.chat_completion(messages_dicts, **kwargs)
    return response
except Exception as e:
    traceback.print_exc()
    raise HTTPException(status_code=500, detail=str(e))

Performance Considerations

Request Batching

The API integrates with the engine’s batching system:

Multiple API requests can be batched together in the engine
Continuous batching maintains high throughput
Batch size limited by engine configuration

Memory Management

The server shares memory with the inference engine:

KV-cache is shared across API requests
Efficient memory reuse through paged cache system
Memory limits enforced by engine configuration

Concurrency

FastAPI provides automatic concurrency handling:

Async request processing
Connection pooling
Efficient handling of multiple simultaneous requests

Security Considerations

Input Validation

Pydantic models validate all request parameters
Type checking prevents injection attacks
Length limits prevent excessive resource consumption

Rate Limiting

While not implemented in the basic version, can be added:

Per-IP rate limiting
Request quota management
Usage monitoring

Deployment Configuration

Server Startup

The server can be started with a specific model:

uvicorn server.api:app --host 0.0.0.0 --port 8000

Environment Configuration

The server supports environment-based configuration:

Model name via environment variables
Port and host configuration
Resource limits

SGLang-Style Features Integration

Continuous Batching

The API benefits from the engine’s continuous batching:

Requests are automatically batched
High throughput maintained
Low latency for individual requests

API requests with similar prefixes benefit from:

Shared computation in radial attention
Reduced memory usage
Improved efficiency

Multi-Step Processing

The API leverages the engine’s multi-step capabilities:

Efficient prefill and decode phases
Optimized request scheduling
Memory-aware processing

Usage Examples

Basic Chat Request

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "temperature": 0.7
  }'

Streaming Request

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [
      {"role": "user", "content": "Write a short story"}
    ],
    "stream": true
  }'

Programmatic Usage

import openai

# Configure client to use local server
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Required but ignored
)

# Use standard OpenAI SDK
response = client.chat.completions.create(
    model="model-name",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7
)

Monitoring and Logging

Health Endpoints

The health endpoint can be used for monitoring:

curl http://localhost:8000/health

Performance Metrics

The server can be extended with:

Response time tracking
Request volume monitoring
Error rate monitoring
Resource utilization metrics

Advanced Features

Model Loading

The server handles model loading automatically:

Lazy loading when first accessed
Caching for subsequent requests
HuggingFace model integration

Response Caching

The system supports response caching for:

Repeated identical requests
Common prompt patterns
Improved response times for cached content

Logging and Debugging

Comprehensive logging can be added:

Request/response logging
Performance metrics
Error tracing
Usage analytics

This OpenAI-compatible API server enables Mini-YAIE to be integrated into existing ecosystems while providing the performance benefits of SGLang-style inference optimization.

Keyboard shortcuts

Mini-YAIE: Educational LLM Inference Engine