Skip to content

LLM Gateway Mini Demo

A CPU-runnable, pure-Python miniature of an LLM Gateway. It demonstrates routing, load balancing, rate limiting, retries with fallback, circuit breaking, request/response transformation, and Prometheus-style metrics without requiring GPUs or external gateway binaries.

Design

The demo is split into small, focused modules:

ModuleResponsibility
config.pyLoad gateway configuration from YAML
providers.pyMock OpenAIProvider, vLLMProvider, TritonProvider
router.pyResolve a model alias to candidate providers and pick one
load_balancer.pySelection algorithms: round-robin, weighted, least-latency, priority
rate_limiter.pyToken-bucket rate limiter keyed by tenant+model
retry_fallback.pyRetry policy, circuit breaker, fallback chain
auth_middleware.pyBearer token -> tenant validation
transform.pyOpenAI request/respose normalization
metrics.pyPrometheus-style counters and histograms
server.pyhttp.server gateway wiring everything together
demo.pyEntry script: spins up 3 mock providers and sends concurrent traffic

Routing strategies

  • round_robin: cycles through candidate providers per model
  • weighted: random selection weighted by provider weight
  • least_latency: picks the provider with the lowest moving-average latency
  • priority: picks the provider with the lowest priority value

Retry / fallback / circuit breaker

Each provider call is wrapped by RetryPolicy (exponential backoff). A per-provider CircuitBreaker trips after 3 consecutive failures, enters an open state, then half-opens after a cooldown. If the chosen provider fails, the FallbackChain tries the next candidate.

Install

bash
cd docs/04-llmops/llm-gateway/mini-demo
pip install -e ".[dev]"

Run the demo

bash
python -m llm_gateway_mini.demo

The demo writes a temporary YAML config with three mock providers (fast vLLM, slow Triton, flaky OpenAI), starts the gateway on a random local port, and fires 20 concurrent requests. It prints routing decisions, rate-limit hits, fallback behavior, and final Prometheus-style metrics.

Run the server directly

python
from llm_gateway_mini.config import load_config
from llm_gateway_mini.server import GatewayApp, make_server

config = load_config("config.yaml")
app = GatewayApp(config, router_strategy="weighted")
server = make_server(app, host="127.0.0.1", port=8080)
server.serve_forever()

Endpoints:

  • GET /health
  • GET /v1/models
  • POST /v1/chat/completions (OpenAI-compatible, requires Authorization: Bearer <key>)
  • GET /metrics

Sample request:

bash
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-demo-123" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-mini", "messages": [{"role": "user", "content": "hello"}]}'

Run tests

bash
pytest tests/

Mini demo vs. a real gateway

CapabilityMini demoReal gateway (e.g., LiteLLM, Kong, Envoy)
Upstream providersMock Python classesReal OpenAI / vLLM / Triton endpoints
Transporthttp.serverProduction HTTP server / proxy
AuthStatic API key mapOAuth, JWT, RBAC, audit logs
Rate limitingIn-memory token bucketRedis-backed, per-tenant quotas
RoutingSimple strategiesML-based load prediction, cost-aware routing
Retries / fallbackIn-process retry + circuit breakerDistributed retry, queueing, dead-letter
MetricsIn-memory countersPrometheus / Grafana, distributed tracing
StreamingNot implementedFull SSE streaming
DeploymentSingle processHorizontally scalable with config management

This project is intentionally small: it shows the concepts and control flow so readers can understand how a production gateway behaves before adopting a heavier system.

Released under CC-BY-SA-4.0 License.