Patterns
Server endpoint: 429 vs queue
Section titled “Server endpoint: 429 vs queue”When a request is denied, you have two strategies.
Option A: immediate deny with 429
Section titled “Option A: immediate deny with 429”Return a 429 response immediately and let the client retry.
const result = await withLease(gov, request, fn);
if (!result.granted) { res.status(429).json({ error: "throttled", retryAfterMs: result.decision.retryAfterMs, });}This is the default behavior and the simplest to reason about. The caller gets an instant answer.
Option B: wait with bounded retries
Section titled “Option B: wait with bounded retries”Queue the request server-side with exponential backoff.
const result = await withLease(gov, request, fn, { strategy: "wait-then-deny", maxAttempts: 3, maxWaitMs: 5_000,});The wait-then-deny strategy retries internally with exponential backoff. If it succeeds within maxAttempts and maxWaitMs, the caller never sees a 429. If it fails, the deny is returned with the full decision object.
When to choose which
Section titled “When to choose which”| Scenario | Strategy |
|---|---|
| Public API with client retries | deny (immediate 429) |
| Internal service with latency budget | wait-then-deny with tight maxWaitMs |
| Batch pipeline that can wait | wait with generous maxWaitMs |
| User-facing chat endpoint | wait-then-deny with maxWaitMs: 3_000 |
Interactive vs background priority
Section titled “Interactive vs background priority”ThrottleAI supports two priority levels: interactive and background. The interactiveReserve config reserves a number of concurrency slots exclusively for interactive requests.
// User-facing chat gets prioritygov.acquire({ actorId: "user", action: "chat", priority: "interactive" });
// Background embedding can waitgov.acquire({ actorId: "pipeline", action: "embed", priority: "background" });With interactiveReserve: 2 and maxInFlight: 5, background tasks are blocked when only 2 slots remain. Those last 2 slots are reserved for interactive requests.
How it works in practice
Section titled “How it works in practice”Suppose you have maxInFlight: 5 and interactiveReserve: 2:
| In-flight | Available | Background can use | Interactive can use |
|---|---|---|---|
| 0 | 5 | 3 | 5 |
| 1 | 4 | 2 | 4 |
| 2 | 3 | 1 | 3 |
| 3 | 2 | 0 (blocked) | 2 |
| 4 | 1 | 0 (blocked) | 1 |
| 5 | 0 | 0 (blocked) | 0 (full) |
Background tasks see a deny with reason: "concurrency" as soon as the available slots drop to the interactive reserve level. Interactive requests can use all remaining slots.
Streaming calls
Section titled “Streaming calls”For streaming responses (SSE, WebSocket-backed, OpenAI streaming), acquire the lease once and hold it for the entire stream duration.
const decision = gov.acquire({ actorId: "user", action: "stream" });if (!decision.granted) return;
try { const stream = await openai.chat.completions.create({ stream: true, model: "gpt-4", messages, }); for await (const chunk of stream) { // process chunk } gov.release(decision.leaseId, { outcome: "success" });} catch (err) { gov.release(decision.leaseId, { outcome: "error" }); throw err;}The lease holds for the entire stream. This is correct behavior — the slot is occupied for as long as the model is generating tokens.
Lease TTL for streams
Section titled “Lease TTL for streams”Streaming calls often take longer than typical request-response calls. If your streams can run for 30+ seconds, increase leaseTtlMs to avoid premature expiration:
createGovernor({ concurrency: { maxInFlight: 5 }, leaseTtlMs: 120_000, // 2 minutes — enough for long streams});If a lease expires mid-stream, the governor fires an expire event and frees the slot. The stream continues but the governor no longer tracks it. This is a safety net, not normal operation — if you see frequent expirations, increase the TTL.
Observability
Section titled “Observability”ThrottleAI provides three observability mechanisms: events, snapshots, and formatters.
Event stream
Section titled “Event stream”The onEvent callback fires on every acquire, deny, release, expiry, and warning:
const gov = createGovernor({ ...presets.balanced(), onEvent: (e) => console.log(formatEvent(e)),});Example output:
[acquire] actor=user-1 action=chat leaseId=abc123[deny] actor=user-1 action=chat reason=concurrency retryAfterMs=500[release] leaseId=abc123 outcome=success latencyMs=1200[expire] leaseId=def456 — TTL exceededPoint-in-time snapshot
Section titled “Point-in-time snapshot”snapshot() returns the governor’s current state:
const snap = gov.snapshot();console.log(formatSnapshot(snap));// concurrency=3/5 rate=12/60 leases=3The snapshot includes:
concurrency— in-flight weight, max, effective max (after adaptive), availablerequestRate— current count vs limittokenRate— current count vs limitleases— active count, last deny reason, last deny time
Stats collector
Section titled “Stats collector”For aggregated metrics (total grants, denies, average latency), use the built-in stats collector:
import { createStatsCollector, createGovernor } from "@mcptoolshop/throttleai";
const stats = createStatsCollector();const gov = createGovernor({ ...presets.balanced(), onEvent: stats.handler,});
// Laterconst summary = stats.snapshot();// { grants: 150, denies: 12, avgLatencyMs: 340, ... }Fairness in multi-tenant apps
Section titled “Fairness in multi-tenant apps”When multiple actors share a governor, fairness prevents monopolization.
createGovernor({ concurrency: { maxInFlight: 20 }, fairness: { softCapRatio: 0.3, // no actor gets >30% of slots starvationWindowMs: 10_000, // denied actors get priority for 10s },});Without fairness, one noisy tenant running 20 concurrent embedding jobs can exhaust all slots. With softCapRatio: 0.3 and maxInFlight: 20, each actor soft-caps at 6 concurrent calls.
The starvation window ensures that actors who were recently denied get priority when slots free up. This prevents a fast actor from repeatedly snatching slots before slower actors can acquire them.