Skip to content

Patterns

When a request is denied, you have two strategies.

Return a 429 response immediately and let the client retry.

const result = await withLease(gov, request, fn);
if (!result.granted) {
res.status(429).json({
error: "throttled",
retryAfterMs: result.decision.retryAfterMs,
});
}

This is the default behavior and the simplest to reason about. The caller gets an instant answer.

Queue the request server-side with exponential backoff.

const result = await withLease(gov, request, fn, {
strategy: "wait-then-deny",
maxAttempts: 3,
maxWaitMs: 5_000,
});

The wait-then-deny strategy retries internally with exponential backoff. If it succeeds within maxAttempts and maxWaitMs, the caller never sees a 429. If it fails, the deny is returned with the full decision object.

ScenarioStrategy
Public API with client retriesdeny (immediate 429)
Internal service with latency budgetwait-then-deny with tight maxWaitMs
Batch pipeline that can waitwait with generous maxWaitMs
User-facing chat endpointwait-then-deny with maxWaitMs: 3_000

ThrottleAI supports two priority levels: interactive and background. The interactiveReserve config reserves a number of concurrency slots exclusively for interactive requests.

// User-facing chat gets priority
gov.acquire({ actorId: "user", action: "chat", priority: "interactive" });
// Background embedding can wait
gov.acquire({ actorId: "pipeline", action: "embed", priority: "background" });

With interactiveReserve: 2 and maxInFlight: 5, background tasks are blocked when only 2 slots remain. Those last 2 slots are reserved for interactive requests.

Suppose you have maxInFlight: 5 and interactiveReserve: 2:

In-flightAvailableBackground can useInteractive can use
0535
1424
2313
320 (blocked)2
410 (blocked)1
500 (blocked)0 (full)

Background tasks see a deny with reason: "concurrency" as soon as the available slots drop to the interactive reserve level. Interactive requests can use all remaining slots.

For streaming responses (SSE, WebSocket-backed, OpenAI streaming), acquire the lease once and hold it for the entire stream duration.

const decision = gov.acquire({ actorId: "user", action: "stream" });
if (!decision.granted) return;
try {
const stream = await openai.chat.completions.create({
stream: true,
model: "gpt-4",
messages,
});
for await (const chunk of stream) {
// process chunk
}
gov.release(decision.leaseId, { outcome: "success" });
} catch (err) {
gov.release(decision.leaseId, { outcome: "error" });
throw err;
}

The lease holds for the entire stream. This is correct behavior — the slot is occupied for as long as the model is generating tokens.

Streaming calls often take longer than typical request-response calls. If your streams can run for 30+ seconds, increase leaseTtlMs to avoid premature expiration:

createGovernor({
concurrency: { maxInFlight: 5 },
leaseTtlMs: 120_000, // 2 minutes — enough for long streams
});

If a lease expires mid-stream, the governor fires an expire event and frees the slot. The stream continues but the governor no longer tracks it. This is a safety net, not normal operation — if you see frequent expirations, increase the TTL.

ThrottleAI provides three observability mechanisms: events, snapshots, and formatters.

The onEvent callback fires on every acquire, deny, release, expiry, and warning:

const gov = createGovernor({
...presets.balanced(),
onEvent: (e) => console.log(formatEvent(e)),
});

Example output:

[acquire] actor=user-1 action=chat leaseId=abc123
[deny] actor=user-1 action=chat reason=concurrency retryAfterMs=500
[release] leaseId=abc123 outcome=success latencyMs=1200
[expire] leaseId=def456 — TTL exceeded

snapshot() returns the governor’s current state:

const snap = gov.snapshot();
console.log(formatSnapshot(snap));
// concurrency=3/5 rate=12/60 leases=3

The snapshot includes:

  • concurrency — in-flight weight, max, effective max (after adaptive), available
  • requestRate — current count vs limit
  • tokenRate — current count vs limit
  • leases — active count, last deny reason, last deny time

For aggregated metrics (total grants, denies, average latency), use the built-in stats collector:

import { createStatsCollector, createGovernor } from "@mcptoolshop/throttleai";
const stats = createStatsCollector();
const gov = createGovernor({
...presets.balanced(),
onEvent: stats.handler,
});
// Later
const summary = stats.snapshot();
// { grants: 150, denies: 12, avgLatencyMs: 340, ... }

When multiple actors share a governor, fairness prevents monopolization.

createGovernor({
concurrency: { maxInFlight: 20 },
fairness: {
softCapRatio: 0.3, // no actor gets >30% of slots
starvationWindowMs: 10_000, // denied actors get priority for 10s
},
});

Without fairness, one noisy tenant running 20 concurrent embedding jobs can exhaust all slots. With softCapRatio: 0.3 and maxInFlight: 20, each actor soft-caps at 6 concurrent calls.

The starvation window ensures that actors who were recently denied get priority when slots free up. This prevents a fast actor from repeatedly snatching slots before slower actors can acquire them.