Getting Started
Install
Section titled “Install”# npmnpm install @mcptoolshop/throttleai
# pnpmpnpm add @mcptoolshop/throttleai
# yarnyarn add @mcptoolshop/throttleaiThrottleAI has zero runtime dependencies. It ships as dual ESM + CJS and runs on Node.js 18+.
60-second quickstart
Section titled “60-second quickstart”import { createGovernor, withLease, presets } from "@mcptoolshop/throttleai";
const gov = createGovernor(presets.balanced());
const result = await withLease( gov, { actorId: "user-1", action: "chat" }, async () => await callMyModel(),);
if (result.granted) { console.log(result.result);} else { console.log("Throttled:", result.decision.recommendation);}That is the entire flow: create a governor, wrap your call in withLease, and check whether it was granted. The governor enforces concurrency, rate limits, and fairness. Leases auto-expire if you forget to release them.
Choose your limiter
Section titled “Choose your limiter”ThrottleAI has five limiter dimensions. You do not need all of them — start with concurrency and add others only when you have a specific reason.
| Limiter | What it caps | When to use |
|---|---|---|
| Concurrency | Simultaneous in-flight calls | Always. This is the most important knob. |
| Rate | Requests per minute | When the upstream API has a documented rate limit. |
| Token rate | Tokens per minute | When you have a per-minute token budget. |
| Fairness | Per-actor share of capacity | Multi-tenant apps where one user should not monopolize slots. |
| Adaptive | Auto-tuned concurrency ceiling | When upstream latency is unpredictable. |
The one-knob rule
Section titled “The one-knob rule”Configure limiters in this order and stop as soon as you have enough control:
- Concurrency first — cap in-flight calls. This prevents stampedes and is the single most useful setting.
- Rate second — add
requestsPerMinuteif the upstream API enforces a rate limit. - Token budget third — add
tokensPerMinuteonly if you have a per-minute token budget (for example, OpenAI tier limits).
Most applications only need concurrency.
Minimal examples by scenario
Section titled “Minimal examples by scenario”Protect an upstream API (OpenAI, Anthropic)
Section titled “Protect an upstream API (OpenAI, Anthropic)”You have a hard rate limit from the provider.
createGovernor({ concurrency: { maxInFlight: 10 }, rate: { requestsPerMinute: 500 }, adaptive: true,});Adaptive mode auto-tunes concurrency when latency spikes, which indicates upstream pressure.
Protect a local model (Ollama, vLLM, llama.cpp)
Section titled “Protect a local model (Ollama, vLLM, llama.cpp)”You are GPU-bound, not API-bound. Concurrency is the bottleneck.
createGovernor({ concurrency: { maxInFlight: 3 }, leaseTtlMs: 120_000, // local models can be slow — extend TTL});Skip rate limiting unless you are paying per-token through a proxy. For local models, concurrency alone prevents OOM and context-switching overhead.
User-facing latency SLO
Section titled “User-facing latency SLO”Interactive users cannot wait more than a few seconds.
createGovernor({ concurrency: { maxInFlight: 8, interactiveReserve: 3 }, adaptive: { targetDenyRate: 0.05, latencyThreshold: 1.3, adjustIntervalMs: 3_000, },});interactiveReserve holds slots for real users even when background jobs are running.
Multi-tenant SaaS
Section titled “Multi-tenant SaaS”Multiple actors share the same governor. Fairness prevents any one user from monopolizing capacity.
createGovernor({ concurrency: { maxInFlight: 20, interactiveReserve: 5 }, rate: { requestsPerMinute: 300 }, fairness: { softCapRatio: 0.3, starvationWindowMs: 5_000 },});With softCapRatio: 0.3 and maxInFlight: 20, each actor soft-caps at 6 concurrent calls.
Batch pipeline (embeddings, migrations)
Section titled “Batch pipeline (embeddings, migrations)”Throughput matters more than latency.
createGovernor({ concurrency: { maxInFlight: 20 }, rate: { requestsPerMinute: 500, tokensPerMinute: 500_000 },});
gov.acquire({ actorId: "pipeline", action: "embed", priority: "background" });Skip adaptive and fairness. Batch jobs have predictable load — let the governor run at full capacity.