Skip to content

Configuration

ThrottleAI ships three presets that cover the most common scenarios. Each preset returns a config object that you can spread and override.

Single user, CLI tools — 1 call at a time, 10 requests per minute.

import { createGovernor, presets } from "@mcptoolshop/throttleai";
createGovernor(presets.quiet());

SaaS backend — 5 concurrent (2 interactive reserve), 60 requests per minute, fairness enabled.

createGovernor(presets.balanced());

Batch processing — 20 concurrent, 300 requests per minute, fairness + adaptive tuning.

createGovernor(presets.aggressive());

Presets are plain objects. Spread and override:

createGovernor({
...presets.balanced(),
leaseTtlMs: 30_000,
concurrency: { maxInFlight: 10, interactiveReserve: 3 },
});
createGovernor({
// Concurrency (optional)
concurrency: {
maxInFlight: 5, // max simultaneous weight
interactiveReserve: 1, // slots reserved for interactive priority
},
// Rate limiting (optional)
rate: {
requestsPerMinute: 60, // request-rate cap
tokensPerMinute: 100_000, // token-rate cap
windowMs: 60_000, // rolling window (default 60s)
},
// Fairness (optional)
fairness: true, // enable with defaults
// or:
fairness: {
softCapRatio: 0.6, // max share of capacity per actor (default 0.6)
starvationWindowMs: 5_000, // denied actors get priority (default 5s)
},
// Adaptive tuning (optional)
adaptive: true, // enable with defaults
// or:
adaptive: {
targetDenyRate: 0.05, // target deny ratio (default 0.05)
latencyThreshold: 1.5, // EMA ratio that triggers reduction (default 1.5)
alpha: 0.2, // EMA smoothing factor (default 0.2)
adjustIntervalMs: 5_000, // how often to re-evaluate (default 5s)
minConcurrency: 1, // floor for effective concurrency (default 1)
},
// Strict mode (optional)
strict: true, // throw on double release / unknown ID (dev mode)
// Lease settings
leaseTtlMs: 60_000, // auto-expire (default 60s)
reaperIntervalMs: 5_000, // sweep interval (default 5s)
// Observability
onEvent: (e) => { /* acquire, deny, release, expire, warn */ },
});
OptionTypeDefaultDescription
maxInFlightnumberrequiredMaximum simultaneous in-flight weight. This is the most important setting.
interactiveReservenumber0Slots reserved exclusively for priority: "interactive" requests. Background requests are denied when available slots drop to this level.
OptionTypeDefaultDescription
requestsPerMinutenumber-Maximum requests per rolling window.
tokensPerMinutenumber-Maximum tokens per rolling window.
windowMsnumber60_000Rolling window duration in milliseconds.
OptionTypeDefaultDescription
softCapRationumber0.6Maximum share of maxInFlight any single actor can hold. With maxInFlight: 20 and softCapRatio: 0.3, each actor caps at 6 slots.
starvationWindowMsnumber5_000Actors denied within this window get a priority pass when slots free up.
OptionTypeDefaultDescription
targetDenyRatenumber0.05Target deny ratio. Higher values allow more throughput but more denials.
latencyThresholdnumber1.5EMA ratio that triggers concurrency reduction. Lower values react faster to latency spikes.
alphanumber0.2EMA smoothing factor. Lower values produce smoother, slower-reacting signals.
adjustIntervalMsnumber5_000How often the adaptive controller re-evaluates.
minConcurrencynumber1Floor for effective concurrency. Adaptive never reduces below this.
OptionTypeDefaultDescription
leaseTtlMsnumber60_000Leases auto-expire after this duration. Set to just above your expected p99 latency.
reaperIntervalMsnumber5_000How often the reaper sweeps for expired leases.

Use this to decide which limiters to enable:

Is your app user-facing?
+-- YES --> Set interactiveReserve >= 1, consider adaptive: true
+-- NO --> Skip interactiveReserve
Does the upstream have a rate limit?
+-- YES --> Set requestsPerMinute to match (leave 10-20% headroom)
+-- NO --> Skip rate config
Do you have multiple actors/users?
+-- YES --> Enable fairness: true (or tune softCapRatio)
+-- NO --> Skip fairness
Is latency unpredictable?
+-- YES --> Enable adaptive: true
+-- NO --> Skip adaptive (manual tuning is simpler)

Adaptive helps when:

  • Upstream latency is variable (cloud APIs, shared GPUs)
  • You do not know the right concurrency up front
  • Load patterns change throughout the day

Adaptive hurts when:

  • Latency is constant (local model with fixed batch size)
  • You know the exact capacity (you own the hardware)
  • Traffic is bursty and low-volume (not enough samples for a good EMA)

If adaptive oscillates, increase adjustIntervalMs (slower reactions) or lower alpha (smoother signal).

You see thisAdjust this
reason: "concurrency"Increase maxInFlight or decrease call duration
reason: "rate"Increase requestsPerMinute / tokensPerMinute
reason: "policy" (fairness)Lower softCapRatio or increase maxInFlight
High retryAfterMsReduce leaseTtlMs so expired leases free faster
Background tasks starvedIncrease maxInFlight or reduce interactiveReserve
Interactive latency highIncrease interactiveReserve
Adaptive shrinks too fastLower alpha or raise targetDenyRate