Skip to content

VRAM estimator

The VRAM estimator answers the operator question “will this config OOM on my card?” BEFORE the trainer pulls the model, loads the dataset, and runs to the first OOM. Two surfaces share the same back-of-envelope math:

  • Trainer.estimate_vram(...) (v1.4 Python API) — returns a structured VRAMEstimate with the headline total_gb and a per-consumer breakdown.
  • backprop estimate-vram (v1.3 CLI) — prints a tier table mapping VRAM ranges to recommended batch sizes for the auto-detection path.

Both surfaces are accurate within ~10-20% of empirical peak for well-known training configs. The estimator is a planning tool, not a guarantee — see the limitations below.

from backpropagate import Trainer
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
estimate = trainer.estimate_vram(
mode="lora",
lora_r=256,
batch_size=2,
max_seq_length=2048,
)
print(estimate.summary())
# VRAM estimate (lora, 7.0B params, batch=2, seq=2048): total=13.4GB
# (weights=3.3 + lora=0.5 + optim=0.0 + activations=4.1 + kv=3.7 + overhead=1.7)
print(f"Fits on 16GB card: {estimate.fits_on_card(16.0)}")
# Fits on 16GB card: True

The returned dataclass carries the headline number, per-consumer breakdown, and the inputs that produced it:

FieldDescription
total_gbHeadline number. Compare against your card’s VRAM.
model_weights_gbBase model in nf4 (0.5 bytes/param when quantize_base=True, default).
lora_adapter_gbLoRA adapter weights in bf16 (0 when mode="full").
optimizer_state_gbpaged 8-bit Adam state for the trainable parameters.
activations_gbForward + backward activations. mode="full" assumes gradient_checkpointing=True (sqrt(L) scaling).
kv_cache_gbTransient KV cache during forward pass (training amortization factor 0.25).
overhead_gb15% margin for fragmentation + framework overhead.
param_count_billionsEstimated model size used in the math.
mode / batch_size / gradient_accumulation / max_seq_length / lora_rReproducibility inputs.
notes: list[str]Diagnostic notes (e.g. “base model quantized to nf4”, or “param_count_billions not provided and could not be estimated”).
estimate.fits_on_card(vram_gb) # bool — does total_gb <= vram_gb?
estimate.summary() # str — operator-readable one-line summary

The CLI surfaces the tier heuristic that Trainer._detect_batch_size uses when --batch-size auto is set. It’s lighter-weight than the Python API (no Trainer construction, no torch / transformers import) — useful for sizing infra spend or previewing the auto-batch decision before kicking off training.

Terminal window
# Uses the local GPU's VRAM
backprop estimate-vram
# Override the model (model is used only for the printed header)
backprop estimate-vram Qwen/Qwen2.5-7B-Instruct
# Simulate a card you don't currently have
backprop estimate-vram --vram-gb 24
# Machine-readable output for CI consumers
backprop estimate-vram --vram-gb 16 --json | jq .recommended_batch_size
# 32 GB card, 7B full fine-tuning via FSDP2 CPU-offload — see host_ram_gb
backprop estimate-vram Qwen/Qwen2.5-7B-Instruct --vram-gb 32 --mode full --full-ft-offload --json

See CLI reference → backprop estimate-vram for the full flag table.

A few canonical configurations on 16GB consumer cards (RTX 4080 / 5080 / 4070 Ti Super):

ModelConfigEstimated totalFits 16GB?
Qwen 2.5 7BLoRA r=256, batch=2, seq=2048~13.4 GBYes
Qwen 2.5 7BLoRA r=64, batch=4, seq=2048~14.1 GBYes (tight)
Phi-4-mini-3.8Bmode="full", batch=1, seq=2048~12.8 GBYes
SmolLM3-3Bmode="full", batch=2, seq=2048~11.2 GBYes
Qwen 2.5 7Bmode="full" (raises before estimate)n/aRefused — RUNTIME_FULL_FT_MODEL_TOO_LARGE

For the last row, the trainer’s mode=‘full’ gate fires at Trainer.__init__ before the estimator gets a chance — see full fine-tuning.

On a 32 GB card (RTX 5090) the envelope opens up — and the offload path reports host_ram_gb:

ModelConfigGPU totalHost RAM (offload)
Qwen 2.5 14BQLoRA r=32, batch=2, seq=4096~8.5 GB
Qwen 2.5 32BQLoRA r=32, seq=2048~26 GB (just fits)
Qwen 2.5 7Bmode="full" (pure-GPU, 7B > 6B ceiling)refused → use --full-ft-offload
Qwen 2.5 7Bmode="full" --full-ft-offload~1 GB working set + activations~39 GB

The recommended auto-batch is 6 at 32 GB and 8 at 48 GB (backprop estimate-vram --vram-gb 32).

The estimator is a planning tool, not a guarantee against OOM. Caveats:

  • Activation memory is the most variable component. Real-world activation peaks depend on the model architecture, the dataset’s sequence-length distribution, attention backend (Flash Attention 2 / 3 vs xFormers vs SDPA), and the gradient-checkpointing schedule. The 15% overhead margin absorbs typical variance; pathological inputs (very long sequences, exotic architectures) can exceed it.
  • The KV cache amortization factor (0.25) is a heuristic. Training rarely keeps the full KV cache (it’s primarily an inference cost), but transformers libraries allocate it transiently during forward. The constant approximates that share.
  • Pretrained-model defaults (hidden_dim, num_layers, num_heads) are 7B-class. Pass explicit values to the Python API for non-Llama-architecture models.
  • The estimator does not account for OS-level GPU usage. Your desktop compositor, browser, or background ML processes share the same VRAM pool. Leave headroom.

If the estimator says a config fits and it OOMs in practice, the oom_recovery=True default will halve the batch and retry up to 3 times before raising RUNTIME_GPU_OOM — that’s the runtime safety net beneath the planning surface.