Skip to content

Error codes

Every Backpropagate exception carries a stable, machine-readable code you can search for, grep your logs for, or quote in a bug report. Codes are grouped by prefix:

  • INPUT_* — bad input from the user (your command line, your dataset, your config). Exit code 1. Not retryable.
  • CONFIG_* — invalid persisted configuration. Exit code 1. Not retryable.
  • DEP_* — an external dependency (HuggingFace Hub, Ollama daemon, CUDA driver) misbehaved. Exit code 2. Often retryable.
  • RUNTIME_* — something failed inside Backpropagate or the training/export pipeline. Exit code 2.
  • STATE_* — corrupt on-disk state (checkpoint file, SLAO snapshot). Exit code 2. Not retryable without manual cleanup.
  • PARTIAL_* — operation finished, but not every unit succeeded. Exit code 3.

You will see codes printed in stderr as [CODE_NAME]: message and in the structured envelope returned by BackpropagateError.to_dict(). They are designed to stay stable across class renames — quote them in issues.

CodeRaised whenFix
INPUT_VALIDATION_FAILEDA user-supplied argument or flag failed validation (e.g. steps=0, malformed --auth).Re-read the suggestion in stderr; fix the offending argument.
INPUT_AUTH_REQUIREDAn operation required --auth credentials (or BACKPROPAGATE_UI_AUTH=user:pass) but they were not supplied.Pass --auth user:password on the CLI, or set BACKPROPAGATE_UI_AUTH=user:pass in the environment. See handbook/security.md for the full auth contract.
INPUT_AUTH_INVALID_SHAPEThe credentials passed to the UI launcher are not a username:password tuple.Use the format --auth username:password (single colon, no spaces).
INPUT_DATASET_INVALIDThe dataset is malformed in a way that doesn’t fit a more specific code.Inspect the file; ensure each line is valid JSON for JSONL inputs.
INPUT_DATASET_NOT_FOUNDThe dataset path does not exist.Check the path is spelled correctly and the file exists. Relative paths resolve against the current working directory.
INPUT_DATASET_PARSE_FAILEDA line in a JSONL/CSV dataset failed to parse.The error includes the line number; open the file and fix the offending row.
INPUT_DATASET_VALIDATION_FAILEDDataset content failed schema validation (missing required fields, wrong shape).See the listed validation errors in the suggestion and fix them.
INPUT_DATASET_FORMAT_UNSUPPORTEDDataset format could not be auto-detected as ShareGPT / Alpaca / OpenAI chat / ChatML / raw text.Convert to one of the supported formats, or pass an already-loaded HuggingFace Dataset object.
INPUT_DATASET_REPORT_THRESHOLDA backprop data report quality gate tripped — a --fail-on-dups / --fail-on-contamination / --max-outlier-rate threshold was exceeded, or --strict promoted a WARN verdict to FAIL. Returned as exit 65 (not raised); the code is stamped into the structured log line.Inspect the report’s failed_thresholds list (re-run without --json for the human summary). Either clean the dataset (dedupe / trim outliers / remove contamination against the --against set) or relax the threshold you passed. Drop the --fail-* flags to run in advisory mode (exit 0).
INPUT_EVAL_RUN_NOT_FOUNDbackprop eval <run_id> (or --vs / --gate-against) named a run_id that is not present in the on-disk run history under the configured --output directory.Run backprop runs --output <dir> to list available run_ids. If the run was trained under a different output directory, re-run with --output <that-dir>. Partial-prefix matches are accepted as long as the prefix is unambiguous.
INPUT_EVAL_HELDOUT_UNRESOLVEDbackprop eval could not resolve the held-out evaluation set — the --heldout path is missing / unreadable, or the --prompts file could not be opened.Pass a readable path to --heldout (a JSONL held-out split) or --prompts (one prompt per line, or JSONL). Check the path resolves from the current working directory and is UTF-8.
CodeRaised whenFix
CONFIG_INVALIDThe persisted configuration (env vars, .env, settings) is invalid in a way that doesn’t fit a more specific code.Run backprop config to dump the resolved config; correct the offending value.
CONFIG_INVALID_SETTINGA specific setting has an invalid value (type, range, enum).The error includes the setting name, the value seen, and the expected shape — fix that one knob.
CodeRaised whenFix
DEP_MODEL_LOAD_FAILEDThe model could not be loaded from HuggingFace Hub or local cache. Common causes: gated model without HF_TOKEN, network timeout, typo in model name, corrupted cache.Verify the model name. For gated models (e.g. some Llama variants) run huggingface-cli login and re-try. Retryable — the trainer will back off on 5xx/429.
DEP_DATASET_ENGINE_MISSINGAn optional dependency needed to read a dataset file format (pandas for CSV; pandas + a pyarrow parquet engine for parquet) is not installed.Install the missing extra: pip install pandas pyarrow (parquet) or pip install pandas (CSV).
DEP_GPU_NOT_AVAILABLECUDA / a compatible GPU could not be detected.Confirm nvidia-smi works. Reinstall PyTorch with the CUDA wheel that matches your driver (pip install torch --index-url https://download.pytorch.org/whl/cu121 etc.).
DEP_OLLAMA_REGISTRATION_FAILEDregister_with_ollama(...) could not reach the Ollama daemon or the registration call returned an error.Start the daemon: ollama serve (or install Ollama from https://ollama.com). Default endpoint is localhost:11434. Retryable.
DEP_MLX_UNAVAILABLEv1.5 T3.1 (experimental) — the MLX (Apple-Silicon) rail was selected (resolved backend='mlx') but the mlx_lm toolchain is not importable on this host. mlx-lm is Apple-Silicon-ONLY (macOS + arm64), so on a Windows / Linux / Intel-Mac host the [mlx] extra cannot install and this fires. Raised by MLXBackend.run() at the subprocess-launch attempt; a forced backend='mlx' on a non-Apple host is normally intercepted earlier at Trainer construction as CONFIG_INVALID_SETTING, so reaching this code usually means a corrupted mlx-lm install on a real Mac.On an Apple-Silicon Mac: install the extra — pip install 'backpropagate[mlx]'. On a non-Apple host: use backend='auto' (the default — routes to CUDA) or backend='cuda'; backend='mlx' cannot run there. Set the rail via --backend (CLI) or BACKPROPAGATE_TRAINING__BACKEND (env).
DEP_FSDP_UNAVAILABLEv1.7--full-ft-offload (FSDP2 CPU-offload full fine-tuning) was requested but the FSDP toolchain (torch.distributed FSDP / accelerate, with a usable NCCL backend) is not available. The common case is Windows-native: FSDP CUDA collectives need NCCL, which is Linux/WSL2-only (gloo can’t carry CUDA collectives). Raised by the trainer before the offload run starts — it never silently runs without offload and OOMs.Run --full-ft-offload under WSL2 / Linux (the measured recipe — needs ~64 GB host RAM), OR drop --full-ft-offload and use a model within the pure-GPU full-FT ceiling, OR use --mode lora (QLoRA fits 7B–34B on a 32 GB card).
CodeRaised whenFix
RUNTIME_TRAINING_FAILEDTraining failed in a way that doesn’t fit a more specific code.Re-run with --verbose (or BACKPROPAGATE_DEBUG=1) to see the unredacted traceback. The run_id printed at startup correlates this failure with every log line and checkpoint — quote it in any bug report.
RUNTIME_TRAINING_ABORTEDTraining was aborted (user interrupt, GPU pause/abort, etc.). The error includes steps_completed and the last checkpoint_path so you can resume.If the abort was from GPU temperature or VRAM pressure, fix the underlying issue and resume from the checkpoint with backprop resume <run_id> (the run_id is printed in the error).
RUNTIME_EVAL_GATE_REGRESSEDbackprop eval --gate-against <baseline_run_id> determined the evaluated run regressed the held-out metric beyond the allowed --max-regression. This is the eval-gate that protects continual-merge / SLAO campaigns. Returned as exit 65 (not raised); the code is stamped into the structured log line.The after-run is worse than the baseline by more than --max-regression. Inspect the diff (re-run with --vs <baseline> for the side-by-side). Either keep the baseline (reject this run / merge), raise --max-regression if a small regression is acceptable, or retrain with a higher learning rate / more steps / cleaner data.
RUNTIME_EVAL_FAILEDbackprop eval failed to complete the evaluation — the model could not be loaded, the held-out forward pass crashed, or sample generation raised. Distinct from a clean regression (RUNTIME_EVAL_GATE_REGRESSED); here the eval did not finish. Surfaced via the catch-all exit-code mapper (exit 2, or 137 on CUDA OOM / 69 on a Hub failure).Re-run with --verbose (or BACKPROPAGATE_DEBUG=1) for the full traceback. Confirm the run’s checkpoint loads via backprop show-run <run_id> and that the model fits in VRAM at eval time (lower --num-samples / --max-new-tokens if you OOM during generation).
RUNTIME_UI_AUTH_NOT_ENFORCEDbackprop ui --share (or --host <non-loopback>) was invoked without --auth user:password. The v1.2.0+ contract refuses to start the UI rather than expose an unauthenticated tunnel. (The pre-v1.2.0 BACKPROPAGATE_SECURITY__REQUIRE_AUTH_FOR_SHARE=false opt-out is a no-op under the Reflex UI — held only for forward-compat with the Gradio era — and will not relax this gate.)Pass --auth user:password (or --auth-file <path> in v1.3+ to keep the credential out of shell history). If you don’t actually need a public URL, use SSH port-forwarding instead: ssh -L 7860:localhost:7860 <host>. See handbook/security.md for the full threat model and the four supported auth modes.
RUNTIME_EXPORT_FAILEDExport failed in a way that doesn’t fit a more specific code.Re-run with --verbose; verify the model loaded cleanly first via backprop info. If the export step fired before trainer.train() returned, check the LoRA-export-without-training warning above your stderr — save() no longer silently writes untrained weights as of v1.4.
RUNTIME_LORA_EXPORT_FAILEDLoRA adapter export failed.Confirm the trainer actually has LoRA adapters attached — trainer.train() must have run before trainer.export(...) or trainer.save(...). As of v1.4, save() emits a WARN line if you call it before train(), naming the path it would have written; check your stderr for that line.
RUNTIME_GGUF_EXPORT_FAILEDGGUF export failed.Install the export extra: pip install backpropagate[export]. On first run, llama-cpp-python may need to build from source — Windows needs Visual C++ Build Tools + CMake; Linux needs build-essential + cmake. See troubleshooting → llama-cpp-python build failed.
RUNTIME_MERGE_EXPORT_FAILEDMerging the LoRA back into the base model failed.Most commonly VRAM pressure during the merge — try a smaller base model, free GPU memory first (nvidia-smi to see what’s holding it), or pass device_map="cpu" to merge on CPU at the cost of speed.
RUNTIME_GPU_ERRORGeneric GPU error that doesn’t fit a more specific GPU code.Run backprop info to confirm CUDA / GPU / driver are wired up correctly. Then re-run training with --verbose to see which op tripped. If you see a CUBLAS error, see the CUDA troubleshooting page → CUBLAS errors.
RUNTIME_GPU_OOMOut-of-memory on the GPU during training, eval, or export.The B-002 OOM-recovery path (enabled by default) auto-shrinks batch size up to 3 times before giving up. If you still see this code: lower --batch-size manually, reduce the model’s max_seq_length setting (most VRAM lives in attention — pass Trainer(max_seq_length=1024) or set BACKPROPAGATE_MODEL__MAX_SEQ_LENGTH=1024), or pick a smaller model preset (Qwen 2.5 3B fits in ~8GB). To opt out of auto-recovery and let OOM bubble up: Trainer(oom_recovery=False).
RUNTIME_FULL_FT_MODEL_TOO_LARGETrainer(mode="full") was requested for a model whose parameter count exceeds the card-aware full-fine-tuning ceiling — derived from detected VRAM (16 GB→4B, 24 GB→5B, 32 GB→6B pure-GPU; --full-ft-offload lifts it to ≈8B / 7B-class on a 32 GB card). Fires at Trainer.__init__ (preset/model-id lookup) AND again at Trainer.load_model() (authoritative model.num_parameters() check).The error is contrastive: (1) if the model clears the offload ceiling but not the pure-GPU one, add --full-ft-offload (Python full_ft_offload=True) to spill params+optimizer to host RAM (slower; Linux/WSL2; ~64 GB host RAM); (2) raise the ceiling explicitly with --full-ft-ceiling-billions; (3) past even the offload ceiling, use --mode lora — LoRA/QLoRA fits 7B–34B on a 32 GB card (recommended per Biderman 2024 + Thinking Machines 2025). See full fine-tuning for the VRAM math + the decision tree.
RUNTIME_FP8_UNSUPPORTEDv1.5; verified on Blackwell (RTX 5090, sm_120) in v1.6 — the FP8 compute path (--fp8 / BACKPROPAGATE_TRAINING__FP8 / Trainer(fp8=True)) was enabled but FP8 conversion / the first float8 GEMM failed at runtime — most often a half-installed or ABI-mismatched torchao whose import torchao.float8 fails AFTER the capability gate already promised FP8 (so it is NOT degraded silently to bf16). Other FP8 failure axes (no CUDA, sm < 9, conversion error) instead degrade to bf16 with a warning and do not raise this.Re-run without --fp8 (fp8=False) to train in bf16 — correct, just without the FP8 speed/memory win. If you want FP8: reinstall a torchao matching your torch build — pip install --force-reinstall 'backpropagate[fp8]'; FP8 needs base-layer inner dims divisible by 16, a Hopper/Blackwell (sm_90+) GPU, and torch>=2.11 for the compiled torchao kernels (2.10 uses the slower _scaled_mm fallback). Note --fp8 is --mode lora + --method sft only; mode='full', any non-sft method (orpo / simpo / kto), or explicit 4-bit combined with FP8 are rejected earlier as CONFIG_INVALID_SETTING.
RUNTIME_GPU_TEMPERATURE_CRITICALGPU exceeded the configured safety threshold and training was paused or aborted. Retryable.Wait for the GPU to cool (training auto-resumes once it does); if it keeps tripping, improve case airflow and / or lower batch size to reduce sustained load. See GPU safety for the temperature thresholds.
RUNTIME_GPU_MONITORING_FAILEDThe GPU monitor could not query NVML.Install pynvml: pip install pynvml. If pynvml is installed and you still see this, your driver may not expose NVML — confirm with nvidia-smi.
RUNTIME_SLAO_ERRORGeneric SLAO failure.Re-run with --verbose to see which merge step tripped. The run_index field in the error includes the run number — pair it with the on-disk checkpoint for that run when reporting.
RUNTIME_SLAO_MERGE_FAILEDSLAO weight merge failed at a specific run.The error includes run_index; inspect the on-disk checkpoint for that run.
SLAO_MERGE_DIVERGEDA SLAO merge produced weights with non-finite values (NaN/Inf). Defensive check raised before the bad weights enter the model.Reduce the learning rate, shorten steps_per_run, or pick a different merge_mode.
PEFT_API_INCOMPATIBLEThe installed peft version does not expose the API Backpropagate expects.Upgrade peft: pip install -U peft.
UI_OUTPUT_DIR_FORBIDDENThe BACKPROPAGATE_UI__OUTPUT_DIR env var (or default) resolved to a system / credential directory like /etc, ~/.ssh, C:\Windows\System32.Pick a non-system directory under your home (e.g. ~/.backpropagate/ui-outputs or ~/work/backprop-out).
CodeRaised whenFix
STATE_CHECKPOINT_INVALIDA checkpoint file could not be saved or loaded — missing manifest, corrupt safetensors, mid-write crash.Delete the offending checkpoint directory; if it was written via atomic-rename, look for stray .partial files and remove them.
STATE_SLAO_CHECKPOINT_INVALIDA SLAO checkpoint snapshot is corrupt.Same fix — remove the bad snapshot and re-run. SLAO will rebuild from the previous good snapshot.
CodeRaised whenFix
PARTIAL_BATCH_OPERATIONA per-item loop (e.g. multi-run training, batch export) finished with some items failing. The error lists the first N failures and a success-rate percentage.Inspect the listed errors; re-run the failed items individually.
PARTIAL_SUCCESSThe CLI ran to completion but not every unit succeeded. Maps to exit code 3.The operation completed enough to be useful — inspect logs, decide whether to retry the failed units.

Every BackpropagateError exposes a structured envelope:

from backpropagate.exceptions import BackpropagateError
try:
trainer.train("data.jsonl", steps=100)
except BackpropagateError as e:
envelope = e.to_dict()
# {
# "type": "GPUMemoryError",
# "code": "RUNTIME_GPU_OOM",
# "message": "Insufficient GPU memory: need 14.2GB, have 12.0GB",
# "retryable": False,
# "suggestion": "Try reducing batch size, ...",
# "details": {"required_gb": 14.2, "available_gb": 12.0},
# }

The retryable flag tells callers whether a naive retry is safe. True for DEP_OLLAMA_REGISTRATION_FAILED, RUNTIME_GPU_TEMPERATURE_CRITICAL, DEP_MODEL_LOAD_FAILED; False for everything else by default.

  • Troubleshooting — symptoms-first reverse index (start here if you don’t yet know which code fired).
  • Environment variables — every BACKPROPAGATE_* knob, including the security and UI sandbox knobs referenced above.