Preference tuning (ORPO / SimPO / KTO)
Backpropagate’s training objective is selected with one knob — method (kwarg Trainer(method=...), CLI --method, env BACKPROPAGATE_TRAINING__METHOD). It is one of four values:
method | Stage | Data shape | Reference model | VRAM envelope |
|---|---|---|---|---|
sft (default) | supervised | any chat format (ShareGPT / Alpaca / OpenAI / ChatML) | — | baseline |
orpo | preference, reference-free | paired {prompt, chosen, rejected} | none (monolithic) | ≈ SFT |
simpo | preference, reference-free | paired {prompt, chosen, rejected} | none (length-normalized reward) | tightest of the paired methods |
kto | preference, binary feedback | unpaired {prompt, completion, label} | the frozen base (no second model) | ≈ SFT (16 GB) |
All three preference methods are reference-free in practice on a 16 GB card: ORPO and SimPO need no reference model at all, and KTO uses the frozen LoRA base as its own reference (TRL’s KTOTrainer disables the adapter to compute reference logprobs), so no second model copy is loaded. That is the whole reason these four — and not classic DPO/PPO — are what Backpropagate ships: they fit the single-consumer-GPU envelope.
ORPO shipped in v1.5; SimPO and KTO shipped in v1.6.
Which one should I use?
Section titled “Which one should I use?”- Start with
sft. If you have instruction → response data and no explicit “this answer is better than that one” signal, supervised fine-tuning is the right tool. Preference methods do not replace SFT — they refine a model on comparative signal. - Use
orpoorsimpowhen you have paired preferences — two candidate responses to the same prompt, one markedchosenand onerejected. Both are single-stage and reference-free.- Reach for
simpowhen VRAM is tightest: SimPO’s length-normalized reward removes the per-token length bias without any reference model, and it is the leanest paired objective Backpropagate offers. - Reach for
orpowhen you want the odds-ratio formulation that folds the SFT loss and the preference penalty into one term (it tends to be a gentle, stable default for paired data).
- Reach for
- Use
ktowhen your feedback is binary and unpaired — you have a pile of(prompt, completion)rows each tagged simply “good” (label: true) or “bad” (label: false), with no requirement that a good and bad response share a prompt. This is the realistic shape for thumbs-up/thumbs-down product telemetry, where you almost never have two responses to the same prompt. KTO is the unpaired / binary-feedback method.
Data shapes
Section titled “Data shapes”The format auto-detector keys off the columns present, so you usually do not declare the shape — you just point Backpropagate at the file and pick the method.
Paired preference data (ORPO, SimPO)
Section titled “Paired preference data (ORPO, SimPO)”One object per line, each carrying both a chosen and a rejected completion. prompt is optional (it can live inside the chosen/rejected message lists for the implicit-prompt case). Each value may be a plain string or an OpenAI-style message list:
{"prompt": "Explain backpropagation to a 10-year-old.", "chosen": "Imagine you guessed an answer and a friend tells you how far off you were, so next time you guess a little closer. Backprop is the computer doing that, over and over.", "rejected": "Backpropagation is the reverse-mode automatic differentiation of the loss with respect to the parameters."}Only rows that carry both chosen and rejected are kept for preference training (DatasetLoader.to_preference_dataset). A row missing one side is dropped, so a mixed file contributes only its real preference rows.
Note: a
{prompt, chosen, rejected}row trained undermethod='sft'is still valid — SFT rendersprompt → chosenand deliberately dropsrejected. Therejectedcolumn only carries signal under a preference method.
Unpaired binary-feedback data (KTO)
Section titled “Unpaired binary-feedback data (KTO)”One object per line: a prompt, a single completion, and a boolean label (true = desirable, false = undesirable). The ints 0 / 1 are accepted and coerced to bool; a float like 1.0 is not a valid label (it is read as a numeric score, not a binary flag) and a string class label is not a KTO label either.
{"prompt": "Write a commit message for a one-line typo fix.", "completion": "fix typo in README", "label": true}{"prompt": "Write a commit message for a one-line typo fix.", "completion": "Various changes and improvements across the codebase.", "label": false}{"prompt": "Summarize the meeting in one sentence.", "completion": "We agreed to ship Friday and Jia owns the rollback plan.", "label": true}The KTO converter (DatasetLoader.to_kto_dataset) emits exactly the columns {prompt, completion, label}. Rows without a completion and a boolean (or 0/1) label are dropped. There is no pairing requirement — desirable and undesirable rows are independent.
Key hyperparameters and defaults
Section titled “Key hyperparameters and defaults”Every knob below is inert unless its method is selected. Each maps to a CLI flag (--simpo-beta, --kto-beta, …), a Trainer(...) kwarg of the same name, and an env var (BACKPROPAGATE_TRAINING__SIMPO_BETA, …) — see Environment variables → Training for the full env table.
| Knob | Default | Meaning |
|---|---|---|
orpo_beta | 0.1 | Odds-ratio weight (the ORPO “lambda” / TRL ORPOConfig.beta). Must be > 0 — a non-positive value silently degenerates ORPO back to plain SFT (zero) or trains toward the rejected completion (negative), so both are rejected at construction with CONFIG_INVALID_SETTING. 0.1 is the paper’s headline setting. |
SimPO has no dedicated TRL trainer — it is TRL’s CPOTrainer / CPOConfig driven with loss_type="simpo" and cpo_alpha=0.0 forced (a non-zero cpo_alpha would be “CPO-SimPO”, a different method; Backpropagate always means pure SimPO). It consumes the same paired {prompt, chosen, rejected} data as ORPO.
| Knob | Default | Meaning |
|---|---|---|
simpo_beta | 2.0 | Reward-scaling temperature (CPOConfig.beta). The cross-setup safe floor from the paper. Any finite value is admissible. |
simpo_gamma | 1.0 | Target reward margin (gamma; absolute, = beta×0.5 at the default beta). Must be > 0 (CONFIG_INVALID_SETTING). |
A gamma/beta ratio above 1.0 over-weights the margin relative to the reward scale and risks repetitive / degenerate output — that is a soft signal, so it only emits a WARN (the run is still launchable). The paper pins gamma at roughly beta×0.5; keep the ratio ≤ 1.0.
KTO is LoRA-mode-only in v1.6 (kto + mode='full' is rejected at construction — the frozen base must exist to serve as the reference). TRL’s KTOTrainer / KTOConfig.
| Knob | Default | Meaning |
|---|---|---|
kto_beta | 0.1 | Prospect-theory loss temperature (KTOConfig.beta). |
kto_desirable_weight | 1.0 | Loss weight on desirable (label=true) examples. Must be > 0. |
kto_undesirable_weight | 1.0 | Loss weight on undesirable (label=false) examples. Must be > 0. |
The desirable/undesirable weights you set are a starting point, not the final ratio. The trainer auto-rebalances the effective weights from your dataset’s actual label counts so the desirable:undesirable contribution lands in the [1:1, 4:3] band recommended by the KTO authors. This corrects for class imbalance (a dataset that is 90% “good” rows would otherwise drown the “bad” signal). A zero or negative weight is still rejected at construction (CONFIG_INVALID_SETTING) — start from positive values and let the trainer balance.
Learning rate (auto-lowered)
Section titled “Learning rate (auto-lowered)”Preference objectives are far more LR-sensitive than SFT. When you do not pass an explicit learning rate, Backpropagate auto-selects a method-appropriate default:
- SFT — a dataset-size ladder (small
5e-4/ medium2e-4/ large1e-4). - ORPO — a lower dataset-size ladder (
2e-5/1e-5/5e-6); the odds-ratio penalty is unstable at SFT magnitudes (Hong, Lee & Thorne 2024, arXiv:2403.07691). - SimPO and KTO — a fixed
1e-6anchor at every dataset size. SimPO degrades to repetitive output at LR ≥1e-5(Meng et al. 2024) and KTO’s published runs sit at1e-6(Ethayarajh et al. 2024). These anchors are published settings, not scaled off the SFT base LR.
If you pass an explicit --lr / learning_rate=..., it wins — but for SimPO a value ≥ 1e-5 is clamped down with a warning, because high LR is the documented SimPO failure mode.
Verifying the result
Section titled “Verifying the result”After a preference run, score the adapter against a held-out reference set rather than eyeballing loss — loss is a weak proxy. The eval harness computes deterministic, judge-free task metrics and can gate a merge on non-regression:
# Carve a held-out reference split, then score the run on exact-match + token-F1backprop data split prefs.jsonl --heldout-ratio 0.1 --seed 0backprop eval <run_id> --references prefs.heldout.jsonl \ --metric normalized_exact_match --metric token_f1See Recipes for the full SimPO / KTO / eval-gate recipes.
Papers
Section titled “Papers”- ORPO — Hong, Lee & Thorne, ORPO: Monolithic Preference Optimization without Reference Model (2024). arXiv:2403.07691
- SimPO — Meng, Xia & Chen, SimPO: Simple Preference Optimization with a Reference-Free Reward (2024). arXiv:2405.14734
- KTO — Ethayarajh, Xu, Muennighoff, Jurafsky & Kiela, KTO: Model Alignment as Prospect Theoretic Optimization (2024). arXiv:2402.01306
See also
Section titled “See also”- Training — the base
Trainersurface, dataset formats, callbacks. - Recipes — paste-and-run SimPO / KTO / eval snippets.
- Environment variables → Training — the
BACKPROPAGATE_TRAINING__*knobs. - CLI reference — every
--method/--simpo-*/--kto-*flag.