Skip to content

CLI reference

Detect the GPU, VRAM, RAM, and (optionally) measured NVMe bandwidth, and print the hardware profile.

Rank the available models best-first for the detected hardware, dropping any that can’t run interactively.

Plan one model: choose quant + KV-cache + context + offload, predict tok/s, and emit ready-to-run arguments — or refuse with a structured reason. An exact model id wins; an unambiguous prefix also resolves (an ambiguous one lists the candidates and exits 2).

FlagApplies toMeaning
--jsonallMachine-readable JSON output
--dir <path>recommend, planAlso scan a folder of .gguf files
--hf <repo>recommend, planAlso rank a Hugging Face GGUF repo without downloading it (opt-in network)
--ctx <n>recommend, planContext length in tokens (default 8192)
--use-case <c>recommend, planreasoning | chat | bulk — gates the quant floor
--backend <b>planllama.cpp | ollama | lmstudio (default llama.cpp)
--experimentalrecommend, planAllow the experimental MoE disk-streaming tier (MoE only)
-h, --helpShow help
CodeMeaning
0OK
1Model not found, or the loadout was refused
2Usage error (bad command, ambiguous model, unknown backend or --use-case, out-of-range --ctx)

bytefit reads installed Ollama models by default (via OLLAMA_HOST, default http://127.0.0.1:11434). Add a local folder of .gguf files with --dir <path>, and a Hugging Face GGUF repo with --hf <repo> — the latter reads only the GGUF header over HTTPS and never downloads the weights.