Getting started
Install
Section titled “Install”npm install -g @mcptoolshop/bytefitRequires Node ≥ 20. Ollama is optional but recommended — bytefit reads your installed Ollama models out of the box.
See your hardware
Section titled “See your hardware”bytefit probeNVIDIA GeForce RTX 5090 31.8 GiB VRAM (28.9 free) @ 1792 GB/sRAM 63.4 GiB (46.7 free) @ 76.8 GB/sbytefit reports honest free memory — if another process (or a resident Ollama model) is holding VRAM, you’ll see less free, and the recommendations adjust accordingly.
Rank your models
Section titled “Rank your models”bytefit recommendNVIDIA GeForce RTX 5090 / 31.8 GiB VRAM / 63.4 GiB RAM — 10 models, 10 runnable:
qwen3.6:35b-a3b FITS Q4_K_M q8_0 ctx8192 ~132 tok/s [vram] mistral-small:24b FITS Q4_K_M q8_0 ctx8192 ~84 tok/s [vram] qwen3.6:27b FITS Q4_K_M q8_0 ctx8192 ~74 tok/s [vram] gemma4:31b FITS Q4_K_M q8_0 ctx8192 ~60 tok/s [vram]Each row is a loadout: the verdict (FITS / DEGRADED / REFUSED), the chosen quant and
KV-cache type, the context length, a predicted tok/s, and the memory tier (vram, vram+ram, or the
experimental disk).
Plan one model
Section titled “Plan one model”bytefit plan qwen3.6:27bplan prints the reasoning trace and the exact command to launch it:
qwen3.6:27b: FITS Q4_K_M q8_0 ctx8192 ~74 tok/s - Usable 27.0 GiB VRAM + 43.1 GiB RAM after headroom; KV 0.3 GiB (q8_0, 8192 ctx). - Quant Q4_K_M: 16.2 GiB weights. - Fits fully in VRAM. ~74 tok/s.
llama.cpp: llama-server -m <model.gguf> -c 8192 -ngl 64 -ctk q8_0 -ctv q8_0 -fa onPick a backend with --backend ollama or --backend lmstudio, change the context with --ctx, or
get JSON with --json. Every flag is in the CLI reference.
Models bytefit can’t fully read
Section titled “Models bytefit can’t fully read”You aren’t limited to installed Ollama models:
bytefit recommend --dir ~/models # a folder of .gguf filesbytefit plan unsloth/Qwen3-14B-GGUF --hf unsloth/Qwen3-14B-GGUF # a HF repo, no download--hf reads only the GGUF header over HTTPS — it never downloads the weights.