Documentation
VRAMPilot installs as a Python package and needs nothing else: no llama.cpp setup, no manual download of a llama-server binary. This page covers installation, how the recovery ladder works, what is persisted, how the watchdog behaves, and the requirements.
Install
pipx install vrampilot # or: pip install vrampilot
Run it on a model:
vrampilot your-model.gguf --ctx 8192 # CLI: plan -> launch -> recover -> serve
vrampilot-web # web UI: paste a .gguf path -> Launch -> chat
The web UI listens on http://127.0.0.1:8770. Both entry points end with an OpenAI-compatible endpoint you can point any client at. Module form also works: python -m vrampilot.cli / python -m vrampilot.web.
How the recovery ladder works
When the server fails to come up (or boots but cannot generate a token), VRAMPilot reads the error log, classifies it as out-of-memory, backs off one step, and retries. The order is designed to preserve the most capability:
- KV-cache precision first (f16 -> q8_0 -> q4_0). This shrinks the KV cache while keeping your full context. It is lossy on long reasoning — the report says so.
- Then shrink context — halve until it fits.
- Then offload more to CPU — more MoE expert-offload, or fewer GPU layers for dense models.
- Floor — if nothing fits, it says so honestly instead of pretending.
Validated example (raw trail in validation/RESULTS.md): given a deliberately impossible configuration at 262144 context, it recovered to 131072 context — 4x more than a context-only back-off would have kept — and served a real reply.
The OOM detector matches error strings across CUDA, Vulkan, ROCm and Metal.
Persistence
Every configuration that actually booted is persisted per (machine fingerprint, model header hash, requested context) in a local SQLite database at ~/.governor/configs.db (the engine keeps the governor name on purpose). It is append-only: a configuration that stops working is kept as history, never overwritten. The next launch boots the known-good configuration directly; a driver or GPU change invalidates the entry and triggers a replan.
Why it matters: a failed attempt costs a full model load — measured at ~220 s per attempt for a 9.5 GB MoE model loaded with --no-mmap on the gate machine's disk. The cache exists so you never pay that twice.
Transparency controls:
vrampilot configs list # inspect everything it remembers
vrampilot model.gguf --no-cache # bypass the cache entirely, force a replan
Nothing leaves your machine.
Watchdog behavior
While the server runs, the watchdog covers VRAM exhaustion during a generation — for example, you open another GPU application mid-stream.
- On NVIDIA, free VRAM is measured (real nvidia-smi reads, polled every few seconds), not estimated. A bad trend triggers a soft alert; crossing the auto-calibrated floor (400 MiB in the validated run) triggers a controlled restart at a degraded configuration, and the surviving configuration is persisted so the next launch starts there.
- The restart is not invisible. A generation in flight at restart time is lost, and the watchdog says so in its own critical message. There is no over-promise of continuity.
- Validated run under real external pressure (
validation/WATCHDOG.md): floor crossed at 102 MiB free, recovered in 223.9 s at context 8192 → 4096 — while the pressure server still held its VRAM. Counter-test on normal generations afterwards: 0 soft alerts, 0 interventions. - On Vulkan and Apple, free VRAM is an estimate, not good enough to act on: the watchdog honestly downgrades itself to a process + health watch and states it in its first event.
Known limit: llama.cpp cannot shrink a live server's KV cache, so the only runtime action is this controlled restart. If upstream lands hot KV resize, VRAMPilot will adopt it and retire the restart for that path.
Requirements
Python 3.9+ is the one declared prerequisite of the pip/pipx install — everything else (including the inference engine binary) is fetched and verified on first run.
- Python 3.9+. That is the only stated prerequisite of the pip/pipx distribution.
- No llama-server needed. The first run resolves one in this order:
$LLAMA_SERVER→ PATH → local cache → fetch the pinned llama.cpp build b9592 for your OS and GPU, with mandatory SHA256 verification — a hash mismatch deletes the file and aborts, with no fallback. The manifest pins 7 targets (Windows Vulkan/CUDA/CPU, Linux Vulkan/CPU, macOS arm64/x64). - Defaults: Vulkan on Windows and Linux (one binary for NVIDIA, AMD and Intel), Metal on macOS, CUDA opt-in via
GOVERNOR_BACKEND=cuda. GOVERNOR_SERVER_BASE_URLcan point the fetch at any mirror you control (own server, NAS, air-gapped share) — the pinned hashes still apply, so no mirror can substitute binaries. You can always pass--server /path/to/llama-serverinstead.