Changelog
Release history. Every figure below is read from a committed validation file — the rule of this site.
v0.2.0 — 2026-06-11
- Renamed to VRAMPilot. The product was previously called Inference Autopilot; the Python package is now
vrampilot. The generic engine keeps the governor name (~/.governor/,GOVERNOR_*environment variables) on purpose. - Zero-prerequisite first run. The first launch fetches a pinned llama.cpp build (b9592) for the detected OS and GPU, with mandatory SHA256 verification — a mismatch deletes the file and aborts. Gate measured at 7.6 s from a cold start to a real served completion.
- Clean Windows VM validated. A brand-new Windows 11 VM (no GPU — the worst case, exercising the CPU fallback) passed the same gate in 12.2s, after finding and fixing two real first-run blockers the scrubbed-host test could not see: a fresh Windows TLS root-CA store breaking the first download (fixed with an OS
curl.exefallback — safe because the pinned SHA256, not the transport, is the security) and the missing MSVC runtime that official llama.cpp Windows binaries import (fixed by bundling 3 DLL(s), deployed app-locally only when missing). - GitHub Releases confirmed as the primary binary host. The host needs no trust: the pinned manifest plus mandatory SHA256 is the security.
GOVERNOR_SERVER_BASE_URLremains the self-hosting mechanism for any mirror. Windows code signing postponed. - Energy measurement gate. Decode measured at 0.84 J/token (CPU-side) on the bench rig. The energy probe is observation-only and optional: the product never depends on the instrument, and when it is absent, energy is reported as absent — never estimated.
v0.1.x — June 2026 (sprint)
- Honesty pass. Every VRAM reading is labeled
measuredorestimated— enforced by the type. The KV-cache estimate readshead_dimfrom the GGUF when present; the fallback is documented. - Persistence — the machine that learns. Every configuration that actually booted is persisted in a local append-only SQLite; the next launch starts at the known-good configuration. Gate passed on real runs: a failed attempt costs a full model load, measured at ~220 s per attempt for a 9.5 GB MoE. The gate also found and fixed a real bug: a fixed boot timeout was killing healthy boots of large models mid-load.
- In-inference watchdog. Controlled restart at a degraded configuration when free VRAM crosses the auto-calibrated floor mid-generation. Gate passed under real external VRAM pressure: floor crossed at 102 MiB free, recovered in 223.9 s while the pressure stayed; counter-test 0 soft alerts, 0 interventions on normal generations.
- Governor extraction. The recovery loop was extracted into a domain-agnostic engine (
core.py) with frozen contracts; the persistence and watchdog gates were re-passed identically on the refactored code — the non-regression proof. - Cross-platform validation. Runs and serves a real completion on 3 GPU vendors on real hardware: NVIDIA (CUDA), AMD (Vulkan), Apple (Metal). OOM-recovery demonstrated clean on NVIDIA and AMD; on Apple the mechanism is in place but a clean forced-OOM demo was not feasible on the small CI runner — stated, not hidden.
- Market probe. Tried by name to kill the differentiator against LM Studio, Ollama and Jan; runtime OOM-recovery survived as genuinely unserved (
validation/MARKET.md).