The wall

Everyone hits the same error.

The first time you run a local model, it crashes with CUDA out of memory — or silently crawls far slower. You're left guessing which quant and context fit. VRAMPilot does the guessing, and recovers when the guess is wrong.

How it works

Profile → plan → recover → serve.

Reads your GPU

Real VRAM, right now — NVIDIA, AMD or Apple. Not a static guess.

Plans a fit

Picks the quant, KV-cache precision and context that should fit your card.

Recovers from OOM

If it still overflows: KV-quant → shrink context → offload layers, and retries until it boots and generates.

Serves + tells you

An OpenAI-compatible endpoint, plus an honest note on what it traded off to fit.

Not a simulation

A real run, measured.

The actual output recovering an impossible config on a real RTX 3070 — KV-quant, then context, until it boots and serves.

vrampilot · RTX 3070 · CUDA

$ vrampilot serve Qwen2.5-7B-Q4.gguf
  GPU: NVIDIA RTX 3070 · 7.7 / 8.0 GB free
  plan: -ngl 99  -c 262144
  ✕ CUDA out of memory
  ↳ KV cache → q8_0  (keep context)   ✕ out of memory
  ↳ KV cache → q4_0  (keep context)   ✕ out of memory
  ↳ context  262144 → 131072          ✓ booted
  ✓ serving · OpenAI-compatible endpoint · reply: "OK"
  # recovered instead of crashing — measured, not staged

New — June 2026

It now remembers, watches, and installs itself.

Three additions since the first preview — each validated on real hardware, raw logs published under /proofs/.

NEW

Remembers what booted

Every config that actually booted is saved per machine + model (local SQLite, append-only). A failed attempt costs ~220 s of model loading — the cache exists so you never pay it twice.

NEW

Survives mid-generation OOM

Another app grabs VRAM mid-generation? The watchdog catches the collapse (floor crossed at 102 MiB free in the validated run), restarts at a degraded config, recovered in 223.9 s. Honest cost: the generation in flight is lost — it says so itself.

NEW

Installs itself

Nothing to set up: the first run fetches a pinned, verified inference build (b9592) with mandatory SHA256. Measured: 7.6 s from cold to a served completion — 1 GB test model already on disk, binary fetch included.

Any GPU

Tested on real hardware. All three.

Runs and serves a real completion on each — the OOM-recovery works across CUDA, Vulkan and Metal.

NVIDIA

✓ VALIDATED

RTX 3070 · CUDA — recovered & served

AMD

✓ VALIDATED

Radeon 780M · Vulkan — recovered & served

Apple

✓ RUNS & SERVES

M1 · Metal — runs & serves (forced-OOM recovery not yet demonstrated on Apple)

No hype

What it is — and isn't.

It is

Recovers from out-of-memory at runtime — unserved in every tool we probed by name (2026, see compare)
Remembers the config that actually booted (local, append-only)
Auto-picks the quant + context that fit your GPU
A clean automation layer on top of proven open-source inference · Win / Linux / macOS

It isn't

A new inference engine — a battle-tested open-source engine (llama.cpp) does the running (named in the docs)
Magic — a model truly too big for your GPU is told so, plainly
Sending anything anywhere — it's 100% local

Who builds this

About ZMLabs

Independent deep-engineering lab based in Sète, France. We build practical AI systems, automation platforms and next-generation software — focused on real-world reliability.

zmlabs.ai →