Simulator below is illustrative · the measured runs are further down, with raw logs
Local AI, unstuck

Run big AI models on a small GPU.
When it runs out of memory, it recovers — it doesn't crash.

Point it at a model. It fits it to your GPU, and when the memory overflows it backs off and keeps running instead of dying. Works on NVIDIA, AMD and Apple.

▸ Simulate a fit — illustrative model, not your hardware
pick a GPU + a model
Your GPU
The model you want to run
VRAM used

Run your first model — free

100% local · no account · no tracking · your conversations never leave your computer
The wall
Everyone hits the same error.
The first time you run a local model, it crashes with CUDA out of memory — or silently crawls far slower. You're left guessing which quant and context fit. VRAMPilot does the guessing, and recovers when the guess is wrong.
How it works
Profile → plan → recover → serve.
01

Reads your GPU

Real VRAM, right now — NVIDIA, AMD or Apple. Not a static guess.

02

Plans a fit

Picks the quant, KV-cache precision and context that should fit your card.

03

Recovers from OOM

If it still overflows: KV-quant → shrink context → offload layers, and retries until it boots and generates.

04

Serves + tells you

An OpenAI-compatible endpoint, plus an honest note on what it traded off to fit.

Not a simulation
A real run, measured.
The actual output recovering an impossible config on a real RTX 3070 — KV-quant, then context, until it boots and serves.
vrampilot · RTX 3070 · CUDA
$ vrampilot serve Qwen2.5-7B-Q4.gguf
  GPU: NVIDIA RTX 3070 · 7.7 / 8.0 GB free
  plan: -ngl 99  -c 262144
  ✕ CUDA out of memory
   KV cache → q8_0  (keep context)   ✕ out of memory
   KV cache → q4_0  (keep context)   ✕ out of memory
   context  262144 → 131072          ✓ booted
  ✓ serving · OpenAI-compatible endpoint · reply: "OK"
  # recovered instead of crashing — measured, not staged
New — June 2026
It now remembers, watches, and installs itself.
Three additions since the first preview — each validated on real hardware, raw logs published under /proofs/.
NEW

Remembers what booted

Every config that actually booted is saved per machine + model (local SQLite, append-only). A failed attempt costs ~220 s of model loading — the cache exists so you never pay it twice.

NEW

Survives mid-generation OOM

Another app grabs VRAM mid-generation? The watchdog catches the collapse (floor crossed at 102 MiB free in the validated run), restarts at a degraded config, recovered in 223.9 s. Honest cost: the generation in flight is lost — it says so itself.

NEW

Installs itself

Nothing to set up: the first run fetches a pinned, verified inference build (b9592) with mandatory SHA256. Measured: 7.6 s from cold to a served completion — 1 GB test model already on disk, binary fetch included.

Any GPU
Tested on real hardware. All three.
Runs and serves a real completion on each — the OOM-recovery works across CUDA, Vulkan and Metal.
No hype
What it is — and isn't.

It is

  • Recovers from out-of-memory at runtime — unserved in every tool we probed by name (2026, see compare)
  • Remembers the config that actually booted (local, append-only)
  • Auto-picks the quant + context that fit your GPU
  • A clean automation layer on top of proven open-source inference · Win / Linux / macOS

It isn't

  • A new inference engine — a battle-tested open-source engine (llama.cpp) does the running (named in the docs)
  • Magic — a model truly too big for your GPU is told so, plainly
  • Sending anything anywhere — it's 100% local
Who builds this
About ZMLabs

Independent deep-engineering lab based in Sète, France. We build practical AI systems, automation platforms and next-generation software — focused on real-world reliability.

zmlabs.ai →