Nvidia GreenBoost: Open-Source Linux Kernel Module Extends GPU VRAM for LLM Inference via DDR4 and NVMe

Independent developer Ferran Duarri released GreenBoost on March 14, 2026, an open-source Linux kernel module and CUDA userspace shim that extends GPU VRAM using system DDR4 RAM and NVMe storage, with no changes required to inference software. Licensed under GPL v2 and published on GitLab, the project emerged from a concrete problem: Duarri's RTX 5070 carries 12 GB of VRAM, but he wanted to run glm-4.7-flash:q8_0, a 31.8 GB quantized model, without the 5-10x throughput penalty of CPU layer offloading or the quality degradation of aggressive requantization. The solution routes overflow memory allocations to system RAM via DMA-BUF and CUDA external memory imports over the PCIe 4.0 x16 link, yielding roughly 32 GB/s of effective bandwidth — far below on-die VRAM speeds but substantially faster than CPU offloading approaches used by llama.cpp and similar tools.

The architecture splits into two cooperating components. A kernel module (greenboost.ko) allocates pinned DDR4 pages using 2 MB compound pages for efficiency, exports them as DMA-BUF file descriptors, and lets the GPU import them as CUDA external memory — from the CUDA runtime's perspective, those pages simply look like device-accessible memory. A CUDA shim (libgreenboost_cuda.so), injected via LD_PRELOAD, intercepts cudaMalloc, cudaMallocAsync, and related calls, passing small allocations through to the standard runtime while redirecting large ones — model weights and KV cache that overflow VRAM — to the kernel module. Ollama resolves GPU symbols internally via dlopen and dlsym, bypassing LD_PRELOAD, so the shim also intercepts dlsym itself using dlvsym bootstrapped with a GLIBC version tag to return hooked versions of cuDeviceTotalMem_v2 and nvmlDeviceGetMemoryInfo, preventing Ollama from seeing only the physical 12 GB and pushing layers to the CPU.

The project's technical foundation was partly enabled by NVIDIA's own 2022 decision to open-source its GPU kernel modules under MIT/GPLv2. Duarri explicitly credits the nvidia/open-gpu-kernel-modules repository as invaluable for understanding UVM page fault handling and the DMA-BUF external memory import path. That NVIDIA's own open-source release handed a developer the tools to patch over its consumer VRAM segmentation is an awkward irony the company has structural reasons not to address. Its enterprise answer to the same problem, the Grace Hopper GH200 with 480 GB of unified memory over NVLink-C2C at 900 GB/s, costs orders of magnitude more than the consumer RTX lineup it deliberately segments by VRAM tier. Apple's M-series unified memory architecture sidesteps the problem entirely at the silicon level, a structural advantage that has made Apple Silicon Macs the default hardware for users running large models locally.

GreenBoost bundles a curated inference optimization stack — ExLlamaV3, kvpress, NVIDIA ModelOpt, TensorRT-Edge-LLM, and Unsloth with LoRA — accessible through a unified optimize-model command. Observed performance on the RTX 5070 running the 31.8 GB model ranges from 2-5 tokens per second at baseline with the CUDA shim, up to 25-60 tokens per second when the model is quantized down to 8 GB with ExLlamaV3 EXL3 and fits fully in VRAM. Duarri frames the DDR4 pool as most useful for KV cache once the model itself is compressed enough to reside in VRAM, with the NVMe tier serving as a safety net rather than a primary inference path. The project had 34 commits on its first public day. A companion terminal tool called Synapse remains proprietary for now.