Google has released quantization-aware training (QAT) checkpoints for Gemma 4, cutting the memory footprint of the E2B text-only model to under 1GB so it can run on a phone.
The trick is selective compression. Rather than squeeze the whole model evenly, Google pushed the token-generating layers down to 2-bit while keeping the core reasoning layers at higher precision, and pre-computed activation scaling during training so mobile chips skip that work at runtime. Because QAT simulates quantization during training, the model learns weights that are robust to the rounding error, so quality stays closer to the full-precision reference than ordinary post-training quantization manages. Checkpoints ship in the popular Q4_0 format and a new mobile-specific schema, across five sizes from E2B up to 31B.
Weights are already live on Hugging Face and wired into llama.cpp, Ollama and LM Studio. The pitch is plain: capable local models that fit on the hardware people actually carry, with no server round-trip.