Gemma 4 runs agents on your phone with 4GB RAM

Google DeepMind just dropped Gemma 4, a family of four open models built from the same research that powers Gemini 3. The lineup includes E2B and E4B for edge devices, plus 26B-A4B and 31B for more demanding tasks. All four support native function calling, multimodal reasoning across vision and audio, and work in 140 languages. Apache 2.0 licensed, available now on Hugging Face.

The "E" in the smaller models stands for Effective, referring to a technique called Per-Layer Embeddings that compresses vocabulary lookup overhead. This lets the E4B model run in as little as 4-6GB RAM at 4-bit quantization, which is genuinely useful for local inference on phones and IoT devices. The 26B-A4B uses Mixture-of-Experts architecture with only 4B active parameters, giving you near-frontier performance at a fraction of the compute cost.

Benchmarks look solid. The 31B model hits 85.2% on MMLU and 86.4% on τ2-bench for agentic tool use. Simon Willison reported the 26B model handles complex SVG generation well, though he noted the 31B had some issues outputting loops of dashes during local testing. The models also include explicit thinking control through special tokens, letting you toggle the reasoning trace on or off. That thinking toggle, combined with native function calling and 4GB RAM deployment, means you can run actual agents locally without calling home.