Google dropped the AI Edge Gallery app for iPhone this week. It runs Gemma 4 models entirely on your phone with zero internet required. Your prompts, images, and conversations never leave your hardware. The centerpiece is Gemma 4 E2B, a 2-billion parameter model that handles reasoning and creative tasks. On an iPhone 17 Pro, it hits 56.5 tokens per second on GPU. That's genuinely fast for local inference.
The feature set goes well beyond basic chat. Agent Skills let the model pull in tools like Wikipedia and interactive maps. Thinking Mode shows you the model's step-by-step reasoning, which is useful for understanding how it reaches conclusions. Ask Image handles multimodal tasks through your camera. Audio Scribe does real-time transcription. There's even a mini-game called Tiny Garden that uses natural language commands to plant and harvest a virtual garden. The app is built on LiteRT-LM, Google's orchestration layer for edge AI, and the source code lives on GitHub if you want to dig in.
Developers are already pushing boundaries. Fikri Karim built Parlor, combining Gemma 4 with Kokoro text-to-speech for fully local voice conversations at 2.5-3 second latency. The Heretic project goes further, using directional ablation to strip safety alignment from Gemma 4 by removing "refusal directions" from model activations. No retraining required. The technique exposes how fragile RLHF constraints really are.