gpt-image-2 trades speed for architectural ambition

OpenAI's April 21 livestream debuted Images 2.0, the latest version of its image generation model. The event doubled as a look back at a packed release calendar: GPT-5 last August, GPT-5.3 and Codex, ChatGPT Atlas, the Realtime API for voice agents, and Sora for video. But gpt-image-2 was the main event. It ditches the old diffusion approach for something fundamentally different. The new model generates images as sequences of discrete tokens, the same way a language model produces text. That's a genuine architectural shift. Developer Simon Willison ran early tests using the API and a custom Python tool. A single 3840x2160 image consumed roughly 13,000 output tokens and took close to two minutes to render. The fidelity was strong. The prompt adherence was not. When he asked for a raccoon holding a ham radio in a crowded scene, the model couldn't manage it. Specific compositional details got lost. That's the trade-off with token-based generation. You get native multimodal processing in one pipeline. But predicting visual tokens sequentially is slow, and it struggles with fine spatial relationships. The community response on Hacker News was familiar. People praised the resolution and detail. Then they circled back to the same complaint that has dogged every major image generator: you still can't iteratively fix parts of an image without regenerating the whole thing. Want to change one object's position or adjust a single detail? Too bad. That workflow gap remains wide open, and gpt-image-2 does nothing to close it. For anyone building tools on top of these models, that's the real limitation worth watching.