Alibaba's Qwen team trained a model to be the environment agents learn in

Alibaba's Qwen team released Qwen-AgentWorld, a pair of language world models (35B and 397B mixture-of-experts) that simulate agentic environments rather than act inside them.

A world model predicts an environment's next state given an action. Trained on more than 10 million real interaction trajectories across seven domains, these models let an agent be trained by reinforcement learning inside the simulation instead of the live system. The paper reports that RL run inside the simulated world beat training in the real environment alone, and that world-model pre-training also served as a warm-up that lifted scores across seven agentic benchmarks.

If a model can convincingly play the world, the slowest and riskiest part of agent training, running in real systems, becomes something you can spin up thousands of times cheaply. The open question is whether the simulator's fidelity holds outside the domains it was benchmarked on.