Multi-agent debate makes LLMs reason better but burns tokens generating long transcripts first. A new paper distills the whole process into a single model that matches or exceeds it while using up to 93% fewer tokens.
The method, called Latent Agents, uses a two-stage fine-tuning pipeline: learn the debate structure, then internalise it with dynamic reward scheduling and length clipping. The striking part is mechanistic. Probing the trained model with activation steering reveals agent-specific subspaces, interpretable directions in activation space that correspond to each debater's perspective, suggesting the model really does carry distinct internal "voices."
That has a safety payoff the authors demonstrate. Instil a malicious agent through debate, then suppress it with negative steering, and the harmful behaviour is easier to localise and remove, with less collateral damage to general performance than steering a base model. Internalising agents may make their reasoning both cheaper and more controllable at once.