LangChain ran open-weight models through its Deep Agents evaluations and found something that would've seemed unlikely a year ago: GLM-5 and MiniMax M2.7 match closed frontier models on the core tasks that matter for agents. File operations, tool use and instruction following. The stuff that determines whether an agent actually works in production.
The cost difference is stark. An application pushing 10 million tokens daily runs about $250 on Claude Opus 4.6 versus $12 on MiniMax M2.7. That's $87,000 a year. And the open models aren't just cheaper. They're faster. GLM-5 averaged 0.65 seconds latency at 70 tokens per second on Baseten, compared to 2.56 seconds at 34 tokens per second for Claude Opus 4.6. For interactive products, that gap matters.
The evaluation covered seven categories: file operations, tool use, retrieval, conversation, memory, summarization, and unit tests. GLM-5 hit 64% correctness, MiniMax M2.7 hit 57%. Claude Opus 4.6 scored 68%. Close enough that the economics start to look very different. Both open models excel at file operations and tool use, though they lag on conversation. GLM-5, developed by Zhipu AI out of Tsinghua University, even beat the frontier models on solve rate, a combined accuracy-speed metric.
This doesn't mean open models have won. The closed models still edge ahead on raw correctness, and conversation remains a weak point. But for teams building agents that need to call tools, manipulate files, and follow instructions, the open options are now genuinely competitive. You can swap to GLM-5 or MiniMax M2.7 with a single line of code in LangChain's Deep Agents SDK.