LM Studio 0.4.0 Goes Headless, Challenges Ollama on CLI Turf

LM Studio just got serious about headless deployments. Version 0.4.0 splits the inference engine into a standalone daemon called llmster, with a new lms CLI for managing models entirely from the command line. No GUI required. This puts LM Studio in direct competition with Ollama for local model serving on servers, CI/CD pipelines, and SSH sessions.

George Liu tested the setup with Google's Gemma 4 26B mixture-of-experts model on a 14" MacBook Pro M4 Pro with 48 GB unified memory. The results: 51 tokens per second, with time to first token at 1.5 seconds. The MoE architecture makes this possible. The model has 128 experts but only activates 8 (about 3.8B parameters) per forward pass. That means you get quality close to a 10B dense model while paying inference costs closer to a 4B model. On benchmarks, it scores 82.6% on MMLU Pro and 88.3% on AIME 2026, not far off from the dense 31B variant's 85.2% and 89.2%.

What differentiates LM Studio from Ollama here is the developer workflow focus. The 0.4.0 release includes native MCP (Model Context Protocol) integration with permission-key gating, a stateful REST API that maintains conversation history, and parallel request processing with continuous batching. Liu notes that performance degrades when used within Claude Code compared to standalone chats, which tracks with Hacker News discussions about conversational MCP workflows being latency-sensitive. Users report that delays above 2 seconds disrupt reasoning quality, prompting optimizations like in-memory caching to hit sub-100ms response times.

For anyone running local models, the calculus is shifting. Ollama built its user base on being the lightweight, CLI-first option. LM Studio is now matching that headless capability while betting on ecosystem features for AI-assisted development. The question is whether developers want a general-purpose model runner or an engine tuned for tool-calling coding agents.