Millwright: Adaptive Tool-Routing Framework That Learns from Agent Experience

A blog post published March 13, 2026 on Minor Gripe outlines Millwright, a proposed framework for adaptive tool selection in AI agents. The core problem is familiar to anyone building production agentic systems: as tool catalogs scale to hundreds or thousands of entries, injecting all tool definitions into an LLM's context becomes increasingly costly, crowding out space needed for RAG retrieval, conversation history, and planning. Models like Meta's Llama 4 Scout and Google's Gemini 3 Pro now offer 128K–1M token windows, yet the pressure persists — tool catalogs at large enterprises grow faster than context budgets do. Static heuristics — returning the most frequently used tools, or the top-N semantically similar ones — fail to account for real-world performance variation and cannot adapt as new tools and tasks emerge.

Millwright's proposed architecture exposes exactly two meta-tools to the agent runtime: suggest_tools and review_tools. When an agent invokes suggest_tools with a natural language query, the toolshed decomposes it into atomic subqueries, embeds them, and ranks candidates by fusing two signals: cosine similarity over embedded tool descriptions for semantic relevance, and a historical fitness layer that draws on an append-only review log of (tool, query, fitness) tuples. After using selected tools, the agent calls review_tools to log fitness ratings — "perfect," "related," "unrelated," or "broken" — which feed back into future rankings. An epsilon-greedy exploration mechanism occasionally surfaces non-obvious tools to prevent the system from converging on a stale subset. If an initial suggestion set proves insufficient, the agent can iterate through additional candidates.

Millwright builds directly on the 2024 academic Toolshed paper by Lumer et al., which introduced RAG-based tool retrieval but had no dynamic feedback mechanism. The fitness layer closes that gap, making tool rankings a function of accumulated agent experience rather than static semantic similarity alone. The review log doubles as an observability layer: operators can inspect which tools underperform, which get overused for tasks they are only marginally suited for, and where gaps in the catalog exist. Cold-start is handled through seed reviews; a periodic compaction step merges entries whose embeddings are sufficiently close, keeping the review index from growing without bound.

As of publication, Millwright is a design proposal, not a released library. The architecture is specified in enough detail to build on any vector store and standard LLM agent runtime. The harder question is whether the epsilon-greedy exploration mechanism holds up under real catalog churn — tools added and deprecated continuously — without corrupting fitness signals built from earlier runs. That's the gap every clean feedback-loop design has to cross before it becomes infrastructure. No production implementation or release timeline has been announced.