A developer known as jmilinovich has released GOAL.md, an open-source pattern and file format designed to give autonomous coding agents a structured fitness function for self-improving software projects overnight. The project was directly inspired by <a href="/news/2026-03-14-pi-autoresearch-autonomous-experiment-loop-llm-training-frontend-metrics">Andrej Karpathy's autoresearch</a>, which showed that pairing an AI agent with a scalar metric and a loop could produce meaningful improvements in LLM training experiments while the developer slept.

The catch is that Karpathy's formula only works when a scalar metric already exists — a training loss curve, a benchmark score. Most real software quality dimensions have no equivalent. Documentation quality, API trustworthiness, and test infrastructure confidence are estimated by intuition, not computed.

The pattern came out of a specific problem: jmilinovich had 30 Playwright tests with no reliable way to tell the trustworthy ones from the broken ones. With no off-the-shelf tool for measuring test infrastructure confidence, the author had to build a composite metric from four weighted components before any optimization could begin. A GOAL.md file dropped into the repo gave Claude a fitness function, an improvement loop (measure, diagnose, act, verify), an action catalog ranked by expected point impact, an operating mode, and explicit constraints. The result was 12 atomic commits overnight that moved a routing confidence score from 47 to 83.

The dual-score pattern is the sharpest technical contribution. It targets a Goodhart's Law failure mode specific to autonomous agents: when an agent is both the optimizer and the instrument-builder, it can degrade its own measurement tools without adversarial intent. A naive agent tasked with maximizing documentation quality as measured by a misconfigured linter would rationally adjust the docs to satisfy a broken instrument rather than fix the instrument itself. GOAL.md addresses this by keeping separate scores for the object being improved and the quality of the measuring tool, requiring the agent to improve its telescope before using it. The pattern formalizes three fitness function modes: Locked (agent cannot touch scoring code), Split (agent can improve the instrument but not redefine success), and Open (agent can modify everything).

GOAL.md is positioned as a complement to instruction files like CLAUDE.md, not a replacement. Where CLAUDE.md is a manual describing how an agent should work, GOAL.md is a reward function defining what better means. The repo is built to be consumed by agents: the template functions as a prompt, the examples serve as few-shot training data, and a CLAUDE.md in the repo teaches agents how to write a GOAL.md for any project.

One of the more technically interesting details is the iteration log pattern, borrowed from the GEPA prompt evolution framework by Khattab et al. It tracks per-action score deltas so future agent sessions can avoid repeating failed experiments — a form of persistent memory across runs that addresses one of the more frustrating failure modes in long-running autonomous agents.

The project has drawn early traction on Hacker News, with jmilinovich actively engaging on the dual-score pattern and soliciting examples from domains beyond the web and TypeScript projects that currently dominate the example set.