Ouroboros: Recursive Self-Improving AI Research Loop That Rewrites Its Own Methodology

Ouroboros, a new open-source project from GitHub user Kargatharaakash, takes Andrej Karpathy's autoresearch framework and wraps it in a generational outer loop — one that rewrites the system's own research methodology rather than just the experiments it runs. Where karpathy/autoresearch lets an AI agent autonomously edit a training file, run fixed-budget experiments, and keep improvements, Ouroboros adds a layer on top: each generation, it rewrites a core strategy document called genome.md, tracks hypothesis predictions against actual outcomes, accumulates a knowledge graph, registers dead ends, and scores divergence across legibility, coherence, and ambition dimensions. The author describes it as "an AI lab system that optimizes how it thinks, not just what it trains."

The system integrates with both Anthropic's Messages API (defaulting to claude-sonnet-4-6) and OpenAI's Responses API (defaulting to gpt-5.4) for autonomous hypothesis generation and genome rewriting, with a fallback to fully local deterministic rule-based logic when no API keys are provided. Hardware support spans Apple Silicon via MLX and MPS backends, NVIDIA GPUs on Windows and Linux, Google Colab, Kaggle, and CPU-only configurations — ranging from roughly one experiment per hour on CPU to 16–22 on an RTX 4090. A population tournament mode allows multiple competing lineages to run in parallel with divergence-aware selection, operationalizing a form of scientific diversity pressure that single-lineage systems lack.

Ouroboros occupies a distinct niche among self-improving research agents. Unlike Sakana AI's AI Scientist, which targets autonomous paper generation using a fixed meta-strategy, Ouroboros treats the evolving research methodology itself as the primary output artifact. Compared to Google DeepMind's FunSearch, which evolves programs for formally verifiable mathematical domains, Ouroboros operates on open-ended language model training, where success criteria are harder to define and game. The project explicitly self-categorizes as "L5" autonomy — improving how it researches while keeping metric definitions and identity constraints locked by orchestrator validation — and explicitly declines to implement what it calls "L6," a system that changes what counts as improvement. Genome rewrites that remove required sections or introduce metric-gaming instructions are rejected outright.

The lineage-first design is where Ouroboros diverges most sharply from pure performance optimizers. Every generation produces a rich artifact set — genome diffs, anomaly logs, divergence scores, and per-experiment hypothesis, result, and reflection bundles — all archived under a structured lineage directory and rendered into readable narratives by a lineage_viewer.py tool. The design makes it infrastructure for auditing autonomous research behavior, not just running it. The critical empirical gap remains: no published benchmarks yet compare generational genome evolution against fixed-strategy baselines. The Ouroboros authors have not yet stated how they plan to address this, and as lineage depth grows across population mode runs, the volume of accumulated diffs will outpace practical human review bandwidth — a scaling problem the project acknowledges but hasn't solved.