CPU-Compatible Fork of Karpathy's Autoresearch Enables Autonomous LLM Hyperparameter Optimization on Consumer Hardware

Andrej Karpathy published autoresearch in March 2026 as a working prototype of fully autonomous AI-driven experimentation: an agent loop that modifies a single training file (train.py), runs a fixed 5-minute experiment, measures validation bits per byte (val_bpb), keeps improvements and reverts regressions, and repeats overnight. The original repository targets a single H100 GPU and depends on Flash Attention 3, putting it out of reach for most hobbyist and academic hardware. Within weeks, at least four community forks had appeared targeting virtually every consumer platform — a pace that tracks the HN thread on Karpathy's announcement, which crested 400 points and drew over 200 comments within 48 hours of posting.

The most notable fork, bopalvelut-prog/autoresearch by developer Matti A. Pöysti, removes the H100 and Flash Attention 3 requirements entirely and introduces an "Always-On Folding Mode" inspired by Folding@home. The fork runs a low-priority background process that consults a local Ollama instance running Qwen 2.5 0.5b to propose changes to train.py, executes the 5-minute training budget, logs results to a TSV file, and auto-commits improvements via git — all without cloud compute or human intervention after initial setup. A Hacker News commenter independently reported a CPU-only Linux run achieving an overnight improvement in val_bpb from 2.29 to 2.23, confirming that meaningful iterative progress is achievable even on severely compute-constrained hardware. Three additional forks extend coverage further: miolini/autoresearch-macos targets Apple Silicon via PyTorch MPS; trevin-creator/autoresearch-mlx replaces PyTorch entirely with Apple's native MLX framework, achieving val_bpb as low as 1.294 on M4 Max hardware; and jsegov/autoresearch-win-rtx enables native Windows support on consumer NVIDIA GPUs using PyTorch SDPA attention.

Autoresearch's design also shifts how human researchers interact with the system. Karpathy's framework splits authorship explicitly: the AI agent edits train.py, while the human edits program.md, a Markdown document encoding the research strategy, priorities, and constraints that guide the agent. The human's primary output becomes natural-language policy rather than executable code. The Pöysti fork pushes this further by substituting a local Ollama model for the human-invoked Claude or Codex workflow, removing one more layer of human mediation from the loop. Karpathy's own README frames the current moment as transitional, written in the voice of a future historian describing the "10,205th generation" of a self-modifying codebase that has "grown beyond human comprehension" — a rhetorical device that doubles as a design principle.

The framework's deliberately minimal architecture — one editable file, one metric, a fixed time budget, git-based keep/revert — is accessible enough that any capable coding agent can drive the loop. As these forks mature and practitioners develop divergent program.md strategies, outcomes will increasingly reflect the quality of human instruction rather than compute or model size. Karpathy gestured at this directly in the HN thread, writing that he expects program.md "to become the interesting variable" as the hardware gap closes. If he's right, the field is about to discover what separates a good research prompt from a great one — and the answer won't come from a leaderboard that currently exists.