Christoph Mütze, known in the demoscene as gizmo64k, has built a working transformer language model that runs on a stock Commodore 64. The same architecture behind ChatGPT, Claude, and Gemini, squeezed onto hardware from 1982 with 64KB of RAM and a 1 MHz processor. The model has about 25,000 int8 parameters across 2 layers, written entirely in hand-coded 6502/6510 assembly. It generates roughly one token per minute.
The project nearly didn't work. Mütze's breakthrough came from fixing the softmax normalization, shifting attention scores by 14 bits instead of 17. That adjustment gave the 128-entry exponential lookup table enough dynamic range to produce meaningful attention weights. Without it, the integer attention stayed essentially uniform across all positions, making the model useless regardless of training.
Soul Player C64 ships with a full training pipeline using PyTorch, quantization that optimizes for int8 accuracy rather than float loss, and about 90 tests spanning the entire chain from float reference through integer arithmetic down to the assembly routines. You can train your own model, build a C64 disk image, and run it on real hardware or an emulator like VICE. The whole thing fits on a single floppy disk with room to spare.
The model has a 20-token context window and produces broken sentences. 25,000 parameters is roughly 70 million times smaller than GPT-4, much like the GuppyLM project. But the architecture works at this scale, and that's the point. Mütze, a Hamburg-based developer with over 30 years of experience and a member of demogroup Farbrausch, has shown that transformers aren't just for cloud clusters and GPU farms. They can run, slowly and badly, on a computer that predates the World Wide Web.