Carlo Valenti spent 18 months of lunch breaks and weekend nights building TRiP, a complete transformer engine written entirely in C. No dependencies. Just raw linear algebra on arrays of floats. The project handles inference, training, tokenizer creation, chat, and multimodal vision with PaliGemma across Llama 2, Gemma 1, and GPT-2 architectures. One 'make' command compiles everything.

This isn't about speed. The forward and backward passes sit side by side in math.c, about 3,000 lines total. Every core transformer operation, from matrix multiplication to backpropagation with AdamW optimization, is hand-coded. The whole engine spans seven files. Valenti is transparent about what AI helped with: a JSON parser, some JPEG handling, the README, and file splitting. The transformer logic itself? That's all human-written.

The vision pipeline implements SigLIP, PaliGemma's vision encoder, in pure C. That means custom JPEG parsing, manual patch embedding, and explicit matrix operations turning 2D image data into token sequences. No Python imaging libraries. No automatic differentiation. Visual embeddings get projected to match text embedding dimensions, then concatenated so the transformer blocks process both modalities together.

Valenti credits Andrej Karpathy's llama2.c and nanoGPT as inspiration. TRiP supports their checkpoint formats alongside HuggingFace's SafeTensors. You can train from scratch, build BPE tokenizers, run chat sessions with proper templates, or feed images to PaliGemma for multimodal inference. Seven files, no dependencies, open source on GitHub.