The Future of Everything Is Lies, I Guess

Aphyr, the distributed systems researcher known for tearing apart database claims, has turned their attention to LLM security. Their verdict is bleak. In a sprawling essay published this week, they argue that alignment, the project of making AI models friendly to humans, is fundamentally broken. Not because the teams at OpenAI and Anthropic aren't trying. Because the math itself doesn't care. ML models are piles of linear algebra with no intrinsic tendency toward prosocial behavior. Alignment comes from training data and human evaluators, both of which are optional and expensive. Anyone with enough money can skip that step.

The four barriers that might have prevented malicious models are crumbling. Hardware access is getting cheaper by the month as Microsoft, Oracle, and Amazon race to rent out training clusters. The mathematics are published openly. Training data is just pirated books and scraped websites, as Meta demonstrated. And human evaluation teams can be bypassed entirely by training your model on another model's outputs. Aphyr puts it plainly: the ML industry is lowering the barrier to creating unaligned models, not raising it.

Then there's the agent problem. Simon Willison coined the term "lethal trifecta" for what happens when you give an LLM access to untrusted content, private data, and external communication. Aphyr argues it's actually a unifecta: LLMs cannot safely be given destructive power, period. Yet tools like OpenClaw (a hypothetical but representative example), which connects LLMs to your inbox, browser, and files in an autonomous loop, are the direction things are heading. Picture handing an AI your credit card so it can buy things from random web pages. Or a social network where agents exchange untrusted content automatically. As Aphyr notes, nobody would run a program that executed any command it saw on Twitter. But call it an "AI agent" and suddenly it's fine.

Prompt injection keeps happening. There was an attack against Claude Cowork just two months ago. The web page you ask a model to summarize can instruct it to share your private SSH key, and there's a chance it will comply. These aren't theoretical vulnerabilities. They're structural features of systems that treat all input the same way.

The gap between what AI companies promise and what their products can safely deliver isn't closing. It's widening. And nobody with a working model has an incentive to admit that.