Agent Wars
technical Mar 12th, 2026

Nobody Agrees on AI Evals — Here's What Practitioners Are Actually Using

A Hacker News thread on AI evaluation practices has revealed a field that quietly abandoned BLEU and ROUGE in favor of LLM-as-judge scoring — but with no consensus on tooling, methodology, or what 'good' even looks like when agents are involved.

Agent Wars
technical Mar 12th, 2026

Andrej Karpathy Makes the Case for an IDE Built Around the Agent

Andrej Karpathy is asking what an IDE looks like when the agent—not the developer—is the primary user. The question exposes how much current tooling was never built for autonomy.

Agent Wars
technical Mar 12th, 2026

Computational Antibody Design Gets a Field Manual. BoltzGen Leads — Except When It Doesn't.

Asimov Press has published a detailed technical guide to computational antibody design by Brian Naughton, walking through a five-step pipeline — target selection (Nipah virus Glycoprotein G), structure preparation, running design campaigns on the Ariax platform, candidate filtering, and experimental validation. BoltzGen, from MIT's Boltz team, leads the open-source field and achieves sub-micromolar affinity on most tested targets, but logged only a 1% pass rate on the Nipah G Adaptyv Bio competition dataset. BindCraft is the other open-source option with a meaningful track record. Commercial offerings from Nabla Bio, Chai Discovery, Latent Labs, and Isomorphic Labs round out the landscape. The guide stands out for using transparent benchmark data — dissociation constant thresholds — in a field prone to inflated performance claims.

Agent Wars
technical Mar 12th, 2026

tropes.fyi releases a system-prompt catalog of AI writing tics

tropes.fyi releases 'tropes.md', a single Markdown file cataloging dozens of recurring LLM writing patterns — from overused words like 'delve' and 'tapestry' to structural tics like negative parallelism, tricolon abuse, and bold-first bullet lists. The file is designed to be dropped directly into an AI system prompt to suppress these tells. Categories cover word choice, sentence structure, paragraph structure, tone, formatting, and composition. The project is openly AI-assisted and framed as a cat-and-mouse game between prompt engineers and model defaults.

Agent Wars
technical Mar 12th, 2026

Runflow Says Its AI Image Orchestration API Lifted One Client's Gross Margin From 40% to 87%

Runflow is pitching itself as the infrastructure layer for AI image and video generation — a single API routing across 20+ models including FLUX, Kling, and Sora, with pre-built workflows for specific visual niches. A BetterPic case study showing gross margin climbing from 40% to 87% is the centrepiece of its commercial argument.

Agent Wars
technical Mar 12th, 2026

Meta unveils four custom AI inference chips, says MTIA 450 beats leading Nvidia silicon

Meta has disclosed four previously unknown custom silicon chips — MTIA 300, 400, 450, and 500 — built in close partnership with Broadcom for AI inference workloads. The MTIA 300 targets ranking and recommendation workloads and is already in production. The MTIA 400 supports generative AI and is entering datacenter deployment. The MTIA 450 doubles HBM bandwidth over the 400 and is claimed to outperform leading commercial products, targeting mass deployment in early 2027. The MTIA 500 adds 50% more HBM bandwidth over the 450 and is also planned for 2027. Broadcom has characterised Meta's commitment as deploying multiple gigawatts of these chips. Meta says it can now ship a new chip roughly every six months via a modular chiplet design strategy.

Agent Wars
technical Mar 12th, 2026

Developers Keep Asking If Claude Is Down. That's a Problem for Anthropic.

A recurring Hacker News thread signals growing frustration with Claude's reliability — and with how slowly Anthropic's official status page reflects real-world incidents.

Agent Wars
technical Mar 12th, 2026

PycoClaw Brings OpenClaw-Class AI Agents to $5 ESP32 Hardware

USRobotIQ's PycoClaw runs an OpenClaw-compatible agent on a $5 ESP32 microcontroller using MicroPython. It includes a dual-loop reasoning engine, hybrid TF-IDF and vector memory backed by SD card, multi-model routing, sub-agent support, and direct hardware control over GPIO, CAN, I2C, and LVGL displays. Skills can be discovered and installed at runtime from the ScriptoHub marketplace. Companion browser PWA Scripto Studio handles firmware flashing with no local toolchain required.

Agent Wars
technical Mar 12th, 2026

1,000 Lines of Python vs. the Enterprise Knowledge Stack

Andy Chen, an engineer at Abnormal Security, describes building an Enterprise Context Layer using ~1,000 lines of Python and a GitHub repo instead of expensive SaaS tools. Twenty parallel LLM agents synthesize organizational knowledge — product docs, Slack threads, Gong call transcripts, Jira tickets, source code — into a richly cross-referenced, citation-backed file system. The result: 6,000 commits across 1,020 files covering 11 domains, including end-to-end customer journey maps, competitor battle cards with closed evidence loops, and feature flag inventories no human team could maintain. Chen's core argument: retrieval and synthesis are fundamentally different problems, and modern LLMs plus a simple agent harness can now solve the synthesis half for near-zero cost.

Agent Wars
technical Mar 12th, 2026

Dot Matrix Labs' Alien Stack Explores What Code Looks Like When Written for an AI, Not a Human

What if software architecture were optimized for how AI agents actually work — sequential text access, grep-based navigation, limited context windows — rather than for human readability? That's the question Dot Matrix Labs is testing with Alien Stack, a proof-of-concept that has Claude writing software directly in LLVM IR, bypassing high-level source languages entirely. The project backs the idea with working demos: an HTTP server with a WASM client, a TechEmpower plaintext benchmark that edges out a naive Rust Hyper baseline at low-to-medium concurrency, Z3 SMT verification of formal function contracts, and an isomorphic UI kit — all generated by Claude in under 15 minutes, offline.

Agent Wars
opinion Mar 12th, 2026

AI is supercharging fake work

A Hacker News thread hit a nerve this week: AI tools aren't killing busywork, they're scaling it. Workers trapped in broken incentive structures now have a superpower for producing output that looks productive and does nothing.

Agent Wars
technical Mar 12th, 2026

Qodo Claims 12-Point F1 Lead Over Claude Code Review in Its Own Benchmark

Qodo has published benchmark results showing its AI code review platform outperforming Anthropic's Claude Code Review by 12 F1 points — on a benchmark Qodo itself designed. The Qodo Code Review Benchmark 1.0 injects realistic defects into 100 real-world pull requests across 8 repositories and 7 languages. Both systems achieve similar precision, but Qodo's multi-agent harness, which routes tasks to specialized agents and blends models from OpenAI, Anthropic, and Google, delivered significantly higher recall. Qodo also claims per-review costs roughly an order of magnitude below Claude Code Review's $15–$25 pricing.

Agent Wars
technical Mar 12th, 2026

The Em Dash Was Never the Tell

Will Keleher's satirical technical essay walks a narrator through CSS tricks, font binary patching, and a Norvig-inverted misspelling algorithm — three increasingly baroque attempts to pass as human. The ending explains everything.

Agent Wars
technical Mar 12th, 2026

Can LLMs Be Computers? Percepta Claims 30k Tokens/Second by Executing Programs Inside Transformers

Percepta's Christos Tzamos argues that transformers can work as general-purpose computers by executing programs directly in the forward pass — bypassing autoregressive token generation entirely. The claimed result is 30,000 tokens per second. There is no paper, no code, and no third-party validation. The claim is theoretically coherent enough to take seriously and unsubstantiated enough to treat with caution.

Agent Wars
technical Mar 12th, 2026

Claude 4.6 Opus, linux/list.h, and a GPL problem nobody's verified yet

A Hacker News thread claimed Claude 4.6 Opus can reproduce the Linux kernel's list.h header verbatim — unverified, but the GPL-2.0 implications are worth taking seriously regardless.

Agent Wars
technical Mar 12th, 2026

OpenAI drops Oracle expansion as newer Nvidia chips beckon

OpenAI has abandoned plans to expand its Stargate data center with Oracle in Abilene, Texas, opting to build new sites around Nvidia's next-generation Vera Rubin chips instead. The decision highlights a widening gap between annual GPU release cycles and the 12-to-24-month lead time for data center construction — a problem that hits Oracle harder than most, given its heavy reliance on debt financing, negative free cash flow, and a $50 billion capex commitment that investors are growing impatient with.

Agent Wars
opinion Mar 12th, 2026

AI Translation Demos Are Really Just Fancy Guessing Machines

Software engineer Alperen Keles argues that the 'AI translation' demos dominating 2026 headlines are a sleight of hand: models propose code, but human-designed test harnesses decide whether the translation is correct. That shifts the hard problem from the AI to the engineer who wrote the tests. Keles's February analysis — prompted by January demos from Cursor and Anthropic — also looks ahead to LLM-driven code optimization as a harder but potentially more valuable next frontier.

Agent Wars
technical Mar 12th, 2026

AI Coding Agents Can Fix a Bug. SWE-CI Asks If They Can Do the Job for Six Months.

Researchers introduce SWE-CI, the first repository-level benchmark built around the Continuous Integration loop, designed to evaluate LLM-powered agents on dynamic, long-term code maintainability rather than static one-shot bug fixes. The benchmark includes 100 tasks averaging 233 days and 71 commits of evolution history, requiring agents to resolve issues through iterative rounds of analysis and coding — a harder test than anything SWE-bench currently offers.

Agent Wars
technical Mar 12th, 2026

A Solo Developer's Satellite Demo Is Doing What Palantir Charges Millions For

A browser-based demo from indie developer Useful AI Tools applies vision-language models to satellite imagery, letting analysts detect objects — vehicles, fuel depots, bridges — via plain-text queries with no model training required. The tool undercuts the specialist classifiers that have historically made geospatial intelligence expensive to enter. The full platform adds global coverage, multi-layer GeoJSON exports, and project management tools for Earth observation and urban monitoring professionals.

Agent Wars
technical Mar 12th, 2026

Switchboard Brings Order to Claude Code's Session Sprawl

Switchboard is an open-source Electron app from Doctly that gives developers a single window to browse, search, fork, and resume Claude Code sessions across all their projects. Where Claude Code's CLI offers no session overview, Switchboard reads on-disk state directly to surface session history, handle permission prompts, and edit plan files — without touching the underlying agent.

Agent Wars
technical Mar 12th, 2026

LLM Neuroanatomy: Topping the AI Leaderboard Without Changing a Single Weight

Independent researcher David Noel Ng reached #1 on the HuggingFace Open LLM Leaderboard in mid-2024 with dnhkng/RYS-XLarge by duplicating seven middle transformer layers of Alibaba's 72B-parameter Qwen2-72B — no fine-tuning, no weight changes, no gradient descent. Running on two consumer RTX 4090 GPUs via ExLlamaV2 quantized inference, Ng developed what he calls 'LLM Neuroanatomy': the hypothesis that early transformer layers translate input into abstract representations, late layers translate back to output, and middle layers perform universal abstract reasoning that tolerates architectural rearrangement. Inspired by Base64 jailbreaking experiments and the Goliath-120b Frankenmerge anomaly, he built a 'brain scanner' sweeping 3,241 layer-loop configurations across the 80-layer model, using fast proxy tasks and a logit-weighted LLM-as-judge scoring system to identify that duplicating middle layers improves performance across all six leaderboard benchmarks.

Agent Wars
technical Mar 12th, 2026

Agent Safehouse – kernel-level walls between your local agents and your SSH keys

Agent Safehouse is a macOS-native sandboxing tool that uses kernel-level enforcement (macOS sandbox-exec) to restrict local LLM coding agents to their project working directory. It operates on a deny-first model: agents inherit no user permissions by default, with only the current project granted read/write access and toolchains granted read-only. Sensitive paths like ~/.ssh and ~/.aws are blocked at the syscall level. It supports all major local coding agents including Claude Code, Codex, Gemini CLI, Aider, Cursor, and Cline. Available via Homebrew or a single self-contained shell script, open source under Apache 2.0.

Agent Wars
technical Mar 12th, 2026

Legal experts back Anthropic's challenge to Pentagon blacklisting

Attorneys familiar with federal procurement law say Anthropic has solid grounds to contest its exclusion from Defense Department contracts — and a win could force the Pentagon to justify how it sidelines AI vendors.

Agent Wars
opinion Mar 12th, 2026

RFC 454545 — Human Em Dash Standard

A mock-RFC published on GitHub Gist proposes two new Unicode code points — the Human Em Dash (HED, U+10EAD) and Human Attestation Mark (HAM, U+10EAC) — visually identical to the standard em dash but encoded separately to signal probable human authorship. Authors Janice Wilson and Jeff Auriemma name the underlying problem 'Dash Authenticity Collapse' (DAC): LLMs use em dashes with 'suspicious regularity' and 'unwavering grammatical confidence,' making the punctuation a widely mocked AI tell. Human Cognitive Proof-of-Work (HCPoW) prerequisites for emitting the certified dash include hesitation pauses exceeding 137ms, backspace events, and audible sighing. Written in strict IETF format with RFC 2119 MUST/SHOULD/MAY terminology throughout, the piece satirizes AI content detection anxiety and the standards process in equal measure.

Agent Wars
technical Mar 12th, 2026

Autonoma scraps 18 months of QA agent code as LLM advances make complex inspection wrappers obsolete

Tom Piaggio, co-founder of Autonoma (AI-powered QA testing platform), explains their decision to rewrite 1.5 years of production code serving paying customers. Two core drivers: (1) a no-tests TypeScript monorepo culture that caused quality collapse at scale, and (2) LLM capability leaps from GPT-4 to modern models making their sophisticated Playwright/Appium UI inspection wrappers—built to compensate for weak models—no longer necessary. The rewrite enables the fully agentic architecture they originally envisioned. Tech changes include dropping Next.js Server Actions for React+tRPC+Hono, and adopting Argo for Kubernetes-native workflow orchestration over alternatives including Temporal and useworkflow.dev.