GPT-4 Agent Traces GKE Outages to WireGuard Bug

When users started seeing random connection failures on Lovable's platform, infrastructure engineer Sascha Eglau pointed an AI agent at the problem. The agent, running on GPT-4 with read access to Clickhouse logs, found that anetd pods (Google's implementation of Cilium for GKE networking) were crashing roughly once per hour. That's bad news when your product spins up 50 sandboxes per second and every crash blocks new pods from getting network interfaces.

The agent traced the crashes to a concurrent map-access panic inside anetd's WireGuard module. The bug sat in Google's integration code, not WireGuard itself, so Lovable got Google's account team on a Sunday call. Google recommended disabling transparent node-to-node encryption. It worked, for about four hours.

Then Valkey connections started failing randomly. Engineer Erik grabbed tcpdump and Wireshark and found the real culprit: an MTU mismatch. Nodes that hadn't restarted yet were stuck at the 1420-byte MTU from when WireGuard was active, while others had moved to the standard 1500-byte Ethernet MTU. Rerolling all nodes fixed it. Google has since patched the original WireGuard bug.

Eglau told Lovable's blog that this incident changed how he debugs. 'I haven't gone back' to manual log parsing, he said. The agent could query logs at scale and spot patterns that would've taken hours to find by hand. On a Sunday when users were seeing errors and time mattered, that made all the difference. This capability is similar to what Kelet does.