StatGPT: IMF Research Reveals ChatGPT Gets Statistics Wrong 66–86% of the Time

A March 2026 IMF working paper has put hard numbers on a problem that AI practitioners have long suspected but rarely quantified: large language models are strikingly unreliable at retrieving specific statistical data, even from authoritative sources they have direct access to. Researchers James Tebrake, Bachir Boukherouaa, Jeff Danforth, and Niva Harikrishnan tested ChatGPT by asking it to generate a CSV table of G7 annual economic growth rates drawn from the IMF's own World Economic Outlook publication. Run under three conditions, ChatGPT was correct only 34 percent of the time within the same conversation, 17 percent across unique conversations, and just 14 percent when the WEO document was explicitly loaded into its memory context.

That third condition — document loaded, accuracy lowest — is where the paper gets uncomfortable for enterprise AI deployments. Grounding a model with a source document is supposed to improve accuracy; it is the foundational premise behind most RAG-based products. Here it made things worse. The most plausible explanation is the "lost in the middle" problem: LLMs exhibit degraded recall for information buried deep within long, dense documents. The WEO is a multi-hundred-page publication mixing narrative analysis with statistical tables, and dumping the whole thing into a context window is architecturally nothing like a properly engineered retrieval pipeline that surfaces only relevant, semantically matched chunks. Practitioners who upload PDFs directly to ChatGPT's file interface and expect reliable numerical extraction are replicating the exact failure mode the IMF paper documented.

Agent builders who route numerical queries through standard RAG should pay attention to the architectural alternatives the paper points toward: table-aware chunking strategies that treat statistical tables as discrete structured units, hybrid systems that route numerical queries to deterministic lookup engines such as SQL or direct data provider APIs, and structured extraction layers that convert tables into queryable formats before any LLM interaction occurs. The authors' proposed short-term remedy — a multi-step prompting sequence that orients the model broadly before narrowing to specific data points — offers some mitigation, though the accuracy numbers suggest it treats symptoms rather than root causes. Any agentic workflow that must accurately pull values from earnings reports, clinical trial tables, or regulatory filings faces the same structural problem.

For the longer term, the IMF team envisions a "Global Trusted Data Commons," a comprehensive AI-ready index of official statistics serving as a verified grounding layer for economic and policy queries. Timothy Taylor, writing at the Conversable Economist blog, endorsed the concept as a public good and drew a clean editorial line: AI tools are capable of producing acceptable first drafts of prose, but statistical retrieval demands a different standard — exact correctness — that current LLMs cannot reliably meet. If the data commons is built with proper structured-retrieval interfaces rather than document-injection approaches, it could solve the precise failure mode the IMF paper identified. That is a solvable engineering problem; the IMF has now made the case that someone needs to solve it.