A study published in Nature confirms what many AI practitioners suspected: bigger, more polished language models aren't necessarily more reliable. Researchers from the Valencian Research Institute for Artificial Intelligence and University of Cambridge analyzed GPT, LLaMA, and BLOOM model families. They found that while scaled-up and instruction-tuned models give correct answers more often and handle prompt variations better, they've developed a worse failure mode. Instead of refusing questions they can't handle, they now confidently give wrong answers that look right. Early language models avoided difficult questions. They'd refuse or hedge. Scaled-up models shaped by RLHF and instruction fine-tuning don't do that anymore. They answer everything. The researchers call these "apparently sensible yet wrong" answers, and they happen most often on difficult questions where human supervisors are also likely to make mistakes. Your model looks smart, acts confident, and is wrong in ways you won't catch. That breaks a core assumption about AI safety. The prevailing thinking assumed that as models got more capable, they'd fail in predictable ways humans could supervise. The opposite happened. Scaled-up models don't secure "low difficulty" zones where errors are absent or easily spotted. They fail in ways that match human blind spots. José Hernández-Orallo and colleagues argue this requires "a fundamental shift" in how we design AI systems, prioritizing reliability over raw capability. Alternative approaches are emerging. Anthropic's Constitutional AI trains models to follow explicit principles rather than relying solely on human feedback. Direct Preference Optimization simplifies alignment. UC Berkeley and OpenAI have researched uncertainty quantification that helps models recognize when they should abstain. But these remain research directions. The models most people use today are built on the scaling and RLHF paradigm this study critiques.
Nature: Bigger LLMs Are Getting Worse at Knowing When to Shut Up
A Nature study finds that scaling up and instruction-tuning LLMs creates a new failure mode: models now confidently give wrong answers instead of refusing questions they can't handle. Researchers from Valencian Research Institute for AI and Cambridge analyzed GPT, LLaMA, and BLOOM families, finding that scaled-up models produce 'apparently sensible yet wrong' answers most often on questions where human supervisors also make mistakes.