NIX Solutions: OpenAI’s New Models Hallucinate More

This week, OpenAI released its o3 and o4-mini AI models. They’re cutting-edge in many ways, but they hallucinate more often — that is, they confidently give answers that aren’t true — than their predecessors.

The hallucination problem remains one of the biggest and most persistent challenges in modern AI, affecting even the best-performing systems. Historically, each successive model has shown improvements in this area, meaning they hallucinate less. But that doesn’t appear to be the case with o3 and o4-mini. OpenAI’s new systems hallucinate more often than the company’s previous reasoning models, including o1, o1-mini, and o3-mini, as well as traditional “non-reasoning” models like GPT-4o, according to the company’s own tests.

NIXsolutions

Somewhat troublingly, OpenAI itself doesn’t know why. In a technical report (PDF), the company says that “more research is needed” to understand why hallucinations increase as reasoning models scale. O3 and o4-mini do perform better than their predecessors in tasks like math and programming, but because they “make more claims overall,” they are also more likely to make both “more accurate claims” and “more inaccurate or hallucinatory claims,” according to the report.

Real-World Test Results and Ongoing Research

In OpenAI’s own PersonQA test, which evaluates knowledge of people, o3 hallucinated 33% of the time — roughly double the rate of previous reasoning models o1 and o3-mini (16% and 14.8%, respectively). The o4-mini scored even higher, hallucinating 48% of the time in the same test. An independent developer’s test of Transluce showed that o3 sometimes fabricated actions it supposedly took when preparing responses. In one example, it claimed to run code on a 2021 Apple MacBook Pro “outside of ChatGPT” and copy-paste results into its answer. While o3 has access to some tools, it could not have performed that action.

One theory suggests that reinforcement learning used in training the o-series models may have worsened the hallucination issue, which had previously been reduced using standard post-training tools. Experts warn that this might limit o3’s usefulness in practice, despite its strong performance in programming tasks — where it has also been known to insert broken links into code, notes NIX Solutions.

A promising direction for reducing hallucinations may lie in integrating real-time web search functions. GPT-4o, for instance, reached 90% accuracy in the OpenAI SimpleQA benchmark using such a feature. “Removing hallucinations in all our models is an active area of ​​research, and we are constantly working to improve their accuracy and reliability,” OpenAI told TechCrunch. We’ll keep you updated as more integrations become available.