Artificial intelligence (AI) models from various companies such as Meta, OpenAI, Cohere, and Anthropic were subjected to rigorous testing by researchers at Arthur AI. The study delved into the phenomenon of AI “hallucinations,” where some models fabricate information more than others.
Cohere Leads in Hallucinations, GPT-4 Shows Promise
According to CNBC’s coverage of the study, Cohere’s AI exhibited the highest tendency for hallucinations, with the AI model “hallucinating” the most. Among the tested models, Llama 2 Meta also showed a higher degree of hallucination compared to GPT-4 and Claude 2. GPT-4, on the other hand, demonstrated the lowest propensity for inventing facts.
Understanding AI Hallucinations and Real-world Impact
AI hallucinations occur when large language models (LLMs) generate entirely fictional information, presenting it as factual data. An infamous example involved a lawyer using ChatGPT-generated hallucinations to support unsubstantiated legal cases, resulting in legal repercussions.
In their experiment, Arthur AI researchers evaluated AI models in various categories like combinatorial mathematics, US presidents, and Moroccan political leaders. The researchers devised questions that required the neural networks to engage in multi-step reasoning with the information provided.
GPT-4’s Performance and Reduction in Hallucinations
The study revealed that OpenAI’s GPT-4 outperformed the other models in most categories. Notably, GPT-4 exhibited fewer hallucinations than its predecessor GPT-3.5. For instance, on math-related questions, GPT-4 demonstrated a reduction of 33% to 50% in hallucinations. In the realm of mathematics, GPT-4 secured the first position, closely followed by Claude 2. However, when it came to accuracy in US presidents’ knowledge, Claude 2 took the lead, positioning GPT-4 in second place. In queries about Moroccan politics, GPT-4 exhibited superior performance, while Claude 2 and Llama 2 from Anthropic struggled significantly.
Hedging and Self-awareness: GPT-4 vs. Cohere
In a subsequent experiment, the researchers examined how AI models responded with cautious phrases like “As an AI model, I cannot express an opinion…”. GPT-4 displayed a 50% relative increase in hedging compared to GPT-3.5, hinting that using GPT-4 could lead to higher user frustration. In contrast, Cohere’s AI model refrained from hedging in any responses. The study identified Claude 2 as particularly reliable in terms of “self-awareness,” implying that it accurately gauged its own knowledge and limited its responses to areas with sufficient training data.
The AI model hallucination study from Arthur AI highlights the varying tendencies of different AI models to fabricate information, concludes NIX Solutions. GPT-4 stands out for its decreased propensity for hallucinations, while Claude 2 excels in self-awareness, providing valuable insights for the AI research community and its applications.