A new benchmark from Artificial Analysis reveals alarming weaknesses in the factual reliability of large language models. Out of 40 models tested, only four achieved a positive score – with Google’s Gemini 3 Pro clearly in the lead.
Gemini 3 Pro scored 13 points on the new Omniscience Index, which ranges from -100 to 100, substantially ahead of Claude 4.1 Opus (4.8), GPT-5.1, and Grok 4. The high score mainly reflects the model’s strong accuracy. Gemini 3 Pro outperformed Grok 4, the previously most accurate model, by 14 points. A score of 0 means a model answers questions correctly and incorrectly at the same rate. The AA-Omniscience Benchmark measures how reliably AI models retrieve factual knowledge across different subject areas.
Källa: Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high
