A new benchmark from Google Deepmind aims to measure AI model reliability more comprehensively than ever before. The results reveal that even top-tier models like Gemini 3 Pro and GPT-5.1 are far from perfect.
Researchers at Google Deepmind have introduced the FACTS Benchmark, a testing environment designed to evaluate the factual accuracy of large language models (LLMs) across multiple disciplines. The benchmark aggregates performance in four specific categories: visual understanding, internal knowledge, web search, and text-based evidence.
Källa: FACTS benchmark shows that even top AI models struggle with the truth

