DeepMind Hits Breakthrough: Introducing SAFE for Fact-Checking LLMs

DeepMind and Stanford University researchers have pioneered SAFE, an AI agent dedicated to fact-checking Large Language Models (LLMs), marking a significant stride in assessing AI model accuracy.

Even the most advanced AI models can occasionally generate inaccurate information, a phenomenon colloquially known as “hallucination.” For instance, when probing ChatGPT for facts, longer responses tend to contain a mix of accurate and erroneous information.

Determining which AI models excel in factual accuracy, particularly in generating longer responses, has been a challenge due to the lack of standardized benchmarks. However, DeepMind addressed this gap by leveraging GPT-4 to develop LongFact, comprising 2,280 prompts covering 38 diverse topics, aimed at eliciting detailed responses from LLMs.

SAFE: The AI Agent

To evaluate the factual accuracy of these responses, the researchers introduced SAFE, an AI agent powered by GPT-3.5-turbo. SAFE dissects lengthy LLM responses into discrete facts, subsequently querying Google Search to assess the veracity of each fact based on the returned search results.

Remarkably, SAFE exhibits “superhuman performance” compared to human annotators, agreeing with 72% of human annotations and proving correct in 76% of disagreements. Furthermore, it boasts a significant cost advantage, being 20 times more economical than crowd-sourced human annotators, positioning LLMs as superior and cost-effective fact-checking alternatives to humans.

The effectiveness of LLM responses was gauged based on both the quantity of facts provided and their factual accuracy, as measured by the F1@K metric. F1@K evaluates the alignment between the number of facts in a response and their accuracy, employing 64 as the median and 178 as the maximum for K.

In assessing the factual accuracy of 13 LLMs from various families, including Gemini, GPT, Claude, and PaLM-2, GPT-4-Turbo emerged as the top performer, closely followed by Gemini-Ultra and PaLM-2-L-IT-RLHF. Notably, the results underscored a trend wherein larger LLMs exhibit higher factual accuracy than their smaller counterparts.

SAFE emerges as a cost-efficient and potent tool for quantifying the factual accuracy of LLM-generated long-form content. While it outpaces humans in speed and affordability, its efficacy hinges on the reliability of information retrieved from Google Search results.

DeepMind has made SAFE publicly available, envisioning its utility in enhancing LLM factuality through enhanced pretraining and finetuning processes. Additionally, SAFE could empower LLMs to validate facts before presenting outputs to users.

The findings of this research, including GPT-4’s superiority over Gemini in yet another benchmark, are poised to invigorate further advancements in AI model accuracy and reliability.

See also: AI and Intimacy Explored: Kate Devlin from King’s College Delves into the Intersection

OpenAI Now Offers ChatGPT Access Without an Account
Microsoft Unveils AI-Powered Xbox Chatbot for Enhanced Support

Trending Posts

Trending Tools

FIREFILES

FREE PLAN FIND YOUR WAY AS AN TRADER, INVESTOR, OR EXPERT.
Menu