Join us in Atlanta on April 10th as we take a look at the security workforce landscape. Let’s explore the vision, benefits, and use cases of AI for security teams. Request an invitation here.
A new study from Google's DeepMind research division finds that artificial intelligence systems can outperform human fact-checking systems when assessing the accuracy of information generated from large-scale language models.
The paper, titled “Long-form Factuality in Large Language Models,” published on the preprint server arXiv, introduces a method called Search-Augmented Factuality Evaluator (SAFE). SAFE uses large-scale language models to classify generated text into individual facts, then uses Google search results to verify the accuracy of each claim.
“SAFE leverages LLM to break down long-form responses into a series of individual facts, send queries to Google Search, and evaluate the accuracy of each fact using a multi-step reasoning process that determines whether the facts are supported by Google Search. .Search results,” the author explained.
'Superhuman' acting controversy sparks
Researchers compared SAFE with human annotators on a dataset of approximately 16,000 facts and found that SAFE's ratings matched human ratings 72% of the time. More notably, in a sample of 100 discrepancies between SAFE and human raters, SAFE's judgments were found to be correct in 76% of cases.
VB events
AI Impact Tour – Atlanta
invitation request
Although the paper claims that “LLM agents can achieve superhuman assessment performance,” some experts question what “superhuman” really means here.
Gary Marcus, a prominent AI researcher and frequent critic of exaggerated claims, suggested on Twitter that “superhuman” in this case may simply “mean better than a low-paid crowd worker rather than a truly human fact-checker.”
“It makes the characterization misleading,” he said. “It’s like saying chess software in 1985 was superhuman.”
Marcus raises a valid point. To truly demonstrate superhuman performance, SAFE must be benchmarked against professional human fact-checkers as well as crowdsourcing workers. The specific details of human evaluators, including qualifications, compensation, and fact-checking processes, are important to properly contextualize results.
Reduce costs and benchmark top models
One of the obvious advantages of SAFE is its cost. Researchers found that using an AI system is about 20 times cheaper than using human fact-checkers. As the amount of information generated by language models continues to explode, it will become increasingly important to have economical and scalable ways to verify claims.
The DeepMind team used SAFE to evaluate the factual accuracy of 13 high-level language models across four families (Gemini, GPT, Claude, and PaLM-2) in a new benchmark called LongFact. Their results indicate that larger models generally produce fewer counterfactual errors.
However, even the best performing models generated a significant number of false claims. This highlights the danger of relying too heavily on language models that can fluently represent inaccurate information. Automated fact-checking tools like SAFE can play an important role in mitigating these risks.
Transparency and human baselines are important
The SAFE code and LongFact dataset are open sourced on GitHub so other researchers can scrutinize and build on the work, but more transparency is still needed about the human criteria used in the study. To assess SAFE's capabilities in the appropriate context, it is essential to specifically understand the backgrounds and processes of crowd workers.
As tech giants race to develop more powerful language models for applications ranging from search to virtual assistants, the ability to automatically fact-check the output of these systems could play a pivotal role. Tools like SAFE represent an important step toward building new layers of trust and accountability.
However, it is important that the development of these important technologies is done openly, beyond the walls of a single company, and with input from a wide range of stakeholders. To measure true progress, rigorous and transparent benchmarking of human experts, not just crowdwork, is essential. Only then can we measure the real-world impact of automated fact-checking in the fight against misinformation.