Google DeepMind unveils 'superhuman' AI system that excels in fact-checking, saving costs and improving accuracy

Join us in Atlanta on April 10th as we take a look at the security workforce landscape. Let’s explore the vision, benefits, and use cases of AI for security teams. Request an invitation here.

A new study from Google's DeepMind research division finds that artificial intelligence systems can outperform human fact-checking systems when assessing the accuracy of information generated from large-scale language models.

The paper, titled “Long-form Factuality in Large Language Models,” published on the preprint server arXiv, introduces a method called Search-Augmented Factuality Evaluator (SAFE). SAFE uses large-scale language models to classify generated text into individual facts, then uses Google search results to verify the accuracy of each claim.

“SAFE leverages LLM to break down long-form responses into a series of individual facts, send queries to Google Search, and evaluate the accuracy of each fact using a multi-step reasoning process that determines whether the facts are supported by Google Search. .Search results,” the author explained.

'Superhuman' acting controversy sparks

Researchers compared SAFE with human annotators on a dataset of approximately 16,000 facts and found that SAFE's ratings matched human ratings 72% of the time. More notably, in a sample of 100 discrepancies between SAFE and human raters, SAFE's judgments were found to be correct in 76% of cases.

VB events

AI Impact Tour – Atlanta

Continuing the tour, they head to Atlanta for the AI Impact Tour on April 10th. This exclusive, invitation-only event in partnership with Microsoft will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invitation now.

invitation request

Although the paper claims that “LLM agents can achieve superhuman assessment performance,” some experts question what “superhuman” really means here.

A quick read won't tell you much about the human subject, but it seems like they have better superhuman means than low-paid crowd workers rather than truly human fact-checkers. That makes the characterization misleading. (That’s like saying chess software in 1985 was superhuman.)…

— Gary Marcus (@GaryMarcus) March 28, 2024

Gary Marcus, a prominent AI researcher and frequent critic of exaggerated claims, suggested on Twitter that “superhuman” in this case may simply “mean better than a low-paid crowd worker rather than a truly human fact-checker.”

“It makes the characterization misleading,” he said. “It’s like saying chess software in 1985 was superhuman.”

Marcus raises a valid point. To truly demonstrate superhuman performance, SAFE must be benchmarked against professional human fact-checkers as well as crowdsourcing workers. The specific details of human evaluators, including qualifications, compensation, and fact-checking processes, are important to properly contextualize results.

Reduce costs and benchmark top models

One of the obvious advantages of SAFE is its cost. Researchers found that using an AI system is about 20 times cheaper than using human fact-checkers. As the amount of information generated by language models continues to explode, it will become increasingly important to have economical and scalable ways to verify claims.

The DeepMind team used SAFE to evaluate the factual accuracy of 13 high-level language models across four families (Gemini, GPT, Claude, and PaLM-2) in a new benchmark called LongFact. Their results indicate that larger models generally produce fewer counterfactual errors.

However, even the best performing models generated a significant number of false claims. This highlights the danger of relying too heavily on language models that can fluently represent inaccurate information. Automated fact-checking tools like SAFE can play an important role in mitigating these risks.

Transparency and human baselines are important

The SAFE code and LongFact dataset are open sourced on GitHub so other researchers can scrutinize and build on the work, but more transparency is still needed about the human criteria used in the study. To assess SAFE's capabilities in the appropriate context, it is essential to specifically understand the backgrounds and processes of crowd workers.

As tech giants race to develop more powerful language models for applications ranging from search to virtual assistants, the ability to automatically fact-check the output of these systems could play a pivotal role. Tools like SAFE represent an important step toward building new layers of trust and accountability.

However, it is important that the development of these important technologies is done openly, beyond the walls of a single company, and with input from a wide range of stakeholders. To measure true progress, rigorous and transparent benchmarking of human experts, not just crowdwork, is essential. Only then can we measure the real-world impact of automated fact-checking in the fight against misinformation.

VB Daily

Stay informed! Get the latest news in your inbox every day.

By subscribing, you agree to VentureBeat's Terms of Service.

Thank you for subscribing. Check out more VB newsletters here.

An error occurred.

Google DeepMind unveils ‘superhuman’ AI system that excels in fact-checking, saving costs and improving accuracy

How robotics and automation can benefit from 3D printing, explains Replique

Enabling autonomous exploration – Robohub

High-selectivity graphene membranes enhance CO₂ capture efficiency

Leave A Reply Cancel Reply

How Trump is training his supporters to laugh themselves into fascism

Guide To US Visa Appointment Interview For Immigrants and Non-Immigrants

Black Garlic: Benefits, Nutrition, Risks

Travis Kelce Joined by Patrick & Brittany Mahomes at Taylor Swift Show

Moving beyond transfer to improve learning mobility: Key podcast

Palak Tiwari’s Orange Ethnic Attire Is Your Ideal Bridesmaid Outfit Inspiration

Constitutional Theory and the Meaning of “Invasion”

How robotics and automation can benefit from 3D printing, explains Replique

Sydney Sweeney Puts Cleavage on Display in Risqué Selfies

Judge to stand trial for wife’s fatal shooting; OnlyFans model has her rights violated — TCN Sidebar

How to Let Your Values Guide You as a School Leader (Opinion)

Prince William reacts as England beat Switzerland in Euro2024 quarter-final

Popular Posts

Prince William reacts as England beat Switzerland in Euro2024 quarter-final

Equity Returns Over Next 3 Years Will Not Be As Good As Last Three: Franklin Templeton MF

Mcap Of BSE-listed Firms Hit All-time High Of Rs 449.88 Lakh Crore

Most Read

This 9-Year-Old Hilariously Nails the Reality of Being a Teacher’s Kid

Brickbat: There for the Taking

Behavior Management Starts With Principals, Not Teachers