A Better Way to Ask "Is This AI Answer Accurate?"
FActScore swaps the thumbs up or thumbs down for a precise score, by breaking answers into atomic facts.
Main Takeaway
Don’t judge an AI’s answer as one block, and don’t let the polish fool you. Check it one claim at a time.
Who is this for
This is for anyone who leans on AI for factual writing and wants a sharper way to talk about accuracy than “it seemed right.” That could be a student, a professional, or frankly anyone who uses AI and cares whether its answers are accurate.
Background
FActScore, short for Fine-grained Atomic Evaluation of Factual Precision, was introduced in a 2023 paper published at EMNLP, one of the top research conferences in natural language processing (NLP). The fast pace of computer science research already makes it old, yet its core ideas are still in use today.
The issue
When an AI tool writes a paragraph, it rarely lands on “all true” or “all false.” A single answer can combine facts that check out with facts that are either irrelevant or incorrect. Evaluating the whole thing with a thumbs up or thumbs down misses the point. The question is not whether it’s accurate, but how much of it is accurate.
A common way to approach this is to break the generated text (the AI’s response) into atomic facts, the smallest standalone claims. Take this generated text as an example:
“He was born in 1961 into a household of 5 living in Kansas City. His family later moved to South Dakota because of his father’s job in the mining industry.”
This paragraph can be split into 5 atomic facts.
He was born in 1961.
He grew up in a household of 5.
He lived in Kansas City.
His family later moved to South Dakota.
His father worked in the mining industry.
The solution
FActScore builds on this idea of atomic facts. First, it uses a language model to break an LLM response into atomic facts. Each claim is then checked against a trusted knowledge source (Wikipedia in this case) to see whether the source supports it. The FActScore is simply the percentage of facts that hold up.
They initially used humans to evaluate the claims, but that is slow and expensive. So their main contribution is an automated version that retrieves passages from the source, and has a model judge each claim, mirroring human judgment with under 2% error. That made it cheap enough to score 6,500 generations from 13 models, work that would have cost about $26,000 by hand.
The findings are pointed. A few clear patterns emerged in how the models failed.
Even ChatGPT, a leading model at the time, reached only 58% factual precision, so a large share of its claims were not supported.
The more obscure the subject, the more errors crept in.
Claims buried later in a response were more likely to be wrong than the ones near the start.
Models that rarely declined to answer tended to score worse, since always guessing produces more unsupported claims.
Interestingly, even tools that search the web and cite their sources (like Perplexity) were less reliable than their reputation suggests.
Towards Practical Everyday Solutions
FActScore gives you a vocabulary and a high-level framework for evaluating an LLM response (breaking answers into atomic facts, evaluating the factual precision of each claim, and detecting which claims are supported, not-supported, or irrelevant).
The results show that you should be skeptical of claims about niche topics, and of the details that show up deep in a long answer.
As a mental model, if you are using the text produced by these models, make sure to independently check each claim and verify them if you are unsure.
The habit underneath all of this is simple. Don’t trust the polish, check the pieces.


