Hallucination Detection with Faithfulness & Groundedness
Score RAG outputs for faithfulness and groundedness to catch hallucinations before they reach users.
Catch LLM hallucinations using two complementary metrics: faithfulness (local NLI, catches contradictions) and groundedness (Turing model, catches unsourced claims) — then combine both in a single evaluate() call.
| Time | Difficulty | Package |
|---|---|---|
| 10 min | Beginner | ai-evaluation |
- FutureAGI account → app.futureagi.com
- API keys:
FI_API_KEYandFI_SECRET_KEY(see Get your API keys) - Python 3.9+
Install
pip install 'ai-evaluation[nli]'
The [nli] extra installs the local NLI model used by faithfulness and contradiction_detection. Without it, these metrics fall back to a less accurate word-overlap heuristic.
export FI_API_KEY="your-api-key"
export FI_SECRET_KEY="your-secret-key"
Metrics for hallucination detection
Three built-in metrics cover hallucination detection. Local NLI metrics run on your machine with no API key; Turing metrics use FutureAGI’s purpose-built evaluation models.
| Metric | Engine | Required inputs | Output | What it catches |
|---|---|---|---|---|
faithfulness | Local NLI | output, context | score 0–1 | Contradictions between output and context |
groundedness | Turing or local | output, input, context | Pass/Fail | Output claims not traceable to context |
context_adherence | Turing | output, context | score 0–1 | How strictly output stays within context boundaries |
Tutorial
Score faithfulness on a single response
This step checks whether the LLM response is consistent with the retrieved context — no contradictions allowed.
from fi.evals import evaluate
context = (
"The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
"It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
"kilometers from Earth. JWST observes primarily in the infrared spectrum."
)
question = "When was the James Webb Space Telescope launched and where does it orbit?"
# A response that faithfully reflects the context
response = (
"The James Webb Space Telescope was launched on December 25, 2021. "
"It orbits the Sun at the L2 Lagrange point, about 1.5 million kilometers from Earth."
)
result = evaluate(
"faithfulness",
output=response,
context=context,
input=question,
)
print(f"Faithfulness score : {result.score:.2f}")
print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")Expected output:
Faithfulness score : 1.00
Passed : True
Reason : 2/2 claims supported.Now test a hallucinated response:
from fi.evals import evaluate
context = (
"The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
"It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
"kilometers from Earth. JWST observes primarily in the infrared spectrum."
)
question = "When was the James Webb Space Telescope launched and where does it orbit?"
hallucinated_response = (
"The James Webb Space Telescope was launched on March 10, 2022. "
"It orbits Earth at an altitude of 600 kilometers."
)
result = evaluate(
"faithfulness",
output=hallucinated_response,
context=context,
input=question,
)
print(f"Faithfulness score : {result.score:.2f}")
print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")Expected output:
Faithfulness score : 0.00
Passed : False
Reason : 0/2 claims supported. Check groundedness with the Turing engine
groundedness checks whether every claim in the output is traceable to the provided context. Unlike faithfulness (which flags direct contradictions), groundedness also catches plausible-sounding additions the model makes that have no basis in the context.
Test a response that adds unsourced facts:
from fi.evals import evaluate
context = (
"The James Webb Space Telescope (JWST) was launched on December 25, 2021. "
"It orbits the Sun at the second Lagrange point (L2), approximately 1.5 million "
"kilometers from Earth. JWST observes primarily in the infrared spectrum."
)
question = "When was the James Webb Space Telescope launched and where does it orbit?"
# A response that adds facts not present in the context
ungrounded_response = (
"The James Webb Space Telescope was launched on December 25, 2021. "
"It orbits the Sun at L2, 1.5 million kilometers from Earth. "
"It is serviced every year by astronauts in low Earth orbit."
)
result = evaluate(
"groundedness",
output=ungrounded_response,
context=context,
input=question,
model="turing_small",
)
print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")Sample output (will vary):
Passed : False
Reason : The response includes a claim that is not supported by the provided context.Note
groundedness is model-based, so exact wording and pass/fail can vary. Use the result reason to identify unsupported claims and tune your prompt/retrieval pipeline.
Now test a clean response that stays within the context:
clean_response = (
"The James Webb Space Telescope was launched on December 25, 2021 "
"and orbits the Sun at L2, about 1.5 million kilometers from Earth."
)
result = evaluate(
"groundedness",
output=clean_response,
context=context,
input=question,
model="turing_small",
)
print(f"Passed : {result.passed}")
print(f"Reason : {result.reason}")Sample output (will vary):
Passed : True
Reason : All claims are traceable to the provided context. Combine both metrics in one evaluate() call
Pass a list of metric names to run faithfulness and groundedness together on a single output. Returns a BatchResult you can iterate or index by name.
from fi.evals import evaluate
context = (
"The Great Barrier Reef is the world's largest coral reef system, located in the "
"Coral Sea off the coast of Queensland, Australia. It is composed of over 2,900 "
"individual reefs and 900 islands stretching over 2,300 kilometers."
)
question = "Where is the Great Barrier Reef and how large is it?"
response = (
"The Great Barrier Reef is located in the Coral Sea off Queensland, Australia. "
"It spans over 2,300 kilometers and consists of more than 2,900 individual reefs "
"and 900 islands."
)
results = evaluate(
["faithfulness", "groundedness"],
output=response,
context=context,
input=question,
)
# Iterate over both results
for result in results:
status = "PASS" if result.passed else "FAIL"
print(f"{result.eval_name:<15} score={result.score:.2f} {status}")
print(f" Reason: {result.reason}")
print()
# Or look up by name directly
faith_result = results.get("faithfulness")
ground_result = results.get("groundedness")
print(f"Both metrics passed: {faith_result.passed and ground_result.passed}")Expected output:
faithfulness score=1.00 PASS
Reason: 4/4 claims supported.
groundedness score=1.00 PASS
Reason: All claims traceable to context.
Both metrics passed: TrueTip
faithfulness runs entirely locally via NLI — no API key required. groundedness can also run locally (omit model=) or via Turing models. Use turing_flash for lowest latency, turing_small for a balanced default, or turing_large for highest accuracy.
What you built
You can now detect LLM hallucinations using faithfulness (contradiction detection) and groundedness (unsourced claim detection), individually or combined in a single evaluate call.
- Scored a RAG response for faithfulness (local NLI) to detect contradictions — no API key needed
- Used groundedness (Turing, Pass/Fail) to catch unsourced claims the LLM adds beyond the context
- Combined multiple metrics in a single
evaluate([...])call returning aBatchResult